Human and nonhuman primates comprehend the actions of other individuals by detecting social cues, including others’ goal-directed motor actions and faces. However, little is known about how this information is integrated with action understanding. Here, we present the ontogenetic and evolutionary foundations of this capacity by comparing face-scanning patterns of chimpanzees and humans as they viewed goal-directed human actions within contexts that differ in whether or not the predicted goal is achieved. Human adults and children attend to the actor’s face during action sequences, and this tendency is particularly pronounced in adults when observing that the predicted goal is not achieved. Chimpanzees rarely attend to the actor’s face during the goal-directed action, regardless of whether the predicted action goal is achieved or not. These results suggest that in humans, but not chimpanzees, attention to actor’s faces conveying referential information toward the target object indicates the process of observers making inferences about the intentionality of an action. Furthermore, this remarkable predisposition to observe others’ actions by integrating the prediction of action goals and the actor’s intention is developmentally acquired.