Introduction

Following the gaze direction of others plays a crucial role in achieving a fundamental aspect of social interaction, which involves gaining insight into the motivations and needs of individuals (Argyle & Cook, 1976; Clifford & Palmer, 2018). This is primarily due to the wealth of information typically associated with the gaze direction exhibited by people during social interactions (see Emery, 2000; Kleinke, 1986). The phenomenon referred to as the gaze cueing effect (GCE) illustrates the propensity of individuals to shift their attention to the same spatial location as a person they observe looking in a particular direction (Driver et al., 1999; Friesen & Kingstone, 1998; Langton & Bruce, 1999). Numerous studies employing variations of a cue-target paradigm have consistently reported the presence of a GCE (see, e.g., Driver et al., 1999; Friesen & Kingstone, 1998; Kingstone et al., 2019). In fact, the concept of joint attention prompted by GCE holds significant importance in social cognition research (for a review, see Barbato et al., 2020; Bayliss et al., 2006; Capozzi et al., 2018; Dalmaso et al., 2020; Tomasello, 1995; McKay et al., 2021; Mundy & Newell, 2007). In a seminar work by Bayliss and Tipper (2006), gaze predictability was manipulated by having unpredictable faces alternate between valid and invalid trials, while predictive faces consistently either looked at (valid) or looked away (invalid) from the target. The study found that gaze-evoked attention shifts can influence an observer’s evaluation of that person’s character (e.g., trustworthiness). This underscores the significance of gaze behavior in social groups and interactions between individuals. Previous studies, exploring cognitive factors like social status (Dalmaso et al., 2011; Jones et al., 2010) and judgments related to agents (Langton, 2009), have shown their impact on spatial attention bias during social interactions by regulating GCE. Given the pivotal role of GCE in human interaction (Frischen et al., 2007), identifying factors influencing GCE is crucial, aligning with the goal of our investigation.

Facial features serve as crucial social cues, typically leading individuals to draw inferences about others based on their facial appearance (Gawronski & Quinn, 2013; Ma et al., 2016; Rule et al., 2009; Todorov et al., 2005, 2015; Willis & Todorov, 2006). Trustworthiness emerges as a primary facet in the initial formation of facial impressions, as underscored in research by Oosterhof and Todorov (2008). Importantly, numerous studies have indicated that individuals can spontaneously form stable impressions of trustworthiness solely from facial characteristics within a brief time frame (Jones et al., 2021; Klapper et al., 2016; Marzi et al., 2014; Todorov et al., 2008a).

When defining trust, Hale et al. (2018) emphasize its multifaceted nature, with diverse definitions proposed by various scholars (Ashraf et al., 2006; Ben-Ner & Halldorsson, 2010). Broadly speaking, an individual's trust level can be classified into two categories. It may manifest as a stable personal trait, commonly referred to as "generalized trust," which signifies a general inclination to trust (Couch & Jones, 1997; Freitag & Traunmüller, 2009). Alternatively, it can be a more targeted response, which is based on the perceived attributes of a specific individual and fosters a sense of positive anticipation within peer groups, as articulated by Butler and Cantrell (1984) and further supported by McKnight et al. (1998). In our investigation, we adopt this latter perspective to explore the role of trustworthiness in shaping the GCE.

Gibson's (1979) ecological approach to social perception emphasizes the dynamic process of detecting and adjusting to environmental features, driving human adaptability in diverse situational contexts. Within this framework, spontaneous inferences of trustworthiness from facial expressions emerge as significant social cues that guide human behavior, influencing tendencies toward approach or avoidance (Todorov et al., 2008b). These inferences are integral in informing individuals on adapting effectively to their social and environmental surroundings. Recognizing the social relevance of facial features and the consistent human inclination to form judgments of trustworthiness based on facial cues, it is reasonable to hypothesize that the extraction of trustworthiness information from facial stimuli plays a crucial role in directing attention.

While some previous research has attempted to address the question of whether trustworthiness can modulate the GCE, the findings have exhibited some inconsistencies. We present a selection of studies demonstrate both non-supportive and supportive perspectives. King et al. (2011) used vignettes to prime face trustworthiness, finding no significant moderation effect on the GCE during object categorization tasks. Similarly, Strachan et al. (2017) utilized pre-existing face stimuli varying in trustworthiness and required participants to complete object categorization, failing to demonstrate a modulation of trustworthiness on the GCE. Conversely, Süßenbach and Schönbrodt (2014) primed trustworthiness using brief personality descriptions, revealing a pronounced GCE with trustworthy faces during object localization task. Likewise, Jessen and Grossmann (2020) showed increased attention allocation to objects associated with trustworthy faces using event-related potentials (ERPs) and face stimuli with varying trustworthiness levels from the Oosterhof and Todorov (2008) database. In existing research on whether trustworthiness influences GCE, disparities arise from varied manipulation techniques and diverse tasks, leading to contradictory outcomes even in seemingly similar experimental designs (see King et al., 2011; Süßenbach & Schönbrodt, 2014). Considering these divergent perspectives, we intend to put forth two plausible hypotheses that may underlie the divergent result patterns. Subsequently, we aim to explore an enhanced methodology to discern which hypothesis aligns more accurately with empirical observations.

The reflexive hypothesis postulates the automatic nature of gaze following. Previous investigations into GCE have demonstrated that the direction of another person's gaze spontaneously induces a shift in the attention of observers (for a review, see Langton et al., 2000). For instance, despite informing participants of target likelihood on the opposite side, experiments by Driver et al. (1999) demonstrated faster discrimination on the gazed side, highlighting the automatic nature. Additionally, Law et al. (2010) found that GCE exhibited minimal dependence on cognitive resources. Assessing the impact of working memory load by having participants hold one or five digits during gaze trials, they observed consistent GCE magnitudes across both low and high memory load conditions. Even without conscious awareness, Sato et al. (2007) found that subliminally presented gaze cues still elicited shorter response times, supporting the automaticity of GCE. In light of this body of evidence, the reflexive hypothesis posits that trustworthiness information associated with faces may not significantly influence gaze following since the cues inherently prompt an automatic shift in attention.

Contrary to the reflexive hypothesis, the flexible hypothesis suggests that GCE is not solely automatic but can be influenced by various factors. These modulators encompass factors such as social competence (Capozzi et al., 2016), communicative intent (Myllyneva & Hietanen, 2015; Senju & Csibra, 2008), knowledge about others' mental states (Teufel et al., 2010a), and the interplay between the social aspects stemming from eye contact and the strategic control related to cue validity (Kompatsiari et al., 2022). For instance, Teufel et al. (2010a) found that observers were more likely to follow gaze cues when they perceived the cue provider as having visual capability (transparent goggles) compared to situations where they believed the person had a lack of visual capability (opaque goggles). However, Cole et al. (2015), in a subsequent study using a gaze cue agent in a cardboard box, manipulating the agent's object perception ability, did not find support for the obligatory influence of attributed mental states on gaze cues, contrary to Teufel et al. (2010a). In multi-agent contexts, Capozzi et al. (2021) found a flexible interplay between cue numerosity and quality of social information, highlighting the adaptable nature of attention capture by gaze cues. In our research, we propose that if the flexible hypothesis holds, trustworthy faces may elicit a more pronounced GCE compared to untrustworthy facial cues.

Although there have been several studies examining the impact of trustworthiness on the GCE, there are opportunities for further refinement for several reasons. First, while prior studies have indirectly manipulated trustworthiness through methods such as priming neutral faces with brief personality descriptions (King et al., 2011; Süßenbach & Schönbrodt, 2014), these manipulations may introduce cognitive and affective elements. Our aim is to focus on understanding the fundamental perceptual processes to gain deeper insights into the intrinsic impact of trustworthiness. However, researchers have encountered inconsistencies in revealing trustworthiness-related features. For instance, Todorov et al. (2008a) identified the brow ridge, cheekbone, chin, and nose sellion as strongly correlated with trustworthiness judgments. Besides, Dotsch and Todorov (2012) observed that information in the mouth, eye, eyebrow, and hair regions also played a significant role in shaping judgments of trustworthiness. To ensure we could manipulate trustworthiness changes while keeping other dimensions unaffected, we employed a data-driven approach inspired by Oosterhof and Todorov's (2008) model for social perception of faces. They adopted principal components analysis (PCA) to identify key features, allowing representation of holistic sets of feature changes without prior assumptions about the significance of specific facial components.

Second, we rigorously exclude the confounding variables unrelated to trustworthiness perception. In experiments 2 and 3, we utilized stimuli from Oosterhof and Todorov (2008)’s database. Their framework posits that trustworthiness and dominance contribute to the 2D structure of face evaluation, aligning with established social perception models (Fiske et al., 2007; Wiggins, 1979). Their methodology enabled changes in trustworthiness while maintaining other trait judgments within the same face identity. Morphing facial stimuli by 3 standard deviations in trustworthiness (experiment 2) and setting them to baseline trustworthiness (experiment 3) enables a focused examination of trustworthiness in the GCE, minimizing interference from confounding variables related to face judgments. This stands in contrast to studies manipulating trustworthiness perception across different face identities. The absence of a comparison of the GCE effect within the same identity with only trustworthiness changed may introduce potential confounding variables, such as the influence of other dimensions like dominance in the 2D structure of face evaluation. This oversight could contribute to the mixed findings regarding the influence of trustworthiness on the GCE.

Third, our study diverges from previous research, which predominantly favored employing object categorization tasks to gauge participants' responses (e.g., Bayliss & Tipper, 2006; King et al., 2011; Strachan et al., 2017). Object categorization tasks involve higher-level cognitive functions associated with grouping and recognizing visual stimuli within meaningful categories. In contrast, our approach centered around an orientation discrimination task. This task specifically targets basic visual perception abilities, focusing on fundamental processes rather than complex object identification. As outlined in our first point, our primary objective is to delve into the core perceptual mechanisms to gain a deeper understanding of the intrinsic impact of trustworthiness.

In sum, the primary objective of the present study was to investigate whether trustworthiness perceived from facial visual appearance affects GCE. Specifically, we employed facial stimuli that had been pre-rated for trustworthiness by a separate group (categorized as trustworthy or untrustworthy). These faces served as cues with manipulated gaze directions, and participants were required to promptly indicate the orientation of the Gabor probe. In Experiment 1, we utilized facial stimuli from the Klapper et al. (2016) study. Our aim was to investigate whether the degree of trustworthiness in these facial stimuli would impact the magnitude of the GCE. For Experiment 2, we sought to replicate the findings of Experiment 1 using facial stimuli sourced from the Oosterhof and Todorov (2008) database. In Experiment 3a and 3b, we standardized the trustworthiness information of the facial stimuli employed in Experiment 2 to establish a baseline for a control experiment. This approach enabled us to examine whether any effects observed in Experiment 1 and 2 were influenced by potential confounding variables, apart from trustworthiness. If the flexible hypothesis were true, it would imply that trustworthiness information from faces could indeed modulate the GCE in Experiments 1 and 2. Furthermore, when we set the trustworthiness information to baseline in Experiments 3a and 3b, the modulation effect of trustworthiness on GCE may disappear. Conversely, if the reflexive hypothesis were accurate, we would expect no modulation effect of trustworthiness in any of the experiments. The raw data for all these studies can be accessed on the Open Science Framework website (https://osf.io/y6xhf).

Experiment 1: the trustworthiness of faces can modulate the GCE

Methods

Participants

Thirty-six students (19 female, age range: 18–26 years, mean age: 20.17 years) were recruited from Shandong Normal University. All participants reported normal or corrected to normal vision. Participants were naïve to the purpose of the study and provided written consent before participation. The study was approved by the Research Ethics Board of Shandong Normal University. Each participant was compensated with a monetary reward of 15 CNY for their participation in this experiment.

To ensure adequate power, the sample size was determined by a power analysis based on the predicted effect size using PANGEA (Westfall, 2015). Based on the results of previous studies (e.g., Süßenbach & Schönbrodt, 2014), we predicted a large effect size (d = 0.7, according to Cohen, 1988) for the interaction effect in our experimental design. With 90% power at a 0.05 significance level, the suggested sample size was approximately 22 individuals. This sample size determination method was applied consistently across all experiments.

Initially, we recruited 24 participants for each of experiments 1 and 2. However, given the inconsistency observed in previous literature on this topic, an additional 12 participants were recruited for each experiment to enhance the stability and robustness of our results, resulting in a final sample size of 36. It's noteworthy that the overall findings from the total sample of 36 participants are in line with those from the initial sample of 24 participants. Details regarding both the original and newly recruited samples are provided in Appendix A. Furthermore, to evaluate the potential impact of sample recruitment status (original vs. newly sampled) on our findings, we included sample status as a between-subject factor and found no significant influence (see Appendix B).

Design

For Experiment 1, we used a 2 (face trustworthiness: trustworthy and untrustworthy) × 2 (gaze-cue type: valid and invalid) within-subject design. On one-half of the trials, the faces were trustworthy, and on the other, untrustworthy. The gaze direction (left or right) was consistent with the position (left or right of the fixation) of the Gabor probe on one-half of the trials (valid trials) and inconsistent on the other half of the trials (invalid trials). Participants completed 256 trials, which were divided into two blocks. For each block, the four conditions mentioned above were fully randomized, and each face was repeated eight times.

Stimuli

We referenced the Klapper et al. (2016) database, which consisted of eight trustworthy looking faces and eight untrustworthy looking faces (12.12° × 14.45°; see Fig. 1). We used all the faces from this database in our experiment. The facial materials were generated using the FaceGen software development kit (Oosterhof & Todorov, 2008). Initially, eight male speakers' standard average faces were morphed two standard deviations towards male characteristics. Then, each face was further altered by morphing it six standard deviations on a random dimension orthogonal to known social dimensions, creating distinct individual features. Individual overlay textures were applied to enhance realism, adding details like skin irregularities. Additionally, each face was given a unique haircut from real face images in the Radboud Face Database. Importantly, trustworthy and untrustworthy versions of each of these eight faces were created by morphing them 2.5 standard deviations toward either trustworthiness or untrustworthiness. Besides, Oosterhof and Todorov (2008) highlights the similarities in trait evaluations between natural and computer-generated faces. The correlations between trait ratings, as well as the results of PCA analyses, were found to be consistent, underlining the validity of using computer-generated faces as experimental materials. Before the experiment, we recruited two groups of 30 Chinese participants (15 females, age range: 19–25 years, mean age: 21.0 years) and asked them to rate the degree of trustworthiness for the adopted face stimuli on a seven-point scale (1 = very untrustworthy, 7 = very trustworthy). Results of the trustworthiness evaluation showed that ratings for face stimuli used here were significantly higher for trustworthy (M = 4.47, SD = 0.911) compared with untrustworthy faces (M = 2.91, SD = 0.912), t (29) = 8.01, p < 0.001, Cohen’s d = 1.46. The rating results confirmed that the trustworthiness manipulation was both successful and robust for Chinese participants.

Fig. 1
figure 1

Examples of trustworthy (left) and untrustworthy faces (right)

We manipulated each face to produce left-gaze and right-gaze cues by shifting the pupil and iris area of each eye into the left and right corner of each eye, using GIMP software (version 2.8.16, GIMP Development Team, 2016). The average degree of gaze change is 0.17°. Ultimately, we adopted 24 trustworthy looking faces (i.e., eight different faces in straight-gaze/ left-gaze/ right-gaze version each) and 24 untrustworthy looking faces (i.e., eight different faces in three version each). Dynamic eye cues were used in the current research for they could extent the work on attention to a more realistic situation (Hermens & Walker, 2012; Kuhn & Tipples, 2011).

For the current experiment, two Gabor patches (1.5° × 1.5°, 3° clockwise or counter-clockwise rotation, and 4.0 cycles per degree) were used as probes. A 200-ms, 2000-Hz beep was used as the error feedback.

Apparatus

Visual stimuli were generated on a PC using PsychoPy (Peirce, 2007), and presented on a linearized CRT monitor (21" Sun GDM-5510; resolution: 1024 × 768 pixels; refresh rate: 100 Hz). Participants viewed the screen from a distance of 70 cm in a dark room. Stimuli were presented against a uniform gray background at a mean luminance of 40 cd/m2.

Procedure

As illustrated in Fig. 2, each trial began with a black fixation cross (500 ms), followed by a face with dynamic gaze (500-ms straight gaze + 500-ms left or right gaze) and then the 100-ms ISI display. Next, a Gabor patch was presented on the left (or right) side of the fixation until the participant’s response or for a maximum of 1500 ms. The Gabor patch was presented slightly tilted either clockwise or counter-clockwise. Participants were instructed to promptly press one of two keys to indicate their perceived orientation of the Gabor patch, regardless of its side of presentation. The tilted directions of Gabor patches were counterbalanced across different conditions. The inter-trial interval was 800–1200 ms. At the beginning of the experiment, participants were explicitly told that neither of the gaze directions was predictive of probe location (Petrican et al., 2012) and the primary objective was to respond swiftly while minimizing errors.

Fig. 2
figure 2

Procedure for Experiment 1. Each trial began with a black fixation cross (500 ms), followed by a face with dynamic gaze (500-ms straight gaze + 500-ms left or right gaze) and then the 100-ms ISI display. Next, a Gabor probe was presented on the left (or right) side of the fixation until the participant pressed either the left or right arrow key on the keyboard or for a maximum of 1500 ms

Statistical analysis

The statistical analysis was carried out using JASP (JASP Team, 2024), employing repeated measures analysis of variance (ANOVA). Initially, ANOVA was applied to the mean accuracy rate with within-subject factors: face trustworthiness (trustworthy and untrustworthy) and gaze-cue type (valid and invalid). Previous studies exploring influence of trustworthiness on GCE usually measured reaction times (RTs) as an index (King et al., 2011; Petrican et al., 2012; Süßenbach & Schönbrodt, 2014). Therefore, in our research, we referenced the previous literature and made RTs the main measure of interest. We also reported accuracy results for the sake of completeness.

For the analysis of RTs, to enhance the robustness of our analysis, we adhered to the methodology outlined by Petrican et al. (2012), which involves excluding trials with RTs deviating more than three standard deviations from the participant mean. This precautionary step aimed to minimize the impact of extreme outlier values on our results and ensure a more reliable dataset. Subsequently, a two-way repeated-measures ANOVA was conducted for the remaining correct RTs, employing the same within-subject factors as in the accuracy analysis. Our exclusion criterion resulted in the exclusion of 393, 363, 323, and 334 trials, constituting 4.26%, 3.94%, 5.26%, and 3.62% of the total number of trials in Experiment 1, 2, 3a, and 3b, respectively. For further scrutiny of the GCE, which signifies the distinction between RTs in the valid condition and RTs in the invalid condition, post hoc multiple comparisons were conducted using the Holm’s correction method (Holm, 1979) if the interaction effect reached significance. It's worth noting that while the post hoc test encompassed all possible comparisons, only the contrasts pertinent to our research goals were reported. These contrasts specifically focused on the comparison between RTs of valid and invalid conditions for trustworthy and untrustworthy stimuli, respectively.

To provide additional support for our findings, we computed Bayes Factors using JASP with default prior width for all the analyses we performed above. As outlined by Wagenmakers et al. (2018), the Bayes Factor is denoted as BF10 (and its reciprocal BF01 = 1/BF10), quantifying the strength of evidence that the data offer for the alternative hypothesis in comparison to the null hypothesis. We interpret Bayes Factors as follows: values below 3 indicate anecdotal evidence, those falling within the range of 3 to 10 suggest moderate evidence, values between 10 and 30 signify strong evidence, the range of 30 to 100 represents very strong evidence, and values exceeding 100 are indicative of decisive evidence (Gray et al., 2018; Jeffreys, 1961).

Results

In Experiment 1, we tested whether the trustworthiness of faces can affect the GCE using stimuli sourced from the Klapper et al. (2016) database. Descriptive data for Experiment 1 (and the other experiments) were presented in Table 1.

Table 1 Descriptive results of all the experiments

RTs

The analysis revealed significant main effects for gaze-cue type [valid vs. invalid: 647 vs. 660 ms, MSE = 477.528, F (1, 35) = 12.525, p = 0.001, \({\upeta }_{p}^{2}\)= 0.264, BF10 = 31.568] and an interaction between trustworthiness and gaze-cue type [MSE = 160.332, F (1, 35) = 4.297, p = 0.046, \({\upeta }_{p}^{2}\)= 0.109, BF10 = 2.186]. Notably, there was no significant main effect of trustworthiness [trustworthy vs. untrustworthy: 653 vs. 653 ms, MSE = 153.043, F (1, 35) = 0.014, p = 0.906, \({\upeta }_{p}^{2}\)< 0.001, BF01 = 4.630; see Fig. 3].

Fig. 3
figure 3

Results of Experiment 1. Mean RTs were calculated for facial cues categorized as trustworthy and untrustworthy in both valid and invalid trial conditions. Error bars represent standard errors. Significance levels are denoted as follows: *p < 0.05, **p < 0.01, ***p < 0.001, and “n.s.” indicating no significant difference

We conducted post hoc multiple comparisons using Holm's correction method. For trustworthy faces, a GCE was detected, with a mean GCE of 17 ms, t (35) = 4.102, p < 0.001, Cohen’ s d = 0.243. However, there was no significant GCE observed for untrustworthy faces, with a mean GCE of 9 ms, t (35) = 2.023, p = 0.144, Cohen’ s d = 0.120. These findings indicate that the perceived trustworthiness of facial appearance has a discernible impact on the GCE.

Accuracy

The overall accuracy was 97%. A repeated-measures ANOVA with the trustworthiness and gaze-cue type as the within-subjects factors revealed that neither the main effects nor the interaction effect was significant (trustworthiness: F(1, 37) = 0.243, p = 0.625, \({\upeta }_{p}^{2}\)= 0.007, BF01 = 4.310; gaze-cue type: F(1, 37) = 2.476, p = 0.125, \({\upeta }_{p}^{2}\)= 0.066, BF01 = 2.137; interaction effect: F(1, 37) = 0.508, p = 0.481, \({\upeta }_{p}^{2}\)= 0.014, BF01 = 2.976).

Discussion

In Experiment 1, we utilized facial stimuli from the database developed by Klapper et al. (2016), varying along the dimensions of trustworthiness and untrustworthiness. Our findings revealed that participants exhibited faster RTs in response to the orientation of the Gabor patch when the gaze cue was valid. However, there were no significant differences in RTs between the trustworthy and untrustworthy conditions. Importantly, when the facial stimuli were perceived as trustworthy, they elicited a significantly stronger GCE compared to situations where the stimuli were deemed untrustworthy. Such findings support the flexible hypothesis, which posits that GCE could be modulated by trustworthiness but not reflexive processes.

Notably, our observed results pattern can be interpreted such that trustworthy faces facilitated the GCE, or untrustworthy faces impeded it. Considering the previous finding suggesting that gaze cues have the capability to direct attention independently, even without explicit trustworthiness manipulations (for an overview, refer to Frischen et al., 2007), we may prefer possibility that, rather than trustworthiness enhancing the GCE, untrustworthiness may have a hindering effect on this attentional phenomenon. While the reported p-value for the interaction effect achieved statistical significance, the Bayes Factor did not surpass the threshold of 3, suggesting only anecdotal evidence for the interaction effect of trustworthiness and GCE. Consequently, a replication of Experiment 1 was conducted to confirm the robustness of our results. This led to the initiation of Experiment 2, serving as a replication study of Experiment 1.

Experiment 2: the trustworthiness of faces can modulate the GCE (Replication study of experiment 1)

Methods

In Experiment 2, we adopted another face stimuli set from the Oosterhof and Todorov (2008) database and aimed to confirm the robustness of the results in Experiment 1. The facial stimuli were generated via computer using FaceGen Modeller 3.2 software (Singular Inversions, 2007, Toronto, Canada) and were manipulated to represent different levels of trustworthiness based on the model outlined by Oosterhof and Todorov (2008). Furthermore, the facial trustworthiness information in this database could be adjusted to baseline for the control experiment later.

Participants

Thirty-six students (12 female, age range: 18–20 years, mean age: 19.2 years) were recruited from Sun Yat-sen University. All participants reported normal or corrected to normal vision. Participants were naïve to the purpose of the study and provided written consent before participation. The study was approved by the Research Ethics Board of Sun Yat-sen University.

Stimuli

Eight trustworthy looking faces and eight untrustworthy looking faces (12.12° × 14.45°) were randomly selected from the Oosterhof and Todorov (2008) database (see Fig. 4).

Fig. 4
figure 4

Examples of trustworthy (left) and untrustworthy faces (right)

As in Experiment 1, we recruited another group of 30 Chinese participants (15 females, age range: 19–25 years, mean age: 20.8 years) before the experiment and asked them to rate the degree of trustworthiness for face stimuli on a seven-point scale (1 = very untrustworthy, 7 = very trustworthy). Results of the trustworthiness evaluation showed that ratings of trustworthy faces (M = 4.92, SD = 0.891) were significantly higher than untrustworthy faces (M = 2.91, SD = 0.912), t (29) = 9.12, p < 0.001, Cohen’s d = 2.40. The results confirmed that these facial stimuli worked well in Chinese samples.

The design, apparatus, procedure, and statistical analysis were the same as in Experiment 1.

Results

RTs

The results revealed no significant main effects of trustworthiness and gaze-cue type [trustworthiness (trustworthy vs. untrustworthy): 634 vs. 634 ms, MSE = 115.396, F (1, 35) = 0.054, p = 0.818, \({\upeta }_{p}^{2}\)= 0.002, BF01 = 4.695; gaze-cue type (valid vs. invalid): 632 vs. 635 ms, MSE = 225.332, F (1, 35) = 1.528, p = 0.225, \({\upeta }_{p}^{2}\)= 0.431, BF01 = 1.815]. However, there was a significant interaction between trustworthiness and gaze-cue type [MSE = 107.669, F (1, 35) = 26.420, p < 0.001, \({\upeta }_{p}^{2}\)= 0.414, BF10 = 3.5863 × 104; Fig. 5].

Fig. 5
figure 5

Results of Experiment 2

Through post hoc tests, we identified the presence of GCE from trustworthy faces [12 ms, t (35) = 3.939, p = 0.001, Cohen’s d = 0.200]. However, there was a reversed GCE from untrustworthy faces [-6 ms, t (35) = -– 1.906, p = 0.184, Cohen’s d = -– 0.097]. Consistent with findings in Experiment 1, these results showed that the trustworthiness impression from facial appearance affected the GCE.

Accuracy

The overall accuracy rate reached 97%. A repeated-measures ANOVA indicated that the main effect of gaze-cue type nor the interaction effect achieved statistical significance [F (1, 35) = 0.002, p = 0.967, \({\upeta }_{p}^{2}\)< 0.001, BF01 = 2.994; interaction effect: F (1, 35) = 1.688, p = 0.202, \({\upeta }_{p}^{2}\)= 0.046, BF01 = 1.842]. The main effect of trustworthiness exhibited statistical significance [F (1, 35) = 5.924, p = 0.020, \({\upeta }_{p}^{2}\)= 0.145, BF10 = 1.708], with a 97.05% accuracy rate in the trustworthy condition and a 97.68% accuracy rate in the untrustworthy condition. However, given the generally high accuracy rate, the importance of this effect may be limited.

Discussion

In Experiment 2, we utilized facial stimuli from the database created by Oosterhof and Todorov (2008) and aimed to replicate the findings obtained in Experiment 1. Despite the factors of validity and trustworthiness not individually yielding differences in RTs, it is worth highlighting that when the facial stimuli were perceived as trustworthy, they elicited a significant GCE. Conversely, for facial stimuli perceived as untrustworthy, we observed a subtle reversal in the GCE tendency, indicating even shorter RTs when the gaze cue was invalid. This, to some extent, can provide support for the proposal we posited in the discussion section of Experiment 1, namely that untrustworthiness impedes the GCE. Experiment 2 also lent support for the flexible hypothesis.

In Experiment 2, the Bayes Factor consistently corroborated all the findings revealed by p values, with particular emphasis on the BF10 of the interaction effect, providing decisive evidence in accordance with Jeffreys (1961). This successful replication of the modulation effect of trustworthiness on the GCE in Experiment 2 further underpins the robustness of the findings obtained in Experiment 1.

We observed conflicting results between accuracy and RTs in the main effects of trustworthiness, despite having mentioned that we primarily focus on RTs as the main index. While accuracy reached statistical significance, the value of BF10 suggested only anecdotal evidence for this effect. Additionally, such significant main effects of trustworthiness on accuracy were not replicated in any other studies in our research. Moreover, the accuracy rates for the trustworthy and untrustworthy conditions were 97.05% and 97.68%, respectively, both indicating a relatively high standard of accuracy. Thus, we hypothesize that the practical significance of this difference may be trivial.

Experiment 3a/3b: the face stimuli without trustworthiness cannot modulate the GCE

In the Oosterhof and Todorov (2008) database, the trustworthiness level of facial stimuli can be modified on a scale ranging from -3 to + 3, while other facial attributes remain unchanged. For instance, the same face may exhibit a high trustworthiness level (+ 3) or a low trustworthiness level (-3), as we employed in Experiment 2. In this context, a trustworthiness score of 0 serves as the baseline, indicating facial stimuli with a trustworthiness level set at zero. In Experiment 3, we reset the trustworthiness information of the faces utilized in Experiment 2 to the baseline level, effectively transforming them into baseline faces. This manipulation was undertaken to investigate the influence of trustworthiness on GCE, while keeping other low-level characteristics consistent. We hypothesized that, following this baseline manipulation, no significant differences would be observed in RTs between the trustworthy and untrustworthy faces employed in Experiment 2. To ensure the stability of our results, we replicated Experiment 3a in Experiment 3b with a larger sample size to assess the replicability of the findings.

Methods

Participants

In Experiment 3a, twenty-four participants (12 females, age range: 18–20 years, mean age: 18.8 years) were recruited, while in Experiment 3b, thirty-six participants (22 females, age range: 19–27 years, mean age: 21.3 years) were recruited from Sun Yat-sen University.

All participants reported normal or corrected to normal vision. Participants were naïve to the purpose of the study and provided written consent before participation. The study was approved by the Research Ethics Board of Sun Yat-sen University.

Stimuli

The face stimuli used in Experiment 2 from Oosterhof and Todorov (2008) allowed us to manipulate the degree of trustworthiness. We still adopted the faces in Experiment 2, except that we set the trustworthiness information to baseline here (see Fig. 6).

Fig. 6
figure 6

Examples of setting trustworthy (left) and untrustworthy faces (right) to baseline faces (i.e., trustworthiness information set to zero)

As in previous experiments, a group of 30 Chinese participants (16 females, age range: 19–21 years, mean age: 20.5 years) were recruited before the experiment and asked to rate the degree of trustworthiness for face stimuli on a seven-point scale (1 = very untrustworthy, 7 = very trustworthy). Results of the trustworthiness evaluation showed that ratings of used-to be trustworthy faces (M = 4.33, SD = 0.793) were no difference to untrustworthy faces (M = 4.11, SD = 0.849), t (29) = 1.012, p = 0.316, Cohen’s d = 0.261. The results confirmed that the baseline setting manipulation of trustworthiness information was successful.

The design, apparatus, procedure, and statistical analysis were the same as in Experiment 1, except that we additionally performed the Bayesian ANOVA to provide support for the null hypothesis of interaction effect with JASP (Version 0.17.2).

Results of experiment 3a

RTs

The results revealed a significant main effect of gaze-cue type: [valid vs. invalid: 636 vs. 646 ms, MSE = 51.036, F (1, 23) = 45.897, p < 0.001, \({\upeta }_{p}^{2}\)= 0.666, BF10 = 1.151 × 104]. However, the main effect of trustworthiness [trustworthy vs. untrustworthy: 640 vs. 642 ms, MSE = 135.019, F (1, 23) = 1.139, p = 0.297, \({\upeta }_{p}^{2}\)= 0.047, BF01 = 1.876] and the interaction effect [MSE = 41.156, F (1, 23) = 0.018, p = 0.896, \({\upeta }_{p}^{2}\)= 0.001, BF01 = 3.663; Fig. 7] were not significant.

Fig. 7
figure 7

Results of Experiment 3a/3b. Mean RTs were computed for facial cues that were previously categorized as trustworthy and untrustworthy in both the valid and invalid trial conditions of Experiment 3. It is important to note that in Experiment 3, the facial stimuli were intentionally adjusted to be neutral in the trustworthiness dimension. Therefore, there were no inherently trustworthy or untrustworthy faces in this experiment, but rather faces that had been rated as trustworthy or untrustworthy in Experiment 2 prior to the relevant features being altered. (a) RTs were calculated for the Experiment 3a (N = 24). (b) RTs were calculated for Experiment 3b (N = 36)

Accuracy

The overall accuracy was 96%. A repeated-measures ANOVA revealed that neither the main effects nor the interaction effect was significant (trustworthiness: F (1, 23) = 0.240, p = 0.629, \({\upeta }_{p}^{2}\)= 0.010, BF01 = 3.460; gaze-cue type: F (1, 23) = 0.175, p = 0.679, \({\upeta }_{p}^{2}\)= 0.008, BF01 = 2.770; interaction effect: F (1, 23) < 0.001, p = 0.997, \({\upeta }_{p}^{2}\)< 0.001, BF01 = 3.717).

Results of experiment 3b

RTs

The results revealed a significant main effect of gaze-cue type, with valid cues (686 ms) leading to shorter RTs compared to invalid cues (704 ms) [MSE = 341.704, F (1, 35) = 37.173, p < 0.001, \({\upeta }_{p}^{2}\) = 0.515, BF10 = 2.351 × 104]. However, the main effect of trustworthiness did not reach significance [trustworthy vs. untrustworthy: 693 vs. 697 ms, MSE = 258.054, F (1, 35) = 1.607, p = 0.213, \({\upeta }_{p}^{2}\)= 0.044, BF01 = 2.604], and there was no significant interaction effect [MSE = 220.233, F (1, 35) = 1.575, p = 0.218, \({\upeta }_{p}^{2}\)= 0.043, BF01 = 1.786; Fig. 7].

Accuracy

The overall accuracy stood at 96%. A repeated-measures ANOVA indicated that neither the main effects nor the interaction effect was significant (trustworthiness: F (1, 35) = 0.227, p = 0.637, \({\upeta }_{p}^{2}\)= 0.006, BF01 = 3.831; gaze-cue type: F (1, 35) = 0.213, p = 0.648, \({\upeta }_{p}^{2}\)= 0.006, BF01 = 4.505; interaction effect: F (1, 35) = 1.154, p = 0.290, \({\upeta }_{p}^{2}\)= 0.032, BF01 = 1.901).

Discussion

Experiment 3 serves as a control experiment aimed at detecting any null effects in the interaction, thus ensuring that the observed changes in GCE were specifically attributable to trustworthiness rather than other low-level facial features. Given the importance of ensuring that even small effect sizes were not overlooked, we chose to replicate Experiment 3a with an increased sample size of 36 participants in Experiment 3b, thus enhancing the validity of our preceding findings.

The results of Experiment 3a and 3b consistently exhibit a clear pattern. When the trustworthiness information was removed from the facial stimuli employed in Experiment 2, the interaction effect shifted from being statistically significant in Experiment 2 to becoming non-significant in Experiment 3. Additionally, we examined the absence of the interaction effect from a Bayesian perspective, aligning with the main objective of Experiment 3. The BF01 for the interaction effect in Experiment 3a was 3.663, indicating moderate evidence in favor of a null effect. Besides, in Experiment 3b, the BF01 for the interaction effect was 1.786, providing anecdotal evidence supporting the null hypothesis. The results suggested that that there was a GCE effect in general. These findings provide support for the conclusion that trustworthiness is indeed the factor modulating the GCE, providing evidence in favor of the flexible hypothesis. This outcome aligns with our hypothesis that the GCE is present even for facial stimuli rated as neutral in trustworthiness, and that the modulation effect of trustworthiness primarily manifests when untrustworthiness hinders the GCE, rather than trustworthiness enhancing it.

General discussion

The current study has revealed that trust plays a crucial role in modulating the GCE, and this observed influence cannot be ascribed to trustworthiness-unrelated features of the facial stimuli. In Experiment 1, the GCE was evident in the trustworthy condition, whereas no such effect was observed in the untrustworthy condition. Building on this, in Experiment 2, we replicated the modulation effect of trustworthiness on the GCE using a new stimulus set sourced from Oosterhof and Todorov (2008). Importantly, this stimulus set allowed for the adjustment of trustworthiness information to a baseline level. In Experiment 3a, after resetting the trustworthiness information to baseline, we found no significant difference in the magnitude of the GCE between the trustworthy and untrustworthy faces utilized in Experiment 2. Furthermore, to enhance the robustness and validity of our findings in Experiment 3a and increasing statistical power, we involved an additional group of 36 students in Experiment 3b to replicate the results, successfully confirming the outcomes observed in Experiment 3a. These results collectively indicate that the effects observed in Experiment 1 and Experiment 2 were not confounded by trustworthiness-unrelated characteristics of the face stimuli. Taken together, our findings support the flexible hypothesis that attentional orienting triggered by gaze following is subject to modulation by top-down factors such as trustworthiness.

Regarding the modulation effect of trustworthiness on GCE, our study unravels a nuanced interpretation. Existing research has consistently shown that gaze cues, in isolation, possess the capacity to direct attention, even in the absence of explicit trustworthiness manipulations (for a review, see Frischen et al., 2007). Contrary to the facilitation by trustworthy faces, our findings suggest that untrustworthy faces impede GCE. Notably, in Experiment 3, we observed significant GCE for neutral faces with trustworthiness set to baseline. Conversely, in Experiment 2, no significant GCE was noted for untrustworthy faces, and there was even a slight reversed GCE, although this did not reach statistical significance. This phenomenon could be elucidated by Todorov et al., (2008a, 2008b) proposition that judgments of trustworthiness may play a crucial role in social interactions, precisely in determining whether to approach or avoid a stranger, particularly in the absence of clear emotional cues indicating the intentions of the other person. Accordingly, in Experiment 2, when untrustworthy faces directed their gaze in a specific direction, individuals may have been predisposed to refrain from following their gaze and instead sought to redirect their attention elsewhere. This inclination could account for the observed quicker responses in invalid trials. The absence of a significant GCE in the untrustworthy condition in Experiment 1 further reinforces the notion that untrustworthy faces hinder GCE. To account for the discrepancy between the nonsignificant tendency of GCE in Experiment 1 and the presence of a non-significant reversed GCE in Experiment 2, we consider the nature of the stimuli used. In Experiment 1, we relied on a database established by Klapper et al. (2016), which involved morphing trustworthy and untrustworthy versions of each of the eight faces by shifting them 2.5 standard deviations toward trustworthiness or untrustworthiness, as per the model of Oosterhof and Todorov (2008). In contrast, for Experiment 2, we selected the most trustworthy and untrustworthy faces from the database of Oosterhof and Todorov (2008), morphed 3 standard deviations in the respective trustworthiness direction. Consequently, the stimuli used in Experiment 2 may have exhibited a higher degree of untrustworthiness compared to those in Experiment 1, thus potentially more effectively diverting participants' attention to another direction. Nevertheless, this is just one plausible speculation that should be further investigated in future research.

The present study contributes to the realm of flexible interpretation. Our findings substantiate the support for flexible interpretation by demonstrating that top-down influences from agents, encompassing intricate social factors such as trustworthiness, effectively modulate GCE. This extension of prior research brings to light the dynamic nature of perception in gaze following, challenging the notion of a rigid or purely reflexive human attention orientation system. Instead, we assert that this system demonstrates heightened flexibility and adaptability across diverse situational contexts, consistent with earlier research findings (Teufel et al., 2010b). Our ability to selectively attend to stimuli in the environment is indicative of a prioritized processing mechanism for utilitarian stimuli, enabling the conservation of resources, facilitation of social interactions, and enhanced efficiency in goal attainment. However, this attentional allocation is not fixed; rather, it dynamically adjusts based on perceived social factors like trustworthiness. elucidated in this study, we propose that evaluating someone's trustworthiness based on facial appearance serves as a valuable means for individuals to efficiently extract utilitarian information when other cues are lacking. Notably, our study showcases a swift redirection of attention away from untrustworthy faces, indicative of a strategic response to conserve resources in the face of potential deception.

Considering the significance of GCE in a social context, one might question whether GCE is inherently flexible. In a recent publication (Zohary et al., 2022), researchers explored whether people can automatically develop GCE when gaze cues become available only in late childhood. Results indicated that late-treated cataract patients failed to understand and follow eye gaze direction, suggesting that individuals rely on an unsupervised gaze-learning process during infancy to shape GCE. However, it remains unclear whether the modulation effect of social factors observed in our research is an outcome of acquired learning during infancy. Future research endeavors could further illuminate the link between primary characteristics and the more intricate top-down processes of social factor modulation identified in our study.

In our paradigm, there might be potential interference factors that require further explanation. First, some may question the use of the orientation discrimination task, as it may involve a second level of congruency where the direction of the Gabor patch tilt aligns with the gaze cue. However, it's worth noting that the adoption of the Gabor orientation discrimination paradigm has previously been validated in Shi et al. (2010), where it was employed to investigate the correlation between the walking direction of biological motion and attentional orienting, thus partially affirming its validity. Moreover, participants were instructed to discern whether the Gabor patch was tilted clockwise or counter-clockwise and respond promptly by pressing one of two keys accordingly. The directions of tilt for the Gabor patches were counterbalanced across different conditions, thereby reducing the likelihood of alignment between the tilt direction of the Gabor patch and the gaze cue. Second, examining the stimuli, it is evident that eye size is often altered across identities when manipulating trustworthiness. Systematic differences in eye size across conditions, typically with smaller eyes for less trustworthy conditions, may suggest that eye gaze information is less salient for smaller eyes, potentially weakening the GCE. However, studies such as Bayless et al. (2011) demonstrated that emotional expressions (e.g., fear), rather than eye size, account for GCE. When maintaining constant eye size but altering the facial orientation by using face inversion or presenting only the eye regions, the GCE was diminished or even eliminated. Additional research, exemplified by the study conducted by Carlson and Aday (2018), supports the notion that eye size alone cannot solely drive attention guidance. Therefore, future studies manipulating trust information while maintaining constant eye size to investigate its impact on GCE would enhance the credibility of results. Moreover, our use of computer-generated faces as stimuli in the current study encourages future research to incorporate real faces to extend the validity of findings.

In summary, our study highlights that trustworthiness significantly modulates the GCE, with indications suggesting that untrustworthy faces may hinder the GCE. These results challenge the conventional view of GCE as a purely reflexive attentional shift, proposing instead that attentional orienting operates flexibly, supporting top-down processing of the GCE. This provides empirical support for the flexible hypothesis, underscoring the dynamic nature of attentional mechanisms in the context of social cues.