Introduction

How do you know where someone is looking? What they are feeling? Or whether you recognize them? Faces are a central source of input for quick social evaluation of other people. Eye gaze captures elements of action, prediction, goal-direction, and facial expressions portraying underlying mental states (e.g., emotions, recognition). Within milliseconds of seeing a face, infants to adults can make socially motivated and informed decisions about a person (Jessen & Grossmann, 2016, 2019; Qian et al., 2016; Todorov & Oh, 2021; Willis & Todorov, 2006; Yovel & Belin, 2013). The salience and centrality of social information associated with and learned from faces drive researchers to study face processing across the lifespan.

The significance of understanding the way faces are perceived, processed, and responded to has far-reaching implications found in almost all aspects of our social lives. Researchers interested in public policy to neurodivergent development consider individual-level and/or system-level phenomena relevant to processing faces. From the justice system (Chen et al., 2021; Eberhardt et al., 2004, 2006; Golby et al., 2001), to health outcomes (Greenwood et al., 2020; Hardeman et al., 2016), education (Kumar et al., 2015; Williams et al., 2019), and socialization (Halberstadt, 2020; McKone et al., 2021), researchers demonstrate the way faces are quickly categorized and interpreted, inform our explicit and implicit judgments.

Disturbances in face processing capabilities across the lifespan are believed to be foundational indicators of a variety of psychiatric and neurodevelopmental disorders (Costa et al., 2021; Griffin et al., 2021; Killgore et al., 2014; Monk et al., 2006; Saarinen et al., 2021). From a developmental perspective, the first years of life include dramatic shifts in face processing capacities (Markant & Scott, 2018; Scherf & Scott, 2012; Scott & Arcaro, 2023; Scott & Fava, 2013; Scott et al., 2007). These and other shifts in face processing throughout the lifespan are believed to be driven by both top-down and bottom-up processes (Hadley et al., 2014) and shaped by one’s environment or culture (Liu et al., 2015; Rennels & Davis, 2008; Sugden et al., 2014). It is even hypothesized that face processing biases are a cornerstone of implicit racial biases (Lee  et al. 2017a, 2017b), and improving face recognition for unfamiliar race faces can reduce implicit associations in both children and adults (Lebrecht et al., 2009; Qian et al., 2019; Xiao et al., 2015).

Due to ongoing interest in the early-emerging and robust face processing system, there are numerous face stimuli databases using computer-generated faces (Matheson & McMullen, 2011; Roesch et al., 2011) and dynamic recordings (Krumhuber et al., 2017, 2021) as well as photographs of adults (Gross, 2005; Workman & Chatterjee, 2021) and children (Bijsterbosch et al., 2021; Dalrymple et al., 2013; Egger et al., 2011; LoBue & Thrasher, 2015; Prada et al., 2018) in either simple controlled or more natural environments. However, a much smaller subset of these validated and published databases includes Models of Color as well as White models (Chen et al., 2021; Conley et al., 2018; LoBue & Thrasher, 2015; Ma et al., 2015, 2021; Sacco et al., 2016; Strohminger et al., 2016; Tottenham et al., 2009; Ueda et al., 2019). These validated stimuli sets more accurately reflect racial diversity and globally growing multiracial populations (Bureau, 2021; Chen et al., 2021; Ma et al., 2021). Additionally, these racially diverse stimuli sets contribute to the needed systemic shift away from implicitly equating a White racial category and White faces as the norm and/or baseline for other comparisons. Having racially diverse validated and publicly available face databases enables a richer characterization of face perception and the influences of experience. Many of these data sets provide images of direct and profile views along with a range of positive and negative facial expressions to provide researchers with different ways to examine face-related perceptual expertise. However, most are missing variation in a key social communication attribute, eye gaze orientation. The aim of the Diverse Face Images (DFI) is focused not only on including racial and ethnic representation, but also on providing images of direct and averted eye gaze.

Gaze orientation is reliant on high-quality face images. Eye gaze and gaze following are critical communication cues that infants learn about in their first year (Akhtar & Gernsbacher, 2008; Itier & Batty, 2009; Renfrew et al., 2008). Infants learn that gaze orientation is not merely meaningless movement, but instead a purposeful cue indicative of shifting one’s attention (Frischen et al., 2007; Okumura et al., 2013a, 2013b; Senju et al., 2008; Striano & Reid, 2006). Gaze following is related to socio-cognitive skills such as joint attention, language, and even theory of mind, supporting our ability to learn about people, objects, and events in our visual world (Cleveland et al., 2007; Flom et al., 2017; Johnson et al., 2007; Reid & Striano, 2005; Reid et al., 2004; Striano et al., 2006). Disruption or atypical response to facial communication cues are linked to many psychiatric and neurodevelopmental disorders, such as major depressive disorder (Gaffrey et al., 2013) and anxiety disorders (Monk et al., 2006). In fact, interventions involving gaze detection and training gaze shifts are proposed as a critical step for autism spectrum disorder (Krstovska-Guerrero & Jones, 2016; Webb et al., 2014), a disorder for which varied or atypical perception and response to gaze is observed early in development (Bedford et al., 2012; Leekam et al., 1998; Stallworthy et al., 2022). Even within neurotypical populations, successful gaze following and processing of gaze-cued objects is influenced by several social factors (Dalmaso et al., 2020; Hadders-Algra, 2022). For example, more efficient gaze following and cued object processing has been found when viewing faces of both individual familiarity (Barry-Anwar et al., 2017; Hoehl et al., 2012) and group-level familiarity, such as familiar-race faces (Pavan et al., 2011; Pickron et al., 2017; Xiao et al., 2018). Other cues such as affective valence (Hoehl & Striano, 2010; Hoehl et al., 2008) and social status (Ciardo et al., 2013) are found to also influence gaze processing. The intersection of group membership, social evaluation, and gaze following further indicates the importance of having publicly available diverse stimuli sets of faces with both direct and averted eye gaze.

There are some published face databases available which include models with direct and averted gaze (Courset et al., 2018; Langner et al., 2010) as well as eye gaze toward peripheral objects in complex scenes (Bill et al., 2020; Recasens et al., 2015). The currently available large stimuli sets that highlight eye gaze orientation offer many strengths that fulfill the respective authors’ aims, yet there are a few missed opportunities for generalized applications or uses of these stimuli sets that DFI aims to address. For example, some of the current stimuli sets have only one (all White; Langner et al., 2010) or two ethnicities included or no validation data evaluating quality of eye gaze orientation (Courset et al., 2018). The current stimuli set supplements these other sets by including women, multiple races and ethnicities, and data on perceived eye gaze orientation quality. Additionally, current stimuli sets that offer eye gaze following within naturalistic scenes include distant, indirect, partially obstructed, or even backward-facing images of human faces (Bill et al., 2020; Recasens et al., 2015). These qualities are strengths for studying perception of scenes but may have limitations for specifically investigating face processing. The current stimuli set includes close-crop views of faces that can be used for eye gaze following as well as for many other research aims.

Many studies related to gaze following and object processing report manually manipulating photographs of models with direct eye gaze into averted gaze by moving the appearance of the iris using image editing software (Ciardo et al., 2013, 2021; Hoehl & Striano, 2010; Hoehl et al., 2008; Richeson et al., 2008; Weisbuch et al., 2017). This type of manipulation results in no longer using the exact faces that were validated in the published stimuli dataset. Despite the effectiveness researchers can achieve by manipulating a photographed face’s direct gaze to averted gaze, utilizing existing databases of stimuli with already averted gaze will likely cut down on stimuli development time and reduce the risk of between-stimuli editing variation. Alternatively, some researchers utilize computer-generated faces (Pavan et al., 2011), which provides flexibility in face race and eye gaze orientation but may be reduced in ecological validity. It may be that some of the existing databases need to be more widely publicized or that researchers have requirements (e.g., racial or gender diversity) not completely fulfilled by what is presently available.

Specifically, the present paper contributes norming data of female models who self-identified as one of five different racial identities. This database of faces expands the existing databases through the actual faces that are included and the type of norming data collected. The faces include multiple exemplars of female models and images of direct and averted eye gaze orientation that have been equated for low-level visual differences. The rating data for the stimuli aim to go beyond accuracy of categorization by including in-depth ratings of each model across three major themes: (1) racial and ethnic group associations, (2) eye gaze orientation, and (3) emotion expression. Researchers will have access not only to averaged data across all models, but critically, to summary data about each individual model included in our database. By including individual model data, we offer opportunities for researchers to consider variability in evaluations within and between the different groups of faces.

Methods

Procedures for recording the face models were approved by the University of Massachusetts Amherst’s Institutional Review Board in 2015. All face models provided written informed consent for their images to be used and distributed for study participation, future research, and publication purposes. Validation data collection methods were approved by the Institutional Review Board of the University of Florida (2022). Participants recruited through the University of Florida provided written informed consent to participate in data collection procedures and have de-identified data published. All research conducted through each institution was performed in accordance with the Declaration of Helsinki.

Stimulus development

Model recruitment and demographics

Individuals were recruited from the University of Massachusetts Amherst and surrounding communities to participate in our stimuli dataset. Information on age, gender, and racial and ethnic identity was included in our recruitment materials. Specifically, we advertised for adults between 18 and 34 years of age who self-identified as male or female and as African-American/Black, White, Chinese/Chinese-American/Taiwanese American, Chinese/Vietnamese, South Asian/Indian/Indian-American, or Hispanic/Latiné. Participants were paid $5.00 in cash for coming in for a single 30-min recording session. A total of 41 female models were video-recorded (7 East Asian, 7 Southeast Asian, 10 Black/African-American, 10 White, and 7 Hispanic/Latiné). Six of the seven women who ethnically identified as Hispanic/Latiné also self-identified with multiple racial groups, which is commonly reported among this population (Araujo Dawson & Quiros, 2014; Cruz-Janzen, 2002; Umaña-Taylor et al., 2014). We included these models in the Hispanic/Latiné group as their primary ethnic identity. The goal of the project is to provide both the model's self-identity data and raters’ perceptions of racial and ethnic identity to give researchers who are using the dataset as much information as possible and to leave it up to the individual researcher to choose which faces to use and how to incorporate race/ethnicity demographic diversity. Only eight men volunteered to participate in the stimuli recording. This sample size did not provide enough representation across race and ethnicity groups, and therefore we decided not to include male faces for this iteration of our face stimuli set.

After recording all the models, our research group created still frame images of each female model expressing direct eye gaze and averted eye gaze to the left and the right. Our team then completed a preliminary visual inspection of the quality of each model’s still frame image. We looked for clear facial expressions, direction of eye gaze, and centrality of head position. The final sample of models included 41 self-identified women (Mage = 25.2, SDage = 3.43). Additional demographic metadata (e.g., further racial and/or ethnic identity details if provided by the individual) are included in Fig. 1 (see column headings “Self-identified ethnicity group” and “Self-identified ethnicity subgroup”).

Fig. 1
figure 1figure 1figure 1figure 1figure 1

DFI face stimuli. Note. * In the questionnaire, South Asian and Southeast Asian ethnicities were collapsed. Figure includes every model, their self-identified race and ethnicity, and the multiple examples of each model based on facial and eye gaze expression

Model recording procedure

Each model wore the same black T-shirt and little to no makeup or jewelry, and removed eyeglasses, hats, and any other items that may have obscured the view of their faces. A Canon XF100 camera was used and positioned directly in front of the models. Fluorescent box lights and softboxes were used for the lighting setup. Light fixtures were positioned to the left and right of the model. Each model was recorded from their shoulders up, and all sat in the same chair during recording sessions. The height of the camera was adjusted based on the height of each participant.

Models were instructed to maintain a pleasant, relaxed, but neutral expression, keeping their mouth closed during recording. For some models (see Fig. 1), representative images of the category were not available from the video. For example, several models did not have a closed-mouth happy image that could be extracted. To control between-model variation in eye movement, we had models follow a PowerPoint presentation of a ball projected on a wall across from them. The ball moved from the center of the wall to either the left or right. Models were instructed to track the ball with their eyes while maintaining a still head, neck, and torso (i.e., keeping the rest of their body facing forward toward the camera). This ensured that eye movement timing and duration were consistent.

Post-recording image processing

Within Photoshop, faces were centered within an oval shape measuring 875 × 1387 pixels, with a white background masking the other parts of the image. All faces were centralized such that eyes were level within a boundary (875 × 246 pixels) that was 123 pixels above the center. A vertical boundary (246 × 1387 pixels) served as a guide to ensure that eyes were equally spaced horizontally. Faces were resized to ensure that both the chin and part of the hairline were visible. Any additional distinguishing features (e.g., nose ring) were blurred or removed to reduce implicit attention. Faces were converted to grayscale, because performance on face recognition tasks is stronger for grayscale than color (R. Russell et al., 2007), and low-level perceptual differences (e.g., luminance and contrast) were reduced by averaging dark and light pixel contrasts and equating the standard deviations of the luminance distribution across all faces using SHINE (Spectrum, Histogram, and Intensity Normalization and Equalization; Willenbockel et al., 2010) in MATLAB. Within the SHINE toolbox source folder, the faces were placed in the SHINE Input folder, and no template images were used in the SHINE Template folder. In MATLAB, the path was set for the SHINE toolbox folder, and lines 69–72 of the main m-file SHINE were edited to reflect the correct image file type (e.g., TIFF), and the locations of the input, output, and template folders. The main m-file, SHINE, was run to obtain mean luminance matching on the whole images, using custom options. The matching mode used was luminance, the luminance option used was lumMatch, and the matching region used was the whole image. The final shined faces were automatically generated into the SHINE output folder. Individual faces were then edited for unnatural patches or distortions using Adobe Photoshop. See Fig. 1 for examples of finalized model stimuli.

Rater data collection

Participants

A total of 327 adults consented to participate in the face model rating study; 38 exited the survey after consent or after completing demographic information and were subsequently excluded. Of the final 288 participants, the majority completed the entire survey (n = 241, 83.7%); the remaining participants completed a range of 10–55%, but their data were still included in analyses given a planned analytic solution to account for missing data. Most participants were aged 18–25 years (279, 96.9%) and an additional six participants indicated being 25–45 years of age. Participants self-reported their race and ethnicity identity (see Table 1). The majority of participants identified as women (n = 219, 76%), with an additional 63 participants identifying as men (21.9%), transgender (n = 1), nonbinary (n = 2), or preferring not to answer (n = 3). See Supplemental Table S1 for additional demographic information related to highest education level, annual household income, and additional self-reported details regarding gender identity.

Table 1 Race and ethnicity self-identity from 288 participants who rated the stimuli

Data collection procedure

Sona Systems, the participant pool management systems for universities, was used to recruit adults taking undergraduate psychology courses at the University of Florida (Copyright 1997–2023, Sona Systems Ltd.). An online study page was created in the University of Florida’s Department of Psychology Research Participation Credit Manager, Sona, as a platform for adults to access our stimulus set. Adults opted to participate by registering for one time slot per person, after which they received the link to our survey. We tested as many students as possible willing to sign up for the online study during a time window of a single academic semester. Face model ratings were collected using an online survey through Qualtrics software (release February 2023; Copyright 2023, Provo, UT, USA). Each participant rated all the available images for a total of 149 trials and images. A randomized list of the stimuli was generated in Excel and then used for the presentation order in Qualtrics. Each face image was presented by itself in the center of the screen. Below the image, raters were given instructions about the type of rating they were being asked to complete.

The primary goal of our study is to offer validation data evaluating three major categories of our included models. Categories included (1) racial and ethnic group associations, (2) eye gaze orientation, and (3) emotional facial expressions. Participants were asked multiple sub-questions within each of these three major categories, and the model’s image was visible to participants for all of the questions. These three categories were selected for evaluation as there is extensive evidence that each quality both uniquely and interconnectedly influences face processing across the lifespan (Adams et al., 2010; Farroni et al., 2004; Gregory et al., 2020; Hoehl & Striano, 2008; Quinn et al., 2018; Richeson et al., 2008; Trawalter et al., 2008).

Race and ethnicity categories

For the race and ethnicity group association, participants responded to the prompt: “How strong of an association does this face have with the following five racial or ethnic groups?” The five racial and ethnic groups included Asian (Chinese, Japanese, Korean, etc.), Black (African-American, Ethiopian, Haitian, Nigerian, etc.), Hispanic and/or Latin American (Mexican, Puerto Rican, Cuban, Dominican, etc.), South and Southeast Asian (Indian, Indonesian, Thai, Malaysian, etc.), and White (Dutch, English, Irish, European, Norwegian, etc.). Participants used a five-point Likert scale, coded as follows: 0 = no association, 1 = very little association, 2 = not sure, 3 = strong association, and 4 = extremely strong association. For every face, participants used the same Likert scale to answer the association prompt for each of the five listed racial and ethnic groups. There were no explicit instructions about how many races a model could be associated with. By asking for an association rating for each race/ethnicity category for every face, we allowed raters to give a range of association values. For example, a model could be rated as having an extremely strong association for the racial category of Black, and no association for racial categories of White, Asian, or South or Southeast Asian, but could also be rated as having a strong association with the ethnicity category of Hispanic/ Latiné.

Using this rating technique, we obtained more nuanced information about the way each model’s face was perceived racially and ethnically, instead of a simple racial categorization response. We acknowledge that race and corresponding racialized categories are social constructs that can be used to perpetuate false associations of biological underpinnings and meaningful distinctions between people (Salter et al., 2018; Smedley & Smedley, 2005). Despite racial categories being products of racial ideation, particularly in the United States, people have been historically socialized to use racialized categories as marking real differences (Hochman, 2021; Roberts & Rizzo, 2021). Perceptual categorization is driven by many factors including experience, socio-cognitive bias, cultural and individual definitions of race and corresponding racial groups, emotional expression, and eye gaze orientation. In our study, we are not focused on the “accuracy” of categorizing faces. That is, we aim to provide information about individual variability in perceiving racial category associations, not whether any given person’s categorization matches the stated racial identity of the model. An individual’s racial identity can be fluid, is constructed from experience, and often does not match with how the rest of the human population perceives and categorizes them (Albuja et al., 2018; Davenport, 2020; Douglass et al., 2016). We focus on perception instead of racial identity accuracy because perceptual processes, especially experience-driven biased perceptions, underlie fast-acting neural and behavioral responses following the presentation of a face. These responses are important and are often what is being measured by researchers using human faces in their studies. Thus, collected data offer a descriptive range of the way the faces in this set are evaluated and the strength of associations individuals have between different socially constructed racial categories and the presented faces. Association data is used as a way to offer a description for how people (U.S.-based population) view the faces and what this may mean for the types of decisions researchers need to make when choosing which face to include in their own research.

Eye gaze orientation

Participants responded to the prompt: “Where do you think this person is looking?” Responses were as follows: 0 = difficult to determine, 1 = directly at the participant, 2 = away from the participant. Models were prompted to display direct and averted eye gaze (to the left and to the right). These naturalistic eye movements come with individual variability in how obvious it is where someone is looking.

Emotional facial expressions

Including ratings regarding facial expressions is critical when creating a database of face stimuli. We included this evaluation for three reasons. First, the perception of a facial expression is contextually influenced; that is, there could be spillover effects in rating one face to another (Albohn & Adams, 2021; Russell & Fehr, 1987). Second, despite the instructions to express a calm or neutral expression, there is individual variability in the way this type of expression is executed. Third, calm or neutral expressions can be perceived as having emotional messages, particularly a more negative valence (Albohn et al., 2019; Lee et al., 2008). The interpretation of a neutral facial expression can be racialized based on the race of the viewer and the presented face (Hu et al., 2017). Participant raters responded to the prompt: “Based on the image above: How ___ does this person look in this image?” The emotional expressions included happy, calm, angry, and neutral (no expression). Participants used a five-point Likert scale for each of the five emotional expressions listed. The four emotion category questions were listed in table format, such that there were four rows with one row per emotion category and five columns, one for each point on the Likert scale. Ratings ranged as follows: 0 = not at all, 1 = somewhat, 2 = average, 3 = very, and 4 = extremely. For example, a model could be rated as “very” calm, “very” neutral, and “not at all” for happy and angry. Similar to the racial and ethnic group associations, we aimed to collect data to highlight individual variability instead of simple yes/no or categorization data.

Data analysis

Descriptive statistics are included for future users of the stimuli set to understand the means and standard deviations of ratings. In addition, analyses were conducted to better understand whether these ratings were influenced by the stimuli (i.e., differences due to stimuli race, orientation, or emotional expression) or rater race. These analyses were conducted in R (version 4.3.1) using the linear mixed-effects package lme4 (Bates et al., 2015) to show the association strength for each judgment (e.g., race/ethnicity, eye gaze orientation, and emotion). All analytic models include a random intercept for each photo model to account for the multiple ratings across different images. This approach also helps account for missing item-level data (e.g., empty survey answers) by utilizing available item-level data without reducing deductive power (Mazza et al., 2015). Estimated marginal means (EMM) were used to describe fixed effects (specified for each model in each respective results section), and Bonferroni correction was applied to pairwise comparisons.

Results

Analysis of the presented data is focused on descriptive statistics for three areas of evaluation: (1) race and ethnicity groups association, (2) eye gaze orientation, and (3) emotional facial expressions.

Validity of race groups

Validity ratings for each of the five race and ethnicity groups are presented in Fig. 2 (see Supplemental Tables S2 and Supplemental Figs. S2S5 for mean association ratings of the race and ethnicity group association for each of the model’s multiple face exemplars). Linear mixed-effects models predicting the mean rating of association strength included a fixed effect of what race was selected by the raters, a fixed effect of rater’s race (same as or different from actor’s race), and the interaction between the two fixed effects.

Full statistical model details are available in the Supplemental Material for Model 1. Considering target responses (i.e., response matching photo model’s self-identified race; e.g., association strength of Asian faces for self-identified Asian photo model), raters’ association strength did vary based upon the photo model’s self-identified race, F(4, 34) = 23.51, p < 0.0001, such that, overall, estimated marginal mean (EMM) ratings were lower for Hispanic faces (EMM = 2.16, SE = 0.12) than all other faces, p < 0.001. Ratings for target responses to Southeast Asian faces (EMM = 2.78, SE = 0.13) were also lower than Black faces (EMM = 3.63, SE = 0.11), p = 0.002. As indicated in Fig. 4, the variability in ratings was higher for Hispanic faces (EMM = 0.39, SE = 0.069) than for Asian faces (EMM = 0.068, SE = 0.079; p = 0.0004), Black faces (EMM = 0.11, SE = 0.066; p = 0.0005), and Southeast Asian faces (EMM = 0.31, SE = 0.073, p = 0.042), as confirmed using post hoc Levene’s test based upon absolute deviation from the mean, statistic = 7.23, p < 0.001. See Fig. 3 for mean ratings of associations for each face race and ethnicity group by rater’s self-identified race.

Fig. 2
figure 2

Averaged race and ethnicity group association by model number. Note. Figure depicts the mean rating of race and ethnicity group associations for each model. These data are collapsed across the specific model examples (e.g., averted left eye gaze, averted right eye gaze, calm, and happy) and grouped by the self-identified racial group of the models

Fig. 3
figure 3

Mean rating of the strength of association (i.e., strength of stimulus association to each of the five race and ethnicity groups) was scored on a Likert scale from 1 (no association) to 5 (extremely high). Analysis was conducted on numeric means but is represented by category here for illustrative purposes. *** p < 0.0001. Note. Violin plot highlights relative density, and boxplots illustrate quartile values

Overall, association strength was weaker when the photo model’s race was different from the rater’s self-identified race, F(4,1, 34) = 10.69, p = 0.003; however, a significant interaction, F(4, 34) = 14.34, p < 0.0001, indicated that this was only true for Black faces, p < 0.0001, and not for the other face races, p > 0.34. In other words, non-Black raters had weaker association strength to Black faces.

Validity of eye gaze orientation

Generally, participants were accurate in reporting gaze orientation (range = 74.4–99.5%). Full statistical model details are available in the Supplemental Material for Model 2. Linear mixed-effects models predicting percent gaze orientation included a fixed effect of gaze condition [averted (collapsed across left and right) and direct (collapsed across happy and calm)], a fixed effect of photo model race, a fixed effect of rater’s race (same as or different from actor’s race), and full-factorial interactions between all fixed effects. Raters were more accurate for direct gaze (EMM = 94.9%, SE = 0.4%) than averted gaze (EMM = 90.2%, SE = 0.4%), F(1, 102) = 83.84, p < 0.0001. An interaction between photo model race, rater race, and gaze direction, F(4, 102) = 4.64, p 0.0018, indicated that Black raters were less accurate than non-Black raters for gaze orientation of Black faces, but only for direct-facing stimuli, p = 0.045. No other pairwise comparisons were significant, p > 0.064. See Fig. 4 for the accuracy ratings identifying direct and averted eye gaze orientation for each actor race.

Fig. 4
figure 4

Accuracy of eye gaze orientation. Note. Percent accuracy of eye gaze orientation categorization (e.g., accurately indicated that averted stimulus is looking away). Violin plot highlights relative density, and boxplots illustrate quartile values. * p < 0.05

Validity of facial expressions

Full statistical model details are available in the Supplemental Material for Model 3. Linear mixed-effects models predicting the mean rating included a fixed effect of emotion condition for each stimulus, a fixed effect of photo model race, a fixed effect of rater’s race (same as or different from photo model’s race), and full-factorial interactions between all fixed effects (see Supplemental Table S3 for the average rating of emotional facial expression for all face models and each iteration of their image).

Overall, ratings were lowest for angry (EMM = 0.46, SE = 0.027) relative to other emotions, p < 0.001, including calm (EMM = 1.77, SE = 0.027), happy (EMM = 1.48, SE = 0.027), and neutral (EMM = 1.70, SE = 0.027). Happy ratings were also lower relative to neutral, p < 0.0001. Rating of facial expressions was not related to rater race, p > 0.17. An interaction between photo model race and emotional expression, F(12, 238) = 4.88, p < 0.0001, indicated that, first, Asian and Hispanic faces had stronger happy associations than Black faces, p < 0.022, and second, Black faces had stronger neutral associations than White faces, p = 0.036. See Fig. 5 for depiction of facial expression findings.

Fig. 5
figure 5

Evaluation of facial expression. Note. Violin plot highlights relative density and boxplots illustrate quartile values. Mean rating (i.e., strength of stimulus association to each of the four emotions) was scored on a Likert scale from 1 (not at all) to 5 (extremely). Analysis was conducted on numeric means but is represented by category here for illustrative purposes

Discussion

The present project introduces a new face stimuli database along with validation data rating three key themes of face processing: racial and ethnicity group association, eye gaze orientation, and emotional facial expressions. The DFI stimuli set will be an open-access tool that includes images of racially diverse female-identifying adults with direct and averted eye gaze orientation, as captured from dynamic video recordings to ensure ecological validity.

The current validation rating data extend beyond the question of accurately categorizing faces on a single criterion. A unique feature of our stimuli rating procedure was that it allowed participants to individually rate faces on each element of interest. Our aim is to provide researchers with data that give a richer characterization of how the faces in our database are perceived. From the reported descriptive statistics, researchers will have access to generate additional analyses to fit their specific needs when making stimuli selection choices. We were particularly interested in highlighting the variability in adults’ perceptions of racial and ethnic group associations. Specifically, we provide descriptive data that highlight the variability in model rating within and between racial and ethnic groups. We offer a general conclusion from each of the three face category evaluations of interests. First, adult participants had more variability in race and ethnicity group associations for models who self-identified as Hispanic/Latiné, relative to Asian and Black models. Images with averted eye gaze were clearly perceived as such by our participant raters. Overall, faces were rated as relatively calm or neutral. This rating of the model's emotional facial expression is consistent with the original instructions to the models during the video recording session, which was to express a calm, but pleasant expression.

Broadly, our results suggest consistent findings across the rater’s self-identified race, despite the fact that the ratings of these models come from a sample of adults from a majority White sample (60.5%). There were two unexpected rater race effects. First, non-Black raters had slightly weaker race-rating association strength to Black faces (3.52) than Black raters (3.74). This weaker association made by non-Black raters was somewhat unexpected given the robust “other-race categorization advantage” that is reflected by a stronger or faster race categorization for unfamiliar-race faces (Caharel et al., 2011; de Lissa et al., 2021; Feng et al., 2011; Sekimoto, 2018; Zhao & Bentin, 2008). It is possible that non-Black raters’ reduced categorization association was a result of seeing both direct and averted eye gaze of Black models, as eye gaze orientation is reported to impact different elements of face processing (Adams et al., 2010; Sessa & Dalmaso, 2016). However, it is unclear why this finding was only specific for Black models and non-Black raters and not any model that was of a different race than the rater. Second, Black raters were slightly less accurate in rating gaze orientation (89.9%) relative to non-Black raters (93.4%). This finding may be related to prior work suggesting differences in visual scanpath strategies for Black observers (i.e., lower half of the face) compared to White observers (upper half of the face) (Hills & Pake, 2013). However, broadly, all raters were accurate (> 70%), and this subtle difference requires more work before considering or concluding race group differences in categorizing direct and averted eye gaze.

The faces included in DFI will be particularly useful for researchers and practitioners examining processes related to areas such as intergroup bias, face perception, attention orientation, and communication cues. With increasing racial and ethnic participant diversity represented in developmental, socio-cognitive, and neuroscience-based studies, this database will support researchers’ efforts in maximizing inclusivity. One example of using this new database of faces is to support testing of the Interactive Model of Attention and Perception (I-MAP). I-MAP predicts that with development comes increasing control over attention which directs perceptual learning and supports top-down selective attention biases for familiar faces (Markant & Scott, 2018). The I-MAP model hypothesizes that the interaction between perceptual learning and attention results in increased anterior-to-posterior neural processing and increasingly right-lateralized occipitotemporal face processing during the first year of life. To this end, the DFI face database will support such research with its racial and ethnic representation, multiple images of each model, and controls for low-level visual cues.

Limitations and future directions

Despite the strengths of the DFI stimuli set, there are two key limitations to acknowledge. The first is the demographics of the participants who rated the models. The participants who completed the stimuli rating questionnaire were primarily White and female. However, the participant sample is representative of the population where the data were collected in north-central Florida. Despite obtaining detailed race and ethnicity identity from participating raters, we failed to obtain details about the racial diversity of raters’ daily lives. This type of information will be beneficial for future stimuli development projects to gain a richer characterization of the experiential context that may be influencing raters’ evaluations. In the future we would also like to increase the racial and ethnic diversity within our participant sample as well as cross-cultural evaluations. Of particular interest is increasing the sample size of raters whose racial and ethnic identity match those of the models included in the DFI stimuli set. This may be especially important for Black-identified raters to obtain a fuller representation of individual- and group-level differences for reviewing eye gaze orientation. Researchers conducting cross-cultural work may find it especially useful to complete additional validation checks with intended participant populations (i.e., those outside of northern Florida, USA). The second limitation is that we were unable to recruit enough male models, and not all racial and ethnic groups are represented within our stimuli set, including multiracial-identifying models. In the future, the authors plan to make the original videos of each model publicly available as well as adding male-identifying models to the DFI database.

Conclusion

Development of the DFI was motivated by the need to have high-quality, racially representative images of faces with averted eye gaze to investigate the way early experiences shape face perception. Responses to eye gaze are a key developmental skill that may have transdiagnostic implications for neurodiverse development and social communication capabilities. This publicly available stimuli database and rater validation descriptive statistics compliment and extend face stimuli resources that researchers and practitioners have access to. The included images give researchers a path forward in efforts toward decentering Whiteness as a standard in studying processes related to human faces.