Gender equity in all fields in academia has progressed over the past several decades, but data from the National Science Foundation (2004) and the U.S. Department of Commerce (2011) suggest that women continue to be less likely than men to access academic careers, to attain full-time positions, and to be promoted and tenured in the natural and social sciences, engineering, and mathematics disciplines. Dubbed the “pipeline problem,” women enter graduate school at about the same frequency as do men, but are less likely to enter and succeed in academia than are their male counterparts (Aiston, 2014; Deo, 2014; Ding, Murray, & Stuart, 2013; Ellemers, Heuvel, Gilder, Maas, & Bovini, 2004; Taylor, 2007; Yost, Winstead, Cotten, & Handley, 2013). In addition, once hired, women leave academia at slightly higher rates than their male counterparts across various disciplines (e.g., Adamo, 2013; Easterly & Ricard, 2011; Kaminski & Geisler, 2012; Levine, Lin, Kern, Wright, & Carrese, 2011; National Academy of Sciences et al., 2007).

One of the limitations in this literature is that the majority of the research focuses on the selection rates of women versus men in specific fields of academia and how women experience bias in their academic careers (i.e., after selection decisions are made) (e.g., Aguirre, 2000; Howe-Walsh & Turnbull, 2016; Lee & Won, 2014; Lerback & Hanson, 2017; Settles, Cortina, Malley, & Stewart, 2006). This is an important limitation because it has left a gap in understanding how bias manifests in the early stages of the selection process. In the current studies, we address this limitation by examining letters of recommendation, one of the most important early-stage selection tools used in academia (Abbott et al. 2010; Sheehan, McDevitt, & Ross, 1998). A growing body of literature has shown how bias can influence the manner in which letters of recommendation are written. Specifically, gender biases arising from perceived gender differences can lead to differences in how letters are written for men and women (Dutt, Pfaff, Bernstein, Dillard, & Block, 2016; Isaac, Chertoff, Lee, & Carnes, 2011; LaCroix, 1985; Madera, Hebl, & Martin, 2009; Moss-Racusin, Dovidio, Brescoll, Graham, & Handelsman, 2012; Rubini & Menegatti, 2014; Schmader, Whitehead, & Wysocki, 2007; Shen, 2013).

The current studies draw from the literature on gender schemas, which are mental models summarizing implicit beliefs and expectations of male and female roles (Crockett, 1988; Fiske & Linville, 1980; Valian, 1998), and the literature on gender linguistic bias (Maass & Arcuri, 1996; Rubini & Menegatti, 2014) to examine doubt raisers (i.e., phrases or statements that question an applicant’s aptness for a job) in letters of recommendation (Trix & Psenka, 2003). Examples of doubt raisers include statements like “somewhat challenging personality,” “might make a good colleague,” and “in view of the difficulties [being experienced],. . performance was especially impressive.” Though they may vary in the degree of negativity and subtleness, they all potentially raise doubts for the evaluator because they indicate that the writer is uncertain about the applicant or does not have an entirely positive impression of the applicant.

The first aim of the current studies is to determine if letters of recommendations for academic positions include more doubt raisers for women than for men. In study 1, we examine gender differences in letters of recommendation using objective methods (i.e., language content analysis) and statistical procedures appropriate for nested data. In addition, because there are well-known gender differences for several job predictor domains, such as various measures of cognitive, personality, and vocational interests (Hough, Oswald, & Ployhart, 2001; Su, Rounds, & Armstrong, 2009), we include measures of academic performance as control variables. Specifically, we use several variables that reflect objective measures of academic performance (e.g., number of publications and number of courses taught) to examine gender differences in academic performance and control for any potential differences that could be related to the use of doubt raisers.

The second aim of the current studies is to determine if doubt raisers actually affect how applicants are evaluated. Even if more doubt raisers are used for women than men in letters of recommendations, such subtleties in language may not matter. In study 2, we use experimental methods and an academic sample to examine if doubt raisers in letters of recommendation negatively affect how applicants are evaluated.

By examining gender differences in the use of doubt raisers in letters of recommendations (study 1) and how doubt raisers negatively affect applicant evaluations (study 2), the current studies will provide a better understanding of how gender schemas affect women in the early stages of the selection process in academia. By examining doubt raisers in letters, the current studies contribute to understanding how gender schemas influence the manner in which men and women are described differently in letters, even after accounting for various indicators of productivity. Research suggests that bias against women might be reduced when women are described as highly qualified because it reduces the uncertainty of whether an applicant will be successful (Heilman, Wallen, Fuchs, & Tamkins, 2004) and offsets gender schema stereotypes that work against women in occupations or roles that are often related to male gender norms (Heilman, 2012). Therefore, it is important to examine if more doubt is raised for women than men in the letters of recommendation, because doubt raisers lead to questions regarding the potential for success of an applicant by introducing uncertainty.

This research also contributes to our understanding of how gender schemas can affect women even before the selection process begins. That is, gender schemas can influence how letters of recommendation are constructed, before they are even used to evaluate an applicant, potentially biasing evaluations for women in the earliest stages of selection. This is particularly important to examine because recent research suggests a new trend for female applicants in academia; namely, selection rates for women in academia seem to be substantially improving in some STEM-related fields (Ceci, Ginther, Kahn, & Williams, 2014a, 2014b; National Research Council, 2009). A series of studies show that women were preferred over men, but only when they were described as equally and not less qualified than men (e.g., Williams & Ceci, 2015; Ceci & Williams, 2015). Despite this encouraging progress, what this research ignores is the possible bias women face at earlier stages of the selection process, before final selection decisions are made. The results of our current research represents a particularly important contribution to this literature, considering that so much of the research on gender bias in academia has focused on what occurs after selection decisions are made.

Background

Letters of Recommendations in Academia

Although they are only one of numerous factors that are considered in evaluating and selecting applicants for jobs, letters of recommendation are an important tool used to screen graduate students, medical school applicants, and faculty in academic settings (Johnson et al., 1998; Landrum, Jeglum, & Cashin, 1994; Nicklin & Roch, 2009; Sheehan et al., 1998) and are valid predictors of undergraduate performance, graduate performance, and professional school performance (Kuncel, Kochevar, & Ones, 2014). Letters of recommendation are tools that screen candidates in the selection process (Guion, 1998; Morgan, Elder, & King, 2013) because they verify information provided by applicants and offer information about applicants’ past performance (Aamodt, Nagy, & Thompson, 1998; Gatewood & Feild, 2001; McCarthy & Goffin, 2001).

Both quantitative and qualitative research have identified the use and strong importance of letters of recommendation in academia. First, letters are critical determinants of who gets academic-based internships. That is, Mittenberg, Peterson, Cooper, Strauman, and Essig (2000) found that letters of recommendation and personal interviews were considered more important than grades or work samples. Similarly, in a study of pre-doctoral internships, 82% of internship selection members from the Association of Psychology Postdoctoral and Internship Centers ranked letters of recommendation as “important” to “very important” in their selection process (APPIC, 2005).

Second, letters are important in assessing teaching abilities of academicians. For example, in a study of how search committee chairs in psychology evaluate applicants’ teaching, Benson and Buskist (2005) found that letters of recommendation were the second most used criteria (after student evaluations), and were more important than previous teaching experience, statement of teaching philosophy, and the applicant’s job talk. In a similar qualitative study of how search committees in academia evaluate teaching ability, Meizlish and Kaplan (2008) examined a sample of 457 surveys from various departments, including English, history, political science, psychology, biology, and chemistry. They found that search committees put more weight on letters of recommendation to assess faculty applicants than any other criteria and that CVs, cover letters, and letters of recommendation were the three most commonly requested materials for open positions.

Third, letters of recommendation are important for inviting applicants in academia for an interview. A study of the hiring process from 368 English departments (Broughton & Conlogue, 2001) found that letters of recommendation were ranked among the top four application materials in terms of importance when screening candidates for on-campus interviews. Letters of recommendation ranked higher than other metrics, such as the number of teaching awards and course evaluations. A similar study of search committee chairs from psychology (Landrum & Clump, 2004) found that letters of recommendation were ranked higher in screening applicants than quality of graduate school, grant activity or potential, and transcripts. Most of the literature on the use letters of recommendation to assess applicants in academia has been either survey-based or qualitative in nature. However, an experiment using a sample of professors who evaluated a hypothetical applicant for an academic job in an experiment found that a strong letter of recommendation (versus a weak letter) had a significant effect on the likelihood of inviting an applicant for an on-campus interview (Applegate, Cable, & Sitren, 2009). Not only do professors use letters of recommendation to select candidates for interviews, but academic administrators also value letters of recommendation as important and useful. For example, a study of political science department chairs from 231 universities (Fuerstman & Lavertu, 2005) found that letters of recommendation were among the top three factors in inviting applicants to campus interviews across all types of universities (e.g., liberal arts colleges, doctoral-granting institutions). They found that letters of recommendation outranked a variety of other factors.

Fourth, letters of recommendation are important for the actual selection of applications for academic positions. Showing the importance of letters of recommendation for selection purposes, Nicklin and Roch (2009) found that letters of recommendation are particularly used and relied upon more in selecting candidates by those in academics than those in applied professions outside of academia. Additionally, the more that faculty wrote letters themselves, the more likely they were to rely on others’ letters when making selection decisions. Provosts, department heads, and other administrators also use letters of recommendations for hiring and promoting faculty (Abbott et al., 2010). In fact, decision-makers in academic administration positions rely on letters of recommendation, particularly from outside experts, more heavily than impact factor, citations, and other metrics available. They reasoned that the best applicants have similar impact factors and citation counts, so letters help distinguish applicants more.

Several conclusions emerge from examining the literature focusing on the use letters of recommendation to assess applicants in academia. First, letters are among the most commonly requested materials for the academic selection process. Second, letters are used to evaluate applicants for both specific (e.g., teaching) and general abilities. Third, letters are often used in the early stage of the hiring process to make decisions for campus visits, so their weight and use are important to advance further in the selection process. Thus, any potential bias in letters can hinder applicants from being hired, not only because they are used to make hiring decisions, but also because they are used when selecting applicants for a campus interview.

Letters of Recommendations and Gender Differences

Despite the frequent use of letters of recommendation in academia, the instructions for how to write those letters are often ambiguous and open-ended (Morgan et al., 2013). Further, the way in which letters of recommendation are used to evaluate candidates usually lacks structure (Liu, Minsky, Ling, & Kyllonen, 2009). The ambiguity and lack of structure of letters of recommendations can lead to biases in how letters are written for men and women (Dutt et al., 2016; LaCroix, 1985; Madera et al., 2009; Schmader et al., 2007). Gender schemas, mental models summarizing beliefs about what it means to be male or female (Crockett, 1988; Fiske & Linville, 1980), provide a theoretical framework for gender biases in letters of recommendation. Gender schemas can be both descriptive and prescriptive (Burgess & Borgida, 1999; Heilman, 2001; Rudman & Glick, 2001), and are implicit, mostly non-conscious beliefs and expectations that can lead to different interpretations of the same behavior in men and women (Valian, 1998). These differences are due, at least partially, to a perceived lack of fit between the stereotypes about and the positions held by men and women (Heilman, 1983; Heilman, 2012).

Central to understanding how gender schemas can affect women in academia is the gender-typing of work through two conditions. First, the distribution of men and women in an occupation is used to stereotype an occupation as either a male or female occupation (Cejka & Eagly, 1999). Men are disproportionately highly represented in academia: women enter graduate school at about the same rate as men, but have a lower percentage of staying in academia (Aiston, 2014; Deo, 2014; Ding et al., 2013; Ellemers et al., 2004; Taylor, 2007; Yost et al., 2013). Many academic departments, such as the natural sciences, engineering, and mathematics, remain male-dominated, whereas other departments, such as education and social work, remain female-dominated (Bailyn, 2003; Eveline, 2005; Pyke, 2013; Van den Brink & Benschop, 2012; Westring et al., 2012). In fact, the majority (86%) of full professors at American institutions are men (U.S. Department of Education, National Center for Education Statistics, 2015).

Second, the responsibilities of the job are tied to gender norms (Heilman, 2001). For example, management roles traditionally have been considered to be male gender-typed because of the importance of traits (e.g., agency) that comprise the male gender schema (Eagly & Johannesen-Schmidt, 2001; Eagly & Karau, 2002; Ragins & Sundstrom, 1989; Ragins, Townsend, & Mattis, 1998. Job advertisements for male- (versus female-) dominated areas of employment use more masculine wording, thereby enhancing the belongingness that men versus women will experience when reading the ads (Gaucher, Friesen, & Kay, 2011). Responsibilities of academics have been based historically on masculine traits, such as being assertive, competitive, authoritative, independent, and experts in their field (Bailyn, 2003). All of these traits are tied to agency, which are a set of traits that men, but not women, are expected to hold (Eagly & Johannesen-Schmidt, 2001). Women, in contrast, are expected to be communal, which includes being concerned with the welfare of other people, affectionate, kind, sensitive, and nurturing.

One example of how gender schemas influence expectations in academia comes from a study of the awarding of endowed professorships at a sample of business schools at tier 1 American research universities. Treviño, Gomez-Mejia, Balkin, and Mixon (2015) found that female professors were less likely to be awarded named professorships than male professors were, even after controlling for years of experience, research productivity, and other performance factors. The disparity was even wider when the endowed chair was awarded to an internal candidate. Female professors had to meet a higher bar for recognition, as shown by the fact that women with endowed chairs scored significantly higher on performance measures than did men. Treviño et al. (2015) argued that these results were partly due to the facts that the majority (86%) of full professors at American institutions are men, and men make up the majority of gatekeepers for hiring and promoting in universities, which develops a work environment based on male gender norms. As such, a masculine-gendered work environment is incongruent with female gender norms.

Because what is required for success in many academic departments may be based on norms of masculinity (Bailyn, 2003; Van den Brink & Benschop, 2012; Westring et al., 2012), a potential bias against female faculty can arise when writing letters of recommendation. Letter writers may have sex-related stereotypes about women that are incongruent with the attributes that are believed to be required for success in a particular job (Eagly & Karau, 2002; Heilman, 2001), such as academia. The language used to describe men and women in work domains also may be related to gender schemas (Maass & Arcuri, 1996; Rubini & Menegatti, 2014). For example, letters of recommendations for medical school residency show gender differences in the language used to describe the applicants (Isaac et al., 2011). Specifically, letters for female (versus male) applicants contained more “tentative” words (e.g., “she might,” “it is possible she could”).

In chemistry and biochemistry faculty positions, letters of recommendations for male versus female applicants were found to contain more standout adjectives, such as “superb,” “outstanding,” “remarkable,” and “exceptional” (Schmader et al., 2007). Similarly, in psychology, male applicants for faculty positions were described as more agentic and less communal than female applicants (Madera et al., 2009). In addition, communal descriptions were negatively related to the hireability of the applicants. Such studies suggest that (1) language in letters of recommendation may be biased unintentionally by gender schemas and (2) male and female writers are equivalent in their attribution of traits to male and female candidates.

Standout adjectives are not the only domain in which writers can describe job candidates. A qualitative study conducted by Trix and Psenka (2003) examined over 300 letters of recommendations that were written for medical school faculty at a large American medical school. Letters for women tended to contain more doubt raisers than letters for men, with no difference between male and female writers. The authors described four sets of doubt raisers: negativity, faint praise, hedges, and irrelevant information. For example, one might describe an applicant as someone who “does not have much teaching experience” (negativity), who “needs only minimum supervision” (faint praise), who “might not be the best…” (hedging), or who “is active in church” (irrelevancy). Doubt raisers vary in how negative and subtle they are and may not have an equivalently pernicious impact. Negativity may tend to be the most obvious and negative doubt raiser, because it points out an overt weakness of the applicant. Irrelevancy is typically the least negative and most subtle, but because they are not related to the essential functions of a job, the reader wonders why they are present at all, making them a doubt raiser. Hedging is less negative than negativity, but is still a forthright doubt raiser, because the writer directly admits uncertainty. Lastly, faint praise is a something of a backhanded compliment.

In general, the majority of letter content was very positive, so the inclusion of a single doubt raiser questions an applicant’s aptness for a job in a manner that is not necessarily direct and apparent (Trix & Psenka, 2003). Letter writers may not have intended to put female applicants at a disadvantage, but may have done so nevertheless if they included doubt raisers more frequently in letters for women versus men.

The current studies build on Trix and Psenka’s (2003) preliminary evidence of gender differences in doubt raisers by using different methodological and statistical procedures. For example, they scored letters of recommendations without removing information about the gender of the applicant. Thus, the possibility of confirmation bias might have been present—coders (who were the authors themselves) were not blind to the applicant gender and were coding for gender differences. Additionally, Trix and Psenka (2003) did not use inferential statistics, nor did they control for the fact that letters of recommendations were nested within applicants. Given these potential limitations, it is important to establish whether doubt raisers really do appear more in letters of recommendation written for women than men.

Study 1

Overview and Hypothesis

To examine gender differences in how men and women are described in letters of recommendation, we analyzed letters of recommendation written for applicants for faculty positions in a psychology department at a university that is classified as having a very high research activity level (Carnegie Classification of Institutions of Higher Education, n.d.). Because academic positions, particularly at elite research institutions, tend to be more male gendered, and because gender schemas portray men as more agentic, task-oriented, and instrumental than women (Burgess & Borgida, 1999; Rudman & Glick, 2001; Valian, 1998), we expected that men would be described more positively in letters of recommendation than would women, even after controlling for ten indicators of academic achievement (e.g., number of publications). Based on the studies by Trix and Psenka (2003), Schmader et al. (2007), and Madera et al. (2009), we specifically examined gender differences in doubt raisers.

Hypothesis 1

Letters of recommendation written for women are more likely to include doubt raisers than are letters of recommendation written for men.

Method

Sample

We examined letters of recommendation for psychology junior-faculty job applicants (collected and reported by Madera et al., 2009) and analyzed letter content that has not been reported previously (see Appendix 1 for data transparency). The sample consisted of 624 letters of recommendations for 174 applicants applying for eight assistant-level faculty positions at a university in the southern USA. In regard to applicant and recommender sex, 49% (n = 85) of the applicants were female and 51% (n = 89) were male; 29% (n = 179) of the recommenders were female and 69% (n = 430) were male (the sex for 2% could not be identified). Applicants’ ages ranged from 26 to 40 years, with a mean of 32 (SD = 3.69). The mean number of letters per applicant was 3.59.

Procedure

Three trained research coders rated the extent to which letters contained doubt raisers. Through a redaction procedure in which all information about the gender of the applicant and letter writers was removed, we kept coders blind to the purpose of the study and also to the gender of both the applicant and the letter writer. The anonymity of the applicants also was preserved by removing identifying information, such as the name of the applicants, letter writers, institutions, and research labs. The coders were provided with the definitions and examples of each of the four different types of doubt raisers.

Measures

Doubt Raisers

To measure doubt raisers, the coders used a 9-point Likert-type scale anchored at 1 (not at all) and 9 (very much) on four items assessing the extent to which letters contained (a) negativity, (b) hedging, (c) faint praise, and (d) irrelevant information. The coders also recorded the frequency of doubt raisers using a free-response format by responding to the following items: (a) How many instances of negative language did the letter contain? (b) How many hedging comments did the letter contain? (c) How many times did the letter contain faint praise? (d) How many times did the letter contain irrelevant information? The eight items were standardized because they were rated on different scales. These items represent the four doubt raiser types: negativity, hedging, faint praise, and irrelevancy.

Following the recommendations from LeBreton and Senter (2007), we used a two-way mixed-effects intraclass correlation (ICCA,1) and the group mean intraclass correlation (ICCA,K) to measure coder agreement and coder consistency. The results showed sufficient individual coder reliability, ICCA,1 = 0.86, and group mean reliability ICCA,K = 0.94. On the basis of these indexes, ratings were combined by averaging within and then across the coders. The alpha coefficient for the measure was 0.79. A principal components factor analysis revealed one meaningful factor that accounted for 71% of the variance. All four items representing negativity, hedging, faint praise, and irrelevant information were retained.

Gender

Gender for both applicants and recommenders was coded separately female (1) or male (2).

Control Variables

We used ten control variables to assess applicant performance on the basis of curriculum vita (CV) information. These were the number of first-author publications, the number of honors, the number of post-doc years, the number of courses taught, the ranking of the applicants’ school, the highest journal impact factor by the applicant, the number of total publications, the position applied for, number of years in graduate school, and the length of the letters measured as the number of words in each letter. The number of first-author publications, the number of honors, the number of post-doc years, the number of courses taught, the ranking of the applicants’ school, the highest journal impact factor by the applicant, and the number of total publications are direct indicators of productivity. We also controlled for the position applied for because applicants from certain backgrounds, such as industrial/organizational psychology, might have more publications; those with cognitive backgrounds might have more post-doc years. The other two control variables are not necessarily objective measures of productivity, but they might influence perceptions of productivity. For example, years in graduate school was controlled for because letter readers might adjust their estimation of productivity by taking into account number of years (i.e., divide productivity by number of years). For example, 3 publications in 5 years would be equivalent to 4.2 in 7 years. Lastly, we controlled for the letter length because past research suggests that longer letters are seen as more positive when assessing applicants in general (Liu et al., 2009; Trix & Psenka, 2003), even if they do not necessarily reflect an applicant’s productivity. In addition, longer letters might provide more opportunity for doubt raisers.

Results

Descriptive statistics and intercorrelations for all of the variables are reported in Table 1. Table 2 shows the descriptive statistics for the variables by the gender of the applicants. For exploratory purposes, we conducted a multivariate analysis of variance (MANOVA) with the objective measures of applicant performance from their CVs (i.e., control variables as the dependent variables and applicant gender as the independent variable to examine if male and female applicants differed in the measures of applicant performance). The omnibus MANOVA result was not significant for gender, Wilk’s Λ = 0.86, F(10, 50) = 0.81, p > 0.05, ηp2 = 0.12, suggesting no differences by gender emerged among the control variables. Because doubt raisers are aggregated data, nested within applicants, they were not included in this initial test.

Table 1 Means, standard deviations, and correlations for level 1 variables in study 1
Table 2 Descriptive statistics for level 2 variables and aggregated level 1 variables by applicant gender for study 1

Since letters of recommendations were nested within applicants, we used the HLM6 program (Raudenbush, Bryk, Cheong, & Congdon, 2004) to analyze the data. We used full maximum likelihood estimation procedures and included random effects. For the analyses, the intercepts of the level 1 variables (doubt raisers) were predicted by the level 2 variable (gender of the applicant). That is, we predicted the content of the letters of recommendation (level 1 variables, which were nested within applicants) by the gender of the applicant (level 2 variable). For exploratory purposes, we also included the gender of the letter writer and the interaction of the gender of applicant and letter writer in the analyses (level 2 variables). Before testing the hypotheses, we investigated whether systematic within- and between-applicant variance existed in the hypothesized dependent variable (i.e., doubt raisers). The results of the unconditional (null) models indicated that there was significant between-applicant variance in the dependent variable; 14% of doubt raiser variance was accounted for by differences between applicants. Thus, there is substantial between and within variance that warrants the use of HLM to examine level 1 and level 2 variables.

Test of Hypothesis

We first tested the standardized measures of doubt raisers as a whole. As shown in Table 3, applicant gender significantly predicted doubt raisers (estimate = − 0.11, p < 0.05). Letters for women contained significantly more doubt raisers (M = 0.12, SD = 0.69) than letters for men (M = − 0.05, SD = 0.51). Using the frequency items of the doubt raiser measure (i.e., the raw sum of the times the letter had negativity, hedges, faint praises, and irrelevant information), the letters for female applicants had an average of 0.69 (SD = 0.96) doubt raisers and the letters for male applicants had an average of 0.55 (SD = 0.71) doubt raisers. Across gender, 52% of the letters had at least one doubt raiser in the letter, 10% had at two or more doubt raisers, and 48% of the letters had no doubt raisers (ranging from 0 to 4.5 doubt raisers). For female applicants, 54% had at least one, 13% of the letters had two or more, and 46% had no doubt raisers. For male applicants, in contrast, 51% had at least one, 7% had two or more, and 49% had no doubt raisers. Neither the main effect of the letter writer gender nor the interaction between the applicant and writer gender was significant.

Table 3 Hierarchical linear modeling results with applicant gender, writer gender, and their interaction as predictors for study 1

When broken down by type of doubt raiser, across gender, 12% of the letters had at least one negativity, 18% had a hedging, 27% had a faint praise, and 14% had an irrelevancy. For female applicants, 14% had at least one negativity, 20% had a hedging, 30% had a faint praise, and 12% had an irrelevancy in their letters. For male applicants, in contrast, 10% had at least one negativity, 15% had at a hedging, 24% had a faint praise, and 16% had an irrelevancy in their letters.

We next examined the effect of applicant gender on each individual doubt raiser and using the same set of control variables (see Table 4 for a summary of the results). For three of the four types, there was a significant effect of applicant gender. Letters for women contained significantly more negativity (M = 0.18, SD = 1.21) than letters for men (M = − 0.06, SD = 0.87; estimate = − 0.12, p < 0.05). Letters for women contained significantly more hedging (M = 0.13, SD = 1.09) than letters for men (M = − 0.04, SD = 0.86; estimate = − 0.14, p < 0.05). Letters for women contained significantly more faint praises (M = 0.15, SD = 1.14) than letters for men (M = − 0.04, SD = 0.90; estimate = − 0.15, p < 0.05). But there was no effect of applicant gender on irrelevant information doubt raisers (estimate = − 0.05, p = 0.30). Neither the main effect of the letter writer gender nor the interaction between the applicant and writer gender was significant for each individual doubt raiser.

Table 4 Descriptive statistics for doubt raisers by applicant gender for study 1

In addition to our quantitative analysis of the data, we provide coded examples of actual doubt raisers from the letters of recommendation to provide contextual information. Examples of doubt raisers in letters for women include the following: “She is unlikely to become a superstar, but she is very solid,” “She is not the brightest, the most creative, the most independent, or original or productive, the most likely to be an outstanding teacher, or the most “anything” of her peers,” “A look at [applicant’s] publication record will show that she has not published a huge amount....,” “Although she has a number of papers in preparation and one under review, I think it would be fair to say that her record on paper would not place her among the top echelon of candidates for first rate programs,” “At first, despite truly spectacular GRE scores, she seemed quite unsure of herself,” “I assume she will be a relatively good teacher of undergraduate and graduate students,” and “She may not be the strongest student we’ve ever put out in any one aspect of academic excellence, but her profile of talents is unique.”

Examples of doubt raisers in letters for men include the following: “I know that first-author publications are priceless for job applicants. Although [applicant] doesn’t have any as of yet, that should not be a concern for you....,” “Instead he chose to apply what he had learned to a venture that involved web-based monitoring of internal states—a great idea, but one that unfortunately coincided with the bottom falling out of the dotcoms, so [applicant] is back on the academic market, somewhat poorer but hopefully wiser,” “His speaking style is fairly slow, and his ideas do not always spring forth into words without a bit of a struggle,” “He has always been passionate about developing himself and improving our program. At times, this has meant that he has not followed through on lower priority projects...,” “[Applicant] was dividing himself among an unusual number of projects and, although each was interesting and important, and all were inter-related, nonetheless his projects seemed stuck approximately 90% of the way to publication,” and “I no longer need to make any major corrections on his manuscripts with regards to grammar and usage. And although he has an accent, I would say it is less thick than many others from a similar background.” Thus, these exemplary doubt raisers show that doubt raisers are mostly related to potential research productivity or their overall ability. Specifically, 66% of the coded doubt raisers were related to research productivity; only 17% were related to teaching. This pattern was found for both men and women.

Discussion

As predicted, letters of recommendation for female applicants for faculty positions contained more doubt raisers than letters for male applicants. In regard to the type of doubt raiser, letters for women contained more negativity, hedging, and faint praises than the letters for the men. Although irrelevant information did not reach statistical significance, the directions of the means of irrelevant information were consistent with the means for negativity, hedging, and faint praises. These differences were obtained even though we controlled for objective measures of applicant performance from their CVs. Given that we included these control variables, we can conclude that the differences in doubt raisers were not due to these specific objective aspects of candidates’ performance.

Study 2

Overview and Hypotheses

Study 1 showed that, as predicted, letters for women contain more doubt raisers than do letters for men, but it leaves open whether doubt raisers influence how applicants are evaluated. It is possible that letter readers are not affected by doubt raisers. To test that possibility, using a sample of university professors, study 2 examines the influence of doubt raisers on evaluations. One reason to think that doubt raisers will have an effect is that in the sea of positive comments that make up most letters of recommendation (Knouse, 1983; Ralston & Thameling, 1988), even small numbers of doubt raisers may stand out and be disadvantageous to applicants. Although doubt raisers are not necessarily directly or overtly negative, they question an applicant’s aptness for a job, suggesting that the applicant may not be the strongest candidate (Trix & Psenka, 2003). We thus predicted the following:

Hypothesis 2

Applicants for academic job positions whose letters of recommendation contain (versus do not contain) doubt raisers will be evaluated more negatively by actual faculty members.

Method

Sample

The sample consisted of 305 university professors from various universities across the USA (46% men, 54% women). In regard to their discipline, 43% were from psychology and 57% were from various disciplines, such as sociology, engineering, neuroscience, and business departments. The majority of respondents were full professors (39%), followed by associate professors (25%), assistant professors (26%), and lecturers (10%). In regard to racial/ethnic identity, 83% of the participants identified themselves as White/Caucasian, 1.4% as African-American/Black, 7.6% as Asian/Asian-American, 3.4% as Hispanic, and 4.6% as other/mixed.

Procedure and Experimental Manipulations

The authors sent an email with the study link to a convenience sample of faculty members, who were also requested to forward the study to their colleagues. After consenting to participate for a study called “Letter of Recommendation,” participants were presented with written instructions indicating that they were going to read a letter of recommendation for a junior faculty position at a tier 1 research institution. Participants were informed that the letter they were going to read had been redacted to remove identifying information. Embedded in the four-paragraph, one-page letter of recommendation was a doubt raiser manipulation that immediately followed the introductory paragraph (see Appendix 2 for the script of the letter).

The doubt raiser manipulation was based on the four doubt raisers from study 1 and related to research productivity as shown in study 1: negativity, faint praise, hedging, and irrelevant information. Participants in the doubt raiser condition read one of the following: (1) “I can say with certainty that AA does not have the skills to be the best researcher you have ever seen, but she/he does have the potential to become successful in developing an independent research program at your institution” (negativity) or (2) “I have confidence that AA will become better than average at being successful in developing an independent research program at your institution” (faint praise) or (3) “I am uncertain that AA has the potential to become one of the best researchers but I believe she/he could be a solid independent researcher at your institution and be successful in developing an independent research program at your institution” (hedging) or (4) “Also impressively, AA is an avid skier and enjoys photography—two tasks that we share in common. I believe she/he can be successful in developing an independent research program at your institution” (irrelevant information). The manipulations from these conditions were derived from earlier work by Trix and Psenka (2003) and measured in study 1. Participants in the control condition read: “I believe that AA will be a solid independent researcher at your institution.” See Table 5 for the manipulated statements. After reading the letter of recommendation, the participants evaluated the applicant. We also manipulated the gender of the applicant to examine whether doubt raisers are equally damaging to male and female candidates.

Table 5 Manipulation of doubt raiser and control statements in study 2

Measures

Teaching competence

We developed a measure of professional teaching competence based on the dictionary from Trix and Psenka (2003) and Schmader et al. (2007). Participants evaluated the applicant on five items using a Likert-type scale from 1 (“I strongly disagree”) to 7 (“strongly agree”). These items assessed whether the applicant (a) had teaching competence, (b) had professionalism, (c) had teaching skills, (d) had teaching potential, and (e) had mentoring skills (α = 0.85).

Research competence

We developed a measure of research competence also based on the dictionary from Trix and Psenka (2003) and Schmader et al. (2007). The participants evaluated applicants on the five items using a Likert-type scale from 1 (“I strongly disagree”) to 7 (“strongly agree”). These items included (a) research skills, (b) research potential, (c) external funding potential, (d) being a top-notch researcher, and (e) excellence in research (α = 0.91).

Manipulation check

Participants were asked to identify the gender of the applicant using a three-option response: male, female, I do not remember. Two participants did not correctly identify the gender, but their inclusion in the analysis did not change the results. Thirty-six (12%) respondents indicated not remembering the gender, but their inclusion also did not change the pattern of the results.

Results

Psychometric Analyses

A CFA on the teaching and research items demonstrated adequate fit: χ2 = 82.39, df = 34, p < 0.05; CFI = 0.97; IFI = 0.97; RMSEA = 0.074; all loadings were statistically significant and were higher than 0.5 (they varied from 0.55 to 0.91), indicating convergent validity (Hair, Black, Babin, & Anderson, 2010; Anderson & Gerbing, 1988). The AVE was 0.54 for the teaching competence measure and 0.64 for the research competence measure, both greater than the 0.50 cutoff (Bagozzi & Yi, 1988). The squared correlation between the measures (r2 = 0.25) was lower than each AVE, demonstrating discriminant validity (Fornell & Larcker, 1981). This two-factor model was compared to a one-factor-model, which demonstrated poor fit and did not significantly improve the fit: χ2 = 636.07, df = 35, p < 0.05; CFI = 0.68; IFI = 0.68; RMSEA = 0.24 (Δχ2 = 553.68; Δdf = 1; p < 0.05).

Test of Hypothesis

Table 6 shows the descriptive statistics for study 2 dependent variables by experimental conditions. A 5 × 2 MANOVA with the teaching and research competence as the dependent variables and the doubt raisers and applicant gender as the independent variables revealed a significant main effect for doubt raiser (Wilks’s Λ = 0.85, F(8, 582) = 5.91, p < 0.05), but not for applicant gender (Wilks’s Λ = .99, F(2, 291) = 1.27, p > 0.05); the interaction was not significant (Wilks’s Λ = 0.95, F(8, 582) = 1.79, p > 0.05).

Table 6 Descriptive statistics dependent variables by experimental condition for study 2

The main effect of doubt raisers on the research competence measure was significant, F(4, 292) = 7.39, p < 0.01, ηp2 = 0.09. Tukey HSD and Scheffe’s post hoc tests showed that the applicants with the negativity and hedging doubt raisers were evaluated significantly lower than the applicants in the other conditions, whereas the control and the other doubt raiser conditions were not significantly different from each other. The main effect of doubt raiser on teaching competence was not significant, F(4, 292) = 1.38, p > 0.05, ηp2 = 0.02. The univariate main effects of applicant gender were not significant for either teaching competence, F(1, 292) = 1.45, p > 0.05, ηp2 = 0.01, or research competence, F(1, 292) = 2.33, p > 0.05, ηp2 = 0.01. Similarly, the interaction univariate effects were not significant for either teaching, F(4, 292) = 0.85, p > 0.05, ηp2 = 0.01, or research competence, F(4, 292) = 1.57, p > 0.05, ηp2 = 0.02.

Discussion

Using experimental methods and an academic sample, the results from study 2 show that doubt raisers in letters of recommendation do indeed influence how applicants are evaluated. The applicant whose letter contained negativity (“… does not have the skills …”) was evaluated lower on research skills than the otherwise identical applicant in the other conditions. In addition, hedging (“I am uncertain …”) also led to lower evaluations on the research skills. But doubt raisers did not affect the ratings of teaching skills, probably because they were specifically related to research and not teaching. That suggests that faculty evaluate applicants based on the specific content of the doubt raiser (e.g., research) without generalizing to other domains (e.g., teaching). Further, the effects of doubt raisers were equally detrimental for both female and male applicants. Even a small island of negativity in an otherwise positive letter (Liu et al., 2009; Morgan et al., 2013) stands out and reduces an applicant’s standing.

General Discussion

Study 1 showed that letters of recommendation for women, compared to letters for men, contain more doubt raisers, specifically, negativity, hedges, and faint praise. This result held despite controls for productivity, such as number of publications and teaching experience. Thus, objective gender differences in productivity do not appear to be the reason that more women than men receive doubt raisers in their letters of recommendation. Differences in doubt raisers are more likely due to gender schemas than to systematic differences in the preparedness or quality of male versus female applicants.

Study 2 showed that both negativity (i.e., a type of doubt raiser that points out weaknesses) and hedging (i.e., a forthright admission of uncertainty) in letters of recommendation lead to lower evaluations of applicants, regardless of the gender of the applicant. Taken together, the key contribution of these studies is the clear illustration that doubt raisers in letters of recommendation do indeed hurt women more than men, but only because doubt raisers are more frequent in letters for women. In other words, evaluators treat doubt raisers equally seriously whether they are provided for a woman or a man (study 2), but because doubt raisers are more often used for women than for men (study 1), women are more likely to be negatively affected by them.

The combined findings are particularly interesting because the lack of evidence of gender bias when doubt raisers are presented in letters of recommendation potentially obscures the gender bias that has occurred at an earlier point, namely, when a recommender is writing the letter. Doubt raisers are a minus for everyone, but letter writers assign that minus more often to women than to men. If search committees ignored letters of recommendation, that asymmetry would not matter. But letters of recommendation are commonly used as selection tools in academia (Nicklin & Roch, 2009; Kuncel et al., 2014). The data have important implications for women in academia, particularly because women face biases early in the selection process (Bailyn, 2003; Eveline, 2005; Pyke, 2013; Van den Brink & Benschop, 2012; Westring et al., 2012; cf. Ceci et al., 2014a, b).

The current research makes important contributions to the literature on the effects of gender schemas on workplace outcomes. Our studies reveal how gender schemas can negatively affect women through the use of doubt raisers in letters of recommendations. That is to say, the letters in our sample contained more phrases that doubt the female (versus male) applicants’ ability to be successful. Letters of recommendation can be ambiguous and unstructured, which allows for biases stemming from gender schemas to play a role. For example, Heilman et al. (2004) argued that biases are more likely in situations that are ambiguous. Because instructions for what should be included in letters of recommendations are often ambiguous and open to interpretation, letter writers may depend on heuristics and stereotypes when writing letters and describing women; these biased descriptions (including doubt raisers) are negatively related to applicant evaluations, as shown in study 2.

The phenomenon that we have reported is not propagated more by male versus female letter writers (study 1) or evaluators (study 2). There were no main effects of the gender of the letter writer and letter writer gender did not interact with applicant gender to predict doubt raisers. The female letter writers (in study 1) wrote letters similarly to their male counterparts and were just as likely as men to describe female applicants with more doubt raisers than male applicants. This provides some support for the universality of gender schemas and the manner in which men and women are described. Similarly, the evaluators (in study 2) interpreted doubt raisers (negativity and hedging) and rated letters containing them more negatively than they rated letters that did not have doubt raisers. The lack of gender differences in how doubt raisers affect an applicant are consistent with the broader literature on stigma (Crocker, Major, & Steele, 1998; Hebl, Tickle, & Heatherton, 2000) and more specifically the literature on sex bias in the workplace (e.g., see Heilman et al., 2004; Heilman & Okimoto, 2007).

The results showed that the inclusion of even a single doubt raiser—particularly negativity or hedging—was enough to lead to statistically lower evaluations of the applicant (study 2). This finding is of particular interest because study 1 showed that 14 and 20% of the letters for female applicants had at least one negativity and hedging doubt raiser, respectively, compared to 10 and 15% of the letters for the male applicants. Although these gender differences, while reliable, are small, the results from study 2 showed that only one statement can make a difference for an applicant.

The results of the current studies also offer important implications for the use of letters of recommendation outside of academia. Although professionals outside of academia rely on letters of recommendation less than academics (Nicklin & Roch, 2009), there are reasons to expect that gender schemas can also influence the development of letters of recommendation outside of academia. As shown in study 1, letters written for women had more doubt raisers than letters for men, even after controlling for objective measures of research productivity. We argue that this occurs partly because of how gender schemas can influence what is expected from men and women and how they are described, particularly in occupations that have norms related to one sex. In particular, we argue that, because what is required for success in many academic departments may be based on norms of masculinity (Bailyn, 2003; Van den Brink & Benschop, 2012; Westring et al., 2012), a potential bias against female faculty can arise when developing letters of recommendation. Letter writers can have sex-related stereotypes of women that are incongruent with the attributes that are believed to be required for success in a particular job (Eagly & Karau, 2002; Heilman, 2001), such as academia. Likewise, gender schemas can also influence the development of letters of recommendation, particularly in male-dominated occupations. For example, extant research shows how gender schemas influence the evaluations and stereotypes of managers and leaders, such that management and leadership qualities are still perceived to be more masculine than feminine (Duehr & Bono, 2006; Heilman, 2012; Koenig, Eagly, Mitchell, & Ristikari, 2011).

Thus, we would expect that if occupations (e.g., accounting positions in Big 8 firms) or positions (e.g., management roles) are related to masculine schemas (e.g., agentic qualities), then letter writers for applicants might be influenced by schemas when developing these letters, despite real or perceived gender differences. Again, we want to highlight that study 1 shows gender differences in doubt raisers even after controlling for productivity. Because the male and female applicants did not differ in the number of publications, impact factor, and teaching experience, gender schemas might provide a reason for why letters for women contained more doubt raisers than letters for the men.

Organizational Implications

Our research has important implications for academic institutions and for organizations that do rely on letters of recommendation. Our findings show that the gender disparity in doubt raisers found in study 1 is related to selection decisions, as shown in study 2. One obvious implication for academic institutions and organizations is that they should adopt strategies that can help identify such biases (see also Kervyn, Bergsieker, & Fiske, 2012) and then work to reduce those biases in the selection process. For example, universities can give less weight to letters of recommendation, or they can wait to collect letters of recommendation until they have reviewed an applicant’s work, or they can provide letter writers prompts so that recommenders are less likely to include doubt raisers in the letters. For instance, recent research has shown that gender biases can be reduced in letters of recommendation by requiring raters of such letters to elaborate and expand on interpretations of letters (Morgan et al., 2013). In particular, when participants were asked to read letters of recommendation and make ratings of the applicant, those who were asked to explain their ratings showed less gender bias against the applicants than those who were not asked to explain their ratings.

Another suggestion is that letters of recommendation should be structured in both their development and how they are used in the selection process. The low validity coefficients in Kuncel et al. (2014) were based on samples of letters that varied in how unstructured they were (some were structured and others were not). This relationship between structure and validity is found for interviews, particularly with structured interviews having greater validity than unstructured interviews. Thus, academic institutions and organizations can reduce gender bias in letters by being aware of the potential biases in letters of recommendation through formal organizational policies or diversity training (Hebl, Madera, & King, 2007), taking direct steps to deactivate the impact of these biases (Morgan et al., 2013), and adding structure to their development and use in their evaluations.

Limitations and Future Research

Although we used actual archival data and not hypothetical letters of recommendation in study 1, a potential limitation is that a variable that we did not include in our analyses caused some systematic differences in the extent of doubt raisers that were based on real gender differences. Since we controlled for number of years in graduate school, the number of total publications, the number of first author publications, the number of honors, the number of post-doc years, the position applied for, and the number of courses taught, however, we doubt the existence of other major differences. Furthermore, other research has shown that such differences still exist (Morgan et al., 2013), even when the quality of candidates is controlled (see Madera et al., 2009).

One fruitful area of future research is how the content of doubt raisers influences evaluations of applicants. In the current research, we manipulated different types of doubt raisers that were related to research but not to teaching (study 2). The doubt raisers did not affect the ratings of teaching skills, which used an academic sample, suggesting that faculty evaluate applicants based on the content of the doubt raiser (e.g., research) without generalizing to other domains (e.g., teaching). Future research might investigate, via standardized manipulations, how doubt raiser content potentially influences letters of recommendation for and appraisal of candidates.

In addition, the current studies did not examine the race of the applicants (in both study 1 and 2) nor of the letter writers in study 1. This is an area for future research to explore. In fact, qualitative research suggests that racial minority faculty face subtle forms of discrimination in academia (e.g., Kelly & McCann, 2014; Perry, Moore, Edwards, Acosta, & Frey, 2009; Peterson, Friedman, Ash, Franco, & Carr, 2004; Stanley, 2006); this body of literature has examined how discrimination manifests when one is already employed in academia. Very little research has examined how an academic racial minority applicant faces discrimination in the selection process or how this interacts with gender, particularly in letters of recommendation.

These data are from a single field, namely psychology. Specifically, the letters for study 1 were for eight assistant-level positions, but for one department (psychology) at one university. However, our results from study 1 are consistent with similar research that examined biases in letters of recommendation from non-psychology samples. In particular, past research using samples from the STEM fields has found similar gender effects in letters of recommendation (Isaac et al., 2011; Schmader et al., 2007; Trix & Psenka, 2003). In addition, the sample for study 2 included professors from various disciplines; only 43.3% were from psychology. These professors also worked in a variety of institutions, including four-year teaching schools. Therefore, the results from study 1 (i.e., how letters for women have more doubt raisers than letters for men) and study 2 (i.e., how doubt raisers influence applicant evaluations) can generalize to other academic fields and types of institutions. However, we do encourage future research to examine if these effects hold in other fields. Of particular importance are the STEM fields in which women are underrepresented in academia (U.S. Department of Commerce, 2011) and for occupations or positions outside of academia that are related to masculine schemas.

Relatedly, the current studies focused only on letters of recommendation. Another area for future research is to examine if other methods used early in the selection process (e.g., reference check phone calls) can also be biased by gender schemas, leading to gender differences in doubt raisers. For example, researchers have argued that many reference check phone calls are unstructured and therefore susceptible to biases (e.g., Hedricks, Robie, & Oswald, 2013; Taylor, Pajo, Cheung, & Stringfield, 2004). The unstructured nature of reference checks is an important feature in light of research that suggests that bias against women is less prevalent when structure reduces the uncertainty of whether a female applicant will be successful in a masculine-gendered work environment, role, or position (Heilman et al., 2004; Heilman, 2012).

Conclusion

The implications of the current research on letters of recommendations are particularly important because their use for academia is well established (Johnson et al., 1998; Landrum et al., 1994; Sheehan et al., 1998). Our studies show how bias in the letter-writing process can be propagated, even if evaluators do not necessarily display overt gender biases. The differences in word choice may seem negligible, but in fact, as our data show, doubt raisers have discernible penalties for women in academia (Eagly & Karau 2002; Eagly & Johannesen-Schmidt, 2001; Wood & Eagly, 2000). Awareness of and attention to these differences are critical areas of future research and application if we want to maximize fairness in occupations, such as academia, that rely on letters of recommendation.