Introduction

Given the importance of integrity, motivation, and interpersonal skills to being an effective doctor, it is important to incorporate measures of non-cognitive characteristics into academic and employment admissions procedures (Patterson et al. 2016). Despite their importance, it remains unclear how best to measure these characteristics, particularly in the context of high-stakes testing where applicants may be motivated to distort their responses (Albanese et al. 2003; Bore et al. 2009; Musson 2009; Patterson et al. 2016). In recent years, a variety of selection tools have been developed that aim to assess these non-cognitive characteristics, including situational judgment tests (Bore et al. 2009; De Leng et al. 2017; Lievens 2013; Patterson et al. 2009, 2012), multiple mini-interviews (Eva and Macala 2014; Eva et al. 2004, 2009; Griffin and Wilson 2012a; Kulasegaram et al. 2010), emotional intelligence tests (Libbrecht et al. 2014), and personality tests (Griffin and Wilson 2012b; Lievens et al. 2002; MacKenzie et al. 2017; Rothstein and Goffin 2006). The current study focuses on personality testing in the context of the high-stakes selection of medical interns assessing the degree to which response distortion might limit the utility of personality tests.

A recurring criticism of personality testing is that applicants might intentionally distort their responses when they know that their responses will be used to inform admission decisions (Morgeson et al. 2007). However, most of what is known about personality testing in selection contexts comes from the large literature on employee selection (Oswald and Hough 2008; Rothstein and Goffin 2006). Meta-analytic research comparing job applicants with non-applicants suggests that applicants distort responses (Birkeland et al. 2006), with applicants typically scoring around half a standard deviation higher on measures of conscientiousness (Birkeland et al. 2006). In contrast, the vast majority of research on medical students and interns has administered personality tests in contexts where scores are used only for low-stakes (research) purposes (McLarnon et al. 2017). The few studies that have examined personality testing in high-stakes (selection) contexts generally have methodological limitations such as small sample sizes and lack of comparison groups (Hobfoll et al. 1982; Shen and Comrey 1997). Arguably, the best current estimate of applicant response distortion in a medical context comes from a small-sample repeated-measures study (n = 63) (Griffin and Wilson 2012b). Responses in the selection context were approximately two-thirds of a standard deviation higher on the Big Five (i.e., extraversion, openness, conscientiousness, agreeableness, and reversed neuroticism) compared to non-applicant responses (Griffin et al. 2008).

A related important issue is whether response distortion reduces the predictive validity of personality test scores (Griffin and Wilson 2012b; MacKenzie et al. 2017; Rothstein and Goffin 2006). The Big Five personality traits have provided a useful organizing framework, with meta-analytic research, largely in non-medical student contexts, finding modest correlations with academic grades for conscientiousness (r = .19) and openness, r = .10 (Poropat 2009).

Although research has examined correlations of medical student personality with academic performance (Doherty and Nugent 2011; Ferguson et al. 2003; Haight et al. 2012; Knights and Kennedy 2007; Lievens et al. 2002, 2009; McLarnon et al. 2017; Peng et al. 1995) and other outcomes (Hojat et al. 2015; Jerant et al. 2012; McManus et al. 2004; Pohl et al. 2011; Song and Shi 2017; Tyssen et al. 2007), there appears to be no research that has systematically compared responses to personality in a large sample of medical students with applicants to medical programs. So, the “jury is still out” regarding the extent to which applicants to medical programs distort their responses. Thus, this study examined the degree to which graduated medical students distort their responses on personality tests when applying for one of several prestigious medical internships. To assess the degree of response distortion in a high-stakes medical selection context we rely on an established paradigm in social desirability research and compare results in the applicant context to several normative samples where responses were collected in a standard low-stakes research context (Mesmer-Magnus and Viswesvaran 2006). To assess whether this response distortion leads to reduced predictive validity in a high-stakes medical selection context, we used university grades that were available both for the applicants and for a large non-applicant sample of medical students.

Methods

Supplementary materials and analyses are available on the OSF at https://osf.io/bjxwu.

Participants and procedure

The applicant sample consisted of medical graduates applying for one of 60 medical internship positions at a major health provider in Australia. The internship represents the first-year of post-graduate education and consists of a period of supervised clinical experience that is typically completed immediately after a graduate degree in medicine. Satisfactory completion is a requirement for general registration with the Medical Board of Australia. Applying for an intern position is a competitive process. Students in the final year of their medical degree record their preferences for where they wish to complete their intern year. Each health provider administers their own selection process. Importantly, the intern program that formed the basis for this research was especially prestigious and applications outstripped positions by approximately ten to one suggesting that applicants would have an incentive to distort responses on the personality test. As only limited places were offered for applicants outside the state, and because of issues with grade standardization, only within-state applicants were retained.

The application process required applicants to first complete an initial application form. From this, 15% of applicants were not asked to continue with the selection process on the basis that they were clearly not competitive for a position. Remaining applicants were then required to complete the personality test online. Applicants were informed that the personality test would influence selection decisions. Participants were encouraged to answer questions honestly and informed that applicants found to have faked would significantly diminish their chances of being selected. The final applicant sample used for present analyses consisted of 530 participants (55% female; age at time of personality testing, M = 24.9, SD = 3.0, range 21–46).

Materials

Personality

Applicants completed versions of the NEO Personality Inventory-3 (NEO PI-3) measuring five domains (i.e., neuroticism, extraversion, openness, agreeableness, and conscientiousness) along with six facets per domain. The test consists of 240 items with 8 items per facet. Items were measured on a 1–5 scale, and scales were scored as the item-mean after reversing reversed-items. The test is one of the most established and well-validated measures of the Big Five with Cronbach’s alpha reliabilities for domains of around .90 (Costa and Mccrae 1992). The NEO PI-3 (McCrae et al. 2005) involved minor revisions to 37 of the 240 items from the earlier NEO PI-R (Costa and Mccrae 1992) in order to improve readability. A large cross-cultural analysis comparing NEO PI-R and NEO PI-3 found that norms were not substantially different between the two versions (De Fruyt et al. 2009).

Grade point average

Academic performance was measured using grade point average (GPA): i.e., the mean student grade over the entire medical degree. Grades were obtained directly from universities. Grades were averaged over years and then z-score standardized within universities to remove any systematic differences in grading practices between universities.

Comparison samples

To assess the degree to which the applicant context led to response distortion, we calculated mean standardized differences on personality scores between the applicants and several normative samples. Each normative sample completed the personality test in a standard research context where there was no obvious incentive to make a positive impression. In addition, the medical student sample also provided measures of GPA from which comparative predictive validities of personality on GPA could be derived.

Medical students

Individual-level data was obtained for a sample of medical students drawn from research in Flemish universities (n = 539, 62% female, age at time of completion of personality test, M = 18.25, SD = 0.83). Students completed the official Dutch translation of the NEO-PI-R in the first year of their medical degree. Student grades were then obtained throughout the degree and z-score standardized in the same way as was done for the applicant sample. Students who completed the study were drawn from the 1997 and 1998 cohorts of a larger longitudinal study (Lievens et al. 2009). Present analyses differ from previous uses of the data in that they (a) combine the 1997 and 1998 cohorts to maximize sample size, and (b) make the sample more comparable to the applicant sample, who all finished their medical degree, by only including students who provided at least 6 years of academic grades. Students completed the personality test during the first year of their medical degree as part of a longitudinal research project. Thus, the average time difference between personality measurement and grades was similar for the student and the applicant sample, and both samples excluded students who did not complete their medical degrees.

Current interns and physician role models

Scale means and standard deviations for the NEO PI-R were obtained from a study by Hojat et al. (1999). This included a current intern sample of 104 physicians in internal medicine residency (33% female). A second physician role model sample of 188 physicians was used. Participants in this sample were invited to participate if their managers deemed them to be positive role models (13% female).

NEO PI-3 and NEO-PI-R norms

Combined gender norms for NEO PI-3 Form S young adults (21–30 years, n = 218) (McCrae et al. 2005) and combined college-aged norms for NEO PI-R (n = 389) (Costa and Mccrae 1992) were obtained from the NEO PI test manuals. Age and gender of these two samples are very similar to the applicant sample. The correspondence of the NEO-PI-3 norms and the NEO-PI-R norms illustrates how the small changes between versions of the test do not substantively influence conclusions about applicant response distortion.

Data analytic approach

The degree of applicant response distortion was quantified using the standardized differences in means between applicants and non-applicants. These standardized differences (i.e., Cohen’s d) involved first subtracting the non-applicant mean from the applicant mean and then dividing this difference by the applicant standard deviation. The applicant standard deviation was used in order to have a consistent denominator across normative samples. A common rule of thumb is to interpret d values of 0.2, 0.5, and 0.8 as indicating small, medium, and large effects respectively (Cohen 1992). Correlations were used to examine the bivariate relationship between personality and GPA, and regression models were used to examine the overall prediction of GPA by personality. Facet-level correlations are presented and discussed in relation to existing literature (Anglim and Grant 2016; Anglim et al. 2017; de Vries et al. 2011; Gray and Watson 2002; Griffin et al. 2004; Horwood et al. 2015; Lievens et al. 2002; Marshall et al. 2005; Paunonen and Jackson 2000; Woo et al. 2015) in the online supplement. Item-level descriptive statistics, which are relevant to quantifying item-level social desirability on the NEO-PI3, are also provided in the online supplement.

Results

Means and standard deviations for applicants along with standardized differences between applicants and norm groups are presented for the Big Five and 30 personality facets in Table 1 (confidence intervals are provided in the online supplement). Applicant responses were more socially desirable than all comparison groups, although these differences were slightly reduced in the interns and medical students samples compared to standard age norms. After reversing neuroticism, the average Cohen’s d effect size for the five samples was d = 1.03, and the average for each sample was d = 0.48 (physician role models), d = 0.95 (interns), d = 1.07 (medical students), d = 1.31 (NEO-PI-3 young adult norms), and d = 1.32 (NEO-PI-R young adult norms). Overall, substantial differences were present on all of the Big Five factors, but were largest for agreeableness, neuroticism, and conscientiousness. Specifically, when averaged over the five normative samples, standardized differences were −1.14 for neuroticism, 0.64 for extraversion, 0.84 for openness, 1.38 for agreeableness, and 1.14 for conscientiousness. Differences at the facet-level varied substantially within a given Big Five factor. For example, scores for openness to actions, ideas, and values were much higher in applicants, but scores for openness to aesthetics and feelings were about the same for applicants and non-applicants.

Table 1 Differences in personality factor and facet scores between applicants and normative samples

The correlations between the Big Five personality and GPA for both the applicant and medical student samples are shown in Table 2. In general, the correlations between personality and GPA were fairly small. Conscientiousness was significantly correlated with GPA in both medical students (r = .21, p < .001), and applicants (r = .11, p < .05). While this correlation was smaller in the applicant sample, this difference was not statistically significant, Δr = −.11, p = .07. Openness was a significant predictor of GPA in non-applicants (r = .14, p < .01), but not in applicants (r = .04, ns), but this difference was not statistically significant, Δr = −.10, p = .11. Correlations with GPA for neuroticism, extraversion, and agreeableness were close to zero in both samples. Overall regression models predicting GPA from the Big Five appeared lower for applicants (adjusted multiple R = .11) compared to medical students (adjusted multiple R = .25). Thus, overall there was modest evidence for a reduction in predictive validity of the Big Five domain scores in the applicant context.

Table 2 Comparison of bivariate correlations between big five domain scores and GPA in non-applicants and applicants

Discussion

Overall, the present study fills an important gap in the literature by providing the first nuanced assessment in a large medical sample of the degree to which a selection context influences responses on personality tests and the degree to which this might alter the predictive validity of academic medical performance. The key conclusions are twofold and can be summarized as follows: When used in a high-stakes medical selection context, applicants appear to respond in more socially desirable ways. In fact, applicant response distortion in this sample was somewhat larger than is commonly seen in the literature. Differences with the interns and medical students were around one standard deviation, and these two samples seem to share the most with the applicant sample. This is larger than the estimates of around two-thirds of a standard deviation from the small-sample repeated-measure study by Griffin et al. (2008), and it is also larger than meta-analytic estimates comparing applicants and non-applicants (Birkeland et al. 2006).

Second, response distortion in the applicant context may reduce but not remove the predictive validity of personality test scores. Correlations for openness and conscientiousness with GPA were slightly lower in the applicant context, albeit this difference in correlations was not significant. A small reduction in validity is consistent with a model of response distortion where response distortion adds a small amount of noise to personality measurement, but where the negative effect on predictive validity is offset because applicants with lower scores tend to distort responses more (Anglim et al. 2017). Thus, the change in rank ordering is not as extreme as it would be were the size of response distortion to be unrelated to true personality scores. Whether the modest predictive validities obtained by personality testing are sufficient to overcome other concerns about their use is a matter of judgment. However, the present data are highly relevant to informing such decisions. At the very least, personality testing should not be used in isolation when measuring interpersonal constructs, and should be combined with other admissions procedures such as multiple mini-interviews and situational judgment tests.

Some limitations should be acknowledged. While comparing applicants to a range of different normative groups, other potential differences besides the selection context may partially explain differences between applicants and non-applicants. While ancillary analyses suggest that such effects are likely to be small relative to the observed differences, future research could seek to obtain samples of applicants and non-applicants that are matched on more variables. Applicants were older (mean age 25 years) than medical students (mean age 18 years); applicants had completed more years of education; applicants completed the English versions of the personality test, and medical students completed the Dutch version; and Flemish and Australian medical students may have different personality profiles. Nonetheless, comparisons with other normative samples suggested that the effects of age, years of education, test format, and cultural differences are likely to be small, relative to the large differences in personality that were obtained. In particular, the student and intern samples did present a slightly more socially desirable response profile than implied by standard young adult test norms.

Although the current research used academic performance as the outcome variable, the broader research goal is to predict what makes an effective doctor. Thus, the assessment of reductions of predictive validity in the applicant sample were mainly intended to assess the broader question of whether predictive validity might decline in general. In particular, many employers perceive the benefits of personality testing to be related more to predicting discretionary behavior rather than standard task performance. The aim is to avoid hiring people who might engage in acts such as bullying, fraud, and unsafe work practices, and to hire people who are more likely to create a positive work climate in the organization. Thus, future research should seek to obtain measures of actual performance in medical practice.