Overview

Reader be warned—the following is not a formal literature review. That need has already been elegantly addressed for non-cognitive measures (Albanese et al. 2003), for grade point average (GPA) (Kreiter and Kreiter 2007) and for subsections of the Medical College Admissions Test (MCAT) (Donnon et al. 2007). This paper is meant as an overview, looking for trends in the predictive validity data provided by admissions tools in the hope that this might provide guidance for future assessment tools development.

In pursuit of cogency, some rules of conduct were set for this overview. Measurement tools were considered to be of helpful predictive validity if they demonstrated all of the following four characteristics:

  1. 1.

    The positive predictive validity correlation must be statistically significant.

  2. 2.

    The positive predictive validity correlation must be practically relevant. A very highly powered study might demonstrate the existence of a correlation with great confidence, and with equally great confidence confirm that the extent of the correlation is vanishingly small.

  3. 3.

    The positive predictive validity correlation must be consistent across multiple studies. It is acknowledged a priori that seeking positive correlations with unreliable outcome variables may limit achievement of this characteristic.

  4. 4.

    The positive predictive validity correlation must have a value added, an incremental validity above and beyond other predictors.

When examining strength of data, three levels, from high to low, are considered: (a) meta-analysis, (b) large studies (>500 subjects) and (c) small studies. Lower level data is considered only when higher level data is unavailable.

What’s worked…

This list is limited to GPA, the MCAT and the multiple mini-interview (MMI).

Grade point average

Grade Point Average has consistently shown statistically significant, practically relevant, positive predictive correlations with future performance, confirmed on meta-analysis (Kreiter and Kreiter 2007). It has some degree of incremental validity above the MCAT (Julian 2005). Correlations are particularly strong (0.40), trending downwards with increasing time from medical school admission. Awareness of this downward trend is longstanding (Gough 1978). Confounding factors for this decrease may include changing cognitive content and/or a shift of content from cognitive towards non-cognitive emphasis in outcome variables administered later in training.

No clear conclusions can be drawn that differentiate between overall GPA and science GPA, as the latter predominates as the standard in reported studies, and no head-to-head comparison is readily available (Koenig et al. 1998; Huff et al. 1999). Science GPA continues as the overwhelmingly preferred GPA indicator in American schools and some Canadian schools without clear data support for that phenomenon, and with only occasional challenge (Barr et al. 2008). Several studies suggest that students from non-science backgrounds initially experience higher stress but ultimately perform equally well relative to their science background counterparts (Dickman et al. 1980; Yens and Stimmel 1982; Woodward and McAuley 1983; Neame et al. 1992; Koenig 1992; Huff and Fang 1999). These studies, however, are plagued by the use of unreliable outcome variables. A lack of demonstrated difference may be because no such difference exists; alternatively, a difference exists, but cannot demonstrate correlation with random number generators.

Medical college admissions test

The MCAT has consistently been shown to be statistically significant, practically relevant, with a positive predictive correlation with future performances. This conclusion has been confirmed on meta-analysis (Donnon et al. 2007). The MCAT has a strong degree of incremental validity, above and beyond GPA (Julian 2005). Correlations are particularly strong, trending downwards with increasing time from medical school admission, and with increasing shift of focus from predominantly cognitive outcome variables earlier in training to predominantly non-cognitive and clinical outcome variables later in training. These trends, however, vary considerably for different MCAT sections.

The MCAT itself is apportioned into four sections. The Physical Sciences section is intended to assess problem solving ability in general chemistry and physics; the Biological Sciences section is intended to do the same for organic chemistry and biology. The Writing Sample section requires the composition of two essays, intended to measure candidates’ ability to develop a central idea, synthesize concepts, and present those ideas cohesively, logically, and with correct use of grammar and syntax. The Verbal Reasoning section consists of approximately seven passages, each followed by 5–7 questions, whose correct response requires the candidate to understand, evaluate and apply the information and arguments provided.

Physical Sciences (MCAT-PS) demonstrates a moderately strong correlation that drifts downwards with increasing time from medical school admission.

Biological Sciences (MCAT-BS) demonstrates a strong correlation which trends downwards with increasing time from medical school admission, though in an even less precipitous fashion than MCAT-PS.

Writing Sample (MCAT-WS) would be categorized more appropriately in the section below entitled “and What Hasn’t (Worked)”. MCAT-WS has failed to demonstrate consistently positive results for predictive validity.

Verbal Reasoning (MCAT-VR) demonstrates moderately strong correlations. These correlations are not only sustained, but strengthened, with increasing time from medical school admission. The relative immunity to time of MCAT-VR compared to MCAT-PS and MCAT-BS (Violato and Donnon 2005) bears some scrutiny. It may be due to a combination of factors. One factor may be that MCAT-VR is less context-bound, so correlations with future performance remain unaffected as the context changes. Another may be that MCAT-VR straddles the somewhat artificial divide between cognitive and non-cognitive domains and therefore remains relevant, even as assessment measures in clerkship and on national licensing examinations shift from cognitive towards non-cognitive domains.

Multiple mini-interview

The MMI has consistently shown statistically significant, practically relevant, positive predictive correlations with future performance. This conclusion has been confirmed on small studies only (Eva et al. 2004, 2009; Reiter et al. 2007). It has a strong degree of incremental validity above GPA and MCAT. This is unsurprising, given the zero correlation of MMI with GPA and only small correlation of MMI with MCAT-VR. Correlations are sustained and trend upwards with increasing time from medical school admission. This may be due to a shift from cognitive towards non-cognitive domains in later outcome assessments during clerkship and national licensure examination. Conclusions are guarded, pending assessment of a sufficiently large cohort of MMI-tested individuals reaching USMLE Steps I, II and III.

…And what hasn’t…

This list includes personal interviews (PI), personal statements, letters of reference (LOR), personality testing (PT), measures of emotional quotient/emotional intelligence (EQ/EI), and written and video-based situational judgment tests (w-SJT and v-SJT).

Personal interview

Personal interviews should work. They certainly work in the Human Resources (HR) setting. The HR methodological gold standard includes a pre-interview job analysis, behavioural descriptor interview (BDI) questions (e.g. describe an occasion you felt challenged in the workplace, and how you dealt with that challenge), and situational interview (SI) questions (e.g. faced with the following challenge, how would you deal with it). This approach yields predictive validity correlations of 0.20–0.30 with subsequent job performance (Wiesner and Cronshaw 1988; Taylor and Small 2002). Yet similar results have not been found for medical school admissions interviews (Albanese et al. 2003; Salvatori 2001).

Personal interview for medical school admission has demonstrated positive predictive validity in three studies, one published after the Albanese review. However, the personal interview still fails to meet the criteria described at this paper’s outset. Firstly, the correlations in two studies (Powis et al. 1988; Kreiter et al. 2004) were not found for the entire cohort under study, but only found for applicants scoring extremely highly and extremely poorly in both predictor and outcome variables. Further, the Kreiter study employed a high level of standardized interview methodology rarely achieved by other medical schools. Short of marked enhancement of other medical schools’ interview format, the generalizability of these results may therefore be questioned. Most intriguing, however, is that two studies (Meredith et al. 1982; Powis et al. 1988) employed a variation in the interview format that likely contributed to serendipitous results. Specifically, Meredith et al. used four independent interviews and Powis et al. used two independent interviews rather than one, for each candidate. As with MMI, this would tend to combat the negative impact of context specificity and halo effect on overall test reliability. The Meredith study has the previously unrecognized distinction of being the first published study to demonstrate some degree of positive predictive validity for a multiple interview, a full generation prior to Eva et al. and the MMI.

Why the difference in results between HR and medical school interviews? The latter tend not to use BDI/SI questions, a factor which would drop predictive validity to the 0.10–0.20 range, according to the HR experience (Wiesner and Cronshaw 1988; Huffcutt and Arthur 1994; McDaniel et al. 1994). But even weak correlations have not been consistently achieved by medical schools. The difference likely lies in differing degrees of applicant pool homogeneity for HR and medical schools. The smaller the difference between individual applicants, the greater the challenge in differentiating between them, and the less likely there will be wide differences in the hired/admitted applicants’ subsequent job performance.

For better or for worse, the practice of medicine has a higher profile than most job positions. Those who apply in a casting call for leads in a Broadway play are going to appear very much different from those who apply for an off-Broadway play. The resumes of the stars on Broadway will cluster at the upper end of the scale, and the theatre-goers can rightly expect a uniformly expert performance. The resumes of the off-Broadway group are across the board and the viewing audience knows in advance that the performance will vary greatly based upon the chosen performers. In a similar vein, it is far more challenging to differentiate between medical school applicants and subsequently demonstrate positive predictive validity, as compared to the usual HR cattle calls.

Letters of reference

Letters of references (LOR) have been an integral part of the medical school admissions process and remains as one of the most common criteria in the candidate screening process (DeLisa et al. 1994; Berstein et al. 2002). There is, however, little evidence to support their effectiveness and their continued usage in the medical school admission process (Salvatori 2001). Many of the concerns that arise from the use of LOR as a selection tool can be attributed to its poor predictive validity (Kirchner and Holm 1997; Standridge et al. 1997) and poor reliability (Ross and Leichner 1984). In terms of inter-rater reliability, Dirschl and Adams (2000) evaluated LOR for 58 residency applicants and found that inter-rater reliability was slight (0.17–0.28). Concerns raised regarding the use of LOR as part of the admission process also focus on the lack of information in these documents, a perception of ambiguity with terminology and rater bias as a result of open file reviews.

Although it is widely believed that traditional LOR offer a great deal of information about an applicant’s non-cognitive abilities, little has been found to support this contention. In a study conducted by O’Halloran et al. (1993), two experienced reviewers could not reliably extract information on non-cognitive qualities from candidate letters of reference. Ferguson et al. (2003) has voiced similar concerns stating that the amount of information contained in a teacher’s reference does not reliably predict the performance of a student at medical school.

In his review of Dean’s reference letters for pediatric residency programs, Ozuah (2002) found that a substantial proportion of Deans’ letters from US medical schools failed to include comparative performance information with a key that allowed for accurate interpretation. Such subjective letters can be one of the many reasons that contribute to the poor inter-rater reliability highlighted above.

Interestingly, both studies by Brothers and Wetherholt (2007) and Peskun et al. (2007) found letters of reference to be predictive of later performance. Brothers and Wetherholt (2007) found a correlation with subsequent clinical performance ratings in residency training. However, the authors acknowledge an important limitation in this situation in that faculty raters also received information about academic performance and USMILE scores that accompanied the reference letters. As such, this open file review process likely biased their overall scoring of the applicant’s LOR. The same may or may not be true for Peskun et al. (2007), who found that non-cognitive assessments (LOR and personal statements culled from medical school admissions files) provided additional value to standard academic criteria in predicting ranking by two residency programs.

Personal statements

Much like the assessment of LOR, there has also been little support found for the predictive validity of personal statements (McManus and Richards 1986). Ferguson et al. (2000) evaluated the personal statements of 176 medical students and found that neither the information categories nor the amount of information in the statements were predictive of future preclinical performance.

Personal statements are also subject to rater biases, which can have notable effects on reliability. In a study looking at the autobiographical sketch submission of applicants to the Michael G. Degroote School of Medicine at McMaster University, Kulatunga-Moruzi and Norman (2002) reported an intermediate inter-rater reliability coefficient of 0.45. Later research (Dore et al. 2006; Hanson et al. 2007) suggested that even that modest claim might be unreasonably optimistic.

Personal statements add relatively little value to an applicant’s profile, due to the common occurrence of input from others in an applicant’s statement, as well as a subjective and difficult comparison of personal statements across applicants. In surveys of first year medical students conducted over 3 years, Albanese et al. (2003) found that 41–44% of students reported their personal statement involved input from others with 15–51% reporting input in content development and 2–6% receiving input from professional services. Similarly, in an assessment of off-site pre-interview autobiographical submissions compared with onsite submissions, Hanson et al. (2007) suggested a relatively low level of frequency with which candidates independently answered pre-interview autobiographical questions, and reported on its deleterious impact on test reliability. This is hardly surprising. It is in the best interest of all applicants to present themselves in a fashion that clusters them at the saintly end of the spectrum. The higher the stakes, the tighter the cluster.

Albanese et al. (2003) has additionally pointed out that because of the personal statement’s free-form nature, “any given personal statement will highlight a set of personal characteristics potentially different from the set highlighted in another applicant’s personal statement”. Comparing non-standardized information of applicants’ personal characteristic makes valid comparisons extremely challenging.

Although research has indicated limited predictive value of the personal statement as a selection tool, the focus of this tool in medical school admissions literature has remained sparse. Interestingly, there have been conflicting reports on the reliability and validity of personal statements in the admissions processes in other health care disciplines. Kirchner and Holm (1997) found a positive correlation between the autobiographical essay and Occupational Therapy GPA. The essay was also found to have added incremental validity to their model used to predict therapy outcomes. However, in a follow up study, Kirchner et al. (2000) could not replicate the same positive correlation and suggested that the previous results may have been anomalous. Similarly, Brown et al. (1991) reported good inter-team reliability (0.71–0.80) when assessing applications to the basic stream of the baccalaureate nursing program at McMaster University. When evaluating letters of applicants applying to the post RN stream, however, they found a lower coefficient of 0.43. The authors attributed some of this variance to rater bias.

Although there is conflicting data regarding the reliability and validity of personal statements in other health care disciplines, the influence of bias and random error cannot be ruled out. Additionally, the higher stakes of medical school compared to other health care schools will tend to cluster the former to haloed extremes. If anything, then, weak results for other health disciplines augurs even more poorly for medical school admissions. Not surprisingly, without meeting the outlined criteria of being replicable across studies, personal statements cannot be considered as an assessment tool that “works”.

Personality testing

A wealth of large scale studies, literature reviews and meta-analyses can be found on personality testing in the human resource (HR) literature (Barrick and Mount 1991). That literature provides a source of encouragement for the application of PT to medical school admissions, albeit with a rather large caveat. As discussed above, the population of medical school aspirants tends to be far more homogenous than those reported in HR studies. A microscope strong enough to differentiate between HR job applicants may not be anywhere near powerful enough to do the same for medical school applicants. Even worse, people are generally unable to reliably self-report. Self-assessment tends to be profoundly inaccurate, so it is likely expecting too much to have self-reporting tools like personality tests avoid myopic, albeit consistent, viewpoints. The automobile driver who knows that he/she is a good driver will consistently report themselves as such, regardless of true driving ability. Because of this, high internal consistency or Cronbach’s alpha using personality tests should not enter the discussion—there is no great merit to being consistent in one’s responses when those responses are consistently inaccurate. This conclusion in no way negates the successful use of personality testing under the auspices of forensic psychologists, who deal with a pool of individuals liberally sprinkled with psychopaths and sociopaths. Medical school applicant pools are homogenously populated by high academic performers, with psychopaths sufficiently (and thankfully) rarely represented, so as to make the personality test unreliable in that setting.

Part of the challenge in interpreting results of early personality testing was generated by the plethora of different personality testing constructs. Using different constructs, with different measuring tools found within each construct, and translating between study results and generalizing from one set of results was an exercise in futility. Over time, one construct—the Big Five Factor construct—and one measuring tool within that construct—NEO (Neuroticism-Extroversion-Openness) Five Factor Inventory—have garnered more widespread use. The NEO Five Factor Inventory is composed of Conscientiousness, Openness to Experience, Neuroticism, Extroversion, and Agreeableness. Of these, only Conscientiousness has consistently demonstrated predictive validity. Claims of predictive validity of the other factors are plagued by failure to make Bonferroni corrections when many correlations of multiple endpoints are sought. If a study finds that a proportion close to 5% of the correlations sought are statistically significant at P < 0.05, then the “positive result” is quite likely due to random chance. Put another way, let’s say Mr. Q boasts of his betting system on horses, and returns from the track having won his bet on a horse running at 20:1 odds. Would you bet money on his system? Well, did he win that on a single bet, or did he win one 20:1 bet while losing 19 others at the same odds? Further, if he tells you that he won that bet because the winning horse had the longest legs and then tells you a week later that he has won another 20:1 bet because the winning horse had the most experienced jockey, you might be wise to keep your wallet closed. Explaining the reasons for the winning bet only after the fact, and rarely in consonance with other post hoc explanations of winning bets, is another warning sign. The HR literature is replete with such situations of positive correlations in one study explained sagaciously, without similarly robust findings in other studies. For these reasons, only the NEO factor of Conscientiousness demonstrates sufficiently consistent findings in order to have it described, along with aptitude tests, as worthwhile predictive measures to consider (Behling 1998).

The NCQ (non-cognitive questionnaire) is another example of a personality testing construct which has gained a level of acceptance, best detailed in Beyond the Big Test: Non-cognitive Assessment in Higher Education, by William Sedlacek (Sedlacek 2004). Attempting to interpret the conclusions drawn is a challenge. Trying to extrapolate a concept that is pertinent to medical school admissions from these conclusions is an even more daunting task. Of the extensive list of 403 references cited in the book, 15 address predictive validity results in peer-reviewed journals (Bandalos and Sedlacek 1989; Boyer and Sedlacek 1988; Fuertes and Sedlacek 1994, 1995; O’Callaghan and Bryant 1990; Sedlacek 1991, 1996, 1999, 2003; Sedlacek and Adams-Gaston 1992; Ting 1997; Tracey and Sedlacek 1985, 1987, 1988, 1989). The vast majority of these 15 publications do not deal with overall applicant populations, but rather subgroups, particularly racially defined subgroups. Are the results generalizable, particularly when applied to the medical school applicant population, who are likely to be far more homogenous than the populations examined using NCQ? Do the number of statistically significant positive predictive correlations exceed that which one would expect by chance alone? Further to this issue, there is no explicit use of Bonferroni corrections. Like other personality tests, those positive correlations that are found are explained post hoc, and are not necessarily consistent between studies. Finally, when statistically significant associations between NCQ scores and academic scores are found, it is not always inherently clear which represents the predictor variable and which represents the outcome variable. As acknowledged by Ting (1997),

“Psychosocial variables including successful leadership experiences, preference for long-range goals, acquired knowledge in a field and a strong support person were significantly related to GPA in the 1st and 2nd semesters. Thus, students who have higher scores on these psychosocial variables also tend to have higher GPAs, or vice versa.”

If the score for “availability of a strong support person” correlates with higher GPA, is that because the former led to the latter, or because a strong support person is more likely to be attracted to those more likely to succeed academically? Ultimately, the limitations in interpreting and extrapolating NCQ results do not necessarily mean that the construct and tests proposed are without merit; only that it is difficult to draw conclusions based upon the extensive information provided.

Emotional quotient/emotional intelligence

Unlike personality testing, testing of EQ/EI has not developed to the point that a single test format has gained greater favour over most others. Worse, it has not developed to the point that a single construct has gained greater favour over most others. With the incongruence of different constructs and different test formats essentially speaking different languages, the interpretation of results between studies is all but impossible. Furthermore, all the same limitations expected when interpreting personality testing studies for their applicability to medical school admissions are also present in emotional intelligence studies: attempted extrapolations from results with more heterogeneous populations, misplaced trust in ability to self-assess, lack of Bonferroni corrections and lack of robust results between studies, even when similar constructs and test formats can be found. The promise and the challenge of EQ/EI has been recently addressed by Lewis et al. (2005), and will not be further addressed here, beyond accepting that the existing predictive validity data does not provide a compelling argument for its present use.

Written and video based situational judgment tests

Recently, situational judgment tests (SJT) administered during the medical school admissions process have emerged as relatively strong predictors of future academic performance (Lievens et al. 2005; Lievens and Sackett 2006; Oswald et al. 2004) In SJTs, applicants are presented with either written or video based depictions of hypothetical scenarios and are asked to identify an appropriate response from a list of alternatives (Lievens et al. 2005). The intent was to develop an admissions test which might predict for future non-cognitive performance.

In their evaluations of SJT, Lievens and Sackett (2006) examined the validity of written based versus video based SJT in a traditional student admission tests for candidates writing the Medical and Dental Studies Admission Exam in Belgium. They compared 1,159 students who completed the video SJT against 1,750 who completed the same SJT but in a written format. It was found that the written SJT offered significantly less predictive and incremental validity for predicting interpersonally oriented criteria than did the video SJT. Interestingly, the written SJT was more predictive of cognitive aspects of future performance as measured by GPA (Lievens and Sackett 2006). As previously discussed in this paper, GPA and MCAT scores have already been proven to be strong cognitively oriented predictors. Thus, since the intention of the SJT was to provide alternative measures to assess non-cognitive qualities, the written SJT has little added value.

Contrastingly, video SJT have shown very promising results in its overall ability to predict future performance. Video SJT, as mentioned earlier, had significantly higher predictive validity and incremental validity for predicting interpersonally oriented criteria than did the written SJT (Lievens and Sackett 2006). Specifically, in their sampling of 1,159 candidates Lievens et al. (2005), found that the video based SJT proved to be a significant predictor in academic courses that partially determined GPA based on interpersonally oriented courses with a positive correlation of 0.21 between the SJT validity coefficient and the interpersonal course ratings. In contrast to the written SJT, the video based version was not predictive for cognitive domains. Further, the video SJT validity increased through the academic years when following students through their first 4 years of medical training (Lievens et al. 2005).

One major limitation that Lievens et al. (2005) address is the differences between admission for North American and European medical schools. It is worth noting that not only is the admission process different, whereby medical school admission is controlled by the state in Belgium, but the population of medical school aspirants also tends to be far more homogenous in North America than in Belgium. Applicants to the majority of North American medical schools are required to have completed several years of post-secondary studies before applying to medical school. Most prospective applicants with lower academic achievement have been siphoned off, leaving a more homogenous applicant pool. The majority of applicants thus have very competitive profiles making selection criteria that much more difficult amongst such a homogeneous applicant pool. This may be very different from the more heterogeneous candidates in Belgium applying to medicine right after secondary studies. It is as yet unclear whether applying the video SJT methodology to the more homogenous (more highly clustered) North American applicant pool maintains a sufficient level of predictive validity to warrant its ongoing use. Since video SJT does not correlate with GPA, it remains possible, albeit unconfirmed, that this homogenizing or clustering effect is minimal for non-cognitive domains.

Because there are not as of yet multiple studies demonstrating predictive validity, video SJT, for the moment, remains in the “and what has not” portion of this paper. The likelihood of future widespread implementation may also be limited by the same cost and technological concerns that led to cessation of video SJT use in Belgium in 2002. Nevertheless, the promising results from Belgium may yet herald a future move of video SJT into the “what’s worked” portion of this paper.

As a guide…

As stated earlier, the overview is looking for trends in the predictive validity data provided by admissions tools in the hope that this might provide guidance for future assessment tool development. Yes, Grade Point Average provides statistically significant positive predictive validity, but how do those correlations change both over time and over the shifting emphasis of endpoints from purely cognitive to clinical? We may be happy that those admitted are more likely to perform well in basic medical science courses in the first 2 years of medical school, but can that endpoint hold a candle to performance as clinical clerks? Does clerkship performance carry the same cachet as scores on certification examinations? Not when higher certification examination scores predict the appropriate use of radiological investigations and correct prescribing techniques (Tamblyn et al. 1998, 2002), lower mortality following cardiac events (Norcini et al. 2002), peer ratings of clinical competence (Ramsey et al. 1989) and fewer complaints to medical boards (Tamblyn et al. 2007). The shift in emphasis can be illustrated in tabular and graphic form, moving from earlier to later training, from mainly cognitive to mainly clinical assessment and ultimately to assessment of professionalism. With that in mind, the endpoints available in the literature are presented in the following series, in order:

  1. 1.

    Grades in the first 2 years of medical school (1st 2 years).

  2. 2.

    Results of early (less clinical) portions of national licensure examinations (US Medical Licensing Examination Steps 1 and 2, Medical Council of Canada Qualifying Examination Part I; USMLE 1, 2, MCC I).

  3. 3.

    Results of early portions of national licensure examinations explicitly dealing with clinical issues (Medical Council of Canada Qualifying Examination Part I—Clinical Decision-Making; MCC I CDM).

  4. 4.

    Clinical Clerkship scores.

  5. 5.

    Results of later portions of national licensure examinations dealing to a greater extent with clinical issues (US Medical Licensing Examination Step 3, Medical Council of Canada Qualifying Examination Part II; USMLE 3, MCC II).

  6. 6.

    Results of later portions of national licensure examinations explicitly dealing with clinical issues (Medical Council of Canada Qualifying Examination Part II—Clinical Decision-Making; MCC II CDM).

  7. 7.

    Results of national licensure examinations explicitly dealing with issues of professionalism (Medical Council of Canada—Considerations of Legal, Ethical and Organizational; MCC-CLEO).

For those tools that have not reproducibly demonstrated positive predictive validity, there is no trend to show, and they are excluded from the table and graphs. The tools that are included in tabular (see Table 1) and linear trend graphic form include GPA (see Fig. 1), the three predictive sections of MCAT (see Figs. 2, 3, 4), and MMI (see Fig. 5). The numbers used to fill in the table arise from (a) meta-analyses (designated in bold), (b) large studies (>500 subjects; designated in regular font), and (c) small studies (designated in italics).

Table 1 Predictive tools and their correlation with future assessments
Fig. 1
figure 1

Correlation of GPA with future assessments

Fig. 2
figure 2

Correlation of MCAT-PS with future assessments

Fig. 3
figure 3

Correlation of MCAT-BS with future assessments

Fig. 4
figure 4

Correlation of MCAT-VR with future assessments

Fig. 5
figure 5

Correlation of MMI with future assessments

Towards predictive admissions tool development

Grade Point Average predicts for Grade Point Average. Correlation between one course score and another, whether in or out of medical school, is moderate; correlation between years of courses is extremely high (Trail et al. 2006). But the ability to excel academically carries less and less gravitas as the domain assessed shifts from the more purely cognitive to the more clinical and ultimately more professional. If the goal of medical schools is to churn out medical science cognitive experts, then GPA is the way to go. The real world, however, places a higher premium on the superb clinician and professional—at least, so it would seem. But in that real world, there are not a lot of physicians with weak cognitive skills. The majority of complaints to State medical boards may be due to issues of professionalism, but that is only because the vast majority of medical aspirants of lower intellectual caliber have been weeded out by GPA and by the Biological Science and Physical Science sections of the MCAT whose predictive validity trends mirror those of GPA (see Figs. 1, 2, 3). Without these screening measures, a much higher proportion of complaints would be due to cognitive, rather than professional, ineptness.

Society has been saved from that fate by the insistence of strong cognitive ability, the potential to attain expertise in medical science, and owes much to the Flexner Report of 1910 in that regard. GPA and MCAT have proven sufficiently successful that further progress on cognitive assessment would provide only limited returns. But scaling one mountain peak opens vistas on the challenges beyond. In a recent sampling of three state medical boards, unprofessional behaviour represented 92% of the known types of violations (Papadakis et al. 2005). This figure is proportional to expectations culled from surveys of sued physicians, non-sued physicians, and suing patients (Shapiro et al. 1989). The approach of the centennial of Flexner’s signature report provides increasing perspective on the next domain to be challenged and conquered. Without Flexner, we would still be mired in complaints of cognitive origin. In response to Flexner, the cognitive mountain peak has been scaled and with it, the vantage point to see the next mountain peak to be assaulted. Where is the Flexner Report of 2010, to define non-cognitive skills and in particular professionalism as the next mountains to scale?

How is that assault to be managed? Success in conquering the first, cognitive peak and early success in non-cognitive evaluation provide perspective (TAU) on potential future success—Trust no-one, Avoid self-reporting, Use repeated measures:

  1. 1.

    Trust no-one. Do not trust the applicants. When stakes are high, the likelihood of cheating increases; the stakes for medical school admission are very high. Do not trust the referees; they were not chosen by the applicants for their ability to remain neutral and objective. Do not trust the file reviewers; compared with the formulaic approach, individual biases always get in the way, from parole board decisions (Grove and Meehl 1996) to medical school admissions (Schofield and Garrard 1975). Do not trust the interviewers, or at the very least, dilute their biases into submission.

  2. 2.

    Avoid self-reporting. Applicants’ inability to cheat self-reporting instruments intentionally is small comfort when they can so easily fill them out inaccurately when telling the truth. It’s not just that people are generally poor at self-assessment, but also that the worst performers tend to be the worst self-assessors (Eva and Regehr 2007). To expect the bad apples to weed themselves out on the basis of self-reporting goes beyond Pollyanna into self-delusion. All observations about the candidate should be made from a more remote position than the applicants themselves.

  3. 3.

    Use repeated measures. Medical schools would not admit students based upon one course’s GPA, nor would they trust an aptitude test with an insufficient numbers of questions. One interview makes no sense. From Meredith through Powis to Eva, the interview formats that suggest reliability and validity are those that use multiple samples, the more the better. Single measures, however obtained, are indefensible whether viewed on the basis of psychometric principles of context specificity and halo effect, or viewed on the basis of empiric data.

In truth, this prescription for test development has already been applied in limited fashion. In Belgium, video-based situational judgment testing has provided initial success; in Canada, MMIs have provided the same. As similar endeavours spread globally, it is only a matter of time (and concerted effort) before non-cognitive assessments are taken for granted in the same way that GPA and MCAT are today. That hope leads inexorably to the next obvious question. The conquest of cognitive medical expertise provided a vantage point of the next peak to conquer. As the next peak of non-cognitive skills is scaled, what new peak(s) might become visible? And will it take a century to identify those next challenges and manage the assault?