The trail making test (TMT) is a commonly used neuropsychological test (Tombaugh, 2004) that is sensitive to brain dysfunction arising from a number of conditions, such as traumatic brain injury (Sanchez-Cubillo et al., 2009; Woods, Wyma, Herron, & Yund, 2015), schizophrenia (Sanchez-Cubillo et al., 2009), and dementia (Rasmusson, Zonderman, Kawas, & Resnick, 1998; Ashendorf et al., 2008). In addition to being an indicator of brain dysfunction and a measure of executive functioning, attention, and visuomotor skills, cutoffs have been established on the TMT in order to function as an embedded validity indicator (EVI; Ruffolo, Guilmette, & Willis, 2000; Iverson, Lange, Green, & Frazen, 2002; O’Bryant, Hilsabeck, Fisher, & McCaffrey, 2003; Merten, Bossink, & Schmand, 2007; Powell, Locke, Smigielski, & McCrea, 2011; Busse & Whiteside, 2012; Ashendorf, Clark, & Sugarman, 2017). The use of the TMT as an EVI is consistent with the general trend of expanding performance validity assessment beyond traditional forced choice recognition memory paradigms and is supported by literature showing that processing speed measures can be useful in the detection of invalid performance (Ashendorf et al., 2017; Erdodi et al., 2017; Erdodi & Lichtenstein, 2017; Erdodi, Abeare, Lichtenstein, Erdodi, & Linnea, 2017).

Performance validity is the degree to which scores on cognitive tests reflect the test taker’s actual ability (Boone, 2013; Larrabee, 2012). There are a number of factors that can lead to invalid performance, including inattention, lack of motivation to provide one’s best performance, limited English proficiency (Erdodi, Jongsma, & Issa, 2017; Erdodi, Nussbaum, Sagar, Abeare, & Schwartz, 2017; Salazar, Lu, Wen, & Boone, 2007), and intentional suppression of performance (i.e., malingering). Administration of performance validity tests (PVTs) in all neuropschological evaluations are considered best practice and are to be included as a basic standard of care (American Academy of Clinical Neuropsychology, 2007; Boone, 2009; Bush, Heilbronner, & Ruff, 2014; Chafetz et al., 2015; Heilbronner et al., 2009; Schutte, Axelrod, & Montoya, 2015).

PVTs can be stand-alone measures specifically designed to detect invalid performance. In many circumstances, administration of stand-alone PVTs is considered a requisite component of a complete neuropsychological evaluation. However, the use of stand-alone PVTs incurs increased time and financial costs, while providing only a snapshot of performance validity at a discrete moment. In forensic practice, this has led to an average use of 2.4 standalone PVTs in a battery, placing more time demands on clinicians and patients alike (Martin, Schroeder, & Odland, 2015).

For this and many other reasons, EVIs have been developed to supplement PVTs in a neuropsychological assessment. EVIs have the benefit of providing a more continuous measurement of performance validity across a test battery and require no additional resources (Lichtenstein et al., 2017). However, EVIs have been criticized on the grounds that they have higher rates of false positives due to factors such as credible cognitive impairment (Bigler, 2012; Bigler, 2015), limited English proficiency (Erdodi, Nussbaum, et al., 2017), or other factors known to influence test scores, such as age (Abeare, Messa, Zuccato, Merker, & Erdodi, 2018; Lichtenstein, Holcomb, & Erdodi, 2018). Although some of these issues are more easily addressed than others, there is a straightforward potential solution to address the influence of demographics: use EVIs based on demographically corrected scores.

By definition, EVIs based on demographically adjusted scores are less likely to be influenced by patient characteristics taken into account through stratified norming procedures. Controlling for key demographic factors is especially important with the TMT, as it is known to be influenced by age and education (Heaton, Miller, Taylor, & Grant, 2004; Tombaugh, 2004; Sanchez-Cubillo et al., 2009; Cavaco et al., 2013). Age has been demonstrated to be positively correlated with time to completion (Yuspeh, Drane, Huthwaite, & Klingler, 2000; MacNeill Horton & Roberts, 2001; Tombaugh, 2004; Hester, Kinsella, Ong, & McGregor, 2005; Perianez et al., 2007; Hamdan & Hamdan, 2009). Similarly, education has been shown to be negatively correlated with TMT completion times (MacNeill Horton & Roberts, 2001; Hamdan & Hamdan, 2009). However, the findings with regard to education have been less consistent: some studies report that only the TMT-B is influenced by education (Tombaugh, 2004; Hester et al., 2005; Hashimoto et al., 2006).

A number of existing EVIs already make use of demographically adjusted scores in order to avoid specific demographic biases (Arnold et al., 2005; Erdodi et al., 2017; Erdodi, Kirsch, Sabelli, & Abeare, 2018). However, until recently, there were no such cutoffs established for the TMT. Ashendorf et al. (2017) appear to be the first to do so, correcting for age, based on data published by Mitrushina, Boone, Razani, and D’Elia (2005). Although this adjustment appears to improve diagnostic accuracy while correcting for the confounding effect of age, educational adjustments may be needed as well. The present study was designed to fill the gap in the current research literature by establishing cutoffs based on age and education corrected T-scores.

Method

Participants

A consecutive case series of 100 clinically referred patients were evaluated following a TBI at an academic medical center in the Midwestern USA. The majority were men (56%) and right-handed (91%). Mean age was 38.8 (SD = 14.9, range: 17–74). Mean level of education was 13.7 years (SD = 2.6, range: 7–20). Mean FSIQ was 92.7 (SD = 15.7, range: 61–130). Most patients in the sample were classified as mild TBI (76%) by the assessing neuropsychologist based on commonly used guidelines (GCS > 13, LOC < 30 min, and PTA < 24 h). Data on litigation status were not available for the majority of the sample, given that all patients were referred by physicians. The present sample overlaps with previous publications from the same research group. The focus of the previous studies was different from the current study: the Forced Choice Recognition trial of the California Verbal Learning Test–Second Edition (CVLT-II; Erdodi, Abeare, Medoff et al., 2018), the Grooved Pegboard Test (GPB; Erdodi, Kirsch, Sabelli, & Abeare, 2018), and the Conner’s Continuous Performance Test–Second Edition (CPT-II; Erdodi, Roth, Kirsch, Lajiness-O’Neill, & Medoff, 2014).

Materials

The core battery of neuropsychological tests administered to all patients included the Wechsler Adult Intelligence (WAIS-IV; Wechsler, 2008), the Wechsler Memory Scale–Fourth Edition (WMS-IV; Wechsler, 2009), the CVLT-II (Delis, Kramer, Kaplan, & Ober, 2000), CPT-II (Conners, 2004), the trail making test (TMT A & B; Retain, 1955; Reitan, 1958), verbal fluency (FAS & animals; Gladsjo et al., 1999; Newcombe, 1969), the Wisconsin Card Sorting Test (WCST; Heaton, Chelune, Talley, Kay, & Curtis, 1993), and the Tactual Performance Test (TPT; Halstead, 1947). Premorbid functioning was estimated using the single word reading subtest on the Wide Range Achievement Test (WRAT-4; Wilkinson & Robertson, 2006) and the Peabody Picture Vocabulary Test–Fourth Edition (PPVT-4; Dunn & Dunn, 2007). Manual dexterity was measured with the Finger Tapping Test and GPB, using demographically adjusted T-scores based on norms by Heaton et al. (2004). The computer-administered version of the Word Memory Test (Green, 2003) was administered as the primary PVT.

A variety of methods have been utilized to detect invalid responding using the TMT. Time to completion of TMT-A, TMT-B, and combined trials (TMT A + B) have all been previously identified as effective measures to determine performance validity (O’Bryant et al., 2003; Powell et al., 2011; Busse & Whiteside, 2012; Shura, Miskey, Rowland, Yoash-Gatz, & Denning, 2016). Additionally, measures of the difference between the scores (Trails B–A; Ruffolo et al., 2000; Yuspeh et al., 2000) and ratio (Trails B/A; Corrigan & Hinkeldey, 1987; Ruffolo et al., 2000) have also gained some support as methods to identifying non-credible performance.

In order to continuously monitor performance validity throughout the assessment (Boone, 2009; Bush et al., 2014; Chafetz et al., 2015; Schutte et al., 2015) and to provide a comprehensive evaluation of performance validity across the cognitive domains (Cottingham, Victor, Boone, Ziegler, & Zeller, 2014; Webber, Critchfield, & Soble, 2018), two validity composites were created (“Erdodi Index Five” or EI-5), following the methodology outlined by Erdodi (2019). One was based on PVTs embedded within measures of visuomotor processing speed and was labeled “EI-5VM.” The other was based on PVTs based on forced choice recognition paradigms and was labeled “EI-5FCR.”

EI-5 components were recoded onto a four-point ordinal scale, where zero was defined as an incontrovertible Pass and three as an incontrovertible Fail. An EI-5 value of one signifies failing the most liberal (i.e., high sensitivity, low specificity) cutoff, which is typically associated with a base rate of failure (BRFail) around 25% (An et al., 2019; Erdodi et al., 2018; Erdodi et al., 2016; Pearson, 2009). The cutoffs for EI-5 values 2 and 3 are either determined on rational grounds in combination with the existing research literature, or in the case of scales with wide range, adjusted to correspond to BRFail of 10 and 5, respectively (Table 1). In other words, as the EI-5 value increases, so does the confidence in correctly classifying a given response set as invalid. By design, the EI-5 captures both the number and extent of PVT failures.

Table 1 The components of the EI-5s and base rates of failure at given cutoffs

The value of the EI-5 composite is obtained by summing its recoded components. As such, it can range from 0 (i.e., all five constituent PVTs were passed at the most liberal cutoff) to 15 (i.e., all five constituent PVTs were failed at the most conservative cutoff). An EI-5 ≤ 1 is classified as an overall Pass, as it contains at most one failure at the most liberal cutoff. The interpretation of EI-5 values 2 and 3 is more problematic, as they can either indicate a single failure at a more conservative, or multiple failures at the most liberal cutoff. Regardless of the specific configuration, this range of performance fails to deliver sufficiently strong evidence to render the entire response set invalid. At the same time, it signals subthreshold evidence of non-credible responding (Abeare et al., 2018; Erdodi, Hurtubise, et al., 2018; Erdodi, Seke, Shahein, et al., 2017; Erdodi et al., 2018; Proto et al., 2014). Therefore, it is labeled Borderline and excluded from analyses requiring a dichotomous (Pass/Fail) criterion variable. However, an EI-5 ≥ 4 indicates either at least two failures at the more conservative cutoffs, at least four failures at the most liberal cutoff, or some combination of both. As such, they meet commonly used standards for classifying the entire neurocognitive profile invalid (Boone, 2013; Davis & Millis, 2014; Larrabee, 2014; Odland, Lammy, Martin, Grote, & Mittenberg, 2015). The majority of the sample produced a score in the passing range on both versions of the EI-5, with approximately 20% BRFail (Table 2).

Table 2 Frequency, cumulative frequency and classification range for the first ten levels of the EI-5s

Due to the fact that the EI-model is a relatively new approach to multivariate performance validity assessment, it was validated against the WMT, a well-established free-standing PVT (Green, Iverson, & Allen, 1999; Iverson, Green, & Gervais, 1999; Tan, Slick, Strauss, & Hultsch, 2002). The EI-5FRC produced a slightly higher (.88) overall classification accuracy than the EI-5VM (.84), driven by superior sensitivity (.75 vs. .64). However, both versions of the EI-5 were highly specific (.96 and .98, respectively) to failure on the WMT (Table3). Significant differences emerged on several TMT scores as a function of EI-5 level (Pass,Borderline, Fail). The contrasts were associated with notably larger main effects (ηp2: .33-.44,extremely large) for the EI-5VM (Table 4) as compared to the EI-5FCR (ηp2: .12-.20, large effects; Table 5). All post hoc contrasts between Pass and Fail were significant, although effect sizes were morepronounced for the EI-5VM (d: 1.57-1.73, very large) than the EI-5FCR (d: 0.74-0.97, large). Likewise,all post hoc contrasts between Pass and Borderline were significant. Again, larger effects wereobserved for the EI-5VM (d: 0.75-1.15, large) than the EI-5FCR (d: 0.66-0.91, medium-large). Thispattern of finding reinforces the notion that a third category (“indeterminate range”) is a legitimateoutcome in performance validity assessment in addition to the traditional Pass/Fail dichotomy(Erdodi, 2019).

Table 3 Classification accuracy of the EI-5s against the WMT

Procedure

Patients were assessed in two half-day (4-hour) appointments in an outpatient setting. Testing was conducted by Master’s level psychologists with specialty training in psychometric testing. De-identified data were collected retrospectively. The project was approved by the relevant institutional boards).

Significant differences emerged on several TMT scores as a function of EI-5 level (Pass,Borderline, Fail). The contrasts were associated with notably larger main effects (ηp2: .33-.44,extremely large) for the EI-5VM as compared to the EI-5FCR (ηp2: .12-.20, large effects; Table5). All post hoc contrasts between Pass and Fail were significant, although effect sizes were morepronounced for the EI-5VM (d: 1.57-1.73, very large) than the EI-5FCR (d: 0.74-0.97, large). Likewise,all post hoc contrasts between Pass and Borderline were significant. Again, larger effects wereobserved for the EI-5VM (d: 0.75-1.15, large) than the EI-5FCR (d: 0.66-0.91, medium-large). Thispattern of finding reinforces the notion that a third category (“indeterminate range”) is a legitimateoutcome in performance validity assessment in addition to the traditional Pass/Fail dichotomy(Erdodi, 2019)..

Table 4 Results of one-way ANOVAs on TMT scores across EI-5VM classification ranges

Data Analysis

Descriptive statistics (mean, standard deviation, range, BRFail) were reported for key variables. Overall classification accuracy (AUC) and the corresponding 95% confidence intervals were computed using SPSS version 23.0. AUC values in the .70–.79 range were classified as acceptable; a value in the .80–.89 range was classified as excellent, whereas an AUC ≥ .90 was considered outstanding (Hosmer & Lemeshow, 2000). Sensitivity and specificity were computed using standard formulas (Grimes & Schulz, 2005). In the context of PVTs, the minimum acceptable threshold for specificity is .84 (Larrabee, 2003), although values ≥ .90 are desirable and are becoming the emerging norm (Boone, 2013; Donders & Strong, 2011). The statistical significance of risk ratios (RR) was determined using the Chi-square test of independence. One-way ANOVAs and t tests were computed to compare means, as appropriate ).

Table 5 Results of one-way ANOVAs on TMT scores across EI-5FCR classification ranges

Results

Raw Score Cutoffs

The raw score-based EVIs for TMT A and B performed reasonably well overall against the three criterion PVTs, with sensitivity ranging from .10 to .63 and specificity ranging from .87 to .98 (Table 6). Classification accuracy was best against the EI-5VM, as expected due to modality specificity (Erdodi, 2019). The TMT A + B had the best classification accuracy of the raw score-based EVIs, with sensitivity ranging from .36 to .82 and specificity ranging from .84 to .95. Again, classification accuracy was best against the EI-5VM. The TMT B/A had unacceptably poor classification accuracy and was therefore excluded from subsequent analyses.

Table 6 Classification accuracy of various TMT validity cutoffs

T-Score Cutoffs

Classification accuracy for the TMT A and B demographically corrected T-scores was generally better than raw score cutoffs and varied depending on the criterion PVT. TMT A T-scores were slightly less sensitive to failure on the WMT (.33 to .36) than TMT B T-scores (.38 to .49), but were slightly more sensitive to failure on the two EI-5s (.43 to .74 and .39 to .68, respectively). Specificity values on TMT A and B T-scores were comparable across criterion PVTs. Cutoffs of T ≤ 33 on both TMT A and B optimized sensitivity (.33 and .38, respectively) with strong specificity (.≥90) against the WMT. The TMT B–A had fairly poor sensitivity (.10 to .17) at acceptable specificity values (.85 to .88).

Lastly, the three T-score-based EVIs were combined into a multivariate liberal (T3-LIB) composite in which the number of failures on liberal cutoffs was summed and a conservative multivariate composite (T3-CON) in which the number of failures of conservative cutoffs was summed, using an overall cutoff of ≥ 2 failures. The multivariate combination had comparable classification accuracy to the univariate cutoffs, with sensitivity ranging from .28 to .74 and specificity ranging from .89 to .97 (Table 7).

Table 7 Base rate of failure (%) as a function of injury severity across various PVTs and cutoffs

Effects of Age and Education

As expected, EVIs based on raw scores were influenced by age (d: .47–.84 medium-large). However, there was no effect of age on the T-score cutoffs (see Table 8). Similarly, there was no effect of education on TMT A raw or T-scores. However, there was an effect of education on TMT B raw scores and TMT A + B raw scores (d: .39–.52, medium). There was no effect of education on TMT A or B T-scores (see Table 9).

Table 8 The relationship between failing TMT validity cutoffs and age of the examinee
Table 9 The Relationship between failing TMT validity cutoffs and the examinee’s level of education

The effects of age and education were also examined for the criterion PVTs. There was a medium effect of age (d = .42) on the WMT and EI-5FCR (d = .45), but there was no effect on the EI-5FCR. There were small to medium, but statistically non-significant, effects of education on the WMT (d = .38) and EI-5VM (d = .44) and a medium effect on the EI-5FCR (d = .63).

Effects of Injury Severity

Consistent with the literature, there was a reverse dose-response relationship (Hill, 1965; Carone, 2008; Erdodi & Rai, 2017; Green, Flaro, & Courtney, 2009; Green et al., 1999) between injury severity and the BRFail on the WMT, with mTBI patients failing at a higher rate (46.1%) than those with moderate to severe TBIs (16.7%). The EI-5VM also had a higher BRFail in the mTBI group (29.3%) than the moderate to severe group (9.1%). The EI-5FCR showed the same pattern, with the mTBI group more than twice as likely to fail, but this difference was not statistically significant. It should, however, be noted that the Chi-square is sensitive to sample size, and in this case, it was underpowered. The BRFail on the TMT did not vary as a function of TBI severity. (Table 10)

Discussion

EVIs based on demographically corrected T-scores on the TMT were developed and compared to previously published raw score cutoffs in an attempt to reduce potential age and educational bias in the assessment of performance validity. Raw score cutoffs were found to be influenced by age and education, whereas the newly developed cutoffs for TMT A and TMT B demographically corrected T-scores successfully eliminated potential age and education biases. In addition, the T-score cutoffs had generally better classification accuracy than the raw-score cutoffs, with a better balance between sensitivity and specificity.

The so-called “derivative” EVIs (e.g., TMT A + B, B–A, B/A) had mixed support. The TMT B–A T-score difference and B/A raw score ratio had poor sensitivity at cutoffs with acceptable specificity. TMT A + B raw score had strong classification accuracy in this sample, but it was confounded by age and education, consistent with other raw-score cutoffs. Consequently, these EVIs have limited utility in a general clinical sample as they have an inflated false positive rate for older adults and examinees with lower levels of education .

Table 10 The relationship between failing criterion PVTs and the examinee’s age and level of education

The multivariate combinations, T3-LIB and T3-CON, had comparable classification accuracy compared to the univariate cutoffs. This is somewhat surprising, as multivariate combination of EVIs often results in improved classification accuracy (Boone, 2009; Erdodi et al., 2014; Rai, An, Charles, Ali, & Erdodi, 2019; Tyson et al., 2018). The reason for the lack of incremental utility in the multivariate combinations in this sample is unclear.

The criterion PVTs showed the expected reverse dose-response relationship with injury severity. The mTBI group had a 46.1% BRFail on the WMT compared to only 16.7% BRFail in the moderate to severe group (RR = 2.76). The same pattern was found for the EI-5VM (BRFail = 29.3% and 9.1%, respectively; RR = 3.22) and the EI-5FCR (BRFail = 32.3% and 15.0%; RR = 2.15), although the Chi-square for the EI-5FCR was non-significant (p = .111). However, consistent with Iverson et al. (2002), the reverse dose-response relationship was not observed for any of the TMT-based EVIs. In fact, there were no reliable differences between severity groups on any of these EVIs. One can only speculate about possible reasons behind these negative findings and their clinical interpretation. The propensity to exaggerated deficits in patients with mTBI in the post-acute stage of recovery may have been canceled out by genuine persisting deficits in patients with moderate-to-severe TBI. It should be noted that the lack of difference in BRFail between these two levels of injury severity itself provides circumstantial evidence for non-credible performance in the mTBI sample, as the normative outcome is full return to baseline (Boone, 2013; McCrea, 2008). Therefore, a relative deficit in patients with moderate-to-severe TBI would be expected on such measures known to be sensitive to the deleterious effects of head injury if those with mTBI produced valid performance (Axelrod, Fichtenberg, Liethen, Czarnota, & Stucky, 2001; Donders & Strong, 2015; Lange & Iverson, 2005).

Although this suggests that TMT validity cutoffs, especially the derivative indices, may be less sensitive to aspects of invalid performance that are responsible for the commonly observed reverse dose-response relationship, they also have the benefit of not being sensitive to genuine impairment. This finding is consistent with previous reports (Arnold et al., 2005; Erdodi et al., 2017; Erdodi, Hurtubise, et al., 2018; Erdodi et al., 2018; Erdodi, Tyson, et al., 2018; Jasinski, Berry, Shandera, & Clark, 2011). The variability in the signal detection profile across samples occasionally leads to accidental discoveries that reaffirm the utility of derivative EVIs (Axelrod, Meyers, & Davis, 2014; Dean, Victor, Boone, Philpott, & Hess, 2009; Glassmire, Wood, Ta, Kinney, & Nitch, 2019). Therefore, the cumulative evidence suggests that experimenting with derivative EVIs is a worthwhile exercise, even if they occasionally underperform.

The modality specificity effect (Erdodi, 2019; Rai and Erdodi, 2019) was observed in this study. The various cutoffs on the TMT were generally much more sensitive when the criterion PVT was the EI-5VM, which is composed of several visuomotor processing speed tests, compared to the WMT and the EI-5FCR, which are based on forced choice recognition paradigms. This finding underscores the importance of using a diverse combination of PVTs that vary in terms of cognitive domain and paradigm, to enable assessors to detect various manifestations of non-credible responding (Cottingham et al., 2014). In addition, the development of these EVIs will benefit clinicians because they provide non-memory-based EVIs to complement the commonly used PVTs that are typically based on forced choice recognition memory.

Overall, results suggest that the demographically corrected T-score cutoffs show promise in clinical use. These cutoffs had reasonably strong classification accuracy against the criterion PVTs and successfully eliminated potential age and education bias observed in the raw score cutoffs. TMT validity cutoffs were unrelated to injury severity, suggesting that they are robust to genuine cognitive impairment. This is particularly important given that the TMT is known to be sensitive to the deleterious effects of TBI (Sanchez-Cubillo et al., 2009; Woods et al., 2015).

Inevitably, the present study has a number of limitations, the most obvious of which is the relatively small and geographically restricted sample. The demographically corrected T-score cutoffs should be examined in other clinical populations and in other geographic regions to ensure the generalizability of findings. These cutoffs should also be examined with larger samples and using different criterion measures as well as research designs, including experimental malingering paradigms.

This study makes a small, but important contribution to the knowledge base of clinical neuropsychology. The use of PVTs that are biased by demographic variables can lead to increased misclassification and contribute to erroneous diagnoses, especially in vulnerable populations (e.g., the elderly and examinees with lower levels of educational achievement). Ensuring that tests are not biased against demographic groups is a fundamental expectation in behavioral sciences. Violating this assumption can have serious social and ethical implications. In addition, in an era of increased financial and time pressures being placed on clinicians, establishing EVIs with strong classification accuracy may help to shorten evaluation time while simultaneously improving the assessment of performance validity (Boone, 2013; Erdodi, Pelletier, & Roth, 2018d; Larrabee, 2008, 2012).