Valid interpretation of neuropsychological test results relies on the assumption that examinees perform to the best of their abilities. As such, it has become a broadly accepted standard of practice to routinely assess the credibility of examinees’ responding (Bush et al., 2005; Heilbronner et al., 2009). Performance validity may also fluctuate over the course of a long testing session as a function of idiosyncratic expression of cognitive deficits and/or malingering strategy (Cottingham, Victor, Boone, Ziegler, & Zeller, 2014; Erdodi, Kirsch, Lajiness-O’Neill, Vingilis, & Medoff, 2014a). Detection methods must be robust enough to identify a broad range of abnormal test taking behaviors that may indicate non-credible responding.

The most effective strategy for a comprehensive assessment of performance validity is using multivariate models (An, Kaploun, Erdodi, & Abeare, 2017; Boone, 2013; Erdodi et al., 2018c; Larrabee, 2008; Tyson et al., 2018). Therefore, it is important to administer several different measures of performance validity dispersed throughout an assessment (Boone, 2009; Bush, Heilbronner, & Ruff, 2014; Chafetz et al., 2015; Schutte, Axelrod, & Montoya, 2015). Performance validity may be compromised by a wide range of factors, such as fatigue (Kalfon, Gal, Shorer, & Ablin, 2016; Suhr, 2003), lack of interest (An et al., 2017, 2019; Rai, An, Charles, Ali, & Erdodi, 2019), cogniphobia (Suhr & Spickard, 2012), illness perception (Henry et al., 2018), acute emotional distress Clark, Amick, Fortier, Millberg, & McGlinchey, 2014; Erdodi et al., 2016), failing to appreciate the importance of demonstrating highest level of neurocognitive ability (Abeare, Messa, Zuccato, Merker, & Erdodi, 2018b; Abeare et al., 2018a), complex trauma history (Costabile, Bilo, DeRosa, Pane, & Sacca, 2018; Williamson, Holsman, Chaytor, Miller, & Drane, 2012), deficits in adaptive functioning (Lippa, Agbayani, Hawes, Jokic, & Caroselli, 2014), or outright malingering (Boone, 2013; Larrabee, 2012). Regardless of the specific cause, non-credible responding is a significant threat to the validity of test data that can undermine the validity of the clinical interpretation.

Clinicians’ judgment about credibility of performance has long been known to be inaccurate (Heaton, Smith, Lehman, & Vogt, 1978), despite (or perhaps, because) high level of confidence in their ability to detect non-credible performance (Guilmette, 2013). Therefore, objective empirical methods are needed to evaluate performance validity (Chafetz et al., 2015; Larrabee, 2014). By design, there are two main categories of instruments developed to monitor performance validity: free-standing performance validity tests (PVTs) and embedded validity indicators (EVIs).

Free-standing PVTs were created with the specific and sole purpose of monitoring the credibility of test taking behavior. Consequently, they provide little (if any) information on cognitive ability. They are usually simple tasks that were designed to appear more difficult than they actually are. In addition to low cognitive demand, the threshold of failure is often established so that the majority of examinees with severe genuine impairment are able to pass it (Allen & Green, 1999; Greve, Bianchini, & Doane, 2006; Green & Flaro, 2003). Given the combination of an objectively easy task and typically highly conservative cutoffs (Erdodi & Rai, 2017; Greve, Etherton, Ord, Bianchini, & Curtis, 2009; Jones, 2013; Kulas, Axelrod, & Rinaldi, 2014), failing free-standing PVTs provides strong evidence of non-credible responding that is unlikely to be attributable to genuine neurological impairment. However, the practice of deeming an entire neurocognitive profile invalid based on failures on free-standing PVTs represents an inferential leap that has been criticized on both logical (An et al., 2019; Bigler, 2012) and empirical grounds (Erdodi, Pelletier, & Roth, 2018e; Erdodi & Roth, 2017; Locke, Smigielski, Powell, & Stevens, 2008).

In contrast, EVIs are derived from traditional neuropsychological tests subsequently co-opted as PVTs (Committee on Psychological Testing, 2015). EVIs have several advantages over free-standing PVIs, including being cost-effective (Boone, 2013; Erdodi, Abeare, Lichtenstein, Erdodi, & Linnea, 2017; Erdodi, Hurtubise, et al., 2018c) and more resistant to coaching, in that examinees are less likely to be able to identify them as PVTs in comparison to their free-standing counterparts (Chafetz et al., 2015; Miele, Gunner, Lynch, & McCaffrey, 2012). In addition, relying on EVIs reduces the demand on the examinee’s mental stamina, which can be particularly useful when testing young children and medically or emotionally fragile patients (Lichtenstein et al., 2017). The main liability inherent in relying on EVIs to determine the credibility of a given response set is that separating ability from performance validity is more difficult and not without controversy (Erdodi & Lichtenstein, 2017).

The Finger Tapping Test (FTT; Reitan, 1969) is a measure of simple motor speed that has subsequently demonstrated its potential for use as an EVI (Heaton et al., 1978; Mittenberg, Rotholc, Russell, & Heilbronner, 1996; Rapport, Farchione, Coleman, & Axelrod, 1998). During the FTT, examinees are instructed to tap a lever attached to a mechanical counter using their extended index finger for 10 s at a time as fast as they can, over at least five consecutive trials. This procedure is then repeated for the non-dominant hand. Once the FTT is completed, the trials are averaged to create a summary score for each hand, which is used to establish the individual’s relative standing. The raw score performance of the FTT is used to compare oscillation speed between the dominant and non-dominant hands in order to infer the integrity of the cortical motor areas and the efferent motor pathways (Schatz, 2011).

Historically, the potential of motor measures to evaluate credible performance has received less attention than measures assessing other cognitive domains, such as memory—especially the two-alternative forced choice recognition paradigm (Baker, Connery, Kirk, & Kirkwood, 2014; Bauer, Yantz, Ryan, Warden, & McCaffrey, 2005; Blaskewitz, Merten, & Brockhaus, 2009; Bortnik et al., 2010; Greve et al., 2009; Lu, Boone, Cozolino, & Mitchell, 2003; Reedy et al., 2013; Wolfe et al., 2010). However, their ability to separate valid from invalid response sets has longstanding empirical support. The FTT demonstrated its ability to perform as an EVI in multiple studies in which participants feigned head injuries to determine if individuals with non-credible performance can be differentiated from those with true head injuries (Greiffenstein, Baker, & Gola, 1996; Heaton et al., 1978; Rapport et al., 1998). Feigning participants demonstrated abnormally poor and inconsistent performance patterns (i.e., performing worse on simple motor tests, such as the FTT, than on more complex measures, such as the Grooved Pegboard Test; Matthews & Klove, 1964). Overall, the existing evidence suggests that the FTT can help separate examinees who are feigning injuries from those with genuine motor deficits. However, early studies stopped short of introducing specific cutoffs to separate credible from non-credible response sets (Erdodi et al., 2017c).

Eventually, normative data that included demographic corrections (i.e., taking into account differences in performance due to age, sex, and education) were published as an alternative to the raw score cutoffs typically used to determine impairment (Heaton, Grant, & Matthews, 1991). Naturally, performance on the FTT is influenced by innate motor skills and, by extension, demographic variables. Specifically, previous research has shown that older age was associated with decreased finger tapping speed (Shimoyama et al., 2012). Similarly, at group level, males typically outperformed females (Arnold et al., 2005; Ruff & Parker, 1993). Also, examinees with more formal education demonstrated better performance on the FTT (Homann et al., 2003). Such examinee characteristics with known influence on FTT performance can inadvertently bias validity cutoffs by artificially inflating failure rates in specific demographic groups that are at inherently higher risk for low finger tapping scores (i.e., examinees who are women, older, and have less education).

The FTT’s integrity as a performance validity indicator hinges on its potential to differentiate between ability, which appears to vary with demographic factors, and performance validity, which is thought to be independent of demographic factors (Boone, 2013; Lichtenstein, Erdodi, Rai, Mazur-Mosiewicz, & Flaro, 2018a; Lichtenstein, Holcomb, & Erdodi, 2018b). The credibility of an individual response set is determined based on binary validity cutoffs (i.e., Pass/Fail). Although existing FTT validity cutoffs have typically been calibrated to clear the minimum acceptable threshold for specificity (.84; Larrabee, 2003; .90; Boone, 2013), they have historically been based exclusively on raw scores.

Larrabee (2003) published validity cutoffs on the FTT that combine dominant and non-dominant hand raw scores. However, these cutoffs do not account for the effect of demographic variables. The only demographic factor currently used to adjust cutoffs for the FTT is sex: Arnold et al. (2005) were the first to formally recognize the need for a separate set of cutoffs for men (higher) and women (lower) to maintain comparable classification accuracy. In more recent research, Axelrod, Meyers, and Davis (2014) explored alternative cutoff scores but did not adjust for any demographic variables.

Previously published cutoffs that exclusively rely on raw scores may inadvertently introduce systematic error in the interpretation of test results. Namely, groups that demonstrate naturally lower performance on this task (i.e., women, older examinees, less educated individuals) are at higher risk for false positive error. The present study was designed to take the next logical step in developing the FTT as an EVI by offering more comprehensive demographically adjusted validity cutoffs. Therefore, FTT raw scores were transformed to T-scores corrected for race, age, gender, and education using normative data published by Heaton, Miller, Taylor, and Grant (2004). Existing cutoffs based on raw scores and newly developed cutoffs based on T-scores were compared directly to evaluate their classification accuracy (i.e., their ability to separate valid and invalid responses). It was hypothesized that using demographically adjusted validity cutoffs for the FTT would achieve specificity and sensitivity values comparable to raw score cutoffs. In addition, we predicted that demographically adjusted FTT validity cutoffs would protect demographically disadvantaged examinees against false positive errors.

Method

Participants

The sample consisted of a consecutive case sequence of 100 patients evaluated following a TBI at an academic medical center in the Midwestern US (fifth author’s institutional affiliation). All patients were clinically referred for a neuropsychological evaluation by their treating physician. Data on external incentives to appear impaired (Criterion A of Malingered Neurocognitive Dysfunction; Slick, Sherman, & Iverson, 1999) were not available for the majority of the sample. Instead, the credibility of neurocognitive profiles was established using psychometric methods. Patients with missing index fingers, acute orthopedic injury to their hand, or neurological conditions known to cause significant impairment in dexterity (hemiparesis, multiple sclerosis, Parkinson’s disease) were excluded from the study to control for a major confound in evaluating the FTT’s potential to serve as a PVT.

The majority of the participants were men (56%) and right-handed (91%). Mean age was 38.8 (SD = 14.9, range 17–74). Mean level of education was 13.7 years (SD = 2.6, range 7–20). Mean FSIQ was 92.7 (SD = 15.7, range 61–130). Although seven of the patients produced FSIQ < 70, none of them had a documented history of longstanding adaptive deficits required for a diagnosis of intellectual disability. In fact, two of them had a university diploma. All of them had psychometric evidence of normal or even above average cognitive functioning. Therefore, FSIQ < 70 in this subset of patients was likely a manifestation of non-credible responding.

As expected, the majority of head injuries were categorized as mild TBI (76%) by the assessing neuropsychologist according to commonly used guidelines (GCS > 13, LOC < 30 min, and PTA < 24 h; American Congress of Rehabilitation Medicine, 1993). Data on injury parameters were collected through clinical interview and the review of the medical records. The present sample overlaps with previous publications from the same research group designed to investigate different topics (Erdodi, Kirsch et al., 2018d; Erdodi, Abeare, Medoff et al., 2018; Erdodi, Roth et al., 2014b). None of these papers examined the FTT as a PVT.

Materials

Tests Administered

A core battery of neuropsychological tests was administered to all patients, including the Wechsler Adult Intelligence (WAIS-IV; Wechsler, 2008) and Memory Scales (WMS-IV; Wechsler, 2009), the California Verbal Learning Test – Second Edition (CVLT-II; Delis, Kramer, Kaplan, & Ober, 2000), Conners’ Continuous Performance Test – Second Edition (CPT-II; Conners, 2004), the Trail Making Test (TMT A & B; Reitan, 1955; Reitan, 1958), verbal fluency (FAS & animals; Gladsjo et al., 1999; Newcombe, 1969), the Wisconsin Card Sorting Test (WCST; Heaton, Chelune, Talley, Kay, & Curtis, 1993), the Grooved Pegboard Test (GPB; Matthews & Klove, 1964), and the Tactual Performance Test (TPT; Halstead, 1947). Premorbid functioning was estimated using the single word reading subtest on the Wide Range Achievement Test (WRAT-4; Wilkinson & Robertson, 2006) and the Peabody Picture Vocabulary Test – Fourth Edition (PPVT-4; Dunn & Dunn, 2007). Manual dexterity was measured with the FTT and GPB, using demographically adjusted T-scores based on norms by Heaton et al. (2004).

Administration protocols for the FTT vary (Strauss, Sherman, & Spreen, 2006). Patients in the present sample were administered the FTT following the instructions by Reitan and Wolfson (1985): five consecutive trials with the dominant hand, followed by five consecutive trials with the non-dominant hand. If the range of raw scores across the five trials was within 5 points, the test was discontinued for that hand. If performance was more variable, further trials were administered until the “five (trials) within five (range)” criterion was met, or 10 trials were reached.

Criterion PVTs

The main free-standing PVT was the computer-administered version of the Word Memory Test (WMT; Green, 2003). The Genuine Memory Impairment Profile (GMIP) was not applied as an exclusionary clause after failing the WMT, for a number of reasons. First, although earlier investigations demonstrated that careful screening for both non-credible responding and genuine memory deficits is needed to detect a dose-response relationship between TBI severity and memory skills (Donders & Strong, 2013), the present study was explicitly focused on performance validity and non-memory-based EVIs. Second, previous research found that a quarter of graduate students asked to feign dementia qualified for the GMIP (Armistead-Jehle & Denney, 2015), suggesting that it is not specific to genuine impairment. Third, injury severity as a potential confound for PVT failure was explicitly investigated in this sample.

To meet the broadly accepted practice guideline of continuously monitoring performance validity throughout the assessment (Boone, 2009; Bush et al., 2014; Chafetz et al., 2015; Schutte et al., 2015), two validity composites were created (“Erdodi Index Five” or EI-5), following the methodology outlined by Erdodi (2019). The first one was based on PVTs embedded within measures of psychomotor processing speed and was labeled “EI-5PSP.” The second validity composite was based on PVTs embedded within measures of attention, working memory, and auditory verbal memory, and was labeled “EI-5AWM.”

Each EI-5 component was recoded onto a 4-point ordinal scale, where 0 was defined as an incontrovertible Pass, and 3 as an incontrovertible Fail. Psychometrically, an EI-5 value of 1 is anchored to failing the most liberal (i.e., high sensitivity, low specificity) cutoff, which is typically associated with a base rate of failure (BRFail) around 25% (Pearson, 2009). The cutoffs for EI-5 values 2 and 3 are either determined on rational grounds in combination with the existing research literature, or in the case of scales with wide range, adjusted to correspond to BRFail of 10 and 5, respectively (Table 1; Erdodi, Kirsch, Sabelli, & Abeare, 2018d). In other words, as the EI-5 value increases, so does the confidence in correctly classifying a given response set as invalid. By design, the EI-5 captures both the number and extent of PVT failures (Erdodi et al., 2018b).

Table 1 Individual components of the EI-5s and base rates of failure at given cutoffs (n = 100)

The value of the EI-5 composite is obtained by summing its recoded components. As such, it can range from 0 (i.e., all five constituent PVTs were passed at the most liberal cutoff) to 15 (i.e., all five constituent PVTs were failed at the most conservative cutoff). An EI-5 ≤ 1 is classified as an overall Pass, in that it contains at most one failure at the most liberal cutoff (Erdodi, 2019).

The interpretation of EI-5 values 2 and 3 is more problematic, for they can indicate either a single failure at a more conservative cutoff, or multiple failures at the most liberal cutoff. Regardless of the specific configuration, this range of performance fails to deliver sufficiently strong evidence to render the entire response set invalid. At the same time, it signals subthreshold evidence of non-credible responding (Abeare, Messa, Whitfield, et al., 2018a; Erdodi, Hurtubise, et al., 2018c; Erdodi, Seke, Shahein, et al., 2017c; Erdodi et al., 2018f; Proto et al., 2014).

Therefore, this range of performance on the EI-5 is labeled Borderline and excluded from analyses requiring a dichotomous (Pass/Fail) criterion variable (Erdodi, 2019). However, an EI-5 ≥ 4 indicates either at least two failures at the more conservative cutoffs, at least four failures at the most liberal cutoff, or some combination of both. As such, they meet commonly used standards for classifying the entire neurocognitive profile invalid (Boone, 2013; Davis & Millis, 2014; Larrabee, 2014; Odland, Lammy, Martin, Grote, & Mittenberg, 2015). The majority of the sample produced a score in the passing range on both versions of the EI-5, with a roughly 20% BRFail (Table 2).

Table 2 Frequency, cumulative frequency, and classification range for the first ten levels of the EI-5s

Given that the EI-model is a relatively new approach to multivariate performance validity assessment, it was validated against the WMT, a well-stablished free-standing PVT (Green, Iverson, & Allen, 1999; Iverson, Green, & Gervais, 1999; Tan, Slick, Strauss, & Hultsch, 2002), in order to demonstrate its diagnostic utility within the present sample. The EI-5AWM produced a significantly higher (.91) overall classification accuracy than the EI-5PSP (.79), driven by superior sensitivity (.74 vs. .47). However, both versions of the EI-5 were highly specific (.94) to psychometrically defined non-credible responding (Table 3).

Table 3 Classification accuracy of the EI-5s against the WMT as criterion PVT

Procedure

Patients were assessed in two half-day (4-hour) appointments in an outpatient setting. Psychometric testing was performed by Master’s level psychologists who received specialized training in test administration and scoring. The clinical interview and the interpretation of neuropsychological profiles were performed by a staff clinical neuropsychologist (fifth author), who was also responsible for generating an integrative summary report. By design, this is a retrospective archival study. Clinical data were fully and irreversibly de-identified prior to being used for research purposes. The project was approved by the institutional board overseeing compliance with research ethics.

Data Analysis

Descriptive statistics (mean, standard deviation, range, BRFail) were reported when relevant. Overall classification accuracy (AUC) and the corresponding 95% confidence intervals were computed using SPSS version 23.0. An AUC in the .70–.79 range is classified as acceptable; a value in the .80–.89 range is classified as excellent, whereas an AUC ≥ .90 is considered outstanding (Hosmer & Lemeshow, 2000). Sensitivity and specificity were computed using standard formulas (Grimes & Schulz, 2005). In the context of PVTs, the minimum acceptable threshold for specificity is .84 (Larrabee, 2003), although values ≥ .90 are desirable and are becoming the emerging norm (Boone, 2013; Donders & Strong, 2011). The statistical significance of risk ratios (RR) was determined using the chi-square test of independence.

Results

Dominant Hand T-Score Cutoffs

Dominant hand FTT T-scores produced significant AUCs (.70–.81) against all three criterion PVTs (Table 4). A cutoff of T ≤ 29 was highly specific (.97–1.00) to psychometrically defined non-credible responding, but had variable sensitivity (.23–.45). Increasing the cutoff to ≤ 33 resulted in a predictable trade-off between improved sensitivity (.36–.55) and diminished, but still high specificity (.90–.98). Making the cutoff even more liberal (≤ 35) reached the point of diminished return: a loss in specificity (.87–.94) without any corresponding gain in sensitivity.

Table 4 Base rates of failure and classification accuracy of FTT validity cutoffs against the WMT and the EI-5s as criterion PVTs

Non-dominant Hand T-Score Cutoffs

Non-dominant hand FTT T-scores produced significant AUCs (.67–.74) against all three criterion PVTs. A cutoff of T ≤ 29 had uniformly good specificity (.92–.93), but small and variable sensitivity (.23–.44). Increasing the cutoff to ≤ 33 was largely inconsequential to both sensitivity (.28–.44) and specificity (.91–.92). No patient scored T = 35. Making the cutoff even more liberal (≤ 37) resulted in a meaningful increase in sensitivity (.41–.50). However, it also caused specificity to drop below the minimum threshold (.83) against the EI-5PSP, while maintaining adequate specificity against the other two criterion PVTs (.88–.90).

Combined T-Score Cutoffs

Consistent with the practice established in previous research on measures of manual dexterity as a PVT (Arnold et al., 2005; Axelrod et al., 2014; Erdodi, Seke, Shahein, et al., 2017c; Larrabee, 2003), a third variable combining both trials (COM) was created. Fail on COM was defined as failing a given cutoff with both hands; Pass was defined as passing a given cutoff with either hand. As such, failing the COM provides a more conservative index of non-credible responding, because it allows a single failure to be counted as an overall Pass.

A COM cutoff of T ≤ 33 had very high specificity (.97–.98), but low sensitivity (.18–.33). Increasing the cutoff to ≤ 37 preserved the specificity (.97–.98) against the WMT and EI-5AWM, while improving sensitivity (.36–.43). However, this cutoff disproportionately traded specificity (.89) for sensitivity (.43) against the EI-5PSP. Making the cutoff even more liberal (≤ 39) resulted in negligible changes in classification accuracy: It further deflated specificity (.86–.95) and produced comparable gains in sensitivity (.38–.50).

Raw Score Cutoffs

To provide a head-to-head comparison with T-score-based cutoffs, the classification accuracy of raw score cutoffs has also been computed. Dominant hand raw score cutoffs produced significant AUCs (.70–.75) against all three criterion PVTs. Gender-corrected cutoffs (≤ 28 for women and ≤ 35 for men; Arnold et al., 2005) had low BRFail (6.0%). Consequently, they resulted in high specificity (.98–1.00) and low sensitivity (.13–.33). Combined hand raw score cutoffs produced similar signal detection profile: low BRFail (5.0%), high specificity (.98–1.00), and low sensitivity (.13–.22).

Relationship Between Injury Severity and Failing FTT Validity Cutoffs

Since severe TBI could result in decreased motor speed (Curtis, Greve, & Bianchini, 2009; Donders & Strong, 2015; Erdodi, Abeare, et al., 2017a, 2018a; Haaland, Temkin, Randahl, & Dikmen, 1994) or difficulty inhibiting the concurrent movement of other fingers during tapping (Prigatano & Borgaro, 2003), it is important to ensure that genuine impairment is not misclassified as non-credible responding. Therefore, the BRFail was compared as a function of injury severity. Mild TBI was invariably associated with higher BRFail on all PVTs examined (RR 1.40–7.65) compared to moderate-to-severe TBI (Table 5). The contrasts involving the WMT and the EI-5AWM were statistically significant (p < .01).

Table 5 Base rate of failure (%) as a function of injury severity across various PVTs and cutoffs

Old and New: A Head-to-Head Comparison of Raw and T-Score Cutoffs

Finally, FTT validity cutoffs based on raw scores were compared to demographically adjusted ones in order to empirically evaluate their relationship with age, level of education, and sex. No significant age effects emerged as a function of passing or failing either type of cutoff (Table 6). However, a large effect (d = 1.00) was observed among the subset of patients who failed the COM raw score cutoff. BRFail was evenly distributed between males and females using T-score-based cutoffs (RR 0.85–1.27; p values .598–.821).

Table 6 Age and level of education as a function of passing or failing various FTT validity cutoffs

However, women failed raw score cutoffs at higher rates than men (RR 1.89–2.53), although the difference in failure rate was not statistically significant (p values .249–1.89). It must be pointed out that the sample size for the Fail group was notably smaller than that of the Pass group for both the T-score (n: 15–16 vs. 84–85) and raw score-based (n: 5–6 vs. 95–94) cutoffs. Therefore, statistical tests were likely underpowered by unequal group and small overall sample size (Table 7).

Discussion

This study was designed to compare the classification accuracy of raw score-based and demographically adjusted validity cutoffs embedded within the FTT. We predicted that the two types of cutoffs (raw vs. T-scores) would have comparable signal detection performance. While this was the case in terms of AUC values, demographically adjusted validity cutoffs had notably higher (i.e., roughly double) sensitivity than raw score-based cutoffs at comparable levels of specificity. In other words, raw score-based cutoffs disproportionally sacrificed their ability to detect non-credible performance in order to ensure a low false positive rate.

Our second hypothesis had mixed support. We predicted that T-score-based FTT validity cutoffs would better protect demographically disadvantaged (i.e., female, older, less educated) examinees against the threat of false positive errors. Surprisingly, both sets of cutoffs were equally robust to age effects. Failing combined hand raw score cutoffs was indeed associated with lower levels of education, whereas no difference was observed across T-score cutoffs. Paradoxically, women failed the gender-corrected raw score cutoffs introduced by Arnold et al. (2005) roughly twice as often as men. Again, no sex difference was observed across T-score cutoffs.

Taken together, these findings suggest that demographically adjusted cutoffs not only have superior classification accuracy but also provide a better balance between sensitivity and specificity while neutralizing the potential confounding effects of age, sex, and level of education. By pre-emptively accounting for these demographic variables, assessors improve their diagnostic accuracy in clinical settings and protect themselves against challenges in medico-legal settings (i.e., avoid the perception of unchecked biases). Logistically, it is also easier to apply uniform cutoffs rather than unique ones for different demographic categories and dominant vs. non-dominant hand on the FTT. These advantages may explain the growing trend towards demographically adjusted cutoffs (Ashendorf, Clark, & Sugarman, 2017; Erdodi, Kirsch, et al., 2018d; Erdodi & Lichtenstein, 2019).

The modality specificity effect (Erdodi, 2019; Rai & Erdodi, 2019) was observed during both the cross-validation of the EI-5s and the FTT cutoffs. The EI-5AWM, the modality congruent predictor variable, produced significantly higher AUC value against the WMT, the criterion PVT. Although its specificity was identical to that of the modality incongruent counterpart (EI-5PSP), sensitivity was notably higher (.47 vs. .74). Similarly, FTT cutoffs tended to produce the best classification accuracy against the EI-5PSP, the modality congruent criterion. This engineered method variance was introduced to evaluate the signal detection profile of FTT validity cutoffs across changing definitions of invalid performance, consistent with the methodological triangulation proposed by Campbell and Fiske (1959). The fact that the optimal cutoffs (dominant and non-dominant hand T ≤ 33 as well as combined ≤ 37) maintained high levels of specificity regardless of the criterion PVT further consolidates the evidence base supporting their clinical utility .

Table 7 Base rate of failure (%) in males and females across various FTT validity cutoffs

Equally important, the newly introduced FTT validity cutoffs were insensitive to injury severity—an important feature of a good PVT (Donders, 2005). In fact, patients with mild TBI produced consistently higher BRFail on the FTT than those with moderate or severe TBI. This paradoxical finding was even more pronounced on the WMT and the EI-5AWM, and is well-replicated in the research literature (Carone, 2008; Erdodi & Rai, 2017; Green, Flaro, & Courtney, 2009; Green et al., 1999; Ord, Boettcher, Greve, & Bianchini, 2010; Sweet, Goldman, & Guidotti Breting, 2013). This pattern of reverse dose-response relationship (Hill, 1965) provides additional evidence that PVT failures in mild TBI are unlikely to be attributable to genuine neurological impairment (Critchfield et al., 2019).

Consistent with previous reports (Arnold et al., 2005; Axelrod et al., 2014; Rapport et al., 1998), dominant hand FTT validity cutoffs provided a better overall measure of the credibility of a given response set (i.e., higher classification accuracy, more stable parameter estimates) compared to non-dominant hand FTT validity cutoffs. Therefore, if one must select one validity cutoff, our results suggest that a dominant hand T-score ≤ 33 is the single best demarcation line between credible and non-credible responding on the FTT.

Inevitably, the study also has a number of limitations, the most obvious of which is the relatively small sample size collected from a single geographic region. Consequently, the distribution of FTT scores was truncated, and certain T-scores were not observed at all (Table 4). Therefore, the results may be vulnerable to sample-specific findings and should be replicated before applying them broadly to clinical and forensic practice (Lichtenstein et al., 2019). Also, since all patients in this study were evaluated for residual deficits subsequent to TBI, it is unclear whether the present findings would generalize to different clinical populations. Future studies on larger samples with more diverse etiologies are needed to establish confidence in the proposed FTT validity cutoffs.

However, the study also has a number of strengths. It expanded the existing knowledge base on raw score-based FTT cutoffs (Arnold et al., 2005; Larrabee, 2003; Axelrod et al., 2014) by introducing demographically adjusted validity cutoffs, following a growing trend in EVI development (Ashendorf et al., 2017; Brooks & Ploetz, 2015; Erdodi & Lichtenstein, 2019; Erdodi, Lichtenstein, Rai, & Flaro, 2017b; Erdodi, Pelletier, & Roth, 2018e; Erdodi et al., 2016; Kirkwood, Hargrave, & Kirk, 2011; Lichtenstein, Flaro, Baldwin, Rai, & Erdodi, 2019; Sussman, Peterson, Connery, Baker, & Kirkwood, 2019). In addition, it also provided further validation to the EI-5 as a multivariate model of performance validity assessment.