Introduction

Performance Validity Assessment and the Word Memory Test (WMT)

An independent survey by Rabin et al. (2014) showed that the three most widely used free-standing performance validity tests (PVTs) were the TOMM (Test of Memory Malingering, Tombaugh, 1996), the Word Memory Test (WMT, Green & Astner, 1995; Green, 2003, Green, Lees-Haley, & Allen, 2003), and the MSVT (Medical Symptom Validity Test; Green, 2004). These tests were designed to be very easy and to provide an indication of whether the examinee produced a credible response set on neuropsychological testing. Failing PVTs suggests that performance on ability tests cannot be treated as valid (Green, Rohling, Lees-Haley, & Allen, 2001; Boone, 2013; Larrabee, 2012).

The basic validation of the WMT began by applying the test to healthy volunteers, all of whom scored above the standard cutoffs, revealing perfect specificity (1.00; Green, Lees-Haley, & Allen, 2003). Out of 25 psychologists and physicians, who were asked to take the WMT and to simulate early dementia, 24 fell below the cutoffs. Hence, the sensitivity of the test to experimental malingering was .96. Very similar results were obtained by independent researchers comparing simulators and good effort volunteers in Turkish (Brockhaus & Peker, 2003), Russian (Tydecks, Merten, & Gubbay, 2006) and German ( Brockhaus, & Merten, 2004).

In the German study, the WMT had perfect sensitivity (1.00) to simulated impairment, whereas only one of 32 institutionalized adults with intellectual disability failed the WMT (.97 specificity). In fact, the WMT had perfect specificity because it was later discovered that the woman who failed it had a borderline personality disorder and was malingering intellectual disability to obtain accommodation (R. Brockhaus, personal communication, 2010). Tydecks, Merten, and Gubbay (2006) reported that the Russian WMT yielded perfect sensitivity and specificity in healthy Russian speaking volunteers assigned to a control or experimental malingering condition.

Native Turkish speakers living in Turkey or Germany were administered the Turkish WMT and asked to feign impairment (Brockhaus & Peker, 2003; Fritze, 2003). All 50 participants in the simulator group were detected (1.00 sensitivity) and 47 out of 48 volunteers in the control group passed the WMT (.98 specificity). The sole false positive was likely due to dyslexia or limited literacy, because the person was 67 years old, and reported zero years of formal education.

The WMT in Children and Adults with Severe Dyslexia

Only 10% of a sample of children with severe dyslexia (Larochette & Harrison, 2012) failed the WMT, a population that is at high risk for failing a PVT that requires word reading. This is consistent with subsequent reports that severe dyslexia does not prevent credible examinees from passing PVTs based on the forced choice recognition paradigm that appear to require intact single word reading ability during the encoding trial (Hurtubise, Scavone, Sagar, & Erdodi, 2017). Following the recommendation to use the oral rather than the computerized WMT if the person’s reading level is less than grade 3 may further decrease failure rate.

False-Positive Rates (FPR): Facts and Controversy

The false-positive rate (FPR) and false-negative rate on PVTs are important issues. A false positive on a PVT may lead an assessor to misclassify a person with genuine impairment as non-credible. Conversely, a false negative on a PVT could result in concluding severe and debilitating cognitive impairment where none is actually present.

A Myth Is Born: Failing the WMT = False-Positive Error

In head injury litigation, it is not unusual for the plaintiff expert to argue that failure on PVTs was caused by severe genuine impairment. The notion that the WMT recognition subtests actually measure ability and not effort, thereby substantiating the myth of high FPR, was proposed by Bowden, Shores, and Mathias (2006). They questioned the finding of a greater base rate of failure on the WMT in mild TBI than in severe TBI, as reported by Green, Rohling, Lees-Haley, and Allen (2001). Instead, they hypothesized that the WMT recognition subtests were sensitive to genuine deficits and, therefore, behaved like ability tests. They reported similar failure rates on the WMT across injury severity ranges.

This conclusion is problematic for a number of reasons. First, Green, Rohling, Lees-Haley, and Allen (2001) reported that WMT failure rates in adults with mild TBI are greater than in adults with severe TBI. In contrast, Bowden, Shores, and Mathias (2006) included both children and adults with TBI in their sample. Notably, the percentage of children versus adults was not reported in the Bowden, Shores, and Mathias (2006) study—a methodologically relevant oversight that complicates the interpretation of their findings.

Second, the authors did not administer the WMT in the standard way, which involves multiple subtests, including Immediate Recognition (IR), Delayed Recognition (DR), Multiple Choice (MC), Paired Associate Recall (PA), Free Recall (FR), and Delayed Free Recall (LDFR). Instead, Bowden, Shores, and Mathias (2006) administered only the first subtest, IR. It is assumed that they used the oral version, which increases the risk of examiners inadvertently influencing the outcome of testing, as well as recording and scoring errors.

When Rohling and Demakis (2010) reanalyzed the raw data, they arrived at a different conclusion: the WMT did not behave like an ability test. The WMT-IR scores were uncorrelated with age and FSIQ, which supports the original designation of the WMT as a measure of performance validity, not cognitive ability. This finding was also consistent with other reports that WMT scores were independent of age and overall intellectual functioning (Lichtenstein, Erdodi, Rai, Mazur-Mosiewicz, & Flaro, 2018; Lichtenstein, Flaro, Baldwin, Rai, & Erdodi, 2019; Rohling & Demakis, 2010). In addition, the failure rate on the WMT-IR was higher (38%) in the mild TBI than in the moderate to severe TBI subsample (28.5%). If the WMT-IR subtests measured ability, the opposite would have been expected: higher BRFail in more severe TBI. Failure on the WMT is usually defined as a score of below the standard cutoff on IR, DR or the consistency score (CNS), not just the IR.

Far more people fail the WMT using the standard criteria than the IR subtest alone. In a sample of 2162 compensation seeking adults 643 failed the WMT (29.7%), but only 454 failed the IR (21.0%; R. Gervais, personal communication, 2019). By deviating from the standard administration protocol, Bowden, Shores, and Mathias (2006) undermined the psychometric integrity of the test, probably resulting in an artificially low failure rate, which in turn may have skewed the findings. If the WMT had been administered in the standard way (i.e., IR, DR, and CNS) to adults only, the findings may have been different.

Other authors who have argued that failures on the WMT in patients with mild TBI were false positives include Greve, Ord, Curtis, Bianchini, and Brennan (2008). They reported a 30% FPR on the WMT in a mixed sample including some mild TBI cases. In that study, non-credible responding was psychometrically defined based on passing or failing a certain set of PVTs, mainly the criteria used by Meyers, Volbrecht, Axelrod, and Reinsch-Boothby (2011). Thus, if patients passed these criteria but failed the WMT, they were classified as false positives for the WMT. Notably, several of the validity cutoffs used to define invalid performance in this study were shown to be overly conservative (i.e., disproportionately sacrificing sensitivity for specificity) by subsequent research (Rai, 2019; Blaskewitz, Merten, & Brockhaus, 2009; Whiteside, Wald, & Busse, 2011).

More recently, Hall, Worthington, and Venables (2014) reported a failure rate of 18% in their acute mild TBI sample, and they equated this to the FPR on the WMT. Their argument was that these cases had passed a combination of other PVTs, including the TOMM (Fail defined as Trial 2 ≤ 44), Reliable Digit Span (Fail defined as ≤ 6), the Processing Speed Index of the WAIS-III (Fail defined as ≤ 75) and, therefore, demonstrated valid performance. They concluded that the mild TBI cases who failed the WMT were unable to pass because they had “verbal processing deficits.” Once again, these cutoffs have been shown to prioritize specificity at the expense of sensitivity (Erdodi et al., 2017a; Etherton, Bianchini, Heinly, & Greve, 2006; Jones, 2013; Kulas, Axelrod, & Rinaldi, 2014; Mathias et al., 2002; Reese, Suhr, & Riddle, 2012). As such, by virtue of selecting highly conservative cutoffs, the authors created a signal detection environment in which the WMT appears prone to high FPR by design.

Rationally and Empirically Based Criticisms of the Myth of High FPR on the WMT

These conclusions are problematic on both rational and methodological grounds. Greve, Ord, Curtis, Bianchini, and Brennan (2008) provided no argument to support the assumption that mild TBI could result in cognitive impairment sufficient to cause failure on the WMT. An equally plausible alternative explanation is that those who passed Meyers’ criteria but failed the WMT were false negatives for the Meyers’ criteria. In fact, this is quite likely, because Meyers’ criteria are based on ability tests like Digit Span, which are correlated with age and intelligence, unlike the WMT recognition trials (Lichtenstein, Erdodi, Rai, Mazur-Mosiewicz, & Flaro, 2018; Lichtenstein, Flaro, Baldwin, Rai, & Erdodi, 2019; Rohling & Demakis, 2009). To reduce false positives on Meyers’ criteria, the cutoffs were set low, resulting in high specificity but low sensitivity. Optimizing the signal detection model for specificity is certainly a legitimate strategy, but its methodological implications should be made explicit (i.e., that creating a highly conservative criterion measure artificially inflates FPR on the WMT.)

Likewise, Hall, Worthington, and Venables (2014) failed to acknowledge that the WMT is more sensitive than the TOMM at standard cutoffs (Erdodi & Rai, 2017; Green, 2007; Green et al., 2004). Therefore, the discrepancies between PVTs may be better explained by the superior sensitivity of the WMT. More importantly, when failure on a PVT is observed, Slick et al. (1999) state that it is necessary to rule out the possibility that failure on the PVT was caused by a genuine neurological condition. To do so, we need evidence that the condition in question (in this case mild TBI) involves sufficient impairment to explain the PVT failure.

No study has ever shown that mild TBI can produce impairment which is severe enough to cause failure on easy tests like the WMT, the MSVT, or the TOMM. In fact, a meta-analysis by Rohling et al. (2011) found no measurable impairment on standard measures of cognitive ability 3 months after mild TBI. This is consistent with research by McCrea (2008), which showed no detectable cognitive impairment 12 h after mild TBI in athletes who were tested both pre and post mild TBI on various neuropsychological tests. Therefore, persistent cognitive impairment is not expected after an uncomplicated mild TBI.

Similarly, there is no empirical support in the literature that people with mild TBI have acquired “verbal processing deficits” of any type, and certainly none sufficiently severe to cause WMT failure. In credible examinees, WMT failure is only seen in patients with dementia (Green, Montijo, & Brockhaus, 2011) or severe dyslexia (Larochette & Harrison, 2012). Most relevant to the present argument, WMT failure is rarely seen in people with moderate to severe TBI (Carone, 2008; Erdodi & Rai, 2017; Green, Lees-Haley, & Allen, 2003). In fact, the cutoffs for the WMT recognition measures were chosen specifically to be three standard deviations below the mean from patients with moderate to severe brain injury, who were assumed to demonstrate valid performance (Allen & Green, 1999).

Not surprisingly, the standard cutoffs had .98 specificity in the moderate to severe TBI group. In addition, the cutoffs were almost three standard deviations below the mean from mixed neurological outpatients who were disabled from work by medically verified neurological disorders (brain tumor, stroke, ruptured aneurysm, or other brain diseases; Green & Allen, 1999). As a result, the WMT cutoffs had a specificity of .93 in this population. It defies logic to argue that a mild TBI can cause WMT failure, when such failure implies scoring far worse than average for people with genuine cognitive deficits acquired subsequent to a severe TBI, and far worse than most people with disabling neurological diseases.

If a substance causes cancer, more of that substance should create more cancer, according to the Bradford Hill criteria for medical causation (Hill, 1965). In other words, there should be a “dose-response relationship.” Similarly, if we wished to argue that TBI causes failure on the WMT, we would have to show a greater failure rate in severe TBI than in mild TBI. However, the empirical data show just the opposite. Green, Rohling, Lees-Haley, and Allen (2001) reported that there was a 34% WMT failure rate in cases of mild TBI but only an 18% failure rate in those with moderate to severe TBI. An even greater discrepancy in WMT failure rates was reported by Erdodi, Kirsch, Sabelli, and Abeare (2018) between patients with mild (46.1%) and moderate to severe (16.7%) TBI.

These findings extended to the MSVT and NV-MSVT as well as several other PVTs. In all cases but one (Warrington’s Recognition Memory Test – Words; Warrington, 1984), the failure rate was significantly higher in mild TBI than in severe TBI (Green & Merten, 2013). This reverse dose-response relationship argues beyond reasonable doubt that brain injury cannot explain WMT, MSVT, or NV-MSVT failures. Another source of convincing evidence comes from adults and children diagnosed with intellectual disability, who were nevertheless able to pass the WMT (Brockhaus & Merten, 2004; Green, Flaro, Brockhaus, & Montijo, 2012; Green & Flaro, 2016). Even patients with left hippocampectomy (Carone, Green, & Drane, 2013) or bilateral hippocampal damage have been reported to pass the WMT (Goodrich-Hunsaker & Hopkins, 2009). A person who fails the WMT is performing in the same range as patients with dementia (Green, Montijo, & Brockhaus, 2011). The literature on mild TBI is extensive and it does not support dementia level impairment in people with mild TBI.

Carone (2014) showed that a 9-year-old girl with intractable seizures, massive developmental absence of brain tissue from hydrocephalus, and FSIQ < 60 produced near-perfect scores on the WMT and MSVT recognition trials and thus, could easily pass these PVTs. Therefore, it would be highly unusual for an adult with only a mild TBI and credible responding to fail these PVTs. Previous research suggests that failure on the WMT cannot be attributed to depression (Rohling, Green, Allen, & Iverson, 2002; Williamson, Holsman, Chaytor, Miller, & Drane, 2012), PTSD (Demakis, Gervais, & Rohling, 2008), or fibromyalgia (Gervais et al., 2001; Suhr, 2003). Although some investigators found that patients with comorbid TBI and PTSD had an increased cumulative failure rate using multiple PVTs (Clark, Amick, Fortier, Milberg, & McGlinchey, 2014; Greiffenstein & Baker, 2008), the clinical interpretation of the role psychogenic factors play in PVT failure is an evolving debate (Henry et al., 2018). Regardless of underlying etiology, non-credible responding is the most parsimonious explanation of WMT failure in mild TBI (Delis & Wetter, 2007; Erdodi, Abeare et al., 2018).

PVT failure rate is especially high among individuals with mild TBI who are involved in compensation seeking. Mittenberg, Patton, Canyock, and Condit (2002) estimated from multiple sources that 39% of people with mild TBI produce invalid data in a compensation context. Larrabee’s (2009) survey concluded that PVT failures occurred in approximately 40% of cases involved with litigation or compensation, including people with mild TBI. The single largest group surveyed by Larrabee (2009) contained 719 cases taken from the practice of the second author of the present paper. Of this sample, 40% failed the WMT or the Computerized Assessment of Response Bias (CARB; Conder, Allen, & Cox, 1992). Larrabee, Millis, & Meyers, (2009) used the term “the magic number of 40% plus or minus 10” to describe the incidence of PVT failure among individuals evaluated in the context of external incentives to appear impaired.

Comparison to Other Free-Standing PVTs

The MSVT was deliberately designed to be even easier than the WMT, and has been shown to be easily passed by children with severe TBI (Carone, Green, & Drane, 2013) as well as by adults with acute severe TBI, once they emerge from post-traumatic amnesia (Macciocchi, Seel, Yi, & Small, 2017). As long as they had grade 3 reading levels, all children with intellectual disability passed the MSVT (Green & Flaro, 2015). In fact, the MSVT is so easy that when children who spoke no French were tested in French, they scored a mean of 98% correct on the recognition subtests, which was the same as fluent French speaking adults (Richman et al., 2006). This finding demonstrates that, unlike other free-standing PVTs based on forced choice word recognition like the Word Choice Test (Pearson, 2009), the MSVT is unaffected by limited English proficiency (Erdodi, Nussbaum, Sagar, Abeare, & Schwartz, 2017).

In a large sample of adults with mild TBI, Green, Flaro, and Courtney (2009) used performance on the MSVT to evaluate the possibility that WMT failures were false positives. Those who failed the WMT were, in most cases, also administered the even easier MSVT. If true deficits accounted for the WMT failure, one would expect a lower failure rate on the easier test, the MSVT. However, this was not the case. The majority of patients who failed the WMT also failed the MSVT. In fact, the group average was significantly lower on the MSVT than a group of children with intellectual disabilities. There is nothing in the mild TBI literature to suggest that such severe intellectual impairment can be acquired as a result of mild TBI. In addition, the mild TBI cases who failed the WMT also scored significantly lower on the MSVT than a group of children with developmental disability, who had been specifically selected for having impaired verbal memory. The literature on mild TBI in adults does not provide any support for the link between mild TBI and impaired verbal memory. Thus, it was concluded that patients with mild TBI who failed the WMT were true positives.

The Limits of Legitimate Exemptions from PVT Failures

Admittedly, some patients do suffer from cognitive impairment severe enough to cause failure on PVTs. For example, 27% of patients with dementia described on pages 43 to 45 of the TOMM test manual failed the TOMM Trial 2 (Tombaugh, 1996). Similarly, people with dementia, mainly of an Alzheimer’s type, failed the WMT or the MSVT (Green, Montijo, & Brockhaus, 2011). It appears that, whereas most forms of brain disease do not cause PVT failure, some credible examinees will fail PVTs if their impairment is severe enough, such as Alzheimer’s disease.

The problem with using one PVT or a combination of PVTs to validate another PVT as in the studies of Greve, Ord, Curtis, Bianchini, and Brennan (2008) and Hall, Worthington, and Venables (2014) is that it introduces circular reasoning (Bigler, 2015). For example, in a series of 1500 patients tested by R. Gervais (personal communication, 2016), the TOMM and Reliable Digit Span (RDS) were used to define invalid test data. Only 6% of compensation seeking cases produced invalid results (i.e., failed both the TOMM and the RDS). Yet in the same group, 30% of cases failed the WMT. In addition, the WMT was not administered if examinees had dyslexia or less than a grade 3 reading level to further protect against an elevated FPR.

One possible conclusion is that the WMT produced a FPR of 24%, consisting mainly of patients with orthopedic injuries, depression, anxiety, PTSD, and dealing with situational stressors. However, it would be hard to argue that an adult with normal premorbid cognitive abilities was unable to pass the TOMM, the WMT or the MSVT due to a mild TBI. To maintain that the high failure rate in the WMT reflects elevated FPR, one would have to adopt the indefensible argument that these examinees assessed in outpatient setting failed the WMT because of genuine and severe cognitive impairment equivalent to that seen in some dementias. No other neurological condition has been shown to reliably elevate failure rates on the WMT.

The Current Study: Rationale, Design, and Hypothesis

The above summary suggests that the WMT has high specificity in both children and adults with genuine cognitive impairment, with the notable exceptions of dementia, and severe dyslexia. In the current study, we explored further the issue of FPR on the WMT in adults with mild TBI. We selected a large case series of patients with mild TBI who were administered the WMT in conjunction with other PVTs—both stand-alone and embedded. We hypothesized that the actual FPR on the WMT will be low (i.e., ≤ 10%). In order to deem a WMT failure a false positive, it must occur in the context of psychometric evidence of credible overall responding. In other words, the examinee must fail no more than one other PVT. In contrast, if an examinee failed ≥ 2 independent PVTs in addition to the WMT, the failure on the WMT was considered a true positive, following established clinical and forensic guidelines (Boone, 2013; Larrabee, 2014). To circumvent some of the methodological flaws of previous studies, traditional free-standing PVTs were complemented by embedded validity indicators aggregated into a single validity composite, in order to harness the advantage of multivariate models of performance validity assessment (An, Kaploun, Erdodi, & Abeare, 2017; Abeare et al., 2018; Cottingham, Victor, Boone, Ziegler, & Zeller, 2014; Erdodi et al., 2017a; Pearson, 2009; Tyson et al., 2018).

Method

Participants

Archival data were collected from a consecutive case sequence of 170 patients with mild TBI referred for neuropsychological assessment in a private practice setting in the setting of a compensation or disability claim. The majority were male (67.7%) and right-handed (87.4%). Mean age was 42.5 years (SD = 11.4), while mean level of education was 12.2 years (SD = 2.4). The most common reasons and sources for referral were the Workman’s Compensation Board (42.5%), followed by personal injury lawyers (22.8%), independent medical examinations (15.6%), and clinical (10.8%). Inclusion criteria were claimed history of mild TBI [Glasgow Coma Scale (GCS) ≥ 13, loss of consciousness < 30 min, post-traumatic amnesia < 1 h), being in the post-acute stage of recovery (i.e., assessed > 3 months after the mild TBI), and a complete administration of the WMT and MSVT. Rai & Erdodi 2019.

Materials

A core battery of neuropsychological tests assessing attention, memory, processing speed, and executive functioning was administered to all patients, in addition to the two free-standing PVTs. Given the similarity between the WMT and MSVT (both free-standing computerized tests with recognition subtests based on the forced choice recognition paradigm) and previous research suggesting that shared features between the predictor and criterion variable can influence the outcome of classification accuracy analyses (Erdodi, 2019), two additional validity composites were developed by combining data from multiple embedded PVTs. The purpose of creating these aggregate measures of performance validity was to avoid the perception of bias in the choice of criterion PVT (i.e., evaluating the WMT against another PVT that is very similar in order to artificially inflate the concordance rate) (Rai & Erdodi, 2019).

The first composite was labeled “Validity Index Seven” (VI-7). As the name suggests, the VI-7 was built from seven independent embedded PVTs, some of which contained multiple indicators (Table 1). The VI-7 represents the traditional approach to multivariate models of performance validity assessment. Its value reflects the number of failures on the component PVTs. As such, it can range from zero (no failure) to seven (failure on all constituent PVTs). Each additional unit increase represents an increased likelihood that the overall response set is invalid (Table 2). However, for the purpose of classification accuracy analyses, an overall Pass on the VI-7 was defined as ≤ 1 (at most one PVT failure), and Fail was defined as ≥ 2 (at least two PVT failures), following commonly accepted forensic standards for embedded measures (Boone, 2013; Larrabee, 2014).

Table 1 The components of the VI-7, cutoffs, and base rates of failure
Table 2 The distribution of scores for the VI-7 and clinical classification ranges

The second composite was labeled “Erdodi Index Five” (EI-5). The EI-5 represents a novel approach to aggregating PVTs designed to recapture the underlying continuity in performance validity (Erdodi et al., 2017a) by simultaneously measuring both the number and extent of PVT failures (Erdodi, Hurtubise et al., 2018). Instead of dichotomizing (Pass/Fail) its components, the EI-5 model recodes each constituent PVT onto a four-point ordinal scale, where zero means an unequivocal Pass, and three means an unequivocal Fail, with two intermediate levels of failure (Table 3). An EI-5 value of one is defined by the most liberal cutoff available in the literature. As such, it is optimized for sensitivity. Conversely, an EI-5 value of three is defined by the most conservative cutoff available in the literature or in the absence of a natural threshold, the bottom 5% of the distribution on an embedded PVT nested in a measure of cognitive ability. An EI-5 value of two represents an intermediate level of failure, and is therefore associated with a cutoff more conservative than an EI-5 value of one, but more liberal than an EI-5 value of three. In the absence of a natural demarcation line, it is defined as “the next 5% of worst performance” following the bottom 5% of the distribution. As such, its upper limit is roughly defined by the 10th percentile, while its lower limit is defined by the 5th percentile (Erdodi, 2019).

Table 3 Individual components of the EI-5 and base rates of failure at given cutoffs

The value of the EI-5 is obtained by summing its recoded components. Therefore, it can range from zero (the examinee passed the most liberal cutoff on all five components) to 15 (the examinee failed the most conservative cutoff on all five components). An EI-5 ≤ 1 reflects at most one marginal failure, and is consequently labeled an overall Pass. The next two levels (2–3) are problematic, as they could represent either several independent failures at the most liberal cutoff or a single failure at a more conservative cutoff. Although in previous studies, this level of performance has been associated with significantly stronger independent evidence of invalid performance compared to the Pass range (Erdodi, 2019; Erdodi, Hurtubise et al., 2018; Erdodi, Seke et al., 2017), it does not meet the strict definition of global non-credible responding, and it has a relatively high multivariate base rate (Pearson, 2009). As such, this range was labeled Borderline, and excluded from analyses that require a dichotomous (Pass/Fail) criterion measure.

However, an EI-5 ≥ 4 crosses the “red line”: it represents either a minimum of four independent failures at the most liberal cutoff, two at the more conservative cutoff, or one at the most liberal and one at the most conservative. Regardless of the specific configuration, this combination of PVT failures meets the commonly accepted definition of invalid performance and has a low multivariate base rate (Pearson, 2009). Therefore, this range of EI-5 was labeled a Fail, with increasing confidence in the classification with each unit increase in the EI-5 value (Erdodi, Dunn et al., 2018), as shown in Table 4.

Table 4 Frequency, cumulative frequency and classification range for the first ten levels of the EI-5

To illustrate the shift in signal detection profile as a function of cutoffs, classification accuracy was computed against the MSVT across a range of cutoffs. When failure on the EI-5 was defined as ≥ 1, specificity was unacceptably low (.50). Increasing the cutoff to ≥ 2 (Pass ≤ 1) only slightly improved specificity (.63). Further increasing the cutoff to ≥ 3 produced steady, but still insufficient improvement in specificity (.77). However, the standard ≥ 4 cutoff resulted in a good combination of sensitivity (.57) and specificity (.93). It also proved to be the point of diminishing returns: making the cutoff any more conservative produced marginal improvement in specificity while disproportionately sacrificing sensitivity.

Given that the EI-5 is a relatively new method of aggregating independent PVTs, its classification was first validated against established free-standing instruments. The EI-5 was a significant predictor of passing or failing the MSVT: AUC = .75 (95% CI .66–.82), with .57 sensitivity and .93 specificity. The EI-5 produced a similar signal detection profile against the Non-Verbal MSVT: AUC = .71 (95% CI .59–.83), .41 sensitivity, and .87 specificity.

Procedure

Psychometric testing was administered and scored by trained technicians in an outpatient setting. A licensed clinical neuropsychologist performed the clinical interview, reviewed the available records, prepared an integrative report, and rendered the final diagnosis. Patients gave written consent for anonymized retrospective analysis of their test results in group data. Only de-identified data were captured for research purposes to protect patient confidentiality. APA ethical guidelines regulating research involving human participants were followed throughout the process.

Data Analysis

Basic descriptive statistics (M, SD, frequency distribution, BRFail) were reported where relevant. The main inferential statistics were independent t tests for continuous variables and χ2 test of independence or two proportions z tests for categorical variables. The statistical significance of the difference between SDs was evaluated using Levene’s test of homogeneity of variances. Effect size estimates were given in Cohen’s d and Cramer’s ϕ2. Overall classification accuracy (AUC) and the associated 95% CI were calculated using SPSS 23.0. Sensitivity and specificity were computed using standard formulae (Grimes & Schultz, 2005).

Results

Base Rates of Failure Across PVTs

The highest failure rate was observed on the WMT (44.7%), followed by the VI-7 (41.8%), the MSVT (39.4%), and the EI-5 (18.2%). The difference in failure rate between the WMT and VI-7 or the MSVT was not statistically significant (p = .322 and .584, respectively). However, patients were significantly (p < .001) more likely to fail the WMT compared to the EI-5 (risk ratio 2.46).

The Rate of Agreement Between WMT and Criterion PVTs

The dichotomous outcome (Pass/Fail) of the WMT and that of the MSVT were strongly related: χ2(1) = 78.4, p < .001, ϕ2 = .46 (very large effect). The two tests agreed on classifying 84.1% of the sample. The relationship with the VI-7 was weaker, but still highly significant: χ2(1) = 40.2, p < .001, ϕ2 = .24 (very large effect). The two tests agreed on classifying 74.7% of the sample. Finally, there was a large proportion of shared variance with the EI-5: χ2(1) = 30.1, p < .001, ϕ2 = .26 (very large effect). The two tests agreed on classifying 75.9% of the sample.

Clinical Characteristics and Neurocognitive Profiles Associated with Failing the WMT

Patients who failed the WMT were significantly older and reported higher level of depression (medium effect) than those who passed the WMT. No difference was found on gender, education, GCS, and lateral dominance (Table 5). Failing the WMT was associated with significantly lower performance on measures of auditory attention, learning and memory (large effects), concept formation (large effects), visuomotor processing speed (large effects), and manual dexterity (large effects). In addition, the patients who failed the WMT produced significantly more variable scores on a test of concept formation and a derivative index of executive deficits.

Table 5 Comparing patients who passed and those who failed the Word Memory Test on demographic variables and performance on neuropsychological tests

Clinical Characteristics and Neurocognitive Profiles Associated with Failing the VI-7

Patients who failed the VI-7 were significantly older and reported higher levels of depression (medium effect). No difference was found on gender, education, GCS, lateral dominance, and manual dexterity (Table 6). Failing the VI-7 was associated with significantly lower performance on measures of auditory attention, learning and memory (large effects), concept formation (large effects), and visuomotor processing speed (large effects). As observed with the WMT, patients who failed the VI-7 produced significantly more variable scores on a test of concept formation and a derivative index of executive deficits.

Table 6 Comparing patients who passed and those who failed the VI-7 on demographic variables and performance on neuropsychological tests

Clinical Characteristics and Neurocognitive Profiles Associated with Failing the EI-5

Patients who failed the EI-5 reported significantly higher levels of depression (medium effect). No difference was found on age, gender, education, GCS, and lateral dominance (Table 7). Failing the EI-5 was associated with significantly lower performance on measures of auditory attention (medium effect), auditory learning and memory (large effects), concept formation (large effects), visuomotor processing speed (large effects), and manual dexterity (large effect). Patients who failed the EI-5 produced significantly more variable scores during the delayed free recall trial of an auditory verbal learning test and a measure of visual attention, sequencing, and motor speed.

Table 7 Comparing patients who passed and those who failed the EI-5 on demographic variables and performance on neuropsychological tests

Examining the False-Positive Rate on the WMT

Of the 76 patients who failed the WMT (44.7% of the sample), 72 also failed either another free-standing PVT or at least two embedded PVTs. In other words, 94.7% of patients identified as non-credible by the WMT also had independent evidence of invalid performance. This is equivalent to the overall specificity of the WMT relative to the other PVTs employed. Conversely, the false-positive rate on the WMT (i.e., proportion of the sample that failed the WMT, but had no other psychometric evidence of non-credible responding) was 5.3%. The vast majority of patients (92.1%) who failed the WMT also failed ≥ 2 other PVTs, 71.1% failed ≥ 3, and 51.3% failed ≥ 4.

Discussion

This archival study empirically evaluated earlier claims that the WMT produces an unacceptably high FPR in patients with mTBI (Bowden, Shores, & Mathias, 2006; Hall, Worthington, & Venables, 2014; Greve, Ord, Curtis, Bianchini, & Brennan, 2008). The investigation was extended to four levels of analysis. First, the failure rate on the WMT was compared to that observed on other PVTs. Second, the performance of patients who passed versus those who failed the WMT was compared across commonly used neuropsychological tests measuring core cognitive domains (attention and processing speed, memory, concept formation, mental flexibility, visuomotor scanning, and manual dexterity). Neurocognitive profiles, demographic variables, and injury parameters associated with valid vs. invalid performance on the WMT were then compared to those obtained using different criterion PVTs to examine the effect of the changing psychometric definitions of non-credible responding. Third, the concordance rate between the Pass/Fail outcome on the WMT and the other criterion PVTs was computed. Finally, an ipsative analysis was performed on the subset of patients who failed the WMT to determine the proportion of these cases with independent evidence of invalid performance.

Indeed, the highest failure rate (44.7%) was observed on the WMT, followed by the VI-7, a multivariate composite of embedded validity indicators (41.8%), the MSVT (39.4%) and the EI-5, a novel approach to aggregating multiple PVTs (26.7%). The lower failure rate on the EI-5 can be attributed to the conservative approach guiding its design: borderline range scores (2 and 3) are excluded from analyses that require a dichotomous outcome (Erdodi, 2019), even though a performance in this range was equally likely to indicate credible and non-credible responding based on the present findings (Erdodi, Hurtubise et al., 2018). Moreover, previous research suggests that EI-5 scores in the borderline range are more likely to reflect invalid than valid performance (Erdodi et al., 2018a; Erdodi et al., 2017b). As such, the EI-5 tends to underestimate the base rate of failure by design. Although the failure rate on the WMT within this sample was high, it is comparable to previous estimates of failure rates on PVTs in this population tested in both clinical (Abeare, Messa, Zuccato, Merker, & Erdodi, 2018; Erdodi, Kirsch, Sabelli, & Abeare, 2018; Erdodi & Roth, 2017) and forensic settings (Mittenberg, Patton, Canyock, & Condit, 2002, Larrabee, Millis, & Meyers, 2009).

Consistent with previous studies, failing the WMT was associated with being older and with higher self-reported level of depression (Erdodi, Abeare et al., 2018; Erdodi, Seke et al., 2017), but unrelated to sex, education, lateral dominance, and GCS. Patients who passed the WMT produced mean scores in the average range on ability tests, in line with the cumulative evidence that a full recovery of cognitive functioning is the normative outcome 3 months after a mild TBI (Rohling et al., 2011, McCrea et al., 2008).

In contrast, those who failed the WMT had mean scores on neuropsychological tests in the borderline range. These contrasts were associated with effect sizes (d .64–1.56) that would be considered large even for experimental malingering paradigms (Rapport, Farchione, Coleman, & Axelrod, 1998; Rogers, Sewell, Martin, & Vitacco, 2003; Suhr & Boyer, 1999). Moreover, failing the WMT was associated with increased within-group variability on a test of concept formation and a derivative index of executive deficits, a phenomenon considered to be an emergent sign of non-credible responding probably owing to the heterogeneity in malingering strategy across individuals (Cottingham, Victor, Boone, Ziegler, & Zeller, 2014; Erdodi, Kirsch, et al., 2014; Erdodi, Pelletier, & Roth, 2018). This neurocognitive profile was replicated with remarkable consistency using the VI-7 and EI-5 as criterion PVTs.

The WMT produced a high rate of agreement (74.7–84.1%) with other criterion PVTs, providing objective empirical evidence against claims that the failure rate on the WMT is artificially inflated by false-positive errors. Finally, the strongest evidence comes from a careful examination of the neurocognitive profile of patients who failed the WMT: 94.7% of them had independent psychometric evidence of non-credible responding. Therefore, they are considered to be true positives, bringing the actual FPR down to 5.3%, which is considered low by most standards (Boone, 2013, Donders & Strong, 2011; Larrabee, 2003).

Inevitably, the study also has a number of limitations. First, it did not examine FPR in any diagnosis apart from mild TBI. However, limiting the investigation to mild TBI was necessary to ensure high internal validity, and to produce findings that would likely generalize to one of the most contentious category of patients in terms of PVT profiles. This clinical population was chosen partly because it lends itself well to PVT studies due its propensity to PVT failures. It is a condition which has been extensively studied, leading to the conclusion that mild TBI does not produce permanent measurable neuropsychological impairment in credible examinees (Rohling et al., 2011, McCrea et al., 2008).

Thus, failure on a PVT designed to be very easy cannot be attributed to genuine cognitive impairment. In contrast, FPRs in individuals with severe cognitive impairment are more difficult to estimate. Although empirically supported algorithms to aid the clinical interpretation of PVT failures in patients with more severe cognitive impairment have been developed (based on profile analysis of tests like the WMT, MSVT, and NV-MSVT) and their clinical utility continues to be replicated by independent investigators (Alverson, O’Rourke, & Soble, 2019; Elbaum, Golan, Lupu, Wagner, & Braw, 2019; Lupu, Elbaum, Wagner, & Braw, 2018; Shelley-Tremblay, Eyer, & Hill, 2019; Tomer, Lupu, Golan, Wagner, & Braw, 2018), a detailed discussion of this built-in safeguard against false-positive errors within the Green family of PVTs is beyond the scope of this paper.

One by-product of the high sensitivity of the WMT, as seen in multiple studies based on experimental malingerers, is that more people will fail the WMT than less sensitive PVTs. Although it is a natural tendency to ask whether the higher failure rate on the WMT versus other PVTs might (at least partly) be due to an increased FPR on the WMT, this may reflect a bias inherent in the assessor, not the test (Erdodi & Lichtenstein, 2017). The evidence from the current study does not support this interpretation. On the contrary, the results add to the impression from many studies summarized in the introduction that individuals with mild TBI who fail the WMT are unlikely to be false positives.