Introduction

The validity of the clinical interpretation of neuropsychological profiles rests on the assumption that examinees were able and willing to demonstrate their true ability level during the testing (Bigler, 2015; Lezak, Howieson, Bigler, & Tranel, 2012). The limitation of clinical intuition in detecting noncredible responding has long been evident (Heaton, Smith, Lehman, & Vogt, 1978). Therefore, a consensus emerged among professional organizations that a systematic and empirical evaluation of performance validity during neuropsychological assessment is necessary (Bush et al., 2005; Bush, Heilbronner, & Ruff, 2014; Chafetz et al., 2015; Heilbronner et al., 2009).

The traditional gold standard measures were free-standing performance validity tests (PVTs) designed specifically to evaluate the credibility of a given response set. These instruments are robust and by design optimized to differentiate genuine impairment from invalid performance. However, they often require multiple learning trials or time delays and provide little to no information on cognitive ability, which remains the ultimate goal of a clinical evaluation. In contrast, embedded validity indicators (EVIs) are nested within established test of neuropsychological functioning and were subsequently co-opted as PVTs.

Given that EVIs simultaneously measure cognitive ability and performance validity, they have the potential to address several of the current challenges faced by practicing neuropsychologists (Boone, 2013). As they abbreviate the psychometric testing without sacrificing data on either ability or effort, EVIs allow clinicians to provide a comprehensive evaluation of core constructs despite growing systemic pressures to optimize assessment practices for cost-effectiveness. Shorter testing also reduces demand on the examinee’s mental stamina, which is especially important in certain vulnerable populations such as young children and patients with complex medical and/or psychiatric conditions (Lichtenstein, Erdodi, & Linnea, 2017). Limiting exposure to free-standing PVTs could also help preserve the long-term integrity of these instruments, as repeated exposure to PVTs has been reported to compromise their signal detection performance (Boone, 2013). On the other hand, the practice of using the same instrument to measure both cognitive ability and performance validity has raised concerns about the inevitable confluence of these conceptually distinct constructs that PVTs are meant to differentiate (Bigler, 2012, 2015).

Research on EVIs has increased exponentially in recent years. Today, EVIs cover a broad range of cognitive domains: attention (Ashendorf, Clark, & Sugarman, 2017; Reese, Suhr, & Riddle, 2012; Trueblood, 1994), memory (Bortnik et al., 2010; Moore & Donders, 2004; Shura, Miskey, Rowland, Yoash-Gatz, & Denning, 2016; Pearson, 2009), processing speed (Erdodi et al., 2017a; Etherton, Bianchini, Heinly, & Greve, 2006; Kim et al., 2010b; Shura et al., 2016; Sugarman & Axelrod, 2015; Trueblood, 1994), language (Erdodi, 2017; Whiteside et al., 2015), executive functions (Ashendorf, Clark, & Sugarman, 2017; Shura et al., 2016; Suhr & Boyer, 1999), vigilance (Erdodi, Roth, Kirsch, Lajiness-O’Neill, & Medoff, 2014b; Lange et al., 2013; Ord, Boettcher, Greve, & Bianchini, 2010; Shura et al., 2016), and visuospatial/perceptual (Lu, Boone, Cozolino, & Mitchell, 2003; Reedy et al., 2013; Shura et al., 2016; Sussman, Peterson, Connery, Baker, & Kirkwood, 2017). In contrast, measures of motor speed have been underrepresented in this otherwise growing trend. Although the finger tapping test has long-established validity cutoffs (Arnold et al., 2005; Axelrod, Meyers, & Davis, 2014; Greiffenstein, Baker, & Gola, 1996; Larrabee, 2003), and the Grooved Pegboard Test (GPB) has a growing evidence base supporting its potential as an EVI (Arnold & Boone, 2007), the majority of the extant research only reported the effect of invalid performance on GPB scores as a continuous variable. The GPB (Lafayette Instrument, 2015) is a measure of manual dexterity designed to assess fine motor speed. It instructs the examinee to place grooved pegs into slotted holes angled in various directions, one at a time, as quickly as possible. Physically, the test consists of a small board with five rows, with five holes per row. The most commonly used score is completion time.

The GPB as an EVI has a long presence in the research literature. Examinees with noncredible responding produced consistently lower mean scores when compared to credible controls, with effect size ranging from medium (d = 0.50; Inman & Berry, 2002) to very large (d = 1.21; Rapport, Farchione, Coleman, & Axelrod, 1998) for the dominant hand and from small (d = 0.21—nonsignificant; Inman & Berry, 2002) to large (d = 1.03; Rapport et al., 1998) for the nondominant hand. The largest effect (d = 1.33) was reported by Binder and Willis (1991) using the combined raw score from both hands. All studies found a larger effect in the dominant hand (Johnson & Lesniak-Karpiak, 1997; van Gorp et al., 1999) relative to the nondominant hand. As a reference, in experimental malingering designs, Cohen’s d values of 0.75 are considered moderate, while Cohen’s d values of 1.25 are considered large (Rogers, Sewell, Martin, & Vitacco, 2003). Based on this classification system, the GPB shows promise as a PVT.

While the research reviewed above was important to establish the sensitivity of the GPB to noncredible responding in general, the practical demands of clinical neuropsychology require specific thresholds with known classification accuracy. The first study that published validity cutoffs for the GPB was by Erdodi et al. (2017d). Their mixed clinical sample consisted of 190 patients medically referred for neuropsychological assessment in the northeast USA. A demographically adjusted T score ≤ 29 for either hand was specific (0.85–0.90) to psychometrically defined invalid responding, with variable sensitivity (0.33–0.66). If Fail was defined as T ≤ 31 on both hands, specificity improved slightly (0.86–0.91). In addition, failing GPB validity cutoffs was associated with higher levels of self-reported symptoms on the Beck Depression Inventory—Second Edition (BDI-II) and several scales of the Personality Assessment Inventory (PAI; Depression, Somatic Concerns, Borderline and Antisocial Features, Alcohol and Drug Problems). Those who passed the dominant hand cutoff reported mean BDI-II scores in the mild range, while those who failed it had a mean score in the moderate severity range. The clinical significance of the increased symptom report was less clear on the PAI scales. On Somatic Concerns, the mean T score for patients who failed any of the GPB validity cutoffs crossed the T > 70 mark, while those who passed it produced mean T scores < 70. However, all means on the Antisocial Features scales were T < 60, regardless of Pass/Fail status on the GPB.

The authors attributed this unusual pattern of findings to psychogenic interference: a hypothesized mechanism by which emotional distress disrupts examinees’ ability to consistently demonstrate their maximal performance during psychometric testing (Bigler, 2012; Erdodi et al., 2016), producing internally inconsistent profiles, which in turn are commonly interpreted as evidence of noncredible responding (Boone, 2007a; Delis & Wetter, 2007; Greiffenstein et al., 1996; Slick, Sherman, Grant, & Iverson, 1999).

The link between emotional distress and cognitive performance is an intriguing topic in the context of clinical and forensic neuropsychology (Boone, 2007b; Henry et al., 2018; Suhr, Tranel, Wefel, & Barrash, 1997). Although multiple independent investigations converge in the conclusion that depression is orthogonal to performance validity (Considine et al., 2011; Egeland et al., 2005; Rees, Tombaugh, & Boulay, 2001), the evidence on the link between depression and test performance on specific cognitive domains remains equivocal. For example, some investigations concluded that depression and memory performance were unrelated (Egeland et al., 2005; Langenecker et al., 2005; Raskin, Mateer, & Tweeten, 1998; Rohling, Green, Allen, & Iverson, 2002). In contrast, other studies found a significant relationship between them (Bearden et al., 2006; Christensen, Griffiths, MacKinnon, & Jacomb, 1997, Considine et al., 2011).

Other factors affecting emotional functioning, such as complex trauma history, pain, and fatigue (Costabile, Bilo, DeRosa, Pane, & Sacca, 2018; Greiffenstein & Baker, 2008; Kalfon, Gal, Shorer, & Ablin, 2016; Suhr, 2003; Williamson, Holsman, Chaytor, Miller, & Drane, 2012), have also been linked to both PVT failure and cognitive performance. Accumulating evidence for the psychogenic interference hypothesis led to a proposed diagnostic entity (“cogniform disorder”) designed to capture excessive cognitive symptoms and poor test taking effort in the context of an assumed sick role nested in a conversion-like manifestation (Delis & Wetter, 2007).

More recently, and using a conceptually and computationally sophisticated methodology, Henry et al. (2018) demonstrated that illness perception in general and cogniphobia specifically predicted PVT outcomes. Illness perception refers to thoughts and beliefs about one’s health status, while cogniphobia is conceptualized as the belief that cognitive exertion may exacerbate an underlying neurological condition and the resulting avoidance of tasks that require significant mental effort (Suhr & Spickard, 2012). Taken together, existing research suggests that studying psychogenic interference has the potential to account for some of the unexplained variance in cognitive test performance.

The Erdodi et al. (2017d) study had a number of limitations. Their sample was diagnostically heterogeneous, which raises questions on whether the classification accuracy statistics developed in their study would generalize to specific diagnostic groups. In addition, missing data limited the effective sample size and, potentially, biased their parameter estimates. Therefore, the present study was designed to replicate their findings in a sample of patients with traumatic brain injury (TBI) assessed in a different region of the USA, using a fixed battery of neuropsychological tests.

Method

Participants

The sample consisted of a consecutive case sequence of 100 adults (55% male, 90% right-handed) clinically referred for neuropsychological assessment subsequent to a TBI at an outpatient neurorehabilitation unit of a Midwestern academic medical center. Mean age was 38.3 years (range 17–70), while mean level of education was 13.6 years. Overall intellectual functioning was in the low end of average range (MFSIQ = 92.8), as was estimated premorbid functioning based on performance on a single-word reading test (MWRAT-4 = 94.1). The sample was used in two previous publications (Erdodi, Abeare et al., 2017b; Erdodi, Roth, Kirsch, Lajiness-O’Neill, & Medoff, 2014b) focused on EVIs within Conners’ Continuous Performance Test and the Forced Choice Recognition trial of the California Verbal Learning Test, respectively. Given that patients were referred for cognitive evaluation by treating physicians, information on external incentive to appear impaired was inconsistently available. Therefore, the criteria for malingered neurocognitive deficits put forth by Slick, Sherman, Grant, and Iverson (1999) could not be applied. Instead, noncredible performance was operationalized using a variety of psychometric tools.

The majority (76%) of the patients sustained a head injury of mild severity. The rest of them were classified as moderate or severe by the assessing neuropsychologist, based on available data on commonly used injury parameters (duration of loss of consciousness, evidence of intracranial abnormalities on neuroradiological imaging, duration of peritraumatic amnesia, Glasgow Coma Scale score at the scene of the accident). A mild head injury was operationalized as a GCS ≥ 13, loss of consciousness < 30 min, posttraumatic amnesia < 24 h, and negative neuroradiological findings. Patients with compromised upper extremity neurological (i.e., hemiparesis or lesions to the peripheral nervous system) or orthopedic (i.e., bone fracture or soft tissue injury to the arm or hand) integrity were not administered the GPB. All patients were in the postacute stage of recovery (> 3 months post mild TBI and > 1 year post severe TBI).

Materials

Tests Administered

A fixed battery of commonly used neuropsychological tests [Booklet Category Test (DeFilippis & McCampbell, 1997), Conners’ Continuous Performance Test—Second Edition (Conners, 2004); Peabody Picture Vocabulary Test—Fourth Edition (Dunn & Dunn, 2007); Tactual Performance Test (Halstead, 1947), Trail Making Test (Reitan, 1955, 1958), verbal fluency (FAS and animals), Wisconsin Card Sorting Test (Heaton et al., 1993), Wide Range Achievement Test—Fourth Edition (Wilkinson & Robertson, 2006), Word Choice Test (Pearson, 2009)] was administered to all patients, covering a wide range of cognitive domains. Intellectual functioning was measured with the Wechsler Adult Intelligence Scale—Fourth Edition (WAIS-IV; Wechsler, 2008). Memory functioning was measured using the Wechsler Memory Scale—Fourth Edition (WMS-IV; Wechsler, 2008) and the California Verbal Learning Test—Second Edition (CVLT-II; Delis, Kramer, Kaplan, & Ober, 2000).

Self-reported emotional functioning was assessed using the Beck Depression Inventory—Second Edition (BDI-II; Beck, Steer, & Brown, 1996) and the Symptom Checklist 90—Revised (SCL-90-R; Derogatis, 1994). BDI-II scores between 14 and 19 are considered mild, 20–28 are considered moderate, and ≥ 29 are considered severe. SCL-90-R T scores ≥ 63 on any of the clinical scales are considered clinical elevations. The Global Severity Index has been reported to be a particularly sensitive indicator of overall psychological distress in individuals with TBI (Westcott & Alfrano, 2005), although other scales were also shown to be sensitive to the residual effects of head injury (Linn, Allen, & Willer, 1994). However, the SCL-90-R was also reported to be vulnerable to symptom fabrication/exaggeration in the context of experimentally induced malingering (McGuire & Shores, 2001; Sullivan & King, 2008).

The main free-standing PVT was the Word Memory Test at standard cutoffs (Green, 2003). GPB scores were demographically corrected using the norms by Heaton, Miller, Taylor, and Grant (2004). Given previous theoretically based (Bigler, 2014; Leighton, Weinborn, & Maybery, 2014) and empirically substantiated (Erdodi, 2017; Erdodi et al., 2017c) concerns about modality specificity as a confounding variable in calibrating PVTs, two additional instruments (“Erdodi Index Five” [EI-5]) were developed by aggregating five EVIs into a single composite measure of performance validity.

The EI-5 (FCR and PSP)

The first composite consisted of PVTs was based on forced choice recognition memory (EI-5FCR), while the other consisted of EVIs nested within measures of processing speed (EI-5PSP). Each EI-5 component was recoded onto a four-point scale: a score in the clear Pass range was coded as 0, while a score in the clear FAIL range was coded as 3, with two intermediate levels of failure (Table 1), following the methodology described by Erdodi (2017). The demarcation for an EI value on each individual component is determined by the most liberal cutoff available in the literature. EI values of 2 and 3 are either defined by previously published, more conservative cutoffs or the 5th and 10th percentile of the local distribution. In other words, if no clear cutoff is available to define the EI level of 3, the score associated with the bottom 5% of the distribution (most egregious failures) determines the specific cutoff. This methodology ensures that the EI model has similar interpretation across studies by providing a balanced approach between universally accepted cutoffs and study-specific variations in base rates of failure (BRFail).

Table 1 Individual components of the EI-5s and base rates of failure at given cutoffs (n = 100)

Naturally, fixing the cutoffs for level 2 and level 3 to BRFail does not eliminate the confounding effect of the inherent variability across samples. The 5th percentile will be defined by very different scores in healthy university students and forensically evaluated patients with comorbid severe neuropsychiatric disorders. Therefore, EI values may not be comparable across studies. Moreover, extreme fluctuations in local distributions resulting in unusually high or low functioning examinees may render the EI model outright useless. For example, if only 5% of the sample fails the most liberal cutoff, meaningful differentiation of examinees in terms of the “extent of failure” is not possible due to the low BRFail. Conversely, if 90% of the sample fails highly conservative cutoffs, determining the range for EI level 1 is problematic, as it is defined as the most liberal published cutoff and is designed to be the gateway to noncredible designation (i.e., the additive effect of repeated near-Passes is equivalent to a couple of clear Fails), yet the vast majority of the sample demonstrates much stronger evidence of invalid performance, undermining the practical utility of this designation.

Such dramatic sampling idiosyncrasies are rare. When they do occur, they violate the implicit assumption underlying the EI model, namely that invalid performance is a continuous variable with a normative distribution of the severity gradient. Existing research on the EI produced remarkably consistent results supporting this a priori condition for the psychometric utility of the model. Across several studies, samples, EI components, and cutoffs (An et al., 2018; Erdodi, 2017; Erdodi et al., 2014b, 2017a, 2017b, 2017d, 2018a, 2018c, 2018d; Erdodi, Pelletier, & Roth, 2018b; Erdodi, Tyson et al., 2017c; Zuccato, Tyson, & Erdodi, 2018), zero was the modal value, the majority (half to two thirds) of the sample fell in the Pass range (i.e., EI ≤ 1), around 15–25% fell in the Borderline range (i.e., EI of 2 or 3), and 15–25% in the Fail range (i.e., EI ≥ 4). Two notable exceptions occurred when multiple EVIs nested within the same test were allowed to serve as independent EI components (Erdodi & Roth, 2017) or when two of the EI components had unusually high BRFail (Erdodi & Rai, 2017). In both cases, the left side of the distribution of EI scores was flattened, and as a result, the demarcation lines became less clear. However, more extreme EI scores retained their high specificity to noncredible responding.

The value of the full EI-5 scale is obtained by summing its recoded components and, thus, can range from 0 (all five constituent PVTs were passed) to 15 (all five constituent PVTs were failed at the most conservative cutoff). An EI-5 score ≤ 1 can be confidently classified as a Pass, as it indicates at most a single marginal failure. The clinical interpretation of the next range of scores (EI-5 of 2 and 3) is problematic: it can mean either several soft Fails, a single failure at the most conservative cutoff, or some combination of both. While this level of performance is clearly not valid, it also lacks strong evidence to deem the response set invalid. Therefore, it is labeled Borderline and excluded from analyses requiring a binary (Pass/Fail) outcome. Previous research demonstrated that individuals who scored in the Borderline range were more likely to fail other PVTs compared to those in the Pass range, but less likely to fail other PVTs compared to those in the Fail range (Erdodi, 2017; Erdodi et al., 2018c; Erdodi, Tyson et al., 2017). In contrast, an EI-5 score ≥ 4 contains sufficient evidence of noncredible responding from independent PVTs and, therefore, serves as the lower limit of the Fail range (Table 2). In previous studies, the EI model produced classification accuracy comparable to established free-standing PVTs in clinical samples of both adults (Erdodi, 2017; Erdodi et al., 2016; Erdodi & Rai, 2017) and children (Lichtenstein, Erdodi, Rai, Mazur-Mosiewicz, & Flaro, 2018).

Table 2 Frequency, cumulative frequency, and classification range for the first ten levels of the EI-5s (n = 100)

The EI-5s were designed to capture both the number and the extent of PVT failures, providing a single-number summary of cumulative evidence of noncredible responding on a continuous scale of measurement. By accounting for different levels of PVT failure, they offer a more nuanced measure of performance validity. In contrast, the common practice of using a single binary cutoff that separates a scale into a valid and an invalid range ignores alternative cutoffs and inevitably sacrifices the diagnostic purity of both groups by forcing borderline cases into one of the two categories.

For example, a Reliable Digit Span (RDS; Greiffenstein, Baker, & Gola, 1994) has two commonly used cutoffs. While both have been shown to meet minimum specificity standards (Heinly, Greve, Bianchini, Love, & Brennan, 2005; Mathias, Greve, Bianchini, Houston, & Crouch, 2002; Reese, Suhr, & Riddle, 2012), RDS ≤ 7 (the liberal cutoff) is optimized for sensitivity, while ≤ 6 (the conservative cutoff) is optimized for specificity. As such, they have different predictive power. Ignoring that fact attenuates the overall classification accuracy and introduces systematic errors in the analysis.

The coexistence of alternative cutoffs creates a potential for diverging interpretations. For example, some assessors may consider an RDS score of 7 a “near-Pass” (Bigler, 2012)—in other words, a performance that does not meet a “sufficiently conservative” standard for failing a PVT. As such, they may be inclined to classify the response set as valid. Others may interpret the same score as the first level of noncredible responding and, consequently, label it as a “soft Fail” (Erdodi & Lichtenstein, 2017). While both hypothetical assessors above correctly determined that the score provides some, albeit weak, evidence of invalid performance, the confluence of the demand for binary classification (Pass/Fail), subjective thresholds for failure, and the implications of descriptive labels (“near-Pass” vs. “soft Fail”) ultimately led them to reach the opposite conclusions.

The EI model is a methodological innovation that provides a psychometric solution to the dilemma of “in-between” scores by the virtue of rescaling each individual performance onto a four-point ordinal scale (0–3), establishing a clearly valid range (= 0), and three levels of failure (1–3). This recoding is performed without making a definitive binary classification (i.e., Pass/Fail) at the level of individual components. Delaying the final decision on any given constituent PVT allows the assessor to take into account scores on other PVTs, and make the ultimate determination based on the cumulative evidence, consistent with recommendations from multiple professional organizations (Bush et al., 2005; Bush, Heilbronner, & Ruff, 2014; Chafetz et al., 2015; Heilbronner et al., 2009).

As such, an RDS of 7 is coded as 1 (weak evidence of noncredible performance). If all other PVTs were passed (= 0), this is then interpreted as a single incident of unusually low performance, and the overall EI composite will be considered a Pass. Similarly, in isolation, a CVLT-II forced choice recognition (FCR) score of 15, a Logical Memory recognition (LMRecog) raw score of 20 out of 30, or a Coding age-corrected scaled score of 5 would be treated as insufficient evidence to deem the entire response set invalid. However, if all four of these scores were to occur together, the overall neurocognitive profile would be interpreted as a Fail and considered equivalent to the combination of an RDS of 6 and an FCR of 14—both of which are individually highly specific to invalid performance (Jasinski, Berry, Shandera, & Clark, 2011; Persinger et al., 2018; Schwartz et al., 2016).

Most of the independent empirical support for the EI model comes from the Advanced Clinical Solutions module for assessing suboptimal effort (Pearson, 2009). The Technical Manual reports that around 25% of the overall clinical sample had an RDS of 7 and LMRecog of 20. It also notes that 19% of the overall clinical sample failed both of them [or two other PVTs at a comparably liberal cutoff (i.e., at the 25% base rate)]. However, only 6% failed three, 3% failed four, and only 1% failed all five. The precipitous drop in the base rate of cumulative failure can be interpreted as a decrease in false positive rate. In other words, while having one or two marginal PVT failures is a relatively common occurrence (i.e., weak evidence of globally invalid performance), the probability of ≥ 3 such failures is very low even in a clinical sample, and therefore, it provides strong evidence that the overall neurocognitive profile is likely invalid. Flexibility is a major advantage of the EI model over the Pearson ACS model: it is a modular index that can be built by mixing and matching components. Anyone (researcher or clinician) who has data on five or more well-validated EVIs can build his/her unique version of the EI and use it as a multivariate index of performance validity in either archival research or prospective studies.

Although some methodologists recommend excluding scores in the indeterminate range of performance on the criterion variable to improve the internal validity of the design (Greve & Bianchini, 2004), and this has since become widespread practice in performance validity research (Axelrod, Meyers, & Davis, 2014; Erdodi et al., 2017d; Jones, 2013; Kulas, Axelrod, & Rinaldi, 2014), it also raises concerns about artificially inflating classification accuracy and, in turn, limiting the generalizability of the findings. Handling borderline cases is a complex, multifactorial decision. The ultimate decision should take into account missing data (i.e., if component PVTs are inconsistently administered, a score in the Pass range is less reliable compared to a sample where all participants have data on all measures), the level of cutoff (failing PVTs at more conservative cutoff provides stronger evidence of noncredible performance, and hence, fewer failures are required to confidently establish a criterion group of invalid profiles), and the number of PVTs administered (requiring two failures has a different meaning if 2 [100% failure rate] or 12 PVTs [16.7% failure rate] were administered).

The VI-7

To provide an alternative validity composite that included all participants (i.e., did not exclude indeterminate cases), the Validity Index Seven (VI-7) was created that follows the traditional approach of counting the number of PVT failures along more conservative dichotomous (i.e., Pass/Fail) cutoffs (Arnold et al., 2005; Babikian, Boone, Lu, & Arnold, 2006; Bell-Sprinkle et al., 2013; Kim et al., 2010a; Nelson et al., 2003). A VI-7 value ≤ 1 (i.e., at most one failed PVT) was considered a Pass, while a VI-7 value ≥ 2 (i.e., at least two failed PVT) was considered an overall Fail, following commonly accepted and empirically supported forensic standards (Boone, 2013; Larrabee, 2014; Odland, Lammy, Martin, Grote, & Mittenberg, 2015). Table 3 lists the components of the VI-7, cutoffs, references, and BRFail. Given that in the present sample all tests were administered to every patient, and only two failures were required to deem the entire profile invalid, more conservative cutoffs were applied on each of the seven components.

Table 3 Components of the VI-7, cutoffs, references, and base rates of failure (BRFail)

Evidence for the Validity of the Composite PVTs

The EI model has growing empirical support as a multivariate PVT. Since its introduction, it performed very similarly to free-standing PVTs as an alternative measure of performance validity (Erdodi et al., 2014b, 2016, 2017d, 2018b, 2018d; Erdodi et al., 2018c; Erdodi & Roth, 2017). Cross-validation against free-standing PVTs further consolidated these findings. Erdodi and Rai (2017) reported that on a different version of the EI-5, a score < 4 had poor specificity (0.65–0.66) against the WMT and Non-Verbal Medical Symptom Validity Test (Green, 2008). However, specificity improved greatly at ≥ 4 (0.80–0.85) and ≥ 5 (0.89–0.90). More recently, An et al. (2018) found that a version of the EI-5 was highly predictive of experimental malingering (0.73 sensitivity at 0.88 specificity) as well as the WCT (0.79 sensitivity at 0.92 specificity) and Test of Memory Malingering (0.65 sensitivity at 0.97 specificity).

The evidence to support the indeterminate range (Borderline) as a legitimate third outcome of performance validity testing is even stronger, both for the traditional VI and the novel EI model. Erdodi (2017) demonstrated that EI-5 scores in the Borderline range provide consistently stronger evidence of noncredible responding than scores in the Pass range, but consistently weaker evidence of noncredible responding than scores in the Fail range. He argued that forcing these scores into either the Pass or the Fail category would contaminate the diagnostic purity of the criterion groups and, hence, attenuate classification accuracy. These findings were replicated in subsequent studies using different versions of both the VI and the EI (Erdodi et al., 2017d, 2018a, 2018c). Taken together, the cumulative evidence suggests that when validity composites are built by aggregating individual PVTs using liberal cutoffs, excluding the indeterminate range is necessary to optimize the separation of valid and invalid profiles and to restore the stringent (≥ 0.90) specificity standard.

Procedure

Data were collected from the clinical archives of the outpatient neuropsychology service where patients were assessed. Only deidentified test data were recorded for research purposes to protect patient confidentiality. The project was approved by the Institutional Review Board of the hospital system where the data were collected. APA ethical guidelines regulating research involving human participants were followed throughout the project.

Data Analysis

Basic descriptive statistics (M, SD, BRFail) were reported when relevant. Overall classification accuracy (AUC) and corresponding 95% confidence intervals (95% CIs) were computed in SPSS version 23.0. Sensitivity, specificity, positive (PPP) and negative predictive power (NPP), and risk ratio (RR) were calculated using standard formulas. The minimum acceptable level of specificity is 0.84 (Larrabee, 2003), although values ≥ 0.90 are the emerging new norm (Boone, 2013; Donders & Strong, 2011). Between-group contrasts were performed using independent t tests. Alpha level was not corrected for multiple comparisons, given that all contrasts were planned, and effect size estimates (Cohen’s d) were provided to allow readers to evaluate the magnitude of the difference independent of sample size.

Results

To demonstrate their utility in differentiating valid and invalid response sets, the classification accuracy of the validity composites was computed against the WMT as criterion (Table 4). The EI-5FCR produced a good AUC (0.88), with very high sensitivity (0.83) and adequate specificity (0.90). The EI-5PSP performed significantly more poorly. Nevertheless, it produced an AUC in the moderate range (0.74) and a good combination of sensitivity (0.50) and specificity (0.90). The VI-7 had the highest AUC at 0.91, 0.67 sensitivity, and 0.92 specificity.

Table 4 Classification accuracy of the EI-5s against the WMT as criterion PVT

Dominant hand GPB T scores produced significant AUCs against all four criterion PVTs (0.68–0.82). The T ≤ 29 cutoff had high specificity (0.90–0.99) and variable sensitivity (0.36–0.55). Increasing the cutoff to ≤ 31 sacrificed some of the specificity (0.87–0.94) without any gains in sensitivity. T ≤ 33 produced a good combination of sensitivity (0.50) and specificity (0.89) against the EI-5FCR but failed to maintain the minimum specificity standard against the WMT, EI-5PSP, and VI-7.

Similarly, nondominant hand GPB T scores produced significant AUCs against all four criterion PVTs (0.67–0.79). The T ≤ 29 cutoff had adequate specificity (0.89–0.92), but low sensitivity (0.24–0.30). Increasing the cutoff to ≤ 31 resulted in a predictable trade-off: improved sensitivity (0.32–0.50) at a reasonable cost to specificity (0.84–0.89). As before, T ≤ 33 produced a good combination of sensitivity (0.45) and specificity (0.88) against the EI-5FCR but failed to maintain the minimum specificity standard against the WMT, EI-5PSP, and VI-7 (Table 5).

Table 5 Base rates of failure and classification accuracy of GPB validity cutoffs against the WMT, the EI-5s, and VI-7 as criterion PVTs

When failure was redefined as a given score with both hands, ≤ 31 produced high specificity (0.93–0.98) but relatively low sensitivity (0.26–0.36). Increasing the cutoff to ≤ 33 cleared the specificity standard against all four criterion PVTs (0.89–0.94) and improved sensitivity (0.31–0.45). Given the increasing awareness that sensitivity and specificity should be interpreted in the broader context of the base rate of the condition of interest (Lange & Lippa, 2017), PPP and NPP were calculated at five different hypothetical BRFail (Table 6).

Table 6 Predictive power of GPB validity cutoffs at various hypothetical BRFail (n = 100)

As severe TBI may result in legitimately impaired performance on the GPB, the effect of injury severity was independently examined (Table 7). Patients with mild TBI had consistently higher BRFail on the four criterion PVTs (RR 1.25–2.76) as well as the newly introduced GPB validity cutoffs (RR 1.23–2.20) compared to patients with moderate-to-severe TBI. However, it should be noted that most differences in BRFail did not reach statistical significance. In addition, to examine the domain specificity effect (Erdodi, 2017) as a potential confound, independent t tests were performed using the newly introduced GPB validity cutoffs (i.e., T ≤ 29 and T ≤ 31) as the independent variable and the EI-5s as dependent variables. Previous research suggests that the match in cognitive domain between the criterion and predictor variable influences classification accuracy: PVTs perform better against criterion measures that are similar in cognitive domain and/or administration format (Erdodi et al., 2017d, 2018a, 2018c), perhaps due to idiosyncratic patterns of selectively demonstrating certain types of deficits, but not others (Cottingham Victor, Boone, Ziegler, & Zeller, 2014; Erdodi et al., 2014a). Although patients who failed the GPB had significantly stronger evidence of invalid performance on both versions of the EI-5s (Table 8), effect sizes were larger on the EI-5PSP (d 0.82–1.13, large) than on the EI-5FCR (d 0.47–0.93; medium-large). This finding is consistent with the modality specificity hypothesis.

Table 7 Base rate of failure (%) as a function of injury severity across various measures of performance validity and cutoffs (n = 100)
Table 8 The effect of passing or failing GPB cutoffs on validity composites based on tests of forced choice recognition memory (EI-5FCR) and processing speed (EI-5PSP)

Finally, the results of independent t test with PVT status (Pass/Fail) as independent variable and the SCL-90-R scales as well as the BDI-II as dependent variable were computed. Patients who failed the WMT reported significantly higher scores on all measures of self-reported emotional distress, with effect sizes ranging from medium (d = 0.42) to large (d = 1.03). Failing the VI-7 was associated with smaller effects (d 0.42–0.63). Similar findings emerged with the EI-5FCR, with somewhat larger effect ranging from medium (d = 0.51) to large (d = 0.91). However, none of the contrasts with EI-5PSP as the independent variable reached significance.

More importantly, mean BDI-II scores among those who passed the criterion PVT was in the mild range or below (≤ 14.6), whereas those who failed the criterion PVT produced means in the upper limit of the mild range (17.1–18.8) or in the moderate severity range (20.3–20.7). Likewise, among significant contrasts, those who passed the criterion PVT scored in the nonclinical range on the SCL-90-R, with the exception of the Obsessive–Compulsive scale (Table 9). However, those who failed the criterion PVTs did not report clinically significant symptoms on the majority of the SCL-90-R scales (Interpersonal Sensitivity, Anxiety, Hostility, Paranoid Ideation, and Phobic Anxiety). At the same time, PVT failure was associated with clinically elevated scores on the Somatization, Obsessive–Compulsive, Depression, and Psychotic Symptoms as well as the GSI.

Table 9 The relationship between performance validity and self-reported psychiatric symptoms

In contrast, passing or failing any of the newly endorsed GPB cutoffs was unrelated to SCL-90-R or BDI-II scores (Table 10). As a side note, there was no significant difference between the sample of Erdodi, Seke et al. (M = 18.1, SD = 12.1, range 0–51) and the present sample (M = 15.4, SD = 11.5, range 0–45) on the BDI-II, the only measure of self-reported psychiatric symptoms administered in both studies: t(268) = 1.79, p = 0.075, d = 0.23 (small effect). Both group means were within the mild range. Passing the GPB validity cutoffs was associated with a mean score in the nonclinical range on all SCL-90-R scales. Patients who failed the GPB reported mean scores in the clinical range on the Somatization, Obsessive–Compulsive, Depression, and GSI scales.

Table 10 The effect of passing or failing the GPB cutoffs on SCL-90-R scales

Discussion

This study was designed to replicate an earlier investigation based on a mixed clinical sample suggesting that in addition to measuring fine motor speed, GPB can also function as an index of noncredible responding and that failing GPB validity cutoffs is associated with elevated self-reported psychiatric symptoms (Erdodi et al., 2017d). The first finding was well-replicated: the previously endorsed cutoff (GPB T score ≤ 29 in either hand) produced good classification accuracy in the present sample against a different set of criterion PVTs. Moreover, based on these data, the default cutoffs could be raised to ≤ 31 for either hand or ≤ 33 for both hands, the point where the instrument best approximates the “Larrabee limit.” The phrase refers to the seemingly inescapable trade-off between false positives and false negative so that fixing specificity at 0.90 results in sensitivity hovering around the 0.50 mark (Lichtenstein et al., 2017).

More importantly, scoring below the GPB validity cutoffs was largely unrelated to head injury severity, pre-empting arguments that genuine neurological impairment could account for the PVT failure, despite previous reports that GPB is sensitive to neurological disorders (Larrabee, Millis, & Meyers, 2008). In fact, patients with moderate-to-severe TBI were less likely to score below the cutoffs than those with mild head injuries. While this finding may appear paradoxical, it is well replicated in performance validity research (Carone, 2008; Erdodi & Rai, 2017; Green, Iverson, & Allen, 1999; Grote et al., 2000).

As in the original study, GPB validity cutoffs resulted in similar classification accuracy across psychometrically diverse criterion PVTs (free-standing vs. embedded, univariate vs. multivariate, modality congruent vs. incongruent), suggesting that existing sensitivity and specificity values provide stable estimates of the GPB’s signal detection parameters. The consistency across samples and criteria alleviate concerns about modality specificity as a confounding variable in PVT research (Leighton et al., 2014; Root et al., 2006). At the same time, in the original study, GPB validity cutoffs performed notably better against the domain-congruent criterion PVT, the EI-5PSP. In contrast, in the present sample, the GPB resulted in consistently higher specificity against the domain-incongruent criterion PVT, the EI-5FCR. This puzzling finding serves as a reminder that measurement models can behave differently across samples, reinforcing the need for multiple independent replications. A possible explanation for this inconsistency is sample-specific differences in BRFail—most notably, on the CPT-II Omissions scale: while only 3.2% of patients in the original study scored T > 100, 15.0% of the present sample failed that cutoff.

The second finding of the Erdodi et al. (2017d) study was not replicated: passing or failing the GPB validity cutoffs was orthogonal to self-reported psychopathology. However, the outcome of the WMT or EI-5FCR was significantly related to BDI-II and SCL-90-R scores, producing a large overall effect. Similarly, failing the VI-7 was associated with a medium effect on all three measures of self-reported psychiatric symptoms. Within the latter SCL-90-R, the GSI was particularly sensitive to invalid responding, confirming previous reports that the instrument is vulnerable to noncredible responding (McGuire & Shores, 2001; Sullivan & King, 2008). This finding also cautions against equating elevated SCL-90-R scores with the presence of psychiatric illness (Johnson, Ellison, & Heikkinen, 1989) and reveals the need for a systematic evaluation of the credibility of symptom report by either utilizing free-standing instruments (Giromini, Viglione, Pignolo, & Zennaro, 2018; Viglione, Giromini, & Landis, 2017) or built-in validity scales within established inventories such as the PAI. While overall these results provide partial support for the psychogenic interference hypothesis overall, they also emphasize that it may be instrument-, scale-, and/or sample-specific.

A possible explanation for the discrepancy between the two sets of results is the choice of outcome measure: the SCL-90-R vs. the Personality Assessment Inventory (Morey, 1991). The latter is a more robust instrument with strong evidence base in a variety of clinical populations (Boone, 1998; Hopwood et al., 2007; Karlin et al., 2005; Siefert, Sinclair, Kehl-Fie, & Blais, 2009; Sims, Thomas, Hopwood, Chen, & Pascale, 2013; Sinclair et al., 2015) that contains almost four times more items and, more importantly, validity scales to evaluate the veracity of the response sets. The SCL-90-R lacks this important feature.

However, both studies used the BDI-II (Beck, Steer, & Brown, 1996) as a measure of emotional functioning, allowing for a direct comparison. The BDI-II has excellent psychometric properties (Sprinkle et al., 2002; Storch, Roberti, & Roth, 2004) and demonstrated both high sensitivity and specificity to a clinical diagnosis of depression (Kjærgaard, Arfwedson Wang, Waterloo, & Jorde, 2014). A careful examination of the results from Erdodi et al. (2017d) and the present study reveals that the only discrepancy in BDI-II scores as a function of passing or failing the GPB validity cutoffs was on the dominant hand (Cohen’s d 0.46 vs. 0.09). Nondominant hand (Cohen’s d 0.18 vs. 0.11) and combined (Cohen’s d 0.29 vs. 0.13) cutoffs produced essentially the same results.

It could be argued that the current findings are internally consistent (i.e., nonsignificant results were observed as a function of passing or failing all GPB cutoffs) regarding the BDI-II and that the isolated positive finding by Erdodi et al. (2017d) associated with failing the dominant hand GPB cutoff was an outlier. Likewise, medium to large effects were observed on the BDI-II as a function of passing or failing PVTs based on the forced choice recognition paradigm and the validity composite that contained several constituent PVTs based on recognition memory, but contrasts involving a criterion PVT based solely on processing speed measures were consistently nonsignificant. Further comparisons are hindered by the fact that the original study did not report BDI-II scores as a function of passing or failing the criterion PVTs. If future investigators agreed to consider reporting such findings (i.e., Is there a difference in self-reported psychiatric symptoms between those who passed and those who failed PVTs?), regardless of the main focus of the study, the knowledge base on psychogenic interference could be advanced more quickly.

As one of the reviewers pointed out, the age range (17–70 years) within the present sample and that of Erdodi et al. (2017d; 18–69 years) constrains the generalizability of the findings. Although the GPB cutoffs are given in a metric that has been adjusted for age in addition to gender, education, and race using norms by Heaton et al. (2004), it is unclear whether the classification accuracy would extend outside that range. While the emerging evidence base suggests that adult cutoffs on some PVTs can be applied to children (Donders, 2005; Erdodi & Lichtenstein, 2017; Kirkwood & Kirk, 2010; Lichtenstein et al., 2017, 2018; MacAllister, Nakhutina, Bender, Karantzoulis, & Carlson, 2009), at the same time, increased false positive rates have been reported in older adults (Ashendorf, O’Bryant, & McCaffrey, 2003; Dean, Victor, Boone, Philpott, & Hess, 2009; Kiewel, Wisdom, Bradshaw, Pastorek, & Strutt, 2012; Zenisek, Millis, Banks, & Miller, 2016).

A significant limitation of the present investigation is that neither measure of emotional functioning has built-in validity scale to provide an objective evaluation of the credibility of self-reported symptoms. Therefore, the data can support competing interpretations: (A) Individuals who exaggerate neurocognitive impairment also tend to exaggerate psychiatric symptoms (i.e., “double malingering”); and (B) genuine emotional distress (accurately captured on psychiatric inventories) prevented a subset of the patients from demonstrating their true ability level on performance based tests, resulting in both repeated PVT failures and elevations on the BDI-II and SCL-90-R (i.e., psychogenic interference; Delis & Wetter, 2007; Kalfon et al., 2016; Suhr, 2003; Suhr & Spickard, 2012). The fact that in some of the earlier studies the relationship between performance validity and self-reported psychiatric symptoms was present even though patients passed the symptom validity checks (Erdodi et al., 2017d, 2018c) argues for the latter explanation. In addition, the lack of data on external incentive status prevented us from classifying patients according to diagnostic criteria for malingered neurocognitive dysfunction proposed by Slick et al. (1999).

Overall, the limited available evidence precludes a definitive conclusion. While the psychogenic interference hypothesis has merit and, thus, warrants further investigation, clinicians are cautioned against its broad interpretation aimed at discounting objective evidence of noncredible performance, especially in the presence of external incentives to appear impaired (Larrabee, 2012; Slick et al., 1999). Instead, the combination of multiple PVT failures and elevated self-reported distress could inform the clinical management of the patient, in the form of psychotherapy or cognitive rehabilitation exploring causal mechanisms behind poor test performance (Boone, 2007a, b; Delis & Wetter, 2007; Erdodi et al., 2017b). If following symptom relief achieved after successful therapy, the patient produces valid data upon retesting, that pattern of findings would retroactively support psychogenic interference as an explanation for the initial PVT failures.

Future research on the psychogenic interference hypothesis would benefit from prospective studies specifically designed to dissociate the effects of several confounding variables identified in previous reports, such as apparent external incentives to underperform/exaggerate symptoms, self-reported emotional distress on both face-valid and opaque instruments. In addition, history of complex psychiatric trauma should be recorded and evaluated as a potential contributing factor to PVT failures with unknown etiology (Williamson et al., 2012). Although the link between abuse history and performance validity is far from being well-understood (Kemp et al., 2008; Tyson et al., 2018), previous studies found that individuals with severe developmental trauma were overrepresented among patients who were misclassified by multivariate models of performance validity assessment due to internal contradiction among various scores (Erdodi et al., 2017c, 2017e). If future studies replicate internal inconsistency as a reliable psychometric marker of the link between adverse life events and inexplicable PVT failures, it would significantly advance our current understanding of the mechanisms behind noncredible responding (Berry et al., 1996; Bigler, 2012, 2015; Boone, 2007a, b; Delis & Wetter, 2007).

In sum, the present investigation successfully replicated an earlier study that introduced validity cutoffs embedded within the GPB in a TBI sample from a different geographic region. Results suggest that a demographically adjusted T score ≤ 31 in either hand or ≤ 33 in both hands is specific to psychometrically defined invalid responding. Unlike in the previous report, failing the GPB was unrelated to self-reported psychiatric symptoms. At the same time, patients who failed PVTs based on the forced choice recognition paradigm reported higher levels of depression, somatic concerns, and overall symptomatology. Given the consistently good classification accuracy of GPB validity cutoffs across samples and criterion PVTs, they appear to be a valuable addition to the growing arsenal of EVIs available to clinical neuropsychologists. However, further research is clearly needed to elucidate the complex relationship between noncredible responding and self-reported emotional distress. Exploring the complex manifestation of psychogenic interference during cognitive testing appears to be a promising line of investigation that has the potential to provide new insights into the causal mechanisms behind internally inconsistent neuropsychological profiles, isolated cognitive deficits with no plausible etiology, and PVT failures.