The establishment of practice standards calling for the routine assessment of performance validity in all neuropsychological evaluations has resulted in a proliferation of Performance Validity Test (PVT) research (Martin et al., 2015). Researchers have pursued the development and validation of new PVTs, as well as cross-validation of existing PVTs, in diverse clinical populations. Traditionally, simulator studies and studies involving samples with high motivation to feign (e.g., personal injury litigants) were the primary manner in which the accuracy and utility of these tests were investigated. More recently, researchers have sought to validate experimental PVTs using other well-validated PVTs as the criterion for classifying sample participants into valid and invalid/noncredible groups. While some research has begun to appreciate how base rate of failure or severity of failure can be included in operationalizing criterion measures of invalid performance (e.g., Erdodi, 2019), best practices for these validation/cross-validation research studies have yet to be fully established. As such, research is needed to determine how many and which PVTs should be used as the criterion measures of performance validity or invalidity in these research studies.

While PVT research occasionally relies on a single criterion PVT for classifying performance as valid or invalid, the implications of failing a single criterion PVT remain inconclusive (Lippa, 2018). Recently, Schroeder et al. (2019) demonstrated the limitations of using just one criterion PVT to establish validity groups and advocated for the use of ≥ 2 PVT failures for identifying criterion groups in PVT research—a recommendation that closely mirrors the standard of practice for clinical evaluations (Larrabee, 2014). Importantly, although combining criterion measures increases the specificity of either test alone, the sensitivity of two tests considered together will, by definition, be lower than either test alone (Weinstein et al., 2005). The consequence is that researchers must avoid requiring failure on too many criterion PVTs to establish invalidity, particularly those that are inappropriate for their population or have poor psychometric properties, lest sensitivity is sacrificed for an inconsequential increase in specificity. Said another way, using more PVTs is not always better.

Lippa (2018) also reviewed literature describing the effect of “chaining” PVTs that were strongly correlated and suggested that using multiple, highly correlated PVTs as criterion measures does not improve diagnostic accuracy. However, she identified that this literature was limited by samples that were not generalizable (e.g., samples that included only severely impaired or only cognitively normal subjects). To date, there is scant literature pertaining to which PVTs maximize classification accuracy when used in combination. With these considerations in mind, this study aimed to explore the implications of using different PVT combinations as criterion measures to classify validity group status, as well as investigate whether adding additional PVTs increases accuracy in establishing validity groups for PVT research studies. Finally, given emerging research examining how PVT cut-scores influence failure base rates (e.g., An et al., 2017), this study also aimed to investigate the effect of adjusting individual PVT failure cutoffs to be more conservative or liberal on overall classification accuracy.

Materials and Methods

Participants

One hundred fourteen nonconsecutive veterans clinically referred for neuropsychological services from 2015 to 2017 at a VA medical center who completed a PVT protocol consisting of the Word Memory Test (WMT; Green, 2003), Test of Memory Malingering (TOMM; Tombaugh, 1996) Trial 1, Advanced Clinical Solutions Word Choice Test (WCT; Pearson, 2009), Dot Counting Test (DCT; Boone et al., 2002), Reliable Digit Span (RDS; Greiffenstein et al., 1994) from the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV; Wechsler, 2008), and Rey 15-Item Test (RFIT; Rey, 1964) as part of their clinical evaluations and consented to include their data as part of ongoing, IRB-approved database study were included. Selection of PVTs administered, and validity cutoffs interpreted, were based on a combination of factors, including frequency of use (Martin et al., 2015; Slick et al., 2004), research support, and classification accuracy. Missing data were as follows: 1 RDS, 1 WCT, and 8 RFIT (due to it being added shortly after the study began). Neuropsychological test batteries differed among patients as clinically indicated, but in addition to the six standard study PVTs, various embedded PVTs, typically within standard memory tests or global cognitive measures, were also administered. Although these embedded PVTs varied between participants and were not examined in this study, they were considered when determining validity status on clinical evaluation. Thus, in total, each participant completed the six PVTs from the study protocol and at least one additional embedded PVT (see Table 1) during their evaluation, for a total of ≥ 7 PVTs per test battery that were used to determine overall validity status at evaluation time.

Table 1 Performance Validity Tests (PVTs) and associated conservative and liberal cut-scores for test failure for the study PVT protocol and ancillary embedded measures

Clinical classification of validity status was made at the time of each patient’s neuropsychological evaluation by the evaluating board-certified clinical neuropsychologist based on (1) failures on at least two PVTs (2 freestanding, 2 embedded, or 1 freestanding and 1 embedded) and (2) established objective criteria for identifying noncredible performance (i.e., Slick et al., 1999), which examines for marked discrepancies between test performance and accepted models of central nervous system dysfunction, observed behavior, self- or collateral reports, or documented clinical history. Given a veteran sample in which patients largely receive their healthcare in one system, neuropsychologists had access to complete medical records (e.g., histories, neuroimaging, laboratory workup, prior evaluations) and information on service-connection/disability-seeking status to incorporate into clinical decision-making regarding validity status. Based on PVT performance and Slick criteria, 80/114 (70%) were clinically classified as valid and 34/114 (30%) as invalid based on their evaluation. For all cases in this study, PVT scores and these histories were reviewed by the authors to ensure the above criteria were met. Of the 80 valid cases, 43 (54%) met formal Diagnostic and Statistical Manual of Mental Disorder-Fifth Edition (DSM 5; APA, 2013) criteria for a mild (N = 35; 81%) or major (N = 8; 18%) neurocognitive disorder, whereas 37 (46%), despite having subjective complaints (or provider concern for possible cognitive impairment), were cognitively unimpaired based on their objective evaluation results. Diagnoses for those without cognitive impairment were no diagnosis (N = 11), posttraumatic stress (PTSD; N = 3), mood disorder (N = 10), anxiety disorder (N = 4), sleep disorder (N = 4), or other psychiatric disorder (N = 5). Diagnoses for those with cognitive impairment in this mixed clinical sample were stroke/cerebrovascular disease (N = 15), Alzheimer’s disease/mild cognitive impairment-amnestic subtype (N = 6), epilepsy (N = 4), attention-deficit/hyperactivity disorder (ADHD)/learning disorder (N = 4), moderate-severe traumatic brain injury (TBI; N = 3), frontotemporal degeneration (N = 2), Parkinson’s disease (N = 1), substance-induced (N = 1), and multiple comorbid etiologies (N = 7). Diagnoses for the 34 invalid cases were TBI (N = 11; 10 mild/1 moderate); primary PTSD, depression, or anxiety (N = 10); primary PTSD or depression with medical comorbidity (e.g., remote stroke, epilepsy, tumor resection; N = 6); and possible malingering (N = 7).

Measures

Of note, TOMM Trial 1 (as opposed to the full TOMM) was used given evidence for good diagnostic accuracy as a briefer freestanding PVT among veterans (Webber et al., 2018a), with pooled sensitivity (77%) and specificity (92%) (Denning, 2012; Martin et al., 2019). (See Table 1 for the conservative and liberal pass/fail cut-scores for the six PVTs included in this study protocol.)

Data Analyses

Chi-square tests and analyses of variance (ANOVAs) were assessed for systematic differences in demographic characteristics between valid and invalid cases. Correlations between PVTs were examined for valid cases. Failure base rates and mean number of PVT failures were subsequently calculated for valid and invalid cases using both conservative and liberal cutoffs. Next, all possible two- and three-PVT combinations extracted from the six PVTs in the protocol were established. Concordance rates were then calculated between validity group classification and each two- and three-PVT combination to establish the accuracy of each PVT combination for identifying invalid performance. For all two- and three-PVT combinations, failure on 0 PVTs was operationalized as valid performance and failure on ≥ 2 PVTs as indicating invalid performance. All concordance analyses were conducted once using conservative cut-scores for criterion PVT failure and then repeated with liberal PVT failure cut-scores to examine and compare potential effect(s) of different criterion cut-scores on overall classification accuracy.

Results

The sample was 83% male, but otherwise rather diverse in terms of age (M = 54.2; SD = 14.9; range = 24–84), education (M = 13.7; SD = 2.4; range = 7–19), and race (46% Caucasian, 35% Hispanic, 16% African American, 3% Other), with nonsignificant demographic differences between valid and invalid participants: age, F(1, 113) = 2.33, p = 0.13; education, F(1, 113) = 0.45, p = 0.50; sex, X2 (1, N = 114) = 0.13, p = 0.71; and race, X2 (3, N = 114) = 3.72, p = 0.29. Means/standard deviations for the six PVTs by validity group and correlations between the six PVTs among the valid cases are included in Table 2. Failure base rates for each PVT at conservative and liberal cutoffs as well as mean number of PVTs failures by validity group are presented in Table 3 and overall PVT failure percentages are included in Fig. 1. For invalid cases, failure rates ranged from 29 to 94% at conservative and 45 to 94% at liberal cutoffs. Among valid cases, failure rates ranged from 5 to 16% at conservative and 5 to 28% at liberal cutoffs. Across both valid, t(79) = − 6.30, p < 0.001, d = 0.57, and invalid, t(33) = − 4.74, p < 0.001, d = 0.38, cases, the mean number of PVT failures was significantly higher with liberal versus conservative cutoffs, though a substantially larger difference and effect size was noted for valid cases relative to invalid cases.

Table 2 Performance Validity Tests scores by validity group and correlations for the valid cases
Table 3 Failure base rates and mean Performance Validity Test failures by validity group
Fig. 1
figure 1

Percentages of Performance Validity Test (PVT) failures using conservative versus liberal failure cut-scores

At conservative cutoffs, when two PVTs were used, 72–91% of clinically valid cases would be correctly identified as valid (i.e., failed 0/2 PVTs) regardless of which two PVTs were used as criterion measures, and a very small percentage (i.e., 0–4%) would be incorrectly classified as invalid (i.e., false positives) by failing 2/2 PVTs. (See Tables 4 and 5 for all possible two-PVT combinations using conservative and liberal cut-scores, respectively.) In contrast, for invalid cases, correct classification (i.e., failure on 2/2 PVTs) varied widely from 17 to 74% based on the two-PVT combination used. WMT/TOMM (74%), WMT/WCT (67%), WMT/DCT (62%), and TOMM/WCT (58%) had the highest concordance, whereas RDS/RFIT (17%), TOMM/RFIT (19%), DCT/RFIT (23%), and WCT/RFIT (27%) had the lowest. Misclassification of invalid performance as valid (i.e., failure on 0/2 PVTs or false negatives) also varied considerably with few to none for all WMT combinations as well as TOMM/DCT and TOMM/WCT, whereas other ≥ 20% of invalid cases passed several other PVT combinations. Overall, the WMT/TOMM, WMT/WCT, WMT/DCT, TOMM/WCT, and TOMM/DCT two-PVT combinations would yield the greatest number of participants in each validity group if used to assign group membership for a research study. Finally, the failure rate on 1/2 PVTs was high and ranged from 6 to 25% for valid cases and 26 to 68% for invalid cases. At liberal cut-scores, 53–86% of clinically valid cases would be correctly identified as valid (i.e., failed 0/2 PVTs), with an increased false-positive rate of 0–15%, though all but 2 combinations retained a false-positive rate of ≤ 6%. Among invalid cases, correct classification (i.e., failure on 2/2 PVTs) significantly improved to 39–82%, and the false-negative rate ranged from 0 to 30% depending on the two-PVT combination. Again, combinations containing RDS and RFIT tended to have lower accuracy for identifying invalid cases. Similar to conservative cut-scores, the failure rate on 1/2 PVTs remained high and ranged from 16 to 45% for valid cases and 26 to 58% for invalid cases.

Table 4 Number of Performance Validity Test (PVT) failures by validity group for all combinations of two PVTs using conservative criterion PVT cut-scores for task failure
Table 5 Number of Performance Validity Test (PVT) failures by validity group for all combinations of two PVTs using liberal criterion PVT cut-scores for task failure

At conservative cutoffs, when three PVTs were used and the benchmark for being classified as invalid was failing ≥ 2 of the 3 PVTs in the combination, classification accuracy substantially increased with 48–97% of invalid cases being correctly identified as such. (See Tables 6 and 7 for all possible three-PVT combinations using conservative and liberal cut-scores, respectively.) Moreover, the number of invalid cases which would have been incorrectly classified as valid by failing 0/3 PVTs dropped considerably and was 0 for 9/19 three-PVT combinations and ≤ 4 cases for 8/10 remaining three-PVT combinations. For valid cases, the majority (i.e., 67–86%) again failed 0 PVTs, whereas a small minority (i.e., ≤ 5 cases; 6%) failed ≥ 2 PVTs. Further examination of those valid cases who failed ≥ 2 PVTs revealed that these individuals generally had significant (e.g., Alzheimer’s disease, temporal lobe epilepsy s/p resection) and/or multifactorial (e.g., multiple medical conditions with polypharmacy) cognitive impairment. Overall, the WMT/TOMM/DCT combination had the highest concordance in that it classified 33/34 (97%) invalid cases with 0/80 valid cases failing ≥ 2 of these tests, though several other three-PVT combinations also had high concordance (see Fig. 2). Finally, even when using three PVTs, 11–31% of valid cases and 3–43% of invalid cases continued to fail just one PVT. At liberal cut-scores, 60–97% of invalid cases would be correctly identified with a false-negative rate of 0–17% depending on the three-PVT combination used. Conversely, for the valid cases, 46–75% failed 0 PVTs with a significantly increased maximum false-positive rate of 3–27%. Combinations containing RDS and RFIT again generally evidenced the weakest relative sensitivities and had the highest false-positive rates. Lastly, the failure rate on 1/2 PVTs remained high and ranged from 21 to 50% for valid cases and 3–28% for invalid cases.

Table 6 Number of Performance Validity Test (PVT) failures by validity group for all combinations of three PVTs using conservative criterion PVT cut-scores for task failure
Table 7 Number of Performance Validity Test (PVT) failures by validity group for all combinations of three PVTs using liberal criterion PVT cut-scores for task failure
Fig. 2
figure 2

Classification accuracy of three-Performance Validity Test (PVT) combinations for identifying valid and invalid cases using conservative scores

Discussion

Returning to the original question of whether quality or quantity of criterion PVTs is critical for establishing validity groups in PVTs research, our results suggest that the answer is, to some degree, both. When two PVTs were used, false-negative rates of invalid cases that were incorrectly classified as valid (i.e., passed 2/2 PVTs) ranged from 0% to as high as 40% depending on the psychometric properties of the specific PVTs included in two-PVT combination. Notably, combinations of the WMT, TOMM, WCT, and DCT had the fewest false negatives, whereas RDS and RFIT resulted in higher false-negative and equivocal rates (i.e., 1 pass/1 failure) when paired with the more robust measures. This has notable implications for the choice of measures for PVT research. As a practical example, suppose a researcher conducted a hypothetical study with these data to cross-validate the WCT using the TOMM Trial 1 and one additional PVT as the two criterion measures. As depicted in Table 8, all four receiver operating characteristic (ROC) curve analyses would yield significant areas under the curve (AUCs); however, alternating the two criterion PVTs would result in (1) different optimal cut-scores, (2) a 23% difference in sensitivity if ≥ 90% specificity is to be maintained (per recommendation by Boone (2012) and Larrabee et al. (2019)), and (3) a large range in the number of cases with one only criterion PVT failure (i.e., 18–33%) who may be excluded from analysis depending on the specific two-PVT criterion combination. Thus, both quantity and quality of measures used as criterion measures in PVT research can greatly affect the reliability and replicability of study findings, and, by extension, evidenced-base clinical practice. These methodological decisions may also limit the use of historical datasets if appropriate PVTs were not included at the time of data collection.

Table 8 Hypothetical study cross-validating the word choice test using different Performance Validity Test combinations to establish validity groups

When the number of PVTs was increased from two to three, overall false-negative rates dropped considerably, such that 13/19 possible three-PVT combinations had ≤ 3% of invalid cases failing 0/3 PVTs. WMT/TOMM/DCT classified the most participants given all but one invalid case failed ≥ 2/3 of these tests and 0/80 valid cases failed ≥ 2/3. While increasing the number of PVTs from two to three reduced false negatives overall, it is important to note that 4/19 of three-PVT combinations still had misclassification rates of > 10%, with the combination on DCT/RDS/RFIT having this highest false-negative rate (i.e., 23%) followed by WCT/RDS/RFIT (17%), indicating PVT quality still matters, to some degree, even when more PVTs are used.

Minimizing valid participants misclassified as invalid is equally paramount in PVT research. To this end, when two PVTs were used, the false-positive rate (i.e., failed 2/2) was low (i.e., 0–4%). When the number of PVTs increased to three, the false-positive rate remained low (i.e., 0–6%), which is consistent with prior findings that document a small, but nonsignificant increase in the false-positive rate as the number of PVTs increase (Larrabee, 2014; Critchfield et al., 2019). As specificity of these PVT cut-scores (or any combination derived from individual PVT cut-scores for that matter) is less than 100% if adequate sensitivity is to be maintained, there is a risk that a small percentage with cognitive impairment will be misidentified as invalid. PVT combinations also could yield different classification accuracy in clinical populations with different levels of impairment severity. Specifically, combinations including RDS and RFIT in elderly patients and patients with suspected/known cognitive impairment are cautioned, and clinicians are encouraged to remain abreast to the literature focusing on alternative scoring/cutoff criteria if these measures are used (Bailey et al., 2018a; Fazio et al., 2019; Poynter et al., 2019; Webber et al., 2018b; Webber & Soble, 2018). Consequently, using appropriate, empirically derived PVT cut-scores within the context of clinical history and structured validity criteria may help reduce the likelihood that these individuals are included in noncredible reference samples when conducting PVT research.

This study also emphasized the research methodological challenges when a participant has just one PVT failure. Notably, regardless of whether two- or three-PVT combinations were used, several cases in both validity groups failed only one PVT. Given all participants in this study had at least seven PVTs during their evaluation along with corroborating clinical data, clinical classification of overall validity status was relatively straightforward. In contrast, validity classification becomes more complex in a research context if all the investigator has available to classify validity status is a two- or three-PVT combination, particularly given concern that a sizeable percentage of participants would end up in the wrong reference group if they were retained via a uniform operational definition (e.g., those with one failure were classified as invalid). While a simple solution is excluding these participants from research studies, doing so may reduce generalizability of results to mixed clinical samples, particularly considering that many of the valid cases with one PVT failure in our sample had diagnosed cognitive impairment. Thus, depending on the population of interest, using alternative operational criteria that examine PVT scores in the context of other available clinical information and/or established criteria for noncredible performance (e.g., Slick criteria; Slick et al., 1999) to more accurately establish group assignment may be advisable.

Finally, given recent research (e.g., An et al., 2017) has suggested potential utility of applying liberal PVT cutoffs, this study examined the effect of adopting more liberal TOMM, WCT, RDS, and RFIT cut-scores on overall validity classification rates. For two-PVT combinations, accuracy for identifying invalid cases increased from 17–74% at conservative cutoffs to 39–85% at liberal cutoffs, whereas the decrease in false negatives was less pronounced for liberal (i.e., 0–30%) versus conservative (i.e., 0–40%) cutoffs. By contrast, liberal cutoffs resulted in significantly more valid cases misclassified as invalid (i.e., 0–15% versus 0–3%) as well as a notable drop (i.e., 72–91% to 53–86%) in the number of valid cases with 0/2 PVTs relative to conservative cutoffs. Similarly, when liberal cutoffs were used for three-PVT combinations, overall accuracy for identifying invalid cases increased minimally from 57 to 97% at conservative cut-scores to 60–97%, with a corresponding false negative decrease from 0–23% to 0–17%. Moreover, among valid cases, the use of liberal cutoffs significantly increased the false-positive rate from 0–6% to 3–27%, while simultaneously reducing the number of valid cases with 0 PVT failures from 67–86% to 46–75%. Thus, a modest increase in sensitivity associated with liberal cutoffs was countered with a significant reduction in specificity with nearly half of three-PVT combinations having a ≥ 10% false-positive rate. Across both two- and three-PVT combinations, those containing RDS and RFIT generally continued to evidence the weakest relative classification accuracy even after adopting liberal cutoffs, and, particularly in the case of three-PVT combinations, produced substantially higher false-positive rates that exceeded the 90% specificity benchmark (Boone, 2012). Consequently, results indicated that PVT quality remains critically important for overall classification, even in the context of adjusting cutoffs more liberally, or specificity will be sacrificed for, at best, modest gains in sensitivity. Finally, even with liberal scores, 18–55% of invalid and 14–45% of valid cases failed one PVT for two-PVT combinations, and 3–28% of invalid cases and 21–43% of valid cases failed just one PVT for three-PVT combinations. Therefore, despite applying more liberal cutoffs, a large number of valid and invalid cases continued to fail only one PVT in each combination, which is especially problematic for two-PVT combinations as this would result in a considerable number of essentially equivocal cases with one PVT pass and one PVT failure if only two PVTs were available as criterion measures.Footnote 1

Whether a researcher is attempting to test the classification accuracy of a PVT or rule out the impact of validity status on other neuropsychological tests (e.g., Webber et al., 2018c), current results are particularly relevant for research studies that incorporate information on performance validity status. While clinicians may have the liberty of adding additional PVTs in vivo to clearly establish a patient’s validity status during clinical and/or forensic exams, research studies in which test selection was established a priori often will have more limited flexibility. When validating new PVTs, results suggest a three-PVT combination efficiently maximizes classification accuracy over a two-PVT combination. Using only two PVTs is likely to decrease overall classification accuracy and introduce meaningful error when only one of the two measures are failed. For instance, a large number of participants failed one PVT in the combination regardless of whether a two- or three-PVT combination was used. With a two-PVT combination, a researcher’s selection would rely upon a single score, as either the passing score or failing score was ultimately weighed as most important to group assignment, which has been recently identified as having significant limitations (see Lippa, 2018 and Schroeder et al., 2019). In contrast, a three-PVT combination likely would provide additional clarity to establish validity groups as this would allow the research to use ≥ 2 independent PVT failures as the benchmark for invalid performance, which mirrors recommended clinical practice guidelines (Larrabee, 2014).

Of note, the strength of this study is the inclusion of a diverse, mixed clinical sample, which served to enhance the generalizability of findings. In contrast to other studies which have reported increased likelihood of false positives in subjects whose cultural identity may differ from the majority culture (e.g., Gasquoine et al., 2017), no such differences in classification were identified in the current study. Further, consideration of alternative cutoffs validated in elderly patients, or those with cognitive impairment (e.g., WMT GMIP), reduced false-positive errors, and underscores the importance of considering these characteristics in research designs.

Limitations and Future Directions

While this study makes meaningful progress toward identifying the best methods for establishing criterion groups in PVT research, future research is still needed to identify the most appropriate test combinations and whether varying the type of PVTs provides the optimal classification accuracy among different clinical samples. While our clinical sample supports use of memory, attention, and speed-based PVTs as the foundation for a robust PVT combination, it is a limitation that the majority of the PVTs in this study were memory-based (or appeared to assess memory). Replication of our results with non-memory-based PVTs is warranted considering that some patients may selectively exhibit invalid performance for a cognitive domain (Boone, 2009, 2012). Therefore, the optimal combination of criterion PVTs may not only include PVTs that measure (or appear to measure) multiple cognitive domains, but also include construct match between the measure being validated and the criterion PVTs.

Given different published cut-scores exist for many PVTs, individual study findings may be, to some degree, dependent on and/or may change as a function of the specific pass/failure cut-scores used for each PVT. Unsurprisingly, prior research has yielded varying published cut-scores for several PVTs in the current study. As such, this study offers an important reminder that interpretation of PVTs (as well as the PVTs efficacy or “quality”) depends on the pass/fail cut-scores selected; the choice of which is ultimately at the professional’s discretion. This is a particular issue with newer PVTs (e.g., WCT), in which there are comparatively fewer studies and a lack of systematic reviews/meta-analyses available to guide optimal cut-score selection. Nonetheless, data from this study highlighted how simply using more liberal cutoffs may not necessarily be the optimal solution to this problem given that more liberal cutoffs predictably increased accurate identification of invalid cases, but came with a cost of having significantly higher false positives, while continuing to maintain a substantial number of individuals with one PVT failure among both valid and invalid cases. Another limitation is that the PVTs examined in this study comprised the majority of those that were originally used to establish validity groups during the clinical evaluation. Therefore, cross-validation of these results in other clinical samples and/or utilizing a prospective study methodology with independent criterion PVTs for establishing validity groups that are separate/distinct from the test PVT combinations being examined will be critical in replication efforts.

Although our overall false-positive rate was low and consistent with prior research that similarly found a low false-positive rate in Alzheimer’s disease when failure on 2 PVTs was used as the criterion (Loring et al., 2016), our mixed clinical sample included cognitively impaired patients with predominantly mild presentations. Replication of findings in clinical samples with a higher prevalence of more severe cognitive impairment/dementia is recommended to increase external validity. Relatedly, the use of a veteran sample may limit the generalizability of results to other populations given unique aspects of this population (Armistead-Jehle et al., 2017). Unlike civilian populations, external incentive (i.e., service-connection disability) is an ever present concern when delivering neuropsychological services in the VA system, particularly considering that evaluation for VA service connected-disability initiation or rating increase can be requested at any point and past research demonstrating that PVT performance can vary with evaluation context in veteran and active duty populations (e.g., clinical vs. disability, history of military concussion; Armistead-Jehle et al., 2016; Grills & Armistead-Jehle, 2016). Finally, it is a limitation that no more than three-PVT combinations were examined as increased accuracy may be obtained with more measures.

While it is well-established that failing ≥ 2 PVTs accurately identifies those exhibiting noncredible performance (Larrabee, 2008; Meyers et al., 2014), few studies have examined how using multiple PVTs to identify criterion groups in PVT research impacts study findings. This study builds on the work of Schroeder et al. (2019) and demonstrates that both the number and the quality of PVTs are important when operationalizing criterion variables in PVT research. Base rate of PVT failure and clinical history can be useful for identifying invalidity clinically, though researchers may be tasked with selecting PVTs to serve as criterion measures a priori. By providing evidence that both the number and quality of PVTs impact criterion grouping, these results may help inform selection of criterion PVTs and have downstream effects on increased replicability of PVT research findings.