Neuropsychological impairment is a core feature of psychotic disorders that significantly predicts poor functional outcome (Bowie et al., 2008; Green et al., 2008; Keefe, Poe, Walker, Kang, & Harvey, 2006). Meta-analyses of neuropsychological test performance indicate that individuals with psychotic disorders (e.g., schizophrenia, schizoaffective disorder, psychosis NOS, bipolar with psychosis) perform approximately 1.5 standard deviation below healthy controls (Dickinson, Ramsey, & Gold, 2007; Fioravanti, Carlone, Vitale, Cinti, & Clare, 2005; Heinrichs & Zakzanis, 1998), with impairment across a range of cognitive domains and no distinctive pattern of differential deficits (Dickinson, 2008; Dickinson, Iannone, Wilk, & Gold, 2004; Dickinson, Ragland, Gold, & Gur, 2008; Reichenberg & Harvey, 2007). Such findings have led some to propose that psychotic disorders are characterized by a “generalized neurocognitive deficit” (Dickinson & Harvey, 2009; Dickinson et al., 2008).

Several accounts have been proposed to explain the generalized neurocognitive deficit. For example, studies support a role for abnormalities in grey- and white-matter, reduced signal integration across neural networks, neuropathology at the cellular level (e.g., NMDA receptor dysfunction, GABA internenurons), and abnormalities in “general systems” that impact the brain (e.g., inflammatory, metabolic, and oxidative stress processes; Dickinson & Harvey, 2009). These central nervous system and general systems theories have led to important advances in our understanding of the generalized neurocognitive deficit. However, they account for a limited proportion of variance in global neuropsychological test scores (1–34%) and have led to minimal treatment breakthroughs (Gur, Turetsky, Bilker, & Gur, 1999; Sullivan, Shear, Lim, Zipursky, & Pfefferbaum, 1996; Zipursky, Lambe, Kapur, & Mikulis, 1998). Alternative approaches to understanding the generalized neurocognitive deficit may therefore be beneficial.

One possibility is that the generalized neurocognitive deficit in individuals with psychotic disorders results from problems with motivation that undermine the ability to allocate sufficient effort during testing. This proposal has appealing face validity for explaining the generalized neurocognitive deficit. If a significant proportion of individuals with psychotic disorders put forth insufficient effort on neuropsychological tests, this could explain why they display impairments across a wide range of cognitive domains with no specific profile of differential deficits.

There has been considerable interest in determining whether individuals with psychotic disorders put forth inadequate effort on neuropsychological tests. Insufficient effort is typically examined using standardized measures that are designed to look like difficult tests of cognitive ability but are in fact quite easy. These tests can be embedded within neuropsychological tests or batteries designed to quantify level and type of cognitive impairment (e.g., Reliable Digit Span; Iverson & Tulsky, 2003), or free-standing tests designed specifically for the purpose of indexing insufficient effort (e.g., Word Memory Test; Green, Iverson, & Allen, 1999). The key outcome from both free-standing and embedded effort tests is a binary score indicating whether adequate effort has been allocated (i.e., pass/fail). Instances where examinees fail effort tests typically result from either incentive to perform poorly (e.g., during litigation, disability determination) that causes feigned impairment (i.e., malingering; Mittenberg, Patton, Canyock, & Condit, 2002), or when the examinee is apathetic and lacks the intrinsic motivation to perform well (Konstantakopoulos et al., 2011).

The validity of free-standing and embedded effort testing is supported by studies indicating that neuropsychiatric patients with severe cognitive impairment and intellectual disability (i.e., IQ < 70) can obtain near perfect scores on these measures (Green, Rohling, Lees-Haley, & Allen, 2001; Rees, Tombaugh, Gansler, & Moczynski, 1998), suggesting that failure reflects motivation rather than cognitive impairment. However, there are some exceptions, such as severe dementia, where effort tests are not thought to be valid because patients fail them due to genuine cognitive deficits (Bianchini, Mathias, & Greve, 2001; Chafetz & Dufrene, 2014; Dean, Victor, Boone, & Arnold, 2008; Larrabee, 2014; Merten, Bossink, & Schmand, 2007; Singhal, Green, Ashaye, Shankar, & Gill, 2009; Smith et al., 2014; Teichner & Wagner, 2004; Tombaugh, 1996; Victor & Boone, 2007). Although effort testing is commonly used with individuals with psychotic disorders in clinical and forensic settings for a multitude of purposes that have important legal and health ramifications, it is unclear whether effort tests are ideally suited for use in this population. It is therefore unclear whether these tests can also be used to make meaningful inferences about whether low effort contributes to the generalized neurocognitive deficit and whether they should be used in routine clinical practice.

A number of studies have used standardized embedded or free-standing tests to examine neuropsychological effort test failure rates in psychotic disorders. Results have been highly variable across studies, with failure rates ranging from 0 to 72% (see results section below). This is similar to the range of effort test failure rates found in other clinical groups with comparable or more severe cognitive deficits, including patients with traumatic brain injury at 10.7–59.1% (Armistead-Jehle, 2010; Armistead-Jehle & Buican, 2012; Lange, Sonal, Bhagwat, Anderson-Barnes, & French, 2012; Lippa, Agbayani, Hawes, Jokic, & Caroselli, 2015; Lippa, Lange, French, & Iverson, 2018; Nelson et al., 2010; Whitney, Shepard, Mariner, Mossbarger, & Herman, 2010; Whiteney, Shepard, Williams, Davis, & Adams, 2009), Parkinson’s disease at 8–62.6% (Carter, Scott, Adams, & Linck, 2016), Huntington’s disease at 18–70% (Sieck, Smith, Duff, Paulsen, & Beglinger, 2013), and across dementia disorders at 5–76% (Bortnik, Homer, & Bachman, 2013; Burton, Enright, O’Connell, Lanting, & Morgan, 2015; Davis, 2018; Duff et al., 2011; McGuire, Crawford, & Evans, 2019). Several factors may influence inconsistent failure rates across studies, including sample characteristics (e.g., symptoms, demographics, IQ), psychiatric hospitalization status (e.g., inpatient vs. outpatient), forensic status (e.g., litigating vs. not), the type of effort test used (i.e., freestanding vs. embedded), the cognitive domain measured by the test, and differences in the sensitivity and specificity of effort tests administered. To date, there has yet to be a systematic exploration of factors moderating effort test failure rates in psychotic disorders. However, there are some trends within the literature that suggest possible candidates for moderators.

The strongest candidate for a moderator appears to be negative symptoms, as several studies point to greater effort test failure rates among those with higher negative symptoms, particularly avolition (Avery, Startup, & Calabria, 2009; Gorissen, Sanz, & Schmand, 2005; Strauss, Morra, Sullivan, & Gold, 2015). This finding is consistent with the intuitive notion that motivational deficits lead individuals with psychotic disorders to put forth low effort and perform poorly on a range of neuropsychological tests. In contrast, associations with positive symptoms have generally been nonsignificant (Foussias et al., 2015; Moore et al., 2013; Morra, Gold, Sullivan, & Strauss, 2015; Strauss et al., 2015; Whearty, Allen, Lee, & Strauss, 2015).

Forensic status may be another relevant moderator. Studies have found that neuropsychiatric patients broadly defined, who are undergoing a legal case, are more likely to fail effort testing. For example, up to 28% of individuals diagnosed with depression who were tested in a forensic context were found to produce invalid effort (Green et al., 2001; Mittenberg et al., 2002). Other studies have documented that the availability of secondary gains (e.g., applying for disability benefits, avoiding criminal persecution, etc.) is associated with malingering (Belanger, Curtiss, Demery, Lebowitz, & Vanderploeg, 2005; Rohling, Binder, & Langhinrichsen-Rohling, 1995). Using a heterogeneous sample, Paradis, Solomon, Owen, and Brooker (2013) found that the incidence of psychotic disorders was higher in individuals who failed an effort test compared to those who passed. Hence, it is possible that being involved in a civil forensic case may increase the likelihood of effort test failure in psychotic disorders. However, since the vast majority of studies examining effort failure rates to date have been conducted in a pure research environment, with non-litigating individuals who had little apparent incentive to feign impairment (i.e., malinger), forensic status seems unlikely to be a strong explanation for increased effort test failure rates in psychotic disorders.

It is unclear whether effort test type (i.e., free-standing vs. embedded) has a differential impact on failure rate in psychotic disorders. In the broader effort testing literature, there is some evidence that embedded measures have better classification accuracy (Miele, Gunner, Lynch, & McCaffrey, 2012). However, other studies warn against using embedded effort tests with cognitively impaired individuals because they may lead to false positives (Slick, Hopp, Strauss, & Thompson, 1997; Teichner & Wagner, 2004). It is unclear how effort test failure rates might differ as a function of whether the measure used is embedded or free-standing in psychotic disorders.

Demographic characteristics, such as age and education, are also known to impact effort test failure rates across a range of neuropsychiatric disorders (Rees et al., 1998; Tombaugh, 1996, 1997). Psychotic disorders have been associated with reduced personal educational attainment (Keefe, Eesley, & Poe, 2005) and accelerated aging (Kirkpatrick et al., 2008). However, it is unclear whether demographic factors might explain effort test failure rate in this population.

Finally, it has become clear that effort tests are not immune to the effects of genuine intellectual disability, and IQ may also be another relevant moderator. IQ has been associated with failure rates in several neuropsychiatric conditions (Chafetz & Dufrene, 2014; Dean et al., 2008; Larrabee, 2014; Smith et al., 2014; Victor & Boone, 2007), and a growing literature suggests that this may be true in psychotic disorders as well (Dean et al., 2008; Whearty et al., 2015). To depict this association, Table 1 presents a re-analysis of two studies conducted by our group, indicating that only participants with estimated Full-Scale IQ scores <80 failed neuropsychological effort tests more often (Morra et al., 2015; Strauss et al., 2015). Individuals with psychotic disorders with average and higher IQ scores did not fail more often. Thus, it is possible that these disorders could be among the neuropsychiatric disorders with intellectual disabilities severe enough to produce elevated failure rates on effort tests due to genuine cognitive impairments.

Table 1 Reanalysis of the Effects of Estimated Full-Scale IQ on Effort Test Performance

Together, prior studies raise several important questions:

  1. (1)

    What is the mean pooled effort test failure rate in individuals with psychotic disorders? Answering this question is of critical importance given the wide range of failure rates reported across studies (0–72%).

  2. (2)

    Which variables significantly moderate effort test failure rates in individuals with psychotic disorders? The literature reviewed above identifies several plausible moderators, including: effort test type (embedded vs. free-standing), forensic status, estimated IQ, positive and negative symptom severity, and demographic variables. Answering this question has important implications for determining which factors contribute to effort test failure and whether effort tests are valid for use in clinical and research settings in this population.

  3. (3)

    Is a low effort hypothesis a viable explanation for the generalized neurocognitive deficit? If such an account is viable, this would suggest that psychosocial, pharmacological, and cognitive rehabilitation approaches to treatment should target motivational processes in an attempt to enhance cognition. Few interventions for cognitive impairment have incorporated motivational elements and the infusion of such elements would reflect a novel approach with potential to improve cognition if low effort is a contributing factor (Velligan, Kern, & Gold, 2006).

The current study evaluated the aforementioned questions via a meta-analysis of data from 2205 individuals with psychotic disorders taken from 19 studies with 24 independent effects. It was hypothesized that: 1) a nontrivial proportion (i.e., >10%) of individuals with psychotic disorders would fail effort testing; 2) effort test failure would be moderated by IQ and negative symptoms, but not effort test type, forensic status, positive symptoms, or demographic variables; 3) global neuropsychological impairment would be predicted by effort test failure rate, which may be consistent with a motivational account of the generalized neurocognitive deficit.

Method

Protocol and Registration

This meta-analysis followed the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) statement guidelines (Liberati et al., 2009; Moher et al., 2009). The review protocol was pre-registered at PROSPERO International Prospective Register of Systematic Reviews (CRD42018086018) and can be accessed at: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=86018.

Eligibility Criteria

The following inclusion criteria were implemented: (1) empirical articles published in a peer-reviewed journal, (2) use of at least one embedded or free-standing effort test, (3) inclusion of a psychosis group, and (4) available in English. Studies were excluded if they did not include sufficient data for effect size calculation (percent failure) and attempts to obtain missing data from the corresponding authors were unsuccessful. Studies that included multiple psychotic groups (i.e., independent effects) were included. For studies in which the same participants worked on multiple effort tests, a mean effect (percent failure) was calculated instead of arbitrarily selecting a failure rate from one of the effort tests reported. This method was considered necessary in order to only include independent effects.

Information Sources

A publication search was performed using the PsycINFO and PubMed databases between March 17th, 2018 and July 31st, 2018. Once relevant publications were identified, Google Scholar was used to manually search for eligible studies citing the already-identified publications that did not appear in the database search.

Search

The following key terms were entered in each database: “schizophrenia” OR “schizoaffective” OR “psychosis” OR “psychotic” OR psychotic disorder AND “neuropsychological test” OR “neuropsychological test of effort” OR “performance validity” OR “symptom validity” OR “malingering” OR “faking bad” OR “feigned impairment” OR “non-credible respon*” OR “Effort” OR “Motivation”. The reference lists of relevant publications were searched manually. Exclusion criteria were: duplicate articles, theses, dissertations, reviews and meta-analyses, book chapters, and poster presentations.

Study Selection

An initial search was conducted by using the specified search strategy. All titles were screened to identify relevant publications for full-text retrieval while abstracts were analyzed as needed. Full texts were retrieved and examined to confirm inclusion. For longitudinal studies, baseline scores were used. Studies that included measures that could have been used as embedded measures of effort (e.g., WAIS-IV Digit Span) but did not report percent effort failure were excluded.

Data Collection Process

A list of relevant variables was first developed by the first author, which was then revised after consulting with the last author. Corresponding authors from relevant publications with insufficient data to extract the main effect or moderator variables were contacted by the first author with a data request message. A follow-up email was sent approximately one month after the first email. To minimize experimenter bias, three coders independently extracted main effects and related data in which 18 effects were extracted consistently across all three coders and 4 effects were extracted consistently from two coders. An ICC = .83 was obtained, suggesting good reliability between coders. Coding discrepancies were resolved with discussion and consensus.

Data Items

The following variables were extracted from relevant publications: (1) study, (2) effort test, (3) effort test type (embedded vs. free-standing), (4) forensic status (forensic vs. non-forensic), (5) effort predicted by IQ (yes vs. no), (6) effort predicted by negative symptoms (yes vs. no), (7) sample size, (8) %failure, (9) effort predicted by positive symptoms (yes vs. no), (10) IQ measure, (11) IQ t-score, (12) premorbid IQ measure, (13) premorbid t-score, (14) depression measure, (15) depression score (%), (16) positive symptom measure, (17) positive symptom score (%), (18) negative symptom measure, (19) negative symptom score (%), (20) %treated with antipsychotic medication, (21) Chlorpromazine equivalent (mean), (22) Chlorpromazine equivalent (SD), (23) Olanzapine equivalent (mean), (24) Olanzapine equivalent (SD), (25) %first generation antipsychotic only, (26) %second generation antipsychotic only, (27) %first and second antipsychotic generation, (28) age (mean), (29) age (SD), (30) personal education (mean), (31) personal education (SD), (32) parental education (mean), (33) parental education (SD), (34) %male, and (35) %White.

Risk of Bias in Individual Studies

The Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies was adapted to assess risk of bias for each publication included (National Heart, Lung, and Blood Institute, 2014). This measure was designed to assist reviewers in their appraisal of the internal validity of a study. Using this measure, each publication was evaluated on the following areas: 1) Was the research question clearly defined?, 2) Was the study population clearly specified?, 3) Was the participation rate of eligible persons at least 50%?, 4) Were all the subjects recruited from the same or similar populations?, 5) Was a sample size justification, power description, or variance and effect estimates provided?, 6) Were the independent variables clearly defined, valid, reliable, and implemented consistently across all study participants?, 7) Was the exposure assessed more than once over time?, 8) Were the dependent variables clearly defined, valid, reliable, and implemented consistently across all study participants?, 9) Was the outcome assessor blind to the exposure status of participants?, and 10) Were key potential confounding variables measured and adjusted statistically for their impact on the relationship between exposure and outcome? Each domain received a score of 0 (yes) or 1 (no; cannot be determined, or not reported). Publications received a risk bias total score (in which higher scores suggested higher risk bias), and this score was used in moderator analysis.

Summary Measures

The effect size of this meta-analysis was the percent (proportion) of effort test failure. Since this effect was not normally distributed (Shapiro-Wilk test: .86, df = 24, p = .003), effects were transformed to logit units. In this method, logit units are normally distributed, have a mean of zero and a standard deviation of 1.83. All analyses were computed using logit units using the formulas,

$$ ES={\mathrm{Log}}_{\mathrm{c}}\ \left[p/\left(1-p\right)\right]. SE=\surd \left[1/(np)+1/\left(n\left(1-p\right)\right)\right].W=1/\left({SE}^2\right)= np\left(1-p\right). $$

where p is the effort failure rate and n is the total number of participants for each effect. Final results were transformed back to percentages for an easier interpretation using the formulas,

$$ p={e}^{\mathrm{x}}/\left({e}^{\mathrm{x}}+1\right).\%=100p. $$

where x is the ES.

Independent effects were extracted from each study. When multiple effects were derived from the same study (e.g., same group of participants participated in multiple effort tests), the multiple effects (in logit units) were averaged and included as an independent effect. This procedure was established a priori and considered necessary in order to include studies providing multiple effects from the same group. When single studies reported effort failure rates from independent samples, one effect was extracted and included from each sample.

Synthesis of Results

A mean pooled effect size was calculated using a maximum-likelihood random effects model, as random differences were expected to exceed heterogeneity attributed by subject-level sampling error alone (i.e., due to significant heterogeneity in symptom severity, patient characteristics, effort tests used, etc.). To examine the potential distortion of the transformed relative to untransformed effects, we also meta-analyzed untransformed effects to determine whether a significant discrepancy between the two sets of results was evident. We also meta-analyzed transformed effects using a fixed-effects model to explore whether significant discrepancies exist between models similarly to what previous meta-analyses have done (Martin et al., 2020; Sugawara et al., 2018). Heterogeneity between effects was tested using the Q statistic, which has a chi-square distribution with k – 1 degrees of freedom where k is the number of independent effects (Hedges & Olkin, 1985). I2 was calculated as an estimate of the percentage of variance that is attributable to true variance rather than sampling error (Borenstein, Hedges, Higgins, & Rothstein, 2011; Higgins, Thompson, Deeks, & Altman, 2003). T2 was also calculated as an estimate of true and absolute variance between effects (Borenstein et al., 2011). All statistical analyses were conducted in R (version 3.6.02019-04-26; Viechtbauer, 2010) using metafor package (version 2.1–0).

Risk of Bias across Studies

A fail-safe procedure was implemented to assess for publication bias (Rosenberg, 2005). In this method, a number of null effects, N+, (i.e., effects at the estimated weighted failure rate) that represents a number of unpublished studies needed to move estimates to a non-significant difference is calculated (Rosenberg, 2005). When N+ = 5n + 10, where n is the number of independent effects included in the meta-analysis, a fail-safe number is considered robust (Rosenberg, 2005). Moreover, the Rosenberg’s fail-safe procedure is a weighted fail-safe calculation applicable to both fixed- and random-effects models (Rosenberg, 2005). A funnel plot was also created for a visual representation of the retrieved effects.

Additional Analyses

Univariate regression analyses were conducted to examine whether each of the predetermined moderator variables independently contributed to the heterogeneity in the pooled weighted effect. Categorical variables were dichotomized. The moderator variables considered for moderator analyses included the following: effort test type: 0 = embedded, 1 = freestanding; forensic status: 0 = non-forensic, 1 = forensic; IQ (t-scores); premorbid IQ (t-scores); depression (%); negative symptoms (%); positive symptoms (%); disorganization (%); age (years); education (years); gender (% male); and medication doze (chlorpromazine equivalent).

Studies that included data on positive symptoms reported using the Brief Psychiatric Rating Scale-Positive (Overall & Gorham, 1962), Positive and Negative Syndrome Scale-Positive (Kay, Fiszbein, & Opler, 1987), and Scale for the Assessment of Positive Symptoms (Andreasen, 1984), whereas studies that included data on negative symptoms utilized the Brief Negative Symptom Scale (Kirkpatrick et al., 2018; Kirkpatrick et al., 2011; Strauss & Gold, 2016; Strauss et al., 2012a; Strauss et al., 2012b; Strauss, Vertinski, Vogel, Ringdahl, & Allen, 2016), Scale for the Assessment of Negative Symptoms (Andreasen, 1982), Schedule for the Deficit Syndrome (Kirkpatrick, Buchanan, McKenney, Alphs, & Carpenter Jr., 1989), Positive and Negative Syndrome Scale-Negative (Kay et al., 1987), or Brief Psychiatric Rating Scale-Negative (Overall & Gorham, 1962). Since there is no single measure for assessing symptom severity in these dimensions of psychopathology in psychotic disorders, composite scores were calculated to put each symptom severity score on the same metric across scales by dividing the total symptom score observed by the maximum possible score for that scale or subscale. Similar methods have been used in past meta-analyses (Fusar-Poli et al., 2013; Llerena, Strauss, & Cohen, 2012; Wykes, Huddy, Cellard, McGurk, & Czobor, 2011).

To evaluate the association between neuropsychological effort and global cognitive performance, correlations reported in eligible publications between effort and global cognition were meta-analyzed using a Fisher’s r to z transformation method (Fisher, 1921).

Results

Study Selection

An electronic search using the aforementioned keywords within the Title produced a total of 364 publications (PsycINFO = 188 studies; PubMed = 178). After duplicates were removed, a total of 228 unique publications were examined for eligibility. Out of those, irrelevant publications were excluded via title (n = 164), abstract (n = 36), or full-text (n = 16) review. Twelve eligible publications resulted from this search. Additionally, publications that have cited each of the relevant studies were found via Google Scholar and screened for eligibility (n = 1073) using the same inclusion criteria, producing seven eligible publications. A total of 19 studies were included and 24 independent effects were extracted from these studies (see Fig. 1 for a literature search flowchart).

Fig. 1
figure 1

Flowchart of the Literature Search Process

Study Characteristics

Tables 2 and 3 reports study characteristics, including sample size, demographic variables, effort test used, effort test type, patient status, group type, and percent failure rate (raw and logit-transformed). In total, 19 unique articles providing k = 24 independent effects were included in this meta-analysis with a total sample size of 2205 psychotic disorder individuals (M = 91.9, SD = 123.9, Median = 58, range = 20–595). Out of the 16 samples that reported ethnicity, a total of 598 (42%) individuals were White. Out of the 22 samples that reported gender, 1301 (70%) were male. Out of the 22 samples that reported age, mean age was 39.3 (SD = 10.4). Eighteen samples included years of personal education, which averaged 11.8 (SD = 2.9). Data from 10 studies indicated that 93.5% of the patients were medicated with an antipsychotic. Three studies reported medication dose and four studies reported medication type. The effort test most frequently used was the TOMM (7 studies). The Reliable Digit Span was used by six studies. Finger Tapping, Word Memory Test, and RBANS-Effort Index were used by three studies. Dot Counting, CVLT Forced-Choice, Validity Indicator Profile Non-Verbal, Validity Indicator Profile-Verbal, Rey 15-Item, and Victoria Symptom Validity Test were each used by two studies. Other effort measures used in a single study included: Hiscock Forced-Choice, Digit Span Age Corrected, Rey Word Reading Test, Rey/Osterrieth Effort Equation, RAVLT Effort Equation, Recognition Memory Test, Sentence Repetition, Logical Memory Test Rarely Missed Items, Rey Complex Figure Test Recognition, and the 21-Item Test French Version.

Table 2 Sample Characteristics of Studies Included in the Meta-Analysis
Table 3 Details of Studies Included in Meta-Analysis

Risk of Bias within Studies

The potential risk bias within studies was examined in multiple ways. First, all publications received a Quality Assessment Tool total score (0–10; higher scores = higher bias risk), and the majority of the studies obtained 3 or less points and none of these obtained >5 on this metric, suggesting adequate internal validity. Second, a moderator analysis using Quality Assessment Tool total score revealed that this metric was not a statistically significant moderator in effort failure (R2 = .15, β = .44, SE = .27, z = 1.66, p = .10). Third, when we excluded the four effects with the highest Quality Assessment Tool total score, results produced similar findings (effort failure = 15%, 95% CI: 10–22%, Q = 154.15, p < .0001). Therefore, risk of bias within studies was considered low.

Synthesis of Results

Percent failure rate across studies ranged from 0.0–72.0%. Across 24 effects from 19 studies, meta-analysis yielded a pooled weighted effort test failure rate of 18% (95% CI: 12–26%; See Fig. 2). We also meta-analyzed untransformed effects to examine the potential distribution impact of the logit-transformation procedure, and results revealed a similar pooled failure rate (20%; 95% CI: 13–27%). Exploratory analyses revealed a similar failure rate when implementing a more conservative approach for comparison based on a fixed-effects model (pooled failure rate = 22%; 95% CI: 20–24%). The weighted effort failure rate across studies using the logit method indicated that a nontrivial number of patients with a psychotic disorder failed an effort test. A test of heterogeneity indicated significant heterogeneity across the effects (Q = 243.0, p < .001). I2 was 93 and T2 was 1.12, suggesting that a significant amount of variance within the effects could be accounted for by true heterogeneity (Higgins et al., 2003) and not sampling error alone.

Fig. 2
figure 2

Forest Plot. Note. Forest plot of 24 independent effects extracted from 19 publications. Each effect includes first author, publication year, proportion (failure rate), and 95% confidence interval. Mean pooled proportion is located at the bottom

A sensitivity analysis was conducted to examine whether the pooled mean failure rate significantly changed after systematically removing each of the effects. Results indicated little change in the pooled effect size. Despite observing a drop in heterogeneity when the largest effect size was excluded from the analysis (Gorissen et al., 2005), the observed pooled effect was similar and heterogeneity was still large and significant (failure rate = 17%; 95% CI: 12–26%; Q = 178.4, p < .001).

Risk of Bias across Studies

Publication bias across studies was assessed using Fail-Safe N+ test (Rosenberg, 2005). An N+ of 787 suggests that this meta-analysis may not be affected by publication bias. N+ is considered robust when it is larger than 5 k + 10, or 5*24 + 10 = 130 (Rosenberg, 2005). A funnel plot can be observed in Fig. 3, which displays asymmetry across the effects. When significant heterogeneity is expected and a sample includes a small number of effects, asymmetry is considered true heterogeneity due to moderator variables rather than publication bias (Sterne, Gavaghan, & Egger, 2000).

Fig. 3
figure 3

Funnel Plot. Note. Funnel plot illustrating log odds (x-axis) by standard error (y-axis)

Moderator Analyses

Moderator analyses were conducted for each of the predetermined moderator variables using univariate regression analyses when there were at least 4 studies with sufficient data to conduct the moderator analysis. The following moderator variables were not tested, as the data extracted from the included studies was insufficient to run moderator analyses on these variables: premorbid IQ, depression, medication dose (chlorpromazine equivalent), and disorganization.

As shown in Table 4, the following moderators were nonsignificant: patient status, forensic status, negative symptoms, positive symptoms, age, gender, and % prescribed an antipsychotic. However, IQ was a significant moderator, such that patients with lower IQ were more likely to fail effort testing. This suggests that lower intellectual functioning is associated with higher effort test failure rate in individuals with psychotic disorders. Similarly, personal education was a significant moderator, suggesting that lower education is associated with higher effort test failure rate. A sensitivity analysis indicated that IQ and education remained significant moderators even when each sample was systematically removed from the analysis. Exploratory analyses examined whether proportion of affective (schizophrenia, psychotic disorder, psychosis NOS) versus nonaffective psychosis (schizoaffective, bipolar with psychosis, depression with psychosis) or percent medicated independently moderated failure rates. Results revealed that these moderator variables did not achieve statistical significance.

Table 4 Moderator Analyses

Does Effort Predict the Generalized Neurocognitive Deficit?

An analysis was conducted to examine whether effort test failure rate is associated with global neuropsychological functioning to evaluate the motivational account of the generalized neurocognitive deficit. Correlations between effort test failure rate and global neurocognitive performance from five independent effects were meta-analyzed using the Fisher’s r to z method. Results revealed a significant association between effort and global cognition (r = .57; 95% CI = .44–.70; p < .0001), explaining a significant amount of variance (R2 = .32). This correlation was significant even when Moore et al. (2013) was removed from the analysis, given that global neurocognition and effort were assessed using the same measure (RBANS and RBANS Effort Index) in this study (r = .49; 95% CI = .36–.61; p < .0001; R2 = .24). This result should be interpreted with caution given that IQ was found to be a significant moderator and effort test performance may therefore reflect genuine cognitive impairment rather than true motivational deficits.

Discussion

Failure Rate

The first aim of this meta-analysis was to quantitatively synthesize the failure rate of effort test performance in psychotic disorders. Findings from 2205 individuals with psychotic disorders taken from 19 published studies with 24 independent effects indicated a pooled failure rate of 18%. This pooled failure rate was similar when using transformed and untransformed effects and fixed versus random models. Furthermore, when only effects derived from non-forensic or non-litigating samples were meta-analyzed (k = 17), a pooled mean failure rate of 16% was obtained, suggesting similar failure rates when patients do not have clear incentives for feigning impairment. These findings suggest that a non-trivial proportion of individuals with psychotic disorders fail effort testing and malingering is unlikely to account for the majority of effort test failures that are observed in psychotic disorders.

Significant Moderators of Failure Rate

To address the reasons why individuals with psychotic disorders fail effort testing, the second aim of the study was to evaluate moderators of failure rate. IQ and personal education were found to be significant moderator variables, explaining 91% and 32% of the variance, respectively.

Evidence that IQ moderates effort test failure rates does not lead to a clear interpretation. It is possible that low effort resulting from a motivational problem leads people with psychotic disorders to perform poorly on IQ or other cognitive tests or that low IQ or genuine cognitive impairments leads them to fail effort tests because these are simply other cognitive tests that people with genuine impairment perform poorly on. Alternatively, a combination of both genuine cognitive impairment and low motivation may lead to effort test failure in psychotic disorders.

Unfortunately, moderation analyses alone cannot clarify which of these explanations is most viable. One way that these competing explanations for effort test failure has been addressed in the literature is by calculating an “easy-hard” difference score on effort tests that allow for such comparisons (Green, Montijo, & Brockhaus, 2011; Howe, Anderson, Kaufman, Sachs, & Loring, 2007). Using such difference metrics, dementia patients with genuine profound cognitive impairments have been found to have large difference scores (i.e., they perform much worse on the hard than easy tests), whereas patients with low effort tend to perform equivalently on the easy and hard scores (i.e., they are lower on both).

Due to our inability to obtain data on easy and hard effort sub-test conditions from the studies included in this meta-analysis, we were unable to directly examine whether primarily profound cognitive impairments or primarily low effort moderates effort test failure rates in psychotic disorders meta-analytically. However, we were able to evaluate data from our two-experiment paper that used the VSVT and WMT in schizophrenia patients and healthy controls (Strauss et al., 2015) to draw some preliminary conclusions. On the VSVT, the control group demonstrated negligible differences between easy and hard conditions (24 vs. 23.7), whereas the psychotic disorder group demonstrated a larger difference (23.6 vs. 21.9). However, no subjects failed the VSVT in either sample, making it a less than ideal comparison for this purpose. On the WMT, which 15.2% of individuals with psychotic disorders failed, the differences between easy (mean of Immediate Recognition, Delayed Recognition, and Consistency) and hard (mean of Multiple Choice, Paired Associates, and Free Recall) conditions were even more pronounced in psychotic disorders (93.4% easy vs. 56.9% Hard). In comparison, controls scored 98.7% on easy and 87.2% on hard. This magnitude of difference in psychotic disorders WMT performance (36.5%) is comparable to that observed in cases with severe dementia (Green et al., 2011), consistent with the notion that individuals with psychotic disorders fail effort tests due to genuine cognitive impairment.

However, these group level results may be misleading. To further explore this “profile analysis”, we evaluated individuals with psychotic disorders who failed versus passed in our WMT sample. When the psychosis sample who failed the WMT (Easy = 80%, Hard = 31%, = 49% difference score) are separated out from those who passed (Easy = 96%, Hard = 62%, = 34% difference score), it becomes clear that the sample that fail perform very poorly on both easy and hard subtests and have an even greater difference score than the patients who pass. This profile analysis strongly speaks against a pure motivational account or combined motivational and genuine cognitive impairment account of why individuals with psychotic disorders fail effort test. One would expect a smaller, not larger, difference score in the patients who failed compared to those who pass if a motivational account was supported. However, the opposite was true, suggesting that individuals with psychotic disorders fail effort tests due to genuinely severe cognitive impairments.

To further explore the question of whether people with psychotic disorders fail effort tests due to low IQ alone or whether a combination of low IQ and motivational deficits contribute, we conducted supplemental mediation analyses on three of the studies in the meta-analysis that came from our lab (Morra et al., 2015; Strauss et al., 2015; Whearty et al., 2015; see Supplemental Materials for full details). Three mediation models were constructed to determine whether: 1) IQ mediated the association between global neurocognitive impairment and effort test failure, 2) motivational symptoms (avolition measured via a negative symptom clinical rating scale) mediated the association between global neurocognitive impairment and effort test failure, and 3) the combination of motivational symptoms and IQ mediated the association between global neurocognitive impairment and effort test failure. Results indicated that IQ significantly mediated the association between global neurocognitive impairment and effort test failure. However, motivational symptoms were not a significant mediator of the association between global neurocognitive impairment and effort test failure. The combined mediation of both IQ and motivational symptoms was significant for 1/3 studies and the significant model was driven by IQ. When coupled with the supplemental profile analyses (easy-hard discrepancies), these findings suggest that effort test failure in psychosis samples primarily results from low IQ or genuine cognitive impairments, rather than low motivation or a combination of low motivation and cognitive impairment. Thus, individuals with psychotic disorders may fail effort tests because they have genuine cognitive impairments and effort tests are simply another type of cognitive test that they perform poorly on due to genuine intellectual deficits.

Nonsignificant Moderators of Failure Rate

The observation that other moderators were nonsignificant is also noteworthy. A number of studies have found that greater negative symptom severity is associated with increased likelihood of failing effort testing (Avery et al., 2009; Foussias et al., 2015; Gorissen et al., 2005; Morra et al., 2015; Strauss et al., 2015), leading us to hypothesize that negative symptoms would be a significant moderator. The null result is therefore surprising. However, the finding should be interpreted with caution given that the majority of past studies used negative symptom scales that do not measure the construct per modern conceptualizations using first-generation negative symptom measures like the Brief Psychiatric Rating Scale (BPRS: Overall & Gorham, 1962) and Positive and Negative Syndrome Scale (PANSS: Kay et al., 1987). Associations between effort and negative symptoms have been stronger in studies using second-generation negative symptom rating scales, such as the Brief Negative Symptom Scale (BNSS: Kirkpatrick et al., 2011; Kirkpatrick et al., 2018; Strauss & Chapman, 2018; Strauss & Gold, 2016; Strauss et al., 2012a; Strauss et al., 2012b; Strauss et al., 2016; Strauss et al., 2018a; Strauss et al., 2018b). It is not surprising that first-generation scales would show little relation to effort. These scales measure the more expressive aspects of negative symptom pathology (e.g., alogia, blunted affect), and neglect the apathetic dimension (avolition, anhedonia, asociality) that should be more theoretically related to effort. Thus, it is unclear whether the nonsignificant negative symptom effect is reliable. Future studies should utilize the most conceptually updated measures (BNSS and CAINS; Kring, Gur, Blanchard, Horan, & Reise, 2013; Kirkpatrick et al., 2011) to test the role of negative symptoms in low effort most adequately.

Furthermore, the nonsignificant effects observed for other moderators are also meaningful. It is easy to imagine how delusions and hallucinations could interfere with cognitive testing, for example, leading patients to perform poorly and fail effort tests if distracted by internal or external stimuli. However, both positive symptoms and inpatient status (where positive symptoms are most pronounced) were not significant moderators, suggesting that the more florid aspects of psychosis psychopathology are contributing minimally to failure rates. Furthermore, other demographic factors that have been found to be associated with poor cognitive test performance (e.g., age) in other disorders were not significant moderators.

Is the Motivational Account of the Generalized Neurocognitive Deficit Viable?

The third aim of the study was to determine whether a motivational account of the generalized neurocognitive deficit might be viable. To examine this question, correlations between effort and global cognitive functioning were meta-analyzed from five independent effects. Results indicated that low effort accounted for 31% of the variance in global neurocognitive functioning. At first glance, these findings appear to support the motivational account of the generalized neurocognitive deficit. However, caution must be applied when interpreting this result given that IQ also moderated effort test failure. It is not clear whether low effort is a viable explanation for the generalized neurocognitive deficit because effort tests may lack validity in the psychotic disorder population. Simply put, the finding that effort test failure predicts global cognitive test scores may simply reflect that poor performance on one cognitive test is associated with poor performance on another cognitive test. This correlation may have little to do with effort, but simply indicate that multiple cognitive tests are correlated with each other because effort tests have questionable validity in the psychotic population.

To evaluate this possibility directly, we conducted further supplemental mediation analyses on the data from the three studies previously published by our group (Morra et al., 2015; Strauss et al., 2015; Wheraty et al., 2015; see Supplemental Materials). Mediation models examined whether: 1) IQ mediated the link between effort and global neurocognitive impairment and 2) effort mediated the link between IQ and global cognitive impairment (see Supplemental Materials). Findings indicated that IQ was either a partial or full mediator between effort and global cognitive impairment in all 3 studies, whereas the model evaluating whether effort mediated the association between IQ and global cognitive impairment was significant in 1/3 studies and nonsignificant at a trend level in 2/3 studies. These findings suggest that the motivational account of the generalized neurocognitive deficit may be plausible. However, it is hard to have much confidence in such conclusions since they are drawn from neuropsychological effort tests. These measures may not be ideal to test the viability of the motivational account because their failure rates are heavily influenced by intellectual ability.

To accurately evaluate the question of whether low effort accounts for the generalized neurocognitive deficit in psychotic disorders, alternative approaches to assessing effort are needed, such as those developed in cognitive neuroscience (Kool, McGuire, Rosen, & Botvinick, 2010; Westbrook, Kester, & Braver, 2013). Using such tests, there is some evidence that poor cognition is associated with deficits in effort-cost computation, which are driven by motivational symptoms (Bismark et al., 2018; Gold et al., 2013). Cognitive neuroscience-based effort tests could be adapted for clinical purposes, standardized, normed, and tested to see if they have clinical utility for detecting the type of low effort that results from motivational impairments that are expected to contribute to the generalized neurocognitive deficit in psychotic disorders.

Limitations

Certain limitations should be noted. First, studies included in this meta-analysis utilized several types of embedded and free-standing tests that assessed effort via performance on specific cognitive domains (e.g., memory, attention, visual perception) or behaviors (e.g., finger tapping). Although these tests theoretically measure the same construct (i.e., effort), distinct levels of specificity and sensitivity across measures might impact performance differently depending on the patient’s characteristics. Prospective meta-analyses on specific effort tests would be informative. Second, results should be interpreted with the caveat that only a limited number of effects were included for certain moderator analyses. Additionally, a lack of data on premorbid IQ, disorganization, medication type and dose, and depression prevented the examination of moderating effects from these variables. Third, the approach taken to standardize positive and negative symptoms severity across measures is not ideal. However, in the absence of a universal measure to assess positive and negative symptoms severity, this approach was considered necessary to examine moderation effects of symptom severity on effort test failure rate. Relatedly, we were unable to examine the unique moderator effects of the most relevant negative symptom dimension (motivation/pleasure) due to the lack of available data on this dimension apart from total negative symptoms scores. Future studies may be able to address this question using more conceptually up-to-date rating scales (BNSS, CAINS). Lastly, we were unable to address the effects of illness stage due to insufficient number of studies in the first versus multi-episode samples.

Conclusions and Implications

In summary, across a heterogeneous sample of studies, patients with psychotic disorders are at higher chances of failing standard effort tests. Effort test failure is a significant predictor of global neuropsychological impairment. However, this association should not be taken as strong support for the motivational account of the generalized neurocognitive deficit given that effort test failure rates are influenced by genuine cognitive impairments in psychotic disorders. Thus, there is a need for new effort tests to be developed that are specifically designed to accommodate the cognitive capacity limitations of this population.

Our findings have important implications for the use of effort tests in clinical practice. Effort tests are now widely used to determine performance validity during neuropsychological evaluations. It is important for these tests to be valid since they play an important role for informing legal decisions pertaining to possible external benefits (e.g., disability status, compensation, etc.) or criminal sentencing. However, the moderation and supplemental mediation analyses reported in this study suggest that effort tests are not immune to the effects of low IQ in this population. Individuals with psychotic disorders appear to fail effort tests because they have genuine cognitive deficits and effort tests are simply cognitive tests that are not immune to true cognitive impairment or low IQ. Similar findings have been reported in other populations with more severe cognitive impairment, such as Alzheimer’s dementia (Bianchini et al., 2001; Merten et al., 2007; Singhal et al., 2009; Teichner & Wagner, 2004; Tombaugh, 1996), and led to the conclusion that there are “limits to effort testing” and the populations in which they should be used (Green et al., 2011; Merten et al., 2007). Our results suggest that psychotic disorders should be added to the list of disorders with severe cognitive impairment for which there are known to be “limits to effort testing.” In situations where such tests must be used in individuals with psychotic disorders, it may be beneficial to administer multiple effort measures and engage in a multi-faceted decision-making processes using as many sources of information as are available, for example, clinical interview, behavioral observations, referral information, whether they are from a non-credible risk group, and effort test performance (Lezack, Howieson, Bigler, & Tranel, 2012; Rickards, Cranston, Touradji, & Bechtold, 2018; Slick, Sherman, & Iverson, 1999).