Over-reporting of depressive symptoms needs to be assessed carefully in civil forensic settings. Indeed, the duration of absence from work is typically longer for cases of major depression than it is for cases of serious medical problems such as back pain, hypertension, diabetes mellitus, and heart disease (Druss et al. 2000). Besides, workers’ compensation claimants (Repko and Cooper 1983), personal injury claimants (Lees-Haley 1997), and military veterans seeking disability compensation (Frueh et al. 1996; Smith & Frueh, 1996), all typically report significant depressive symptoms in their evaluations. Moreover, possibly because everyone has experienced low mood at some point in his/her life, and information about depressive symptoms is readily accessible to anyone (Lees-Haley and Dunn 1994), major depression symptoms can be feigned easily (Bagby et al. 2000; Nicholson and Martelli 2006; Steffan et al. 2003). In fact, it has been estimated that about 15% of the depressive syndromes diagnosed in litigation or compensation cases are likely feigned (Mittenberg et al. 2002).

To assess the credibility of depression-related presentations, practitioners should always include employ multiple sources of information and multiple tests (Boone 2009; Bush et al. 2005; Heilbronner et al. 2009; Iverson 2006; Larrabee 2008). Several stand-alone symptom validity (SVTs) and performance validity (PVTs) tests are available, to that purpose. The Test of Memory Malingering (TOMM; Tombaugh 1996, 1997), Word Memory Test (WMT; Green et al. 1996), and Rey 15-item Memorization Test (RMT; (Lezak 1995) are three popular examples of PVTs. The Structured Inventory of Malingered Symptomatology (SIMS; Smith and Burger 1997; Widows and Smith 2005) is a popular example of SVT (Dandachi-FitzGerald et al. 2013; Martin et al. 2015). Additionally, several multiscale personality inventories including one or more validity indicators designed to detect atypical response styles and exaggeration are available as well. Among them, the most investigated one for malingered depression issues (Nicholson and Martelli 2007) is probably the Minnesota Multiphasic Personality Inventory (MMPI; Hathaway and McKinley 1940), in its most updated versions, i.e., the MMPI-2 (Butcher et al. 2001) and MMPI-2-RF (Ben-Porath and Tellegen 2008).

Three MMPI-2 scales are particularly useful to assess symptom validity: F (Infrequency), Fb (Back Infrequency), and Fp (Infrequency-Psychopathology). The F scale was originally designed to measure atypical responding, which occurs in case of random responding or poor understanding of the meaning of the items (Friedman et al. 2015). Because its items address uncommon or deviant behavior, however, elevations of F have been commonly used as an indicator over-reporting or exaggerating. The Fb scale was designed to operate similarly to the F scale, i.e., to detect divergences from normality. Its focus, however, is on the second half of the inventory (Friedman et al. 2015). Thus, what characterizes Fb is that it is sensitive to possible shifts in the respondent’s attitude, for example, due to fatigue or poor cooperation during the latter part of the test. Additionally, F and Fb differ in their content: While the former mainly addresses psychosis-related problems, the latter focuses more on acute distress and depression or low self-esteem issues. Lastly, the Fp scale was developed by Arbisi and Ben-Porath (1995a, 1995b) to help practitioners disentangle whether a high score on the F would reflect a “faking bad” response set versus other phenomena such as random responding, poor reading ability, or severe psychopathology (Friedman et al. 2001). Indeed, while F includes items that are endorsed rarely by healthy controls, Fp is comprised of items that are endorsed rarely by both healthy controls and psychiatric patients with known psychopathology. As such, F elevations associated with low Fp scores are deemed to indicate random responding, poor reading ability, or genuine, but severe disturbance, whereas high F scores with high Fp scores might instead suggest an over-reporting or faking bad attitude.

While each of these three validity scales provides useful and unique information, currently, the Fp scale is considered the strongest MMPI-2 scale for discriminating bona fide from feigned psychopathology. Indeed, a meta-analysis of 76 MMPI-2 studies (Rogers et al. 2003) indicates that albeit the average effect size across studies was slightly higher for F (d = 2.21) than for Fp (d = 1.90), and the cut-off scores of Fp were much more stable across the different diagnostic targets taken under consideration. Along the same lines, when compared to F and Fb, Fp yielded more similar effect sizes when going from one investigation to another. Conversely, the empirically derived cut-off scores of F were quite variable across the different studies, ranging from > 8 to > 30, and the average effect size of Fb was remarkably lower (d = 1.62) compared to both F and Fp. Based on these findings, Rogers et al. (2003) “recommended the Fp as the primary scale for the assessment of feigning” (p. 173) and “questioned the routine use of Fb” (p. 160).

With the introduction of the briefer, MMPI-2-RF, the Fb scale was not retained and the F and Fp scales were slightly revised to adjust to the new format of the test (which has decreased from 567 to 338 items) and to the newly collected, normative reference data. Named “F-r” and “Fp-r,” these revised counterparts of the MMPI-2 F and Fp scales remained highly consistent with their MMPI-2 predecessors. Indeed, while F-r addresses possible divergences from normality, Fp-r addresses possible divergences from both healthy controls and bona fide psychiatric patients.

It is noteworthy that examination of MMPI-2-RF validity scales’ research also leads to similar conclusions to those described above. In fact, a recent meta-analysis of 30 studies by Sharf et al. (2017) suggests that Fp-r may be superior to all other MMPI-2-RF scales for several reasons. First, differently from F-r, which exhibited marked elevations in some bona fide patients affected by mixed diagnoses, major depression, or somatoform disorders (i.e., false positive results), Fp-r was highly specific in all of the studies included in the meta-analysis, with small variations from one tested condition to another. Second, unlike all other MMPI-2-RF validity scales, the Fp-r continued to prove highly useful also in the only one study (Sellbom and Bagby 2010), among those included in the meta-analysis, that compared coached simulators against clinical samples. Third, its effect sizes, receiver operating characteristic (ROC) curves, and empirically derived cut-off scores were particularly stable from one study to another.

All in all, however, the F-r scale also showed some merit in this meta-analytic report. Indeed, its average effect size was d = 1.15, when comparing all feigners (n = 2575) against all genuine patients (n = 1836) taken into consideration. Furthermore, Sellbom et al. (2010) suggested that F-r may provide some incremental validity over Fp-r in criminal forensic settings, where malingerers likely present complaints in multiple, rather than one, domains (i.e., psychopathology, cognitive, and somatic).

Both MMPI-2 F and Fp scales, as well as their MMPI-2-RF counterparts F-r and Fp-r, are effective because malingerers likely do not fully know what symptoms are common versus rare for a given, psychopathological condition (Greene 2000). More specifically, Fp and Fp-r measure the extent to which a test-taker endorses rare symptoms, i.e., symptoms that are infrequent among both healthy controls and psychiatric patients, and F and F-r measure endorsement of quasi rare symptoms, i.e., symptoms that are infrequent in the normative nonclinical samples but may not be so infrequent among bona fide patients, especially if affected by severe psychopathology. The main idea is that elevation of these scales should raise concerns as to whether a given presentation is credible or not, as it is rather unlikely to find high scores in these scales if the test-takers have answered honestly. Although both the MMPI-2 and MMPI-2-RF use some other detection strategies too (e.g., erroneous stereotypes, obvious-subtle, symptom selectivity), presently, the rare-symptoms detection strategy appears to be the most effective one, in the assessment of feigned mental disorders (Sharf et al. 2017; Rogers et al. 2003). As reviewed above, indeed, both MMPI-2 and MMPI-2-RF meta-analytic studies indicate that the MMPI-2 Fp and its MMPI-2-RF Fp-r produce by far the most stable and satisfactory results across studies. No other MMPI-2 or MMPI-2-RF validity scale reaches similar levels of effectiveness across different studies.

The Current Study

Nowadays, virtually, all researchers and practitioners would agree that the clinical determination of malingering should never rely on a single measure and should instead use multiple instruments, possibly implementing different feigning strategies (Boone 2009; Bush et al. 2014; Bush et al. 2005; Chafetz et al. 2015; Rogers 2008; Rogers and Bender 2018). To that extent, it might be argued that a tool that could prove particularly useful, when used in combination with the MMPI instruments, is the Inventory of Problems – 29 (IOP-29; Viglione et al. 2017). Comprised of 29 items only, the IOP-29 was indeed designed specifically to provide incremental validity over the classic, MMPI-based, rare-symptom approach scales (Viglione, Giromini et al. 2018).

Rather than focusing on rare-symptoms endorsement, the IOP-29 addresses the subjective experience of the test-taker concerning his or her ability to deal and cope with his or her problems. For example, instead of asking whether or not the respondent has problems falling asleep, it investigates whether s/he feels like there is anything s/he can do about it, whether s/he feels like s/he bears some responsibility for that problem, and so on. Furthermore, in addition to the classic “True” versus “False” response options, the IOP-29 also offers a third possible choice: “Doesn’t make sense.” This is because accumulating experience in the field indicates that feigners may at times present themselves with some confusion, cognitive deficiency, and resistance to the evaluation (Rogers 2008), which may be well captured by this type of response option (Viglione et al. 2017). Along the same lines, in addition to 26 self-report items, the IOP-29 also presents three cognitive, or PVT items, which also contribute to make the IOP-29 a very different tool, compared to the MMPI instruments. For all these reasons, we hypothesized that using the MMPI together with the IOP-29 would provide some useful incremental validity, over using either instrument alone. The current study tested this hypothesis by administering the MMPI-2 and IOP-29 to a sample of patients with depression-related disorders and to a sample of experimental malingerers (expMAL) instructed to feign depression.

Method

Three different Italian samples contributed to this research. A first sample included 36 psychiatric patients diagnosed with and in treatment for major depression disorder (MDD) or adjustment disorder with depressive mood (ADDM). A second sample was comprised of 28 adult individuals who met the following three criteria: (1) they had been referred to psychiatric and psychological units of a public hospital for work-related stress issues; (2) they had received a diagnosis of MDD or ADDM; (3) their symptom presentation was deemed to be highly credible. The third sample was comprised of 100 nonclinical adults instructed to feign depressive symptoms elicited by a work-related accident. Thus, a total of 64 patients with depression and 100 expMAL contributed to this study. All signed an informed consent form, and the procedures of this project were reviewed and approved by the applicable ethical committees. Data collection began in March 2018, when the IOP-29 was officially made available to practitioners and ended December 2018.

Participants and Procedures

All participants were native Italian-speaking adults, who defined themselves as “Italian” or “Caucasian.” As such, all materials were administered in Italian, consistent with standard Italian practice. In addition, because all completed at least Middle School, their reading abilities were considered to be adequate to filling out both the MMPI-2 and IOP-29.

Depressed Patients in Treatment

All individuals included in this sample (n = 36) were consecutive adult patients from a psychiatric ambulatory located in the North of Italy. Two thirds (n = 24) were referred for the first time to this ambulatory for psychological assessment and treatment purposes, whereas 12 had been in treatment for months (with SSRI antidepressants and, in some cases, benzodiazepines) and, at the time when the MMPI-2 and IOP-29 were administered, were considered to be in remission. In all cases, the diagnoses of DDM and ADDM had been formulated by the two chief psychiatrists of the ambulatory via clinical interview, after consulting with each other. For the majority of the sample, the presented depressive symptoms were not considered to be particularly severe.

Twenty-two (i.e., 61.1%) of the patients included in this sample were women, average age was 50.1 (SD = 14.0), and average number of years of education was 12.8 (SD = 3.5).

Depressed Patients with Work-Related Stress

Individuals included in this sample were depressed patients evaluated for possible exaggeration and considered highly unlikely to be malingerers. Because all had external incentives to look depressed (e.g., lawsuits in progress), they were first evaluated through an extensive clinical interview by a medical doctor on the Occupational Health Unit of a hospital located in the North of Italy. Then, if this doctor believed that their complaints were bona fide, they were sent to a different unit of the same hospital for a second clinical interview, this time performed by a psychiatrist. Diagnoses of MDD and ADDM were formulated in this occasion. Then, all of these patients returned to the Occupational Health Unit, where two experienced psychologists conducted another extensive clinical interview and reviewed, together with the doctor from the first interview, all relevant information concerning the cases, including clinical histories and any potentially useful materials such as email and photos. This three-step, thorough, examination terminated with the identification of 28 patients deemed to be genuinely affected by MDD or ADDM. All individuals who did not receive one of these psychiatric diagnoses or whose symptom presentation was not considered fully credible were excluded from the current study.

The administration of both MMPI-2 and IOP-29 occurred at the end of this three-step examination. Slightly more than the half of this genuinely depressed sample (i.e., 15, or 53.6%) were women, average age was 48.9 (SD = 8.3), and average number of years of education was 14.8 (SD = 3.2).

Experimental Malingerers

A nonclinical sample comprised of 100 adult participants instructed to feign depression also contributed to this research. These were recruited via convenience and snowball sampling procedures in various Italian cities (mainly located in the North of Italy). Inclusion criteria required being 18 or more, not having been diagnosed with any major psychiatric disorders, and being able (and willing) to read and sign an informed consent form. In line with standard guidelines on how to conduct a simulation study (Rogers and Bender 2018), all were given a vignette depicting a situation in which a person might decide to fake depression, a brief list of symptoms characterizing this psychopathological condition, a cautionary statement “not to over-do it” or else their performance would not be believable, and a small economic incentive to do their best to successfully feign depression without looking like feigners (see Appendix 1). Lastly, at the end of the experiment, they were inquired about their feigning strategies, so to ascertain that everyone followed the instructions. In terms of demographic variables, 62 (i.e., 62.0%) were women, average age was 51.0 (SD = 17.0), and average number of years of education was 14.0 (SD = 3.7).

Measures

The Minnesota Multiphasic Personality Inventory-2 (Butcher et al. 2001)

The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) is the probably most popular measure of general psychopathology for forensic and psychiatric assessment. It is comprised of 567 “True” or “False” items, and offers several validity and clinical scales, as well as content components and supplementary scales. In this study, the official Italian version of the MMPI-2 was used (Pancheri and Sirigatti 1995).

As reviewed above, among all MMPI-2 validity scales addressing negative response bias, Fp is probably the most supported one, from an empirical standpoint, but F and—to a lesser extent—Fb have some merit too. According to Rogers et al.’s (2003) meta-analysis, optimal cut scores for these F scales may be F raw > 20 or F raw > 24; Fb raw > 18 or Fb raw > 20; and Fp raw > 7. It should be noted that the MMPI-2 is generally considered too long and time consuming to be used as a screening test for malingering. It follows then that all these cut scores favor specificity (less than 5% or 2% of patients should be classified as feigning) over sensitivity, which would be favored in screening tests.

The Inventory of Problems-29 (Viglione et al. 2017)

The Inventory of Problems-29 (IOP-29) is a relatively new, self-administered test, comprised of 27 “True,” “False,” or “Doesn’t make sense” items and two open-ended cognitive items. Its chief feigning scale, the False Disorder Probability Score (FDS), was derived from logistic regression, and therefore, it consists of a probability score. More specifically, the IOP-29 FDS provides the likelihood that a given IOP-29 comes from a sample of experimental feigners versus a sample of bona fide patients, when the a-priori expectations are 50% and 50%. The higher the score, the more likely the score represents noncredible complaints. In this study, the cross-culturally adapted version of the IOP-29 for use with Italian populations has been used (Giromini et al. 2018).

According to the results of a clinical comparison simulation study conducted by Giromini et al. (2018), despite it having only 29 items, the IOP-29 offered a better classification accuracy compared to the 75-item SIMS. Furthermore, two recent studies from Portugal (Giromini et al. 2019a) and Italy (Giromini et al. 2019b) have shown that the IOP-29 FDS may be similarly sensitive to different types of mental health complaints, such as those related to depression, PTSD, psychosis, or mild traumatic brain injury. As a general principle, an FDS ≥ .50 should offer the best balance between sensitivity and specificity, offering an average overall correct classification percentage of about 80%. Because the IOP-29 is so short, however, one could also use it as a screening instrument. If that were the case, cut scores of FDS ≥ .30 or FDS ≥ .15 may be preferable, as they should produce higher sensitivity rates, of 90% and 95% respectively. Conversely, in forensic contexts where specificity is likely more important than sensitivity, cut scores of FDS ≥ .65 or FDS ≥ .70 may be more appropriate as they should offer higher specificity rates, of 90% and 95% respectively (for details on these cut scores, please see Giromini et al. 2018).

Protocol Screening and Statistical Analyses

Prior to analyzing the data, all 164 available MMPI-2 and IOP-29 records were screened for content-unrelated distortions, such as inconsistencies, and inadequate item endorsement. Thus, records with MMPI-2 Cannot Say (CNS) ≥ 30, True Response Inconsistency (TRIN) T ≥ 80, Variable Response Inconsistency (VRIN) T ≥ 80, or more than 3 missing responses on the IOP-29 were excluded. This approach reduced the sample size to 155 valid cases, as 8 people had an invalid MMPI-2 and one person had an invalid IOP-29. Of these 155 valid cases, 62 were depressed patients (36 depressed patients in treatment and 26 depressed patients assessed for work-related stress) and 92 were expMAL. Next, we compared the patient and expMAL groups on gender, age, and years of education, to evaluate whether the two groups were sufficiently balanced on these demographic variables. None of these analyses produced statistically significant results, all p ≥ .41.

Subsequently, we focused on Cohen’s d effect sizes, ROC curves, and classification accuracy statistics by contrasting the patients’ data against those of expMAL. To evaluate incremental validity, we then performed a series of hierarchical logistic regressions, with group (0 = patient; 1 = expMAL) as criterion variable and the MMPI-2 and IOP-29 scores as predictors. Lastly, we inspected MMPI-2 clinical scales to evaluate whether the expMAL did elevate the depression-related scales, as one would expect.

Results

Table 1 reports on average MMPI-2 and IOP-29 scores produced by the depressed patients and the expMAL included in this study. As shown in Table 2, the MMPI-2 scale that produced the highest effect size and AUC was F: When considering the entire sample (N = 155), it produced a Cohen’s d of 1.48 and an AUC of .89. With that same sample (i.e., when considering the entire group), the IOP-29 FDS produced relatively similar, perhaps slightly superior results, with Cohen’s d = 1.80 and AUC = .89. According to Rogers et al.’s (2003) characterization of Cohen’s d values from experimental malingering studies, the IOP-29 FDS showed “very large” effect sizes (i.e., ≥ 1.75), MMPI-2 scales F and Fb showed “large” effect sizes (i.e., ≥ 1.25), and MMPI-2 Fp showed “moderate” effect sizes (i.e., ≥ .75).

Table 1 MMPI-2 and IOP-29 scores across groups
Table 2 Comparison between ExpMAL and depressed patients: Cohen’s d and area under the curve values (AUC)

Table 3 reports on the classification accuracy of selected MMPI-2 F, Fb, and Fp cut scores, as well as IOP-29 FDS cut scores. As expected, using MMPI-2 cut scores from Rogers et al.’s (2003) meta-analysis ensured very high specificity values, ranging from .94 to 1.00, depending on the sample under consideration. Sensitivity, for those same cut scores, ranged from .33 to .52.

Table 3 Classification accuracy of selected MMPI-2 and IOP-29 cut scores

The classification accuracy of the IOP-29 also was in line with previous research and expectations. Consistent with Giromini et al. (2018), using FDS ≥ .70 and FDS ≥ .65 yielded specificity values of about .95 and .90 (.92 and .89 respectively, considering the entire sample), whereas using FDS ≥ .15 and FDS ≥ .30 generated sensitivity values of about .95 and .90 (.97 and .89 respectively, considering the entire sample). Also in line with Giromini et al.’s (2018) findings, FDS ≥ .50 provided the best balance between sensitivity and specificity (.75 and .87 respectively, considering the entire, combined sample), with an approximate overall correct classification rate of 80%.

Tables 4 and 5 present the results of our incremental validity analyses, which focused on the entire sample so to maximize statistical power. Table 4 demonstrates that entering the IOP-29 after each of the three MMPI-2 validity scales under investigation significantly improved the prediction of group membership (0 = patient; 1 = expMAL). Likewise, but in the opposite direction, each of the three selected MMPI-2 scales also significantly improved our logistic regression models, when entered after the IOP-29 FDS (Table 5). Interestingly, the model with the highest χ2 was the one that included MMPI-2 F together with IOP-29 FDS, χ2 (2) = 105.06, p < .001. Also noteworthy is that neither MMPI-2 Fb, χ2 (1) = 2.10, p = .15, nor MMPI-2 Fp, χ2 (1) = .02, p = .90, significantly improved the prediction of group membership when entered after MMPI-2 F. That is, the only scale that yielded some incremental validity over MMPI-2 F, in this study, was the IOP-29 FDS.

Table 4 Incremental validity of IOP-29 FDS over MMPI-2 Validity Scales: logistic regression models
Table 5 Incremental validity of MMPI-2 Validity Scales over IOP-29 FDS: logistic regression models

Because entering MMPI-2 F together with IOP-29 FDS produced the best model, we created a composite score, calculated as the Z average of the MMPI-2 F and IOP-29 FDS scores (for details on how to calculate this variable, see Appendix 2). As expected, when considering the entire sample (N = 155), this Z average index produced slightly higher Cohen’s d (= 1.85) and AUC (= .93) values compared to all of the MMPI-2 and IOP-29 scales under investigation (Fig. 1). To further investigate whether combining the IOP-29 FDS with the MMPI-2 F scale would improve classification accuracy compared to the MMPI-2 F scale alone, we performed ROC analyses. Given that our a-priori selected cut scores for F (see Table 3) yielded specificity values of .95 (F > 20) and .98 (F > 24), we selected cut scores for our Z average index with the same specificity values. We then examined whether Z average cut scores would yield increased sensitivity. The Z-average cut score of Z ≥ 1.5 produced a specificity of .95 and a sensitivity of .66, and the Z ≥ 1.8 produced a specificity of .98 and a sensitivity of.60. With the same specificity values, the MMPI-2 F cut scores produced notably lower sensitivity values of .52 and .38. This pattern thus demonstrates that adding the IOP-29 to the most valid MMPI-2 F validity scale remarkably improved the prediction of group membership.

Fig. 1
figure 1

ROC curve of MMPI-2 F, IOP-29 FDS, and their Z average

Lastly, we inspected MMPI-2 clinical scales across the three groups to evaluate the extent to which our expMAL could reproduce adequate elevations in the depression-related indicators. As depicted in Fig. 2, the expMAL group showed elevations in several scales, including—but not limited to—Scale 2 (D, Depression). Conversely, the group of depressed patients with work-related stress showed notable elevations on scales 1 (Hs, Hypochondriasis), 2 (D, Depression), and 3 (Hy, Hysteria), but lower scores on all other scales, as one might expect in the case of depression-related conditions. The group of depressed patients in treatment instead showed markedly lower scores compared to both other groups.

Fig. 2
figure 2

MMPI-2 Clinical Scales: average scores by group

Discussion

This study was designed to test whether using the Minnesota Multiphasic Personality Inventory-2 (MMPI-2; Butcher et al. 2001) together with the recently developed, Inventory of Problems-29 (IOP-29; Viglione et al. 2017) would provide incremental validity in evaluating the credibility of presented depressive symptoms, compared to using either test alone. Examination of MMPI-2 and IOP-29 data from 93 experimental malingerers (expMAL), 36 patients in treatment for depression, and 26 depressed patients assessed for work-related stress confirmed this hypothesis. In fact, a series of hierarchical logistic regressions with group membership as criterion variable (0 = depressed patient; 1 = expMAL) and the selected MMPI-2 and IOP-29 scales as predictors demonstrated that using both instruments together yielded a statistically significantly better prediction than using either instrument alone. Importantly, both the IOP-29 scale FDS and the MMPI-2 scales F, Fb, and Fp also demonstrated effectiveness in differentiating bona fide from feigned depression when considered alone, with relatively large Cohen’s d (≥ 1.28 for MMPI-2 F and ≥ 1.64 for IOP-29 FDS) and excellent AUC (≥ .88 for both MMPI-2 F and IOP-29 FDS) values (for thresholds for characterizing AUC values, please see Hosmer and Lemeshow 2000). Taken together, these findings thus suggest that including both the MMPI-2 and IOP-29 in multimethod forensic assessments might be a particularly suitable choice, when evaluating depression-related complaints.

The fact that the IOP-29 FDS provided some incremental validity over the use of MMPI-2 F scales is not too surprising. As briefly reviewed in the introduction, many of the detection strategies used by the IOP-29 items were aimed exactly at offering some incremental validity over the classic rare-symptom approach implemented by the MMPI-2 F scales (Viglione et al. 2018). Indeed, while the MMPI-2 F scales primarily focus on symptom endorsement, the emphasis of the IOP-29 FDS is more on how a person manages life despite symptoms, and on this person’s beliefs surrounding the possibility of influencing the severity and expression of problems. Combining the results of the MMPI instruments together with those of the IOP-29 might thus prove particularly useful because it potentially allows to understand not only what symptom(s) the person is experiencing, but also how s/he is managing to cope with them. In this article, to appreciate how one might integrate the results of the MMPI-2 with those of the IOP-29, we have calculated the Z average of the MMPI-2 F and IOP-29 FDS scores. As shown in Fig. 1, this composite score showed superior effectiveness compared to either one instrument used alone. Future studies might thus further investigate this approach and perhaps provide additional information on what cut scores one might want to use, if s/he intended to adopt this Z average score in his/her practice.

Differently from recent MMPI meta-analytic research, our study found that the F and Fb provided superior effectiveness in detecting experimental feigning compared to the Fp. This unexpected finding is quite difficult to explain. On the one hand, one might say that the fact that many of the patients included in the first of the two patient samples, i.e., the one comprised of depressed patients in treatment (many of which were in remission) suffered from very mild depressive symptoms (see Fig. 2) may have favored F and Fb over Fp. Indeed, while the F and Fb are elevated by endorsement of symptoms that are infrequent in the MMPI-2 normative nonclinical samples, the Fp is elevated by endorsement of symptoms that are infrequent among psychiatric patients. As such, the fact that our patients were not suffering from severe psychopathology may have boosted the specificity of F and Fb, without influencing the overall effectiveness of Fp (Rogers et al. 2003). This explanation, however, does not fit well with the fact that this same pattern of finding, with F and Fb offering better classification accuracy than Fp, was observed also when comparing our expMAL group against the sample of patients assessed for work-related stress and presumably affected by genuine depression (Table 2). These patients, indeed, reported remarkably more severe psychological problems compared to the depressed patients in treatment sample, as shown in Fig. 2. Future studies might thus attempt to clarify whether this unexpected finding is specific to our sample or perhaps depends on other variables such as the type of vignette we used, the specific instructions we gave to our expMAL, and so on.

One more consideration deserves mentioning. When compared to other similar IOP-29 experimental malingering studies, ours has produced slightly lower sensitivity results, when using the standard cut score of FDS ≥ .50. In fact, when investigating feigning of depression-related symptoms via malingering experimental paradigm, using that same cut score previous studies showed sensitivity rates ranging from .79 to .96 (Giromini et al. 2018; Giromini et al. 2019a, b; Viglione et al. 2017). In our study, with that cut score sensitivity was .75. Because the exact same instructions used in our study were used also in Giromini et al. (2019a) and Giromini et al. (2019b), we speculate that our reduced sensitivity has possibly to do with the fact that the administration of the MMPI-2 may somehow negatively impact the IOP-29’s ability to detect feigned depression. Indeed, it is possible that our participants felt like they had already convinced the examiner about their depressive symptoms with the 567 MMPI-2 items, so that they did not have to continue over-reporting depression-related problems also when responding to the IOP-29. Alternatively, it is also possible that, given the length of the MMPI-2, some fatigue had occurred while responding to the two tests, so that the IOP-29 was attended to by our participants with relatively less attention, compared to Giromini et al.’s (2019a) and Giromini et al.’s (2019b) studies. Indeed, in those previous studies, participants only had to fill out the IOP-29 and be examined with the TOMM (Giromini et al. 2019a) or fill out the IOP-29 alone (Giromini et al. 2019b), which obviously required notably less cognitive effort compared to filling out a long and complex personality inventory such as the MMPI-2. Additional research using both the IOP-29 and MMPI-2 would therefore be highly beneficial, to better understand the possible influence of MMPI-2 administration on IOP-29 sensitivity results.

Lastly, it should be noted that like all malingering-related studies, ours also have some limitations that need to be considered. First, external validity may be questioned, given that our expMAL were instructed to feign depression using an experimental paradigm, so that it is unknown whether real-life malingerers in high stakes contexts would really behave like our experimental participants did in our study. Second, although our Table 2 reveals that there were no notable differences between the results of the MMPI-2 F scales and IOP-29 FDS across the two different patient samples, our inclusion of a patient sample characterized by very mild depressive symptoms may have boosted the effect sizes of our study, to some extent. Third, the patients included in the group of individuals assessed for work-related stress were considered highly unlikely to be malingerers. Although all of them had been thoroughly screened by a series of interviews performed by experienced psychiatrists and psychologists, we cannot rule out that some of them may have in fact over-reported their symptoms. Indeed, the limitation of clinical judgment in determining the credibility of a response set has long been known (Heaton et al. 1978). Fourth, using the MMPI-2 and IOP-29 only may have limited ecological validity, given that real-life symptom validity assessment typically is performed by using a multitude of instruments. Fifth, our inclusion criteria required our expMAL to report that they had not been diagnosed with any major psychiatric disorders. However, given that depression is a high-prevalence mental disease and self-report has its limitations, we cannot rule out that some of our expMAL participants did in fact suffer from depression. If that was the case, our results could be inaccurate regarding the actual effectiveness of the MMPI-2 and IOP-29 to detect feigned depression. Sixth, our study could not evaluate the possible impact of administration order, which in previous studies has shown to have the potential to significantly influence test scores (Erdodi and Lajiness-O'Neill 2014; Ryan et al. 2010; Zuccato et al. 2018). Future research randomizing administration sequence and examining its potential impact on MMPI-2 and IOP-29 scores would therefore be beneficial.

Despite all these limitations, our study still has the merit to be the first to report on the utility of using the MMPI-2 together with the IOP-29 when assessing the credibility of depression-related complaints. All in all, our findings indicate that the IOP-29 may provide useful incremental validity over the classic rare-symptoms endorsement detection strategy scales of the MMPI-2. Given that, researchers are encouraged to continue to investigate the utility of using the IOP-29 in combination with other popular instruments such as the Personality Assessment Inventory (PAI; Morey 1991, 2007) or the recently developed and very promising Self-Report Symptom Inventory (SRSI; Merten et al. 2016).