The use of the Dissociative Experiences Scale (DES) in psychological research is widespread, as is suggested by over 2200 studies utilizing the scale listed in PsycINFO. Since its inception, the DES (Bernstein & Putnam, 1986) and its variants (e.g., DES-II, Dissociative Experiences Scale-Revised (DES-R), DES-Comparison) have found increasingly wide use in the assessment of psychopathology; this usage reflects the recognition of the ubiquity and complexity of dissociation in psychopathology (Lyssenko et al., 2018) and the increasing evidence for dissociation as a trauma consequence (Carlson, Dalenberg, & McDade-Montez, 2012). In addition to the disorders in which dissociation is a defining feature (e.g., dissociative amnesia as well as both dissociative and depersonalization disorders), severe dissociation is also a criterion in other disorders as well, such as posttraumatic stress disorder (PTSD) and borderline personality disorder (American Psychiatric Association, 2013). Indeed, recognition of the role of dissociation in trauma has led to advances and refinements in the understanding of several disorders, such as the recent identification of a dissociative subtype of PTSD. A number of studies have not only supported the existence of a dissociative subtype of PTSD in veterans but have also associated it with greater symptom severity (Armour, Karstoft, & Richardson, 2014; Haagen, van Rijn, Knischeer, van der Aa, & Kleber, 2018; Tsai, Armour, Southwick, & Pietrzak, 2015; Waelde, Silvern, & Fairbank, 2005; Wolf, Lunney, et al., 2012; Wolf, Miller, et al., 2012).

Despite general acceptance in diagnostic manuals, dissociation remains a subject of debate. Researchers have cast doubt on a causal link between trauma and dissociation (Patihis & Lynn, 2017) and the role of trauma in dissociative amnesia (Pope, Poliakoff, Parker, Boynes, & Hudson, 2007) as well as dissociative identity disorder (DID; Piper & Merskey, 2004). Merckelbach and Patihis (2018) challenge the very use of the term “trauma-related dissociation” (TRD) as a potentially prejudicial term, favoring claimants who assert a causal link between past trauma and current dissociation and implying that dissociation has only one prominent cause (Merckelbach & Patihis, 2018). However, the position of these authors has been countered by Brand et al. (2018), who provide a robust defense of the trauma-related dissociation concept. In the context of a forensic disagreement between experts over the validity of the dissociative symptoms; however, the DES can be criticized both for its distributional qualities (skewness) and for its lack of a validity scale.

The DES has been translated into numerous languages, including Spanish, Hebrew, Italian, Dutch, Japanese, Turkish, Russian, Portuguese, German, Czech and French. A version of the DES has also been created for use with adolescent populations and, like the DES, shows strong cross-cultural validity and reliability (Soukup, Papežová, Kuběna, & Mikolajová, 2010). However, the DES and the DES-II, the original test and first revision, have been criticized for poor distributional qualities and a confusing format. The DES asks individuals what percentage of time they experience various symptoms, a difficult question to answer on subjective items such as feelings of unreality. The DES-R was a revision of the DES in which the format was changed to a frequency scale. Respondents are asked how often they experience various symptoms, with choices ranging from “never” to “more than once a week.” The changes in the DES normalized the distribution while preserving the correlation to the original scale, with r values from .80 to .90 (Coe, Dalenberg, Aransky, & Reto, 1995; Arzoumanian et al., 2017). The DES-R has been successfully used to predict relevant clinical constructs in a number of published papers (Coe et al., 1995; Kluemper & Dalenberg, 2014).

Despite improvements to the distribution of the DES in some of the recent revisions, the lack of a validity scale within the measure remains a significant drawback, particularly for a scale that may often find its way into the forensic arena (given the relationship of dissociation to trauma). At the same time, current assessment instruments that are widely used in forensic settings have not been developed to consider complex trauma profiles and, therefore, have misclassified true dissociative patients as feigners in experimental trials (Brand, Webermann, & Frankel, 2016; Palermo & Brand, 2018). The relatively recent rise of paid online survey takers (e.g., MTurk) also has provided a ready source of participants for studies, with online samples increasingly used for tests of theoretical models. Consequently, the general research field and the forensics field would greatly benefit from a scale that could identify potential instances of malingered dissociation while remaining sensitive to individuals with true elevations in dissociative symptoms.

It appears that most experts favor administration of a feigning assessment as part of standard practice and particularly forensic practice (Melton, Petrila, Poythress, & Slobogin, 2007). It has been noted that many psychological measures of personality (e.g., the Beck Depression Inventory-II; Beck, Steer, & Brown, 1996) do not consider the issue of feigning. However, recent scale development has been more consistent in incorporating a feigning screen into new measures. For example, the Detailed Assessment of Posttraumatic Stress (DAPS), Trauma Symptom Inventory (TSI), and Minnesota Multiphasic Personality Inventory-2 (MMPI-2) all include scales which identify possible malingering, inconsistency, or feigning, thereby adding a means to assess the validity of each measure (Resnick, West, & Wooley, 2018; Gray, Elhai, & Briere, 2010; Butcher, Graham, Ben-Porath, Tellegan, & Dahlstrom, 2001). In evaluating validity, researchers tend to employ a few specific methods which have been shown to be effective in identifying feigning within an assessment. The following methods have been used in published measures to evaluate validity:

FormalPara Inconsistency

This method involves addressing a specific thought, feeling, or construct multiple times throughout an assessment in order to determine consistency in an individual’s responses. As an example, an evaluation of inconsistency has been incorporated into the newer versions of the MCMI. Both the MCMI-III and MCMI-IV include an Inconsistency Scale which compares pairs of items to identify if a person appears to be responding randomly (Millon, Millon, Davis, & Grossman, 1994; Millon, Grossman, & Millon, 2015). The Variable Response Inconsistency Scale and the True Response Inconsistency Scale scores on the MMPI-2 are also inconsistency measures.

FormalPara Atypicality

To assess atypicality, measures include several items that probe for thoughts, feelings, or behaviors that are extremely rare or unknown. Multiple endorsements of such items typically indicate feigning. There are several well-known measures that include such validity items, including the Atypical Response Scale of the TSI (Briere, Elliott, Harris, & Cotman, 1995; Gray et al., 2010). The Validity scale of the MCMI-IV also includes such improbable items (Millon et al., 1994; Millon et al., 2015).

FormalPara Unlikely Extremity

Unlikely extremity is related to atypicality; however, it differs slightly in that it aims to identify those individuals who have a pattern of extremity. Examples of this type of measure are the Under-response and Hyper-response scales of the Trauma Symptom Checklist for Children (Briere, 1996). In this measure, counts are made of complete denial of common thoughts or behaviors (under-response) and responses at the ceiling for less common thoughts or behaviors (hyper-response).

FormalPara Structure

This validity check involves patterns of responses that do not typically occur, such as unusual, co-occurring symptoms or endorsement of more serious aspects of a disorder without endorsing the less serious aspects. The Infrequency-Psychopathology Scale (Fp) of the MMPI-2 partially addresses this form of feigning. The scale attempts to identify those individuals who are either over-reporting symptoms or those who seem to be responding randomly, resulting in unlikely patterns of symptoms (Butcher et al., 2001). That is, here the issue is not that the individual was extreme but that he or she endorsed items in a pattern that was unlikely. For the DES, the structure score consisted of subtracting the taxon score (items with a low base rate) from the absorption score (items with a high base rate), with low or negative scores indicating invalidity. Base rates for the absorption and taxon items have been repeatedly established in prior studies (Olsen, Clapp, Parra, & Beck, 2013; Waller, Putnam, & Carlson, 1996).

FormalPara Language Proficiency

This validity check involves administering an examination of grade-level language proficiency using the same level of vocabulary that is used in the remainder of the assessment or in other administered assessments. Such checks are quite rare outside of the neuropsychological realm, but lack of proficiency in language clearly increases the likelihood of confusion for the participants as to the meaning of assessment questions.

FormalPara Duration

This validity check examines whether the speed of completion for a specific assessment is too slow or too fast when compared with a mean completion time for a group. In online forums, completion in a few minutes, significantly faster than the likely reading capacity of the respondents, can mean either that the test was taken by an automated system or that the individual did not read the questions.

FormalPara Manipulation Check

Experimental studies often include several questions that ensure that the respondent is aware of the directions respective to his/her group assignment (Foschi, 2014). Failure on this check could have a number of meanings, including a lack of clarity on the part of the experimenter but clearly compromises the interpretation of results.

Another type of data integrity protection that has specific applicability to some online surveys involves checking the geographic location of the participant. As more and more research relies on Internet-based solicitation, an important aspect of validity may concern the country of origin of participants. Survey sites such as MTurk pay participants according to the number of surveys that they complete. Consequently, individuals may be incentivized to participate in surveys for which they are not necessarily qualified (e.g., ignoring such qualifiers as U.S. residency or native English speaker). While imperfect, checking the Internet Protocol (IP) addresses of participants provides at least a first tool in culling potentially inappropriate participants from a data set. Although definitional questions from Language Proficiency (LP) tests will catch many of these individuals, those who value the incentive can easily use efficient internet resources to find the definitions of words and pass the LP requirement.

In light of concerns over ensuring valid administration of the DES-R, we developed a form of the assessment that incorporated an embedded validity scale comprising atypicality items, inconsistency items, and structure. The addition includes ten new questions, six for atypicality and six for inconsistency items. With regard to structure, we subtracted taxon from absorption item scores, with a cutoff set through receiver operator curve (ROC) analysis. Additionally, all participants were required to meet specific cutoffs for English proficiency as well as duration of test taking were required to pass the manipulation check and were checked for duplicate IP addresses or IP addresses from non-English speaking countries. After exclusion based on 8th-grade vocabulary scores under 60%, unlikely duration, and suspect IP addresses, structure, inconsistency, and atypicality were hypothesized to differentiate honest responders from those asked to feign dissociative symptoms. Exclusion based on IP, vocabulary, duration, and the three validity scales was expected to increase the correlation between reported trauma and dissociation, as would be predicted from theories of trauma-related dissociation, rather than decrease, as would be expected from fantasy-based theories of dissociation (see Dalenberg et al., 2012).

Methods

Participants

Participants were recruited via MTurk to answer a 7–15-min survey in exchange for $0.75 to $1.00. MTurk is a Web-based platform designed to recruit and pay participants to perform various tasks. The quality of data collected via MTurk has been shown to meet or exceed the psychometric standards associated with published research (Buhrmester, Kwang, & Gosling, 2011). Additionally, samples collected from MTurk are more representative of the US population than in-person convenience samples (Berinsky, Huber, & Lenz, 2012; Buhrmester et al., 2011). To maximize the probability that the participants were attentive and motivated during the task, only participants who had a “Master Worker” designation from MTurk were allowed to participate for the first 25% of data collection (across honest controls and feigning groups); these individuals have demonstrated a high degree of success in performing a wide range of human intelligence tasks across a large number of requesters. The second quarter of the participants was collected without Master Worker designation, and all variables were compared. When no differences were found, the remaining participants were collected without Master Worker designation.

PTSD (n = 40) participants and Dissociative Disorder (DD) participants (n = 5) were consecutive patients with the diagnosis of PTSD requesting therapy at a local trauma-centered clinical practice. The PTSD diagnosis was confirmed through administration of the Clinician-Administered PTSD Scale (CAPS-5). Sixteen had a diagnosis of the dissociative subtype of PTSD. The five DD clients were not utilized in the analyses, given unacceptable power for testing as a separate group. Results will be presented for pilot purposes.

Measures

Respondents completed a brief demographic scale, the DSM-V (see below), a brief depression inventory, and a vocabulary screen.

Demographic Questionnaire

Participants completed a brief demographic questionnaire regarding age, gender, and race/ethnicity. Options for prior trauma were taken from the DAPS (Briere, 2001) and could range from 0 to 11.

Patient Health Questionnaire 9

The PHQ was designed to measure depression severity in medical populations in clinical settings. The categories within the Patient Health Questionnaire 9 (PHQ-9) are derived from the DSM-IV classification system pertaining to: (1) anhedonia, (2) depressed mood, (3) trouble sleeping, (4) feeling tired, (5) change in appetite, (6) guilt or worthlessness, (7) trouble concentrating, (8) feeling slowed down or restless, and (9) suicidal thoughts. Participant response options vary from “not at all” to “nearly every day.” The PHQ-9 has been shown several times to have good reliability and validity (Kroenke, Spitzer, & Williams, 2001; Lowe, Unutzer, Callahan, Perkins, & Kroenke, 2004; Martin, Rief, Klaiberg, & Braehler, 2006; Pinto-Meza, Serrano-Blanco, Peñarrubia, Blanco, & Haro, 2005).

English Vocabulary Screener

A vocabulary test was administered consisting of seven 8th-grade-reading level vocabulary words, as identified by the Spache Readability Formula. Respondents who received scores under 60% on the vocabulary test were excluded.

Dissociative Experiences Scale-Revised

The Dissociative Experiences Scale (DES; Bernstein & Putnam, 1986) is a 28-item self-report measure that assesses the frequency of dissociative experiences using three major categories: (1) absorption/imaginative involvement, (2) amnesia, and (3) depersonalization/derealization. Responses range from “this happens never” to “this happens at least once a week.” The DES has high reliability (r = .83, p < .0001) and high internal consistency (α = .95; Frischholz, Braun, Sachs, Hopkins, et al., 1990). The test also was found to differentiate well between dissociative and non-dissociative clinical groups (Dubester & Braun, 1995).

Due to repeated findings of skewness and leptokurtosis in the DES, Dalenberg et al. (1994) revised the response format of the scale to a frequency scale. This change normalized the distribution without changing the relationship of the DES-R to other important variables (Coe et al., 1995). The relationship between the DES and DES-R reported in Coe et al. was .90.

The structure scale of the DES-V (the label given for the DES with the embedded validity scale) was based on subtraction of the taxon items from the absorption items. Absorption items were DES items 1, 2, 14, 15, 16, 17, 18, 20, 21, 22, and 23, and taxon items were DES items 3, 5, 7, 8, 12, 13, 22, and 27 (Waller et al., 1996). The Atypical Response scale was determined by the sum of six atypical items developed by the Trauma Research Institute and verified as atypical by three highly published authors on dissociation and a pilot group of 30 patients with dissociative PTSD and 5 DD clients (none of whom were included in the main study). Items were included if they were seen as atypical by 90% or more of the patients and all experts (see Table 1 for a list of DES-R items with corresponding inconsistency items and a list of atypical items.) Chosen items had a mean of 1 or less on the 0–6 Likert-scaled atypicality items. Examples of atypical items include “sometimes people find they do not feel physical pain at all” and “sometimes people find that they collect things that remind them of their trauma but they do not remember buying those things.” Finally, six of the DES-R items were reworded in a positive direction and an inconsistency score was generated. Participants had to answer with a frequency score of five or more to an item as well as on the matching reversed-inconsistency question to be considered inconsistent. Disagreements with both items were not considered inconsistencies, as it is possible to say that one is neither extremely hypervigilant as a driver nor unaware of surroundings. A score from 0 to 6 was calculated for the number of paired disagreements.

Table 1 DES-R items with corresponding inconsistency items

Other Exclusion Criteria

IP addresses that indicated the participant was from a non-English speaking country were excluded. Further, we established a cutoff for time to completion that may indicate insufficient care in answering the questionnaire. The cutoff for duration of test taking was derived by asking a group of 14 Ph.D. students to complete our survey as quickly as possible while recording their completion times. All participants in the duration pilot test were members of the Trauma Research Institute and were therefore familiar with all instruments. Thus, the duration was likely to be faster than is typical. To measure the validity of responses, participants who completed the survey two standard deviation faster than our Ph.D. group were considered to have put forth less effort than required for this assessment and excluded. Lastly, as a manipulation check, all participants were required to correctly identify their group assignment. Incorrect responses on the manipulation check, unlikely duration, and IP addresses outside of English-speaking countries resulted in the participant being dropped from the study.

Procedures

Participants were divided into three groups. All groups were given a brief definition of dissociation. Participants could stop at any time with no penalty; therefore, completion of the survey was voluntary after receiving instructions of group assignment. Dropouts are discussed in “Results.”

Honest Control Group

Participants were asked to answer the survey as honestly as possible. The honest group was also told that “giving dishonest answers on a medical survey is like giving contaminated blood in a blood donation. It can be extremely harmful for the scientific project.” In a separate pilot project, the addition of this statement was found to increase admission of alcohol misuse (t = 3.24, p < .01), failure to use a condom with new sexual partners (t = 4.41, p < .01) and cheating in undergraduate school (t = 5.94, p < .01) in an undergraduate sample of 434 students.

Feigning Group

Participants were asked to pretend to be someone who experiences dissociative symptoms and to try to convince the researchers using their responses to the questions. To ensure the participants were attentive and motivated during the task, the feigning group was provided an extra monetary incentive in the form of an additional $1.00 to the top 50 most believable malingerers and an additional $5.00 to the top 5 most believable. Dissociation was described as an experience that at times occurs after negative events, wherein a person feels detached or disconnected from reality or feels fragmented or disconnected internally. Other listed symptoms were claims of lack of recall for various activities and difficulty in feeling normal sensations.

Posttraumatic Stress Disorder Group

The 40 PTSD participants took the survey under the same conditions as the honest control group, as did the 5 DD participants (who, again, were not included in the analyses).

Results

Initially, 357 individuals participated in the experiment. However, 12 individuals were participating through IP addresses originating in non-English-speaking countries and were excluded from the subsequent data analysis. Of the 345 remaining participants, 20 did not complete the DES-R, 34 failed the manipulation check (incorrectly identifying their Honest/Feigning group membership), and 35 failed the vocabulary test (including 58% (n = 7) of those with IP addresses with non-English speaking countries). Only seven individuals fell into the category of suspiciously fast completers. As there was overlap among these failures, a total of 84 individuals were eliminated. Analysis of the excluded individuals versus retained participants yielded non-significant results for age, race, and gender. Despite instructions that asked those in the feigning group to answer honestly for these demographics, feigning group members were more likely to be excluded (χ2 = 10.491, p < .001; 33 vs. 14%). It should be noted that, of those who were excluded because they failed the vocabulary check, 94% also failed the atypicality criteria and 90.3% failed structure. For those who failed the manipulation check, 81% (n = 25) also failed atypicality, 18% (n = 6) failed inconsistency, and 84% (n = 27) failed structure. Ninety-one percent of the respondents who failed the manipulation check also failed at least one of the validity checks. Of the remaining subjects to be evaluated on the DES-V, 40 were in the PTSD group, 98 were in the feigning, and 135 were in the honest control group, for a total N = 273.

Table 2 describes the age, gender, and race distribution of the retained sample. The honest control, feigning, and PTSD groups did not differ on gender, age, or race distribution.

Table 2 Participant characteristics

Group Differences on Atypicality, Structure, and Inconsistency

Distributions for the DES-R total score, atypicality, and structure were relatively normal within the two honest groups (PTSD, honest control). Inconsistency scores were skewed, with only 26 individuals inconsistent on one item and nine inconsistent on more than one item. The vast majority (89.8%) had no inconsistencies. The inconsistency variable was thus recoded as a dichotomy.

The three groups significantly differed on the Atypical items: F(2, 271) = 110.12, p < .001, with one missing value (see Table 3 for full results). The three groups also significantly differed on the inconsistency items 2 = 22.58, p < .001) and the structure items, F(2, 272) = 27.28, p < .001. Effect sizes were larger for atypicality (η2 = .50), than for structure (η2 = .17) or inconsistency (η2 = .08).

Table 3 Between group differences on malingering items

Results of the Pearson correlation indicated that there was a significant positive association between inconsistency and atypicality, r(323) = .40, p < .01, between inconsistency and structure, r(324) = .117, p < .05 and between atypicality and structure, r (324) = .484, p < .01. Among those who were not excluded, duration of test taking did not significantly correlate with any of the three constructs.

Logistic Results

Using the three constructs and comparing the honest with the feigning participants on their continuous scores on atypicality and structure together with their dichotomous score in inconsistency, a logistic regression was able to correctly classify 90.8% of the honest participants (honest controls and PTSD participants) and 75.3% of the Feigning participants. The logistic regression was statistically significant, χ2(3) = 157.421, p < .001, and a 60.3% (Nagelkerke R2) variance in responding was explained by the model.

ROC Analyses

To establish cutoffs for feigning, we used ROCs for the structure and atypicality data. For the structure items, we followed a two-step process. As previously defined, structure represents the difference between taxon and absorption scores, with the expectation that honest respondents would endorse more high-base-rate absorption items than low-base-rate taxon items. However, as the reader will recall, feigning is associated with a low rather than a high structure score here. A high score on either absorption or taxon items can only occur if the individual is expressing dissociative symptoms; if dissociation scores as a whole were low, the issue of feigning is moot. Therefore, those participants who did not elevate on any DES-R items (scoring 3 or over) were not included in the ROC analysis.

In the second step, a ROC curve was used to establish a cutoff for honest respondents on structure. The ROC curve established the score above which an individual was likely to be feigning. This second curve established a cutoff of 4.5, with an AOC of .70, p < .001. Thus, those who scored 4 or less on structure and who elevated at least one item on the DES-R were given 1 point (for possible feigning) on structure. Using these criteria, 18.6% (n = 19) of the honest controls, 73.3% (n = 66) of the feigning group and 12.5% (n = 5) of the PTSD group failed on the structure variable, as did 1 of the DD case controls.

The ROC curve for atypicality established a cutoff of 5.5, above which an individual was judged to be feigning. The AOC statistic was excellent (.906, p < .001). One of the PTSD group failed the atypicality criteria, as did 18 of the honest controls (13.3%) and 79 of the feigning group (80.6%). One of the DD group also failed the atypicality criterion.

For inconsistency, there was insufficient variance to apply the ROC procedure. Chi-squares established that 3.7% (n = 5) of the honest control group, 22% (n = 22) of the feigning group and 5% (n = 2) of the PTSD group endorsed at least one inconsistency item. None of DD cases had a positive inconsistency score.

In order to calculate the sensitivity and specificity of our cutoffs, we assigned individuals one point each for scoring in the feigning range on atypicality, structure, and inconsistency. Scores for this scale, which we call ASI, could range from 0 to 3. The full results for the three groups are in Table 4, and related sensitivity, specificity, PPV, and NPV values are in Table 5. Note that elimination of subjects who fail at least one of the three tests retained 90% of Honest responders while eliminating 71.31% of those asked to feign dissociative symptoms.

Table 4 Relationship between feigning scale score and group membership
Table 5 Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for values of ASI

Comparison of Correlations Before and After Exclusion

The number of types of potential traumas was summed based on reports in the initial demographic questionnaire. Using the full sample, the correlation of trauma with the DES-R was .14, p < .01. Using the sample after exclusion of the 84 subjects based on vocabulary, IP address, incomplete reports, and unlikely duration, the correlation was .19, p < .01. Using on those who passed all three validity screens (n = 145), the correlation was .29, p < .001. The comparable figures for the depression screen (PHQ) were .19, p < .01 for the full sample, .29, p < .001 for the sample after initial exclusions, and .32, p < .001 after screening on the validity tests.

Discussion

Although the DES is commonly used in research, forensic, and clinical settings to assess for dissociative symptomology, it currently does not have a validity scale. This limitation of the DES can have significant effects on the status of the field in a number of ways—decreasing effect sizes in survey research, undermining diagnostic accuracy, and encouraging the small group of critics who argue that the dissociative disorders themselves (most notably dissociative amnesia and dissociative identity disorder) do not exist (McNally, 2007; Merckelbach & Patihis, 2018). Arguably, the fact that we were able to distinguish known feigners from honest responders with 84.5% accuracy (in the logistic regression) suggests that the feigners and allegedly true responders were in some way different. The use of three quite different methods, each of which individually differentiated the honest responders from those told to feign symptoms, also adds weight to the argument in favor of use of such methods. It is particularly important to note that substantial differences in findings emerge with and without taking the validity scales into account. Including all samples, the correlation between total numbers of possible traumas reported and DES-R total was .19, p < .01. Considering only those who passed validity checks, the same correlation was .29, p < .001. Substantial differences in findings with strict exclusion criteria in online samples are also noted by Thomas and Clifford (2017).

In this study, we have created a validity scale for the DES-R using several different types of checks, including structure, atypicality, inconsistency, manipulation checks, and vocabulary to assess the accuracy of responses. This inclusion of a validity scale was thought to make the DES-V a more effective tool in Web-based research, as it provides a simple means of checking the validity of data gathered online. The package of assessments addressed a variety of methods suggested for protecting the integrity of data.

The elimination of IP addresses through non-English speaking countries is an option available (typically with extra cost) with many survey-respondent companies, including MTurk. In our work, only 3% (n = 12) of the sample were eliminated for this reason. Although it is reasonable to argue that those from non-English speaking countries may well have a vocabulary sufficient to take the test, it is certainly telling that 58% of this group also failed vocabulary and 83% failed atypicality, suggesting that they were not reliable responders. Almost all of those who failed the vocabulary test (97%, n = 30) also failed one or more of the validity checks, strongly supporting the use of such measures in online research. Failure of the manipulation check was also a strong correlate of failure of validity checks. Here, 91% (n = 29), failed at least one of the embedded validity measures. Both of these findings also are in keeping with Thomas and Clifford’s (2017) advisory for strong exclusion criteria in online research.

The atypicality scale demonstrated the strongest ability to differentiate honest respondents from feigners compared with the other measures used in the study. Scores for participants from the feigning group were more than five times higher those of honest responders and the participants with PTSD. While the atypical items show good initial utility in research, it is important that these items be tested with a group of individuals with confirmed dissociative disorders before being utilized in a clinical setting. It is promising, however, that if a cut point of 1 was used, rather than 0 (the recommended cut point for online research), none of those with DD and only one PTSD client was excluded in the present sample.

Few participants showed elevated levels of inconsistency in responding, which was meant to pick up carelessness in those who were less committed to honest responding. Participants in the feigning group were more easily identified through atypicality or structure than through inconsistency items. However, inconsistency was here defined by extreme differences (strongly agreeing with two similar items worded in opposite directions). This disallows inconsistency between answers that were in the mid-range of the scale, which was characteristic of the majority of the sample. In retrospect, it may have been more useful to include an attention check rather than a set of inconsistency items in order to judge careless responding. The former method involves inclusion of items with an obvious correct response, requires only one to two items, and has been shown to work well in other online research (Berinsky, Margolis, & Sances, 2014). A number of studies have shown comparable careless responding in online and in-person samples, but estimates of unacceptable carelessness tend to range from 5 to 10% (Johnson, 2005; Meade & Craig, 2012).

The results also indicate that the structure variable is useful in assessing the validity of responses. The feigning group showed a smaller difference between the taxon and absorption items than did the two honest groups among the subgroup who claimed any dissociative symptoms. Currently, assessment of taxon and absorption items is unique to the DES/DES-R, but the technique could easily be incorporated into other scales with reported base rates, such as the PHQ-9 (Rief, Nanke, Klaiberg, & Braehler, 2004), the Beck Anxiety Inventory (Gillis, Haaga, & Ford, 1995), and the Beck Depression Inventory (Dawes et al., 2010). We emphasize that structure is recommended here as a tool at present only for the measurement and validation of dissociation as symptom, including dissociation in the context of PTSD or BPD. Within populations with DID, which should be exceedingly rare in an online random sample, structure may not be valid as an indicator, as taxon scores are often elevated. Using an archival sample of DES-R data from 32 DID individuals who had participated in other studies within our laboratory, 20% would have failed the structure criteria. This failure, however, is difficult to interpret, given that several of these individuals were questionable cases of DID (as reported by their therapists) and many did not have verification of their diagnoses through a reliable assessment tool.

The idea of using duration of test taking as a validity check seems to have face validity, and it was somewhat surprising that the cutoff identified few participants. This result may have been an artifact of study design, given that highly educated students, both familiar with the screen and trained in digesting material quickly, generated the target duration figures. This method was an effort to account for the quick reaction times in professional survey takers by substituting higher average education and training, but this arrangement may have been a poor comparison.

Limitations and Conclusions

The DES-V results using the criteria of 1 or more elevated validity scales shows general promise. The sensitivity and specificity scores of .90 and .71 are comparable with average findings of many accepted validity screens. The Test of Memory Malingering Trial 1, for instance, has an average reported specificity of .90 and sensitivity between .59 and .70 (Martin et al., 2019). The validity screen for the Connors ADHD scale again has specificity of .86–.90 but sensitivity of .44 to .63 for random responding and .31–.46 for feigning (Walls, Wallace, Brothers, & Berry, 2017). The SIRS-2 is reported to have high specificity (.90) but moderate sensitivity (.54) (Tarescavage & Glassmire, 2016). Individually, both the structure and atypicality subscales achieved comparable or better statistics with these scales when used alone, with atypicality alone rivaling use of the full set of predictors (sensitivity = .80; specificity = .91). Further work will focus on a shift to the attention check methodology to capture carelessness, expansion of the atypicality items to ensure validity across types of samples, broadening of the structure criterion (using base rates of items on the full scale, rather than simple absorption and taxon items), and inclusion of a dissociative disorder sample. Use of IP investigation or exclusion and use of vocabulary screen are also clearly supported.

Another limitation to the study, ubiquitous in this area of research, is concern about the “honest” responders. It is quite possible that a number of the honest controls were in fact malingering, despite the attempt to use social influence to increase compliance and largely accurate manipulation check responses. If so, however, the screen is likely even more effective than is presented here, and that a few of the “honest” responders labeled as false positives (feigners) were indeed true feigners. It can be argued with certainly only that the honest groups were likely more honest on the average than those told to feign symptoms.

The most important limitation to the current work is the absence of a dissociative disorder control group, which is now in progress. The types of dissociation associated with simple PTSD (versus complex PTSD) are limited (Van der Hart, Nijenhuis, & Steele, 2005) and do not include the more severe fragmentation that is characteristic of Dissociative Identity Disorder. Although those with traumatic histories in general are often shown to elevate on validity scales (Flitter, Elhai, & Gold, 2003), concern about validity scales with dissociative disorder groups is more significant (Palermo & Brand, 2018). Given that the base rate of DID is low (Ross, 1991; Şar, Akyüz, & Doğan, 2007), we believe that the current version of the scale is valuable for further research in community populations, but clinical replications are mandatory before it would be deemed usable for forensic purposes. Such research efforts are challenging, in that many institutions do not routinely screen for dissociative disorders (Ginzburg, Somer, Tamarkin, & Kramer, 2010).

The incorporation of embedded validity checks into the administration of the DES-R provides a means for the objective evaluation of feigning, a critical requirement in forensic practice. In forensic evaluation, lack of a validity scale will leave experts relying on clinical intuition or experience in judging the reliability of a specific case, a process fraught with opportunities for miscarriages of justice (Dawes, Faust, & Meehl, 1989). In research and clinical fields, the proliferation of new assessment tools, compounded by the increasing reliance on isolated and unseen (i.e., Internet) subjects, often professional survey takers who benefit more monetarily from speed than from accuracy, necessitates more attention be paid to the validity of the data produced through such means. The DES-V may be a tool to move the field toward this goal.