The many foibles of Likert scales challenge claims that self-report measures of self-control are better than performance-based measures

Paap, Kenneth R.; Anders-Jefferson, Regina T.; Balakrishnan, Nithyasri; Majoubi, John B.

doi:10.3758/s13428-023-02089-2

The many foibles of Likert scales challenge claims that self-report measures of self-control are better than performance-based measures

Published: 09 March 2023

Volume 56, pages 908–933, (2024)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

The many foibles of Likert scales challenge claims that self-report measures of self-control are better than performance-based measures

Download PDF

Kenneth R. Paap¹,
Regina T. Anders-Jefferson¹,
Nithyasri Balakrishnan¹ &
…
John B. Majoubi¹

1371 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Self-control and executive functioning are often treated as highly related psychological constructs. However, measures of each rarely correlate with one another. This reflects some combination of true separability between the constructs and measurement differences. Traditionally, executive functioning is objectively measured as performance on computer-controlled tasks in the laboratory, whereas self-control is subjectively measured with self-report scales of predispositions and behaviors in everyday life. Self-report measures tend to better predict outcomes that should be affected by individual differences in control. Our two studies show that the original version of Tangney, Baumeister, and Boone’s brief self-control scale (consisting of four positive and nine negative items) strongly correlates with self-esteem, mental health, fluid intelligence, but only weakly with satisfaction with life and happiness. Four variants of the original scale were created by reverse-wording the 13 original items and recombining them to form, for example, versions with all positive or all negative items. As the proportion of items with positive valence increased: (1) the outcomes with strong correlations in the original scale weakened and the weak correlations strengthened and (2) the mean overall scores increased. Both studies replicated a common finding that the original scale yields two factors in an exploratory factor analysis. However, the second factor is generated by method differences, namely, having items with both positive and negative valence. The second factor is induced by the common practice of reverse-coding the items with negative valence and the faulty assumption that Likert scales are equal-interval scales with a neutral-point at midscale.

A factor analytic investigation of the Barkley deficits in executive functioning scale, short form

Article 25 April 2020

One executive function never comes alone: monitoring and its relation to working memory, reasoning, and different executive functions

Article 09 July 2016

The Efficient Assessment of Self-Esteem: Proposing the Brief Rosenberg Self-Esteem Scale

Article Open access 27 April 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

There are several important psychological constructs that are highly related to one another: self-control (SC), executive functioning (EF), self-regulation, impulse-control, cognitive control, attention control, executive attention, self-regulation, will power, grit, ego-strength, inhibitory control, and the list goes on. Nigg (2017) provides a useful discussion of the similarities and differences between these constructs and a framework that might facilitate the integration and cross-talk between researchers working from different perspectives. This project is in that spirit but focuses on only SC and EF because they are verbally described in much the same way, but dominate different disciplines and typically use different measurement methods. The project also has close ties to self-regulation as Nigg observed: “Executive function and cognitive control are not identical to self-regulation because they can be used for other activities, but account for top-down aspects of self-regulation at the cognitive level” p. 1.

For the present purposes, SC and EF refer to the set of general-purpose control processes central to the self-regulation of thoughts, emotions, and behaviors that are instrumental to accomplishing goals; especially in the presence of more tempting or automatic action plans that would disrupt or replace goal-consistent actions if they were not subject to control (Paap, 2023). EF plays a critical role in cognitive science where EF ability is typically measured in terms of performance on artificial, but highly prescribed, computer-controlled tasks such as the Stroop color–word interference task (Paap et al., 2022). Traditions from the personality, social, developmental science, counseling, and clinical psychology, and psychiatry heavily rely on self-report (or reports from other informants such as parents or teachers) of typical behavioral tendencies in everyday life. Our focal example of this type of scale is the Brief Self-Control (BSC) scale (Tangney et al., 2004) where participants are asked to indicate how much a statement –I am good at resisting temptation – reflects how they typically are on a five-point scale ranging from Not at all to Very much.

Are these two traditions studying the same construct and leading to compatible understandings of individual differences in self-control or how control varies across situations?

Convergent validity of subjective self-report measures

Preliminary to the question of the degree to which the two approaches measure the same construct is whether there is agreement between measures developed within each tradition. Our recent publications have often included the BSC, Barkley’s Deficits in Executive Functioning (BDEFS, Barkley, 2011), and three subscales from the UPPS Impulsive Behavior Scale (Whiteside & Lynam, 2001). Table 1 is not an exhaustive review of the relevant literature but does support the conclusion that commonly used scales based on subjective self-reports usually correlate with one another and sometimes the correlations are strong.

Table 1 Correlations between three subscales of the UPPS-P Impulsive Behavior Scale and the BSC and BDEFS self-control scales

Full size table

Convergent validity of performance-based measures is weak

The answer to the question of convergent validity for performance-based measures is less straightforward because seminal studies using latent variables suggested that EF should be viewed as three separable components: shifting, updating, and inhibition (Miyake et al., 2000). Although the convergent validity observed for the first two components is typically adequate, that for inhibitory control is notoriously challenged (Paap & Sawi, 2014). Even variants of the same task can have near-zero correlations as exemplified by Salthouse (2010) for the flanker task and Shilling et al. (2002) for the Stroop task. Rey-Mermet et al. (2018) conclude that there is little evidence for domain-general inhibition and that inhibitory control is task specific. In their review, they point out that an inhibitory-control factor is often dominated by a single measure and that statistical models offer weak support for a domain-general EF ability. Randy Engle’s group (Draheim et al., 2020; Burgoyne et al., 2023) offer a spirited rebuttal to this pessimistic view based on the development of a set of new performance-based tasks that show substantially better reliability and convergent validity. But the relevant and voluminous research literature used the “traditional” tasks and because these measures do not show adequate convergent validity, the best possible state of affairs would be that some coherent subset of them would strongly correlate with self-report measures.

The modest association between the two types of measures

Recall that SC and EF were defined as the set of general-purpose control processes central to the self-regulation of thoughts, emotions, and behaviors that are instrumental to accomplishing goals: especially in the presence of more tempting or automatic action plans that would disrupt or replace goal-consistent actions if they were not subject to control. The core idea that control is needed to prevent an act that would otherwise occur is a highly salient facet of self-report measures of control. In contrast, items referring to switching (mental flexibility), updating (working memory), and planning are less common. This consideration leads to the expectation that any alignment between the two types of measures may appear for some of the performance-based measures of inhibitory control rather than switching or updating. This expectation was not confirmed because weak correlations between the two types of measures are typically observed for all components of EF.

In their recent excellent analytic review, Friedman and Gustavson (2022) conclude that “…ample evidence now suggests that tasks and ratings do not correlate well with each other” p. 262. This echoes our independent evaluation (Paap, 2023) and follows in the footsteps of the first seminal reviews by Cyders and Coskunpinar (2011) and Duckworth and Kern (2011). Another highlight comes from Allom et al. (2016), who showed near-zero correlations between a composite of self-report measures and a composite of performance-based measures.

The capstone for this lack of convergence is provided by Mazza et al. (2020) who obtained measures of self-regulation from 23 self-report surveys and 37 cognitive tasks from a group of 522 adult participants (mean = 33.6 years old) recruited through MTurk. Variables derived from self-report scales weakly correlated with variables derived from the cognitive tasks (M = + 0.05, range = 0.00 to 0.27 for the absolute value of r). The self-report scales include the two we have used most often (BSC and UPPS impulsivity) and many performance-based stalwarts (e.g., stop-signal, go/no-go, flanker, Simon, Stroop, backward digit span, backward spatial span, N-back, etc.)

When two purported measures of the same construct fail to correlate with each other, one way of determining which is the more valid measure is to see how well they each predict anticipated outcomes. For example, individuals with better self-control should engage in more health-promoting activities. Indeed, Allom et al. report that the self-report measure of control (but not performance-based measures of EF) predicted the degree of physical exercise. Thus, this suggests that the self-report measures are the more valid measures of self-control ability. We have told a very similar story in Mason et al. (2020) where we reported that a set of performance-based measures of EF (viz., switch costs, mixing costs, and spatial Stroop effects) do not correlate with self-report measures such as the BSC or BDEFS. Furthermore, like Allom et al., we correlated both self-report and performance-based measures with physical activity and replicated their finding that only the self-report measures predicted the amount of physical activity. Paap (2023) reviews additional studies consistent with the view that self-report and performance-based measures weakly correlate with each other and that self-report measures usually enjoy greater predictive validity.

Issues associated with self-report measures

To this point, the groundwork has been laid for the hypothesis that subjective measures of SC/EF in everyday life may be superior to performance-based measures because they tend to correlate more strongly and consistently with tendencies to exercise control outside the laboratory. Given the shared ecological validity of self-report measures and self-report outcomes, and the obtained pattern of correlations, the potential superiority of self-report enjoys substantial plausibility. However, self-report measures are vulnerable to response biases and measurement problems different from the challenges faced in performance-based measures. Another purpose of this project is to explore these problems in the specific context of the most popular self-report scale of SC.

The BSC as our focal example

TBB’s BSC (Brief Self-Control) scale was an obvious target, as it already has an immense user group and has gathered more than 7800 Google Scholar citations in December of 2022. In general, Likert scales ask how much a person agrees with a statement. Another item from the BSC asks how much they agree that I am able to work effectively toward long-term goals. In the most straightforward application, the BSC can be treated as a unitary measure of general SC with individual differences reflected in the total score or the mean across all 13 items. The choice between total score and mean score is a matter of preference and henceforth mean score is used.

Reverse-wording

The following discussion relies on the concepts of reverse-wording and reverse-coding. Please avoid the jingle fallacy, erroneously assuming that two quite different things are the same, simply because they have similar names. Reverse-wording refers to taking a potential scale item and rewriting it so that the valence is reversed. An item with positive valence, I am good at resisting temptation, might be reverse-worded to I am bad at resisting temptation. Similarly, an item with negative valence, I am lazy, might be reverse-worded to I am not lazy. There are two ways of reverse-wording an item: (1) negation adds a negative particle such as not or no or by adding affixal negation such as un- or -less (e.g., “I am not lazy) and (2) replacing a keyword with its polar opposite (e.g., bad for good or energetic for lazy).

Reverse-coding

Reverse-coding refers to the standard practice applied to items with negative valence whereby the raw score is subtracted from the maximum scale value (“5” in the case of the BSC) plus 1. For example, a response of “1” to I am lazy would be reverse-coded to 6 – 1 = 5 and a response of “2” to I am bad at resisting temptation would be reverse-coded to 6 – 2 = 4. The underlying (but usually implicit) assumptions are that reverse-wording completely reverses the semantics AND that the Likert scale is equal interval with a neutral point of 3 such that reverse-coding is an unbiased transformation. In an ideal application, the response to a negative item (e.g., I am bad at resisting temptation. “2”) after reverse coding (6 – 2 = 4) will be identical to the response given to its positive mate (e.g., I am good at resisting temptation, “4”).

Acquiescence bias

But what motivates researchers to include both positive and negative items? One belief is that it reduces the likelihood of an acquiescence bias. Acquiescence is a response tendency to agree with statements. It is considered as a personality trait with some individuals predisposed to acquiesce, but also to be more prevalent when the situation promotes satisficing over optimizing responses. Furthermore, response inertia might potentiate acquiescence as one gets into the rhythm of indicating agreement at the top end of the scale: 5, 5, 5…. However, if the negative items are reverse-scored to 1’s, then the mean score (for a scale with 50% negative items) due to acquiescence is 2.5. If acquiescence continues unabated and equivalently on both positive and negative items, then its effects on the overall scale mean (after reverse scoring the negative items) may cancel out.

A different possibility is that participants may experience the inherent inconsistency in bouncing back and forth between agreeing to both X (e.g., having good self-control) and not X. This might lead to a response strategy that substantially reduces the tendency to acquiesce. Both of these scenarios lead to a better state of affairs, namely, mean scores that are less biased than those obtained with scales consisting of only positive (or only negative) items.

Social desirability bias

In order to protect one’s self-image or to project a more positive image, some individuals may be biased to respond in a socially desirable way, (viz., shifting their self-appraisal in the direction of greater agreement (toward 5) with positive items and less agreement (toward 1) with negative items. When the negative items are (as usual) reverse-coded, the overall effect of social desirability is to bias the mean scores in the direction of better self-control (toward 5). Thus, the overall mean scores for self-control are susceptible to desirability effects as high means may indicate excellent self-control or a strong desirability bias. Unlike acquiescence bias, including negative items does not appear to have any mitigating mechanism for controlling a desirability bias. A common strategy is to measure social desirability, partial out its effects in a simultaneous regression, and hope that relationship of interest remains robust.

The foibles of reverse-wording

As described above, a popular cure for acquiescence bias is to include items with both positive and negative valence. During scale development, this is likely to involve reverse-wording some positive items into negative items that will be reverse-coded in order to derive an overall scale mean. This cure for acquiescence bias may be worse than the disease because it is very difficult to completely reverse the semantics or content of an item. As Krosnick and Presser (2010) describe in a meta-analysis of 41 studies, nonequivalence is more likely than not. When people are asked to agree or disagree with pairs of statements stating mutually exclusive views (e.g., I enjoy socializing vs. I don’t enjoy socializing), the between-pair correlation (before reverse-scoring) is only –0.22. Although some of the weakness in this negative correlation could be due to acquiescence to both types of statements, it more likely reflects that syntactic negation is rarely understood as a polar opposite meaning that would lead to a negative correlation of –1.00.

In a project using 100 San Francisco State University undergraduates, we tried to obtain a purer measure of the semantics of the BSCs items and their reverse by focusing the target behavior/predisposition on a hypothetical other person rather than as a self-appraisal. To minimize acquiescence, judgments involving self-appraisal were replaced with direct judgments of good or poor self-control. The instruction read as follows: “Suppose someone you do not know says: ‘I have a hard time breaking bad habits.’ Based on this statement, how much self-control (aka self-discipline, willpower, impulse control, perseverance) do you think this person has? 1 = Substantially Below Average, 2 = Somewhat Below Average, 3 = Average, 4 = Somewhat Above Average, 5 = Substantially Above Average”.

When there is no issue of agreement (and hence no opportunity for acquiescence) and no issue of social desirability (because the question is not about the respondent), the correlation (after reverse coding the negative items) between the positively and negatively worded versions for the 13 BSC items average r = + 0.33. Although this approaches a medium effect size, it shows that knowing the degree of self-control implied by a description of some specific behavior predicts only 11% of the variance in judgments about the reverse-wording of that behavior. Consider two contrasting examples. “I have trouble concentrating” captures quite well the polar opposite of “I have good concentration”, r = 0.72 (when the negative item is reverse coded), but the reverse wording of most of the original BSC items does not substantially reverse the semantics. For example, the correlation between “Sometimes I can’t stop myself from doing something even if I know it is wrong” and “Sometimes I can stop myself from doing something when I know it is wrong” is r = –0.14. Simply put, negating a statement does not mean that the comprehender will make a polar opposite inference about its meaning.

Scales that use both positive and negative items seem to require that we make the (nearly always unstated) assumption that respondents really do think about valence as a single dimension. As a reviewer observed, substantial evidence from behavioral economics strongly suggests that this is often not the case. For example, people seem to reason about potential losses and potential gains in qualitatively different ways even when the phrasing results in mathematically equivalent results (Kahneman & Tversky’s prospect theory, 1979). For example, people tend to overweight small probabilities to guard against losses. From this perspective, the fact that mirrored items have different factor loadings accurately captures real cognitive differences between them, but this source of item-item difference is minimized by using scales with all positive or all negative items.

Reverse-worded items of the need for cognition scale alter the factor structure

As we were embarking on our odyssey through the BSC, we were unaware that Zhang et al. (2016) had explored similar manipulations of the 18-item Need for Cognition (NFC) scale (Cacioppo et al., 1984). In the original NFC, nine items endorsed the need to engage in and enjoy cognitive activities and nine were reverse-worded. Three new versions were created. The All-Positive version reversed the nine reverse-worded items to create a scale that uniformly endorsed a need for cognition. For example, reversing “Thinking is not my idea of fun” to “Thinking is my idea of fun”. The other two versions maintained an even division between positive and negative endorsement of the need for the cognition, but the nine reversed items in version Reverse-1 were uniformly created by using polar opposite adjectives (e.g., I would prefer simple to complex problems.) and in Reverse-2 by using negative particles (e.g., The notion of thinking abstractly is not appealing to me). About 315 University of British Columbia undergraduates completed each of the four versions.

Although these appear to be subtle wording changes, Zhang et al., correctly anticipated that they would lead to scales with different factor structures. Using exploratory factor analyses, the original NFC scale clearly indicated two factors, but the modified scales yielded weak evidence for two factors. Consistent with this analysis the factor correlation between the two factors was only + 0.56 for the original NFC, but + 0.96 or greater for the three versions with wording changes. Confirmatory factor analyses verified that the type of reverse-wording may also affect the factor structure of the NFC. Inconsistent responses to polar opposite items may occur because they are not actually polar opposites with respect to the construct of interest. A respondent may agree with both I like simple tasks and I like complex tasks because liking simple tasks does not preclude also liking complex tasks. This problem is often referred to as reversal ambiguity. Furthermore, there is a long and convoluted history of psycholinguistic research on the comprehension of negative sentences (see Wang et al., 2021) that permits the conclusion that negation is tricky even when the reader does not miss the negative particle due to inattentiveness. Zhang et al. are drawn to the conclusion that the use of reverse-worded items in Likert scales “… has serious disadvantages” p. 13.

Study 1: between-subjects comparisons of different versions of the BSC

The Zhang et al. study resets the stage for our interest in how promising the BSC (and other self-report measures of self-control) might be as reliable and valid measures of self-control. The purpose of this study was to investigate the effects of reverse-wording and valence (behaviors that have either positive or negative social desirability) in the BSC with respect to not only the factor structure of the scale but also its ability to predict both positive and negative outcomes in everyday life. Thus, testing the predictive validity of wording variants of the BSC is an important and novel contribution of our study.

Method

Measures of self-control

TBB reported that the original BSC enjoyed excellent reliability in two large samples of college students. Cronbach’s alpha showed an internal consistency of 0.83 (N = 351) in Study 1 and 0.85 (N = 255) in Study 2. Test–retest reliability over a 1–3-week period was measured in Study 1 and yielded an impressive r = 0.87.

For the present study, five versions (counting the original form with 13 items) of the BSC were created. As shown in Table 2, each of the 13 original items was reworded to reverse its valence. The new version is termed the Mirrored version and consists of nine items with positive valence and four items with negative valence.

Table 2 The original BSC items and their mirrored version

Full size table

An All-Positive version was formed by retaining the original four positive items and recombining them with the nine new positive items from the Mirrored version. Similarly, an All-Negative version was formed by retaining the original nine negative items and recombining them with the four new negative items from the Mirrored version. An anomaly that captured our attention is the original Item 8 (People would say that I have iron self-discipline.) asked about what other People would say rather than soliciting a self-appraisal. In order to explore the extent to which a subtle wording change unrelated to valence can impact its interpretation, a fifth version that we dubbed the Who Version replaced the original Item 8 with this statement: I would say that I have iron self-discipline.

Measures of outcomes associated with self-control

In order to assess the predictive validity of the five versions of the BSC, all participants completed several scales associated with self-control in past research. Below each scale is described its internal and retest reliability reported, and then we review the published correlations between each outcome scale and the BSC scale.

Self-esteem

Self-esteem was measured using the Rosenberg (1965) Self-Esteem Scale consisting of five items with positive valence (e.g., I feel I am a person of worth) and five items with negative valence that are reverse scored (e.g., I feel I do not have much to be proud of) to yield an overall score where larger numbers signify greater self-esteem. A four-point Likert scale is used, anchored by Strongly Agree (1) and Strongly Disagree (4). The scale has good internal reliability (Cronbach’s alphas ranging from 0.72 to 0.91). For example, Sinclair et al. (2010) reported an alpha of 0.91 for a representative sample of US adults (N = 503). Test–retest correlations with college students have indicated a 1-week test–retest correlation of 0.82 (Fleming & Courtney, 1984) and a 2-week test–retest correlation of 0.95 (Silber & Tippett, 1965).

A search of the published literature returned 15 studies that reported the bivariate correlation between the BSC and Rosenberg’s Self-Esteem scale. These ranged from r = + 0.19 (531 undergraduates, M = 19.3 years old, Trumpeter et al., 2006) to r = + 0.53 (147 university students from Tehran, M = 26.9 years old, Ghorbani et al., 2014) with a mean across all 15 studies of r = + 0.38.

General health questionnaire (GHQ)

General mental health was measured using the 12-item scale originally developed by Goldberg (Goldberg & Williams, 1991). Six items have positive valence (Have you been able to concentrate well on what you were doing) with Likert choices of 0 (better than usual), 1 (same as usual), 2 (less than usual), 3 (much less than usual). The remaining six items have negative valence (Have your worries made you lose a lot of sleep) with Likert choices of 0 (not at all), 1 (no more than usual), 2 (more than usual), and 3 (much more than usual). If the scale values of 0 to 3 are used, then better health is signified by smaller totals or means. The test–retest reliability over a 7–14-day interval for an Italian version of the scale (N = 83) was r = 0.84 (Piccinelli et al., 1993) when administered to adult volunteers at a general medical practice clinic. Cronbach’s alpha (0.90, N = 3705) showed good internal consistency in the Health Survey for England 2004 cohort (Haskins, 2008).

Three published studies reported that better mental health was significantly associated with better self-control, r = –0.42 (159 college students, M = 21.3 years old, Fung et al., 2020), r = –0.19 (328 Russian science majors, M = 18.4 years old, Gordeeva et al., 2017), and r = –0.27 (106 employees, M = 44.1 years old, Jammieson et al., 2017).

Satisfaction with Life (SWL)

Satisfaction with life was measured using Diener et al. (1985) classic five-item scale (In most ways my life is close to my ideal) with Likert choices 1 (Strongly disagree) to 7 (Strongly agree). Diener et al. (1985) tested two large samples of college students (N = 176 and N = 163) and a smaller group of 53 elderly persons. For the undergraduates, the 2-month test–retest correlation was 0.82 and Cronbach’s alpha was 0.87.

Fourteen studies have reported the bivariate correlation between the BSC and Diener’s SWL scale. These ranged from r = + 0.20 (500 Chinese adult employees, M = 28.0 years old, Dou et al., 2019) to r = + 0.37 (328 Russian undergraduate science majors, M = 18.4 years old, Gordeeva et al., 2017) with a mean of r = + 0.29.

Happiness

Self-rated happiness was measured with Lyubomirsky and Lepper’s (1999) four-item Subjective Happiness Scale (e.g., In general, I consider myself: 1 “not a very happy person” to 7 “a very happy person”. One item required reverse coding. The developers validated their happiness scale across more than a dozen samples totaling 2732 participants in the United States and Russia who were either college students or from the local community. Internal consistency measured as Cronbach’s alpha ranged from 0.80 to 0.94 with a mean of 0.86. Test–retest reliability was assessed across five samples at time lags ranging from 3 weeks to 1 year. The reliability ranged from 0.55 to 0.90 (M = 0.72). The smallest coefficient (r = 0.55) was observed in a U.S. adult community sample, which was tested 1 year apart. More recently, Extremera and Fernández-Berrocal (2013) reported a Cronbach’s alpha of 0.81 and retest reliability (at intervals of 6–8 weeks) for a Spanish version of the scale of r = 0.72.

We know of only one other study that probed this association between self-control and happiness. Fung et al. (2020) reported a significant positive correlation (r = + 0.33) based on responses from 903 students attending Chinese universities.

Social desirability

Stöber’s (2001) Social Desirability Scale (SDS) was primarily used as a covariate when treating BSC as a predictor of the outcome variables described above. SDS scores will correlate with BSC scores to the extent that an individual is biased to over report on items with positive valence or under report on those with negative valence. The original SDS has nine positive-valence items (I always admit my mistakes openly and face the potential negative consequences) and seven with negative valence (I sometimes litter). One point is scored for each true response to a positive item and one point for a false response to a negative item. Thus, larger totals indicate a greater bias to give socially desirable answers. To avoid issues of legality, we did not use one of the negative items (I have tried illegal drugs …). Stöber reported test–retest correlations over 0.80 across intervals from 2 to 6 weeks and Cronbach alphas of either 0.74 or 0.75 across three college student samples and a large community sample.

Prior studies show strong correlations between BSC and social desirability scores. Bertrams and Dickhäuser (2012) reported r = + 0.46 for a sample of 150 undergraduates. Similarly, Kwapis and Bartczuk (2020) reported r = + 0.45 for a sample of 141 adolescents (M = 17.7 years old) with both scales translated into Polish. Uysal and Knee (2012) reported similar correlations when social desirability was measured with the Marlowe Crowne scale: r = + 0.43 for 160 undergraduates in Study 1, r = + 0.59 for 74 undergraduates in Study 2, and r = + 0.51 for 55 undergraduates in Study 3. Collectively these substantial correlations highlight the possibility that positive correlations between BSC scores and other desirable outcomes may be mediated by social desirability.

Raven’s test of general fluid intelligence (gF)

Fluid intelligence was assessed using Set 1 of the Ravens Advanced Progressive Matrices (Raven et al., 1977). The task consisted of 12 items. Each item was composed of a pattern with a missing piece in the lower right. Participants were instructed to Look at the pattern, think what the missing part must be like to complete the pattern correctly, both across the rows and down the columns. Participants selected from a set of eight alternatives. The task was computerized and controlled by Qualtrics. Participants were given a maximum of 2 min to respond to each item. Most responses, regardless of correctness, in this self-paced computer-controlled version were made well within the deadline. The manual states that with self-pacing, Set 1 can be used as a short 10-min test. The 12-item test has a decent Cronbach alpha, for example, 0.81 (Partchev, 2020) and 0.73 (Bors & Stokes, 1998, N = 506 University of Toronto students). Arthur et al. (1999) reported a test–retest r = 0.76 for 71 participants at a 1-week interval.

Raven’s scores were included in this study because our previous work (Paap et al., 2020) showed that general fluid intelligence was the most consistent predictor of performance-based measures of self-control. Furthermore, some experts (Salthouse, 2005, 2010) have argued that EF and gF may be two names for the same ability. However, a significant correlation was not anticipated in this study as we had never observed a significant correlation between the BSC and Raven’s using samples of university students (Paap et al., 2019, r = –0.07; Mason et al., 2021, r = –0.10; Paap et al., 2019, r = –0.04; Paap et al., 2022, r = + 0.05). Similarly, Erceg et al. (2019) reported a correlation of r = –0.14 for a sample of 159 college students (M = 21.3 years old). Finally, Mazza et al. (2020) reported a correlation of r = + 0.07 based on a sample of 522 MTurk participants (M = 33.6 years) paid considerably more than is typical ($60 plus an average of $10 in bonuses) for completing a 10-h battery of surveys and cognitive tasks. This disconnect between the relationship of self-report and performance-based measures of self-control to Raven’s scores is not surprising given our earlier discussion that the two types of self-control measures do not correlate with each other.

Design and procedure

Participants were randomly assigned to one of five groups that differed only with respect to the BSC version they received: original BSC, All Positive, All Negative, Mirrored, and Who. The sequence of events was controlled by a Qualtrics Survey: (1) informed consent, (2) language and demographic background, (3) the randomly assigned version of the BSC, (4) self-esteem scale, (5) general health questionnaire, (6) satisfaction with life scale, (7) subjective happiness scale, (8) social desirability, and (9) Raven’s test of gF.

Participants

MTurk workers were recruited for a modest compensation of $0.20. MTurk results were returned to Qualtrics from 1439 workers. Responses were deleted if less than 70% of the survey was completed. Because the Raven’s test came last (and constituted the final 30%), this means that correlations involving gF are based on somewhat smaller sample sizes. Each participant responded to only one of the versions of the BSC and this was followed by an attention check such as “respond 3 to this item”. A similar attention check was presented after the GHQ scale and again after the Happiness scale. Seventy-six (6.2%) participants were deleted because they failed more than one of the three attention checks. An additional and more important screening involved the following affirmation procedure. Participants were warned at the onset of the survey that we would eventually ask them if they had paid attention to each item and if they had answered honestly. We also told them that they would be paid regardless of how they answered these questions. These two affirmation items were presented at the end of the survey. Fourteen (1.3%) reported that they did not always pay attention and 40 (3.6%) reported that they did not always answer honestly. After eliminating these participants, a pool of 1003 qualified participants remained with about 200 participants in each group (see Table 3 for specific N’s).

Table 3 Descriptive statistics for the five versions of the BSC

Full size table

Results

Differences between BSC versions

At the individual participant level, all responses to items with negative valence were reverse scored by subtracting each raw score from 6. The left side of Table 3 shows the means across all 13 items for each of the five versions of the BSC with larger means indicating greater self-control. A one-way ANOVA with Version as a between-subject variable was significant, F(4, 997) = 12.51, p < 0.001. Post hoc Bonferroni tests were used to compare each of the modified versions to the mean of the original BSC. By this relatively conservative test, the only significant difference was that the All-Positive mean of 3.63 was greater than the mean of 3.37 for the Original BSC, p < 0.001.

It should be noted that the mean (3.37) for the Original version of our MTurk sample is greater than the means reported by TBB for college-student samples (3.02 and 3.07). This difference may be driven mostly by age as the mean MTurk participant in our sample was 39 years old and for the group assigned to the Original version the significant correlation between age and mean self-control is r = + 0.23. As shown in Fig. 1, the linear fit to this data shows a mean score just above 3.00 for a 20-year-old participant. A meta-analysis of 50 studies using the BSC reported a mean of 3.26 with a range of 2.87 to 4.26. In summary, the mean for the Original version in our study fits very well with means observed in past studies.

Assessing the possibility that the obtained pattern of mean differences across the five versions reflect contributions from acquiescence or social-desirability bias is complicated. Consider first a participant who only pays attention to the valence of the item (from a social desirability perspective) and responds 5 to all the positive items and responds 1 to all negative items. As shown in Table 4, this would result in a mean of 5.0 for all four versions because all responses of 1 to negative items are reverse coded to 5’s and, of course, all positive items are 5’s to begin with. Thus, uniform social-desirability does not drive differences between versions that differ in terms of ratio of positive to negative items. Said another way, a strong desirability bias drives scores higher in general, but all other factors equal, all boats (versions) should rise or fall together as the tides of desirability ebb or flow.

Table 4 Effects of reverse-coding items with negative valence on overall scale means assuming a total acquiescence bias or a total desirability bias

Full size table

In contrast to the scenario just discussed, suppose a participant adopts a complete acquiescence bias and responds 5 to all items regardless of whether they have positive or negative valence. Because the negative-valence items are reverse coded, a complete acquiescence bias to agree to any type of statement should have opposing effects on positive and negative items. That is, if a strong acquiescence bias leads to full agreement to “I am good at resisting temptation”, the individual will select 5 (Very Much) and the scored value is 5. If that same strong acquiescence bias leads to full agreement to “I am lazy”, the individual will likewise select 5 (Very Much), but when this negative item is reverse scored the scored value is 6–5 = 1. Thus, the means for the acquiescence-bias scenario show that the overall scale means should increase as the ratio of positive to negative items increases (see bottom row of Table 4). Because acquiescence bias pushes responses to the higher end of the scale regardless of valence, reverse-coding the contribution of acquiescence bias for the negative items is the wrong thing to do. The problem, of course, is that when someone responds “5” (Very Much) to I am lazy, we do not how much of that agreement represents a genuine self-appraisal that the participant is lazy (and that should be reverse-coded) and how much is due to the tendency to acquiesce. If experiencing all positive items promotes more acquiescence compared to versions that include items with negative valence, then this could account for why the All-Positive version had the greatest mean score in Study 1.

Internal consistency and factor structure

Cronbach’s alpha is shown in the fourth column of Table 3 for each of the five versions of the BSC. As a measure of internal consistency, alpha provides an index of the degree to which the items cohere into a unitary measure of the construct of interest. The alpha of 0.85 for the Original version corresponds closely to the values (0.83 and 0.85) reported by TBB in their two college-student samples. The BSC should have a high alpha because TBB, in part, selected items for the brief form because they showed high inter-item consistency in the long form.

TBB also selected items for the BSC such that there would be representation from each of the five factors that emerged from an exploratory factor analysis (EFA) of the 36-item long form. The factor structure of the 13-item BSC has a somewhat tortured history that Paap (2023) traces in detail. For present purposes, a key and safe conclusion drawn from TBB and subsequent publications is that the total score (or overall mean score) provides impressive predictive validity of many outcomes that is rarely exceeded to any nontrivial degree by the predictive power of any constituent factor. To rephrase, individual factor scores are usually not better than total-scale scores at predicting outcomes. This is not surprising if second factors are often method factors rather than content factors.

Figure 2 shows the scree plot for the group responding to the original 13-item BSC. The elbow supports the extraction of two factors. The two-factor solution is shown in Fig. 3, with the red dots indicating items with positive valence and the green dots those with negative valence. It is clear that the salient difference between the two factors is simply the valence of the item. This is further supported by comparison to the other versions. As shown in Table 3, the percentage of variance accounted for by a second factor is the least when the items are all of one valence (i.e., All Positive or All Negative), but for reasons that elude us, the unitary structure is most apparent in the All-Negative version shown in Fig. 4.

The effects of the wording change on item 8

Undertaking a small exploration of wording changes not involving changes in valence we modified “People would say I have iron self-discipline” to “I would say I have iron self-discipline”. This shifts the semantics from an appraisal by others, to a self-appraisal. The mean for the original item, People would say…, (M = 3.27) is significantly greater than the mean (M = 3.02) for the modified item I would say…, t(399) = 2.07, p = 0.039. This is consistent with Duckworth et al. (2017) observation that across the lifespan and around the world, individuals experience SC as a very hard thing to do and report that they often fail. Indeed, people rate themselves lower in SC than in kindness, fairness, honesty, and most other aspects of character (Park & Peterson, 2006). The more general implication of this example is that apparently subtle wording changes can generate non-trivial shifts in the item mean. This also suggests that there may be potential risks in translating scales into other languages because it is difficult and often impossible to avoid subtle shifts in meaning.

Using self-control to predict everyday outcomes depends on the mix of valence

One very important use of a self-control measure is that it enables one to test for relationships between the ability (and or predisposition) to exercise self-control and important positive and negative outcomes in everyday life. However, because this study is limited to correlation, it is difficult to distinguish causal relationships from those due to confounds or response biases. With those caveats in mind, Table 5 shows the bivariate correlations between the self-control scores of each of the five versions of the BSC and five outcomes: self-esteem, mental health, satisfaction with life, happiness, and general fluid intelligence. Table 6 shows the beta coefficients for these relationships when the effects of social desirability are partialed out. The internal consistency (Cronbach’s alpha) and retest reliability reported in other studies for each of these outcome variables and social desirability were reviewed in the Method section. The means, standard deviations, and Cronbach’s alpha for each of these variables are shown in Table 7.

Table 5 Predicting outcomes: Self-esteem, mental health, satisfaction with life, happiness

Full size table

Table 6 Predicting outcomes while controlling for social desirability (standardized betas)

Full size table

Table 7 Descriptive statistics for six outcome variables

Full size table