There are several important psychological constructs that are highly related to one another: self-control (SC), executive functioning (EF), self-regulation, impulse-control, cognitive control, attention control, executive attention, self-regulation, will power, grit, ego-strength, inhibitory control, and the list goes on. Nigg (2017) provides a useful discussion of the similarities and differences between these constructs and a framework that might facilitate the integration and cross-talk between researchers working from different perspectives. This project is in that spirit but focuses on only SC and EF because they are verbally described in much the same way, but dominate different disciplines and typically use different measurement methods. The project also has close ties to self-regulation as Nigg observed: “Executive function and cognitive control are not identical to self-regulation because they can be used for other activities, but account for top-down aspects of self-regulation at the cognitive level” p. 1.

For the present purposes, SC and EF refer to the set of general-purpose control processes central to the self-regulation of thoughts, emotions, and behaviors that are instrumental to accomplishing goals; especially in the presence of more tempting or automatic action plans that would disrupt or replace goal-consistent actions if they were not subject to control (Paap, 2023). EF plays a critical role in cognitive science where EF ability is typically measured in terms of performance on artificial, but highly prescribed, computer-controlled tasks such as the Stroop color–word interference task (Paap et al., 2022). Traditions from the personality, social, developmental science, counseling, and clinical psychology, and psychiatry heavily rely on self-report (or reports from other informants such as parents or teachers) of typical behavioral tendencies in everyday life. Our focal example of this type of scale is the Brief Self-Control (BSC) scale (Tangney et al., 2004) where participants are asked to indicate how much a statement –I am good at resisting temptation – reflects how they typically are on a five-point scale ranging from Not at all to Very much.

Are these two traditions studying the same construct and leading to compatible understandings of individual differences in self-control or how control varies across situations?

Convergent validity of subjective self-report measures

Preliminary to the question of the degree to which the two approaches measure the same construct is whether there is agreement between measures developed within each tradition. Our recent publications have often included the BSC, Barkley’s Deficits in Executive Functioning (BDEFS, Barkley, 2011), and three subscales from the UPPS Impulsive Behavior Scale (Whiteside & Lynam, 2001). Table 1 is not an exhaustive review of the relevant literature but does support the conclusion that commonly used scales based on subjective self-reports usually correlate with one another and sometimes the correlations are strong.

Table 1 Correlations between three subscales of the UPPS-P Impulsive Behavior Scale and the BSC and BDEFS self-control scales

Convergent validity of performance-based measures is weak

The answer to the question of convergent validity for performance-based measures is less straightforward because seminal studies using latent variables suggested that EF should be viewed as three separable components: shifting, updating, and inhibition (Miyake et al., 2000). Although the convergent validity observed for the first two components is typically adequate, that for inhibitory control is notoriously challenged (Paap & Sawi, 2014). Even variants of the same task can have near-zero correlations as exemplified by Salthouse (2010) for the flanker task and Shilling et al. (2002) for the Stroop task. Rey-Mermet et al. (2018) conclude that there is little evidence for domain-general inhibition and that inhibitory control is task specific. In their review, they point out that an inhibitory-control factor is often dominated by a single measure and that statistical models offer weak support for a domain-general EF ability. Randy Engle’s group (Draheim et al., 2020; Burgoyne et al., 2023) offer a spirited rebuttal to this pessimistic view based on the development of a set of new performance-based tasks that show substantially better reliability and convergent validity. But the relevant and voluminous research literature used the “traditional” tasks and because these measures do not show adequate convergent validity, the best possible state of affairs would be that some coherent subset of them would strongly correlate with self-report measures.

The modest association between the two types of measures

Recall that SC and EF were defined as the set of general-purpose control processes central to the self-regulation of thoughts, emotions, and behaviors that are instrumental to accomplishing goals: especially in the presence of more tempting or automatic action plans that would disrupt or replace goal-consistent actions if they were not subject to control. The core idea that control is needed to prevent an act that would otherwise occur is a highly salient facet of self-report measures of control. In contrast, items referring to switching (mental flexibility), updating (working memory), and planning are less common. This consideration leads to the expectation that any alignment between the two types of measures may appear for some of the performance-based measures of inhibitory control rather than switching or updating. This expectation was not confirmed because weak correlations between the two types of measures are typically observed for all components of EF.

In their recent excellent analytic review, Friedman and Gustavson (2022) conclude that “…ample evidence now suggests that tasks and ratings do not correlate well with each other” p. 262. This echoes our independent evaluation (Paap, 2023) and follows in the footsteps of the first seminal reviews by Cyders and Coskunpinar (2011) and Duckworth and Kern (2011). Another highlight comes from Allom et al. (2016), who showed near-zero correlations between a composite of self-report measures and a composite of performance-based measures.

The capstone for this lack of convergence is provided by Mazza et al. (2020) who obtained measures of self-regulation from 23 self-report surveys and 37 cognitive tasks from a group of 522 adult participants (mean = 33.6 years old) recruited through MTurk. Variables derived from self-report scales weakly correlated with variables derived from the cognitive tasks (M =  + 0.05, range = 0.00 to 0.27 for the absolute value of r). The self-report scales include the two we have used most often (BSC and UPPS impulsivity) and many performance-based stalwarts (e.g., stop-signal, go/no-go, flanker, Simon, Stroop, backward digit span, backward spatial span, N-back, etc.)

When two purported measures of the same construct fail to correlate with each other, one way of determining which is the more valid measure is to see how well they each predict anticipated outcomes. For example, individuals with better self-control should engage in more health-promoting activities. Indeed, Allom et al. report that the self-report measure of control (but not performance-based measures of EF) predicted the degree of physical exercise. Thus, this suggests that the self-report measures are the more valid measures of self-control ability. We have told a very similar story in Mason et al. (2020) where we reported that a set of performance-based measures of EF (viz., switch costs, mixing costs, and spatial Stroop effects) do not correlate with self-report measures such as the BSC or BDEFS. Furthermore, like Allom et al., we correlated both self-report and performance-based measures with physical activity and replicated their finding that only the self-report measures predicted the amount of physical activity. Paap (2023) reviews additional studies consistent with the view that self-report and performance-based measures weakly correlate with each other and that self-report measures usually enjoy greater predictive validity.

Issues associated with self-report measures

To this point, the groundwork has been laid for the hypothesis that subjective measures of SC/EF in everyday life may be superior to performance-based measures because they tend to correlate more strongly and consistently with tendencies to exercise control outside the laboratory. Given the shared ecological validity of self-report measures and self-report outcomes, and the obtained pattern of correlations, the potential superiority of self-report enjoys substantial plausibility. However, self-report measures are vulnerable to response biases and measurement problems different from the challenges faced in performance-based measures. Another purpose of this project is to explore these problems in the specific context of the most popular self-report scale of SC.

The BSC as our focal example

TBB’s BSC (Brief Self-Control) scale was an obvious target, as it already has an immense user group and has gathered more than 7800 Google Scholar citations in December of 2022. In general, Likert scales ask how much a person agrees with a statement. Another item from the BSC asks how much they agree that I am able to work effectively toward long-term goals. In the most straightforward application, the BSC can be treated as a unitary measure of general SC with individual differences reflected in the total score or the mean across all 13 items. The choice between total score and mean score is a matter of preference and henceforth mean score is used.

Reverse-wording

The following discussion relies on the concepts of reverse-wording and reverse-coding. Please avoid the jingle fallacy, erroneously assuming that two quite different things are the same, simply because they have similar names. Reverse-wording refers to taking a potential scale item and rewriting it so that the valence is reversed. An item with positive valence, I am good at resisting temptation, might be reverse-worded to I am bad at resisting temptation. Similarly, an item with negative valence, I am lazy, might be reverse-worded to I am not lazy. There are two ways of reverse-wording an item: (1) negation adds a negative particle such as not or no or by adding affixal negation such as un- or -less (e.g., “I am not lazy) and (2) replacing a keyword with its polar opposite (e.g., bad for good or energetic for lazy).

Reverse-coding

Reverse-coding refers to the standard practice applied to items with negative valence whereby the raw score is subtracted from the maximum scale value (“5” in the case of the BSC) plus 1. For example, a response of “1” to I am lazy would be reverse-coded to 6 – 1 = 5 and a response of “2” to I am bad at resisting temptation would be reverse-coded to 6 – 2 = 4. The underlying (but usually implicit) assumptions are that reverse-wording completely reverses the semantics AND that the Likert scale is equal interval with a neutral point of 3 such that reverse-coding is an unbiased transformation. In an ideal application, the response to a negative item (e.g., I am bad at resisting temptation. “2”) after reverse coding (6 – 2 = 4) will be identical to the response given to its positive mate (e.g., I am good at resisting temptation, “4”).

Acquiescence bias

But what motivates researchers to include both positive and negative items? One belief is that it reduces the likelihood of an acquiescence bias. Acquiescence is a response tendency to agree with statements. It is considered as a personality trait with some individuals predisposed to acquiesce, but also to be more prevalent when the situation promotes satisficing over optimizing responses. Furthermore, response inertia might potentiate acquiescence as one gets into the rhythm of indicating agreement at the top end of the scale: 5, 5, 5…. However, if the negative items are reverse-scored to 1’s, then the mean score (for a scale with 50% negative items) due to acquiescence is 2.5. If acquiescence continues unabated and equivalently on both positive and negative items, then its effects on the overall scale mean (after reverse scoring the negative items) may cancel out.

A different possibility is that participants may experience the inherent inconsistency in bouncing back and forth between agreeing to both X (e.g., having good self-control) and not X. This might lead to a response strategy that substantially reduces the tendency to acquiesce. Both of these scenarios lead to a better state of affairs, namely, mean scores that are less biased than those obtained with scales consisting of only positive (or only negative) items.

Social desirability bias

In order to protect one’s self-image or to project a more positive image, some individuals may be biased to respond in a socially desirable way, (viz., shifting their self-appraisal in the direction of greater agreement (toward 5) with positive items and less agreement (toward 1) with negative items. When the negative items are (as usual) reverse-coded, the overall effect of social desirability is to bias the mean scores in the direction of better self-control (toward 5). Thus, the overall mean scores for self-control are susceptible to desirability effects as high means may indicate excellent self-control or a strong desirability bias. Unlike acquiescence bias, including negative items does not appear to have any mitigating mechanism for controlling a desirability bias. A common strategy is to measure social desirability, partial out its effects in a simultaneous regression, and hope that relationship of interest remains robust.

The foibles of reverse-wording

As described above, a popular cure for acquiescence bias is to include items with both positive and negative valence. During scale development, this is likely to involve reverse-wording some positive items into negative items that will be reverse-coded in order to derive an overall scale mean. This cure for acquiescence bias may be worse than the disease because it is very difficult to completely reverse the semantics or content of an item. As Krosnick and Presser (2010) describe in a meta-analysis of 41 studies, nonequivalence is more likely than not. When people are asked to agree or disagree with pairs of statements stating mutually exclusive views (e.g., I enjoy socializing vs. I don’t enjoy socializing), the between-pair correlation (before reverse-scoring) is only –0.22. Although some of the weakness in this negative correlation could be due to acquiescence to both types of statements, it more likely reflects that syntactic negation is rarely understood as a polar opposite meaning that would lead to a negative correlation of –1.00.

In a project using 100 San Francisco State University undergraduates, we tried to obtain a purer measure of the semantics of the BSCs items and their reverse by focusing the target behavior/predisposition on a hypothetical other person rather than as a self-appraisal. To minimize acquiescence, judgments involving self-appraisal were replaced with direct judgments of good or poor self-control. The instruction read as follows: “Suppose someone you do not know says: ‘I have a hard time breaking bad habits.’ Based on this statement, how much self-control (aka self-discipline, willpower, impulse control, perseverance) do you think this person has? 1 = Substantially Below Average, 2 = Somewhat Below Average, 3 = Average, 4 = Somewhat Above Average, 5 = Substantially Above Average”.

When there is no issue of agreement (and hence no opportunity for acquiescence) and no issue of social desirability (because the question is not about the respondent), the correlation (after reverse coding the negative items) between the positively and negatively worded versions for the 13 BSC items average r =  + 0.33. Although this approaches a medium effect size, it shows that knowing the degree of self-control implied by a description of some specific behavior predicts only 11% of the variance in judgments about the reverse-wording of that behavior. Consider two contrasting examples. “I have trouble concentrating” captures quite well the polar opposite of “I have good concentration”, r = 0.72 (when the negative item is reverse coded), but the reverse wording of most of the original BSC items does not substantially reverse the semantics. For example, the correlation between “Sometimes I can’t stop myself from doing something even if I know it is wrong” and “Sometimes I can stop myself from doing something when I know it is wrong” is r = –0.14. Simply put, negating a statement does not mean that the comprehender will make a polar opposite inference about its meaning.

Scales that use both positive and negative items seem to require that we make the (nearly always unstated) assumption that respondents really do think about valence as a single dimension. As a reviewer observed, substantial evidence from behavioral economics strongly suggests that this is often not the case. For example, people seem to reason about potential losses and potential gains in qualitatively different ways even when the phrasing results in mathematically equivalent results (Kahneman & Tversky’s prospect theory, 1979). For example, people tend to overweight small probabilities to guard against losses. From this perspective, the fact that mirrored items have different factor loadings accurately captures real cognitive differences between them, but this source of item-item difference is minimized by using scales with all positive or all negative items.

Reverse-worded items of the need for cognition scale alter the factor structure

As we were embarking on our odyssey through the BSC, we were unaware that Zhang et al. (2016) had explored similar manipulations of the 18-item Need for Cognition (NFC) scale (Cacioppo et al., 1984). In the original NFC, nine items endorsed the need to engage in and enjoy cognitive activities and nine were reverse-worded. Three new versions were created. The All-Positive version reversed the nine reverse-worded items to create a scale that uniformly endorsed a need for cognition. For example, reversing “Thinking is not my idea of fun” to “Thinking is my idea of fun”. The other two versions maintained an even division between positive and negative endorsement of the need for the cognition, but the nine reversed items in version Reverse-1 were uniformly created by using polar opposite adjectives (e.g., I would prefer simple to complex problems.) and in Reverse-2 by using negative particles (e.g., The notion of thinking abstractly is not appealing to me). About 315 University of British Columbia undergraduates completed each of the four versions.

Although these appear to be subtle wording changes, Zhang et al., correctly anticipated that they would lead to scales with different factor structures. Using exploratory factor analyses, the original NFC scale clearly indicated two factors, but the modified scales yielded weak evidence for two factors. Consistent with this analysis the factor correlation between the two factors was only + 0.56 for the original NFC, but + 0.96 or greater for the three versions with wording changes. Confirmatory factor analyses verified that the type of reverse-wording may also affect the factor structure of the NFC. Inconsistent responses to polar opposite items may occur because they are not actually polar opposites with respect to the construct of interest. A respondent may agree with both I like simple tasks and I like complex tasks because liking simple tasks does not preclude also liking complex tasks. This problem is often referred to as reversal ambiguity. Furthermore, there is a long and convoluted history of psycholinguistic research on the comprehension of negative sentences (see Wang et al., 2021) that permits the conclusion that negation is tricky even when the reader does not miss the negative particle due to inattentiveness. Zhang et al. are drawn to the conclusion that the use of reverse-worded items in Likert scales “… has serious disadvantages” p. 13.

Study 1: between-subjects comparisons of different versions of the BSC

The Zhang et al. study resets the stage for our interest in how promising the BSC (and other self-report measures of self-control) might be as reliable and valid measures of self-control. The purpose of this study was to investigate the effects of reverse-wording and valence (behaviors that have either positive or negative social desirability) in the BSC with respect to not only the factor structure of the scale but also its ability to predict both positive and negative outcomes in everyday life. Thus, testing the predictive validity of wording variants of the BSC is an important and novel contribution of our study.

Method

Measures of self-control

TBB reported that the original BSC enjoyed excellent reliability in two large samples of college students. Cronbach’s alpha showed an internal consistency of 0.83 (N = 351) in Study 1 and 0.85 (N = 255) in Study 2. Test–retest reliability over a 1–3-week period was measured in Study 1 and yielded an impressive r = 0.87.

For the present study, five versions (counting the original form with 13 items) of the BSC were created. As shown in Table 2, each of the 13 original items was reworded to reverse its valence. The new version is termed the Mirrored version and consists of nine items with positive valence and four items with negative valence.

Table 2 The original BSC items and their mirrored version

An All-Positive version was formed by retaining the original four positive items and recombining them with the nine new positive items from the Mirrored version. Similarly, an All-Negative version was formed by retaining the original nine negative items and recombining them with the four new negative items from the Mirrored version. An anomaly that captured our attention is the original Item 8 (People would say that I have iron self-discipline.) asked about what other People would say rather than soliciting a self-appraisal. In order to explore the extent to which a subtle wording change unrelated to valence can impact its interpretation, a fifth version that we dubbed the Who Version replaced the original Item 8 with this statement: I would say that I have iron self-discipline.

Measures of outcomes associated with self-control

In order to assess the predictive validity of the five versions of the BSC, all participants completed several scales associated with self-control in past research. Below each scale is described its internal and retest reliability reported, and then we review the published correlations between each outcome scale and the BSC scale.

Self-esteem

Self-esteem was measured using the Rosenberg (1965) Self-Esteem Scale consisting of five items with positive valence (e.g., I feel I am a person of worth) and five items with negative valence that are reverse scored (e.g., I feel I do not have much to be proud of) to yield an overall score where larger numbers signify greater self-esteem. A four-point Likert scale is used, anchored by Strongly Agree (1) and Strongly Disagree (4). The scale has good internal reliability (Cronbach’s alphas ranging from 0.72 to 0.91). For example, Sinclair et al. (2010) reported an alpha of 0.91 for a representative sample of US adults (N = 503). Test–retest correlations with college students have indicated a 1-week test–retest correlation of 0.82 (Fleming & Courtney, 1984) and a 2-week test–retest correlation of 0.95 (Silber & Tippett, 1965).

A search of the published literature returned 15 studies that reported the bivariate correlation between the BSC and Rosenberg’s Self-Esteem scale. These ranged from r =  + 0.19 (531 undergraduates, M = 19.3 years old, Trumpeter et al., 2006) to r =  + 0.53 (147 university students from Tehran, M = 26.9 years old, Ghorbani et al., 2014) with a mean across all 15 studies of r =  + 0.38.

General health questionnaire (GHQ)

General mental health was measured using the 12-item scale originally developed by Goldberg (Goldberg & Williams, 1991). Six items have positive valence (Have you been able to concentrate well on what you were doing) with Likert choices of 0 (better than usual), 1 (same as usual), 2 (less than usual), 3 (much less than usual). The remaining six items have negative valence (Have your worries made you lose a lot of sleep) with Likert choices of 0 (not at all), 1 (no more than usual), 2 (more than usual), and 3 (much more than usual). If the scale values of 0 to 3 are used, then better health is signified by smaller totals or means. The test–retest reliability over a 7–14-day interval for an Italian version of the scale (N = 83) was r = 0.84 (Piccinelli et al., 1993) when administered to adult volunteers at a general medical practice clinic. Cronbach’s alpha (0.90, N = 3705) showed good internal consistency in the Health Survey for England 2004 cohort (Haskins, 2008).

Three published studies reported that better mental health was significantly associated with better self-control, r = –0.42 (159 college students, M = 21.3 years old, Fung et al., 2020), r = –0.19 (328 Russian science majors, M = 18.4 years old, Gordeeva et al., 2017), and r = –0.27 (106 employees, M = 44.1 years old, Jammieson et al., 2017).

Satisfaction with Life (SWL)

Satisfaction with life was measured using Diener et al. (1985) classic five-item scale (In most ways my life is close to my ideal) with Likert choices 1 (Strongly disagree) to 7 (Strongly agree). Diener et al. (1985) tested two large samples of college students (N = 176 and N = 163) and a smaller group of 53 elderly persons. For the undergraduates, the 2-month test–retest correlation was 0.82 and Cronbach’s alpha was 0.87.

Fourteen studies have reported the bivariate correlation between the BSC and Diener’s SWL scale. These ranged from r =  + 0.20 (500 Chinese adult employees, M = 28.0 years old, Dou et al., 2019) to r =  + 0.37 (328 Russian undergraduate science majors, M = 18.4 years old, Gordeeva et al., 2017) with a mean of r =  + 0.29.

Happiness

Self-rated happiness was measured with Lyubomirsky and Lepper’s (1999) four-item Subjective Happiness Scale (e.g., In general, I consider myself: 1 “not a very happy person” to 7 “a very happy person”. One item required reverse coding. The developers validated their happiness scale across more than a dozen samples totaling 2732 participants in the United States and Russia who were either college students or from the local community. Internal consistency measured as Cronbach’s alpha ranged from 0.80 to 0.94 with a mean of 0.86. Test–retest reliability was assessed across five samples at time lags ranging from 3 weeks to 1 year. The reliability ranged from 0.55 to 0.90 (M = 0.72). The smallest coefficient (r = 0.55) was observed in a U.S. adult community sample, which was tested 1 year apart. More recently, Extremera and Fernández-Berrocal (2013) reported a Cronbach’s alpha of 0.81 and retest reliability (at intervals of 6–8 weeks) for a Spanish version of the scale of r = 0.72.

We know of only one other study that probed this association between self-control and happiness. Fung et al. (2020) reported a significant positive correlation (r =  + 0.33) based on responses from 903 students attending Chinese universities.

Social desirability

Stöber’s (2001) Social Desirability Scale (SDS) was primarily used as a covariate when treating BSC as a predictor of the outcome variables described above. SDS scores will correlate with BSC scores to the extent that an individual is biased to over report on items with positive valence or under report on those with negative valence. The original SDS has nine positive-valence items (I always admit my mistakes openly and face the potential negative consequences) and seven with negative valence (I sometimes litter). One point is scored for each true response to a positive item and one point for a false response to a negative item. Thus, larger totals indicate a greater bias to give socially desirable answers. To avoid issues of legality, we did not use one of the negative items (I have tried illegal drugs …). Stöber reported test–retest correlations over 0.80 across intervals from 2 to 6 weeks and Cronbach alphas of either 0.74 or 0.75 across three college student samples and a large community sample.

Prior studies show strong correlations between BSC and social desirability scores. Bertrams and Dickhäuser (2012) reported r =  + 0.46 for a sample of 150 undergraduates. Similarly, Kwapis and Bartczuk (2020) reported r =  + 0.45 for a sample of 141 adolescents (M = 17.7 years old) with both scales translated into Polish. Uysal and Knee (2012) reported similar correlations when social desirability was measured with the Marlowe Crowne scale: r =  + 0.43 for 160 undergraduates in Study 1, r =  + 0.59 for 74 undergraduates in Study 2, and r =  + 0.51 for 55 undergraduates in Study 3. Collectively these substantial correlations highlight the possibility that positive correlations between BSC scores and other desirable outcomes may be mediated by social desirability.

Raven’s test of general fluid intelligence (gF)

Fluid intelligence was assessed using Set 1 of the Ravens Advanced Progressive Matrices (Raven et al., 1977). The task consisted of 12 items. Each item was composed of a pattern with a missing piece in the lower right. Participants were instructed to Look at the pattern, think what the missing part must be like to complete the pattern correctly, both across the rows and down the columns. Participants selected from a set of eight alternatives. The task was computerized and controlled by Qualtrics. Participants were given a maximum of 2 min to respond to each item. Most responses, regardless of correctness, in this self-paced computer-controlled version were made well within the deadline. The manual states that with self-pacing, Set 1 can be used as a short 10-min test. The 12-item test has a decent Cronbach alpha, for example, 0.81 (Partchev, 2020) and 0.73 (Bors & Stokes, 1998, N = 506 University of Toronto students). Arthur et al. (1999) reported a test–retest r = 0.76 for 71 participants at a 1-week interval.

Raven’s scores were included in this study because our previous work (Paap et al., 2020) showed that general fluid intelligence was the most consistent predictor of performance-based measures of self-control. Furthermore, some experts (Salthouse, 2005, 2010) have argued that EF and gF may be two names for the same ability. However, a significant correlation was not anticipated in this study as we had never observed a significant correlation between the BSC and Raven’s using samples of university students (Paap et al., 2019, r = –0.07; Mason et al., 2021, r = –0.10; Paap et al., 2019, r = –0.04; Paap et al., 2022, r =  + 0.05). Similarly, Erceg et al. (2019) reported a correlation of r = –0.14 for a sample of 159 college students (M = 21.3 years old). Finally, Mazza et al. (2020) reported a correlation of r =  + 0.07 based on a sample of 522 MTurk participants (M = 33.6 years) paid considerably more than is typical ($60 plus an average of $10 in bonuses) for completing a 10-h battery of surveys and cognitive tasks. This disconnect between the relationship of self-report and performance-based measures of self-control to Raven’s scores is not surprising given our earlier discussion that the two types of self-control measures do not correlate with each other.

Design and procedure

Participants were randomly assigned to one of five groups that differed only with respect to the BSC version they received: original BSC, All Positive, All Negative, Mirrored, and Who. The sequence of events was controlled by a Qualtrics Survey: (1) informed consent, (2) language and demographic background, (3) the randomly assigned version of the BSC, (4) self-esteem scale, (5) general health questionnaire, (6) satisfaction with life scale, (7) subjective happiness scale, (8) social desirability, and (9) Raven’s test of gF.

Participants

MTurk workers were recruited for a modest compensation of $0.20. MTurk results were returned to Qualtrics from 1439 workers. Responses were deleted if less than 70% of the survey was completed. Because the Raven’s test came last (and constituted the final 30%), this means that correlations involving gF are based on somewhat smaller sample sizes. Each participant responded to only one of the versions of the BSC and this was followed by an attention check such as “respond 3 to this item”. A similar attention check was presented after the GHQ scale and again after the Happiness scale. Seventy-six (6.2%) participants were deleted because they failed more than one of the three attention checks. An additional and more important screening involved the following affirmation procedure. Participants were warned at the onset of the survey that we would eventually ask them if they had paid attention to each item and if they had answered honestly. We also told them that they would be paid regardless of how they answered these questions. These two affirmation items were presented at the end of the survey. Fourteen (1.3%) reported that they did not always pay attention and 40 (3.6%) reported that they did not always answer honestly. After eliminating these participants, a pool of 1003 qualified participants remained with about 200 participants in each group (see Table 3 for specific N’s).

Table 3 Descriptive statistics for the five versions of the BSC

Results

Differences between BSC versions

At the individual participant level, all responses to items with negative valence were reverse scored by subtracting each raw score from 6. The left side of Table 3 shows the means across all 13 items for each of the five versions of the BSC with larger means indicating greater self-control. A one-way ANOVA with Version as a between-subject variable was significant, F(4, 997) = 12.51, p < 0.001. Post hoc Bonferroni tests were used to compare each of the modified versions to the mean of the original BSC. By this relatively conservative test, the only significant difference was that the All-Positive mean of 3.63 was greater than the mean of 3.37 for the Original BSC, p < 0.001.

It should be noted that the mean (3.37) for the Original version of our MTurk sample is greater than the means reported by TBB for college-student samples (3.02 and 3.07). This difference may be driven mostly by age as the mean MTurk participant in our sample was 39 years old and for the group assigned to the Original version the significant correlation between age and mean self-control is r =  + 0.23. As shown in Fig. 1, the linear fit to this data shows a mean score just above 3.00 for a 20-year-old participant. A meta-analysis of 50 studies using the BSC reported a mean of 3.26 with a range of 2.87 to 4.26. In summary, the mean for the Original version in our study fits very well with means observed in past studies.

Fig. 1
figure 1

Mean self-control for the original version of the BSC as a function of age

Assessing the possibility that the obtained pattern of mean differences across the five versions reflect contributions from acquiescence or social-desirability bias is complicated. Consider first a participant who only pays attention to the valence of the item (from a social desirability perspective) and responds 5 to all the positive items and responds 1 to all negative items. As shown in Table 4, this would result in a mean of 5.0 for all four versions because all responses of 1 to negative items are reverse coded to 5’s and, of course, all positive items are 5’s to begin with. Thus, uniform social-desirability does not drive differences between versions that differ in terms of ratio of positive to negative items. Said another way, a strong desirability bias drives scores higher in general, but all other factors equal, all boats (versions) should rise or fall together as the tides of desirability ebb or flow.

Table 4 Effects of reverse-coding items with negative valence on overall scale means assuming a total acquiescence bias or a total desirability bias

In contrast to the scenario just discussed, suppose a participant adopts a complete acquiescence bias and responds 5 to all items regardless of whether they have positive or negative valence. Because the negative-valence items are reverse coded, a complete acquiescence bias to agree to any type of statement should have opposing effects on positive and negative items. That is, if a strong acquiescence bias leads to full agreement to “I am good at resisting temptation”, the individual will select 5 (Very Much) and the scored value is 5. If that same strong acquiescence bias leads to full agreement to “I am lazy”, the individual will likewise select 5 (Very Much), but when this negative item is reverse scored the scored value is 6–5 = 1. Thus, the means for the acquiescence-bias scenario show that the overall scale means should increase as the ratio of positive to negative items increases (see bottom row of Table 4). Because acquiescence bias pushes responses to the higher end of the scale regardless of valence, reverse-coding the contribution of acquiescence bias for the negative items is the wrong thing to do. The problem, of course, is that when someone responds “5” (Very Much) to I am lazy, we do not how much of that agreement represents a genuine self-appraisal that the participant is lazy (and that should be reverse-coded) and how much is due to the tendency to acquiesce. If experiencing all positive items promotes more acquiescence compared to versions that include items with negative valence, then this could account for why the All-Positive version had the greatest mean score in Study 1.

Internal consistency and factor structure

Cronbach’s alpha is shown in the fourth column of Table 3 for each of the five versions of the BSC. As a measure of internal consistency, alpha provides an index of the degree to which the items cohere into a unitary measure of the construct of interest. The alpha of 0.85 for the Original version corresponds closely to the values (0.83 and 0.85) reported by TBB in their two college-student samples. The BSC should have a high alpha because TBB, in part, selected items for the brief form because they showed high inter-item consistency in the long form.

TBB also selected items for the BSC such that there would be representation from each of the five factors that emerged from an exploratory factor analysis (EFA) of the 36-item long form. The factor structure of the 13-item BSC has a somewhat tortured history that Paap (2023) traces in detail. For present purposes, a key and safe conclusion drawn from TBB and subsequent publications is that the total score (or overall mean score) provides impressive predictive validity of many outcomes that is rarely exceeded to any nontrivial degree by the predictive power of any constituent factor. To rephrase, individual factor scores are usually not better than total-scale scores at predicting outcomes. This is not surprising if second factors are often method factors rather than content factors.

Figure 2 shows the scree plot for the group responding to the original 13-item BSC. The elbow supports the extraction of two factors. The two-factor solution is shown in Fig. 3, with the red dots indicating items with positive valence and the green dots those with negative valence. It is clear that the salient difference between the two factors is simply the valence of the item. This is further supported by comparison to the other versions. As shown in Table 3, the percentage of variance accounted for by a second factor is the least when the items are all of one valence (i.e., All Positive or All Negative), but for reasons that elude us, the unitary structure is most apparent in the All-Negative version shown in Fig. 4.

Fig. 2
figure 2

Scree plot for the group responding to the original 13 BSC items. The elbow appears to support the extraction of two factors

Fig. 3
figure 3

The two-factor solution for the group given the original version of the BSC. Red dots are the items with positive valence

Fig. 4
figure 4

Factor loading plot for the All-Negative version of the BSC. The 13 items load on a single factor

The effects of the wording change on item 8

Undertaking a small exploration of wording changes not involving changes in valence we modified “People would say I have iron self-discipline” to “I would say I have iron self-discipline”. This shifts the semantics from an appraisal by others, to a self-appraisal. The mean for the original item, People would say…, (M = 3.27) is significantly greater than the mean (M = 3.02) for the modified item I would say…, t(399) = 2.07, p = 0.039. This is consistent with Duckworth et al. (2017) observation that across the lifespan and around the world, individuals experience SC as a very hard thing to do and report that they often fail. Indeed, people rate themselves lower in SC than in kindness, fairness, honesty, and most other aspects of character (Park & Peterson, 2006). The more general implication of this example is that apparently subtle wording changes can generate non-trivial shifts in the item mean. This also suggests that there may be potential risks in translating scales into other languages because it is difficult and often impossible to avoid subtle shifts in meaning.

Using self-control to predict everyday outcomes depends on the mix of valence

One very important use of a self-control measure is that it enables one to test for relationships between the ability (and or predisposition) to exercise self-control and important positive and negative outcomes in everyday life. However, because this study is limited to correlation, it is difficult to distinguish causal relationships from those due to confounds or response biases. With those caveats in mind, Table 5 shows the bivariate correlations between the self-control scores of each of the five versions of the BSC and five outcomes: self-esteem, mental health, satisfaction with life, happiness, and general fluid intelligence. Table 6 shows the beta coefficients for these relationships when the effects of social desirability are partialed out. The internal consistency (Cronbach’s alpha) and retest reliability reported in other studies for each of these outcome variables and social desirability were reviewed in the Method section. The means, standard deviations, and Cronbach’s alpha for each of these variables are shown in Table 7.

Table 5 Predicting outcomes: Self-esteem, mental health, satisfaction with life, happiness
Table 6 Predicting outcomes while controlling for social desirability (standardized betas)
Table 7 Descriptive statistics for six outcome variables

Self-esteem

As reviewed above, many published studies reported a robust correlation between the original BSC and Rosenberg’s self-esteem measure. The observed correlation for the BSC (r =  + 0.60) in this study is higher compared to earlier studies and compared to the four other versions tested in this study. The lowest (but still highly statistically significant) correlations were associated with the versions that had the most modified items, i.e., the All Positive (+ 0.32) version has nine modified items and the Mirrored version (+ 0.33) that modified all 13 items. Thus, although these modifications led to higher mean self-control scores, they appeared to have reduced the scales predictive validity compared to the Original version.

Mental health

The observed correlation between the original BSC and the GHQ scale (when coded such that higher scores indicate better mental health) of r =  + 0.63 is strong and stronger than that reported in three earlier studies. The other four versions also yielded significant bivariate correlation, but as observed in the analysis of self-esteem, the association was weaker in those versions (Mirrored and All Positive) that modified more of the original items. The strength of the observed associations was maintained when the social-desirability scores were partialed out.

The correlations examined to this point would be consistent with the assumption that TBB did a fine job in developing the BSC – perhaps a fortunate confluence of exceptional domain expertise and the sifting and winnowing of items during scale development. This leads to the possibility that the BSC and its nearest neighbors (Who and All Negative) genuinely predict self-esteem and mental health more so than more distant neighbors with more modified items.

Satisfaction with life and happiness

The correlations between the various self-control scales and satisfaction with life and happiness show an opposite pattern from that just described for self-esteem and mental health. The correlation between the original BSC scores and both satisfaction with life and happiness is near zero (r <|.10|), but robust for the All Positive and Mirrored versions. Although the pattern is different from that observed for self-esteem and mental health, these correlations were also affected very little when social desirability was partialed out. The lack of an association with satisfaction with life is surprising as 14 published studies have reported a statistically significant positive correlation (M =  + 0.29) between BSC scores and satisfaction with life. If a preponderance of positive items induces acquiescence in a subset of participants (an intuition many scaling experts might endorse), then this may have elevated the scores for this subset in the All-Positive version (and to a lesser extent in the Mirrored version where 9 of the 13 items were positive) and in the satisfaction with life (all positive items) and happiness (only one negative item) scales. Said another way, pseudo-correlations between two scales (e.g., the correlation between All Positive and Satisfaction with Life) may emerge when scales are matched (e.g., a preponderance of positive items) in terms of their valence structure. This is, of course, is wild speculation. A safer, but less informative summary statement, is that the Original version and its nearest neighbors strongly predict self-esteem and mental health, whereas more distant neighbors display weaker correlations to self-esteem and mental health, but mysteriously are associated with more satisfaction with life and happiness.

Raven’s (general fluid intelligence)

As shown in Table 5, when the Original version of the BSC or its nearest neighbors (Who and All Negative) are correlated with Raven’s scores, the bivariate correlations are highly significant and of medium strength. In contrast, the correlations are near zero when all (Mirrored) or most (All Positive) of the items have been modified. In summary, predictions of Raven’s conform to the pattern observed for self-esteem and mental health, which differ from the pattern observed for satisfaction with life and happiness.

Summary

The predictive validity of the different versions of the BSC depends on the outcome measured. One possible interpretation is that some versions (e.g., the Original) are better (e.g., more valid and reliable) and, thus, more sensitive measures of true relationships. Given the obtained pattern this line of reasoning leads to the conclusion that self-control is associated with self-esteem, mental health, and general fluid intelligence, but not with satisfaction with life and happiness. But robust correlations are observed between the Mirrored (13 modified items, nine positive items) and All Positive (nine modified items) versions and satisfaction with life or happiness.

The possible spillover effects of BSC version on responses to the outcome scales

It is possible that the act of completing a self-control scale might affect responses to the outcome scales that follow. One-way ANOVAs were used to test if the specific version of the BSC influenced the means for self-esteem, mental health, satisfaction with life, and happiness. As shown in the bottom row of Table 8, there are no statistically significant differences between the groups randomly assigned to different versions of the BSC, although the ANOVA on the Happiness scores yielded F(4, 997) = 2.26, p = 0.061. Furthermore, the largest mean on the happiness scale occurs for the All-Positive version and sparks the possibility that responding to positive SC items may induce a boost in mood that, in turn, enhances the happiness ratings. This would be a risky interpretation because (1) the exact probability of the F statistic is greater than the standard alpha of 0.05, (2) the rank order of means for happiness is not predicted by the number of items with positive valence, (3) and any such boost in positive mood would need to be maintained as the participant responds to the self-esteem, mental health, and satisfaction-with-life scales that preceded the happiness scale.

Table 8 Effects of BSC version on mean of outcomes scales. The bottom row is the exact probability associated with the F statistic in the corresponding one-way ANOVA

Revisiting the critical assumptions underlying reverse coding of negative items

An optimistic and unstated assumption in scale development is that any item with positive valence can be rewritten as a polar opposite item with negative valence (and the reverse, that is, any item with negative valence can be rewritten as a polar opposite item with positive valance). If this assumption is true and put into practice, then scales that have different mixes of positive and negative items will be equivalent once the negative items are reverse scored by subtracting the raw scores from the maximum value plus 1. Recall that the instruction for the BSC is to Indicate how much each of the following statements reflects how you typically are on a scale of 1 (Not at all) to 5 (Very Much). Thus, a response of 1 to the negative item I am lazy would be reverse coded as 6–1 = 5. If the key assumption is true, then rewriting the item to have positive valence, I am not lazy, should elicit a response of “5”. Similarly, positive items receiving a response of “4” should have their negative counterpart responded to as a “2”. In this ideal world, self-control scales would be valid and unbiased regardless of the ratio of positive to negative items.

However, the distribution of mean scores for the different versions of the BSC are far from equivalent. Examining the distribution of means for the All-Positive (Fig. 5, top) and All-Negative (Fig. 5, bottom) versions, show that the All-Positive distribution has a somewhat larger mean and decidedly smaller standard deviation. The following factors contribute to these differences. The first factor challenges the assumption that reverse coding is a valid transformation because the five-point Likert scale used by the BSC is not an equal-interval scale with a neutral point of 3. For example, the contrast shown in Fig. 5 suggests that participants are reluctant to use the Not at All (1) end point for positive items but less reluctant for negative items (where they are recoded to 6–1 = 5). A second factor is that it is very difficult to rewrite items into their polar opposites. It is worth noting that of the nine original BSC items with negative valence, only one involves the negation of a positive statement (Sometimes I can’t stop myself from doing something, even if I know it is wrong). The remaining eight items with negative valence directly assert an undesirable attribute (e.g., I am lazy). Thus, the greater variability for the All-Negative version is not due to the fact that sentences expressing grammatical negation tend to be comprehended less well, especially when the reader is not highly attentive. It is important to remember that the items with negative valence in the original BSC, with the one exception noted above, do not use any form of syntactic negation – they simply describe a behavior consistent with weak self-control, an undesirable trait. As observed by Rav Suri in an informal review, the original BSC items may have selected the most natural way to express a predisposition influenced by self-control.

Fig. 5
figure 5

Histograms showing the frequency of participant means for the samples receiving the All Positive (top) and All Negative versions of the BSC. The raw scores for all 13 negative items were reverse-coded

Echoes of gavagai, inscrutability, and indeterminacy

Five versions of the BSC were examined. For the most part, each displayed a satisfactory (and sometimes outstanding) Cronbach’s alpha, produced EFAs with coherent single-factor or two-factor solutions, and significant correlations with at least two outcomes that are thought to be related to self-control. In an insightful treatise on traditional methods of survey validation, Maul (2017) refers to these steps as the trinity of classic methods for validating scaling instruments, but expresses concerns that they are not very effective in validating the scale as a measure of the intended theoretical construct. Maul cleverly illustrates the fallibility of this trinity by showing that nonsensical scales can pass with flying colors. One of Maul’s demonstrations riffs on Dweck’s (2006) theory that a growth mindset (the degree to which an individual believes that intelligence is malleable and changeable) predicts outcomes that should be affected by intelligence. Maul’s Study 1 included a scale of growth mindset that consisted of four positive items (e.g., You can always substantially change how intelligent you are) and four negative items (e.g., Your intelligence is something about you that you can’t change very much). The novelty introduced was to also collect data on a nonsensical scale where intelligence is replaced with the nonword gavagai, e.g., You can always substantially change how gavagai you are. This 8-item gavagai scale enjoyed a high alpha (0.91), a two-factor solution that explained 99% of the common variance and separated the positive from the negative items. The total scores on the Theory of Intelligence moderately associated with total scores on the “Theory of Gavagai” (r = 0.44) and exhibited weak but significant positive correlations with Agreeableness (r =  + 0.09) and Openness (r =  + 0.09). The lesson for the present purposes is that one should be skeptical that all five of the versions of the BSC tested in Study 1 have been “validated” as a measure of self-control simply because they passed the classic trinity of checks.

Study 2 within-subjects comparisons of different versions of the BSC

Study 1 was designed to discover if the valence structure of a 13-item self-control scale mattered when there were no carryover effects from responding to other scales. To that end, participants were randomly assigned to one of the five versions of the BSC and the assigned version was then followed by the self-esteem, mental health, satisfaction with life, happiness, and social desirability scales. Under these relatively pristine conditions, interesting differences between the BSC versions emerged. The All-Positive version had a larger mean and smaller standard deviation. Furthermore, a second factor emerged only when the scale version used items with both positive and negative valence and the factor structure was completely consistent with the method difference (viz, positive versus negative valence). Although there were about 200 participants in each group, it is difficult to discern if the small differences between the means for some of the other versions were statistically significant and likely to replicate. Perhaps more important, there was an unanticipated interaction between the Version of the self-control scale and correlations with different outcomes (self-esteem and mental health versus happiness and satisfaction with life). To increase power, Study 2 used a within-subject design to explore the same issues.

Method

Materials and design

Each participant was presented with the 26 items shown in Table 2 plus the special Who- version item I would say that I have iron self-discipline. The items were arranged as follows. The first block of 13 items consisted of the 13 Mirrored items presented in a different random order for each participant. The second block of 13 items consisted of the 13 Original items, also randomized for each participant. The Who item was always the 27th item presented. Thus, each of the unique 27 items was presented only once. From the participant’s perspective this was a single scale consisting of 27 items, with a random mix of positive and negative items. At analysis, the responses of each participant can be rearranged to calculate mean scores for each of the five versions. Consistent with Study 1, the self-control items were followed by the scales for self-esteem, mental health, satisfaction with life, happiness, social desirability, and then by the Raven’s test with the affirmation checks regarding honesty and paying attention coming last.

Participants

Participants for Study 2 were tested on July 14, 2022, via MTurk from English-speaking countries (dominantly USA) for the same modest compensation of $0.20 as offered in Study 1. MTurk returned a total of 242 completed surveys, but 11 did not complete the Raven’s test and were not retained for analysis. On the affirmation questions, eight (3.3%) indicated that they did not always pay attention and six (2.5%) indicated they did not always answer honestly. After eliminating these participants, the main analyses were based on 228 qualified participants.

Results

Differences between BSC versions

Scale means

The overall measure of self-control (the mean of the 13 items) for each of the five versions is shown on the right side of Table 3. Note that each of the 27 items was presented only once. Thus, a participant’s response to Item 1, I am good at resisting temptation, contributes the same score (e.g., “5”) to each version that included that specific item; in this case the Original, Who, and All Positive versions. This reduces the amount of measurement error one would have if each of the five versions was presented in a separate block of 13 items whereby a participant caught between two scale values may choose “4” on one encounter with Item 1, but “5” on another. It also minimizes practice effects (or other sequential effects) that might occur if each item is repeated in each of five different blocks.

The mean self-control score for the All-Positive version (M = 3.87) is the highest in this within-subject comparison just as it was in the between-subject Study 1. But contrary to Study 1, the rest of the means systematically decline as the number of items with positive valence decrease, reaching a low of 2.44 for the All Negative version. This decreasing function (see Table 3, right side) is consistent with a strong acquiescence bias coupled with the reverse-coding of items with negative valence, as illustrated in the last row of Table 4. This observation raises the question why the within-subjects format should induce a strong acquiescence bias when the between-subjects format did not, or at least did not do so uniformly and particularly for the Negative version. Adopting a strong acquiescence bias for negative items entails not having a social-desirability bias because agreeing to an item with negative valence means one is expressing a judgment that one has a negative (undesirable) trait. Perhaps there is something about experiencing items with some mix of positive and negative items that induces acquiescence and this response set dominates in Study 2 where the first 26 items are randomized and consists of 13 positive and 13 negative items. Unfortunately, this potential explanation for the scale means decreasing as the number of negative items increases is in complete opposition to the lore discussed earlier that including negative items is likely to prevent, or at least attenuate, the tendency to acquiesce.

Factor structure

In an ideal world where items can be rewritten to perfectly flip their valence and reverse coding works because the Likert scale behaves as an equal-interval scale with a neutral point of 3, the factor-loading plot for a 26-item scale consisting of the 13 positive items and their corresponding 13 negative counterparts should all load in one region representing Factor 1. Furthermore, in this ideal world the loadings for each specific positive item should superimpose on that for its negative mate. Rather, as clearly shown in Fig. 6, a robust second factor is fabricated by the reverse coding. When the items with negative valence are not reverse coded, the analysis clearly favors a single factor as shown in Fig. 7. (Although the big cluster in Fig. 7 can be partitioned into both a positive and negative cluster, the CFA clearly favors a model based on a single factor.) To the point, the dramatic separation into two factors seen in Fig. 6 is the product of a transformation (reverse-coding) that is intended to produce identical results for a positive item and its reverse-worded negative counterpart. Factor 2 in the CFA analysis shown in Fig. 6 is a pseudo factor. We were tempted to title this article: The Alchemy of Reverse Scoring the Brief Self-Control Scale Transmutes Valence Differences into a Fool’s-Gold Second Factor.

Fig. 6
figure 6

Factor loading plot of a scale consisting of the 13 positive items and their corresponding 13 negative items when the negative items are, as is typical, reverse-coded

Fig. 7
figure 7

The factor plot for a scale consisting of 13 positive items and their corresponding 13 negative items when the negative items are not reverse-coded

Predictive validity of the five versions

The scale means derived in Study 2 are shown on the right side of Table 3 while their ability to predict the five outcome variables are shown on the right side of Tables 5 and 6.

Self-esteem and mental health

The pattern of interaction between the five versions of the BSC and the self-esteem and mental-health outcome variables is similar to that observed in Study 1. The versions with the fewest number of modified items (Original 0, Who 1, All Negative 4) all produced strong correlations with self-esteem and mental health, whereas the correlations were weaker for the Mirrored and All Positive versions. It is interesting to note that despite the fact that the All-Negative mean was much lower (Study 1 M = 3.34, Study 2 M = 2.44) in the within-subject study this did not impact the strong correlations with self-esteem and mental-health. In addition to the five versions investigated in Study 1, a new sixth version of the BSC was derived by including all 27 items and this All 27 version also showed strong correlations with self-esteem and mental-health.

Satisfaction with life and happiness

As was the case in Study 1, the Original and highly similar Who version (1 modified item of the same valence) show weak and mostly non-significant correlations with both satisfaction with life and happiness. Why previous strong and positive correlations between the BSC and satisfaction with life are not replicated in our studies remains a mystery, but valence per se or reverse-coding may play a role as the All-Positive correlation of r =  + 0.51 (with satisfaction with life) reverses to r = –30 for the All-Negative version. As pointed out earlier, and in contrast to the self-esteem and mental-health scales, all five of the satisfaction with life items have positive valence.

Raven’s (general fluid intelligence)

Consistent with Study 1, the within-subject design shows modest, but statistically significant, correlations with Raven’s scores for most versions of the BSC. The exception, once again, is the All-Positive version that shows no correlation at all (r = –0.01). Recall, that one account of the correlation between gF and cognitive control is that they are much the same thing. To the extent this is true, it suggests that the All-Positive version does not provide a good measure of self-control, perhaps because we did a poor job of rewriting the original nine items with negative valence so that they would capture the same content when expressed with positive valence.

Item overlap

Another potential problem in interpreting correlations between self-control scores and self-report measures of everyday outcomes is that positive correlations can be driven by item overlap. For example, Item 10 from the BSC (I have trouble concentrating) is very similar to Item 1 of the mental-health scale (Have you been able to concentrate well on what you were doing?). A strong correlation need not be mediated by individual differences in self-control.

Limitations of our empirical work

Generalizing our results is risky because they are based on manipulating the valence structure of only one scale (the BSC), examining a small set of outcomes variables (self-esteem, mental health, satisfaction with life, happiness, and general fluid intelligence), and sampling from a single population (MTurk participants). Although we showed, that in broad stroke, our results replicate; they may not generalize to other scales and populations.

General discussion

Which version of the BSC is best?

Hat’s off to Tangney, Baumeister, and Boone if one answers the question on the basis of ability to predict our outcome measures. Across both studies, the correlations with self-esteem, mental health, and general fluid intelligence are the strongest for the original version. Other versions do better in predicting satisfaction with life and happiness in our studies, but many other studies show strong positive correlations between the BSC and satisfaction with life.

A version of the BSC based on all 26 positive and negative items could be nominated for best self-control scale as it significantly correlates with all five outcomes. However, it is no longer “Brief”. Perhaps more important, more needs to be done in investigating why (in our data) it is the items with positive valence that drive the correlation with satisfaction/happiness. Are these positive correlations due to associations between the labeled psychological constructs (e.g., self-control and happiness) or pseudo-correlations driven by responses biases or the valence structure of the two scales? If these turn out to be real, then the All-26 version will prove to be the best of these alternatives.

What are the lessons for best practice in scale construction?

  1. 1.

    Reverse-wording scale items to switch their valence is a difficult and imprecise process that will, at best, elicit a meaning that is approximately equal and opposite from the base item and sometimes fall considerably short of that goal. The potential benefit of assessing how respondents interpret and think about individual items is often undervalued (see Newton, 2018, for an introduction to response-process validation).

  2. 2.

    Scales that include both positive and negative items and intended to measure a unitary construct will need to reverse-score one set of items, typically those with negative valence. Even if the reverse-wording of an item is well done conceptually, the typical method for reverse coding will introduce a bias because Likert scales are unlikely to be treated as equal interval scales with a neutral point corresponding to the center value.Footnote 1

  3. 3.

    Including both positive and negative items does not mitigate a desirability bias because reverse-coding a negative item reinforces the bias. For example, if a desirability bias induces the response Not at All (1) to the negative item I am lazy, reverse scoring transforms the raw score of 1 to a 5 (the value indicating the maximum level of self-control). Thus, a desirability bias operating on both positive and negative items leads to a compounded overestimation of true self-control.

  4. 4.

    Including items with both positive and negative valence in a scale intended to measure a unitary concept is likely to generate a two-factor solution as the best model rendered through linear factor analysis, but the second factor is likely to be fabricated from the reverse-coding of the negative items (see discussion of Fig. 6 versus Fig. 7).

  5. 5.

    The lore suggesting that participants are less likely to adopt an acquiescence bias when the scale includes both positive and negative items lacks compelling empirical support.

  6. 6.

    Given the difficulties associated with reverse-wording and reverse-coding, it may simply be better to develop a scale consisting of only positive items or only negative items. Of course, the semantics of each item must capture the construct of interest. One can sift and winnow through a larger set of possible items to select those with better internal reliability, convergent validity, and predictive validity.

  7. 7.

    By contemporary standards, scales and tests should undergo a validation process.

The process of validation

To this point, our discussion of validity has been quite operationist as reflected in considerations of “convergent validity” and “predictive validity” as empirical forms of evidence. But this ignores a decades-long movement toward a broader and more theoretical view of validity.

Validity as a causal relation

A fundamental starting point is Denny Borsboom’s (Borsboom et al., 2004) definition that a test is valid if the target property (aka, attribute or construct) exists, and that variation of the target property causes variation in the test score (i.e., the responses to the test items). Thus, a convincing validation process begins with a clear description of the target property – the psychological construct of interest. This is not easy for a construct as ambitious and promiscuous as self-control.

TBB regard self-control as the capacity to change and adapt the self so as to produce a better, more optimal fit between self and world. Central to this concept of self-control is the ability to override or change one’s inner responses, as well as to interrupt undesired behavioral tendencies (such as impulses) and refrain from acting on them. Regulating the stream of thought (e.g., forcing oneself to concentrate), altering moods or emotions, restraining undesirable impulses, and achieving optimal performance (e.g., by making oneself persist) all constitute important instances of the self overriding its responses and altering its states or behaviors. More generally, breaking habits, resisting temptation, and keeping good self-discipline all reflect the ability of the self to control itself. How control is accomplished was not the focus of TBB, although Baumeister has, of course, energetically endorsed a theory that effortful control requires mental resources that can be depleted and eventually sabotage the amount required for successful control.

Returning to Borsboom, the precondition for a valid test is that the target construct exists. We have been haunted by the specter that there may be no such thing as a domain-general ability for EF because of the lack of convergent validity in performance-based measures of EF, but as noted in the introduction, self-report tests, including the BSC, pass this test. As TBB point out there is anecdotal and research findings suggesting that substantial individual differences exist in people’s capacity for self-control and appeal to the “obvious” universality of observations like this: “Some people are much better able than others to manage their lives, hold their tempers, keep their diets, fulfill their promises, stop after a couple of drinks, save money, persevere at work, keep secrets, and so forth.” But the degree to which control is highly consistent across situations may be illusory. Perhaps we also know someone who loses his temper, but keeps a healthy diet, fails to follow through on what they assure us they will do, but saves money, overindulges at happy hour and can’t keep a secret, but stays on task at work with tenacity.

One might argue that evidence gleaned from observing everyday life does not provide probable evidence that self-control is a unitary capacity that spans thoughts, emotions, impulses, and performance across and all types of situations. But in the interest of considering Borsboom’s second aspect of validity, let’s stipulate that a domain-general ability for self-control does exist. Assuming that construct exists, a test is valid if variations in that attribute causes variations in the outcomes of the measurement procedure. The spirit underlying the adequacy of this definition of validity is that the causality assumption ensures that the test scores must provide relevant information about the degree to which the test-taker embodies the construct.

How can the causality assumption be evaluated for self-control? One source of variation in the construct of self-control might be to test for transfer-of-training. There is a plethora of research exploring the possibility that specific activities assumed to require domain-general EF will lead to the enhancement of performance-based measures of EF (e.g., nonverbal interference tasks like flanker, Simon, and Stroop). Exciting early reports of far transfer of the effects of video-gaming and music performance have failed to replicate in later studies that use both random assignment to treatment groups and active control groups (Paap et al., 2019).Footnote 2 There is a paucity of similar research that uses self-report measures of self-control like the BSC.

Borsboom reasons that if the crucial ingredient of validity involves the causal effect of an attribute (a construct) on the test scores, then the locus of evidence for validity lies in the processes that convey this effect. Somewhere in the chain of events that occurs between item presentation and item response, the measured attribute must play a causal role in determining what value the measurements outcomes will take. His pithy assertion is that “… if one does not have an idea of how the attribute variations produce variations in measurement outcomes, one cannot have a clue as to whether the test measures what it should measure. No table of correlations can be a substitute for knowledge of the processes that lead to item responses.” p. 1068 (Borsboom et al., 2004).

We suggest that for the BSC the “chain of events between item presentation and item response” might consist of a primary and secondary chain. The primary chain is the underlying theory of SC specifying how and how well this capacity causes adaptive control of thoughts, emotions, and performance. We sketch such a model in Fig. 8. The primary chain at the top refers to an individual’s self-control history. His or her capacity for SC is brought to bear on many occasions every day and each encounter results in actions that vary from success to failure. An objective record of the success-to-failure ratio (perhaps weighted by the importance of each act) rationally defines the individual’s SC ability relative to others. For measurement purposes, this could be considered the true SC ability. However, each act is not carved in stone, but rather encoded (with some degree of abstraction and bias) in an autobiographical memory subject to decay and/or interference. The secondary chain, at the bottom of Fig. 8, traces the events triggered by reading a test item and leading to a semantic representation of that item (bottom right). The task requires the individual to compare this semantic representation (e.g., semantics of “I am lazy”) to a synthesized representation of all autobiographical memories of relevant acts (e.g., instances of being lazy) resulting in a value that reflects the degree of SC and then to further determine the Likert option that best matches the result of the comparison operation. As suggested in the bottom-left this selection process is open to influences and biases from other sources, such as a motivation for social desirability.

Fig. 8
figure 8

A model showing that Likert responses to a BSC item require a search and synthesis of autobiographical memory. See text for details

This exercise shows that the “chain of events between item presentation and item response” is very complicated for self-report means of SC and that gathering evidence that changes in true SC cause changes in BSC scores is not a matter of adding a gram to one side of the scale and directly observing a change. Is the BSC a valid test of self-control? A test intended to measure if the test-taker has the necessary algebra skills to be successful in calculus can be validated directly and objectively. Tests based on self-reports that embed one causal chain (the effects of self-control on behavior in everyday life) inside of another (the effects of memories of relevant past events on the response to a self-control item) may never provide such compelling levels of evidence.

Validation as a process

Messick’s (1995) widely cited definition of validity is that “Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (p. 741). This resonates with current theory and practice on validation which assumes that test-score interpretations and uses that are clearly stated and supported by appropriate evidence are considered to be valid. Kane (1992) reemphasized that it is the interpretation and use of the test scores that are validated, not the test on an uninhabited island. This approach is endorsed in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014) who state that “It is the interpretations of test scores for proposed uses that are evaluated, not the test itself… Validity is a unitary concept. It is the degree to which all the accumulated evidence supports the intended interpretation of test scores for the proposed use” p. 11.

Inferences from test scores to theoretical constructs depend on assumptions included in the theory defining the construct. Because it is not possible to prove all the assumptions that lead to the interpretation and use of the test, the best that can be done is to show that the argument is highly plausible, given all available evidence (Kane, 1992). Following this framework, the Standards permit and encourage the examination of whether the test measures less or more than its proposed construct. Construct underrepresentation refers to the degree to which a test fails to capture important aspects of the target construct because it does not capture some psychological processes that are encompassed by the intended construct. Construct irrelevance refers to the degree to which test scores are affected by processes that are extraneous to the test’s intended purpose. The phrasing used by the Standards enables the tail (the intentions of the test developer) to wag the dog (the test). The long-term utility of this depends on whether traits like self-control are there to be discovered or are “constructed” to serve as part of a useful theory. A more useful fruit of this discussion is that BSC scores have been interpreted and used for multiple purposes and each purpose may deserve its own validation process.

Validation by objective measures

As speculated earlier, the correlation between self-control scores and a variety of outcome variables may depend on valence structure of both the predictor and outcome. This is likely to be very difficult to disentangle. A better approach might be to use objective measures rather than subjective self-report measures for either self-control measures, life outcomes, or both. However, as addressed in the Introduction, performance-based measures of self-control do not correlate with self-report measures like the BSC and, compounding the disappointment, it is the objective performance-based measures that are weakly and inconsistently associated with outcomes in everyday life.

Although objective (performance-based) measures of self-control have not been helpful, perhaps objective measures of everyday outcomes will have more utility. Promising results have been reported. TBB reported that in their two college-student samples, BSC scores predicted GPA (r =  + 0.39 and + 0.15). Duckworth et al. (2010) provide evidence that self-control (BSC scores) may causally influence academic achievement. This longitudinal study tracked 142 fifth-graders (M = 10.5 years) for 4 years. A growth curve analysis showed that changes in self-control over time predicted subsequent changes in GPA. Gordeeva et al. (2017) reported a statistically significant correlation (r =  + 0.17) between BSC scores for first-year university science majors and average scores in the immediately following examination session.

Ferrari et al. (2009) studied 606 adults (407 men, 199 women, M age = 38.5 years) who were in recovery and residing in self-governed, communal living, abstinent homes across the United States. BSC scores were positively related to length of abstinence, but the factor defined by the four items with positive valence were primarily responsible for the significant relationship. These specific examples bolster the prospects that BSC scores will correlate with objective measures and adjudicate which correlations are real (self-esteem?) and which may be the product of response strategies imposed on self-reports (satisfaction with life?). A meta-analysis of the predictive validity of the BSC conducted by de Ridder et al. (2012) showed that observed behaviors (drawn from eight studies) and self-reported behaviors (drawn from 29 studies) were equally related to self-control as measured by the BSC. This is promising, but most of the evidence relies on subjective measures and correlational studies.

Validation through theory testing

Oberauer and Lewandowsky (2019) advocate for more theory-testing research. According to them, the defining feature of theory-testing research is a theory that implies that under the conditions specified in the theory, X must be the case. The hypothesis follows deductively from the core assumptions and this tight logical link between theory and hypothesis implies that establishing X as an empirical generalization supports theory T, and conversely, empirically establishing that X is not true counts as evidence against T. This is quintessential theory testing. It offers a chance to obtain strong evidence both in favor of and against a theory. When the strong evidence is obtained, it also validates the measures used to test the theory. Oberauer and Lewandowsky’s requirements for theory testing resonate with the parallel use of mathematical algorithms that instantiate the conditions and consequences in a computational model that enables the simulation of empirical phenomena. This also aligns with Maul’s desire for scales of psychological constructs to be tied more directly to explicit theories and that we cannot be satisfied with just operationally defining an important concept like self-control as a score on a superficially “validated” scale.

In this concluding section, we sketch a framework that could lead to a testable theory of SC. The core of this theory will be a mechanism that exerts SC without the need for an unscientific homunculus exerting free will. It is based on Suri and Paap’s (2023) Comparison with Goal States Model (CGSM). This cybernetic core is instantiated as a neural network that compares available response options to a relevant goal in memory and can select the option that is most similar to that goal even if other options are initially more attractive. This network is represented in Fig. 9 by the yellow circle labeled CGSM Network where it is noted that the network acts as an iterative Comparator, Amplifier, and Attenuator. The neural networks perform these control acts by amplifying the activation of options whose representations are similar to the goal and attenuating activation of options that are dissimilar. When does the CGSM succeed in selecting the more goal-fulfilling option even when it is initially less attractive (less activated)? It involves a balance between the initial activation of the goal, the relative initial activation of the options, the relative similarity of each option to the goal, and the weights assigned to similarity versus dissimilarity. Specific assumptions about the architecture and processes within the CGSM are provided in Suri and Paap (2023).

Fig. 9
figure 9

The Comparison with Goal States Model embedded within a more complete model of self-control

Here, we are most interested in what factors drive individual differences in SC that should be reflected in BSC scores (or any other candidate measure of SC). One individual-difference driver addressed by Suri and Paap is the capacity to apply or sustain activation to the goal. Thus, some individuals may be better at attending to the goal (perhaps an ability with high heritability). The more attention paid to the goal the more rapidly the network can amplify the activation of the option most similar to the goal and attenuate the activity of the others.

In its current state of development the simulation uses hand-crafted representations of the options and goals, but the CGSM postulates an unsupervised Hebbian learning component (see Suri & Paap) that translates statistical regularities across semantic representations that have been acquired (e.g., representations of what one knows about apples) into distributed representations corresponding to the available choices in the environment and the relevant goals (e.g., I should eat healthy food) that have been acquired. In Fig. 9 the available choices appear as Option Representations and the relevant goal as a Goal Representation. These activated representations provide the initial input to the comparator. Suri and Paap used the CGSM to simulate empirical data in healthy versus tasty food choice and in temporal discounting – two phenomena central to the study of self-control.Footnote 3 The model successfully simulated reaction times in food choice and the dynamics reflected in mouse-tracking trajectories during food choice. More specifically, the CGSM closely simulated the empirical advantage in mouse-tracking for individuals with better SC (based on a composite measure that included BSC items) simply by using a higher activation levels for the goal representation. A key pathway in the model shown in Fig. 9 shows that attending to a goal leads to the activation of its representation. In the CGSM network greater activation of the goal means that the more similar options will gain the upper hand over the dissimilar options faster, even when the latter enjoys a head start because it has a higher initial level of activation.

The effectiveness of SC should, in part, depend on the quality of the goal or option information that an individual experiences. For example, the Option Representation of a tasty “energy” bar may contain bogus features regarding calories, nutrition, or saturated fat if exposed to misleading or incomplete advertising. Or, on the Goal side, an individual may simply not be sufficiently exposed to (or convinced of) the benefits of a healthy diet or the features that provide a healthy diet. Should the degree to which an individual seeks and comprehends high-quality information about important life goals be considered an essential facet of SC? On the one hand, it has little or nothing to do with the actual process of control, on the other hand, it may account for a substantial individual differences in objective measures of SC success. Smarter people may not have better willpower, but they are likely to know more – to have more accurate representations about the world.

If the option selected involves a delicate balance involving the current activation levels of the competing options and the relevant goal (as is the case for the CGSM), then it follows that the situation will play a powerful role in SC as it is likely to dominate the relative salience of the representations engaged by the CGSM network. The important role of the situation in selectively activating the goal and the available options is marked by the Situation box on the left side of Fig. 9. The pathway branching toward the Goal Representation acknowledges that aspects of the situation are powerful determinants of the degree to which a goal is activated. Suri and Paap’s simulations show that when a goal is highly activated (e.g., you post your New Year’s resolution to eat healthy on the refrigerator door) it can compensate for initially low levels of activation in a highly similar option (e.g., no apple is in plain sight) or, conversely, that a highly salient option with low-similarity to the goal (e.g., that open package of donuts on the counter) can lead to a quick impulsive decision. These outcomes are consistent with the empirical evidence showing that manipulations of the environment can either make SC fast and easy (the Activation/Attention-to-Goal pathway in Fig. 9) or lead to SC failures when the situation triggers the Urges/Habits/Impulses leading to poor choices. As Duckworth et al. (2017) observe: “Ironically, we may underappreciate situational self-control for the same reason it is so effective, namely that by manipulating our circumstances to advantage we are often able to minimize the in-the-moment experience of intrapsychic struggle typically associated with exercising self-control… an individual who would rather snack on bananas than donuts after work might decide to enter her home via the living room (rather than the kitchen), calling out to her husband to hide the box of donuts she knows she left out on the kitchen counter that morning. Then, ensuring that her gaze falls anywhere but those donuts, she might deliberately think to herself, “calorie bomb!” and thereby strengthen her resolve not to eat any.” p. 1. These hypothetical scenarios not only ring true, but the underlying strategies that maneuver our immediate environment and/or change the way we think about it have received empirical support. Thinking about the money we have in our pocket as a windfall inclines us to spend, but cognitively framing it as part of our future income stream or as an acquired asset induces us to save it (Milkman & Beshears, 2009). Surrounding oneself with fellow saints makes it less likely to give in to temptations compared to other sinners (Hofmann et al., 2012). The absence of distractors (e.g., cell phone, TV) where high-school students study predicts their enjoyment and intrinsic motivation for doing homework (Galla & Duckworth, 2015). We consume more from larger snack packages than smaller ones (Wansink, 1996). Regardless of their weight, cafeteria patrons take more high-calorie desserts when they are easier to reach (Levitz, 1976). But the salience of the option can also be leveraged to induce better choices. For example, students are more likely to make healthier food choices when those items are available in the beginning, rather than the middle of the cafeteria line. Likewise, shoppers purchase more fruit and vegetables when they are within easy reach (Rozin et al., 2011). In communities where purchasing alcohol becomes easier, alcohol consumption and related medical and criminal consequences increase (Campbell et al., 2009).

Modeling advances such as the CGSM network potentially provide a much better understanding of what SC is as decision mechanism. But, the Process Model of Self-Control that identifies learnable strategies for shaping the environment to make SC easy or even unnecessary may account for the lion’s share of individual differences in making the right choices to satisfy long-term goals. Is SC a skill or a personality trait? The items comprising the BSC all ask about the respondent’s predispositions for SC successes (positive valence) or failures (negative valence), but not about typical behaviors used to manage the need for SC. Such items could be used and, for example, a study by Hofmann, et al. (2012) includes: “I avoid situations in which I might be tempted to act immorally” and “I choose friends who keep me on track to accomplishing my long-term goals”. If BSC scores are to be used as a measure of how well individuals have adapted to their environment (see section regarding how TBB defined SC) and to predict the degree to which their decisions in the face of prepotent and less goal-fulfilling options can lead to success over failure, then it is likely that scale will need to cover both the skill and the trait. In summary, this section shows how theory can and should play a role in the validation process. However, the constructive role of theory must be coupled with scale-development methods that minimize the foibles of self-report and Likert scales highlighted in the empirical part of this report.