Aggression, risk-taking, and other types of disinhibitory behavior have been the focus of decades of empirical study due to their potential for deleterious outcomes. In particular, in seeking to explain why males exhibit higher levels of physical aggression and risk-taking than females, researchers have often focused on the role of androgens. Testosterone (T)—a steroid hormone product of the hypothalamic-pituitary–gonadal axis—has been highlighted as one potentially influential factor associated with status-relevant behaviors (e.g., aggression, dominance). Evidence for the T-aggression association appears to be robust in animal studies; however, evidence for the effect in humans is much less consistent (Archer, 2006; Carré & Archer, 2018; Wingfield et al., 1990). Mixed empirical findings suggest, at best, a weak positive correlation between T and status-relevant behaviors (Book et al., 2001; Rubinow & Schmidt, 1996). Indeed, a meta-analysis on the subject finds a significant but very small effect (Archer et al., 2005).

The dual-hormone hypothesis (DHH) has been offered as one way to account for inconsistent findings regarding the role of T on status-seeking behaviors in humans. As defined by Mehta and Josephs (2010), the DHH posits that T’s association with dominance and aggression depends on levels of cortisol (C), a hormone product of the hypothalamic–pituitary–adrenal axis. Behaviorally, C is implicated in the stress response and associated with submissiveness, withdrawal, fear, and avoidance (Brown et al., 1996; Denson et al., 2009; Dickerson & Kemeny, 2004; Goldsmith & Lemery, 2000; Roelofs et al., 2009; Schulkin et al., 1998; Taylor et al., 2000), which are quite distinct from the behaviors associated with T (approach-related behaviors, antagonism, and risk-taking; Carré & McCormick, 2008; Mehta & Josephs, 2006; Oyegbile & Marler, 2005; Platje et al., 2015; van Honk et al., 2004). Specifically, the DHH states that T and C jointly relate to socially dominant behavior such that higher levels of T are associated with more status-relevant behaviors, but only when C is low (Mehta & Josephs, 2010). When C is high, the DHH predicts that the effects of T would be inhibited or suppressed. Since it was first proposed, this model has been discussed extensively in the literature (see Knight et al., 2020; Mehta & Prasad, 2015; Sarkar et al., 2019), and the interaction between T and C has been linked to many status-relevant behavioral outcomes.

Proposed Mechanisms of the DHH

Support for the theoretical underpinnings of the DHH comes from a variety of research targeting potential mechanisms of the T × C interaction (Dekkers et al., 2019; Mehta & Josephs, 2010). As noted previously, behavioral manifestations of T and C are largely at odds with one another. These conflicting behavioral manifestations could potentially result in an interaction effect whereby high levels of one hormone override, or suppress, the other. Further support for the DHH can be drawn from evidence for T × C interactions at the molecular level – for example, high levels of C have been shown to actively suppress T expression at multiple molecular levels (Burnstein et al., 1995; Chen et al., 1997; Johnson et al., 1992; Tilbrook et al., 2000), and vice versa. This mechanistic process has been empirically observed at the cellular and genomic levels, as well (Viau, 2002). Thus, biological support for T × C interaction has been drawn on to reinforce proposals for the DHH at behavioral levels.

Multiple studies have now found support for the DHH across a variety of behavioral outcomes, with fairly extensive recent coverage of these studies found elsewhere (Dekkers et al., 2019; Knight et al., 2020; Sarkar et al., 2019). Specifically, these studies typically find that associations between T and the outcome of interest are only observed when levels of C are low. Within this context, many status-relevant variables have been examined as the dependent variable, including social status and winning (Casto et al., 2019; Edwards & Casto, 2013; Ponzi et al., 2016; Sherman et al., 2016), social dominance (Mehta & Josephs, 2010; Pfattheicher, 2017), disinhibitory psychopathology and physical aggression (Dabbs et al., 1991; Geniole et al., 2011; Pfattheicher et al., 2014; Popma et al., 2007; Tackett et al., 2014), and risk-taking (Mehta et al., 2015; Ronay et al., 2018; Van den Bos, 2013).

Inconsistent Evidence Supporting the DHH

Despite this wide-ranging evidence supporting the DHH for a variety of status-relevant behaviors, other studies have shown reversed or null findings (e.g., Geniole et al., 2013; Grebe et al., 2019). Dekkers and colleagues (2019) recently conducted a meta-analysis examining the T × C interaction in predicting status, dominance, aggression, psychopathy, and risk-taking. This meta-analysis concluded that there was a significant T × C interaction in predicting these outcomes, although the effect size was very small (r = -0.061 overall; r = -0.03 for dominance). The authors failed to find sizeable moderation of this effect by participant sex, age, or population type. Further, Dekkers and colleagues (2019) found evidence of methodological weaknesses in the literature on the DHH, including low power, variability in operationalization of variables of interest, and analytic flexibility.

Potential Reasons for Inconsistency

A number of moderators not addressed by Dekkers et al. (2019) have been proposed to explain heterogeneity in the DHH literature. Prior studies have tested whether the DHH may only be detectable in targeted groups, while higher measurement error or restricted range of variance in other groups (e.g., lower T signal in females) may substantially decrease researchers’ ability to detect dual-hormone effects. As early DHH studies included highly specific samples (e.g., male delinquents in Popma et al., 2007), some have proposed that sample stratification across different moderators (e.g., age, gender, clinical status) may better isolate T × C interaction effects, if and when they do exist, by increasing the ratio of signal to noise. Other moderators that have demonstrated this pattern include developmental considerations aside from chronological age (e.g., the influence of pubertal status on T × C coupling; Ruttle et al., 2015) or personality traits (e.g., disinhibitory personality traits; Tackett et al., 2014). However, many prior studies have lacked the robust sample sizes required to detect these complex interactions (Dekkers et al., 2019).

Inconsistencies in sample types and assays used to measure hormone levels may also explain heterogeneity in DHH findings. Most DHH studies have used immunoassay methods to measure hormone levels in single saliva samples. This may contribute to measurement error, as immunoassay methods may inflate very low T concentrations (Welker et al., 2016). Associations between immunoassays and liquid chromatography tandem mass spectrometry (LC–MS/MS) assays are weaker for T than for C, suggesting more precise assays should be used to measure T concentrations in particular. LC–MS/MS methods have a number of advantages over immunoassays, including higher specificity, greater matrix independence, and a wider analytical range (Field, 2013). Additionally, whereas single saliva samples are moderately stable when collected at the same time of day (Harden et al., 2016), single samples are subject to fluctuations due to mood or reactivity to current environmental states. Hair sampling may capture more trait-like individual differences in C and T levels, as hormones extracted from hair measure aggregate free hormone secretion over several months (Stadler et al., 2012). There is a need to examine possible DHH effects across these sample and assay types, as effects may be most trustworthy when using hair and LC–MS/MS assay methods. Other compelling explanations for non-replication of the DHH effect include poor construct validation and other measurement practices (e.g., Grahek et al., in press), low power, and analytic flexibility (e.g., Bakker et al., 2016). The DHH has been proposed to impact constructs ranging from leadership and dominance behaviors (Mehta & Josephs, 2010) to psychopathy (e.g., Welker et al., 2014), aggression (e.g., Geniole et al., 2011), and risk-taking (e.g., Mehta et al., 2015). While all of these constructs may overlap to some extent, they are not isomorphic, and construct confusion may result in overgeneralizing findings from one domain to another. Evidence also suggests that measurement practices in many areas of psychology are often poor, with studies using previously unvalidated measures (Flake et al., 2017; Grahek et al., in press) and selectively reporting measures or variables showing hypothesized effects (Simmons et al., 2011). While several recent studies that used pre-registered analysis plans have found evidence in support of the DHH (Kordsmeyer & Penke, 2019; Lausen et al., 2020; Ronay et al., 2018; Schild et al., 2020), additional sufficiently powered, multi-method, and pre-registered replications of this question may help to resolve discrepancies in this literature.

The Present Study

The goal of the present study was to evaluate the DHH literature broadly through a conceptual extension of many recent dual-hormone studies (see Supplementary Table 1 for methodological differences between the present study and three other relevant dual-hormone studies). We aimed to assess whether the interaction between T and C is associated with dominance in an adolescent sample via multiple measures of T, C, and dominance, and with pre-registered hypotheses and analyses. The first of our three primary aims was to examine whether salivary T and C interact in predicting dominance in adolescents as measured by observer ratings of performance in a leadership task similar to that used in Mehta and Josephs (2010) Study 1. Our second aim was to evaluate evidence for the DHH across different hormone samples (hair and saliva), and different operationalizations of dominance (self- and parent-reported dominance motivation and self-reported social potency). We hypothesized no main effects of either T or C on dominance, but we did not have specific predictions about whether or not we would observe significant T × C interactions. Our third aim was to evaluate evidence for a three-way T × C × gender interaction to assess whether the T × C interaction differs between males and females, given mixed findings regarding gender differences in the broader literature. We hypothesized that there would not be a difference in the T × C interaction across genders. Our hypotheses and analytic decisions were pre-registered after data collection, but before data cleaning or analysis (https://osf.io/5mzga), and any deviations from this pre-registration are noted below in the Deviations from the Pre-Registration section.

Table 1 Descriptive Statistics

Methods

Participants

The total sample consisted of 337 adolescents (Mage = 14.98 years, SD = 1.51 years, 57% female) and their caregivers (83.5% mothers). Adolescents were recruited via community outreach in a large metropolitan area in the midwestern United States. Flyers were distributed online, in local high schools, and in the community. Inclusion criteria were fluency in English for both adolescent and parent and an age range of 13–17 for adolescents (18-year-olds still in high school were also eligible). Exclusion criteria were the presence of an intellectual disability, neurodevelopmental disorder, or psychotic disorder in the adolescent. We excluded three participants who self-identified as transgender or gender nonbinary, as T values were winsorized and standardized within-sex (see Data Analysis section below). Parent-reported (or self-reported, if parent-reported data were unavailable) race/ethnicity of adolescents were as follows: 8.9% Asian, 10.7% Black, 6.2% Hispanic/Latinx, 62.0% White, 11.6% other/multiple races; 0.6% did not report their race/ethnicity.

Procedures

Data from the present study were collected as part of the Game Changers project, a broad study assessing many domains, including personality, psychopathology, and social functioning. Ethics approval was obtained by the Institutional Review Board at Northwestern University. Female participants who had reached menarche were scheduled to participate in the study during the first 10 days of their menstrual cycle, when hormone levels are most stable (Liening et al., 2010). All participants were instructed to refrain from eating and drinking for 2 h, as well as to refrain from smoking for 4 h prior to the lab visit. Participants were also instructed not to floss and not to eat dairy or anything containing live cultures the day of their visit.

Data from hair and saliva samples, dominance questionnaires, cognitive tasks, and pubertal development are included in the present study. A simplified timeline and version of the full in-person lab protocol, including a list of all measures administered, can be found on the OSF page for this project (https://osf.io/9n8gf/). Participants reported to the lab and completed informed consent (parents) and assent (adolescents). Hair and saliva samples were collected from youth within the first 1.5 h of the visit, followed by cognitive tasks and computer-based questionnaires (including measures of dominance motivation, social potency, and pubertal status). Youth participants later completed a leadership task near the end of the 3.5-h visit.Footnote 1 Parental participation consisted of approximately 45 min of computer-based questionnaires. Youth and parents who did not finish questionnaires in the lab were sent a link to complete them online.

Hormone Samples

Standard collection, processing, and assaying procedures were used for salivary hormones (Schultheiss & Stanton, 2009). Passive drool saliva samples were collected between noon and sundown in order to account for diurnal variation in hormone levels (Kirschbaum & Hellhammer, 1994; Liening et al., 2010). After completing assent, adolescent participants were asked to rinse their mouths with water, then drink four ounces of water. Participants provided two baseline saliva samples: T1 was completed 30 min after their arrival to the lab and after completing sedentary questionnaires, and T2 was completed approximately 45 min thereafter. Participants passively drooled into a 2-mL IBL vial through a sanitary straw. Saliva samples were kept at room temperature until participants took a short break or until the study visit was complete (approximately 2 to 3 h) then frozen at -80 degrees Celsius. Samples were shipped in dry ice to the Technical University of Dresden where they were centrifuged for 5 min. Hormone concentrations were measured using chemiluminescence-immunoassays (IBL International, Hamburg, Germany). The intra- and inter-assay coefficients of variation were < 9% and < 12%, respectively, for both hormones.

Hair samples, 3 mm in diameter and at minimum 3 cm in length, were collected from participants’ scalps, inferior to the cranial bones. Hair samples were stored at room temperature and were shipped to the Technical University of Dresden where T and C concentrations were assessed from the point of the hair closest to the scalp. Samples were washed with isopropanol and were incubated in methanol for 18 h. Samples were then centrifuged at 10,000 rpm for 2 min. Hormone concentrations of these hair samples were measured using LC–MS/MS. Hair extracts with known hormone concentrations were included with every batch of LC–MS/MS analysis to ensure reliable and valid hormone quantification. Full wash and assay procedures are detailed in Gao et al. (2013).

“Leader–Follower” Leadership Task

Participants were taken into a room with the examiner and a confederate (“follower”) who was seated in front of blocks from the Wechsler Adult Intelligence Scale block design task (Wechsler, 2008).Footnote 2 A video camera facing the participant was put in plain view. Participants were explicitly informed that they would take part in a leadership task where they would be judged on their leadership ability in providing directions to the follower on building block designs. Participants stood directly behind the follower and were given the first design by the examiner. They were then instructed to prompt the follower to start a timer, after which they were to provide verbal directions on how the follower was to construct the blocks in front of them to match the picture on the card (which was only observable to the adolescent participant; the follower did not see the picture of the design). Once the block construction was complete, the participant prompted the follower to stop the timer, and the examiner recorded the time and the design for accuracy. Participants completed a total of four designs: two 2 × 2 designs (4 blocks) and two 3 × 3 designs (9 blocks).Footnote 3

Measures

Descriptive statistics for all variables are listed in Table 1. Descriptive statistics by participant gender are in Supplementary Table 2.

Observer Ratings of Adolescent Dominance

Participants’ dominance behavior during the leadership task was rated by the examiner and the confederate “follower.” The dominance scale (Mehta & Josephs, 2010, Study 1) consists of 19 items: engaged, bored (reversed), leader-like, energetic, confident, shy/timid (reversed), gave clear instructions, comfortable, assertive, directive, indecisive (reversed), dominant, comfortable giving instructions, nervous (reversed), stumbled over words (reversed), masculine, anxious (reversed), strong posture, and hesitant (reversed). Participants were rated on all items using a 7-point Likert scale ranging from (1) Does not describe him/her at all to (7) Describes him/her very well. Cronbach’s alpha for the observer-rated dominance scale was 0.96.

Dominance Motivation

Dominance motivation was measured using both self- and informant-report versions of the Achievement Motivation Scale (AMS; Cassidy & Lynn, 1989). The AMS is a 49-item scale designed to measure different motivations for achievement. Participants and parents responded dichotomously (Yes or No) on the 7-item Dominance scale (e.g., I think I would enjoy having authority over people or I think they would enjoy having authority over other people). Cronbach’s alpha for the Dominance scale was 0.71 for youth report and 0.77 for parent report.

Social Potency

Social potency was measured using the 18-item Social Potency subscale of the Multidimensional Personality Questionnaire Brief Form (MPQ-BF; Patrick et al., 2002; Tellegen, 1982). The MPQ-BF is a 155-item questionnaire designed to measure personality from a psychobiological framework. Adolescent participants responded to the Social Potency scale (e.g., I am quite good at convincing others to see things my way) on a 4-point Likert scale ranging from (1) Definitely true to (4) Definitely false. Cronbach’s alpha for the youth-report Social Potency subscale was 0.81.

Covariates

Cognitive Performance.

To maintain consistency with Mehta and Josephs (2010) Study 1, cognitive performance was used as a covariate in regression analyses, assessed via the following metrics.

Leadership Task Time.

When the outcome variable was observer-rated dominance, cognitive performance was operationalized as the amount of time, in seconds, it took for participants to complete the leadership task. Each participant completed 4 designs, and the third and fourth designs were expected to take longer to complete than the first and second. We created z-scores for a participant’s time on each design; the participant’s average z-score across the four designs was used as the time-based cognitive performance score.

Executive Function Tasks.

Because our data collection protocol also included decontextualized tasks indexing cognitive performance, we used scores on executive function tasks as cognitive performance covariates for survey-based outcome variables. We also used these tasks as cognitive performance covariates for the observer-rated dominance outcome variable as a robustness check. Executive function tasks assessed working memory maintenance (Digit Span Forward; Lahey et al., 2004), working memory updating (Digit Span Backward; Lahey et al., 2004), and set-shifting (Trail Making Test Part B; Reitan et al., 1992). The Digit Span Forward and Digit Span Backward scores were computed by summing the number of correctly recalled digit sequences. Performance on the Trail Making Task was operationalized as the amount of time, in seconds, it took for participants to complete Part B.

Testing Time of Day.

Salivary cortisol levels vary diurnally (Kirschbaum & Hellhammer, 1994; Liening et al., 2010), so time of saliva sample collection (i.e., the midpoint between T1 and T2 collection times) was used as a covariate in regression analyses using salivary hormones, consistent with Mehta and Josephs (2010) Study 1. However, others suggest that cortisol assessment should be anchored on waking time, rather than sample collection time, as diurnal rhythms are more affected by sleep–wake cycles than by light–dark cycles (Adam & Kumari, 2009). In the present study, there were small negative correlations between time since waking (computed as the number of minutes between waking and sample collection) and salivary C (T1 r = -0.02, 95% CI [-0.14., 0.10]; T2 r = -0.04, 95% CI [-0.16, 0.08]) and T (T1 r = -0.06, 95% CI [-0.18, 0.06]; T2 r = -0.07, 95% CI [-0.19, 0.05]). Thus, we conducted additional robustness analyses using wake time as a covariate rather than sample collection time.

Pubertal Development.

Pubertal development was rated using the Pubertal Development Scale (PDS; Petersen et al., 1988), a self-report measure designed to assess pubertal status in male and female adolescents. Males and females rated their growth in height, appearance of body hair, and changes in their skin using a 4-point scale, ranging from (1) Has not yet started to (4) Seems completed. Male participants also rated their growth of facial hair and their voice deepening, while female participants rated their breast development and the onset of menstruation. Onset of menstruation was measured using the responses Yes or No. Dimensional composite PDS scores were used in analyses (Mfemale = 3.16, Mmale = 2.81); most female participants (n = 158) were in the advanced pubertal or post-pubertal stages, whereas most male participants (n = 119) were in the mid- or advanced pubertal stages of development.

Data Analysis

Some participants were missing either saliva (n = 66) or hair (n = 118) samples due to refusal or insufficient material (i.e., very short hair) for assays. The present sample reflects all participants who had either hair (n = 219) or salivary (n = 271) hormone data (total N = 337). The number of participants with data for each variable used in the present study can be found in Table 1.

The lower sensitivity limits for salivary C and T are 0.11 nmol/L and 1.8 pg/ml, respectively. In the present study, no samples fell below the limit for salivary C, whereas 9 samples fell below the limit for salivary T. Excluding participants whose salivary T values were below the lower limit of detectability did not substantially affect results (see Supplementary Table S6), so we retained the salivary T levels provided for these 9 samples. The lower sensitivity limit for hair T and C is 0.1 pg/mg; samples below this threshold were given values of 0.1 pg/mg for hair T (n = 45) or C (n = 2).

For both salivary and hair hormones, we winsorized extreme C and T values (i.e., replaced with the highest value within 3 standard deviations of the sample mean). For T only, we winsorized values separately for males and females to account for sex differences in T levels. After winsorization, we averaged salivary T and C values obtained at T1 and T2. Pearson rs were 0.89 and 0.69 for associations between T1 and T2 salivary C and T, respectively. As expected, T and C values were skewed, so we used log-transformed values in analyses. Analyses were also performed with non-log-transformed T and C values as a robustness check.

Hierarchical linear regression models were estimated to examine primary study aims, including main effects and interaction effects. Eight hierarchical regression models were estimated to examine evidence for the DHH using a) salivary and b) hair hormones (T and C) as predictors, predicting dominance as measured by a) performance on the leadership task, b) parent-reported AMS Dominance, c) self-reported AMS Dominance, and d) self-reported MPQ Social Potency. Dominance was entered as the dependent variable for all regression models. Covariates (time of day for salivary hormone analyses and cognitive performance for all analyses) were entered in Step 1. Gender, T, and C were entered in Step 2; the T × C interaction term was entered in Step 3; finally, the T × gender, C × gender, and T × C x gender interaction terms were entered in Step 4.

Results

Deviations from the Pre-Registration

Analyses were conducted as pre-registered (https://osf.io/9n8gf/), with a few exceptions. First, we conducted a number of validity checks on our hormone data (e.g., examining whether T values were associated with gender, age, and pubertal status). Second, in an effort to more fully clarify associations between hormone outcomes and relevant covariates (age, gender, and pubertal status), we conducted a set of preliminary analyses in which hormonal outcomes were regressed on these covariates. Third, because hormonal concentrations change dramatically across pubertal development (Shirtcliff et al., 2009), we conducted an additional robustness check by controlling for pubertal status in regression models. Fourth, we conducted analyses excluding a subset of participants whose salivary T values were below the lower limit of detectability. Fifth, to maintain consistency with other DHH studies, we analyzed regression models separately for males and females. Output for all non-pre-registered analyses can be found on the OSF page for this project (https://osf.io/9n8gf/) and is explicitly labeled as such.

Preliminary Analyses

Hormonal outcomes were first regressed on relevant covariates: Age, Gender, Gender × Age, and pubertal development (Supplementary Table 3). In initial models, Age2 and Age2 × Gender effects were estimated as nonsignificant for all outcomes and were subsequently dropped to produce more interpretable main effects. Similarly, the Gender × Age effect was estimated as nonsignificant for both hair outcomes and for salivary C and was dropped from those models. Results indicated that salivary C and both salivary and hair T concentrations were significantly higher in male participants, whereas hair C concentrations did not differ by gender. For salivary T, the gender difference significantly increased with age. Salivary T levels were also positively associated with age and pubertal development, whereas salivary C and hair T and C levels were not significantly associated with age or pubertal development. Given robust gender differences in testosterone levels, we standardized hair and salivary T within sex so that high levels indicate high T relative to other same-sex adolescents.

Correlations between all variables are presented in Supplementary Table 4a and are presented separately by participant gender in Supplementary Table 4b. Consistent with previous research (Mehta et al., 2008; Mehta & Josephs, 2010, Study 1), salivary T and C levels were positively correlated (r = 0.34, p < 0.001). Hair T and C levels were also positively correlated (r = 0.28, p < 0.001). Salivary and hair C were uncorrelated (r = -0.03, p = 0.68), whereas salivary and hair T were negatively correlated (r = -0.18, p = 0.02). Dominance variables were positively correlated, with associations ranging from r = 0.17 (p = 0.01) between observer-rated dominance and parent-reported AMS Dominance to r = 0.62 (p < 0.001) between youth-reported AMS Dominance and youth-reported MPQ Social Potency.

Dual-Hormone Hypothesis

As pre-registered, we conducted eight hierarchical regression analyses to evaluate evidence for the DHH. A negative T × C interaction term implies that T is positively associated with dominance only when C is low, supporting the DHH. Here, we present only T and C main effects, T × C interaction effects, and T × C × gender interaction effects. Model results can be found in Table 2, and models run separately by participant gender can be found in Supplementary Table 5.

Table 2 Hierarchical Regression Results Predicting Dominance

Observer-Rated Dominance

For salivary hormones, there was a small main effect of T on observer-rated dominance, such that adolescents with lower levels of T were rated as more dominant, on average (ß = -0.13, 95% CI [-0.23, -0.03], p = 0.02). Main effects of C (ß = 0.07, 95% CI [-0.10, 0.24], p = 0.24) and gender (ß < 0.01, 95% CI [-0.19, 0.19], p > 0.99) on observer-rated dominance were nonsignificant. The T × C interaction was nonsignificant (ß = -0.12, 95% CI [-0.28, 0.05], p = 0.55, ΔR2 < 0.01; Fig. 1a). The T × C × gender interaction was also nonsignificant (ß = 0.52, 95% CI [0.19, 0.85], p = 0.07, ΔR2 = 0.02). However, in models run separately by participant gender, the direction of the T × C interaction was negative in males (ß = -0.42, 95% CI [-0.70, -0.14], p = 0.17) and positive in females (ß = 0.28, 95% CI [0.08, 0.49], p = 0.27).

Fig. 1
figure 1

Salivary testosterone and salivary cortisol do not jointly relate to adolescent dominance across four measures. Intercepts and slopes from multiple regression models were used to plot the relationship between salivary testosterone and adolescent dominance at mean, mean + 1 standard deviation, and mean – 1 standard deviation salivary cortisol levels

For hair hormones, main effects of T (ß = 0.07, 95% CI [-0.04, 0.18], p = 0.27), C (ß = -0.03, 95% CI [-0.11, 0.06], p = 0.65), and gender (ß = 0.02, 95% CI [-0.22, 0.26], p = 0.74) on observer-rated dominance were nonsignificant. The T × C interaction was nonsignificant (ß = -0.15, 95% CI [-0.23, -0.07], p = 0.14, ΔR2 < 0.01; Fig. 2a). The T × C × gender interaction was also nonsignificant (ß = -0.27, 95% CI [-0.46, -0.08], p = 0.20, ΔR2 = 0.01), though the T × C interaction was positive in males (ß = 0.07, 95% CI [-0.11, 0.25], p = 0.69) and negative in females (ß = -0.22, 95% CI [-0.32, -0.12], p = 0.10).

Fig. 2
figure 2

Hair testosterone and hair cortisol do not jointly relate to adolescent dominance across four measures. Intercepts and slopes from multiple regression models were used to plot the relationship between hair testosterone and adolescent dominance at mean, mean + 1 standard deviation, and mean – 1 standard deviation hair cortisol levels

Parent-Reported AMS Dominance

For salivary hormones, main effects of T (ß = -0.05, 95% CI [-0.09, -0.005], p = 0.54), C (ß = 0.02, 95% CI [-0.05, 0.09], p = 0.79), and gender (ß = 0.06, 95% CI [-0.02, 0.14], p = 0.38) on parent-reported dominance were nonsignificant. The T × C interaction was nonsignificant (ß = 0.03, 95% CI [-0.03, 0.10], p = 0.89, ΔR2 < 0.01; Fig. 1b) and the direction of this effect was equivalent in males and females. The T × C × gender interaction was also nonsignificant (ß = 0.04, 95% CI [-0.09, 0.18], p = 0.90, ΔR2 < 0.01).

For hair hormones, main effects of T (ß = 0.03, 95% CI [-0.01, 0.08], p = 0.66), C (ß = 0.11, 95% CI [0.08, 0.15], p = 0.15), and gender (ß = 0.07, 95% CI [-0.03, 0.17], p = 0.37) on parent-reported dominance were also nonsignificant. The T × C interaction was nonsignificant (ß < 0.01, 95% CI [-0.03, 0.03], p > 0.99, ΔR2 < 0.01; Fig. 2b). The T × C × gender interaction was also nonsignificant (ß = -0.27, 95% CI [-0.35, -0.20], p = 0.23, ΔR2 = 0.02), though the T × C interaction was positive in males (ß = 0.21, 95% CI [0.14, 0.28], p = 0.36) and negative in females (ß = -0.07, 95% CI [-0.11, -0.02], p = 0.69).

Youth-Reported AMS Dominance

For salivary hormones, main effects of T (ß = 0.01, 95% CI [-0.03, 0.05], p = 0.88), C (ß = -0.05, 95% CI [-0.11, 0.01], p = 0.45), and gender (ß = -0.01, 95% CI [-0.08, 0.06], p = 0.93) on youth-reported dominance were nonsignificant. The T × C interaction was nonsignificant (ß = 0.29, 95% CI [0.23, 0.36], p = 0.20, ΔR2 = 0.01; Fig. 1c) and the direction of this effect was equivalent in males and females. The T × C × gender interaction was also nonsignificant (ß = 0.03, 95% CI [-0.09, 0.15], p = 0.93, ΔR2 = 0.02).

For hair hormones, main effects of T (ß = 0.08, 95% CI [0.04, 0.12], p = 0.29), C (ß = -0.11, 95% CI [-0.14, -0.07], p = 0.16), and gender (ß = 0.09, 95% CI [< 0.01, 0.17], p = 0.22) on youth-reported dominance were also nonsignificant. The T × C interaction was nonsignificant (ß = 0.18, 95% CI [0.14, 0.21], p = 0.15, ΔR2 = 0.01; Fig. 2c) and the direction of this effect was equivalent in males and females. The T × C × gender interaction was also nonsignificant (ß = -0.31, 95% CI [-0.38, -0.25], p = 0.17, ΔR2 = 0.03).

Youth-Reported MPQ Social Potency

For salivary hormones, there was a small main effect of T on youth-reported social potency, such that adolescents with higher levels of salivary T were higher in social potency, on average (ß = 0.17, 95% CI [0.10, 0.24], p = 0.04). There was also a small main effect of C, such that adolescents with lower levels of salivary C were higher in social potency, on average (ß = -0.21, 95% CI [-0.33, -0.09], p = 0.01). The main effect of gender on social potency was nonsignificant (ß = -0.02, 95% CI [-0.16, 0.11], p = 0.73). The T × C interaction was nonsignificant (ß = 0.44, 95% CI [0.33, 0.55], p = 0.10, ΔR2 = 0.01; Fig. 1d) and the direction of this effect was equivalent in males and females. The T × C × gender interaction was also nonsignificant (ß = -0.14, 95% CI [-0.38, 0.09], p = 0.75, ΔR2 = 0.04).

For hair hormones, main effects of T (ß = 0.10, 95% CI [0.03, 0.17], p = 0.23), C (ß = 0.02, 95% CI [-0.04, 0.07], p = 0.82) and gender (ß = 0.10, 95% CI [-0.06, 0.25], p = 0.24) on youth-reported social potency were nonsignificant. The T × C interaction was nonsignificant (ß = 0.25, 95% CI [0.19, 0.31], p = 0.10, ΔR2 = 0.02; Fig. 2d) and the direction of this effect was equivalent in males and females. The T × C × gender interaction was also nonsignificant (ß = -0.01, 95% CI [-0.14, 0.11], p = 0.97, ΔR2 = 0.05).

Robustness Analyses

Robustness checks were conducted by examining analyses in the following conditions: 1) removing all covariates; 2) using non-log-transformed C and T values; 3) excluding participants whose hormone secretion levels may be inflated or decreased due to dairy consumption prior to their visit (n = 64), use of hormonal birth control (n = 11), or use of diabetes medication (n = 1); 4) excluding participants whose salivary T and C values differed substantially from T1 to T2 collection times (i.e., > 3 standard deviations from the average T1 to T2 difference; n = 5); 5) use of wake time rather than saliva sample time as a covariate (for salivary hormone models only); 6) inclusion of pubertal status as an additional covariate; 7) use of executive function tasks rather than leadership task response time as a cognitive performance covariate in models where observer-rated dominance was the dependent variable; and 8) excluding participants whose salivary T values were below the lower limit of detectability (n = 9). Results of these robustness analyses can be found in Supplementary Table 6.

Overall, the results of our main analyses presented above remained consistent across these robustness tests. In just a few cases, robustness analyses indicated variability in effect sizes that sometimes resulted in a change in statistical significance of effects. For example, the negative main effect of salivary T on observer-rated dominance was no longer statistically significant when covariates were removed, when using non-log transformed T and C values, and when executive function tasks were used as a cognitive performance covariate rather than response time on the leadership task. The positive main effect of salivary T on MPQ Social Potency was no longer statistically significant when covariates were removed, when excluding participants whose T or C values changed greatly between collection times, when pubertal status was included as a covariate, and when removing participants who had consumed dairy on the morning of their visit or who were taking hormonal birth control or diabetes medication. The negative main effect of salivary C on MPQ Social Potency was no longer statistically significant when using non-log-transformed T and C values.

In three cases, statistical significance emerged in robustness analyses in the context of initial null results. First, a significant salivary T × C × gender interaction emerged for observer-rated dominance when non-log-transformed T and C values were used. Second, there was a positive main effect of hair C on parent-reported dominance when non-log-transformed T and C values were used. Third, a significant salivary T × C interaction emerged for youth-reported dominance when non-log-transformed T and C values were used. However, we interpret these statistically significant effects as likely spurious because they were only present under highly selective conditions. Further, given the number of tests conducted, there was a high likelihood of observing several spurious positive results. Indeed, all statistically significant effects which emerged in the context of initial null results were reduced to nonsignificance when the Holm (1979) method was used to correct p-values for multiple comparisons within each hypothesis family (e.g., all tests examining salivary hormones and observer-rated dominance; Supplementary Table 6).

Discussion

Extending previous research on the DHH, we evaluated whether T and C jointly relate to adolescent dominance using task-based, parent-report, and youth self-report measures of adolescent dominance and two measures of hormones (saliva and hair). Evidence for main effects of either T or C on dominance were largely null. The only outcome variable where salivary T and C main effects emerged in typically expected directions was for Social Potency, a subscale of the well-validated MPQ. These single main effects emerging in the expected directions were small but statistically significant, consistent with recent meta-analytic findings (Dekkers et al., 2019). The positive T main effect was variable across robustness analyses, whereas the negative C main effect remained statistically significant across robustness analyses, except when non-log-transformed T and C values were used.

Across methods and hormone measures, evidence in support of the DHH was weak. Point estimates of T × C interactions were suggestive of a small DHH-consistent effect for observer-rated dominance for both salivary and hair hormones, consistent with the interactions reported in Mehta and Josephs (2010) Study 1 and meta-analytic findings (Dekkers et al., 2019). However, point estimates of the interaction effects were near zero for parent-reported dominance and positive for youth-reported dominance and social potency, highlighting weak support for the DHH across dominance measures. Finally, whereas three-way T × C × gender interactions across methods and hormone measures were nonsignificant, gender-specific analyses indicated the direction of the T × C interaction was not always equivalent in males and females. This lack of consistent support for the DHH – both for the task-based dominance measure and the three questionnaire measures – may be interpreted in a broader context, which we discuss below.

One interpretation of these findings is that development may play a critical role in hormone-behavior relationships. The current study was conducted with an adolescent sample, and the largely null results suggest the possibility that effects consistent with the DHH may not emerge until after hormone variability associated with puberty has stabilized. A preponderance of studies on the DHH previously have been carried out in young adult samples. Though Dekkers and colleagues (2019) did not find evidence that age moderated the effect in their meta-analysis, this may have been due to restricted age range in the studies included (i.e., only 3 of the 33 studies were conducted with a predominantly adolescent sample). Importantly, within our adolescent sample, the results of the present study did not change when pubertal status was included as a covariate. In addition, two early studies of the DHH involved adolescent boys and found evidence for an effect, but it is important to note these studies were carried out in justice-involved samples: Dabbs and colleagues examined a sample of late-adolescent offenders (1991), and Popma et al. examined boys referred to a delinquency diversion program (2007). A more recent mixed-gender study of externalizing behaviors in a community sample of adolescents found evidence in support of the DHH for aggression, however (Grotzinger et al., 2018). Thus, the fact that the current study finds no evidence of an effect may be explained by the less clinically severe nature of the community sample, divergent operationalizations of status-seeking (i.e., aggression or violence in the aforementioned studies versus trait and situational dominance in the present study), or a combination of these factors.

Another interpretation of these findings implicates the variable nature of potential moderators across studies, including the clinical status of the population of interest. Indeed, research on hormone-behavior relationships has increasingly indicated that variability in hormone-behavior relationships across studies may be driven by influential moderators (Geniole et al., 2019; Knight et al., 2020). Viewing the DHH literature in its entirety, a cluster of potential moderators emerge (gender, age, clinical status, disinhibitory personality traits; see Knight et al., 2020). These proposed moderators may serve to illuminate meaningful evidence for the DHH by effectively stratifying broader samples in such a way that amplifies the signal of the effect. For example, given that women manifest lower T levels than men, on average, the fact that evidence for the effect is more common in male samples may be an illustration of the extent to which restricted range of variance in female samples serves to suppress the effect overall, making it more difficult to detect. Consistent with this interpretation, meta-analytic evidence suggests that the T × C interaction effect is statistically significant in men, but not women, though effect sizes for men and women did not significantly differ (Dekkers et al., 2019). Similar patterns of undetected effects might be expected in community (vs. clinical) samples, children and adolescents (vs. adults), and high self-control personality traits (vs. low) – essentially, any time we have reason to expect that restricted range of variance in hormones may make it more difficult to detect an effect, even when it does exist. Not all potential moderators should automatically be dismissed as suspect or evidence of less-than-rigorous practices (e.g., cherry-picking to produce statistically significant findings); however, researchers need to denote theoretically and methodologically justifiable moderators by carefully pre-registering and transparently reporting all analyses, covariates, and hypothesized moderators.

A final interpretation of the present findings is that prior results showing a T × C interaction in predicting dominance may have been inflated due to the presence of methodological and reporting issues. This interpretation is in line with the findings of Dekkers and colleagues (2019) that some studies in the DHH literature have relied on insufficient sample sizes and likely employed flexible analytic practices. Further, publication bias and selective reporting of measures may have led to the suppression of null findings for T × C interactions in predicting dominance in the literature (Franco et al., 2014; Simmons et al., 2011). As was illustrated through the results of our robustness analyses, a significant T × C interaction may be found when a high number of tests are applied across multiple measures of the target construct, but this effect may not generalize across the majority of tests. In the presence of problematic scientific practices and reporting standards, it is virtually impossible to formulate a certainty or confidence level around the positive empirical support the DHH has garnered in the literature.

Limitations and Future Directions

As discussed extensively by Prasad, Lassetter, and colleagues (2019), several factors may impact hormone measurement validity, including matrix interference (e.g., variance in freeze–thaw cycles and centrifugation and freezer storage periods) and the performance of antibodies used to isolate the substrate being analyzed, which can vary across immunoassay kits. These factors, along with use of different methods to measure hair and salivary hormones (Welker et al., 2016) and saliva collection at only two time points on a single day (Zhang et al., 2018), could affect precision in hormone measurement. Measurement error reduces statistical power even in the context of adequate sample size, and it is common to find largely nonsignificant effects in underpowered studies (Blake & Gangestead, 2020). This raises the possibility that the mostly nonsignificant findings in the present study may be a result of Type II errors secondary to limitations in hormone measurement. Measurement error also contributes to increased Type I error rates, however, particularly in the presence of publication bias. Moreover, while a strength of our study is its in-depth pre-registration, we caution researchers against viewing pre-registration as a panacea. The goal of pre-registrations is to guard against “over-confidence in post hoc explanations” (Nosek et al., 2018, p. 2600) and to increase transparency in the reporting of analyses, ultimately promoting scientific reproducibility.

In the context of these limitations, we offer three suggestions. First, we encourage readers to be critical consumers of scientific products and to view science as a systematic process of uncertainty reduction. As such, we view our project as one of many relevant contributions to the dual-hormone literature. Second, we recommend that pre-registrations using hormonal data include plans for several hormone measurement reliability and validity checks, including inter- and intra-assay coefficients of variability, checks for sex differences or diurnal fluctuations in hormone levels if these are expected, and (if possible) use of multiple methods to measure at least a subset of hormone samples. Many useful checks are outlined in existing papers (e.g., Schultheiss & Stanton, 2009). We recommend that hormone validity checks be conducted in blinded datasets (i.e., datasets that do not include outcome variables) to reduce the potential for bias.

Third, future research using more rigorous methods and practices including increased statistical power and precision, larger sample sizes, pre-registration of hypotheses, measures, and analyses, and improved measurement is still needed. All forms of pre-registration control for analytic flexibility and reporting biases, but Registered Reports may also be used to significantly diminish publication bias for null effects (Nosek et al., 2018). Registered Reports have certain advantages for research on hormone-behavior relationships because they provide the opportunity for reviewers to address methodological concerns prior to data collection. Whereas the present study used hormonal and dominance measures selected from a much larger study, there are some benefits to designing a study with narrower aims in mind (e.g., testing for hormone effects in a restricted set of dominance-relevant outcomes rather than selecting outcomes from a large set of measures), and these are well-suited for Registered Reports.

An ideal future study of the DHH would be a very large Registered Report that employs multiple methods of measuring 1) hormones and 2) outcome variables such as dominance that are pre-specified prior to data collection. For example, comprehensive assessment of hormones may include measuring C and T in saliva, blood serum, and/or urine via multiple methods (e.g., immunoassays and LC–MS/MS; El-Farhan et al., 2017). This kind of study would be positioned to test the boundaries around dual-hormone effects, and to rigorously investigate theoretically informed potential moderators. Further, measurement error is an often-overlooked hinderance to statistical power (Williams & Zimmerman, 1989), and not all measures of hormones (e.g., immunoassays vs. mass spectrometry; Welker et al., 2016) nor outcomes (e.g., Flake et al., 2017) are created equal. Rigorous construct validation results in tests that are more reliable are purer representations of the latent construct of interest, and therefore produce more signal than those that are created “on-the-fly” or on the basis of face validity alone (Cronbach & Meehl, 1955; Grahek et al., in press).

Conclusions

In a community sample of adolescents, the present study used a multi-method approach to evaluate the DHH in a series of pre-registered hypotheses and analyses. Results were suggestive of inconsistent support for the DHH across task-based and parent- and self-report questionnaire measures of dominance and across salivary and hair hormones. These results are generally consistent with a recent meta-analysis indicating only marginal support for the DHH (Dekkers et al., 2019). These null effects may indicate that empirical support for the DHH is a result of spurious or cherry-picked findings. However, it is also possible that hormone concentrations in typical community samples are too low to pick up on meaningful variance in hormone effects, but that the dual-hormone effect may appear in some population subgroups or under the influence of specific moderators. We urge researchers to continue to interrogate the DHH in the context of theoretically strong moderators and with an eye toward more precise delineation of those boundary conditions under which the DHH might emerge in human samples. Finally, these results underscore the importance of using open science practices when investigating hormone-behavior relationships. Practices such as pre-registration allow for maximum transparency in manipulation of hormone variables, measures chosen, analytical approaches, and robustness checks, and ultimately offer the only solution to discovery of true DHH effects.