Introduction

The body of research linking steroid hormones, such as testosterone (T) and other androgens, to human sexuality and social contexts is rapidly growing. For example, T has been linked with physiological and self-reported levels of arousal in women (Heiman et al., 2011) and men (Stoleru, Ennaji, Cournot, & Spira, 1993) as well as social and sexual behaviors in both men and women (Edelstein, Chopik, & Kean, 2011; van Anders & Goldey, 2010). Viewing sexual stimuli increases T in men (Redoute et al., 2000; Stoleru et al., 1999) but not women (van Anders, Brotto, Farrell, & Yule, 2009), whereas sexual thoughts have been shown to elicit T increases in women (Goldey & van Anders, 2011). In addition, research and theory indicate that T is positively linked with sexual intimacy and negatively linked with nurturant intimacy in men and women (van Anders, Goldey, & Kuo, 2011).

Given the increasing interest in measuring androgens in human sexuality research, evidence-based guidelines on appropriate methodological considerations are critical for empirically-supported research protocols. There are contemporary reviews on best practices for incorporating cortisol and other stress-related hormones into a variety of study designs with humans (e.g., Adam & Kumari, 2009). There have also been two recent reviews that addressed methodological issues for research with estrogens from a physical/biological anthropology approach (Jasienska & Jasienski, 2008; Vitzthum, 2009). In contrast, there are no methodological reviews on testosterone (T) or other androgens, except one foundational article published over two decades ago (Ellison, 1988). Furthermore, there are no reviews on best practices for incorporating T in sexuality studies, which can involve a specific set of sexuality-related confounds and issues. Accordingly, our goal in this article is to provide a current and comprehensive review of methodological issues for human sexuality research with androgens, focusing on T. Though the considerations provided in this review are focused on human sexuality research, they will also be pertinent to more general human biobehavioral research that involves hormones. Given how critical best practices are to successful research, we also provide evidence-based practical suggestions that may be useful for avoiding confounds in sexuality studies with T (see Table 1). This article may thus be most useful to researchers who are considering or have recently begun incorporating T to better address their questions of interest.

Table 1 Methodological issues in sexuality research with testosterone and other androgens in humans, and suggested practices for addressing them

Any methodological review might focus on myriad issues and the scope of this article is intended to be comprehensive rather than exhaustive, and is limited to pre-sampling issues, i.e., methodological concerns that affect study design. Though the majority of methodological research on androgens is conducted with men, studies with T increasingly include women, and this review focuses on both women and men where possible. Given the prevalence and utility of salivary measures in human sexuality research, we focus more on methodological considerations for salivary rather than blood sampling of hormones (see also below: Saliva vs. Blood). Some of the issues we focus on in this review—such as gender/sex and other social location variables—will be relevant to the content of sexuality research as well as its methods, whereas other topics, like biological rhythms, are less likely to be of interest per se, but are critical to designing methodologically strong studies with hormones and conducting analyses that provide best chances for detecting weak to moderate effects (the range in which hormone-behavior associations tend to fall). There are a number of potential confounds and topics for research on T that are reviewed in this article; not all are relevant to each researcher or study.

We also want to note that the ease of salivary sampling has meant that incorporating T into sexuality research has become increasingly feasible. However, the theoretical rationales for measuring T, as well as the theoretical foundations for interpreting results with T, have not necessarily changed. Even in cause-effect studies, there may well be mediating variables that merit attention. And, theoretically-oriented research is as valuable in sexual social neuroendocrinology as it is in any field.

Social and Demographic Factors

Gender and Sex

T is markedly higher in men than women with little overlap in distribution and researchers have hypothesized that women are more sensitive to fluctuations in T than men (i.e., in women, smaller differences in T may account for larger changes in behavior or desire compared with men whose T levels differ by the same amount) (Bancroft, 2002, 2009; Sherwin, 1988). According to this hypothesis, a gender/sex difference in sensitivity to T may occur because substantially higher levels of T in males during early development could de-sensitize them to the behavioral effects of T, which may function to offset potentially adverse behavioral effects of high levels of T (Bancroft, 2002, 2009). T is largely studied in men and this focus on males parallels non-human research where females are vastly understudied (Beery & Zucker, 2010), with some exceptions (see Dixson, 2012); accordingly, researchers may not appreciate that T is meaningfully studied in women (Halpern, Udry, & Suchindran, 1997; Hamilton & Meston, 2010; Singh, Vidaurri, Zambarano, & Dabbs, 1999; van Anders, Brotto, Farrell, & Yule, 2009; van Anders et al., 2011; van Anders, Hamilton, Schmidt, & Watson, 2007; Welling et al., 2007). In addition to basic research on T and women’s sexuality, a large clinical literature has examined the effects of T deficiency and administration on women’s sexual function, with nuanced results suggesting that some women are more sensitive to the behavioral effects of T than others (for reviews, see Bancroft, 2002, 2005, 2009). However, research that includes T typically focuses on male participants and/or addresses behaviors that are tied to cultural stereotypes about masculinity, including aggression and high/hypersexuality (Jordan-Young, 2010). In women, this has translated into either a deficit approach, where (low) T is generally studied in association with low sexual desire, or an over-focus on female biology, where T is only studied in relation to menopausal status, pregnancy, or menstrual cycles as opposed, for example, to the “general” behavior studied in men. But research on a host of topics beyond this limited range has provided important insights and T is tied to sexuality in nuanced ways that challenge cultural stereotypes; for example, associations between T and sexual desire may be positive, negative, or non-existent depending on context and gender/sex (van Anders, 2012b). Accordingly, researchers are increasingly including women in their research, even for those general questions that go beyond female-specific phenomena (though these issues are important and merit attention). And, researchers have moved beyond the more narrow culturally scripted proscription for T to topics on sexuality other than hyper/high sexuality (e.g., Exton et al., 2001; Goldey & van Anders, 2011; Sagarin, Cutler, Cutler, Lawler-Sagarin, & Matuszewich, 2009; van Anders & Dunn, 2009).

Sexuality differences between men and women related to T are often interpreted as evidence of sex differences, i.e., inborn, evolutionary, or nonplastic, because they are mediated by T and thus “biology.” However, social contexts alter T in humans and non-human species (Gleason, Fuxjager, Oyegbile, & Marler, 2009; van Anders & Watson, 2006c), such that differential gender socializations could lead to ostensible “sex” differences that appear to be mediated by T. Or different patterns of T or T responsivity could reflect lifetimes of distinct social contexts and learning. Without experimental data, it would be difficult to tell whether hormone-behavior correlations that differ in men and women reflect gender or sex differences (Hines, 2005), so assumptions of biological causes for difference can be unfounded. Accordingly, many researchers use gender to avoid terminology that implies biological causation. Still, “gender” is sometimes used not to denote sociocultural construction, but rather as a placeholder for sex, so we use “gender/sex” to reflect that biological versus social causation cannot be clearly disentangled.

Sexual Orientations and Sexual Diversity

Sexual diversity may be incorporated as a major topic of investigation or an important individual difference variable by sexuality researchers, and research that examines links between sexuality and T is often conducted with an eye to questions that relate to sexual minorities. Human research on T and sexual minorities has a difficult past that can affect current research practices. For example, individuals from sexual minorities were pathologized by hormone researchers seeking to discover a T-related etiology or treatment for same-gender/sex fantasies, interests, and behaviors (Oudshoorn, 1994). This legacy has not surprisingly made LGBT (lesbian, gay, bisexual, trans-identified) individuals concerned about participating in hormone studies. In addition, LGB individuals may rightly conclude that current research on T and sexual diversity in humans is still largely concerned with etiologies of non-heterosexuality (van Anders, 2012a), even though the current frame is basic (e.g., how? why?) as opposed to medical (e.g., what went wrong?), and some researchers are LGBT-identified themselves and consider their work to, in part, promote acceptance/tolerance of LGBT individuals and communities.

Further, many researchers still exclude sexual minorities when that research is not explicitly about sexual minorities (e.g., about “general” sexual behaviors or relationships). This may be taken to suggest further to this community that their only epistemological value is related to their sexual minority status (van Anders, 2012a), and that researchers see them as too other to be included among the “regular” sample of everyday participants (i.e., “the” population). In contrast to this “etiology approach,” researchers have begun to incorporate sexual minorities into their research in ways that do not position sexual diversity as a problem to be explained. For example, researchers have studied hormonal influences on sexual motivation in sexual minority women (Diamond & Wallen, 2010; Matteo & Rissman, 1984), or how diverse approaches to relationships, including polyamory, are linked to T (van Anders, Hamilton, & Watson, 2007).

The way researchers position sexual diversity—as a problem to be explained, as a research-worthy sexual variation, or as just one of many possible individual difference variables—may impact the willingness of sexual minorities to participate in research, both at present and in the future, which makes these issues relevant to even the least politically-interested scientist. In addition, the inclusion or exclusion of sexual minorities in research has far reaching impacts in terms of the way it affects other researchers’ ability to draw conclusions or build on past research. For example, the “etiology focus” on comparing individuals with same- versus other-sex interests means that research about bisexuality (the second most common sexual orientation/identity) (Herbenick et al., 2010) is largely absent from the literature and thus absent from scientific sexual knowledge.

Sexual diversity takes different forms between and within cultures and times, such that unitary understandings of sexual minorities cannot be taken for granted. For example, in some cultures, anal sex between men is seen as a part of gay sexuality whereas, in others, only receptive anal sex would qualify. “MSM” (men who have sex with men) or “WSW” (women who have sex with women) are terms used to describe sexual behaviors in contrast to sexual identities, though these terms are now informing identities in some places. Researchers who study links between T and sexuality typically focus on orientation, i.e., some mix of behavior, attraction, and fantasy typically via some iteration of the Kinsey questions of sexual practice (Kinsey, Pomeroy, & Martin, 1948), though some are interested in identity and therefore group individuals by self-report (e.g., heterosexual, queer, bisexual, lesbian, etc.). A large number of researchers simply ask participants to check a box indicating heterosexual, bisexual, or homosexual. All three possibilities (Kinsey questions, open-ended self-report, check boxes) have drawbacks and advantages. Given that sexual behavior, orientation, and identity reflect distinct aspects of an individual’s sexuality, the method researchers use to categorize participants by sexual orientation or identity can affect the conclusions drawn from research.

The Kinsey questions are widely used and participants respond to questions of behavior and/or fantasy (and sometimes attraction) on scales of “0” to “6” that range from exclusively other gender/sex to exclusively same-gender/sex. Scholars have noted that there is no standardization of coding such that an array of practices and cut-offs are used to categorize individuals (Jordan-Young, 2010). Still, many researchers categorize participants who select “0” or “1” as heterosexual and participants who select “5” or “6” as homosexual (the orientation term used in the Kinsey questions) or gay/lesbian (identity terms preferred by the sexual minority community), with people selecting intermediate numbers as bisexual, though research does question collapsing 0’s and 1’s together (Chivers, Bouchard, Timmers, & Haberl, 2012). Some others will categorize participants as heterosexuals and non-heterosexuals, which collapses bisexual and same-gender/sex oriented individuals together. This may be problematic as a large body of research demonstrates that bisexual, heterosexual, and lesbian/gay individuals differ on a number of dimensions in nonparallel ways, i.e., bisexual individuals are not “more” similar to either heterosexual or lesbian/gay individuals across measures (van Anders, 2012a). One disadvantage of the Kinsey questions is the “homosexual” terminology, which is seen as pejorative or outdated by many sexual minority individuals and intrinsically tied to the pathologizing interest of past medical practitioners and some “conversion” therapists (who claim to decrease same-gender sexual interests or at least practices); we sometimes consider using same-gender/sex oriented (which itself has gendered assumption problematics built in). Another disadvantage is that the Kinsey questions are rooted in gender, but some individuals are not interested in gender as a factor in sexual attraction (as with person-not-gender/sex), and others are genderqueer or trans-identified or interested in genderqueer/trans individuals (and thus may not have an expressed gender/sex to provide a basis for “same” or “other” gender/sex-attraction). In this case, bisexual and pansexual may appear to be the same (i.e., the midpoint on the scale) even though they imply very different things (i.e., attracted to women and men vs. attracted to people regardless of their gender). Some individuals may also be attracted to masculinity but not femininity in men and women (or the reverse), or may be attracted to masculine women but feminine men (or the reverse); obviously, the Kinsey questions become nonsensical for these individuals. The benefit of the Kinsey questions, however, is that they tap behavior and fantasy rather than identity, which can be important when identity is not the variable of interest. But this can be another drawback when current identity might matter more than life history of behavior (e.g., a woman who lived her life as heterosexual but fell in love and partnered with a woman in later life might have very heterosexual scores on this scale despite currently living and identifying as lesbian). Or, she might just respond to the scale in ways that affirm her current identity, which questions the utility of the scale at all if in practice it is used to reflect current identity anyway.

Open-ended self-report is also widely-used and is especially useful when identity is important and freely expressed. In contrast, in cultures where identifying as GLB or queer puts one at risk for violence or other repercussions, asking participants to report their sexual orientation or identity may be problematic and only certain individuals might participate (e.g., those who are already culturally stigmatized by public identities, those who are out, or those who make the choice to do so despite the potential cost). For example, changes in who is comfortable about identifying as lesbian or gay in public or in questionnaires may lead to perceptions of changes over time in the content of sexual diversity when, in reality, what has changed may be the visibility of specific parts of the community. In addition, people may identify one way but still have fantasies or behaviors that extend beyond this identity. The benefit of open-ended self-report is that researchers can tap into existing and evolving linguistic norms (e.g., homosexual → gay → lesbian → dyke → queer) that can still be categorized in a post hoc fashion for analyses. That is, even self-report responses can be categorized for quantitative analysis purposes (van Anders & Goldey, 2010). However, providing examples (e.g., heterosexual, gay, lesbian, bisexual, queer) can be helpful, as some individuals are less confident with what “sexual orientation” actually means (and this is especially true for sexual majority individuals who have not been forced to contemplate their sexual identities). In our lab, participants have sometimes written “female” assuming the question asks what their sexual orientation is to, and others have written “regular” or “normal,” which can obviously be offensive for sexual minority (or ally) experimenters to have to see repeatedly.

Asking participants to select from a preset list of options allows for quick, easily categorized responses, which can be especially useful for large online studies. However, because the options must be picked a priori, they may not reflect the community under study, and may be viewed as outdated or offensive. In our lab, we have found that only one participant has ever self-identified as “homosexual” and that sexual minority women rarely self-identify as “gay,” yet both these terms appear regularly as pre-selected checklist options. As noted above, homosexual is both outdated and linked to conceptualizing sexual minorities as pathological; gay is a term that many use to refer to same-sex oriented individuals even though communities typically use it to refer to men only and terms like lesbian, dyke, queer (and others) are instead used by women (men also use other terms, including queer). The use of homo/hetero/bisexual or gay/heterosexual distinctions in pre-set checkboxes limits community members’ abilities to self-define, and also imposes identity labels that sexual minorities have fought and still fight to both control and make public.

“Transsexual,” “transgender” or similar terms are sometimes included in a list of sexual orientations, but many trans-identified individuals see their trans status as being a property of gender/sex rather than sexual orientation. However, natal sex may be seen as relevant to sexual orientation and/or sexual identity for some scientists and/or trans-identified individuals. Increasingly, though, many would argue that current gender/sex identification and community associations are what matters for sexual identity rather than natal sex or the junction of natal and transitioned gender/sex. This remains contentious and, likely, a study- or community-specific research and social issue.

Social and Familial Connections and Processes

Social and relational variables influence T levels and can also be linked to T in trait ways. For example, sexually active status—including current sexual activity levels or having ever engaged in sexual activity—are both linked with T. We have found that associations between T and partnering status are mediated by current sexual activity levels in women, such that the lower T in women in long-term relationships relative to single women was explained by long-term partnered women’s more frequent sexual activity (van Anders & Goldey, 2010). And, there is evidence that sexual experience can moderate associations between T and other social variables: Roney, Mahler, and Maestripieri (2003) (cf. Roney, Lukaszewski, & Simmons, 2007) found that only sexually experienced men showed T responses to conversations with women. Paradoxically, desisting from sexual activity for a period of time, i.e., abstinence, is itself linked to higher T as well (Exton et al., 2001). These factors may be meaningful confounds or explanations of other T-sexuality associations (e.g., it may be that anticipation is linked to higher T). Accordingly, some researchers assess sexual experience and frequency. However, a fascinating body of research highlights that “sex” and “sexuality” can be interpreted in different ways (e.g., sex might mean intercourse to some, but any sexual contact to others, including non-genital contact) (Sanders & Reinisch, 1999). Accordingly, many researchers define sexual experience and frequency of sexual activity by specifying what does and what does not count in the definition (e.g., consensual sexual contact with your or a partner’s genitals; any sexual contact, including deep kissing but not friendly “pecks” or backrubs, etc.). The choice of an appropriate definition may depend on the outcome variable of the study; for example, a study on STI prevention may define sexual activity differently than a study on sexual desire. Given that more frequent masturbation and solitary orgasms are linked with higher T in women (van Anders, Hamilton, Schmidt, et al., 2007) and that T is differentially linked with solitary (positively) versus dyadic (negatively) desire in women (van Anders, 2012b), considering solitary sexual behaviors may be important as well.

In addition to sexual experience, a large body of research demonstrates that T differs by relationship status in women and men (Gray, Chapman et al., 2004; Gray & Campbell, 2009; van Anders, 2009; van Anders & Goldey, 2010; van Anders & Gray, 2007; van Anders & Watson, 2007b). This body of work suggests that monoamorously partnered individuals in romantic/sexual relationships characterized by commitment and nurturance have lower T. In contrast, individuals in multiple relationships, ostensibly monogamous relationships characterized by low commitment or cheating, or relationships characterized by a lack of nurturance and the presence of hostility, appear to have higher T as do single individuals. For women, partner presence seems to be a key variable, as same-city but not long-distance partnered women have lower T than single women. Casual relationships appear to differ by gender/sex, with lower T for women and higher T for men. Querying and recording relationship status can thus be a critical way to interpret potential third variable associations or other statistical issues. However, individuals have more complicated relationship profiles than might be expected and term definitions can be similarly helpful here, especially for research with populations that differ from researchers by age, social location, etc. Relationship characterizations and terminologies can differ by culture and time, but also by generation. For example, “hook-ups” and “friends with benefits” are categories that do not neatly fit into single/committed dichotomies and might be more meaningful to younger versus older generations. Moreover, “single” might be interpreted as meaning free to pursue relationships or having no relationships; specifying the term’s components can avoid misunderstanding (e.g., “single” = currently having no sexual or romantic contacts with anyone, meaning no hook-ups, one-night stands, etc.). As such, some researchers ask a variety of questions, including open-ended ones, to try and accurately assess relationship status given the complexities in nuance. Open-ended questions also allow researchers the opportunity to discover terms and relationship approaches they may not have previously encountered. Our lab’s own research on polyamory stemmed from participants informing us that our limited checkboxes (which we previously used) did not fit their relationship approach.

Similar to relationship status, there is a growing body of evidence demonstrating that parents have lower T than non-parents and that T shows a birth-specific drop in parents (Gray & Campbell, 2009; Gray, Parkin, & Samms-Vaughan, 2007; Gray, Yang, & Pope, 2006; Kuzawa, Gettler, Huang, & McDade, 2010; Storey, Walsh, Quinton, & Wynne-Edwards, 2000). Gonadal steroids, including T, increase during pregnancy and are lower during lactation in women (Alder & Bancroft, 1988; Greenspan & Gardner, 2001) and T also changes among co-fathers (i.e., fathers who are involved in parental care together with their partners) over pregnancy with a decrease in T (with perhaps one brief increase) that stays low but slowly increases with infants’ ages (Gettler, McDade, Feranil, & Kuzawa, 2011; Storey et al., 2000); this is similar to other biparental mammals (Wynne-Edwards, 2001). Accordingly, many researchers query participants about parental status, including age of offspring, and especially about pregnancy/lactation status, because these factors could introduce noise or confound other group differences in T.

Social Location Variables

Although sexuality researchers with training in Women’s Studies or feminist psychology tend to consider intersections among sexuality and other social location variables, such as ethnicity, immigration status, or socioeconomic status (SES) (Blanc, 2005; Froyum, 2010; Phillips et al., 2011), sexuality research conducted from a physiological perspective (e.g., research with T or genital arousal) has devoted less attention to these identity variables. In contrast to research on sexuality and hormones, research on stress and cortisol increasingly takes these identity variables as critical to understanding the interplay between social location/experience and hormonal processes (e.g., DeSantis et al., 2007). Research on T is often unconcerned with these issues, yet characterizing samples is critical to understanding the phenomenon under investigation and to what extent the results generalize beyond the study’s sample. Moreover, cultural variables have been shown to modify associations between gonadal hormones and other social and health variables (Gehlert et al., 2008). Accordingly, a few researchers use open-ended questions to address ethnicity and immigration status and include these in their demographic description. And, some use measures of participant (or parental, in the case of college students) income to crudely characterize SES.

Biological Rhythms

Diurnal Variation

Androgens show diurnal rhythms linked both to sleep patterns and time of day, with a near 50 % decrease from morning to evening (Dabbs, 1990b). Levels are highest upon waking and then steeply decline in the first 1–2 h post-waking, followed by a more moderate decline during the waking period until levels are lowest just before sleep, at which point androgens start to increase until their highest point immediately before waking (see Table 2) (Aedo, Nunez, Landgren, Cekan, & Diczfalusy, 1977; Axelsson, Ingre, Akerstedt, & Holmback, 2005; Boyar et al., 1974; Dabbs, 1990b; Piro, Fraioli, Sciarra, & Conti, 1973; van Anders & Hampson, 2005). These diurnal patterns are not necessarily standard, however, as older age (e.g., > 65 yrs) is associated with a flatter decrease over the day (Brambilla, Matsumoto, Araujo, & McKinlay, 2009; Bremner, Vitiello, & Prinz, 1983; Luboshitzky, 2003; Nicolau et al., 1985; Panico et al., 1990; Plymate, Tenover, & Bremner, 1989). Because of the strong and well-known diurnal variation in T, researchers have tended to restrict time of sampling and/or control for sampling time via statistical analyses.

Table 2 Levels of testosterone over a 1-day period

In addition to these diurnal rhythms, sleep itself is linked to T: sleep duration is positively correlated with levels of T and sleep disruption is associated with altered levels of T (Goh & Tong, 2010; Luboshitzky, 2003). This may have implications for studies on androgens with parents of infants and young children, college students, depressed individuals, and shift workers, among other groups, because of the associated alterations in sleep patterns.

A number of researchers have reported associations between T and behavior that are stronger when T has been collected in the afternoon versus the morning (Berg & Wynne-Edwards, 2001; Gray, Kahlenberg, Barrett, Lipson, & Ellison, 2002; Muller & Wrangham, 2004; van Anders, Hamilton, Schmidt, et al., 2007; Worthman & Konner, 1987). This has not been consistently shown for any one behavior or across a range of behaviors, but the steep declines in T over the morning may add noise or variation that obscures underlying effects that are visible with afternoon sampling, when levels are less variable. The growing preference for afternoon sampling of T stands in stark contrast to cortisol and biomedical research that focus on waking samples or daily slopes (O’Donnell, Badrick, Kumari, & Steptoe, 2008), yet this afternoon sampling approach with T has provided meaningful and consistent results. Many researchers successfully use a single sample to measure T if sampling time is restricted to the afternoon (see also below: Sampling Frequency, Timing, and Context), but researchers interested in T profiles across the day may sample T once in the morning and once in the afternoon or evening (Gettler et al., 2011; Gray et al., 2006).

Seasonality

In addition to diurnal rhythms, there is seasonal variation in androgens ranging up to twofold increases though the majority of evidence stems from cross-sectional rather than longitudinal studies. Moreover, evidence is somewhat variable and generally focused on seasonality in North America and Europe, limiting generalizability. Autumn tends to show the most consistent peaks in androgens in men and women (in men: Dabbs, 1990a; Moffat & Hampson, 2000; Reinberg et al., 1978; Reinberg, Lagoguey, Chauffournier, & Cesselin, 1975; Reinberg, Smolensky, Hallek, Smith, & Steinberger, 1988; Smals, Kloppenborg, & Benraad, 1976; Stanton, Mullette-Gillman, & Huettel, 2011; Svartberg, Jorde, Sundsfjord, Bonaa, & Barrett-Connor, 2003; van Anders, Hampson, & Watson, 2006; in women: Kauppila, Kivela, Pakarinen, & Vakkuri, 1987; Kauppila, Pakarinen, Kirkinen, & Markila, 1987; Stanton et al., 2011; van Anders et al., 2006; Wisniewski & Nelson, 2000). Though other peaks and no peaks have also been identified (see Table 3) (Brambilla, O’Donnell, Matsumoto, & McKinlay, 2007; Garde, Hansen, Skovgaard, & Christensen, 2000; Martikainen, Tapanainen, Vakkuri, Leppaluoto, & Huhtaniemi, 1985; Perry, Miller, Patrick, & Morley, 2000; Valero-Politi & Fuentes-Arderiu, 1998), there are data that are consistent with the autumn peak in male non-human primates, including Japanese macaques and rhesus monkeys (Gordon, Bernstein, & Rose, 1978; Muroyama, Shimizu, & Sugiura, 2007). Considering seasonal variation in androgens may be especially critical for longitudinal studies or when data collection spans a considerable time period, and some researchers have accordingly controlled for either season or day of testing in their analyses, or identified testing season/month in their methods.

Table 3 Seasonality and androgens in women and men

Researchers have speculated that seasonal variation in androgens result from seasonal variations in nutrition, caloric intake, and work (Jasienska & Ellison, 2004; Vitzthum et al., 2009). It may also be possible that seasonal variation in exposure to light and weather patterns contribute directly to fluctuations in androgens as they do in some other species (Nelson, Denlinger, & Somers, 2009), and given that most research has focused on populations living in North America or Europe, it is unknown whether seasonal effects on T are limited to populations living at higher latitudes. There is no real body of literature investigating the causes of seasonality in humans, but seasonality can be an important methodological issue to consider in longitudinal studies or when data collection spans multiple seasons.

Menstrual Cycles

Menstrual cycles are characterized by large fluctuations in estrogens and progesterone, and T shows more moderate variation. T is low during the menstrual phase, but begins a gradual increase that continues over the follicular phase until a peak around ovulation, with a gradual decrease during the luteal phase until onset of menses (see Fig. 1) (Campbell & Ellison, 1992). Similar peaks in T around ovulation have been found in a number of non-human primate species, though experiments with rhesus monkeys suggest that the mid-cycle peak in T has little to no effect on sexual behavior (Dixson, 2012; Michael, Richter, Cain, Zumpe, & Bonsall, 1978). Researchers have addressed whether the magnitude of human menstrual variation in T is large enough such that studies should control for it, and concluded that (1) this is unnecessary unless menstrual variation in T is itself of interest and (2) menstrual variation in T is relatively small compared to other sources of variation like diurnicity or individual differences (Dabbs, 1990b; Dabbs & de La Rue, 1991). Accordingly, many researchers who incorporate T into their (non-phasic) behavioral research do not control for menstrual phase, while others who are specifically interested in menstrual or ovulatory phase do identify and analyze menstrual phase. Still, many researchers assume that menstrual status must be controlled in any hormonal studies with women, an assumption that likely stems from historical definitions of sex hormones that tied cyclicity and instability to women and females (Oudshoorn, 1994), especially in light of empirical evidence (as above) that consistently shows menstrual cycle adds less noise than time of day or other variables that routinely go uncontrolled.

Fig. 1
figure 1

Relative testosterone levels over a 30 day menstrual cycle, and phase breakdowns using Backwards Counting. An optional “ovulatory phase” could constitute two days around ovulation if necessary. Note: 30 days is closer to average cycle lengths than the traditional 28 day cycle

Menstrual cycles relate to androgens in other ways. The variability in length and regularity of women’s menstrual cycles can reflect differing hormonal contributions. Very long or short cycles can be related to altered T in anovulatory or even healthy women (Campbell & Ellison, 1992; van Anders & Watson, 2006a). Moreover, cycle length and regularity decrease with age (Vitzthum, 2009). And, there are population differences in menstrual cycle lengths, regularity, and hormones that are still being explored (Vitzthum, 2009).

Identifying menstrual phase is most accurate with longitudinal monitoring of relative hormone levels (e.g., see Jasienska & Jasienski, 2008). However, researchers have sometimes used other methods to identify menstrual phase given that long-term monitoring is costly, invasive, and time-intensive (see also Vitzthum, 2009). Below, we detail three methods researchers have used as “shorthands” to menstrual phase identification: Forwards Estimation, the Two-Week Method, and Backwards Counting.

Forwards Estimation

Here, women’s phase is identified by counting forward from the first day of the most recent period using a 28-day cycle such that Day 14 is ovulation. Despite the widespread use of Forwards Estimation (near-exclusively by non-hormonal researchers), it is unreliable for a number of reasons. It presumes a 28-day average that is known to be incorrect; average cycle length is instead 29.5 or 30 days. Secondly, it does not take into account the wide variation in cycle length between and within women (Vitzthum, 2009). Perhaps of most concern, there is neither consensus nor standardization of phase breakdown, which has translated into very loose determination of phases with no discernable decision rule.

The Two-Week Method

Researchers interested in quasi-experimental approaches to studying T-behavior links might consider incorporating menstrual phase into their research. Some of these researchers might use menstrual phase as a proxy for different hormone levels rather than because of interest in menstrual phase per se. Given the difficulties of accurately identifying phase, some researchers have instead used the Two-Week Method, in which women are simply tested at two points separated by 2 weeks (van Anders, Chernick, Chernick, Hampson, & Fisher, 2005; Welling et al., 2007). Here, women can be tested close to expected onset of menses or during menses and then 2 weeks following, because T levels should be higher midcycle than earlier or later. Accordingly, this approach is useful when hormone variation, rather than menstrual variation per se, is of interest. Obviously, it is of limited use when menstrual phase is specifically of interest. Moreover, it still requires two testing points separated by weeks where hormones must be sampled, which can be difficult for some research designs.

Backwards Counting

A third method for estimating menstrual phase is Backwards Counting (Harvey, 1987), which, like Forwards Estimation, uses a counting method to identify women’s menstrual phase. Here, researchers calculate the actual length of each woman’s menstrual cycle via reports of the first days of two consecutive menstrual periods. Menstrual phases are thus more reliably estimated than the Forwards Estimation method, because the luteal phase is close to 14 days in healthy women and is less variable relative to other phases (Ellison, 2001). The menstrual phase is assigned to those days that contain menstrual bleeding. The follicular phase is situated between the menstrual and luteal phases. Ovulation should occur in between the follicular and luteal phases, so researchers could assign the 2 days around ovulation as an ovulatory phase if this is important—though this is less accurate because of the short window. An advantage of Backwards Counting is that it is relatively noninvasive, inexpensive, and though women need to be contacted post-study, this can be done remotely. Figure 1 provides a breakdown of phases by cycle day.

Aging and Menopause

Older ages are associated with lower T, but the cross-sectional nature of the majority of this research makes it difficult to definitively conclude whether these changes are indeed age-related or instead are due to cohort effects (i.e., variation due to birth period) (cf. Feldman et al., 2002; Morley, Perry, Patrick, Dollbaum, & Kells, 2006). Evidence shows lower T with older ages in women (Zumoff, Strain, Miller, & Rosner, 1995) and men (Burger, Dudley, Cui, Dennerstein, & Hopper, 2000; Ellison et al., 2002; Feldman et al., 2002; Ferrini & Barrett-Connor, 1998; Gray, Berlin, McKinlay, & Longcope, 1991; Morley et al., 1997; Nahoul & Roger, 1990; Uchida et al., 2006). Effects of aging on T in men may be a Western phenomenon related to atypically high T early in adulthood; some studies of non-Western populations show no significant age differences in T (Campbell, Gray, & Ellison, 2006; Ellison et al., 2002; Ellison & Panter-Brick, 1996) though some do (Ellison et al., 2002; Lukas, Campbell, & Ellison, 2004), depending on the specific population studied. Note that there is no research with women on aging and T in non-Western populations, so effects of aging that appear to be culturally-specific for men may also be so for women. Because of the variation in T by age, many researchers control for age in statistical analyses, and it may be that age is of varying importance depending on populations.

Though menopause brings a marked change in levels of many hormones in women, there are no specific menopause-related decreases in T (Burger et al., 2000). Instead, there is a change in relative levels of androgens and estrogens due to the large decline in estradiol that results from menopause (Vermeulen, 1980). Given this change in T and other hormones, many researchers limit their participants to premenopausal women and similarly-aged men, though this introduces serious issues about the generalizability of findings beyond this age range; given that menopause does not introduce large changes in T itself, excluding postmenopausal women may not really have any justification in theory or evidence.

Sample Collection, Handling, and Storage

Saliva Versus Blood

Androgens are most commonly measured via saliva or blood (serum) in humans, and these result in comparable though different measures. Some fraction of circulating T is bound to albumin or sex hormone binding globulin (SHBG), and the portion of interest to behavioral researchers is typically the unbound fraction available to travel throughout the blood and bind to receptors. Serum results in one of two measures: total T (a direct measure) or “free T” (generally an estimate based on the ratio of total T relative to SHBG). Data indicate that some estimates of free T from total T and SHBG are only moderately correlated with actual levels of free T, especially in men, but methods that physically separate the free and bound portions of serum T (e.g., equilibrium dialysis) are rarely used due to their high costs, labor intensiveness, and inadequate sensitivity for measuring free T in women (Ellison, 1988; Kapoor, Luttrell, & Williams, 1993; Morris, Malkin, Channer, & Jones, 2004; Rosner, Auchus, Azziz, Sluss, & Raff, 2007). Salivary T results in only one measure that is referred to as T, salivary T, or bioavailable T. Free T and salivary T are not the same—free T is an estimate whereas salivary T is a direct measure—though they both represent the unbound and potentially weakly-bound portion of T and are thus especially useful to researchers (Quissell, 1993). In addition, salivary measures may not reflect all of T circulating in the blood, though measures are usually highly correlated (Ellison, 1988). Additional androgens that can be measured include dehydroepiandrosterone (DHEA) and its sulphate (DHEAS), which are released from the adrenal gland and thus useful in comparison to other adrenal hormones like cortisol, as well as two that are not commonly measured: dihydrotestosterone (DHT), which is tied to physical virilization but less often measured in behavioral studies, and androstenedione, a weaker androgen and precursor to testosterone. Other androgens are uncommon in sexuality research or other behavioral research with humans.

Sampling androgens via blood or saliva brings method-specific advantages and disadvantages (see also Ellison, 1988; Vitzthum, 2009). Advantages of salivary sampling over blood include low or no biohazard implications, low invasiveness, high compliance from participants, ease of collection, storage, and shipment of samples, and ability to postpone freezing samples if needed. A special and important bonus of salivary sampling, particularly relevant to sexuality researchers, is the ability of participants to self-collect and self-store samples, allowing for the study of hormone-sexuality links in private and/or naturalistic settings. In addition, little is known about the effects of venipuncture on subsequent hormone levels or measures relevant to sexuality, though blood sampling (actual or anticipated) seems prima facie to interfere with sexual arousal much more than saliva sampling. Moreover, blood loss itself (i.e., that accompanies blood sampling) is a signal to the body of physical damage or fluid loss (Garrioch, 2004) in a way that spitting is not; the widespread assumption that this emergency signal is nonreactive in terms of research questions is based on faith rather than evidence. The pulsatile fashion of gonadal steroid release may also make saliva a more accurate option, since saliva represents a sort of averaging of hormone release over a short period, thus reflecting both the highs and lows associated with pulsatile release. Furthermore, and countering some questions about whether hydration would matter, salivary flow rate does not affect the measurement of T in saliva (Arregger, Contreras, Tumilasci, Aquilano, & Cardoso, 2007). These advantages make salivary measurement of T easier, more practical, and more amenable to a large array of study designs. However, though salivary assays have been conducted for decades, some biomedical researchers still question the validity of using saliva, with blood being the gold standard.

Saliva T measures have also been validated for research, with studies demonstrating their internal validity (i.e., accuracy, precision, linearity of dilution, sensitivity, and specificity), reliability across time and different laboratories, and external validity (e.g., expected associations with time of day, age, pubertal status, gender/sex, pharmacological manipulations, and clinical conditions) (Dabbs, 1990b; Dabbs et al., 1995; Granger, Schwartz, Booth, & Arentz, 1999; Johnson, Jopling, & Burrin, 1987; Luisi et al., 1980; Walker, Wilson, Read, & Riad-Fahmy, 1980). They are still more controversial than blood as some studies point to nonsignificant or low correlations between salivary and free T in women (Granger, Shirtcliff, Booth, Kivlighan, & Schwartz, 2004; Shirtcliff, Granger, & Likos, 2002; Swinkels, Meulenberg, Ross, & Benraad, 1988), while other studies show good correlations (Khan-Dawood, Choe, & Dawood, 1984; Magrini, Chiodoni, Rey, & Felber, 1986; Swinkels et al., 1988). Some studies also show good correlations between salivary and total T (Granger et al., 2004; Shirtcliff et al., 2002). How problematic are these data for studies with T in women? Salivary T measurements in women may add noise and lead to underestimation of effects; thus, sufficient and large sample sizes of women should overcome these problems and do (van Anders, 2010b; van Anders & Dunn, 2009; van Anders & Watson, 2006b). Additionally, serum assays of T in women may be problematic (Taieb et al., 2003), so low correlations between salivary and serum measures of T may be due to issues with accuracy in serum or saliva measures. In men, correlations between salivary and free T are high (Goncharov et al., 2006; Granger et al., 2004; Khan-Dawood et al., 1984; Shirtcliff et al., 2002; Walker et al., 1980; Wang, Plymate, Nieschlag, & Paulsen, 1981). Researchers should recognize that all hormone research involves an estimation of the hormone level of interest, and exact measures of some true hormone level are never available. This is most simply reflected in the accepted reporting of hormone levels that have been averaged from duplicate or triplicate assays of the same sample, along with intra-assay coefficients of variation; i.e., even the same assay provides slightly different estimates of the same hormone level from the same sample. Accordingly, hormone measurement (like all measurement) always involves compromise, and saliva holds many advantages in terms of validity and research design.

Sialogogues (Saliva Stimulants), Saliva, and Tubes

Many researchers use sialogogues to speed up saliva production. Though there are few empirical studies demonstrating any time benefit, there is one study showing that chewing gum speeds up saliva production by 3–6 min depending on type of gum (van Anders, 2010a). Countering this benefit, sialogogues affect the assay process for T: cotton artificially inflates readings of gonadal steroids and candy also alters results (Lipson & Ellison, 1989; Shirtcliff, Granger, Schwartz, & Curran, 2001). Despite widespread use, chewing gum also affects assays of gonadal steroids, including T (Granger et al., 2004; Lipson & Ellison, 1989; Paton, Lowe, & Irvine, 2010; Shirtcliff et al., 2000; cf. Dabbs, 1991), and this includes even popular choices like Trident sugar-free Original flavor. Research has shown that six variations of gum artificially inflate assay readings of T in women and men by up to 150 %, with larger effects for women (van Anders, 2010a). There is also conflicting evidence of whether time since chewing affects assays, as time spent chewing would potentially reflect the amount of chemicals leached from gum into the sample (Dabbs, 1991; Granger et al., 2004). Accordingly, unless saving 3–6 min is a critical design consideration, the detractions of sialogogues appear to vastly outweigh their benefits.

Collection of saliva samples for T typically involves spitting into tubes. However, saliva can bring impurities in the form of blood, from sores or recent tooth brushing, or particulate from food, tobacco, drink, gum, etc. These impurities are known to affect the quality of assays and therefore the fidelity of results (Granger et al., 2007). Some researchers request that participants avoid introducing matter into their mouths prior to the study by avoiding smoking, eating, drinking non-water fluids, or brushing their teeth. Some researchers ask participants to rinse their mouths with water to remove loose or detachable detritus. Other researchers use assay kits that are designed to test for blood contamination or ask participants to report on oral state or taste of blood (Hamilton, van Anders, Cox, & Watson, 2009), which might be especially helpful for studies with athletes or individuals with oral diseases or hygiene issues where mouth injuries or sores might be common.

Most researchers will want to use some sort of plastic tube rather than glass (the gold standard of inertness) to collect saliva samples because of cost and breakage considerations. Other collection vehicles like salivettes and material-based swabs adversely affect steroid assays (Kozaki, Hashiguchi, Kaji, Yasukouchi, & Tochihara, 2009; Kruger, Breunig, Biskupek-Sigwart, & Dorr, 1996; Shirtcliff et al., 2001); however, not all plastic tubes are acceptably inert. Though some recommend polypropylene tubes (Vitzthum, 2009), an empirical study demonstrated that they were problematic for steroidal assays (Banerjee, Levitz, & Rosenberg, 1985). Polystyrene tubes are more acceptable, and no studies to our knowledge have shown interference. Instead, T assays from samples collected in glass and polystyrene tubes are highly and significantly correlated at r = .92 (Lipson & Ellison, 1989).

Sampling Frequency, Timing, and Context

T shows high test-rest reliability over days (r = .64) and months (r = .52), suggesting that unitary measures of T can be meaningful for inferring trait levels of T, even though T does fluctuate over time and in response to social contexts (Dabbs & de La Rue, 1991). Indeed, research has indicated that between-person variability in T is much larger than within-person variability (Bain, Langevin, D’Costa, Sanders, & Hucker, 1988; Dabbs, 1991; Dabbs & de La Rue, 1991; Rowe et al., 1974), suggesting that a stronger approach to detecting signals is to include additional participants over additional samples per participant. Indeed, one investigation concluded that one sample was as good as the mean of three samples for estimating trait T, which may be an exaggeration but is nonetheless instructive. A number of researchers do use one sample to ascertain trait levels of T in behavioral research (Carré & Putnam, 2010; Gray, Campbell, Marlowe, Lipson, & Ellison, 2004; Mehta & Josephs, 2006; Roney et al., 2007; Schultheiss et al., 2005; van Anders et al., 2009; van Anders, Hamilton, & Watson, 2007; van Anders & Goldey, 2010).

Though some sexuality studies assess correlations between T and behaviors or attitudes, others investigate changes in T. In these experimental designs, samples are typically taken before and after a manipulation, leading to the following meaningful measures of T: baseline T, which can be used to assess anticipatory changes; change in T, which can be absolute (post minus pre) or percent change (post minus pre, all divided by pre); and stimulated levels, i.e., post-manipulation levels. Some researchers have found that percent changes in T are more sensitive than absolute changes, because the relative measures control for the large variability in baseline/trait T (van Anders, 2010a; van Anders & Watson, 2007a).

Despite an increasing body of research on experimentally manipulated T, the time course of T changes (e.g., time to T response, length of T response) is unknown. In spite of this, researchers have successfully used 15 min as the post-manipulation sampling time point, suggesting that 15 min is at least one time point at which changes are measurable (Mehta & Josephs, 2006; Schultheiss et al., 2005; van Anders, Hamilton, Schmidt, et al., 2007; van Anders & Watson, 2007a). Additionally, researchers have sampled 10 min and 20 min post-manipulation for T levels (e.g., Carré & Putnam, 2010). These timeframes are typically measured from the end of the manipulation to the sample itself, but whether the length of the manipulation itself matters remains an open question. Researchers must also consider whether sampling times for T are the same or different as the most appropriate times for participants to complete questionnaires about psychological responses (e.g., mood and arousal) in experimental studies. It is unclear at what time points T and psychological responses to sexual stimuli are correlated, if they are correlated at all (e.g., Goldey & van Anders, 2011, 2012), so multiple measures of psychological responses (e.g., immediately post-manipulation and 15-min post-manipulation) may be helpful.

Given the 15 min gap between the manipulation’s end and the second saliva sample in pre/post experimental studies on T, researchers use a variety of activities to fill this otherwise empty space. These filler activities should ideally be neutral, to avoid any additional confounding and contributing to changes in T. Research clearly demonstrates that a wide range of activities do affect T (for reviews, see van Anders & Watson, 2006c; van Anders et al., 2011), including just thinking sexual thoughts (Goldey & van Anders, 2011) or anything competitive (Archer, 2006; Carré & Putnam, 2010; Mehta & Josephs, 2006; van Anders & Watson, 2006c). Thus, filler activities like leafing through fashion magazines, with their sexualized images, responding to sexuality-related questionnaires, or completing cognitive tasks that may result in feelings of success or victory, may be problematic as neutral tasks. Researchers have thus turned to activities like somewhat boring travel videos as neutral/control filler activities (Goldey & van Anders, 2011).

In addition to activities, lab studies necessitate testers and experimenters. Evidence demonstrates that interactions with women increase T in heterosexual men (who were the only group studied) (Roney et al., 2003), which may confound gender/sex difference analyses if all testers are women but participants are both women and men. Similarly, sexual orientation/identity analyses on T may be confounded if participants of only one sexual orientation and gender are tested by the gender/sex they find most (or least) attractive. No researchers, so far as we are aware, counterbalance gender of tester unless this is a specific research question. While doing so would be ideal, it seems difficult to reasonably accomplish, and some researchers instead choose to match the gender of testers and participants because this may increase participant comfort in some sexuality studies (but, of course, may not always: it seems reasonable to conjecture that gay-identified male participants might be more uncomfortable with a heterosexual-identified male tester than a female tester due to high heterosexual-male homophobia). Some researchers therefore identify the gender of tester so that at least this potential confound can become apparent.

Sample Shipping

Some sexuality researchers may wish to recruit participants over a wide geographic area to diversify their sample or to target a specific population that is underrepresented in the researcher’s locale (e.g., polyamorous individuals, LGBT parents). Thus, the ability to have participants ship saliva samples to the researchers via mail is desirable. One previous study found that mailing saliva samples resulted in a small but significant decline in T for men’s samples and a substantial elevation in T for women’s samples (Dabbs, 1991). In addition, mailing also introduced random error into women’s T measurements as evidenced by a relatively low correlation between mailed and frozen samples from the same individual. The reason that mailing affected women’s and men’s T differently is unknown, although Dabbs speculated that the elevation and greater introduction of error in women’s T measurements could be due to women’s already low levels of T. In this study, the samples were mailed unrefrigerated and spent an average of 8 days in transit.

We tested the viability of shipping frozen saliva samples using gel freezer packs by having three volunteers (two women, one man) provide eight pairs of samples (some volunteers provided multiple pairs of samples). Each pair of samples was provided by the same individual at the same time, who immediately froze both samples. Then, one sample of each pair (shipped sample) was delivered on dry ice to a location outside of the university and shipped back to the laboratory frozen with gel freezer packs (see Appendix for detailed packaging and shipping procedures). The other sample of each pair (control sample) remained in our lab freezer for the duration of the study. All shipped samples arrived frozen at the lab within 24–38 h, and the shipped and control samples were assayed for T via radioimmunoassay in the Core Assay Facility at the University of Michigan. Our results indicated that T remained stable during frozen storage and frozen shipping, in contrast to the previous efforts where samples were shipped at room temperature. Shipped and control samples within a pair did not significantly differ from one another, t(7) < 1 (Cohen’s d = 0.08), and were highly correlated, Pearson’s r(6) = 0.90, p = .003; Spearman’s ρ(6) = 0.79, p = .02 (see Fig. 2). This suggests that shipping with freezer packs over a 24–38 h period preserves the integrity of saliva samples and allows researchers and participants to bypass the time requirements of traveling to pick up or drop off samples and the safety and regulatory concerns of using dry ice for studies involving salivary T. However, we certainly cannot guarantee that samples will be shipped in the promised amount of time, and time variability would certainly affect whether samples remained frozen—and, therefore, the integrity of the samples.

Fig. 2
figure 2

Correlation between control (stored) and shipped levels of salivary testosterone from the same individual at the same time. Note: The five lowest data points are samples provided by two female volunteers, and the three highest data points are samples provided by a male volunteer

Health, Medical Issues, and the Body

Hormonal Contraceptives

Hormonal contraceptive (HC) use alters gonadal steroid levels markedly, since HCs themselves are synthetic estrogens and progesterone. Though HCs increase circulating estrogens and progesterone levels, they can decrease endogenous production of these hormones due to negative feedback loops. Accordingly, ovarian activity may be downregulated, and HC use appears to decrease T (Bancroft, Sherwin, Alexander, Davidson, & Walker, 1991; van Anders & Watson, 2006a). This effect may not be universal, as one prospective study found that HC use decreased T in some, but not all, women (Graham, Bancroft, Doll, Greco, & Tanner, 2007; Greco, Graham, Bancroft, Tanner, & Doll, 2007). Though oral contraceptives (the Pill) are the most well-known HC, there are a variety of others, including hormonal intrauterine devices (IUDs), injections, patches, etc. There is variation between HCs but also within HC type (e.g., pills can contain differing degrees of hormones on different regimens). In addition to affecting trait levels of T, there is mixed evidence of whether HCs moderate T responsiveness. HC users and naturally cycling women have been shown to exhibit similar T responses to both sexual and athletic activity (Edwards & O’Neal, 2009; van Anders, Hamilton, Schmidt, et al., 2007). Based on some of these data, Josephs (2009) suggested that HC use not be an exclusionary criterion and instead be analyzed as a source of variance. However, more recent research (Goldey & van Anders, 2011) suggests that HC use can lead to opposing T responses. Accordingly, most researchers continue to exclude HC users from research on T responsivity, or at the very least check for HC moderation of effects. However, researchers do exclude HC users from studies of trait T. In addition to hormonal effects, there is some evidence that HCs have variable effects on sexual interest, with similar proportions of women reporting increases, decreases, or no change in sexual interest (Graham et al., 2007), which might be relevant to some sexuality studies.

Medications and Nicotine

Both nicotine and a variety of prescription or non-prescription medications can affect T. Nicotine has strong effects on T, with nicotine users having higher T than non-users (Ponholzer et al., 2005), such that many researchers either exclude nicotine users or statistically control for its use. Anabolic–androgenic steroids, treatments for polycystic ovary syndrome, and steroids (or their antagonists) used by individuals who have transitioned sex all affect circulating T levels and endogenous T production. Some medications, including those that do not affect T, may have sexual side effects. We have found that recruiting for “healthy” individuals does not lead to self-screening, as up to 27 % of our participants recruited with inclusion criteria that specified “healthy” reported using some medication. Accordingly, merely recruiting for healthy participants is not an effective strategy for studies with individuals who have unaffected levels of T. However, it would be difficult to pre-identify all possible contraindicated medications or to ask participants to complete a lengthy checklist of all medications. Some researchers use the strategy of asking participants to report medication usage and name of substance and then examining potential effects on T through drug databases. However, as the use of medications and “lifestyle” drugs continue to grow in Western nations, the inclusion of individuals using some substance may become unavoidable.

Health Conditions and Stress

As noted above, the adrenal gland releases androgens (DHEA, DHEAS) that are precursors to T. Accordingly, stress and stress-related psychological health conditions may affect T levels through adrenal activation, and these include very common conditions like anxiety, depression, eating disorders, and chronic stress (Burke, Davis, Otte, & Mohr, 2005; Hellhammer, Wust, & Kudielka, 2009). Moreover, because adrenal androgens make up a larger proportion of androgen levels in women relative to men, it is possible that women with high stress may demonstrate elevated androgen levels even while ovarian hormone output is suppressed (Cruess et al., 2001; Weimann, 2002). Men may show a different response to chronic stress, i.e., decreased T because chronic adrenal activation suppressed testicular output of hormones and the adrenals contribute only a low proportion of T (Aakvaag et al., 1978; Opstad & Aakvaag, 1982; Rose et al., 1969). As with medication use, the widespread prevalence of these conditions in Western nations makes the exclusion of people diagnosed with relevant conditions increasingly difficult.

In addition to stress and psychological conditions, there are physical conditions that can affect T. One of these is polycystic ovary syndrome (PCOS), which leads to higher levels of androgens in women (DeVane, Czekala, Judd, & Yen, 1975). Another of these is any medical attention to the ovaries or testes (e.g., cysts, etc.). In addition, illnesses may lead to altered gonadal output including T, as demonstrated in non-human species (e.g., Besedovsky & del Rey, 1996). Evidence from influenza vaccinations suggests that immune challenges decrease T (Simmons & Roney, 2009). Some researchers query and record the presence of psychological and physical health conditions in their samples, and either include these data in the sample description or use these data to exclude participants from analyses if the conditions are known to interfere with T.

Body Composition, Weight, and Exercise

Gonadal steroids can be synthesized from hormone precursors in fatty tissues and fat content can, therefore, affect steroid hormone levels (Deslypere, Verdonck, & Vermeulen, 1985; Nimrod & Ryan, 1974). T is negatively correlated with both weight and fat deposition in men (Fejes et al., 2006; van den Beld, de Jong, Grobbee, Pols, & Lamberts, 2000), but positively correlated with each in women (Leenen, van der Kooy, Seidell, Deurenberg, & Koppeschaar, 1994; Lukanova et al., 2004). Negative links between T and both weight and fat may occur because peripheral conversion of steroids in fatty tissue increases aromatization of T to E. The positive links between T and both weight and fat in women are likely attributable to lower levels of SHBG in women with higher fat content, and perhaps also to increased peripheral synthesis of T in fatty tissues (Leenen et al., 1994; Lukanova et al., 2004). Given links between weight and T, many researchers measure height and weight to compute body mass index (BMI), and assess its utility as a statistical control. Exercise itself affects T in ways that differ depending on activity and intensity level. Many studies find increased T post-exercise in women and men (Copeland, Consitt, & Tremblay, 2002; Kraemer et al., 1999), but very high intensity exercise (e.g., intense long-term running) can decrease T (Kuoppasalmi, Naveri, Harkonen, & Adlercreutz, 1980). Some researchers therefore control for exercise frequency and/or intensity given its mixed associations with T.

Incorporating Multiple Hormones

We have focused our review on T due to widespread interest in T among sexuality researchers; however, sexuality researchers sometimes measure other hormones such as cortisol (C), estradiol (E), and progesterone (P), often in combination with androgens (e.g., Heiman et al., 2011; van Anders et al., 2009). Unpublished data from our lab show that changes in T, E, and C in response to erotic stimuli in women are moderately to strongly correlated. C is sometimes included in sexuality research when interactions between sexuality and stress or anxiety are of interest (e.g., Hamilton & Meston, 2011; van Anders, 2012b). The near-exclusive focus on E and P as markers of female reproductive cycles has limited the range of research topics investigated in relation to these hormones. Nonetheless, researchers have demonstrated that E responds to visual erotic stimuli in women, and that E is linked with perceptions of solitary orgasms in women but not men (van Anders & Dunn, 2009; van Anders et al., 2009). In addition, researchers have found that sexual identity moderates links between E and same-sex sexual desires, such that around the time of ovulation when E levels are at their highest, women who consistently identify as lesbian have more motivation to act on their attractions to the same sex as compared to women who identify as bisexual or who have changed their identity labels at some point (Diamond & Wallen, 2010). Sexuality research on P is mainly limited to studies examining menstrual variation in women’s sexuality (e.g., Rupp et al., 2009), and research with P in men is scarce, despite an important role for P in male sexual behavior in non-human species (Wagner, 2006). However, one study demonstrated that non-sexual affiliative stimuli increased P in women and men (Schultheiss, Wirth, & Stanton, 2004).

Generally, the methodological issues reviewed above in relation to T are similarly important to consider when measuring other steroids such as C, E, and P. Adam and Kumari (2009) have provided a thorough review of methodological issues to consider when sampling C. However, research is mixed for E and scarce for P on some issues, specifically diurnality and seasonality (Bao et al., 2004; Bjornerem, Straume, Oian, & Berntsen, 2006; Brambilla et al., 2007; Goji, 1993). When sampling E or P in women, researchers will likely need to control for menstrual phase in some way given the large changes in E and P over the menstrual cycle (Nelson, 2005), or perhaps incorporate very large numbers of women. Sampling multiple hormones may require researchers to make some adjustments to sample collection schedules; for example, researchers studying C may be interested in the awakening response or daily slopes (O’Donnell et al., 2008), and one study showed that multiple samples per menstrual cycle over several cycles are optimal for measuring P in women, though the number of samples required may depend on the population (Jasienska & Jasienski, 2008). However, single samples for trait levels and one set of pre-post samples for state responses to experimental manipulations have also yielded meaningful results with C and E in sexuality studies (Goldey & van Anders, 2012; van Anders & Dunn, 2009; van Anders et al., 2009).

Conclusion

Clearly, there are a large number of methodological issues to consider in sexuality research design with T in humans. Some of these issues become more or less relevant depending on the question under investigation, and accordingly this review does not suggest or recommend that researchers attend to every possible confound and issue. Instead, this review has provided context for a variety of confounds along with some notions of how researchers currently address them methodologically, focusing especially on pragmatic and resource-efficient methods. There are a large host of issues this review did not cover, including post-collection methodological concerns. Furthermore, we did not address the relative effect sizes of the various methodological issues on T, and an examination of these issues via meta-analysis remains an area for future research (although a potentially challenging one given that few studies report effect sizes). Still, this review attempted to fill the gap in methodological reviews on T for human sexuality research and for human behavioral research in general, given that no contemporary examples exist in contrast to other hormones. Addressing relevant issues may only entail the addition of a few short questions or very minor adjustments to research design. The benefit of doing so is the increase in methodological rigor that should result in lower variation and increased ability to detect effects. Additional benefits of addressing potential confounds include the possibility of identifying under-researched or unknown effects or groups, and also engendering positive regard among participants. Attending to issues in design will help novices engage in evidence-based best practices for research with T, and strengthen the field of sexuality research with T in humans.