Introduction

A major goal in evolutionary biology is to understand the extent to which various biological features are products of adaptation (in response to Darwinian selection), or of other evolutionary processes such as drift (random changes) or developmental constraint. Simplified, the aim is to distinguish functional adaptations from nonfunctional byproducts (or ‘spandrels’; Andrews et al. 2002; Gould and Lewontin 1979).

One way this aim has been pursued is by comparing within-species variability of features. This approach is based on the premise that functional structures are less variable than nonfunctional structures; it is thought that low variation in a feature within a population indicates that variation in the feature has been winnowed away by strong selection (implying function), whereas high variation indicates that the feature has been under weak or no selection (implying no function). This idea is often attributed to Fisher’s fundamental theorem of natural selection (Fisher 1930), although the theorem’s original formulation is quite different (“The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.” p. 35), and its meaning and applicability to real populations have been controversial (Frank and Slatkin 1992; Okasha 2008; Lessard 1997; Price 1972). Nevertheless, researchers have used differences in variation in the physical size of features to infer their functionality or lack thereof.

One such example concerns the female orgasm, which could have developed as a byproduct of the clearly functional male orgasm (Gould 1987; Lloyd 2005; Symons 1979; Wallen and Lloyd 2008), or could alternatively serve its own adaptive function (e.g. in mate choice; Puts et al. 2012a, b). Within this debate, the greater length variability of the human clitoris versus penis has been taken as evidence that the human female orgasm is a functionless developmental byproduct of the functional male orgasm (Wallen and Lloyd 2008). Note, however, that the authors demonstrated no connection between men’s capacity to orgasm (and thus potentially impregnate a mate) and phallus size. Instead, it was simply presumed that functional features necessarily have less variance in physical size than non-functional features. In a related example, lower variability in human penis girth versus length has been used as evidence that a selective pressure of female choice is stronger on penis girth than on its length (Apostolou 2015).

However, such inferences are only valid if the premise is consistently true: that is, functional/strongly selected features are less variable in size than nonfunctional/weakly selected features. This assumption is falsifiable, given a consensually agreed upon feature that is functional in one sex, but not the other. Human nipples are such a feature. Gould, who first coined the term ‘spandrel’ (Gould and Lewontin 1979), has cited men’s nipples as a clear example of a functionless developmental byproduct of the strong selective pressure on women to have (functional) nipples (Gould 1987). That men’s nipples are spandrels is agreed upon by those who cite lack of variability as evidence of trait functionality (e.g. Wallen and Lloyd 2008). Not only are male nipples a good example of a spandrel, they are to our knowledge the only example of an uncontested spandrel of which a functional version of the feature is carried by members of the same species. This provides a rare test case for relative size variability of functional versus nonfunctional features, since factors other than functionality (e.g. type of trait, dimensionality of trait, species in which it is observed) are constant.

Thus, in the present paper, we compare the coefficient of variation (a mean-adjusted measure of absolute variability) of men’s and women’s nipples, with and without controlling for potentially correlated body measures (bust size for women, chest size for men). If degree of between-individual variation in the size of a feature can be used to infer the feature’s functionality and selection history, then human female nipples should be less variable than male nipples.

Methods

For the purposes of brevity we refer to the entire nipple-areola complex as ‘nipple(s)’. This complex was chosen because it can be reduced to a 2-dimensional area measurement (see Fig. 1), and includes a range of adaptive features (arguably more than the nipple bud itself). It contains the tubercles of Morgagni and Montgomery glands (Montgomery 1837; Rosen 2001) which secrete milk and oils (Kopans 2007; Montgomery 1837; Nicholson et al. 2009), with an “abundant lymphatic system” known as the “subareolar or Sappey plexus” (Nicholson et al. 2009, p. 511). The areola glands have a lubricating function that physically eases breastfeeding (An et al. 2010), and have been described as “miniature lacteal units combined with sebaceous units” (Smith et al. 1982). The areola secretions are an olfactory stimulant to infants (Doucet et al. 2009; Porter and Winberg 1999), and help initiate suckling and are positively related to infant growth (Doucet et al. 2012). Additional blood vessels relative to the breast increases areola temperature, diffusing the scent (Porter and Winberg 1999) and physically orienting infants (Doucet et al. 2012; Zanardo and Straface 2015). The dense skin of the nipple protects the breast against saliva and sucking (Perkins and Miller 1926; Zanardo and Straface 2015), and may be a visual target for infants (see Doucet et al. 2007). None of these features are useful in men, who do not breast feed.

Fig. 1
figure 1

Example of male nipple scan and tracing for size measurement. Image from volunteer not involved in study with written permission for publication

Participants

Australian undergraduates (33 male, 30 female) aged between 18 and 33 (M = 20.16, SD = 2.74) participated in exchange for course credit. Informed consent was obtained from all individual participants included in the study. Note that participation was completely voluntary, and participants were informed of the requirements of the study prior to signing up for participation. There were multiple different studies that students could choose to participate in for credit, or they could complete an alternative to any study participation and still receive full class credit. Participation was dependent on being 18 years or older, having natural breasts or pectorals (i.e. no breast or chest implants, or surgeries on those areas), and for females, having never experienced any stage of a pregnancy. The majority of the sample were Caucasian (82.5%), with the rest Asian (9.5%), African (1.6%), or Other/Non-Specified (6.4%).

Measures

Nipple area (mm2) was an average measurement taken from two images of each participant’s chest. In private, the bare chest was pressed against a vertical A3 flatbed scanner for two scans. Two of the authors, Ashleigh Kelly and Shelli Dubbs, later independently measured the de-identified images on-screen by tracing the nipple structure (see Fig. 1) using the freehand tool in ImageJ (www.imagej.net). Freehand tool settings were calibrated identically for all measurements, based on a scale physcially affixed to the scanner (hence present in all images).

Inter-rater reliability was computed for nipple measurements using the irr Package (Version 0.84). As recommended by Hallgren (2012), two-way average-measures intra-class correlations (ICC) were assessed against an absolute standard (i.e. that nipple area ratings would be similar in absolute value) using a mixed model (with our two raters treated as fixed effects). For nipple ratings, ICC = .99, indicating near-perfect agreement between the raters. For each participant, nipple area rating was averaged across raters, and this variable was used for all subsequent analyses. Inter-scan reliability was computed similarly (with nipple area averaged across left & right, and across raters). Again, results indicated excellent reliability between scans 1 and 2 (ICC = .99).

Weight in kilograms was measured standing and barefoot minus all jewellery and metal, using Tanita BC-575 Body Composition Scales.

Body fat percentage was measured as per weight. To ensure accuracy, pre-tests were conducted comparing the Tanita scales to a highly accurate ImpediMed DF50 bio-impedance medical analysis device (for support for use of both bioelectrical impedance and ImpediMed device, see Lukaski et al. 1985; Segal et al. 1988). The devices showed good relative agreement on body fat percentage (r = .96, p < .001), and absolute agreement was within 3%. Pateyjohns et al. (2006) performed similar tests using Tanita brand scales and the DF50 device, and came to similar conclusions regarding the accuracy of Tanita scales for group comparisons. As the ImediMed device required a supine position, strict test and pre-test conditions, and multiple electrodes, it was impractical to use for our purposes.

Other body measurements were taken in centimeters using a tape measure and standardized procedures (Fig. 2 shows anatomical placement). Although participants were given the option of placing the tape themselves, nearly all opted to allow the researcher to place it (excl. bust). For sensitive measurements (e.g. bust), the option to hold the tape was taken by a small percentage of participants, with the researcher requesting adjustments to placement and tension, and reading the value. All measurements were taken while clothed and wearing a non-padded bra (females only).

  • Height was measured barefoot, pressing the back and heels against a wall.

  • Chest circumference was taken around the chest and back, directly under the armpits and above the bust.

  • Ribs circumference was taken for females only, just under the bust and around the back as part of a bra size measurement.

  • Bust circumference was taken for females only, at the fullest part of the breasts and around the back, typically aligned with the nipples.

Fig. 2
figure 2

Measurements taken on both genders, excluding ribs and bust (females only). All were circumferences excluding height. Image by JBamaya Design (JBamaya.com)

An array of other body measurements were gathered, yet most were either redundant or uncorrelated with nipple area once analyzed (hence excluded).

Room temperature was measured in degrees Celsius using a digital thermometer placed directly above the scanner.

Procedure

The study was advertised as an investigation into sexual dimorphism in humans, with procedures explicitly disclosed prior to participants signing up for participation in the study, and again during participation (at multiple times, to allow for withdrawal at any stage). Body measurements were taken while clothed with bulky items removed, and feet were cleaned for scale measurements. Participants also undertook a questionnaire, confirming gender and ethnicity. They were then instructed on using the scanner, taking supervised images of their hands for practice before being left in private to undertake chest scans.

Two female experimenters were present with the participant at all times except during chest scans. Anonymity, comfort, and consent were of utmost importance, with participants assigned a secret code, which they used as a reference when undertaking their scans. Withdrawal was offered multiple times, and any hint of discomfort required the experimenters to abandon the chest scans and confirm desire to continue with other segments. Ethical approval was obtained for all procedures.

Results

Preliminary Statistics

All analyses were performed in RStudio (Version 0.99.489; R Version 3.2.2). The dataset supporting this article has been uploaded as supplementary material. Descriptive statistics for women and men are displayed in Tables 1 and 2, respectively. Note that ambient temperature was not significantly correlated with nipple measurements. Nipple size variation had low and nonsignificant skew (.37 for men, .39 for women).

Table 1 Women’s uncentered means, standard deviations, ranges and intercorrelations
Table 2 Men’s uncentered means, standard deviations, ranges and intercorrelations

Main Analyses

Tables 1 and 2 reveal large mean differences between the area of female and male nipples, such that male nipples were on average 36% the size of female nipples. To compare variances of samples with different means, the variances are scaled to the sample means by calculating coefficients of variation (CVs) for each sample, expressed as percentages (Pearson 1896; Broberg 1999, 2016; as per Apostolou 2015; Lynch 2008; Wallen and Lloyd 2008). The CV is the ratio of the standard deviation divided by the mean, multiplied by 100 to yield a unitless percentage of variance. Significance (p) of the sex difference was computed via Studentized bootstrapping (a procedure that does not assume data normality) using Broberg’s rsd.test function within the SAGx R package (see Broberg 1999, 2016, for details on testing the significance of differences in CVs). Note that the results are substantively the same if a simple F-test is used as per Apostolou 2015, and Wallen and Lloyd 2008.

In raw terms, the CV of female nipples was significantly higher than that of male nipples (42% versus 27%, p < .001). After accounting for bust size in women, and chest size in men, the CV for female nipples was still higher than male nipples (38% for women, versus 25% for men, p < .001). Results were substantively the same when accounting for bust/chest size and women’s and men’s respective body fat percentages (38% versus 25%, p < .001).

Discussion

Female nipples are functional whereas male nipples are nonfunctional byproducts; therefore, our finding that women’s nipples are significantly more variable in size than men’s nipples demonstrates that high morphological variability of a feature does not necessarily imply a lack of function or of historical selection, as has previously been assumed.

This finding is novel in that a comparison of variability had not previously been made between clearly functional versus nonfunctional byproduct versions of the same feature. However, previous influential studies have made other relevant comparisons of variability. Pomiankowski and Moller (1995) compared the variability of features that are functional in both sexes but are exaggerated as a sexual display in males (e.g. bird tails). They found that the sexually selected features showed greater variability. Houle (1992) compared variability across different kinds of traits completely, finding that the types of traits presumed to be under strong selection (e.g. life-history traits such as fecundity and longevity) tend to exhibit more variability than the types of traits presumed to be under weak selection (e.g. morphological traits such as whole-body or body-part sizes). These studies suggest a complex relationship between historical selection pressure and variation in a trait, but they have not prevented researchers from taking high trait variation as indicative of lack of functionality/selection (e.g. Wallen and Lloyd 2008) – perhaps because the comparisons in the previous studies do not directly pertain to the byproduct/functional adaptation distinction. Our study, in contrast, pertains directly to this distinction. Indeed, male nipples are held as a prototype of an evolutionary byproduct (e.g. the major proponents of the byproduct explanation of female orgasm explicitly draw the analogy with male nipples; Gould 1987; Lloyd 2005; Symons 1979; Wallen and Lloyd 2008).

Given the status of women’s and men’s nipples as a prototypical product-byproduct, their further study could better illuminate the relation of function and selection to morphological variation. For example, our study used a 2-D measure of nipples in a relatively small sample; a whole-body, 3-D scan in a large sample could enable fine-grained measurement of variance in size and shape of different parts of the nipple and of the breasts, and if the sample was also genetically informative (e.g. comprising twins) then comparisons of genetic variance could be made as well as those of phenotypic variance.

In sum, our findings serve two main purposes. First, they discredit previous studies that have taken relatively high size variability of a feature to indicate lack of functionality. Returning to the example mentioned in the introduction, Wallen and Lloyd (2008) claimed that the greater length variation of clitorises than of penises meant that the female orgasm is a nonfunctional byproduct of male orgasm. This evidence can now be disregarded, since their own chosen analogy (male and female nipples) shows the opposite effect (the byproduct exhibits less variation in size than the adaptation). Second, our findings serve to prevent others from designing studies based on incorrect assumptions about the relationship of variability with functionality or selection. The salience and relatability of the nipple exemplar should help our findings to serve this purpose broadly.