Introduction

The concept of ‘physics envy’ describes the alleged fact that biologists would like to have theories that allow researchers to design experiments that readily distinguish between competing hypotheses. Biological data, especially in ecology and evolution, rarely provides researchers with such opportunities. Biological interactions are complex, forcing researchers to consider mutually non-exclusive hypotheses, and outcomes are often highly variable, being affected by environmental factors (Candolin and Heuschele 2008), demography (Kokko and Rankin 2006; Weir et al. 2011), historic contingencies and stochastic events (Jennions et al. 2012a). This applies to inter-specific parasite-host interactions (Poulin and Forbes 2012), or contests for resources or space (Castellanos and Verdú 2012), but it also applies to within-species interactions, especially those between the sexes over reproduction. Closely related species often have markedly different mating systems and patterns of parental care leading to variation in sexual selection. The same can be true for different populations of a single species—and even the same population in different years (review: Siepielski et al. 2009; but see Morrissey and Hadfield 2012) or stages of the season (e.g. Forsgren et al. 2004).

This creates challenges when trying to make generalizations that apply across taxa (e.g. to determine whether a particular factor is generally important), and when trying to understand why a particular feature emerges under some conditions but not others. Both tasks are important: without the ability to generalize and to infer associations that hint at causal factors of general importance, science can not proceed past individual cases of storytelling. Meta-analysis offers a potential solution. Generalizing is achieved by combining the results of different studies, with greater weight given to those offering more precise estimates, to determine the average influence of a factor of interest. We can estimate the mean ‘effect size’, which is a standardized measure of the strength of a relationship (Koricheva et al. 2012). Even more important is that we can include covariates, either continuous or categorical, into a meta-analysis to see if they correlate with the effect size estimates across studies. This approach, often referred to as a meta-regression (e.g. Jones et al. 2009), can identify factors that are potentially causally related to the phenomenon of interest. This is important for testing theory: advanced theory only rarely predicts a single outcome (e.g. ‘females benefit by producing extra-pair young’), but rather patterns of covariation (e.g. ‘production of extra-pair young should reflect variation in the costs incurred by mothers’). Meta-regression can reveal higher patterns that are undetectable in the original studies and either corroborate or refute existing theory (e.g. Griffin et al. 2005; Weir et al. 2011) or raise new research questions (see Jennions et al. 2012b).

A short history of meta-analyses of sexual selection

The first empirical meta-analyses by ecologists and evolutionary biologists appeared in 1992 (Gurevitch et al. 1992; Vanderwerf 1992). Subsequently, a key review paper aimed at evolutionary biologists highlighted the value of meta-analysis (Arnqvist and Wooster 1995). The first meta-analysis to explicitly address a sexual selection question considered assortative mating in water striders (Arnqvist et al. 1996). The number of meta-analyses of sexual selection and allied topics has since remained static at about 4–5 per year, although there was a sudden surge to 16 meta-analyses published or available online in 2011 (Fig. 1a). In other areas of evolutionary biology and ecology one can find examples of much faster increases in the uptake of meta-analysis (e.g. host-parasite interactions; Fig. 1 in Poulin and Forbes 2012). The total number of ecological and evolutionary meta-analysis published shows a steady annual increase (Fig. 1b; see also Jennions et al. 2012c).

Fig. 1
figure 1

a Papers per year that include actual meta-analyses on topics on or closely linked to the study of sexual selection (n = 94, see Table 1). b Papers per year located using the search term topic = meta-analys* or metaanalys* in the field categories Ecology, Evolutionary Biology, Plant Sciences and Zoology using the ISI Web of Science (Dec 31, 2011). Note: not all of these studies are actual meta-analyses (e.g. some might simply refer to the need for meta-analysis)

Some of the earliest meta-analyses explored the role of fluctuating asymmetry (FA) in sexual selection. At the time, late 1990s, there was already disquiet as to whether FA in sexual traits offered a single characteristic that could capture much of the variation in individual quality and, hence, be a target of mate choice. Consequently, many FA studies were controversial (Leamy and Klingenberg 2005). In this atmosphere, early meta-analyses of FA and sexual selection often generated intense, even vitriolic, debates [e.g. commentaries in J. Evol. Biol on Møller and Thornhill (1997)]. In addition, there were claims that publication bias was a major problem in published meta-analyses (Palmer 1999; Simmons et al. 1999; but see Møller et al. 2005), which led to wider concern that the use of meta-analysis in ecology and evolution might be inappropriate due to systemic publication bias (Palmer 2000). There were also issues surrounding inclusion of unpublished work that could not be evaluated (Palmer 1999), and the lack of objective search criteria for relevant studies, with the inference that some studies were deliberately excluded from meta-analyses to bias the outcome.

In hindsight, many of the objections raised echo those from other disciplines (e.g. medicine, social sciences) when meta-analysis was first used. For example, the temporal decline in effect size found in ecological or evolutionary studies (Simmons et al. 1999; Poulin 2000; Jennions and Møller 2002) has been reported in many other fields (Koricheva et al. 2012). Similarly, a negative relationship between sample sizes and effect sizes has been reported in other fields (Jennions et al. 2012b). This pattern is expected if small sample studies are only published when they report significant results (but it can arise for many other reasons). More generally, as with any new technique, earlier studies were less rigorous and used less sophisticated approaches than those published later. For example, recent studies are far more likely to control for statistical non-independence (Nakagawa and Santos 2012) and correct for phylogeny (Chamberlain et al. in prep).

Despite these problems, several meta-analyses conducted in the 1990s were positively received, became highly cited, and probably encouraged others to attempt a meta-analysis. Notable among these were meta-analyses of the evidence for ‘good genes’ benefits to female choice (Møller and Alatalo 1999), the effects of polyandry on female fitness (Arnqvist and Nilsson 2000) and data on how well sexual traits predict male reproductive success in birds (Fiske et al. 1998; Møller and Ninni 1998). To date, at least 18 papers on ecological or evolutionary topics that include a meta-analysis have appeared in Science or Nature, including two on sexual selection related topics (effect of mate attractiveness on offspring sex ratio: West and Sheldon 2002; male-biased parasitism: Moore and Wilson 2002).

Here, we identify 94 papers that contain formal meta-analyses (i.e. the use of effect sizes, usually weighted by the inverse of their variance or sample size) on sexual selection or allied topics (Table 1; search protocol in Electronic Appendix 1). Our approach to defining ‘sexual selection or allied topics’ was pragmatic. We included topics of likely interest to those studying competition for mates and mating systems (e.g. such as maternal adjustment of offspring sex ratios in response to sire attractiveness or the future level of mating competition). We also considered relationships between immune function, testosterone, parasitism and survival, because the life history trade-off between immunity and sexual traits is a major area of research (i.e. the immunocompetence handicap hypothesis; Folstad and Karter 1992; see also Poulin and Forbes 2012). We included studies on phenotypic differences between the sexes because, in the absence of sexual selection, natural selection should result in similar optimal phenotypes, and the sexes should, all else being equal, only differ in gamete size (Kokko and Jennions 2008). We also considered meta-analyses of the effects of inbreeding and the relationship between heterozygosity and fitness because of recent interest in whether mate choice/biased fertilization improves parental genetic compatibility (Griffith and Immler 2009), or whether selection favours mating with kin (Kokko and Ots 2006) or more heterozygous individuals (review: Kempenaers 2007).

Table 1 Summary of details of sexual selection meta-analyses (n = 94)

We do not claim to provide a complete list of published sexual selection meta-analyses. Papers with ‘buried’ meta-analyses are easily overlooked. Indeed, less than half the papers in Table 1 have ‘meta-analysis’ in the title. We have, however, probably located most meta-analyses that are unambiguously related to core sexual selection questions. To keep our review manageable, however, we excluded meta-analyses exclusively on humans or plants.

The 94 meta-analyses were authored by 179 different researchers, with a mean of 2.7 authors per paper. Interestingly, given the effort required to initially familiarize oneself with how to conduct a meta-analysis, 82 % of these researchers have only coauthored a single meta-analysis. We hope that this is not a sign that the experience was too traumatic to repeat! Only a few researchers published four or more meta-analyses (L. Garamszegi, M. D. Jennions, A. P. Møller, S. Nakagawa, R. Poulin, B. C. Sheldon, R. Thornhill). Another method that shares many similarities with meta-analysis—comparative analysis using phylogenetic contrasts—appears to be far more widely used. For example, Garamszegi and Møller (2010) located 194 comparative analyses from 2003 to mid-2007 in only four journals. The restricted use of meta-analysis is surprising given that published meta-analyses on sexual selection have had great impact. For example, 16 of the meta-analyses in Table 1 have been cited over 100 times, and 14 % are among the top 100 most cited ‘meta-analysis’ related publications in the ISI defined fields of ecology, evolution, plant sciences and zoology (13 of 94 studies), which is double the field average of 7 % (100 of 1,474 ‘meta-analyses’; see Fig. 1b).

Some of the 94 publications addressed several different questions, and contain multiple meta-analyses using effect sizes drawn from partially overlapping sets of primary studies. Nevertheless, it is possible to use five broad categories to account for two-thirds of the listed meta-analyses. First, 16 studies used phenotypic correlations to identify potential life history trade-offs between male attractiveness/sexual trait expression and naturally selected traits or fitness components. Second, 15 studies tested for sex differences in key life history traits. Third, 12 studies asked which male traits are associated with greater mating success. This provides correlative, and sometimes experimental, evidence that they are targets of female choice. Four, 11 studies related female mate choice, extra-pair mating and polyandry to genetic benefits, direct benefits and naturally selected costs. Five, 10 studies tested whether offspring sex ratios are adjusted in relation to factors that might affect sons’ mating success or daughters’ breeding success (e.g. mate attractiveness, maternal condition).

Insights and impact of past sexual selection meta-analyses

The impact of published sexual selection meta-analyses ranges from confirming to challenging established theory. In this section we discuss how meta-analysis can help distinguish between hypotheses, what issues arise when effect sizes (which are often small) are used to test theory, and we explain how meta-regression can yield further insights if sufficient data are available.

Distinguishing between predictions or hypotheses

In science, it is often portrayed as ideal to have competing hypotheses that make different testable predictions. In nature, however, multiple factors can simultaneously influence an outcome, so a more balanced approach is to ask which factors tend to drive the observed outcomes rather than trying to completely reject the importance of others. Sexual selection theory is replete with false dichotomies. Sometimes key factors that alter the direction of a relationship are ignored or artificially fixed (e.g. the opposite ends of the Fisher-Zahavi continuum: Kokko et al. 2006b). At other times, two explanatory frameworks are falsely treated as mutually exclusive (e.g. sexual conflict vs. good genes). It is, of course, challenging to confidently state what ‘drives’ a system, especially when taking into account that multiple factors often interact, albeit with varying importance. Even so, identifying prevalent trends can provide key insights into the relative importance of different factors.

We illustrate this with an example. Two opposing views have been promoted as to whether females should differentially allocate resources to offspring depending on their mating partner’s attractiveness. (review: Ratikanen and Kokko 2010). It has been suggested that females mated to more attractive males should increase their reproductive investment if they are likely to produce higher quality offspring with greater reproductive value (‘differential allocation’: Burley 1986, also widely attributed to Sheldon 2000 who did not specify the direction of adjustment). Alternatively, females mated to less attractive males might increase their reproductive investment to ‘compensate’ (Gowaty 2008). The unifying consideration is ultimately how parents adjust the rearing environment (i.e. resources provided). Both positive and negative relationships between male attractiveness and female investment are possible (Harris and Uller 2009). Whether one views ‘differential allocation’ and ‘compensation’ as differing merely quantitatively or also qualitatively, there is a basic empirical question: which direction occurs more often? Horváthová et al. (2012) found that the mean relationship was for significantly greater investment when mated to attractive males (r = 0.117 corrected for phylogeny). This finding should focus attention on the underlying causes of adjustment in effort (i.e. why does ‘compensation’ tend not to occur?) and, hopefully, will lead to the development of better theoretical models.

We would argue that a general insight from sexual selection meta-analyses is that one can rarely unambiguously distinguish between hypotheses. This is because the relevant hypotheses are not mutually exclusive, and the evidence being summarized is usually correlative rather than experimental. Even so, by indicating prevailing trends in nature it is often possible to estimate the relative importance of different putative causal factors.

When meta-analysis confirms expected patterns, effect sizes are often low

In many cases, meta-analysis is not used to distinguish between competing hypotheses, but simply to confirm (or refute; see next section) an already accepted assumption about an empirical relationship. The value of the exercise is thus to provide an objective measure of the average strength of the relationship. For example, Kelly (2008) asked whether, in territorial species, there is a positive correlation between male resource holding potential (RHP) and the value of the resource held, between resource value and male reproductive success, and finally, given these relationships, a positive relationship between male RHP and reproductive success. As one might expect, all three relationships were indeed significantly positive. However, the mean estimated values of r lay between 0.37 and 0.45, which reminds us that RHP cannot be expected to correlate perfectly with the resources gained by a male and how many offspring he sires. It would be interesting to follow up and test whether RHP is a better predictor of male success in studies that included males who failed to establish a territory (resource value = 0 for these males) than in those which did not present data on such males. Some selection to maintain high RHP might remain invisible in studies that only consider variation within the subset of territorial males. Overall, however, RHP remains an imperfect predictor of male success. This fits with theoretical and experimental approaches that show that arrival order can override mild RHP differences in deciding territory ownership (Hardy and Field 1998, Kokko et al. 2006a).

In a similar vein, several studies have estimated the mean relationship between mating success and sexual dimorphic traits that are assumed to be under direct sexual selection (e.g. plumage colour, body size, song complexity). Again, the mean relationships are often modest (e.g. r = 0.30, Gontard-Danek and Møller 1999; r = 0.20, Soma and Garamszegi 2011). These examples remind us that stochasticity might play a large role in determining mating success in most systems (Jennions et al. 2012a).

This conclusion generalizes readily: mean effect sizes are often low in sexual selection studies. Even when the means are statistically significant, a univariate approach rarely explains more than 1–10 % of the variation in a trait of interest (Møller and Jennions 2002). Why are effect sizes so modest? If sexual selection theory is correct, would we not expect greater explanatory power for variables of interest? We offer five responses.

First, the problem could lie in the quality of empirical measures. Above, we highlighted an example: males who never gained a territory/mate are sometimes excluded from primary studies so that the strength of selection is underestimated. More generally, it is notoriously difficult to measure fitness. The use of proxy measures like body condition or annual survival which are fitness components, and not fitness itself, introduces measurement error and, at worst, a systematic bias that must reduce the estimated strength of actual relationships (Hunt and Hodgson 2010). An example of a bias is that lifespan or survival are often poor proxies for fitness if the fittest males have a shorter life expectancy due to trade-offs between sexual signalling and survival (Kokko 2001; Hunt et al. 2004).

Second, the effect sizes obtained will depend on whether the primary researchers used a combination of biological intuition and current theoretical expectations to pick the most relevant variables or whether they inadvertently chose less relevant proxies that will lead to a low effect size. Similarly, if a broad-brush approach is used to identify many potentially relevant factors, some will probably be irrelevant and fail to give any support to theory. For example, in the meta-analysis on RHP and territoriality, Kelly (2008) highlights such problems by discussing data on red-collared widowbirds Euplectes ardens. The study predicts r ≈ 1 if collar colour is used as a determinant of territory size, but including other apparently non-important traits (e.g. tail asymmetry) deflates the mean effect size to r = 0.52. It is tempting to, post hoc, exclude traits that do not support the expected relationship, but this can obviously lead to Type I errors. Ultimately, more data is needed to confirm whether certain traits continue to have a low estimated effect size, suggesting that they are indeed unsuited to testing theory in the first place.

Third, most dependent variables in sexual selection studies are affected by multiple factors of interest. Consider, for example, studies where the benefit of polyandry due to elevated offspring fitness is calculated by randomly assigning varying numbers of mates to females (Slatyer et al. 2012). Even if polyandry elevates offspring fitness, there could still be much variation within each mating treatment if, for example, female size and age affect offspring fitness. The situation is even more complex if the benefits of choice are context-dependent so that some females benefit more than others (review: Schmoll 2011). In short, a distinction must be drawn between the variation explained by a single factor (the goal of many meta-analyses) and the total variation that can be explained in any given primary study using a model building approach with multiple predictors. Peek et al. (2003) estimated that, on average, statistical models in primary ecological studies explain 47–54 % of the observed variation.

Fourth, we suggest that the role of stochastic events (‘luck’ in plain language) is underestimated in sexual selection studies. It is likely to generate considerable variation in mating success, especially when the sex ratio is male biased. Mating is a binary event (mate or not) so continuous variation in a male trait is imperfectly correlated with mating success (for an illustrative example see Jennions et al. 2012a; see also Kahneman 2011 for a general review of the propensity to underestimate how chance affects success). For this reason alone, effect sizes are likely to be modest in studies investigating male mating success.

Fifth, there are genuine biological differences between populations, sites and species and environmental conditions across years. The direction of a relationship can genuinely vary. In such cases, the mean effect must be smaller than that expect based on studies that show the strongest support for theory (which are often the best known studies). These situations call for an analysis of moderating variables, which we will return to shortly.

So what should we make of low effect sizes? The most important implication is that biological systems are strongly influenced by noise so that most primary studies have minimal statistical power. Failure to detect a significant influence of a factor of interest should be commonplace, regardless of whether or not it has an impact on the measured outcome. This hard truth makes it exceedingly difficult to interpret a null result in any single study: was theory refuted or was the effect simply too weak to detect despite ultimately being evolutionarily significant?

These difficulties place into perspective apparent discrepancies between studies that do and do not obtain a significant result. Historically, such differences evoke conflict: researchers often explain them away by invoking biological differences (e.g. geographic variation) or criticize the inferior methodology of rival researchers. The use of meta-analysis discourages the dichotomous interpretation of P values and encourages greater consideration of the distribution of the observed magnitude of a relationship. It is worthwhile remembering that studies of the same phenomena will, even in the absence of biological variation, generate a range of effect sizes purely due to sampling error (Nakagawa and Cuthill 2007). This subtle shift in outlook could profoundly direct emphasis away from over-reliance on extrapolating from individual studies and towards detecting general trends using meta-analysis.

When meta-analysis refutes conventional wisdom, new alternatives are often identified

Meta-analysis does not always confirm a prior prejudice. A general insight from sexual selection meta-analyses (see the occasional ‘No’ answer in Table 1) is that it is easy to be misled by a few high profile studies into believing that a prediction is well supported. Support is often weaker than assumed.

To start with a simple example, meta-analysis has shown that, despite some high profile primary studies that reported a significant effect of certain male traits on mating success, the mean effect did not differ from zero when looking at all the available studies. For example, Nakagawa et al. (2007) found no significant relationship between bib size and reproductive success in house sparrows Passer domesticus in a meta-analysis of results from 12 populations. Similarly, Garamszegi and Møller (2004) found that song complexity did not predict a male’s within-pair paternity, although Soma and Garamszegi (2011) subsequently showed that song complexity was, on average, weakly but significantly positively correlated with total reproductive success (r = 0.20) (but this was not readily attributable to increased success at gaining extra-pair paternity).

More important, perhaps, are cases where meta-analysis fails to detect a theoretically predicted relationship that is assumed to be widely corroborated. For example, Roberts et al. (2004) tested some key assumptions of the immunocompetence handicap hypothesis (ICCH; Folstad and Karter 1992). The ICCH suggests that a major cost constraining elaboration of sexually selected traits in vertebrates, making their expression covary with male quality, is that testosterone elevates sexual trait expression but reduces the ability to repel parasites. There was, however, no evidence from experimental manipulations of testosterone levels that testosterone is immunosuppressive (based on immune function measures like white blood cell counts or subsequent parasite loads) when data were analyzed at the species level. In another meta-analysis, Boonekamp et al. (2008) showed, albeit with a very small data set, a potential causal link in the reverse direction: experimental immune system activation suppresses testosterone levels. The primary studies they analysed had previously received little attention, and it is unlikely that a case for this ‘reverse causality’ explanation could have been built without a meta-analysis. Given the implications of these findings for the well-studied ICCH, it is surprising that no follow up meta-analyses have re-examined the findings of these two meta-analyses.

Another example can be found in game theory models of sperm competition. Models predict that males will increase ejaculate size when sperm competition risk is high (e.g. when a rival is also likely to mate with the female), but decrease ejaculate size when the intensity of sperm competition is greater (e.g. when many rivals are present because the marginal returns per sperm decline). Two meta-analyses recently confirmed the ‘risk’ prediction (Kelly and Jennions 2011; DelBarco-Trillo 2011), but for the ‘intensity’ prediction, the mean effect size did not differ from zero (Kelly and Jennions 2011). If sperm competition theory is internally correct, such that the conclusions follow from the model assumptions, then a failure to meet the predictions can highlight two problems: the assumptions might not be met in nature and/or there is a problem in how theory is being tested or applied. It might be that, while effects of ‘risk’ and ‘intensity’ both exist due to high measurement error, by chance only one has been detected. Alternatively, males might not perceive the experimental manipulation as intended by the researcher. For example, focal males might not use brief exposure to different numbers of males as a measure of the likely intensity of sperm competition. It could also be that many tests are simply on species where males do not facultatively adjust ejaculates to short-term changes in the intensity of sperm competition, thereby deflating the overall effect size.

This last point is an important consideration: is sexual selection theory as a whole incorrect if specific predictions do not apply in some species? Consider the fact that, outside of the insects, hermaphroditic species are common but never appear to have precopulatory ornaments (Lukas Schärer, pers.comm). It would be counterproductive to include many hermaphroditic species in a meta-analysis testing predictions about ornament evolution as it could ‘dilute’ estimates of the mean effect for more appropriate study taxa.

This is one reason why it is worthwhile looking for heterogeneity in effect sizes and attempting to identify sources of variation in effect sizes. A theory is not best supported when effect sizes are strongest, but when effect sizes can be predicted based on explanatory variables. This is why it is also worthwhile to quantify the magnitude of phylogenetic effects (Lajeunesse 2009). A strong phylogenetic signal indicates that the theory being tested is potentially more applicable to some taxa (those where the effect is stronger) than others. If so, researchers can address a new set of questions. Could historic contingencies, combined with phylogenetic inertia, be sufficient to create a pattern where some taxa meet the theory’s assumptions better than others? For example, in sex allocation studies some taxa might simply lack any mechanisms that allow them to bias the offspring sex ratio. Alternatively, there could be predictable variation in species properties across taxa that make it readily understandable why evolutionary outcomes follow theoretical predictions better in some lineages than others (e.g. local mate competition could have a stronger impact on sex ratios in taxa that lack efficient dispersal).

Meta-regression: understanding variation in effect sizes

Meta-analysis provides us with an overview of the location and distribution of effect sizes (i.e. mean and associated confidence interval), and if the number of species is small this might be the only realistically achievable goal. Many early studies simply reported the mean effect size and, at best, calculated effect sizes for a few subsets of the data (i.e. tested the influence of categorical factors). However, when sample sizes permit, meta-regression provides techniques to investigate differences among studies. This helps to overcome the problem outlined earlier that low mean effect sizes do not necessarily mean that there is nothing biologically interesting going on. Meta-regression can be used to test whether variation in effect sizes among studies is not solely attributable to sampling error (i.e. there is significant heterogeneity) and can partly be explained by biological and/or methodological moderators. The use of meta-regression with multiple predictors has become more common in recent sexual selection studies following recent software advances (see Nakagawa and Santos 2012), and we highlight its importance with three examples.

Identifying sources of variation can prevent misinterpretation of a non-significant or low mean effect as evidence that underlying theory does not apply in nature. Consider the general finding that birds increase their reproductive effort when mated to attractive males (Horváthová et al. 2012). Closer inspection showed that the magnitude of the ‘attractiveness’ effect varies for different types of maternal investment. There was a significant increase for some traits (clutch size, egg size and maternal feeding rate), but not others (levels of immuno-stimulants and androgens in eggs). Given the problem outlined earlier that including many irrelevant traits can dilute a real effect to yield a non-significant mean effect, it is an interesting conjecture to consider what would have happened if many more studies had existed on androgens, and the true effect for androgen is zero. Would the initial ‘null’ result based on the full dataset have discouraged the researchers enough to make them forgo the more advanced step of meta-regression? One hopes not, as the meta-regression not only revealed effects of ‘attractiveness’ on clutch size, egg size and feeding responses, but also exposed an interesting difference between species with bi-parental and female-only care. Species with female-only care showed a significantly greater propensity to increase egg size, while those with bi-parental care tended to increase clutch size. This finding makes intuitive sense if the ability of a single parent to care for a larger brood is limited. The finding also raises wider questions about the likely patterns that will arise in non-avian taxa with different patterns of parental care that can now be tested based on a priori predictions (Ratikanen and Kokko 2010). [For another good example of a study that shows how a non-significant mean effect is not necessarily a refutation of theory see Griffin et al. (2005). They show that facultative adjustment of offspring sex in cooperative breeders in response to the number of existing helpers can be explained by variation in the extent to which helpers actually elevate parental fitness].

Another example is provided by studies of extra-pair mating in birds. The cost of extra-pair mating due to reduced parental care by a cuckolded male varies widely among species. Small genetic benefits (Møller and Alatalo 1999; Slatyer et al. 2012) have led some to argue that extra-pair activity is male driven (Arnqvist and Kirkpatrick 2005, but see Eliassen and Kokko 2008; Griffith 2007; Schmoll 2011 for problems of measuring and interpreting the patterns obtained). If male behaviours (seeking and protecting paternity) were the only factors determining the proportion of extra-pair young, effect sizes should not predictably vary with the fitness consequences for females. However, Albrecht et al. (2006) showed in a phylogenetic controlled meta-analysis that rates of extra-pair fertilization decrease with increasing costs (due to lower male care) of multiple paternity for females. As with any across-study or species comparison, the evidence is correlative, but this finding clearly makes it difficult to maintain that the behaviour of only one sex (female or male) determines the distribution of paternity.

Finally, meta-regression can highlight important methodological differences between studies. We have already mentioned, in the context of RHP and territoriality (Kelly 2008), the potential for a meta-regression to estimate how important it is to include ‘failed’ males with no resources when estimating selection on RHP. In the context of sexual conflict over parental care, there was significant among-study heterogeneity in the compensatory increase in the focal parent’s feeding rate of offspring in response to reduced care by the partner (Harrison et al. 2009). Inclusion of experimental treatment as a moderating factor showed that the increase in care was greater when the mating partner was removed as opposed to experimentally manipulated so that his/her level of care was reduced (d = 1.69 vs. 0.50). This study highlights the basic importance of taking into account the role of methodology when estimating the strength of the relationship.

Great potential, but do we have enough species?

Attempts to evaluate the generality of an argument, and identify causal factor relies on sufficient coverage of evolutionary outcomes from many taxa. As with much sexual selection research, there is a strong bias towards meta-analyses exclusively of birds studies (36 % in Table 1), with far fewer meta-analyses exclusively devoted to mammals (11 %) or arthropods (14 %). More than 34 % of the meta-analyses in Table 1 did, however, use data sets covering four or more major taxa (e.g. birds, mammals, reptiles, fish, frogs, amphibians, insects or arthropods), but closer inspection shows that there was often still a strong sampling bias towards one or two taxa.

Of the 94 studies, 10 were single species studies. Of the remainder, there were 75 studies that provided detailed enough information about the number of species used to tabulate it (Table 1). We noted the number of species in the largest meta-analysis per publication, if the original publication clearly estimated separate mean effects for different subsets of the data (which could vary in the number of species for which data were available). If it was not possible for us to deduce the number of species of the most extensive meta-analysis in this way, we have simply noted the total number of species, and it should be kept in mind that the number of species in each meta-analysis might then be smaller.

The mean and median numbers of species in these 75 meta-analyses were 29.2 and 20, respectively. Only 11 publications included data on more than 50 species. These numbers should alert the reader to limitations of current datasets. As we have noted earlier, the value of a meta-analysis is much enhanced when there is sufficient data to test for moderating variables. Such an approach allows researchers to address questions of the type ‘why does our phenomenon of interest appear stronger in studies/species where, say, populations are at higher density’ which, if answered, advance a field more than a simple statement that ‘the phenomenon appears to exist (or not)’. The capacity to test for moderating variables is limited due to low statistical power if we have only a few cases at hand. It is important to consider the level of analysis when testing for moderators (e.g. do we want to look for variation across studies or species?). In general, especially given the need to formally correct for phylogenetic relatedness (Lajeunesse 2009), we would argue that most researchers are interested in explaining variation across species. In meta-analysis, the statistical power to test whether the mean effect differs from the null value depends on study samples sizes that affect the precision of estimates of effect size and how variation among estimates is then distributed at different hierarchical levels (e.g. within and among species variation). This is information that we do not present in Table 1, partly because large differences in sample sizes among studies can make the interpretation of the mean sample size per study misleading. In contrast, simply put, the ability to detect significant moderators of variation across species will depend on the number of species examined (no matter how precisely each species mean value is estimated).

Handy hints to find a sexual selection topic to meta-analyse

Here, we assist those keen to conduct a meta-analysis but unsure where to begin. We provide general strategies to identify fruitful areas for meta-analysis and identify neglected areas in sexual selection.

Coming up with ideas for meta-analyses: some random thoughts

There are some simple strategies to identify areas where data exists, but meta-analyses are lacking. Reviews that include tables that ‘vote count’ studies that did or did not report a significant result usually indicate that there is sufficient empirical data to conduct a meta-analysis. More generally, any area that is the subject of extensive narrative reviews with tables of studies is likely to be ripe for meta-analysis [e.g. condition-dependence of female mate choice (Cotton et al. 2006); whether male-male competition and female choice select for different traits (Hunt et al. 2009)]. It is generally fruitful to identify statements that are repeatedly encountered but only backed up by citing specific, often high-profile, studies. This usually suggests that there is, as yet, no published meta-analysis. Even if the statements are ‘obviously’ true, it is unlikely that they have been objectively validated, or the relative magnitude of the relationship quantified. For example, it is often claimed that there is an ‘ownership advantage’ or ‘prior residency effect’ during territorial fights (Kokko et al. 2006a). Table A1 in Kokko et al. (2006a) suggests this is true but, if so, there is still scope to account for variation in the size of the residency effect which is important to explain the evolution of residency and migration (Kokko 2011).

Given similarities between meta-analysis and comparative analysis it should also be obvious that almost every published comparative analysis could be re-analyzed and ‘converted’ into a meta-analysis simply by taking into account sampling variance/measurement error (Garamszegi and Møller 2010).

Already existing meta-analyses should be periodically updated. Many early sexual selection meta-analyses had small sample sizes, with little potential to test for moderator factors, and estimates of mean effects had broad confidence intervals. As data accumulates it is possible to address these limitations (e.g. compare Coltman and Slate 2003 and Chapman et al. 2009). Inspection of Table 1 reveals some obvious cases of meta-analyses that could be revisited. Another reason to redo earlier meta-analyses is that they were often generous in the types of primary studies included. Specifically, experimental and observational studies were often pooled, when only the former provide strong evidence for causality. The opportunity therefore exists to conduct meta-analyses with more restrictive datasets (ideally only experiments) to reach more robust conclusions about causality. For example, Slatyer et al. (2012) tested whether polyandry confers genetic benefits using only experimental studies where the number of mates was varied and the number of matings per female stayed constant. In contrast, Møller and Alatalo (1999) looked at a range of sources of experimental and correlative evidence for ‘good genes’ due to mate choice.

It is not always necessary to think big. Some topics are so well studied that it is an overwhelming task to conduct a meta-analysis (e.g. the effect of body size on fight outcome). In such cases, a more restrictive meta-analysis is a pragmatic solution, such as confining the meta-analysis to certain taxa. It is also worth noting that a meta-analysis does not have to resolve major theoretical questions. As with primary studies there is value to tackling modest questions. A good start is to list the basic questions you ask in your own study system. So, for example, based on our own empirical work we might ask: Do larger fiddler crabs wave faster than smaller ones? Do male and female crickets differ in their ability to withstand an immune challenge? How are different measures of immune function in insects correlated? Do male fish avoid larger sexual competitors when deciding which females to court? If you work on a well-studied species remember that even species-specific questions can be meta-analysed (see Table 1). At the extreme, it can even be informative to conduct a meta-analysis of a single study system if the same type of data is repeatedly collected over time (or space) (e.g. Milner et al. 2010).

Neglected areas in sexual selection

Any mismatch between the topics covered in Table 1 and how many primary studies there are in these areas provides a clue as to which areas are neglected. Here we provide a ‘shopping list’.

  1. 1.

    Intra-sexual selection: Fighting behaviour is extensively studied and there are many theoretical models (Briffa and Sneddon 2010), but we did not locate meta-analyses explicitly addressing these models. Obvious topics are: the winner/loser effect; the residency advantage; the role of body or weapon size in determining fight outcome; the relationship between body size differences and fight duration or escalation.

  2. 2.

    What traits show a life history trade-off with greater male attractiveness? Phenotypic correlations between male attractiveness and many key life history traits have been well researched but the general trends, and sources of variation, have not been identified by meta-analysis. For example, what is the relationship between attractiveness and: social dominance, post-copulatory reproductive success (i.e. fertilization under sperm competition) or metabolic rate? Ideally, one should address questions about evolutionary trade-offs using data on genetic correlations (e.g. Evans 2010) or effects of experimental manipulations (Reznick et al. 2000). Meta-analyses of these data provide the strongest evidence for which trade-offs are most important.

  3. 3.

    What are the average values of key genetic parameters measured in sexual selection studies (review: Chenoweth and McGuigan 2010)? For example, what is the mean difference between the axis of directional selection on male sexual traits and the axis of maximum additive genetic variation in attractiveness (see Blows 2007)? What is the mean heritability of specific male traits? Can we explain variation in the genetic correlation between female mating preferences and preferred male traits or net attractiveness? What is the genetic correlation between male attractiveness and lifespan, or between the strength of female mating preferences and lifespan?

  4. 4.

    Many studies measure phenotypic selection (review: Kingsolver et al. 2012), but there has been little exploration of the importance of the main population parameters invoked by theoreticians to account for variation in the intensity of sexual selection (but see Weir et al. 2011). For example, how well do the adult or operational sex ratio, variance in male mating success, population density, the difference in ‘time out’ between the sexes after mating, the division of parental care and type of ‘mating system’ explain variation in the strength of selection on sexual traits. More specifically, can these parameters account for variation in the level of mate choosiness or of direct physical competition for mates?

  5. 5.

    There are many statements in the literature about key differences between sexual and non-sexual traits in allometry, phenotypic variability and level of phenotypic plasticity (condition-dependence) that have not yet been meta-analysed (reviews: Pomiankowski and Møller 1995; Cotton et al. 2004; Kodric-Brown et al. 2006: Bonduriansky 2007).

  6. 6.

    Several experimental evolution studies have investigated the evolution of traits in both sexes in the presence/absence of sexual selection (i.e. enforced monogamy vs. polygamy) (review: Edward et al. 2010). The average effect of removing sexual selection on different types of traits has not yet been formally quantified.

  7. 7.

    There is much interest in the effect of inbreeding on mate choice. Some key questions are: Do females prefer heterozygous males based on either conventional mate choice or biased paternity when mating multiply (Kempenaers 2007)? Does inbreeding have a more detrimental effect on sexually than naturally selected traits (Cotton et al. 2004)? Do females mate with or bias paternity towards non-related males (i.e. avoid inbreeding)? If not, is this because inclusive fitness gains outweigh the direct costs of inbreeding depression (Kokko and Ots 2006)?

  8. 8.

    An extreme form of phenotypic plasticity in relation to sexual selection is to change sex. We are, however, unaware of any meta-analyses that directly investigate how the timing or direction of sex change is related to factors of theoretical importance (but see Molloy et al. 2008 for a meta-analysis of an intriguing pattern).

  9. 9.

    Several meta-analyses have investigated which secondary sexual traits predict male mating success (see Table 1). But what predicts success under sperm competition? There are now several ejaculate or sperm traits that seem to be of potential importance (Snook 2005), but their relative influence has not been quantified using meta-analysis.

  10. 10.

    There are general claims about the types of individuals that are preferred as mates that have not been subject to a meta-analysis. For example, do females prefer older males? Do males prefer virgin females or larger females? Is the Coolidge effect common?

  11. 11.

    Sexual selection theory makes predictions about how much each sex will invest in parental care (Kokko and Jennions 2008), but there is no theory to predict the total reproductive effort by each sex. Do the reproductive budgets of females tend to exceed those of males (see Hayward and Gillooly 2011)?

What next?

We conclude by highlighting a few general issues. While it is tempting to conduct a large and complicated meta-regression with several predictor variables, it is worth bearing in mind the available data. In general, we think it is prudent to think in terms of the number of species in the database when trying to generalise. A meta-regression should be no more complicated than the equivalent model one would run for a primary analysis with repeated measurement of individuals to explain individual level traits (i.e. treat species as analogous to individuals). Although it is possible to look at moderator effects at lower levels using a hierarchical model (Hadfield and Nakagawa 2010), the important theoretical questions in most meta-analyses are nearly always about the distribution of effect sizes across species, and the power to test species level hypotheses depends on sampling at the appropriate level.

When interpreting a meta-analysis one should be cautious in drawing inferences from non-significant results given low statistical power and the possibility that studies of irrelevant species, traits or sets of conditions ‘dilute’ the effect size and makes it difficult to detect whether the theory is sometimes applicable (or is generally applicable but only yields a non-zero effect in some circumstances). The opposite problem applies if researchers cherry-pick study species that appear most likely to support their favourite theory. For example, in sexual selection studies there is a tendency to work on species that seem most likely to generate the observed pattern. When studying mate choice, researchers have historically favoured species that are sexually dimorphic, lek or mate polygynously. Likewise, studies of ‘sexual conflict’ tend to be on species where males are readily observed harassing females. A related problem is an understandable tendency to work on common species where large sample sizes can be collected. ‘Missing’ species and variation in sampling effort are consequently non-randomly distributed across phylogenies (Garamszegi and Møller 2011).

Several of the meta-analyses in Table 1 look at numerous moderator variables (and even their interactions). The risk of ‘statistical fishing’ arises. Even if the researchers used, say, information criteria to select the best model, this does not eliminate the underlying problem that when more moderators are examined there is a greater chance that the final model will contain some moderators with a significant effect due to type I errors (review: Forstmeier and Schielzeth 2011; Freckleton 2009; for the use of model averaging see Grueber et al. 2011). Finally, as with any statistical model, the reader should ask whether the meta-regression was tested for outliers, or points that had undue leverage. Given low sample sizes, sensitivity analyses seem appropriate.

The existence of publication bias remains controversial (reviews: Jennions et al. 2012b; Nakagawa and Santos 2012). Selective reporting of results is clearly an issue that can affect meta-analysis (e.g. Cassey et al. 2004). In sexual selection studies, however, we suspect that narrow sense publication bias is unlikely to be a major problem, because of the wide range of views held by researchers (e.g. whether or not ‘good genes’ are important). The one notable exception to this statement is that there is likely to be an issue with genetic parameter estimates. Quantitative genetic experiments that report negative heritabilities or those close to zero are probably less likely to be published. This is not least because current research is focused on questions about multivariate genetic variation, genetic correlations and ‘constraints’ on evolution (review: Blows 2007). A lack of additive genetic variation means that there is no point reporting genetic correlations or engaging in more sophisticated analyses. This undermines the main framework used to write up quantitative genetic studies of sexual selection. Of course, claims about publication bias need to be directly tested. There are many indirect tests (see Nakagawa and Santos 2012), but these are always open to alternative explanations and are often inapplicable if there is heterogeneity in effect sizes among the studies being analyzed (which is true in most cases). Direct tests that track the publication fate of completed studies are required. To date, there has been only one direct test for publication bias in sexual selection (Møller et al. 2005).

Finally, meta-analysis is a powerful way to summarize what we know, but it should not blind us to other sources of evidence. It is easy to become blinkered and assume that the only useful data is that collected in a uniform manner, and amenable to statistical tests that allow effect sizes can be calculated. There are, however, often lines of evidence based on idiosyncratic experimental approaches that illuminate the general importance of a factor of interest, but are unsuited to inclusion in a formal meta-analysis (see table 6 in Slatyer et al. 2012). Similarly, there are cases where a mechanistic understanding of a biological phenomenon might provide a more powerful explanation for a relationship than the ‘black box’ approach often taken in sexual selection studies of gathering correlative data. Ideally, these forms of evidence should be presented in a synthetic review as a complement to quantitative meta-analysis. Systematic, methodological advances help science to become more rigorous, but progress is never an entirely mechanical or statistical endeavour.