Cross-national data on multiple aspects of gender disparities are used in academic research on the causes and consequences of gender inequality and other gender-related matters, but is also of growing relevance to policy discussions. These data are used for agenda setting by raising awareness of gender gaps, and help to keep governments and a variety of organizations accountable for compliance with certain standards and goals. In addition to fairly visible global initiatives, such as the Beijing Platform for Action of 1995, there is evidence that in many countries and regions, data on gender equality are becoming part of the policy process.Footnote 1 These are undoubtedly positive indirect effects of the production of data, for, as the adage goes, “What gets measured, gets done.” But the use of data on gender also draws attention to the issue of data quality and the need both to evaluate measures of gender disparities and to engage in efforts to generate improved measures. If the data that are used in the policy arena are of questionable quality, the legitimacy of using data in the pursuit of collective goals suffers. Thus, issues of measurement, though involving intricate methodological questions, are more than merely academic curiosities. It is of policy importance to get measurement right.

Many researchers and organizations have been involved with developing measures disaggregated by gender.Footnote 2 Several projects have focused on measuring specific issues, such as reproductive health (Abdullah 2000; Yinger et al. 2002) and violence against women (Walby 2007; Kelly et al. 2008). Others have taken advantage of the growing availability of a broad range of indicators, and sought to identify which indicators can be used to measure diverse aspects of social life and to identify areas where measures are lacking (UNECLAC 1999; UNECE 2001; UNIFEM 2002; UNESCAP 2003; UNRISD 2005; UNDP 2006). As a result of a broad collective endeavor, the amount of data that incorporates a gender perspective has increased considerably.

A distinctive facet of this challenge of measurement has been the development of indices, that is, compound measures that aggregate multiple indicators. The first indices that explicitly included gender-differentiated data were the United Nations Development Programme’s (UNDP) Gender-related Development Index and Gender Empowerment Measure, both launched in 1995 (UNDP 1995). These pioneering UNDP indices gained some public visibility, in part because they were reported on a yearly basis in the UNDP’s Human Development Reports. But they also triggered a more academic debate about how to construct a valid and reliable index with gender-differentiated data. Contributions to this debate have been focused on assessing the two UNDP indices, highlighting some of their weaknesses and suggesting how the UNDP might improve their indices. And much effort by organizations and independent researchers has gone into developing new indices. Thus, today, a number of indices with gender-differentiated data are broadly available.

Indices are likely to play a growing role in discussions of gender, both in academic and public policy circles. All indices serve to summarize a complex phenomenon by aggregating multiple indicators. This function is all the more important in the study of gender disparities because the successful response to the lack of information on gender in the 1980s has created a new problem: the intractable number of gender indicators. Indeed, the current availability of gender indicators—a comprehensive list of data sources reveals a total of 303 indicators (UNECLAC 2002: Annex 6)—makes it difficult to study a particular country, let alone to compare several countries, by considering all the available information. However, by itself an index is not a solution to the problem brought about by the overwhelming number of indicators. For an index to serve as a synthetic measure of gender disparities, the construction of the index must take as its point of departure a clearly formulated concept and comply with basic methodological criteria. And, as the ongoing discussion about how best to craft a gender index underscores,Footnote 3 the construction of a sound index with gender-differentiated data is not an easy task.

This article contributes to this discussion by offering a systematic assessment of the most visible gender indices that provide data on most countries of the world. We focus on five indices: the UNDP’s Gender-related Development Index (GDI), the UNDP’s Gender Empowerment Measure (GEM), Social Watch’s Gender Equity Index (GEI), the World Economic Forum’s Global Gender Gap Index (GGGI), and the OECD’s Social Institutions and Gender Index (SIGI).Footnote 4 The purpose of our comparison of these five indices is to provide useful guidance to users of gender indices and to contribute to the collective effort to further develop cross-national indices that aggregate gender-differentiated data.

Comparing these gender indices is not straightforward. The description of the methodologies underlying these indices do not share a common terminology, do not adequately describe methodological choices and often remain silent on potential sources of measurement error. Comparing the five gender indices is further complicated as not all index developers are explicit about the overarching concept they seek to measure. A careful inspection of the methodological choices underlying the construction of these indices often shows that what the index actually measures is different from what it purportedly measures. As such, our assessment addresses two related questions: what do these indices actually measure? and, how valid are these indices?

To answer these questions, in Sect. 1 we present a framework for evaluating gender indices that integrates a large literature on the methodology of measurement and identifies the key methodological choices underlying gender indices. This framework serves as the organizing principle for the article. Sections 2, 3 and 4 focus on the distinct methodological tasks involved in the construction of the five indices. In these three sections, we describe and compare the choices made by the index creators, and we assess whether each choice contributes to a measure that is clear and consistent in meaning, whether an appropriate justification is offered for each choice, whether the index creators offer empirical tests that validate their choices, and whether tests we conduct suggest that these indices are well constructed.Footnote 5

By way of a conclusion, we offer our response to the question, what do these indices measure? suggesting how each index should be interpreted and what ambiguities cloud a clear interpretation. We also answer the question, how valid are these indices? summarizing the discussion of methodological strengths and weaknesses of each index. Finally, we outline the implications of our assessment for users and producers of gender indices.

1 A Framework for the Evaluation of Gender Indices

A framework to evaluate gender indices need not be overly elaborate. In the broadest terms, producing a gender index involves two stages—securing raw data on indicators, and then combining the values of these indicators into an index—tackling a series of distinct tasks, and meeting certain criteria (see Table 1). Yet, assessing the validity of gender indices is tricky. The validity of an index is influenced by many methodological choices and these choices are interrelated. Thus, a brief discussion of the methodological choices involved in index construction sets the stage for the analysis of gender indices that follows.

Table 1 A methodological framework for the evaluation of indices

1.1 From Overarching Concept to Raw Data on Indicators

All measures, inasmuch as they are theoretically interpretable, explicitly or implicitly take as a point of reference an overarching concept. Hence the first task in developing a measuring instrument—identifying the dimensions of the overarching concept—is the most purely theoretical one. The first desideratum of a measuring instrument is a clear, theoretically justified definition, consisting of a mutually exclusive and jointly exhaustive set of conceptual dimensions that avoid contamination by extraneous concepts.

The second task is the selection of indicators to measure the conceptual dimensions identified in the prior step. Since indicators are simply concrete, empirically grounded concepts, this task is an extension of the prior more purely conceptual task. Thus, the selected indicators should appropriately bridge the abstract concepts that are based on theorizing and the observables that are essential to measurement. The selected indicators should also capture the full meaning of the conceptual dimensions they purportedly measure, avoiding duplication and extraneous indicators.Footnote 6

The third task concerns the design of scales for each indicator. These, too, should be consistent with the concept being measured, and should offer as much nuance, that is, as many distinctions, as is justified. Finally, the fourth task focuses on the process whereby values are assigned to each indicator and should aim to maximize reliability, that is, that an independent attempt to assign values to the indicators would produce similar data; and validity, that is, that the values assigned to each indicator are true measures of the concept being measured. It is also important to ensure the replicability of this process, that is, that the data generating process can be reproduced, so that the reliability and validity of the data on the indicators can be verified by independent researchers.

1.2 From Raw Data on Indicators to Data on an Index

The choices involves in securing raw data on indicators set the initial parameters of an index. But the transformation of raw data on indicators into data on an index involves several additional choices. First, an index developer faces some choice regarding rescaling. Specifically, rescaling might be necessary if the indicators have been measured with different scales and might be called for if the form of the function between the values being aggregated and the aggregate value is non-linear. Second, the index developer must consider what weight should be assigned to each indicator. And, third, the index developer must address the relationship among indicators and determine the appropriate aggregation rule. Moreover, when producing indices with gender-differentiated data, each of these choices must be confronted in the context of two distinct steps: (i) the aggregation of the male and female values for each indicator (for example, the percentage of men and women holding seats in parliament), and (ii) the aggregation of the values of all indicators.

Here again, theory and testing is important. For example, the choice of aggregation rule should be theoretically justified. Moreover, sensitivity tests that assess how robust the aggregated values are to changes in the aggregation rule should be part of the validation of an index. Indeed, if the theory guiding the choice of aggregation rule is weak, and if alternative, but plausible, aggregation rules result in considerable changes in the data that are generated, the validity of the index should be questioned.

This framework is not exhaustive, but it presents the key issues involved in the production of data, and draws attention to choices—such as those involved in aggregation—that are rarely treated explicitly, and highlights issues—such as the many links between theory and data—that are frequently overlooked. Moreover, this framework is suitable to the job at hand: a comparison of gender indices and an assessment of the strengths and weaknesses of these indices. Thus, we turn next to an analysis of gender indices, starting with the choices involved in securing data on the indicators used, addressing subsequently the two-step aggregation process that completes the production of these indices with gender-differentiated data.

2 From Overarching Concept to Raw Data on Indicators

2.1 Identifying Conceptual Dimensions

The creators of the UNDP’s GDI and GEM, Social Watch’s GEI, the World Economic Forum’s GGGI, and the OECD’s SIGI identified the dimensions of their indices’ overarching concepts in a similar way. Each overarching concept is disaggregated into three or four mutually exclusive conceptual dimensions (see column 2 in Table 2). This helps to orient the subsequent task of indicator selection. Nonetheless, there are notable differences in terms of the level of theorization of the overarching concept, and the content of the conceptual dimensions, of each index. And these differences are consequential.

Table 2 Indices with Gender-differentiated Data I: Key Methodogical Features

The GDI has a key virtue: its clarity of purpose. The index’s overarching concept, development, is well theorized and is disaggregated and measured in much the same terms as the UNDP’s Human Development Index (HDI).Footnote 7 As a result, the GDI is relatively easy to interpret. But the explicit grounding of the GDI in a theorized concept also draws attention to a key limitation. The concept of human development is very broad, encompassing factors essential to the “formation of human capabilities” that are included in the HDI and the GDI, but also a series of “political, economic and social freedoms” (UNDP 1990: 10, Sen 1999: 3–53) that, due to the lack of data on the relevant indicators at the time the HDI was constructed (Fukuda-Parr 2003: 119), are excluded from the HDI as well as the GDI. Thus, the conceptual critique of the HDI, its failure to include several key conceptual dimensions of human development (Fukuda-Parr 2003), applies equally to the GDI.Footnote 8

The other four indices appear to take the GDI as a point of reference and, partly explicitly, partly implicitly, seek to overcome the narrowness of the GDI. But not all these indices are as clearly rooted in a well-theorized overarching concept as the GDI. The first index to incorporate conceptual dimensions excluded from the GDI was the UNDP’s own Gender Empowerment Measure (GEM), launched along with the GDI in 1995. This index explicitly addresses factors excluded from the GDI, such as rights and access to power. But the concept of empowerment used in the GEM was not elaborated and the relationship between the GDI and the GEM is unclear. Subsequently, the Social Watch’s Gender Equity Index (GEI), first released in 2004, and the World Economic Forum’s Global Gender Gap Index (GGGI), introduced in 2006, offered, as their key conceptual innovation, the inclusion of conceptual dimensions presented separately in the two UNDP indices.Footnote 9 However, as in the case of the GEM, the overarching concepts of the GEI and the GGGI are not explicitly discussed and thus their developers do not clarify what these indices aim to measure and do not offer a theoretical rationale for the conceptual dimensions included in these indices.

Indeed, besides the GDI, only the OECD’s Social Institutions and Gender Index (SIGI), made public in 2009, is based on an explicit framework, built around the overarching concept of social institutions (Morrisson and Jütting 2005: 1066–70). This helps to highlight the purpose and novelty of this index. The SIGI does not seek to subsume previous efforts, and thus drops all the conceptual dimensions introduced by the two UNDP indices, the GEI and the GGGI. Rather, its contribution lies in offering a supplementary measure that encompasses a range of issues pertaining to social institutions and norms that, even if not entirely comprehensive (for example, discrimination in the labor market is not addressed), are ignored by other indices. Thus, the concept of social institutions served as a useful anchor for the SIGI and helps to clarify the meaning of the index.

In sum, the history of gender indices involved a series of efforts to include broader dimensions of social life that had not been included in the GDI. In this regard, it is tempting to posit that all five indices could be understood as measures of the concept of human development, as elaborated by Sen (1999) and Nussbaum (2000), and that, even though none of these indices offer a jointly exhaustive measure, a combination of the World Economic Forum’s GGGI—which subsumes the two UNDP indices and the Social Watch’s GEI—and the OECD’s SIGI would be an important step in that direction. But, because the GEM, GEI and GGGI in particular do not rest on as clear an overarching concept as the GDI, it is difficult to compare these indices to each other conceptually.Footnote 10 Moreover, there are a number of other important differences in the choices that go into developing these indices and, as the analysis that follows will show, the meaning of each index is also affected by these choices.

2.2 Selecting Indicators

When assessing the selection of indicators to measure each of the conceptual dimensions identified in the prior step, some interesting points emerge (see column 3 in Table 2). The OECD’s SIGI breaks new ground by calling for indicators that address key spheres of social life in which gender differences could be a factor even if they are not readily available, i.e., the scarcity of readily available data does not constrain the selection of indicators. Indeed, the SIGI shows that data availability does not need to be seen as an overwhelming constraint on how indices are constructed and that new data can be generated to measure certain indicators that are considered central to an index’s overarching concept.

In contrast, the other four indices rely largely on the same indicators, such as Literacy Rate; Primary (net), Secondary (net), and Tertiary (gross) Level Enrolment; Earned income (PPP US$); Labor Force Participation; Professional and Technical Positions; and Parliamentary Seats (see Table 3).Footnote 11 The choices are undoubtedly driven by the desire to include indicators that cover a large number of countries, are publicly available, and are regularly updated. But there are weaknesses associated with these choices of convenience.

Table 3 Indices with gender-differentiated data II: a comparison of indicators

One problem is that the selected indicators fail to capture the full meaning of the conceptual components they are supposed to measure. As an example, Parliamentary Seats, by itself, seems too narrow a measure of the GEM’s conceptual component Political Participation and Decision-making. Another problem is the use of extraneous indicators, which are measures of conceptual components other than those they are claimed to measure. Indeed, it is clear that the indicator Legislators, Senior Officials and Managers—used in the GEM to measure the conceptual component Economic Participation and Decision-making, and in the GGGI to measure the conceptual component Economic Participation and Opportunity—is problematic in that political information about legislators is presented as a measure of an economic dimension.Footnote 12 In sum, the selection of indicators for which data are easily accessible is associated with the common failure to tap the full meaning of each conceptual dimension, a problem most evident in the GDI, and other problems, such as the inclusion of extraneous indicators.Footnote 13

2.3 Designing Indicator Scales

The next choice in the production of an index, the design of the scales used in collected raw data for each indicator, is a moot question for most indices. Indeed, because most indices largely rely on data gathered by others, as pointed out above, they simply operate with scales designed by the original data collectors that consistently assess the situation of men and women. In contrast, the developers of the SIGI do make distinct scaling choices. Indeed, the SIGI differs from the other indices in that, because it seeks to respond to the laudable goal of generating new data on some hard to measure indicators, its elaboration includes the specific design of scales for new indicators (see column 5 in Table 2).Footnote 14 Nonetheless, from the perspective of the ultimate goal of constructing an index, the SIGI’s scales are quite problematic. On the one hand, though the designed scales reflect sensitivity about the number of distinctions that can justifiably be made in light of available information (Jütting et al. 2008: 68–70, 83–84; OECD Development Centre 2009b), they are all ordinal scales. Thus, their use in an index is highly questionable. On the other hand, most of the SIGI’s scales are one-sided, addressing inequalities that disadvantage girls or women. Thus, potential inequalities that disadvantage boys and men are thus ignored and the possibility of empirically studying gender equality broadly understood is closed.Footnote 15 Furthermore, the interpretation of the index is muddled due to the combination of different types of indicators: some, such as Parental Authority and Inheritance, that posit a relationship between the situations of men and women, and others, such as Female Genital Mutilation and Violence Against Women, that focus on restrictions of the rights of girls and women while making no contrast to the situations of boys and men. As a result, the SIGI’s indicator scales do not offer a basis for systematically collecting gender-disaggregated data and fuse, in a largely implicit manner, measures of gender inequality and measures of distinctive women’s rights.

2.4 Assigning Values to Indicators

Since most of the gender indices rely on readily available data sets (the clearest exception being the OECD’s SIGI), the choice concerning the way values are assigned to indicators, much as the choice concerning indicator scales, largely follows the choice regarding which indicators to include. The concerns about the reliability and comparability of many of the data sets used in these indices expressed by Srinivasan (1994) deserve attention.Footnote 16 Nonetheless, when seen from a broad perspective, the data on these indicators are relatively reliable (see column 4 in Table 2). Indeed, the most serious doubts regarding the process of data generation pertain less to available data sets than to the efforts by the GGGI and the SIGI developers to generate their own data on indicators.

Data on the GGGI’s indicator Wage Equality for Similar Work is collected through a question in an expert survey (Porter and Schwab 2008: question 9.13), of unknown reliability. The degree of agreement among the respondents is not reported and, since the data are not publicly available, independent researchers are not able to assess the reliability of the data. And the data on the SIGI’s indicators, generated through expert coding, are even more troubling. Expert coding is a well-established method of generating data. Moreover, the sources of information used to code the SIGI’s indicators are made public, and an accompanying narrative provides valuable background information used in the coding of indicators (Jütting et al. 2008: 68–70, 83–84; OECD Development Centre 2009a). But, as some of the indicator scales are vaguely formulated (Jütting et al. 2008: 83–84; OECD Development Centre 2009b), and as the SIGI’s developers do not report the results of an intercoder reliability test, we are left with serious doubts about the validity and reliability of the data on the SIGI’s indicators. Given that this method of expert coding is used for nine of the SIGI’s twelve indicators, this is a major weakness.

2.5 An Interim Assessment

Summing up the discussion thus far, the following points deserve emphasis. The UNDP’s GDI and the OECD’s SIGI take, as a point of departure, a theoretically developed overarching concept and what these indices seek to measure is relatively clear. But the theoretical ideas guiding the design of these two indices are poorly executed; in the case of the GDI, because its indicators fail to capture a large part of the meaning of the concept of development and, in the case of SIGI, because the new data it relies on are of questionable quality. In contrast, the UNDP’s GEM, the Social Watch’s GEI and the World Economic Forum’s GGGI lack clear theoretical foundations. Hence, the meaning of these three indices has to be uncovered, inductively, by considering the content of the selected indicators. Moreover, any theoretical grounds for validating the choice of indicators included in these indices are surrendered. In short, though the creation of these indices reveals a considerable degree of self-awareness about the choices regarding conceptual dimensions, indicators, indicator scales, and the assignment of values to indicators, the discussion thus far also reveals weaknesses in all five indices.

3 From Raw Data on Indicators to Data on an Index I: The Rule to Aggregate Male and Female Values for Each Indicator

Indices with gender-differentiated data have a common goal: to move beyond simple averages that overlook possible differences in the situation of women relative to men. This goal simplifies the challenge of forming an index because, in most cases (the two UNDP indices and the OECD’s SIGI are partial exceptions), it more or less automatically resolves the need to ensure that all indicators are measured with the same units. Thus, the key decisions tackled in the first step in the transforming the raw data on indicators into data on an index revolve around the issue of how to define a standard of parity, how to weight deviations from the posited standard, and whether and how to aggregate relational and absolute measures of attainment.

With regard to these choices, the gender indices under review—with the exception of the SIGI—share some features. The indices move toward a relational measure by positing a standard to be used in comparing the data on women and men for each indicator—opting, practically as a default option, for a 50/50 ratio as the standard of parityFootnote 17—and calculating deviations from this standard (see column 6 in Table 2). But the gender indices differ with regard to two key choices: whether to assign equal weights to deviations from the posited standard favoring men and women; and whether to combine a relational measure with an absolute measure and, if so, how to combine these measures.

Concerning the weight assigned to deviations from the chosen standard, two options can be distinguished. Both UNDP indices give equal weight to deviations from the standard of parity that favor men or women. That is, they measure inequality broadly understood. In contrast, the GEI and GGGI assign no weight to deviations from the standard of parity that favor women and only count inequalities that disadvantage women. That is, they measure inequality only from the perspective of women.Footnote 18 As to the combination of relational and absolute measures, three options can be distinguished. The GDI and GEM seek to retain the information about the absolute level of attainment of goals,Footnote 19 and aggregate a relational and absolute measure using the harmonic mean of these two measures, an aggregation rule that assigns a “moderate” weight to their relational measure vis-à-vis the absolute measure (UNDP 1995: 73; 2007: 358–60).Footnote 20 In contrast, the GEI and GGGI rely solely on relational measures, and do not take the absolute level of attainment of goals into consideration.

The options selected by the index developers are relatively well justified and the aggregation process is largely replicable, but basic questions can be raised about some choices. With the two UNDP indices, the decision to give equal weight to deviations from the standard of parity that favor men or women in both the GDI and the GEM is justified in light of the stated purpose of the UNDP to address gender equality in general as opposed to, say, the disadvantages of women relative to men. But the reliance on a strict 50/50 ratio in, for example, the GEM’s Parliamentary Seats indicator could be questioned. For example, if women fall short of gaining 50 percent of the seats in parliament, this may be due to the lack of a level playing field but could also be attributed to a host of unrelated factors. This standard is more consistent with a measure focused on equality of results than a measure of “opportunities,” which the UNDP suggests the GEM to be (UNDP 2007: 360).Footnote 21 Finally, the use of the harmonic mean to aggregate relational and absolute measures in the GDI and GEM is open to questioning. The rationale for this choice is spelled out in a paper by Anand and Sen (2003b: 211–13), who explain that this aggregation rule is meant to capture the degree of inequality aversion in a society and impose a penalty for deviations from equality, such that the measure that is produced is, to use the UNDP’s vocabulary, moderately lower than a measure solely based on a simple average. But no justification is offered for assigning a moderate weight to inequalities as opposed to some other weight; indeed, we are left with the impression that this weighting choice is arbitrary.Footnote 22

Regarding the GEI, though its use of a 50/50 ratio as a standard of parity is a reasonable choice for most indicator, the same question raised about the GEM’s Parliamentary Seats indicator applies to it. In turn, though the GEI’s assignment of no weight to deviations from the standard of parity favoring women is justified by Social Watch’s stated interest to draw “conclusions about critical deficiencies in what women are able or allowed to do” (Social Watch 2005: 77), it is important to note that such a decision makes it less general than the UNDP indices and hence limits the index’s use. Additionally, though the decision to rely solely on a relational measure, a notable difference between the GEI and GDI, is founded in Social Watch’s desire to empirically assess the relationship between gender equity and absolute levels of development (Social Watch 2005: 76–77), such a decision is not problem-free. This decision gets around the problem the UNDP faces in assigning weights to relative and absolute measures, but it creates a new problem, most evident with regard to the Parliamentary Seats indicator. The problem is that a country may or may not have a parliament and, to get around an awkward conclusion,Footnote 23 Social Watch opts for a costly solution: countries lacking a parliament are categorized as missing data on the Parliamentary Seats indicator, which leads either to dropping that country from the index—a choice that diminishes the GEI’s substantive utility—or to an ad hoc reweighting of the indicators of the Empowerment conceptual dimension for that particular country—a choice that reduces the GEI’s comparability across cases.

Finally, as to the GGGI, the use of a 50/50 ratio for the Parliamentary Seats and Heads of State indicators raises similar concerns to those mentioned above.Footnote 24 The option to assign no weight to deviations from the standard of parity favoring women entails, as with the GEI, a loss of generality that is somewhat cryptically played down by the index developers with little more than the claim that they find such a choice “more appropriate for [their] purposes” (Hausmann et al. 2007a: 5). In contrast, the decision to rely solely on a relational measure is better justified, in terms similar to those put forward by Social Watch (Hausmann et al. 2007a: 3). But, as with the GEI, the sole reliance on relational measures creates its own problems. Specifically, when a country has no parliament, the GGGI drops this indicator and reweights the remaining indicators of the Political Empowerment conceptual dimension, in what amounts to an ad hoc procedure that reduces the comparability across cases of the data.

The OECD’s SIGI is an exception with regard to aggregating male and female values for each indicator. The SIGI’s approach is close to the one adopted by the GEI and GGGI, in that it disregards potential inequalities favoring women. But, unlike the other indices, the SIGI does not use one criterion for aggregating male and female values consistently, given that some of the scales focus solely on women and make no comparison with the situation of men. Finally, as mentioned above, data on men and women are not collected separately for the SIGI’s indicators; rather, the comparisons are embedded in the scales of each indicator. Thus, unlike the other four indices, independent researchers are unable to test the robustness of SIGI’s choice of aggregation rule.Footnote 25

The possibility of conducting sensitivity analyses in the case of four of the five gender indices notwithstanding, it is striking that none of the index developers provides tests to validate their choice of rule to aggregate the male and female values of each indicator. Such tests are particularly relevant to the choice of aggregation used in the two UNDP indices. Indeed, given that the weight of the penalty assigned to deviations from equality in the GDI and GEM is not based on any theoretical grounds, it is important to verify what impact this choice has on the indices.Footnote 26 And when we performed such a test, the results were not encouraging (see Table 4). Though the purpose of building the GDI was to address inequalities, the GDI, which differs from the HDI only in its choice of the harmonic mean over the arithmetic mean, is not all that different from the HDI (a measure that does not address inequalities). Assigning the harshest possible penalty for inequalities, as captured by the minimum aggregation rule, may have been more suitable to the purpose. In contrast, when we calculate the GEM using alternative aggregation rules—whether the slightly less-punishing geometric mean or the vastly more-punishing minimum—we show that the GEM is not robust to changes in its aggregation rule. In other words, these tests raise questions about the choice of the use of the harmonic mean as an aggregation rule in both the GDI and GEM.

Table 4 The impact of alternative aggregation rules I: the aggregation of male and female values

4 From Raw Data on Indicators to Data on an Index II: The Rule to Aggregate the Values of All Indicators

Turning to the last key choice in the production of gender indices—combining the values of all indicators—the five gender indices share some features. First, they all rely on a distinction between the level of conceptual dimensions and the level of indicators (see columns 2 and 3 in Table 2), and aggregate the group of indicators corresponding to conceptual dimensions into subindices before aggregating these subindices into a single index.Footnote 27 Second, they all rely on additive aggregation rules. Beyond this commonality, there are some significant differences (see column 7 in Table 2). The GEM and the GEI use simple averages, that is, an additive and unweighted aggregation rule, at both the level of indicators and the level of conceptual dimensions. The GDI follows suit, except that—following the HDI—it assigns different weights to the indicators that are part of the Knowledge conceptual dimension.Footnote 28 But the other two indices are quite different.

The GGGI is similar to the GEM and the GEI except that it breaks significantly with the equal-weight rule at the level of indicators. The GGGI’s indicators are assigned weights according to their standard deviations relative to those of other indicators of the same conceptual dimension—higher weights are assigned to indicators with lower standard deviations and the weighting scheme calculated for the 2006 index is used in subsequent versions of the index (Hausmann et al. 2007a).Footnote 29 In turn, the SIGI uses the most-complicated aggregation procedure (Branisa et al. 2009). First, the developers of the SIGI conduct a dimensionality test on the indicators within each conceptual dimension and, finding that the Missing Women indicator did not correlate with the other indicators of the Physical Integrity conceptual dimension, placed the Missing Women indicator under a new conceptual dimension labeled Son Preference. Second, to calculate the value of the sub indices corresponding to each conceptual dimension, they use a factor-analytical technique and assign higher weights to indicators that correlate more.Footnote 30 Third, before aggregating the values of the subindices using an additive and unweighted aggregation rule, each subindex is rescaled using a non-linear (square function) transformation, which reduces the degree to which high levels of inequality on one conceptual dimension can be compensated with low levels of inequality on other conceptual dimensions.

The way index developers approach these choices has some merits. A plausible theoretical justification is offered or can be offered for many choices. The use of an additive aggregation rule is reasonable as all indices posit the existence of spheres that are neither interchangeable (gains in, say, the political sphere cannot make up or substitute for problems in, say, the economic sphere), nor prone to contamination (gains in one sphere are not overridden by problems in other spheres). Likewise, the rescaling using non-linear functions in the GDI and the SIGI is well justified (Anand and Sen 2003a: 140–41; Branisa et al. 2009: 7–8). But two weaknesses affect all five indices: one concerns the weighting scheme, and a second pertains to the testing of the selected aggregation rule against plausible alternatives.

The most common approach to weighting choices—used in the GDI, GEM, and GEI—is purely theory driven, meaning that weights are simply assigned by the index developers. And the selected choice, with only one exception, is to give equal weight to all indicators within a conceptual dimension and to apply equal weights across the conceptual dimensions. This weighting choice seems to be largely a default option used less out of reasoned analysis than a sense that agnosticism regarding all possible weighting schemes somehow constitutes a justification for assigning equal weights to indicators and conceptual dimensions.Footnote 31 In addition, index developers appear to be unaware of how the weights of indicators are affected by the formal weighting scheme that is adopted but also by the number of conceptual dimensions identified and the number of indicators selected to measure each conceptual dimension. The choices concerning the number of conceptual dimensions and indicators are never addressed, even though these two factors indirectly but strongly affect the weighting of indicators.Footnote 32 Indeed, we show that when these choices are factored into the weighting of the GDI’s, the GEM’s, and the GEI’s indicators (see columns 2, 3 and 4 in Table 5), the actual weight of each indicator within each index is anything but equal and the same indicator has quite different weights in different indices. Ironically, the index developers that opt for a theory-driven weighting scheme pay little attention to the multiple choices that affect the weight of each indicator and hence offer very little theoretical justification for their choices.

Table 5 Indices with gender-differentiated data III: a comparison of indicator weights (%)

The GGGI and SIGI exemplify a different approach to weighting choices. They assign an equal weight to each conceptual dimension. But, they combine this theory-driven choice with data-driven choices at the indicator level. This mixed approach is interesting and potentially fruitful. But it is important to note that data-driven choices entail assumptions and that the legitimacy of the assumptions used in the GGGI and SIGI is questionable. The GGGI’s data-driven weighting scheme assigns higher weights to indicators with lower standard deviations (Hausmann et al. 2007a). But it is not clear, for example, why a ban on women holding seats in parliament in a certain country should be considered less of a problem simply because many other countries also have one. Furthermore, the GGGI’s use of weights calculated for the 2006 index in subsequent versions of the index—the rationale being the desire to ensure the comparability over time of the index—has the effect of allowing an arbitrarily selected year to determine whether certain problems will be given more or less weight in the future.Footnote 33 When the weighting for indicators thus derived are combined with the theory-driven weights at the level of conceptual dimensions (see the last column in Table 5), the result is a weighting scheme that is open to many questions. In turn, the SIGI’s data-driven weighting scheme—which assigns higher weights to indicators that correlate more—hinges on the assumption that because a problem (for example, early marriage) does not correlate as highly as two other problems (for example, polygamy and equal rights regarding inheritance) it is less of a problem. In sum, the weighting schemes used by the five gender indices rest on a shaky foundation.Footnote 34

Last of all, and most generally, the developers of the five indices do little to present ideas about alternative ways in which they might have constructed their indices and to test the robustness of their chosen rules.Footnote 35 On the positive side, replicability of the process followed by each index allows for independent tests.Footnote 36 But when we conducted a test to assess the impact of weighting decisions, the results were mixed.

We recalculated each index using three broad alternative approaches to the choice of aggregation rule (see Table 6). One alternative, which we label as the “direct and indirect weighting” option, includes aggregation rules that assign variable weights to indicators through direct means—through the differential assignment of weights—but also via indirect means—that is, as a result of the way they are grouped under each conceptual dimension. Indeed, as shown above, the varying number of indicators per conceptual component yields a framing effect, whereby the conceptual framing of an index will result in indicators with different weights even if the conceptual dimensions are weighted equally. A second alternative, which we label the “indirect weighting” option, includes aggregation rules that directly assign an equal weight to each indicator but indirectly allow differential weighting of indicators by grouping the indicators under different conceptual components. And a third option, which we label the “no weighting” option, not only directly assigns an equal weight to each indicator but also avoids any indirect weighting through the framing effect.

Table 6 The impact of alternative aggregation rules II: the aggregation of the values of all indicators

The comparison of the mean value of each index when calculated using different aggregation rules is instructive. On the one hand, this test shows that the weighting choice used in some indices does not make much of a difference. Removing the GDI’s theory-driven weighting of indicators does not introduce much change (the mean value goes from 0.665 to 0.659). Likewise, the data-driven weighting of the GGGI and the SIGI, if somewhat questionable, is of little consequence; the GGGI’s mean value drops by 0.001 when this direct weighting of indicators is removed, and doing away with the direct weighting of the SIGI’s indicators makes no statistically significant difference.Footnote 37 On the other hand, this test shows that the GEM and the SIGI would change considerably if all their indicators were weighted equally. The conceptual framing used by both these indices, as with the other indices as well, relies on standard and defensible distinctions. Nonetheless, the greater importance of framing choices in the GEM and the SIGI relative to the other indices is a weakness.

In short, index developers have a responsibility to offer a justification for their choice of aggregation rule and also to consider alternative-but-feasible ways of constructing an index and to present the results of tests that assess the impact of these differences. Nonetheless, the developers of the reviewed indices have failed to shoulder the burden of performing sensitivity analyses and hence to consider evidence that could either strengthen or weaken the case for the validity of the proposed indices.

5 Overview and Implications

Measures of gender inequality, and gender indices in particular, are tools for raising awareness and they have become important in policy-agenda setting and decision-making. This is a welcome development. But it has also raised the stakes of measurement exercises and given urgency to doubts about the quality of gender measures. To respond to this need for a responsible use of data in discussions about gender matters, this article has provided a systematic and comprehensive assessment of the most visible gender indices. As shown, the methodological choices that go into the making of these indices are complex; therefore, any quick summary assessment of these indices risks oversimplifying matters. But the complexity of the issues addressed in this article also makes an overview all the more necessary and valuable. Thus, by way of a conclusion, we first provide an overall assessment of the gender indices we have scrutinized, addressing the two questions posed in the introduction: What are the five reviewed indices a measure of? And, how valid are these indices? Thereafter, we draw some lessons from our assessment in the form of advice for users and producers of gender indices.

5.1 What Do These Indices Measure?

The five reviewed gender indices differ in their meanings, which are not always evident from the index’s name or even the description offered by the index developers. The UNDP’s GDI is the best-explained and most-easily understood index because of its focus on a well-theorized concept—development—and the wide dissemination of the human development index (HDI). As the UNDP states in the report that introduced the GDI, this index “measures the same basic capabilities as the HDI does, but takes note of inequalities in achievement between women and men. … The GDI is simply the HDI, discounted, or adjusted downwards, for gender inequality” (UNDP 1995: 73). Less self evidently, to correctly interpret the GDI it is also important to recall that it is a general measure of gender inequality, which considers inequalities favoring men or women.Footnote 38 But the GDI stands out as a model of clarity and serves as reference point in an effort to understand the meanings of the other four indices.

The GEM has not been as well explained as the GDI. The explanation that the GEM focuses “on women’s opportunities rather than their capabilities” (UNDP 2007: 360), and that it serves therefore as a supplement to the GDI, is not very helpful. After all, the GEM includes indicators that address results more than opportunities and, furthermore, it includes one of the same indicators used in the GDI. Explanations notwithstanding (UNDP 1995: 82), this contrast between opportunities and capabilities is somewhat puzzling. Moreover, since the GEM relies on the same general concept of gender equality as the GDI, the emphasis on women is misleading. Thus, the most obvious point to emphasize in interpreting the GEM is that it measures dimensions of the concept of human development related to economic and political power that were not included in the GDI and approaches these dimensions of human development from a gendered perspective.

The developers of the GEI and the GGGI do not explicitly state what concepts they intend to measure. That is, though these indices are advertised as measures of gender equality, their overarching concepts are not clearly formulated. The best one can do is to consider the conceptual dimensions of these two indices, infer that these conceptual dimensions are related to human development, as defined by the UNDP, and interpret the indices as measures of development that subsume (especially in the case of the GGGI) the indicators used in both the GDI and the GEM. But two key differences between the GEI and GGGI on the one hand, and the GDI on the other, need to be emphasized. First, as the creators of the GEI and GGGI clearly state (Social Watch 2005: 73; Hausmann et al. 2007a: 3), these indices are measures of the situation of women relative to men that do not take into consideration—in contrast to the GDI—the aggregate absolute situations of women and men. Second, the GEI and GGGI are measures of inequalities favoring men that—in contrast to the GDI—disregard inequalities that favor women.Footnote 39 Thus, in interpreting the GEI and GGGI, it is useful to see them as rather similar indices that are broader measures than the GDI, in that they cover more conceptual dimensions, but also narrower measures than the GDI, in that they focus only on relative positions and only on female disadvantages relative to men.

Finally, interpreting the OECD’s SIGI presents a set of distinct problems. The SIGI intends to provide a measure that goes beyond the other indices and the overarching concept of social institutions is elaborated sufficiently to make it clear how SIGI relies on indicators that supplement those used by the other indexes (Jütting et al. 2008: 78). Nonetheless, some ambiguities cloud the interpretation of the SIGI. As with the GEI and GGGI, the SIGI does not address inequalities favoring women. But the SIGI includes relational measures based on an implicit comparison of the rights of women relative to men, and absolute measures of the rights of women that have no obvious male counterpart. And this lack of symmetry in the SIGI’s indicators and indicator scales makes it unclear what is being measured. The SIGI can be seen, somewhat loosely, as a combined measure of female disadvantages relative to men concerning some basic rights and the fulfillment of rights distinctive to women.

5.2 How Valid Are These Indices?

Turning to the question of how well the five gender indices measure what they measure, some methodological strengths and weaknesses can be identified. Taken as a group, strengths include a reliance on well-established distinctions to differentiate among conceptual dimensions, the inclusion of indicators that incrementally tap into more aspects of gender, the use of indicators for which reliable data on most countries of the world are available, and the replicability of the aggregation process (see Table 7). Common weaknesses include an inadequate theorization of the index’s overarching concept, the choice of indicators that do not cover the full meaning of the concept being measured or include extraneous or redundant indicators, the lack of justification for various choice made in the aggregation process, and the lack of sensitivity analyses that are essential to validation of measures. The important advances made in developing gender indices notwithstanding, the failure to meet some standard and basic methodological criteria cast doubts on the quality of these indices.

Table 7 Indices with gender-differentiated data IV: methodological strengths and weaknesses

Beyond these shared features, the five indices differ in the extent to which they meet the criteria we highlight in our methodological framework. The UNDP’s pioneering GDI stands out as the soundest index, its key problem being the lack of a clear differentiation with respect to the HDI. At the other extreme, the OECD’s SIGI is a case of a measure that holds much promise but has been poorly executed. Finally, the other three indices fall in between these extremes, the lack of robustness of the UNDP’s GEM aggregate data (see Tables 4 and 6) pointing to a relative advantage of the Social Watch’s GEI and World Economic Forum’s GGGI.

5.3 Implications for Data Users and Producers

The valuable contributions made by the developers of the five reviewed gender indices should be duly recognized. These indices constitute a resource for the study of development and a range of rights, which has allowed analysts to portray gender disparities and the rights of women in a more comprehensive and integrated manner than was possible prior to 1995. Having an index, especially one of global empirical scope, is better than having none, even if the index suffers from methodological weaknesses. But the availability of multiple indices—which could support different conclusions—also creates a problem for users, who now must offer a rationale for opting for one index over the others.

This is not an easy choice, but it is becoming an increasingly pressing choice. After all, confusion about the interpretation of gender indices has been common and has led to some resistance to embrace the use of gender indices (Shuller 2006: 173–76). Thus, in addition to standard concerns about measurement error and the appropriate uses of data,Footnote 40 data users should be cautious when interpreting these gender indices, should be cognizant that grasping the meaning of these indices is a more complicated matter than may seem at first sight, and recognize, as stressed in this article, that, as they measure different concepts, they are not interchangeable.

The shortcomings identified in this article also highlight the need to see the production of measures that incorporate a gender perspective as an evolving research agenda. This article has two central implications for this agenda. First, we emphasized the value of an explicit methodological framework in work on measurement. The value of such a framework has been recognized in discussions of measures of development (Booysen 2002: 116–17). But current discussions of measures with gender-differentiated data do not rely on such a framework or focus only on some sources of measurement error. Yet, as we have shown, the meaning and validity of an index can be ascertained only if consideration is given to the full range of methodological choices highlighted in our framework. Thus, future research on data that incorporate a gender perspective should build more explicitly on a broad and integrated methodological framework.

Second, we showed the value of critical-but-constructive evaluations of existing measures to the challenge of developing better gender indices. In this regard, the assessment of gender indices in this article contains a host of ideas to guide future work on gender indices, Researchers frequently remain wedded to established measures and such a tendency is likely to hold regarding gender indices. Nonetheless, as we have showed, there is a need to develop better gender indices. And, by pinpointing both the valuable insights that should be incorporated in new indices and the problems that need to be rectified, we hope to have contributed to the collective effort to generate better gender measures.