1 Introduction

The necessity to find better alternatives to GDP (see Ruta et al. 2005; Stiglitz et al. (2009); Thiry 2010; Dialga 2015) has given way to a flourishing range of composite indicators (CIs) in the nineties.Footnote 1 Some studies including those of Stiglitz et al. (2009) and UNU-IHDPFootnote 2 (2015) however insist that the initiatives in terms of CIs construction must be accompanied by dashboards allowing to take into account some qualitative aspects such as “inclusive wealth”. Looking at initiatives opened in the CIs construction field (see Fig. 1 and Bornand et al. 2011), it is likely that this trend, already perceptible, will become ever greater, the social demand being rather in favor of these multidimensional measures. Today, it is easy to notice that actions, in terms of public policies, are largely dependent on these synthetic tools, at least at two levels. Upstream, they can serve as guiding lights for a policy maker on what he needs to know about social aspirations. Downstream, the same tools are relevant to evaluate performances of these policies.

Fig. 1
figure 1

Evolution of the number of research articles on “Composite Indicators” published in science direct between 2000 and 2015 (18/03) in all journals (left side) and in economics, econometrics and finance journals. Source Reproduced from Kutin et al. (2015)

Nevertheless, these synthetic indicators are not without their critics (see Saisana and Saltelli 2010; Klugman et al. 2011; Chiappini 2012) as most of them focus on methodological aspects in the CIs construction. For Council of Europe (2005) and Chiappini (2012), the choice of certain system weights could be very subjective, with no empirical evidence nor defendable theoretical foundation. These persistent criticisms tend to question the local legitimacyFootnote 3 of indicators as guidance tools for monitoring public actions. This lack of confidence in figures is even more likely since the profusion of these indicators makes confusion in users’ choices (see Bandura 2008). Which indicators to adopt? For which actions? And to which finalities?

As composite indicators are well-known to be essential in the public debate, their methodological construction must be known by the largest audience. If this is not the case, these synthetic tools would lack visibility and public actions would not be able to convince national or local elected representatives. Yet the composite indicators are one of the best communication materials and supports for pragmatic actions. The role of scientists in this context is to support their development by explicitly expressing reservations about these tools and by disseminating new research results and the evolution of debates in the field (Gadrey 2002). This research paper takes part of this goal.

The OECDFootnote 4 and JRCFootnote 5 Handbook (2008) provides a comprehensive but technical introduction to the construction of CIs.Footnote 6 A reader may find difficult to use this handbook as a simple guide for action. As a complement, the aim of this paper is to provide a brief and accessible synthesis of the different required steps of construction, based on a simple example that will be the leading thread.

More precisely, the paper brings two major contributions in the CIs literature beyond its contribution in methodological issues by presenting exhaustively the construction steps of a composite indicator. First, on the academic level, the paper launches the debate on the need to have a standard methodological framework in the construction of CIs in order to address the methodological challenges including both sensitivity and uncertainty on composite scores related to the methods used. Moreover, by illustrating CIs construction’s steps by a simple Sustainable Development Indicator (SDI), the paper aims to “democratize” this disciplinary field which is still a black box for some researchers. The aim is to involve more young researchers in this field given the stakes of both methodological and practical issues related to CIs. Secondly, from the perspective of CIs use, the paper highlights the need to make the construction of CIs non-technical. Given their growing use in public debates (housing policy, transport policy, sustainable development policy, social cohesion policy), the appropriation of CIs methods construction by a wider audience, becomes a major democratic challenge.

The paper is organized as follows: Sect. 2 discusses steps of a CI construction. We pay particular attention to the most problematic aspects such as choice of the theoretical framework, selection of variables, normalization, weighting and aggregation, in showing implications they can have in terms of uncertainty and of the credit we can give to the CIs. We illustrate each step (only the most used methods have been described) by constructing a very simple SDI following the scheme initially presented. In Sect. 3, we discuss results from the approaches used and then, analyze their implications in terms of sustainable development grounds. We conduct the indicator’s sensitivity and uncertainty analysis and test the SDI validity by comparing its correlation with some well-known CIs. Section 4 concludes by summarizing the most important points in this illustrated review of CIs construction.

2 Steps of a CI’s Construction

A composite indicator is a mathematical combination of many indicators representing different dimensions of the same concept (OECD and JRC 2008). From this definition, CIs don’t have measurement units. A CI can result from the combination of at most three types of variables. According to Council of Europe (2005) definition, the three types of variables are defined as follows:

  • Quantitative-objective indicators are quantitative variables that are directly measurable values. Example: per capita income, unemployment rate, emissions of CO2.

  • Qualitative-objective indicators are not directly measurable but call for objectively verifiable variables such as presence or absence of a quality norm.

  • Qualitative-subjective indicators are matters of opinion and appreciation such as satisfaction, trust.

2.1 Definition of the CI and Choice of Variables (Steps 1 and 2)

The definition step is a crucial one, since an indicator can give space to some ambiguities and create dubious or erroneous interpretations (OECD and JRC 2008). The definition of CI should be coherent with the objective and the phenomenon that it aims to represent. Dimensions of the phenomenon should be defined by most relevant variables; the latter are chosen according to criteria that can be objective or subjective (however following a coherent logic). These criteria should meet four requirements to ensure their quality (Council of Europe 2005):

  1. (1)

    Representative of the issue they deal with,

  2. (2)

    Informative and univocal,

  3. (3)

    Allowing a clear and accepted normative interpretation,

  4. (4)

    Not excessively onerous.

Thus, a sustainable development index should for example include at least three dimensions: economic, social and environmental ones and these dimensions are themselves broken down into easily identifiable and interpretable variables. In our illustration, only one variable is used for each dimension: Gross National Income (GNI) per capita based on purchasing power parity (PPP constant 2005 $US), Gini coefficient and per capita carbon dioxide emissions (metric tons per capita). Note that our SDI differs from the existing ones on both theoretically and empirically levels. However, the SDI is not intended to replace the existing indices. For example, the new SDI differs from the Sustainable Human Development Index (HSDI) of Togtokh (2011) and Bravo (2015) in its social dimension. Even if education and health are essential to human well-being, their achievement may be compromised as long as the inequalities are important in the country. Wealth inequality may therefore induce a phenomenon of poverty trap (Dialga 2015) in which only individuals earning a minimum income level can have access to basic services such as education and health. As suggested by Talberth et al. (2006) who proposed to weight negatively income inequality using the Gini Index, and although reducing the social dimension, we use the Gini index to take into account this social dimension. As for the choice of the economic variable, we follow Stiglitz et al. (2009) who suggested that “to measure well-being, the national income is more suitable than GDP”. Finally, the unavailability of environmental data leads us to retain the measurement of carbon emissions. Both economic and environmental variables are reported on population in order to take into account the country size effect.

As shown by the analysis of the correlations summarized in Table 9, our SDI index, although based on three simple variables, provides enough information to be considered as a non-redundant index compared to the usual ones, like HDI.

The refinement into sub-indicators depends on the degree of detail of the information that we would like to provide via the CI. However we are acutely aware of a risk of “information overload”.

Nonetheless, as highlighted above, while seeking a high degree of information detail, one can come to combine theoretically incompatible concepts in one CI and thus not give a convincing interpretation. Indeed, the need to exhaustively represent one country’s wealth can lead to define in the same indicator “stock” variables—to characterize wealth—and “flow” variables like the economic growth. On the other hand, the complexity of certain phenomena makes CIs’ constructors simplify variables and only keep relevant and representative ones. The human capital represented by enrolment and literacy rates in HDI illustrates these simplifications of social realities. One bad definition of CI at the beginning has evidently impacts on the other steps of CI construction and in particular, co-linearity analysis, normalization and interpretation of the CI.

In sum, without neglecting other steps, the definition of CI is an important prerequisite for its success since a poorly constructed theoretical framework results in biased and hard to interpret findings and consequently to inadequate policies. However, it does not mean that we should only start with available and easily accessible data to elaborate a CI; the definition of relevant variables ought to guide statistical data mobilization.

2.2 Sources of Data and Imputation of Missing Data (Step 3)

After defining relevant variables according to the theoretical framework, the next step deals with data mobilization. Definition of a CI by identification of its sub-indicators and component variables should allow the determination of data types necessary for the construction of the final indicator. We conventionally distinguish two types of data: primary data and secondary data. Primary data are directly collected via surveys, observations or experiments done by researchers for a specific problem. Secondary data are available before the study is done and can come from statistical institutes, administrative sources or polling organizations.

In practice, needed data are not always fully available. To deal with this difficulty, researchers use many statistical tips. The missing patterns can be of three types depending on their links with the variable of interest on one hand and the other observed variables on the other hand.

They could be “Missing Completely At Random”, thus “do not depend on the variable of interest (Y) or on any other observed variable (X i ) in the data set or on any other observed variable in the data set” (OECD and JRC 2008). Formally, \(X_{i} \bot \left({X_{j};Y} \right)\quad \forall i \ne j\). In this case, it is possible to omit records from the analysis (case deletion) without producing a biased indicator. One example of this treatment is the removal of some countries from HDI ranking when some data are missing. However, this removal reduces the quality of information revealed by the CI, especially when the variable represents an important element. Moreover, it is not possible to make a comparative study between the original sample and the reduced one. In this case, a substitution of the variable for which data are not available could be considered.

Missing data could directly depend on variables of interest (Non Missing At Random). Formally, \(X = f\left(Y \right)\quad with\,\,X_{i} \bot X_{j} \quad \forall i \ne j.\)

Missing data could also be conditional on other variables in the data set but do not depend on variables of interest (Missing At Random). \(X_{i} = f\left({X_{j}} \right)\quad \forall i \ne j\quad and\quad X_{i} \bot Y.\) In these last two cases, missing data can be imputed with statistical tools (use of central tendency indicators such as means, medians or modes) or econometric ones (such as linear regression). These approximations help to deal with one difficulty but raise another issue regarding the reliability of the CI because of the uncertainties they could produce. Indeed, the imputed values are considered as equivalent to observed data. Yet, one unique imputed value cannot represent the whole uncertainty. Regarding these variables as equivalent to observed data is an underestimation of this uncertainty; thus tends to reduce the variance of the sample and the confidence interval of the indicator (Donzé 2001). Similarly, Saisana and Saltelli (2010) show that the extent of the consideration of the uncertainty in collected data can lead to a significant variation of the final indicator’s value. The quality of the indicator depends strongly on the quality of data used and the latter, in case of imputation, depends on the robustness of mobilized tools.

In our example, raw data come from the World Bank database, World Development Indicators (WDI). We consider for each country the most recent year for which data are available for the three variables. Unfortunately, in order to have a full panel, we were not able to work with a more recent year than 2008. The choice of our sample (high, intermediate and low income countries) is based on World Bank classification according to the level of GNI per capita, whereas the selected countries are done randomly in order to have a representative sample of countries. Obviously, other selection criteria, such as the level of human development (based on HDI), would lead to different choices as shown in Table 9. According to our criteria, the set of selected countries is composed by five high income countries (Australia, Germany, France, USA & Qatar); five intermediate income countries (Brazil, Russia, India, China, Bulgaria) and five low income countries (Algeria, Burkina Faso, Burundi, Cambodia, Vietnam).

2.3 Multivariate Analysis (Step 4)

The multivariate analysis aims to analyze the general structure of data in order to find an eventual correlation between sub-indicators (in the case of SDI, it relates to relations between used variables). The advantage of this analysis is that it allows early identification of inconsistencies in the indicator’s formulation and corrects it them when it is needed—for example with the inverse weighting of correlated sub-indicators. Indeed, if the analysis reveals a negative correlation between two sub-indicators, both of them should not be components of the final indicator since their effects will neutralize each other and thus constitute a bias in some aggregation functions, such as arithmetic mean. Different weightings should be made if these indicators represent important and district criteria. In practice, variables can be correlated with each other (see Table 1) and not considering the endogeneity of these variables could result in biased estimators. In the case of correlation between variables, Principal Component Analysis (PCA) gives weights allowing for the taking into account these interactions between variables. Weights are determined following three steps.

Table 1 Correlation matrix between variables of SDI and pairwise correlation test

In the first step, we verify that correlations exist between variables;

At the second step, we select components called factors that explain the most the variance of the sample. PCA proceeds to a linear combination of all variables related to each other. Principal components are identified, if the next three conditions are met:

  1. (1)

    The eigenvalue associated with the variable should be ≥1;

  2. (2)

    The individual contribution of the variable to the total variance should be ≥10 %;

  3. (3)

    The cumulative variance of the variables in a decreasing order should be ≥60 %.

The third step consist in obtaining weights from a rotation matrix which gives coefficients related to interactions between variables called loading factors (OECD and JRC 2008). With components chosen from the second step, weights are calculated by dividing the square of loading factors by the respective variance of each component.

Visibly, Table 1 shows negative correlations between the GNI per capita and CO2 per capita and Gini index. However, because of the normalization method used below i.e. \(I = \frac{{Value_{max} - Value_{country}}}{{Value_{max} - Value_{min}}}\), these negative correlations must be interpreted as positive coefficients. In other words, as GNI and CO2 emissions are positively linked (seen as negative in terms of sustainable development-SD), the complementary value of CO2 given by the normalization method is negatively correlated to GNI. One has also to note that correlations are weak between variables. The correlation coefficients between GNI per capita and Gini index is <1 %, those between CO2 emissions per capita and Gini index is somewhat more than 10 %, whereas correlation between CO2 emissions and GNI per capita is much greater at more than 90 %, meaning that industrialized and emergent countries emit a much larger quantity of CO2 because of the importance of their total production. Togtokh (2011) also highlighted these weak correlations between economic and social dimensions of SD whereas emissions are positively and strongly correlated with income.

Strictly speaking, the Gini coefficient is not representative of the social dimension, as it is weakly correlated with both the GNI and the CO2 per capita. It should be replaced by a more relevant variable. But as said above, the challenge is not to have an ideal SDI.

2.4 Normalization of Data (Step 5)

This step aims at unifying different measurement units when data for all variables can have a common or equivalent measurement. Depending on the indicator’s type—warning indicator (existence of a critical level for a given phenomenon) or indicator for comparing performances (international indicators), different methods exist and suggest reference scales. One could cite Ranking, Standardization (or z-scores), Denominator-Based Weight. In this article, the two most used approaches are presented namely MinMax and Benchmark scale-ratio.

2.4.1 Min–Max

In practice, it is the most used method especially to normalize international indicators such as HDI. Algebraically, \(SI_{ij}^{t} = \frac{{I^{t} - \hbox{min} \left({i^{\prime}} \right)\left({I^{t}} \right)}}{{\hbox{max} (I)\left({I^{t}} \right) - \hbox{min} \left({i^{\prime}} \right)\left({I^{t}} \right)}}\) where \({ \hbox{min} }\left({{\text{i}}^{\prime}} \right)\left({I^{t}} \right)\) is the weakest score performed by one of the entities. Entity \(i^{\prime}\) could be different from I, which means that the weakest score could belong to one entity other than the one for which the indicator j is normalised (i). t denotes year; \({ \hbox{max} }\left({\text{I}} \right)\left({I^{t}} \right)\) is the highest score performed by one of the entities. I could be different from i and should be different from \(i^{\prime}\) except when all entities are both best and worst. By definition, the then normalized sub-indicator ranges from 0 to 1 and rankings of all entities are made with reference to relative positions of the indicator in this range. The min–max method is very sensitive to extreme values.

2.4.2 Benchmark Scale-Ratio

This method associates scores with performances made in a field with reference to a threshold more or less arbitrarily chosen. This threshold could be the performance of the reference country at the initial year: \(SI_{ij}^{t} = \frac{{I_{ij}^{t}}}{{I_{{i,j = \bar{J}}}^{{t_{0}}}}}.\) Two other approaches are also used: the threshold could be \(I_{{i,j = \bar{J}}}^{t}\), i.e. performance of the reference country at the current year; or it could be \(I_{i,j}^{{t_{0}}}\), i.e. performance of the considered country at the initial year.

The normalized indexes of SDI are summarized in Table 2. Major clarifications have to be made in the normalization of sub-indicators of SDI. The indexes corresponding to the “social” and “environmental” dimensions are “warning indicators”, which means that the SDI’s score is improved when the values of variables decline (Gini index and CO2 emissions per capita). In other words, the warning indicators refer to indicators built to warn of the existence of a threshold for a given phenomenon. The existence of these types of indicators allows policy makers to take action at the right time to avoid exceeding critical thresholds. Thus, the normalization formula in this case is: \(I = \frac{{Value_{max} - Value_{country}}}{{Value_{max} - Value_{min}}}\).

Table 2 Normalization by the Min–Max method

The index corresponding to the “economic” dimension is a “prosperity indicator”. A “prosperity indicator” is an indicator for which its growth improves the composite indicator positively. Example: the level of national revenue (Gross National Income) is a prosperity indicator for a Sustainable Development Indicator (SDI) or for Human Development Index (HDI). So the normalization method respects the traditional formula: \(I = \frac{{Value_{country} - Value_{min}}}{{Value_{max} - Value_{min}}}\). Both “warning” and “prosperity” indicators are named by Areal and Riesgo (2015) as “less is better” indicators and “more is better” indicators respectively.

Considering the “economic” dimension, Qatar has the best performance (1.00) whereas the persistent poverty in Burundi is reflected by a zero score for this country. Also, most of the countries in the sample have a lower than 0.50, even industrialized economies such as Australia and France. Next to Qatar, only the USA and Germany manage to get a score bigger than 0.50.

For the “social” aspect, there is no “best” nor “worst” performance thanks to which we could evaluate the other countries when we refer to the Gini coefficient (social policies are different from one country to another). Nonetheless, it must be noted that some countries get better scores than others and that in this sample, Bulgaria tends toward a more egalitarian distribution of income than the rest of the sample while Brazil stays quite inegalitarian. The developed countries such as Germany and France, two pillars of European Union, as well as Australia, get high scores in this field, probably thanks to the effectiveness of their social protection policies.

With regards to the “environmental” aspect, regularly highlighted in discussions related to sustainable development, it is interesting to note that Qatar, the leader in economic matters, gets the weakest score for environmental issues, whereas countries with the most limited production capacities and thus low CO2 emissions, have high scores (Burkina Faso, Burundi). In the group of high income countries, there is one notable distinction: while European ones manage to get good scores, USA and Australia are only better than Qatar. Also, the gap between Qatar and the other 14 countries in terms of CO2 emissions is very large, since none of the latter has a score <0.50. These results are not surprising since, as highlighted by Table 1, a high correlation is found between CO2 emission per capita and GNI per capita.

It would be more interesting to compare the results of the two main used normalization methods namely min–max and scale-ratio normalization. Unfortunately, the second one is not adapted to our topic because it requires the need of a benchmark.Footnote 7 If it is reasonable to consider the 1992 pollution level (the 1st Earth Summit) as a reference in the environmental variable standardization, the choice of a baseline for economic and social dimensions is subject to debate among researchers (see Klugman et al. 2011). What baseline to choose for all countries in the study? This choice is it legitimate and accepted by all? At the individual level, each country can set its reference level according its development priorities.

2.5 Weighting and Aggregation (Steps 6 and 7)

These two steps are closely linked and difficult to dissociate in practice because the chosen weighting method implicitly imposes the aggregation method. Nevertheless, some methods allow to explicitly distinguish these two steps.

2.5.1 Budget Allocation Process

This method consists in asking each expert (or stakeholder) to allocate a budget of an X amount between different fields of a phenomenon. The mean of allocated scores allows calculating weights of the indicators and the composite score is their weighted sum. Although the optimal allocation of this budget coming from experts in the field and so gives to the CI a professional legitimacy, choices strongly depend on the perception of the phenomenon by the experts. So, this method tends to be founded on an implicit subjectivity, the risk being that the expert opinions could differ from both the target audience’s opinion and reality which will be likely to occur if the number of experts is not sufficiently large and representative. In such a case, this too limited number of experts can produced biased weights. Nevertheless, even if the list of experts is large, it is advised to verify the logic of the value judgment of the expert or any other stakeholder by calculating a coherence indexFootnote 8 of value judgments (Saaty 1987; Saaty 1990). When the value of this index is greater than 10 %, then there is an incoherence in the value judgment and thus in the budget allocation of the player which has to be identified and corrected.

Furthermore, when the phenomenon is multidimensional and the budget has to be allocated between these dimensions, this method could give weights make no sense—taking into account a bigger number of variables in the construction of a CI doesn’t necessarily lead to a high quality indicator which is representative of the phenomenon—the reasonable number of sub-indicators has to be around twelve (Nardo et al. 2005).

To illustrate this method, we asked 21 expertsFootnote 9 to allocate 100 points between the three dimensions of sustainable development.

Regarding budget allocation done by the 21 experts, Table 3 shows that the three variables chosen are all crucial in Sustainable Development (SD) issues; no dimension has received zero. The minimum weight is given to social dimension (0.1) whereas economic dimension received the maximum weight (0.6); the environmental dimension is an intermediary position. However, on average, experts give more importance to social issues (0.355), followed by environmental issues (0.335). These results are well distributed to the extent that the differences between the average values and median values are negligible. We can therefore conclude that globally, experts have converging views on issues of sustainable development.

Table 3 BAP weighting results from 21 experts

2.5.2 Maximization of Scores

This method is directly derived from the Benefit of the doubt (BOD) method, itself an application of the DEA (Data Envelopment Analysis) approach (OECD and JRC 2008; Blancard and Hoarau 2013). The DEA approach consists in constructing from best performances an efficiency frontier and then, determines other participants’ performances relative to this. Thus, BOD gives a relative weight to an individual i considering best performances. By associating score 1 to the best performance, the least effective individuals’ scores will logically be inferior to 1 but still be positive according to this formula: \(w_{i} = \frac{Performance\,of\,i}{Benchmark} \le 1\). Defined that way, the relative weight of each individual i depends on their performance compared with the “ideal” situation—the benchmark.

By definition, aggregation by this method maximizes scores given by BOD while probability constraint is respected, i.e. non negativity of scores or sub-indicators weights, the sum of weighted sub-indicators should be inferior or equal to 1 (\(\sum\nolimits_{j = 1}^{J} {I_{j} w_{j} \le 1}\)). For a set of sub-indicators, this method gives the benefit of the doubt to the individual whose global performance might be evaluated by only keeping dimensions for which they are most effective (OECD and JRC 2008; Blancard and Hoarau 2013). The underlying idea is that each country has political priorities and seeks to maximize its actions in dimensions judged essential. Formally, CI results from the following maximization program:

$$\begin{aligned} CI_{i} = \arg \mathop {\hbox{max} }\limits_{{w_{ij} }} \sum\limits_{j = 1}^{J} {w_{ij} SI_{ij} } \hfill \\ sc.\left\{ {\begin{array}{l} {0 < \sum\limits_{j = 1}^{J} {w_{ij} SI_{ij} } \le 1} \\ {w_{ij} > 0} \\ \end{array} } \right. \hfill \\ \end{aligned}$$

with i the individual or country \(\left({i = 1, \ldots,I} \right);\,j\) the index of sub-indicator (SI) representing one dimension \(\left({j = 1, \ldots,J} \right)\) and w ij relative weight associated with sub-indicator j in the CI of individual i.

Just as other methods, this one can present weaknesses when dimensions are not substitutable. For example if they are of equal importance (case of sustainable development) or complementary, the maximization method only keeps scores for dimensions for which the individual makes the most efforts and thus does not treat the phenomenon in its totality. There might be from this moment a risk of imbalance in the phenomenon apprehension. The second limit is inherent to the weighting method. Indeed, by limiting scores in the range [0,1], this method does not allow one to observe the worst performances and the individuals who perform outside of the predefined efficiency frontiers are confined to this pre-established range. Some individuals might present poor scores lower than the imposed limit inferior. However, these scores could be a warning when analyzing the global performance of the entity. By restricting the weights of the sub-indicators to 1, the method excludes the possibility that better situations exist. Entities located on this frontier do not have incentives to improve their performances for the latter are considered references to which other entities should tend. Moreover, this method becomes very complex, inoperable even, when it comes to representing this possibility frontier from a large number of sub-indicators. The same difficulty occurs when there are among entities participating in the study “many best performers” for one dimension. In principle, the method of BOD is inappropriate in the case of the SDI as it makes the implicit assumption that the dimensions of the measured phenomenon are perfectly substitutable. In other words, if we use the method of BOD, we assume implicitly that countries can choose to pollute more to further develop their economy, for example. If perfect substitution is not allowed, one should discard the BOD method.

All things considered, BOD presents two essential advantages. Firstly, the BOD gives to the entity an opportunity to be excellent in at least one of many dimensions of the studied phenomenon, generally where the entity focuses its efforts. Secondly, the weights of the sub-indicators are endogenously determined, thus keep the CI free from any criticism about subjectivity in the weighting process.

In order to better illustrate the overall view of these weighting methods, we present in the same table (see Table 4) weighting results from BAP, PCA and BOD approaches. Before commenting them, let examine some methodological points:

Table 4 The weights of the dimensions according to the BAP, PCA and BOD methods

Weights in BAP are the average allocation of 100 points between the three dimensions of SD by the experts. PCA considers the weight of each variable as its relative contribution to the variations of the total variance of the CI. In BOD, a small difference in the way of computing weights exists between the “economic” dimension and the other two dimensions. For “social” and “environmental”, the lower the value of Gini index or CO2 emissions of a country relative to the rest, the better their situations relative to these dimensions and the relative weight is equal to the value of the benchmark countries divided by the value of the entity. The benchmark countries have a weight equal to 1 for this dimension. The best economic performances are rewarded. Relative weights are determined by reporting the GNI per capita of the country in question to the best value of this variable in the sample. To give meaning to comparisons, the weights are normalized to 1.

The BAP method results come from Table 3. Weights from this method are very different from PCA ones. With the PCA method, the relative weight of the “economic” dimension is 0.493, the “environmental” one is 0.500 whereas the “social” one is 0.008. The low weight combined with the social dimension is not surprising given the weak correlation between the Gini coefficient and the other variables identified in Table 1. At the same time, BOD results are analyzed dimension by dimension. In each one, the best performers are respectively Qatar (leader in the “economic” dimension), Bulgaria in “social” matters and Brazil, Burkina Faso, Burundi, China and Russia in the “environmental” dimension (low emission of CO2 per capita). We calculated the average weight of each dimension over the whole sample and got the following results (in the same order): 0.430; 0.173; 0.0676. So most countries have a good performance in the social dimension, to which the weights given are beneficial. On the contrary, the environmental dimension is generally given a low weight, for most of countries got poor results in this field relative to best performer.

Although PCA can provide weights that clearly take in account the correlations between the variables, the PCA results should be taken with caution since a robust PCA needs a relatively high number of variables and the correlations between variables must be higher or equal to 0.30. Indeed, even if PCA weighting system allows for “objective weights” which are generated following the endogenous structure of data, the constructor may be in an uncomfortable situation when all variables are not well correlated (Table 1). For the OECD and JRC Handbook (2008), the PCA weighting system is not suitable when variables are uncorrelated.

2.5.3 Indicators Average

It can be an additive, multiplicative or harmonic mean. We present hereby the arithmetical and the geometrical ones. Indeed, the arithmetical mean is the most used weighting-aggregation method in practice, probably because of its simplicity in being understood by a large public and its transparency. However, equal weight which seems “neutral” (as it gives the same importance to different dimensions of the treated phenomenon), can be a source of discriminations. In other words, it is very sensitive to extreme values, which can give biased results when data contain outliers. Furthermore, this method is based on an implicit assumption that a perfect substitutability between different dimensions and that the latter are of equal importance. Thus dimensions for which values are relatively low would be overestimated. The geometric mean takes into account the lack of perfect substitution between sub-indicators and rewards entities that perform evenly in all fields. We can see how weighting-aggregation methods influence the calculation results of the CI as well as the ranking results by referring to Tables 5, 6 and to Bravo (2015); Areal and Riesgo (2015).

Table 5 Scores and ranking according to different weighting methods and the additivea aggregation method
Table 6 Scores and ranking according to different weighting methods and the geometrica aggregation method

With the additive aggregation method, results do not vary much from one weighting method to another except the case of BOD. Germany always occupies the first place, with equal weighting, BAP, or BOD excepted PCA method in where Germany is dethroned by France. France, Bulgaria and Australia always have good scores whereas Russia, China and Brazil are in the last three places regardless of the weighting method used. BOD method provides seven leaders namely Germany, Bulgaria, Burkina Faso, Burundi, France, Qatar and USA.

With the geometric method, we veer BOD method because it is inconsistent with nonlinear programs. The analysis is done considering the three equal BAP and PCA weighting methods. Germany stays the leader in equal and BAP weighting methods whereas Qatar becomes the last of the group. France ranking is also stable even improved in the case of PCA. The emerging and developing countries positions are mixed. As we can see, emerging countries such as China, India Russia are in the intermediate position while developing countries (e.g.: Vietnam, Burundi, Burkina Faso) hold the last places. Table 6 also highlights the inconsistency between some aggregation methods and some types of data. In the present case, Qatar’s score calculation is not possible because negative values are not allowed in geometric aggregation method.

3 Discussion of Results and Sensitivity Analysis

In this section we discuss results (ranks and scores) of different combinations of approaches and their implications in terms of sustainable development. We also analyze the sensitivity of the indicator and scores related to changes of methods and then, see its validity by conducting a pair wise correlation test with some known CIs.

3.1 Discussion of the Sustainable Development Index

As a reminder, three dimensions of sustainable development—Economic, Social, Environmental—are measured via three respective corresponding variables—GNI per capita based on purchasing power parity (PPP constant 2005 $US); Gini index (base 100); Emissions of CO2 per capita (in metric tons).

The additive aggregation method gave similar results between the first three equal, BAP and PCA weighting methods used except the case of PCA method. Scores of countries in the sample do not vary much from one method to another and ranking is mostly the same (except for Burundi and the USA). Germany is always at the top whereas new economic powers such as Russia, China and Brazil occupy the last places, probably because of the imbalance between efforts spent on economic growth and those used on social and environmental matters. BOD induces some different results, as Qatar and Burundi are rewarded for their respective performances in economic and environmental matters. These figures illustrate how this method promotes aspects where the entity has the most advantage, whereas in sustainable development all aspects are assumed to have equal importance. Also, with BOD, 7 countries share the best score of 1.000, because of the restriction of scores in a [0, 1] range, even if they don’t really have the same performance.

The geometric aggregation method does not change ranking results related to equal with the PCA methods. The BAP combined with multiplicative method provide higher scores for all countries. However Qatar, leader in the economic dimension, gets the last place with both equal and PCA methods. Indeed, its ranking suffers from too much CO2 emitted per capita, (0 point for the environmental aspect so 0 for the total grade). We observe no compensation between sub-indicators, which is different from the additive aggregation.

Based on these results, industrialized nations seem to be closer of SD objectives than the developing countries. Obviously, the addition of other dimensions, such as intergenerational equity and good governance, will deteriorate the scores of the last countries group.

3.2 Sensitivity and Uncertainty Analysis

From the above results, we conduct tests on the sensitivity of the indicator and of the scores to weighting methods, with the additive aggregation. Table 7 presents cumulative gaps attributable to weighting methods (see “Appendix 2” for theoretical explanations).

Table 7 Sensitivity of the indicator and scores to weighting methods (additive aggregation)

Equal and BAP weighting methods give similar results in terms of ranking. Differences are slightly more perceptible when it comes to scores. If the scores’ variations lie around 3 % between equal and BAP methods, they are very extensive in the case of PCA and BOD methods, reaching −400 %. These results highlight how countries scores and their underlying ranks vary following the method used (Table 8).

Table 8 Relative uncertainty associated with SDI’s scores

After the sensitivity test, we also evaluate the uncertainty associated to the SDI. We use additive aggregation results. Since we do not have functional specifications linking different combinations used (normalizations, weightings and aggregations), only the uncertainty related to weighting method changes are evaluated. Monte Carlo simulations could have been the alternative solutions but they require a much larger sampling which is an unsatisfied condition in our case (see “Appendix 1”). Thus we use the next relative uncertainty formula: \(\Delta X/\bar{X} = t.\frac{s}{\sqrt n}/\bar{X}\) where \(\bar{X}\) is SDI score, t = 3.18 Student t-value at 3 degrees of freedom \(\left({dof = n - 1} \right)\), s standard deviation of the sample (here the standard deviation of SDI obtained with 4 weighting methods) and n = 4 number of variables (in SDI, n = 4 represents the number of weighting methods used). The absolute uncertainty ΔX measures the maximal error in the evaluation of indicator \(\bar{X}\). The relative uncertainty \(\Delta X/\bar{X}\) measures the importance of the maximal error compared with the calculated value of the indicator at a certain degree of confidence (in our application, the degree of confidence is 95 %).

The uncertainty associated with the countries scores are around 30 % except that associated with the score of Brazil that reaches 432 %. The results analyzed in the previous section suggest that both BOD and PCA methods are at the origin of important variations in the calculation results for SDI of certain countries especially for Brazil. This illustrates the caution we have to keep in conclusions from analyses of CIs related to different weighting methods. The same principle applies to other approaches in the CI construction, for example, the choice of normalization, aggregation method.

3.3 Validity and Robustness Analysis

Comparing SDI to some well-known composite indexes such as HDI, HSDI, GNI and EFI, major changes appear in the countries’ ranking. While old indexes reflect the income level of the country, the new SDI produces scores that allow a nuanced reading of development. This is typically the case of USA (see Table 9A). Results also show that countries like Bulgaria, Burkina Faso or Cambodia are more sustainable than the historical great countries like USA and Australia. Emerging countries are not well ranked in the SDI comparing to the other indexes except the case of very strong sustainability index namely EFI (see Table 9A). Regarding only sustainable development indices, Table 9B shows that most of them remain strongly correlated to the income indices except the new SDI and the EFI. According to the SDI, results seem to show that income effects are reduced in the composite score when income is divided by the population size. Relative to the strong sustainability index, results show that the EFI is a partial measure of the concept of SD. As it can be see, there is no correlation between this index and the SDI or between the latter and the HSDI. In addition, the strong correlation between the EFI and the GNI shows the pressure of economic activity on the biological capacity of the Earth. Note that the negative correlations must be interpreted as positive coefficients because of the normalization method used.

Table 9 Comparing SDI with some well-known indexes and correlation test

One of the most interesting results is that our SDI is correlated with any of the existing indicators at the statistical 1 % level. The result indicates that the indicator is not redundant in addition to those already built. Despite the adjustments made by Bravo (2015), the results in Table 9 challenge the HSDI as a measure of sustainable development in view of its strong correlation with the GNI and HDI, and then almost identical rankings. The lack of significant correlation between SDI and EFI is explained by the difference of approaches of the theoretical framework of their construction. Indeed, while the ecological footprint is based on a strict and strong vision of sustainable development, the SDI accepts compensation between the three dimensions of sustainable development and integrates within the social aspects absent in the EFI.

4 Conclusion

In this paper, we review the steps of the construction of a composite indicator and illustrate them by constructing a simple sustainable development index-SDI, which includes three main dimensions: economic, social and environmental. Each step is illustrated by the most used approaches in practice. Results are discussed, as well as the sensitivity of the indicator and the scores and uncertainty related to a change in method. Regardless of the method used, the indicator remains subject to uncertainties, making scores and ranking results fluctuate. It also appears that the legitimate need to make the CIs’ construction more popular and the need to develop robust tools to reduce uncertainty and sensitivity of composite scores appear as two irreconcilable goals (Aguna and Kovacevic 2010; Blancard and Hoarau 2013; Areal and Riesgo 2015). From this fact and since CIs define, guide and evaluate public action, the choice of a method in the construction of a CI requires a coherent theoretical justification without which the indicator would lack legitimacy. Regarding the validity of the index, correlation tests with some lights and well known indicators, reveal very consistent results. Finally, given its prominent role in the definition of what a country, a territory or a city would be, research in this field should be encouraged in order to develop methods making CIs more robust.