1 Introduction

Composite indicators have been gaining a prominent place in the literature. They already cover practically all areas of activity as they facilitate the interpretation and understanding of complex realities and offer a one-dimensional measure for multidimensional phenomena associated with several sub-indicators (Becker et al., 2017; Kuc-Czarnecka et al., 2020; Mazziotta & Pareto, 2017; OECD, 2008). There are many examples of composite indicators in the operations research, economics and management fields (Dočekalová & Kocmanová, 2016; Charles et al., 2018; Calabria et al., 2018; Lafuente et al., 2020; D'Inverno & De Witte 2020; Ferreira et al., 2020; Karagiannis, 2021). In particular, researchers have widely used composite indicators in the social sciences field (Dickes & Valentova, 2013; Marozzi, 2015, 2016, 2021; Marzi et al., 2019; Mazziotta & Pareto, 2016). This popularity is due to the multidimensional characteristic of social phenomena such as poverty, well-being, inequality, and social vulnerability (De Muro et al., 2011; Frigerio et al., 2018; Libório et al., 2021; Mazziotta & Pareto, 2018; Pinar, 2019).

The construction of composite indicators is quite simple. It consists of ten steps: development of the theoretical framework, data selection, imputation of missing data, exploratory multivariate analysis, normalization, weighting and aggregation of sub-indicators, robustness and sensitivity analysis, link to other variables by correlating the composite indicator with other indicators and presentation and visualization of the composite indicator (OECD, 2008).

Although simple, the construction of composite indicators is associated with uncertainties arising from the different ways of normalizing, weighting, and aggregating the sub-indicators (Cinelli et al., 2020; Dialga & Giang, 2017). Despite some exceptions,Footnote 1 the different ways of weighting the sub-indicators significantly impact the composite indicator scores (Becker et al., 2017; Greco et al., 2019).

Sub-indicator weighting approaches can be classified as Equal Weights, Data-driven, and Participatory (El Gibari et al., 2019). Equal Weights is the most employed weighting approach in the composite indicators literature (OECD, 2008). Data-driven approaches such as Principal Component Analysis (PCA, Pearson, 1901) and Data Envelopment Analysis (DEA, Charnes et al., 1978) assign the weights statistically. Participatory approaches such as the Analytic Hierarchy Process (AHP, Saaty, 1988) and Conjoint Analysis (Green & Srinivasan, 1978) are based on expert opinions.

Some authors argue that Data-driven weighting approaches are more reliable as they are less prone to subjective opinions (OECD, 2008). Others argue that Participatory approaches "make the subjectivity behind the process of weighing sub-indicators controllable and, most importantly, transparent" (Greco et al., 2019, p.67). Besides, bringing together experts, policymakers, or citizens to mutually decide on the importance of what is at stake is society's natural and desired behavior (Munda, 2005). However, neither of these three weighting approaches is beyond criticism (Maričić et al., 2019). Equal weight and data-driven approaches do not capture the relative importance of the sub-indicators (Gan et al., 2017; Lafuente et al., 2020; Liborio et al., 2021). Participatory approaches are susceptible to errors and bias in expert assessments (Greco et al., 2019).

The assessment errors are the first problem considered in this research. These errors are expected when the number of assessments is elevatedFootnote 2 because the consistency of the assessments decreases with the increase in the number of sub-indicators (Gan et al., 2017; Otoiu et al., 2021). The inconsistency of the assessments is a problem because multidimensional phenomena possess, by nature, a large number of sub-indicators (e.g., Lind, 2019; Marzi et al., 2019).

Researchers in the decision-making field understand that the inconsistency of the assessments also occurs because the experts find it difficult to express their preferences regarding weights directly (Fishburn, 1972; Roszkowska, 2013). One way to reduce the difficulty of direct assigning weights is to use other assessment formats, such as paired comparison of alternatives or ordering alternatives by importance (Ramalho et al., 2019).

Even though experts find it easier to express their preferences indirectly, the corresponding formats have shortcomings. In particular, the paired comparisons of alternatives are laborious as the number of assessments grows by \(n\times (n - 1)/2\) times for each \(n\) sub-indicator (Saaty, 1980). The ordering of alternatives by importance is a ranking system and not a weight one (Sureeyatanapas et al., 2018). In the last case, it is possible to use transformation functions to obtain the weights of alternatives ordered by importance (Roszkowska, 2013). These transformation functions permit one to reduce different forms of heterogeneous information (quantitative or qualitative, ordinal or cardinal, interval or ratio scales, etc.) to a unique necessary format, providing information homogeneity (Libório et al., 2022; Pedrycz et al., 2011).

Several functions can transform alternatives ordered by importance into weights. Many studies indicate which function produces the best results (Ekel et al., 2020; Feng et al., 2014). However, little is known about to what extent these functions impair the consensus degree between individual and collective assessments. Understanding the extent of this effect helps identify which transformation functions generate the weights most compatible with the collective opinion. This identification is relevant because it makes it possible to obtain a composite indicator that is more faithful to the collective opinion, especially when the number of experts is small.

The second problem considered in this research occurs when the number of experts is small. The smaller the number of experts, the greater the influence of individual assessments on the results (Ekel et al., 2020; Parreiras et al., 2012). The small number of experts is a frequent problem in international comparisons as the weights reflect the specific conditions of each country, requiring experts who know the reality of several countries (Gan et al., 2017; OECD, 2008).

Therefore, the contributions offered by this research are focused on composite indicators built with two characteristics. First, when the sub-indicators are weighted with weights obtained by the participatory approach. Second, when the composite indicator aggregates a large number of sub-indicators assessed by a few experts. These characteristics are found in several composite indicators: Global Innovation Index (Garcia-Bernabeu et al., 2020), Multidimensional Poverty Index (Alkire & Santos, 2014), Sustainable Development Goals Index (Diaz‐Sarachaga et al., 2018), Ease of Doing Business Index, which is constructed with forty-one sub-indicators (Kuc-Czarnecka et al., 2020; World Bank, 2020) as well as the composite indicator explored in this research: Cost of Doing Business Index (Bernardes et al., 2021).

This contribution is relevant for three reasons. First, slight differences in the sub-indicators' weights have the potential to modify the composite indicator's scores significantly (Becker et al., 2017; Saisana et al., 2005). Second, it is difficult to reach a consensus on the weights when the number of sub-indicators is elevated (Greco et al., 2019). Third, the consensus is even more challenging to achieve when international comparisons are made (OECD, 2008). The contribution of this research is to provide the means to obtain consistent weights based on the opinion of a small number of experts, but with few assessment errors.

In addition to this Introduction, Sect. 2 presents the different assessments formats and the format transformation functions. Section 3 describes the methods used to measure the expert assessments consensus degree. The consensus-based Cost of Doing Business Index for G20 countries is introduced in Sect. 4. The results and analyzes are presented in Sects. 5 and 6, and the conclusions are in Sect. 6.

2 Assessment Formats for Obtaining the Sub-Indicators Weights

The bounded rationality of decision-makers (see Simon, 1950) makes the weight assessment process challenging (Greco et al., 2019). In these situations, researchers in the field of Decision-Making show that it is possible to facilitate the assessment of alternatives, that is, the sub-indicators weights, by adopting the appropriate assessment format (Pedrycz et al., 2011; Roszkowska, 2013; Ramalho et al. al., 2019; Ekel et al., 2020).

It is necessary to indicate that facing with different information sources, any expert involved in the decision-making process may want to express his/her preferences through different preference formats (Ekel et al., 2020). Ramalho et al. (2019) consider the following five preference formats: Ordering of Alternatives (\(OA\)), Utility Value (\(UV\)), Multiplicative Preference Relations (\(MR\)), Fuzzy Estimates (\(FE\)), and Fuzzy Preference Relations (\(RF\)) which can be Additive Reciprocal (\(RR\)) or Non-Reciprocal (\(RN\)). These formats can cover most of the real decision-making situations (Ekel et al., 2020).

The \(OA\) format reflects the ordering of alternatives from the best to worst by a vector\(O=\left[o\left({x}_{1}\right),o\left({x}_{2}\right),\dots ,o\left({x}_{k}\right),\dots ,o\left({x}_{n}\right)\right]\), where \(o\left({x}_{k}\right)\) is a swap function that returns the position of the alternative \({x}_{k}\) among integer values\(\{1, 2, . . . ,k,\dots , n\}\). In the \(UV\) format, the alternatives are given as a set of \(n\) utility values\(U=\{u\left({x}_{1}\right),u\left({x}_{2}\right),\dots , u({x}_{k}),\dots , u({x}_{n})\}\), where \(u\left({x}_{k}\right)\in \left[\mathrm{0,1}\right]\) represents the utility value assigned to the alternative\({x}_{k}\). The \(MR\) format is given as a matrix \({MR}_{ n\times n}\) of positive and reciprocal preference relationships\(m\left({x}_{k},{x}_{l}\right)\), k, l = 1,…,n reflecting how much more the alternative \({x}_{k}\) is \(m\left({x}_{k},{x}_{l}\right)\) times more or less preferable than\({x}_{l}\), which is reflected by Saaty's (1977) intensity scale. In the \(FE\) format, the alternatives can be directly assessed by experts using a set of estimates\(L=\{l\left({x}_{1}\right),l\left({x}_{2}\right),\dots ,l\left({x}_{k}\right),\dots ,l\left({x}_{n}\right)\}\), where \(l\left({x}_{k}\right)\) is the \(FE\) associated with alternative \({x}_{k}\) from the perspective of some criterion. Ekel et al. (1998) show that \(FE\) can be automatically transformed into\(RF\). These relationships reflect the degree to which the alternative \({x}_{k}\) is at least as good as\({x}_{l}\). Annex I lists these and other notations used in the formulas presented below.

It is necessary to indicate that only the \(UV\) format allows the attribution of weights to sub-indicators directly, that is, by direct rating (Pedrycz et al., 2011). However, the literature has long recognized that individuals have difficulties making accurate assessments (Fishburn, 1972) because of their limited capacity to assess and process information (Simon, 1950). The assessment difficulty may be greater or lesser depending on the time available to carry out the assessments, the expert's knowledge of the alternatives, or the presence of inaccurate, incomplete, or partial information (Roszkowska, 2013). These difficulties can be even more significant when experts differ among themselves on the weights of sub-indicators, resulting in endless debates (Greco et al., 2019; Saisana & Tarantola, 2002).

Pedrycz et al. (2011) show that it is possible to reduce the difficulties faced by experts when they do not know how to assess quantitative alternatives by replacing the \(UV\) format with other formats.

Weight assessment using the \(MR\) format is considered less prone to errors in judgment than other methods (Saisana & Tarantola, 2002). However, when many alternatives are assessed, "it exerts cognitive stress on decision-makers" and increases the chances of inconsistent assessments (Greco et al., 2019, p. 68). Considering this, Roszkowska (2013) suggests assessing the alternatives by applying the \(OA\) format and obtaining the weights using the corresponding transformation functions.

The \(OA\) format is mainly used when the expert does not know how much one alternative is better or worse than the others, but he knows the order of alternatives by importance (Ekel et al., 2020). Roszkowska (2013) considers it is more reliable to obtain weights by transforming the assessments ordered into weights than by the direct assessment of weights. According to Roszkowska (2013), decision-makers are more confident about the order of the alternatives than about the alternative's weights. Sureeyatanapas et al. (2018) affirm that the assessment in the \(OA\) format decreases uncertainty about the data as it requires less cognitive effort from borrowers during the evaluation process.

2.1 Transformation Function to Obtain Weights Indirectly

The literature shows that many functions allow the direct transformation of sub-indicators ordered by importance into weights (Ekel et al., 2020; Roszkowska, 2013; Sureeyatanapas, 2016).

Equal Weight (EW) requires no knowledge about the priorities of the \(n\) sub-indicators, and weights can be obtained (Sureeyatanapas, 2016) as

$$w_{j} = \frac{1}{n}, j = 1,2,.., n$$
(1)

Rank Sum (RS) allows the weights to be obtained from the order of the alternatives normalized by dividing the sum of the ordering (Stillwell et al., 1981) through the following expression:

$$w_{j} = \frac{n - o\left( j \right) + 1}{{\mathop \sum \nolimits_{j = 1}^{n} \left( {n - o\left( j \right) + 1} \right)}} = \frac{{2\left( {n + 1 - o\left( j \right)} \right)}}{{n\left( {n + 1} \right)}}, j = 1,2,..,{ }n$$
(2)

where \(o\left(j\right)\) is the order of importance of the \(j\)-th sub-indicator.

Rank Exponent (RE) is a generalization of RS that incorporates the weight of the most important sub-indicator in parameter \(p\), making it possible to operationalize the functions Equal-Weight using \(p=0\) and Rank Sum using \(p=1\) through the expression.

$$w_{j} = \frac{{\left( {n - o\left( j \right) + 1} \right)^{p} }}{{\mathop \sum \nolimits_{j = 1}^{n} \left( {n - o\left( j \right) + 1} \right)^{p} }}, j = 1,2,..,{ }n$$
(3)

where \(p\) is a parameter that controls the distribution of the weights.

The parameter \(p\) can be defined by a decision-maker using the weight of the most important criterion or interactive scrolling, as shown in Table 1. It can be seen that the distribution of weights becomes steeper as \(p\) increases.

Table 1 Effect of parameter p on the weights of alternatives according to their respective orders

Rank Reciprocal (RR) uses the reciprocal of the ordered alternatives that are normalized by dividing each term by the sum of the reciprocal (Stillwell et al., 1981) as follows:

$$w_{j} = \frac{1}{\left( j \right)} \div \mathop \sum \limits_{j = 1}^{n} \left( {\frac{1}{o\left( j \right)}} \right){ },{ }j = { }1,2,..,{ }n$$
(4)

The Rank-Order Centroid (ROC) produces estimates of the weights that minimize each weight's maximum error, identifying the centroid of all possible weights, maintaining the order of classification of objective importance (Barron & Barrett, 1996) through the correlation

$$w_{j} = \frac{1}{n}{ }\mathop \sum \limits_{j = j}^{n} \frac{1}{o\left( j \right)}$$
(5)

The most important sub-indicator is ordered first \(o\left( j \right) = 1\) while the least important is \(o\left( j \right) = n\).

The ROC function has an attractive theoretical basis and performs better than other functions that transform \(OA\) format to \(UV\) format (Roszkowska, 2013). It tends to have the best performance because its inclination and nonlinearity of the weights are closer to decision-makers' behavior (Sureeyatanapas, 2016). The ROC function "outperforms other functions in most experimental scenarios and measures, particularly, in terms of the precision of choice" (Sureeyatanapas et al., 2018, p.73).

Researchers point out the ROC function as the best function for transforming assessments in \(OA\) format to \(UV\) (Roszkowska, 2013; Sureeyatanapas, 2016; Sureeyatanapas et al., 2018). However, this understanding ignores at least three elements. First, the weights of the sub-indicators can be obtained indirectly by converting the assessments in the \(OA\) format to \(MR\), and then applying the AHP method (Ekel et al., 2020; Ramalho et al., 2019). Second, comparing the ROC and the AHP results shows that these approaches provide similar estimates (Erkan & Elsharida, 2020). Third, little is known about the effects of transformation functions on the collective consensus degree calculated from the weights generated by each function (Parreiras et al., 2012; Pedrycz et al., 2011). In this context, it is necessary to consider obtaining the weights directly and indirectly, and also to measure the effects of each function on the consensus degree.

2.2 Transformation Function to Obtain Weights Indirectly

Ramalho et al. (2019) present a function that makes it possible to transform sub-indicators ordered by importance to the \(MR\) format and then obtain the weights through Saaty's (1980) AHP. For the transformation from \(OA\) to \(MR\) it is assumed that \(m\) is the upper limit of \(MR\), that is, \(m = 9\) (Saaty, 1980), then the following expression is applied:

$$RM\left( {x_{k} ,x_{l} } \right) = m\left( {\frac{{o\left( {j_{l} } \right) - o\left( {j_{k} } \right)}}{n - 1}} \right),n > 1$$
(6)

where \(o\left( {j_{l} } \right)\) is the order of importance of the sub-indicator \(l\), and \(o\left( {j_{k} } \right)\) is the order of importance of the sub-indicator \(k\).

This transformation function reduces the number of assessments needed to obtain the weights and eliminates any possibility of inconsistent assessments (Chiclana et al., 1996). Once the assessments in \(MR\) format are obtained, weights can be calculated using the AHP method.

3 Consensus Degree of Assessments

After transforming assessments into \(OA\) format in weights directly applying (1)-(5) and indirectly applying (6) and then AHP, the collective weights for each sub-indicator can be calculated as

$$\lambda_{j} = \frac{1}{e}\mathop \sum \limits_{i = 1}^{e} w_{ij} ,{ }j = { }1,2,..,{ }n$$
(7)

where \(e\) is the number of experts involved in the assessment and \(w_{ij}\) is the weight assigned by the \(j\)-th expert to the \(j\)-th sub-indicator.

From (7), it is possible to calculate the consensus degree of assessments through several methods. These methods reflect the reliability of assessments and are divided into four classes: Test–retest, Parallel forms, Internal-reliability, and Inter-rater. Test–retest does not apply to this research because it estimates the reliability of the same test over time. Parallel forms are also not applicable because it assesses the reliability of different test versions designed to be equivalent. Cronbach's (1951) alpha and the average inter-item correlation obtained through Pearson's (1920), Kendall's (1948), and Spearman's (1961) correlation coefficients assess the internal consistency of individual items of a test. However, these tests are not the most suitable for this study because they disregard the possibility that agreement occurs by chance. Inter-rater reliability methods such as Cohen's (1960) kappa, Scott's (1955) pi, Fleiss' (1971) kappa, Lin's (1989) Concordance Correlation Coefficient, and Bartko's (1966) Intraclass Correlation Coefficient measure the reliability of assessments conducted by different people. These methods measure the proximity or agreement between the assessments and are indicated when one of the assessments is considered a reference (Barnhart et al., 2007).

The CCC and the ICC were used to measure the consensus degree among experts for five reasons. First, Cohen's kappa and Scott's pi measures agreement only for two raters. Second, Fleiss kappa measures agreement only for ordinal and categorical data. Third, "Historically, the agreement between quantitative measurements has been assessed via the Intraclass Correlation Coefficient" (Barnhart et al., 2007, p.25). Fourth, "the Concordance Correlation Coefficient is the most popular index for assessing agreement in the statistical literature" (Barnhart et al., 2007, p. 28). Fifth, these methods have been successfully used to measure inter-rater agreement in determining the weights of sub-indicators (Faucher et al., 2004; Gómez-Limón et al., 2020).

The CCC measures the precision and accuracy of the assessments (Crawford et al., 2007). The precision reflects how much the linear relationship of the assessments of \(w_{ij} ,i = 1,2...,e , j = 1,2,.., n\) and \(\lambda_{j} ,j = 1,2,.., n\) deviates from the concordance line. The accuracy reflects how much each assessment \(w_{ij}\) and \(\lambda_{j}\) deviate from the adjusted line. The \({\text{CCC}}\) is calculated based on the following expression:

$${\text{CCC}} = \frac{{2\rho \sigma_{w} \sigma_{\lambda } }}{{\sigma_{w}^{2} + \sigma_{\lambda }^{2} + \left( {\mu_{w} - \mu_{\lambda } } \right)^{2} }},$$
(8)

where \(\mu_{w}\) and \(\mu_{\lambda }\) are the averages of \(w_{ij}\) and \(\lambda_{j}\), \(\sigma_{w}^{2}\) and \(\sigma_{\lambda }^{2}\) are the variances of \(w_{ij}\) and \(\lambda_{j}\), and \(\rho\) is the correlation coefficient between \(w_{ij}\) and \(\lambda_{j}\).

The Intraclass Correlation Coefficient measures the amount of general variance in the data due to variability between experts' opinions, including systematic errors by experts and random residual errors (Chen & Barnhart, 2008). The Intraclass Correlation Coefficient \({\text{ICC}}\) can be calculated by applying

$${\text{ICC}} = \frac{1}{{Ns^{2} }}\mathop \sum \limits_{n = 1}^{N} \left( {w_{j,1} + \overline{{w_{j} }} } \right)\left( {w_{j,2} - \overline{{w_{j} }} } \right),$$
(9)

where

$$\overline{{w_{j} }} = \frac{1}{2N}\mathop \sum \limits_{n = 1}^{N} \left( {w_{j,1} + w_{j,2} } \right),$$
(10)

and

$$s^{2} = \frac{1}{2N}\left\{ {\mathop \sum \limits_{n = 1}^{N} \left( {w_{j,1} - \overline{{w_{j} }} } \right)^{2} + \mathop \sum \limits_{n = 1}^{N} \left( {w_{j,2} + \overline{{w_{j} }} } \right)^{2} } \right\}$$
(11)

Koo and Mae (2016) interpret the ICC values as follows: \({\text{ICC}} < 0.50\) Poor; \(0.50 < {\text{ICC}} < 0.75\) Moderate; 0.75 \(< {\text{ ICC}} <\) 0.90 Good; and \({\text{ICC}} > 0.90\) Excellent. Although both methods measure the concordance between opinions, the \({\text{ICC}}\) can be higher or lower than \({\text{CCC}}\) (Feng et al., 2014).

4 Consensus-Based Cost of Doing Business Index for G20 Countries

The present study is developed in four stages. In the first step, the theoretical framework of the Cost of Doing Business Index is formulated. The second step consists of selecting, collecting, and normalizing the sub-indicators data. The expert assessments and the weighting and aggregation of the sub-indicators are carried out in step 3. The uncertainty analysis is carried out in the fourth and last step.

4.1 Theoretical Framework

The World Bank's (2020) Ease of Doing Business Index occupies a prominent position in the composite indicators literature. The index measures the quality of the business environment of one hundred and ninety countries (World Bank, 2020). The Ease of Doing Business Index is calculated in three steps. First, the forty-one sub-indicators are rescaled using the linear transformation Min–Max (see Cinelli et al., 2020 or Eq. 12). Second, the Sub-Indices are aggregated by the arithmetic mean into ten Sub-Indices. Third, the Sub-Indices are aggregated into the final index that assigns a value of 100 to the most business-friendly country and 0 to the least business-friendly country.

Despite its popularity,Footnote 3 studies reveal that the Ease of Doing Business Index is associated with limitations that can distort the index adequacy and generate misleading messages (Kuc-Czarnecka et al., 2020). First, the Ease of Doing Business Index calculation does not consider that the sub-indicators structured by World Bank (2020) in its ten dimensions do not necessarily have the same relevance to represent the country's business environment (Maričić et al., 2019; Ruiz et al., 2018). Second, the sample of data collected for calculating the Ease of Doing Business Index is restricted to a single city and type of company, limiting the index's ability to represent the entire economy's business environment conditions (Breen & Gillanders, 2012; Vokoun & Daza Aramayo, 2017). Third, the composite indicator construct is not internally consistent (Pinheiro-Alves & Zambujal-Oliveira, 2012).

The weighting of sub-indicators with equal weights is a concern among researchers as the sub-indicators have different relative importance in representing the business environment. In particular, participatory weighting approaches produce better results than Data-driven approaches to represent a country's economic performance and competitiveness (Lafuente et al., 2020). This scenario has motivated the construction of alternative composite indicators for the Ease of Doing Business Index (e.g., Rogge & Archer, 2021; Tan et al., 2018).

4.2 Selection, Collection, and Normalization of Data

The Cost of Doing Business Index is an alternative composite indicator to the Ease of Doing Business Index (Ekel et al., 2022). The Cost of Doing Business Index assigns different weights to the sub-indicators, aggregates only the sub-indicators whose values are the same for all cities and companies in the country, and presents an internally consistent construct (Bernardes et al., 2021). The Cost of Doing Business Index construct considers only sub-indicators of costs regulated by governments and applicable to the entire economy (Ekel et al., 2022; Bernardes et al., 2021).

The Cost of Doing Business Index comprises seventeen sub-indicators associated with eight (not calculated) Ease of Doing Business Sub-Indices. These Sub-Indices and their respective sub-indicators are presented in Table 2.

Table 2 Cost of Doing Business Index sub-indicators

The data are from 2019 and were obtained from the World Bank Data Bank of the World Bank (2021). Descriptive statistics (Table 7) and correlations between sub-indicators (Table 8) are available in the Annex II. In general, it shows few significant correlations between the sub-indicators. These low correlations suggest that variations in weights can significantly modify composite indicator scores (Paruolo et al., 2013).

Then, the seventeen sub-indicators were normalized by the Min–Max method as follows

$$S_{jr} = \frac{{x_{jr} - {\text{min}}\left( {x_{jr} } \right)}}{{{\text{max}}\left( {x_{jr} } \right) - {\text{max}}\left( {x_{jr} } \right)}}$$
(12)

where \(x_{jr}\) is the value of the \(j\)-th sub-indicator of the \(r\)-th country, \({\text{min}}\left( {x_{jr} } \right)\) is the minimum value of the \(j\)-th sub-indicator among all countries, and \({\text{max}}\left( {x_{jr} } \right)\) is the maximum value of the \(j\)-th sub-indicator among all countries.

4.3 Expert Assessment, Weighting, and Aggregation of Sub-Indicators

A group of six experts ordered the seventeen sub-indicators presented in Table 2 by importance. The experts were selected based on three criteria: 1) holding a management position or higher, 2) working in a multinational with operations in G20 countries, and 3) having a minimum five-year experience in international business. The small number of experts consulted is associated with criterion number 2. This small number is critical as it increases the impact of the expert on the results. However, this is a common situation when assessments involve international comparisons. Naturally, the results of this research are mainly focused on situations where few experts are qualified to assess the sub-indicators.

First, the experts ordered the seventeen sub-indicators from most important to least important. Then, the order of the sub-indicators was transformed into weights by the direct and indirect transformation functions presented in (1)-(6).

After (12), the composite indicator is calculated using the weights obtained in (7), where \(w_{j} = 1\) means that the experts were considered to have equal importance levels. The following expression performs this calculation

$$CI_{r} = \mathop \sum \limits_{j = 1}^{n} \lambda_{j} S_{jr}$$
(13)

The direct aggregation of the seventeen sub-indicators in the Cost of Doing Business Index reduces the compensation between sub-indicators with poor and above-average performance, which is greater when sub-indicators are aggregated into pillars or sub-indices (Paruolo et al., 2013).

4.4 Uncertainty Analysis

The degree of impact of the weighting scheme on composite indicator ratings is usually verified through uncertainty analysis (Becker et al., 2017; Greco et al., 2019). The uncertainty analysis verifies how the different ways of normalizing, weighting, and aggregating the sub-indicators impact the composite indicator scores and ranks (Marozzi, 2016; Saisana et al., 2005). It indicates how robust the ranking of observations of a composite indicator is in terms of normalization function, weights of sub-indicators, and aggregation scheme (OECD, 2008). From the uncertainty analysis, it is possible to identify and reduce as much as possible the sources of uncertainty (Munda et al., 2009). For these reasons, uncertainty analysis is considered an essential step in constructing composite indicators (Becker et al., 2017; Charles et al., 2018; Dialga & Giang, 2017; Marozzi, 2021).

The uncertainty analysis is calculated from the variation of the countries' position in the rankings of the composite indicator, applying the following expression:

$$\overline{R}_{S} = \frac{1}{M}\mathop \sum \limits_{r = 1}^{M} \left| {{\text{Rank}}_{ref} \left( {CI_{r} } \right) - {\text{Rank}}\left( {CI_{r} } \right)} \right|,$$
(14)

where \(\overline{R}_{S}\) measures the relative change in the position of the entire system of countries in a single number, \({\text{Rank}}_{ref} \left( {CI_{r} } \right)\) is the ranking of country \(r\) according to the composite indicator of reference, and \({\text{Rank}}\left( {CI_{r} } \right)\) is the ranking of country \(r\) according to the composite indicator constructed with input parameters (normalization method, weights, experts, and aggregation schemes) different from the \({\text{Rank}}_{ref} \left( {CI_{r} } \right)\).

The ranking assigned by \({\text{Rank}}\left( {CI_{r} } \right)\) is an output of uncertainty analysis. It captures the relative change in the position of the entire country system in a single number. A total of fifty \({\text{Rank}}\left( {CI_{r} } \right)\) were calculated to analyze the uncertainty associated with:

  • Functions that transform sub-indicators ordered by importance into weights;

  • Sub-indicators with higher and lower weight in the composite indicator;

  • Experts who most and least agree with the collective opinion.

5 Consensus Degree and Sub-Indicators Weights

The graphs in Figs. 1, 2, 3, 4 and 5 show the results related to the Concordance Correlation Coefficient. The graphs show how much the experts and collective opinions deviate from the perfect concordance line represented by the line at 45 degrees (Lawrence & Lin, 1989). Thus, the Concordance Correlation Coefficient is given by the distance between the observed concordance (dotted line) and the perfect concordance (continuous line).

Fig. 1
figure 1

The concordance Correlation Coefficient between the weights obtained by the RS function

Fig. 2
figure 2

The concordance Correlation Coefficient between the weights obtained by the RE function. Note: p-parameter equal to five

Fig. 3
figure 3

The concordance Correlation Coefficient between the weights obtained by the RR function

Fig. 4
figure 4

The concordance Correlation Coefficient between the weights obtained by the ROC function

Fig. 5
figure 5

The concordance Correlation Coefficient between the weights obtained by the AHP

The following findings are revealed in the graphs of Figs 1, 2, 3, 4 and 5. First, the average difference between the consensus degree obtained by the different transformation functions is six percent. Second, the consensus degree of the RR function is 1.19 times or twelve percent higher than that of the RS function. Third, the collective consensus degree is affected by the function that transforms assessments ordered by importance into weights. Fourth, the RR function allows the construction of a composite indicator that is more compatible with collective opinion than the other functions. Fifth, the ROC and AHP functions' consensus degree was 0.745 and 0.714, respectively. Sixth, the slight difference between the consensus degree produced by the ROC and AHP functions confirms that they produce similar results (e.g., Erkan & Elsharida, 2020). Seventh, the RR function with a \({\text{CCC}}\) of 0.761 and the RE function with a \({\text{CCC}}\) of 0.752 showed the highest consensus degree among all transformation functions. Eighth, the ROC function does not guarantee the highest compatibility between the weights and the expert assessments, contrary to the current literature that the ROC function presents the best results (Roszkowska, 2013; Sureeyatanapas, 2016; Sureeyatanapas et al., 2018).

Figure 6 shows each expert's \({\text{ICC}}\) with the group for each of the five transformation functions applied. The first point represents the \({\text{ICC}}\) between the first expert's and the group's assessments calculated from the AHP. The last point represents the \({\text{ICC}}\) between the sixth expert's and the group's assessments calculated from the RR.

Fig. 6
figure 6

The Intraclass Correlation Coefficient of expert and group assessments for the five transformation functions applied

Only the weight obtained from the Expert 4 assessments by the RR function showed poor concordance (I \({\text{CC}}\) < 0.50) with the collective opinion. Moderate concordance (0.50 < I \({\text{CC}}\) <0.75) was observed in 47% of cases and Good concordance (0.75 < I \({\text{CC}}\) <0.90) in 50% of cases.

The results confirm that I \({\text{CC}}\) and \({\text{CCC}}\) produce different values (Feng et al., 2014). However, the results show no change in the order of the most compatible functions. Table 3 confirms the RR with the function most compatible between individual assessments and the collective opinion.

Table 3 Comparison between Concordance Correlation Coefficient and Intraclass Correlation Coefficient

The following findings are revealed regarding the Intraclass Correlation Coefficient. First, the \({\text{ICC}}\) of the RR function is 1.18 times greater than the \({\text{ICC}}\) of the RS function. Second, the consensus degree calculated by the \({\text{CCC}}\) and the \({\text{ICC}}\) from the weights obtained by the RS and RR functions does not show significant differences. Third, the ROC function remains preferable to the AHP function. Fourth, the functions RR with \({\text{ICC}}\) of 0.773 and RE with \({\text{ICC}}\) of 0.765 continue to present the best results. Fifth, the order of results obtained by \({\text{ICC}}\) and \({\text{CCC}}\) are the same.

The \({\text{CCC}}\) and \({\text{ICC}}\) values reflect the different weights generated by the preference format transformation functions. Table 4 shows the weights generated by the transformation functions presented in (1)-(6).

Table 4 Sub-indicators Weights by preference format transformation function

These findings reveal that functions that transform orders into weights produce different results. The weights obtained by the different transformation functions significantly influence the consensus degree. Weighting sub-indicators with the weights obtained by the transformation function produces the highest consensus degree allows for constructing a composite indicator more compatible with collective opinion. The sub-indicators weights vary according to the function used to transform sub-indicators ordered by importance into weights.

6 Uncertainty Associated with Transformation Functions, Sub-Indicator with Higher and Lower Weight, and Expert with Higher and Lower Consensus Degree

The G20 Countries showed an average shift of 3.90 positions up or down. Specifically, three countries showed variations of more than ten positions. Saudi Arabia and South Africa shifted positions between the RS and RE functions' rankings. These results reveal that assessments carried out by the \(OA\) format and transformed by different functions generate rankings with significant differences in the countries' positions. The impact of the weights obtained by the different transformation functions on the position of countries in the ranking of the Costs of Doing Business Index is shown in Table 5.

Table 5 Uncertainty analysis of the weights obtained by the different transformation functions

The largest shifts in countries' positions are associated with the attribution of Equal Weights to the sub-indicators. These shifts were greater than ten positions in nine countries, reflecting the difference between the weights obtained by the Participatory and Equal Weights approaches. In particular, the difference between the weights obtained by the RR Function and the Equal Weights is 3.74% on average. This difference implies significant shifts in the position of countries in the ranking, as illustrated in Fig. 7.

Fig. 7
figure 7

The difference between weights obtained by the Equal Weights and Participatory approach (RR Function) and its impact on the position of countries in the Cost of Doing Business Index ranking

France's position in the Cost of Doing Business Index shifts more than twenty positions when the sub-indicators are weighted by Equal Weights instead of weights obtained by the RR function. These results demonstrate that disregarding the relative importance of sub-indicators in the costs of doing business significantly impacts the position of countries in the ranking.

The position of countries in the ranking also shifts when the lowest and highest weight sub-indicators are excluded from the composite indicator. The lowest average weight obtained by the transformation functions was 0.02, corresponding to the Ec1 sub-indicator "Attorney fees for enforcing contracts." The highest average weight was 0.18, corresponding to the Pt3 sub-indicator "Paying taxes fees." The Pt3 sub-indicator weight is 6.9 times greater than the weight of the Ec1 sub-indicator. Consequently, the impact of excluding the Ec1 sub-indicator on the displacement of countries' positions is four times smaller than that of excluding the Pt3 sub-indicator. Table 6 shows the displacement of countries' positions according to the transformation function after excluding the sub-indicators Ec1 and Pt3.

Table 6 The uncertainty associated with the sub-indicators of higher and lower weight in the composite indicator by transformation function

The exclusion of experts who least agree and most agree with the collective opinion impacts both the position of countries in the ranking and the consensus degree on the weights. Figure 8 shows the \({\text{CCC}}\), \({\text{ICC}}\) and the \(\overline{R}_{S}\) by transformation function, after excluding the assessments of Expert 2 (higher agreement) and Expert 3 (lowest agreement).

Fig. 8
figure 8

The concordance correlation coefficient, intraclass correlation, and uncertainty after excluding experts who most and least agree with the collective opinion

The results show that excluding the expert with the most divergent assessments increases the consensus degree by six percent and excluding the expert with the most convergent assessments decreases the consensus degree by four percent. Moreover, countries move an average of 0.46 positions in the ranking after excluding the expert that most converges with collective opinion. In turn, countries shift, on average, 1.02 positions in the ranking when the expert assessments that least agree with the collective opinion are excluded.

Excluding the assessments of the expert with the most divergent opinion from the collective opinion increases the consensus degree concerning the weights obtained by the RR function by eight percent. Excluding Expert 3 assessments increases the \({\text{CCC}}\) from 0.76 to 0.82 and the \({\text{ICC}}\) from 0.77 to 0.83. In addition to removing an important source of uncertainty, excluding the expert with the most divergent opinion allows the construction of a composite indicator more compatible with collective opinion. Figure 9 shows the impact of excluding Expert 3 on the countries' position in the ranking and the final ranking of the Cost of Doing Business Index.

Fig. 9
figure 9

The impact on the position of countries after the exclusion of Expert 3 and the final Cost of Doing Business Index

Because of the uncertainties, this research suggests that the sub-indicators be weighted based on the function that produces the weights most compatible with the experts' opinion: the RR function. Furthermore, the research results suggest excluding the expert who most diverges from collective opinion as this expert represents a considerable source of uncertainty. Naturally, these choices do not eliminate the uncertainties associated with the composite indicator since the different ways of normalizing and aggregating sub-indicators also impact the position of countries in the ranking. However, the construction of composite indicators with weights more compatible with collective opinion is a strategy that makes it possible to identify and reduce uncertainties associated with assessment errors, especially when few experts assess the weights of a large number of sub-indicators.

7 Conclusions

Errors in assessing the weights of sub-indicators are a problem inherent to the participatory weighting approach. Mainly, these errors occur when the number of sub-indicators to be assessed is high, as it requires more cognitive effort from decision-makers, increasing the inconsistency of the assessments. The participatory weighting approach is also often associated with the problem of international comparison. The weights of the sub-indicators are not necessarily the same because countries have specific characteristics. This problem can be overcome by selecting experts who know the reality of the countries involved. However, the decrease in the number of experts qualified to carry out the assessments is also problematic, as it increases the impact of expert assessments on the results. These problems are critical for composite indicators that involve a large number of sub-indicators and countries, such as the Global Innovation Index, Multidimensional Poverty Index, Sustainable Development Goals Index, and Ease of Doing Business Index.

This research presents solutions to these problems by constructing the Cost of Doing Business Index for the G20 countries. First, ordering the sub-indicators by importance and then transforming the order into weights reduces the cognitive effort required to assess the sub-indicators. Second, transforming order into weights by the RR function allows obtaining weights with a consensus degree, on average, seven percent higher than by the other functions. Third, excluding more deviating assessments increases the stability of the composite indicator and the compatibility between the weights used in the weighting with the collective opinion of experts. In short, converting the sub-indicators ordered by importance into weights by the RR function and excluding the Expert 3 assessments allows for a twenty-nine percent higher consensus degree than employing the RS function and maintaining the assessments of all experts.

These results corroborate the current literature that the transformation functions influence countries' weights and position in the ranking and add that these functions also influence the collective consensus degree. Besides, the results suggest weighting the sub-indicators with the weights obtained by the function that achieves the highest consensus degree and excluding experts with opinions that insert uncertainties in the countries' positions in the ranking. In the present case, the RR function reached the highest consensus degree, which is preferable to the other functions by the compatibility criterion between the weights of the sub-indicators and the collective opinion. Expert 3 inserts greater uncertainty in the position of countries in the ranking, and its assessments were disregarded in the Cost of Doing Business Index for G20 countries.

The research results also indicate a direct and positive relationship between the expert influence on the consensus degree and the composite indicator uncertainty. Future research that deepens the understanding of this relationship may indicate the threshold consensus degree to obtain a stable composite indicator. Furthermore, replicating the methodology presented to other composite indicators makes it possible to assess the degree of universality of the results and answer two fundamental questions. Does the RR function always reach the highest consensus degree? Is the function that achieves the greatest consensus degree specific to each situation?

8 Appendix 1: List of Notations

  • \(CI_{r}\)is the composite indicator score for the \(r\)-th country;

  • \({\text{CCC}}\)is the Concordance Correlation Coefficient of the assessments;

  • \(e\)is the number of experts involved in the assessment;

  • \(FE\) corresponds to the Fuzzy Estimates assessment format;

  • \({\text{ICC}}\) is the Intraclass Correlation Coefficient of the assessments;

  • \({\text{max}}\left( {x_{jr} } \right)\) is the maximum value of the \(j\)-th sub-indicator among all countries;

  • \({\text{min}}\left( {x_{jr} } \right)\) is the minimum value of the \(j\)-th sub-indicator among all countries;

  • \(\mu_{w}\) is the average of \(w_{ij}\);

  • \(\mu_{\lambda }\) is the average of \(\lambda_{j}\);

  • \(OA\) is the Ordering of Alternatives assessment format;

  • \(O = \left[ {o\left( {x_{1} } \right),o\left( {x_{2} } \right), \ldots ,o\left( {x_{k} } \right), \ldots ,o\left( {x_{n} } \right)} \right]\) is the vector of alternatives ordered from best to worst;

  • \(o\left( j \right)\) is the order of importance of the \(j\)-th sub-indicator;

  • \(o\left( {j_{l} } \right)\) is the order of importance of the sub-indicator \(l\);

  • \(o\left( {j_{k} } \right)\) is the order of importance of the sub-indicator \(k\);

  • \(o\left( {x_{k} } \right)\) is the \(k\)-th order of importance of alternative \(x\);

  • \(p\) is a parameter that controls the distribution of the weights;

  • \(\rho\) is the correlation coefficient between \(w_{ij}\) and \(\lambda_{j}\);

  • \({\text{Rank}}_{ref} \left( {CI_{r} } \right)\) is the ranking of the \(r\)-th country according to the composite indicator of reference;

  • \({\text{Rank}}\left( {CI_{r} } \right)\) is the ranking of the \(r\)-th country according to the composite indicator constructed with input parameters (normalization method, weights, experts, and aggregation schemes) different from the \({\text{Rank}}_{ref} \left( {CI_{r} } \right)\);

  • \(RF\) is the Fuzzy Preference Relations assessment format;

  • \(RM\left( {x_{k} ,x_{l} } \right)\) is a matrix \(RM_{n \times n}\) of positive and reciprocal preference relationships \(m\left( {x_{k} ,x_{l} } \right)\), k, l = 1,…,n reflecting how much more the alternative \(x_{k}\) is \(m\left( {x_{k} ,x_{l} } \right)\) times more or less preferable than \(x_{l}\);

  • \(RN\) is the Non-Reciprocal form of \(RF\);

  • \(RR\) is the Reciprocal Additive form of \(RF\);

  • \(\overline{R}_{S}\) is the relative change in the position of the entire system of countries in a single number;

  • \(S_{jr}\) is the normalized value of the \(j\)-th sub-indicator of the \(r\)-th country;

  • \(\sigma_{w}^{2}\) is the variance of \(w_{ij}\);

  • \(\sigma_{\lambda }^{2}\) is the variances of \(\lambda_{j}\);

  • \(U = \left\{ {u\left( {x_{1} } \right),u\left( {x_{2} } \right), \ldots , u\left( {x_{k} } \right), \ldots , u\left( {x_{n} } \right)} \right\}\) is the set of \(n\) utility values;

  • \(UV\) is the Utility Value assessment format;

  • \(u\left( {x_{k} } \right) \in \left[ {0,1} \right]\) represents the utility value assigned to the alternative \(x_{k}\);

  • \(w_{j}\) is the weight of the \(j\)-th sub-indicator;

  • \(w_{ij}\) is the weight assigned by the \(i\)-th expert to the \(j\)-th sub-indicator;

  • \(x_{jr}\) is the value of the \(j\)-th sub-indicator of the \(r\)-th country;

  • \(\lambda_{j}\) is the collective weight of the \(j\)-th sub-indicator.

9 Appendix 2: Sub-Indicators Statistics

See Tables 7 and 8.

Table 7 Descriptive statistics of the Cost of Doing Business Index sub-indicators
Table 8 Correlation between the sub-indicators