Keywords

1 Introduction

Income data are generally made available in binned formats by governing bodies to preserve the confidentiality of the individual participants. Obtaining inferences from such summary information has been recently discussed by Deduwakumara and Prendergast (2018), in the context of obtaining confidence intervals for quantiles using estimates of the underlying distribution using grouped data. As we will show in what follows, we can obtain reliable confidence intervals for some inequality measures using bootstrap and Wald-type approaches.

Motivated by these findings, we compare the interval estimators for inequality measures when the data are available in grouped form only. For comparison, we use the well-known Gini, Theil and Atkinson indices and the newly proposed quantile ratio index (Prendergast and Staudte 2018). We begin by introducing these measures before discussing some distribution estimation strategies in Sect. 3. In Sect. 4, we report findings of simulations for interval estimators of the inequality measures. Two real data examples are presented in Sect. 5, followed by a brief discussion in Sect. 6.

2 Some Inequality Measures

Let f, F and Q denote the density, distribution and quantile functions respectively for the population of interest. For \(p \in [0,1]\), let \(x_p=Q(p)=F^{-1}(p)\) denote the p-th quantile. We find it convenient to consider continuous probability distributions to model incomes while acknowledging that, in practice, a population of incomes has a finite number, N, of individuals. Let \(x_1,\ldots ,x_n\) denote a simple random sample of incomes from the population and let \(\widehat{x}_p\) be the estimated p-th quantile.

2.1 Gini Index

Suppose \(X\sim F\) where X represents a randomly chosen income from the population and let \(\mu =E(X)\) denote mean income. Easily the most commonly used inequality measure is the Gini index (Gini 1914), which measures the deviation of the income distribution from perfect equality. It can be defined as,

$$\begin{aligned} G =1-\frac{1}{\mu }\int \limits _0^\infty [1-F(x)]^2 \ dx \end{aligned}$$

with \(G \in [0,1]\). Here, \(G=1\) indicates that one individual holds all wealth (e.g. one individual with income greater than zero) and \(G=0\) represents the equality of incomes for all. The Gini index can be estimated for a simple random sample of size n, with the ordered values of \(x_1,\ldots ,x_n\) by,

$$\begin{aligned} \hat{G}=\frac{2\sum _i i x_i}{n\sum _i x_i}-\frac{n+1}{n}. \end{aligned}$$

For more details on the Gini index and estimation see, for example, Dixon et al. (1988) and Damgaard and Weiner (2000).

2.2 Theil Index

Based on information theory, Theil (1967) proposed an entropy-based measure which is defined to be

$$\begin{aligned} T =\int \limits _0^\infty \Big (\frac{x}{\mu } \Big )\log \Big (\frac{x}{\mu }\Big ) f(x)\ dx \end{aligned}$$

where \(T \in [0,\infty )\). In practice where a population consists of finite number of N incomes, the upper bound is \(\ln (N)\). The Theil index can be estimated by

$$\begin{aligned} \widehat{T} = \frac{1}{n} \sum _i\frac{x_i}{\bar{x}}\ln \bigg (\frac{x_i}{\bar{x}}\bigg ) \end{aligned}$$

where \(\bar{x}\) is the sample mean and where \(\widehat{T}\in [0, \ln (n)]\). Further properties of the Theil index can be found in Theil (1967), Allison (1978) and Shorrocks (1980).

2.3 Atkinson Index

The Atkinson index was initially introduced by Atkinson (1970). This measure depends on the sensitivity parameter, \(\epsilon \) \((0<\epsilon < \infty )\), which represents the level of inequality aversion. As this parameter increases, more weight is shifted to the distribution at the lower end and vice versa. It is defined as

$$\begin{aligned} A =1-\left[ \int \limits _0^\infty \Big (\frac{x}{\mu } \Big )^{1-\epsilon }f(x) \ dx \right] ^{\frac{1}{1-\epsilon }} \end{aligned}$$

where \( A \in [0,1]\).

Atkinson values represent the proportion of total income that would be needed to achieve an equal level of social welfare if incomes were perfectly distributed. Depending on the value of \(\epsilon \), the sample estimate is

$$\begin{aligned} \hat{A} = {\left\{ \begin{array}{ll} \displaystyle 1-\frac{1}{\bar{x}}\bigg (\frac{1}{n}\sum _i x_i^{1-\epsilon } \bigg )^{\frac{1}{1-\epsilon }}, &{} \text {for} \quad 0 \le \epsilon < 1 \\ \displaystyle 1-\frac{1}{\bar{x}}\bigg (\prod _i x_i\bigg )^{\frac{1}{n}}, &{} \text {for} \quad \epsilon = 1 \end{array}\right. } \end{aligned}$$

We use the value of \(\epsilon =0.5\) for our analysis which is the default value used in the package ineq (Zeileis 2014) in R software (R Core Team 2017). More details for the Atkinson index can be found in Atkinson (1970), Biewe and Jenkins (2006) and Shorrocks (1980).

2.4 Quantile Ratio Index

Prendergas and Staude (2018, 2019) introduced the quantile ratio index (QRI) which uses the ratio of symmetric quantiles and which is simpler than similarly defined inequality measures given by Prendergast and Staudte (2016b). The QRI is denoted as

$$\begin{aligned} I = 1-\int \limits _0^1 \frac{x_{p/2}}{x_{1-p/2}} \ dp =1-\int \limits _0^1 R(p) \ dp \end{aligned}$$

where \(I \in [0,1]\). Note that R(p) is the ratio of symmetric quantiles so that I can be seen to be based on the average ratio of incomes chosen symmetrically from the poorer and richer halves of the incomes respectively. For a suitably large JI is estimated as \(J^{-1}\sum _j\left[ 1-\widehat{R}(p_j)\right] \) where \(p_j=(j-1/2)/J\) and \(\widehat{R}(p_j)\) is the ratio of the estimated \((p_j/2)\)-th and \((1-p_j/2)\)-th quantiles. Prendergast and Staudte (2018) show that \(J=100\) is large enough to obtain good estimates of I and so this will be our choice in what follows.

3 Density Estimation Methods

We now consider two methods for estimating the density from grouped data. The first requires bins and frequencies, and the second also requires the bin means. The methods were used by Dedduwakumara and Prendergast (2018) to obtain intervals for quantiles from histograms.

3.1 GLD Estimation Method

Due to flexibility in approximating a wide range of distributions, the Generalized Lambda Distribution (GLD) is commonly used and particularly favoured in fields such as economics and finance. Defined in terms of its quantile function, several parameterizations for the GLD exist. Following is the FKML parameterization for the GLD given by Freimer et al. (1988) which is often favoured since it is defined for all parameter choices, with the only restriction being that the scale parameter must be greater than zero. The GLD quantile function is

$$\begin{aligned} Q(p)=\lambda + \frac{1}{\eta } \bigg [\frac{(p^\alpha -1)}{\alpha }-\frac{(1-p)^\beta -1}{\beta }\bigg ]. \end{aligned}$$
(1)

The GLD has been used in different contexts to obtain various interval estimators (e.g. Su 2009; Prendergast and Staudte 2016a) when the full data set is available. However, using the percentile matching methods presented by Karian and Dudewicz (1999) and Tarsitano (2005), the GLD parameters can still be estimated when data is in grouped format with frequencies and bins. This method is available in the bda package (Wang 2015).

3.2 Linear Interpolation Method

The linear interpolation method was proposed by Lyon et al. (2016) as a method of estimating the underlying distribution of binned data when the group (bin) means are also available. Within each bin, a linear density is estimated using the lower and upper bounds of the bin and the associated mean, and the final bin is fitted with an unbounded exponential tail. The slope of the linear density is determined by the mean in relation to the bin midpoint. Closed form solutions for the density and the quantile functions are extensively provided by Lyon et al. (2016) and following is a summary of the density results.

Assume there are J intervals in the grouped data bounded by \([a_{j-1},a_j), j=1,\ldots ,J\) where \(a_0>-\infty \) and \(a_J=\infty \). Let the midpoint, mean and relative frequency of the jth bin be denoted by \(x^c_j\), \(\bar{x}_j\) and \(\widehat{f}_j\). The linear density for the jth bin is

$$\begin{aligned} h_j(x)=\alpha _j+\beta _j x, \qquad x\in [a_{j-1},a_j) \end{aligned}$$
(2)

where the estimates of \(\alpha _j\), \(\beta _j\) are given by,

$$\begin{aligned} \widehat{\beta _j}=\widehat{f_j}\frac{12(\bar{x}_j-x^c_j)}{(a_j-a_{j-1})^3}, \quad \widehat{\alpha }_j=\frac{\widehat{f_j}}{a_j-a_{j-1}}-\widehat{\beta _j}x^c_j. \end{aligned}$$
(3)

The density estimate for the final unbounded interval using an exponential tail is provided by,

$$\begin{aligned} h_J(x)=\frac{\eta }{\lambda }\exp \left\{ -\frac{(x-a_{J-1})}{\lambda }\right\} \end{aligned}$$
(4)

where \(\widehat{\eta }=\widehat{f_J}\) and \(\widehat{\lambda }=\bar{x}_J-a_{J-1}\).

4 Interval Estimators Using Grouped Data

In this section, we propose and describe our bootstrap and Wald-type methods to produce intervals for inequality measures using grouped information. The variance of the QRI estimator depends on the underlying income distribution density function applied to income quantiles (Prendergast and Staudte 2018). Therefore, provided we can obtain good estimates of the density from grouped data, then the QRI is well-suited to obtaining Wald-type intervals in this setting. Aside from bootstrapping, to obtain the variance of, for example, the Gini index, it is common to use the jackknife approach or other methods that require the full data set. Consequently, obtaining an approximation to the variances for the Gini, Thiel and Atkinson measure estimators from grouped data is not straightforward and therefore an area for further research.

For the bootstrapping procedure, we obtain the bootstrap samples from the estimated quantile function arising from the estimated GLD or linear interpolation densities. We then use the percentile bootstrap interval described below. While there are other bootstrap methods available that often have improved performance over the percentile method, they require the full data set and it is not immediately clear on how to use them when data is only available in grouped format; e.g. the bootstrap t interval requires the variance of the estimator, the BCa method (Efron 1987) and Efron’s ABC method (Diciccio and Efron 1992) requires the full sample data to calculate the acceleration parameter. However, we did try a variation of the bootstrap t interval whereby the \(\alpha \) parameter was estimated as usual, but where the estimate and its standard error were also approximated from the bootstrap samples given the lack of the full data set. Coverages were usually no better, and often worse than those for the percentile approach so we do not present them in what follows for brevity. Further variations of bootstrap methods to accommodate the lack of the full data set may result in improved results and this is an area for future research.

Bootstrap Confidence Intervals. In the following algorithm, we describe the estimation of percentile bootstrap confidence intervals in detail.

  • Step 1: Estimate the GLD and linear interpolation densities using available summary information of bin points and frequencies (and bin means for the linear interpolation approach).

  • Step 2: Take 500 bootstrap samples of size n using the estimated quantile functions from the two estimation methods using the inverse transform sampling method. That is, randomly generate n numbers, \(y_1,\ldots , y_n\) in [0, 1] from the uniform distribution and then the ith observation for the jth bootstrap is \(y_{ji}=\widehat{Q}(y_i)\) where \(\widehat{Q}\) is the estimated quantile function.

  • Step 3: Construct the percentile bootstrap 95% confidence intervals by taking the 2.5% and 97.5% quantiles of the 500 bootstrapped estimates of the inequality measures.

For the GLD method, we consider the available bin points as the empirical percentiles in the percentile matching method, providing the estimated parameters for the GLD. By using the GLD quantile function (Sect. 3.1) and the estimated parameters, we can easily take the bootstrap samples using the inverse transform sampling method as in Step 2. For the linear interpolation approach, we use the following two quantile functions to generate data depending on the value of p (Lyon et al. 2016). For the bounded interval of \([a_{j-1},a_j)\), the following quantile function is used for \(p\in [0,1)\) is,

$$\begin{aligned} \widehat{x}_p=\frac{-\widehat{\alpha }_j + \root \of {2\widehat{\beta }_jp+\widehat{C}_j}}{\widehat{\beta }_j} \end{aligned}$$
(5)

where, \(\widehat{C_j}=[\widehat{\alpha }_j^2-2\widehat{\beta }_j\widehat{F}_{j-1}+2\widehat{\beta }_j\widehat{\alpha }_j a_{j-1}+\widehat{\beta }_j^2(a_{j-1})^2]\), \(\widehat{\beta _j}\) and \(\widehat{\alpha _j}\) as in (3).

Further the fitted exponential tail yields the following quantile function when the cumulative relative frequency up to final (Jth) interval is denoted by \(\widehat{F}_J\),

$$\begin{aligned} \widehat{x}_p=a_{J-1}-\widehat{\lambda }\ln \Bigg (1-\frac{p-\widehat{F}_{J-1}}{\widehat{\eta }}\Bigg ). \end{aligned}$$
(6)

Wald-Type Confidence Intervals for the QRI. Obtaining confidence intervals for the QRI from full data sets is studied by Prendergast and Staudte (2018). The variance of the estimator depends on the density function and quantiles. Therefore, given a good estimation of the density which in turn would be expected to give good estimates to quantiles, QRI intervals from grouped data are possible.

The \((1-\alpha )\times 100\) confidence interval for I is given by \(\hat{I} \pm z_{1-\alpha /2}\root \of {\text {Var}(\hat{I})}\), where \(\text {Var}(\hat{I})\) is adopted from Prendergast and Staudte (2018) where we use \(J=100\). Here, \(z_{1-\alpha /2}\) is the \(1-\alpha /2\) percentile from the standard normal distribution. \(\text {Var}(\hat{I})\) consists of the variances and co-variances terms of ratios of symmetrically chosen quantiles (see Prendergast and Staudte 2018). We then require estimates for population quantiles and density function. As described earlier, first we estimate the underlying density and quantile functions using the GLD and linear interpolation methods. Then those estimated quantile functions can be used to estimate the symmetrically chosen quantiles.

5 Simulations and Examples

We begin by reporting our findings for simulation studies conducted with a variety of distributions before considering real data examples.

5.1 Simulations

To assess coverage, we consider the lognormal distribution with \(\mu =0\) and \(\sigma =1\) and the Singh-Maddala distribution with parameter values \(a=1.6971\), \(b=87.6981\) and \(q=8.3679\) where these parameters were from fitted US family incomes reported by McDonald (1984). We also consider the Dagum distribution with the parameter choices of \(a=4.273\) \(b=14.28\) and \(p=0.36\) which were used in Kleiber (2008) and were estimated from fitted US family incomes in 1969. The \(\chi ^2_2\), Pareto type II distribution with scale one and shape equal to two and the exponential distribution with rate one were also considered. Table 1 provides the population inequality values of each measure.

Table 1. True values of inequality measures for each distribution.
Table 2. Empirical coverage probabilities and average widths (in brackets) of Boot-strapped interval estimates of inequality measures from quintiles estimated using linear interpolation method at nominal level 95%, each based on 1000 replications and 500 bootstrap repetitions.

From Table 2 for quintile-grouped data and using the linear interpolation method, intervals for I produces coverage probabilities close to the nominal level of 0.95 together with narrow mean width for all settings and with both bootstrap and the Wald-type intervals. Given that the computation of the interval is much more efficient for the Wald-type interval, there does not appear to be an advantage for using the bootstrap. However, for the Gini, Theil and Atkinson measures, the coverages are comparatively weaker but improves as the sample size increases for most of the distributions.

Table 3. Empirical coverage probabilities and average widths (in brackets) of Boot-strapped interval estimates of inequality measures from quintiles estimated using GLD method at nominal level 95% for, each based on 1000 replications and 500 bootstrap repetitions.

Table 3 shows that the intervals based on the GLD and quintiles for the Gini, Theil and Atkinson measures have poor coverage. Coverages are typically very good for the QRI intervals, albeit more conservative than those using the linear interpolation method. However, coverages become low for the lognormal suggesting that quintiles do not provide enough information to get a good approximation using the GLD.

Table 4. Empirical coverage probabilities and average widths (in brackets) of Boot-strapped interval estimates of inequality measures from deciles estimated using GLD method at nominal level 95% for, each based on 1000 replications and 500 bootstrap repetitions.

When the data is summarised in deciles rather than quintiles (i.e. more bins and more information), Table 4 shows improved coverage is achieved with the GLD method. However, coverage is still poor for the Gini, Theil and Atkinson measures when compared to the good coverages achieved for the QRI. Again, the similar coverages for the bootstrap and Wald-type intervals suggest that the Wald-type is a good choice since it is simple and quick to compute.

Fig. 1.
figure 1

Boxplots of 1000 centered (with respect to the true values) simulated estimates of inequality measures from quintiles, estimated using linear interpolation method from the Lognormal distribution with mean 0 and various standard deviation values where n = 250

In Fig. 1 we look at what happens to estimates using the linear interpolation method for each measure (e.g. an estimate based on a bootstrap sample) as skew increases. In this case, we use the lognormal distribution while increasing the \(\sigma \) parameter from 0.5 to 2. The estimates are centered according to the true value so a value of zero indicates a perfect estimate. We exclude the Theil index from the analysis since its upper bound is unrestricted. As the distribution becomes more skewed, the Gini and Atkinson estimators have an increase in bias and variability whereas the quantile-based measure (I) indicates smaller variability and smaller bias throughout for all of the choices of \(\sigma \). This helps to explain why the coverages are poor for the Gini and Atkinson measures.

6 Applications

6.1 Example 1: Household Income Reported with Group Means

In this example, we present household income data reported with group means by the Survey of Consumer Finances and Expenditures carried out by the Macquarie University and the University of Queensland which can be found in Podder (1972) and Kakwani and Podder (1976). The data is summarised in Table 5.

Table 5. Australian household income data for 1967-68
Table 6. Interval and point estimates of the inequality measures generated using the linear interpolation (LI) and GLD methods for the data presented in Table 7.

The confidence intervals produced by 500 bootstrapped samples using the linear interpolation (LI) and GLD methods are given in Table 6. As the final interval is unbounded, we arbitrarily set the upper limit of that bin to $500,000. As can be seen, the confidence intervals and the estimates generated by the two methods are similar.

6.2 Example 2: Comparison of Equalized Disposable Household Income Data

In this example, we compare two assumed-independent income distributions reported in deciles from ABS (2011) (see Table 7) to assess whether the income inequality measures of the two distributions are significantly different from one another. It is simple to adapt the previous intervals to the two-sample setting. For example, for the bootstrap approach we simply estimate the difference at each iteration and then form the interval by taking percentiles from the bootstrapped differences. For the Wald-type approach we can get the variance of the difference as a sum of the variances for each estimator of the QRI. For estimation purposes, the highest income has been considered as $5000 for both years.

Table 7. Equalized disposable household income at top of selected percentiles ($) in Western Australia.
Table 8. Point and interval estimates of inequality measures generated using GLD method for Equalized disposable household income in Western Australia presented in Table 7

From Table 8, it can be seen that all intervals for the difference in the measures do not include zero. These intervals then suggest that income inequality has change over the years. We can conclude that inequality of the equalized disposable household income in Western Australia has been significantly increased from 1996-97 to 2009-10.

7 Discussion

To preserve confidentiality, it is common for income data to be summarised in grouped format. We therefore considered interval estimators for several measures, including the popular Gini index and a newly proposed quantile-based measure, the QRI. Since grouped data contains bin boundaries and frequencies (and therefore quantile estimates of the data), the QRI is naturally suited to this setting. We showed that bootstrap intervals and a Wald-type interval, both using estimated densities form the grouped data, had typically excellent coverage (i.e. close to nominal). The other measures, however, often had intervals with poor coverage. Further research could include consideration of how to get good approximations to the variances of the Gini, Theil and Atkinson estimators when dealing with grouped data. This was possible for the QRI since the variance of the estimator can be approximated using the estimated density function. For the other measures it is not so straightforward. In summary, when faced with grouped data, if confidence intervals are needed then the QRI is a good option for measuring inequality.