Introduction

Recently, periodic reports of certain summary statistics on income or wealth distribution have become quite common (Sitthiyot and Holasut, 2021). International Labor Organization’s ILOSTAT, United Nations Development Programme’s Human Development Report (UNDP-HDR), the United Nations University-World Income Inequality Database (UNU-WIID), the World Bank’s Poverty and Inequality Platform (PIP), and the World Income Database (WID) are the largest cross-country databases that provide grouped data. By using the grouped data on income or wealth, Gini index could be estimated (1) by assuming a statistic distribution of income, such as lognormal, Beta-II, Generalized Pareto, or mixture distribution (McDonald, 1984; Chotikapanich et al., 1997, 2007; Blanchet et al., 2022) and (2) by specifying a parametric function form for Lorenz curve (Paul and Shanker, 2020; Sitthiyot and Holasut, 2021).

Fitting Lorenz curve is more convenient than fitting income distribution because the cumulative population share and the corresponding cumulative income share form the points of the Lorenz curve. Numerous studies have suggested a variety of parametric functional forms to estimate directly the Lorenz curve. There are single-parametric Lorenz curve (Kakwani and Podder, 1973; Aggarwal, 1984; Chotikapanich, 1993; Paul and Shankar, 2020), double-parametric Lorenz curve (Rasche et al., 1980; Ortega et al., 1991; Sitthiyot and Holasut, 2021), three-parametric Lorenz curve (Kakwani, 1980; Sarabia et al., 1999). Chotikapanich and Griffiths (2002) have suggested estimating parameter(s) of the Lorenz curve using the Maximum Likelihood (ML) method assuming that each income share is subject to the joint Dirichlet distribution, while Jorda et al., (2021) have proposed estimating parameter(s) of the Lorenz curve utilizing the error minimization technique. Sitthiyot and Holasut (2021) introduce a simple and straightforward method for estimating the Lorenz curve using three indicators, namely, the Gini index, the income share of the bottom, and that of the top, which is associated with a specific functional form based on the weighted average of the exponential function and the functional form implied by Pareto distribution.

This study proposes a new approach named the regression method for estimating the Gini index by decile based on a specified functional form suggested by Kakwani (1980). The new approach builds a linear regression model to estimate three parameters, and can get better estimates of Gini index compared to the ML method and the error minimization technique for the same Lorenz curve. The new approach can easily obtain an estimation of the Lorenz curve and corresponding Gini index, and get an estimation of the variance of the Gini index utilizing popular computer programs such as EVIEWS and STATA. In addition to, the new approach allows negative income or wealth share values, which are not allowed in the Beta-II density functionFootnote 1 of Chotikapanich et al. (2007) and the Gamma functionFootnote 2 in the maximum likelihood of Chotikapanich and Griffiths (2002).

We demonstrate how to estimate the Gini index by using the new approach based on a dataset of the income shares of sixteen economies, which differ in the level of income inequality, economic, sociological, and regional backgrounds. We also compare the performance of our new method to that of the methods suggested by Sitthiyot and Holasut, 2021 and Sarabia et al. (1999), which aim to estimate the Gini index and fit the income shares by using the error minimization technique and decile data.

Methods

Kakwani (1980) suggests a functional form for fitting the Lorenz curve as follows:

$$L(x;a,p,q)=x-a{x}^{p}{(1-x)}^{q};\,a \,>\, 0,\,0 \,<\, p\,\le\, 1,\,0 \,<\, q\,\le\, 1$$
(1)

where, x is the cumulative population share, 0 ≤ x ≤ 1. When fitting function of the Lorenz curve, the functional form in model (1) is more consistent with the actual data (Cheong, 2002; Tanak et al., 2018; Sitthiyot and Holasut, 2021).

The advantage of the function L(x) is that it is applicable to negative income or wealth share. This can deal with the Sarabia et al. (1999)’s criticism that Eq.⑴ violates L′ (0+) ≥ 0. Actually, the wealth share of the poorest 10% is often negative in many economies; in this case, we only request that the function L(x) meets the conditions: L(x) ≥ 0, L′(x) ≥ 0 and L″(x) ≥ 0 in the right-handed area within the interval [0,1].

The derivation process of its second-order derivative is as follows:

Let f(x) = xp(1-x)q namely L(x) = x-f(x), so that

$$f^{\prime} (x)=\frac{d({e}^{{\mathrm{ln}}f})}{dx}=\frac{d({e}^{p\,{\mathrm{ln}}x+q\,{\mathrm{ln}}(1-x)})}{dx}={e}^{{\mathrm{ln}}f}\left(\frac{p}{x}-\frac{q}{1-x}\right)$$

Then

$$\begin{array}{c}f^{\prime\prime} (x)={e}^{{\mathrm{ln}}f}\left[{\Big(\frac{p}{x}-\frac{q}{1-x}\Big)}^{2}-\left(\frac{p}{{x}^{2}}\,+\,\frac{q}{{(1-x)}^{2}}\right)\right]\\ \qquad={e}^{{\mathrm{ln}}f}\frac{(p\,+\,q)(p\,+\,q-1){x}^{2}-2p(p\,+\,q-1)x\,+\,{p}^{2}-p}{{x}^{2}{(1-x)}^{2}}\end{array}$$

Let \(g(x)=(p+q)(p+q-1){x}^{2}-2p(p+q-1)x+{p}^{2}-p\). We can obtain the discriminant of the root nature of \(g(x)\) as follows

$$\begin{array}{l}\varDelta =4{p}^{2}{(p+q-1)}^{2}-4p(p-1)(p+q)(p+q-1)\\ =4pq(p+q-1)\end{array}$$

When p + q ≤ 1, p > 0, q > 0, then \(\varDelta \,<\, 0\), indicating g(x) ≤ 0, which means f″(x) ≤ 0,so that L″(x) ≥ 0 (a > 0). That is, the curve is convex.When p + q > 1, p > 0, q > 0, then \(\varDelta \,>\, 0\), thus we can obtain two roots of g(x), x1 and x2 as follows

$${x}_{1}=\frac{p-\sqrt{\frac{pq}{p+q-1}}}{p+q}\le 0\iff (p+q)(p-1)\le 0\iff 0 \,<\, p\le 1$$
$${x}_{2}=\frac{p+\sqrt{\frac{pq}{p+q-1}}}{p+q}\ge 1\iff (1-q)(p+q)\ge 0\iff 0 \,<\, q\le 1$$

So that

⑴ when p + q ≤ 1, p > 0, q > 0, we have L″(x) ≥ 0, x \(\in\) [0,1].

⑵ when p + q > 1, 0 < p ≤ 1, 0 < q ≤ 1, we have L″(x) ≥ 0, x \(\in\) [0,1].

⑶ when p > 1, 0 < q ≤ 1, we have x1 < 1, x2 > 1, we have L″(x) ≥ 0, x \(\in\) [x1,1].

Furthermore, L′(x) = 1-af′(x), when \(x\to {1}^{-}\), we have L′(x) > 0, because

$$\mathop{\mathrm{lim}}\limits_{x\to {1}^{-}}f^{\prime} (x)=\mathop{\mathrm{lim}}\limits_{x\to {1}^{-}}\frac{p(1-x)-qx}{{x}^{1-p}{(1-x)}^{1-q}} < 0\Rightarrow \mathop{\mathrm{lim}}\limits_{x\to {1}^{-}}L^{\prime} (x) > 0$$

It means L(x) is convex, increasing in the right-handed area within the interval [0, 1].

When analyzing the condition L″(x) ≥ 0, we find that under p + q > 1, the condition p > 0 must hold. So Eq. (1) can be expressed as follow:

$$L(x;a,p,q)=x-a{x}^{p}{(1-x)}^{q}\begin{array}{cc}; & a \,>\, 0\begin{array}{cc}, & 0 \,<\, q\,\le\, 1\end{array}\begin{array}{cc}, & p \,>\, 0\end{array}\end{array}$$
(2)

Then, the L(x) satisfies with L(x) ≥ 0, L′(x) ≥ 0 and L″(x) ≥ 0 in the right-handed area within the interval [0,1]. For example, in Fig. 1, the Lorenz curve passes through the point (0.53, −0.0021), which is formed from fitting the generalized Pareto curves for the United States (Blanchet et al., 2022). Although model (2) cannot meet the properties of the classic Lorenz curve, i.e. L(x) ≥ 0, L′(x) ≥ 0, and L″(x) ≥ 0 in the interval [0, 1], we would like to call model (2) a Lorenz curve, more precisely, a fitting function of a Lorenz curve.

Fig. 1: The Lorenz curve of the capital of the USA in 2010.
figure 1

https://wid.world/gpinter/ (file2).

Then rearranging Eq. ⑴ and taking the natural logarithm, we can get the following equation:

$$\log (x-y)=\,\log a+p\,\log x+q\,\log (1-x)+\varepsilon$$
(3)

where y = L(x), ε is an error term with zero mean value.

Let β0 = log (a), the model (3) can be expressed as the following matrix form:

$${\bf{Y}}=\left(\begin{array}{c}\log \,({x}_{1}-{y}_{1})\\ \log \,({x}_{2}-{y}_{2})\\ \vdots \\ \log \,({x}_{9}-{y}_{9})\end{array}\right),{\bf{X}}=\left(\begin{array}{ccc}1 & \log {x}_{1} & \log \,(1-{x}_{1})\\ 1 & \log {x}_{2} & \log \,(1-{x}_{2})\\ \vdots & \vdots & \vdots \\ 1 & \log {x}_{9} & \log \,(1-{x}_{9})\end{array}\right),{\mathbf{\upbeta}} =\left(\begin{array}{c}{\beta }_{0}\\ p\\ q\end{array}\right)$$

we can obtain the parameters estimates of the model using the least square method:

$$\hat{{\mathbf{\upbeta }}}={({\bf{X}}{{^{\prime} }}{\bf{X}})}^{-1}\,{\bf{X}}{{^{\prime} }}{\bf{Y}},\,{\bf{V}}{\bf{a}}{\bf{r}}(\hat{{\mathbf{\upbeta }}})={({\bf{X}}{{^{\prime} }}{\bf{X}})}^{-1}{\sigma }^{2}$$

where Var(ε) = σ2. Therefore, we estimate not only the parameters of the Lorenz curve, but also the covariance matrix of the parameters. Using the estimated parameters, we calculate the Gini coefficient and its variance according to the following formulas:

$$G=2aBeta\,(p+1,\,q+1),\,Var\,(G)=\left(\frac{\partial G}{\partial {\mathbf{\upbeta}}}\right)^{\prime} Var\,({\mathbf{\upbeta}})\frac{\partial G}{\partial {\mathbf{\upbeta}}}$$
(4)

According to Kakwani and Podder (1976), we can obtain the following partial derivatives:

$$\begin{array}{l}\frac{\partial G}{\partial {\beta }_{0}}=G\begin{array}{cc}, & a={e}^{{\beta }_{0}}\end{array}\\ \frac{\partial G}{\partial p}=[\Psi (1+p)-\Psi (2+p+q)]G, \Psi (1+p)\\\qquad\quad-\,\Psi (2+p+q)=\mathop{\sum }\limits_{k=0}^{+\infty }\left(\frac{1}{2+p+q+k}-\frac{1}{1+p+k}\right)\\ \frac{\partial G}{\partial q}=[\Psi (1+q)-\Psi (2+p+q)]G\end{array}$$

Because both x and y are increasing, this can lead to heteroscedasticity of ε. To avoid the problem, we can exploit the weighted least square method to estimate the parameters of Eq.⑶ using 1/xi (i = 1, 2,…, n) as the weights:

$$\hat{{\mathbf{\upbeta }}}={({\bf{X}}{{^{\prime} }}\Lambda {\bf{X}})}^{-1}{\bf{X}}{{^{\prime} }}\Lambda {\bf{Y}}\begin{array}{cc}, & {\bf{Var}}(\hat{{\mathbf{\upbeta }}})\end{array}={({\bf{X}}{{^{\prime} }}\Lambda {\bf{X}})}^{-1}{\sigma }^{2}$$

where Λ = diag(1/x1,1/x2,…,1/xn), Var(εε′) = \(\Lambda^{-1}\sigma^{2}\) .

The total income or wealth is usually greater than 0, and the average income or wealth is also required to be greater than 0. The income Gini coefficient is equal to the Gini mean difference divided by the mean income, the Gini mean difference is non-negative, so the Gini coefficient is always non-negative.

The UNU-WIID database has data on the income shares by decile and the Gini index for the economies in the world. These economies differ significantly in the degree of income inequality according to the UNU-WIID database. For example, Belgium (BEL), Czechia (CZE), Slovakia (SVK), and Iceland (ISL) have a lower Gini index ranging from 0.2 to 0.3; while Panama (PAN), Brazil (BRA), South Africa (ZAF), and Hong Kong (HKG) have higher Gini indices greater than 0.5. We classify the economies in the UNU-WIID database into four groups according to these economies’ Gini index. The economies in the first group has a Gini index less than 0.3, the economies in the second group with a Gini index ranging from 0.3 to 0.4, the economies in the third group with a Gini index ranging from 0.4 to 0.5, and the economies in the last group with a Gini index greater than 0.5.

To demonstrate the regression method for estimating Gini index by decile, we choose sixteen economies (see Table 1) from the above four groups and utilize the data on the income shares of these sixteen economies during the period of 2016 to 2020.

Table 1 Estimating the Gini index by the error minimization technique and the regression method using the data on the income shares by decile as published in the UNU-WIID.

As suggested by Dagum (1977), a good parametric functional form for the Lorenz curve should be able to characterize income distributions of different countries, regions, socioeconomic groups in different periods. We can use some goodness-of-fit statistics, such as coefficient of determination (R2), mean square error (MSE), and mean absolute error (MAE), to gauge how close the estimated income shares are to the actual observations (Chotikapanich, 1993; Cheong, 2002; Tanak et al., 2018; Paul and Shankar, 2020; Sitthiyot and Holasut, 2021).

Fitting Lorenz curves using income shares by decile, we can estimate parameter(s) of the Lorenz curve based on the curve fitting technique and the method of minimizing the sum of squared errors:

$$\hat{\theta }=\mathop{\rm{min}}\limits_{\theta }{\mathop{\sum}\limits_{i=1}^{N}[{y}_{i}-L({x}_{i},\theta )]}^{2},\,\theta ^{\prime} =(a,p,q)$$
(5)

From the view of fitting the Lorenz curve, though the goodness-of-fit of our regression method in most cases is smaller than that of the minimization technique of Eq.⑸, it has better performance in estimating Gini index. We introduce the root mean squared error (RMSE) in Eq. (6) to measure the difference between the estimated and the actual Gini indices.

$$RMSE=\sqrt{\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}{\big({\hat{G}}_{i}-{G}_{i}\big)}^{2}}$$
(6)

Results and discussion

Table 1 reports the estimated parameters a, p and q for the Lorenz curve of Eq.⑵ for the sixteen economies using the error minimization technique and the regression method. We can substitute the estimated values of parameters a, p, and q into Eq. (4) to obtain the estimated Gini index. The ΔG1 and ΔG2 in Table 1 are the differences between the estimated and the actual Gini indices based on the two estimation methods, respectively. It can be seen that mostly the ΔG2 is smaller than the ΔG1 for the sixteen economies, indicating a better performance of the regression method. Furthermore, we can see that the regression method has a lower RMSE compared to the minimization technique for the full sample estimation. This conclusion also holds for the estimations for the medium, higher, and high groups. This also conveys that the regression method is superior to the minimization technique. We also note that in some cases the estimated value of parameter p is greater than one under both methods, this provides the justification of our adjustment of the range of parameter p in Eq. (2). Table 1 also gives the standard deviation of the estimated Gini index employing the regression method.

Comparison of goodness-of-fit by two methods

We then use the fitted Lorenz curves to calculate the values of the (cumulative) income shares by decile for the sixteen economies under the two methods, and compare the estimated income shares with the observed income shares. Table 2 reports the values of goodness-of-fit statistics, i.e., information inaccuracy measure (IIM), R2, MSE, MAE, and maximum absolute error (MAS). All the values of goodness-of-fit measures suggest that there is no significant difference between the estimated (cumulative) income shares and the observed income shares for each economy. We find that the MSE of the fitted Lorenz curve by the error minimization technique is less than the MSE by the regression method for all the sampled economies. In most cases, this conclusion holds in terms of the statistics MAS, MAE, and IIM. Therefore, the error minimization technique has a better performance than the regression method.

Table 2 Goodness-of-fit of the fitted Lorenz curves by the error minimization technique and the regression method using the data on the income shares by decile for the sixteen economies.

Comparison of the estimated Gini index using different Lorenz curves

There are many different functional forms of Lorenz curve, of which the most common forms are as the following:

$$\begin{array}{l}{\rm{Paul}}-{\rm{Shankar}}\!:{L}_{1}(x;r)=x[{e}^{-r(1-{e}^{x})}-1]/[{e}^{-r(1-e)}-1],\,r \,>\, 0\\ {\rm{Aggarwal}}\!:{L}_{2}(x;r)=[{(1-r)}^{2}x]/[{(1+r)}^{2}-4rx],0 \,<\, r \,<\, 1\\ {\rm{Chotikapanich}}\!:{L}_{3}(x;r)=({e}^{rx}-1)/({e}^{r}-1),r \,>\, 0\\ {\rm{Pareto}}\!:{L}_{4}(x;r)=1-{(1-x)}^{1/r},r \,>\, 1\\ {\rm{Kakwani}}-{\rm{Podder}}\!:{L}_{5}(x;r)=x{e}^{-r(1-x)},r \,>\, 0\\ {\rm{Rasche}}\,{\rm{et}}\,{\rm{al}}.\!:{L}_{6}(x;q,r)={[1-{(1-x)}^{q}]}^{r},0 \,<\, q\,\le\, 1,\,r\ge 1\\ {\rm{Ortega}}\,{\rm{et}}\,{\rm{al}}.\!:{L}_{7}(x;q,r)={x}^{q}[1-{(1-x)}^{r}],q\,\ge\, 0,0 \,<\, r\,\le\, 1\\ {\rm{Sitthiyot}}-{\rm{Holasut}}\!:{L}_{8}(x;q,r)=(1-r){x}^{q}+r[1-{(1-x)}^{1/q}],q\,\ge\, 1,0\,\le\, r\,\le\, 1\\ {\rm{Sarabia}}\,{\rm{et}}\,{\rm{al}}.\!:\,{L}_{9}(x;q,r,s)={x}^{q}{[1-{(1-x)}^{r}]}^{s},q\,\ge\, 0,0 \,<\, r\,\le\, 1,s\,\ge\, 1\end{array}$$

The first five specifications of the above Lorenz curve are single-parametric, the last one is three-parametric, and the rest are double-parametric.

The minimization technique are applicable to all the above specifications of Lorenz curve, while the regression method can only be applied to the Kakwani’s Lorenz curve (1980). We fit the above-mentioned nine specifications using the error minimization technique, and fit the specification of the Kakwani’s Lorenz curve using the regression method based on the dataset of the sampled economies. Table 3 reports the estimated results. Column (4) presents the estimated result of the Kakwani’s Lorenz curve using the regression method, and columns (5)-(13) present the estimated results of the above nine specifications using the error minimization technique, respectively.

Table 3 Estimating the Gini index by the regression method and the error minimization technique with the different Lorenz curves using the data on the income shares by decile as published in the UNU-WIID.

We find that the performance of the regression method is better than that of the error minimization technique under the single-parametric and the double-parametric specifications, while it is poorer than the error minimization method under the three-parametric specification presented in column (13). In addition, we find the regression method is always better than the error minimization technique under any specifications for the medium and the higher groups, since the RMSE of the regression method is smaller than the RMSE of the error minimization method under any specifications.

We also find that the three-parametric Lorenz curve has a better performance than the double-parametric one when the error minimization technique is used to fit Lorenz curve, since the RMSE under the three-parametric specification is smaller than the RMSE of the double-parametric one. Similarly, the double-parametric specification is better than the single-parametric one. The results from Table 3 also suggest that the L8 proposed by Sitthiyot and Holasut (2021) has the best performance in the three double-parametric Lorenz curves, and the L2 proposed by Aggarwal (1984) has the best performance in the five single-parametric Lorenz curves. Sitthiyot and Holasut (2023) find that the estimated Gini index has a lower bound of 0.4180 for the Lorenz curve L1 under the condition r > 0. In order to better fit the above-mentioned nine Lorenz curves, we relax the constraints on the parameters of those Lorenz curves when we fit them using the error minimization technique. For example, for the Lorenz curve L1, we allow the parameter r to vary between negative infinity and positive infinity, i.e. -∞ < r < ∞, and for the Lorenz curve L9, we allow the following looser parameter constaints: -∞ < q < ∞, 0 < r ≤ 1, and s > 0.

Comparison of the estimated income shares between LSH and KRE

Given the poorer performance of the regression method that fits the Lorenz curve in Table 2, we examine the relative performance of the regression method to the error minimization technique in fitting the income shares by decile. We compare the estimated income shares of the regression method under Kakwani’s Lorenz curve to those of the error minimization technique under the Lorenz curve L8, since the specification L8 has the best performance in all the single-parametric and the double parametric specifications. When estimating the income shares, we choose four countries from the sixteen economies, namely, Belgium, Albania (ALB), the United States (USA), and South Africa, which have significant differences in the level of inequality. For instance, the Gini index of Belgium is 0.2540 in 2019, Albania 0.343 in 2019, the USA 0.4709 in 2018, and South Africa 0.6170 in 2017.

Table 4 reports the estimated income shares for these four countries. Panel A in Table 4 presents the actual income shares and the estimated income shares using two methods for Belgium. Panel B, Panel C, and Panel D display the actual and the estimated income shares of Albania, the USA, and South Africa, respectively. The lower half of Table 4 reports the K-S test and the goodness-of-fit statistics such as IIM, MSE, MAE, and MAS.

Table 4 The comparison of the estimated income shares of four countries by decile between the error minimization technique with L8 and the regression method.

The results of the K-S test suggest that there is no significant differences between the actual income shares and the estimated income shares for each country. Furthermore, judged by the MSE, MAE, MAS, and IIM, the regression method has a better performance than the error minimization technique in estimating income shares for Albania and the USA. In contrast, the error minimization technique has a better performance than the regression method for Belgium and South Africa.

Conclusions

Estimating Gini index with the income shares by decile attracts considerable attentions of researchers. Because the cumulative population shares and the corresponding cumulative income shares by decile form the points of Lorenz curve, fitting Lorenz curve is more convenient than fitting income distribution function. Based on the Lorenz curve suggested by Kakwani (1980), we propose a new approach named the regression method to estimate parameters of this Lorenz curve and calculate the Gini index for the sample economies.

We build a linear multiple regression equation and obtain the estimated values of three parameters of the Kakwani (1980)’s Lorenz curve using the latest data on the income shares by decile from the UNU-WIID database. We then calculate Gini index based on the Beta function, which is easily calculated utilizing some popular computer programs such as EWIEWS, STATA, MATHLAB, and R. We also provide a method to estimate the variance of Gini index.

The results suggest that the regression method has a better performance than the error minimization technique when fitting the Lorenz curve proposed by Kakwani (1980). We extend the range of the parameter p in Eq. ⑵ by replacing 0 < p ≤ 1 with p > 0. In the same time, the Lorenz curve proposed by Kakwani (1980) allows for negative income and wealth values. This is very useful for research using survey data with negative values.

We also analyze the effects of the different forms of Lorenz curves on the estimated Gini index. Using nine popular Lorenz curves, we find that the three-parametric Lorenz curves have a better performance than the double-parametric curves judged by the RMSE, and the double-parametric specifications are better than the single-parametric. Furthermore, we find that the three-parametric Lorenz curve proposed by Sarabia et al. (1999) has the best performance among the nine Lorenz curves. In addition, we find the regression method has the best performance for the economies with medium and higher levels of inequality, while the error minimization technique is the best for the economies with low and high levels of inequality under the Lorenz curve proposed by Sarabia et al. (1999).

Among the double-parametric Lorenz curves, the Lorenz curve suggested by Sitthiyot and Holasut (2021) has the best performance in terms of the estimated Gini index. Therefore, we compare the performance of the regression method to that of the error minimization technique in estimating the income shares by decile. The results show that the regression method is better than the error minimization technique for the economies with medium and higher inequality, while the minimization technique is better than the regression method for the economies with low and high inequality.