1 Introduction

Life expectancy (LE) is an important demographic indicator, providing an overall index of the mortality of a given population [1]. This indicator is nowadays world-widely used to evaluate the populations’ health and to make comparisons between nations’ health [2] as well as to study sociological inequalities in mortality (e.g., [3]). As a statistical estimator is computed from underlying random variables: the vital statistics of deaths and births as well as population counts, the LE estimation varies randomly from sample to sample extracted from a population of consideration. Each LE estimation thus has a statistical variance that represents its precision. This precision can be expressed as a standard error, coefficient of variation, or confidence interval. This variance is also essential for statistical tests that are used to ascertain whether or not observed differences in LE estimation are likely caused by a random chance.

Chiang [4, 5] first proposed a method of life expectancy estimation based on the survival theory. The method is straightforward but effective and has widely found its applications (e.g., [3, 6, 7]). Nevertheless, [8,9,10], have reported that the biases under the Chiang’s current life tables are related to the estimations of age specific mortality rates \(M_x\) and of the conditional probabilities of dying \(q_x\) in age intervals \([x;x+n)\). Specifically, the ratio of the number of deaths and the population number in a 5-years age interval usually provides a biased estimation of age-specific mortality rates \(M_x\) [8, 11, 12]. As a result, the probability \(q_x\) of dying at the age interval calculated from \(M_x\) is also biased. Moreover, it has also been found in many simulation and actual data studies [13,14,15,16], that the Chiang estimation method is less accurate in a small area set-up. On the other hand, the Chiang variance formula was derived by applying the delta method under the independence assumption of the age specific mortality rates. However, this assumption has not been supported by any theoretical analysis. Moreover, as it is showed by Scherbov and Ediev ([14], formula (3)), under the assumption of independence and non-bias in individual mortality rates, the Chiang estimation of survival probability is biased upwards. That leads to an overestimation of LE. We do not assume this independence in our model.

In the article [17], we introduced a model similar to the one in this article which is based on a combination of Weibull distribution commonly used for survival analysis. We also provided an estimation method for its parameters. After that, this method can be applied to abridged datasets, showing certain advantages over the Chiang method. However, in this model, we did not obtain a closed form expression for the variance and proof the approximation to normal distribution of the life expectancy local parametric estimation.

In this article, we propose a different variant of the model studied in [17], and we referred to it as the “Local parametric method” (LP method). Our main objective is to study the newly proposed model and obtain a closed-form formula for the variance of life expectancy. To validate our new findings, we experiment on the same data set used in [17]. Besides providing estimator variances, we also remark that in comparison with many other proposed methods (i.e Chiang method), we do not need to use the abridged life tables to estimate life expectancy.

The rest of the article is organized as follows. Section 2 is devoted to the theoretical analysis of the life expectancy estimation and the proof of the explicit variance formula for the newly proposed model. In Sect. 3, we apply this result for life expectancy estimations on a longitudinal survival dataset of FilaBavi [18] using the the Chiang, Kaplan–Meier, and our proposed LP method. Conclusion are drawn in Sect. 4.

2 The local parametric method

The Weibull distribution was introduced in 1951 [19] and has become a popular model in many areas such as survival analysis. It was pointed out in [20,21,22,23] that the survivorship data can be well fit by the Weibull distribution. This inspired the authors of [17] to use the Weibull distribution to model the human survival process and to develop a life expectancy estimation method.

2.1 Local Weibull model for survival time distribution

A random variable T is said to follow the Weibull distribution if its density function is given by

$$\begin{aligned} f(t)=k\lambda (\lambda t)^{k-1}e^{-\lambda t^k}\,\,\, for \,\,\, 0\le t <\infty . \end{aligned}$$

We denote this by \(T\sim \textrm{Weibull}(\lambda ,k)\) where \(\lambda\) is a positive scale parameter and k is a positive shape parameter. Moreover, for a random variable T which follows the Weibull distribution, if \(k<1\), the instantaneous hazard (i.e. \(h(t) = \lim _{\Delta t \rightarrow 0} \frac{P(t \le T<t+\Delta t \mid T \ge t)}{\Delta t}= k\lambda (\lambda t)^{(k-1)}\) ) monotonically decreases with time. In the case \(k=1\), the instantaneous hazard is constant over time. For \(k>1\), the instantaneous hazard increases with time. Because of this property, the Weibull distribution with a single scale parameter \(\lambda\) and a single shape parameter k can not be used to model the human survival time for his/her whole life span. This is because of the fact that the age-specific mortality rate of a person usually decreases over time in early age of his/her life, but stays constant during medium ages, and then increases over time towards the end of his/her life (e.g., [20]).

This study deals with a survival model using the Weibull distribution with local parametrization over 19 age bands of [0; 1), [1; 5), [5; 10),..., [80; 85), and \([85;\infty )\).

For \(j=1,2,\ldots ,19\) let \([x_j, x_j+ o_j)\) denote the j-th age band with length \(o_j\). We set \(o_1=1\); \(o_2=4\); \(o_j=5\) for all \(j=3,4,\ldots ,18\); and \(o_{19}=\infty\). This choice is somewhat standard as it can be seen from e.g. [5]. Suppose that a person is alive at time \(x_j\), \(j=1,2,\ldots , 18\), we model the number of years that a person lives after age \(x_j\) by \(W_j\sim \textrm{Weibull}(\lambda _j,k_j)\) with density function

$$\begin{aligned} f_j(t)=k_j\lambda _j (\lambda _j t)^{k_j-1}e^{-(\lambda _j t)^{k_j}}\,\,\, for \,\,\, 0\le t <\infty , \end{aligned}$$
(1)

and cumulative distribution function given by

$$\begin{aligned} F_j(t) = 1 - e^{-(\lambda _jt)^{k_j}}. \end{aligned}$$
(2)

For the last age band \([x_{19}, \infty )\), we model the number of years that a person lives after \(x_{19}\) by \(W_{19}\sim \textrm{Weibull}(\lambda _{19},k_{19})\) with \(k_{19}=1\).

In addition, we assume that the random variables \(W_1, W_2,\ldots , W_{19}\) are independent. We remark that these random variables are not observable. In order to simplify many calculations, we also define the random variables

$$\begin{aligned} U_{j} = W_j\wedge o_j, j=1,2,\ldots ,18, \, \, \text{ and } \, \, U_{19} = W_{19}. \end{aligned}$$
(3)

From now on, we will denote by \({\textbf{1}}_{A}(x)\) the indicator function of set A, which equals 1 if \(x \in A\) and equals 0 otherwise.

We define

$$\begin{aligned} T_{1} = U_1 ; \, \, T_{j} = U_{j}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{j-1};\infty )}(W_{j-1}) \,, \, \text{ for } \, \, j = 1,2,\cdots ,19 \end{aligned}$$
(4)

and

$$\begin{aligned} T =T_{1}+T_{2}+T_{3}+ \dots +T_{18}+T_{19}\,. \end{aligned}$$
(5)

The locally parametrized Weibull distribution model defined in (15) is called the “Local Weibull model”.

The observable random variable \(T_j\) represents the number of years (as a real number) a person is alive within the j-th age band, \(j=1,2,\ldots ,19\), and the random variable T denotes a person’s lifetime. Then the expectation and the variance of T are important measures related to the life expectancy of a person. We will proceed to compute these values.

From the formula (5) then the expectation and the variance of T are determined by the following proposition.

Proposition 1

Let the random variable T model the person lifetime accordingly to the local Weibull model defined in (15), where the random variables \(W_1, W_2,\ldots , W_{19}\) are independent. Then the mean life expectancy, \({\mathbb {E}}[T]\), is given by

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[T]&= \{{\mathbb {E}}[W_{1}\cdot {\textbf{1}}_{[0;o_1)}(W_1)]+p_{1}\}+\{{\mathbb {E}}[W_{2}\cdot {\textbf{1}}_{[0;o_2)}(W_2)]+o_2 p_{2}\} p_{1}\\&\quad + \{{\mathbb {E}}[W_{3}\cdot {\textbf{1}}_{[0;o_3)}(W_3)] + o_3 p_{3}\} p_{1} p_{2} + \cdots \\&\quad + \{{\mathbb {E}}[W_{18}\cdot {\textbf{1}}_{[0;o_{18})}(W_{18})] + o_{18} p_{18}\} p_{1} p_{2}\cdots p_{17}\\&\quad + {\mathbb {E}}[W_{19}] p_{1} p_{2}\cdots p_{18}, \end{aligned} \end{aligned}$$
(6)

with \(p_j = {\mathbb {P}}\left( W_{j}\ge o_j\right) ,j = 1,2,\ldots ,18\). Besides, the variance of T can be expressed by

$$\begin{aligned} Var [T] = \sum \limits _{j=1}^{19}Var [T_j] + 2\cdot \left\{ \sum \limits _{j=2}^{18}\sum \limits _{i=j+1}^{19} Cov [T_j,T_i] \right\} , \end{aligned}$$
(7)

where the variances \(Var [T_j]\) are given by the formulas

$$\begin{aligned} Var[T_1] = {\mathbb {E}}[W_{1}^2\cdot {\textbf{1}}_{[0;o_1)}(W_1)] + o_1^2 p_1 - ({\mathbb {E}}[W_{1}\cdot {\textbf{1}}_{[0;o_1)}(W_1)]+o_1 p_{1})^2; \end{aligned}$$
(8)
$$\begin{aligned} \begin{aligned} Var[T_j]&= \{{\mathbb {E}}[W_{j}^2\cdot {\textbf{1}}_{[0;o_j)}(W_j)] + o_j^2 p_j- ({\mathbb {E}}[W_{j}\cdot {\textbf{1}}_{[0;o_j)}(W_j)]+o_j p_{j})^2 p_{1} \cdots p_{j-1} \}\cdot \\&\quad \cdot p_{1} \cdots p_{j-1}, \end{aligned} \end{aligned}$$
(9)

for \(j=1,2,\ldots ,18\); and

$$\begin{aligned} Var[T_{19}] = \{{\mathbb {E}}[W_{19}^2] - {\mathbb {E}}[W_{19}]^2 p_{1} \cdots p_{18}\}\cdot p_{1} \cdots p_{18}. \end{aligned}$$
(10)

Moreover, the covariances \(Cov [T_j,T_i]\) are determined by

$$\begin{aligned} Cov [T_1,T_i] = (\{{\mathbb {E}}[W_i.{\textbf{1}}_{[0;o_i)}(W_i)] + o_i p_i\} p_{1} \cdots p_{i-1})\cdot [1- {\mathbb {E}}[W_{1} {\textbf{1}}_{[0;o_1)}(W_1)]-p_1], \end{aligned}$$
(11)

for \(i=2,3,\ldots ,18\);

$$\begin{aligned} \begin{aligned} Cov [T_j,T_i]&= \{({\mathbb {E}}[W_{i} {\textbf{1}}_{[0;o_i)}(W_i)]+o_i p_{i})p_{1} \cdots p_{i-1} \} \cdot \\&\quad \cdot \{o_j-({\mathbb {E}}[W_{j}\cdots {\textbf{1}}_{[0;o_j)}(W_j)]+o_j p_{j}) \cdot p_{1} \cdots p_{j-1}\}, \end{aligned} \end{aligned}$$
(12)

for \(2 \le j < i = 3,\ldots ,18\); and

$$\begin{aligned} Cov [T_j,T_{19}] = {\mathbb {E}}[W_{19}]p_{1} \cdots p_{18} \cdot \{o_j-({\mathbb {E}}[W_{j} {\textbf{1}}_{[0;o_j)}(W_j)]+o_j p_{j})p_{1} \cdots p_{j-1}\}, \end{aligned}$$
(13)

for \(j = 1,2,\ldots ,18\).

Proof

It is evident that the expectation of the random variable T is given by

$$\begin{aligned} {\mathbb {E}}[T] ={\mathbb {E}}[T_{1}]+{\mathbb {E}}[T_{2}]+{\mathbb {E}}[T_{3}]+ \dots +{\mathbb {E}}[T_{18}]+{\mathbb {E}}[T_{19}]. \end{aligned}$$

Now we compute each of the above expectations using that \(W_j\wedge o_j=W_j\cdot {\textbf{1}}_{[0;o_j)}(W_j) + {\textbf{1}}_{[o_j;\infty )}(W_j)\) and the independence of \(W_1, W_2,\ldots , W_{19}\), we obtain

$$\begin{aligned} {\mathbb {E}}[T_{1}]= & {} {\mathbb {E}}[U_{1}] = {\mathbb {E}}[W_1\cdot {\textbf{1}}_{[0;o_1)}(W_1) + {\textbf{1}}_{[o_1;\infty )}(W_1)] = {\mathbb {E}}[W_{1}\cdot {\textbf{1}}_{[0;o_1)}(W_1)]+p_{1},\\ {\mathbb {E}}[T_{j}]= & {} {\mathbb {E}}[U_{j}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{j-1};\infty )}(W_{j-1})] \\= & {} ({\mathbb {E}}[ W_j\cdot {\textbf{1}}_{[0;o_j)}(W_j)]+o_j\cdot p_j)\cdot p_1 \cdots p_{j-1} , j=2,\ldots ,18, \\ {\mathbb {E}}[T_{19}]= & {} {\mathbb {E}}[U_{19}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{18};\infty )}(W_{18})] \\= & {} {\mathbb {E}}[W_{19}] \cdot p_1 \cdots p_{18}, \end{aligned}$$

That immediately implies (6).

To compute the variance of T, we must calculate the variances \(Var[T_i], i=1,2,\ldots ,19\), and the covariances \(Cov[T_j,T_i]\), \(1\le j < i =2,3,\ldots ,19\). Firstly, from (3) we have

$$\begin{aligned} {\mathbb {E}}[U_j^2]&= {\mathbb {E}}[ W_j^2 \wedge o_j^2 ] = {\mathbb {E}}[W_{j}^2\cdot {\textbf{1}}_{[0;o_j)}(W_j)] + o_j^2 p_j\, , \text{ for } \,j=1,2,\cdots ,18 \\ {\mathbb {E}}[U_{19}^2]&= {\mathbb {E}}[W_{19}^2] - 2{\mathbb {E}}[W_{19}] + 1. \end{aligned}$$

That together with the equations \(Var[T_j] = {\mathbb {E}}[T_j^2] - {\mathbb {E}}[T_j]^2\), \(j=1,2,\ldots ,19\), the independence of \(W_1, W_2,\ldots , W_{19}\), and (4) yield (8), (9) and (10). Namely,

$$\begin{aligned} Var[T_1] = {\mathbb {E}}[T_1^2] - {\mathbb {E}}[T_1]^2 = {\mathbb {E}}[W_{1}^2\cdot {\textbf{1}}_{[0;o_1)}(W_1)] + o_1^2 p_1 - ({\mathbb {E}}[W_{1}\cdot {\textbf{1}}_{[0;o_1)}(W_1)]+o_1 p_{1})^2, \end{aligned}$$

which means that (8) holds. Furthermore, the independence of \(W_1, W_2,\ldots , W_{19}\) ensures

$$\begin{aligned} Var[T_j] = {\mathbb {E}}[T_j^2] - {\mathbb {E}}[T_j]^2 = {\mathbb {E}}[U_{j}^2 ] \cdot p_{1} p_{2}\cdots p_{j-1} - {\mathbb {E}}[U_j]^2 \cdot p_{1}^2 p_{2}^2 \cdots p_{j-1}^2\,, \end{aligned}$$

for \(j=2,\ldots ,19\). The above equation directly implies (9) and (10). On the other hand, for \(i =2,\ldots ,18\), we have

$$\begin{aligned} {\mathbb {E}}[T_1 \cdot T_i]&= {\mathbb {E}}[U_1 \cdot U_{i}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{i-1};\infty )}(W_{i-1})] \\&= {\mathbb {E}}[(W_1 \cdot {\textbf{1}}_{[0;o_1)}(W_1) + o_1\cdot {\textbf{1}}_{[o_1;\infty )}(W_1)) \cdot (W_i \cdot {\textbf{1}}_{[0;o_i)}(W_i) + o_i \cdot {\textbf{1}}_{[o_i;\infty )}(W_i)) \cdot \\&\quad \cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{i-1};\infty )}(W_{i-1})]\\&= \{{\mathbb {E}}[W_i \cdot {\textbf{1}}_{[0;o_i)}(W_i)] + o_i\cdot p_i\}\cdot p_{1} p_{2}\cdots p_{i-1} . \end{aligned}$$

For \(2\le j < i =3,\ldots ,18\), we also have

$$\begin{aligned} {\mathbb {E}}[T_i \cdot T_j]&= {\mathbb {E}}[U_i \cdot U_j \cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdot {\textbf{1}}_{[o_{j-1};\infty )}(W_{j-1}) \cdot {\textbf{1}}_{[o_{j};\infty )}(W_{j}) \cdot \\&\quad \quad \quad \quad \cdots {\textbf{1}}_{[o_{i-1};\infty )}(W_{i-1})] \\&=\{{\mathbb {E}}[W_i\cdot {\textbf{1}}_{[0;o_i)}(W_i)] + o_i p_i \} \cdot o_j \cdot p_{1} p_{2}\cdots p_j \cdots p_{i-1} . \end{aligned}$$

For \(j =1,2,\ldots ,18\), it is clear that

$$\begin{aligned} {\mathbb {E}}[T_j \cdot T_{19}]&= {\mathbb {E}}[(U_{j}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{j-1};\infty )}(W_{j-1}) ) \cdot \\&\quad \quad \cdot (U_{19}\cdot {\textbf{1}}_{[o_1;\infty )}(W_1) \cdot {\textbf{1}}_{[o_2;\infty )}(W_2) \cdots {\textbf{1}}_{[o_{18};\infty )}(W_{18}) ] \\&= {\mathbb {E}}[W_{19}] \cdot o_j \cdot p_{1} p_{2}\cdots p_j \cdots p_{18} . \end{aligned}$$

The above equations are used to derive the formulas of the covariances between \(T_j\) and \(T_i\) given in (11), (12), (13) and this completes the proof. \(\square\)

Proposition 1 states that all entries of the covariance matrix of the random vector \((T_1, T_2,\ldots , T_{19})\) are well-defined real numbers. This also implies that the random vector \((T_1, T_2,\ldots , T_{19})\) has a finite covariance matrix.

2.2 Estimate survival time expectation by sample means

It is clear that for a random variable X with a random sample of size n given by \(\{X_1, X_2,\ldots , X_n\}\), the sample mean value \(\bar{X} = (X_1+X_2+\cdots +X_n)/n\) is the best estimator of the expectation \({\mathbb {E}}[X]\), because it is an unbiased consistent estimator with the smallest variance. So, let \(n_j\) be the sample size of the observations in the j-th age group, \(j=1,2,\ldots ,19\), and let \(\{T_{j,1}, T_{j,2},\ldots , T_{j,n_j}\}\) be a random sample of the random variable \(T_j\). That means \(T_{j,1}, T_{j,2},\ldots , T_{j,n_j}\) are independent random variables having the same distribution as that of \(T_j\). Then we take the sample mean value \(\bar{T}_j = (T_{j,1}+T_{j,2}+\cdots +T_{j,n_j})/n_j\) to be the estimation of the expectation \({\mathbb {E}}[T_j]\) of \(T_j\), and take the sum \(\bar{T} = \bar{T}_1+\bar{T}_2+\cdots +\bar{T}_{19}\) to be the estimation of the expectation \({\mathbb {E}}[T]\) of T. As a consequence of the Multivariate Central Limit Theorem and finite covariance matrix, which confirms \((T_1, T_2,\ldots , T_{19})\) is a random vector with finite covariance matrix, we have the following proposition.

Proposition 2

Let \(\{T_{j,1}, T_{j,2},\ldots , T_{j,n_j}\}\) be a random sample of the random variable \(T_j\) and \(\bar{T}_j = (T_{j,1}+T_{j,2}+\cdots +T_{j,n_j})/n_j\), for \(j=1,2,\ldots ,19\). Then the estimator \(\bar{T} = \bar{T}_1+\bar{T}_2+\cdots +\bar{T}_{19}\) of the expectation \({\mathbb {E}}[T]\) of the lifetime T has a distribution which is approximately normal as the sample sizes \(n_1, n_2,\ldots , n_{19}\) tend to infinity.

It is easy to show that \(Var(\bar{T}_j)=Var({T}_j)/n_j\). Furthermore, \(Cov(T_i,T_j)\), \(i, j=1,2,\ldots ,19\), may be different from 0 (see Eqs. (11), (12) and (13)) although \(W_1, W_2,\ldots , W_{19}\) are independent. Still, we can obtain \(Cov(\bar{T}_i,\bar{T}_j)=Cov({T}_i,{T}_j)/\max {(n_i;n_j)}\) for \(1\le i < j \le 19\). The above observation together with an argument similar to that of Proposition 1, where \(T_1, T_2,\ldots , T_{19}\) are respectively replaced by \(\bar{T}_1, \bar{T}_2,\ldots , \bar{T}_{19}\), implies the following proposition.

Proposition 3

Let T be the random variable modeling the person lifetime with the local Weibull model defined in (15), where \(W_1, W_2,\ldots , W_{19}\) are independent. Let \(\{T_{j,1}, T_{j,2},\ldots , T_{j,n_j}\}\) be a random sample of the random variable \(T_j\) and \(\bar{T}_j = (T_{j,1}+T_{j,2}+\cdots +T_{j,n_j})/n_j\) with sample size \(n_j\) of the j-th age group, for \(j=1,2,\ldots ,19\). Then \({\bar{T}}\) is an estimator of the life expectancy whose mean can be approximated as follows:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[{\bar{T}}]&= {\mathbb {E}}[\bar{T_1}] + {\mathbb {E}}[\bar{T_2}] +\cdots + {\mathbb {E}}[\bar{T_{19}}] \\&\approx \left( \frac{k_1 \lambda _1^{k_1} o_1^{k_1+1}}{k_1+1} - \frac{k_1 \lambda _1^{2k_1} o_1^{2k_2+1}}{2k_1 +1}+o_1p_{1}\right) +\left( \frac{k_2 \lambda _2^{k_2} o_2^{k_2+1}}{k_2+1} - \frac{k_2 \lambda _2^{2k_2} o_2^{2 k_2+1} }{2k_2 +1} +o_2 p_{2}\right) p_{1} + \\&\quad +\left( \frac{k_3 \lambda _3^{k_3} o_3^{k_3+1}}{k_3+1} - \frac{k_3 \lambda _3^{2k_3} o_3^{2 k_3+1} }{2k_3 +1} + o_3 p_{3}\right) p_{1} p_{2} +\cdots + \\&\quad + \left( \frac{k_{18} \lambda _{18}^{k_{18}} o_{18}^{k_{18}+1}}{k_{18}+1} - \frac{k_{18} \lambda _{18}^{2k_{18}} o_{18}^{2 k_{18}+1} }{2k_{18} +1} + o_{18} p_{18}\right) p_{1} \cdots p_{17} + \frac{1}{\lambda _{19}} p_{1} \cdots p_{18}. \end{aligned} \end{aligned}$$
(14)

Furthermore, the variance of \({\bar{T}}\) is given by

$$\begin{aligned} Var [\bar{T}] = \sum \limits _{j=1}^{19}Var [\bar{T}_j] + 2\cdot \{\sum \limits _{j=2}^{18}\sum \limits _{i=j+1}^{19} Cov [\bar{T}_j,\bar{T}_i] \} \,, \end{aligned}$$

where the variances and the covariances can be approximated as follows:

$$\begin{aligned} Var [\bar{T}_1] \approx \frac{\frac{k_1 \lambda _1^{k_1}o_1^{k_1+2} }{k_1+2} - \frac{k_1 \lambda _1^{2k_1}o_1^{2k_1+2} }{2k_1 +2} + o_1^2 p_1 - \left( \frac{k_1 \lambda _1^{k_1}o_1^{k_1+1} }{k_1+1} - \frac{k_1 \lambda _1^{2k_1} o_1^{2k_1+1}}{2k_1 +1}+o_1 p_{1}\right) ^2 }{ n_1} , \end{aligned}$$
(15)
$$\begin{aligned} \begin{aligned} Var [\bar{T}_j]&\approx \frac{ 1}{ n_j} \times \left\{ \frac{k_j \lambda _j^{k_j} o_j^{k_j+2}}{k_j+2} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+2} }{2k_j +2} + o_j^2 p_j \right. \\&\quad \left. - \left( \frac{k_j \lambda _j^{k_j} o_j^{k_j+1}}{k_j+1} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+1} }{2k_j +1}+o_j p_{j}\right) ^2 p_{1} \cdots p_{j-1} \right\} p_{1} \cdots p_{j-1} , \end{aligned} \end{aligned}$$
(16)

for \(j =2,\ldots ,18\),

$$\begin{aligned} Var [\bar{T}_{19}] \approx \frac{\left\{ \frac{2}{\lambda _{19}^2} - \frac{1}{\lambda _{19}^2} p_{1} \cdots p_{18}\right\} \cdot p_{1} \cdots p_{18}}{n_{19}}. \end{aligned}$$
(17)

and

$$\begin{aligned} Cov [\bar{T}_1,\bar{T}_i] \approx \frac{\left( \left\{ \frac{k_i \lambda _i^{k_i} o_i^{k_i+1}}{k_i+1} - \frac{k_i \lambda _i^{2k_i} o_i^{2 k_i+1} }{2k_i +1} + o_i p_i\right\} \cdot p_{1} p_{2}\cdots p_{i-1}\right) \cdot \left[ 1- \frac{k_1 \lambda _1^{k_1} }{k_1+1} - \frac{k_1 \lambda _1^{2k_1} }{2k_1 +1} - p_1\right] }{\max (n_1; n_i)}, \end{aligned}$$
(18)

for \(i = 2,3,\ldots ,18\),

$$\begin{aligned} \begin{aligned} Cov [\bar{T}_j,\bar{T}_i]&\approx \frac{1}{\max (n_j; n_i)} \cdot \left\{ \left( \frac{k_i \lambda _i^{k_i} o_i^{k_i+1}}{k_i+1} - \frac{k_i \lambda _i^{2k_i} o_i^{2 k_i+1} }{2k_i +1}+o_i p_{i}\right) p_{1} \cdots p_{i-1} \right\} \cdot \\&\quad \quad \cdot \left\{ o_j-\left( \frac{k_j \lambda _j^{k_j} o_j^{k_j+1}}{k_j+1} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+1} }{2k_j +1}+o_j p_{j}\right) p_{1} \cdots p_{j-1}\right\} , \end{aligned} \end{aligned}$$
(19)

for \(2\le j < i =3,\cdots ,18\),

$$\begin{aligned} Cov [\bar{T}_j,\bar{T}_{19}] \approx \frac{\frac{1}{\lambda _{19}} p_{1} \cdots p_{18} \cdot \left\{ o_j-\left( \frac{k_j \lambda _j^{k_j} o_j^{k_j+1}}{k_j+1} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+1} }{2k_j +1}+o_j p_{j}\right) p_{1} \cdots p_{j-1}\right\} }{\max (n_j ; n_{19})}, \end{aligned}$$
(20)

for \(j =1,2,\cdots ,18\).

Proof

For \(j=1,2,\ldots ,18\), we have

$$\begin{aligned} {\mathbb {E}}[W_{j} {\textbf{1}}_{[0;o_j)}(W_j)]= \int \limits _0^{o_j} s f_j (s) d s = \int \limits _0^{o_j} k_j \lambda _j^{k_j} s^{k_j}e^{-\lambda _j^{k_j} s ^{k_j}} d s, \end{aligned}$$

To simplify the integration of the exponential function \(k_j \lambda _j^{k_j} s^{k_j}e^{-\lambda _j^{k_j} s ^{k_j}}\), we use the power series for the exponential function up to the 1-st order, then

$$\begin{aligned} {\mathbb {E}}[W_{j} {\textbf{1}}_{[0;o_j)}(W_j)] \approx \int \limits _0^{o_j} \left( k_j \lambda _j^{k_j} s^{k_j} - k_j \lambda _j^{2k_j} s^{2k_j} \right) ds = \frac{k_j \lambda _j^{k_j} o_j^{k_j+1}}{k_j+1} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+1} }{2k_j +1} . \end{aligned}$$
(21)

Using the same argument, we get

$$\begin{aligned} {\mathbb {E}}[W_{j}^2 {\textbf{1}}_{[0;o_j)}(W_j)]= \int \limits _0^{o_j} k_j \lambda _j^{k_j} s^{k_j+1}e^{-\lambda _j^{k_j} s ^{k_j}} d s \approx \frac{k_j \lambda _j^{k_j} o_j^{k_j+2}}{k_j+2} - \frac{k_j \lambda _j^{2k_j} o_j^{2 k_j+2} }{2k_j +2}. \end{aligned}$$
(22)

Moreover, because \(W_{19}\sim \textrm{Weibull}(\lambda _{19},k_{19})\) with \(k_{19}=1\), so we have \({\mathbb {E}}[W_{19}]= 1/ \lambda _{19}\) and \(Var [W_{19}] = 1/ \lambda _{19}^2\), which together with the equation \(Var [W_{19}] = {\mathbb {E}}[W_{19}^2] - ( {\mathbb {E}}[W_{19}])^2\) imply

$$\begin{aligned} {\mathbb {E}}[W_{19}^2]= \frac{2}{\lambda _{19}^2} \,. \end{aligned}$$
(23)

Finally, by virtue of the above mentioned fact that \(Var(\bar{T}_j)=Var({T}_j)/n_j\) and \(Cov(\bar{T}_i,\bar{T}_j)=Cov({T}_i,{T}_j)/\max {(n_i;n_j)}\) for \(1\le i < j \le 19\), we can combine (613) with (2123) to get (1420). Consequently, the proposition is proved. \(\square\)

We remark that the approximation used in the above proof can also be seen as a standard first order approximation for the incomplete Gamma function and in that sense the approximations could be improved if necessary.

2.3 Variance of life expectancy estimated using abridged annual reported data

An abridged population dataset (annually recorded) consists of aged-classed pairs whose first component is the number of population \(pop_j\) and its second component is the number of deaths \(death_j\) recorded for the j-th age band \([x_j; x_j+o_j)\), \(j=1,2,\ldots ,19\) when the age of the dead person had been in the interval \([x_j; x_j+o_j)\) at the time of death. On the other hand, \(pop_j\) is the number of persons alive at the midyear day (July 1) of the observation year, whose age recorded on that day belonged to the same interval \([x_j; x_j+o_j)\). We refer to this kind of abridged data as midyear abridged data (MAD). Because an abridged dataset contains only \((death_j, pop_j)\) for each age band \([x_j; x_j+o_j)\), the data can be used only to estimate the scale parameter \(\lambda _j\) as we will see later. For this reason, we choose a reasonable value for the shape parameter \(k_j\) in our estimations. In conclusion, for the present study, we will fix appropriate values of \(k_j\) before using the data to estimate the parameters \(\lambda _j\) for all age groups.

Typically, data in population studies are comprised of follow-up reports over a certain calendar year, from January 1 to December 31 of the year. In this study, abridged data used in the LP method of life expectancy estimation are organized differently from MAD. Specifically, the age on the last day (December 31) of the current observation year is used to classify the age groups \([x_j; x_j+o_j)\), \(j=1,2,\ldots ,19\) in abridged data. Then for each \(j=1,2,\ldots ,19\), the number \(n_j\) is the number of people whose age at the last day of the year was in the interval \([x_j; x_j+o_j)\), so \(n_j\) is also sample size of observations in j-th group. Similarly, \(d_j\) is the number of deaths that occurred in the current observation year among \(n_j\) individuals of the j-th age groups.

We define the random variable \(Y_j\), which measures the number of years (as a real number) that a person in the age band \([x_j; x_j+o_j)\) exceeds \(x_j\) on December 31 of that year. At that point, the age of this person (on December 31 of the year) is \(x_j+Y_j\). Meanwhile, because \(W_j\) is the number of years that a person lives after age \(x_j\), so if \(T \ge x_j\) then T can be written as \(T = W_j + x_j\).

We assume that the age bands have been chosen so that \(Y_j\), \(j=1,2,\ldots ,18\) has a uniform distribution with the density functions

$$\begin{aligned} g_j(s)&= {\left\{ \begin{array}{ll} \frac{1}{o_j} &{} \text{ if } s \in \left[ ; o_j\right) , \\ 0 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(24)

For a person whose age on December 31 is in the age band \([x_{19}, \infty )\), we assume that \(Y_{19}\) has an exponential distribution with a positive parameter \(\mu >0\) and has density function \(g_{19}(s)=\mu e^{-\mu s}, \text {if} \, \, s \ge 0\). In the next we deal with the procedure to estimate the scale parameters \(\lambda _j\), assuming the random variables \(W_j\) and \(Y_j\) are independent. Note that although both \(W_j\) and \(Y_j\) depend on \(x_j\), they are independent as their joint distribution density function equals the product of the marginal distribution density functions, \(f_{\left( W_j, Y_j\right) }(t, s)=f_j(t) g_j(s)\).

In the argument of [17], we use the Lexis diagram (Fig. 1) to describe death events in a calendar year (fixed as the 2010 year to facilitate the interpreting), related to age groups [0; 1) and [1; 5) (marked by black and gray dots). In this diagram, the horizontal axis represents the calendar time, the vertical one indicates the individual age (as a real number). The horizontal coordinate of each dot informs the exact time and date of death, the vertical coordinate shows the age at death of the concerned person. Meantime, the birth date of any dead person is determined by the intersection between the horizontal axis and the line parallel to the main diagonal on the first quadrant (at 45-degree angle) which passes through the dot representing the age and date of death. According to [17], we can assume \(n_j=pop_j\), for all \(j=1,2,\ldots ,19\).

Fig. 1
figure 1

Mortality pattern in the age intervals [0;1) and [1;5)

Let \(d_{i;j}\), \(i=1,\ldots ,18\), \(j=1,2\) denote the number of deaths for each age interval i which corresponds to either the lower triangle \(j=1\) or upper triangle \(j=2\) as in Fig. 1. For example, \(d_{1;1}, d_{1;2}, d_{2;1}\) and \(d_{2;2}\) are the number of deaths occurred in the domains A, B, C, and D, respectively (by the same way we can extend the Lexis diagram and define the numbers \(d_{j;1}, d_{j;2}\), \(j=3,\ldots ,18\)). We see from the diagram that \(death_1 = d_{1;1} + d_{1;2}\). Furthermore, deaths in the lower triangular domain A correspond to deaths in 2010 of children born in the same year. We compute now the probability of a person belonging to the domain A which is denoted by \(q_{1;1}\). This can also be related to \(d_{1;1}\) as it is approximately equal to \(n_1 \times q_{1;1}\). In fact, due to (24), the probability of death in the domain A, is given by

$$\begin{aligned} \begin{aligned} q_{1;1}&= {\mathbb {P}}\left\{ 0 \le W_1 \le Y_1<1\right\} \\&= \int _0^1 {\mathbb {P}}\left\{ 0 \le W_1 < s \right\} g_1(s) d s \\&=\int _0^1 \left( 1-e^{-\left( \lambda _1s\right) ^{k_1}} \right) ds. \end{aligned} \end{aligned}$$
(25)

Whilst, the black dots in the domain B represent the deaths in 2010 of children born in 2009, each of those cases has the age at December 31 of 2009 less than the age at death, which is smaller than 1. These cases belong to the [0; 1) age group of the 2009 observation year. However, the two successive years 2009 and 2010 have similar survival models for the [0; 1) age group, that implies \(d_{1;2}\) the number of deaths that occurred in the domain B can be estimated by \(n_1 \times q_{1;2}\), where the probability of death \(q_{1;2}\) equals

$$\begin{aligned} \begin{aligned} q_{1;2}&= {\mathbb {P}}\{0 \le (Y_2 + 1) -1 \le W_1<1\}\\&=\int _0^1 {\mathbb {P}}\{s \le W_1< 1\} g_2(s) d s\\&=\int _0^1\frac{1}{4}\left\{ e^{-\left( \lambda _1 s\right) ^{k_1}} - e^{-\lambda _1^{k_1}} \right\} d s. \\ \end{aligned} \end{aligned}$$
(26)

From (25), (26), and using 1-st order Taylor approximation of the exponential function we get

$$\begin{aligned} \begin{aligned} death_1&\approx n_1 \times \left\{ \int _0^1 \left( 1-e^{-\left( \lambda _1s\right) ^{k_1}} \right) ds + \int _0^1\frac{1}{4}\left\{ e^{-\left( \lambda _1 s\right) ^{k_1}} - e^{-\lambda _1^{k_1}} \right\} d s\right\} \\&\approx n_1 \times \left\{ \int _0^1 \left( \lambda _1s \right) ^{k_1}ds + \int _0^1 \frac{1}{4}\left( \lambda _1^{k_1} -\left( \lambda _1s \right) ^{k_1} \right) ds \right\} \\&\approx n_1 \times \left\{ \frac{3\lambda _1^{k_1}}{4(k_1+1)} + \frac{1}{4}\lambda _1^{k_1} \right\} . \end{aligned} \end{aligned}$$

From here, we can propose the following estimator of \(\lambda _1\)

$$\begin{aligned} {\hat{\lambda }}_1 = \left\{ \frac{4(k_1+1)}{k_1+4} \times \frac{death_1}{n_1} \right\} ^{\frac{1}{k_1}}. \end{aligned}$$
(27)

Consequently, we can use (26) to approximate the value of \(d_{1;2}\) by

$$\begin{aligned} \begin{aligned} d_{1 ; 2}&=n_1 \times \int _0^1\frac{1}{4}\left\{ e^{-\left( \lambda _1 s\right) ^{k_1}} - e^{-\lambda _1^{k_1}} \right\} d s \\&\approx n_1 \times \int _0^1 \frac{1}{4}\left( \lambda _1^{k_1} -\left( \lambda _1s \right) ^{k_1} \right) ds\\&\approx n_1 \times \left\{ \frac{\lambda _1^{k_1}}{4}-\frac{\lambda _1^{k_1}}{4(k_1+1)} \right\} \\ \end{aligned} \end{aligned}$$
(28)

For the scale parameter \(\lambda _2\), we deal with the deaths corresponding to the dots in the domains B, C and D. The dots in the two domains B and C represent the deaths in 2010 of the children in the age group [1; 5), which were born in the period from 2006 to 2009. Therefore, the number \(d_2\) of the such deaths is approximately equal to \(n_2 \times q_{2;1}\). Recalling (24), the probability of death \(q_{2;1}\) can be written as

$$\begin{aligned} \begin{aligned} q_{2;1}&={\mathbb {P}}\left\{ 0 \le (Y_2+1)-1 \le W_1 \le 1 \right\} + {\mathbb {P}}\left\{ 1 \le W_2 + 1 \le Y_2+1 \le 2 \right\} \\&\quad \quad +{\mathbb {P}}\left\{ 1 \le (Y_2+1)-1 \le W_2 +1 \le Y_2+1<5\right\} \\&={\mathbb {P}}\left\{ 0 \le Y_2 \le W_1 \le 1 \right\} + {\mathbb {P}}\left\{ 0 \le W_2 \le Y_2 \le 1 \right\} \\&\quad \quad +{\mathbb {P}}\left\{ 1 \le Y_2 \le W_2 +1 \le Y_2 + 1<5\right\} \\&= \int _0^1 {\mathbb {P}}\left\{ s \le W_1< 1 \right\} g_2(s)ds + \int _0^1 {\mathbb {P}}\left\{ 0\le W_2< s \right\} g_2(s)ds\\&\quad \quad + \int _1^4 {\mathbb {P}}\left\{ s-1 \le W_2 < s \right\} g_2(s)ds\\&=\int _0^1 \frac{1}{4} \left( e^{-\left( \lambda _1 s\right) ^{k_1}} - e^{-\left( \lambda _1 \right) ^{k_1}}\right) ds + \int _0^1 \frac{1}{4} \left( 1- e^{-\left( \lambda _2 s\right) ^{k_2}} \right) ds \\&\quad \quad +\int _1^4 \frac{1}{4} \left( e^{-\left( \lambda _2 (s-1)\right) ^{k_2}} - e^{-\left( \lambda _2 s\right) ^{k_2}}\right) ds. \end{aligned} \end{aligned}$$
(29)

Simultaneously, the black dots in D represent the deaths in 2010 of persons born in 2005, which belong to the age group [1; 5) at the last day of the observation year 2009, but not to the age group [1; 5) at the last day of the observation year 2010. It is clear that the mortality models of these two age groups are close to each other. We observe that the death cases in the domain D had the age at death greater than \((Y_3+5)-1\) and less than 5. Additionally, recalling (24), the number \(d_{2;2}\) of the deaths occurred in the domain D can be approximated by \(n_2 \cdot q_{2;2}\), where the probability of death \(q_{2;2}\) equals

$$\begin{aligned} \begin{aligned} q_{2;2}&={\mathbb {P}}\left\{ 4 \le (Y_3+5)-1 \le W_2 + 1<5\right\} \\&= \int _0^1 {\mathbb {P}}\left\{ s+ 3 \le W_2 <4 \right\} g_3(t) ds\\&=\int _0^1 \frac{1}{5} \left( e^{-\left( \lambda _2 (s+3)\right) ^{k_2}} - e^{-\left( 4\lambda _2 \right) ^{k_2}}\right) ds. \end{aligned} \end{aligned}$$
(30)

We observe that \(death_2\) is the number of deaths which occurred in the domains C and D, while \(d_2\) is the number of deaths which occurred in the domains C and B. That means \(death_2 = d_{2;1} + d_{2;2}\), \(d_2 = d_{2;1} + d_{1;2}\) so \(death_2 = d_2 + d_{2;2}-d_{1;2}\). We have

$$\begin{aligned} death_2 \approx n_2q_2 + n_2q_{2;2} - d_{1;2}. \end{aligned}$$

From (29), (30), using an approximation of first order based on the Taylor expansion for the exponential function, we have

$$\begin{aligned} \frac{dead_2+d_{1;2}}{n_2}&\approx \frac{k_1 \lambda _1^{k1}}{4(k_1+1)} + \frac{\lambda _2^{k_2} (4^{k_2+1} - 3^{k_2+1})}{4(k_2+1)} + \\&\quad \quad +\frac{1}{5} \left( \left( 4\lambda _2 \right) ^{k_2} - \frac{\lambda _2^{k_2} (4^{k_2+1} - 3^{k_2+1})}{k_2+1} \right) . \end{aligned}$$

That yields estimator of \(\lambda _2\)

$$\begin{aligned} \begin{aligned} {\hat{\lambda }}_2 = \left\{ \frac{20(k_2+1)}{4^{k_2}(k_2+1)+ 4^{k_2+1} - 3^{k_2+1}} \left( \frac{dead_2+d_{1;2}}{n_2} - \frac{k_1 \lambda _1^{k1}}{4(k_1+1)} \right) \right\} ^{\frac{1}{k_2}}. \end{aligned} \end{aligned}$$
(31)

From (27), we get

$$\begin{aligned} {} d_{2;2} \approx \frac{n_2}{5}\left( \left( 4\lambda _2 \right) ^{k_2} - \frac{\lambda _2^{k_2} (4^{k_2+1} - 3^{k_2+1})}{k_2+1} \right) . \end{aligned}$$
(32)

We continue the above argument with the extended Lexis diagram then \(death_i = d_{i;1}+d_{i;2}\), \(d_i = d_{i;1}+ d_{i-1;2}\), so

$$\begin{aligned} death_i = d_i + d_{i;2} - d_{i-1;2}, \quad i=3,4,\ldots ,18. \end{aligned}$$

Firstly, \(d_i \approx n_i\cdot q_{i;1}\) with

$$\begin{aligned} \begin{aligned} q_{i;1}&={\mathbb {P}}\left\{ x_i-1 \le (Y_i + x_i)-1 \le W_{i-1} + x_{i-1} \le x_i \right\} +{\mathbb {P}}\left\{ x_i \le W_{i} + x_{i} \le Y_i+ x_i \le x_i +1 \right\} \\&\quad \quad + {\mathbb {P}}\left\{ x_i \le (Y_i + x_i)-1 \le W_i + x_i \le Y_i + x_i <x_i+o_i\right\} \\&=\int _0^1 \frac{1}{5} \left( e^{-\left( \lambda _{i-1}( s+o_{i-1}-1)\right) ^{k_{i-1}}} - e^{-\left( o_{i-1} \lambda _{i-1} \right) ^{k_{i-1}}}\right) ds +\int _0^1 \frac{1}{5} \left( 1- e^{-\left( \lambda _{i}s\right) ^{k_{i}}}\right) ds \\&\quad \quad + \int _1^{o_i} \frac{1}{5} \left( e^{-\left( \lambda _i (s-1)\right) ^{k_i}} - e^{-\left( \lambda _i s\right) ^{k_i}}\right) ds. \end{aligned} \end{aligned}$$
(33)

Besides, \(d_{i;2} \approx n_i\cdot q_{i;2}\) with

$$\begin{aligned} \begin{aligned} q_{i;2}&={\mathbb {P}}\left\{ x_{i+1}-1 \le (Y_{i+1} + x_{i+1})-1 \le W_i + x_i < x_{i+1}\right\} \\&=\int _0^1 \frac{1}{5} \left( e^{-\left( \lambda _i (s+o_i-1)\right) ^{k_i}} - e^{-\left( \lambda _i o_i \right) ^{k_i}}\right) ds. \end{aligned} \end{aligned}$$

Repeating the Taylor first order based approximation for the exponential function, we have

$$\begin{aligned} \begin{aligned}&\frac{death_i +d_{i-1;2}}{n_i} \approx \frac{1}{5}\left( \left( o_{i-1} \lambda _{i-1} \right) ^{k_{i-1}} - \frac{\lambda _{i-1}^{k_{i-1}}((o_{i-1})^{k_{i-1}+1}-(o_{i-1}-1)^{k_{i-1}+1})}{k_{i-1}+1}\right) \\&\quad + \frac{\lambda _i^{k_i} (o_i^{k_i+1} - (o_i-1)^{k_i+1})}{5(k_i+1)} + \frac{1}{5} \left( \left( \lambda _i o_i\right) ^{k_i} - \frac{\lambda _i^{k_i}(o_i^{k_i+1} -(o_i-1)^{k_i+1})}{k_i+1} \right) . \end{aligned} \end{aligned}$$

From that, \(\lambda _i\) is estimated by \({\hat{\lambda }}_i\), \(i=3,4,\ldots ,18\)

$$\begin{aligned} {\hat{\lambda }}_i = \frac{1}{o_i}\left( \frac{5(death_i +d_{i-1;2})}{n_i} - \frac{\lambda _{i-1}^{k_{i-1}}(o_{i-1} ^{k_{i-1}} +k_{i-1}o_{i-1}^{k_{i-1}}-o_{i-1}^{k_{i-1}+1}+(o_{i-1}-1)^{k_{i-1}+1})}{k_{i-1}+1} \right) ^{1/k_i} . \end{aligned}$$
(34)

And

$$\begin{aligned} d_{i;2} \approx \frac{n_i}{5} \left( \left( \lambda _i o_i\right) ^{k_i} - \frac{\lambda _i^{k_i}(o_i^{k_i+1} -(o_i-1)^{k_i+1})}{k_i+1} \right) , \quad i=3,4,5,\ldots ,18. \end{aligned}$$
(35)

For the open-ended age group \([x_{19};\infty )\), \(death_{19}\) is the number of deaths which occurred in the current year of people whose age at death belonged to the age interval \([x_{19}, \infty )\). Because \(n_{19}\) also includes \(d_{18;2}\) which is number of deaths occurred in the current year of people whose age on December 31 was in the interval \([x_{19}, \infty )\) but his/her age at the time of death was less than \(x_{19}\), so \(death_{19}\) satisfies

$$\begin{aligned} \begin{aligned} death_{19}&\approx \left( n_{19} - d_{18;2}\right) \cdot q_{19;2}, \end{aligned} \end{aligned}$$
(36)

with

$$\begin{aligned} \begin{aligned} q_{19;2}&= {\mathbb {P}}\left\{ 85 \le W_{19} + 85 \le Y_{19} + 85<86\right\} + \\&\quad \quad + {\mathbb {P}}\left\{ 85 \le (Y_{19}+85)-1 \le W_{19} +85 \le Y_{19} + 85 <\infty \right\} \\&=\int _0^1 {\mathbb {P}}\left\{ 0 \le W_{19} \le s \right\} g_{19}(s)ds+ \int _1^{\infty } {\mathbb {P}}\left\{ s-1\le W_{19} \le s \right\} g_{19}(s)ds, \end{aligned} \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} q_{19;2}&= \int _0^1 \left( 1- e^{\lambda _{19}s}\right) \mu e^{-\mu s}ds + \int _1^{\infty } \left( e^{\lambda _{19}(s-1)}-e^{\lambda _{19}s} \right) \mu e^{-\mu s}ds \\&=\left( 1-e^{-\mu }\right) \frac{\lambda _{19}}{\mu + \lambda _{19}}. \end{aligned} \end{aligned}$$
(37)

where \(\mu\) is the exponent parameter in the density function of the random variable \(Y_{19}\), \(g_{19}(s) = \mu e^{-\mu s}, s \ge 0\).

In addition,

$$\begin{aligned} {\mathbb {P}} \left( x_{19} \le Y_{19} + x_{19}< x_{19} +1 \right) = {\mathbb {P}} \left( 0 \le Y_{19} < 1 \right) = 1- \exp (-\mu ). \end{aligned}$$
(38)

Since \(\left( x_{19} \le Y_{19} + x_{19} < x_{19} + 1 \right)\) consists of people whose age at the end of the observing year belongs to \([x_{19}, x_{19}+1)\), the number of these people is approximate to the number of people belonged to the \([x_{19} -1, x_{19} )\) interval in the preceding year. Moreover, the 5-year interval \([x_{18}, x_{19})\) has \(n_{18}\) people, hence the number of people in the interval \([x_{19} -1, x_{19} )\) can be approximated by \(n_{18}/5\). Thus, from (38), we have

$$\begin{aligned} 1- \exp (-\mu ) \approx \frac{n_{18}}{5n_{19}}. \end{aligned}$$

So \(\mu\) is estimated by \({\hat{\mu }}\)

$$\begin{aligned} {\hat{\mu }} = \ln {\frac{5n_{19}}{5n_{19} - n_{18}}}. \end{aligned}$$
(39)

From formula (36), (37) and (38), we obtain the estimator of \(\lambda _{19}\)

$$\begin{aligned} {\hat{\lambda }}_{19} = \frac{{\hat{\mu }} \cdot death_{19}}{\left( n_{19} - d_{18;2}\right) \left( 1-e^{-{\hat{\mu }}}\right) -death_{19}}. \end{aligned}$$
(40)

Thus, from these formulae we obtain estimated values \({\hat{\lambda }}_j\) for the scale parameters \(\lambda _j\), \(j=1,2,\ldots ,19\). Moreover, according to Proposition 1, \({\hat{p}}_j={\mathbb {P}}({\hat{W}}_j>o_j)\), where \({\hat{W}}_j\) has a Weibull distribution with parameters \({\hat{\lambda }}_j\) and \(k_j\), \(j=1,2,\ldots ,18\). Therefore, applying formula (2) then \({\hat{p}}_j = 1-F_j(o_j) = e^{-({\hat{\lambda }}_j o_j)^{k_j }}\). This together with Proposition 3, gives the following result (recall that according to [17], we can assume \(n_j = pop_j\), for all \(j = 1, 2,\ldots , 19\)):

Proposition 4

Let \(o_1=1\), \(o_2=4\), \(o_{19}=\infty\), \(o_j=5\) for \(j=3,4,\ldots ,18\), and \(x_1=0\), \(x_j=x_{j-1}+o_{j-1}\) for \(j=2,3,\ldots ,19\). Let \(\{(pop_j,death_j), j=1,2,\ldots ,19\}\) be a MAD abridged survival dataset, where \(pop_j\) and \(death_j\) are the population and the number of deaths recorded for the j-th band \([x_j; x_j+o_j)\), \(j=1,2,\ldots ,19\). Then using the estimated values of the local scale parameters \(\lambda _j\), the life expectancy can be estimated by the formula

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[{\bar{T}}]&\approx \left( \frac{k_1 {\hat{\lambda }}_1^{k_1}o_1^{k_1+1}}{k_1+1} - \frac{k_1 {\hat{\lambda }}_1^{2k_1}o_1^{2k_1+1} }{2k_1 +1}+o_1 {\hat{p}}_{1}\right) +\left( \frac{k_2 \lambda _2^{k_2} o_2^{k_2+1}}{k_2+1} - \frac{k_2 {\hat{\lambda }}_2^{2k_2} o_2^{2 k_2+1} }{2k_2 +1} +o_2 {\hat{p}}_{2}\right) {\hat{p}}_{1} \\&\quad +\left( \frac{k_3 {\hat{\lambda }}_3^{k_3} o_3^{k_3+1}}{k_3+1} - \frac{k_3 {\hat{\lambda }}_3^{2k_3} o_3^{2 k_3+1} }{2k_3 +1} + o_3 {\hat{p}}_{3}\right) {\hat{p}}_{1} {\hat{p}}_{2} +\cdots \\&\quad + \left( \frac{k_{18} {\hat{\lambda }}_{18}^{k_{18}} o_{18}^{k_{18}+1}}{k_{18}+1} - \frac{k_{18} {\hat{\lambda }}_{18}^{2k_{18}} o_{18}^{2 k_{18}+1} }{2k_{18} +1} + o_{18} {\hat{p}}_{18}\right) {\hat{p}}_{1} \cdots {\hat{p}}_{17} + \frac{1}{{\hat{\lambda }}_{19}} {\hat{p}}_{1} \cdots {\hat{p}}_{18} \end{aligned} \end{aligned}$$
(41)

with \({\hat{p}}_1 = e^{-\left( {\hat{\lambda }}_1o_1 \right) ^{k_1}}\), \({\hat{p}}_2 = e^{-\left( {\hat{\lambda }}_2o_2\right) ^{k_2}}, \ldots , {\hat{p}}_{18} = e^{-\left( \hat{\lambda }_{18}o_{18}\right) ^{k_{18}}}\). Besides, the following variance formula for the estimator \(\bar{T}\) of the life expectancy is satisfied

$$\begin{aligned} Var [\bar{T}] = \sum \limits _{j=1}^{19}Var [\bar{T}_j] + 2\cdot \left\{ \sum \limits _{j=2}^{18}\sum \limits _{i=j+1}^{19} Cov [\bar{T}_j,\bar{T}_i] \right\} , \end{aligned}$$
(42)

where the variances \(Var [\bar{T}_j]\) and the covariances \(Cov [\bar{T}_j,\bar{T}_i]\) can be calculated by formulas (1520).

Remark that the formula (41) provides a life expectancy estimation method without the use of the abridged life table. This is a difference with our previous article [17].

By virtue of Proposition 3, the above result provides a formula for the confidence interval of life expectancy estimation in the following corollary:

Corollary 5

For a given number \(\alpha \in (0,1)\), the \((1-\alpha )\)-confidence interval of life expectancy estimation \({\mathbb {E}}[{\bar{T}}]\) is the interval

$$\begin{aligned} \biggl ({\mathbb {E}}[{\bar{T}}] - Z_{1-\alpha /2}\cdot \sqrt{Var [\bar{T}]}, {\mathbb {E}}[{\bar{T}}]+Z_{1-\alpha /2}\cdot \sqrt{Var [\bar{T}]}\,\biggl ), \end{aligned}$$

where \({\mathbb {E}}[{\bar{T}}]\) and \(Var [\bar{T}]\) are determined as in (41) and (42), while \(Z_{\beta }\) denotes the \(\beta\)-quantile of the standard normal distribution.

3 Experiment and comparison of different life expectancy estimation methods with real data

In this section a real longitudinal survival dataset of FilaBavi [18] is used to compare the effectiveness of the LP method for life expectancy estimation and the traditional Chiang method. The variances of the Chiang life expectancy estimation and the LP life expectancy estimation are computed to compare the performance of the two estimation methods of life expectancy for males and females.

3.1 FilaBavi longitudinal dataset

FilaBavi, an epidemiological field laboratory sited in the Bavi District of northern Vietnam, was created by Vietnam Health Strategy and Policy Institute in 1999, with the assistance of Swedish Sida/SAREC [18]. Authors of the dataset randomly sampled villages, with a probability proportional to their population sizes. This sample is comprised of a population size of 51,024 inhabitants in 11,089 households, accounting for approximately 20% of the total population (approximately 235,000 people) of the district in 1999. The data were collected through quarterly demographic survey of vital events, including information on marital status changes, migrations, pregnancy follow-ups, births, and deaths.

The dataset of FilaBavi was recorded with interviews from March 1999 until October 2015. This includes 15 years (2000–2014) of complete observation that have been recorded. Extracting from the dataset, this study uses the data file that has been reorganized by splitting it into 15 one-year observation semi-cohort data files, each of them is related to a specific year from 2000 to 2014. In the following analysis, we used the one-year observation semi-cohort data files to calculate the life expectancy following the Kaplan-Meier method. This method serves as the gold standard to validate other life expectancy estimation methods.

Simultaneously, each dataset of the 15 1-year observation semi-cohort data files is used to generate the respective midyear abridged data (MAD) file. We then use these 15 abridged datasets to calculate life expectancy estimations using the Chiang, the LP method, as well as the variances of these estimations.

3.2 Life expectancy estimations and their variances

As mentioned previously, the shape parameter \(k_j\) of the random variable \(W_j\) influences the mortality rate in \([x_j, x_j+o_j)\), \(j=1,2,\ldots ,19\). The mortality rate in this model, which can be interpreted as \(k_j\lambda _j(\lambda _jt)^{(k_j-1)}\), decreases over time in the first year of life, but stays constant with time for the medium age groups. This mortality rate increases over time in the last age groups. In the LP method, the value of \(k_j\) must be pre-determined and used consistently for any dataset. Specifically, following our suggestion in [17], we propose the set of parameters for \(k_j\) as follows:

$$\begin{aligned} \left\{ k_i\right\} =\{0.1; 0.2; 0.9; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 2; 2; 2; 1\} \end{aligned}$$

where \(k_1=0.1; k_2=0.2; k_3=0.9; k_j=1\) for the medium age groups with \(j=4, \ldots , 15; k_j=2\) for the three older age groups with \(j=16,17,18;\) and \(k_{19}=1\).

In this study, based on the proposed sequence of values for \(k_j\), \(j=1,2,\ldots ,19\), we realized that we can replace them by other values in order to make the numerical calculations easier without significantly influencing the accuracy of the life expectancy estimations. And, one of those combinations is presented in the following series:

$$\begin{aligned} \left\{ k_i\right\} =\{0.1; 0.2; 0.9; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1\} \end{aligned}$$

where \(k_1=0.1; k_2=0.2; k_3=0.9\) for the first age groups, and; \(k_j=1\), \(j=4, \ldots , 19\) for other age groups.

The article [17] proposes the Kaplan-Meier life expectancy estimation method that can be applied to demographic datasets that fully record birth dates and death dates of all dead individuals. Then the method can provide an accurate estimation of life expectancy and can be used as a “gold” standard in the accuracy investigation of other life expectancy estimation methods. Meantime, the LP method and the Chiang method use abridged survival data, which are routinely reported data containing numbers pairs of deaths and of persons grouped in 5-years age groups. Therefore, the last two estimation methods should provide less precise estimations, although they can be more widely applicable.

To evaluate the accuracy of life expectancy estimations given by the Chiang method (denoted as “Ch. Est” in the tables) and the LP method (denoted by “LP Est” in the tables), we use the Kaplan-Meier estimator (denoted as “K-M Est” in the tables) obtained from one-year semi-cohort dataset described in [17] as the "gold" standards of life expectancy estimation. In Tables 1, 2 and 3, we computed the Kaplan-Meier life expectancy estimator for 15 years (2000–2014) using the full FilaBavi dataset. Simultaneously, the MAD’s are used as entries to estimate the life expectancy by the Chiang method and the LP method. Then, the differences between the Chiang estimations and the Kaplan-Meier method (the “gold” standards) are reported as estimation residuals in the “Ch. Res” columns of the tables. From these residuals, as can be seen, in all cases, the Chiang method yields over-estimated results. Specifically, the Chiang estimations are consistently larger than the corresponding Kaplan-Meier estimations. For example, they are greater than 0.5 of year for 10 out of 15 years for males, and 13 out of 15 years for females.

Table 1 Male life expectancy estimations and variances
Table 2 Female life expectancy estimations and variances

Now we compare the differences between the LP life expectancy estimations and the Kaplan–Meier life expectancy estimations. These differences are represented in the “LP Res” columns of the tables. The residuals in these columns show the closeness of the LP life expectancy estimations to the corresponding Kaplan-Meier life expectancy estimations. These comparisons indicate that the LP method estimation method is more accurate than the Chiang life expectancy estimation method.

Next we analyze the variances of the different estimation methods. The “Ch. Var” columns of the tables show the variances of the Chiang life expectancy estimations, whilst the “LP Var” columns of the tables contain the variances of the LP life expectancy estimations described in Sect. 2. Comparing the variances in these columns, we can confirm that the LP life expectancy estimation method always provides smaller variances than those obtained by the Chiang method. Especially, in Table 1 dedicated to the male population, the LP variances are around 1/2 times of the corresponding Chiang variances (average of 0.5603 in comparison to average of 1.3041).

In Table 2 dedicated to the female population, the LP variances are about 1/3 times of the corresponding Chiang variances (average of 0.3661 in comparison to average of 1.0336). The comparison results point out the higher effectiveness of the LP life expectancy estimation over the Chiang life expectancy estimation.

To study the influence of the population size on the magnitude of the LE estimation variance, we randomly extract 5000 observations from each year dataset of male and female populations. Using these randomly extracted datasets, we derive 30 MAD’s that contain the pairs of the population size and the number of deaths recorded for each of the 19 age bands. Applying the Chiang method and the LP method to the MDA’s we obtain LE estimations together with their variances for each data set. The results are represented in Table 3 for male and Table 4 for female populations, respectively.

Table 3 Life expectancy estimations variance and 95% confidence interval—male population of 5000
Table 4 Life expectancy estimations variance and 95% confidence interval—female population of 5000

In these tables, the columns “Ch. Est” and “Ch. Var” contain the LE estimations issued by the Chiang method, together with the corresponding variances. From the LE estimations and variances, we derive the respective 95% confidence intervals in the column “Ch. 95% CI”. Similarly, the LE estimations and variances computed by applying the LP method are the items of the columns “LP Est” and “LP Var”. In addition, the items in the column “LP 95% CI” show the respective 95% confidence intervals determined by virtue of Corollary 6.

Comparing Tables 3 and 4 to Tables 1 and 2, we see the LE estimation variances increase about 5 times when the population sizes decrease around 5 times (from about 25,000 down to 5000). Simultaneously, the variances of the LP LE estimations are always two times smaller than the corresponding variances of the Chiang LE estimations. Consequently, the width of the confidence intervals of the LP LE estimations are smaller than 2/3 of the width of the respective confidence intervals of the Chiang LE estimations. The above notes could emphasize the advantage on effectiveness of the LP estimations over the Chiang estimations.

4 Conclusion

The Chiang method of life expectancy estimation is a popular tool used to get a summary index of populations’ health. This indicator is widely used by epidemiologists, ecologists, economists, policy makers, to examine the geographic and social-demographic inequalities in health and to make international comparisons between countries’ living standards. Therefore, the validity of this estimation and its variance is an important topic that has attracted paramount interest of statisticians and other scientists.

Earlier works of [8,9,10,11,12], have showed that there are biases in the Chiang life expectancy estimation method when applied to data. Furthermore, Scherbov and Ediev [14] have proved that the Chiang method provides an overestimation of life expectancy. Meanwhile, the bias in expectation estimation always makes the estimation variance larger. On the other hand, [13,14,15,16], pointed out that the Chiang method of life expectancy estimation is often less accurate for small population considerations. This is confirmed by the fact that all the above articles point out that the variance of the Chiang life expectancy estimation is very large when the population size is small.

To address the above problems, this article constructs a life expectancy estimation model which is different from the one presented in [17]. In particular, we do not need to use the abridged life table. Furthermore in comparison with the model proposed in [17], we are able to give formulas for the variance of our life expectation estimators which is important in practice. With this result one can prove the asymptotic normality of the life expectancy estimation method proposed here. This is the crucial theoretical base for the determination of the estimation confidence intervals used in statistical tests.

The experiments presented in Sect. 3 are based on a real dataset demonstrate the additional advantages of the life expectancy estimation method presented here over the Chiang method. Especially, the lower variances of the new method estimation confirm the higher accuracy level of our methodology in the dataset we used. Additionally, theses lower variances allow statistical tests to be more effective/sensitive in detecting differences in life expectancy estimation. These facts also further reinforce the effectiveness of the proposed life expectancy estimation.

To answer the question whether the live expectancy of males is different from the live expectancy of females, we can compare the confidence intervals in Table 3 to those in Table 4. From the columns “LP 95% CI” in the tables, we observe that in almost all years (except only the 2000 year), the LP LE estimation confidence interval of males is disjoint from the LP LE estimation confidence interval of females. That means the differences between the live expectancy of females and males are statistically significant. It is worth mentioning that these differences could not be detected using the Chiang LE estimation method, because almost all pairs of the corresponding confidence intervals intersect.

Tables 3 and 4 contain the calculation results based on subsamples of the datasets with population size of 5000, which is close to the population size of datasets of many studies in small areas. Therefore, the above analyses can be extrapolated to other similarly small areas’ datasets. In other words, the LP life expectancy estimation method can be useful in the assessment of the health inequality in small area settings, e.g., particularly for remote areas (e.g., villages).