1 Introduction

The ratio estimator is commonly used to estimate the mean or the total of a variable of interest in many fields (Royall, 1970; Cochran, 1977, pp.150–188; Lu & Yan, 2014, p.1). Examples can be found in medical research (Wang et al., 2011), marine science (Hoenig et al., 1997; Stock et al., 2019), forest research (Bullock et al., 2020; Snowdon, 1992; Zarnoch & Bechtold, 2000), population growth research (Severud et al., 2019), and, last but not least, official statistics (Scheaffer, 2012, p.171).

When data are obtained through survey questionnaires, missing values are prevalent in data. For example, in official economic statistics, some enterprises do not answer sensitive items such as turnover (sales), while the same enterprises answer non-sensitive items such as the number of employees. Under this circumstance, to compute the mean (or the total) of turnover, missing values are often taken care of by imputation, where the missing values in turnover are predicted by the observed values in the number of employees. While multiple imputation may be used for variances and covariances, single imputation can yield reasonable estimates of means and totals (Takahashi & Watanabe, 2017, p.24; Little & Rubin, 2020, p.72). Specifically, ratio imputation is often used for missing values in official economic statistics (de Waal et al., 2011, pp.244–245; Takahashi et al., 2017).

However, official economic statistics deal with a variety of enterprises, such as small-and-medium size enterprises and large enterprises. Thus, it is important to consider the effects of outliers on the imputation model when dealing with missing values. This is of great importance, because the presence of outliers biases the parameter of the imputation model, which leads to biases in imputed data, leading to biased results in statistical analyses. This is also important for data science in general, because data science requires high-quality data. Official statistics is one of the most important sources of data; however, the quality of such data is dependent on how missing values are taken care of. Therefore, this research contributes to data science in general by helping official statistics to deal with missing values that plague the quality of data.

Traditionally, the following robust estimators are suggested: The median (de Waal et al., 2011, p.210), the trimmed mean (de Waal et al., 2011, p.211), the Winsorized mean (Gwet & Rivest, 1992, p.1174; Mulry et al., 2014, pp.724–725), and the mean by \(M\)-estimators (Gwet & Rivest, 1992, p.1175; Mulry et al., 2014, pp.725–727, Wada & Sakashita, 2017, p.3). However, the median, the trimmed mean, and the Winsorized mean are univariate approaches to outliers. Thus, they are potentially sensitive to bivariate outliers. We focus on bivariate outliers, but not on higher dimensional outliers, because the ratio imputation model is intrinsically bivariate (de Waal et al., 2011, pp.244–245). Also, the mean by \(M\)-estimators takes only the residuals into account, which makes it potentially sensitive to outliers (high-leverage points) on the horizontal axis in the sense of the scatter plot, where the predictor is on the horizontal axis and the target variable for imputation is on the vertical axis.

This article proposes a new robust ratio estimator named the TC-ratio estimator, an extension of the ratio estimator with trimming based on Cook’s distance (1977). Also, this article applies the TC-ratio estimator to the ratio imputation model to make it robust against outliers on the vertical axis (variable \(y\)), on the horizontal axis (variable \(x\)), and on both axes (\(x\) and \(y\)). Furthermore, this study proposes a novel method of automatically determining the number of outliers, based on the coefficient of determination \({R}^{2}\). Thus, the process can be automated, which is meaningful in the practice of official statistics, because the schedule for producing estimates in official statistics is usually tight; thus, if a new method can detect and treat influential values in an automated manner, it is more preferable (Mulry et al., 2014, p.722).

Monte Carlo simulations based on 1,600,000 datasets reveal that the robust ratio imputation model by the TC-ratio estimator outperforms the traditional robust ratio imputation models when outliers exist only in \(y\), only in \(x\), and both in \(x\) and \(y\). When outliers are not present, the performance of the robust ratio imputation model by the TC-ratio estimator is also approximately equivalent to the non-robust ratio imputation method.

2 The ratio estimator

Suppose that the population model is \({y}_{i}=\beta {x}_{i}+{\varepsilon }_{i}\), where \({y}_{i}\) is the target incomplete variable, \({x}_{i}\) is an auxiliary variable (completely observed), and \({\varepsilon }_{i}\sim N\left(0,{\sigma }^{2}{x}_{i}^{2\xi }\right)\), where \(\xi\) is some constant. In other words, the model is regression without an intercept, also known as regression through the origin (Eisenhauer, 2003; de Waal et al., 2011, p. 245), and the error term \({\varepsilon }_{i}\) has the expected value of zero, but the variance is proportional to \({x}_{i}^{2\xi }\); in other words, it is heteroskedastic.

Takahashi et al., (2017) show that the weighted least squares (WLS) transform the heteroskedastic error term \({\varepsilon }_{i}\) into the homoskedastic error term \({\gamma }_{i}={\varepsilon }_{i}/{x}_{i}^{\xi }\), where \({\gamma }_{i}\sim N\left(0,{\sigma }^{2}\right)\). Since \({x}_{i}^{\xi }\) is a function of \({x}_{i}\), not only the expected value of \({\varepsilon }_{i}/{x}_{i}^{\xi }\) is zero, conditional on \({x}_{i}\), but also, the variance of \({\varepsilon }_{i}/{x}_{i}^{\xi }\) is constant, conditional on \({x}_{i}\). Therefore, Eq. (1) corrects for heteroskedasticity. See Takahashi et al. (2017) about how Eq. (1) is obtained. Note that, in this article, the sums are taken from \(i=1, 2,\dots , n\), where \(n\) is the sample size, unless otherwise stated. Furthermore, the homoskedastic error term \({\gamma }_{i}\) is shown in Eq. (2).

$${\widehat{\beta }}_{\mathrm{WLS}}=\frac{\sum {x}_{i}^{1-2\xi }{y}_{i}}{\sum {x}_{i}^{2\left(1-\xi \right)}},$$
(1)
$${\gamma }_{i}=\frac{{y}_{i}-{\widehat{\beta }}_{WLS}{x}_{i}}{{x}_{i}^{\xi }}.$$
(2)

When \(\xi =0.0\), \({\widehat{\beta }}_{\mathrm{WLS}}\) reduces to the ordinary least squares (OLS) estimator \({\widehat{\beta }}_{\mathrm{OLS}}\) in Eq. (3), and the corresponding residual \({e}_{i}\) is Eq. (4).

$${\widehat{\beta }}_{\mathrm{OLS}}=\frac{\sum {x}_{i}^{1-2\times 0.0}{y}_{i}}{\sum {x}_{i}^{2\left(1-0.0\right)}}=\frac{\sum {x}_{i}{y}_{i}}{\sum {x}_{i}^{2}}.$$
(3)
$${e}_{i}={y}_{i}-{\widehat{\beta }}_{OLS}{x}_{i},$$
(4)

When \(\xi =0.5\), \({\widehat{\beta }}_{WLS}\) reduces to the ratio-of-means estimator \({\widehat{\beta }}_{\rm ratio}\) in Eq. (5), which is also known as the ratio estimator (Royall, 1970, p.380; Cochran, 1977, p.150), and the corresponding residual \({e}_{r,i}\) for the ratio estimator (\(\xi =0.5\)) is Eq. (6), where subscript \(r\) denotes ratio. Equation (6) will be an important component to robustify the ratio imputation model.

$${\widehat{\beta }}_{\rm ratio}=\frac{\sum {x}_{i}^{1-2\times 0.5}y_{i}}{\sum {x}_{i}^{2\left(1-0.5\right)}}=\frac{\sum y_{i}}{\sum x_{i}} =\frac{\sum y_{i}/n} {\sum x_{i}/n}=\frac{\overline{y}}{\overline{x}},$$
(5)
$${e}_{r,i}=\frac{{y}_{i}-{\widehat{\beta }}_{\rm ratio}{x}_{i}}{\sqrt{{x}_{i}}}.$$
(6)

Since \({\widehat{\beta }}_{\rm ratio}\) is based on arithmetic means, it does not take the rocket scientist to imagine that \({\widehat{\beta }}_{\rm ratio}\) is sensitive to outliers. This is the problem that the current study seeks to solve.

3 Definition of outliers and influential observations

The definition of outliers is vague, because outliers are only defined in relation to other observations in the remaining data. Suffice it to say that outliers are those observations that appear to be different from the rest of the data (Ghosh-Dastidar & Schafer, 2006, p.487; Wooldridge, 2020, p.317). In statistical analyses, the presence of outliers may imply that the model sufficiently describes the majority of observations, but it does not describe a small number of observations. Under this circumstance, the data may be modeled as a mixture of two types of distributions (Schafer, 1997, p.385).

There are a variety of reasons why outliers exist in data, but two distinctions are important: (1) outliers are incorrect observations (errors); (2) outliers are correct but unusual observations (Gwet & Rivest, 1992 p.1174; Bonate, 2011, p.71). If outliers are incorrect observations, then these outliers should be corrected in the editing process before imputing missing values (de Waal, 2013; Di Zio & Guarnera, 2013; Ghosh-Dastidar & Schafer, 2006). On the other hand, the types of outliers in the current study are correct but unusual observations. If an observation is correct but has an excessive effect on an estimate of a parameter, then the observation is regarded as influential (Mulry et al., 2014, p.721). As is the case with Mulry et al. (2014, p.721), the focus of this study “is on influential values that remain after all the data have been verified or corrected, so these unusual values are true and not the result of reporting or recording errors.” This is important to consider, because if outliers are correct but influential observations, these outliers remain in missing data at the imputation stage. Then, \({\widehat{\beta }}_{\rm ratio}\) is influenced by outliers and \({\widehat{\beta }}_{\rm ratio}\) is biased, which leads to biases in the imputed data, which further leads to biased results in statistical analyses based on imputed data.

Then, a natural question is what are the influential observations? To discuss this issue, let us first consider unconditional (univariate) outliers and conditional (bivariate) outliers (Fox, 2020, p.40). Suppose that heights are normally distributed with the mean of 170 cm and the standard deviation of 6 cm. If someone’s height is 200 cm, then this is an unconditional (univariate) outlier, because it is five standard deviations above the mean. Also, suppose that weights are normally distributed with the mean of 60 kg with the standard deviation of 10 kg. If the same person’s weight is 110 kg, then this is again an unconditional (univariate) outlier, because it is five standard deviations above the mean. However, this person is unlikely to be a conditional (bivariate) outlier. Conditional on the value of the person’s height (200 cm), this person’s weight (110 kg) is a likely value of the weight.

In the context of regression analysis by OLS, Fox (2020, p.41) notes that the combination of high leverage on the horizontal axis and the unusual size of residuals on the vertical axis exerts influence on the regression coefficients. In other words, influence is a function of unusualness with respect to both horizontal and vertical axes in the sense of the scatter plot (McClendon, 1994, p.52; Bonate, 2011, pp.73–74). These influential observations are the kinds of outliers against which the current study proposes a robust ratio estimator. See Sect. 7.3 for concrete examples.

4 Traditional robust ratio estimators

This section briefly surveys the traditional methods of robust ratio estimators, against which the performance of the proposed method will be tested in Sects. 7 and 8.

4.1 Ratio of medians

\({\widehat{\beta }}_{\rm ratio}\) is estimated by the ratio of arithmetic means. As a well-known fact, the arithmetic mean is sensitive to outliers while the median is insensitive to outliers. It is natural that the median is commonly used as an outlier robust measure of location (de Waal et al., 2011, p.210). Therefore, if we replace \({\widehat{\beta }}_{\rm ratio}\) by \({\widehat{\beta }}_{\rm med}\) in Eq. (7), it is the ratio-of-medians estimator, where \(\mathrm{med}\left(\bullet \right)\) denotes the median.

$${\widehat{\beta }}_{\rm med}=\frac{\mathrm{med}\left({y}_{i}\right)}{\mathrm{med}\left({x}_{i}\right)}.$$
(7)

4.2 Ratio of trimmed means

The trimmed mean is also one of the commonly used outlier robust measures of location (de Waal et al., 2011, p.211). While the median is insensitive to outliers, the median is inefficient because it utilizes information from very few observations. The trimmed mean can be regarded as a compromise between the arithmetic mean and the median (DeGroot & Schervish, 2002, p.579).

Let \({y}_{1},{y}_{2},\dots ,{y}_{n}\) be a random sample of size \(n\), which satisfies the following condition: \({y}_{1}<{y}_{2}<\dots <{y}_{n}\). Also, let \(k\) be a positive integer such that \(k<n/2\). Suppose that we delete from the data the \(k\) smallest observations \({y}_{1},{y}_{2},\dots ,{y}_{k}\) and the \(k\) largest observations \({y}_{n-k+1},\dots ,{y}_{n-1},{y}_{n}\). Then, the average of the remaining \(n-2k\) middle observations is the \(k\)-th level trimmed mean, which is Eq. (8) (DeGroot & Schervish, 2002, p.578).

$${\overline{y} }_{\rm trim}=\frac{1}{n-2k}\sum_{i=k+1}^{n-k}{y}_{i}.$$
(8)

Therefore, if we replace \({\widehat{\beta }}_{\rm ratio}\) by \({\widehat{\beta }}_{\rm trim}\) in Eq. (9), it is the ratio-of-trimmed-means estimator, where \({\overline{x} }_{\rm trim}\) is defined in a manner similar to \({\overline{y} }_{\rm trim}\).

$${\widehat{\beta }}_{\rm trim}=\frac{{\overline{y} }_{\rm trim}}{{\overline{x} }_{\rm trim}}.$$
(9)

4.3 Ratio of Winsorized means

The Winsorized mean is also one of the commonly used outlier robust measures of location (de Waal et al., 2011, p.211). In Winsorization, rather than deleting the lowest and largest \(k\) values, as done by the \(k\)th level trimmed mean, they are set equal to the smallest or largest value not trimmed (Mair & Wilcox, 2020, p.465). Again, let \({y}_{1},{y}_{2},\dots ,{y}_{n}\) be a random sample of size \(n\), which satisfies the following condition: \({y}_{1}<{y}_{2}<\dots <{y}_{n}\). Also, let \(k\) be a positive integer such that \(k<n/2\). Suppose that we set the \(k\) smallest observations \({y}_{1},{y}_{2},\dots ,{y}_{k}={y}_{k},{y}_{k},\dots ,{y}_{k}\) and the \(k\) largest observations \({y}_{n-k+1},\dots {y}_{n-1},{y}_{n}={y}_{n-k+1},\dots {y}_{n-k+1},{y}_{n-k+1}\). Then, the Winsorized mean is Eq. (10).

$${\overline{y} }_{\rm winsor}=\frac{1}{n}\left(k{y}_{k}+\sum_{i=k+1}^{n-k}{y}_{i}+k{y}_{n-k+1}\right).$$
(10)

Therefore, if we replace \({\widehat{\beta }}_{\rm ratio}\) by \({\widehat{\beta }}_{\rm winsor}\) in Eq. (11), it is the ratio-of-Winsorized-means estimator, where \({\overline{x} }_{\rm winsor}\) is defined in a manner similar to \({\overline{y} }_{\rm winsor}\).

$${\widehat{\beta }}_{\rm winsor}=\frac{{\overline{y} }_{\rm winsor}}{{\overline{x} }_{\rm winsor}}.$$
(11)

4.4 Ratio of means by \({\varvec{M}}\)-estimators

Rather than deleting a fixed amount of data or setting it to one value, \(M\)-estimators provide a more flexible method to deal with outliers, where \(M\) stands for maximum likelihood type, because \(M\)-estimators are found by maximizing a function that might not be the likelihood (DeGroot & Schervish, 2002, pp. 579–581; Mair & Wilcox, 2020, pp. 465–466).

In the context of regression analysis based on OLS, the sum of squared errors is a weighted average of errors, where the weights are their own values. The idea behind \(M\)-estimators is to replace these weights by some weights that do not keep growing in magnitude as the errors grow. Essentially, robust methods such as \(M\)-estimators give less weights to observations with larger residuals (Kennedy, 2003, p. 375; Wooldridge, 2020, p. 323).

To find \(M\)-estimators, we often need to use the method of iteratively reweighted least squares (IRLS) (Mulry et al., 2014, p. 727). In the current study, based on Wada and Sakashita (2017, p. 3), whose method was applied to the 2016 Japanese Economic Census, we replace \({\widehat{\beta }}_{\rm ratio}\) by \({\widehat{\beta }}_{\rm IRLS}\) in Eq. (12), where \({w}_{i}\) is Tukey’s biweight function defined in Eq. (13), \({e}_{r,i}\) is in Eq. (6), and \(\psi\) is an arbitrary constant ranging from 4 (more robust) to 8 (less robust). For the choice of \(\psi\), see Wada and Tsubaki (2020, p.3). For more information on a robust ratio estimator by \(M\)-estimators (IRLS), also see Gwet and Rivest (1992), Pannekoek (2018), Wada (2020), and Wada et al. (2021).

$${\widehat{\beta }}_{\rm IRLS}=\frac{\sum {w}_{i}{y}_{i}}{\sum {w}_{i}{x}_{i}},$$
(12)
$${w}_{i}=\left\{\begin{array}{ll}{\left[1-{\left(\frac{{e}_{r,i}}{\psi }\right)}^{2}\right]}^{2} \mathrm{if }\left|{e}_{r,i}\right|\le \psi \\ \quad \quad0\quad \quad\quad\quad\quad \mathrm{if }\left|{e}_{r,i}\right|>\psi \end{array}\right..$$
(13)

5 Cook’s distance for ordinary least squares (OLS)

This section briefly reviews the mechanism of Cook’s distance for OLS. If we want to know whether an observation is influential or not, then an obvious way is to delete an observation one at a time and to recalculate how parameter estimates change. Apparently, this requires an iterative procedure such as IRLS; however, there is a method that can directly assess the influence of the \(i\)th observation with no iterations (Bonate, 2011, p.75). This method is Cook’s distance (Cook, 1977), which is originally a composite score that evaluates an observation’s influence on a set of regression parameters in the context of OLS (McClendon, 1994, p.107; Bonate, 2011, p.76).

Specifically, Cook’s distance \({C}_{i}\) is shown in Eq. (14) (Cook, 1977, p.16; Fox, 2020, p. 49), where \(p\) is the number of parameters in the model, and \({e}_{i}^{{\prime}}\) is the studentized residual in Eq. (15), which deals with outliers on the vertical axis. Note that \({e}_{i}\) is the OLS residual in Eq. (4), and \(s\) is the standard error of the regression (Wooldridge, 2020, pp. 49–50) defined in Eq. (16). Note that \(s\) is also called an estimate of the error standard deviation, depending on academic fields. Also, \({h}_{i}\) is the hat value in Eq. (17), which deals with the leverage on the horizontal axis (Fox, 2020, p.45).

$${C}_{i}=\frac{{e}_{i}^{{{\prime}}2}}{p}\times \frac{{h}_{i}}{1-{h}_{i}},$$
(14)
$${e}_{i}^{{\prime}}=\frac{{e}_{i}}{s\sqrt{1-{h}_{i}}},$$
(15)
$$s=\sqrt{\frac{\sum {e}_{i}^{2}}{n-p}},$$
(16)
$${h}_{i}=\frac{1}{n}+\frac{{\left({x}_{i}-\overline{x }\right)}^{2}}{{\sum }_{j=1}^{n}{\left({x}_{j}-\overline{x }\right)}^{2}}.$$
(17)

Therefore, if an observation has a large value of Cook’s distance, this means that the observation is influential in the OLS regression model, in terms of the vertical axis (measured by \({e}_{i}^{{{\prime}}2}\)), the horizontal axis (measured by \({h}_{i}\)), or the combination of both.

6 Algorithm of the TC-ratio estimator

6.1 Extending Cook’s distance to the ratio estimator

This section presents how Cook’s distance can be extended to the ratio estimator as the TC-ratio estimator. First, we estimate \({\widehat{\beta }}_{\rm ratio}\) in Eq. (5) as if there were no outliers. Second, we calculate \({e}_{r,i}\), the residual of the ratio estimator in Eq. (6). As we saw in Sect. 2, \({\widehat{\beta }}_{\rm ratio}\) is a weighted least squares estimate, where the weight is \(1/\sqrt{{x}_{i}}\). Therefore, Eq. (6) is different from the OLS residual in Eq. (4), because we need to take the weight into account. Equation (6) is the key component of extending Cook’s distance to the ratio estimator. Third, we calculate the studentized residual \({e}_{r,i}^{{\prime}}\) in Eq. (18), where \({e}_{r,i}\) is the residual of the ratio estimator in Eq. (6) and \({s}_{r}\) is the standard error of the regression (Wooldridge, 2020, pp.49–50) for the ratio model in Eq. (19), where \(p=1\), because there is only one parameter in the ratio model. Again, note that \({s}_{r}\) is also called an estimate of the error standard deviation, depending on academic fields. Also, note that, as long as \(p\) is the number of parameters in the model, the formula for the standard error of the regression is the same with or without an intercept (Eisenhauer, 2003, p.78). Note that \({e}_{i}\) is the residuals for the OLS regression model defined in Eq. (4), while \({e}_{r,i}\) is the residuals for the ratio model defined in Eq. (6). These residuals are the key difference between \(s\) in Eq. (16) and \({s}_{r}\) in Eq. (19).

$${e}_{r,i}^{{\prime}}=\frac{{e}_{r,i}}{{s}_{r}\sqrt{1-{h}_{i}}},$$
(18)
$${s}_{r}=\sqrt{\frac{\sum {e}_{r,i}^{2}}{n-p}}.$$
(19)

Fourth, we calculate Cook’s distance \({C}_{r,i}\) in Eq. (20), where \(p=1\), \({e}_{r,i}^{{\prime}}\) is the studentized residual in Eq. (18), and \({h}_{i}\) is exactly the same as Eq. (17). Therefore, as was the case with Cook’s distance for OLS, if an observation has a large value of \({C}_{r,i}\), this means that the observation is influential in the ratio model, in terms of the vertical axis (measured by \({e}_{r,i}^{{{\prime}}2}\)), the horizontal axis (measured by \({h}_{i}\)), or the combination of both.

$${C}_{r,i}=\frac{{e}_{r,i}^{{{\prime}}2}}{p}\times \frac{{h}_{i}}{1-{h}_{i}}.$$
(20)

Based on the values of \({C}_{r,i}\), we trim the identified outliers, where outliers are defined as large values of \({C}_{r,i}\). Let \({D}_{i}=\left({x}_{i},{y}_{i}\right)\) be a random sample of size \(n\) (\(i=1, 2,\dots ,n\)). Also, let \(\lambda\) and \(k\) be positive integers. When \({C}_{r,j}>\lambda\), trim \({D}_{j}\), and when \({C}_{r,j}\le \lambda\), do not trim \({D}_{j}\), where \(j\) means the \(j\)-th observation. Suppose that we trim \(k\) observations from the data. This means that we have \({D}_{tcr,i}=\left({x}_{tcr,i},{y}_{tcr,i}\right)\), where \(i=1, 2,\dots ,n-k\) and the subscript \(tcr\) stands for TC-ratio. Then, the average of the remaining \(n-k\) observations is the \(k\)th level trimmed mean based on the robust TC-ratio estimator. Thus, \({\widehat{\beta }}_{tcr}\) is given in Eq. (21), \({\overline{y} }_{tcr}\) in Eq. (22) and \({\overline{x} }_{tcr}\) in Eq. (23).

$${\widehat{\beta }}_{tcr}=\frac{{\overline{y} }_{tcr}}{{\overline{x} }_{tcr}},$$
(21)
$${\overline{y} }_{tcr}=\frac{1}{n-k}\sum_{i=1}^{n-k}{y}_{tcr,i,}$$
(22)
$${\overline{x} }_{tcr}=\frac{1}{n-k}\sum_{i=1}^{n-k}{x}_{tcr,i}.$$
(23)

Finally, we compute the imputed values based on \({\widehat{y}}_{i}={\widehat{\beta }}_{tcr}{x}_{i}\), which is the robust ratio imputation model based on the TC-ratio estimator. This estimator is expected to work better than the traditional approaches, such as the ratio of medians, trimmed means, and Winsorized means, because the proposed estimator can detect both univariate (unconditional) and bivariate (conditional) outliers, while the traditional approaches (ratio of medians, trimmed means, and Winsorized means) can only detect univariate (unconditional) outliers.

6.2 Automatic method to determine the number of outliers

In the previous section, \(\lambda\) is defined as a positive integer that is to be used as a cutoff to determine whether an observation is an outlier or not. In general, there is “no clear guidance on the percentage of trimming to be done” (Young & Mathew, 2015, p.78). Therefore, the choice of \(\lambda\) is often arbitrary.

Traditionally, a cutoff for Cook’s distance is proposed as \({\lambda }_{\rm cook}=4/\left(n-p\right)\), where \(n\) is the number of observations and \(p\) is the number of parameters (Fox, 2020, p.51). However, this cutoff is so simple that it does not take into account the characteristics of data in hand, because it is only a function of the number of observations and parameters. The current study proposes a novel method of automatically determining a cutoff based on the coefficient of determination \({R}^{2}\), by exploiting the fact that deleting an outlier is likely to increase \({R}^{2}\), because the model will have a better fit to the remaining data. Also, this study suggests a scree-like plot to graphically assess where the cutoff can be found.

For illustration purposes, we will use the following small dataset in Table 1, where the last two observations (id = 41 and 42) are added to the data as outliers. Based on the values of \({C}_{r,i}\), the data are sorted in an increasing order.

Table 1 Example dataset to illustrate how a cutoff can be automatically found

Since we have 42 observations and there is only one parameter, \({\lambda }_{\rm cook}=4/\left(n-p\right)=4/\left(42-1\right)=0.098\). In this case, we detect ID 38, 39, 40, 41, and 42 as outliers. Alternatively, since \({C}_{r,i}\) is univariate, we may try using the common measure of univariate outliers, i.e., \(\mathrm{UL}= {Q}_{3}+1.5\times \mathrm{IQR}\), where \(\mathrm{UL}\) is the upper limit, \({Q}_{3}\) is the third quartile, and IQR is the inter-quartile range (Weiss, 2005, p.122). \({Q}_{3}\) in \({C}_{r,i}\) is 0.013 and IQR in \({C}_{r,i}\) is 0.012; thus, \(\mathrm{UL}= 0.013+1.5\times 0.012=0.031\). We detect ID 36, 37, 38, 39, 40, 41, and 42 as outliers. Either way, we detect too many observations as outliers. These simple methods do not work, because they do not take the characteristics of data into account.

If we calculate \({R}^{2}\) in \({\widehat{y}}_{i}=\widehat{\beta }{x}_{i}\) among the 42 observations in Table 1, \({R}^{2}=0.518\). Note that we are not interested in interpreting the model fit per se, but we are interested in how the model fit changes when we delete an observation with large \({C}_{r,i}\). Let \({R}_{k}^{2}\) be the coefficient of determination when we trim the \(k\) largest observations, where “largest” refers to the size of \({C}_{r,i}\). Since \({C}_{r,42}=0.529\) is the largest value of Cook’s distance, if we trim observation 42 from the data, \({R}_{1}^{2}=0.605\), which is larger than \({R}^{2}=0.518\) by 0.086. Since \({C}_{r,41}=0.527\) is the second largest, if we trim observation 41 from the data, \({R}_{2}^{2}=0.798\), which is larger than \({R}_{1}^{2}=0.605\) by 0.194. Since \({C}_{r,40}=0.239\) is the third largest, if we trim observation 40 from the data, \({R}_{3}^{2}=0.809\), which is larger than \({R}_{2}^{2}=0.798\) by 0.011. Since \({C}_{r,39}=0.143\) is the fourth largest, if we trim observation 39 from the data, \({R}_{4}^{2}=0.815\), which is larger than \({R}_{3}^{2}=0.809\) by 0.006. We can continue this process until the last two observations are left. See \({R}_{k}^{2}\) in Table 1.

Naturally, \({R}_{k}^{2}\) tends to go up, as we trim more and more outliers. However, the speed of growth in \({R}_{k}^{2}\) decreases as we trim outliers. This can be used as a method of determining where we should stop trimming outliers. Notice that the increase was 0.086, 0.194, 0.011, and 0.006, which means that, after trimming the two largest outliers, the speed of growth in \({R}_{k}^{2}\) dramatically decreased. See \({R}_{k}^{2}-{R}_{k+1}^{2}\) in Table 1, which is the difference between the two adjacent \({R}_{k}^{2}\).

Graphically, the left-hand panel in Fig. 1 plots \(1/{R}_{k}^{2}\) based on 42 observations in Table 1 against the number of trimmed observations \(k\). Figure 1 is analogous to the scree plot in principal component analysis (Bartholomew et al., 2002, pp.124–125). In the left-hand panel of Fig. 1, there is an elbow at two trimmed observations. This means that trimming the rest of observations has similar \({R}_{k}^{2}\), which further means that they each explain a similar proportion of the total variance of \({y}_{i}\). Therefore, graphically, we can decide that there are two outliers in the data. On the other hand, if Table 1 did not have ID 41 and 42 in the first place, and if we calculate \({C}_{r,i}\) and \({R}_{k}^{2}\) based on the first 40 observations, the scree-like plot would be the right-hand panel in Fig. 1, which shows no elbows, meaning that there are no outliers in the data.

Fig. 1
figure 1

Examples of the scree-like plot to detect the number of potential outliers for Table 1

Next, we locally calculate the vertical distance from one dot to another in Fig. 1, so that we numerically and automatically decide the number of outliers. This is done by calculating \({R}_{k}^{2}-{R}_{k+1}^{2}\). When these vertical distances, \({R}_{k}^{2}-{R}_{k+1}^{2}\), are close enough to zero, then we trimmed enough outliers in the data. Most of these values are close to zero. In fact, the mean is \(-\)0.012, the median is \(-\)0.005, the first quartile (\({Q}_{1}\)) is \(-\)0.010, and IQR is 0.008. Since \({R}_{k}^{2}-{R}_{k+1}^{2}\) is univariate, we can simply use the common measure of univariate outliers, i.e., \(\mathrm{LL}= {Q}_{1}-1.5\times \mathrm{IQR}\), where \(\mathrm{LL}\) is the lower limit and \({Q}_{1}\) is the first quartile (Weiss, 2005, p.122). Therefore, \(\mathrm{LL}= -0.010-1.5\times 0.008=-0.021\). Since \({R}_{0}^{2}-{R}_{1}^{2}=-0.086\) and \({R}_{1}^{2}-{R}_{2}^{2}=-0.194\) are smaller than \(\mathrm{LL}=-0.021\), we can numerically and automatically decide that outliers are up to the second largest \({C}_{r,i}\). On the other hand, if Table 1 did not have ID 41 and 42, and if we calculate \({C}_{r,i}\) and \({R}_{k}^{2}\) based on the first 40 observations, \(\mathrm{LL}=-0.018\). None of \({R}_{k}^{2}-{R}_{k+1}^{2}\) would be smaller than \(- 0.018\); thus, we numerically and automatically conclude that there are no outliers in the data.

Additionally, in the actual implementation, the moving average of order 3 is used to avoid haphazard idiosyncrasies (large jump from \(k\) to \(k+1\)).

Therefore, it is demonstrated in this subsection that we can determine the number of outliers based on the speed of change in \({R}_{k}^{2}\). This mechanism allows the TC-ratio estimator to be fully automated in the process of outlier detection, because no processes involve human decisions.

7 Monte Carlo simulation: settings

Monte Carlo simulation is useful especially when assumptions of a model are violated, but there are no easy analytical solutions available (Mooney, 1997, p.1). Analyses in this study are carried out using R version 4.0.2. In this simulation study, the sample size \(n\) is set to 1000, and the number of simulation runs is set to 10,000. Since the means and totals are considered the most important products in official statistics (de Waal et al., 2011, p.245), the parameter of interest in the simulations is set to the mean of a target variable, \(\overline{y }\).

7.1 Settings of population data

The Monte Carlo simulations are carried out using five different artificially generated populations of values \(\left({x}_{i},{y}_{i}\right)\), whose values are generated by a gamma distribution, a normal distribution, or a uniform distribution.

A random variable \(X\) follows a gamma distribution with parameters \(\phi\) and \(\omega\), where \(x>0\), \(\phi >0\), \(\omega >0\) if its density function is given by Eq. (24), and \(\Gamma \left(\phi \right)\) is the gamma function defined in Eq. (25). Also, from Eqs. (26) and (27), the mean is \(\phi \omega\) and the variance is \(\phi {\omega }^{2}\) (DeGroot & Schervish, 2002, p.297; Ross, 2006, pp.237–239). A gamma distribution is one of the commonly used population settings for ratio imputation (Lee et al., 1994, p.236; Rao & Sitter, 1995, p.455; Sitter & Rao, 1997, p.69; Haziza & Valée, 2020).

$$f\left(X=x\right)=\frac{1}{\Gamma \left(\phi \right){\omega }^{\phi }}{x}^{\phi -1}\mathrm{exp}\left(\frac{-x}{\omega }\right),$$
(24)
$$\Gamma \left(\phi \right)={\int }_{0}^{\infty }{x}^{\phi -1}\mathrm{exp}\left(-x\right)dx,$$
(25)
$$E\left(X\right)={\int }_{0}^{\infty }\frac{\phi \omega }{\Gamma \left(\phi +1\right)}{\left(\frac{x}{\omega }\right)}^{\phi }\mathrm{exp}\left(\frac{-x}{\omega }\right)\frac{1}{\omega }dx=\phi \omega ,$$
(26)
$$\mathrm{var}\left(X\right)=\phi {\omega }^{2}\left(\phi +1\right)-{\phi }^{2}{\omega }^{2}=\phi {\omega }^{2}.$$
(27)

Specifically, a set of 1000 \(x\)-values are generated by a gamma distribution with mean \(\phi \omega =48\) and variance \(\phi {\omega }^{2}=768\). Then, for each fixed value of \(x\), the corresponding value of \(y\) is generated by a gamma distribution with mean \({\mu }_{y}=bx\) and variance \({\sigma }_{y}^{2}={d}^{2}{x}^{2g}\), where the values of \(b\), \(d\), and \(g\) are shown in Table 2. Also, \(\rho\) is the correlation between \(x\) and \(y\), and \({\mu }_{y}\) is the true population value of \(\overline{y }\). This follows the population settings used in Lee et al., (1994, p.236). Also, the online appendix A reports additional simulation runs based on a gamma distribution with mean \(\phi \omega =24\) and variance \(\phi {\omega }^{2}=768\), where the expected value of \(X\) is set to half.

Table 2 Characteristics of the three populations (gamma distribution)

Lee et al., (1994, p.236) show that \(\phi\) and \(\omega\) can be defined as Eqs. (28) and (29).

$$\phi =\frac{{\left(bx\right)}^{2}}{{d}^{2}{x}^{2g}},$$
(28)
$$\omega =\frac{{d}^{2}{x}^{2g}}{bx}.$$
(29)

Therefore, the relation between \({x}_{i}\) and \({y}_{i}\) can be adequately captured by the ratio estimator model \({y}_{i}=\beta {x}_{i}+{\varepsilon }_{i}\), where \(\beta =1.5\) (b = 1.5 in Table 2) and \({\varepsilon }_{i}\sim N\left(0, {\sigma }^{2}\sqrt{{x}_{i}}\right)\). Also, the online appendix B reports additional simulation runs based on \(\beta =3.0\), where the true ratio is set to double.

Sections 8.1 and 8.2 display the results for population 1. The results for populations 2 and 3 can be found in the Appendix (Sects. 11.1 and 11.2). Furthermore, in discussing the ratio estimator, some authors (Zou et al., 2010, p.871; Wada & Sakashita, 2017, p.3) assume that \(x\)-values are generated by a uniform distribution, and some authors (Zou et al., 2010, p.871; Lui, 2020, p.140) assume that \(x\)-values are generated by a normal distribution. Therefore, to make the simulations more general (free of distributional assumptions), the Appendix has extra results for population 4 (uniform distributions) and population 5 (normal distributions).

Under population 4, a set of 1000 \(x\)-values are generated by a uniform distribution \(U\left(0.1, 2.1\right)\). Under population 5, a set of 1,000 \(x\)-values are generated by a normal distribution \(N\left(20, 16\right)\). Since \(x\)-values must be positive for the ratio estimator, in case that \(x\)-values are generated as negative, they are replaced by the minimum value among the positive \(x\)-values. In both populations 4 and 5, \({y}_{i}=3.9{x}_{i}+\sqrt{{x}_{i}}{\varepsilon }_{i}\), where \({\varepsilon }_{i}\sim N\left(0, 1\right)\). All of these settings for populations 4 and 5 follow the simulation studies by Zou et al., (2010, p.871), slightly changing \({x}_{i}{\varepsilon }_{i}\) to \(\sqrt{{x}_{i}}{\varepsilon }_{i}\), because their simulations assume the population for the mean of ratios, not the ratio of means which the current study assumes.

7.2 Settings of missing data

Let \({y}_{i}\) be the target incomplete variable for imputation, \({x}_{i}\) be completely observed in all of the situations to be used as the auxiliary variable, and \({u}_{1,i}\) and \({u}_{2,i}\) be two continuous uniform random variables ranging from 0 to 1 for the missingness mechanism. This means that missing occurs in \({y}_{i}\), the numerator in the ratio, \({\widehat{\beta }}_{\rm ratio}=\overline{y }/\overline{x }\).

Each of the artificially generated datasets is made incomplete using the following two types of missing data generation processes based on missing at random (MAR), where the missingness of \({y}_{i}\) depends on the values of \({x}_{i}\), \({u}_{1,i}\), and \({u}_{2,i}\), i.e., the conditional probability of missing data after controlling for observed data is the same as the probability of observed data (Allison, 2002, p.4; Enders, 2010, p.11; Little & Rubin, 2020, p.14). The average missing rates are set to 30%. It is reported that the family incomes and personal earnings in the National Health Interview Survey (1997–2004) have approximately 30% of missingness (Schenker et al., 2006, p.925). Therefore, 30% is a realistic value as a missing rate. Note that, while any specific real survey may have different rates of missingness, the specific settings on the missing rates should not be much of a concern. The average missing rates are 30% under 10,000 simulation runs, which means that some simulation runs have missingness less than 30%, and other simulation runs have missingness more than 30%. On average across 10,000 runs, it is 30%. Therefore, this setting is supposed to cover a reasonable range of missing rates.

In the first type of missing data generation process under MAR, \({y}_{i}\) is missing if \({x}_{i}<\mathrm{med}({x}_{i})\) and \({u}_{1,i}<0.5\), and \({y}_{i}\) is missing if \({x}_{i}>\mathrm{med}({x}_{i})\) and \({u}_{2,i}<0.1\), where \(\mathrm{med}(\bullet )\) denotes the median. For example, suppose that \({y}_{i}\) is turnover (sales) and \({x}_{i}\) is the number of employees. The assumption in this setting is that more values are missing among small-and-medium size enterprises than large enterprises, because the missing values of turnover for large enterprises are collected through recontacts in official statistics. Therefore, \(\overline{y }\) based on missing data overestimates the true value of \(\overline{y }\). Let us call this MAR1.

In the second type of missing data generation process under MAR, \({y}_{i}\) is missing if \({x}_{i}<\mathrm{med}({x}_{i})\) and \({u}_{1,i}<0.1\), and \({y}_{i}\) is missing if \({x}_{i}>\mathrm{med}({x}_{i})\) and \({u}_{2,i}<0.5\). Again, for example, suppose that \({y}_{i}\) is turnover and \({x}_{i}\) is the number of employees. The assumption in this setting is that large enterprises are more likely to refuse to answer turnover than small-and-medium size enterprises, possibly because of some tax-related concerns. Therefore, \(\overline{y }\) based on missing data underestimates the true value of \(\overline{y }\). Let us call this MAR2.

Both of these two scenarios intuitively sound plausible and we do not expect, a priori, which of the scenarios is more realistic in a given survey of official statistics. Therefore, we use these two types of missing data scenarios. These two types of missing data can be understood as MAR via censoring.

Under the assumption of missing completely at random (MCAR), the probability of missing data does not depend on data, and observed data are a simple random sub-sample of complete data (Allison, 2002, p.3; Enders, 2010, p.7; Little & Rubin, 2020, p.13). Since MAR is a “less restrictive assumption than MCAR” (Little & Rubin, 2020, p.14), in reality, it is safer that we assume MAR rather than MCAR. This takes us back to the case of MAR. Therefore, the current study does not consider the assumption of MCAR.

Under the assumption of not missing at random (NMAR, also known as missing not at random: MNAR), the missingness of \({y}_{i}\) depends on the values of \({y}_{i}\), \({u}_{1,i}\), and \({u}_{2,i}\), even after controlling for \({x}_{i}\), i.e., the conditional probability of missing data after controlling for observed data is not the same as the probability of observed data (Allison, 2002, p.5; Enders, 2010, p.11; Little & Rubin, 2020, p.14). Graham (2009, p.567) states that all missing data are a continuum between pure MAR and pure NMAR. The current study focuses on the case of pure MAR, because the current study is concerned with the influence of outliers on the imputation model under the situation where the imputation model can eliminate the bias due to missing data. In the case of pure NMAR, the literature recommends the use of the selection model and the pattern mixture model (Allison, 2002, pp.77–84; Enders, 2010, pp.290–301; Little & Rubin, 2020, pp.351–355). How the TC-ratio estimator can be extended by way of the selection model or the pattern mixture model is left for future research. Nevertheless, Scheuren (2005, p.317) contends that, in official statistics, about 10–20% are MCAR, about 50% are MAR, and the rest is NMAR. Thus, the assumption of MAR may cover the majority (up to 70%) of the situations that we may encounter in official statistics.

7.3 Settings of outliers

Figure 2 in the current study graphically shows the patterns of 5% outlier settings, where white circles represent usual observations and red triangles represent outliers generated by our outlier model, which is described below. Outliers in \({y}_{i}\) follow \(U\left(0.7\mathrm{max}\left[{y}_{i}\right],\mathrm{max}\left[{y}_{i}\right]\right)\), where the associated values of \({x}_{i}\) are less than \(\mathrm{med}\left({x}_{i}\right)\). Outliers in \({x}_{i}\) follow \(U\left(0.7\mathrm{max}\left[{x}_{i}\right],\mathrm{max}\left[{x}_{i}\right]\right)\), where the associated values of \({y}_{i}\) are less than \(\mathrm{med}\left({y}_{i}\right)\).

Fig. 2
figure 2

Examples of four outlier patterns in the simulations (population 1). White circles represent usual observations, red triangles represent outliers, the vertical line is med (x), and the horizontal line is med (y)

Furthermore, the cases where outliers exist in both \(x\) and \(y\) can be divided into three patterns: equal percentage (50:50) in Fig. 2, less outliers in \(x\) than in \(y\) (25:75) and more outliers in \(x\) than in \(y\) (75:25) in Fig. 3. In official statistics, ratio imputation is applied to different subpopulations, which is known as group ratio imputation (de Waal et al., 2011, p.245). Some subpopulations may have outliers on the vertical axis, while other subpopulations may have outliers on the horizontal axis, or the combination of both.

Fig. 3
figure 3

Examples of outliers for both axes in the simulations (population 1). White circles represent usual observations, red triangles represent outliers, the vertical line is med (x), and the horizontal line is med (y)

Therefore, if a ratio imputation model is robust against outliers anywhere in the scatter plot, it will be beneficial.

The percentage of outliers is set to 1%, 5%, and 10%. This means that we will add 10 outliers to 1000 observations (\(n=1010\) in total for 1% outliers), 50 outliers to 1,000 observations (\(n=1050\) in total for 5% outliers), and 100 outliers to 1,000 observations (\(n=1100\) in total for 10% outliers). Note that these outliers will not be missing in the simulations, because we are interested in the influence of outliers on the parameter of the imputation model when outliers are indeed present in data. Also, the online appendix C reports additional simulation runs, where outliers are also missing.

Therefore, there are 10 types of data without outliers (5 population types and 2 missingness types) and 150 types of data with outliers (5 population types, 2 missingness types, 5 types of outlier locations, and 3 types of outlier percentages). Additionally, in the online appendices, we have 194 types of data. Each of these 354 types of data is repeated 10,000 times. Thus, we have 3,540,000 different types of data in total.

7.4 Evaluation criteria for simulations

Let \(\theta\) be the true population parameter and \(\widehat{\theta }\) be an estimator of \(\theta\). If \(\mathrm{Bias}\left(\widehat{\theta }\right)=0\) in Eq. (30), the expected value of \(\widehat{\theta }\) is equal to the true \(\theta\). Then, this estimator \(\widehat{\theta }\) is an unbiased estimator of true parameter \(\theta\) (Mooney, 1997, p.59; Gujarati, 2003, p.899). Therefore, \(\mathrm{Bias}\left(\widehat{\theta }\right)\) indicates whether the method is good on average.

$$\mathrm{Bias}\left(\widehat{\theta }\right)=E\left(\widehat{\theta }\right)-\theta.$$
(30)

Oftentimes, however, there is a situation where one estimator has smaller bias and larger variance than another estimator. The root mean squared error (RMSE) in Eq. (31) measures the dispersion around the true value of the parameter, taking a balance between bias and efficiency into account (Mooney, 1997, p.59; Gujarati, 2003, p.901–902; Carsey & Harden, 2014, pp.88–89). Therefore, \(\mathrm{RMSE}\left(\widehat{\theta }\right)\) indicates whether the method is good across 10,000 runs, taking both bias and efficiency into account.

$$\mathrm{RMSE}\left(\widehat{\theta }\right)=\sqrt{E{\left(\widehat{\theta }-\theta \right)}^{2}}.$$
(31)

Thus, an estimator \(\widehat{\theta }\) in this study is considered good if it has \(\mathrm{Bias}\left(\widehat{\theta }\right)\) close to zero and \(\mathrm{RMSE}\left(\widehat{\theta }\right)\) close to zero.

7.5 Competing methods in the simulations

Table 3 displays the abbreviations about the competing methods used in the simulations. For each of the traditional robust ratio imputation models, see Sect. 4.

Table 3 List of the competing methods in the simulations

Comp is complete data, which are supposed to be ideal, but unavailable in reality. LD is listwise deletion, which throws away all of the rows that contain missing values. This is the result we obtain if we do not deal with missing values. In the literature of missing data analysis, this method is also known as complete case analysis (Little & Rubin, 2020, pp.47–48). Both Comp and LD are not affected by outliers, because there are no imputation models.

Ratio is the non-robust ratio imputation model. Ratio is expected to work best among imputation methods when outliers are not present, while it is expected not to work well when outliers are present.

M-1 and M-2 are the ratio imputation models by \(M\)-estimators. These are the methods implemented in the 2016 Economic Census in Japan (Wada & Sakashita, 2017). This study chooses two values for the tuning constant \(\psi\) that represent less robust (\(\psi =8\)) and more robust (\(\psi =4\)), respectively, because \(\psi\) cannot be predetermined. For the information on the choice of \(\psi\), see Wada and Tsubaki (2020, p.3).

Med is the ratio-of-medians imputation. Trim is the ratio-of-trimmed-means imputation. Wins is the ratio-of-Winsorized-means imputation. In trimming outliers, there are no clear rules about the percentage of trimming. We set 5% as a cutoff for the trimmed and the Winsorized means. Therefore, these two methods are expected to work well when outliers are 5% in the simulations, but work less well under 1% and 10%

C-1 is the proposed TC-ratio estimator, where outliers are detected by modified Cook’s distance and the number of outliers is determined by the inverse \({R}_{k}^{2}\). C-2 uses \(4/\left(n-p\right)\) as a cutoff.

7.6 Motivating example for simulation settings

Populations 1, 2, and 3 follow the simulation settings by Lee et al. (1994, p.236). A natural question is to ask whether these settings are realistic. As a real-world example, this subsection uses the anonymized data of the 2004 Japanese National Survey of Family Income and Expenditure, which is based on the actual microdata of the survey and is offered for the purpose of academic analyses. As of this writing, 2004 is the latest version of the anonymized data of the National Survey of Family Income and Expenditure.

Figure 4 displays the distributions of net expenditure and yearly income. Note that yearly income is measured on a yearly basis in the unit of 10,000 Japanese yen, while net expenditure is measured on a monthly basis in the unit of 1 Japanese yen. To make them comparable, net expenditure is divided by 10,000 and multiplied by 12, so that net expenditure is also on a yearly basis in the unit of 10,000 Japanese yen. Since these are sensitive real data from official statistics, for the purpose of disclosure limitation, the axes in Fig. 4 are intentionally hidden. Please pay attention to the shapes of the distributions, not the values of each data point.

Fig. 4
figure 4

Characteristics of expenditure and income. Note that the vertical line is med (net expenditure) and the horizontal line is med (yearly income). Axes are intentionally hidden for disclosure limitation purposes

Suppose that some values of yearly income are missing and all of the values of net expenditure are observed. Then, we may predict the missing values of \({income}_{i}\) by \({\mathrm{expenditure}}_{i}\), using \({\mathrm{income}}_{i}=\beta \times {\mathrm{expenditure}}_{i}+{\varepsilon }_{i}\), where \(\beta\) is estimated by the ratio of means. Note that the prediction in the imputation model does not require a causal specification (King et al., 2001, p.51), meaning that the imputation model does not claim that \({\mathrm{expenditure}}_{i}\) is the cause of \({\mathrm{income}}_{i}\). It simply states that missing values of \({\mathrm{income}}_{i}\) may be predicted by \({\mathrm{expenditure}}_{i}\).

Figure 4 shows that each variable is skewed to the right, and the bivariate distribution is also heteroskedastic. The mean of income is 669.5 and the mean of expenditure is 379.6. Let \(y\) be income and \(x\) be expenditure. Then, \(\beta =\overline{y }/\overline{x }=669.5/379.6=1.76\), which is slightly higher than the value of \(b,\) defined in Table 2. The correlation between income and expenditure is 0.46, which is lower than \(\rho ,\) defined in Table 2, but this is still coherent in the sense that the correlation is positive. One of the reasons why the correlation is low is that income is top coded at the value of 2500, which makes correlation lower than it actually must be. Note that the real data already contain some potential outliers. If we trim these potential outliers by the TC-ratio estimator, \(\beta =1.80\) and \(\rho =0.64\), which are quite close to the theoretically defined values in Table 2. Therefore, it is demonstrated in this subsection that the simulation settings above are realistic.

Note that, based on the Statistics Act (Japan), the author obtained the anonymized data of the 2004 National Survey of Family Income and Expenditure from the National Statistics Center (NSTAC). Also, note that the analyses in this article are the author’s own and are different from the officially published results by the Japanese government. For the information on the extended use (secondary use) of official statistics in Japan, see https://www.soumu.go.jp/english/dgpp_ss/seido/2jiriyou.htm.

8 Monte Carlo simulation: results

8.1 MAR1 in population 1

Table 4 presents the results of the simulations for population 1 (Gamma distribution, \(d=1.84\), \(g=0.75\)) under MAR1.

Table 4 Population 1 (Gamma distribution: \(d=1.84\), \(g=0.75\)), MAR1

Although there are no absolute criteria to judge the size of bias, Schafer and Graham (2002, p.157) state, “A rule of thumb that we have found useful is that bias becomes problematic if its absolute size is greater than about one half of the estimate’s standard error.” In Table 4, one standard error of the mean in complete data is about 1.7, which can be found in the column of Comp under RMSE, because RMSE is \(\sqrt{\mathrm{variance}+{\mathrm{bias}}^{2}}\); thus, for an unbiased estimator, RMSE is the standard error. Therefore, if the absolute value of bias is smaller than 1.7/2 = 0.850, then we deem the method unbiased and put it in italics.

Also, there are no absolute criteria to judge the size of RMSE, which is meaningful only in comparative terms (Carsey & Harden, 2014, p.89). The smallest RMSE indicates that the estimator is comparatively best among the competing estimators. Thus, the smallest value of RMSE is shown in italics for each outlier setting. Note that Comp is excluded from the comparison of RMSE, because Comp is always the best, but unavailable method.

Under all situations, listwise deletion (LD) is severely biased (bias = 9.005, 9.062). In fact, listwise deletion is always biased under MAR. Therefore, the task is to correct the bias of about 9 points by way of imputation.

When there are no outliers (%X = 0.00, %Y = 0.00), the regular ratio imputation model (ratio) is unbiased and most efficient (bias = − 0.012, RMSE = 1.852). All of the robust ratio imputation models are slightly more biased than the regular ratio imputation model, but most of them, except Med and C-2, can also correct the bias in listwise deletion within half of one standard error. Thus, we consider M-1, M-2, Trim, Wins, and C-1 unbiased, using the rule of thumb by Schafer and Graham (2002, p.157).

In the case of the equal number of outliers in both \(x\) and \(y\) (%X = 0.50, %Y = 0.50), the bias in the regular ratio imputation model is small (Bias = 0.261). However, as the percentages of outliers increase, the bias in the regular ratio imputation model becomes large (bias = 1.029 for %X = 2.50, %Y = 2.50; bias = 1.746 for %X = 5.00, %Y = 5.00).

In 13 out of 16 cases, the bias of the TC-ratio estimator (C-1) is smaller than half of one standard error. Most importantly, when the bias of the regular ratio imputation model is larger than half of one standard error in 13 cases, the bias of the TC-ratio estimator is smaller than half of one standard error in 10 out of these 13 cases. In the case of (%X = 7.50, %Y = 2.50), the bias of the TC-ratio estimator is larger than half of one standard error, but it is still comparatively smaller than the biases of the other competing methods. The two scenarios (%X = 0.00, %Y = 10.00; %X = 10.00, %Y = 0.00) are found too hard to deal with, because no methods can adequately handle these two scenarios.

Taking both bias and efficiency into account, RMSE shows that the TC-ratio estimator is almost always best among the competing robust ratio imputation methods. In fact, the TC-ratio estimator is judged best in 10 out of 16 patterns. For the remaining six patterns, the differences in RMSE are quite small. The remarkable characteristic of the TC-ratio estimator is that RMSE is quite stable under most situations, ranging from 1.841 to 2.085 in 14 patterns.

Furthermore, when outliers are present, the TC-ratio estimator outperforms the regular ratio imputation model; and when there are no outliers, the TC-ratio estimator (bias = -0.326, RMSE = 1.878) works approximately equally well compared to the regular ratio imputation model (bias = -0.012, RMSE = 1.852). Also, the TC-ratio estimator (C-1) outperforms C-2 in 12 out of 16 patterns with 1 tie in terms of both bias and RMSE. When the proportion of outliers is 1%, the performance of the TC-ratio estimator (C-1) and the usual criterion of \(4/\left(n-p\right)\) (C-2) is similar; therefore, if we are certain that the proportion of outliers is low, the usual criterion of \(4/\left(n-p\right)\) might be enough. However, when we want to automate the process of imputation, there is uncertainty as to the proportion of outliers. Therefore, in case that the proportion of outliers is high, the TC-ratio estimator (C-1) is more preferable than the usual criterion of \(4/\left(n-p\right)\) (C-2).

8.2 MAR2 in population 1

Table 5 presents the results of the simulations for population 1 (Gamma distribution, \(d=1.84\), \(g=0.75\)) under MAR2.

Table 5 Population 1 (gamma distribution: \(\mathrm{d}=1.84\), \(\mathrm{g}=0.75\)), MAR2

In Table 5, if the absolute value of bias is smaller than 0.850 (half of one standard error), then it is shown in italics. Also, the smallest value of RMSE is shown in italics. The overall conclusions are similar to the ones in Sect. 8.1.

Remember that, under MAR2, the missing rates of \({y}_{i}\) are higher when \({x}_{i}>\mathrm{med}({x}_{i})\). This means that larger values of \({y}_{i}\) tend to be missing. Also, \({x}_{i}\) and \({y}_{i}\) both follow gamma distributions, which are skewed to the right with right long tails. This further means that many missing values are scattered among very large values of \({y}_{i}\). Therefore, the situation is more difficult to handle than in Sect. 8.1. In fact, the absolute size of biases of the regular ratio imputation model (ratio) tend to be larger than in MAR1.

When there are no outliers, the biases of M-1, Trim, Wins, and C-1 are smaller than half of the standard error; thus, we consider them unbiased.

When the bias of the regular ratio imputation model is large in 14 cases, the TC-ratio estimator (C-1) corrects the bias within half of one standard error in 8 cases. In the four cases (%X = 0.00, %Y = 5.00; %X = 5.00, %Y = 0.00; %X = 2.50, %Y = 7.50; %X = 7.50, %Y = 2.50), the biases of the TC-ratio estimator are comparatively smaller than those of the competing methods. The two scenarios (%X = 0.00, %Y = 10.00; %X = 10.00, %Y = 0.00) are, again, found too hard to deal with, because no methods can adequately handle these two scenarios.

In terms of RMSE, the TC-ratio estimator is judged best in 9 out of 16 patterns. For the remaining seven patterns, the differences in RMSE are quite small. Again, the remarkable characteristic of the TC-ratio estimator is that RMSE is quite stable under most situations ranging from 1.937 to 2.579 in 14 patterns. Also, the TC-ratio estimator (C-1) outperforms C-2 in 12 out of 16 patterns in terms of bias, and in 11 out of 16 patterns in terms of RMSE.

9 Summary of the overall results

Table 6 summarizes the results of all the 160 different data patterns. The row, “Unbiased,” shows the number of times the bias of each method was less than half of one standard error. The row, “RMSE,” shows the number of times the RMSE of each method was smallest among the competing methods.

Table 6 Summary of the overall results in 160 data patterns

The TC-ratio estimator is deemed unbiased in 114 out of 160 patterns, and the relative performance of the TC-ratio estimator is best in 106 out of 160 patterns in terms of RMSE. Therefore, the TC-ratio estimator is remarkably robust under a variety of outlier settings, missing data types, and distributional assumptions. Against C-2, in terms of RMSE, C-1 wins 121 times and loses 31 times, with 8 ties, in 160 patterns. For the results of specific data types, see the Appendix in Sect. 11.

10 Conclusion

This article proposed a new robust ratio imputation model based on the TC-ratio estimator, which extended Cook’s distance to the ratio estimator. Simulation studies showed that the new robust ratio imputation model is robust against many types of outliers under a variety of settings. This method works better than the traditional robust methods (the ratio of medians, trimmed means, Winsorized means, and means by \(M\)-estimators) when outliers are on the vertical axis. This method works far better than the traditional robust methods when outliers are on the horizontal axis (high-leverage points). Also, this method works approximately equally well compared to the non-robust method when there are no outliers. This is true regardless of the distributional assumptions (gamma, uniform, and normal distributions: also see the Appendix in Sect. 11). Therefore, the TC-ratio estimator is comparatively a more robust ratio estimator than the traditional robust methods.

Furthermore, since \(M\)-estimators are iterative methods, whether the algorithm converges depends on the choice of parameter settings in \(M\)-estimators (Mulry et al., 2014, p.733). In case the algorithm does not converge, the literature suggests a need to have a backup strategy (Mulry et al., 2014, pp.744–745). The TC-ratio estimator in the current study is not an iterative method. Therefore, even if \(M\)-estimators are chosen for a particular survey as a method of imputation, the TC-ratio estimator can be a reliable back-up method of imputation for \(M\)-estimators. Also, it is reported that developing an automatic data-driven method for \(M\)-estimators is challenging due to the difficulties in setting the initial value of the tuning constant \(\psi\) (Mulry et al., 2018, p.483). The TC-ratio estimator in the current study is a fully automatic data-driven method. In this sense, too, the proposed method is highly useful.

The following is out of scope for this article. The current study proposed an outlier resistant single imputation method, because the goal was to compute the means (or the totals). If the goal is to make an inference about the population parameters based on sample statistics, then we may need to consider either of the following two methods. One is multiple imputation (Carpenter & Kenward, 2013, p.35; van Buuren, 2018, p.25). For this, Takahashi (2017a) and Takahashi (2017b) proposed multiple ratio imputation based on the expectation–maximization with bootstrapping, which is known to be a fast and reliable multiple imputation algorithm (Takahashi, 2017c). How multiple ratio imputation can be robustified by the TC-ratio estimator will be an important future research topic. Another strategy is to use variance estimation procedures for singly imputed data (Deville & Särndal, 1994, p.389, p.392; Haziza & Vallée, 2020). How these variance estimation procedures for singly imputed data can be applied to the TC-ratio estimator will be also another important future research topic.

Let us end this article with a final remark on the potential limitation of the proposed method. Just as Young and Mathew (2015, p.93) note, this article does not necessarily suggest a panacea for outlier treatments in all survey settings. While Cook’s distance is known as one of the most well-established methods to detect individually influential observations, Cook’s distance may overlook the mutually influential observations or a group of influential observations (Lawrance, 1995, p.181). This problem is known as masking. In cases where observations are jointly influential, Cook’s distance can be sequentially applied, but even the sequential approach may not be always successful (Fox, 2020, p.51). There are two ways to deal with this problem. First, Lawrance (1995, p.184) proposed a conditional approach as a measure of the masking in Cook’s distance. How the TC-ratio estimator can incorporate the conditional approach by Lawrance (1995) is left for future research. Second, due to the possibility of masking, the literature suggests to complement outlier detection techniques with graphical methods (Fox, 2020, pp.52–53; Filliben & Heckert, 2013). In fact, by examining outliers in detail, we may find “omitted variables, incorrect functional forms, …, or other neglected aspects of a study” (Bollen, 1989, p.31). Whenever possible, subject matter knowledge should be incorporated into statistical analysis in dealing with outliers (de Waal et al., 2011, p.230; Young & Mathew, 2015, p.77). The current study focused on the side of statistical analysis only. If we incorporate subject matter knowledge in the process of outlier treatments for imputation, the findings of this study will be further strengthened.

11 Appendix

This appendix displays the results for the other four populations (gamma with \(d=5.13\), \(g=0.50\); gamma with \(d=13.78\), \(g=0.25\); uniform \(\left[0.1, 2.1\right]\); and normal with mean = 20, variance = 16). In the following tables, unbiased results (smaller than half of one standard error) are shown in italics. The information on the standard error in each table can be found in the column of Comp under RMSE. Again, see Schafer and Graham (2002, p.157) for this rule of thumb to judge the size of bias. Also, the smallest RMSE value is shown in italics.

11.1 Results of the simulations for populations 2–5 in MAR1

See Tables 7, 8, 9, 10.

Table 7 Population 2 (gamma distribution: \(d=5.13\), \(g=0.50\)), MAR1
Table 8 Population 3 (Gamma distribution,\(\mathrm{d}=13.78\),\(\mathrm{g}=0.25\)), MAR1
Table 9 Population 4 (uniform distribution: [0.1, 2.1]), MAR1
Table 10 Population 5 (normal distribution: mean = 20, variance = 16), MAR1

11.2 Results of the simulations for populations 2–5 in MAR2

See Tables 11, 12, 13, 14.

Table 11 Population 2 (gamma distribution: \(\mathrm{d}=5.13\), \(\mathrm{g}=0.50\)), MAR2
Table 12 Population 3 (Gamma distribution, \(d=13.78\), \(g=0.25\)), MAR2
Table 13 Population 4 (Uniform distribution: [0.1, 2.1]), MAR2
Table 14 Population 5 (normal distribution: mean = 20, variance = 16), MAR2