1 Introduction

It is often the case that multivariate datasets include atypical data points, i.e. points that deviate from the main pattern. Such data units are usually called outliers. Outlier detection is important for two main reasons: on the one hand, outlying data points may be interesting on their own, since they can reveal nonconforming phenomena; on the other hand, the results of usual multivariate methods can be heavily influenced by outliers. Outlier detection may however be a tricky endeavor as some (false) outliers are often reported, even for data sets with none if no distinction is made between outliers and distribution extremes (Filzmoser et al. 2005). Moreover, true outliers may not stand out as expected since they may affect estimates of location and scatter so much that they no longer look atypical–this is known as “masking effect” in the literature (Hubert et al. 2008). Also, usual rules based on an appropriate cut-off value for robustified Mahalanobis distances [see, for instance, Filzmoser et al. (2005) and Rousseeuw and Zomeren (1990)] suffer from the drawback that they are independent from the sample size.

In this paper we address the problem of outlier detection in multivariate interval data, i.e., data where for each data unit and variable an interval rather than a single real value is recorded. Interval data may occur in different situations, e.g. when describing ranges of variable values like daily stock prices or temperature ranges. Another common and increasingly interesting source of interval data is the aggregation of large data bases, when interval observations result from aggregating the real values describing the individual units.

A common approach for multivariate outlier detection measures outlyingness by Mahalanobis distances. A point i is considered an outlier if its distance \(D^2_{m, C}(i)\) from an appropriate estimate of the multivariate mean m is above a threshold, where \(D^2_{m, C}(i)= (x_i-m)^tC^{-1}(x_i-m)\) and C is an estimate of the covariance matrix. Under the assumption of d-variate normality, \(D^2_{m, C}(i)\) follows approximately a Chi-square distribution with d degrees of freedom. Then the threshold for outlier labeling may be an upper quantile of \(\chi ^2_d\), e.g., the \(97.5 \%\) quantile.

However, if m and C are chosen to be the classical sample mean vector and covariance matrix this procedure is not reliable, as \(D^2_{m, C}(i)\) may be strongly affected by atypical observations. Therefore, the Mahalanobis distances should be computed with robust estimates of location and scatter. Rousseeuw and Zomeren (1990) use these robust Mahalanobis distances (RD’s) for multivariate outlier detection. Many robust estimators for location and covariance have been proposed in the literature.

The minimum covariance determinant (MCD) estimator (Rousseeuw 1984, 1985) uses a subset of the original sample, consisting of the h points in the dataset for which the determinant of the sample covariance matrix is minimal; this is very frequently used in practice since a computationally fast algorithm is available (Rousseeuw and Driessen 1999). Since the Chi-square approximation to the distribution of Mahalanobis distances based on the MCD estimator may not work well, even for moderately large samples, finite samples approximations have been proposed (Cerioli 2010).

Trimmed likelihood estimators (Hadi and Luceño 1997) are also based on a sample subset. In this case, the subset of h points is obtained by maximizing the trimmed likelihood, keeping the multivariate observations that contribute most to the likelihood function. For multivariate Gaussian data, the two approaches, the MCD method and the trimmed likelihood method, lead to the same estimators of covariance (Hadi and Luceño 1997).

In either case, the proportion of data points to be used needs to be specified a priori. The choice \(h\approx n/2\), where n is the sample size, leads to the highest breakdown point but to low efficiency, while a larger value for h reduces the breakdown point and increases the efficiency; a trade-off is the choice \(h\approx 0.75\cdot n\) (Hubert et al. 2008).

While in classical statistics and multivariate data analysis the statistical units under analysis are single individuals, each of which taking a single value for each (numerical or categorical) variable, it often happens that the data under analysis do not correspond to single observations, but rather to sets of values, either related to groups of units gathered on the basis of some common properties, or observed repeatedly over time. The classical data-array model is then somehow restricted to take into account the variability inherent to such data. In those situations, data is commonly summarized by central statistics, e.g., averages, medians or modes, and other relevant information may be disregarded. This is the case when we are facing huge sets of data, recorded in very large databases (often called nowadays “Big Data”), and elements of interest are not the individual records but rather some second-level entities. For instance, in a database of individual phone calls, we are surely more interested in describing the behavior of some person (or some pre-defined class or group of persons) rather than each call by itself. The analysis requires then that the calls data for each person (or group) be somehow aggregated to obtain the information of interest, however data can no longer be properly described by the usual numerical and categorical variables without an unacceptable loss of information.

Defining appropriate variable types, which may assume new forms of realizations-multiple, possibly weighted, values for each case - that take into account the data intrinsic variability, Symbolic Data Analysis (Billard and Diday 2003; Bock and Diday 2000; Brito 2014; Diday and Noirhomme-Fraiture 2008) provides a framework where the variability observed is explicitly considered in the data representation.

We focus here on interval-valued data, i.e., where for each entity under analysis an interval is observed. Many methods have to this day been developed for the analysis of interval-valued data, ranging from univariate statistics to multivariate methods such as Clustering [see, e.g., De Carvalho et al. (2006) and De Carvalho and Lechevallier (2009)], Principal Component Analysis [see, e.g. Douzal-Chouakria et al. (2011) and Le-Rademacher and Billard (2012)], Discriminant Analysis (Duarte Silva and Brito 2015; Ramos-Guajardo and Grzegorzewski 2016), Regression Analysis (Dias and Brito 2017; Lima Neto and De Carvalho 2008, 2010; Lima Neto et al. 2011), etc. For a survey the reader may refer to Brito (2014) and Noirhomme-Fraiture and Brito (2011). Most methodologies rely however on non-parametric exploratory approaches, nevertheless recent approaches based on parametric models have been proposed-see Brito and Duarte Silva (2012), Le-Rademacher and Billard (2011) and Lima Neto et al. (2011).

In this paper we address the problem of outlier detection in interval data, using the modeling proposed in Brito and Duarte Silva (2012). In the particular case of interval data, outlyingness may be caused by different reasons. At the univariate level, an observation may be considered as an outlier due to its MidPoint, to its Range or to both, or still to a particular relation between them resulting in an outlying interval. Furthermore, from a multivariate perspective, it may be important to distinguish outliers that stand out by their MidPoints, by their Ranges, or both, or by the global relation between all MidPoints and Ranges.

In Li et al. (2006) the authors present an algorithm for outlier detection in interval data, using a distance-based approach. However, their method fixes a priori the number of data points to be flagged as outliers, and seems therefore unable to distinguish between outlier-free datasets and contaminated ones. Viattchenin (2012) proposes a heuristic approach based on possibilistic clustering. This approach relies on an exhaustive search over all possible partitions of the given dataset in k clusters, for each fixed k, and therefore is only feasible for small datasets. To the best of our knowledge, the methodology we present here is the first statistical approach to this problem, rooted on classical robustness theory.

The remainder of the paper is organized as follows. Section 2 briefly introduces interval data, and parametric models for interval-valued variables. In Sect. 3 we address the problem of robust parameter estimation and its application to robust outlier detection in interval data. Section 4 presents a simulation study comparing alternative methods for outlier detection. An application is discussed in Sect. 5, where we illustrate the main issues arising in this context. Section 6 concludes the paper putting in evidence its main contributions.

2 Models for interval data

Let \(S=\{s_{1},\ldots ,s_{n}\}\) be the set of n entities under analysis.

We are in the presence of interval data when for each \(s_i \in S\) an interval of \(\mathbb {R}\) is recorded for each variable. Formally, an interval-valued variable [see Noirhomme-Fraiture and Brito (2011)] is defined by an application

$$\begin{aligned} Y : S \rightarrow T \ \hbox {such that} \\ s_i \rightarrow Y(s_i)=\left[ l_i, u_i \right] \end{aligned}$$

where T is the set of intervals of an underlying set \(O \subseteq {\mathbb {R}}\). Let \(I = [I_{ij}]\) be an \(n\times p\) matrix containing the values of p interval variables on S. Each \(s_{i} \in S\) may then be represented by a p-dimensional vector of intervals, \(I_{i}=(I_{i1},\ldots , I_{ip}), i=1,\ldots , n\), with \(I_{ij} = [l_{ij},u_{ij}], j=1,\ldots , p\) (see Table 1).

Table 1 Matrix I of interval data

The value of each interval-valued variable \(Y_j\) for each \(s_i \in S\) is usually defined by the lower and upper bounds \(l_{ij}\) and \(u_{ij}\) of \(I_{ij}=Y_j(s_i)\); alternatively \(Y_j(s_i)\) may also be represented by the MidPoint \(c_{ij}= \displaystyle \frac{l_{ij}+u_{ij}}{2}\) and Range \(r_{ij}=u_{ij}-l_{ij}\) of \(I_{ij}\).

2.1 Parametric models for interval data

In Brito and Duarte Silva (2012), parametric models for interval data, relying on Multivariate Normal or Skew-Normal distributions for the MidPoints and Log-Ranges of the interval-valued variables have been proposed. The Gaussian model has the advantage of allowing for the application of classical inference methods, and is the model considered in this paper.

The model consists in assuming a joint multivariate Normal distribution for the MidPoints C and the logs of the Ranges R, \(R^* = ln(R)\), with \(\mu =\left[ \mu _C^t \ \mu _{R^*}^t \right] ^t\) and \( \varSigma = \left( \begin{array}{cc} \varSigma _{CC} &{} \varSigma _{CR^*}\\ \varSigma _{R^*C} &{} \varSigma _{R^*R^*}\\ \end{array} \right) \) where \(\mu _C\) and \(\mu _{R^*}\) are p-dimensional column vectors of the mean values of, respectively, the MidPoints and Log-Ranges, and \(\varSigma _{CC}, \varSigma _{CR^*}, \varSigma _{R^*C}\) and \(\varSigma _{R^*R^*}\) are \(p \times p\) matrices with their variances and covariances.

With this parametrization, the data in matrix I in Table 1 is equivalently represented by the matrix shown in Table 2:

Table 2 Matrix of MidPoints and Log-Ranges
Table 3 Different cases for the variance-covariance matrix

We remark that the MidPoint \(c_{ij}\) and the Range \(r_{ij}\) of the value of an interval-valued variable \(I_{ij}=Y_j(s_i)\) relate to the same variable, so that the link that might exist between them should be appropriately taken into account. This is done by considering particular configurations of the global variance-covariance matrix (Table 3). The following cases are of particular interest, and have been addressed:

  1. 1.

    Non-restricted case: allowing for non-zero correlations among all MidPoints and Log-Ranges;

  2. 2.

    Interval-valued variables \(Y_j\) are uncorrelated, but for each variable, the MidPoint may be correlated with its Log-Range;

  3. 3.

    MidPoints (Log-Ranges) of different variables may be correlated, but no correlation between MidPoints and Log-Ranges is allowed;

  4. 4.

    All MidPoints and Log-Ranges are uncorrelated, both among themselves and between each other.

Table 3 summarizes the different cases considered in this paper.

It should be remarked that in Cases C2, C3 and C4, \(\varSigma \) can be written as a block diagonal matrix, after a possible rearrangement of rows and columns; therefore maximum likelihood estimates under these cases can be obtained directly from the classical non-restricted estimates. Testing for the different models/configurations can be done in a straightforward manner, using the likelihood-ratio approach.

This modeling has been implemented in the R-package MAINT. Data (Duarte Silva and Brito 2017), available on CRAN. MAINT.Data introduces a data class for representing interval data and includes functions for modeling and analysing these data. In particular, maximum likelihood estimation and statistical tests for the considered configurations are addressed. Methods for (M)ANOVA and Discriminant Analysis of this data class are also provided.

3 Robust parameter estimation and outlier detection

The identification of outliers is based on robust Mahalanobis distances from each data point to the mean. These values are then compared with the \(97.5\%\) quantile of an appropriate distribution. Traditionally, the \(\chi ^2_{2p}\) distribution (where p is the number of interval variables and 2p the total number of MidPoints and Log-Ranges) is used. However, and since the Chi-square approximation to the distribution of Mahalanobis distances based on the MCD estimator may not work well, even for moderately large samples, finite samples approximations have been proposed (Cerioli 2010). This will be detailed below in Sect. 3.2.

3.1 Trimmed maximum likelihood estimation

The Trimmed Log-Likelihood (lnTL) estimator is a special case of the weighted trimmed likelihood estimator, defined in Hadi and Luceño (1997) and Vandev and Neykov (1998). The lnTL estimator is defined as

$$\begin{aligned} \hat{\theta }_{lnTL}:=\arg \min _{H ; \theta \in \varTheta ^{q}} \sum _{i\in H} (-\log \varphi (y_{i};\theta )), \end{aligned}$$
(1)

for an index set \(H\subset \{1,\ldots ,n\}\) of size h to be determined, with the unknown parameter \(\theta \in \varTheta ^{q} \subset {\mathbb {R}}^q \), and where \( y_i \in {\mathbb {R}}^d \) for \(i=1,\ldots , n\) are i.i.d. observations with probability density \(\varphi (y;\theta )\). Here, h is the trimming parameter, and with the help of trimming it is possible to remove those \(n-h\) observations whose values would be highly unlikely to occur if the fitted model was true.

Problem (1) is infeasible to solve for larger data sets, with hundreds of observations or even more. However, Neykov and Müller (2003) proposed a fast algorithm to find an approximative solution.

3.2 Robust model estimation for interval data

The trimmed Maximum Likelihood principle can be readily adapted to the problem of robust parameter estimation for the probabilistic models proposed in Brito and Duarte Silva (2012). In particular, in the case of the Gaussian models to be discussed in this paper, the diagonal by blocks structure of the restricted covariance matrix always implies that trimmed likelihood maximization is equivalent to the minimization of the determinant of the restricted trimmed sample covariance matrix.

That can be seen by noting that for all configurations considered the trimmed log-likelihood

$$\begin{aligned} ln \ TL(\mu ,\varSigma ) = -\frac{h}{2} \left( 2p \ ln(2\pi ) + ln\left| \varSigma \right| + tr \tilde{\varSigma } \varSigma ^{-1} + \left( \tilde{\mu }-\mu \right) ^t\varSigma ^{-1}\left( \tilde{\mu }-\mu \right) \right) \end{aligned}$$
(2)

where h is the number of observations kept in the trimmed sample, and \( \tilde{\mu } = \frac{1}{h} \sum \nolimits _{i=1}^h X_i, \ \tilde{\varSigma } = \frac{1}{h} \sum \nolimits _{i=1}^h (X_i-\tilde{\mu })(X_i-\tilde{\mu })^t\) are respectively the trimmed mean and trimmed sample covariance.

It follows [see Brito and Duarte Silva (2012)] that when \(\varSigma \) is block diagonal with blocks \(\varSigma _{1}, \varSigma _{2}, \ \dots , \varSigma _{b} \), ln TL is maximized by \( \hat{\mu } = \tilde{\mu }\) and

with \(\tilde{\varSigma }_j\) being the block of \(\tilde{\varSigma }\) corresponding to \(\varSigma _{j}\). The maximal value of ln TL reduces to \(-\frac{h}{2} \left( 2p \ ln(2\pi ) + ln |\hat{\varSigma }| + 2p \right) \) so that the trimmed maximum likelihood and minimum trimmed covariance determinant principles remain equivalent in this setting.

A consequence of this equivalence is that known refinements of the traditional Minimal Covariance Determinant Estimator can be readily adapted. In particular, following Hubert et al. (2008), we implemented a one-step re-weighted bias-corrected estimate given by

$$\begin{aligned} \hat{\mu }_1= & {} \frac{ \sum \nolimits _{i=1}^n w_i x_i}{m} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \end{aligned}$$
(3)
$$\begin{aligned} \hat{\varSigma }^{1}= & {} \frac{l_{m, 2p} \ c^1_{m, h, n, 2p, Cf} \sum \nolimits _{i=1}^n w_i (x_i-\hat{\mu }_1)(x_i-\hat{\mu }_1)^t}{m} \end{aligned}$$
(4)
$$\begin{aligned} m = \displaystyle \sum _{i=1}^n w_i \ \ \ \ \ \ w_i = \left\{ \begin{array}{cl}\ 1, &{} \text{ if } \ \ \ l_{h,2p} \ c_{h, n, 2p, Cf} \ D_{\hat{\mu },\hat{\varSigma }}(i) \le \sqrt{Q_{0.975}}\\ 0, &{} otherwise \\ \end{array} \right. \end{aligned}$$

where \(Q_{0.975}\) is the \(97.5\%\) quantile of the \(D^2_{\hat{\mu },\hat{\varSigma }} \) distribution. The traditional approach consists in using the Chi-square approximation, so that \(Q_{0.975}=\chi ^2_{2p,0.975}\), but Hardin and Rocke (2005) proposed using an F distribution with better finite sample properties.

In expression (4), \(l_{\alpha ,2p} = \frac{\alpha /n}{P(\chi ^2_{2p+2} \le \chi ^2_{2p;\alpha /n})} (\alpha =m;h)\) are consistency correction factors, \(Cf=\{\)C1, C2, C3, C4\(\}\) are the alternative configurations of the covariance matrix, and \( c_{h, n, 2p, Cf}\), \(c^1_{m, h, n, 2p, Cf}, \) are raw (\( c_{h, n, 2p, Cf}\)) and one-step re-weighted (\(c^1_{m, h, n, 2p, Cf}\)) finite-sample bias-correction factors, found by a simulation and interpolation procedure, along the lines proposed in Pison et al. (2002), and described in the Appendix.

Illustrative values of finite sample correction factors are given in Table 4 (raw correction factors) and Table 5 (re-weighted correction factors).

Table 4 Raw finite sample correction factors: \(c_{h, n, 2p, Cf}\)
Table 5 Re-weighted finite sample correction factors: \(c^1_{m,[0.75\,n], n, 2p, Cf}\)

The final rule for outlier identification compares the robust Mahalanobis distances, based on covariance estimates obtained as discussed above, with the 97.5% quantile of either the \(\chi ^2_{2p}\) distribution or using the approximations [see Cerioli (2010)]:

$$\begin{aligned} D^2_{\hat{\mu }_1,\hat{\varSigma }_1} \sim \frac{(m-1)^2}{m} Beta \left( p,\frac{m-2p-1}{2} \right) , \ \ \ \ if \ \ w_i=1 \end{aligned}$$
(5)
$$\begin{aligned} D^2_{\hat{\mu }_1,\hat{\varSigma }_1} \sim \frac{m+1}{m} \frac{(m-1)2p}{m-2p} F \left( 2p, m-2p \right) , \ \ \ \ if \ \ w_i=0 \end{aligned}$$
(6)

3.3 Choice of the trimming parameter

The trimmed maximum likelihood estimation procedure, described in the previous section, relies on the choice of the value of the trimming parameter h. There is no consensus on how to choose this parameter. One has to consider the trade-off between the potential outlier influence and the efficiency of the resulting estimators. One extreme consists in maximizing the breakdown point, which leads to roughly half the sample. The other extreme is the maximum likelihood estimation, which maximizes efficiency but is also the most sensitive to outliers. Some authors (Hubert et al. 2008) suggest using \(75\%\) of the sample, which they argue, usually results in a reasonable compromise. Ideally, one should use the true number of regular observations, this is however unknown in practice.

These considerations lead us to proposing a two-step approach. In the first step, the outlier detection procedure is run to get an estimate of the outlier proportion. The second step repeats the procedure fixing the trimming parameter at the obtained value.

To assess the good founding of this proposal, a small simulation study has been conducted, controlling for the sample size, the outlier proportion, the contamination level and the data covariance structure. This simulation considered the following alternatives: maximum likelihood, maximization of the breakdown point, setting the trimming parameter to \(75\%\) of the sample, the two-step approach with 50% and \(75\%\) in the first step, and the trimming parameter chosen by the Bayesian Information Criterion (BIC). The performance measure used to compare these methods is the F-measure, defined as the harmonic mean of Precision and Recall [see, for instance, Rijsbergen (1979)], where Precision is the fraction of identified data points that are indeed outliers, while Recall is the fraction of true outliers that are identified as outliers by the method. This measure is computed as:

$$\begin{aligned} F=\frac{2 \times \text{ true } \text{ positives }}{2 \times \text{ true } \text{ positives } + \text{ false } \text{ positives } + \text{ false } \text{ negatives } } \end{aligned}$$
(7)

The higher the F-measure the better the method’s performance.

The results showed that depending on the data condition either the maximum likelihood, the method that maximizes the breakdown point, or the two step procedure with \(75\%\) in the first step, provide the best results. These three methods will be further compared in a more extensive simulation study to be described in Sect. 4.

Results of this preliminary study are available from the authors upon request.

Furthermore, we found that these procedures scale well and we were able to apply them to problems with up to \(n\,=\,1000\) observations and \(\textit{p}\,=\,50\) interval-valued variables in less than 10 minutes of computer time.

4 Simulation studies

To better understand the factors affecting the relative performance of the methods under comparison, we performed a controlled simulation experiment.

Table 6 Methods with the largest F-measure, for the considered data settings–p = 4
Table 7 Methods with the largest F-measure, for the considered data settings–p = 10
Table 8 F-measure obtained by the best methods listed in Tables 6 and 7, for the corresponding data settings–p = 4: mean (standard error)
Table 9 F-measure obtained by the best methods listed in Tables 6 and 7, for the corresponding data settings–p = 10: mean (standard error)
Fig. 1
figure 1

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 3, with p = 4 interval variables, generated from Gaussian distributions

Fig. 2
figure 2

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 3, with p = 10 interval variables, generated from Gaussian distributions

Fig. 3
figure 3

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 6, with p = 4 interval variables, generated from Gaussian distributions

Fig. 4
figure 4

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 6, with p = 10 interval variables, generated from Gaussian distributions

Fig. 5
figure 5

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 12, with p = 4 interval variables, generated from Gaussian distributions

Fig. 6
figure 6

Performance (F-measure) as a function of the proportion of outlying observations (Noise), covariance configuration (C1, C2, C3, C4) and number of observations (n), for Mahalanobis distance = 12, with p = 10 interval variables, generated from Gaussian distributions

4.1 Experimental design

We considered a factorial design with the following six factors:

  • Number of interval variables (NV - 2 levels): \(\textit{p}\,=\,4\) and \(\textit{p}\,=\,10\).

  • Sample size (SS - 3 levels): Total number of sample observations, set at \(n\,=\,50, n\,=\,100, n\,=\,200\).

  • Data Generating Process (DGP - 2 levels): MidPoints and Log-Ranges jointly Normally distributed or generated from linear transformations of Gaussian and Uniform variables.

  • True configuration (TConf - 4 levels): Case of true covariance of MidPoints and Log-Ranges. Set at the levels C1 (unrestricted), C2 (Uncorrelated Interval Variables), C3 (MidPoints uncorrelated with Log-Ranges) and C4 (all MidPoints and Log-Ranges uncorrelated with each other).

  • Noise level (NL - 4 levels): Percentage of true outliers, set at Noise = \(0.0\%\), \(5.0\%\), \(10.0\%\), \(20.0\% \).

  • Contamination level (CL - 3 levels): Mahalanobis distance between the mean vectors (2p-dimensional) of the regular and the outlying observations, set at 3.0, 6.0, 12.0.

  • Method (M - 6 levels): Maximum likelihood (Alln (MLE)), maximum trimmed likelihood with \(50\%\) trimming, with quantiles from the Chi-square (0.50n) or from the F or Beta distributions (Fs0.50n), two-step maximum trimmed likelihood with \(75\%\) trimming in the first step, with quantiles from the Chi-square (TStp75n) or from the F or Beta distributions (FsTStp75n), the distance-based approach by Li et al. (2006), with four neighbors and the true number of outliers (LLLk4).

For each data condition with true outliers, defined by a combination of factors NV, SS, DGP, TConf, NL and CL we generated 1000 independent samples, and applied the six outlier detection methods. In each case, we computed the Precision, Recall and F-measure (see (7)), as defined in Sect. 3.3. For the conditions with no outliers, the distance-based approach by Li et al. (2006) was not applied; for all other methods we computed the proportion of observations incorrectly flagged as outliers.

Fig. 7
figure 7

Performance (F-measure) of the considered methods for the different covariance configurations (C1, C2, C3, C4), with Mahalanobis distance = 6, \(10\%\) outliers, n = 50, p = 10, data generated from Gaussian distributions

Fig. 8
figure 8

Percentage of false outliers Err for all methods, for the different covariance configurations (C1, C2, C3, C4) and number of observations (n), with p = 4

Fig. 9
figure 9

Percentage of false outliers for all methods, for the different covariance configurations (C1, C2, C3, C4) and number of observations (n), with p = 10

4.2 Data generation

Data were generated as follows:

For each observation \(s_i, i=1,\ldots , n\), we first generated a 2p dimension random vector, \(V = \left[ V_1 V_2\right] \), with independent components.

For Case 4, where MidPoints and Log-Ranges are uncorrelated both between themselves and between each other, the \(V_1\) variables define the intervals’ MidPoints C, and the \(V_2\) variables the Log-Ranges \(R^*\).

In Case 3, MidPoints (Log-Ranges) of different variables may be correlated, but no correlation between MidPoints and Log-Ranges is allowed. Then, the MidPoints C are given by linear combinations of the \(V_1\) components, \(C = L_1 \ V_1^t\); likewise, the Log-Ranges \(R^*\) are given by linear combinations of the \(V_2\) components, \(R^* = L_2 \ V_2^t\). The coefficients in matrices \(L_1\) and \(L_2\) are obtained from independently generated Uniform values, which are then normalized to ensure that \(L_1\) and \(L_2\) are orthonormal.

Case 1 allows for non-zero correlations among all MidPoints and Log-Ranges. Therefore, MidPoints C are defined as linear combinations of both \(V_1\) and \(V_2\) components, \(C = [L_1 | L_3] \ [V_1^t V_2^t]^t\). Log-Ranges \(R^*\) are then given by linear combinations of the \(V_2\) components, \(R^* = L_2 \ V_2^t\). \(L=[L_1 | L_3]\) and \(L_2\) are orthonormal matrices obtained as in Case 3.

Finally, in Case 2 interval-valued variables \(Y_j\) are uncorrelated, but for each variable, the MidPoint may be correlated with its Log-Range. Accordingly, C and \(R^*\) are generated in the same way as in Case 1, but placing all required zeros in matrices \(L_1, L_2, L_3\) to ensure that MidPoints and Log-Ranges of different interval variables have null correlations.

In the first setup, both \(V_1\) and \(V_2\) are Gaussian, whereas in the second setup \(V_1\) is multivariate Normal and \(V_2\) is a vector of independent Uniform variables. In all cases, variances are set to unity. For the regular observations, all components of \(V_1\) and \(V_2\) are generated from distributions with null expected value, while for the outlying observations the expected values are set to ensure the desired level of Mahalanobis distance.

4.3 Results

Tables 6 and 7 present the method resulting in the largest F-measure (see (7)) among all considered methods (except the distance-based approach by Li et al. (2006)), for the data settings with true outliers and Tables 8 and 9 gather the corresponding F values, for all methods except the distance-based approach by Li et al. (2006). In each case the mean F value of the 1000 replications is given, along with the corresponding standard error. Our experiments showed that Li et al. (2006) method (denoted as LLLK4 in Figs. 12345678 and 9) which requires the knowledge of the true number of outliers, works the best when the contamination level is either low or highest, but not in intermediate situations–see Figs. 12345 and 6. As the number of outliers is never known in practice, and its discovery is one of the challenges, comparisons with this method are not fair and its performance may only be understood as a theoretical benchmark.

The results are very similar for data generated only from Gaussian or from Gaussian and Uniform distributions, so that we present graphical representations only for the former case, in Figs. 12345 and 6.

Contamination level  = 3

With four interval-valued variables, the distance-based approach by Li et al. (2006) is by far the best, except in some cases with non-restricted covariances with \(5\%\) outliers. Among the methods not using the knowledge of the true number of outliers, the method that uses \(50\%\) of the sample with quantiles from the Chi-square distribution (denoted as 0.50n method) is often the best one, sometimes tied with the two-step procedure with Chi-square quantiles. When the number of interval variables is set to 10 the pattern is similar, but for unrestricted covariance matrix (C1), with either \(5\%\) or \(10\%\) outliers, both the 0.50n method and the two-step approach with Chi-square quantiles may perform better than the distance-based one.

Contamination level  = 6

Maximum likelihood (denoted Alln(MLE)) is almost always (with a few exceptions when p = 10) the worse method when the proportion of outliers is equal or higher than \(10\%\), with a remarkable bad performance in configurations involving a large number of parameters. For p = 4, and small samples relatively to the number of parameters, the performance of the methods using quantiles from the F or the Beta distributions deteriorates as the true proportion of outliers increases. When we increase the number of interval variables to 10, for large outlier proportions the 0.50n method (using half the sample with Chi-square quantiles) performs better than the other parametric methods. When there are only \(5\%\) outliers, the FS0.50n method and the two-step approach with using quantiles from the Chi-square distribution are usually the most competitive methods.

Contamination level  = 12

When p = 4, maximum likelihood always performs badly when the proportion of outliers is equal or higher than \(10\%\). For small samples (n = 50), both the 0.50n and the two-step methods (using quantiles from the Chi-square distribution) do not perform well. For a larger dimension (p = 10), and for medium or large samples, all methods perform well for the most restricted covariance configurations (C2 and C4), except maximum likelihood in most cases. With these configurations, when there are only 5% outliers the maximum likelihood method performs the best, for larger outlier proportions the two-step using Chi-square quantiles is overall the most competitive. For configurations with more parameters (C1 and C3), depending on the particular condition, either the FS0.50n method or the two-step approach using Chi-square quantiles provide the best results.

Non-contaminated case

In the situations where there are no outliers, either maximum likelihood (Alln(MLE)) or the two-step approach with quantiles from the F or Beta distributions (FsTStp75n) are those that flag the lowest number of observations-see Table 10. We note that maximum likelihood performs better in relative terms for lower sample sizes and larger number of parameters to be estimated. This general pattern is valid both for data generated from Gaussian or from Gaussian and Uniform distributions. Furthermore, in all situations, the proportion of wrongly flagged observations is below the nominal significance level of 2.5%-see Table 11; Figs. 8 and 9.

General conclusions

We notice that no method is uniformly best across data conditions. Maximum likelihood is competitive with very low numbers of true outliers, but quickly deteriorates as the proportion of outliers increases; the method that uses \(50\%\) of the sample provides the best results with high proportions of outliers and low or moderate contamination levels; the two-step method using Chi-square quantiles (TStp75n) is usually the best with moderate or large proportions of outliers and moderate or large contamination levels-see Figs. 12345 and 6; this same approach with F or Beta quantiles (FsTStp75n) works best for small proportions of outliers but requires relatively large samples, particularly when there are many parameters to be estimated. This aspect may be relevant for interval data, where the number of parameters grows faster than it is the case for standard real data, unless highly restricted configurations are assumed.

As concerns the different covariance cases, we note that maximum likelihood and the methods using F or Beta quantiles perform better in Cases C2 and C4, where there are less parameters to estimate, specially when the training samples are small. The two-step methods provide a more stable performance across covariance cases-see Fig. 7.

In practical cases, a preliminary inspection of the data may provide some insights as concerns the proportion and severity of outliers; this information together with the number of variables and the size of the dataset allows identifying appropriate setups to be considered and therefore the corresponding adequate methods.

Table 10 Best methods by data condition (no outliers data)
Table 11 Percentage of false outliers for best methods

5 Application

In this section we analyse a set of 27 car models described by four interval-valued variables–lnPrice, Engine Capacity, Top Speed, Acceleration (see Table 12). We use the natural logarithm of the Price, since the MidPoints of the original Price variable have a strong positive skewness. The dataset is available in the package MAINT.Data.

We proceed to outlier detection, based on Mahalanobis distances (from each car model to the mean estimate) and considering the \(97.5 \%\) quantiles of the relevant distributions. Preliminary likelihood ratio tests reject all restricted configurations against the full model at 1% significance level. BIC values obtained by our robust procedure also point to the unrestricted configuration. To decide on the method to apply, we made a preliminary inspection of the data using classical Mahalanobis distances. Even with this non-robust procedure, it became clear that the dataset had at least 10% severe outliers. Therefore, the most appropriate simulation conditions for this data are the cases with a low number of variables (p = 4), small training sample, high proportion of outliers and a large contamination level. For these cases, the conclusions from the simulation study suggest the use of the method based on half of the sample using F and Beta quantiles-see Fig. 5 and Table 6. Maximum likelihood, for which the Chi-square quantiles are used, is also applied for comparison purposes. We also identify outliers using the non-parametric approach of Li et al. (2006), with the authors’ recommendation of four neighbors.

For the proposed parametric methods, the results are displayed in Figs. 10111213,1415 and 16. In these figures, cutoffs from the F distribution are indicated with a dashed line whereas those from the Beta distribution are indicated with a solid line. In Fig. 10 we note that whereas classical distances only identify Skoda Octavia and Honda NSK as outliers, using robust distances three other car models are recognized as outliers, but the Honda NSK is no longer flagged.

Table 12 ‘Car’ data set with four interval-valued variables

Figure 11 shows which car models are identified as outliers when we consider only the MidPoints (respectively, only the Log-Ranges). We observe that three of the car models identified as outliers in the global analysis are also identified as outliers in the separate analysis of MidPoints (Ferrari, Mercedes Class S, Porsche), of which Ferrari is also flagged as an outlier for the Log-Ranges; Skoda Octavia is identified as an outlier for the Log-Ranges but not for the MidPoints.

Fig. 10
figure 10

‘Car’ dataset: Classical vs robust Mahalanobis distances

Fig. 11
figure 11

‘Car’ dataset: Robust Mahalanobis distances on MidPoints vs Robust Mahalanobis distances on Log-Ranges

Fig. 12
figure 12

‘Car’ dataset: Robust Mahalanobis distances on variable lnPrice: a Robust Mahalanobis distances on lnPrice MidPoints vs Robust Mahalanobis distances on lnPrice Log-Ranges; b Robust Mahalanobis distances on lnPrice using both MidPoints and Log-Ranges

Fig. 13
figure 13

‘Car’ dataset: Robust Mahalanobis distances on variable Engine Capacity: a Robust Mahalanobis distances on Eng. Cap. MidPoints vs Robust Mahalanobis distances on Eng. Cap. Log-Ranges; b Robust Mahalanobis distances on Eng. Cap. using both MidPoints and Log-Ranges

Fig. 14
figure 14

MidPoints vs. Log-Ranges for Engine Capacity

Fig. 15
figure 15

‘Car’ dataset: Robust Mahalanobis distances on variable Top Speed: a Robust Mahalanobis distances on Top Speed MidPoints vs Robust Mahalanobis distances on Top Speed Log-Ranges; b Robust Mahalanobis distances on Top Speed using both MidPoints and Log-Ranges

We then proceed to a variable by variable analysis, again considering MidPoints and Log-Ranges separately or jointly-see Figs. 12131415 and 16.

Fig. 16
figure 16

‘Car’ dataset: Robust Mahalanobis distances on variable Acceleration: a Robust Mahalanobis distances on Acceleration MidPoints versus Robust Mahalanobis distances on Acceleration Log-Ranges; b Robust Mahalanobis distances on Acceleration using both MidPoints and Log-Ranges

Table 13 summarizes the results, and indicates which car models are identified as outliers, when the full (all MidPoints and Log-Ranges) or partial (only MidPoints, only Log-Ranges, individual variables) descriptions are considered. We note that the car models indicated as outliers vary according to the analysis performed, which provide different insights. Interval outliers may not stand out when MidPoints or Log-Ranges are analysed separately, putting in evidence the need for their joint analysis–this may be seen, for instance, with variable Engine Capacity, for which no car model was flagged as outlier either for the MidPoints or the Log-Ranges (see Fig. 13a), but Porsche is flagged in the joint analysis (see Fig. 13b), because for this model the relation between MidPoints and Log-Ranges deviates from the usual pattern. This can be seen in Fig. 14, Honda NSK also deviates from the main pattern but not enough to be flagged using a 97.5% quantile (it would indeed be flagged if we would use the corresponding 95% quantile).

After removing all six car models identified as possible outliers, the MidPoints and Log-Ranges of the four interval-valued variables pass the Mardia’s, Henze-Zirkler’s and Royston’s tests (Korkmaz et al. 2014) for joint multivariate normality, at the \(1\% \) significance level.

The results obtained with the non-parametric approach of Li et al. (2006), using four neighbors, are displayed in Table 14, for different values of the k fixed number of outliers to be identified. We note that while the method identified Ferrari, Porsche and Honda NSK as first outliers, when the fixed number of outliers is increased to \(k\,=\,4\) this method indicates the Mercedes Class E, which the applied parametric approach did not flag, Mercedes Class S appears for \(k\,=\,5\); for \(k\,=\,6\) this method includes Audi A8, also not flagged by the parametric approach.

Table 13 Outliers in ‘Car’ data set
Table 14 Outliers identified by the distance-based method (Li et al. 2006) in the ‘Car’ data set

Summing up, we observe that six car models are identified as possible outliers and would deserve further attention before proceeding to data analysis. They stand out in different complementary analysis. At the univariate level, an observation may be considered as an outlier due to its MidPoint–Ferrari and Porsche are in this case, to its Range, as it is the case for Alfa 166, or to both, or still to a particular relation between them resulting in an outlying interval–as it happens for Porsche or even maybe Honda NSK for Engine Capacity. Furthermore, from a multivariate perspective, it is important to distinguish outliers that stand out by their MidPoints–see Mercedes Class S–by their Log-Ranges–as Skoda Octavia–or both-as Ferrari-or still by the global relation between all MidPoints and Log-Ranges.

6 Summary and conclusions

A multivariate outlier detection method has been introduced for interval data, which makes use of parametric modeling of the interval-valued variables according to Brito and Duarte Silva (2012). A joint Multivariate Normal distribution of the MidPoints and Log-Ranges was assumed, with different types of restrictions for the parameters. Maximum likelihood estimation of the parameters, as proposed in Brito and Duarte Silva (2012), leads to non-robust estimates which would not be useful for the purpose of outlier detection, since the estimates themselves are affected by the outliers. This is circumvented by using the weighted trimmed likelihood principle for parameter estimation (Hadi and Luceño 1997). A fast algorithm along the lines of Neykov and Müller (2003) has been implemented in the R package MAINT. Data (Duarte Silva and Brito 2017) which makes parameter estimation for real data sets feasible. Multivariate outlier detection is done by computing Mahalanobis distances for the observations, by plugging in the robust estimates of location and covariance for MidPoints and Log-Ranges. Cutoff values from relevant distributions inform about the outlyingness of the observations. These distributions may be the traditional Chi-square or finite sample approximations following recent proposals (Cerioli 2010; Hardin and Rocke 2005).

The performance of the outlier detection procedure was evaluated by a simulation study, using various different data conditions. In presence of contamination, the advantages of the robust estimators over the non-robust maximum likelihood estimators have been clearly demonstrated. Also in an application it turned out that robust estimation provides interesting new insights, and that diagnostic plots help understanding the outlyingness behavior.

To the best of our knowledge, this is the first statistical outlier investigation for interval data. Further steps may consist in applying multivariate techniques, such as principal component analysis or discriminant analysis. It is straightforward to robustify such methods by plugging in the robust parameter estimates proposed in this paper.