1 Introduction

Statistical procedures for testing equality of two populations of interval valued data are developed in this paper. In interval-valued data, the variable of interest is a univariate random object on a probability space of intervals, and has the form of an interval (LU], with lower and upper bounds L, U satisfying \(L<U\). More precisely, the lower and upper bounds L and U of the interval are real-valued random variables on the probability space of intervals.

Here, we denote an interval-valued data as a form of a half-open interval but the interval could be open, closed, or other form of half-open intervals. In general, the interval-valued data can be categorized into two types; min–max (MM) and measurement error (ME) types (Blanco-Fernández and Winker 2016). The MM type assumes that lower and upper bounds of each interval-valued observation are the minimum and maximum values of object of interest, respectively. In practice, the MM type data is generated when aggregating large datasets to the minimum and maximum values or focusing on the range of variation of the variables. The typical example of the MM type data is the blood pressure data, where the blood pressure usually recorded in minimum and maximum during a heartbeat cycle. On the other hand, the ME type assumes that there exists a true value and the true value is not observable directly, but only observable as an interval that contains the true value. The ME type data is occurred when the exact value is not available due to the confidentiality issues or the use of non-sufficiently accurate measurement device. The typical example of the ME type data is the interval-censored data that commonly encountered in clinical trials.

While the same notation is used for both the MM-type and ME-type interval-valued data, the analysis and inference in both types should be different (Blanco-Fernández and Winker 2016; Grzegorzewski 2018). In the ME type, we deal with usual real-valued random variables, but the problem is that the realization is not precise and obtained as an interval. Thus, the statistical analysis is based on this imprecise information about the point data. On the other hand, in the MM type, we focus on the random interval itself, not the point value. Thus, the statistical analysis aims at developing probabilistic models for the interval itself by considering the models of the lower and upper bounds (or the center and half-range) of the interval.

In this paper, we assume that the observed interval-valued data is of MM-type and develop statistical procedures to test equality of two populations of the interval-valued data. Among many statistical procedures, the comparison of two populations is one of the most fundamental statistical questions. There are several literatures for testing equality of two populations for the ME-type interval-valued data which are related to the interval censored data. Most of the existing methods are developed by the nonparametric test procedures such as Wilcoxon test (Perolat et al. 2015; Grzegorzewski and Śpiewak 2017), U-statistic (Choi et al. 2019) and the sign test (Grzegorzewski and Śpiewak 2019). However, little research has been done in the context of the MM-type interval-valued data. The only method we are aware of is the combined test (CB) proposed by Grzegorzewski (2018). To develop a more powerful testing procedure, we consider three additional testing procedures. One is, by considering the bivariate nature (or bivariate real-valued representation) of interval-valued data, the Hotelling’s \(T^2\) (HT) test. The forementioned two are direct applications of the existing methods for bivariate data. The other two newly suggested are based on the univariate marginalization (or univariate distributional representation) of interval-valued data that depends on kernalization. To this end, the uniform kernel method (UK) and Gaussian kernel method (GK) by Jeon et al. (2015) are used to estimate the marginal distribution. We suggest using the Kolmogorov–Smirnov (KS) distance between the kernel marginal distributions to test the equality of two populations. The null distribution of the KS distance is approximated by a permutation procedure (Præstgaard 1995).

The remainder of the paper is organized as follows. In Sect. 2, we precisely describe four methods to compare two-sample interval-valued data. In Sect. 3, we compare performance of the four methods in various settings through a comprehensive simulation study. In Sect. 4, we apply the methods to the blood pressure data of female students in the US. In Sect. 5, we conclude the paper with a summary.

2 Methods

To verify if there is a significant difference between two populations where samples are observed by intervals, we study four methods: the CB, HT, UK, and GK tests. For the CB and HT test, we transform interval-valued data into a bivariate real-valued vector of center C and (logarithm of) half-range R (\(\log R\)), where \(C = (L+U)/2\) and \(R = (U -L)/2\), in order to remove the constraint in L and U.

2.1 Combined (CB) test

Let \(F_{\scriptscriptstyle \mathrm C}\) and \(F_{\scriptscriptstyle \mathrm R}\) be the cumulative distribution function (c.d.f.) of the center and half-range (or log-transformed half-range), respectively, from one population. We define \(G_{\scriptscriptstyle \mathrm C}\) and \(G_{\scriptscriptstyle \mathrm R}\) similarly for another population. Assume that m and n random samples are observed from each population and independent of each other, i.e., \(\big \{(C_{1j}, R_{1j}), j=1,\ldots , m \big \}\) and \(\big \{(C_{2j}, R_{2j}), j=1,\ldots , n\big \}\). Grzegorzewski (2018) suggests verifying the equivalence of the two populations by testing the overall hypothesis below

$$\begin{aligned} {\mathcal {H}}_0: F_{\scriptscriptstyle \mathrm C} = G_{\scriptscriptstyle \mathrm C} ~\text{ and }~ F_{\scriptscriptstyle \mathrm R} = G_{\scriptscriptstyle \mathrm R}. \end{aligned}$$

Grzegorzewski (2018) proposes the KS goodness-of-fit test that individually applies the usual KS test to the center and half-range, and combines the respective results.

The KS statistics for each hypothesis \({\mathcal {H}}_{0,C}: F_{\scriptscriptstyle \mathrm C} = G_{\scriptscriptstyle \mathrm C}\) and \({\mathcal {H}}_{0,R}: F_{\scriptscriptstyle \mathrm R} = G_{\scriptscriptstyle \mathrm R}\) are

$$\begin{aligned} T_C= & {} D_{m,n}({\widehat{F}}_{m, \scriptscriptstyle \mathrm C}\;, {\widehat{G}}_{n, \scriptscriptstyle \mathrm C}) = \left( \frac{mn}{m+n} \right) ^{1/2} \sup _{t \in {\mathbb {R}}} |{\widehat{F}}_{m, \scriptscriptstyle \mathrm C} (t)-{\widehat{G}}_{n, \scriptscriptstyle \mathrm C} (t)|, \\ T_R= & {} D_{m,n}({\widehat{F}}_{m, \scriptscriptstyle \mathrm R}\;, {\widehat{G}}_{n, \scriptscriptstyle \mathrm R}) = \left( \frac{mn}{m+n} \right) ^{1/2} \sup _{t \in {\mathbb {R}}} |{\widehat{F}}_{m, \scriptscriptstyle \mathrm R} (t)-{\widehat{G}}_{n, \scriptscriptstyle \mathrm R} (t)|, \end{aligned}$$

where \({\widehat{F}}_{m, \scriptscriptstyle \mathrm C} (t) = ({1}/{m}) \sum _{j=1}^m \mathrm{I}(C_{1j} \le t)\), \({\widehat{F}}_{m, \scriptscriptstyle \mathrm R} (t) = ({1}/{m}) \sum _{j=1}^m \mathrm{I} (R_{1j} \le t)\), \({\widehat{G}}_{n, \scriptscriptstyle \mathrm C} (t) = ({1}/{n}) \sum _{j=1}^n \mathrm{I} (C_{2j} \le t)\), and \({\widehat{G}}_{n, \scriptscriptstyle \mathrm R} (t) = ({1}/{n}) \sum _{j=1}^n \mathrm{I} (R_{2j} \le t)\). The asymptotic null distribution of \(T_C\) (or \(T_R\)) is known as Kolomogorov–Smirnov distribution (Feller 1948), where, for every fixed \(z \ge 0\),

$$\begin{aligned} \mathrm{P}\left\{ T_C \le z \right\} \rightarrow ~ L(z) = 1-2 \sum _{j=1}^{\infty } (-1)^{j-1} e ^{{-2j^2}z}, \end{aligned}$$

as \(m\rightarrow \infty\), \(n\rightarrow \infty\) so that \(m/n \rightarrow a \in (0, \infty )\). In the numerical study and data example, we use the permutation method to estimate the distribution of the test statistic \(T_C\) (or \(T_R\)) due to finiteness of the sample sizes.

To test the overall hypothesis \({\mathcal {H}}_0\), Grzegorzewski (2018) exploits the Bonferroni procedure when combining p values of \({\mathcal {H}}_{0,C}\) and \({\mathcal {H}}_{0,R}\). To be specific, let \(p_C\) and \(p_R\) be the p values related to \(T_C\) and \(T_R\), respectively. Then, the overall p value is set as \(p =2 \min (p_C, p_R)\), and we reject \({\mathcal {H}}_0\) if p is small enough, such as \(p<\alpha\), where \(\alpha \in (0,1)\) is the significance level.

2.2 Hotelling’s \(T^2\) (HT) test

Two-sample HT test is one of the most popular procedures to test the equality of two mean vectors of the populations. We apply this method to center and log-transformed half range by transforming the interval data, which is a two-dimensional problem. We assume that the real-valued random variables \({\mathbf {X}}_i=(C_{1i}, \log R_{1i}), i=1,\ldots , m\) (\({\mathbf {Y}}_j=(C_{2j}, \log R_{2j}), j=1,\ldots , n\), respectively) are independently from the population with \(N_2(\mathbf {\mu _{{\mathbf {x}}}}, \varSigma _{{\mathbf {x}}})\) (\(N_2(\mathbf {\mu _{{\mathbf {y}}}}, \varSigma _{{\mathbf {y}}})\), respectively), where \(N_2(\mathbf {\mu }, \varSigma )\) denotes the bivariate normal distribution with mean vector \(\mu\) and covariance matrix \(\varSigma\).

2.2.1 Equal covariance case

We assume that the covariances of the two populations are equal, \(\varSigma _{{\mathbf {x}}}=\varSigma _{{\mathbf {y}}}\). Then, the null hypothesis \({\mathcal {H}}_0:\mathbf {\mu _{{\mathbf {x}}}}=\mathbf {\mu _{{\mathbf {y}}}}\) can be tested using \(HT_\mathrm{eq}\):

$$\begin{aligned} \mathrm{HT}_\mathrm{eq}= \frac{mn}{m+n} {(\overline{{\mathbf {X}}}-\overline{{\mathbf {Y}}})}^{\top } S_p^{-1} (\overline{{\mathbf {X}}}-\overline{{\mathbf {Y}}}), \end{aligned}$$

where \(\overline{{\mathbf {X}}}\) and \(\overline{{\mathbf {Y}}}\) are the sample mean vectors of two samples, respectively, and \(S_p\) is the pooled covariance matrix calculated by

$$\begin{aligned} S_p = \frac{(m-1)S_{{\mathbf {x}}}+(n-1)S_{{\mathbf {y}}}}{m+n-2}, \end{aligned}$$

where \(S_{{\mathbf {x}}}\) and \(S_{{\mathbf {y}}}\) are the sample covariance matrices from \(\mathbf{X}_i\)s and \(\mathbf{Y}_j\)s, respectively. Under the null hypothesis, we know that

$$\begin{aligned} \frac{m+n-3}{2(m+n-2)} \mathrm{HT}_\mathrm{eq} \sim F(2, m+n-3), \end{aligned}$$

where \(F(2, m+n-3)\) is the F-distribution with parameters 2 and \(m+n-3\).

2.2.2 Unequal covariance case

If \(\varSigma _{{\mathbf {x}}} \ne \varSigma _{{\mathbf {y}}}\), the HT statistic given by

$$\begin{aligned} \mathrm{HT}_\mathrm{un} = {(\overline{{\mathbf {X}}}-\overline{{\mathbf {Y}}})}^{\top } \left(\frac{S_{{\mathbf {x}}}}{m}+\frac{S_{{\mathbf {y}}}}{n}\right)^{-1} {(\overline{{\mathbf {X}}}-\overline{{\mathbf {Y}}})} \end{aligned}$$

follows the null distribution below:

$$\begin{aligned} \frac{m+n-3}{2(m+n-2)} \mathrm{HT}_\mathrm{un} \sim F(2, \nu ), \end{aligned}$$

where \(\nu\) is an appropriately defined degrees of freedom.

2.3 Marginalization-based test

In this section, we propose two-step marginalization-based approaches to test the equality of two interval-valued samples. First, we find a univariate distributional representation, which we named as marginalization, that attempts to summarize the interval-valued sample with single real-valued variables. Then, we adopt a procedure to compare two univariate distributions. Here, we should remark that the marginalization above is a univariate real-valued representation of an interval, not the typical marginalization of the bivariate real-valued representation of the interval, e.g. (LU) or (CR).

2.3.1 Two marginalizations

We first introduce two popular marginalization methods: an empirical histogram estimator (also known as a marginal histogram estimator or a kernel estimator with the uniform kernel) and a Gaussian kernel estimator. Suppose we observe n independent intervals \(\big \{ I_{i} =(\ell _i, u_i] , i=1, \ldots , n \big \}\). The estimator with the uniform kernel (Bertrand and Goupil 2000) for a univariate density of interval-valued data is

$$\begin{aligned} f_n^{\scriptscriptstyle UK}(g) = \frac{1}{n} \sum _{i=1}^n \frac{1}{u_i-\ell _i} \mathrm{I}(\ell _i < g \le u_i). \end{aligned}$$
(1)

The rationale behind (1) is that the value of a univariate representation of \(I_i\) is uniformly distributed in the interval \((\ell _i, u_i]\). Thus, the marginalization is represented as the uniform mixture of n uniform distributions. We refer to this estimator as the uniform kernel estimator (UK).

Jeon et al. (2015) improve the uniform kernel estimator by imposing some structures on the distribution of data. The proposed estimator is a mixture of n univariate normal densities. That is,

$$\begin{aligned} f_n^{\scriptscriptstyle GK}(g;h) = \frac{1}{n} \sum _{k=1}^n \phi (g |{\hat{\mu }}_k(h), {\hat{\sigma }}_k(h)), \end{aligned}$$
(2)

where h is a bandwidth, \(\phi (\cdot |{\hat{\mu }}_k(h), {\hat{\sigma }}_k(h))\) is the univariate normal density with mean \({\hat{\mu }}_k(h)\) and standard deviation \({\hat{\sigma }}_k(h)\) computed by

$$\begin{aligned}&{\hat{\mu }}_{k}(h) = \frac{1}{n} \sum _{i=1}^n w_{ki}(h)m_{i} , \;\;&{\hat{\sigma }}_{k}^2(h) = \frac{1}{n} \sum _{i=1}^n w_{ki}(h) v_{i}, \\&m_{i} = ( \ell _i+u_i)/2 , \;\; v_{i} =(u_i-\ell _i)^2/12. \end{aligned}$$

For the given bandwidth h, the local weights \(w_{ki}(h)\) are determined as follows.

Using the center of intervals, we calculate Euclidean distances between kth and ith intervals, say \(d_{ki}(=d_{ik})\). Let \(R_{ki}\) be the rank of the \(d_{ki}\) (in increasing order) among \(\big \{d_{k1},d_{k2},\ldots , d_{kn} \big \}\) with \(R_{kk}=1\). The weights are determined such that

$$\begin{aligned} w_{ki}(h) \propto \frac{1}{h} K\left( \frac{R_{ki}-1}{h} \right) ~~\text {and}~ \sum _{i=1}^n w_{ki}(h)=1, \end{aligned}$$

where K is the standard normal density. Jeon et al. (2015) propose to select h that minimizes the Kullback–Leibler loss between the uniform kernel estimator in (1) and the Gaussian kernel estimator in (2). Details can be found in Jeon et al. (2015). We refer to this estimator (2) as the Gaussian kernel estimator (GK).

2.3.2 Test statistic

Let us consider two independent random intervals: first sample \({\mathbf {X}}_1, \ldots , {\mathbf {X}}_m\) is drawn from the population with c.d.f. \(F(\ell _1, u_1)\) where \(\ell _1\) and \(u_1\) indicate the lower and upper bound of the interval \({\mathbf {X}}\), respectively. The second sample \({\mathbf {Y}}_1, \ldots , {\mathbf {Y}}_n\) comes from the population with c.d.f. \(G(\ell _2,u_2)\) where \(\ell _2\) and \(u_2\) are defined similarly for \({\mathbf {Y}}\). We check the equality of distributions \({\mathcal {H}}_0 : F=G\) by using the univariate marginal estimators introduced previously. In other words, we compare \(F_{\scriptscriptstyle \mathrm M}\) and \(G_{\scriptscriptstyle \mathrm M}\), where \(F_{\scriptscriptstyle \mathrm M}\) and \(G_{\scriptscriptstyle \mathrm M}\) are the marginal distributions of \(F_{}(\ell _1, u_1)\) and \(G_{}(\ell _2, u_2)\), respectively.

We consider two types UK and GK of test statistics based on the UK and GK estimators, respectively. For the UK type, the test statistic \(T_{\scriptscriptstyle M}^{UK}\) is similar to the KS statistic and defined as follows:

$$\begin{aligned} T_{\scriptscriptstyle \mathrm M}^{UK} = D_{m,n}({\widehat{F}}_{{\scriptscriptstyle \mathrm M}, m}^{UK}\;, {\widehat{G}}_{{\scriptscriptstyle \mathrm M},n}^{UK}) = \left( \frac{mn}{m+n} \right) ^{1/2} \sup _{t \in {\mathbb {R}}} |{\widehat{F}}_{{\scriptscriptstyle \mathrm M}, m}^{UK} (t)-{\widehat{G}}_{{\scriptscriptstyle \mathrm M}, n}^{UK} (t)|, \end{aligned}$$
(3)

where \({\hat{F}}_{M,m}^{UK}\) and \({\hat{G}}_{M,m}^{UK}\) are the UK estimators of the marginal cumulative distribution functions \(F_M\) and \(G_M\) based on \((\mathbf{X}_1\),..., \(\mathbf{X}_m)\), and \((\mathbf{Y}_1,\ldots , \mathbf{Y}_n)\), respectively. The estimator \({\hat{F}}_{M,m}^{UK}\) of \(F_M\) is defined by the estimated density functions \({\hat{f}}_m^{UK}\) as follows:

$$\begin{aligned} \begin{array}{rcl} {\hat{F}}^{UK}_{M,m}(t) &{}=&{} \displaystyle \int _{-\infty }^{t} {\hat{f}}_m^{UK}(x) ~dx\\ &{}=&{} \displaystyle \frac{1}{m} \sum _{i=1}^{m} \{I(U_{1i}< t) + \frac{t - L_{1i}}{U_{1i}-L_{1i}} I(L_{1i} < t \le U_{1i})\},\\ \end{array} \end{aligned}$$

where \(L_{1i}\) and \(U_{1i}\) are the lower and upper bounds of the ith observed interval of \(\mathbf{X}_i\).

To develop the GK type test statistic, we only adopt the structure of the Gaussian kernel estimator for the interval-valued data proposed in Jeon et al. (2015) and define the test statistic \(T_{\scriptscriptstyle M}^{GK}\) based on the GK estimator as the maximal distance between \({\hat{F}}_{M,m}^{GK}(t;h)\) and \({\hat{G}}_{M,n}^{GK}(t;h)\) with respect to t for the given common bandwidth h that maximizes \(\sup _t | {\hat{F}}_{M,m}^{GK}(t;h) - {\hat{G}}_{M,n}^{GK}(t;h)|\), where \({\hat{F}}_{M,m}^{GK}(\cdot ;h)\) and \({\hat{G}}_{M,n}^{UK}(\cdot ;h)\) are the GK estimators of the marginal cumulative distribution functions \(F_M\) and \(G_M\) for a given common bandwidth h based on \((\mathbf{X}_1\),..., \(\mathbf{X}_m)\), and \((\mathbf{Y}_1,\ldots , \mathbf{Y}_n)\), respectively. The estimator \({\hat{F}}_{M,m}^{GK}(\cdot ;h)\) of \(F_M\) is obtained by the estimated density function \({\hat{f}}_m^{GK}(\cdot ;h)\) as follows:

$$\begin{aligned} {\hat{F}}^{GK}_{M,m}(t;h) = \int _{-\infty }^{t} {\hat{f}}_m^{GK}(x;h) ~dx = \frac{1}{m} \sum _{i=1}^{n} \varPhi (t|{\hat{\mu }}_i(h), {\hat{\sigma }}_i(h)), \end{aligned}$$

where \(\varPhi (t | {\hat{\mu }}_i(h), {\hat{\sigma }}_i(h))\) is the cumulative distribution function of the normal distribution with mean \({\hat{\mu }}_i(h)\) and variance \({\hat{\sigma }}_i^2(h)\). To be specific, we first choose the common bandwidth \(h_{\max }\) such that

$$\begin{aligned} h_{\max } = \text {argmax}_{h} \sup _{t \in {\mathcal {R}}}|{\widehat{F}}_{{\scriptscriptstyle \mathrm M}, m}^h(t)-{\widehat{G}}_{{\scriptscriptstyle \mathrm M}, n}^h(t)|. \end{aligned}$$

Therefore, the test statistic \(T_{\scriptscriptstyle \mathrm M}^{GK}\) for the GK type is defined as follows:

$$\begin{aligned} T_{\scriptscriptstyle \mathrm M}^{GK}= \left( \frac{mn}{m+n} \right) ^{1/2} \sup _{t \in {\mathbb {R}}} |{\widehat{F}}_{{\scriptscriptstyle \mathrm M}, m}^{GK} (t;h_{\max })-{\widehat{G}}_{{\scriptscriptstyle \mathrm M}, n}^{GK} (t;h_{\max })|. \end{aligned}$$
(4)

Note that we propose the GK type test with the bandwidth \(h_{\max }\) since the GK type test with the proposed bandwidth \(h_{\max }\) has similar powers for center change and larger powers for range change than the GK type test with the bandwidth selection by Jeon et al. (2015) in our numerical study (see Appendix 2). However, it is worth noting that the proposed bandwidth \(h_{\max }\) does not guarantee the better performance for density estimation compared to the bandwidth selection by Jeon et al. (2015). In addition, the common bandwidth selection for the GK type test statistic does not need the calculation of the cross-validated Kulback–Leibler loss as applied in Jeon et al. (2015) and hence we can considerably reduce the computational cost in the evaluation of p value by the permutation procedure while we need to choose the optimal bandwidth h for every permutation if the test statistic is defined by the optimal bandwidth chosen by the cross-validation.

2.3.3 Permutation procedure to approximate the null distribution

We use the permutation method to estimate the sampling distribution of the test statistic (3) under the null \({\mathcal {H}}_0\). The permutation procedure is straightforward and briefly described as follows. For the b-th permutation, we combine all the \(m+n\) observations from both groups together, and then randomly take m observations without replacement. This sample constitutes the first group and the remaining n observations are set as the second group. We compute the test statistic \(t_{{\scriptscriptstyle \mathrm M}, b}\) as in (3) using these permuted samples and repeat this procedure B many times. The permutation distribution for the test statistic \(T_{\scriptscriptstyle \mathrm M}\) is given by the empirical distribution of \(t_{{\scriptscriptstyle \mathrm M},1}, \ldots ,t_{{\scriptscriptstyle \mathrm M},{\scriptscriptstyle \mathrm B}}\). Now, let \(t_{\scriptscriptstyle \mathrm M}^\mathrm{obs}\) be the observed test statistic from the original two samples. The p value for hypothesis \({\mathcal {H}}_0\) based on permutation is

$$\begin{aligned} p = \frac{\sum _{b=1}^B \mathrm{I}( t_{{\scriptscriptstyle \mathrm M}, b} \ge t_{\scriptscriptstyle \mathrm M}^\mathrm{obs})}{B}. \end{aligned}$$

In the numerical study, since we know the underlying distribution such as normal or t distribution, the reference distribution can be better approximated by generating random samples from the known distribution under the null rather than permuting observed samples.

3 Numerical study

In this section, we compare the finite-sample performance of the four methods described in the previous section. We generate interval variable by generating a bivariate real-valued random variable \((C, \log R)\) under various situations. Each situation depends on different factor(s) to induce difference between two populations, where the magnitude of difference is controlled by \(\delta =0, 0.5, 1, 1.5\). By the setting, the null hypothesis is expressed as \({\mathcal {H}}_0: \delta = 0\) for all four tests. Thus, when \(\delta =0\), we examine the size of each test, while for \(\delta >0\), we assess the power of competing tests. For the sample size, we consider following 4 cases: \((m, n)=(30, 30), ~(30, 120), ~(50, 50), ~(50, 200)\). To investigate the effect of correlation between the center and range, we use three values for a correlation parameter \(\rho =(0,~0.4, ~0.8)\). All other settings we consider for the study are summarized in Table 1. Generative models of each simulation are given in the beginning of each subsection.

For test statistics \(T_C\), \(T_{R}\), and \(T_{\scriptscriptstyle \mathrm M}\), we numerically approximate its null distribution by generating m and n samples under the null and calculating corresponding test statistics. We repeat this procedure 20, 000 times to get their reference distributions. For \(\mathrm{HT}_\mathrm{eq}\) and \(\mathrm{HT}_\mathrm{un}\), the simulated distribution is similarly obtained if a setting does not meet underlying assumptions of the HT test.

The significance level \(\alpha\) is set as \(5\%\). The size and power of each test are evaluated as the rejection rate through 2000 repetitions.

Table 1 Summary of the settings. At the first column, the left character of the hyphen (-) denotes the distribution of \((C, \log R)\): N is for “normal”, T for “T with df 5”, and SN for “skew-normal”. The right character represents a source of difference between the two populations: C is for “mean of center”, R for “mean of range”, C.S for “mean and skewness of center”, COV for “covariance”, C.V for “mean and variance of center”, and R.V for “mean and variance of range”. Each population (\(i=1,2\)) is denote by \(\varPi _i\) with parameters \(\mu _i\) (mean), \(\varSigma _i\) (covariance matrix), and \(\gamma _i\) (skewness). We define \(\varSigma =(1 ~~\rho ~; \rho ~~ 1)\)

3.1 Normal distribution with equal covariances

We set a bivariate normal distribution for the center and log-transformed half-range. We compare the rejection power of four tests by varying the mean vector value of the second population, assuming that the covariances of two populations are equal. By denoting the first population as \(\varPi _1\) and the second as \(\varPi _2\), the setting is expressed as follows:

$$\begin{aligned} \varPi _1: \begin{pmatrix} C_1 \\ \log R_1 \end{pmatrix} ~~ \sim ~~ N_2(\mu _1, \varSigma _1), \quad \varPi _2: \begin{pmatrix} C_2 \\ \log R_2 \end{pmatrix} ~~ \sim ~~ N_2(\mu _2, \varSigma _2), \end{aligned}$$

where mean and variance parameters are

$$\begin{aligned}&\mu _1= (0,0)^{\top }, ~~ \varSigma _1= \varSigma _2= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}, \\ \text {(N-C) }&\mu _2= (\delta , 0)^{\top }, \text { or } \text {(N-R) }\mu _2= (0, \delta )^{\top }. \end{aligned}$$

Note that the mean vector in the second population (\(\varPi _2\)) is set to either \((\delta , 0)\) or \((0, \delta )\). The reason for varying mean of center and half-range separately is that they differently affect the rejection power, which will be explained later.

Table 2 Size and power of each test in case of the bivariate normal distribution with equal covariances. Numbers in bold denote the highest power among CB, HT, UK and GK for each simulation setting except δ = 0 where numbers denote the size of test

We first explain a general trend across methods. When we look at the null case where \(\delta =0\) in Table 2, the size of each test is well controlled since the rejection rate is close to the significance level \(\alpha =0.05\) in all cases. Under the alternative hypothesis (\(\delta >0\)), it can be seen for every setting that the larger \(\delta\) is, the greater probability of rejection is. Similarly, each test becomes more powerful as more samples are available.

To summarize the winners based on the case where \(\rho =0\), the HT test shows the highest power among the four tests in both cases (N-C) and (N-R). This consequence is natural when considering that other methods test the equality of distributions, while the HT test only compares mean vectors between two populations. In addition, the data generation setting (a bivariate normal distribution with equal covariances) satisfies the underlying assumptions of the HT test. Note that in case (N-C), where two distributions differ in mean of the center, two marginal tests are comparable to the HT, but perform better than the CB. However, in case (N-R), where mean vectors are different at the range, the result is reversed, i.e., the CB performs better than the marginal tests.

Looking closely at the properties of each test, in the CB and HT tests, the power in case (N-C) is almost the same to the power in (N-R) under the same simulation parameters. This result is also natural because both tests are designed with the same priority for the center and range. On the other hand, in the marginal tests, the power in case (N-C) is much higher than the power in case (N-R), especially when \(\delta\) is small. This implies that the two marginalization methods, the UK and GK, are more sensitive to the change of the center rather than range. In addition, the marginalization-based tests show much less powers than those of the CB and HT tests. Thus, two-dimensional test procedure is preferable to the marginalization-based tests when the difference of two distributions is caused by the difference in the range of the interval. However, it is worth noting the performance of the marginalization-based tests in case (N-R) with \(\rho =0\). That is, even if the range and center are independent, the power of the GK and UK is close to 1 as \(\delta\) grows. It should also be noted that the performance of the GK and UK is similar in case (N-C), but the GK performs much better than the UK in case (N-R).

Now, we examine the effect of correlation on the power of each test. In general, larger correlation results in higher power of each test. This phenomenon can be explained using the Mahalanobis distance between the two mean vectors from \(\varPi _1\) and \(\varPi _2\). In case (N-C), for instance, the distance is \((\delta , 0){\begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}}^{-1} (\delta , 0) = \delta ^2 /(1-\rho ^2)\), which increases as \(\rho\) gets larger. Specifically, when \(\rho\) is 0, 0.4, and 0.8, the corresponding distance is \(\delta ^2\), \(1.2\delta ^2\) and \(2.8\delta ^2\), respectively. Thus, it is evident to see that two population distributions are easily distinguished from each other, especially when \(\rho = 0.8\). However, the effect size of \(\rho\) in power differs from each test. The HT test shows the most significant increment in power among the four tests as \(\rho\) increases, which could be reasonable considering that the HT statistic is in the form of the Mahalanobis distance between two mean vectors. The followings are the UK and GK tests showing a similar increase. On the other hand, the power of the CB test hardly changes. We, hereafter, would avoid a discussion on \(\rho\) since interpretation of its effect is almost same in most of the following settings. Thus, the case of \(\rho = 0\) will be mainly discussed.

3.2 Non-normal cases

We examine the size and power in terms of tail thickness and skewness of an underlying bivariate distribution for the center and log-transformed half-range.

3.2.1 Thickness of the tail

We use a bivariate t-distribution with the degrees of freedom 5 denoted by \(t_5\), which has a thicker tail than the normal distribution. We assume two populations have equal covariance matrices. Other details regarding the setup are identical to the normal case. That is,

$$\begin{aligned} \varPi _1: \begin{pmatrix} C_1 \\ \log R_1 \end{pmatrix} ~~ \sim ~~ t_5(\mu _1, \varSigma _1), \quad \varPi _2: \begin{pmatrix} C_2 \\ \log R_2 \end{pmatrix} ~~ \sim ~~ t_5(\mu _2, \varSigma _2), \end{aligned}$$

where mean and variance parameters are

$$\begin{aligned}&\mu _1= (0,0)^{\top }, ~~ \varSigma _1=\varSigma _2= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix} \\ \text {(T-C) }&\mu _2= (\delta , 0)^{\top }, \text { or } \text {(T-R) } \mu _2= (0, \delta )^{\top }. \end{aligned}$$

Since the Gaussian assumption is broken, the null distribution of \(\mathrm{HT}_\mathrm{eq}\) is calculated by the permutation method as mentioned earlier.

Table 3 Size and power of each test in case of the bivariate t-distribution with df 5 with equal covariances. Numbers in bold denote the highest power among CB, HT, UK and GK for each simulation setting except δ = 0 where numbers denote the size of test

First of all, it is noticeable in Table 3 that the testing power decreases overall compared to that of the normal distribution. Next, based on the case where \(\rho\) is 0, the UK test outperforms the other three tests in case (T-C), while in case (T-R), the CB test is most powerful, which is different from the normal case where the HT test shows the highest power. Performance degradation of the HT is obvious since the Gaussian assumption is not satisfied. Third, in case (T-C), the power of the UK test uniformly dominates that of the GK, contrary to their similar performance in the normal case (N-C). The less better performance of the GK test is attributed to its dependency on the Gaussian kernel. Finally, as in the previous results, in case (T-R), the performance of the marginal tests is much worse than that of the two other tests except the case with large \(\delta =1.5\). Meanwhile, when center and range are highly correlated (\(\rho =0.8\)), the HT test shows better performance than the others. This is because as \(\rho\) gets larger, the increase of power in the HT test is more substantial than the other tests, as explained before.

3.2.2 Skewness

We generate the center and log-transformed half-range from the following bivariate skew-normal distribution. We use a centered parameterization to fix the marginal parameters at prescribed values (Azzalini and Capitanio 1999). That is,

$$\begin{aligned} \begin{pmatrix} C \\ \log R \end{pmatrix} ~~ \sim ~~ SN\left[ \mathbf {\mu }=\begin{pmatrix} \mu _{\scriptscriptstyle \mathrm C} \\ \mu _{\scriptscriptstyle \mathrm R} \end{pmatrix}\;, \varSigma =\begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix} \;, \mathbf {\gamma }=\begin{pmatrix} \gamma _{\scriptscriptstyle \mathrm C} \\ \gamma _{\scriptscriptstyle \mathrm R} \end{pmatrix} \right] , \end{aligned}$$

where \((\gamma _{\scriptscriptstyle \mathrm C}, \gamma _{\scriptscriptstyle \mathrm R})^{\top }\) represents skewness of the marginal distribution of the center and log-transformed half-range, respectively. For the sake of simplicity, we only consider two cases for sample size \((m,n)=(30,30),(30, 120)\), and the case of different mean at center. We additionally include the case where skewness and mean of the center are varying together, which is motivated from the real data example described in the next section.

  • (SN-C) Mean of the center is different while covariance and skewness are the same in two populations:

    $$\begin{aligned}&\varPi _1 : \mu _1=(0,0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}, ~~\gamma _1=(-0.6, -0.1)^{\top }&\\&\varPi _2 : \mu _2=(\delta , 0)^\top , ~~\varSigma _2= \varSigma _1, ~~ \gamma _2=\gamma _1.&\end{aligned}$$
  • (SN-C.S) Skewness of the center as well as mean of the center are different in two populations, and two covariances are equal:

    $$\begin{aligned}&\varPi _1 : \mu _1= (0,0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}, ~~\gamma _1=(0, -0.1)^{\top }&\\&\varPi _2 : \mu _2= (\delta , 0)^\top , ~~\varSigma _2= \varSigma _1, ~~ \gamma _2=(-2\delta /5, -0.1)^{\top }.&\end{aligned}$$
Table 4 Size and Power of each test in case of the bivariate skew-normal distribution with equal covariances. Numbers in bold denote the highest power among CB, HT, UK and GK for each simulation setting except δ = 0 where numbers denote the size of test

It is shown in Table 4 that the case (SN-C) is similar to the normal case in that the HT test shows the best performance and the power of two marginal tests is better than that of the CB. In the case (SN-C.S), we control the skewness of the second population to gradually increase so that its marginal distribution is more left-skewed. We find that when correlation is small (\(\rho =0\)), the UK and GK tests are superior to the other two tests, unlike the previous case (SN-C), but under the highly correlated structure (\(\rho =0.8\)), the HT test is the most powerful, as before.

3.3 Normal distribution with unequal covariances

We also set a bivariate normal distribution for the center and log-transformed half-range, but this time we assume that covariances of two populations are not equal. We consider the following four cases, one of which represents characteristics of the real data example. We use two cases for sample size for simplicity: \((m,n)=(30,30), (30,120)\).

  • (N-COV) The covariance matrices are unequal while the mean vectors are equal:

    $$\begin{aligned}&\varPi _1 : \mu _1=(0, 0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}&\\&\varPi _2 : \mu _2=(0, 0)^{\top }, ~~\varSigma _2= (1+\delta )\varSigma _1.&\end{aligned}$$
  • (N-C.V1) The mean and variance of the center are different in two populations. In the second population, both the mean and variance of the center increase:

    $$\begin{aligned}&\varPi _1 : \mu _1= (0, 0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 1 &{} \rho \\ \rho &{} 1 \end{pmatrix}&\\&\varPi _2 : \mu _2 = (\delta , 0)^{\top }, ~~\varSigma _2= \begin{pmatrix} 1+2\delta &{} \sqrt{1+2\delta }\rho \\ \sqrt{1+2\delta }\rho &{} 1 \end{pmatrix} .&\end{aligned}$$
  • (N-C.V2) In the second population, the mean of center increases while the variance of center decreases:

    $$\begin{aligned}&\varPi _1 : \mu _1 = (0, 0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 4&{} 2\rho \\ 2\rho &{} 1 \end{pmatrix}&\\&\varPi _2 : \mu _2 = (\delta , 0)^{\top }, ~~\varSigma _2= \begin{pmatrix} 4-2\delta &{} \sqrt{4-2\delta }\rho \\ \sqrt{4-2\delta }\rho &{} 1 \end{pmatrix} .&\end{aligned}$$
  • (N-R.V) The mean and variance of the range differ in two populations. In the second population, both mean and variance of the range increase:

    $$\begin{aligned}&\varPi _1 : \mu _1 = (0, 0)^{\top }, ~~\varSigma _1= \begin{pmatrix} 1&{} \rho \\ \rho &{} 1 \end{pmatrix}&\\&\varPi _2 : \mu _2 = (0, \delta )^{\top }, ~~\varSigma _2= \begin{pmatrix} 1 &{} \sqrt{1+2\delta }\rho \\ \sqrt{1+2\delta }\rho &{} 1+2\delta \end{pmatrix} .&\end{aligned}$$
Table 5 Size and power of each test in case of the bivariate normal distribution with unequal covariances. Numbers in bold denote the highest power among CB, HT, UK and GK for each simulation setting except δ = 0 where numbers denote the size of test

As mentioned earlier, we give an interpretation to the cases where \(\rho = 0\) based on Table 5. The most interesting result is the case (N-COV), where the marginal tests have much higher power than two other tests, and the GK outperforms the UK. This result means that the marginal tests, especially the GK test, effectively detect the difference in covariance over the other tests. On the contrary, the HT test, which tests the difference between two mean vectors, is incapable of detecting covariance differences between two populations, as it shows the power same to the size. In cases of (N-C.V1) and (N-C.V2), where variance of the center in the second population varies (increases or decreases) together with the mean change, the marginal tests performed best, compared to the case (N-C) where the HT test is the best. Finally, in case (N-R.V), where both mean and variance of the range are controlled, the GK test shows much higher power than the other tests for \(\rho = 0, 0.4\), unlike the poor performance in case (N-R). For \(\rho =0.8\), the HT still has the highest powers than the others as shown in the case of (N-R). When \(\rho = 0.8\), the HT test has the highest power in all cases but (N-COV), where there is no difference in the two mean vectors.

Table 6 Summary of the results for the significance level \(1 \%, 5 \%\) and 10%. The best and worst tests are represented for each case. At the second column, the left character of the hyphen (-) denotes the distribution of (C,logR) and the right represents the difference between the two populations

We summarize the numerical study in Table 6 that shows the best and worst methods in each case. Two major findings we make are as follows. First, when the center and range are highly correlated, the HT performs best among all. Second, the marginal tests, the UK and GK tests, show higher power than other methods if two distributions differ by more than one factor (mean, covariance, and skewness, etc). In addition, the marginal tests tend to detect the difference in center better than in range. Note that the results of numerical study for the significance levels 1% and 10% are reported in Tables 10, 11, 12, 13, 14, 15, 16 and 17 in Appendix 1.

4 Data example

We conduct a real data analysis using the methods discussed in this paper. We use the data from National Heart, Lung, and Blood Institute Growth and Health Study (NGHS), which is a cohort study to investigate temporal trends of cardiovascular risk factors, such as systolic and diastolic blood pressures (SBP, DBP) through up to ten annual visits of 2379 African–American and Caucasian girls. The blood pressure (BP) measured at two levels can be an example of the MM-type interval-valued data. The goal of our real data analysis is to find the difference in BP distributions between African–American and Caucasian girls at the initial points of the study.

After we removing subjects with missing measurement, the total number of subjects remaining is \(N=2256\) (\(m= 1112\) Caucasians and \(n=1144\) African–Americans). Table 7 shows descriptive statistics of the BP data by race and results of univariate t tests on whether the BP of African–Americans is higher than that of Caucasians. Mean value of SBP, DBP, and their center from African–American girls is significantly larger than that from Caucasians, but the range shows no significant difference between the two groups. The distributions of the center and log-transformed half-range of African–American are more skewed to the upper-left than those of Caucasians (see Fig. 1). Correlation coefficients between the center and log-transformed half-range for the two groups are as low as \(-0.26\) and \(-0.27\), respectively. Thus, the data are roughly matched with the simulation setting (SN-C.S) or (N-C.V2) with small \(\rho\).

Table 7 Descriptive statistics of the BP data by race. The p-value is from a univariate t-test on the alternative hypothesis that the BP of the African-American is higher than that of Caucasian

Table 8 shows the results when two-sample comparison methods are applied to the BP data. In all tests, the p values are smaller than 0.001, confirming the significant difference between the two groups.

Table 8 Two-sample tests for the whole BP data
Fig. 1
figure 1

Contour plots of the two groups of BP data

4.1 Sub-sampling

Since the sizes of two samples (\(m = 1112\), \(n = 1144\)) are very large compared to the typical sample size, the p value of each test is close to 0 and it is difficult to compare the performance of four tests. Therefore, considering the original sample as a population, we sub-sample \(m^{\prime }\) and \(n^{\prime }\) and examine corresponding powers with the significance levels 1%, 5% and 10%.

Table 9 summarizes the rejection power depending on the size of sub-samples among 2000 replicates. The two marginal tests perform best with similar power, followed by the HT and CB tests. This result is consistent with our findings from the numerical study, especially for (SN-C.S) and (N-C.V2) with small \(\rho\).

Table 9 Powers of four two-sample testing methods for different sub-sample sizes with significance levels 1%, 5%, and 10%. Numbers in bold denote the highest power among CB, HT, UK and GK for each simulation setting

5 Conclusion

In this paper, we test equality of two population of MM-type interval samples by testing their real-valued representations. We first consider the Hotelling’s \(T^2\) test to examine the equality of mean vectors of the center and range of interval-valued data. We then propose marginalization-based test statistics, \(T_{\scriptscriptstyle M}^{UK}\) and \(T_{\scriptscriptstyle M}^{GK}\), which are based on two univariate distributional representation (named as marginalization in this paper) of the interval-valued data.

Numerical study and real data analysis show that the marginalization-based tests perform better than the existing methods when two population distributions are different due to more than one factor, such as mean, covariance, skewness, and so on. This implies that the marginal tests can be more suitable for testing real problems of interval-valued data. Further, the power of the GK test is much higher than that of the UK when the two populations differ in range and covariance.

However, we need to be cautious when we apply the marginalization-based tests since they are only valid for the case that the null hypothesis is rejected. That is, the rejection of the equality test using the marginalization implies that two bivariate distributions are unequal. On the other hand, the acceptance of the null hypothesis does not imply the equality of two bivariate distributions.

Finally, it is worth remarking that both the marginalization (or univariate real-valued representation) and bivariate real-valued representation (e.g. (LU) or \((C, \log R)\)) are induced by the probability measure on intervals, but the converse is not. To be specific, the interval-valued data is a univariate random object on an appropriately defined probability space of intervals. For example, suppose we consider a sample space of intervals, say \(\varOmega\), equipped with a metric \(d(\omega _1, \omega _2)\) for \(\omega _1,\omega _2 \in \varOmega\). The metric introduces the Borel \(\sigma -\)field, say \({\mathcal {F}}\), and the probability measure \({\mathcal {P}}\) is defined on \({\mathcal {F}}\). In this paper, we write the interval-valued data as the form of \((L(\omega ), U(\omega )]\), where \(L(\omega )\) and \(U(\omega )\) are real-valued random variables on the above probability space and their joint distribution, say \(F(\ell , u)\), is induced by the probability measure \({\mathcal {P}}\). Thus, in this paper, we test equality of two probability measures by testing the equality of “their real-valued representations”.