Partition of Interval-Valued Observations Using Regression

Liu, Fei; Billard, L.

doi:10.1007/s00357-021-09394-5

Partition of Interval-Valued Observations Using Regression

Published: 28 August 2021

Volume 39, pages 55–77, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Classification Aims and scope Submit manuscript

Partition of Interval-Valued Observations Using Regression

Download PDF

Fei Liu¹ &
L. Billard²

247 Accesses
2 Citations
Explore all metrics

Abstract

Both regression modeling and clustering methodologies have been extensively studied as separate techniques. There has been some activity in using regression-based algorithms to partition a data set into clusters for classical data; we propose one such algorithm to cluster interval-valued data. The new algorithm is based on the k-means algorithm of MacQueen (1967) and the dynamical partitioning method of Diday and Simon (1976), with the partitioning criteria being based on establishing regression models for each sub-cluster. This also depends on distance measures between the underlying regression models for each sub-cluster. Several types of simulated data sets are generated for several different data structures. The proposed k-regressions algorithm consistently out-performs the k-means algorithm. Elbow plots are used to identify the total number of clusters K in the partition. The new method is also applied to real data.

Categorical Data Clustering

A new methodology for exploratory and predictive data analysis based on level sets of probability density function

Article 19 August 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this article, we adapt the basic dynamical partitioning technique of Diday (1971a, b) and Diday and Simon (1976) for partitioning n observations into K clusters, to develop appropriate algorithms based on regression criteria for interval-valued observations. Interval data are examples of symbolic data first introduced by Diday (1987). Diday’s (1971b) dynamical partitioning technique extends the k-means method of MacQueen (1967) for classical data, with several authors (e.g., Chavent et al., 2002) developing k-means methods for selected classes of symbolic data. There are a few papers in the literature which use regression ideas as a basis for the partitioning criteria in the k-means approach for classical data, starting with the initial development in Charles (1977). We introduce regression based methods to the k-means technique for interval data.

We start in Section 2 with some background literature to the k-means method, and its application to regression methodology. Then, in Section 3, the proposed algorithm for regression-based clustering for interval observations is described. Simulation studies are presented in Section 4; these are used to simulate different sets of data structures. Applications to real data in Section 5 nicely verify the integrity of the proposed algorithm. Advantages and disadvantages are also discussed. Some concluding remarks are found in Section 6.

2 Background

Cluster analysis is a common statistical tool that divides a population (of size n, say) into different sub-populations such that the observations within the same sub-population are as homogeneous as possible while observations from different sub-populations are as non-homogeneous as possible. The fundamental and most well-known clustering method for the partitioning process is the k-means algorithm originally proposed by MacQueen (1967). For a fixed K, the k-means algorithm requires initial clusters to start the process. This initialization could be K seeds or K clusters; a detailed discussion can be found in Anderberg (1973) and Cormack (1971). Then, the algorithm partitions the n observations into K clusters based on the rules under which an observation belongs to a cluster with the nearest mean. A summary of the k-means algorithm and its extensions could be found in Jain et al. (1999), Bock (2007, 2008) and Jain (2010).

Similar to k-means clustering, the cluster-wise linear regression method tries to recover the data structure where the observations are clustered using multiple linear regression models. Cluster-wise linear regression partitions the n observations into K subsets where each observation belongs to its nearest linear regression model. For classical data, the cluster-wise linear regression method is one of the most developed clustering methods in statistics. Analogously with the k-means algorithm, Späth (1979, 1981, 1982) partitioned the data into K subsets and fitted K linear regressions such that the total sum of squares of the errors is locally minimized. DeSarbo and Cron (1988) utilized a maximum likelihood methodology using a mixture of conditional normal distributions to choose the appropriate partition that maximizes the likelihood function, which resulted in a fuzzy cluster-wise linear regression method. The assumptions for ordinary linear regression modeling apply to the cluster-wise linear regression. Wedel and Kistemaker (1989) proposed another maximum likelihood methodology by which a particular observation can belong to only one cluster. Later, Tibshirani et al. (2001) and Shao and Wu (2005) explored methods of determining the number of clusters for a cluster-wise linear regression clustering approach. Zhang (2003) introduced k-harmonic means clustering for the cluster-wise linear regression method, which is less sensitive to the choice of initialization. Rao et al. (2007) and Qian and Wu (2011) extended Späth’s (1982) method to one that is more robust by applying M-estimation to the linear regression modeling. Bougeard et al. (2017, 2018) used blocks in a partial least squares setting. A good review can be found in Brusco et al. (2008).

Our concern is with interval-valued observations. Intervals are examples of symbolic data, which can be defined broadly as hypercubes or Cartesian products of distributions in p-dimensional space $\mathbb {R}^{p}$, in contrast to classical data which are points in $\mathbb {R}^{p}$. While many observations are naturally symbolic (e.g., species measurements, temperatures recorded as daily minimum and maximum values), most symbolic data today arise as a result of aggregating the large data sets accumulated by the modern computer, into more manageable forms for analyses. How any particular aggregation proceeds will depend on the scientific questions underlying the investigation at hand. A key feature of such data is that they contain internal variation. For example, suppose an aggregation produced a range of individual values across an interval [10,20], say. By using a summary statistic such as the midpoint (here 15) for subsequent analyses results in a loss of critical information, since, e.g., this interval cannot be distinguished from a second interval [14,16] which has the same midpoint as the first. An extensive coverage of the types of symbolic data and their fundamental properties can be found in Bock and Diday (2000), Billard and Diday (2003, 2006), Diday and Noirhomme-Fraiture (2008), Noirhomme-Fraiture and Brito (2011), and Diday (2016), with non-technical introductions in Billard (2011, 2014).

It is important to remember that although observations are aggregated into intervals, each aggregated observation within an interval remains as being from some underlying distribution, e.g., normal distribution. Thus, descriptive statistics calculated from these intervals are still point estimates. It is assumed however that those normally distributed (say) observations are uniformly spread across the interval or sub-intervals for histogram-valued data, analogous with the calculations of histograms (or sample means, etc.) of “group” data in elementary statistics courses. Thus, Bertrand and Goupil (2000) obtained the sample mean as a point value (see Eq. 3). Likewise, for interval-valued observations, a sample variance has a point value, as does each sample regression parameter value, based on n observations (see, e.g., Eq. 9, and Eq. 6, respectively).

Extensions of the k-means algorithm of MacQueen (1967) and the adaptive k-means algorithm of Diday and Simon (1976) to symbolic data are based on the traditional k-means criteria involving distances between observations and the centers as the representation of the obtained clusters; some use a median measure rather than the means. Thus, e.g., for interval-valued data, de Carvalho et al. (2006) consider an L₂ distance; Chavent et al. (2002) use Hausdorff distances; de Souza and de Carvalho (2004), de Souza et al. (2004), and de Carvalho and Lechevallier (2009) apply city-block distances; de Carvalho et al. (2004a, 2004a) use Chebychev distances; and de Carvalho and Lechevallier (2009) apply an adaptive k-means method. Also, for histogram data, Verde and Irpino (2007), Irpino et al. (2006) and Košmelj and Billard (2012) consider Mallows’ distance for histogram observations; and Korenjak-Černe et al. (2011) and Batagelj et al. (2015) extend the k-means leaders approach to discrete distributions.

3 Regression-Based k-Means for Interval Data

3.1 Regression Model

Suppose we have n observations in a data set with response variable Y and p predictor variables X = (X₁,...,X_p). All the response and predictor variables are interval-valued random variables. Let X_ij, i = 1,...,n, j = 1,...,p, be the i th realization for the j th predictor variable X_j, denoted by X_ij = [x_ija,x_ijb] with $x_{ija}, x_{ijb} \in \mathbb {R}$ and x_ija ≤ x_ijb. Similarly, let Y_i be the i th realization for the response variable Y, denoted by Y_i = [y_ia,y_ib] with y_ia ≤ y_ib, $i = 1,\dots , n$. Then, the multiple linear regression model is

$$ Y = \beta_{0} + \beta_{1} X_{1} + {\dots} + \beta_{p} X_{p} = \mathbf{X}^{\prime} \boldsymbol{\beta}^{\ast} + \boldsymbol{\epsilon}, $$

(1)

where $\boldsymbol {\beta }^{\ast } = (\beta _{0}, \beta _{1}, \dots , \beta _{p})^{\prime }$ is the set of regression coefficients of the p + 1 variables $\mathbf {X}^{\prime } = (1, X_{1},\dots ,X_{p})$, and 𝜖 is the error interval vector. In Eq. 1 given the data, Y is an n × 1 vector and $\mathbf {X}^{\prime }$ is a (p + 1) × n matrix. Equivalently, Eq. 1 may be written as

$$ Y - \bar{Y} = (\mathbf{X} - \bar{\mathbf{X}})^{\prime} \boldsymbol{\beta} + \boldsymbol{\epsilon}, $$

(2)

where $\boldsymbol {\beta } = (\beta _{1}, \dots , \beta _{p})^{\prime }$, $\mathbf {X} = (X_{1}, \dots , X_{p})$ and where $\bar {\mathbf {X}} = (\bar {X}_{1},\dots , \bar {X}_{p})$ and $\bar {Y}$ are the sample means defined, respectively, by

$$ \bar{X}_j = \frac{1}{2n}\sum\limits_{i = 1}^n(x_{ija}+x_{ijb}),~ j = 1,\dots, p, ~~ \bar{Y} = \frac{1}{2n}\sum\limits_{i = 1}^n(y_{ia}+y_{ib}). $$

(3)

See any of the many elementary texts on regression for properties of the model (e.g., Johnson and Wichern, 2007; Draper & Smith, 1966); see also Xu (2010) for interval observations.

From Eq. 2, the error sum of squares can be written as

$$ S = \sum\limits_{i = 1}^{n} [Y_{i} - \bar{Y} - (X_{i1} - \bar{X}_{i})\beta_{1} - {\dots} - (X_{ip}- \bar{X}_{p})\beta_{p}]^{2}, $$

(4)

for given observations $i = 1, \dots , n$. Then differentiating the right-hand-side of Eq. 4 with respect to $\beta _{j}, ~ j = 1, \dots , p$, we have

$$ \frac{\partial S}{ \partial \beta_{j}} = 2 \sum\limits_{i = 1}^{n} [Y_{i} - \bar{Y} - (X_{i1} - \bar{X}_{i})\beta_{1} - {\dots} - (X_{ip}- \bar{X}_{p})\beta_{p}](X_{ij} - \bar{X}_{j}), $$

(5)

and setting the derivatives to zero and $\beta _{j} = \hat {\beta }_{j}$, we can solve the set of p equations in Eq. 5 to obtain the least squares estimators of β_j. In matrix terms, this becomes

$$ \begin{array}{@{}rcl@{}} (\hat{\beta}_{1}, \dots, \hat{\beta}_{p}) &=& {((\mathbf{X} - \bar{\mathbf{X}})}^{\prime}{(\mathbf{X} - \bar{\mathbf{X}})})^{-1} (\mathbf{X} - \bar{\mathbf{X}})^{\prime}(\mathbf{Y} - \bar{Y}), \\ \hat{\beta}_{0} &=& \bar{Y} - (\hat{\beta}_{1} \bar{X}_{1} + {\dots} + \hat{\beta}_{p}\bar{X}_{p}), \end{array} $$

(6)

where the estimator $\hat {\beta }_{0}$ pertains from Eq. 1.

To obtain the values of the elements in these matrices, let us re-write

$$ \begin{array}{@{}rcl@{}} {(\mathbf{X}-\bar{\mathbf{X}})}^{\prime} & (\mathbf{X}-\bar{\mathbf{X}}) \\ &=& \left( \begin{array}{ccc} {\sum}_{i=1}^{n}(X_{i1}-\bar{X}_{1})^{2}&\cdots&{\sum}_{i=1}^{n}(X_{i1}-\bar{X}_{1})(X_{ip}-\bar{X}_{p})\\ \vdots&\ddots&\vdots\\ {\sum}_{i=1}^{n}(X_{ip}-\bar{X}_{p})(X_{i1}-\bar{X}_{1})&\cdots&{\sum}_{i=1}^{n}(X_{ip}-\bar{X}_{p})^{2} \end{array} \right)_{p \times p} \\ &=&\left( \sum\limits_{i=1}^{n}(X_{ij}-\bar{X}_{j})(X_{ij^{\prime}}-\bar{X}_{j^{\prime}})\right)_{p \times p} \\ &=&\left( n \times \text{Cov}(X_{j},X_{j^{\prime}})\right)_{p \times p} \quad j, j^{\prime} = 1,\dots, p, \end{array} $$

(7)

and

$$ \begin{array}{@{}rcl@{}} {(\mathbf{X}-\bar{\mathbf{X}})}^{\prime}(\mathbf{Y}-\bar{Y})&=& \left( \sum\limits_{i=1}^{n}(X_{ij}-\bar{X}_{j})(Y_{i}-\bar{Y})\right)_{p \times 1} \\ &=&\left( n \times \text{Cov}(X_{j},Y)\right)_{p \times 1} \quad j = 1,\dots, p. \end{array} $$

(8)

Thus, it is clear that these matrix elements are simply n times the sample covariances between two variables. The question now is to find expressions for these elements for given data; i.e., we need to estimate the elements of the p × p matrix ${(\mathbf {X} - \bar {\mathbf {X}})}^{\prime }(\mathbf {X} - \bar {\mathbf {X}}) \equiv \mathbf {M}_{xx}$ (say); and likewise for the elements of the p × 1 matrix $\mathbf {M}_{xy} \equiv {(\mathbf {X} - \bar {\mathbf {X}})}^{\prime }(\mathbf {Y} - \bar {Y}$).

Consider first the diagonal elements $M_{xx}(j, j) = n \text {Cov}(X_{j}, X_{j}) \equiv n \text {Var}(X_{j}) \equiv \text {SS}_{X_{j}}$, say. Bertrand and Goupil (2000) showed that for given interval observations, this sum of squares is estimated by

$$ \text{SS}_{X_{j}} = (1/3)\sum\limits_{i = 1}^{n} [x_{ija}^{2} + x_{ija}x_{ijb} + x_{ijb}^{2}] - n \bar{X}_{j}^{2}, ~~ \bar{X}_{j} = (1/2n) \sum\limits_{i = 1}^{n}(x_{ija} + x_{ijb}); $$

(9)

this can be re-written as

$$ \text{SS}_{X_{j}} = (1/3) \sum\limits_{i = 1}^{n} [(x_{ija} - \bar{X}_{j})^{2} + (x_{ija} - \bar{X}_{j})(x_{ijb} - \bar{X}_{j}) + (x_{ijb} - \bar{X}_{j})^{2}]. $$

(10)

When there is only one observation, i.e., when n = 1, Eq. 10 becomes

$$ \begin{array}{@{}rcl@{}} \text{SS}_{X_{j}}(n = 1) &=& (1/3) [(x_{ija} - \bar{X}_{ij})^{2} + (x_{ija} - \bar{X}_{ij})(x_{ijb} - \bar{X}_{ij}) + (x_{ijb} - \bar{X}_{ij})^{2}], \\ \bar{X}_{ij} &=& (1/2) (x_{ija} + x_{ijb}), \end{array} $$

(11)

where the single i subscript in Eq. 11 is retained. From this, we see that $\text {SS}_{X_{j}}(n = 1) \neq 0$ unless x_ija = x_ijb, i.e., unless the observation X_ij is a classical point observation. That is, $\text {SS}_{X_{j}}(n = 1)$ represents the internal variation for this interval X_ij observation. Summing over all $i = 1, \dots , n$ observations, we obtain the Within variation, $\text {WithinSS}_{X_{j}}$, as

$$ \begin{array}{@{}rcl@{}} \text{WithinSS}_{X_{j}} &\! = (1/3)\sum\limits_{i = 1}^{n} [(x_{ija} - \bar{X}_{ij})^{2} \!+ \!(x_{ija} - \bar{X}_{ij})(x_{ijb} - \bar{X}_{ij}) + (x_{ijb} - \bar{X}_{ij})^{2}]. \end{array} $$

(12)

Re-arranging, we can show that Eq. 12 can be written as

$$ \text{WithinSS}_{X_{j}} = \sum\limits_{i = 1}^{n} [(x_{ijb} - x_{ija})^{2}/12]. $$

(13)

If we look at the variation between observations, viz., $\text {BetweenSS}_{X_{j}}$, we have

$$ \text{BetweenSS}_{X_{j}} = \sum\limits_{i = 1}^{n} [(x_{ija} + x_{ijb})/2 -\bar{X}_{j}]^{2}. $$

(14)

From Eqs. 13 and 14, it is easy to show that

$$ \text{SS}_{X_{j}} \equiv \text{TotalSS}_{X_{j}} = \text{WithinSS}_{X_{j}} + \text{BetweenSS}_{X_{j}} = M_{xx}(j, j) $$

(15)

as given in Eq. 10. It is noted that the expressions on the right-hand-side of Eqs. 13 and 14 are conditional on an assumption that the aggregated values within an interval are uniformly spread across the interval. (As an aside, although this is a so-far universally accepted assumption, different assumptions about the internal interval spread would change the formulae of these two expressions).

The off-diagonal elements $M_{xx}(j, j^{\prime })$ are obtained analogously, by using the sum of products (SP). Thus, we have the Within SP between the random variables X_j and $X_{j^{\prime }}$, $\text {WithinSP}_{X_{j},X_{j^{\prime }}}$, as

$$ \text{WithinSP}_{X_{j},X_{j^{\prime}}} = \sum\limits_{i = 1}^{n} [(x_{ijb} - x_{ija})(x_{ij^{\prime}b} - x_{ij^{\prime}a})/12] $$

(16)

and the Between SP between X_j and $X_{j^{\prime }}$, $\text {BetweenSP}_{X_{j},X_{j^{\prime }}}$, as

$$ \text{BetweenSP}_{X_{j},X_{j^{\prime}}} = \sum\limits_{i = 1}^{n} [(x_{ija} + x_{ijb})/2 -\bar{X}_{j}][(x_{ij^{\prime}a} + x_{ij^{\prime}b})/2 -\bar{X}_{j^{\prime}}]. $$

(17)

Hence, adding Eqs. 16 and 17 and rearranging gives, for $j, j^{\prime } = 1,\dots , p$,

$$ \begin{array}{@{}rcl@{}} M_{xx}(j, j^{\prime}) &=& \text{TotalSP} = \text{WithinSP} + \text{BetweenSP}, \\ M_{xx}(j, j^{\prime}) &=& \frac{1}{6} \sum\limits_{i=1}^{n}[2(x_{ija} - \bar{X}_{j})(x_{ij^{\prime}a} - \bar{X}_{j^{\prime}})+(x_{ija} - \bar{X}_{j})(x_{ij^{\prime}b} - \bar{X}_{j^{\prime}})\\ &&+ (x_{ij^{\prime}a} - \bar{X}_{j^{\prime}})(x_{ijb} - \bar{X}_{j}) +2(x_{ijb} - \bar{X}_{j})(x_{ij^{\prime}b} - \bar{X}_{j^{\prime}})]. \end{array} $$

(18)

When $j = j^{\prime }$, Eq. 18 simplifies to Eq. 10. For the special case that the observations are classical point values (with, e.g., a ≡ [a,a]), these derivations reduce to those for classical data, as they should. The elements of M_xy are obtained similarly. Hence, the least squares estimators for β^∗ are found by substituting the elements of M_xx and M_xy into Eq. 6.

In the sequel, regression model fits will use the estimators obtained from Eqs. 1–18, and will be referred to as the symbolic variation methodology.

Earlier methods developed for fitting regression models to interval-valued data include the center method of Billard and Diday (2000), whereby a model is fitted to the midpoints of the intervals. Although their method used the range of the predictor variables to find the prediction intervals, the calculation of the model parameters is based on the interval midpoints only, and so omits important information contained in the internal variations of the intervals. In an attempt to redress that problem, in a series of papers, de Carvalho and his colleagues (e.g., Lima Neto et al., 2004; de Carvalho et al., 2004a, b; Lima Neto et al., 2005; Lima Neto & de Carvalho, 2008) transformed the original interval-valued variables into point-valued center and half-range variables; then, a classical regression analysis was conducted on each of the center values and half-range values separately. Also, use of the center-range variables is equivalent to using the interval end-points. We note that all these methods use some form of classical surrogates on one/two single point values of the interval, rather than using the entire interval as in the symbolic variation method. It is easy to show that the variations (SS/SP) on the centers equate to the between variations of the symbolic variation methodology but that the variations on the ranges do not equate to the Within variations of the symbolic variation method. Thus, in their various ways, these methods are not fully utilizing all the variations in the data correctly. There are moreover other statistical issues here; e.g., the assumption that the center and range are independent entities is unsustainable. However, while each of these previous methods advanced the subject at the time, each has its limitations.

3.2 Partitioning Criteria

Our concern is with partitioning our data set into a fixed number K clusters where the observations within each cluster are identified by a specific regression model. Assume that the response variable Y has K different linear relationships with the predictor variables X. Let (X_k,Y_k), k = 1,...,K, be the set of observations that follows the k th regression model (i.e., belongs to the k th cluster C_k). Then,

$$ Y_k = \boldsymbol{X}_k^{\prime} \boldsymbol{\beta}_k + \boldsymbol{\epsilon}_k,~~~k=1,...,K, $$

(19)

where β_k is the set of linear coefficients of the predictor variables for the k th regression model, and 𝜖_k is the error vector.

Let n_k, k = 1,...,K, be the number of observations in the k th cluster with ${\sum }_{k=1}^{K} n_{k} = n$. We assume the following:

A1
The number of observations n_k satisfies p < n_k ≤ n, k = 1,...,K, where p is the number of predictor variables, and n is the total number of observations in the whole data set. It can be shown that n_k = n only if K = 1.
A2
The individual errors in a particular cluster k are drawn independently from a normal distribution with mean 0 and variance ${\sigma _{k}^{2}}$, $N(0,{\sigma _{k}^{2}})$, and after aggregation (along the lines of Eq. 33 e.g.) become intervals. The error intervals 𝜖_k are independent from $\boldsymbol {\epsilon }_{k^{\prime }}$, $k \ne k^{\prime }$, for $k, k^{\prime }=1,...,K$.

The first assumption (A1) avoids the situation with n_k < p such that there is no linear regression solution for the k th cluster; the second assumption (A2) reduces the computational complexity of the problem. Our goal is to find an optimal partition $P^{*} = (C_{1}^{*},...,C_{K}^{*})$ that minimizes the overall residuals of the regression models given the number of clusters K.

Given a partition P = (C₁,...,C_K), we can fit a linear regression model for each cluster as in Eq. 19. Denote the coefficient estimator of β_k by $\hat {\boldsymbol {\beta }}_{k}$ for k = 1,...,K. Then, the regression residuals for the k th cluster are defined as, for i ∈ C_k,

$$ r_{ki} = d(Y_{i},\hat{Y}_{i}), $$

(20)

where $d(Y_{i},\hat {Y}_{i})$ is the distance between the observation Y_i and its predicted value $\hat {Y}_{i}$. Since $\hat {Y}_{i}=\boldsymbol {X}_{i}^{\prime }\hat {\boldsymbol {\beta }}_{k}$, Eq. 20 can be rewritten as $r_{ki}=d(Y_{i}, \boldsymbol {X}_{i}^{\prime }\hat {\boldsymbol {\beta }}_{k})$. The predictive interval $\hat {Y}_{i}$ using the symbolic variation method is, for i ∈ C_k,

$$ \hat{Y}_{i} = [\hat{Y}_{ia},\hat{Y}_{ib}] = [\min_{\boldsymbol{X}\in \mathbb{X}}\boldsymbol{X}_{i}^{\prime}\hat{\boldsymbol{\beta}_{k}}, \max_{\boldsymbol{X}\in \mathbb{X}}\boldsymbol{X}_{i}^{\prime}\hat{\boldsymbol{\beta}_{k}}], $$

(21)

where $\mathbb {X}=\{\boldsymbol {X} = (x_{j}): x_{ja} \le x_{j} \le x_{jb}, j=1,...,p\}$; see Xu (2010). Our goal is to find an optimal partition that minimizes the sum of squared residuals (SSR) given K, viz.,

$$ \text{SSR}=\underset{P;~ \hat{\boldsymbol{\beta}}_{k}}{\arg\min} \sum\limits_{k=1}^{K} \sum\limits_{i \in C_{k}} r_{ki}^{2} = \sum\limits_{k=1}^{K} \sum\limits_{i=1}^{n_{k}}r_{ki}^{2}. $$

(22)

Since r_ki in Eq. 22 is defined as the distance between two intervals, Y_i and $\hat {Y}_{i}$, different definitions of the distance between these two intervals will affect the clustering results of our k-regressions algorithm. For illustrative concreteness, we consider three different distance definitions between two interval variables, specifically, center distance, Hausdorff distance and city-block distance; other distances could be used.

The center distance between two p-dimensional interval observations Z₁ = (Z₁₁,…, Z_1p) and Z₂ = (Z₂₁,…,Z_2p) with Z_ij = [z_ija,z_ijb], i = 1,2, j = 1,…,p, is defined as

$$ d_{C}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p}|z_{1j}^{c} - z_{2j}^{c}|, $$

(23)

where $z_{ij}^{c}=(z_{ija} + z_{ijb})/2$, is the midpoint of the interval Z_ij, i = 1,2, j = 1,…,p. The Hausdorff (1937) distance is defined as

$$ d_{H}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p} \max\{|z_{1ja} - z_{2ja}|, |z_{1jb} - z_{2jb}|\}. $$

(24)

The city-block distance is defined as

$$ d_{CB}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p} [|z_{1ja} - z_{2ja}| + |z_{1jb} - z_{2jb}|]. $$

(25)

We observe that there are effectively two optimizations occurring in this process. The first is designed to minimize the sum of squared residuals when fitting the regression model within each cluster (i.e., minimize the regression sum of squared errors ${\sum }_{i = 1}^{n_{k}}{\epsilon _{i}^{2}}$ for each k th cluster regression, $k = 1, \dots , K$). This is not the same as minimizing ${\sum }_{i = 1}^{n}{\epsilon _{i}^{2}}$ over all observations ignoring the clusters. The second optimization is to minimize the sum of the squared distances $r_{ki} = d(Y_{i}, \hat {Y}_{i})$ where the predicted value $\hat {Y}_{i}$ is determined by the regression equation for the k th cluster. We also note that the explained variation in the data is due to a mix of explained variation from the regression fits within clusters and that due to heterogeneity across clusters. For the classical data setting, Brusco et al. (2008) has a nice example illustrating how sometimes the explained variation can be dominated by the within cluster variations, and sometimes by the cluster heterogeneity.

3.3 Partitioning Algorithm

The k-regressions algorithm for each of the three distance definitions of Eqs. 23–25 is the same, except that the distance $d(Y_{i},\boldsymbol {X}^{\prime }_{i}\hat {\boldsymbol {\beta }}_{k})$ used in the allocation step (iii) changes appropriately. Analogously with the algorithm in Späth (1979), we propose a k-regressions cluster-wise algorithm for interval-valued data as follows:

(i)
Initialization: Choose a partition $P^{(0)}=(C_{1}^{(0)},...,C_{K}^{(0)})$ randomly from all the possible partitions, or partition the whole data set to K clusters based on some prior knowledge.
(ii)
Representation: For k = 1,...,K, fit regressions $Y_{k} = \boldsymbol {X}_{k}^{\prime }\boldsymbol {\beta }_{k} + \boldsymbol {\epsilon }$ to the observations in the k th cluster for partition $P^{(l)}=(C_{1}^{(l)},...,C_{K}^{(l)})$ where $l = 0 , 1, \dots $, denotes the l th iteration.
(iii)
Allocation: For observation Y_i, i = 1,...,n, calculate its distance to its prediction $\hat {Y}_{i}$ obtained by its k^th regression line, $d(Y_{i},\boldsymbol {X}^{\prime }_{i}\hat {\boldsymbol {\beta }}_{k})$, k = 1,...,K, and allocate the observation to its closest line. The updated partition is now $P^{(l+1)}=(C_{1}^{(l+1)},...,C_{K}^{(l+1)})$.
(iv)
Stop: Repeat (ii) and (iii) until the improvement of SSR in Eq. 22 is smaller than a predetermined criterion, or the number of iterations reaches a predetermined maximum number.

For the representation step, we apply the symbolic variance method to fit the linear regression model. For the allocation step, the observations are allocated such that, for k = 1,...,K,

$$ C_k = \{(\boldsymbol{X},Y) | d(Y,\boldsymbol{X}^{\prime}\hat{\boldsymbol{\beta}}_k) \le d(Y,\boldsymbol{X}^{\prime}\hat{\boldsymbol{\beta}}_{k^{\prime}}), \forall k \ne k^{\prime}\}. $$

(26)

Given a data set, the algorithm cannot guarantee a global minimum of SSR. Thus, we repeatedly implement the steps (i)–(iv) a number of times with different initializations and select the solution which has the lowest value of SSR. The selected partition can be further iterated until SSR cannot be reduced anymore. An expanded description is available in Liu (2016).

3.4 Number of Clusters

The k-regressions clustering algorithm is used to implement the cluster-wise regression method given that K is known. However, if we do not have prior knowledge about K, a bad guess of K can mislead the results. Xu (2010) gave a symbolic R-square (R²) of the symbolic variation method for the linear regression of interval-valued data as

$$ R^2 = \frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}, $$

(27)

where $\hat {Y}$ is the predicted value of the response variable Y, and Var(⋅) is the symbolic variance calculated from Eq. 9. Using the symbolic R², we propose the following methods to determine the number of clusters K.

Given a predetermined maximum number of clusters K^max, for each K = 1,…, K^max, calculate the R² for each cluster k = 1,…,K, denoted by $R_{k}^{2^{(K)}}$. For the whole data set, the weighted average R² for the n observations given K is defined as

$$ R^{2^{(K)}}= \sum\limits_{k=1}^K w_k^{(K)}R_k^{2^{(K)}}, $$

(28)

where $w_{k}^{(K)}=n_{k}/n$ is the weight of the R² for the k th cluster, and n_k = |C_k| is the number of observations in the k th cluster. From the plot of $(1-R^{2^{(K)}})$ versus K, the elbow point is the optimal number of clusters K^∗.

Determining the optimal number of clusters K by looking for the elbow point can be subjective, especially when the elbow point is not obvious. Analogously with the adjusted R² for the linear regression model, we propose an adjusted R² to determine the optimal K for the k-regressions algorithm. We know that R² for ordinary least squares regression models corresponds to the proportion of variation explained by the model; i.e., R² can be defined as

$$ R^{2} = 1- \text{SS}_{\text{res}}/\text{SS}_{\text{tot}} = \text{SS}_{\text{reg}}/\text{SS}_{\text{tot}}, $$

(29)

where $\text {SS}_{\text {tot}} = {\sum }_{i}(Y_{i}-\bar {Y})^{2}$ is the total sum of squares, $\text {SS}_{\text {reg}} = {\sum }_{i}(\hat {Y}_{i}-\bar {Y})^{2}$ is the sum of squares of the regression, $\text {SS}_{\text {res}} = {\sum }_{i}({Y_{i}-\hat {Y}_{i}})^{2}$ is the sum of squares of the residuals, and $\bar {Y} = {\sum }_{i} Y_{i}/n$ is the sample mean of Y. The R² in Eq. 29 can be rewritten as

$$ R^{2} = 1- \text{Var}_{\text{res}}/\text{Var}_{\text{tot}}, $$

(30)

where Var_res = SS_reg/n and Var_tot = SS_tot/n. The Var_res and Var_tot terms are both biased estimators of the residual variation and the population variation, respectively. The adjusted R² term adjusts these two variance estimators to be unbiased estimators, so that the adjusted R² is defined as

$$ \bar{R}^2=1-\frac{\text{SS}_{\text{res}}/\text{df}_{\epsilon}}{\text{SS}_{\text{tot}}/\text{df}_t}, $$

(31)

where df_𝜖 = n − p − 1 is the degree of freedom of the residuals, and df_t = n − 1 is the degree of freedom of the population variation. The adjusted R² adjusts the R² in Eq. 29 so that it does not always increase.

The k-regressions algorithm fits K regressions on the whole data set, so that the total number of parameters is Kp. For each K = 1,…,K^max, analogously with the idea of an adjusted R² for ordinary least squares regression, we define the adjusted weighted R² for the k-regressions clustering as

$$ \begin{array}{ll} \bar{R}^{2^{(K)}} &= R^{2^{(K)}}- (1-R^{2^{(K)}})\frac{Kp}{n-Kp-1} \\ &\equiv {R}^{2^{(K)}} - Q^{(K)}, \end{array} $$

(32)

where $Q^{(K)} = (1-R^{2^{(K)}})Kp/(n-Kp-1)$ is the penalty term.

The adjusted weighted R² in Eq. 32 penalizes the $R^{2^{(K)}}$ of Eq. 28 by the factor Q^(K) when the number of clusters increases. Since we fit K different linear regression models to the whole data set, the number of parameters is K(p + 1) for the cluster-wise regression methodology.

From Eq. 32, $\bar {R}^{2^{(K)}}$ is always smaller than $R^{2^{(K)}}$. The $\bar {R}^{2^{(K)}}$ increases only if the increase of K improves $R^{2^{(K)}}$ more than the penalized term Q^(K). Usually when K increases, $\bar {R}^{2^{(K)}}$ increases and reaches a maximum at a certain value of K, and decreases afterwards. The value of K that maximizes $\bar {R}^{2^{(K)}}$, or equivalently minimizes $1-\bar {R}^{2^{(K)}}$, is the optimal number of clusters K^∗. We compare the two methods of determining the optimal K by simulation results in the next section.

4 Simulation Study

We describe how our interval-valued observations are simulated in Section 4.1. Then, in Section 4.2, we compare the k-regressions clustering with the traditional k-means method. How the new algorithm performs is studied in Section 4.3 for several different data set structures.

4.1 Methodology

In practice, most interval data sets arise from aggregating classical data. From this perspective, we propose the following simulation method. The intervals of X are simulated where the interval midpoints X^(c) come from a multivariate normal distribution, and the interval ranges X^(r) are from exponential distributions. The intervals of X_j, j = 1,...,p, are given by $X_{j} = [X_{j}^{(c)}-0.5X_{j}^{(r)}, X_{j}^{(c)}+0.5X_{j}^{(r)}]$. The spread of observations within these intervals is assumed to be uniform. For a particular observation i, to obtain the interval Y_i, we randomly draw m values from the uniform distribution U(x_ija,x_ijb) for each j = 1,...,p, denoted by x_ij1,...,x_ijm. The m is a predetermined number. Then, the interval Y_i = [y_ia,y_ib] is determined by

$$ \begin{array}{ll} y_{ia} & = \min\limits_{l \in \{1,...,m\}}\{\beta_0 + \beta_1 x_{i1l}+...+\beta_p x_{ipl}+\epsilon_{il}\},\\ y_{ib} & = \max\limits_{l \in \{1,...,m\}}\{\beta_0 + \beta_1 x_{i1l}+...+\beta_p x_{ipl}+\epsilon_{il}\}, \end{array} $$

(33)

where $\epsilon _{il} \stackrel {iid}{\sim } N(0,\sigma ^{2})$ for i = 1,...,n, and l = 1,...,m. This method is practically reasonable. For example, traffic on a particular intersection is recorded multiple times everyday; the minimum and maximum values are recorded as the traffic interval for a day.

A more general case for this method is to assume the number m follows a certain distribution f(m;λ), say, such as a geometric distribution or a negative binomial distribution. For each observation i, the m’s are the same for all the predictors X_j, j = 1,...,p, but the m’s are different for different observations. We have $m_{i} \stackrel {\text {iid}}{\sim } f(m; \lambda )$ for i = 1,...,n. The interval Y_i, i = 1,...,n, is now given by Eq. 33 but with m replaced by m_i. By allowing a random value for m, this simulation method fits more general scenarios. For instance, the daily price for a particular stock is an interval where the lower bound is the minimum price while the upper bound is the maximum price. The prices for the stock are recorded on a transaction base for every trading day, but the number of transactions on each day is not fixed. Instead, it is a random number that follows a certain distribution.

A problem for this simulation method is that it cannot guarantee that the obtained intervals Y_i, i = 1,...,n, internally have the aggregated observations uniformly spread across the interval. However, it can be verified that the observations within an interval Y_i obtained in this way are uniformly distributed for a relatively large m, say, m ≥ 3000; see Xu (2010). The advantage is that this method is close to how interval data sets are collected in practice.

4.2 Comparison of the k-Regressions and k-Means Algorithms

The k-means algorithm is designed for spherical data structures. When each of the clusters in a data set is not spherical, the algorithm can fail. For example, if the variables are highly correlated within a cluster and the clusters overlap, it is difficult for the k-means algorithm to recover such clusters. In this section, we give two examples (with m small and large, respectively) where the k-means algorithm fails to recover the true clusters while in contrast the k-regressions algorithm succeeds. The k-means clustering method for interval-valued data is based on the algorithms in Chavent et al. (2002) and de Souza and de Carvalho (2004). In each case, we implement the k-means clustering method based on each of the city-block distance and the Hausdorff distance. For the k-regressions clustering method, we use the center distance for demonstration purposes. Like the k-means algorithm, the k- regressions algorithm does not guarantee a proper convergence given a random start partition. Therefore, multiple initial partitions are needed; the one with minimum SSR (see Eq. 22) is deemed to be the convergence result for the k-regressions algorithm. We call an initial partition that converges to the minimum SSR as being a good initial partition.

Data I

Our first data set (I) is composed of three clusters that follow the equations:

$$ (1)~ Y = 142 + 5X + \epsilon_1,~~ (2)~ Y = 53 - 3X + \epsilon_2,~~ (3)~ Y = -43 + 0.6X + \epsilon_3, $$

(34)

respectively, where $\epsilon _{1} \sim N(0,15^{2})$, $\epsilon _{2} \sim N(0,12^{2})$, and $\epsilon _{3} \sim N(0,7^{2})$. We apply the simulation method described in Section 4.1 and set m = 25. The observations of these three regression models are simulated separately with 200 observations for each, and then the three data sets are stacked into one data set.

Figure 1a shows the three true clusters with the three linear of Eq. 34, respectively. Figure 1b shows the clustering results based on the k-means algorithm when the city-block distance was used, while Fig. 1c gives the k-means clustering results using the Hausdorff distance for the data. From Fig. 1b and c, we see that both these k-means algorithms cluster at the intersection areas between the three clusters in Eq. 34, which are clearly not the correct clusters.

Figure 1d, e, and f show the clustering process of the k-regressions algorithm with a good initial partition. Figure 1d shows the first (initialization) iteration of the k-regressions algorithm, while Fig. 1e shows the third iteration where the algorithm starts to converge to the true linear regression models in Eq. 34. Figure 1f shows the tenth or final iteration of the k-regressions algorithm where the algorithm converges to the three true linear regression models in Eq. 34. The three linear regression models obtained by the k-regressions algorithm are, respectively,

$$ (1)~ Y = 138.80 + 4.94X,~~ (2)~ Y = 54.13 - 3.18X,~~ (3)~ Y = -41.7 + 0.65X. $$

(35)

These coefficients are close to the true coefficients in Eq. 34. In addition, by comparing the true data set in Fig. 1a and the k-regressions clustering results in Fig. 1f, it is safe to say that the k-regressions algorithm recovers the three true clusters for Data I in Eq. 34. A further investigation shows that all the misclassification observations are from the intersection areas between the three clusters.

Data II

Our second data set (II) contains three clusters. We set m = 3000 in Eq. 33. One hundred observations for each regression model are simulated and then all the observations are stacked into one data set. The three linear regression models between the two variables are as follows:

$$ (1)~ Y = 150.5 + 4.5X + \epsilon_1,~~ (2)~ Y = 53 - 3X + \epsilon_2,~~ (3)~ Y = -53 + 0.5X + \epsilon_3, $$

(36)

respectively, where $\epsilon _{1} \sim N(0,15^{2})$, $\epsilon _{2} \sim N(0,12^{2})$, and $\epsilon _{3} \sim N(0,7^{2})$.

The simulated Data II with a total of 300 observations and 3 clusters is visualized in Fig. 2a where the three true regression lines are also plotted. Figure 2b and c give the clustering results by the k-means algorithm with the city-block distance and the Hausdorff distance. We can see clearly that the k-means algorithm with both the city-block distance and the Hausdorff distance fails to recover the correct clusters.

Figure 2d, e, and f show the progress of the k-regressions algorithm through nine iterations onto the Data II with the good initial partition. Figure 2d is the plot of the three clusters for the first iteration (initialization). Figure 2e shows the third iteration of the algorithm where the three clusters are already close to the three true clusters. The ninth (final) iteration is presented in Fig. 2f and shows clearly the convergence to the correct three clusters by the algorithm. The estimated linear regression model by the symbolic variation method for the three converged clusters are, respectively,

$$ (1)~ Y = 151.13 + 4.49X,~~ (2)~ Y = 54.93 - 2.95X,~~ (3)~ Y = -53.74 + 0.54X. $$

(37)

The estimated coefficients in Eq. 37 and the true coefficients in Eq. 36 are quite close. In addition, by comparing the plot of the original three linear regression models in Fig. 2a and the plot of the converged three clusters in Fig. 2f, it is observed that the k-regressions algorithm successfully recovered the true structure of Data II. Both Data I and II are non-spherical; the k-means algorithm failed to recover the true structure, whereas our method succeeded.

In these simulated examples, we assume that the true number of clusters is known. To decide the optimal number of clusters, we calculate the weighted R-squared, $R^{2^{(K)}}$, for K = 1,...,8, from Eq. 28. The elbow plot is the plot of $1 - R^{2^{(K)}}$ versus the number of clusters K. Figure 3a and b show the elbow plots for Data I from Eq. 34 and Data II from Eq. 36, respectively. For both data sets, the elbow plots correctly show that the optimal number of clusters is K = 3.

4.3 Performance of the k-Regressions Algorithm

4.3.1 Data Structures

In this section, we simulate data sets with different structures to investigate the performance of the k-regressions algorithm. We first consider three data sets with p = 1. Data A and Data B contain three clusters, while Data C contains four clusters.

Table 1 provides the parameter setup for the three data sets. In Table 1, n is the sample size for each of the clusters; β₀ and β₁ are the coefficients of the linear relation for each cluster. The values μ_x and σ_x are the two parameters of the normal distribution N(μ_x,σ_x) from which the interval midpoints of the predictor variable X are drawn. The value λ_x is the parameter of the exponential distribution $\exp (\lambda _{x})$ from which the interval ranges of X are drawn. The error terms 𝜖_i are drawn from a normal distribution N(0,σ_𝜖) where the values of the parameter $\sigma ^{2}_{\epsilon }$ are shown in the row “σ_𝜖” in Table 1.

Table 1 Parameter setup for the Data A, B, and C

Full size table

Thus, the true linear regression equations for the three clusters of Data A have the structure:

$$ (1)~ Y = 1.0 + 1.3X,~~ (2)~ Y = 45 + 1.8X,~~ (3)~ Y = 45 - 2.5X; $$

(38)

those for Data B are as follows:

$$ (1)~ Y = 142 + 5X,~~ (2)~ Y = 33 - 3X,~~ (3)~ Y = -73 + 0.6 X; $$

(39)

and those for Data C are as follows:

$$ (1)~ Y = 2.0 + 0.8X,~~ (2)~ Y = 1.0 + 2.3X,~~ (3)~ Y = 3.0 - 1.8 X,~~ (4)~ Y = 1.0+4.3X. $$

(40)

We can observe the data structures for Data A, B, and C, in Fig. 4a, b and c, respectively. The regression lines in each plot are the recovered linear lines obtained by the k-regressions algorithm. We can see that different clusters overlap with each other for all three data sets. Especially, for Data C, a large proportion of the four clusters is overlapping. In addition, for a particular data set, each cluster is clustered around a linear regression line.

4.3.2 Application of the k-Regressions Algorithm

Consider first Data A. For this particular data set, we generate a random sample that follows its structure as described in Table 1. Then, given the correct number of clusters K = 3, we use the k-regressions algorithm to recover the data structure. We try a number of random initial partitions. Based on these different initial partitions, the clustering result with smallest sum of squared residuals of Eq. 22 is set to be the correct convergence for these samples. For each simulation, we tried 50 different random initial partitions to recover its structure. This whole process is one replication. Then, we implement 100 such replications using different seeds to investigate the overall performance of the k-regressions algorithm.

For each distance, and each cluster, the mean and standard deviation (std) of the estimated parameter values from these 100 replications are displayed in Table 2. For example, for the center distance, the mean estimated coefficients of β₀ and β₁ for the cluster 1 are 0.93 and 1.32, respectively; the differences between the estimated values and the true values are small relative to the true values. The standard deviation of the estimated β₀ and β₁ for cluster 1 are 0.17 and 0.01, respectively. Small standard deviations of the coefficients indicate we have stable clustering results. Similar arguments pertain for the other two clusters and for the other distances, though it is noted that the center distances give considerably better fits when comparing the SSR residual values.

Table 2 k-regressions clustering results for Data A

Full size table

For each replication, we tried 50 different initial partitions when applying the k-regressions algorithm to a particular simulation. The number of good initial partitions out of 50, n^∗, gives an idea about how difficult it is for the algorithm to converge to the correct cluster by a random initial partition. Out of the 100 replications, we can calculate the mean and standard deviation of n^∗, which is shown in the row “n^∗ out of 50” in Table 2 for the three distances. The SSR is also calculated for each replication. The mean and standard deviation of the SSR out of the 100 replications are presented in the row “SSR”.

Table 3 presents the corresponding clustering results for Data B with 100 replications. Table 3 can be interpreted in a similar way as Table 2 for Data A. Given K = 3, for all the three distances, the differences between the true coefficients and the mean estimated regression coefficients are all small relative to the coefficient scales. Again, small standard deviations for all the estimated coefficients imply stable clustering results.

Table 3 k-regressions clustering results for Data B

Full size table

The clustering results for Data C are presented in the Table 4. The interpretation of Table 4 for Data C follows in a similar manner as for Table 2 for Data A and Table 3 for Data B. Note that for Data C, a large proportion of the four clusters overlaps, which makes it more difficult to converge to the correct clusters for the k-regressions algorithm. For each replication, we tried 200 different initial partitions. The differences between the true coefficients and the mean estimated coefficients are small relative to the scales of the coefficients. The standard deviations of the coefficients are small for all the estimated coefficients and all the three distances. However, the intercept estimates for clusters 2, 3, and 4 are not as accurate as for cluster 1. This is not surprising given that cluster 1 is more separated from the other three clusters.

Table 4 k-regressions clustering results for Data C

Full size table

Now, let us use the same three data structures, Data A, B, and C, to investigate the performance of determining the optimal number of clusters by the elbow method, and the adjusted R². For a particular data set, we generate a random sample based on its parameter setup and implement the k-regressions algorithm for K = 1,…,10. For each K, we try a number of different initial partitions and select the results with smallest SSR as the correct clustering results. The $R^{2^{(K)}}$ from Eq. 28 and $\bar {R}^{2^{(K)}}$ from Eq. 32 are calculated for each of K = 1,…,10. The elbow plot is plotted as K versus $1-R^{2^{(K)}}$. We also plot the K versus $\bar {R}^{2^{(K)}}$ where the maximum $\bar {R}^{2^{(K)}}$ determines the optimal number of clusters. This whole process is for one replication, and we implement a total of 20 replications to test the performance of the elbow method and the adjusted R².

Figure 5a, b, and c are the elbow plots for (1 − R^2(K)) against K for Data A, B, and C, respectively, where the grey lines are the elbow plots for the 20 replications, and the blue line is the average $R^{2^{(K)}}$ over the 20 replications. We can see that the elbow plots identify the correct optimal number of clusters for all three data sets, K = 3 for Data A and B, and K = 4 for C. It is relatively difficult to determine the optimal number of clusters for Data C due to the overlapping of clusters; however, the elbow plots correctly determined the number of clusters for all 20 replications nevertheless. Figure 5d, e, and f show the plots of $\bar {R}^{2^{(K)}}$ for Data A, B, and C, respectively. For each of the three data structures, the optimal number of clusters determined by the largest $\bar {R}^{2^{(K)}}$ is mostly larger than the true number of clusters for the 20 replications.

Generally, the elbow method is a stable and reliable method to determine the optimal number of clusters. There could be cases where $R^{2^{(K)}}$ decreases gradually and consistently so that an elbow point is hard to find. Usually such scenarios indicate that there does not exist an optimal number of clusters to separate the data well and subjective judgment needs to be involved for a decision. Fixing a reasonable cutoff for $R^{2^{(K)}}$ is a realistic option in practice. The $\bar {R}^{2^{(K)}}$, adjusted R², usually overestimates the optimal number of clusters and so is not a good method to determine the optimal number of clusters.

4.3.3 p > 1

Consider now three simulated data sets D, E, F, with p = 3,5,5, respectively. The parameter setups are given in Table 5; tables and figures are shown in the Supplementary Materials S1. The sample size for each cluster in Data D is 200, while the sample size for each cluster in Data E and in Data F is 300 observations. As before, there are 100 replications for each case. The estimated parameter values for each of the three distances, are given in Tables 6, 7 and 8, for Data D, E, and F, respectively. The respective elbow plots are shown in Fig. 10 with a and d showing the weighted 1 − R^2(K) plot and the adjusted $\bar {R}^{2(K)}$ plot for Data D, Fig. 10b and e for Data E, and Fig. 10c and f for Data F, respectively. As for the p = 1 cases, these estimated results compare well with the true values; and the fits are good as determined by the SSR values and the elbow plots, again corroborating the merits and usefulness of the proposed k-regressions algorithm. Tables 6–8 also show the time (in secs) to run the 100 replications. From these, we see there is very little difference between these times as p increases; rather, if anything, any time differences are due to different distance measures used, though the standard deviations are large relative to the respective sample means.

5 Real Data Application

The methodology introduced herein is applied to the faces data set of Leroy et al. (1996). The predictor variable is X = length between the outer corners of the eyes, i.e., eye-span, and the response variable is Y = length between the inner corners of the eyes, i.e., the length of the bridge of the nose. There are n = 27 interval-valued observations (see Table 9 in Supplementary Materials S2). The resulting K = 3 clusters, obtained from the k-regressions algorithm, are shown in Fig. 6. The maximum number of clusters was set at K = 5. We see from the elbow plot in Fig. 7 that the optimal number is the K = 3 of Fig. 6. The respective regression equations are as follows:

$$ \begin{array}{@{}rcl@{}} \text{Cluster 1 (black): ~~} Y &= -87.52 + 0.91 X + \epsilon, \end{array} $$

(41)

$$ \begin{array}{@{}rcl@{}} \text{Cluster 2 (red): ~~~~~} Y &= -143.19 + 1.20 X + \epsilon, \end{array} $$

(42)

$$ \begin{array}{@{}rcl@{}} \text{Cluster 3 (green): ~~} Y &= -80.17 + 0.89 X + \epsilon. \end{array} $$

(43)

For all three clusters, the length between the two outer corners of the eyes (X) is positively correlated with the length between the two inner corners of the eyes (Y). However, the relationship between the two lengths is a bit different for different clusters.

The clusters consist of C₁ = {7,8,9,19,20,21,25,26,27} faces, C₂ = {4,5,6,10,11, 12,13,14,15} faces, and C₃ = {1,2,3,16,17,18,22,23,24} faces, respectively. As it so happens, these faces are in fact three observations from each of nine individuals. The algorithm (naturally unaware of this fact) correctly always puts the measurements for the same individual into the same cluster, thus enhancing the credibility of the algorithm.

A second data set is the cars data set discussed in de Carvalho et al., 2010. There are n = 33 cars. The k-regressions algorithm is applied where the response variable is Y = price and the regression variable is X = engine capacity (data are in Table 10 in S2). The k-regressions algorithm gave the K = 3 clusters as shown in Fig. 8. The respective regression equations (where the data are standardized) are as follows:

$$ \begin{array}{@{}rcl@{}} \text{Cluster 1 (red): ~~~~~} Y &= -44.53 + 78.05 X + \epsilon, \end{array} $$

(44)

$$ \begin{array}{@{}rcl@{}} \text{Cluster 2 (green): ~~} Y &= -66.78 + 58.98 X + \epsilon, \end{array} $$

(45)

$$ \begin{array}{@{}rcl@{}} \text{Cluster 3 (blue): ~~~~} Y &= -68.24 + 73.00 X + \epsilon. \end{array} $$

(46)

The elbow plot is shown in Fig. 9 suggesting that the optimal K = 3. The clusters consist of C₁ = {11,15,16,22} cars, C₂ = {1,2,3,6,7,8,9,10,12,13,14,18,20,21,26, 29,30,31,32,33} cars, and C₃ = {3,4,16,18,22,23,24,26,27} cars.

These cars data were also analysed by de Carvalho et al., 2010 using the same variables. Importantly, they also obtained three clusters. However, it is difficult to make any further comparison, as they apply separate classical regressions to each of the interval centers and half-ranges point values and so (apart from the deficiencies discussed in Section 3.1) it is not possible to produce cluster-regression fits along the lines of Eqs. 44–46 above.

6 Conclusion

The use of a regression-based approach to partition a data set into its relevant clusters, as originally introduced by Charles (1977), exists for classical data. For interval data, we have introduced a k-regressions algorithm to partition a set of interval-valued observations into its underlying clusters. The algorithm was tested on a wide variety of simulated data sets and two real application sets of interval data. It worked well. The new k-regressions algorithm demonstrably excelled when compared with the well-known k-means algorithm methodology for partitioning.

References

Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.
MATH Google Scholar
Batagelj, V., Kejžar, N., & Korenjak-Černe, S. (2015). Clustering of modal valued symbolic data. Machine Learnin. arXiv:1507.06683.
Bertrand, P., & Goupil, F. (2000). Descriptive statistics for symbolic data. In H.-H. Bock E. Diday (Eds.) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data (pp. 103–124). Berlin: Springer.
Billard, L. (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining, 4, 149–156.
Article MathSciNet Google Scholar
Billard, L. (2014). The past’s present is now. What will the present’s future bring? In X. Lin, C. Genest, D.L. Banks, G. Molenberghs, D.W. Scott, & J.-L. Wang (Eds.) Past, present, and future of statistical science (pp. 323–334). New York: Chapman and Hall.
Billard, L., & Diday, E. (2000). Regression analysis for interval-valued data. In H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, & M. Schader (Eds.) Data analysis, classification, and related methods (pp. 369–374). Springer.
Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal American Statistical Association, 98, 470–487.
Article MathSciNet Google Scholar
Billard, L., & Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Chichester: Wiley.
Book Google Scholar
Bock, H.-H. (2007). Clustering methods: A history of k-means algorithms. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 161–172). Berlin: Springer.
Bock, H.-H. (2008). Origins and extensions of the k-means algorithm in cluster analysis. Journal Électronique d’Histoire des Probabilités et Statistics, 4, 1–18.
MathSciNet Google Scholar
Bock, H.-H., & Diday, E. (2000). Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data. Berlin: Springer.
Book Google Scholar
Bougeard, S., Abdi, H., Saporta, G., & Niang, N. (2018). Clusterwise analysis for multiblock component methods. Advances in Data and Analysis of Classification, 12, 285–313.
Article MathSciNet Google Scholar
Bougeard, S., Cariou, V., Saporta, G., & Niang, N. (2017). Prediction for regularized clusterwise multiblock regression. Applied Stochastic Models for Business and Industry, 34, 852–867.
Article MathSciNet Google Scholar
Brusco, M.J., Cradit, J.D., Steinley, D., & Fox, G.L. (2008). Cautionary remarks on the use of clusterwise regression. Multivariate Behavioral Research, 43, 29–49.
Article Google Scholar
Charles, C. (1977). Regression typologique et reconnaissance des formes thèse de 3ème cycle. Université de, Paris, Dauphine.
Chavent, M., Lechevallier, Y., Jajuga, K., Sokolowski, A., & Bock, H.-H. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on Hausdorff distance. In Classification, clustering, and data analysis (pp. 53–60). Berlin: Springer.
Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society A, 134, 321–367.
Article MathSciNet Google Scholar
de Carvalho, F.A.T., Lima Neto, E.A., & Tenorio, C.P. (2004a). A new method to fit a linear regression model for interval-valued data. In Lecture notes in computer science, KI2004 advances in artificial intelligence (pp. 295–306). Springer.
de Carvalho, F.A.T., de Souza, R.M.C.R., & Silva, F.C.D. (2004b). A clustering method for symbolic interval-type data using adaptive Chebyshev distances. In A.L.C. Bazzan S. Labidi (Eds.) LNAI 3171 (pp. 266–275). Berlin: Springer.
de Carvalho, F.A.T., Brito, M.P., & Bock, H.-H. (2006). Dynamic clustering for interval data based on l₂ distance. Computational Statistics, 21, 231–250.
Article MathSciNet Google Scholar
de Carvalho, F.A.T., & Lechevallier, Y. (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42, 1223–1236.
Article Google Scholar
de Carvalho, F.A.T., Saporta, G., & Queiroz, D.N. (2010). A clusterwise center and range regression model for interval-valued data. In Y. Lechevallier G. Saporta (Eds.) Proceedings in computational statistics COMPSTAT 2010 (pp. 461–468). Berlin: Springer.
DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5, 249–282.
Article MathSciNet Google Scholar
de Souza, R.M.C.R., & de Carvalho, F.A.T. (2004). Clustering of interval data based on city-block distances. Pattern Recognition Letters, 25, 353–365.
Article Google Scholar
de Souza, R.M.C.R., de Carvalho, F.A.T., Tenóio, C.P., & Lechevallier, Y. (2004). Dynamic cluster methods for interval data based on Mahalanobis distances. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.) Classification, clustering, and data analysis (pp. 251–360). Berlin: Springer.
Diday, E. (1971a). Une nouvelle méthode de classification automatique et reconnaissance des formes: la méthode des nuées dynamiques. Revue de Statistique Appliquée, 2, 19–33.
Google Scholar
Diday, E. (1971b). La méthode des nuées dynamiques. Revue de Statistique Appliquée, 19, 19–34.
MathSciNet Google Scholar
Diday, E. (1987). Introduction à l’approche symbolique en analyse des données. Premier Jouneles Symbolique-Numerique, CEREMADE, Universite Paris - Dauphine, 21–56.
Diday, E. (2016). Thinking by classes in data science: The symbolic data analysis paradigm. WIRES Computational Statistics, 8, 172–205.
Article MathSciNet Google Scholar
Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software. Chichester: Wiley.
MATH Google Scholar
Diday, E., & Simon, J.C. (1976). Clustering analysis. In K.S. Fu (Ed.) Digital pattern recognition (pp. 47–94). Berlin: Springer.
Draper, N.R., & Smith, H. (1966). Applied regression analysis. New York: Wiley.
MATH Google Scholar
Hausdorff, F. (1937). Set theory (translated into English by J. R. Aumann 1957). New York: Chelsey.
Google Scholar
Irpino, A., Verde, R., & Lechevallier, Y. (2006). Dynamic clustering of histograms using Wasserstein metric. In A. Rizzi M. Vichi (Eds.) COMPSTAT 2006 (pp. 869–876). Berlin: Physica-Verlag.
Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666.
Article Google Scholar
Jain, A.K., Murty, M.N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 263–323.
Article Google Scholar
Johnson, R.A., & Wichern, D.W. (2007). Applied multivariate statistical analysis, 6th edn. New Jersey: Prentice-Hall.
MATH Google Scholar
Korenjak-Černe, S., Batagelj, V. , & Pavešić, B. J. (2011). Clustering large data sets described with discrete distributions and its application on TIMSS data set. Statistical Analysis and Data Mining, 4, 199–215.
Article MathSciNet Google Scholar
Košmelj, K., & Billard, L. (2012). Mallows’l² distance in some multivariate methods and its application to histogram-type data. Metodološki Zvezki, 9, 107–118.
Google Scholar
Leroy, B., Chouakria, A., Herlin, I., & Diday, E. (1996). Approche géométrique et classification pour la reconnaissance de visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, 548–557.
Lima Neto, E.A., & de Carvalho, F.A.T. (2008). Centre and range method for fitting a linear regression model to symbolic interval data. Computational Statistics and Data Analysis, 52, 1500–1515.
Article MathSciNet Google Scholar
Lima Neto, E.A., de Carvalho, F.A.T., & Freire, E.S. (2005). Applying constrained linear aggression models to predict interval-valued data. In U. Furbach (Ed.) Lecture notes in computer science, KI: advances in artificial intelligence (pp. 92–106). Brelin: Springer.
Lima Neto, E.A., de Carvalho, F.A.T., & Tenorio, C.P. (2004). Univariate and multivariate linear regression methods to predict interval-valued features. In Lecture notes in computer science, AI 2004, advances in artificial intelligence (pp. 526–537). Berlin: Springer.
Liu, F. (2016). Cluster analysis for symbolic interval data using linear regression method. Doctoral Dissertation, University of Georgia.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L.M. LeCam J. Neyman (Eds.) Proceedings of the 5th berkeley symposium on mathematical statistics and probability, (Vol. 1 pp. 281–299). Berkeley: University of California Press.
Noirhomme-Fraiture, M., & Brito, M.P. (2011). Far beyond the classical data models: Symbolic data analysis. Statistical Analysis and Data Mining, 4, 157–170.
Article MathSciNet Google Scholar
Qian, G., & Wu, Y. (2011). Estimation and selection in regression clustering. European Journal of Pure and Applied Mathematics, 4, 455–466.
MathSciNet MATH Google Scholar
Rao, C.R., Wu, Y., & Shao, Q. (2007). An M-estimation-based procedure for determining the number of regression models in regression clustering. Journal of Applied Mathematics and Decision Sciences, Article ID 37475.
Shao, Q., & Wu, Y. (2005). A consistent procedure for determining the number of clusters in regression clustering. Journal of Statistical Planning and Inference, 135, 461–476.
Article MathSciNet Google Scholar
Späth, H. (1979). Algorithm 39 clusterwise linear regression. Computing, 22, 367–373.
Article MathSciNet Google Scholar
Späth, H. (1981). Correction to algorithm 39: clusterwise linear regression. Computing, 26, 275.
Article MathSciNet Google Scholar
Späth, H. (1982). A fast algorithm for clusterwise linear regression. Computing, 29, 175–181.
Article Google Scholar
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.
Article MathSciNet Google Scholar
Verde, R., & Irpino, A. (2007). Dynamic clustering of histogram data: Using the right metric. In P. Brito, P. Bertrand, G. Cucumel, & F. de Carvalho (Eds.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.
Wedel, M., & Kistemaker, C. (1989). Consumer benefit segmentation using clusterwise linear regression. International Journal of Research in Marketing, 6, 45–59.
Article Google Scholar
Xu, W. (2010). Symbolic data analysis: interval-valued data regression. Doctoral Dissertation, University of Georgia.
Zhang, B. (2003). Regression clustering. In X. Wu, A. Tuzhilin, & J. Shavlik (Eds.) Proceedings third IEEE international conference on data mining (pp. 451–458). California: IEEE Computer Society Publishers.

Download references

Author information

Authors and Affiliations

Bank of America, Charlotte, NC, 28202, USA
Fei Liu
Department of Statistics, University of Georgia,, Athens, GA, 30602, USA
L. Billard

Authors

Fei Liu
View author publications
You can also search for this author in PubMed Google Scholar
L. Billard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Liu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 294 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, F., Billard, L. Partition of Interval-Valued Observations Using Regression. J Classif 39, 55–77 (2022). https://doi.org/10.1007/s00357-021-09394-5

Download citation

Accepted: 11 June 2021
Published: 28 August 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00357-021-09394-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Partition of Interval-Valued Observations Using Regression

Abstract

Similar content being viewed by others

Categorical Data Clustering

Categorical Data Clustering

A new methodology for exploratory and predictive data analysis based on level sets of probability density function

1 Introduction

2 Background

3 Regression-Based k-Means for Interval Data

3.1 Regression Model

3.2 Partitioning Criteria

3.3 Partitioning Algorithm

3.4 Number of Clusters

4 Simulation Study

4.1 Methodology

4.2 Comparison of the k-Regressions and k-Means Algorithms

4.3 Performance of the k-Regressions Algorithm

4.3.1 Data Structures

4.3.2 Application of the k-Regressions Algorithm

4.3.3 p > 1

5 Real Data Application

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Electronic supplementary material

(PDF 294 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Partition of Interval-Valued Observations Using Regression

Abstract

Similar content being viewed by others

Categorical Data Clustering

Categorical Data Clustering

A new methodology for exploratory and predictive data analysis based on level sets of probability density function

1 Introduction

2 Background

3 Regression-Based k-Means for Interval Data

3.1 Regression Model

3.2 Partitioning Criteria

3.3 Partitioning Algorithm

3.4 Number of Clusters

4 Simulation Study

4.1 Methodology

4.2 Comparison of the k-Regressions and k-Means Algorithms

4.3 Performance of the k-Regressions Algorithm

4.3.1 Data Structures

4.3.2 Application of the k-Regressions Algorithm

4.3.3 p > 1

5 Real Data Application

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Electronic supplementary material

(PDF 294 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation