1 Introduction

In this article, we adapt the basic dynamical partitioning technique of Diday (1971a, b) and Diday and Simon (1976) for partitioning n observations into K clusters, to develop appropriate algorithms based on regression criteria for interval-valued observations. Interval data are examples of symbolic data first introduced by Diday (1987). Diday’s (1971b) dynamical partitioning technique extends the k-means method of MacQueen (1967) for classical data, with several authors (e.g., Chavent et al., 2002) developing k-means methods for selected classes of symbolic data. There are a few papers in the literature which use regression ideas as a basis for the partitioning criteria in the k-means approach for classical data, starting with the initial development in Charles (1977). We introduce regression based methods to the k-means technique for interval data.

We start in Section 2 with some background literature to the k-means method, and its application to regression methodology. Then, in Section 3, the proposed algorithm for regression-based clustering for interval observations is described. Simulation studies are presented in Section 4; these are used to simulate different sets of data structures. Applications to real data in Section 5 nicely verify the integrity of the proposed algorithm. Advantages and disadvantages are also discussed. Some concluding remarks are found in Section 6.

2 Background

Cluster analysis is a common statistical tool that divides a population (of size n, say) into different sub-populations such that the observations within the same sub-population are as homogeneous as possible while observations from different sub-populations are as non-homogeneous as possible. The fundamental and most well-known clustering method for the partitioning process is the k-means algorithm originally proposed by MacQueen (1967). For a fixed K, the k-means algorithm requires initial clusters to start the process. This initialization could be K seeds or K clusters; a detailed discussion can be found in Anderberg (1973) and Cormack (1971). Then, the algorithm partitions the n observations into K clusters based on the rules under which an observation belongs to a cluster with the nearest mean. A summary of the k-means algorithm and its extensions could be found in Jain et al. (1999), Bock (2007, 2008) and Jain (2010).

Similar to k-means clustering, the cluster-wise linear regression method tries to recover the data structure where the observations are clustered using multiple linear regression models. Cluster-wise linear regression partitions the n observations into K subsets where each observation belongs to its nearest linear regression model. For classical data, the cluster-wise linear regression method is one of the most developed clustering methods in statistics. Analogously with the k-means algorithm, Späth (1979, 1981, 1982) partitioned the data into K subsets and fitted K linear regressions such that the total sum of squares of the errors is locally minimized. DeSarbo and Cron (1988) utilized a maximum likelihood methodology using a mixture of conditional normal distributions to choose the appropriate partition that maximizes the likelihood function, which resulted in a fuzzy cluster-wise linear regression method. The assumptions for ordinary linear regression modeling apply to the cluster-wise linear regression. Wedel and Kistemaker (1989) proposed another maximum likelihood methodology by which a particular observation can belong to only one cluster. Later, Tibshirani et al. (2001) and Shao and Wu (2005) explored methods of determining the number of clusters for a cluster-wise linear regression clustering approach. Zhang (2003) introduced k-harmonic means clustering for the cluster-wise linear regression method, which is less sensitive to the choice of initialization. Rao et al. (2007) and Qian and Wu (2011) extended Späth’s (1982) method to one that is more robust by applying M-estimation to the linear regression modeling. Bougeard et al. (2017, 2018) used blocks in a partial least squares setting. A good review can be found in Brusco et al. (2008).

Our concern is with interval-valued observations. Intervals are examples of symbolic data, which can be defined broadly as hypercubes or Cartesian products of distributions in p-dimensional space \(\mathbb {R}^{p}\), in contrast to classical data which are points in \(\mathbb {R}^{p}\). While many observations are naturally symbolic (e.g., species measurements, temperatures recorded as daily minimum and maximum values), most symbolic data today arise as a result of aggregating the large data sets accumulated by the modern computer, into more manageable forms for analyses. How any particular aggregation proceeds will depend on the scientific questions underlying the investigation at hand. A key feature of such data is that they contain internal variation. For example, suppose an aggregation produced a range of individual values across an interval [10,20], say. By using a summary statistic such as the midpoint (here 15) for subsequent analyses results in a loss of critical information, since, e.g., this interval cannot be distinguished from a second interval [14,16] which has the same midpoint as the first. An extensive coverage of the types of symbolic data and their fundamental properties can be found in Bock and Diday (2000), Billard and Diday (2003, 2006), Diday and Noirhomme-Fraiture (2008), Noirhomme-Fraiture and Brito (2011), and Diday (2016), with non-technical introductions in Billard (2011, 2014).

It is important to remember that although observations are aggregated into intervals, each aggregated observation within an interval remains as being from some underlying distribution, e.g., normal distribution. Thus, descriptive statistics calculated from these intervals are still point estimates. It is assumed however that those normally distributed (say) observations are uniformly spread across the interval or sub-intervals for histogram-valued data, analogous with the calculations of histograms (or sample means, etc.) of “group” data in elementary statistics courses. Thus, Bertrand and Goupil (2000) obtained the sample mean as a point value (see Eq. 3). Likewise, for interval-valued observations, a sample variance has a point value, as does each sample regression parameter value, based on n observations (see, e.g., Eq. 9, and Eq. 6, respectively).

Extensions of the k-means algorithm of MacQueen (1967) and the adaptive k-means algorithm of Diday and Simon (1976) to symbolic data are based on the traditional k-means criteria involving distances between observations and the centers as the representation of the obtained clusters; some use a median measure rather than the means. Thus, e.g., for interval-valued data, de Carvalho et al. (2006) consider an L2 distance; Chavent et al. (2002) use Hausdorff distances; de Souza and de Carvalho (2004), de Souza et al. (2004), and de Carvalho and Lechevallier (2009) apply city-block distances; de Carvalho et al. (2004a, 2004a) use Chebychev distances; and de Carvalho and Lechevallier (2009) apply an adaptive k-means method. Also, for histogram data, Verde and Irpino (2007), Irpino et al. (2006) and Košmelj and Billard (2012) consider Mallows’ distance for histogram observations; and Korenjak-Černe et al. (2011) and Batagelj et al. (2015) extend the k-means leaders approach to discrete distributions.

3 Regression-Based k-Means for Interval Data

3.1 Regression Model

Suppose we have n observations in a data set with response variable Y and p predictor variables X = (X1,...,Xp). All the response and predictor variables are interval-valued random variables. Let Xij, i = 1,...,n, j = 1,...,p, be the i th realization for the j th predictor variable Xj, denoted by Xij = [xija,xijb] with \(x_{ija}, x_{ijb} \in \mathbb {R}\) and xijaxijb. Similarly, let Yi be the i th realization for the response variable Y, denoted by Yi = [yia,yib] with yiayib, \(i = 1,\dots , n\). Then, the multiple linear regression model is

$$ Y = \beta_{0} + \beta_{1} X_{1} + {\dots} + \beta_{p} X_{p} = \mathbf{X}^{\prime} \boldsymbol{\beta}^{\ast} + \boldsymbol{\epsilon}, $$
(1)

where \(\boldsymbol {\beta }^{\ast } = (\beta _{0}, \beta _{1}, \dots , \beta _{p})^{\prime }\) is the set of regression coefficients of the p + 1 variables \(\mathbf {X}^{\prime } = (1, X_{1},\dots ,X_{p})\), and 𝜖 is the error interval vector. In Eq. 1 given the data, Y is an n × 1 vector and \(\mathbf {X}^{\prime }\) is a (p + 1) × n matrix. Equivalently, Eq. 1 may be written as

$$ Y - \bar{Y} = (\mathbf{X} - \bar{\mathbf{X}})^{\prime} \boldsymbol{\beta} + \boldsymbol{\epsilon}, $$
(2)

where \(\boldsymbol {\beta } = (\beta _{1}, \dots , \beta _{p})^{\prime }\), \(\mathbf {X} = (X_{1}, \dots , X_{p})\) and where \(\bar {\mathbf {X}} = (\bar {X}_{1},\dots , \bar {X}_{p})\) and \(\bar {Y}\) are the sample means defined, respectively, by

$$ \bar{X}_j = \frac{1}{2n}\sum\limits_{i = 1}^n(x_{ija}+x_{ijb}),~ j = 1,\dots, p, ~~ \bar{Y} = \frac{1}{2n}\sum\limits_{i = 1}^n(y_{ia}+y_{ib}). $$
(3)

See any of the many elementary texts on regression for properties of the model (e.g., Johnson and Wichern, 2007; Draper & Smith, 1966); see also Xu (2010) for interval observations.

From Eq. 2, the error sum of squares can be written as

$$ S = \sum\limits_{i = 1}^{n} [Y_{i} - \bar{Y} - (X_{i1} - \bar{X}_{i})\beta_{1} - {\dots} - (X_{ip}- \bar{X}_{p})\beta_{p}]^{2}, $$
(4)

for given observations \(i = 1, \dots , n\). Then differentiating the right-hand-side of Eq. 4 with respect to \(\beta _{j}, ~ j = 1, \dots , p\), we have

$$ \frac{\partial S}{ \partial \beta_{j}} = 2 \sum\limits_{i = 1}^{n} [Y_{i} - \bar{Y} - (X_{i1} - \bar{X}_{i})\beta_{1} - {\dots} - (X_{ip}- \bar{X}_{p})\beta_{p}](X_{ij} - \bar{X}_{j}), $$
(5)

and setting the derivatives to zero and \(\beta _{j} = \hat {\beta }_{j}\), we can solve the set of p equations in Eq. 5 to obtain the least squares estimators of βj. In matrix terms, this becomes

$$ \begin{array}{@{}rcl@{}} (\hat{\beta}_{1}, \dots, \hat{\beta}_{p}) &=& {((\mathbf{X} - \bar{\mathbf{X}})}^{\prime}{(\mathbf{X} - \bar{\mathbf{X}})})^{-1} (\mathbf{X} - \bar{\mathbf{X}})^{\prime}(\mathbf{Y} - \bar{Y}), \\ \hat{\beta}_{0} &=& \bar{Y} - (\hat{\beta}_{1} \bar{X}_{1} + {\dots} + \hat{\beta}_{p}\bar{X}_{p}), \end{array} $$
(6)

where the estimator \(\hat {\beta }_{0}\) pertains from Eq. 1.

To obtain the values of the elements in these matrices, let us re-write

$$ \begin{array}{@{}rcl@{}} {(\mathbf{X}-\bar{\mathbf{X}})}^{\prime} & (\mathbf{X}-\bar{\mathbf{X}}) \\ &=& \left( \begin{array}{ccc} {\sum}_{i=1}^{n}(X_{i1}-\bar{X}_{1})^{2}&\cdots&{\sum}_{i=1}^{n}(X_{i1}-\bar{X}_{1})(X_{ip}-\bar{X}_{p})\\ \vdots&\ddots&\vdots\\ {\sum}_{i=1}^{n}(X_{ip}-\bar{X}_{p})(X_{i1}-\bar{X}_{1})&\cdots&{\sum}_{i=1}^{n}(X_{ip}-\bar{X}_{p})^{2} \end{array} \right)_{p \times p} \\ &=&\left( \sum\limits_{i=1}^{n}(X_{ij}-\bar{X}_{j})(X_{ij^{\prime}}-\bar{X}_{j^{\prime}})\right)_{p \times p} \\ &=&\left( n \times \text{Cov}(X_{j},X_{j^{\prime}})\right)_{p \times p} \quad j, j^{\prime} = 1,\dots, p, \end{array} $$
(7)

and

$$ \begin{array}{@{}rcl@{}} {(\mathbf{X}-\bar{\mathbf{X}})}^{\prime}(\mathbf{Y}-\bar{Y})&=& \left( \sum\limits_{i=1}^{n}(X_{ij}-\bar{X}_{j})(Y_{i}-\bar{Y})\right)_{p \times 1} \\ &=&\left( n \times \text{Cov}(X_{j},Y)\right)_{p \times 1} \quad j = 1,\dots, p. \end{array} $$
(8)

Thus, it is clear that these matrix elements are simply n times the sample covariances between two variables. The question now is to find expressions for these elements for given data; i.e., we need to estimate the elements of the p × p matrix \({(\mathbf {X} - \bar {\mathbf {X}})}^{\prime }(\mathbf {X} - \bar {\mathbf {X}}) \equiv \mathbf {M}_{xx}\) (say); and likewise for the elements of the p × 1 matrix \(\mathbf {M}_{xy} \equiv {(\mathbf {X} - \bar {\mathbf {X}})}^{\prime }(\mathbf {Y} - \bar {Y}\)).

Consider first the diagonal elements \(M_{xx}(j, j) = n \text {Cov}(X_{j}, X_{j}) \equiv n \text {Var}(X_{j}) \equiv \text {SS}_{X_{j}}\), say. Bertrand and Goupil (2000) showed that for given interval observations, this sum of squares is estimated by

$$ \text{SS}_{X_{j}} = (1/3)\sum\limits_{i = 1}^{n} [x_{ija}^{2} + x_{ija}x_{ijb} + x_{ijb}^{2}] - n \bar{X}_{j}^{2}, ~~ \bar{X}_{j} = (1/2n) \sum\limits_{i = 1}^{n}(x_{ija} + x_{ijb}); $$
(9)

this can be re-written as

$$ \text{SS}_{X_{j}} = (1/3) \sum\limits_{i = 1}^{n} [(x_{ija} - \bar{X}_{j})^{2} + (x_{ija} - \bar{X}_{j})(x_{ijb} - \bar{X}_{j}) + (x_{ijb} - \bar{X}_{j})^{2}]. $$
(10)

When there is only one observation, i.e., when n = 1, Eq. 10 becomes

$$ \begin{array}{@{}rcl@{}} \text{SS}_{X_{j}}(n = 1) &=& (1/3) [(x_{ija} - \bar{X}_{ij})^{2} + (x_{ija} - \bar{X}_{ij})(x_{ijb} - \bar{X}_{ij}) + (x_{ijb} - \bar{X}_{ij})^{2}], \\ \bar{X}_{ij} &=& (1/2) (x_{ija} + x_{ijb}), \end{array} $$
(11)

where the single i subscript in Eq. 11 is retained. From this, we see that \(\text {SS}_{X_{j}}(n = 1) \neq 0\) unless xija = xijb, i.e., unless the observation Xij is a classical point observation. That is, \(\text {SS}_{X_{j}}(n = 1)\) represents the internal variation for this interval Xij observation. Summing over all \(i = 1, \dots , n\) observations, we obtain the Within variation, \(\text {WithinSS}_{X_{j}}\), as

$$ \begin{array}{@{}rcl@{}} \text{WithinSS}_{X_{j}} &\! = (1/3)\sum\limits_{i = 1}^{n} [(x_{ija} - \bar{X}_{ij})^{2} \!+ \!(x_{ija} - \bar{X}_{ij})(x_{ijb} - \bar{X}_{ij}) + (x_{ijb} - \bar{X}_{ij})^{2}]. \end{array} $$
(12)

Re-arranging, we can show that Eq. 12 can be written as

$$ \text{WithinSS}_{X_{j}} = \sum\limits_{i = 1}^{n} [(x_{ijb} - x_{ija})^{2}/12]. $$
(13)

If we look at the variation between observations, viz., \(\text {BetweenSS}_{X_{j}}\), we have

$$ \text{BetweenSS}_{X_{j}} = \sum\limits_{i = 1}^{n} [(x_{ija} + x_{ijb})/2 -\bar{X}_{j}]^{2}. $$
(14)

From Eqs. 13 and 14, it is easy to show that

$$ \text{SS}_{X_{j}} \equiv \text{TotalSS}_{X_{j}} = \text{WithinSS}_{X_{j}} + \text{BetweenSS}_{X_{j}} = M_{xx}(j, j) $$
(15)

as given in Eq. 10. It is noted that the expressions on the right-hand-side of Eqs. 13 and 14 are conditional on an assumption that the aggregated values within an interval are uniformly spread across the interval. (As an aside, although this is a so-far universally accepted assumption, different assumptions about the internal interval spread would change the formulae of these two expressions).

The off-diagonal elements \(M_{xx}(j, j^{\prime })\) are obtained analogously, by using the sum of products (SP). Thus, we have the Within SP between the random variables Xj and \(X_{j^{\prime }}\), \(\text {WithinSP}_{X_{j},X_{j^{\prime }}}\), as

$$ \text{WithinSP}_{X_{j},X_{j^{\prime}}} = \sum\limits_{i = 1}^{n} [(x_{ijb} - x_{ija})(x_{ij^{\prime}b} - x_{ij^{\prime}a})/12] $$
(16)

and the Between SP between Xj and \(X_{j^{\prime }}\), \(\text {BetweenSP}_{X_{j},X_{j^{\prime }}}\), as

$$ \text{BetweenSP}_{X_{j},X_{j^{\prime}}} = \sum\limits_{i = 1}^{n} [(x_{ija} + x_{ijb})/2 -\bar{X}_{j}][(x_{ij^{\prime}a} + x_{ij^{\prime}b})/2 -\bar{X}_{j^{\prime}}]. $$
(17)

Hence, adding Eqs. 16 and 17 and rearranging gives, for \(j, j^{\prime } = 1,\dots , p\),

$$ \begin{array}{@{}rcl@{}} M_{xx}(j, j^{\prime}) &=& \text{TotalSP} = \text{WithinSP} + \text{BetweenSP}, \\ M_{xx}(j, j^{\prime}) &=& \frac{1}{6} \sum\limits_{i=1}^{n}[2(x_{ija} - \bar{X}_{j})(x_{ij^{\prime}a} - \bar{X}_{j^{\prime}})+(x_{ija} - \bar{X}_{j})(x_{ij^{\prime}b} - \bar{X}_{j^{\prime}})\\ &&+ (x_{ij^{\prime}a} - \bar{X}_{j^{\prime}})(x_{ijb} - \bar{X}_{j}) +2(x_{ijb} - \bar{X}_{j})(x_{ij^{\prime}b} - \bar{X}_{j^{\prime}})]. \end{array} $$
(18)

When \(j = j^{\prime }\), Eq. 18 simplifies to Eq. 10. For the special case that the observations are classical point values (with, e.g., a ≡ [a,a]), these derivations reduce to those for classical data, as they should. The elements of Mxy are obtained similarly. Hence, the least squares estimators for β are found by substituting the elements of Mxx and Mxy into Eq. 6.

In the sequel, regression model fits will use the estimators obtained from Eqs. 118, and will be referred to as the symbolic variation methodology.

Earlier methods developed for fitting regression models to interval-valued data include the center method of Billard and Diday (2000), whereby a model is fitted to the midpoints of the intervals. Although their method used the range of the predictor variables to find the prediction intervals, the calculation of the model parameters is based on the interval midpoints only, and so omits important information contained in the internal variations of the intervals. In an attempt to redress that problem, in a series of papers, de Carvalho and his colleagues (e.g., Lima Neto et al., 2004; de Carvalho et al., 2004a, b; Lima Neto et al., 2005; Lima Neto & de Carvalho, 2008) transformed the original interval-valued variables into point-valued center and half-range variables; then, a classical regression analysis was conducted on each of the center values and half-range values separately. Also, use of the center-range variables is equivalent to using the interval end-points. We note that all these methods use some form of classical surrogates on one/two single point values of the interval, rather than using the entire interval as in the symbolic variation method. It is easy to show that the variations (SS/SP) on the centers equate to the between variations of the symbolic variation methodology but that the variations on the ranges do not equate to the Within variations of the symbolic variation method. Thus, in their various ways, these methods are not fully utilizing all the variations in the data correctly. There are moreover other statistical issues here; e.g., the assumption that the center and range are independent entities is unsustainable. However, while each of these previous methods advanced the subject at the time, each has its limitations.

3.2 Partitioning Criteria

Our concern is with partitioning our data set into a fixed number K clusters where the observations within each cluster are identified by a specific regression model. Assume that the response variable Y has K different linear relationships with the predictor variables X. Let (Xk,Yk), k = 1,...,K, be the set of observations that follows the k th regression model (i.e., belongs to the k th cluster Ck). Then,

$$ Y_k = \boldsymbol{X}_k^{\prime} \boldsymbol{\beta}_k + \boldsymbol{\epsilon}_k,~~~k=1,...,K, $$
(19)

where βk is the set of linear coefficients of the predictor variables for the k th regression model, and 𝜖k is the error vector.

Let nk, k = 1,...,K, be the number of observations in the k th cluster with \({\sum }_{k=1}^{K} n_{k} = n\). We assume the following:

  1. A1

    The number of observations nk satisfies p < nkn, k = 1,...,K, where p is the number of predictor variables, and n is the total number of observations in the whole data set. It can be shown that nk = n only if K = 1.

  2. A2

    The individual errors in a particular cluster k are drawn independently from a normal distribution with mean 0 and variance \({\sigma _{k}^{2}}\), \(N(0,{\sigma _{k}^{2}})\), and after aggregation (along the lines of Eq. 33 e.g.) become intervals. The error intervals 𝜖k are independent from \(\boldsymbol {\epsilon }_{k^{\prime }}\), \(k \ne k^{\prime }\), for \(k, k^{\prime }=1,...,K\).

The first assumption (A1) avoids the situation with nk < p such that there is no linear regression solution for the k th cluster; the second assumption (A2) reduces the computational complexity of the problem. Our goal is to find an optimal partition \(P^{*} = (C_{1}^{*},...,C_{K}^{*})\) that minimizes the overall residuals of the regression models given the number of clusters K.

Given a partition P = (C1,...,CK), we can fit a linear regression model for each cluster as in Eq. 19. Denote the coefficient estimator of βk by \(\hat {\boldsymbol {\beta }}_{k}\) for k = 1,...,K. Then, the regression residuals for the k th cluster are defined as, for iCk,

$$ r_{ki} = d(Y_{i},\hat{Y}_{i}), $$
(20)

where \(d(Y_{i},\hat {Y}_{i})\) is the distance between the observation Yi and its predicted value \(\hat {Y}_{i}\). Since \(\hat {Y}_{i}=\boldsymbol {X}_{i}^{\prime }\hat {\boldsymbol {\beta }}_{k}\), Eq. 20 can be rewritten as \(r_{ki}=d(Y_{i}, \boldsymbol {X}_{i}^{\prime }\hat {\boldsymbol {\beta }}_{k})\). The predictive interval \(\hat {Y}_{i}\) using the symbolic variation method is, for iCk,

$$ \hat{Y}_{i} = [\hat{Y}_{ia},\hat{Y}_{ib}] = [\min_{\boldsymbol{X}\in \mathbb{X}}\boldsymbol{X}_{i}^{\prime}\hat{\boldsymbol{\beta}_{k}}, \max_{\boldsymbol{X}\in \mathbb{X}}\boldsymbol{X}_{i}^{\prime}\hat{\boldsymbol{\beta}_{k}}], $$
(21)

where \(\mathbb {X}=\{\boldsymbol {X} = (x_{j}): x_{ja} \le x_{j} \le x_{jb}, j=1,...,p\}\); see Xu (2010). Our goal is to find an optimal partition that minimizes the sum of squared residuals (SSR) given K, viz.,

$$ \text{SSR}=\underset{P;~ \hat{\boldsymbol{\beta}}_{k}}{\arg\min} \sum\limits_{k=1}^{K} \sum\limits_{i \in C_{k}} r_{ki}^{2} = \sum\limits_{k=1}^{K} \sum\limits_{i=1}^{n_{k}}r_{ki}^{2}. $$
(22)

Since rki in Eq. 22 is defined as the distance between two intervals, Yi and \(\hat {Y}_{i}\), different definitions of the distance between these two intervals will affect the clustering results of our k-regressions algorithm. For illustrative concreteness, we consider three different distance definitions between two interval variables, specifically, center distance, Hausdorff distance and city-block distance; other distances could be used.

The center distance between two p-dimensional interval observations Z1 = (Z11,…, Z1p) and Z2 = (Z21,…,Z2p) with Zij = [zija,zijb], i = 1,2, j = 1,…,p, is defined as

$$ d_{C}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p}|z_{1j}^{c} - z_{2j}^{c}|, $$
(23)

where \(z_{ij}^{c}=(z_{ija} + z_{ijb})/2\), is the midpoint of the interval Zij, i = 1,2, j = 1,…,p. The Hausdorff (1937) distance is defined as

$$ d_{H}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p} \max\{|z_{1ja} - z_{2ja}|, |z_{1jb} - z_{2jb}|\}. $$
(24)

The city-block distance is defined as

$$ d_{CB}(\mathbf{Z}_{1}, \mathbf{Z}_{2})= \sum\limits_{j=1}^{p} [|z_{1ja} - z_{2ja}| + |z_{1jb} - z_{2jb}|]. $$
(25)

We observe that there are effectively two optimizations occurring in this process. The first is designed to minimize the sum of squared residuals when fitting the regression model within each cluster (i.e., minimize the regression sum of squared errors \({\sum }_{i = 1}^{n_{k}}{\epsilon _{i}^{2}}\) for each k th cluster regression, \(k = 1, \dots , K\)). This is not the same as minimizing \({\sum }_{i = 1}^{n}{\epsilon _{i}^{2}}\) over all observations ignoring the clusters. The second optimization is to minimize the sum of the squared distances \(r_{ki} = d(Y_{i}, \hat {Y}_{i})\) where the predicted value \(\hat {Y}_{i}\) is determined by the regression equation for the k th cluster. We also note that the explained variation in the data is due to a mix of explained variation from the regression fits within clusters and that due to heterogeneity across clusters. For the classical data setting, Brusco et al. (2008) has a nice example illustrating how sometimes the explained variation can be dominated by the within cluster variations, and sometimes by the cluster heterogeneity.

3.3 Partitioning Algorithm

The k-regressions algorithm for each of the three distance definitions of Eqs. 2325 is the same, except that the distance \(d(Y_{i},\boldsymbol {X}^{\prime }_{i}\hat {\boldsymbol {\beta }}_{k})\) used in the allocation step (iii) changes appropriately. Analogously with the algorithm in Späth (1979), we propose a k-regressions cluster-wise algorithm for interval-valued data as follows:

  1. (i)

    Initialization: Choose a partition \(P^{(0)}=(C_{1}^{(0)},...,C_{K}^{(0)})\) randomly from all the possible partitions, or partition the whole data set to K clusters based on some prior knowledge.

  2. (ii)

    Representation: For k = 1,...,K, fit regressions \(Y_{k} = \boldsymbol {X}_{k}^{\prime }\boldsymbol {\beta }_{k} + \boldsymbol {\epsilon }\) to the observations in the k th cluster for partition \(P^{(l)}=(C_{1}^{(l)},...,C_{K}^{(l)})\) where \(l = 0 , 1, \dots \), denotes the l th iteration.

  3. (iii)

    Allocation: For observation Yi, i = 1,...,n, calculate its distance to its prediction \(\hat {Y}_{i}\) obtained by its kth regression line, \(d(Y_{i},\boldsymbol {X}^{\prime }_{i}\hat {\boldsymbol {\beta }}_{k})\), k = 1,...,K, and allocate the observation to its closest line. The updated partition is now \(P^{(l+1)}=(C_{1}^{(l+1)},...,C_{K}^{(l+1)})\).

  4. (iv)

    Stop: Repeat (ii) and (iii) until the improvement of SSR in Eq. 22 is smaller than a predetermined criterion, or the number of iterations reaches a predetermined maximum number.

For the representation step, we apply the symbolic variance method to fit the linear regression model. For the allocation step, the observations are allocated such that, for k = 1,...,K,

$$ C_k = \{(\boldsymbol{X},Y) | d(Y,\boldsymbol{X}^{\prime}\hat{\boldsymbol{\beta}}_k) \le d(Y,\boldsymbol{X}^{\prime}\hat{\boldsymbol{\beta}}_{k^{\prime}}), \forall k \ne k^{\prime}\}. $$
(26)

Given a data set, the algorithm cannot guarantee a global minimum of SSR. Thus, we repeatedly implement the steps (i)–(iv) a number of times with different initializations and select the solution which has the lowest value of SSR. The selected partition can be further iterated until SSR cannot be reduced anymore. An expanded description is available in Liu (2016).

3.4 Number of Clusters

The k-regressions clustering algorithm is used to implement the cluster-wise regression method given that K is known. However, if we do not have prior knowledge about K, a bad guess of K can mislead the results. Xu (2010) gave a symbolic R-square (R2) of the symbolic variation method for the linear regression of interval-valued data as

$$ R^2 = \frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}, $$
(27)

where \(\hat {Y}\) is the predicted value of the response variable Y, and Var(⋅) is the symbolic variance calculated from Eq. 9. Using the symbolic R2, we propose the following methods to determine the number of clusters K.

Given a predetermined maximum number of clusters Kmax, for each K = 1,…, Kmax, calculate the R2 for each cluster k = 1,…,K, denoted by \(R_{k}^{2^{(K)}}\). For the whole data set, the weighted average R2 for the n observations given K is defined as

$$ R^{2^{(K)}}= \sum\limits_{k=1}^K w_k^{(K)}R_k^{2^{(K)}}, $$
(28)

where \(w_{k}^{(K)}=n_{k}/n\) is the weight of the R2 for the k th cluster, and nk = |Ck| is the number of observations in the k th cluster. From the plot of \((1-R^{2^{(K)}})\) versus K, the elbow point is the optimal number of clusters K.

Determining the optimal number of clusters K by looking for the elbow point can be subjective, especially when the elbow point is not obvious. Analogously with the adjusted R2 for the linear regression model, we propose an adjusted R2 to determine the optimal K for the k-regressions algorithm. We know that R2 for ordinary least squares regression models corresponds to the proportion of variation explained by the model; i.e., R2 can be defined as

$$ R^{2} = 1- \text{SS}_{\text{res}}/\text{SS}_{\text{tot}} = \text{SS}_{\text{reg}}/\text{SS}_{\text{tot}}, $$
(29)

where \(\text {SS}_{\text {tot}} = {\sum }_{i}(Y_{i}-\bar {Y})^{2}\) is the total sum of squares, \(\text {SS}_{\text {reg}} = {\sum }_{i}(\hat {Y}_{i}-\bar {Y})^{2}\) is the sum of squares of the regression, \(\text {SS}_{\text {res}} = {\sum }_{i}({Y_{i}-\hat {Y}_{i}})^{2}\) is the sum of squares of the residuals, and \(\bar {Y} = {\sum }_{i} Y_{i}/n\) is the sample mean of Y. The R2 in Eq. 29 can be rewritten as

$$ R^{2} = 1- \text{Var}_{\text{res}}/\text{Var}_{\text{tot}}, $$
(30)

where Varres = SSreg/n and Vartot = SStot/n. The Varres and Vartot terms are both biased estimators of the residual variation and the population variation, respectively. The adjusted R2 term adjusts these two variance estimators to be unbiased estimators, so that the adjusted R2 is defined as

$$ \bar{R}^2=1-\frac{\text{SS}_{\text{res}}/\text{df}_{\epsilon}}{\text{SS}_{\text{tot}}/\text{df}_t}, $$
(31)

where df𝜖 = np − 1 is the degree of freedom of the residuals, and dft = n − 1 is the degree of freedom of the population variation. The adjusted R2 adjusts the R2 in Eq. 29 so that it does not always increase.

The k-regressions algorithm fits K regressions on the whole data set, so that the total number of parameters is Kp. For each K = 1,…,Kmax, analogously with the idea of an adjusted R2 for ordinary least squares regression, we define the adjusted weighted R2 for the k-regressions clustering as

$$ \begin{array}{ll} \bar{R}^{2^{(K)}} &= R^{2^{(K)}}- (1-R^{2^{(K)}})\frac{Kp}{n-Kp-1} \\ &\equiv {R}^{2^{(K)}} - Q^{(K)}, \end{array} $$
(32)

where \(Q^{(K)} = (1-R^{2^{(K)}})Kp/(n-Kp-1)\) is the penalty term.

The adjusted weighted R2 in Eq. 32 penalizes the \(R^{2^{(K)}}\) of Eq. 28 by the factor Q(K) when the number of clusters increases. Since we fit K different linear regression models to the whole data set, the number of parameters is K(p + 1) for the cluster-wise regression methodology.

From Eq. 32, \(\bar {R}^{2^{(K)}}\) is always smaller than \(R^{2^{(K)}}\). The \(\bar {R}^{2^{(K)}}\) increases only if the increase of K improves \(R^{2^{(K)}}\) more than the penalized term Q(K). Usually when K increases, \(\bar {R}^{2^{(K)}}\) increases and reaches a maximum at a certain value of K, and decreases afterwards. The value of K that maximizes \(\bar {R}^{2^{(K)}}\), or equivalently minimizes \(1-\bar {R}^{2^{(K)}}\), is the optimal number of clusters K. We compare the two methods of determining the optimal K by simulation results in the next section.

4 Simulation Study

We describe how our interval-valued observations are simulated in Section 4.1. Then, in Section 4.2, we compare the k-regressions clustering with the traditional k-means method. How the new algorithm performs is studied in Section 4.3 for several different data set structures.

4.1 Methodology

In practice, most interval data sets arise from aggregating classical data. From this perspective, we propose the following simulation method. The intervals of X are simulated where the interval midpoints X(c) come from a multivariate normal distribution, and the interval ranges X(r) are from exponential distributions. The intervals of Xj, j = 1,...,p, are given by \(X_{j} = [X_{j}^{(c)}-0.5X_{j}^{(r)}, X_{j}^{(c)}+0.5X_{j}^{(r)}]\). The spread of observations within these intervals is assumed to be uniform. For a particular observation i, to obtain the interval Yi, we randomly draw m values from the uniform distribution U(xija,xijb) for each j = 1,...,p, denoted by xij1,...,xijm. The m is a predetermined number. Then, the interval Yi = [yia,yib] is determined by

$$ \begin{array}{ll} y_{ia} & = \min\limits_{l \in \{1,...,m\}}\{\beta_0 + \beta_1 x_{i1l}+...+\beta_p x_{ipl}+\epsilon_{il}\},\\ y_{ib} & = \max\limits_{l \in \{1,...,m\}}\{\beta_0 + \beta_1 x_{i1l}+...+\beta_p x_{ipl}+\epsilon_{il}\}, \end{array} $$
(33)

where \(\epsilon _{il} \stackrel {iid}{\sim } N(0,\sigma ^{2})\) for i = 1,...,n, and l = 1,...,m. This method is practically reasonable. For example, traffic on a particular intersection is recorded multiple times everyday; the minimum and maximum values are recorded as the traffic interval for a day.

A more general case for this method is to assume the number m follows a certain distribution f(m;λ), say, such as a geometric distribution or a negative binomial distribution. For each observation i, the m’s are the same for all the predictors Xj, j = 1,...,p, but the m’s are different for different observations. We have \(m_{i} \stackrel {\text {iid}}{\sim } f(m; \lambda )\) for i = 1,...,n. The interval Yi, i = 1,...,n, is now given by Eq. 33 but with m replaced by mi. By allowing a random value for m, this simulation method fits more general scenarios. For instance, the daily price for a particular stock is an interval where the lower bound is the minimum price while the upper bound is the maximum price. The prices for the stock are recorded on a transaction base for every trading day, but the number of transactions on each day is not fixed. Instead, it is a random number that follows a certain distribution.

A problem for this simulation method is that it cannot guarantee that the obtained intervals Yi, i = 1,...,n, internally have the aggregated observations uniformly spread across the interval. However, it can be verified that the observations within an interval Yi obtained in this way are uniformly distributed for a relatively large m, say, m ≥ 3000; see Xu (2010). The advantage is that this method is close to how interval data sets are collected in practice.

4.2 Comparison of the k-Regressions and k-Means Algorithms

The k-means algorithm is designed for spherical data structures. When each of the clusters in a data set is not spherical, the algorithm can fail. For example, if the variables are highly correlated within a cluster and the clusters overlap, it is difficult for the k-means algorithm to recover such clusters. In this section, we give two examples (with m small and large, respectively) where the k-means algorithm fails to recover the true clusters while in contrast the k-regressions algorithm succeeds. The k-means clustering method for interval-valued data is based on the algorithms in Chavent et al. (2002) and de Souza and de Carvalho (2004). In each case, we implement the k-means clustering method based on each of the city-block distance and the Hausdorff distance. For the k-regressions clustering method, we use the center distance for demonstration purposes. Like the k-means algorithm, the k- regressions algorithm does not guarantee a proper convergence given a random start partition. Therefore, multiple initial partitions are needed; the one with minimum SSR (see Eq. 22) is deemed to be the convergence result for the k-regressions algorithm. We call an initial partition that converges to the minimum SSR as being a good initial partition.

Data I

Our first data set (I) is composed of three clusters that follow the equations:

$$ (1)~ Y = 142 + 5X + \epsilon_1,~~ (2)~ Y = 53 - 3X + \epsilon_2,~~ (3)~ Y = -43 + 0.6X + \epsilon_3, $$
(34)

respectively, where \(\epsilon _{1} \sim N(0,15^{2})\), \(\epsilon _{2} \sim N(0,12^{2})\), and \(\epsilon _{3} \sim N(0,7^{2})\). We apply the simulation method described in Section 4.1 and set m = 25. The observations of these three regression models are simulated separately with 200 observations for each, and then the three data sets are stacked into one data set.

Figure 1a shows the three true clusters with the three linear of Eq. 34, respectively. Figure 1b shows the clustering results based on the k-means algorithm when the city-block distance was used, while Fig. 1c gives the k-means clustering results using the Hausdorff distance for the data. From Fig. 1b and c, we see that both these k-means algorithms cluster at the intersection areas between the three clusters in Eq. 34, which are clearly not the correct clusters.

Fig. 1
figure 1

Comparison between clustering results of k-means algorithm and k-regressions algorithm for Data I

Figure 1d, e, and f show the clustering process of the k-regressions algorithm with a good initial partition. Figure 1d shows the first (initialization) iteration of the k-regressions algorithm, while Fig. 1e shows the third iteration where the algorithm starts to converge to the true linear regression models in Eq. 34. Figure 1f shows the tenth or final iteration of the k-regressions algorithm where the algorithm converges to the three true linear regression models in Eq. 34. The three linear regression models obtained by the k-regressions algorithm are, respectively,

$$ (1)~ Y = 138.80 + 4.94X,~~ (2)~ Y = 54.13 - 3.18X,~~ (3)~ Y = -41.7 + 0.65X. $$
(35)

These coefficients are close to the true coefficients in Eq. 34. In addition, by comparing the true data set in Fig. 1a and the k-regressions clustering results in Fig. 1f, it is safe to say that the k-regressions algorithm recovers the three true clusters for Data I in Eq. 34. A further investigation shows that all the misclassification observations are from the intersection areas between the three clusters.

Data II

Our second data set (II) contains three clusters. We set m = 3000 in Eq. 33. One hundred observations for each regression model are simulated and then all the observations are stacked into one data set. The three linear regression models between the two variables are as follows:

$$ (1)~ Y = 150.5 + 4.5X + \epsilon_1,~~ (2)~ Y = 53 - 3X + \epsilon_2,~~ (3)~ Y = -53 + 0.5X + \epsilon_3, $$
(36)

respectively, where \(\epsilon _{1} \sim N(0,15^{2})\), \(\epsilon _{2} \sim N(0,12^{2})\), and \(\epsilon _{3} \sim N(0,7^{2})\).

The simulated Data II with a total of 300 observations and 3 clusters is visualized in Fig. 2a where the three true regression lines are also plotted. Figure 2b and c give the clustering results by the k-means algorithm with the city-block distance and the Hausdorff distance. We can see clearly that the k-means algorithm with both the city-block distance and the Hausdorff distance fails to recover the correct clusters.

Fig. 2
figure 2

Comparison between clustering results of k-means algorithm and k-regressions algorithm for Data II

Figure 2d, e, and f show the progress of the k-regressions algorithm through nine iterations onto the Data II with the good initial partition. Figure 2d is the plot of the three clusters for the first iteration (initialization). Figure 2e shows the third iteration of the algorithm where the three clusters are already close to the three true clusters. The ninth (final) iteration is presented in Fig. 2f and shows clearly the convergence to the correct three clusters by the algorithm. The estimated linear regression model by the symbolic variation method for the three converged clusters are, respectively,

$$ (1)~ Y = 151.13 + 4.49X,~~ (2)~ Y = 54.93 - 2.95X,~~ (3)~ Y = -53.74 + 0.54X. $$
(37)

The estimated coefficients in Eq. 37 and the true coefficients in Eq. 36 are quite close. In addition, by comparing the plot of the original three linear regression models in Fig. 2a and the plot of the converged three clusters in Fig. 2f, it is observed that the k-regressions algorithm successfully recovered the true structure of Data II. Both Data I and II are non-spherical; the k-means algorithm failed to recover the true structure, whereas our method succeeded.

In these simulated examples, we assume that the true number of clusters is known. To decide the optimal number of clusters, we calculate the weighted R-squared, \(R^{2^{(K)}}\), for K = 1,...,8, from Eq. 28. The elbow plot is the plot of \(1 - R^{2^{(K)}}\) versus the number of clusters K. Figure 3a and b show the elbow plots for Data I from Eq. 34 and Data II from Eq. 36, respectively. For both data sets, the elbow plots correctly show that the optimal number of clusters is K = 3.

Fig. 3
figure 3

Determining the number of clusters K by an elbow plot

4.3 Performance of the k-Regressions Algorithm

4.3.1 Data Structures

In this section, we simulate data sets with different structures to investigate the performance of the k-regressions algorithm. We first consider three data sets with p = 1. Data A and Data B contain three clusters, while Data C contains four clusters.

Table 1 provides the parameter setup for the three data sets. In Table 1, n is the sample size for each of the clusters; β0 and β1 are the coefficients of the linear relation for each cluster. The values μx and σx are the two parameters of the normal distribution N(μx,σx) from which the interval midpoints of the predictor variable X are drawn. The value λx is the parameter of the exponential distribution \(\exp (\lambda _{x})\) from which the interval ranges of X are drawn. The error terms 𝜖i are drawn from a normal distribution N(0,σ𝜖) where the values of the parameter \(\sigma ^{2}_{\epsilon }\) are shown in the row “σ𝜖” in Table 1.

Table 1 Parameter setup for the Data A, B, and C

Thus, the true linear regression equations for the three clusters of Data A have the structure:

$$ (1)~ Y = 1.0 + 1.3X,~~ (2)~ Y = 45 + 1.8X,~~ (3)~ Y = 45 - 2.5X; $$
(38)

those for Data B are as follows:

$$ (1)~ Y = 142 + 5X,~~ (2)~ Y = 33 - 3X,~~ (3)~ Y = -73 + 0.6 X; $$
(39)

and those for Data C are as follows:

$$ (1)~ Y = 2.0 + 0.8X,~~ (2)~ Y = 1.0 + 2.3X,~~ (3)~ Y = 3.0 - 1.8 X,~~ (4)~ Y = 1.0+4.3X. $$
(40)

We can observe the data structures for Data A, B, and C, in Fig. 4a, b and c, respectively. The regression lines in each plot are the recovered linear lines obtained by the k-regressions algorithm. We can see that different clusters overlap with each other for all three data sets. Especially, for Data C, a large proportion of the four clusters is overlapping. In addition, for a particular data set, each cluster is clustered around a linear regression line.

Fig. 4
figure 4

Data structure for the Data A (a), B (b), and C (c)

4.3.2 Application of the k-Regressions Algorithm

Consider first Data A. For this particular data set, we generate a random sample that follows its structure as described in Table 1. Then, given the correct number of clusters K = 3, we use the k-regressions algorithm to recover the data structure. We try a number of random initial partitions. Based on these different initial partitions, the clustering result with smallest sum of squared residuals of Eq. 22 is set to be the correct convergence for these samples. For each simulation, we tried 50 different random initial partitions to recover its structure. This whole process is one replication. Then, we implement 100 such replications using different seeds to investigate the overall performance of the k-regressions algorithm.

For each distance, and each cluster, the mean and standard deviation (std) of the estimated parameter values from these 100 replications are displayed in Table 2. For example, for the center distance, the mean estimated coefficients of β0 and β1 for the cluster 1 are 0.93 and 1.32, respectively; the differences between the estimated values and the true values are small relative to the true values. The standard deviation of the estimated β0 and β1 for cluster 1 are 0.17 and 0.01, respectively. Small standard deviations of the coefficients indicate we have stable clustering results. Similar arguments pertain for the other two clusters and for the other distances, though it is noted that the center distances give considerably better fits when comparing the SSR residual values.

Table 2 k-regressions clustering results for Data A

For each replication, we tried 50 different initial partitions when applying the k-regressions algorithm to a particular simulation. The number of good initial partitions out of 50, n, gives an idea about how difficult it is for the algorithm to converge to the correct cluster by a random initial partition. Out of the 100 replications, we can calculate the mean and standard deviation of n, which is shown in the row “n out of 50” in Table 2 for the three distances. The SSR is also calculated for each replication. The mean and standard deviation of the SSR out of the 100 replications are presented in the row “SSR”.

Table 3 presents the corresponding clustering results for Data B with 100 replications. Table 3 can be interpreted in a similar way as Table 2 for Data A. Given K = 3, for all the three distances, the differences between the true coefficients and the mean estimated regression coefficients are all small relative to the coefficient scales. Again, small standard deviations for all the estimated coefficients imply stable clustering results.

Table 3 k-regressions clustering results for Data B

The clustering results for Data C are presented in the Table 4. The interpretation of Table 4 for Data C follows in a similar manner as for Table 2 for Data A and Table 3 for Data B. Note that for Data C, a large proportion of the four clusters overlaps, which makes it more difficult to converge to the correct clusters for the k-regressions algorithm. For each replication, we tried 200 different initial partitions. The differences between the true coefficients and the mean estimated coefficients are small relative to the scales of the coefficients. The standard deviations of the coefficients are small for all the estimated coefficients and all the three distances. However, the intercept estimates for clusters 2, 3, and 4 are not as accurate as for cluster 1. This is not surprising given that cluster 1 is more separated from the other three clusters.

Table 4 k-regressions clustering results for Data C

Now, let us use the same three data structures, Data A, B, and C, to investigate the performance of determining the optimal number of clusters by the elbow method, and the adjusted R2. For a particular data set, we generate a random sample based on its parameter setup and implement the k-regressions algorithm for K = 1,…,10. For each K, we try a number of different initial partitions and select the results with smallest SSR as the correct clustering results. The \(R^{2^{(K)}}\) from Eq. 28 and \(\bar {R}^{2^{(K)}}\) from Eq. 32 are calculated for each of K = 1,…,10. The elbow plot is plotted as K versus \(1-R^{2^{(K)}}\). We also plot the K versus \(\bar {R}^{2^{(K)}}\) where the maximum \(\bar {R}^{2^{(K)}}\) determines the optimal number of clusters. This whole process is for one replication, and we implement a total of 20 replications to test the performance of the elbow method and the adjusted R2.

Figure 5a, b, and c are the elbow plots for (1 − R2(K)) against K for Data A, B, and C, respectively, where the grey lines are the elbow plots for the 20 replications, and the blue line is the average \(R^{2^{(K)}}\) over the 20 replications. We can see that the elbow plots identify the correct optimal number of clusters for all three data sets, K = 3 for Data A and B, and K = 4 for C. It is relatively difficult to determine the optimal number of clusters for Data C due to the overlapping of clusters; however, the elbow plots correctly determined the number of clusters for all 20 replications nevertheless. Figure 5d, e, and f show the plots of \(\bar {R}^{2^{(K)}}\) for Data A, B, and C, respectively. For each of the three data structures, the optimal number of clusters determined by the largest \(\bar {R}^{2^{(K)}}\) is mostly larger than the true number of clusters for the 20 replications.

Fig. 5
figure 5

Elbow plots by weighted R2 and adjusted R2 for Data A (a) and (d), Data B (b) and (e), and Data C (c) and (f)

Generally, the elbow method is a stable and reliable method to determine the optimal number of clusters. There could be cases where \(R^{2^{(K)}}\) decreases gradually and consistently so that an elbow point is hard to find. Usually such scenarios indicate that there does not exist an optimal number of clusters to separate the data well and subjective judgment needs to be involved for a decision. Fixing a reasonable cutoff for \(R^{2^{(K)}}\) is a realistic option in practice. The \(\bar {R}^{2^{(K)}}\), adjusted R2, usually overestimates the optimal number of clusters and so is not a good method to determine the optimal number of clusters.

4.3.3 p > 1

Consider now three simulated data sets D, E, F, with p = 3,5,5, respectively. The parameter setups are given in Table 5; tables and figures are shown in the Supplementary Materials S1. The sample size for each cluster in Data D is 200, while the sample size for each cluster in Data E and in Data F is 300 observations. As before, there are 100 replications for each case. The estimated parameter values for each of the three distances, are given in Tables 6, 7 and 8, for Data D, E, and F, respectively. The respective elbow plots are shown in Fig. 10 with a and d showing the weighted 1 − R2(K) plot and the adjusted \(\bar {R}^{2(K)}\) plot for Data D, Fig. 10b and e for Data E, and Fig. 10c and f for Data F, respectively. As for the p = 1 cases, these estimated results compare well with the true values; and the fits are good as determined by the SSR values and the elbow plots, again corroborating the merits and usefulness of the proposed k-regressions algorithm. Tables 6–8 also show the time (in secs) to run the 100 replications. From these, we see there is very little difference between these times as p increases; rather, if anything, any time differences are due to different distance measures used, though the standard deviations are large relative to the respective sample means.

5 Real Data Application

The methodology introduced herein is applied to the faces data set of Leroy et al. (1996). The predictor variable is X = length between the outer corners of the eyes, i.e., eye-span, and the response variable is Y = length between the inner corners of the eyes, i.e., the length of the bridge of the nose. There are n = 27 interval-valued observations (see Table 9 in Supplementary Materials S2). The resulting K = 3 clusters, obtained from the k-regressions algorithm, are shown in Fig. 6. The maximum number of clusters was set at K = 5. We see from the elbow plot in Fig. 7 that the optimal number is the K = 3 of Fig. 6. The respective regression equations are as follows:

$$ \begin{array}{@{}rcl@{}} \text{Cluster 1 (black): ~~} Y &= -87.52 + 0.91 X + \epsilon, \end{array} $$
(41)
$$ \begin{array}{@{}rcl@{}} \text{Cluster 2 (red): ~~~~~} Y &= -143.19 + 1.20 X + \epsilon, \end{array} $$
(42)
$$ \begin{array}{@{}rcl@{}} \text{Cluster 3 (green): ~~} Y &= -80.17 + 0.89 X + \epsilon. \end{array} $$
(43)

For all three clusters, the length between the two outer corners of the eyes (X) is positively correlated with the length between the two inner corners of the eyes (Y). However, the relationship between the two lengths is a bit different for different clusters.

Fig. 6
figure 6

Three k-regressions clusters for the faces data

Fig. 7
figure 7

Elbow plots faces data set clustering

The clusters consist of C1 = {7,8,9,19,20,21,25,26,27} faces, C2 = {4,5,6,10,11, 12,13,14,15} faces, and C3 = {1,2,3,16,17,18,22,23,24} faces, respectively. As it so happens, these faces are in fact three observations from each of nine individuals. The algorithm (naturally unaware of this fact) correctly always puts the measurements for the same individual into the same cluster, thus enhancing the credibility of the algorithm.

A second data set is the cars data set discussed in de Carvalho et al., 2010. There are n = 33 cars. The k-regressions algorithm is applied where the response variable is Y = price and the regression variable is X = engine capacity (data are in Table 10 in S2). The k-regressions algorithm gave the K = 3 clusters as shown in Fig. 8. The respective regression equations (where the data are standardized) are as follows:

$$ \begin{array}{@{}rcl@{}} \text{Cluster 1 (red): ~~~~~} Y &= -44.53 + 78.05 X + \epsilon, \end{array} $$
(44)
$$ \begin{array}{@{}rcl@{}} \text{Cluster 2 (green): ~~} Y &= -66.78 + 58.98 X + \epsilon, \end{array} $$
(45)
$$ \begin{array}{@{}rcl@{}} \text{Cluster 3 (blue): ~~~~} Y &= -68.24 + 73.00 X + \epsilon. \end{array} $$
(46)

The elbow plot is shown in Fig. 9 suggesting that the optimal K = 3. The clusters consist of C1 = {11,15,16,22} cars, C2 = {1,2,3,6,7,8,9,10,12,13,14,18,20,21,26, 29,30,31,32,33} cars, and C3 = {3,4,16,18,22,23,24,26,27} cars.

Fig. 8
figure 8

Three k-regressions clusters for the cars data

Fig. 9
figure 9

Elbow plots cars data set clustering

These cars data were also analysed by de Carvalho et al., 2010 using the same variables. Importantly, they also obtained three clusters. However, it is difficult to make any further comparison, as they apply separate classical regressions to each of the interval centers and half-ranges point values and so (apart from the deficiencies discussed in Section 3.1) it is not possible to produce cluster-regression fits along the lines of Eqs. 4446 above.

6 Conclusion

The use of a regression-based approach to partition a data set into its relevant clusters, as originally introduced by Charles (1977), exists for classical data. For interval data, we have introduced a k-regressions algorithm to partition a set of interval-valued observations into its underlying clusters. The algorithm was tested on a wide variety of simulated data sets and two real application sets of interval data. It worked well. The new k-regressions algorithm demonstrably excelled when compared with the well-known k-means algorithm methodology for partitioning.