ROOTCLUS: Searching for “ROOT CLUSters” in Three-Way Proximity Data

Bocci, Laura; Vicari, Donatella

doi:10.1007/s11336-019-09686-1

ROOTCLUS: Searching for “ROOT CLUSters” in Three-Way Proximity Data

Published: 13 September 2019

Volume 84, pages 941–985, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Psychometrika Aims and scope Submit manuscript

ROOTCLUS: Searching for “ROOT CLUSters” in Three-Way Proximity Data

Download PDF

262 Accesses
3 Citations
Explore all metrics

Abstract

In the context of three-way proximity data, an INDCLUS-type model is presented to address the issue of subject heterogeneity regarding the perception of object pairwise similarity. A model, termed ROOTCLUS, is presented that allows for the detection of a subset of objects whose similarities are described in terms of non-overlapping clusters (ROOT CLUSters) common across all subjects. For the other objects, Individual partitions, which are subject specific, are allowed where clusters are linked one-to-one to the Root clusters. A sound ALS-type algorithm to fit the model to data is presented. The novel method is evaluated in an extensive simulation study and illustrated with empirical data sets.

GINDCLUS: Generalized INDCLUS with External Information

Article 12 October 2016

Evaluating the Necessity of a Triadic Distance Model

Additive Trees for Fitting Three-Way (Multiple Source) Proximity Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In many research domains (psychometry, sociometry, marketing), it can be of interest to investigate how the perception or evaluation of the similarities between objects (products or stimuli) may differ across several subjects (sources, experimental conditions, situations, scenarios or times).

Our concern here is referred to three-way two-mode data arranged as a set of square symmetric matrices of pairwise proximities between N objects provided by H subjects.

Clustering three-way similarity data is a complex and challenging task since proximity matrices actually subsume different classifications of the objects due to the subject heterogeneity.

Different approaches have been introduced to handle and account for the subject heterogeneity: The three-way clustering should represent a consensus of the individual object classifications, but such a consensus is actually representative only when no differences due to the subject heterogeneity occur and such an assumption is often unrealistic.

In order to account for the subject heterogeneity of three-way similarity data, Carroll and Arabie (1983) introduce the INDCLUS (INdividual Differences CLUStering) model as a three-way generalization of ADCLUS (ADditive CLUStering), originally formulated by Shepard and Arabie (1979) for two-way similarity data. ADCLUS assumes that objects are grouped into overlapping clusters and the similarity between objects equals the sum of the weights of all the clusters they belong to. Mirkin (1987) discusses the use of Qualitative Factor Analysis (QFA) methods for fitting ADCLUS in the special case where the qualitative factors are clusters. In the three-way framework, INDCLUS allows for extracting overlapping clusters of N objects from the proximities provided by H subjects which are supposed to differently weigh each cluster. Therefore, INDCLUS deals with the subject heterogeneity by assuming that even though all subjects share the same clustering of objects, they assign different patterns of weights to clusters so that the effect of each cluster on the proximities is subject specific.

Several extensions of INDCLUS have been proposed which present more flexibility in accounting for the subject heterogeneity.

Giordani and Kiers (2012) provide a fuzzified version of the INDCLUS model (FINDCLUS) which allows objects to belong to a partition with a fuzzy membership degree which is common to all subjects.

In the context of clustering and multidimensional scaling for three-way similarity data, the CLUSCALE (simultaneous CLUstering and SCAL[E]ing) model (Chaturvedi and Carroll 2006) combines additively INDCLUS and INDSCAL (Carroll and Chang 1970) searching for a common both (discrete) clustering representation and continuous spatial configuration of the objects. Thus, what is not accounted for by the INDCLUS term is handled by the INDSCAL part.

As a further extension of the INDCLUS model, Bocci and Vicari (2017) propose GINDCLUS, a Generalized INDCLUS model where the subject heterogeneity is accounted for by determining classes of subjects in addition to the common classification of the objects. Furthermore, the availability of possible information which is external to the three-way data is exploited to better account for the subject heterogeneity and object clustering.

Several other methodologies searching for simultaneous classification of subjects and objects had been before proposed, aiming at accounting for the subject heterogeneity of three-way proximity data.

In a least-squares approach, POP (Partitions Of Partitions) and PARSCLA (PARtition and least-Squares Consensus cLassifications Analysis) have been proposed by Gordon and Vichi (1998) and Vichi (1999), respectively, by searching classes of subjects where a single consensus partition of objects is found, while Vicari and Vichi (2009) fit several classification structures within each class of subjects.

In a maximum likelihood framework, Wedel and DeSarbo (1998) and Bocci, Vicari and Vichi (2006) fit finite mixture models which deal with the three-way heterogeneity by identifying unobserved classes of subjects each having a different classification structure of the objects.

In this paper, we do not deal with clustering of subjects, but in order to extract more information from the three-way proximities by accounting for and taking advantage of the subject heterogeneity, we move on from the INDCLUS model by considering that a unique common object clustering is not enough and that each subject may only partially share a common partition of some objects but differ in how some other objects are classified.

Specifically, the approach proposed here assumes that the N objects can be divided into two subsets: one including $N_P$ objects, which subsume an underlying common perceptual structure of the pairwise proximities, while the other contains the leftover $N_M$ objects whose proximities are differently evaluated by each subject.

Hence, the ROOTCLUS (ROOT CLUStering) model proposed here aims at jointly identifying:

(A)
a ROOT partition of the $N_P$ objects (i.e., a common partition) into non-overlapping clusters denoted ROOT clusters;
(B)
H different subject-specific partitions of the remaining $N_M$ objects into clusters linked one-to-one to the common ones. From now on, such subject-specific partitions are termed Individual partitions because they are different and peculiar for each subject.

The idea is that some common object clusters exist which account for the homogeneity of the subjects and form the roots at which the individual clusters intersect.

The model is formalized in a least-squares framework and an appropriate Alternating Least-Squares-type algorithm is given.

The paper is organized as follows. In Sect. 2, a real motivating example is presented to give an intuition of the problem we are dealing with. In Sect. 3, the ROOTCLUS model is formalized in a general framework and an appropriate algorithm is proposed in Sect. 4. An extensive simulation study and applications to real data are presented in Sects. 5 and 6, respectively. Finally, the last section is devoted to some concluding remarks.

2 Motivating Example

In order to give a flavor of the situation where the ROOTCLUS model could fruitfully provide a useful insight into clustering heterogeneous proximity data among subjects, we present an example of a real application, before discussing in detail the statistical framework of the model.

The data considered here refer to the classification of $N=14$ sports (Volleyball, Horse-riding, Cycling, Athletics, Water Polo, Rugby, Martial Arts, Ski, Soccer, Fencing, Basketball, Swimming, Artistic Gymnastic, Tennis) provided by a number of college students in December 2017 at the Department of Statistical Sciences—Sapienza University of Rome (Italy). Specifically, the students were asked to sort the sports (given in random order) into three non-overlapping clusters on the basis of some common aspect not to be explicitly expressed. For each student, the partition provided was converted into a membership matrix and used to build a $14 \times 14$ similarity matrix for each student having ones only for the pairs of sports put together into the same cluster and zeros otherwise.

In order to study the gender differences in evaluating the sports, two matrices have been built from the similarity matrices of the students by summing up separately the single matrices of Males and Females and dividing them by their number (8 Males and 5 Females, respectively), so that, conditionally on gender, the percentage of times in which two sports have been assigned to the same cluster was assumed as a measure of their similarity (Table 1).

Table 1 Sport data by gender

Full size table

In this way, Males and Females are regarded as two “macro-subjects,” representative of the two genders.

By a first inspection, it is clear that a different pattern in evaluating the similarity among sports is present and both matrices exhibit a strong underlying structure in clusters.

Moreover, in the clustering process such a structure is differently exhibited by Males and Females and some points can be highlighted about the heterogeneity of their similarities:

1.
agreement for some sports generally allocated to the same cluster regardless of the gender (as for example Volleyball, Water Polo, Rugby, Basketball and Soccer which all Males and almost all Females assign to a cluster clearly identified as “team sports using a ball”);
2.
some sports quite clearly assigned to a cluster of similar sports by one gender but less clearly evaluated by the other gender: For example, Tennis is added to the “ball sports” and never classified with Gym and Athletics by Males, while Females put Tennis with the “ball sports” as many times as with Gym and Athletics;
3.
sports allocated to a cluster to some (more or less) limited extent according to possible gender preferences (as Martial Arts).

From such considerations, we may deduce that Males and Females share their evaluations for some common clusters of sports, but they also differ in evaluating some other sports, so that one single partition is not enough to explain all the differences.

The idea underlying the proposed methodology is to find jointly A) common “root” clusters including only a subset of sports which is subsumed by Males and Females and, in addition, B) different gender-specific clusters of the remaining sports which are linked one-to-one to the roots to account for the gender heterogeneity.

The ROOTCLUS model, which will be fully introduced in Sect. 3, has been applied to the Sport Data in Table 1.

The optimal solution (retained as the best one in 200 runs of the algorithm presented in Sect. 4 and attaining a relative^{Footnote 1} loss equal to 0.0824) gives 3 “Root” clusters containing a subset of 10 out of 14 sports that are the roots which the gender-specific partitions stem from (Table 2). For the sake of completeness, Table 2 also reports the optimal weights and constants from the ROOTCLUS model which will be formalized in Sect. 3 and which are used to estimate the similarities in Table 3 (see Sect. 3 for fully details).

Table 2 Optimal clusters and weights (in parentheses) from the ROOTCLUS model

Full size table

Actually, two additional complementary partitions are found which subsume the similarities expressed by Males and Females, respectively, by differently allocating the remaining four sports to Gender-specific Clusters which join the common Root Clusters (Table 2).

Note that all sports are assigned to some cluster, but while the Root Clusters underly the common perception of the similarities among sports, the gender-specific clusters provide a particular connotation of the clusters themselves.

Thus, for example, Cluster 1 identifies the “Ball Sports” including Tennis (Gender-specific Cluster 1) in addition to the “Team Sports” (Root Cluster 1) for both Males and Females. Note that Tennis is not part of the first Root Cluster even if it is a singleton in both gender-specific partitions. In fact, even though both Males and Females perceive Tennis similar to the sports in Root Cluster 1, actually they differ in how they evaluate the strength of such similarity. This is evident from the estimated similarities (Table 3) which confirm that Tennis (the only individual ball sport) belongs to Cluster 1 but it is perceived less similar to the other sports in the cluster (similarity equal to 57.67 and 44.83 for Males and Females, respectively).

The same considerations hold for Fencing: Male students put only Fencing with the sports requiring “Special Equipments” (Root Cluster 2), while Females also add Martial Arts and Swimming. Sports in Root Cluster 2 are very similar according to both Males and Females (similarity equal to 80.77 and 89.23, respectively), but their similarity with Fencing is lower and different for the two genders.

Lastly, Root Cluster 3 is complemented by Males with Martial Arts and Swimming forming the cluster of “Multidisciplinary Sports,” whereas Females do not add anything to the Root Cluster: This implies that for Males such two sports are similar to Athletics and Artistic Gym, whereas this is not true for Females.

Table 3 Sport data: estimated similarities from the ROOTCLUS model

Full size table

Moreover, the two sports in Root Cluster 3 are perceived much more similar according to Males than Females (similarity equal to 103.89 and 63.61, respectively).

From Table 3, it is evident that clusters of similar sports have been identified for each gender: A) sports within the same clusters are always more similar than sports in different clusters; B) sports belonging to Gender-specific Clusters are always more similar to the sports in their corresponding Root Clusters but to a lesser extent than the sports within Root Clusters are to each other.

Finally, as for the assessment of the goodness of the solution, we may consider that 91.76% of the total sum of squares is accounted for by the model and can be decomposed into the within and between clusters sum of squares (88.82% and 2.94%, respectively). Moreover, the within sum of squares accounted for can be, in turn, decomposed into the common within (67.27%) and gender within (21.55%) sum of squares, due to the similarities within Root and Gender-specific Clusters, respectively.

3 The ROOTCLUS Model

Let us assume that a two-mode three-way array ${\underline{\mathbf {S}}}$ consists of H square matrices $\mathbf {S}_h$ ($h=1,\ldots ,H$) of size N where the entry $s_{ilh}$ represents the pairwise nonnegative similarity between objects i and l ($i,l=1,\ldots ,N$) given by subject h ($h=1,\ldots ,H$).

In order to present the INDCLUS-type model, termed here ROOTCLUS, let us formally recall the INDCLUS model (Carroll and Arabie 1983), which can be written as

$$\begin{aligned} \mathbf {S}_h={{\tilde{\mathbf {P}}}}{{\tilde{\mathbf {W}}}}_h {{\tilde{\mathbf {P}}}}^\prime +{\tilde{c}}_h \mathbf {1}_N \mathbf {1}_N^\prime + {\tilde{\mathbf {E}}}_h , \qquad (h=1,\ldots ,H), \end{aligned}$$

(1)

where ${{\tilde{\mathbf {P}}}}=[{\tilde{p}}_{ij}] \, ({\tilde{p}}_{ij}=\{0,1\}$ for $i=1,\ldots ,N$ and $j=1,\ldots ,J)$ is the $N \times J$ binary matrix defining the common clustering of the N objects into J possibly overlapping clusters, ${{\tilde{\mathbf {W}}}}_h$ is the nonnegative diagonal weight matrix of order J for subject h, ${\tilde{c}}_h$ is the real-valued additive constant for subject h which can be interpreted as the weight of a “universal cluster” comprising the complete set of the N objects, $\mathbf {1}_N$ denotes the column vector with N ones and ${\tilde{\mathbf {E}}}_h$ is the square error matrix which is the part of $\mathbf {S}_h$ not accounted for by the model.

Actually, when the subjects present systematic differences, it is reasonable to think that, on the one hand, they can share the same partition only for a subset of the whole set of N objects and, on the other hand, they differ in partitioning the remaining ones.

Let us suppose that $N_P$ out of the N objects belong to a common partition (termed here Root partition) in JRoot non-overlapping clusters $\{R_1,\ldots ,R_j,\ldots ,R_J\}$, while the other $N_M=N-N_P$ objects remain not assigned. In addition, the latter $N_M$ objects are supposed to form HIndividual partitions, being differently assigned to JIndividual non-overlapping clusters $\{I_1^h,\ldots ,I_j^h,\ldots ,I_J^h\}$ ($h=1,\ldots ,H$) by the H subjects. Both $\{R_1,\ldots ,R_j,\ldots ,R_J\}$ and $\{I_1^h,\ldots ,I_j^h,\ldots ,I_J^h\}$ ($h=1,\ldots ,H$) define incomplete partitions of the N objects because not all the objects are assigned to either a Root or an Individual cluster. Consequently, for any subject h, the partition of the whole set of the $N=N_p+N_M$ objects can be obtained as the union of the Root and the Individual partition by assuming $C_j^h=\{ R_j \bigcup I_j^h \}$ ($j=1,\ldots ,J$ and $h=1,\ldots ,H$). Note that $\{C_1^h,\ldots ,C_j^h,\ldots ,C_J^h\}$ ($h=1,\ldots ,H$) defines a complete partition of the N objects, i.e., any object is assigned to one and only one cluster by each subject, which is possible because the number of Root clusters and Individual clusters is equal and they are linked one-to-one as detailed in the following.

In order to specify the model, let us set the necessary notation:

$\mathbf {P}=[{p}_{ij}] \, ({p}_{ij}=\{0,1\}$ for $i=1,\ldots ,N$ and $j=1,\ldots ,J)$ is the $N \times J$ incomplete binary membership matrix defining the Root partition of $N_P$$(\le N)$ objects in JRoot clusters. Note that $\mathbf {P}$ presents $N_M=N-N_P$ rows of zeros corresponding to the $N_M$ objects not allocated to the Root partition;
$\mathbf {M}_{h}=[m_{ijh}] \, (m_{ijh}=\{0,1\}$ for $i=1,\ldots ,N$ and $j=1,\ldots ,J)$ is the $N \times J$ incomplete binary membership matrix defining the Individual partition in JIndividual clusters of the $N_M=N-N_P$ objects not assigned to $\mathbf {P}$ by subject h ($h=1,\ldots ,H$). Note that $\mathbf {M}_h$ presents $N_P$ rows of zeros and defines a partition complementary to $\mathbf {P}$.

Therefore, the ROOT CLUStering (ROOTCLUS) model can be specified as follows:

$$\begin{aligned} \mathbf {S}_h = \mathbf {P} \mathbf {W} \mathbf {P}^\prime + \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr ) \mathbf {V}_h \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr )^\prime + c_{h} \mathbf {1}_N \mathbf {1}_N^\prime +\mathbf {E}_{h} , \quad (h=1,\ldots ,H) , \end{aligned}$$

(2)

where $\mathbf {W}$ and $\mathbf {V}_h$ are nonnegative diagonal weight matrices of order J, ${c}_h$ is a real-valued additive constant and $\mathbf {E}_{h}$ ($h=1,\ldots ,H$) is the error term.

Therefore, on the one hand, we suppose that there exists a subset $N_P$ of the N objects whose pairwise proximities subsume the same partition $\mathbf {P}$ and cluster weights $\mathbf {W}$; on the other hand, subjects are supposed to differ in partitioning the remaining $N_M$ objects providing Individual partitions$\mathbf {M}_h$ and weights $\mathbf {V}_h$ ($h=1,\ldots ,H$).

Accordingly, for any subject h ($h=1,\ldots ,H$), the binary membership matrices $\mathbf {P}$ and $\mathbf {M}_h$ identify pairs of incomplete and complementary partitions so that matrix $(\mathbf {P}+\mathbf {M}_h)$ is a complete membership matrix in the sense that no object remains non-assigned and any object is assigned to one and only one cluster of the partition $\{C_1^h,\ldots ,C_j^h,\ldots ,C_J^h\}$ ($h=1,\ldots ,H$).

The J common clusters $\{R_1,\ldots ,R_j,\ldots ,R_J\}$ identified by $\mathbf {P}$ can be considered as Root clusters in the sense that they are common to all subjects and are the roots at which the Individual clusters $\{I_1^h,\ldots ,I_j^h,\ldots ,I_J^h\}$ identified by $\mathbf {M}_h$ ($h=1,\ldots ,H$) intersect or branch.

For identifiability reasons, weights $v_{jh}$ ($j=1,\ldots ,J$; $h=1,\ldots ,H$) are constrained to be null when the corresponding Individual cluster is empty ($I_j^h=\emptyset $).

Remark 1

Up to the constants $c_h$, weight $w_j$ ($j=1,\ldots ,J$) represents the similarity between objects in the Root cluster $R_j$ for all subjects and measures the strength of the “baseline” similarity between the $N_P$ objects belonging to $R_j$, while weight $v_{jh}$ ($j=1,\ldots ,J$; $h=1,\ldots ,H$) measures the strength of the similarity between objects in the Individual cluster $I_j^h$. Thus, for any subject h the estimated similarities for objects in cluster $C_j^h=\{ R_j \bigcup I_j^h \}$ are finally larger between objects within $R_j$ (equal to $w_j+v_{jh}$), while they are weaker (and equal to $v_{jh}$) between pairs of objects a) both in $I_j^h$ or b) one in $R_j$ and one in $I_j^h$. The idea behind is that if subject h has object i in his Individual cluster $I_j^h$, whereas this is not true for object $i\,^\prime $, this implies that i is more similar than $i\,^\prime $ to objects in Root cluster $R_j$.

3.1 Additional Requirements

The key assumption here is that the subject heterogeneity of the three-way similarities cannot be explained only by a single partition of the objects common to all subjects, but subsumes also Individual partitions which link to the common one. In such a respect, we may always regard that for each subject h, any cluster $C_j^h$ of his complete partition is composed by the subset $R_j$, which is common to all subjects, and by the complementary subset $I_j^h$, which is subject specific.

For any $C_j^h=\{ R_j \bigcup I_j^h \}$ ($j=1,\dots ,J$ and $h=1,\ldots ,H$), different cases can be considered:

a)
$R_j\ne \emptyset $ and $I_j^h= \emptyset \Rightarrow C_j^h\ne \emptyset $: subject h assigns no objects to $I_j^h$, which means that no objects are regarded similar to those in $R_j$;
b)
$R_j\ne \emptyset $ and $I_j^h\ne \emptyset \Rightarrow C_j^h\ne \emptyset $: subject h assigns some objects to $I_j^h$ because they are similar to objects in $R_j$ but less similar than the objects within $R_j$ are to each other;
c)
$R_j= \emptyset $ and $I_j^h= \emptyset \Rightarrow C_j^h= \emptyset $: subject h does not assign any object to both Root and Individual clusters j;
d)
$R_j= \emptyset $ and $I_j^h\ne \emptyset \Rightarrow C_j^h\ne \emptyset $: subject h assigns some objects to $I_j^h$ even if $R_j$ is empty.

Note that, regardless of the assumption of non-overlapping clusters, when case a) is true for all subjects and clusters, it corresponds to fit the INDCLUS model (1) which searches for the same clusters $R_j$ ($j=1,\ldots ,J$) across subjects. Again, when case d) is true for all subjects and clusters, it corresponds to performing ADCLUS (Shepard and Arabie 1979) to each $\mathbf {S}_h$ ($h=1,\ldots ,H$) separately. On the one hand, the latter allows for the maximum flexibility because there are no objects in the common clusters and all objects belong to the individual ones; on the other hand, the resulting subject-specific clusters are obtained independently and are not linked across subjects.

The ROOTCLUS model covers here cases a), b), c) while case d) is assumed not admissible because the aim is to identify clusters $R_j$ ($j=1,\ldots ,J$) which are linked one-to-one to the corresponding Individual clusters $I_j^h$ ($j=1,\ldots ,J$; $h=1,\ldots ,H$). Thus, the common meaning of all $C_j^h$ ($j=1,\dots ,J$; $h=1,\ldots ,H$) is given by their intersection set $R_j$ which subsumes the underlying core of similarities shared by all subjects.

Thus, some additional constraints are required to specify the model. Hence, case d) is avoided by controlling that for any pair of subjects h and $h^\prime $ ($h\ne h^\prime $; $h,\, h^\prime =1,\ldots ,H$) for any cluster j ($j=1,\ldots ,J$): $C_j^h \bigcap C_j^{h^\prime }= R_j=\emptyset \, \Rightarrow \, I_j^h=I_j^{h^\prime }=\emptyset $.

This is equivalent to set the following additional constraint for the membership matrices $\mathbf {P}$ and $\mathbf {M}_h$,

$$\begin{aligned} \text {if} \quad \sum _{i=1}^{N} p_{ij}=0 \quad \text {then} \quad \sum _{i=1}^{N} m_{ijh}=0 , \quad (j=1,\ldots ,J \, ; \, h=1,\ldots ,H) \end{aligned}$$

(3)

where no Individual non-empty clusters are assumed admissible whenever the corresponding common cluster is empty.

Furthermore, the more objects belong to the Root cluster $R_j$, the more parsimonious the model is; conversely, when the common subset $R_j$ contains just a few objects, the Individual clusters $I_j^h$ are more flexible in accounting for the observed similarities. Thus, as a second requirement a trade-off is necessary between the need for parsimony (as many objects allocated to common partition as possible) and the goodness of fit (a large number of objects $N_M$ differently allocated by each subject). Without any other constraint, the best model may lead to find the more flexible solution by allocating the minimum number of objects to the Root clusters.

That is why, we need an additional requirement to control the non-sparsity of matrix $\mathbf {P}$ and, equivalently, the sparsity of matrices $\mathbf {M}_h$ ($h=1,\dots ,H$) in order to force toward solutions where the number of objects assigned to the Root partition is as large as possible:

$$\begin{aligned} G=\frac{N-N_P}{N_P^2}=\frac{1}{N_P}\bigg (\frac{N_M}{N_P}\bigg )\, \le \, s \end{aligned}$$

(4)

where s is a specified parameter.

Since G in (4) is a function of the ratio between the number of objects non-allocated $N_M$ and allocated $N_P$ to the Root clusters, respectively, it can be used to penalize the model. Note that the function G is 0 when all objects belong to the Root partition ($N_P = N$), while G increases as the number of objects in any Individual partition increases and, to accelerate such an increase, $N_P^2$ is used at the denominator instead of $N_P$.

Remark 2

Note that the ROOTCLUS model could be reformulated in terms of overlapping clusters (with appropriate modifications of the algorithm presented in the following Sect. 4), but with an additional burden in terms of model complexity. Actually, this is not a limitation in most situations because the ROOTCLUS model is rather flexible as it allows to assign any object to different Individual clusters. Since Root and Individual clusters are linked one-to-one, clusters $C_j^h$ ($h=1,\ldots ,H$) synthesize the subject heterogeneity by combining the common Root cluster $R_j$ with the Individual clusters $I_j^h$ ($h=1,\ldots ,H$). Thus, objects in Root clusters are evaluated similar by all subjects and assigned to only one cluster, while the remaining objects may belong to more than one cluster $C_j^h$ ($h=1,\ldots ,H$) according to how subjects differently evaluate or perceive the pairwise proximities (see illustrative example in Sect. 3.2 and applications in Sect. 6.2).

3.2 Illustrative Example: Artificial Data

In order to better illustrate the meaning of any part of the model, let us consider the example in Table 4, where artificial three-way similarity data are displayed for $N=10$ objects, $H=3$ subjects and $J=3$ clusters, together with the membership matrices $\mathbf {P}$ and $\mathbf {M}_h$ ($h=1,\ldots ,H$).

Table 4 Artificial three-way similarity data (Color table online)

Full size table

The pattern underlying the data is evident. The heat map of each similarity matrix shows 3 “green blocks” on the main diagonal having common (dark green) “roots” across matrices: namely, $R_1=(A,B,C)$, $R_2=(E,F)$, $R_3=(L,M)$, corresponding to the incomplete membership matrix $\mathbf {P}$.

Actually, the remaining three objects D, G, K differently frame the blocks across the subjects. For example, object D frames the first block for both subject $S_1$ and $S_3$, while for subject $S_2$ object D joins the second block together with object G (see also the membership matrices $\mathbf {M}_1$, $\mathbf {M}_2$, $\mathbf {M}_3$). The different shades of green show how strong the similarities are within Root and Individual clusters. Thus, up to the individual constants $c_h$ (which play the same role as in INDCLUS), the weight $w_j$ ($j=1,2,3$) provides the baseline similarity between objects within each block (Root cluster $R_j$) across subjects, which is possibly augmented by $v_{jh}$ depending on whether subject $S_h$ adds other objects to cluster j. The individual weight $v_{jh}$ represents the average similarity between objects due to the Individual cluster j. For instance, objects G and K belong to cluster 2 in $S_1$ and cluster 3 in $S_3$ with similarities given by $s_{GK1} = v_{21} + c_1 =3+1=4$ and $s_{GK3} = v_{33} + c_3 =1+1=2$, respectively; conversely, they belong to different clusters in $S_2$ with smaller similarity $s_{GK2} = c_2 =1$.

For the sake of completeness, while the ROOTCLUS model fits perfectly the data (relative loss equal to zero), the INDCLUS model fitted to such data attains a relative loss equal to 0.0722 corresponding to the optimal matrices ${{\tilde{\mathbf {P}}}}$ and ${{\tilde{\mathbf {W}}}}$ given in Table 5. It is evident that although the unique classification identified by matrix ${{\tilde{\mathbf {P}}}}$ is quite flexible because it allows for overlapping clusters, it does not manage to capture all the subject heterogeneity. Object D is correctly assigned to both clusters 1 and 2; object K remains not assigned to any cluster (which means that K is not similar to any other object), while G belongs to the second cluster only.

Table 5 Artificial data: results from INDCLUS

Full size table

Conversely, while ROOTCLUS does not allow for overlap of the Root clusters ($A, B, C, E, F, L, M$ belong to only one cluster), it actually captures the subject heterogeneity through matrices $\mathbf {M}_h$, so that matrix $\frac{1}{H}\sum _{h=1}^{H} \big (\mathbf {P} + \mathbf {M}_h \big )$ (see Table 4) synthesizes how each object either belongs to only one Root cluster or to several Individual clusters with different proportions due to subjects. For example, object D belongs to the first and second clusters with proportions 2 / 3 and 1 / 3, respectively, which correctly reproduces the data in Table 4. Note that this is meaningful because of the one-to-one link between Root and Individual clusters.

3.3 The Least-Squares Problem

In order to achieve our goal of partitioning the objects of the three-way similarity data, we specify the least-squares estimation of model (2) as the solution of the following constrained problem

$$\begin{aligned}&\min \, F(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h) \nonumber \\&\quad = \frac{\sum _{h=1}^{H} \left\| \mathbf {S}_h - \mathbf {P} \mathbf {W} \mathbf {P}^\prime - \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr ) \mathbf {V}_h \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr )^\prime - c_{h} \mathbf {1}_N \mathbf {1}_{N}^{\prime } \right\| ^2}{\sum _{h=1}^{H} \left\| \mathbf {S}_h \right\| ^2} \end{aligned}$$

(5)

subject to

$$\begin{aligned}&p_{ij} = \{ 0,1 \} \; \; (i=1,\ldots ,N; \, j=1,\ldots ,J), \; \sum _{j=1}^{J} p_{ij} \le 1 \; \; (i=1,\ldots ,N), \end{aligned}$$

(6)

$$\begin{aligned}&m_{ijh} = \{ 0,1 \} \quad (i=1,\ldots ,N; \, j=1,\ldots ,J; \, h=1,\ldots ,H), \nonumber \\&\quad \sum _{j=1}^{J} m_{ijh} \le 1 \quad (i=1,\ldots ,N; \, h=1,\ldots ,H), \end{aligned}$$

(7)

$$\begin{aligned}&\sum _{j=1}^{J} (p_{ij}+m_{ijh})=1 \quad (i=1,\ldots ,N; \, h=1,\ldots ,H), \end{aligned}$$

(8)

$$\begin{aligned}&\forall \, j \, : \, \sum _{i=1}^{N} p_{ij}=0 \, \Rightarrow \, \sum _{i=1}^{N} m_{ijh}=0 \quad (j=1,\ldots ,J; \, h=1,\ldots ,H), \end{aligned}$$

(9)

$$\begin{aligned}&w_{j} \ge 0 \quad (j=1,\ldots ,J), \end{aligned}$$

(10)

$$\begin{aligned}&v_{jh} \ge 0 \quad (j=1,\ldots ,J; \, h=1,\ldots ,H), \end{aligned}$$

(11)

$$\begin{aligned}&\forall \, j \, : \sum _{i=1}^{N} m_{ijh}=0 \, \Rightarrow \, v_{jh}=0 \quad (h=1,\ldots ,H), \end{aligned}$$

(12)

$$\begin{aligned}&G=\frac{N-\sum _{i=1}^{N} \sum _{j=1}^{J} p_{ij}}{(\sum _{i=1}^{N} \sum _{j=1}^{J} p_{ij})^2} \, \le \, s . \end{aligned}$$

(13)

Specifically, the set of constraints (6)–(8) specify that in the partitions determined by $\mathbf {P}$ and $\mathbf {M}_h$ ($h=1,\ldots ,H$) each object is assigned to only one cluster in either $\mathbf {P}$ or $\mathbf {M}_h$, while constraints (9) and (13) are equivalent to (3) and (4), respectively.

Since there is a one-to-one correspondence between s in (13) and the nonnegative parameter $\lambda $ in the following function:

$$\begin{aligned} \min f(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h)= \min \left\{ F(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h) +\lambda G \right\} , \end{aligned}$$

(14)

the constrained problem (5) can be equivalently formulated as the minimization of (14) subject to the set of constraints (6)–(13). Note that, the higher value of $\lambda $, the stronger the penalty G.

An appropriate efficient Alternating Least-Squares (ALS)-type algorithm is presented in Sect. 4.

4 The Algorithm

The constrained problem (14) can be solved by using an Alternating Least-Squares (ALS)-type algorithm, which is defined as a block-relaxation algorithm (de Leeuw 1994) applied to a least-squares loss function. Thus, the complex optimization problem (14) is solved by alternatingly updating each block of parameters while maintaining all the others fixed as detailed as follows.

Similar to the schemes employed in SINDCLUS (Chaturvedi and Carroll 1994) and SYMPRES (Kiers 1997) for fitting the INDCLUS model, given J and $\lambda $, an ALS-type algorithm for fitting the ROOTCLUS model is presented:

Step 0:: Initialization.
Step 1:: Updating matrices$\mathbf {P}$ and $\mathbf {M}_h \, (h = 1, \ldots , H)$: both matrices $\mathbf {P}$ and $\mathbf {M}_h \, (h = 1, \ldots , H)$ are updated together in a row-wise fashion by solving assignment problems.
Step 2:: Updating weight matrix$\mathbf {W}$: weight matrix $\mathbf {W}$ is estimated as the solution of a constrained regression problem.
Step 3:: Updating weight matrix$\mathbf {V}_h \, (h = 1, \ldots , H)$: weight matrices $\mathbf {V}_h \, (h = 1, \ldots , H)$ are estimated by solving H independent constrained regression models.
Step 4:: Updating constant$c_h \, (h=1,\ldots ,H)$: individual constants $c_h \, (h=1,\ldots ,H)$ are estimated by successive residualizations of the data matrix.
Step 5:: Stopping rule.

The four main steps 1 to 4 are alternated and iterated until convergence. The loss function (14) cannot increase at each step, and the algorithm stops when the loss decreases less than a fixed arbitrary positive and small threshold.

In order to increase the chance of finding the global minimum, the best solution over different random starting parameters is retained.

Without loss of generality and just for simplicity of the notation, in the following, we assume here that the diagonal entries of the H similarity matrices $\mathbf {S}_{h}$ are fitted; the case where the diagonal entries are not of interest can be derived straightforwardly (see “Appendix”).

A detailed description of the steps of the algorithm, implemented in MATLAB R2017b,^{Footnote 2} follows.

Step 0 Initialization.

Initial estimates of the parameters $\hat{\mathbf {P}}, \hat{\mathbf {W}}$, $\hat{\mathbf {M}}_h, \hat{\mathbf {V}}_h$ and ${\hat{c}}_h \; (h=1,\ldots ,H)$ are chosen randomly or in a rational way, but they are required to satisfy the set of constraints (6)–(13).

To improve the chance of finding a global minimum, in the simulation study and applications, a rational starting $\hat{\mathbf {P}}$ has been chosen by a) taking the absolute values of the first J eigenvectors of the mean similarity matrix $\frac{1}{H}\sum _{h=1}^{H}\mathbf {S}_h$ and b) setting all row-wise highest elements to 1 and all other elements to 0. Afterward, for $J<N-1$ a total of J rows have been randomly set to zero, while to get feasible starting solutions for either $J=N-1$ or $J=N$ a total of $N-2$ rows have been randomly set to zero.

Alternative rational starts could be differently obtained starting from the INDCLUS solution or from a separate ADCLUS analysis per subject, and deriving appropriate feasible starting solutions.

Step 1Updating matrices$\mathbf {P}$ and $\mathbf {M}_h \, (h = 1, \ldots , H)$.

Given the current estimates of $\hat{\mathbf {W}}$, $\hat{\mathbf {V}}_h$ and ${\hat{c}}_h \; (h=1,\ldots ,H)$, problem (14) is solved over $\mathbf {P}$ and $\mathbf {M}_h \, (h = 1, \ldots , H)$ for each row i ($i = 1, \ldots , N$) by solving assignment problems which minimize (14).

In the following, let $\mathbf {0}$ be the J-column vector of zeros, $\mathbf {u}_j$ be the J-dimensional vector with all entries equal to 0 except for the j-th which is 1, $n_j$ be the number of objects in Root cluster j, $\mathbf {p}_i$ be the i-th row of $\mathbf {P}$, $\mathbf {m}_{ih}$ be the i-th row of $\mathbf {M}_h$, f(a) be the loss function (14) computed using the entity within parentheses.

The updating of the i-th rows of $\mathbf {P}$ and $\mathbf {M}_h$ ($h=1,\dots ,H$), while keeping all the other rows fixed, is described in pseudo-code as follows.

Briefly, any object i may be either assigned to $\mathbf {P}$, i.e., to one Root cluster $\{R_1,\ldots ,R_j,\ldots ,R_J\}$, or remained non-assigned. In the latter case, the updating of the i-th row of $\mathbf {M}_h$ is achieved by solving $J^*$ ($\le J$) standard assignment problems (Inner loop), being $J^*$ the number of non-empty columns in the current $\mathbf {P}$. This guarantees that constraint (9) is fulfilled.

Before describing steps 2 to 4, it is worth noting that model (2) can be rewritten as

$$\begin{aligned} \mathbf {s}_h= \mathbf {T}\mathbf {w} + \bigl ( \mathbf {T}+\,\mathbf {Q}_h \bigr ) \mathbf {v}_h + c_h\mathbf {1}_{N^2}+ \mathbf {e}_h , \qquad (h=1, \ldots , H), \end{aligned}$$

(15)

where

$\mathbf {s}_h$ ($h=1, \ldots , H$) is the column vector of size $N^2$ of the vectorized matrix $\mathbf {S}_h$, i.e., $\mathbf {s}_h = vec(\mathbf {S}_h)=[s_{11h}, \ldots , s_{1Nh}, \ldots , s_{N1h}, \ldots , s_{NNh}]^\prime $;
$\mathbf {T}$ is the $N^2 \times J$ matrix formed by the Khatri–Rao product^{Footnote 3} (McDonald 1980; Rao and Mitra 1971) of $\mathbf {P}$ with itself, i.e., $\mathbf {T} = \mathbf {P}|\otimes |\mathbf {P}$;
$\mathbf {Q}_h$ ($h=1, \ldots , H$) is the $N^2 \times J$ matrix obtained as $\mathbf {Q}_h = \bigl (\mathbf {M}_h\,|\otimes |\,(\mathbf {P}+\mathbf {M}_h) \bigr )+ \bigl ( \mathbf {P}\,|\otimes |\mathbf {M}_h \bigr )$;
$\mathbf {w}$ is the column vector with the J diagonal entries of $\mathbf {W}$;
$\mathbf {v}_h$ ($h=1, \ldots , H$) is the column vector with the J diagonal entries of $\mathbf {V}_h$;
$\mathbf {e}_h$ ($h=1, \ldots , H$) is the column vector of size $N^2$ of the errors.

Therefore, by taking into account (15), function (5) becomes

$$\begin{aligned} \min \, F(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h) = \frac{\sum _{h=1}^{H} \left\| \mathbf {s}_h - \mathbf {T}\mathbf {w} - \bigl ( \mathbf {T}+\,\mathbf {Q}_h \bigr ) \mathbf {v}_h - c_h\mathbf {1}_{N^2} \right\| ^2}{\sum _{h=1}^{H} \left\| \mathbf {s}_h \right\| ^2}. \end{aligned}$$

(16)

Thus, given the current estimates of $\hat{\mathbf {P}}$ and $\hat{\mathbf {M}}_h$ ($h=1,\ldots ,H$), the minimization of (14) over $\mathbf {W}$, $\mathbf {V}_h$ and $c_h$ ($h = 1, \ldots , H$) is achieved by minimizing (16) subject to

$$\begin{aligned}&\mathbf {w} \ge \mathbf {0} \end{aligned}$$

(17)

$$\begin{aligned}&\mathbf {v}_h \ge \mathbf {0}\, ; \quad \forall j \, : \sum _{i=1}^{N} m_{ijh}=0 \, \Rightarrow \, v_{jh}=0 ,\qquad (h=1,\ldots ,H). \end{aligned}$$

(18)

Step 2Updating matrix$\mathbf {W}$.

Given the current estimates of $\hat{\mathbf {P}}$, $\hat{\mathbf {M}}_h, \hat{\mathbf {V}}_h$ and ${\hat{c}}_h$ ($h=1,\ldots ,H$), the estimation of $\mathbf {W}$ is obtained as the solution of the regression problem (16) over $\mathbf {w}$ subject to constraint (17) by using nonnegative least-squares (Lawson and Hanson 1974).

Step 3Updating matrix$\mathbf {V}_h \, (h = 1, \ldots , H)$.

Given the current estimates of $\hat{\mathbf {P}}$, $\hat{\mathbf {W}}$, $\hat{\mathbf {M}}_h$ and ${\hat{c}}_h \; (h=1,\ldots ,H)$, it is readily seen that minimizing (16) over $\mathbf {v}_h$$(h = 1, \ldots , H)$ is equivalent to minimize each term $F_h(\mathbf {v}_h) = \left\| \mathbf {s}_h - \hat{\mathbf {T}} \hat{\mathbf {w}} - \bigl ( \hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h \bigr ) \mathbf {v}_h - {\hat{c}}_h\mathbf {1}_{N^2} \right\| ^2$, subject to constraint (18), separately. The estimation is obtained as the solution of a constrained regression problem by using nonnegative least-squares (Lawson and Hanson 1974). This procedure is iterated H times, and constraint (18) is imposed in the case of an empty Individual cluster.

Step 4Updating constant$c_h \, (h = 1, \ldots , H)$.

Given the current estimates of $\hat{\mathbf {P}}, \hat{\mathbf {W}}$, $\hat{\mathbf {M}}_h, \hat{\mathbf {V}}_h \; (h=1,\ldots ,H)$, the estimate of $c_h$ is given by

$$\begin{aligned} c_h = \frac{\mathbf {1}_{N^2}^\prime \Bigl ( \mathbf {s}_h - \hat{\mathbf {T}}\hat{\mathbf {w}} - \bigl ( \hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h \bigr ) \hat{\mathbf {v}}_h \Bigr )}{N^2} \qquad (h=1,\ldots , H). \end{aligned}$$

Step 5Stopping rule.

The loss function value is computed for the current values of $\hat{\mathbf {P}}, \hat{\mathbf {W}}$, $\hat{\mathbf {M}}_h, \hat{\mathbf {V}}_h$ and ${\hat{c}}_h \; (h=1,\ldots ,H)$ and since $f\bigl (\hat{\mathbf {P}}, \hat{\mathbf {W}}, \hat{\mathbf {M}}_h, \hat{\mathbf {V}}_h, {\hat{c}}_h \bigr )$ is bounded from below, it converges to a point which is expected to be at least a local minimum. When the function f has not decreased considerably with respect to a convergence tolerance value, the process is assumed to be converged. Otherwise, steps 1 to 4 are repeated in turn.

4.1 Model Assessment and Selection

4.1.1 Choice of the Number of Clusters

It is important to notice that the algorithm to fit the ROOTCLUS model does not need to be run by varying the number of clusters, but setting J equal to the maximum number expected or preferred is enough, so that the choice of the optimal number of clusters is not a critical issue here. In fact, the presence of empty clusters (corresponding to columns of zeros both in $\mathbf {P}$ and $\mathbf {M}_h$, $h=1,\ldots ,H$) does not affect the goodness of the solution and the number of non-empty clusters $J_P$ is assumed to be the optimal one.

A general criterion to set the maximum number of clusters cannot be given, but it depends on the specific application. Generally, a number of clusters J large enough can be set to get $J_p < J$ non-empty clusters or, when $J_P$ is too large for practical use, the scree plot of the loss values of the model with different numbers of clusters can be analyzed.

4.1.2 Model Assessment

In order to assess different aspects of the fit of the ROOTCLUS model, it can be useful to provide a decomposition of the total variability where the different parts due to Root and Individual partitions are evaluated and can be possibly used to better analyze and choice the optimal solution. Without loss of generality, we consider here that once the ROOTCLUS model has been fitted, we may decompose the total variability starting from (15),

$$\begin{aligned} TSS&= RSS + ESS \nonumber \\ \sum _{h=1}^{H} \left\| \mathbf {s}_h \right\| ^2&= \sum _{h=1}^{H} \left\| \hat{\mathbf {T}} \hat{\mathbf {w}} + \bigl (\hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h\bigr ) \hat{\mathbf {v}}_h + {\hat{c}}_h\mathbf {1}_{N^2} \right\| ^2 \nonumber \\&\quad + \sum _{h=1}^{H} \left\| \hat{\mathbf {s}}_h - \hat{\mathbf {T}} \hat{\mathbf {w}} - \bigl (\hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h\bigr ) \hat{\mathbf {v}}_h - {\hat{c}}_h\mathbf {1}_{N^2} \right\| ^2 \end{aligned}$$

(19)

where TSS is the Total Sum of Squares, while RSS and ESS denote the ROOTCLUS and the Error Sum of Squares, respectively.

Note that, once the ROOTCLUS model has been estimated, given $\lambda $, RSS is actually the part of the observed similarities accounted for by the model and, specifically, the variability due to clusters $(\hat{\mathbf {P}}+\hat{\mathbf {M}}_h)$ ($h=1,\ldots ,H$). Additionally, since $(\hat{\mathbf {P}}+\hat{\mathbf {M}}_h)$ is a full standard membership matrix, for any h the variability accounted for by such complete partition can be further decomposed in its Within and Between parts, respectively

$$\begin{aligned} RSS&= WSS + BSS \nonumber \\&= \sum _{h=1}^{H} \left\| \hat{\mathbf {T}} \hat{\mathbf {w}} + \bigl (\hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h\bigr ) \bigl ( \hat{\mathbf {v}}_h+{\hat{c}}_h\mathbf {1}_J \bigr ) \right\| ^2 + \sum _{h=1}^{H} \left\| \Bigl ( \mathbf {1}_{N^2} - \bigl (\hat{\mathbf {T}}+\,\hat{\mathbf {Q}}_h\bigr )\mathbf {1}_J \Bigr ) {\hat{c}}_h \right\| ^2. \end{aligned}$$

(20)

Moreover, the WSS can be further decomposed by taking into account the part due to the Root (common) partition$\hat{\mathbf {P}}$ and the Individual partitions$\hat{\mathbf {M}}_h$ ($h=1,\ldots ,H$)

$$\begin{aligned} WSS&= CWSS + IWSS \nonumber \\&= \sum _{h=1}^{H} \left\| \hat{\mathbf {T}} \bigl (\hat{\mathbf {w}}+\hat{\mathbf {v}}_h+{\hat{c}}_h\mathbf {1}_J \bigr )\right\| ^2 + \sum _{h=1}^{H} \left\| \hat{\mathbf {Q}}_h \bigl ( \hat{\mathbf {v}}_h+{\hat{c}}_h\mathbf {1}_J \bigr ) \right\| ^2. \end{aligned}$$

(21)

Actually, CWSS is the part of the within variability due to the objects belonging to the partition which is common to all subjects, while IWSS is the part accounted for by the Individual partitions.

Thus, plugging (21) into (20) and then into (19), it holds

$$\begin{aligned} TSS=(CWSS+IWSS+BSS)+ESS . \end{aligned}$$

For the sake of interpretability, all the Sum of Squares can be related to the TSS of the similarity data so that ESS / TSS is actually the relative loss (5) and, similarly, CWSS / TSS and IWSS / TSS are relative measures to assess how well the Root and Individual partitions account for the observed similarities.

4.1.3 Model Selection

In order to control the non-sparsity of the (incomplete) membership matrix $\mathbf {P}$ (Sect. 3.1), the tuning parameter s has been introduced in (4) which controls the number of objects $N_P$ belonging to the Root partition. Note that because of (9), the solution where $N_P=0$ (empty matrix $\mathbf {P}$) produces also empty Individual partitions (empty matrices $\mathbf {M}_h$) and the ROOTCLUS model reduces to $\mathbf {S}_h=c_h \mathbf {1}_N\mathbf {1}^{\prime }_N+\mathbf {E}_h$ which is the trivial solution where each subject is modeled independently without any clustering structure. This trivial solution can be actually avoided by setting a finite value for s in (4) or equivalently $\lambda >0$ in (14). In fact, since G gets larger as the number of objects $N_P$ becomes smaller, the second term in (14) forces toward solutions with at least $N_P=1$ even for small values of $\lambda $. Thus, matrix $\mathbf {P}$ (matrices $\mathbf {M}_h$) will be non-sparse (sparse) for an appropriate choice of the tuning parameter $\lambda $.

When applied to a dataset as a descriptive method, in order to obtain an intuitive understanding of the data, one might simply fix the tuning parameter based on some appropriate criterion: For instance, one could select large values of $\lambda $ (small values of s), in order to obtain matrices $\mathbf {P}$ having a desirable level of non-sparsity, or determine the tuning parameter $\lambda $ such that $N_P$ equals a certain value by trial and error.

Conversely, in a confirmatory perspective, the question is which solution should be preferred across different values of $\lambda $ in order to take a trade-off between a large number of objects in the common partition and an acceptable lack of fit due to such a constraint. In such a respect, the analysis of the scree plot of the loss values across the $\lambda $ values may help to identify the minimum $\lambda $ which does not display a sizeable increase in the loss (usually corresponding to an elbow). In addition, to ease the model selection in terms of choice of $\lambda $ we propose to make use of an index which relies on the decomposition (18) and measures the quality of the solution taking into account the model complexity. Similar to the pseudo-F index, Calinski and Harabasz (1974) proposed and used in different contexts (Rocci and Vichi 2008; Schepers et al. 2008) just to select the correct model, we consider the ratio of the variability accounted for by the model (RSS) to the residual variability (ESS), both corrected for the degrees of freedom:

$$\begin{aligned} pF = \frac{RSS/dR}{ESS/dE} \end{aligned}$$

(22)

where the corresponding degrees of freedom are: 1) $dR= N_P+J_P+J_M+H+H(N-N_P)$ with $J_M$ equal to the total number of non-empty Individual clusters and 2) $dE= (HN(N + 1)/2)-dR$ when the diagonal entries are fitted, otherwise $dE= (HN(N - 1)/2)-dR$.

The performance of pF in the simulation study is reported in Tables 6 and 7 where it has achieved a good performance in recovering the underlying (known) clustering structure. Hence, pF has been also used in the application to select the $\lambda $ value maximizing (22).

5 Simulation Study

5.1 Design

In order to evaluate the performance of the ROOTCLUS model and the algorithm proposed in Sects. 3 and 4, an extensive simulation study has been carried on artificial data.

A number of three-way datasets have been generated by the true underlying model (2) by setting $H=10$ (subjects), $N=12, 24$ (objects), $J_{\text {true}}=2, 3$ (clusters), $\%N_P=50, 80$ (percentage of objects allocated to the Root partition). Then, each free-error similarity matrix $\mathbf {S}^*_h$ ($h=1,\ldots ,H$) has been built and perturbed as follows

$$\begin{aligned} \mathbf {S}_h&= \bigl (\mathbf {P}\mathbf {W}\mathbf {P}^\prime +\bigl (\mathbf {P}+\mathbf {M}_h\bigr )\mathbf {V}_h\bigl (\mathbf {P}+\mathbf {M}_h\bigr )^\prime + c_h\mathbf {1}_N\mathbf {1}_N\prime \bigr )+\delta \mathbf {E}_h \\&= \bigl ( \mathbf {S}^*_h + c_h\mathbf {1}_N\mathbf {1}_N^\prime \bigr )+\delta \mathbf {E}_h \end{aligned}$$

where

$\mathbf {P}$ is a random incomplete membership matrix having a percentage of nonzero rows equal to $\%N_P$ and generated with clusters of approximately equal sizes;
$\mathbf {M}_h$ ($h=1,\ldots ,H$) are random incomplete membership matrices complementary to $\mathbf {P}$, having a percentage of zero rows equal to $\%N_P$ and the objects are randomly assigned to one out of the $J_{\text {true}}$ Individual clusters with probability $1/J_{\text {true}}$;
$\mathbf {W}=diag(4, 8)$ for $J_{\text {true}}=2$ and $\mathbf {W}=diag(4, 8, 12)$ for $J_{\text {true}}=3$;
$\mathbf {V}_h=2 \mathbf {W}$ ($h=1,\ldots ,H$);
$\mathbf {E}_h$ ($h=1,\ldots ,H$) are symmetric matrices of random noise generated by independently drawing their upper triangular entries from a standard normal distribution; then, matrices $\mathbf {E}_h$ have been rescaled so that their joint sum of squares $\sum _{h=1}^{H} \left\| \mathbf {E}_h \right\| ^2$ was equal to $\sum _{h=1}^{H} \left\| \mathbf {S}^*_h \right\| ^2$;
$\delta $ has been set to 0.25, 0.5, 1 to allow for different error levels (Low, Medium, High) corresponding to 25%, 50%, 100% of noise-to-data ratio $\big (\sum _{h=1}^{H} \left\| \delta \mathbf {E}_h \right\| ^2 \setminus \sum _{h=1}^{H} \left\| \mathbf {S}^{*}_h \right\| ^2\big )$, respectively;
$c_h$ ($h=1,\ldots ,H$) have been chosen to ensure the nonnegativity of the final similarities.

For each cell of the experimental design, 100 datasets have been generated for a total of 2 (number of objects) $\times $ 2 (number of clusters) $\times $ 2 ($\%$ of objects in the Root partition) $\times $ 3 (error levels)= 2400 datasets.

The ROOTCLUS model has been fitted for a maximum number of clusters $J=5$ and different values of $\lambda $ (0–0.6 with increments of 0.1). Moreover, in order to prevent from falling into local optima, for any experimental cell and value of $\lambda $, the best solution in terms of relative loss (ESS / SS) has been retained starting from 100 different random solutions, so that the algorithm run 1,680,000 times in total.

Note that the simulation study was carried out on a personal computer with Intel(R) Core i7-7700 CPU 3.60 GHz processor and 16 GB of RAM.

The performance of ROOTCLUS according to different standpoints is assessed by computing several measures. Specifically, for each cell of the experimental design and for any value of $\lambda $ the following measures have been computed by averaging over the 100 datasets:

$\%P$: percentage of objects allocated to the Root optimal partition $\hat{\mathbf {P}}$; i.e., $100\sum _{ij} \hat{\mathbf {p}}_{ij}/N$;
MK: Kappa coefficient (Cohen 1960), to measure the similarity between true $\mathbf {P}$ and fitted $\hat{\mathbf {P}}$ Root partitions.^{Footnote 4} To take the permutational freedom of the (order of the) clusters into account, we defined the MK-statistic as the maximum value of the Kappa coefficient over all column permutations of the fitted $\hat{\mathbf {P}}$. As such, MK-statistic is the proportion of entries of the true $\mathbf {P}$, adjusted for chance, that have been recovered correctly; its maximum value is 1 meaning perfect recovery;
$\%(MK=1)$: percentage of successes in recovering the true partition $\mathbf {P}$, i.e., % of times where $MK=1$;
ARI: Adjusted Rand Index (Hubert and Arabie 1985) between true $\bigl (\mathbf {P}+\mathbf {M}_h\bigr )$ and fitted $\bigl (\hat{\mathbf {P}}+\hat{\mathbf {M}}_h\bigr )$ complete partitions averaged over the H subjects; it takes its maximum value equals to 1 when the two partitions are coincident;
$\%(ARI=1)$: percentage of successes in recovering the true partitions $\bigl (\mathbf {P}+\mathbf {M}_h\bigr )$, i.e., % of times where $ARI=1$;
L2: L2-distance between true $\mathbf {c}$ and fitted $\hat{\mathbf {c}}$ vectors of constants to evaluate the accuracy of the estimates;
VAF: Variance Accounted For index (Hubert et al. 2006) between true and fitted proximity matrices;
It: number of iterations before convergence;
Time: time per run (in seconds);
$\#\, sc$: number of non-empty clusters in the starting Root partition;
$\#\, fc$: number of non-empty clusters in the final optimal Root partition.

The simulation results are displayed in Tables 6 and 7 where the average measures of performance are reported across the $\lambda $ values for the two scenarios with $N=12$ and $N=24$, respectively.

Table 6 Simulation study results ($N=12$ and $H=10$)

Full size table

Table 7 Simulation study results ($N=24$ and $H=10$)

Full size table

5.2 Results: Goodness of Recovery

In order to investigate the importance of the manipulated factors of the simulation study and their interactions in determining the recovery performance, we performed an ANOVA with the ARI as dependent variable and the data-manipulated factors (N, $\%N_P$, $J_{\text {true}}$ and Error level) as independent variables. Effect sizes (partial ${\eta }^2$) for each of the main effects and first-order interactions are displayed in Table 8.

Table 8 Effect sizes (partial ${\eta }^2$) of the ARI for all main and first-order interaction effects of the manipulated factors. Large effect sizes in bold (partial ${\eta }^2$ > 0.10 ).

Full size table

In Figs. 1 and 2, the mean plots with confidence intervals of the ARI LS-means from post-hoc analyses have been reported to relate the results back to the experiment and better highlight how the algorithm performs as the manipulated factors with large effect sizes vary. LS-means and 95% confidence limits are displayed for each level of the main effects and interactions with large effect sizes in bold in Table 8. From the multiple-comparison analysis, the adjusted pairwise differences and their significance levels have shown that some interactions have similar effects (the corresponding intervals are connected by arrowed lines in Fig. 2b, d), while the main effects are all significant (see Figs. 1, 2a, c).

Thus, in this analysis, taking into account only effects with partial ${\eta }^2$ larger than 0.10 as relevant, it emerges that the Error level has a sizeable main effect across all $\lambda $ values (partial ${\eta }^2$ from 0.21 to 0.25): As expected and evident from Tables 6 and 7, the recovery performance decreases as the error becomes larger (Fig. 2a).

The $\%N_P$ has a sizeable main effect increasing with $\lambda >0.1$: When the percentage of objects in the Root partition is smaller ($\%N_P=50$), the decrease in recovery performance is more pronounced as $\lambda $ increases (Tables 6, 7 and Fig. 2c). Such a main effect is qualified by an Error level by $\%N_P$ interaction (partial ${\eta }^2 = 0.15$) for $\lambda =0.2,0.3$, which actually correspond to the $\lambda $s where the algorithm is mostly likely to return the optimal solution and is more pronounced for a large amount of error (Fig. 2d).

The number of objects N determines a main effect only for $\lambda =0$ (partial ${\eta }^2 = 0.12$), and it is qualified by an interaction between N and Error level (partial ${\eta }^2 = 0.14$) for $\lambda $ equals to 0 and 0.1. In fact, for the smaller number of objects it emerges that the optimal solution is less likely to be found for $\lambda =0$ (Fig. 1) and, again, this is more pronounced as the error level increases (Fig. 2b).

Finally, note that the true number of clusters $J_{\text {true}}$ and any interactions with it indicate no differences in effect size for any $\lambda $ value.

The recovery performance is generally good even when the error is quite high (Fig. 2a): it is better in terms of recovery of the true partitions when the number of objects is larger regardless of the $\lambda $ values (Fig. 2b).

In both settings, the recovery of the correct both Root and Individual classifications attains nearly comparable high values on average and the percentage of hitting the maximum agreement with the true Root partition ($MK=1$) remains generally very high for at least one of the values of $\lambda $, except for an high level of error where it gets worse, especially for $\%N_P=50$ than $\%N_P=80$ (Tables 6, 7).

Probably, the high Error level combined with the presence of high heterogeneity of the subjects (only the $50\%$ of the objects belonging to the common partition) makes the recovery of the true clustering structure more challenging.

Note that the results for $\lambda =0$ are generally worse and probably affected by the tendency toward trivial solutions (Sect. 4.1.3).

Finally, it can be noted that regardless of the amount of error, it generally takes a larger amount of average time per run for those $\lambda $ values where the correct classifications are identified, as expected and for $\%N_P=50$ which is due to more steps required by the allocation step of the algorithm.

5.3 Results: Model Selection

In order to assess the performance of the algorithm in terms of model selection (i.e., determining the optimal number of clusters), the average number of non-empty clusters, out of 5 requested, in the starting and optimal Root partitions is reported in Tables 6 and 7. From the comparison, it clearly emerges that the number of starting non-empty clusters is generally larger than the optimal one for all settings.

This confirms that the number of the starting non-empty clusters ($\#\, sc$) does not affect the number of non-empty clusters one ends up ($\# fc$). Note that $\# fc$ is generally close to $J_{\text {true}}$ for at least some $\lambda $ values, depending on the amount of error. To evaluate how well the algorithm retrieves the correct clusters, the $\%(MK=1)$ reports how often (in %) the perfect recovery of the true Root partition is achieved. For example, for the case ($J_{\text {true}}=2$, $\%N_P=50$, Error level$=25\%$) in Table 6, the algorithm retrieves optimal solutions with two non-empty clusters ($\#fc=2$) for $\lambda = 0, 0.1, 0.2$, whereas $\%(MK=1)$ is equal to 100, 99, 81, respectively, which indicates how many of such optimal solutions into two clusters coincide with the true ones.

Note that it is not expected to find the correct clusters for all $\lambda $s because of the penalty in the loss which forces toward larger and larger Root clusters as $\lambda $ increases. What we expect is to find large percentages of correct recovery $(\%(MK=1))$ for some $\lambda $ values which are often obtained depending on the amount of error. For example, the worst of the best cases across $\lambda $ is $\%(MK=1)=58$ corresponding to $\lambda =0$ for the setting $(N=12, J_{\text {true}}=2, \%N_P=50,\textit{Error level}=100\%)$. This confirms the capability of the algorithm to find the correct clusters for “at least” one $\lambda $ value in different conditions and regardless of the starting solution.

Moreover, the simulation setup has been also used to assess the performance of the maximum pF index (22) across the $\lambda $ values in selecting the most appropriate model. For any experimental cell and for any $\lambda $, Tables 6 and 7 report also:

$\%pF^*$: percentage of times the pF index assumes its maximum (for the $\lambda $ value in question);
$MK(pF^*)$: average MK computed for the solutions indicated by $pF^*$.

The larger the $MK(pF^*)$ values, the better the ability of the pF index in recovering the underlying common clustering structure when it is unknown as in real applications, because the pF index manages to indicate lambda values giving good solutions in terms of recovery of the true Root partition.

As expected, the performance generally decreases as the noise level increases but it always succeeds in indicating a good solution on average except for the cases with high noise level combined with $\%N_P=50$ where the model attains the worst fit as already observed.

5.4 Number of Starts

In order to analyze the stability of the solution and investigate on the sensitivity to local optima, additional results on the recovery performance are reported in Table 9 for the settings $H=10$, $J_{\text {true}}=3$, $N=12, 24$, $\%N_P=50, 80$ and Error level$=50\%, 100\%$, being the data generated as described in Sect. 5.1 and the ROOTCLUS model fitted for a maximum number of clusters $J=5$.

Table 9 Analysis of random starts ($H=10$, $J_{\text {true}}=3$)

Full size table

When the best solution is retained in an increasing number of random starts (10, 30, 50, 100), both the average ARI between true and fitted partitions and the % of times where $ARI=1$ show on average a fairly good improvement across $\lambda $ values, as expected. A good performance in terms of average ARI is already reached when the optimal solution is retained over a few random starts when it generally becomes stable. Nonetheless, the percentage of times where the correct partitions are perfectly recovered for all subjects needs a larger number of random starts for a more relevant improvement, especially for small datasets.

5.5 Scalability

A small simulation has been also carried out to evaluate the average runtime for larger data. In a setting where $J_{\text {true}}=3$, $\%N_P=50$ and Error level$=50\%$, ten datasets have been generated as described in Sect. 5.1 by varying the number of objects $N=12,24,48$ and the number of subjects $H=10,20,40$. As before, the model has been fitted for a maximum number of clusters $J=5$, $\lambda $ from 0 to 0.6 with increments of 0.1 and by retaining the best solution in 100 different random starts. The average total runtime (in seconds) to get the optimal solution is reported in Table 10.

Table 10 Average total runtime (in seconds) to get the optimal solution ($J_{\text {true}}=3$, $\%N_P=50$, Error level$=50\%$)

Full size table

Generally speaking, a large number of subjects are more burdensome than a large number of objects due to the larger number of the Individual partitions, as expected. Given the number of objects, the average runtime increases by about 4.7 times from $H=10$ to $H=20$ subjects and it increases by 3.5 times more for $H=40$. On the other hand, given the number of subjects, as the number of objects goes from 12 to 24 and then from 24 to 48, the runtime goes up a factor of 3.1 and 4.9 on average, respectively.

Note that such a total runtime has been obtained by retaining the best solution in 100 random starts, by considering 7 values for $\lambda $ and by searching for 5 possible non-empty clusters. In situations with larger data in terms of number of either objects or subjects, time can be possibly saved by reducing the increments of the $\lambda $ values or the number of random starts.

6 Illustrative Applications

6.1 Cola Data

We analyze the well-known proximity Cola data among all pairs of ten colas provided by ten subjects (Schiffman et al. 1981, pp. 33–34). In a sensory experiment, ten subjects (nonsmokers, aged 18–21 years) provided 45 dissimilarity judgments tasting ten different brands of cola: Diet Pepsi (DiP), RC Cola (RCC), Yukon (Yuk), Dr. Pepper (DrP), Shasta (Sha), Coca Cola (CoC), Diet Dr. Pepper (DDP), Tab (Tab), Pepsi Cola (PeC), Diet Rite (DiR). The dissimilarity judgements for each pair of colas were placed on a scale from 0 (representing “same”) to 100 (representing “different”).

In order to fit the ROOTCLUS model, the dissimilarity judgements have been converted into similarities by taking the complements to 100 of the original values.

The aim is here to investigate the existence of sensory perceptual heterogeneity among subjects using the proposed methodology. Actually, the ten subjects can show different perception of colas depending on two key elements: (1) Five subjects (A, D, E, F, I) have got one dominant allele on the human genome which affects their ability to taste a bitter compound called PTC (which is bitter to all of them but is tasteless to all the others), while the other five subjects (B, C, G, H, J) do not have this inheritable characteristic (Schiffman et al. 1981, p. 151); (2) Colas can be classified either diet (DiP, DDP, Tab, DiR) or non-diet (RCC, Yuk, DrP, Sha, CoC, PeC) and either cherry (DrP, DDP) or noncherry (DiP, RCC, Yuk, Sha, CoC, Tab, PeC, DiR).

Therefore, we wish to investigate the existence of common subsets of colas perceived as similar by all subjects and, at the same time, identify how each subject can differently link the remaining colas to such common clusters.

The best solution has been retained over 300 random starts. The ROOTCLUS model has been fitted to the off-diagonal entries of the ten similarity matrices for different values of $\lambda $ (0–1, with increments of 0.1) and setting the maximum number of clusters to $J=4$. The best solution results in two non-empty clusters for $\lambda =0$ (unconstrained solution) and in three non-empty clusters for larger $\lambda $ values.

From a first analysis of all the Root partitions, it is worth noticing that the Root clusters remain quite stable across the different values of $\lambda $ (Table 11), identifying actually the “roots” of different types of cola: $R_1=$ regular noncherry colas, $R_2=$ cherry colas (either DrP or DDP) and $R_3=$ diet noncherry colas.

Table 11 Cola data: results from ROOTCLUS

Full size table

The model selection in terms of the choice of the best solutions across the $\lambda $ values has been evaluated by inspecting the scree plot of both the relative loss and the pF index (Fig. 3).

The pF index attains its maximum for $\lambda \ge 0.6$, but also at $\lambda = 0$ it attains a local maximum: They identify the two extreme situations where either all colas but DDP belong to the common partition or only 5 out of 10 colas are assigned to the Root clusters.

In the first case ($\lambda \ge 0.6$), the (parsimonious) solution goes toward a unique partition and the heterogeneity of the subjects depends on how they differently taste the only diet cherry cola DDP (Table 12): the 5 non-PTC tasters add DDP to $R_2$ to form a cluster of cherry colas (Cluster 2), while the 5 PTC subjects put DDP with the diet noncherry ones in $R_3$ forming a cluster of diet colas (Cluster 3). Therefore, the PTC tasters recognize the DDP as diet, whereas this is not true for the non-PTC tasters: this implies that PTC tasters evaluate DDP more similar to the colas in $R_3$ (because it is diet), whereas such a resemblance is not perceived by the non-PTCs.

Table 12 Cola data: clusters for PTC (A, D, E, F, I) and non-PTC (B, C, G, H, J) tasters from ROOTCLUS ($\lambda =0.6$)

Full size table

The optimal weights of the Root clusters are $\mathbf {w}=(31.57, -, 17.29)$:^{Footnote 5} the regular noncherry colas ($R_1$) tasted much more similar than diet ones ($R_3$) for all subjects with a baseline similarity about 1.8 times larger. Moreover, the weights of the Individual clusters in Fig. 4 help to better qualify such a result: For PTC subjects, the diet colas in $R_3$ are more similar to each other than to DDP. Because of the estimated weights, the similarity between Tab, DiP and DiR in $R_3$ is given by the baseline similarity $w_3=17.29$ for non-PTCs. Conversely, for PTCs the similarity between Tab, DiP and DiR in $R_3$ is larger. For example, subject E evaluates the similarity between the three colas in $R_3$ equal to $w_3 + v_{3E} =17.29+33.19=50.48$, while DDP results less similar to the diet noncherry colas ($v_{3E}=33.19$). This is due to the additional cherry flavor which characterizes the diet DDP. The same reasoning and interpretation can be done for Cluster 2 with the regular cherry DrP for PTC tasters, while it additionally includes the diet cherry DDP according to the non-PTCs: The latter perceive only the cherry flavor (and not the diet one) common to this two colas and evaluate them very similar (large weights of the Individual clusters 2 for non-PTC tasters as evident from Fig. 4).

For the case $\lambda = 0$, the (most flexible) optimal solution provides two Root clusters (Table 11) including only five non-diet colas (4 noncherry in $R_1$ and 1 cherry in $R_2$), while the remaining five colas differentiate the Individual clusters (Table 13).

Table 13 Cola data: Root and Individual clusters from ROOTCLUS ($\lambda =0$)

Full size table

Hence, together with the Root clusters, the Individual partitions reflect the ability of the subjects to differently taste the colas because the PTC subjects actually distinguish non-diet (Cluster 1) vs diet (Cluster 2) colas, while non-PTC subjects mainly tend to separate noncherry (Cluster 1) from cherry (Cluster 2) colas. Therefore, this solution, as for $\lambda =0.6$, reveals that the non-PTC subjects are not able to perceive the different tastes of the diet colas.

The optimal weights of the Root clusters are $\mathbf {w}=(18.02,-)$, while the optimal weights of the Individual clusters are displayed in Fig. 5 where it is evident that colas within Individual cluster 2 are perceived more similar to each other than colas in the Individual cluster 1, except for subject C. Moreover, from the weights in Fig. 5 it emerges that for both subjects G and J the cherry cola DrP (in $I_2$) is the most similar to the diet cherry cola DDP (the only in $R_2$) because of the largest weights $v_{2G}=85$ and $v_{2J}=79.8\,$: This is due probably to the cherry taste of DrP and DDP which are indistinguishable to the two non-PTC tasters G and J.

The results from ROOTCLUS, interpretable in terms of the particular inheritable characteristic of the subjects, confirm the existence of the same perceptual heterogeneity among subjects (see, for instance, Schiffman et al. 1981; Wedel and DeSarbo 1998; Bocci and Vicari 2017).

In fact, the results from the ROOTCLUS model are consistent with the ones from INDSCAL (Schiffman et al. 1981, pp. 149–151) where for the PTC subjects the distinction between diet and non-diet colas is more important than that between cherry and regular colas, while for non-PTCs the reverse is true.

Moreover, the ROOTCLUS solution is also consistent with the one from INDCLUS with $J = 2$ clusters of colas. The first cluster consists of seven colas (four regular noncherry colas—RCC, Sha, CoC, PeC—plus their diet versions—DiR, Tab, DiP), the second cluster is formed only by regular noncherry colas (RCC, Sha, CoC, PeC, Yuk), while the two cherry colas (DrP and DDP) are not assigned to any cluster. In INDCLUS, the differences in judging similarities are given by the subject weights: All PTC tasters (A, D, E, F, I) tend to weigh much more than the second cluster of regular colas, while the non-PTC subjects (except for H) do the opposite (see Bocci and Vicari 2017, for numerical details).

6.2 Sport Data

The Sport data already analyzed by gender in Sect. 2 have been fully investigated by fitting the ROOTCLUS model to the 13 student pairwise similarity matrices having entries equal to either ones or zeros whether two sports belong to the same cluster or not.

It was chosen to analyze these data by setting the maximum number of clusters to 7 and the algorithm run for $\lambda $ values in [0; 0.5] with increments of 0.1 by retaining the best solution from 200 different starts.

In Fig. 6, the loss function F is plotted against the $\lambda $ values together with the pF index (22). The ROOTCLUS loss function assumes (almost) the same values for $\lambda \le 0.2$ before increasing quickly, the pF index attains its maximum for $\lambda = 0$ and stabilizes to a value close to the maximum at $\lambda = 0.1, 0.2$ before dropping down rapidly for larger $\lambda $s.

Since the optimal Root clusters for $\lambda = 0$ and $\lambda = 0.1, 0.2$ are equal up to one sport, the latter solution is analyzed in detail because of its parsimony.

It is worthwhile noticing that three non-empty Root clusters are always found across $\lambda $s and the number of sports allocated to the Root partitions varies from 6 to 9 in such solutions.

The solution for $\lambda =0.1$ finds three non-empty clusters which accommodate in total the 50% of the sports: $R_1=$(Volleyball, Water Polo, Rugby, Soccer), $R_2=$(Horse-riding), $R_3=$(Athletics, Artistic Gym).

The remaining sports are differently allocated by the 13 students as displayed in Fig. 7 where, for each of the seven sports not belonging to the Root clusters, the bars show the different compositions of the Individual clusters. The heights of the bars (in different patterns) indicate the percentage of students who have put each sport into each Individual cluster.

Thus, it is evident that Basketball has been considered very similar to the Team Sports in $R_1$ by almost all the students ($92\%$), while only half of them have added Tennis to the sports in $R_1$ to form a cluster interpretable as Ball Sports. The majority of students have put Ski, Cycling and Fencing with Horse-riding to form a cluster of sports requiring special equipment and a minority of students have also added Martial Arts, Swimming and Tennis with them. Finally, the third cluster of the Multidisciplinary Sports is formed mainly by adding Martial Arts and Swimming to Athletics and Artistic Gym in $R_3$ (only a few students include here also Fencing and the other sports).

Such partitionings reflect what found in the aggregated analysis in Sect. 2; thus, for the sake of completeness, in Fig. 8 the same results of Fig. 7 are shown by gender so that the different ways how Males and Females (in %) allocate the seven sports to the Individual clusters can be appreciated and compared with the results in Sect. 2.

The optimal weights of the Root clusters are $\mathbf {w}=(0.12, {-}, 0)$, confirming that the baseline similarity between sports is larger in $R_1$ than in $R_3$. The weight $w_3$ is null here indicating that there is no agreement in evaluating the two sports in $R_3$ as similar, but such a similarity is differently qualified by the subjects (Fig. 9). The optimal weights of the Individual clusters are displayed in Fig. 9 where it is clear that there is a general agreement in evaluating the similarities of the sports within clusters except for students 3 (Individual cluster 2), 6 (Individual cluster 1), 11 (Individual clusters 2 and 3), 12 and 13 (Individual cluster 3). Note that in this application, the proximity matrices are binary: Thus, since the similarities to be fitted are either zeros or ones, in this special case the smaller weights for such students indicate that their similarity matrices are fitted worse than the others.

As for the goodness of fit, the solution analyzed accounts for the 94.56% of the total sum of squares (the relative loss function value is $ESS/TSS=0.0544$) and can be decomposed into the Root and Individual Within, plus the Between Sum of Squares as follows: $CWSS/TSS=0.2709$, $IWSS/TSS=0.6725$, $BSS/TSS=0.0022$, which confirm how the model optimally fits the data.

For comparison, the INDCLUS model has been also fitted with $J=3$ and the best solution retained over 200 random starts gives a relative loss equal to 0.3401. The clusters of sports result to be $C_1=$(Volleyball, Water Polo, Rugby, Soccer, Basketball), $C_2$=(Volleyball, Water Polo, Rugby, Soccer, Basketball, Tennis), $C_3=$(Horse-riding, Cycling, Athletics, Martial Arts, Ski, Fencing, Swimming, Artistic Gym). The flexibility of INDCLUS in allowing for overlapping clusters permits the first two clusters to be identical up to Tennis confirming the double view of a Team Sport cluster and a Ball Sport cluster, while the third is a residual cluster of Other Sports.

From the individual weights in Table 14, two sets of profiles occur where the weights assigned to cluster 1 (Cluster 2) are either almost zero (one) or close to one (zero) but they cannot be explained by gender differences. The weights of cluster 3 are all smaller and probably reflect the heterogeneity of the sports it contains.

Table 14 Sport data: weights resulting from INDCLUS

Full size table

7 Discussion

A model for clustering objects in three-way proximity data (termed ROOTCLUS) has been proposed which jointly search for A) Root clusters which subsume the similarities on the common perception of the subjects about a subset of objects; B) Individual clusters of the remaining objects accounting for the heterogeneity of the subjects in evaluating the similarities.

The model is flexible and allows to account for the subject perception in evaluating and clustering the similarity between objects. As evident from the applications on real data, the complementary (Root and Individual) partitions define corresponding clusters with subject-specific meaning which reflect different profiles, provide an interpretively satisfying approach in several contexts where the subjects’ perceptions are differently expressed and allows to identify subsets of objects that are exclusive to certain subjects or categories of subjects.

Moreover, the results of the extensive simulation study demonstrate the effectiveness of the proposed method and its performance in different conditions.

As for the algorithm, different choices of the random starts and further analyses of its efficiency and capability to recover the correct partitions deserve more investigations in different and more complex situations. Moreover, the performance when different number of clusters J are chosen also deserves to be analyzed.

As for the model assumptions, the overlap of the clusters could be allowed (as in INDCLUS) to guarantee a major flexibility at the cost of an increase in model complexity. Conversely, a more parsimonious model may be derived by considering classes of subjects sharing the same “Individual” partitions to identify similar profiles of subjects.

There is still room for extensions regarding the use of possible supplementary information to better interpret and describe the Individual clusters of objects even for the prediction of the perceptual profile of additional subjects.

Notes

The relative loss is defined here as the ratio of the raw loss to the total sum of squares of the data.
The MATLAB code is available online on SpringerLink with the article.
Given two matrices $\mathbf {A}$ and $\mathbf {B}$ with the same number J of columns, the Khatri–Rao product of $\mathbf {A}$ and $\mathbf {B}$ is the column-wise Kronecker product, i.e., $\mathbf {A} |\otimes | \mathbf {B} = (\mathbf {a}_1 \otimes \mathbf {b}_1, \ldots , \mathbf {a}_j \otimes \mathbf {b}_j, \ldots , \mathbf {a}_J \otimes \mathbf {b}_J )$ where $\mathbf {a}_j$ and $\mathbf {b}_j$ are the j-th ($j=1,\ldots ,J$) column of $\mathbf {A}$ and $\mathbf {B}$, respectively, and $\otimes $ denotes the Kronecker product.
The Kappa coefficient (KC) between two binary matrices is equal to the proportion of agreement between the two matrices (i.e., the proportion of the corresponding cells having the same values), corrected for chance (Wilderjans et al. 2012):
$$\begin{aligned} KC=\frac{(p_{00} + p_{11}) - (p_{0.}p_{.0} + p_{1.}p_{.1})}{1 - (p_{0.}p_{.0} + p_{1.}p_{.1})}, \end{aligned}$$
with $p_{00}$$(p_{11})$ being the proportion of corresponding cells that both are zero (one) and $p_{0.}$ and $p_{1.}$ ($p_{.0}$ and $p_{.1}$) the marginal proportion of zero- and one-cells for the first (second) matrix. Note that $p_{00}+p_{11}$ equals the (uncorrected) proportion of corresponding cells that have the same value.
Note that $w_2$ is missing here because the Root cluster $R_2$ is a singleton and the diagonal entries of the similarity matrices are not fitted in this application.

References

Bocci, L., & Vicari, D. (2017). GINDCLUS: Generalized INDCLUS with external information. Psychometrika, 82, 355–381.
Article Google Scholar
Bocci, L., Vicari, D., & Vichi, M. (2006). A mixture model for the classification of three-way proximity data. Computational Statistics & Data Analysis, 50, 1625–1654.
Article Google Scholar
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.
Google Scholar
Carroll, J. D., & Arabie, P. (1983). INDCLUS: An individual differences generalization of ADCLUS model and the MAPCLUS algorithm. Psychometrika, 48, 157–169.
Article Google Scholar
Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an N-generalization of the Eckart–Young decomposition. Psychometrika, 35, 283–319.
Article Google Scholar
Chaturvedi, A., & Carroll, J. D. (2006). CLUSCALE (CLUstering and multidimensional SCAL[E]ing): A three-way hybrid model incorporating clustering and multidimensional scaling structure. Journal of Classification, 23, 269–299.
Article Google Scholar
Chaturvedi, A. J., & Carroll, J. D. (1994). An alternating combinatorial optimization approach to fitting the INDCLUS and generalized INDCLUS models. Journal of Classification, 11, 155–170.
Article Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
De Leeuw, J. (1994). Block-relaxation algorithms in statistics. In H. H. Bock, W. Lenski, & M. M. Richter (Eds.), Information systems and data analysis (pp. 308–325). Berlin: Springer.
Chapter Google Scholar
Giordani, P., & Kiers, H. A. L. (2012). FINDCLUS: Fuzzy INdividual Differences CLUStering. Journal of Classification, 29, 170–198.
Article Google Scholar
Gordon, A. D., & Vichi, M. (1998). Partitions of Partitions. Journal of Classification, 15, 265–285.
Article Google Scholar
Hubert, L. J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Article Google Scholar
Hubert, L. J., Arabie, P., & Meulman, J. (2006). The structural representation of proximity matrices with MATLAB. Philadelphia: SIAM.
Book Google Scholar
Kiers, H. A. L. (1997). A modification of the SINDCLUS algorithm for fitting the ADCLUS and INDCLUS models. Journal of Classification, 14, 297–310.
Article Google Scholar
Lawson, C. L., & Hanson, R. J. (1974). Solving least squares problems. Englewood Cliffs: Prentice Hall.
Google Scholar
McDonald, R. P. (1980). A simple comprehensive model for the analysis of covariance structures: Some remarks on applications. British Journal of Mathematical and Statistical Psychology, 33, 161–183.
Article Google Scholar
Mirkin, B. G. (1987). Additive clustering and qualitative factor analysis methods for similarity matrices. Journal of Classification, 4, 7–31.
Article Google Scholar
Rao, C. R., & Mitra, S. (1971). Generalized inverse of matrices and its applications. New York: Wiley.
Google Scholar
Rocci, R., & Vichi, M. (2008). Two-mode multi-partitioning. Computational Statistics & Data Analysis, 52, 1984–2003.
Article Google Scholar
Shepard, R. N., & Arabie, P. (1979). Additive clustering: Representation of similarities as combinations of discrete overlapping properties. Psychological Review, 86, 87–123.
Article Google Scholar
Schepers, J., Ceulemans, E., & Van Mechelen, I. (2008). Selecting among multi-mode partitioning models of different complexities: A comparison of four model selection criteria. Journal of Classification, 25, 67–85.
Article Google Scholar
Schiffman, S. S., Reynolds, M. L., & Young, F. W. (1981). Introduction to multidimensional scaling. London: Academic Press.
Google Scholar
Vicari, D., & Vichi, M. (2009). Structural classification analysis of three-way dissimilarity data. Journal of Classification, 26, 121–154.
Article Google Scholar
Vichi, M. (1999). One mode classification of a three-way data set. Journal of Classification, 16, 27–44.
Article Google Scholar
Wedel, M., & DeSarbo, W. S. (1998). Mixtures of (constrained) ultrametric trees. Psychometrika, 63, 419–443.
Article Google Scholar
Wilderjans, T. F., Depril, D., & Van Mechelen, I. (2012). Block-relaxation approaches for fitting the INDCLUS model. Journal of Classification, 29, 277–296.
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to the Associate Editor and referees for their valuable comments and suggestions which greatly improved the presentation and content of the first version.

Author information

Authors and Affiliations

Department of Communication and Social Research, Sapienza University of Rome, Rome, Italy
Laura Bocci
Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy
Donatella Vicari

Authors

Laura Bocci
View author publications
You can also search for this author in PubMed Google Scholar
Donatella Vicari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donatella Vicari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 279 KB)

Appendix

In order to solve the constrained problem (14) in those cases when the diagonal entries of the H similarity matrices $\mathbf {S}_{h}$ are not of interest, the steps 1 to 4 of the ALS-type algorithm presented in Sect. 4 can be modified straightforwardly as follows.

Since only the off-diagonal elements of matrices $\mathbf {S}_{h}$ ($h=1,\ldots ,H$) need to be considered, the loss function (14) becomes

$$\begin{aligned} f_{\text {off}}(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h)=\; F_{\text {off}}(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h) +\lambda G, \end{aligned}$$

(23)

where

$$\begin{aligned} \begin{aligned}&F_{\text {off}}(\mathbf {P}, \mathbf {W}, \mathbf {M}_h, \mathbf {V}_h, c_h) = \\&\quad \frac{\sum _{h=1}^{H} \left\| \mathbf {S}_h - \mathbf {P} \mathbf {W} \mathbf {P}^\prime - \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr ) \mathbf {V}_h \bigl (\mathbf {P}+\mathbf {M}_{h} \bigr )^\prime - c_{h} \mathbf {1}_N \mathbf {1}_{N}^{\prime } \right\| _{\text {off}}^2}{\sum _{h=1}^{H} \left\| \mathbf {S}_h \right\| _{\text {off}}^2} \end{aligned} \end{aligned}$$

(24)

and $\left\| \mathbf {Z} \right\| _{\text {off}}^2 = \sum _{x=1}^{X}\sum _{\begin{array}{c} y=1\; ;\; \end{array}{y\ne x}}^{Y} z_{xy}^2$.

In step 1, the loss function (23), instead of (14), is minimized over $\mathbf {P}$ and $\mathbf {M}_h$ ($h=1,\ldots ,H$).

In steps 2 to 4, all the rows of $\mathbf {s}_{h}$, $\mathbf {T}$, $\mathbf {Q}_{h}$ and $\mathbf {1}_{N^2}$ in model (15), corresponding to the diagonal entries of the matrices in (14), need to be left out. Such reduced structures are obtained as follows:

$$\begin{aligned} {\tilde{\mathbf {s}}}_h&= \mathbf {s}_h \odot \mathbf {d} \qquad (h=1,\ldots ,H) \,, \end{aligned}$$

(25)

$$\begin{aligned} {\tilde{\mathbf {T}}}&= \mathbf {T} \odot \mathbf {D}\,, \end{aligned}$$

(26)

$$\begin{aligned} {\tilde{\mathbf {Q}}}_h&= \mathbf {Q}_h \odot \mathbf {D} \qquad (h=1,\ldots ,H)\,, \end{aligned}$$

(27)

$$\begin{aligned} {\tilde{\mathbf {1}}}_{N^2}&= \mathbf {1}_{N^2} \odot \mathbf {d} \,, \end{aligned}$$

(28)

where $\odot $ denotes the Hadamard product, $\mathbf {d}$ is the column vector of size $N^2$ of the vectorized matrix $\big (\mathbf {1}_N \mathbf {1}_N^\prime - \mathbf {I}_N\big )$, being $\mathbf {I}_N$ the identity matrix of size N, and $\mathbf {D}$ is the $N^2 \times J$ matrix having all its columns equal to $\mathbf {d}$.

Therefore, model (15) is rewritten in terms of (25)–(28) and Steps 2, 3 and 4 modified accordingly.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bocci, L., Vicari, D. ROOTCLUS: Searching for “ROOT CLUSters” in Three-Way Proximity Data. Psychometrika 84, 941–985 (2019). https://doi.org/10.1007/s11336-019-09686-1

Download citation

Received: 04 July 2018
Revised: 13 August 2019
Published: 13 September 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11336-019-09686-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

ROOTCLUS: Searching for “ROOT CLUSters” in Three-Way Proximity Data

Abstract

Similar content being viewed by others

GINDCLUS: Generalized INDCLUS with External Information

Evaluating the Necessity of a Triadic Distance Model

Additive Trees for Fitting Three-Way (Multiple Source) Proximity Data

1 Introduction

2 Motivating Example

3 The ROOTCLUS Model

Remark 1

3.1 Additional Requirements

Remark 2

3.2 Illustrative Example: Artificial Data

3.3 The Least-Squares Problem

4 The Algorithm

4.1 Model Assessment and Selection

4.1.1 Choice of the Number of Clusters

4.1.2 Model Assessment

4.1.3 Model Selection

5 Simulation Study

5.1 Design

5.2 Results: Goodness of Recovery

5.3 Results: Model Selection

5.4 Number of Starts

5.5 Scalability

6 Illustrative Applications

6.1 Cola Data

6.2 Sport Data

7 Discussion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (zip 279 KB)

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation