Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Umbleja, Kadri; Ichino, Manabu; Yaguchi, Hiroyuki

doi:10.1007/s11634-020-00411-w

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Regular Article
Published: 22 July 2020

Volume 15, pages 407–436, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Download PDF

253 Accesses
4 Citations
Explore all metrics

Abstract

Symbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data mining. In order to accurately cluster these sophisticated data types, usual methods are not enough. Throughout the years different approaches have been proposed but they mainly concentrate on the “macroscopic” similarities between objects. Distributional data, for example symbolic data, has been aggregated from sets of large data and thus even the smallest microscopic differences and similarities become extremely important. In this paper a method is proposed for clustering distributional data based on these microscopic similarities by using quantile values. Having multiple points for comparison enables to identify similarities in small sections of distribution while producing more adequate hierarchical concepts. Proposed algorithm, called microscopic hierarchical conceptual clustering, has a monotone property and has been found to produce more adequate conceptual clusters during experimentation. Furthermore, thanks to the usage of quantiles, this algorithm allows us to compare different types of symbolic data easily without any additional complexity.

Clustering of modal-valued symbolic data

Article 24 October 2020

Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering

Association measures for interval variables

Article 03 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Symbolic data approach enables to represent huge amounts of data in compact but complex form. Symbolic data can be considered as an aggregated distribution of a much larger amount of classical data. It summarises a given set of data by concentrating on the main common aspects of original data and hides entry specific details. Despite its aggregated property, the distributional information contains huge amounts of fine details which can be beneficial during data analysis but are mostly overlooked in current methods for analysing symbolic data. These small microscopic details are especially beneficial during clustering when the goal is to find groups with similar objects within and have clear segregation between groups.

In this paper, we offer an approach for hierarchical conceptual clustering which takes into account small details in symbolic data by describing symbolic objects using quantiles. The groups are formed during clustering based on similarities in multiple quantile points between objects. Therefore, the comparison between objects is much more detailed than by just comparing the area objects cover or the start and end points. This proposed method allows to find clusters where objects are similar to each other in microscopic level. Furthermore, the approach ensures that monotone nesting structure is guaranteed during clustering.

In Sect. 2, we cover the previous works in the field of conceptual clustering for symbolic data.

In Sect. 3, we present quantile representation of symbolic data. The quantile method transforms each of n complex symbolic objects to d m-dimensional numeric vectors, called quantile vectors. Quantiles are obtained from underlying distributional information from symbolic object. m presents preselected integer number—larger the m, the more microscopic properties will be considered. If m equals to two, only the start and end point of the symbolic data are used, reducing quantile representation on interval. Therefore, the selection of integral value m controls the granularity of the sub-concepts by constructing the given object in the representation space.

In Sect. 4, we propose an algorithm for hierarchical conceptual clustering based on quantile representation of symbolic data for finding microscopic similarities between compared objects. Then we will use some well known datasets in symbolic data analysis to show the benefits of proposed method and compare results between macroscopic and microscopic approaches.

2 Background

2.1 Symbolic data

The symbolic data analysis (SDA) (Billard and Diday 2006; Diday and Esposito 2003) is an approach to data analysis which permits describing and analysing complex data. If classical data is described by giving a single value to each variable, then the symbolic data appears in many different and complex forms. For example: symbolic data can be an interval, histogram, categorical value or modal valued data. All these different types can be considered as distributions. This kind of data expands classical data by considering more complete and complex information.

Symbolic data can be extracted from many different sources and in many ways. The most common feature is that bigger traditional data sets are aggregated into more compact forms of data which hides entry specific information and provides a more general and summative overview of the source data. Thanks to these properties, SDA methods enable to analyse large data sets (big data) which are too large to be analysed by usual methods. Furthermore, the fact that aggregated data hides entry specific information makes the symbolic data also suitable for fields where privacy concerns are vital.

2.2 Hierarchical conceptual clustering

The aim of clustering is to form groups (clusters) with objects within a single cluster being similar and those between clusters being dissimilar according to some suitably defined dissimilarity or similarity criteria (Billard and Diday 2006). Many hierarchical clustering methods have been extensively developed.

Conceptual clustering is a paradigm that differs from ordinary data clustering by generating a concept description for each cluster of objects (Michalski and Stepp 1983). Therefore, not only objects with common properties are grouped, but characteristics of each cluster over set of data objects are extracted, describing the regularities in achieved clusters (Hu 1992). This classification scheme can consist of a set of disjointed clusters, or a set of clusters organized into a hierarchy. Each cluster is associated with a generalized conceptual description of the objects within the cluster. Hierarchical clusterings are often described as classification trees (Jonyer et al. 2001).

Conceptual clustering can be used for a variety of tasks, starting from data exploration to model fitting. The aim of applying clustering to a dataset is to gain a better understanding of the data, in many cases by highlighting hierarchical topologies (Jonyer et al. 2001). Clustering aims to achieve clusters that are optimally "connected" or "compact" (Johnson 1967). In Hubert (1972) extensions are offered for Johnson’s algorithms. Fisher proposed an incremental hierarchical conceptual clustering method designed to maximize inference ability (Fisher 1987). A comprehensive review of data clustering methods is offered in Jain et al. (1999). Goswami and Chakrabarti (2012) proposed conceptual clustering algorithm based on comparing values of classical dataset against medians and/or quartile values. Their algorithm is applied to classical data with single point values not distributions. Single point is compared with features median/quartiles and labeled according to its position inside the feature. They use combination of labels directly as cluster description with 4d possible combinations if using median/quartiles and d is number of features. Therefore, algorithm produces a large number of groups and may require additional merging to avoid cluttering during decision making.

Because symbolic data has a complex structure, it also requires a more complicated approach to conceptual clustering than classical data. A thorough survey of symbolic data clustering methods has been compiled in de Carvalho and de Souza (2010). Generalized Minkowski metrics for mixed feature types based on the Cartesian system model has been defined in Ichino and Yaguchi (1994). Dendrograms obtained from the application of standard linkage methods are also presented by Ichino and Yaguchi. Different dissimilarity measures for symbolic data clustering are covered in Billard and Diday (2006). SODAS software project produced much research into different aspects of symbolic data clustering like Bertrand and Mufti (2008), Brito and De Carvalho (2008) and De Carvalho et al. (2008). Irpino and Verde (2006) proposed clustering approach based on Wasserstein-Kantorovich metric also using quantile functions. Their difference with current proposed approach to quantiles comes from the aspect that they still use underlying histogram bin values and probabilities. Furthermore, like in Umbleja (2017), the merger of histograms is complex process. Both cases suffer from rising number of quantiles/bins when histograms are merged. The merger has $O(n^3)$ complexity respect to histograms being merged, features and quantiles/bins.

All of those symbolic data conceptual clustering methods are considered to compare "macroscopic" properties of symbolic objects. By "macroscopic" properties, we mean that objects are compared according to general information embedded into their symbolic representation. For example, in Ichino and Umbleja (2018), concept size is used to determine the formation of clusters. Concept size is the area two joined objects span in the Cartesian space. Therefore, basically the start and end value of the feature for an object have impact during clustering. In reality, due to its complex structure and the way larger datasets are aggregated into symbolic form, the symbolic data object contains huge amounts of fine details that are not compared nor considered during this kind of clustering process. Those small fine details are called "microscopic" properties in this paper, reflecting inner structure of symbolic objects.

There has been previous research into conceptual clustering of symbolic data described as quantiles like Brito and Ichino (2010) and Brito and Ichino (2011). The differences between these methods and proposed method is how dissimilarity is measured and how concepts are merged and handled. The similarities are that all methods are comparing distributions on multiple points(quantiles). The main novelty proposed in this paper compared with previous works is the usage of "quantile rectangles" that naturally produce conceptual descriptions for objects and merged concepts.

3 Quantile representation of symbolic data

3.1 Types of symbolic data

Symbolic data can be represented in many different forms. The most common types are interval valued data, histogram valued data, categorical data and modal multi-variable data.

Let set of n objects U be represented by $U=\{\omega _1,\omega _2,\ldots ,\omega _n\}$. Let $F_1, F_2,\ldots ,F_d$ be d features describing each object $\omega _i$, $i=1,2,\ldots ,n$ of different with symbolic object $E_{ij}$.

Definition 1

Let feature $F_j$ be an interval valued feature and let each object $\omega _i \in U$ be represented by an interval:

$$\begin{aligned} E_{ij}=\{[a_{ij},b_{ij}]1 \} \end{aligned}$$

(1)

Definition 2

Let feature $F_j$ be a histogram valued feature and let each object $\omega _i \in U$ be represented by a histogram:

$$\begin{aligned} E_{ij}=\{[a_{ijk},b_{ijk}]p_{ijk}; k=1,2, \dots ,n_{ij} \} \end{aligned}$$

(2)

where $\sum _{k=1}^{n_{ij}} p_{ijk}=1$, $b_{ijk}=a_{ij(k+1)}$ and $n_{ij}$ is number of bins for histogram $E_{ij}$. As can be seen, (1) is a special case of (2) where $n_{ij}$=1.

Definition 3

Let $Y_j$ be the domain of possible outcomes for modal multi-valued feature $F_j$ containing category values $Y_j=\{c_{j1},c_{j2},...,c_{j|Y_j|}\}$. $|Y_j|$ notates the size of $Y_j$—the number of category values in $Y_j$. Each object $\omega _i \in U$ takes categorical values from $Y_j$ with probability $p_{ijk}$ and is represented as:

$$\begin{aligned} E_{ij}=\{c_{jk},p_{ijk}; k=1,2,\dots ,|Y_j| \} \end{aligned}$$

(3)

where $\{c_j1,c_j2,...,c_{j|Y_j|}\}\subseteq Y_j$ and category $c_{jk}$ appears with probability $p_{ijk}$.

Modal multi-valued data in (3) can be represented as histogram where every category $c_{jk}$ corresponds to bin $[a_{jk},a_{jk}+l]$ with equal width l and $a_{j(k+1)}=a_{jk}+l$. Therefore, (3) is transformed to histogram as:

$$\begin{aligned} E_{ij}=\{[a_{jk},a_{jk}+l]p_{ijk}; k=1,2,\dots ,|Y_j|, l=1/|Y_j| \} \end{aligned}$$

(4)

The order of categorical values has an impact on the distribution, therefore it should be carefully considered. One possibility is to use sum of frequency and rank values for the ordering (Ichino 2011).

Definition 4

Let $Y_j$ be the domain of possible outcomes for categorical feature $F_j$ containing category values $Y_j=\{c_{j1},c_{j2},...,c_{j|Y_j|}\}$. Each object $\omega _i \in U$ takes categorical values for feature $F_j$ from $Y_j$ and is represented with list of included categories $Y_{ij}\subseteq Y_j$ :

$$\begin{aligned} E_{ij}=\{c_{jk}; c_{jk}\in Y_{ij}\} \end{aligned}$$

(5)

Categorical data represented as (5) can be represented as modal multi-valued data (3) in following way:

$$\begin{aligned} E_{ij}=\{c_{jk},p_{ijk}; k=1,2,\dots ,|Y_j| \} \end{aligned}$$

(6)

where $p_{ijk}=1$ if category is included in $Y_{ij}$ and $p_{ijk}=0$ if category is not included in $Y_{ij}$. Therefore, categorical data (5) can also be represented as histogram data according to (4) as:

$$\begin{aligned} E_{ij}=\{[a_{jk},a_{jk}+l]{p'}_{ijk}; k=1,2,\dots ,|Y_j| \} \end{aligned}$$

(7)

where ${p'}_{ijk}={p_{ijk}}/{|Y_{ij}|}$ is normalized probability according to the number of categories taken by object $\omega _i$ for feature $F_j$.

It can be seen that different types of symbolic data can be all represented as histograms and therefore considered to be distributions describable with distribution function.

Definition 5

Let an object $\omega _i \in U$ for feature $F_j$ be histogram valued symbolic object as in (2). Therefore object $\omega _i \in U$ for feature $F_j$ can be described with distribution function:

$$\begin{aligned} F(x)= {\left\{ \begin{array}{ll} 0 &{}\quad \text {if } x\le a_{ij1}\\ p_{ij1} \times (x-a_{ij1})/(b_{ij1}-a_{ij1}) &{}\quad \text {if } a_{ij1} \le x \le b_{ij1}=a_{ij2}\\ F(a_{ij2})+p_{ij2} \times (x-a_{ij2})/(b_{ij2}-a_{ij2}) &{}\quad \text {if } a_{ij2} \le x \le b_{ij2}\\ \dots \\ F(a_{ijl})+p_{ijn} \times (x-a_{ijl})/(b_{ijl}-a_{ijl}) &{}\quad \text {if } a_{ijl} \le x \le b_{ijl}\\ 1 &{}\quad \text {if } b_{ijl} \le x \end{array}\right. } \end{aligned}$$

(8)

3.2 Quantile representation

Based on the knowledge of distribution functions, a quantile method (Ichino 2008) offers a common way to represent symbolic data with features of different type. The basic idea is to express the observed feature values by some predefined quantiles of the underlying distribution (Ichino 2011). In case of interval valued data, we assume the uniform distribution inside the interval. In case of histogram valued data, we also assume uniform distribution inside histogram bin.

By using distribution function (8), we can easily obtain m numeric values (quantile values) $Q_1,Q_2,\dots ,Q_m$ matching probabilities $p_1,p_2,\dots ,p_m$ (where $p_1<p_2<\dots <p_m$) in distribution:

$$\begin{aligned}&F(Q_1)=p_1\\&F(Q_2)=p_2\\&\dots \\&F(Q_m)=p_m \end{aligned}$$

It should be noted that for quantile values we use term "quantile values" together with minimum and maximum value. If $p_1=0$ (minimum value) then $Q_1=a_{ij1}$ and if $p_{m}=1$ (maximum value) then $Q_{m}=b_{ijl}$. The chosen probabilities may include 0 and 1 but do not have to—in some cases it is beneficial to truncate distributional data.

Definition 6

Let $U=\{\omega _{1},\omega _2,\dots ,\omega _n\}$ be set of n-objects . Let $F_1,F_2,\dots ,F_d$ be d features describing each object $\omega _i$ of different feature types. Let m be preselected integer number to determine the common number of quantiles for each feature. Then, object’s $\omega _i$ feature $F_j$ can be represented with m-tuple quantile vector as:

$$\begin{aligned} E_{ij}=(Q_{ij1},Q_{ij2},\dots ,Q_{ijm}) \end{aligned}$$

(9)

where quantiles values $Q_{ij1},Q_{ij2},\dots ,Q_{ijm}$ are matching m quantiles $p_1,p_2,\dots ,p_m$ (where $0\le p_1<p_2<\dots <p_m \le 1$).

By default, we can consider quantile value $Q_{ijk}$ to be single points as $Q_{ijk}=\{q_{ijk}\}$. In some cases, as will be shown later, it is beneficial to think of quantile values $Q_{ijk}$ as intervals $Q_{ijk}=[q_{ijkmin},q_{ijkmax}]$ where $q_{ijkmin}=q_{ijk}$ and $q_{ijkmax}=q_{ijk}$ . Therefore, object $\omega _i$ becomes m-series of d-dimensional quantile rectangle and (9) can also be written as:

$$\begin{aligned} E_{ij}=([Q_{ij1},Q_{ij1}],[Q_{ij2},Q_{ij2}],\dots ,[Q_{ijm},Q_{ijm}]) \end{aligned}$$

(10)

The advantage of quantile representation of distributional data is that even if underlying symbolic data may be represented by different number of bins (for example in case of histogram—the number of bins $n_{ij}$ may vary from object to object), we can obtain the same number of quantiles for all distributions.

Using m-tuple quantile vectors to describe symbolic objects gives us a simple way how to describe underlying small microscopic details in distribution that would otherwise be overlooked or remain hidden in complex representation of distributional information.

Example 1

Let $E_{ij}$ be histogram described as:

$$\begin{aligned} E_{ij}=\{[1,4]0;[4,5]0.2;[5,6]0.4;[6,7]0;[7,8]0.1;[8,9]0.3\} \end{aligned}$$

(11)

The histogram can be seen in Fig. 1. The corresponding distribution function is:

$$\begin{aligned} F(x)= {\left\{ \begin{array}{ll} 0 &{}\quad \text {if } x\le 4\\ 0.2 \times (x-4) &{}\quad \text {if } 4 \le x \le 5\\ 0.2+0.4 \times (x-5) &{}\quad \text {if } 5 \le x \le 6\\ 0.6 &{}\quad \text {if } 6 \le x \le 7\\ 0.6+0.1 \times (x-7) &{}\quad \text {if } 7 \le x \le 8\\ 0.7+0.3 \times (x-8) &{}\quad \text {if } 8 \le x \le 9\\ 1 &{}\quad \text {if } 9 \le x \end{array}\right. } \end{aligned}$$

(12)

We are looking for quantile values $Q_1,Q_2$,$Q_3$,$Q_4$ and $Q_5$ matching probabilities 0, 0.25, 0.5, 0.75 and 1. We can find these values by simply solving equations $F(Q_1)=0$, $F(Q_2)=0.25$, $F(Q_3)=0.5$, $F(Q_4)=0.75$ and $F(Q_5)=1$. We obtain quantiles values $Q_1$=4, $Q_2$=5.125, $Q_3$=5.75, $Q_4$=8.17 and $Q_5$=9. Finally, we can produce 5-tuple quantile vector to describe $E_{ij}$:

$$\begin{aligned} E_{ij}=(4,5.125,5.75,8.17,9) \end{aligned}$$

(13)

We can extend (13) to 5-tuple interval-format quantile vector as (10):

$$\begin{aligned} E_{ij}=([4,4],[5.125,5.125],[5.75,5.75],[8.17,8.17],[9,9]) \end{aligned}$$

(14)

4 Microscopic hierarchical conceptual clustering

In the following section, we present the algorithm for Hierarchical Conceptual Clustering (HCC) based of quantile vectors.

We use term "microscopic" for clustering method that takes into account underlying properties from the distribution—for example, to which part of the feature space most of the probabilities are concentrated to? Are the probabilities in compact region or are they spread widely? In the proposed algorithm, the dissimilarity between two objects is measured at multiple points (quantiles) to consider different aspects of distributions. In "macroscopic" approach, only limited characteristics of data are considered—start and end points, join and meet of two objects, for example. In case of intervals where uniform distribution is assumed to be inside a bin, this kind of approach may be adequate, but with more complicated distributions like histograms, many details are overlooked.

4.1 Dissimilarity between two quantile vectors

To compare two quantile vectors $E_{\omega _{xj}}$ and $E_{\omega _{yj}}$ described as m-tuple quantile vectors, the following metric to measure their dissimilarity is proposed.

Definition 7

The dissimilarity between two quantile values $Q_{xjk}$ and $Q_{yjk}$ is:

$$\begin{aligned} |Q_{xjk}-Q_{yjk}|=max(q_{xjkmax},q_{yjkmax})-min(q_{xjkmin},q_{yjkmin}) \end{aligned}$$

(15)

where $Q_{xjk}$ is given with interval $[q_{xjkmin},q_{xjkmax}]$ and $Q_{yjk}$ is given with interval $[q_{yjkmin},q_{yjkmax}]$.

Definition 8

The dissimilarity between m-tuples quantile vectors $E_{\omega _{xj}}$ and $E_{\omega _{yj}}$ for objects $\omega _{x}, \omega _{y} \in U$ for feature $F_j$ is:

$$\begin{aligned} d_q( E_{\omega _{xj}},E_{\omega _{yj}})=\sum _{k=1}^{m} \frac{|Q_{\omega _{xjk}}-Q_{\omega _{yjk}}|}{|D_{jk}|}/ m \end{aligned}$$

(16)

where $|D_{jk}|$ is the length of domain $D_{jk}=[Q_{min_{jk}},Q_{max_{jk}}]$ for k-th ($k=1 \dots m$) quantile over all objects $\omega _i \in U$ for feature $F_j$.

Normalization among all the quantiles assures that every quantile has an equal impact for the distance between two objects.

Definition 9

The dissimilarity between two objects $\omega _{x}$ and $\omega _{y}$ from set U described with (m)-tuple quantile vectors in d-dimensional feature space is:

$$\begin{aligned} d( \omega _{x},\omega _{y})=\sum _{j=1}^{d} \frac{d_q(E_{\omega _{xj}},E_{\omega _{yj}})}{d} \end{aligned}$$

(17)

Proposition 1

The dissimilarity between two quantile vectors $E_{\omega _{xj}}$ and $E_{\omega _{yj}}$ is $0\le d_q( E_{\omega _{xj}},E_{\omega _{yj}}) \le 1$ as can be seen from (15) to (16).

Proposition 2

The dissimilarity between two objects $\omega _{x}$ and $\omega _{y}$ is $0\le d( \omega _{x},\omega _{y}) \le 1$. It is due Proposition 1 and normalization with number of features in (17).

The measure is basically a join operation between two quantile values in m data points along the distribution function. Therefore, the proposed distance reflects not only the general differences between two comparable quantile vectors but also reflects their inner dissimilarities making it superior over current available methods. The choice of value m and the choice of m probabilities has an impact on result.

Definition 10

Cartesian join of two m-tuple quantile vectors $E_{\omega _{xj}}=(Q_{xj1},Q_{xj2},\dots ,Q_{xjm})$ and $E_{\omega _{yj}}=(Q_{yj1},Q_{yj2},\dots ,Q_{yjm})$ with respect to the j-th feature is defined as:

$$\begin{aligned} E_{\omega _{xj}} \boxplus E_{\omega _{yj}}&=([min(Q_{xj1min},Q_{yj1min}),max(Q_{xj1max},Q_{yj1max})], \nonumber \\&[min(Q_{xj2min},Q_{yj2min}),max(Q_{xj2max},Q_{yj2max})],\nonumber \\&\dots , \nonumber \\&[min(Q_{xjmmin},Q_{yjmmin}),max(Q_{xjmmax},Q_{yjmmax})]) \end{aligned}$$

(18)

It can be seen that the result of the Cartesian join of the given two quantile vectors is a vector of intervals. Cartesian join of two objects, described by m-tuple quantile vectors in d-dimensional feature space, is described m-series of d-dimensional rectangles.

The Cartesian join operation is used during HCC when concepts are merged. Lower and upper bounds are taken from both merged concepts at every quantile value point. We can consider the area covered by d-dimensional rectangles as concept size of that specific quantile point.

Example 2

Let $E_{pj}$ be histogram described as quantile vector in (13). Let $E_{qj}$ be described as quantile vector:

$$\begin{aligned} E_{qj}=(4,5,7,8,9) \end{aligned}$$

(19)

We can see that both distributions have the same start and end points. The difference between objects comes from the internal variation in the distribution.

The dissimilarity between $E_{pj}$ and $E_{qj}$ for quantiles $k=1,\ldots ,5$ is, according to (15) :

$$\begin{aligned} |Q_{pj1}-Q_{qj1}|&=max(4,4)-min(4,4)=0\\ |Q_{pj2}-Q_{qj2}|&=max(5,5.125)-min(5,5.125)=0.125\\ |Q_{pj3}-Q_{qj3}|&=max(7,5.75)-min(7,5.75)=1.25\\ |Q_{pj4}-Q_{qj4}|&=max(8,8.17)-min(8,8.17)=0.17\\ |Q_{pj5}-Q_{qj5}|&=max(9,9)-min(9,9)=0\\ \end{aligned}$$

Assume the domains for 5 quantiles for feature j are: $D_{j1}=[2,6]$, $D_{j2}=[3,6]$, $D_{j3}=[4,8]$, $D_{j4}=[8,9]$ and $D_{j5}=[8,12]$. Then, the dissimilarity between quantile vectors $E_{pj}$ and $E_{qj}$ according to (16) is:

$$\begin{aligned} d_q( E_{pj},E_{qj})&=\sum _{k=1}^{5} \frac{|Q_{pjk}-Q_{pjk}|}{|D_{jk}|}/ 5\nonumber \\&=\left( \frac{0}{4}+\frac{0.125}{3}+\frac{1.25}{4}+\frac{0.17}{1}+\frac{0}{4}\right) /5 \nonumber \\&=\frac{0.524}{5}=0.105 \end{aligned}$$

(20)

The Cartesian join of quantile vectors $E_{pj}$ and $E_{qj}$ according to (18) is:

$$\begin{aligned} E_{(p\boxplus q)j}=([4,4],[5,5.125],[5.75,7],[8,8.17],[9,9]) \end{aligned}$$

(21)

4.2 Algorithm for hierarchical conceptual clustering

1.
For each pair of objects $\omega _i$ and $\omega _i'$ calculate dissimilarity $d(\omega _i,\omega _i')$ as (17). Find a pair of objects $\omega _p$ and $\omega _q$ that minimize the dissimilarity d.
2.
Generate a merged concept $\omega _{pq}$ of $\omega _p$ and $\omega _q$ in U. Delete $\omega _p$ and $\omega _q$ from U. The new object $\omega _{pq}$ (a concept) is described by Cartesian join $E_{pq}=E_p\boxplus E_q$.
3.
Repeat step 2 until U contains only one object (the whole concept).

Merged concept in step 2 is Cartesian join of two objects $\omega _p$ and $\omega _q$ as described in (18).

A notable property of step 2 and step 3 is intentional dual monotone property such that:

The monotone property of the extension:

$$\begin{aligned} \{\omega _p\} \subseteq \{\omega _p,\omega _q\} \subseteq \{\omega _p,\omega _q,\omega _r\} \subseteq \dots \end{aligned}$$

(22)

And the monotone property of dissimilarity:

$$\begin{aligned} d(\omega _p,\omega _p) \le d(\omega _p,\omega _q) \le d(\{\omega _p,\omega _q\},\omega _r) \le ... \end{aligned}$$

(23)

From (15), it is clear that dissimilarity (17) is 0 only if the objects are exactly matching. From Proposition 2 it is known that dissimilarity cannot be less than 0. Therefore, first part of (23) holds. For the second part, we use the term "concept size". We define concept size $P(Q_{xjk})$ as:

$$\begin{aligned} P(Q_{xjk})=|Q_{xjk}|/|D_{jk}| \end{aligned}$$

(24)

Concept size of m-dimensional quantile vector is average of quantile values’ concept sizes as:

$$\begin{aligned} P(E_{xj})=\sum _{k=1}^{m} \frac{P(Q_{xjk})}{m} \end{aligned}$$

(25)

and concept size of object is average concept size over all features:

$$\begin{aligned} P(\omega _{x})=\sum _{j=1}^{d} \frac{P(E_{xj})}{d} \end{aligned}$$

(26)

.

Concept size correlates with the area which the object occupies in the Cartesian space and with definitions (15)-(17) . From (18), we can see that when objects are merged, their span in Cartesian space is also merged. If merged concept is further merged, as in $\{\{\omega _p,\omega _q\},\omega _r\}$, there are two options. In first case, the k-th quantile value for f-th feature of $\omega _r$ is inside corresponding quantile value for $\{\omega _p,\omega _q\}$. In this case, both concept size (24) (and also corresponding dissimilarity) remains the same. In second case, the value is outside quantile value for $\{\omega _p,\omega _q\}$. In that case, the merged quantile value will have larger concept size (and larger dissimilarity) than $\{\omega _p,\omega _q\}$. This property will be carried over from quantile value to quantile vector to object level. Therefore, also second part holds. Concept size can also be used to prove extension property (22) the same way.

Nested structure, for example, and tree structure, should be used during bottom up method to memorize the structure of nested objects and corresponding descriptions by the Cartesian join regions. It should be noted that in each step of the agglomeration process, we not only know the distance between objects, but also know which features create the differences between concepts.

5 Application

To show the benefits of the proposed approach to HCC, it will be applied to well known symbolic datasets and results of proposed microscopic HCC are compared with macroscopic approaches. For macroscopic HCC in case of interval valued datasets, the dissimilarity between objects is found using (Ichino and Brito 2013; Ichino and Umbleja 2018). For histogram valued data, its’ extension from Umbleja (2017) is used.

5.1 Oils data

Oils dataset is an interval valued symbolic dataset introduced by Ichino and Yaguchi (1994). It has been excessively used to verify clustering methods for symbolic interval data [for example in Billard and Diday (2006), El-Sonbaty and Ismail (1998), Hardy and Lallemand (2002) and Guru and Nagendraswamy (2006)].

The dataset contains 8 oils. The first six oils are vegetable oils and last two are animal based oils. Pairs of oils (1,2), (3,4) and so on are expected to have similar properties. Vegetable oils 3 to 6 are very similar to each other. The data is given in interval format with fifth feature being categorical variable. Data is normalized according to feature length. Acids are treated as modal multi-valued data with 9 bins correlating to 9 possible acids. Probability 0 or 1 indicates if the category is present or not.

The following five features describe the Oils data:

$F_1$: Specific gravity
$F_2$: Freezing point
$F_3$: Iodine value
$F_4$: Saponification value
$F_5$: Major fatty acids

First step in microscopic HCC is to decide on proper set of m quantiles to be used. For the current example, following 7 quantiles are used: 0%, 10%, 25%, 50%, 75%, 90% and 100%. As mentioned earlier, the choice of quantiles has an impact on the result. In this case, 7 quantiles have been chosen so that there are many points for comparison. Included are the end and start points which are self-explanatory. 10% and 90% quantiles are useful for ignoring outliers as they truncate the data. Sometimes it may be beneficial to ignore start and end points and use truncation instead. Other three chosen quantiles assure that there are enough points for comparison inside the data.

The second step is finding domains $D_{jk}$ for every feature j and for every quantile k. In current example $j=1\dots 5$ and $k=1\dots 7$. The minimum and maximum values for $F_1$ for all 7 quantiles can be seen in Table 1.

Table 1 Domain values for $F_1$ in Oils dataset for 7 quantiles

Full size table

As domain values are known, dissimilarity between all 8 oils can be found. Results can be seen in Table 2. The smallest dissimilarity is between Cotton and Olive—0.106. Dissimilarity calculation for Cotton and Olive for feature 1 over 7 quantiles can be seen in Table 3. The resulting merged concept description can be seen in Table 4.

Table 2 Dissimilarity between oils using 7 quantiles after first iteration of microscopic HCC

Full size table

Table 3 Dissimilarity calculation for Cotton and Olive for $F_1$

Full size table

Table 4 Concept description for merged concept Cotton–Olive

Full size table

For macroscopic HCC, a merged concept with smallest average concept size (span) is found. At first iteration, the most similar pair is Olive and Camilla, different from the first pair in microscopic HCC. The merged concept can be seen in Table 5.

Table 5 Concept description for merged concept Olive-Camilla from macroscopic HCC

Full size table

Figure 2a represents the result of microscopic HCC and (b) shows results of macroscopic approach to clustering. As can be seen, clear pairs of oils are forming in both dendrograms but the pairs are slightly different. In both graphs the vegetable oils concept (Cotton, Sesame, Camellia and Olive) is very similar. In case of macroscopic (b), the pairs are formed as it was expected from dataset description. In case of microscopic (a), the pairs are formed in different order, reflecting the small differences on how two algorithms compare objects. Based on the properties of microscopic HCC, it can be said that the results in (a) reflect similarity between objects on more than 2 points (start and end) and can therefore be considered more precise. It should be noted that after three concepts are left, clusters are matching. The formed seven quantile rectangles at three clusters, that are used in concept description, can be visually followed for features Iodine Value and Specific gravity in Fig. 3.

If dendograms are to be cut at two clusters, clear differences can be followed. In (a) two clusters clearly separate vegetable and animal based oils. In (b), vegetable oils Linsead and Perilla are forming their own cluster while animal fats are clustered together with four similar vegetable oils. It could be argued that result in (a) is more in accordance with real life as the dataset contains two very different kinds of oils. In Fig. 3, it can also be followed that two clusters of vegetable oils look visually closer to each other than to fats. It should also be remembered that Oils data is interval-valued—the proposed microscopic approach does have benefits over macroscopic method but the real advantages become visible with more complex types of symbolic data.

In addition, we can follow cluster quality from dendrogram in Fig. 2. In (a), microscopic case, we can draw the cut dendrogram at 0.28 after 3 clusters are left. In (b), macroscopic case, the dendrogram can be cut after 0.455. Therefore, in (a) the concept size (that correlates to dissimilarity between objects) is smaller and therefore clusters are more compact and more clearly separatable.

5.2 Hardwood data

In this section, a small example using hardwood data by U.S. Geological survey (U.S. Geological Survey 2013) is represented. Ten groups of trees (5 species with west and east coast groups) were chosen with eight features. The features are given with cumulative percentages.

The following eight features describe the Hardwood data:

$F_1$: Annual temperature (ANNT) ($^{\circ }$C)
$F_2$: January temperature (JANT) ($^{\circ }$C)
$F_3$: July temperature (JULT) ($^{\circ }$C)
$F_4$: Annual precipitation (ANNP) (mm)
$F_5$: January precipitation (JANP) (mm)
$F_6$: July precipitation (JULP) (mm)
$F_7$: Growing degree days on 5$^{\circ }$C base * 1000 (GDC5)
$F_8$: Moisture index (MITM)

The hardwood data is in a histogram format—therefore span and concept size of the object are not detailed enough for adequate comparison. All objects tend to cover most of the feature space and therefore have a similar span. Dissimilarities between objects are hidden in their distributions. The quantile values for $F_1$ can be seen in Table 6.

Table 6 Quantile values for annual temperature

Full size table

The structure of Hardwood data can be seen in Fig. 4. The distributional data is represented by using accumulated quantile values to show the fine details of distribution (Ichino and Britto 2014). 7 monotone lines containing 8 points are used to visualize the accumulation of values for quantiles k, $k=1,\dots ,7$ over features j, $j=1,\dots ,8$. Different grey tones are used for different features. The shapes of those monotone lines allows to identify the patterns in data. For example, for all tree species the difference between east and west lays between the last two quantile values in quantile vector as bottom parts of the figure have similar shapes for the respective species. It can also be seen in Fig. 4 that all east hardwood, except Alnus East, are very similar. All that information should be taken account during HCC. The dendrograms achieved by microscopic and macroscopic HCC can be seen in Fig. 5. Same 7 quantiles as in case of Oils data were used. As noted previously, for histogram-valued data, extension of Ichino and Umbleja (2018) method is used. The original method is not suitable for histogram-valued data as it calculates join and meet by the span the data covers in feature space. Histograms tend to span over large parts of feature space but the regions with high probabilities are important. The extension of concept size method does consider the shape of histogram. Therefore the concept size and dissimilarity will not equal to 1 as would be expected in a dendrogram. It only occurs when histogram has one bin (interval). Despite that drawback, the propositions of dendrogram and concepts merged can be normalized and are are comparable.

There are some fundamental differences between (a) and (b) in Fig. 5 due to underlying distributions as in Fig. 4. In both cases 3 clear clusters emerge but they are not identical. In (b) clusters are formed following real-life expectations—east and west coast trees are grouped independently. In (a) Alnus East gets mixed with Acer and Alnus West. In (b) span has huge impact on compactness measure—dissimilarity between Acer and Alnus West is rather large. In (a), dissimilarity is measured along multiple points and inner distribution of histograms is analysed. Those three trees actually have very similar patterns from Fig. 4. The concept descriptions at 3 clusters for microscopic approach can be seen in Table 7.

Table 7 Concept descriptions for 3 clusters formed with MHCC

Full size table

It can be further verified with principal component analysis in Fig. 6. The main difference between East and West coast comes from last quantile—in case of west trees, the difference between last and penultimate quantile covers a very large span—meaning that there is a low concentration of probabilities. In (a), due to the normalization factor where differences among quantiles have equal width, the dissimilarity along the last quantile does not overwrite similarities among other 6 quantiles used. Therefore, (a) offers us benefits over (b) by presenting new information that was not known with background knowledge about the dataset but this can be verified with other methods. Furthermore, it makes sure that differences in minimum and maximum values, that may have only low probabilities due to the fact that natural data has normal distribution, due not overwrite other similarities.

When comparing cluster quality, it can be visually followed that if we cut dendrograms in Fig. 5 when 3 clusters remain (the most optimal point), the formed concepts in (a) have better compactness than in (b). Formal comparison of cluster quality between formed microscopic and macroscopic HCC can be followed in Table 8. Three measures where chosen: Calinski–Harabasz (CH), Davies–Bouldin (DB) and Root-mean-square standard deviation (RMSSTD). CH index uses average between- and within-cluster sum of squares. Larger the value, better the cluster. DB also uses within-group and between-group distances to validate formed clusters. Smaller the value, better the cluster. RMSSTD considers only homogeneity within the cluster (Liu et al. 2010; Vendramin et al. 2010).

Most of the indexes for cluster validation consider two aspects—compactness and separation (Liu et al. 2010). As symbolic data is not point value but distribution that covers large part of the feature space (that is especially the case with American States’ data), there exists lot of overlap between objects and formed clusters do not produce clear separation over whole distribution/span. Based on distance calculation during microscopic HCC, most similar objects over majority of quantiles are merged into cluster. Concept description is achieved at quantile level (as could be followed from Table 7). Separation over one quantile is adequate for achieving separation between clusters. RMSSTD was chosen as quality measure as it only considers compactness of the formed clusters that is important for distributions.

Cluster quality measures are calculated at quantile level and averaged over all quantiles. Even better results can be achieved when instead of average best quantile is used. Calinski–Harabasz index is calculated based on Vendramin et al. (2010) with (27) where $trace(W_q)$ and $trace(B_q)$ are defined as in (28) and (29).

$$\begin{aligned}&\max _{\begin{array}{c} q \in [Q_1 \dots Q_m] \end{array}} \dfrac{trace(B_q)}{trace(W_q)} \times \dfrac{N-k}{k-1} \end{aligned}$$

(27)

$$\begin{aligned}&trace(W_q)=\sum _{l=1}^{k} \sum _{j=1}^{d} \sum _{ \omega _{i} \in ||C_l|| }(x_{ijq}-{\overline{x}}_{ljq})^{2} \end{aligned}$$

(28)

$$\begin{aligned}&trace(B_q)=\sum _{i=1}^{n} \sum _{j=1}^{d} (x_{ijq}-{\overline{x}}_{jq})^{2}- trace(W_q) \end{aligned}$$

(29)

Davies–Bouldin for specific quantile is calculated as generally with distances being calculated as described previously (17). Then, best or average value over all quantiles is used. Same applies with root-mean-square standard deviation.

Table 8 Cluster quality evaluations for Hardwood data

Full size table

As can be seen in Table 8, with Calinski–Harabasz measure, 3 clusters are clearly the best result with all different approaches. With Davies–Bouldin, the results are not so clear with index value dropping with larger number of clusters. It can be followed from dendrogram at Fig. 5 that with more than 3 clusters, individual objects are starting to form solo clusters that have good compactness (0 as only one object in cluster) and separation. RMSSTD best value is defined as "elbow" (Liu et al. 2010) and in all cases, it is clearly visible that there is large drop on values before 3 clusters but the index does not change that much with larger number of clusters.

5.3 American States’ weather

In this section, the results of proposed methodology are described on climate dataset (National Climatic Data Center 2014). This dataset contains sequential monthly "time bias corrected" average temperature data for 48 states of USA (Alaska and Hawaii are not represented in the dataset). Time period of 1895–2009 is used for comparison purposes. The data provided was first transformed into histograms describing average temperature for every state and every month.

The following twelve features describe the American States data:

$F_1$: January
$F_2$: February
$F_3$: March
$F_4$: April
$F_5$: May
$F_6$: June
$F_7$: July
$F_8$: August
$F_9$: September
$F_{10}$: October
$F_{11}$: November
$F_{12}$: December

The dendrograms achieved by microscopic and macroscopic HCC can be seen in Fig. 7. Dendrograms were cut at 5 clusters and plotted to map as can be seen in Fig. 8. Same 7 quantiles as before were used.

Table 9 Differences in clustering between Microscopic and Macroscopic HCC—extreme summers

Full size table

Table 10 Differences in clustering between Microscopic and Macroscopic HCC—extreme winters

Full size table

As can be seen from Fig. 7, dendograms have different structure. It can be said that (a) would produce 5 compact clusters and (b) would produce 4 clusters. For comparison purposes, 5 clusters are chosen for Fig. 8. We can think of these clusters as "very warm", "warm", "mild","colder" and "cold". Both (a) and (b) in Fig. 8 are reasonable clusters as they follow expected real-life knowledge—clusters follow latitudes with coastal states being clustered to warmer latitude clusters. Inland states are usually clustered with colder groups than others in the same latitude.

It can be followed from Fig. 7 that in both cases the "cold" and "colder" clusters are forming their own branch. Furthermore, it should be noted that Florida has a very different weather pattern from the rest of very warm states by having extremely warm winters. That property can be followed in dendrograms as Florida is an outlier which is being grouped as an individual object.

If comparing (a) and (b) in Fig. 7, the clusters tend to be more compact in (a) than in (b) meaning better cluster quality. The concept descriptions of five formed clusters using microscopic approach can be seen in Table 11.

Table 11 Concept descriptions in Celsius for 5 clusters formed with microscopic HCC

Full size table

Microscopic approach follows the average weather patterns while occurrences of extreme maximums and minimums have a huge impact on macroscopic approach. The differences in clusters can be ascribed to that characteristic. The states clustered to different groups are listed in Tables 9 and 10. States average and extreme temperatures are given. For comparison, cluster’s average temperature, according to microscopic HCC, is also included. It can be followed that state’s average temperature is similar to cluster’s average. The extreme temperature is very different from both state’s and cluster’s average.

In all cases listed in Tables 9 and 10, the differences are caused by extreme recorded temperatures that have a low probabilty of occurring. In case of macroscopic HCC, those extremes (minimum and maximum temperatures) have a very strong impact on clustering results. In case of microscopic clustering when 7 quantile values were used, the extremes do not have such a huge impact—they are only 2 quantile values and all quantiles’ distances are treated equally. Therefore, if 5 out of 7 quantile values are very similar, they have a larger impact on overall similarity than 1 or 2 large differences with minimum and maximum extremes would have. The average weather pattern has more impact than the rare occurrence of few unusual temperatures.

Arizona’s concept description can be seen in Table 12. In microscopic HCC, Arizona is grouped with warmer states while in macroscopic approach it groups with the warmest. When comparing Table 12 with warm cluster description in Table 11, it can be followed that for most winter months and most quantiles, Arizona affects the maximum value. For example, for January for quantiles 10–100%, Arizona’s quantile value is the maximum for cluster’s description. The same Arizona’s January quantile values would be minimum values for "Very Warm" cluster showing clearly that Arizona is a borderline case. The low probability high temperatures during summer (for example 100% quantiles for F6–F9) are actually higher than in "Very Warm" cluster. For June (F6), 100% quantile has value 38. 90%, on the other hand, has value 31, meaning 10% of Arizona’s distribution for June is covered by 7 degrees. The value at 75% quantile is 28 and therefore 15% of distribution is covered with only 3 degrees showing that high temperatures are not that likely. During clustering process, all quantiles and all months are considered. In majority of cases, especially with lower quantiles and summer months, Arizona is clearly colder than "Very Warm" cluster description and therefore clustering result by microscopic approach is more accurate than result achieved with macroscopic.

Table 12 Concept descriptions for Arizona using 7 quantiles

Full size table

Similar phenomena can be followed with California, as can be followed in Table 13, but with winters. California has recorded some very cold winter temperatures but on average, its weather all year round is quite even with small difference between 25 and 75% quantile values for all months (oceanic climate). That makes it different from the average state’s weather in "warm" cluster where summers are usually much warmer than in California. When considering other seasons, the average weather in California is very similar to rest of "warm" weather states.

Table 13 Concept descriptions for California using 7 quantiles

Full size table

Similar comparisons with between clusters’ descriptions and states individual quantile values can be made with other borderline cases brought out in Table 10. In all cases,there have been extremely cold winters, but the likelihood of them is low—the gap in degrees between 0% quantile and 10% quantile is wide.

Similar results to Fig. 8b can be followed in Irpino and Verde (2006). Despite that method considering the inner distribution of histograms, it produces similar results for macroscopic method, differences in clusters appearing with California and Arizona that are grouped similar to (a). The reason for achieving results similar to macroscopic approach is due to the fact that the length of the histogram has a large impact on their method. With proposed microscopic method, distance in every quantile value has equal impact therefore the average weather, as explained previously, is not overshadowed by extreme occurrences. The main differences between Fig. 8a and results in Irpino and Verde (2006) are with inland states that record few extremely cold years.

Table 14 Cluster quality evaluation for American States’ weather

Full size table

Cluster quality measures are calculated in Table 14. Three different indexes as previously are used. Calinski–Harabasz and RMSSTD have similar results, as with Hardwood data, indicating the best number of clusters at 5. In additionally, Calinski–Harabasz index can be compared with (Irpino and Verde 2006). They also achieved best result with 5 clusters and best maximum value of 77.65. The best value using macroscopic approach is 77.98 but this is achieved with 3 clusters—cold, mild and warm. Microscopic approach achieves best results at 5 clusters, similar to Irpino and Verde (2006), with maximum value 79.53 that is higher than previously reported.

Davies–Bouldin, on the other hand, suggests achieving best result with HCC at 3 clusters. Averaged index over all quantiles using MHCC has similar result while using the best quantile suggest using only 2 groups. DB index seems to be very sensitive to large overlap that weather dataset has.

Additionally, clusters’ compactness can visually be observed in Fig. 7. The dendogram of (a) has smaller average "distance" between most of the objects than in (b). Concepts in (a) are therefore more compact. When cutting the dendrogram at 5 clusters, (a) is cut after 32% (at point 0.320) while (b) is cut at 47% (at point 0.1). The dendogram at Irpino and Verde (2006) uses Ward criterion and therefore dendograms are not comparable other than by general shape.

6 Discussion

As shown above, the main benefit of proposed algorithm is that distributional information is considered in more details allowing additional data being taken into account during conceptual clustering that is overlooked in macroscopic case. In addition, proposed approach does not suffer from computationally expensive histogram merger.

The proposed approach does not re-invent general algorithm for hierarchical conceptual clustering. The base of the algorithm is the same for both macroscopic and microscopic approach. The difference comes from how objects are compared to each other. For macroscopic case, size of the area where probabilities can be found is considered—being it span in the case of intervals or sizes of the histograms’ join and meet.

In microscopic case, differences at specific distributional points (quantiles) are measured. Individually, we could think of the distance in a single quantile point as interval’s span (between minimum and maximum values of objects at that quantile) and therefore we could think of it as “macroscopic” property. For overall difference between objects, multiply quantiles in different parts of distribution are considered. We could say that we apply similar approach as span many times in different parts of distribution to get more precise result.

Due to using multiple points for comparison, mergers forming main clusters tend to happen with smaller values of dissimilarity than with macroscopic methods. For example, chaining effect as in Fig. 5b tends to occur much less with proposed algorithm than with macroscopic methods. This is due to the fact that when objects are similar, majority of the quantile values compared tend to be similar. Some differences may appear in only few individual quantiles. The overall normalized result will lower the impact of those few differences. When objects are dissimilar, all the points tend to be different in all quantile values, therefore having equal impact to normalized final dissimilarity measure. Thanks to this phenomena, it is easier to make a decision where to cut the dendrogram and decide how many clusters to keep.

There exists few other clustering methods that do considering underlying distributions. The proposed method has benefit of low required storage during algorithm execution and simple distance calculation as single point values are used.

Storage complexity of macroscopic algorithm is more complex – the whole histogram has to be stored. In general case, the histogram has many bins and those require storing three values (min, max and probability per bin). During the merger phase, new histograms are produced. To keep the exact description of merged distribution, that process produces more detailed histograms (with more bins). In worse case, if the histograms have partial overlap, actually more bins than before could be produced during merger (if original histograms had g and l bins, then merged histogram in worse case can have $g+l+1$ bins). In addition, as more mergers are done and more bins generated, the probabilities of the individual bins become very small—that is another undesired side effect of the macroscopic merger for histograms. Similar undesired side effect also occurs in Irpino and Verde (2006)’s approach using quantiles.

In case of microscopic approach, the underlying complexity of the data structure does not matter—only fixed m quantile values of every histogram is stored, no matter how many bins there are. In addition, merging does not generate new values—only minimums and maximums of quantiles are stored ($m \times 2$ values). Therefore, during clustering no new storage is required with maximum storage still being $m \times d \times n \times 2$.

As we are dealing with symbolic data, distance calculation between two objects is not doable with O(n) complexity, as in classical data’s case, with respect to the number of features. Instead of linear time, it becomes quadratic time $O(n^2)$ with respect of number of features and bins/quantiles. That remains true in both macroscopic and microscopic case. In macroscopic case, the most expensive operation is merger of histograms with complexity $O(n^3)$ as in Irpino and Verde (2006) and Umbleja (2017). In proposed microscopic approach, the merger is reduced to $O(n^2)$ complexity - over all features and quantiles, simple list operation is performed to update minimum and maximum value for limited number of quantiles. In addition, now additional quantiles or bins are generated during merger (like in two previously mentioned methods)—simply current minimum and maximum value for specific quantile are stored. In addition, in macroscopic case, distance is calculated with compactness measure that depends on number of histograms already merged, $\Vert \omega _{x}\Vert $, similar to average histogram calculation in Irpino and Verde (2006) with complexity $O(n^3)$.

Usually, $n>>m$ and therefore number of quantiles m does not have such an huge impact of time complexity while number of objects already merged, $\Vert \omega _{x}\Vert $, and number of bins/quantiles in currently available algorithms tend to rise with every merger (larger the n, more mergers required). With previous approaches, the procedures (like merging histograms or calculating its compactness) are complex as they consider histogram’s characteristics (like assumption of uniform distribution inside the bin and dividing it between bins when new bins/quantiles are generated). In microscopic approach, the complex nature of the underlying data can be "ignored" due to using simple fixed number quantile values. In addition, simple individual quantile values are easier to follow by human analyser than complex histograms with large number of bins. In macroscopic approach, the possibility to overcome extra bin/quantile generation can be achieved by limiting the maximum number of bins/quantiles—but that would require extra actions and modifications to current approaches.

Also, the proposed approach can be very neatly modified to match specific conditions that interests the analyser by modifying the set of quantiles used to compare objects. If only two quantiles 0% and 100% is used, the algorithm becomes similar to macroscopic approach in its analysis capacity but instead of single size, it uses two measured differences. If 0% and 100% quantiles are left out, truncated data is used. That can be beneficial for clearer picture as low probability density areas at the either side of the main data are not that important. Closer the first and last quantile values are, more of the data is left out and less amount of original distribution information is used. As normally symbolic data is aggregated from large classical data, as in case of Hardwood data, by discarding those small probability areas at the start and end of density curve removes outliers impact to aggregated data.

The number of quantiles used has an impact on the results. More quantiles used, less impact dissimilarity among one specific quantile has. At one point, adding more quantiles will not have any further influence for the results and will become counter-productive as impact of single difference is reduced to non-existent. Less quantiles, more effect every single quantile has but less details about the data are considered. Optimal choice of quantiles can be made depending on the data and the goal of the analysis but usually it would be between 4 and 10 quantiles. The amount should reflect the goal of the analysis and the volume of original data. If unaggregated data was not large, not many quantiles are needed for adequate analysis. It may be beneficial to space quantiles equally or concentrate on probability values around median.

The proposed algorithm not only forms clusters but also generates the concepts and their description. Another important benefit of proposed algoritm is that it is monotone.

7 Conclusion

In this paper, method for microscopic hierarchical conceptual clustering based on quantile values is proposed. The main contribution of proposed method is that the algorithm is simple and monotone, yet it allows considering small microscopic details in underlying distribution in distributional data. It assures that more accurate conceptual clusters are formed as a result of clustering as it was proven by applying proposed method on three commonly used datasets in symbolic data analysis. The proposed method is especially beneficial for symbolic data that is described by complex distribution functions like histograms but also has advantages to more simple distributions like intervals. Furthermore, it allows, due to usage of quantiles, comparison between different types of symbolic objects without adding complexity to the method. Algorithm can be modified, using the choice of quantiles, to match specific needs or previous knowledge about the data.

References

Bertrand P, Mufti GB (2008) Stability measures for assessing a partition and its clusters: application to symbolic data sets. In: Symbolic data analysis and the SODAS software, pp 263–278
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Hoboken
Book Google Scholar
Brito P, De Carvalho FdA (2008) Hierarchical and pyramidal clustering. In: Symbolic data analysis and the sodas software, pp 157–180
Brito P, Ichino M (2010) Symbolic clustering based on quantile representation. In: Proceedings of COMPSTAT2010, pp 22–27
Brito P, Ichino M (2011) Conceptual clustering of symbolic data using a quantile representation: discrete and continuous approaches. In: Proceeding of theory and application of high-dimensional complex and symbolic data analysis in economics and management science, pp 22–27
de Carvalho FdA, de Souza RM (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31(5):430–443
Article Google Scholar
De Carvalho FDA, Lechevallier Y, Verde R (2008) Clustering methods in symbolic data analysis. In: Symbolic data analysis and the sodas software, pp 181–204
Diday E, Esposito F (2003) An introduction to symbolic data analysis and the sodas software. Intell Data Anal 7(6):583–601
Article Google Scholar
El-Sonbaty Y, Ismail MA (1998) On-line hierarchical clustering. Pattern Recogn Lett 19(14):1285–1291
Article Google Scholar
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
Goswami S, Chakrabarti A (2012) Quartile clustering: a quartile based technique for generating meaningful clusters. J Comput 4(2):48–55
Google Scholar
Guru D, Nagendraswamy H (2006) Clustering of interval-valued symbolic patterns based on mutual similarity value and the concept of k-mutual nearest neighborhood. In: Asian conference on computer vision, Springer, Berlin, pp 234–243
Hardy A, Lallemand P (2002) Determination of the number of clusters for symbolic objects described by interval variables. In: Classification, clustering, and data analysis, Springer, Berlin, pp 311–318
Hu X (1992) Conceptual clustering and concept hierarchies in knowledge discovery. Ph.D. thesis, theses (School of Computing Science)/Simon Fraser University
Hubert L (1972) Some extensions of Johnson’s hierarchical clustering algorithms. Psychometrika 37(3):261–274
Article MathSciNet Google Scholar
Ichino M (2008) Symbolic PCA for histogram-valued data. In: Proceedings IASC, pp 5–8
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min: ASA Data Sci J 4(2):184–198
Article MathSciNet Google Scholar
Ichino M, Britto P (2014) The data accumulation graph (DAQ) to visualize multi- dimensional symbolic data. In: Workshop in symbolic data analysis, Taipei, Taiwan
Ichino M, Brito P (2013) A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. In: Proceedings of world statistics congress of the international statistical institute
Ichino M, Umbleja K (2018) Similarity and dissimilarity measures for mixed feature-type symbolic data. In: Studies in theoretical and applied statistics, Springer, Berlin, pp 131–144
Ichino M, Yaguchi H (1994) Generalized minkowski metrics for mixed feature-type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708
Article MathSciNet Google Scholar
Irpino A, Verde R (2006) A new wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Data science and classification, Springer, Berlin, pp 185–192
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Article Google Scholar
Jonyer I, Cook DJ, Holder LB (2001) Graph-based hierarchical conceptual clustering. J Mach Learn Res 2:19–43
MATH Google Scholar
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, IEEE, pp 911–916
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning, Springer, Berlin, pp 331–363
National Climatic Data Center (2014) Tables of histogram data. Climate-vegetation atlas of North America. http://www1.ncdc.noaa.gov/pub/data/cirs/drd/drd964x.tmpst.txt. Accessed 10 Aug 2015
Umbleja K (2017) Competence based learning—framework, implementation, analysis and management of learning process. Ph.D. thesis, Theses (School of Information Technologies)/Tallinn University of Technology, https://digi.lib.ttu.ee/i/?7573. Accessed 4 Oct 2018
US Geological Survey (2013) Tables of histogram data. Climate-vegetation atlas of North America. http://pubs.usgs.gov/pp/p1650-b/datatables/hgtable.xls. Accessed 24 Aug 2015
Vendramin L, Campello RJ, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: ASA Data Sci J 3(4):209–235
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors want to thank reviewers for their helpful comments. Kadri Umbleja’s work has been supported by Japan Society for the Promotion of Science’s International Research Fellow program.

Author information

Authors and Affiliations

Tokyo Denki University, Saitama, Japan
Kadri Umbleja, Manabu Ichino & Hiroyuki Yaguchi

Authors

Kadri Umbleja
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Ichino
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Yaguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kadri Umbleja.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Implementation of algorithm in Python can be found at: https://github.com/iardacil/MHCC

Rights and permissions

Reprints and permissions

About this article

Cite this article

Umbleja, K., Ichino, M. & Yaguchi, H. Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data. Adv Data Anal Classif 15, 407–436 (2021). https://doi.org/10.1007/s11634-020-00411-w

Download citation

Received: 06 December 2018
Revised: 29 November 2019
Accepted: 09 July 2020
Published: 22 July 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11634-020-00411-w

Keywords

Mathematics Subject Classification

68T09

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Abstract

Similar content being viewed by others

Clustering of modal-valued symbolic data

Comparison of Two Distribution Valued Dissimilarities and Its Application for Symbolic Clustering

Association measures for interval variables

1 Introduction

2 Background

2.1 Symbolic data

2.2 Hierarchical conceptual clustering

3 Quantile representation of symbolic data

3.1 Types of symbolic data

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

3.2 Quantile representation

Definition 6

Example 1

4 Microscopic hierarchical conceptual clustering

4.1 Dissimilarity between two quantile vectors

Definition 7

Definition 8

Definition 9

Proposition 1

Proposition 2

Definition 10

Example 2

4.2 Algorithm for hierarchical conceptual clustering

5 Application

5.1 Oils data

5.2 Hardwood data

5.3 American States’ weather

6 Discussion

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation