Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Clustering is one of the most studied fields in text mining. Text data are characterized by high dimensionality, and this greatly influences the scalability and the effectiveness of clustering algorithms. One of the most common approaches to reduce dimensionality is to exploit stemming and removal of stop words, and, finally, use a vector representation of documents, where the presence of each term in a vector is weighted by using, for example, TF-IDF. However, stemming and removal of stop words only alleviate the curse of dimensionality, due to the large vocabulary of terms mentioned in a typical document corpus. A technique pursued by seminal papers (see [1, 12]) is to exploit an algorithm for mining Frequent Itemset Mining (FIM) to extract a reduced set of more meaningful features, with the aim of shrinking the vectorial representation of documents. In particular, in this paper we focus on the FIHC algorithm (Frequent Itemset-based Hierarchical Clustering), proposed by Fung et al. in their influential paper [3].

The idea of FIHC is to first mine the frequent itemsets, namely word-sets, that co-occur in the corpus with a frequency not less than a pre-determined minimum support threshold, thus leading to a set of top-k most frequent word-sets. FIHC exploits the extracted frequent word-set to determine an initial clustering of documents, based on the documents that contain or overlap those patterns. The collection of mined word-sets naturally identifies a vocabulary of frequent terms, which determines the dimensionality of the vectors used to represent each document. This representation is then used to recursively merge the initial clusters until the desired number of clusters is obtained.

The quality of the top-k word-sets exploited by FIHC impacts on the quality of the resulting clustering. For instance, changing the minimum support threshold has the effect of changing the set of top-k most frequent word-sets, and thus impacts on the feature space into which documents are mapped, as well as on the seed clusters initially identified by each word-sets. Specifically, the smaller the minimum support, the larger the set of the word-sets extracted. In fact, a large support threshold may result in a reduced set of word-sets that occur in many documents and thus having a loose discriminative power. On the other hand, a small threshold may result in too many patterns and, consequently, a large vocabulary of words used in the vectorial representation of documents, again suffering from the curse of dimensionality during clustering.

We propose to use the clustering quality as a proxy measure to quantitatively evaluate the quality of different kinds of patterns that are used to feed the FIHC algorithm. In particular, we study approximate patterns that identify itemsets that are approximately included in the corresponding sets of transactions [7, 9, 13]. This means that, given an approximate pattern, some false positives are allowed, i.e., some of the items included in the patterns may not occur in a few transactions supporting the pattern. While the exact patterns are commonly ranked according to the popular concept of frequency in the collection, the alternative approximate patterns we study in this paper are ranked according to different definitions of importance.

To limit the number of patterns used to model documents and identity the initial clusters, for both the exact and approximate cases we select the top-k one. Whereas for frequent ones these top-k patterns are simply the most frequent ones, an approximate pattern algorithm aims at discovering the set of k patterns that best describes/models, the input dataset. State-of-the-art algorithms differ in the formalization of the above concept of dataset description. For instance, in [9] the goodness of the description is given by the number of occurrences in the dataset incorrectly modeled by the extracted patterns, while shorter and concise patterns are promoted in [7, 13]. The goodness of a description is measured with some cost function, and the top-k mining task is casted into an optimization of such cost. In most of such formulations, the problem is proved to be NP-hard, and greedy strategies are therefore adopted. At each iteration, the pattern that best optimizes the given cost function is added to the solution. This is repeated until k patterns have been found or until it is not possible to improve the cost function.

In this paper we study the quality of document clustering achieved by exploiting the approximate top-k patterns extracted by three state-of-the-art algorithms: Asso  [9], Hyper+  [13] and \(\textsc {PaNDa}^+\)  [8], where the cost functions adopted by Asso  [9] and Hyper+  [13] share important aspects that can be generalized into a unique formulation. The \(\textsc {PaNDa}^+\) framework can be plugged with such generalized formulation, which makes it possible to greedily mine approximate patterns according to several cost functions, including the ones proposed in [7, 10]. \(\textsc {PaNDa}^+\) also allows to include maximum noise constraints [2].

Concerning the evaluation methodology of these patterns, we adopt the quality of the clustering obtained by FIHC as a proxy of the quality of the patterns extracted by the various algorithms. Specifically, we use the aforementioned mining algorithms to extract top-k approximate patterns sets, which are then used to fed FIHC in order to cluster the input documents. Several commonly used “external” measures are used to evaluate the goodness of the pattern-based clusters with respect to the true classes of the documents. Moreover, we also compared such methods with a couple of baselines, i.e., K-Means and a version of FIHC exploiting classical frequent word-sets (exact patterns).

The main contribution of this paper is an extensive evaluation of approximate patterns. Our investigation shows that approximate patterns provide a better representation of the given dataset than exact patterns, and that \(\textsc {PaNDa}^+\) generates patterns of better quality than other state-of-the-art algorithms. In other words, our experiments shows that \(\textsc {PaNDa}^+\) seems to be able to better capture the patterns/features characterizing the most salient topics being discussed in the given corpus of documents.

The rest of the paper is organized as follows. Section 2 discusses exact and approximate pattern mining, and briefly introduces some algorithms for top-k approximate pattern miming. Section 3 discusses the clustering algorithm FIHC, and the possible exploitation within FIHC of either frequent or approximate patterns. In Sect. 4 we describe the experimental setting and the quality of document clustering identified by the various versions of FIHC and the baselines. Finally, Sect. 5 draws some concluding remarks.

2 Approximate and Exact Patterns

The binary representation of a transactional dataset, indeed a multi-set of itemset where each itemset is a subset of a given collection of items \(\mathcal{I}\), is convenient to introduce pattern mining extracted from textual datasets. A transactional dataset of N transactions and M items – which is analogous to representing a corpus of N documents with a vocabulary of M terms as a collections of “sets of words”, thus ignoring the positions and the number of occurrences of each term in a document – can be represented by a binary matrix \(\mathcal {D} \in \left\{ 0,1 \right\} ^{N \times M}\), where \(\mathcal {D} (i,j)=1\) if the \(j^{th}\) item occurs in the \(i^{th}\) transaction, and \(\mathcal {D} (i,j)=0\) otherwise.

An pattern P is thus identified by a set of items, along with the set of transactions where the items occur. In terms of text documents, P is a word-set occurring in a given set of documents. We represent these two sets as binary vectors \(P = \langle P_I, P_T\rangle \), where \(P_I \in \left\{ 0,1 \right\} ^M\) and \(P_T \in \left\{ 0,1 \right\} ^N\) are the indicator vectors of two subsets of items and transactions, respectively. The outer product \(P_T \cdot P_I^{\text{ T }}\in \left\{ 0,1 \right\} ^{N \times M}\) identifies a sub-matrix of \(\mathcal {D}\). These patterns are also called hyper-rectangles [13]: each pattern can be visualized as a rectangle if we properly reorder rows (transactions) and columns (items) to make them contiguous.

If a pattern is exact, the sub-matrix only covers 1-bits in \(\mathcal {D}\), where \(\Vert P_I\Vert \) is the length of the pattern and \(\Vert P_T\Vert \) is its support, with \(\Vert \cdot \Vert \) being the \(L^{1}\)-norm (or Hamming norm) that simply counts the number of 1 bits in each binary vector. Conversely, in case a pattern is approximate, it only approximately covers 1-bits in \(\mathcal {D}\) (true positives), but it may also cover a few 0-bits too (false positives). Still we have that \(\Vert P_I\Vert \) is the length of the patten, and \(\Vert P_T\Vert \) is its approximate support.

2.1 Exact Closed Patterns

Let \(\varPi ^\sigma =\left\{ P_1,\ldots ,P_{|\varPi ^\sigma |} \right\} \) be a set of exact frequent patterns, where \(\sigma \) is the minimum support ratio. These patterns may overlap, since they may share items or transactions. Therefore, \(\forall P \in \varPi ^\sigma \), \(P = \langle P_I, P_T\rangle \), we have that \(\frac{\Vert P_T\Vert }{N} \ge \sigma \), where N is the number of documents in the corpus, represented as \(\mathcal{D}\).

In this paper, we exploit the popular concept of closed frequent patterns, by removing from \(\varPi ^\sigma \) some redundant patterns, since this also prevents the creation of redundant initial seed clusters used by FIHC. Specifically, a pattern \(P_i \in \varPi ^\sigma \), \(P_i = \langle P^i_I, P^i_T\rangle \), is said closed iff \(\not \exists P_j \in \varPi ^\sigma \), \(P_j = \langle P^j_I, P^j_T\rangle \), such that \(set(P^j_I) \subset set(P^i_I)\) Footnote 1 and \(P^i_T = P^j_T\). In other words, we maintain in \(\varPi ^\sigma \) only the frequent itemsets such that there is no other super-itemset occurring in exactly the same set of transactions.

The number of frequent closed item sets may be orders of magnitudes smaller than all the frequent ones, still providing the same information: frequent itemsets can be in fact derived from closed ones. We denote by \(\widehat{\varPi }^\sigma \), where \(\widehat{\varPi }^\sigma \subseteq \varPi ^\sigma \), the set of closed patterns given a minimum support \(\sigma \). Several frequent closed itemsets mining algorithm [5, 14] can be used to mine \(\mathcal{D}\).

Since we need to limit the set of (closed) patterns to the top-k most frequent ones, we first select the largest \(\sigma _k\) such that:

(1)

and then select the top-k in \(\widehat{\varPi }^{\sigma _k}\), denoted by \(\widehat{\varPi }_k^{\sigma _k}\), where the patterns in \(\widehat{\varPi }_{\sigma _k}\) are first sorted in decreasing order of support (and then of pattern length). This minimum support \(\sigma _k\) used to identify \(\widehat{\varPi }_k^{\sigma _k}\), is then employed by FIHC within specific similarity measures.

In this work, we thus exploit such top-k closed frequent itemsets as a baseline of the possible pattern-based features used to model the text documents that feed the FIHC algorithm.

2.2 Approximate Patterns

Let \(\varPi =\left\{ P_1,\ldots ,P_{|\varPi |} \right\} \) be a set of approximate overlapping patterns that aim at best describing/modelling the input dataset \(\mathcal{D}\). This means that \(\varPi \) approximately cover the 1’s in dataset \(\mathcal {D} \), except for some noisy item occurrences, identified by matrix \(\mathcal{N} \in \left\{ 0,1 \right\} ^{N\times M}\):

$$\begin{aligned} \mathcal{N} \ =\ \bigvee _{P \in \varPi } (P_T \cdot P_I^\mathsf{T}) \quad \veebar \ \mathcal {D}. \end{aligned}$$
(2)

where \(\vee \) and \(\veebar \) are respectively the element-wise logical or and xor operators. Note that some 1-bits in \(\mathcal {D}\) may not be covered by any pattern in \(\varPi \) (false negatives).

Indeed, our formulation of noise (matrix \(\mathcal{N}\)) models both false positives and false negatives. If an occurrence \(\mathcal {D} (i,j)\) corresponds to either a false positive or a false negative, we have that \(\mathcal{N}(i,j)=1\).

We define the top-k approximate pattern discovery problem as an optimization one, where the goal is to minimize a given cost function \(J(\varPi _k, \mathcal {D})\):

(3)

A general formulation of the cost function J is the following:

$$\begin{aligned} J(\varPi _k, \mathcal {D})\ =\ \gamma _\mathcal{N}(\mathcal{N}) \ +\ \rho \cdot \sum _{P \in \varPi _k} \gamma _P(P) \end{aligned}$$
(4)

where \(\mathcal{N}\) is the noise matrix defined by Eq. 2, \(\gamma _\mathcal{N}\) and \(\gamma _P\) are user defined functions measuring the cost of encoding noise and pattern descriptions, respectively. Constant \(\rho \ge 0\) works as a regularization factor weighting the relative importance of the patterns cost. It is worth noting that such cost J is directly proportional to the complexity of the pattern set and the amount of noise, respectively.

The various algorithms for top-k patterns greedily optimize a specialization of the function of Eq. 4. In addition, they exploit some specific parameters, whose purpose is to make the pattern set \(\varPi _k\) subject to particular constraints, with the aim of (1) reducing the algorithm search space, or (2) possibly avoiding that the greedy generation of patterns brings to local minima. As an example of the former type of parameters, we mention the frequency of the pattern. Whereas, for the latter type of parameters, an example is the amount of false positives we can tolerate in each pattern. Table 1 summarizes the specialization of the generalized cost function.

Table 1. Objective functions for Top-k Pattern Discovery Problem.

In the following, we briefly discuss some state-of-the-art algorithms for top-k approximate pattern mining, in turn used to select a significant set of features modelling documents to be clustered by FIHC.

Asso  [9] is a greedy algorithm that minimizes function \(J_{A}\) in Table 1, which only measures the amount of noise in describing the input data matrix \(\mathcal {D}\). Note that this noise, namely \(\gamma _\mathcal{N}(\mathcal{N})=\Vert \mathcal{N}\Vert \), is measured as the \(L^{1}\)-norm \(\Vert \mathcal{N}\Vert \) (or Hamming norm), which simply counts the number of 1 bits in matrix \(\mathcal{N}\). Indeed, Asso aims at finding a solution for the Boolean matrix decomposition problem, thus identifying two low-dimensional factor binary matrices of rank k, such that their Boolean product approximates \(\mathcal {D}\). The authors of Asso called this matrix decomposition problem the Discrete Basis Problem (DBP). It can be shown that the DBP problem is equivalent to the approximate top-k pattern mining problem when optimizing \(J_{A}\). Asso works as follows. First, it creates a set of candidate item sets by extending each item with every other item having correlation grater than a given parameter \(\tau \). Then Asso iteratively selects a pattern from the candidate set by greedily minimizing the \(J_{A}\).

Hyper+ [13] is a two-phase algorithm aiming at minimizing function \(J_{H}\) in Table 1, which only considers the pattern set complexity. Specifically, the complexity of each pattern in \(P \in \varPi _k\) is measured by \(\gamma _P(P)=\Vert P_{T}\Vert +\Vert P_{I}\Vert \). In the first phase, the algorithm aims to cover in the best way all the items occurring in \(\mathcal {D}\), with neither false negatives nor positives, and thus without any noise. The rationale is to promote the simplest description of the whole input data \(\mathcal {D}\), without any constraint on the amount k of patterns. For this first phase Hyper+ uses a collection of frequent item sets, for a given minimum support parameter \(\sigma \). In the second phase, pairs of patterns previously extracted are recursively merged as long as a new collection of approximate patters can be obtained without generating an amount of false positive occurrences larger than a given budget \(\beta \). Finally, since the pattern set produced by Hyper+ is ordered (from most to least important), we can simply select \(\varPi _k\) as the top-listed k patterns, as done by the algorithm authors in Sect. 7.4 of [13]. Note that this also introduces false negatives, corresponding to all the occurrences \(\mathcal {D} (i,j)=1\) in the dataset that remain uncovered after selecting only the top-k patterns.

Finally, we considered \(\textsc {PaNDa}^+\), a pattern mining framework [8] that can be plugged in with all the cost functions in Table 1, including the last three functions \(J_{P}\), \(J_{P}^{\overline{\rho }}\), and \(J_{E}\), which can fully leverage the trade-off between patterns description cost and noise cost. In particular, \(J_{E}\), originally proposed in [10], realizes the MDL principle [11]. The regularities in \(\mathcal {D}\), corresponding to the discovered approximate patterns \(\varPi _k\), are used to lossless compress the whole \(\mathcal {D}\), expressed as pattern model and noise, as in Eq. 2. Hence, the best pattern set \(\varPi _k\) is the one that induces the smallest encoding of \(\mathcal {D}\), namely \(J_{E}\). \(\textsc {PaNDa}^+\) adopts a greedy strategy by exploiting a two-stage heuristics to iteratively select each pattern: (a) discover a noise-less pattern that covers the yet uncovered 1-bits of \(\mathcal {D}\), and (b) extend it to form a good approximate pattern, thus allowing some false positives to occur within the pattern. Finally, in order to avoid the greedy search strategy accepting too noisy patterns, \(\textsc {PaNDa}^+\) supports two maximum noise thresholds \(\epsilon _r, \epsilon _c \in [0,1]\), inspired by [2], aimed at bounding the maximum amount of noise along the rows and columns of each pattern.

3 Frequent Itemset-Based Hierarchical Clustering

In this section we discuss the FIHC framework [3] for document clustering. FIHC implements 3 steps:

  1. 1.

    frequent item sets are mined and transformed in a set of initial grouping of transactions;

  2. 2.

    these groups are refined to produce a partitional clustering;

  3. 3.

    these clusters are recursively merged until the desired number of clusters is obtained.

In the first step, FIHC transforms the input corpus of documents in a binary representation suitable for frequent pattern mining algorithms, where the vector dimensions state the presence/absence of a vocabulary term in a document. According to the framework of Sect. 2, each mined frequent (closed) pattern \(P \in \widehat{\varPi }_k^{\sigma _k}\), \(P = \langle P_I, P_T\rangle \), trivially identifies a group of documents – i.e., the dataset documents corresponding to \(P_T\) that support the word-set identified by \(P_I\). Since a transaction may support several frequent itemsets, by construction these document groups may overlap. We focus on closed frequent itemsets [5, 14] as they are a succinct representation of all the frequent itemsets, avoid redundancies by definition: this is because they are maximal with respect to the set of supporting transactions, and therefore there are no two closed itemsets supported by the same set of transactions.

We call candidate seed clusters the resulting groups of transactions supporting the the various frequent (closed) itemsets extracted, used in the subsequent step of FIHC.

In the second step, FIHC enforces a partitional clustering, where each transaction is assigned to only one of the candidate clusters. To this end, FIHC uses a function \({\textsf {Score}(\cdot )}\), which measures how well a given transaction fits within a candidate cluster. Each transaction is then assigned to the best fitting cluster. The \({\textsf {Score}(\cdot )}\) function is defined in terms of the items’ global frequency and local frequency.

Given an item \(x \in \mathcal{I}\), the global frequent \(\phi _{\mathcal {D}}(x)\) is the ratio supp \(_\mathcal{D}(x)/|\mathcal{D}|\) that considers all the transactions in the input dataset \(\mathcal{D}\) supporting item x, while the local frequency \(\phi _\mathcal{C}(x)\) is the ratio supp \(_\mathcal{C}(x)/|\mathcal{C}|\) limited to transactions associated with a given cluster \(\mathcal{C}\) under consideration.

On the basis of the extracted pattern set \(\widehat{\varPi }_k^{\sigma _k}\), we first prune from \(\mathcal{I}\) the infrequent items, thus obtaining \(\mathcal{I}' = \{x \in \mathcal{I} \ \mid \ \phi _{\mathcal {D}}(x) \ge {\sigma _k} \}\).

Besides the minimum global frequency threshold \(\sigma _k\), the same used to extract the top-k frequent (closed) patterns, FIHC also defines a minimum local frequency threshold \(\sigma _{loc}\), used to identify the set of locally frequency items of cluster \(\mathcal C\), defined by \(LF_{\mathcal{C}} = \{x \in \mathcal{I}' \ \mid \ \phi _{\mathcal{C}}(x) \ge \sigma _{loc} \}\).

It is worth recalling that in order to compute \(\phi _{\mathcal{C}}(x)\) and \(\phi _{\mathcal{D}}(x)\), in this phase we ignore possible multiple occurrences of term x in each transaction/document. Indeed, FIHC combines this concept of global/local frequency with a typical word weighting scheme, where the term frequency is instead taken into account. Given a transaction t, which simply represents the presence/absence of the various words in a document, the algorithms builds an associated vector \(\overrightarrow{\omega }_{t}\), where \(\omega _{t}(x)\) weights the importance of term x in the original document, measured by the usual \(TF\cdot IDF\) statistics. The matching of a transaction t to a cluster \(\mathcal C\) is thus defined as a function of such weight vector, that only consider the items in the pruned set \(\mathcal{I}'\):

$$\begin{aligned} \mathsf{Score}(\mathcal{C} \leftarrow \overrightarrow{\omega }_{t}) = \sum _{x \in \mathcal{I}', x \in t, x \in LF_\mathcal{C}} \omega _{t}(x) \cdot \phi _\mathcal{C}(x) - \sum _{x \in \mathcal{I}', x \in t, x \not \in {LF}_\mathcal{C}} \omega _{t}(x) \cdot \phi _{\mathcal {D}}(x) \end{aligned}$$
(5)

where the first term of the function rewards cluster \(\mathcal C\) if word x is locally frequent in \(\mathcal C\), whereas the second term penalizes the same cluster for all the items of t that not locally frequent. The last term encapsulates the concept of dissimilarity into the score.

Intuitively, a cluster \(\mathcal C\) is good for t if there are relatively many items in t that appear in many other transactions assigned to \(\mathcal C\), and this happens when t is similar to these transactions because they share many common frequent items. Finally, terms with larger \(TF\cdot IDF\) values have a larger impact.

According to such scoring function, each transaction in the dataset is associated with one and only one of the candidate clusters identified in the previous step. At the end of this second stage, a partitional clustering is thus attained, where only a few of the original candidate clusters survived by attracting other transactions.

During the third and last phase, FIHC merges similar pairs of clusters, using an ad-hoc similarity measure. Merging is performed recursively until the desired number of final clusters is reached. Indeed, the inter-cluster similarity is defined on top of the above \({\textsf {Score}(\cdot )}\) function. Given two clusters \(\mathcal{C}_i\) and \(\mathcal{C}_j\), all the transactions in the latter are combined and matched to the former. Let \(\omega _{\mathcal{C}_j}\) be this combined weight vector, obtained by summing up the weight vectors of all the transactions in \(\mathcal{C}_j\), i.e. \(\overrightarrow{\omega }_{\mathcal{C}_j} = \sum _{t \in \mathcal{C}_j} \overrightarrow{\omega }_{t}\). Thus, the following cluster similarity is defined:

$$\begin{aligned} \textsf {Sim}(\mathcal{C}_i \leftarrow \mathcal{C}_j) = \frac{\textsf {Score}\left( \mathcal{C}_i \leftarrow \overrightarrow{\omega }_{\mathcal{C}_j} \right) }{\varOmega } +1 \end{aligned}$$
(6)

where \(\varOmega \) in a normalization factor. The whole similarity computed in Eq. 6 is asymmetric and normalized between 0 and 2. It is finally made symmetric by taking the geometric mean of \(\textsf {Sim}(\mathcal{C}_i \leftarrow \mathcal{C}_j)\) and \(\textsf {Sim}(\mathcal{C}_j \leftarrow \mathcal{C}_i)\):

$$\begin{aligned} \mathsf{Inter\_Sim}(\mathcal{C}_i \leftrightarrow \mathcal{C}_j) = \sqrt{\mathsf{Sim}(\mathcal{C}_i \leftarrow \mathcal{C}_j) \cdot \mathsf{Sim}(\mathcal{C}_j \leftarrow \mathcal{C}_i)} \end{aligned}$$
(7)

At each step of the recursive merging, the two most similar clusters \(\mathcal{C}_i\) and \(\mathcal{C}_j\) are replaced by a new cluster \(\mathcal{C}_{ij} = \mathcal{C}_i \cup \mathcal{C}_j\).

3.1 Exploiting Approximate Patterns in FIHC

The FIHC framework can be easily adapted to produce a clustering starting from the approximate patterns extracted by algorithms such as Asso, Hyper+, and \(\textsc {PaNDa}^+\). Indeed, they return patterns of the form \(P = \langle P_I, P_T\rangle \), where each pattern identifies not only a set of items (namely vector \(P_I\)), but also a set of related transactions (namely vector \(P_T\)). Therefore, these patterns are analogous to those returned by a frequent (or frequent closed) itemset mining algorithm. The only difference is that, due to noise, the item set identified by \(P_I\) may be only approximatively supported by the set of transactions corresponding to \(P_T\).

In order to apply the score function in Eq. 5, we need to redefine the concept of globally frequent item, since both \(\textsc {PaNDa}^+\) and Asso do not use any frequency threshold to extract the patterns.

The original FIHC uses the minimum support threshold \(\sigma _k\), which works as a sort of a priori filter, by limiting the selected features to only the frequent items and disregarding the infrequent ones. Conversely, the patterns returned by \(\textsc {PaNDa}^+\) or Asso are not derived on the basis of any frequency threshold, even if both the algorithms extract significant patterns, even if the single items occurring in each approximate pattern are likely to be well supported in the dataset.

In analogy with frequent itemsets, where the single items that occur in any patterns are globally frequent by definition, we consider all the items occurring in the various approximate patterns \(P \in \overline{\varPi }_k\) as the ones to be included in the pruned set of items \(\mathcal{I}'\), \(\mathcal{I}' \subseteq \mathcal{I}\). Thus, in order to permit FIHC to exploit an approximate pattern set, we need to replace the concept of global frequency of an item by the concept of occurrence of the same items in the pattern set.

We argue that a high quality pattern set should boost the quality of the generated clustering by FIHC. This is confirmed by our experimental results, where the clustering quality obtained by using top-k approximate patterns are better than using exact frequent (closed) ones.

4 Experimental Evaluation of Approximate Patterns

We compared the quality of the approximate patterns extracted by \(\textsc {PaNDa}^+\), Asso, and Hyper+ by using as a proxy the quality of the clustering obtained by FIHC, which in turn uses such top-k patterns as described in Sect. 3. We run our experiments on four categorized text collections (R52 and R8 of Reuters 21578, WebKB Footnote 2, and Classic-4 Footnote 3). The main characteristics of the datasets used are reported in Table 2. As expected, these datasets have a very large vocabulary with up to 19,241 distinct terms/items. The binary representation of those datasets, after class labels removal, was used to extract patterns. The number L of the class labels varies from 4 to 52.

Table 2. Datasets.
Table 3. Pattern-based clustering evaluation. Best results are highlighted in boldface.
Table 4. Pattern-based clustering evaluation. Best results are highlighted in boldface.

During the cluster generation step, the usual \(TF\cdot IDF\) scoring was adopted to instantiate \(\overrightarrow{\omega }_{t}\), and \(\sigma _{loc}=0.25\) was used. We forced FIHC to produce a number of clusters equal to L. Even if the goal of this work is to evaluate different solutions for pattern-based clustering, we also reported as a reference the results obtained with the K-Means clustering algorithm, by still setting parameter K of K-Means equal to the number L of classes in the datasets. Finally, cosine similarity was used to compare documents. This baseline is used only to make sure that the generated clustering is of good quality. All the pattern-based algorithms evaluated perform better than K-Means.

The quality of the clusters generated by each algorithm was evaluated with 5 different measures: Jaccard index, Rand index, Fowlkes and Mallows index (F-M), Conditional Entropy (the conditional entropy \(H_{K}\) of the class variable given the cluster variable), and average F-measure (denoted \(F_1\)) [4]. For each measure the higher the better, but for the conditional entropy \(H_{K}\) where the opposite holds. The quality measures reflect the matching of the generated clusters with respect to the true documents’ classification.

Tables 3,4 report the results of the experiments conducted on the four text categorization collections.

In order to evaluate the benefit of approximate patterns over exact frequent patterns, we also investigated the clustering quality obtained by FIHC with the 50 and 100 most frequent closed item sets. As shown in Table 3, closed patterns provide a good improvement over K-Means. The best \(F_1\) is achieved when 100 patterns are extracted, with an improvement of 112.5274725274725274% over K-Means, and similarly for all other measures. This validates the hypothesis of pattern-based text clustering algorithms, according to which frequent patterns provide a better feature space than raw terms.

For all the approximate pattern mining algorithms, we evaluated the clusters generated by feeding FIHC with L, 50, or 100 patterns.

The Asso algorithm has a minimum correlation parameter \(\tau \) which determines the initial patterns candidate set. We reported results of \(\tau =0.6\), for which we observed the best average results after fine-tuning in the range [0.5, 1.0]. We always tested the best performing variant of the algorithm which is named Asso  + iter in the original paper. Unfortunately, we were not able to include all Asso results, since this algorithm was not able to process the four datasets (we stopped the execution after 15 h). We highlight that Asso is however able to provide good performance on the datasets with a limited number of classes. The results on the other datasets are not as high quality as those obtained by \(\textsc {PaNDa}^+\).

To get the best performance of Hyper+, we used a minimum support threshold of \(\sigma =10\,\%\) and we fine-tuned its \(\beta \) parameter on every single dataset by choosing the best \(\beta \) in the set \(\{1\,\%,10\,\%\}\). The results obtained with only L patterns are poorer than the K-Means baseline, and 50 Hyper+ patterns do not improve over the most frequent 50 closed item sets. However, some improvement is visible with 100 Hyper+ patterns. Both \(F_1\) and Rand index exhibit some improvement over closed item sets, and an improvement over K-Means of 113.6263736263736263% and 125.934065934065934% respectively.

Finally, we report quality of \(\textsc {PaNDa}^+\) patternsin Table 4. We tested several settings for \(\textsc {PaNDa}^+\), and we achieved the best results with the \(J_P\) cost function and varying the noise tolerance, namely \(\epsilon =\epsilon _r=\epsilon _c\). For the sake of space, we report only results for \(\epsilon \in \{0.75, 1.0\}\). Even in this case, L patterns are insufficient to achieve results at least as good as K-Means, and 50 patterns provide similar results as the other algorithms tested. The best results are observed with the top-100 patterns extracted. In this case, \(\textsc {PaNDa}^+\) patterns are significantly better, achieving an improvement over the K-Means baseline in terms of \(F_1\) and Rand index of 121.0989010989010989% and 140.8791208791208791% respectively. In fact, \(\textsc {PaNDa}^+\) patterns with \(\epsilon =0.75\) provide a better clustering with all of the measures adopted. We thus highlight, that imposing noise constraints \(\epsilon <1\) generally provides better patterns.

Tables 3,4 also report the average length and support of the patterns extracted by the various algorithms (see the last two columns). As expected, the most frequent closed itemsets are also very short, with at most 2.4 items on average. Hyper+ is better able to group together related items, mining slightly longer patterns up to an average length of 5.4 for the WebKB dataset. Unlike all other algorithms, \(\textsc {PaNDa}^+\) provides much larger patterns, e.g., of length 14.19 for WebKB in the best setting. We conclude that \(\textsc {PaNDa}^+\) is more effective in detecting items correlations, even in presence of noise, thus providing longer and more relevant patterns which are successfully exploited in the clustering step.

5 Conclusion

This paper analyzes the performance of approximate binary patterns for supporting the clustering of high-dimensionality text data within the FIHC framework. The result of reproducible experiments conducted on publicly available datasets, show that the FIHC algorithm fed with approximate patterns outperforms the same algorithm using exact closed frequent patterns. Moreover, we show that the approximate patterns extracted by \(\textsc {PaNDa}^+\) performs better than other state-of-the-art algorithms in detecting, even in presence of noise, correlations among items/words, thus providing more relevant knowledge to exploit in the subsequent FIHC clustering phase. From our tests, one of the motivation is the higher quality of the patterns extracted by \(\textsc {PaNDa}^+\), which are longer than the ones mined by the other methods. These patterns are fundamental for FIHC, which exploits them for the initial document clustering which is then refined in the following steps of the algorithm.