Keywords

1 Introduction

The dramatic rise in the use of the web and the improvement in communications in general have transformed our society into one that strongly depends on information [4, 5]. Vast amounts of text documents are also available in various fields. The huge amount of various documents accumulates daily in databases on web are astonishing. The accumulations of available text documents have raised new challenges for information retrieval (IR) technology. Therefore, in order to facilitate the knowledge management process, various approaches and techniques applied on text classification (categorization) and text clustering are being compared and studied. In short, it is essential and important for us to manage the unstructured and random documents by labeling these documents automatically.

This paper proposes a novel approach to manage English text documents by clustering and labeling them automatically. Clustering is a frequent performed task and technique for machine learning, data mining, pattern recognition, image analysis and bioinformatics. It can be applied in various type of tasks related to improving the quality in the structure and usage of large and high dimensional data. It is a method of unsupervised learning in which a descriptive task will be performed. There are many potential applications and advantages that will accrue from being able to reliably and automatically cluster, categorize and label corpora of documents. Many of these document clustering works are based on supervised or unsupervised learning techniques in order to label particular web documents as belonging to a specific category, or grouping together similar documents into clusters.

In this paper, an unsupervised learning technique will be used to implement the proposed algorithm in which a hierarchical clustering is applied to unlabeled documents. The hierarchical clustering is chosen instead of partitional clustering (K-means) [3] because the hierarchical clustering is able to form overlapping clusters which is more suitable for this research. The taxonomy (tree) is able to show the sub-clusters of the parent cluster and thus the aggregation of concepts can be illustrated [9]. The two types of hierarchical clustering are agglomerative (bottom-up) and divisive. Hierarchical agglomerative clustering is applied in this research since the concepts obtained from sub-clusters can be aggregated [1]. Distance measures such as single linkage and complete linkage will also be compared in order to compare the results of clusters based on different distance methods.

This is paper is organized as followed. Section 2 introduces some related works and we present the details of our approach and our dataset. Section 3 describes the concept labeling of document clusters by using the hierarchical agglomerative clustering technique. Section 4 presents the experimental design set-up and the experimental results. This paper presents the evaluation of several structures of hierarchical agglomeration clustering results that are produced by using different types of inter-clusters distance measurements, namely Single, Complete and Average links. This paper is concluded with future works in Sect. 5.

2 Related Works

Earlier works include the comparison of Chi Square (X2), most frequent words, most predictive words and a combination of most frequent and predictive words methods [6]. In this comparison, Popescul and Ungar found that the most frequent and predictive words method produced the best labels, capturing the words which both occur frequently in a cluster and effectively discriminate the given cluster from the other clusters. Unfortunately, none of the methods gave uniformly satisfactory results when disciplines corresponding to clusters are very diverse in the vocabulary used and encompass very broad topics.

In other works, Lamirel et al. present a new approach combining original hypertree construction techniques for multidimensional clustering results visualization with novel cluster labeling techniques based on the use of cluster content evaluation criteria, like the F-measure on cluster properties [7]. Treeratpituk and Callan proposed a simple linear model that considers the structure of the hierarchy when automatically assigning labels to document clusters in a hierarchy [8]. They conducted a study to show the effectiveness of different statistical features in selecting cluster labels. They also showed that such a simple model is likely to tolerate the type of noise in the cluster hierarchy that is normally generated by clustering algorithms.

Newman et al. [9] explored visualizations of document collections, which they call topic maps. In their works, they found that while topics are a useful way to organize an entire collection, producing a static global topic map of the collection may have limited value for exploring the collection. Therefore local topic maps may ultimately be more useful for better understanding and navigating local structure in a collection.

Magatti et al. addressed the task of labeling topics that are induced by a hierarchical topic model [10]. Their label candidate set is the Google Directory (gDir) hierarchy, and label selection takes the form of ontological alignment with gDir. The disadvantages of this method are that the method is only applicable to a hierarchical topic model and crucially relies on a pre-existing ontology and the class labels contained therein.

Automatic labeling of document clusters can also be associated with the problem of topic extraction. The problem of topic extraction is attracting a great deal of attention due to its wide applicability; extraction of scientific research topics, author-topic analysis, opinion extraction and information retrieval. Several probabilistic models have proved to be effective to discover topics [13].

In this paper, the authors propose a framework for concept aggregation based on hierarchical agglomerative clustering. This paper explores and describes the interplay between inter-cluster distance methods and the aggregated concepts extracted from the hierarchical concept model.

3 Concepts Labeling of Document Clusters

The document labeling problem can be tackled in two different ways: human labeling and computer labeling. The first approach maps a document into a set of pre-specified categories, usually such categories form a taxonomy or a topic hierarchy. The latter approach has recently emerged to be well suited, i.e. to be efficient and effective, in several settings. While human labeling usually benefits from the availability of a domain specific topic hierarchy, agreed by experts, it is extremely time consuming and in some particular situations universally agreed labeling cannot be achieved. On the contrary, computer labeling is economically attractive, while the achieved labeling must be accurately checked by specific domain experts to ensure that it is consistent. Furthermore, effective methods for automatically building a hierarchy of topics have been very recently proposed in the specialized literature [17]. The approach we propose offers an efficient way to extract concepts automatically from the document clusters. We apply the hierarchical agglomerative clustering technique to group documents. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections.

3.1 Clustering Parallel Corpora

In this experiment, we use the vector space model [12], in which a document is represented as a vector in n-dimensional space (where n is the number of different words in the collection). Here, documents are categorized by the words they contain and their frequency. Before obtaining the weights for all the terms extracted from these documents, stemming and stopword removal is performed. Stopword removal eliminates unwanted terms (e.g., those from the closed vocabulary) and thus reduces the number of dimensions in the term-space. Once these two steps are completed, the frequency of each term across the corpus is counted and weighted using term frequency—inverse document frequency (tf-idf) [12], as described in Eq. (21.1).

$$ tf - idf = tf\left( {t,d} \right) \cdot idf\left( t \right) $$
(21.1)
$$ idf\left( t \right) = \log \,\,\left( {\frac{{\left| {\text{D}} \right|}}{{{\text{df}}({\text{t}})}}} \right) $$
(21.2)
$$ {\text{sim}}\left( {d_{i} ,d_{j} } \right) = \frac{{\left( {{\text{d}}_{i} {\text{d}}_{j} } \right)}}{{\left( {\left\| {{\text{d}}_{i} } \right\| \cdot \left\| {{\text{d}}j} \right\|} \right)}} $$
(21.3)

Weights are assigned to give an indication of the importance of a word in characterizing a document as distinct from the rest of the corpus. In summary, each document is viewed as a vector whose dimensions correspond to words or terms extracted from the document. The component magnitudes of the vector are the tf−idf weights of the terms. In this model, tfidf, as described in Eq. (21.1), is the product of term frequency tf(t,d), which is the number of times term t occurs in document d, and the inverse document frequency, Eq. (21.2), where |D| is the number of documents in the complete collection and df(t) is the number of documents in which term t occurs at least once. To account for documents of different lengths, the length of each document vector is normalized so that it is of unit length [13].

3.2 Hierarchical Agglomerative Clustering

In this work, we concentrate on hierarchical agglomerative clustering. Unlike partitional clustering algorithms that build a hierarchical solution from top to bottom, repeatedly splitting existing clusters, agglomerative algorithms build the solution by initially assigning each document to its own cluster and then repeatedly selecting and merging pairs of clusters, to obtain a single all-inclusive cluster, generating the cluster tree from leaves to root [11]. The main parameters in agglomerative algorithms are the metric used to compute the similarity of documents and the method used to determine the pair of clusters to be merged at each step.

In these experiments, the cosine distance, Eq. (21.3), is used to compute the similarity between two documents di and dj. This widely utilized document similarity measure becomes one if the documents are identical, and zero if they share no words. The two clusters to merge at each step are found using either, the single link, complete link or average link method. In this scheme, the two clusters to merge are those with the greatest minimum (Single link), maximum (Complete link) or average (Average link) similarity distances between the documents in one cluster and those in the other [15, 16].

3.3 Extracting Concepts of Document Clusters

Concepts that characterize English documents will be generated by using the hierarchical agglomerative clustering. This is done by computing terms that have large weights assigned to them to indicate the importance of a word in characterizing a document as distinct from the rest of the corpus.

4 Experimental Design and Evaluations

The experiment is designed in order to investigate and compare the effectiveness of clustering English text documents using three different types of inter-cluster distance measurement, namely minimum (Single Link), maximum (Complete Link) and average (Average Link). Depending on the type of inter-cluster distance used, cluster result that provides the lowest DBI value will be taken into consideration for the extraction of concepts to characterize the document clusters. There are five categories of English news collected from the Star Online in the year of 2010 (Malaysian local online news—thestar.com.my). The details of the text documents used in this experiment are shown in Table 21.1.

Table 21.1 Five categories of english news extracted from the star online news

There are two main stages in this experiment that includes (a) Clustering English Documents using a Hierarchical Agglomerative Clustering technique and (b) Concepts Extraction.

4.1 Clustering English Documents Using a Hierarchical Agglomerative Clustering Technique

In the first stage, we perform the task of clustering English texts by using the Hierarchical Agglomerative Clustering. We look at the similarities of pair of clusters based on three inter-distance methods (Single, Complete and Average links). We evaluate the structure of tree-like cluster results that minimizes some objective function applied to k-cluster centers. In our case, we consider the cluster dispersion.

$$ d_{centroid} \left( {Q_{k} } \right) \, = \frac{{\sum\nolimits_{i} {\left\| {xi - c_{k}} \right\|} }}{N_{k}} $$
(21.4)
$$ c_{k} = 1/N_{k} \sum\nolimits_{xi \in Q_{k}} {xi} $$
(21.5)
$$ d_{between} \left( {Q_{k} ,Q_{l} } \right) \, = \left\| {c_{k} - cl} \right\| $$
(21.6)
$$ DBI \, = \frac{1}{K}\sum\limits_{k = 1}^{K} {\hbox{max}\,l \ne k\left\{ {\frac{dcentroid(Q_{k}) + dcentroid(Q_{l})}{dbetween(Q_{k},Q_{l})}} \right\}} $$
(21.7)

Typically cluster dispersion metric is used, such as the Davies-Bouldin Index (DBI) [14]. DBI uses both the intra-cluster and inter-clusters distances to measure the cluster quality. Let d centroid (Q k ), defined in (4), denotes the average link distances within-cluster Q k , where x i \( \in \) Q k , N k is the number of samples in cluster Q k , c k is the center of the cluster and k ≤ K clusters. Let d between (Q k , Q l ), defined in (6), denotes the distances inter-clusters Q k and Q l , where c k is the centroid of cluster Q k and c l is the centroid of cluster Q l . In this study, we also cluster the text documents based on the minimum (single link), maximum (complete link) and average (Average link) distances between clusters. Therefore, given a partition of the N points into K-clusters, DBI is defined in (7). This cluster dispersion measure can be incorporated into any clustering algorithm to evaluate a particular segmentation of data.

4.2 Concepts Extraction

Next, in the second stage, we compute the weights of terms (tfidf) considered in clustering English documents for each cluster. The top five terms with high weights for each cluster will be extracted as concepts that characterize each corresponding cluster. The extracted concepts are then compared with the actual concepts derived manually.

4.3 Experimental Results

Table 21.2 shows the comparison of DBI values for the clustering results using Single, Complete and Average links with different number of clusters produced. Based on the results obtained, on average the quality of clusters produced is better for documents when the Complete link is used as the inter-cluster distance in the process of clustering them. This is due to the fact that when the Complete link method is used to cluster English documents, two clusters, C i and C j having two elements, e a and e b where e a C i and e b C j , with the highest value of distance are merged into one cluster. As a result, the final clusters produced may be well separated and thus produces lower DBI value. The results also show that the quality of clusters produced is better when we have smaller number of clusters. Another important finding that can be obtained from this experiment is that when the number of words is high, the DBI values are also high as shown in Table 21.2 for the Entertainment and Politics categories.

Table 21.2 Comparison of DBI values for the clustering results when using single, complete and average links with different number of clusters, k = 5, 10 and 15

Table 21.3 shows the concepts extracted from English clusters using the hierarchical clustering technique with k = 10 and Complete link method. We compare the actual concepts derived manually and the extracted concepts by referring to the tree-like document representation and computing the top five terms based on the tfidf weights. Based on the results shown in Table 21.3, the percentage of concepts aligned between the actual concepts derived manually and the extracted concepts derived from the HAC technique is quite encouraging, with the exception of the Entertainment and Generals news categories. This is probably because the Entertainment and Generals news are very diverse in the vocabulary used and encompass very broad topics.

Table 21.3 Comparison of actual concepts and extracted concepts for all five news categories for k = 10

5 Conclusion

In this paper we have presented the framework of using a tree-like document clusters representation which is produced by clustering English documents using the Hierarchical Agglomerative Clustering (HAC) technique to automatically extract concepts that characterize each cluster. We have empirically showed that better clustering results can be produced by using the Complete link method in computing the inter-cluster distance measurement when merging two different clusters. By using the Complete link method to merge two different clusters, the concepts extracted from each cluster should be more relevant and applicable in labeling the English clusters as shown in the experimental results. In order to improve the results obtained, future works include comparing actual concepts with extracted concepts derived based on other weights computation and implementing a semi-supervised HAC technique to extract concepts automatically from English document clusters more effectively.