Keywords

1 Introduction

A document clustering algorithm helps to find groups in documents that share a common pattern [15]. It is an unsupervised technique and is used to automatically find clusters in a collection without any user supervision. The main goal of the clustering is to find the meaningful groups so that the analysis of all the documents within clusters is much easier compared to viewing it as a whole collection.

The Vector Space Model (VSM) represents a document using a vector of T unique terms in a collection (T-dimension). Each term in a vector is associated with a weight (term weight) [6, 7]. The term weight is based on the frequency with which the term appears in that document. The term weighting scheme measures the importance of a term with respect to a document and a collection. A term with higher weight is more important than a term with lower weight. Each document can be located in T-dimensional space, where T is the number of unique terms in a collection (Euclidean space). With a document represented by a location in Euclidean space, we can compare any two documents by measuring the actual distance between them. In the same way, a user-supplied query can be represented as a vector and mapped in Euclidean space. In order to find a set of documents relevant to a query, we can find documents that are closer to this query in Euclidean space. A document ranking model finds the similarities between these documents and a query. If a document is more relevant to a query, it will get a higher ranking. VSM and term weighting schemes are widely used in many research areas such as document clustering, classification, information retrieval, document ranking, etc.

2 Cluster-Based Retrieval

Cluster-based retrieval uses the cluster hypothesis to retrieve a set of documents relevant to a query [8]. Cluster hypothesis Documents in the same cluster behave similarly with respect to relevance to information needs [9]. If a document in a cluster is relevant to a query, then the rest of the documents in that cluster are potentially relevant to the same query. There are two approaches in cluster-based retrieval. The first approach retrieves one or more clusters relevant to a query instead of retrieving documents relevant to a query. In other words, this approach retrieves and ranks the relevant clusters instead of the relevant documents. Based on the cluster hypothesis, the documents from the highly ranked clusters are more relevant to a query than the documents from the clusters with lower ranking. The main motive of this approach is to achieve higher efficiency and faster search.

The second approach uses the generated clusters as a reference in improving the retrieved relevant documents. In this approach, the given document collection is clustered (static clustering) beforehand. When a set of documents is retrieved for a query, the generated clusters (static clusters) of the collection are used as a reference to update the retrieved relevant document list. The main goal of this approach is to improve the precision-oriented searches.

3 Term Weighting Scheme

The Boolean retrieval model assigns 1 or 0 based on the presence or absence of a term in a document. This model performs undesirably in querying for a document. Later, VSM was introduced for ranked retrieval [10]. It is widely used in querying documents, clustering, classification and other information retrieval applications because it is simple and easy to understand. It uses a bag of word approach. Each document \( {\text{d}}_{\text{i}} \) in the collection ζ is represented as a vector of terms with weight [11, 12].

One of the most commonly used term weighting schemes, TF-IDF, assigns weights to each term using the term frequency (tf) and inverse document frequency (idf). The term frequency of the \( {\text{term}}_{\text{t}} \) is the number of times the given \( {\text{term}}_{\text{t}} \) occurs in a document. The inverse document frequency is the total number of documents in a collection containing the \( {\text{term}}_{\text{t}} \) with respect to the total number of documents (N) in a collection. Then, the document vector \( {\text{d}}_{\text{i}} \), represented as:

$$ {\text{d}}_{\text{i}} = \left\{ {{\text{term}}_{1,} {\text{tf}}_{{{\text{i}}1}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{1} }};{\text{term}}_{2,} {\text{tf}}_{{{\text{i}}2}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{2} }}; \ldots \; {\text{term}}_{{{\text{T}},}} {\text{tf}}_{\text{it}} \cdot\;{ \log }\frac{\text{N}}{{{\text{N}}_{\text{t}} }};} \right\} $$

The term weight \( {\text{w}}_{\text{it}} \) determines whether the \( {\text{term}}_{\text{t}} \) will be included in the further steps. Only certain terms extracted from a document can be used for identifying and scoring a document within the collection. The term weighting schemes are used to assign weight to each term in a document. These term weights represent the importance of a term with respect to a collection. Document clustering uses these term weights to compare two documents for their similarity.

Table 1 shows representation of some of the term weighting schemes commonly used. Here, TF is the Term Frequency, IDF is the Inverse Document Frequency and ICF is the Inverse Cluster Frequency.

Table 1 Term weighting schemes
  • \( {\text{w}}_{\text{itj}} \) is the weight of the term \( {\text{term}}_{\text{t}} \) in the document \( {\text{d}}_{\text{i}} \) of the cluster \( {\text{C}}_{\text{j}} \).

  • \( {\text{tf}}_{\text{if}} \) = fit is the term frequency of the term \( {\text{term}}_{\text{t}} \) in the document \( {\text{d}}_{\text{i}} \).

  • \( {\text{idf}}_{\text{t}} = \log \frac{\text{N}}{{{\text{N}}_{\text{t}} }} \) is the inverse document frequency for the term \( {\text{term}}_{\text{t}} \) in the collection

where N is the total number of documents in the collection and \( {\text{d}}_{\text{t}} \) is the number of documents that contain the term \( {\text{term}}_{\text{t}} \).

\( {\text{icf}}_{\text{t}} = \log \frac{\text{k}}{{{\text{k}}_{\text{t}} }} \) is the inverse cluster frequency of the term \( {\text{term}}_{\text{t}} \) in the collection ζ, where K is the total number of clusters in the collection and Kt is the number of clusters that contains the term \( {\text{term}}_{\text{t}} \).

We present a new term weighting method based on the traditional TF-IDF term weighting scheme. Our motivation is based on the idea that the terms of the documents in a same cluster have similar importance compared to the terms of the documents in a different cluster. We concentrated on the terms that are important within a cluster and considered the other terms as irrelevant and redundant. We presented this idea by giving more weight to the terms that are common within a cluster but uncommon in other clusters.

In our new term weighting scheme, we used unsupervised partitional (K-means) clustering algorithms [2, 13, 14]. First, we ran the K-means algorithm with the four term weighting schemes, to show that the CBT term weighting scheme improves the quality of the clusters generated by a partitional clustering algorithm.

From our experiment, we found that the new term weighting scheme based on the clusters gives better results than the other well-known term weight schemes traditionally used for both partitional and hierarchical clustering algorithms.

3.1 The Proposed Term Weighting Scheme

We introduce our new term weighting scheme. For the term \( {\text{term}}_{\text{t}} \), document \( {\text{d}}_{\text{i}} \) and cluster \( {\text{C}}_{\text{j}} \), CBT is given as:

$$ \begin{aligned} {\text{w}}_{\text{itj}} &= {\text{tf}}_{\text{it}} \cdot {\text{idf}}_{\text{t}} \cdot {\text{df}}_{\text{tj}} \cdot {\text{icf}}_{\text{t}} \\ &= {\text{f}}_{\text{it}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{\text{t}} }} \cdot \frac{{{\text{df}}_{\text{j}} }}{{|{\text{c}}_{\text{j}} |}} \cdot {\text{log}}\frac{\text{K}}{{{\text{K}}_{\text{t}} }} \\ \end{aligned} $$

Here, \( {\text{df}}_{\text{tj}} = {\text{df}}_{\text{j}} \)

\( {\text{jC}}_{\text{j}} {\text{j}} \) is the document frequency of the term \( {\text{term}}_{\text{t}} \) within the cluster \( {\text{C}}_{\text{j}} \), where \( {\text{df}}_{\text{j}} \) is the number of documents in the cluster \( {\text{C}}_{\text{j}} \) that contain the term \( {\text{term}}_{\text{t}} \), and \( {\text{jC}}_{\text{j}} {\text{j}} \) is the total number of documents in the cluster \( {\text{C}}_{\text{j}} \).

Our new term weighting scheme has four components. The first two components are based on the term weighting components discussed in [15]. The last two components are the cluster components as shown in Table 2. In other words, CBT assigns a weight to a term which is

Table 2 List of components in CBT term weighting scheme
  • Highest when the term occurs more frequently in the documents of a cluster and uncommon in other clusters.

  • Higher when the term occurs less frequently in the documents of a cluster and uncommon in other clusters.

  • Lower when the term occurs often in a few clusters.

Lowest when the term occurs in most of the documents in a collection.

4 K-Means Algorithm with CBT

Initially, the K-means algorithm doesn’t have any information about the cluster components, so we start the algorithm by setting \( {\text{df}}_{\text{tj}} \) and \( {\text{icf}}_{\text{t}} \) to 1 and update the inter- and intra-cluster components on each iteration. If a document has a set of terms that doesn’t belong to a cluster, then its term weight will be reduced so that it will move to other clusters. It will be repeated until it finds a suitable cluster of its type.

  • Require: An integer K     1, Document Collection \( \upzeta \)

  • 1: if K   =   1 then

  • 2:  return ζ

  • 3: else

  • 4:  Initialize l   =   0

  • 5:   \( \{ {\text{C}}_{1}^{\left( 0 \right)} , \cdots \cdots ,{\text{C}}_{\text{k}}^{\left( 0 \right)} \leftarrow {\text{RANDOM }}\quad {\text{CLUSTERS}}(\upzeta,{\text{K}}) \)

  • 6:  repeat

  • 7:   for all \( {\text{d}}_{\text{i}} \in\upzeta,{\text{i}}:1 \ldots \ldots {\text{N do}} \)

  • 8:    \( {\text{m}} = {\text{arg min}}_{\text{j}} |{\text{c}}_{\text{j}} - {\text{d}}_{\text{i}} | \)

  • 9:    \( {\text{C}}_{\text{m}}^{{\left( {{\text{l}} + 1} \right)}} \leftarrow {\text{C}}_{\text{m}}^{{\left( {{\text{i}} + 1} \right)}} \cup {\text{d}}_{\text{i}} \)

  • 10:   end for

  • 11:    l ← l+1  +   1

  • 12:     \( {\text{w}}_{\text{itj}} {\text{tf}}_{\text{it}} \cdot {\text{idf}}_{\text{t}} \cdot {\text{df}}_{\text{tj}} \cdot {\text{icf}}_{\text{t}} ; \)

  • for each term termt in a document di for a cluster

  • \( {\text{C}}_{\text{j}}^{{({\text{t}})}} ,{\text{t}}\, :\, 1\ldots {\text{T, i}}\, :\, 1\ldots {\text{N, j}}\, :\, 1\ldots {\text{K }} \)

  • 13:  for j   =   1 to K do

  • 14: \( {\text{C}}_{\text{j}} \leftarrow \frac{1}{{|{\text{C}}_{\text{j}}^{{({\text{l}})}} |}}\mathop \sum \limits_{{{\text{d}}_{\text{i}} \in {\text{C}}_{\text{j}}^{{({\text{l}})}} }} {\text{d}}_{\text{i}} \)

  • 15:   end for

  • 16:  until No change in K centroids

  • 17:  return \( \left\{ {{\text{C}}_{1}^{{\left( {\text{l}} \right)}} , \cdots \cdots ,{\text{C}}_{\text{k}}^{{\left( {0{\text{l}}} \right)}} } \right\} \)

  • 18: end if

4.1 Data Collections for CBT

We used the TREC [16], 20 Newsgroup and Reuters-21578 [17] data collections for our experiment. TR11, TR12, TR23, TR31, and TR45 collections are taken from TREC-5, TREC-6 and TREC-7. 20 NG S1–S5 are the five randomly chosen subsets of 20 Newsgroup documents [18]. RE S1 and RE S2 data sets are from Reuters-21578 collection. We got 4645 documents that have only one category. In addition to that, we used the Reuters transcribed subset (RE S2) [19]. For all the data sets shown in Table 3, we removed the stop words and stemmed using the Porter stemming algorithm [20].

Table 3 Data sets

5 Conclusion

Importance of using inter- and intra- cluster components in the term weights using the average entropy measure. Since the K-means clustering algorithm is unstable and sensitive to initial centroids, we ran the algorithm 10 times with different random seed for the initial centroids on each run. We repeated this experiment for the four term weighting schemes on the data collections listed in Table 3.

We calculated the entropy for the term weighting schemes, as given in Eq. (2.13), for each run after the algorithm converged. Then, we computed the average of the entropies obtained in each run. Similarly, we computed the average F-measure and average purity measures. Table 4 shows the average entropy calculated for each data set. Table 5 shows the average entropy, average F-Measure and average purity measured for the TF-IDF and CBT term weighting schemes for the K-means clustering algorithm. Both experiments show that the results obtained from the K-means clustering algorithms with the CBT term weighting scheme have shown better results compared to the other term weighting schemes on each data set. According to the cluster-based term weighting scheme, a term is considered important to a cluster if it is unique to that cluster and occurs frequently within the documents of that cluster. The inter- and intra- cluster components try to identify these important terms by analyzing the term frequency distribution at three levels: document, cluster and collection. And our experimental results have shown that adding these cluster components in the term weighting scheme significantly improves the results on each data set. We believe that some of the deviations in the results are due to the clustering algorithms’ lack of handling the noise in the data collection.

Table 4 K-means clustering algorithm—avg. Entropy measured for norm TF, CBT, TF IDF ICF and TF IDF term weighting schemes
Table 5 Bisecting K-means clustering algorithm—avg. Entropy, avg. F-measure and avg. Purity measured for the TF-IDF and CBT term weighting schemes