Keywords

1 Introduction

Huge collection of electronic documents posed new challenges to the researchers for developing automatic techniques to visualize, analyze and summarizing the documents in order to retrieve information accurately and computationally efficient manner. Topic models identify patterns which reflect underlying semantic embedded in the documents [1], needed to classify the documents based on the requirement of the users. Topic modeling is applied to index the documents using relevant terms whereas a topic is a probability distribution over words [2]. In classical document clustering approaches [3] documents are represented by bag-of-word (BOW) model based on raw term frequency and therefore, poor in capturing the semantics. Topic models are able to combine similar semantics into the same group, called topic.

Term frequency inverse document frequency (tf-idf) method [4] can be able to identify discriminative words for a document, but pays little attention to inter- or intra-document statistical structure [5]. To address the problem, first latent semantic indexing (LSI) [6] and latter a generative probabilistic model [7] of text corpora were proposed. A significant step toward probabilistic modeling of text is probabilistic latent semantic analysis (PLSA) reported in [8]. LDA model has been developed to improve the process of forming the mixture models by capturing the exchangeability of both words and documents previously not explored in PLSA and LSA [9]. There are many LDA-based models including temporal text mining, author-topic analysis, supervised topic models, latent Dirichlet co-clustering, and LDA-based bioinformatics [10]. Several improvements have been proposed on LDA, such as the hierarchical topic models [11] and the correlated topic models [12].

A query consisting of several keywords can be viewed as distribution of words with probability over topics. Challenge is to develop more efficient retrieval mechanism for searching related topics from the corpus similar to the query submitted by the user. In this paper, we use LDA method to extract the topics from a large corpus of documents [2]. Then we propose a sparse representation-based classifier [13] for classifying the query, which is distribution of words with probability among the topics. A term vocabulary (TRV) has been constructed using unique terms in the topic corpus, representing the document repository. Since the number of terms present in the query is very specific, the query vector is highly sparse with respect to the TRV. In Sparse representation based classifier (SRC), query is represented in an overcomplete dictionary whose base elements are the training samples. In this paper, the topics are encoded using TRV and considered as training samples in topic-vocabulary matrix (TVM). Finally, we apply SRC to classify the query vector in a reduced search space.

This paper is divided into four sections. Section 2 describes detailed methodology using statistical topic modeling and sparse representation based classifier while results are summarized in Sect. 3 and conclusions are arrived in Sect. 4.

2 Methodology

In the proposed method, first we obtain the topics and the word distribution among the topics using traditional statistical topic models (LDA model). In the second part, we apply SRC to classify the users’ query.

2.1 Statistical Topic Modeling

LDA is a basic probabilistic model that describes a generative topic model for a large corpus of documents. In this model, each document is defined as a mixture of words with a given probability. The most dominant topics in the document are with highest probabilities.

Basic LDA model is built on certain assumptions. It assumes that (i) k number of topics is in the corpus, (ii) each document has topic proportions of θ, (iii) a word w is generated from a topic z, and (iv) each topic is defined as the word proportions β over the number of existing words.

The generative process is described below:

  1. 1.

    For each document d in the corpus

    1. (a)

      Generate \( \theta_{d} \sim Dir\, (\alpha ) \)

    2. (b)

      For each word

      1. i.

        Draw a topic \( z_{d,n} \sim \text{Mult}\,(\theta_{d}) \)

      2. ii.

        Draw a word \( w_{d,n} \sim \text{Mult}\,(\beta_{{z_{d,n} }} ) \)

where n is the number of words and α is Dirichlet prior vector for θ and β is the topic probability. The joint distribution of topic mixture θ for given parameter α and β, a set of N topics zd and a set of n words wd is given by Eq. (1):

$$ p\left( {\theta ,z_{d} ,w_{d} \,|\,\alpha ,\beta } \right) = p(\theta \,|\,\alpha )\prod\nolimits_{n = 1}^{N} {p\left( {z_{d,n} \, |\,\theta } \right).p(w_{d,n} \,|\,z_{d,n} ,\beta )} $$
(1)

where \( p\left( {z_{d,n} \, |\,\theta } \right) \) is \( \theta_{d,i} \) for the unique i such that \( z_{n}^{i} = 1 \).

Equation (2) shows the marginal distribution of a document:

$$ p\left( {w_{d,n } \, |\,\alpha ,\beta } \right) = \mathop \smallint \nolimits p\left( {\theta \, |\,\alpha } \right).(\prod\nolimits_{n = 1}^{N} {\sum {p\left( {z_{d,n} \, |\,\theta } \right).p(w_{d,n} \,|\,z_{d,n} ,\beta ))\,d\theta } } $$
(2)

Finally, we obtain the probability of a corpus as given in Eq. (3):

$$ p\left( {D\, |\,\alpha ,\beta } \right) = \prod\nolimits_{d = 1}^{M} {\int {p(\theta_{d} \, |\,\alpha )} } \,(\prod\nolimits_{n = 1}^{{N_{d} }} {\sum {p(z_{dn} \, |\,\theta_{d} )p(w_{dn} ,\beta )} )\,d\theta_{d} } $$
(3)

In the learning method, latent variables z and θ are searched using LDA with an objective to maximize log-likelihood of the data and this problem is NP hard. Several approximate inference algorithms include Gibbs sampling [14] and variation inference [15] have been used for learning purpose. Figure 1 shows graphical representation of LDA model where each node is represented as a random variable and labeled according to the generative process.

Fig. 1
figure 1

Graphical representation of LDA model

In our experiment, we apply LDA considering variable number of topics. We set initial value of parameter α in the range [0, 1] and obtain its effect on distribution of number of topics. We choose 10−2 as threshold of difference in α varied over number of topics for the proposed retrieval method. We choose 25 topics as threshold because no significant change in α has been observed further.

2.2 Sparse Classifier

We obtain n topics containing words using LDA from the corpus of documents and a vocabulary TRV is prepared using unique terms present in different topics. Let us assume the number of unique words in n no. of topics is k, so the dimension of TRV is 1 × k. On the basis of TRV a feature vector (FV) of dimension 1 × k for each topic is defined in Eq. (4):

$$ \begin{aligned} {\mathbf{FV}}[i] & = P (w_{i} \,|\,T )\, \cdot \,P(w_{i} ),\quad {\text{if}}\;w_{i} \;{\text{presents}}\;{\text{in}}\;{\text{the}}\;{\text{topic}} \\ & = P (w_{i} ),\quad {\text{otherwise}} \\ \end{aligned} $$
(4)

where P(w i |T) denotes the probability of word w i in topic T and P(w i ) gives the probability distribution of the word in the corpus.

Let us assume that there are c known pattern classes. Let Ai be the matrix obtained using the training samples of class i, i.e., \( \varvec{A}_{i} = \left[ {\varvec{y}_{{\varvec{i}1}} ,\varvec{y}_{{\varvec{i}2}} , \ldots ,\varvec{y}_{{\varvec{iRi}}} } \right]{ \in {\mathbb{R}}}^{d \times Si} \) where M i is the number of training samples of class i.

Let us define a matrix \( \varvec{A} = [\varvec{A}_{1} ,\varvec{A}_{2} , \ldots \,\varvec{A}_{c} ] \in {\mathbb{R}}^{d \times S} \), where \( S = \sum\nolimits_{i = 1}^{c} {S_{i} } \). The matrix A is built for the entire training samples. Given a query test sample y, we represent y in an overcomplete dictionary whose basis are training samples, so \( \varvec{y} = \varvec{Aw} \) If the system of linear equation is underdetermined (P < S), this representation is naturally sparse.

The sparsest solution can be obtained by solving the following L 1 optimization problem given in Eq. (5),

$$ (L_{1} ) \; \widehat{{\varvec{w}_{1} }} = \arg \,\hbox{min} \left\| \varvec{w} \right\|_{1} ,\quad {\text{subject}}\;{\text{to}}\;\varvec{Aw} = \varvec{y} $$
(5)

This problem can be solved in polynomial time by standard linear programming algorithm [16]. After the sparsest solution \( \widehat{{\varvec{w}_{1} }} \) is obtained, the SRC can be done in the following way [17].

For each class i, let \( \partial_{i} :{\mathbb{R}}^{S} \to {\mathbb{R}}^{S} \) be the characteristic function that selects the coefficient associated with the ith class.

For \( \varvec{w} \in {\mathbb{R}}^{S} ,\;\partial_{i} (\varvec{w}) \) is a vector whose nonzero entries are in w associated with class i. Using only the coefficient associated with the ith class, reconstruction can be done on a given test sample y as \( \varvec{v}^{i} = \varvec{A}\partial_{i} \widehat{{w_{1} }} \); v i is called the prototype of class i with respect to the sample y. Equation (6) shows the residual distance between y and its prototype v i of class i,

$$ r_{i} (\varvec{y}) = \left\| {\varvec{y} - \varvec{v}^{i} } \right\| _{2} = \left\| {\varvec{y} - \varvec{A}\partial_{i} (\widehat{{\varvec{w}_{1} }} ). \varvec{v}^{i} } \right\|_{2} $$
(6)

The SRC decision rule is

if \( r_{l} (\varvec{y}) = {\text{min }}_{i} \,r_{i} (\varvec{y}), \) y is assigned to class l.

In the experiment, we consider the TVM as the training set. TVM has been built by considering the feature vectors FV i for each topic i as described below.

$$ {\mathbf{TVM}} = [{\mathbf{FV}}_{{\mathbf{1}}} ,\,{\mathbf{FV}}_{{\mathbf{2}}} \, \ldots \,{\mathbf{FV}}_{n} ]^{\text{T}} $$

The dimension of TVM is n × k. Now, we consider a user query q as a test sample and convert it into feature vector FV q of dimension 1 × k as described in Sect. 2.2.

We apply SRC on TVM for reconstruction of FV q and assigned nearest topic to FV q . The procedure is described in Algorithm 1.

Algorithm 1 Query classification using SRC

3 Results and Discussion

In the experiment, we used a subset of TREC AP (Academy Press) corpus containing 2246 news articles with 10473 unique terms. After preprocessing the document, we apply LDA and obtain 25 topics that are optimum for this experiment. Each topic is visualized as a set of words where each element of the set is assigned with the posterior topic measures. Table 1 shows five different topics with most frequent 10 keywords. Test samples as users’ query are classified by executing algorithm1.

Table 1 Topics with top 10 keywords

The performance of the classifier is evaluated using different statistical measures [18] considering 10 queries for each 25 topics, i.e., 10 × 25 = 250 queries. We considered varied length of query up to five keywords and no significant change is reported in the retrieved information. It has been observed statistically that maximum five keywords are provided by most of the users for their query [19]. For instance, the query vectors [judge trial charges guilty]T, [oil higher market price]T, [government authorities reported official]T and [aids patients doctors care]T of length 4 are classified as ‘Judiciary’, ‘Economy’, ‘Administration’, and ‘Healthcare’.

Accuracy has been improved with the number of topics indicating appropriateness of applying sparse classifier in the proposed method. Figure 2 shows improvement in accuracy with increasing number of topics, not remarkable for other classifiers unlike SRC. High accuracy and high TP rate ensures good precision and recall for retrieval method that guarantees lower misclassification too. Detailed comparison of different classifiers using statistical measures summarized in Table 2 and ROC curve for SRC is shown in Fig. 3.

Fig. 2
figure 2

Accuracy versus number of topics

Table 2 Comparison with different classifiers
Fig. 3
figure 3

ROC curve for SRC

4 Conclusion

LDA is a basic and generative model for topic modeling used in the paper for initial latent topic identification. We consider any query as a distribution of words obtained from topics which is sparse with respect to high dimensional input space represented using vocabulary. Proposed approach gives satisfactory results in terms of statistical measures and improves with the size of the training set. For Naïve Bayes classifier performance is better for less number of topic, however, proposed approach using SRC outperforms when size of the training set increases, shown in Fig. 2.