1 Introduction

More and more textual data is unfolding before people’s eyes in more diverse forms with the rise of web 2.0. For example, multifarious data is generated from queries and questions in Web search, social networks, various internet news and so on. As a consequence, researchers are urged to solve the problem that internet users sometimes get bored because they are subject to a myriad of turbid information and the restraint of limited message coverage [19].

As an essential topic, lots of methods are put forward for the above problem. Text categorization used in information retrieval, news classification, spam mail filtering to acquire better user experience is studied roundly [10]. However, the applicability of classification for high dimensional and sparse data often becomes a short slab in many models. Like a teeter-board, the efficiency of processing sparse data and performance quality are hard to be fairness considered. On one hand, the classification accuracy would be descending if the dimension was cut down at an efficient level. On the other hand, for sparse and high dimensional datasets, the computing efficiency has to be sacrificed since the dimension will get to thousands or even more [13].

Researchers usually characterize sparse data by building semantics association or employing external knowledge base to settle the sparse feature problems. For instance, Wikipedia was used in [15] as an external corpus to rich the corpus. Cataldi et al. [2] used semantics relation rules to build relation rules library, so as to rich feature corpus. Xia et al. [20] introduced topics for multi-granularity, and then discriminative features are generated for sparse data classification. Nevertheless, it is hard to introduce external corpora to sparse text due to specific situations, and appropriate semantic association that can enhance the effect of sparse data classification [23]. What’s more, the problem of accuracy and efficiency in classification are difficult to get an optimal solution [8].

A novel way to address the above problem is presented in our paper. To classifying sparse text accurately and fleetly, Biterm Topic Model (BTM) algorithm [21] is used for generating features, so that we can utilize topic information in Vector Space Model (VSM). Then the Support Vector Machine (SVM) is acted on it to obtain better classification result. Through the experiments on 20 Newsgroups datasets and dataset from Tencent Microblogs, we found that the combination of BTM and SVM enhances performance much more than other classification models for sparse data. Moreover, the proposed method provides a novel way to process sparse data.

The rest of the paper is organized as follows: the related work is reviewed in Sect. 2. Section 3 discusses our approach using BTM+SVM, and then the implementation is detailed in Sect. 4. Further discussion is presented experimentally in Sect. 5. Finally, Sect. 6 is the conclusion.

2 Related Work

Text classification is an important task for natural language process, and topic model is popular among researchers to process natural language. Liu et al. [9] devised a semi-supervised learning with Universum algorithm based on boosting technique. In their method, they aims to study a collection of nonexamples that do not belong to any class of interest. Luss et al. [11] developed an analytic center cutting plane method to solve the kernel learning problem efficiently, this method exhibits linear convergence but requires very few gradient evaluations. Lai et al. [7] applied a recurrent structure to capture contextual information as far as possible when learning word representations, and it is said that the proposed method shows better results than the state-of-the-art methods on document-level. By contrast, our method uses the generation of word co-occurrence pattern to keep main information while reducing dimensionality. Landeiro et al. [8] estimated the underlying effect of a text variable on the class variable based on Pearls back-door adjustment.

SVM is widely uesed in text classification. Yin et al. [22] used semi-supervised learning and SVM to improve the traditional method and it can classify a large number of short texts to mine the useful massage from the short text, however the efficiency is not satisfactory. Song et al. [18] illustrated Chinese text feature selection method based on category distinction and feature location information, while this method has boundedness that location information is not easy to obtain. Nguyen et al. [14] proposed the improving multi-class text classification method combined the SVM classifier with OAO and DDAG strategies. In Seetha et al. [16], nearest neighbour and SVM classifiers are chosen as text classifiers for their good classification accuracy. Luo et al. [10] presented a method which combines the Latent Dirichlet Allocation (LDA) algorithm and SVM. However, the method is not good at deal with sparse text data according to our experiments. Altinel et al. [1] proposed a novel semantic smoothing kernel for SVM based on a meaning measure.

3 Problem Formalization

Motivated by researches on classification models, this study first formalizes the data collection to meet the prerequisites in algorithms. As usual, we use a vector to represent a document, and the whole text data can be regarded as a matrix. The problem is formalized technically as follows.

Every document and extracted term are supposed to be mapped into a vector [5] to represent text documents as a document-term matrix according to VSM.

$$\begin{aligned} d_j=(w_{1j},w_{2j},\ldots ,w_{tj}) \end{aligned}$$
(1)

Each dimension related to a separate term, where the value corresponds to the term is usually computed by term frequency-inverse document frequency model (TF-IDF). The weight vector for document d is

$$\begin{aligned} v_d=[w_{1,d},w_{2,d},\ldots ,w_{N,d}]^T \end{aligned}$$
(2)

where \(w_{t,d}=tf_{t,d}\cdot log\dfrac{|D|}{|\{d\prime \in D|t\in d\prime \}|}\) and \(tf_{td}\) is the term frequency of term t in document d, |D| is the total number of documents in the set; \(|\{d\prime \in D|t\in d\prime \}|\) is the number of documents containing the term t.

For dimension reduction, there are two general ways to apply. One is feature extraction, large data is transformed into a reduced features vector, so that the desired task can be solved using the reduced representation [13]. The data transformation model can be nonlinear like kernel principal component analysis, linear like latent semantic indexing, linear discriminant analysis and so on. The other one is known as feature selection, such as \(\chi ^2\) statistic, document frequency and so forth, those are selecting a subset of relevant features for use in model construction.

4 Novel Method for Sparse Data Classification

In this section, we will illustrate our method for sparse data classification carefully. To begin with, an overview of BTM and SVM model is presented. After that we will elaborate how to employ BTM to generate the document topic matrix, and then explain how to utilize the SVM to classify and predict the category of sparse data.

4.1 Matrix of Topic Distribution

BTM is a probabilistic model that learns topics over short texts by directly using the generation of biterms in the whole corpus [21]. The notation of “biterm” refers to an instance of unordered word pair occurrence, and any two distinct words in a document compose a biterm. The model in graph is showed in Fig. 1. The key point is that two words are more likely to be in the same topic if they co-occur more frequently.

Fig. 1.
figure 1

BTM: a generative graphical model

Table 1. Notations in BTM

Given a corpus with \(N_D\) documents, we can utilize a K-dimensional multinomial distribution \(\theta =\{\theta _k\}_{k=1}^K\) with \(\theta _k=P(z=k)\) and \(\sum _{k=1}^K \theta _k=1\) to show the prevalence of topics. Suppose each biterm is drawn from a specific topic independently, the specific generative process of the corpus in BTM can be shown as follows [4]. The notations used in BTM are listed in Table 1.

  1. 1.

    For each topic \(\textit{z}\), draw a topic-specific word distribution \( \phi _z \sim \textit{Dir}(\beta )\).

  2. 2.

    Extracting a topic distribution \( \theta \sim \textit{Dir}(\alpha )\) for the whole collection.

  3. 3.

    For each biterm b in the biterm set B, draw a topic assignment: \(\textit{z} \sim \textit{Multi}(\theta )\), and draw two words: \(w_i,w_j \sim \textit{Multi} (\phi _z)\).

The joint probability of a biterm \(b=(w_i,w_j)\) over topic z can be written as:

$$\begin{aligned} P(b)=\sum _zP(w_i|z)P(w_j|z) =\sum _z\theta _z\phi _{i|z}\phi _{j|z} \end{aligned}$$
(3)

Similar as LDA, Gibbs sampling can be adopted to perform approximate inference. In the process, the topic-word distribution \(\phi \) and global topic distribution \(\theta \) can be generated as:

$$\begin{aligned} \phi _{w|z}&=\dfrac{n_{w|z} + \beta }{\sum _w n_{w|z}+M\beta } \end{aligned}$$
(4)
$$\begin{aligned} \theta _z&=\dfrac{nz+\alpha }{|B|+K\alpha } \end{aligned}$$
(5)

where |B| is the aggregated number of biterms. The matrix \(\theta \) is an essential part of our method as the matrix of topic distribution.

4.2 Support Vector Machine (SVM)

SVM plays an important part in lots of domains, and hyperplanes are constructed when it performs classification tasks in a multidimensional space. It is reported that SVM can generate better results than other learning algorithms in classification [6]. The basic theory of SVM is elaborated next:

When the training dataset of n points in the form of \(({\varvec{x}}_{\varvec{1}},{\varvec{y}}_{\varvec{1}}),\dots ,({\varvec{x}}_{\varvec{n}},{\varvec{y}}_{\varvec{n}})\) is known, where \(y_i\) is either 1 or \({-1}\), the optimization problem is defined as:

$$\begin{aligned} min \dfrac{1}{2}w^T w+C\sum _{i=1}^{n}\zeta _i s.t. y(w^T\phi (x_i)+b)\ge 1-\zeta _i, \zeta _i\ge 0 \end{aligned}$$
(6)

where function \(\phi \) can map training vectors \(x_i\) into a higher dimensional space b. \(C>0\) is the penalty parameter of the error instances, which should be chosen with care to avoid over fitting. SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. On the basis of Mercer theorem [12], there always exists an equation \( K(x_i,x_j)=\phi (x_i)^T\phi (x_j)\) called the kernel function. The problem 6 can be derived as:

$$\begin{aligned} f(x)=\sum _{i=1}^l a_iy_iK(x_i,x_j)+b \end{aligned}$$
(7)

By solving the optimization, parameters of the maximum-margin hyperplane are derived specifically. Note that the core of SVM which is good at processing high dimensional data is that the number of dimensions can be turned from \(\phi (x_i)\) to \(x_i\). What’s more, LIBSVM [3] has some attractive training time properties. Each convergence iteration takes linear time to read the training data and the iterations also have a Q-Linear Convergence property, which makes the algorithm extremely fast [17].

4.3 Experimental Procedure for Enhancement

For less complexity and higher performance, our method retrieves optimal set of features, which reflects the original data distribution. The steps in document classification are listed as follows.

  1. Step 1.

    Making a document-term matrix according to the vector support model.

  2. Step 2.

    Analysing the topic distribution and building a matrix about topic distribution for documents.

  3. Step 3.

    Acquiring the weight of vector support model by using the topic distribution values.

  4. Step 4.

    Testing documents by building the classifier.

We firstly formalize the data collection in order that it can be used in SVM, so a document-term matrix must be built in Step 1. Since Step 2 utilizes matrix \(\theta \) to indicate the relationship between texts and topics, we need to generate it by BTM estimation with Gibbs sampling first. In Step 4, SVM is used to build upon the characteristics identified in Step 2.

5 Experimental Evaluation

In this section, we conduct several experiments to show the great superiority of our method, results are presented below followed by discussion.

Fig. 2.
figure 2

Category distribution of Tencent messages

5.1 Data Preparation

We evaluate our method on two popular datasets used in large scale and sparse text classification study. One is Tencent microblogs, which contains 11,285,538 messages from seven different micro-channels posted by users from July 2 to July 14 in 2013 [19] on Tencent microblog platform (http://t.qq.com/). The other dataset is 20 Newsgroups (http://qwone.com/~jason/20Newsgroups/), which has 20 categories and is widely used in text classification.

The raw data of these collections is very noisy. For preprocessing, the terms like the punctuation marks, stop words, links and other non-words in the raw microblogging datasets are removed in data preparation using a punctuation list and a stop words dictionary. Specifically, for the process of word segmentation, the ICTCLAS (http://www.ictclas.org/) is used in this paper.

To further describe the datasets for classification, Fig. 2 is showed for category distribution of Tencent messages, and Table 2 illustrates the classical data proportion on 20 Newsgroups.

Table 2. Data description for 20 Newsgroups

5.2 Evaluation Criteria

In our experiment, the \(Macro/Micro-precision\), \(Macro/Micro-Recall\) and \( Macro/Micro-F1\) criteria are employed to evaluate the method. The definitions are showed below.

$$\begin{aligned} Micro-Precision&=\dfrac{\sum _{i=1}^m TP_i}{\sum _{i=1}^m TP_i +FP_i} \end{aligned}$$
(8)
$$\begin{aligned} Micro-Recall&=\dfrac{\sum _{i=1}^m TP_i}{\sum _{i=1}^m TP_i +FN_i} \end{aligned}$$
(9)
$$\begin{aligned} Micro-F1&=\dfrac{Micro-Precision\times Micro-Recall\times 2}{Micro-Precision+Micro-Recall} \end{aligned}$$
(10)
$$\begin{aligned} Macro-Precision&=\dfrac{1}{m} \sum _{i=1}^m P_i \end{aligned}$$
(11)
$$\begin{aligned} sMacro-Recall&=\dfrac{1}{m} \sum _{i=1}^m R_i \end{aligned}$$
(12)
$$\begin{aligned} Micro-F1&=\dfrac{Macro-Precison\times Macro-Recall\times 2}{Macro-Precision+Macro-Recall} \end{aligned}$$
(13)

5.3 Results and Analysis

We choose two other methods PCA+SVM and LDA+SVM as baselines to verify the advantage of our approach. Documents used in our experiments are mapped into document-term matrix firstly. Considering topic model as a method of dimensionality reduction firstly, we then trained the document vectors by LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html), and we then predicted the categories of new documents. Unlike the PCA method which treats terms as features of document vector, the LDA and BTM methods use the topics as features of documents vectors. In order to obtain document-topic matrix, the widely used LDA tool GibbsLDA++ (http://gibbslda.sourceforge.net/) was employed in our experiments. BTM (http://shortext.org/) is first used to acquire the matrix of topic distribution for documents. The number of Gibbs sampling iterations in the following experiment is set to 1000 to insure the classification accuracy.

We use \(Macro-Precision\), \(Macro-Recall\), \(Macro-F1\) and \(Micro-F1\) to evaluate the classifiers PCA+SVM, LDA+SVM and BTM+SVM based on 20 Newsgroups which are depicted in Figs. 3 and 4, respectively. What need to mention is that Micro-Precision and Micro-Recall are the same as Micro-F1 since we suppose each instance has exactly one correct label. From the result, we can see that the values in Fig. 3 reach peak value after the dimensionality is brought down at 400. By contrast, as we can see from Fig. 4, when the number of topics is merely set to 180 for BTM+SVM, the \(Macro-Precision\), \(Macro-Recall\), \(Macro-F1\) and \(Micro-F1\) undulate slightly around 0.87,0.86,0.87,0.90, respectively. It can be seen from that the values of those criteria for BTM+SVM are relatively higher than those of PCA+SVM and LDA+SVM, respectively.

Fig. 3.
figure 3

The values of evaluation criteria under diverse number of features reduced by PCA+SVM method on 20 newsgroups collection

Fig. 4.
figure 4

The values of evaluation criteria under diverse number of features reduced by LDA+SVM, BTM+SVM methods on 20 newsgroups collection

Comparison experiments were made in order to verify the high performance of BTM for feature selection, we estimated the number of iterations needed to obtain high accuracy by spending less time on topic-matrix generation. The accuracy on 5-fold cross validation is reported in Fig. 5. It can be seen that 900 iterations is a relatively better choice on Tencent Dataset, and accuracy keeps around 90 % with 60 features generated. From Fig. 5(b), we can see that all the methods work better with training data size grows. It suggests that the LDA+SVM method is not able to overcome the sparsity problem, while BTM+SVM can achieve better performance than LDA+SVM, which also shows the superiority of our method.

Fig. 5.
figure 5

Comparision of classification performance in different aspects between LDA+SVM and BTM+SVM on Tencent Dataset

BTM+SVM can resolve the over-fitting and feature redundancy problem, and yields better classification results than others. Utilizing the topical model is able to accelerate the process of classification. What’s more, for sparsity problem in conventional topical model, BTM is better at capturing the topics by using word co-occurrence patterns in the whole corpus [21]. The Table 3 presents information about training speed of three provided methods, which also shows the high efficiency of BTM+SVM by comparison. It only takes 50 min to generate a topic matrix by GibbsLDA++ with 100 topics and 1000 iterations, which saves about 30 min than LDA+SVM and is only one fifth of the time PCA+SVM consumed.

Table 3. Time cost for dimensionality generated on 20 Newsgroups by three following methods using 3.0 GHz CPU, 2G memory

6 Conclusion

In this paper, we proposed a hybrid approach called BTM+SVM for sparse data classification. We explored the difference among BTM+SVM, PCA+SVM and LDA+SVM, and the results showed that our method has superiority over accuracy and efficiency when sparse text is processed. We figured out the number of topics to use when approximating the matrix properly. Comparing with traditional methods, we improved the classification accuracy and tested the training speed over the experiments. Overall, our method is able to cope with sparse problem properly, which is promising and can be used extensively in real applications.