Keywords

1 Introduction

Short texts are convenient in human communication, and have prevalent on the social networks nowadays. Short text classification is one of the challenges due to its natural sparsity, noise words, syntactical structure and colloquial terminologies [1]. Those topics attracted lots of research attention in the field of short text expansion and classification research.

Due to the imitation of words and low-frequency of terms in short text, the bag-of-words (BOW) representation has limits in analyzing short texts [2]. One possible solution for handling sparsity is to expand short text by appending new features based on semantic information extracted from Web searching, lexical databases or provided by machine translations [3], which are called an external resource-based approaches. Web searching [4] based feature extension technologies need to interact frequently with search engines, and result in high communication overhead and low efficiency for data analysis. Knowledge bases or lexical databases, such as Wikipedia and HowNet for concept taxonomies [5,6,7] or topic models [8, 9] are used to enrich short text representations. However, these feature extension methods have high dependencies on the integrity of external resources, and often time consuming. Moreover, these predefined topics and categories are domain-specialized or language-specific.

Using rules or statistical information hidden in the context of short texts is another kind of approaches to extend features, which are called the self-contained resource approaches [10, 11, 22,23,24, 27]. Mining hidden information in short texts plays a key role in feature extension. A self-aggregation-based topic model (SATM) [22] has been reported recently, which assumes short texts are sampled from long pseudo-documents, and then topic modeling is conducted by finding “document-ship” for each short text. U. K. Sikdar et al. [10] described a deep learning approach to recognize Amharic named entities from a large dataset annotated with six different classes, trained on various language independent features together with word vectors, which were the semantic information taken by an unsupervised learning algorithm, word2vec. The word vectors were merged with a set of specifically developed language independent features and together fed to the neural network model to predict the classes of the words. Zhang et al. [11] proposed a character-level convolutional network model for short text classification without any knowledge on the syntactic or semantic structures of a language. Nevertheless, these works ignore the relevance of the words in short texts. In the case of limited words, the association between words can be used as additional information to serve as an important basis for feature expansion and solve the problem of sparse features of the short text.

This paper considers two forms of information: inter-type and intra-type relationships between words and short texts. Based on these two kinds of data relations, the feature space is obtained by dimension reduction of word clustering indicator, which is obtained by non-negative matrix tri-factorization [12]. Then, according to the correlation between words, closely related features in the feature space are selected to expand the text feature vector, and this can effectively solve the problem of feature sparseness.

2 Related Works

Feature expansion is essential to classify short texts, and it has been mainly focusing on two kinds of approaches by now, Latent Dirichlet Allocation (LDA) topic model [40, 42, 43] and Word Embedding [29,30,31, 35,36,37,38, 42]. Y. Xu used LDA for clustering words or documents into “topics”, and based on a “topic-word” probability distribution model, the closely-related words were found and selected out to expand feature space of words [42]. W. Xia, et al. chose the liveness of each user as a feature, and modelled it as the weighted value for the user. They improve the precision of topic detection and tracking, by including the user feature into LDA model to expand the feature of short texts [40]. Yu, et al. [43] used the Dirichlet Multinomial Mixture (DMM) model as the main framework and extended short texts with the potential feature vector representation of the words by combining the user-LDA topic model, and achieved a good performance as an external extension of short texts. The complexity of Probabilistic Graphical Model hampers the development of LDA, and the computational cost of LDA results in bigger penalty compare with the improvement of this algorithm.

On the other hand, word embedding presents another kind of words representation, converting per word into a continuous vector space with dimensionality reduction [32, 33]. Semantic expansion of words is then obtained by clustering of vectors. Recently, researches have widely employed deep learning-based approaches for word embedding model. Google developed a Word2Vec tool based on Bengio neural network for word embedding [24]. In fact, Word2Vec predicted words based on their context by using one of two distinct neural models: CBOW [33, 35, 38, 39] and Skip-Gram [10, 29, 31, 34, 36, 37, 40].

P. Wang et al. proposed a framework to expand short texts, based on skip-gram model to learn word embeddings from large-scale unstructured text data. By using additive composition over word embeddings from context with variable window width, the representations of multi-scale semantic units in short texts were computed [37]. In literature [36], distributed word embeddings were learned by skip-gram algorithm through a neural network architecture, and then they were combined into a sentence representation to predict the semantic relations between short texts. W. X. Liang et al. proposed a global and local word embedding-based topic model (GLTM) for short texts [34]. They trained global word embeddings from large external corpus and employed the continuous skip-gram model with negative sampling (SGNS) to obtain local word embeddings. Utilizing both the global and local word embeddings, their method could distill semantic related information between words which could be further leveraged by Gibbs sampler in the inference process to strengthen semantic coherence of topics.

G. X. Xun et al. used Continuous Bag of Words (CBOW) to provide additional semantics for short text corpus, and incorporated it into each short document’s model to establish a Gaussian topic in the vector space [39]. In addition, a discrete background mode over word types was also added to complement the continuous Gaussian topics model. In literature [38], by using word embedding features, L. Sang et al. expanded and enriched the words density in the short texts, and semantic similarities of short texts were calculated for effective learning. This method combined external sources of word semantic information with the short text structure information. A. J. Pascual et al. presented a Contextual Specificity Similarity (CSS) algorithm [33] for document similarity measure, where documents were represented as arrays of their word vectors, and then Inverse Document Frequency (IDF) of the words were added into to define the closeness degree between documents.

Although Word2Vec has an outstanding performance in synonymous words analysis, it still relies on local context so much, lacking of global statistical information of short texts. Accordingly, in 2014, Jeffrey Pennington et al. presented a new model based on the words ice and steam to illustrate how to generate meaning from word occurrence, and how to result a global word vectors representing that meaning [23]. They defined it as GloVe, whose training was performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showed interesting linear substructures of the word vector space [37]. Comparative study [41] showed that its effectives for the Arabic language processing, and pointed out that the appropriate starting point for word vector learning might be indeed with ratios of co-occurrence probabilities rather than the probabilities themselves. The shortcoming of GloVe was also mentioned in literature [25], demanding a large-scale corpus and big enough storage resource.

Both approaches mentioned above cannot work without huge corpus data support. Opposite to the large-scale learning algorithms, this paper studies on feature expansion by short text itself. There are three aspects of relations taken into consideration, including word-to-word, word-to-text and text-to-text, to make use of more relatedness information from short text. We use this method as an alternative to the aforementioned relation features, in cases where only limited amounts of training data are available.

3 Algorithm Framework

Given a short text set T = {t1,…, tm} and a word set W = {w1, …, wn}. The goal is to group the texts {t1, …, tm} into k clusters, in the meantime also grouping the words {w1, …, wn} into k clusters. The relationship matrix R describes the inter-type relationships between texts and words. The correlation matrix At and Aw represent the intra-type relationships of texts and words, respectively. The clustering indicator matrix F represents the clustering result of words, whose element Fij represents the possibility that wi belongs to cluster kj. Similarly, the clustering indicator matrix G represents the clustering result of short texts. Since the short text category label of training set is known, the matrix G can be obtained. In this way, the feature expansion for short texts is transformed into the clustering of texts and words jointly.

The overall framework of our algorithm is based on non-negative matrix factorization, including four steps: feature space establishment, feature expansion, feature space updating and short text classification, as shown in Fig. 1.

Fig. 1.
figure 1

Framework of the proposed algorithm

The feature space of the short text itself describes the possibility of the word belonging to the category. Based on training texts, we construct a relationship matrix to describe membership of word-to-text, and two correlation matrixes to describe intra-type relation of text-to-text and word-to-word respectively. Under the manifold regularization, the nonnegative matrix factorization algorithm is used to build words clustering indicator matrix. After removing some evenly distributed features in the indicator matrix, a dimension-reduced feature space is constructed. The feature of short text is to extend by the correlation between the features in the feature space and the text features. The updating of feature space is to predict the clustering indicator value of the unknown feature with the clustering indicator average value of the known feature in the same text, and then add the new feature into the feature space. The classifier is to divide the testing samples into different categories by using an SVM algorithm.

4 Feature Space Construction Based on DNMTF

4.1 Non-negative Matrix Tri-Factorization

The feature space is constructed by factorization of the relationship matrix. Firstly, according to the label data of the short text training set, the clustering indicator matrix G can be directly obtained, which is part of the relationship matrix R in the non-negative matrix tri-factorization [13]. Then, with manifold regularization constraint added, word clustering indicator matrix F is obtained by decomposition.

The relation matrix R is decomposed into three matrices, F, S and G, noted as: R ≈ FSGT. Matrix F and G are clustering indicator matrix corresponding to two types of entities respectively, and matrix S is an equilibrium matrix with multi-dimension, which would guarantee the accuracy of low-dimensional matrix representation.

4.2 Construction of Relationship and Correlation Matrix

The construction of the relationship matrix R follows the natural relationship between text and word. If the word wi appears in the text tj, then Rij = 1, otherwise Rij = 0.

The construction of the correlation matrix At and Aw is based on statistics information between text and words. The calculation of correlation strength between two samples xi and xj is shown in Eq. (1).

$$ A_{ij} = \frac{{B\left( {x_{i} ,x_{j} } \right)}}{{\mathop \sum \nolimits_{{x_{a} ,x_{b} \in T\left( W \right)}} B\left( {x_{a} ,x_{b} } \right)}} $$
(1)

Where \( B\left( {x_{i} ,x_{j} } \right) \) is the number of words (text) co-occurrence by sample xi and xj in T (word set W).

4.3 Relationship Matrix Factorization with Manifold Regularization

According to the manifold hypothesis [14], if two samples xi and xj are similar in geometric structure, then the practical significance of these two samples is also similar, which is reflected in clustering labels. Therefore, we propose a novel algorithm based on the dual regularization non-negative matrix tri-factorization algorithm (DNMTF) [15] to capture the intra-type and inter-type relationship among entities. The relationship matrix factorization based on manifold regularization is shown in Eq. (2).

$$ J_{1} = \left\| {R - FSG^{T} } \right\|^{2} + \mu tr\left( {F^{T} L_{w} F} \right) + \phi tr\left( {G^{T} L_{t} G} \right)\,\,\,\,\,s.t. F,S,G \ge 0 $$
(2)

Where μ, ϕ > 0 are the regularization parameters, used to balance the reconstruction error of DNMTF in the first item and graph regularizations in the second and third terms in Eq. (2). \( L_{w} = D_{w} - A_{w} \) is the graph Laplacian of the data graph which reflects the label smoothness of the data points, and \( L_{t} = D_{t} - A_{t} \) is the graph Laplacian of the feature graph which reflects the label smoothness of the feature Dw and Dt are diagonal matrix, whose entities are column sum of Aw and At, noted as \( D_{ii}^{w} = \mathop \sum \limits_{j} A_{ij}^{w} \), \( D_{ii}^{t} = \mathop \sum \limits_{j} A_{ij}^{t} \), respectively.

Since labels of training set are known already, the clustering indicator matrix G can be directly obtained as part input of J1. The objective function in Eq. (2) can be rewritten into Eq. (3).

$$ J_{1} = tr\left( {\left( {R - FSG^{T} } \right)\left( {R - FSG^{T} } \right)^{T} } \right) + \mu tr\left( {F^{T} L_{w} F} \right) + \phi tr\left( {G^{T} L_{t} G} \right) $$
$$ = tr\left( {RR^{T} } \right) - 2tr\left( {RGS^{T} F^{T} } \right) + tr\left( {FSG^{T} GS^{T} F^{T} } \right) + \mu tr\left( {F^{T} L_{w} F} \right) + \phi \left( {G^{T} L_{t} G} \right) $$
(3)

Introduce Lawrencian multiplier αn × k, βm × k and γk × k for constraint F ≥ 0, G ≥ 0 and S ≥ 0, respectively. Accordingly, the Lawrencian function is shown in Eq. (4).

$$\begin{aligned} L & = tr\left( {RR^{T} } \right) - 2tr\left( {RGS^{T} F^{T} } \right) + tr\left( {FSG^{T} GS^{T} F^{T} } \right) + \mu tr\left( {F^{T} L_{w} F} \right) \\ & + \,\phi tr\left( {G^{T} L_{t} G} \right) + tr\left( {\alpha F^{T} } \right) + tr\left( {\beta G^{T} } \right) + tr\left( {\gamma S^{T} } \right) \\ \end{aligned} $$
(4)

In solving the matrix S, we take the matrix F and G as the given conditions, and then let the partial differential \( \frac{\partial L}{{\partial \varvec{S}}} = 0 \), then we derive Eq. (5).

$$ \gamma = 2F^{T} RG - 2F^{T} FSG^{T} G $$
(5)

Using KKT condition [16] \( \gamma_{ij} S_{ij} = 0 \). Then we can get Eq. (6).

$$ [F^{T} RG - F^{T} FSG^{T} G]_{ij} S_{ij} = 0 $$
(6)

According to Eq. (6), matrix S follows the following updating, as shown in Eq. (7).

$$ S_{ij} \leftarrow S_{ij} \frac{{[F^{T} RG]_{ij} }}{{[F^{T} FSG^{T} G]_{ij} }} $$
(7)

In solving the matrix F, we take the matrix S and G as the given conditions, and then let the partial differential \( \frac{\partial L}{{\partial \varvec{F}}} = 0 \). Then we get Eq. (8).

$$ \alpha = 2RGS^{T} - 2FSG^{T} GS^{T} - 2\mu L_{w} F $$
(8)

Replace \( L_{w} = D_{w} - A_{w} \) into Eq. (8) and use KKT condition [16] \( \alpha_{ij} F_{ij} = 0 \). Then we can get Eq. (9).

$$ [RGS^{T} - FSG^{T} GS^{T} - \mu D_{w} F + \mu A_{w} F]_{ij} F_{ij} = 0 $$
(9)

According to Eq. (9), matrix F follows the following updating, as shown in Eq. (10).

$$ F_{ij} \leftarrow F_{ij} \frac{{[RGS^{T} + \mu A_{w} F]_{ij} }}{{[FSG^{T} GS^{T} + \mu D_{w} F]_{ij} }} $$
(10)
figure a

5 Feature Extension Based on Self-resources

5.1 Feature Expansion

Suppose there are p feature words in the feature space Hp×k, which is the output of Algorithm 1. Then, from space H, there are q (p >> q) features fi (i = 1, …, q) are chosen out to compose of a subset of the feature space H, denoted as H*q×k, which contains and only contains those q features. Then, multiply H* with feature space H to get matrix Eq×p, as shown in Eq. (11).

$$ E = H^{*} \cdot H^{T} $$
(11)

Where the matrix E describes fi (i = 1, …, q) correlation with all features in space H.

In order to select features for expansion conveniently, the matrix E is compressed, and the values of each column are added and the mean is calculated to get the vector e with dimensions p, as shown in Eq. (12).

$$ e\left( j \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{q} E_{ij} }}{q},\quad j = 1 \cdots p $$
(12)

Vector e describes the relevance between each feature word in the feature space H and feature representation fi (i = 1, …, q) in the subspace H*. In addition to the existing text features, the first K features are selected to expand the short text according to the relevance in e.

5.2 Feature Space Update

In the process of extending the features of the short text, there is a possibility: some features extracted from the short text are not included in the feature space H. At this time, the feature space has an insufficient feature expansion. Therefore, before the feature expansion of the short text, the text features should be firstly detected to see whether update of space H to cover all new text features is needed. There are two kinds of new features needed to update:

  1. (1)

    the feature does not exist in the feature space H

  2. (2)

    the feature is not the one that had been deleted after dimension reduction on clustering indicator matrix.

Suppose there are features needed to be updated, and their corresponding clustering indicator matrix is H**. Due to the correlation between input data, H** can be calculated based on H*, as shown in Eq. (13).

$$ H_{i}^{**} \left( j \right) = \frac{{\mathop \sum \nolimits_{g = 1}^{q} \varvec{H}_{gj}^{*} }}{q},j = 1 \cdots k,i = 1 \cdots a $$
(13)

Finally, H** is incorporated into H to obtain an enlarged feature space, based on which feature expansion is carried out. Here, H* is a subset of the feature space H.

5.3 Algorithm Description

figure b

6 Experiments and Discussion

6.1 Experimental Datasets

This paper verifies the effectiveness of the proposed method using three datasets. In the experiment, the open source tool libsvm is used as the text classifier. The first dataset, Web snippets, obtained from Web search by Phan et al. [17], is a commonly used short text classification test set. The data set contains 8 categories, including 10060 training sets and 2280 test sets, with an average text length of 17.93. Specific information is listed in Table 1.

Table 1. Web snippets dataset

The second data set is the Twitter 100k, published by Hu et al. [18]. The text is written by users in an informal language and is subject to the number limitation of words. Without class label in this data set, only sports-related data are selected out, and used as experimental data for sport-item data classification after they are manually tagged and the final 6 items, including 3000 training sets and 630 test sets, are left with an average text length of 12.95. The specific information is listed in Table 2.

Table 2. Twitter sports dataset

The third data set is the AGnews data obtained by Zhang [19] et al., and the 4 classes with the largest amount of are selected to construct the data set, including 120,000 training sets and 7600 test sets, with an average text length of 38.82. The specific information is listed in Table 3.

Table 3. AGnews dataset

6.2 Parameters Selection

In Eq. (2), the regularization parameters μ and ϕ are selected according to one of the three evaluation indexes, Purity [20], Normalized Mutual Information (NMI) [21] and Adjusted Rand Index (ARI) [26]. Purity calculates the proportion of correctly clustered documents in the total number of documents. NMI measures the degree of similarity between the two clustering results, and ARI measures the degree of coincidence between the clustering results and the real situation. In the process of relationship matrix factorization, the regularization parameter is set to μ = ϕ. Based on different value of μ, the DNMTF method with random initialization is carried out for 50 times, and the comparison results are shown in Fig. 2.

Fig. 2.
figure 2

Effect of different regularization parameter μ

From Fig. 2, we can see that the clustering accuracy arrives the highest when μ = 0.6, with any one of three evaluation indexes. Accordingly, in the following experiments of matrix factorization, we set up the regularization parameter to be μ = 0.6.

The Web snippets data set has 4775 features, Twitter sports data set has 1248 features, and AGnews data set has 6582 features. The selection of feature extension number K directly affects the classification results. Therefore, different parameters K are selected on three data sets for comparative experiments, and the results are shown in Fig. 3(a)–(c), respectively. We can see that no matter which data set, even if there is only one feature is added, and the accuracy of classification results increase rapidly to be close to the optimal value 1. The reason for that is the feature with the strongest relevance to the short text is found in the feature space according to Eq. (12), which must be the most indicative feature in a certain category. Expansion by this feature will allow other short texts of the same category to enlarge their feature representation, in case they did not have it before. The similarity between the sparse feature vectors of the same category is greatly improved, which has a positive impact on the classification results.

Fig. 3.
figure 3

Results of parameter K on three datasets

When the number of extended features gradually increases, the accuracy of classification results increases comparatively constant until it reaches the peak point of each dataset, then it begins to decline slightly, as shown in Fig. 3 (a)–(c).

6.3 Compared Algorithms

In order to verify the effect of NMFFE algorithm, we compare NMFFE with BOW and Char-CNN, namely word bag method and character level convolutional neural network method without considering semantic information. The results are shown in Table 4. and the corresponding best results in the table are all in bold font. In the study [11], the accuracy of BOW algorithm and Char-CNN algorithm on AGnews data set was 88.81% and 87.18%, respectively. In our experimental environment and data processing operations, our experimental results shown in Table 4 are little different with those presented by study [11].

Table 4. Comparison results of classification accuracy on 3 datasets

From Table 4, we can find that in the respect of dataset size, the Char-CNN algorithm performs well in big datasets but perform less in small datasets, where the limited training data cannot cover the overall distribution of data, and lead to the over-fitting of convolutional neural network.

In the respect of data integrity, text length of the AGnews dataset is relatively long, and its sufficient corpus makes the three algorithms perform well in text classification. The accuracies of their classification results have small differences. The similarity between test dataset and training dataset of Web snippets (co-occurrence of keywords) is not as high as the other two datasets, making the BOW algorithm based on word frequency statistics on this dataset less effective.

The overall performance of the proposed NMFFE algorithm achieves better classification results than those of the other two algorithms, and the robustness on datasets with different sizes is better than the two latter. BOW algorithm and Char-CNN algorithm are more suitable for large-scale datasets.

The running time of the three algorithms is compared on three data sets, and the results are shown in Fig. 4. The execution time of BOW algorithm is shorter than the other two algorithms, and it is more obvious on large datasets, mainly because the model of BOW algorithm is relatively simple. NMFFE algorithm takes the longest time in the feature expansion process, because it involves a lot of matrix operations. When the number of feature extensions K increases, the running time also increases. The Char-CNN algorithm model consists of 6 convolution layers and 3 full connection layers.

Fig. 4.
figure 4

Comparation of running time

7 Conclusions

Different from vector-form based feature expansion method of short texts, we proposed a method using K relevant features as a self-contained subset to extend feature space of short texts. Without relying on the external resources, words clustering indicator matrix was obtained from text dataset itself through graph dual regularization non-negative matrix tri-factorization (DNMTF). After dimension reduction, feature space was obtained as the basis for feature expansion, and then the most relevant features extracted within the dataset itself were selected to enlarge the feature space of short texts. Experimental results showed that NMFFE algorithm performed better than Word2Vec algorithm and Char-CNN algorithm in accuracy of classification. However, the datasets used in this paper were all open datasets which actually had been pre-processed. However, the main challenge of short-text feature expansion and classification is the online and real-time data processing. So, we will adjust our method to adapt the real-time online environments in the future.