Keywords

1 Introduction

There has been a lot of interests in extreme multi-label text classification (XMC) problem: given an input text instance, return the most relevant labels from an enormous label collection, where the number of labels could be in the millions or more, which becomes increasingly important due to the fast growing of internet contents and the urgent needs for organizational views of big data. Multi-label classification is fundamentally different from the traditional binary classification problems which have been intensively studied in the machine learning literature. Binary classifiers treat class labels as independent target variables, which is clearly sub-optimal for multi-label classification as the dependencies among class labels cannot be leveraged. Of the many related tasks, discovering relevant labels from document is of considerable practical importance.

In Ref. [1], XML-CNN is first adopted for XMC task which modifies the traditional TextCNN [2] architecture by using a dynamic max pooling scheme and adding a hidden bottleneck layer that captures richer information from different regions of the document. However, such a method admits the following disadvantages: (1) CNN model is unable to discover the dependency patterns due to the sparsity issue of the XMC datasets which typically exhibit a power-law distribution of labels, and the substantial proportion of the labels have very few training instances associated with them. (2) The computational costs in both training and testing of mutually independent classifiers would be practically prohibiting when the number of labels reaches hundreds of thousands or even millions.

In Ref. [5, 6], interactive methods are proposed for topic discovery in order to incorporate with user’s intention. However such interactive algorithms are for clustering, which introduces the uncertainty and randomness.

Specifically, representation learning is a fundamental problem in natural language processing and how to learn a structured representation for text classification is still challengeable. Unlike most existing representation models that either use no structure or rely on pre-specified structures, the reinforcement learning (RL) method that we apply is able to learn sentence representation by discovering optimized structures automatically [10]. However, there are few attempts to apply RL method into XMC tasks due to the intractable issues of XMC datasets.

The main contributions are summarized as follows:

  1. 1.

    We apply a reinforcement learning method and make a optimization for extreme multi-label text classification.

  2. 2.

    We re-examine the state of the art of XMC by conducting a comparative experimental evaluation of 5 methods which are most representative in text classification.

  3. 3.

    We develop a practical CMS online system for editors and perform extensive experimental validations for the proposed method.

The rest of this paper is organized as follows: Sect. 2 gives the related works and background of XMC. In Sect. 3 we give a detailed introduction about the RL method and Sect. 4 presents our extensive experiments and results, followed by conclusion and future work in Sect. 5.

2 Related Works and Background

2.1 Methods for Text Classification

Methods for comparison are outlined, including the most representative methods in XMC and some successful deep learning methods which are designed for mutli-class text classification but also applicable to XMC with minor adaptations.

a. FastText. FastText [7] is a simple yet effective baseline method for text classification, which is inspired by the recent work on efficient word representation learning, such as skip-gram and CBOW [8]. The representation of text is constructed by averaging the embeddings of the words, upon which a softmax layer is applied to map the document representation to class labels. This simplicity makes FastText very efficient to train and achieves state-of-the-art performances on both precision and time consuming. However, simply averaging input word embeddings with the shallow architecture for document-to-label mapping might limits its success in XMC. In XMC task, document presentations need to capture more high dimensional information for predicting multiple correlated labels and discriminating them from enormous numbers of irrelevant labels.

b. FastXML. FastXML [9] aims to develop an extreme multi-label classifier that is faster to train and more accurate at prediction and is considered as the state-of-the-art tree-based method for XMC. It learns a hierarchy of training instances and optimizes an NDCG-based objective at each node of the hierarchy. Specifically, a hyperplane parameterized by \(w \in {\mathbb {R}^D}\) is induced at each node, which splits the set of documents in the current node into two subsets; the ranking of the labels in each of the two subsets are jointly learned. The key idea is to have the documents in each subset sharing similar label distribution, and to characterize the distribution using a set-specific ranked list of labels. This is achieved by jointly maximizing NDCG scores of the ranked label lists in the two sibling subsets. In practice, an ensemble of multiple induced trees are learned to improve the robustness of predictions. At prediction time, each test document is passed from the root to a leaf node in each induced tree, and the label distributions in all the reached leaves are aggregated for the test document.

c. TextCNN. TextCNN [2] applys convolutional neural networks to text classification for the first time. TextCNN depicts three filter region sizes in the convolutional layer, each of which has two filters. Filters perform convolutions on the sentence matrix and generate feature maps which are fed to a max-pooling layer. By concatenating the outputs that generated from all six maps, a feature vector is generated. The final fully-connected layer with L softmax outputs corresponding to L labels uses it as inputs to classify the sentence representation. TextCNN obtains excellent performance in text classification, and is considered as a strong baseline comparative method. Considering the two distinct type of words embedding inputs, we distinguish TextCNN into two categories: Bow-CNN [3] and XML-CNN [1], which take bag of words feature and word vector as input respectively.

d. SLEEC. SLEEC [4] is most representative for target-embedding methods in XMC which extends embedding methods in multiple ways and uses KNN for classification stage.

2.2 Chinaso Application

The database “China Story” [12] is built by Chinaso of Xinhua News Agency with the aim of “telling China’s stories well, and making the voice of China heard”. A challenging problem at the database amounts to discovering relevant labels from an enormous output space of potential candidates for one story document: for example, suggesting keywords to editors labeling new stories, as well as to internet users starting new campaign on Chinaso website. Normally, this task was accomplished manually, which consumes a large amount of manpower and time.

3 Reinforcement Learning

In the RL [10] method, three components are interleaved together. The policy network adopts a stochastic policy and uses a delayed reward from the final classification network to guide the policy learning for structure discovery. While the state representation of policy network is derived from the representation models which contains two models: Information Distilled LSTM which selects important, task-relevant words to build sentence representation, and Hierarchical Structured LSTM which discovers phrase structures and builds sentence representation with a two-level LSTM. The final classification network makes prediction on top of structured sentence representation and facilitates reward computation for the policy network. Figure 1 gives the detailed illustration of the overall framework.

Generally, the most straightforward adaptation from the multi-class classification problems to multi-label ones would be to extend the traditional cross-entropy loss as follows:

$$\begin{aligned} \mathcal {L} = - \frac{1}{n}\sum \limits _{i = 1}^n {\sum \limits _{j = 1}^L {{y_{ij}}\log \left( {{{\hat{p}}_{ij}}} \right) } } \end{aligned}$$
(1)

where \(\varTheta \) denotes classification model parameters, \({\hat{p}}_{ij}\) is the model prediction for instance i on label j via a softmax activation. Specifically, in our RL method, we modify the loss function and introduce binary cross-entropy objective as a substitute which can be formulated as follows:

$$\begin{aligned} \mathop {\min }\limits _\Theta - \frac{1}{n}\sum \limits _{i = 1}^n {\sum \limits _{j = 1}^L {\left[ {{y_{ij}}\log \left( {\sigma \left( {{f_{ij}}} \right) } \right) + \left( {1 - {y_{ij}}} \right) \log \left( {1 - \sigma \left( {{f_{ij}}} \right) } \right) } \right] } } \end{aligned}$$
(2)

where \(\sigma \) is the sigmoid function .

Fig. 1.
figure 1

Overview of the overall RL network. The policy network (PNet) samples an action at each state. The structured representation model offers state representation to PNet and outputs the final sentence representation to the classification network (CNet). CNet performs text classification and provides reward to PNet.

4 Performance Evaluation

4.1 Datasets

We first use the well-known RCV1 [11] datasets to evaluate the above method, which contains manually categorized newswire stories made available by Reuters, Ltd.

Table 1. Datasets statistics of RCV1

In addition, based on the consideration of actual needs for online application and demand for evaluating the practical performance of XMC methods, we build our own practical Chinese datasets as well. The statistical details of the two datasets are displayed in Table 1.

4.2 Experimental Validation Results

a. Evaluation Metrics. To test the performance of different methods in XMC, we introduce several evaluation metrics. In XMC datasets, each instance only has very few relevant labels which means that how to present a short ranked list of potentially relevant labels for each test instance and evaluate the quality of such ranked lists with an emphasis on the relevance of the top portion of each list is far more important. As a result rank-based evaluation metrics have been commonly used for comparing XMC methods, including the precision at top K (P@K) and the Normalized Discounted Cummulated Gains (NDCG@K) [9]. We follow such convention and use these two metrics in our evaluation in this paper, with \(k = 1, 3, 5\). Denoting by \(\mathbf{{y}} \in {\left\{ {0,1} \right\} ^L}\) as the vector of true labels of an document, and \(\mathbf{{\hat{y}}} \in {{\mathop {\mathbb {R}}\nolimits } ^L}\) as the system-predicted score vector for the same document, the metrics are defined as:

$$\begin{aligned}&\mathrm{{P}}@k = \frac{1}{k}\sum \limits _{l \in {r_k}\left( {\mathbf{{\hat{y}}}} \right) } {{y_l}} \end{aligned}$$
(3)
$$\begin{aligned}&\mathrm{{DCG@}}k = \sum \limits _{l \in {r_k}\left( {\mathbf{{\hat{y}}}} \right) } {\frac{{{y_l}}}{{\log \left( {l + 1} \right) }}} \end{aligned}$$
(4)
$$\begin{aligned}&\mathrm{{NDCG@}}k = \frac{{\mathrm{{DCG@}}k}}{{\sum \nolimits _{l = 1}^{\min \left( {k,{{\left\| \mathbf{{y}} \right\| }_0}} \right) } {\frac{1}{{\log \left( {l + 1} \right) }}} }} \end{aligned}$$
(5)

where \({r_k}\left( {\mathbf{{\hat{y}}}} \right) \) is the set of rank indices of the truly relevant labels among the top-k portion of the system-predicted ranked list for a document, and \({{{\left\| \mathbf{{y}} \right\| }_0}}\) counts the number of relevant labels in the ground truth label vector \(\mathbf {y}\). \(\mathrm{{P}}@\mathrm{{K}}\) and \(\mathrm{{NDCG}}@\mathrm{{K}}\) are calculated for each test document and then averaged over all the documents.

b. Evaluation Results. Using these specific evaluation metrics, we conduct extensive experiments on the two datasets. Table 2 and 3 demonstrates the \(\mathrm{{P}}@\mathrm{{K}}\) and \(\mathrm{{G}}@\mathrm{{K}}\) results of the methods on RCV1 respectively, showing that RL method achieved competitive performance which consistently produced the best or the second best results on the datasets no matter when \(k = 1, 3\) or 5. Note that the RCV1 datasets have a higher number of training instances per class and the RL method is capable to learn accurate sentence representation, especially when \(k = 5\). Meanwhile, based on the actual needs for online application Fig. 2 displays the results on our Chinese datasets compared with other 5 state-of-the-art method. Note that owing to the small scale of practical datasets, all the methods obtained close performance, which strongly indicates that our RL method is able to discover the representation in small datasets and can be easily scaled to the larger datasets. However, due to the very limit manual scale of China Story database, all data-driven representation learning methods are hardly capable to capture the complete feature, while FastText works better under this circumstance on account of its simplify.

Table 2. Evaluation results in P@K on datasets RCV1
Table 3. Evaluation results in G@K on datasets RCV1
Fig. 2.
figure 2

Evaluation results on our ChinaStory datasets

5 Conclusion and Future Work

In this paper, we present reinforcement learning approach to discover the structured document representation for extreme multi-label text classification. Experimental validation in comparison with other state-of-the-art methods on datasets is conducted, demonstrating that on both public and practical datasets, expecting results have been obtained. However, There still exists a lot work to be further investigated. Firstly, we wish to discover more relevant label patterns from document for the end editors; Secondly, we hope to develop a more flexible graphic user interface (GUI) and integrate more high-level knowledge of the human’s intention as a feedback information into the model. Finally, we wish to discover more hierarchical structure of the label in the document in a coarse-to-fine manner.