1 Task Definition

This task aims to evaluate the automatic classification techniques for very short texts, i.e., Chinese news headlines. Each news headline (i.e., news title) is required to be classified into one or more predefined categories. With the rise of Internet and social media, the text data on the web is growing exponentially. Make a human being to analysis all those data is impractical, while machine learning techniques suits perfectly for this kind of tasks. After all, human brain capacity is too limited and precious for tedious and non-obvious phenomenons.

Formally, the task is defined as follows: given a news headline \(x=(x_1, x_2, ..., x_n)\), where \(x_j\) represents jth word in x, the object is to find its possible category or label \(c\in \mathcal {C}\). More specifically, we need to find a function to predict in which category does x belong to.

$$\begin{aligned} c^* = \mathop {\mathrm{argmax}}\limits _{c\in \mathcal {C}} f(x;\theta _c), \end{aligned}$$
(1)

where \(\theta \) is the parameter for the function.

Table 1. The information of categories.

2 Data

We collected news headlines (titles) from several Chinese news websites, such as toutiao, sina, and so on.

There are 18 categories in total. The detailed information of each category is shown in Table 1. All the sentences are segmented by using the python Chinese segmentation tool jieba.

Some samples from training dataset are shown in Table 2.

Table 2. Samples from dataset. The first column is Category and the second column is news headline.
Fig. 1.
figure 1

The blue line is character length statistic, and blue line is word length. (Color figure online)

Table 3. Statistical information of the dataset.

Length. Figure 1 shows that most of title sentence character number is less than 40, with a mean of 21.05. Title sentence word length is even shorter, most of which is less than 20 with a mean of 12.07 (Table 3).

The dataset is released on github https://github.com/FudanNLP/nlpcc2017_news_headline_categorization along with code that implement three basic models.

3 Evaluation

We use the macro-averaged precision, recall and F1 to evaluate the performance.

The Macro Avg. is defined as follow:

$$ Macro\_avg = \frac{1}{m}\sum _{i=1}^{m}{\rho _i}$$

And Micro Avg. is defined as:

$$ Micro\_avg = \frac{1}{N}\sum _{i=1}^{m}{w_i\rho _i}$$

where m denotes the number of class, in the case of this dataset is 18. \(\rho _i\) is the accuracy of ith category, \(w_i\) represents how many test examples reside in ith category, N is total number of examples in the test set.

Table 4. Results of the baseline models.

4 Baseline Implementations

As a branch of machine learning, Deep Learning (DL) has gained much attention in recent years due to its prominent achievement in several domains such as Computer vision and Natural Language processing.

We have implemented some basic DL models such as neural bag-of-words (NBoW), convolutional neural networks (CNN) [3] and Long short-term memory network (LSTM) [2].

Empirically, 2 Gigabytes of GPU Memory should be sufficient for most models, set batch to a smaller number if not.

The results generated from baseline models are shown in Table 4.

5 Participants Submitted Results

There are 32 participants actively participate and submit they predictions on the test set. The predictions are evaluated and the results are shown in Table 5.

Table 5. Results submitted by participants.

6 Some Representative Methods

In this section, we gives three representative methods.

[4] proposed a novel method which enhances the semantic representation of headlines. It first adds some keywords extracted from the most similar news to expand the word features. Then, it uses the corpus in news domain to pre-train the word embedding so as to enhance the word representation. At last, it utilizes Fasttext classifier, which uses a liner method to classify texts with fast speed and high accuracy.

[1] developed a voting system based on convolutional neural networks (CNN), gated recurrent units (GRU), and support vector machine (SVM).

[5] proposed an efficient approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. It first models the headline semantic both on character and word level via Bi-directional Long Short-Term Memory (BiLSTM), with the concatenation of output states from hidden layer as the semantic representation. Meanwhile, it adopts attention mechanism to highlight the key characters or words related to the classification decision. And lastly it utilizes ensemble learning to determine the final category of the whole test samples by sub-models voting.

7 Conclusion

Since large amount of data is required for machine learning techniques like deep learning, we have collected considerable amount of news headline data and contributed to the research community. We also found that the performance of news headline classification still need be improved. We hope that our dataset provides a valuable training data and a testbed for text classification task.