Abstract
In this paper, we give an overview for the shared task at the CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2017): Chinese News Headline Categorization. The dataset of this shared task consists 18 classes, 12,000 short texts along with corresponded labels for each class. The dataset and example code can be accessed at https://github.com/FudanNLP/nlpcc2017_news_headline_categorization.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
1 Task Definition
This task aims to evaluate the automatic classification techniques for very short texts, i.e., Chinese news headlines. Each news headline (i.e., news title) is required to be classified into one or more predefined categories. With the rise of Internet and social media, the text data on the web is growing exponentially. Make a human being to analysis all those data is impractical, while machine learning techniques suits perfectly for this kind of tasks. After all, human brain capacity is too limited and precious for tedious and non-obvious phenomenons.
Formally, the task is defined as follows: given a news headline \(x=(x_1, x_2, ..., x_n)\), where \(x_j\) represents jth word in x, the object is to find its possible category or label \(c\in \mathcal {C}\). More specifically, we need to find a function to predict in which category does x belong to.
where \(\theta \) is the parameter for the function.
2 Data
We collected news headlines (titles) from several Chinese news websites, such as toutiao, sina, and so on.
There are 18 categories in total. The detailed information of each category is shown in Table 1. All the sentences are segmented by using the python Chinese segmentation tool jieba.
Some samples from training dataset are shown in Table 2.
Length. Figure 1 shows that most of title sentence character number is less than 40, with a mean of 21.05. Title sentence word length is even shorter, most of which is less than 20 with a mean of 12.07 (Table 3).
The dataset is released on github https://github.com/FudanNLP/nlpcc2017_news_headline_categorization along with code that implement three basic models.
3 Evaluation
We use the macro-averaged precision, recall and F1 to evaluate the performance.
The Macro Avg. is defined as follow:
And Micro Avg. is defined as:
where m denotes the number of class, in the case of this dataset is 18. \(\rho _i\) is the accuracy of ith category, \(w_i\) represents how many test examples reside in ith category, N is total number of examples in the test set.
4 Baseline Implementations
As a branch of machine learning, Deep Learning (DL) has gained much attention in recent years due to its prominent achievement in several domains such as Computer vision and Natural Language processing.
We have implemented some basic DL models such as neural bag-of-words (NBoW), convolutional neural networks (CNN) [3] and Long short-term memory network (LSTM) [2].
Empirically, 2 Gigabytes of GPU Memory should be sufficient for most models, set batch to a smaller number if not.
The results generated from baseline models are shown in Table 4.
5 Participants Submitted Results
There are 32 participants actively participate and submit they predictions on the test set. The predictions are evaluated and the results are shown in Table 5.
6 Some Representative Methods
In this section, we gives three representative methods.
[4] proposed a novel method which enhances the semantic representation of headlines. It first adds some keywords extracted from the most similar news to expand the word features. Then, it uses the corpus in news domain to pre-train the word embedding so as to enhance the word representation. At last, it utilizes Fasttext classifier, which uses a liner method to classify texts with fast speed and high accuracy.
[1] developed a voting system based on convolutional neural networks (CNN), gated recurrent units (GRU), and support vector machine (SVM).
[5] proposed an efficient approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. It first models the headline semantic both on character and word level via Bi-directional Long Short-Term Memory (BiLSTM), with the concatenation of output states from hidden layer as the semantic representation. Meanwhile, it adopts attention mechanism to highlight the key characters or words related to the classification decision. And lastly it utilizes ensemble learning to determine the final category of the whole test samples by sub-models voting.
7 Conclusion
Since large amount of data is required for machine learning techniques like deep learning, we have collected considerable amount of news headline data and contributed to the research community. We also found that the performance of news headline classification still need be improved. We hope that our dataset provides a valuable training data and a testbed for text classification task.
References
Zhu, F., Dong, X., Song, R., Hong, Y., Zhu, Q.: A multiple learning model based voting system for news headline classification. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Yin, Z., Tang, J., Chengsen, Luo, W., Luo, Z., Ma, X.: A semantic representation enhancement method for Chinese news headline classification. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Lu, Z., Liu, W., Zhou, Y., Hu, X., Wang, B.: An effective approach for Chinese news headline classification base on multi-representation mixed model with attention and ensemble learning. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Qiu, X., Gong, J., Huang, X. (2018). Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_85
Download citation
DOI: https://doi.org/10.1007/978-3-319-73618-1_85
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)