Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

Qiu, Xipeng; Gong, Jingjing; Huang, Xuanjing

doi:10.1007/978-3-319-73618-1_85

Xipeng Qiu¹⁸,
Jingjing Gong¹⁸ &
Xuanjing Huang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10619))

Included in the following conference series:

National CCF Conference on Natural Language Processing and Chinese Computing

3395 Accesses
4 Citations

Abstract

In this paper, we give an overview for the shared task at the CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2017): Chinese News Headline Categorization. The dataset of this shared task consists 18 classes, 12,000 short texts along with corresponded labels for each class. The dataset and example code can be accessed at https://github.com/FudanNLP/nlpcc2017_news_headline_categorization.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Ukrainian News Corpus as Text Classification Benchmark

An Approach for Bengali News Headline Classification Using LSTM

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

Article Open access 05 May 2023

1 Task Definition

This task aims to evaluate the automatic classification techniques for very short texts, i.e., Chinese news headlines. Each news headline (i.e., news title) is required to be classified into one or more predefined categories. With the rise of Internet and social media, the text data on the web is growing exponentially. Make a human being to analysis all those data is impractical, while machine learning techniques suits perfectly for this kind of tasks. After all, human brain capacity is too limited and precious for tedious and non-obvious phenomenons.

Formally, the task is defined as follows: given a news headline $x=(x_1, x_2, ..., x_n)$, where $x_j$ represents jth word in x, the object is to find its possible category or label $c\in \mathcal {C}$. More specifically, we need to find a function to predict in which category does x belong to.

$$\begin{aligned} c^* = \mathop {\mathrm{argmax}}\limits _{c\in \mathcal {C}} f(x;\theta _c), \end{aligned}$$

(1)

where $\theta $ is the parameter for the function.

Table 1. The information of categories.

Full size table

2 Data

We collected news headlines (titles) from several Chinese news websites, such as toutiao, sina, and so on.

There are 18 categories in total. The detailed information of each category is shown in Table 1. All the sentences are segmented by using the python Chinese segmentation tool jieba.

Some samples from training dataset are shown in Table 2.

Table 2. Samples from dataset. The first column is Category and the second column is news headline.

Full size table

Table 3. Statistical information of the dataset.

Full size table

Length. Figure 1 shows that most of title sentence character number is less than 40, with a mean of 21.05. Title sentence word length is even shorter, most of which is less than 20 with a mean of 12.07 (Table 3).

The dataset is released on github https://github.com/FudanNLP/nlpcc2017_news_headline_categorization along with code that implement three basic models.

3 Evaluation

We use the macro-averaged precision, recall and F1 to evaluate the performance.

The Macro Avg. is defined as follow:

$$ Macro\_avg = \frac{1}{m}\sum _{i=1}^{m}{\rho _i}$$

And Micro Avg. is defined as:

$$ Micro\_avg = \frac{1}{N}\sum _{i=1}^{m}{w_i\rho _i}$$

where m denotes the number of class, in the case of this dataset is 18. $\rho _i$ is the accuracy of ith category, $w_i$ represents how many test examples reside in ith category, N is total number of examples in the test set.

Table 4. Results of the baseline models.

Full size table

4 Baseline Implementations

As a branch of machine learning, Deep Learning (DL) has gained much attention in recent years due to its prominent achievement in several domains such as Computer vision and Natural Language processing.

We have implemented some basic DL models such as neural bag-of-words (NBoW), convolutional neural networks (CNN) [3] and Long short-term memory network (LSTM) [2].

Empirically, 2 Gigabytes of GPU Memory should be sufficient for most models, set batch to a smaller number if not.

The results generated from baseline models are shown in Table 4.

5 Participants Submitted Results

There are 32 participants actively participate and submit they predictions on the test set. The predictions are evaluated and the results are shown in Table 5.

Table 5. Results submitted by participants.

Full size table

6 Some Representative Methods

In this section, we gives three representative methods.

[4] proposed a novel method which enhances the semantic representation of headlines. It first adds some keywords extracted from the most similar news to expand the word features. Then, it uses the corpus in news domain to pre-train the word embedding so as to enhance the word representation. At last, it utilizes Fasttext classifier, which uses a liner method to classify texts with fast speed and high accuracy.

[1] developed a voting system based on convolutional neural networks (CNN), gated recurrent units (GRU), and support vector machine (SVM).

[5] proposed an efficient approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. It first models the headline semantic both on character and word level via Bi-directional Long Short-Term Memory (BiLSTM), with the concatenation of output states from hidden layer as the semantic representation. Meanwhile, it adopts attention mechanism to highlight the key characters or words related to the classification decision. And lastly it utilizes ensemble learning to determine the final category of the whole test samples by sub-models voting.

7 Conclusion

Since large amount of data is required for machine learning techniques like deep learning, we have collected considerable amount of news headline data and contributed to the research community. We also found that the performance of news headline classification still need be improved. We hope that our dataset provides a valuable training data and a testbed for text classification task.

References

Zhu, F., Dong, X., Song, R., Hong, Y., Zhu, Q.: A multiple learning model based voting system for news headline classification. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Yin, Z., Tang, J., Chengsen, Luo, W., Luo, Z., Ma, X.: A semantic representation enhancement method for Chinese news headline classification. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Google Scholar
Lu, Z., Liu, W., Zhou, Y., Hu, X., Wang, B.: An effective approach for Chinese news headline classification base on multi-representation mixed model with attention and ensemble learning. In: Proceedings of the CCF Conference on Natural Language Processing & Chinese Computing (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, 825 Zhangheng Road, Shanghai, China
Xipeng Qiu, Jingjing Gong & Xuanjing Huang

Authors

Xipeng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Gong
View author publications
You can also search for this author in PubMed Google Scholar
Xuanjing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xipeng Qiu .

Editor information

Editors and Affiliations

Fudan University, Shanghai, China
Xuanjing Huang
Singapore Management University, Singapore, Singapore
Jing Jiang
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiu, X., Gong, J., Huang, X. (2018). Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_85

Download citation

DOI: https://doi.org/10.1007/978-3-319-73618-1_85
Published: 05 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

Abstract

Similar content being viewed by others

Ukrainian News Corpus as Text Classification Benchmark

An Approach for Bengali News Headline Classification Using LSTM

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

1 Task Definition

2 Data

3 Evaluation

4 Baseline Implementations

5 Participants Submitted Results

6 Some Representative Methods

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

Abstract

Similar content being viewed by others

Ukrainian News Corpus as Text Classification Benchmark

An Approach for Bengali News Headline Classification Using LSTM

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

1 Task Definition

2 Data

3 Evaluation

4 Baseline Implementations

5 Participants Submitted Results

6 Some Representative Methods

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation