Multi-label Document Classification in Czech

Hrala, Michal; Král, Pavel

doi:10.1007/978-3-642-40585-3_44

Michal Hrala²⁰ &
Pavel Král^20,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

2439 Accesses
6 Citations

Abstract

This paper deals with multi-label automatic document classification in the context of a real application for the Czech news agency. The main goal of this work is to compare and evaluate three most promising multi-label document classification approaches on a Czech language. We show that the simple method based on a meta-classifier proposes by Zhu at al. outperforms significantly the other approaches. The classification error rate improvement is about 13%. The Czech document corpus is available for research purposes for free which is another contribution of this work.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Novel Unsupervised Features for Czech Multi-label Document Classification

SAPKOS: Experimental Czech Multi-label Document Classification and Analysis System

Two-Level Neural Network for Multi-label Document Classification

Keywords

References

Hrala, M., Král, P.: Evaluation of the Document Classification Approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 875–885. Springer, Heidelberg (2013)
Google Scholar
Bratko, A., Filipič, B.: Exploiting structural information for semi-structured document categorization. In: Information Processing and Management, pp. 679–694 (2004)
Google Scholar
Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 380–393 (1997)
Article Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Chapter Google Scholar
Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41, 1263–1276 (2005)
Article Google Scholar
Gomez, J.C., Moens, M.F.: Pca document reconstruction for email classification. Computer Statistics and Data Analysis 56, 741–751 (2012)
Article MathSciNet Google Scholar
Yun, J., Jing, L., Yu, J., Huang, H.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39, 2035–2046 (2012)
Article Google Scholar
Novovičová, J., Malík, A., Pudil, P.: Feature selection using improved mutual information for text classification. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 1010–1017. Springer, Heidelberg (2004)
Chapter Google Scholar
Novovičová, J., Somol, P., Haindl, M., Pudil, P.: Conditional mutual information based feature selection for classification task. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 417–426. Springer, Heidelberg (2007)
Chapter Google Scholar
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3, 1–13 (2007)
Article Google Scholar
Yaoyong, L., Shawe-Taylor, J.: Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing & Management 43, 1183–1199 (2007)
Article Google Scholar
Olsson, J.S.: Cross language text classification for malach (2004)
Google Scholar
Wu, Y., Oard, D.W.: Bilingual topic aspect classification with a few training examples. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 203–210. ACM (2008)
Google Scholar
Zhu, S., Ji, X., Xu, W., Gong, Y.: Multi-labelled classification using maximum entropy method. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 274–281. ACM (2005)
Google Scholar
Hajič, J., Böhmová, A., Hajičová, E., Vidová-Hladká, B.: The Prague Dependency Treebank: A Three-Level Annotation Scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, pp. 103–127. Kluwer, Amsterdam (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science & Engineering, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Michal Hrala & Pavel Král
NTIS - New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Pavel Král

Authors

Michal Hrala
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Král
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of West Bohemia, 306 14, Pilsen, Czech Republic
Ivan Habernal & Václav Matoušek &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hrala, M., Král, P. (2013). Multi-label Document Classification in Czech. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_44

Download citation

DOI: https://doi.org/10.1007/978-3-642-40585-3_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-label Document Classification in Czech

Abstract

Chapter PDF

Similar content being viewed by others

Novel Unsupervised Features for Czech Multi-label Document Classification

SAPKOS: Experimental Czech Multi-label Document Classification and Analysis System

Two-Level Neural Network for Multi-label Document Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multi-label Document Classification in Czech

Abstract

Chapter PDF

Similar content being viewed by others

Novel Unsupervised Features for Czech Multi-label Document Classification

SAPKOS: Experimental Czech Multi-label Document Classification and Analysis System

Two-Level Neural Network for Multi-label Document Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation