Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

Kim, Dong-Kyum; Lee, Byunghwee; Kim, Daniel; Jeong, Hawoong

doi:10.3938/jkps.76.368

Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

Published: 13 March 2020

Volume 76, pages 368–377, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of the Korean Physical Society Aims and scope Submit manuscript

Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

Download PDF

Dong-Kyum Kim¹^na1,
Byunghwee Lee¹^na1,
Daniel Kim² &
…
Hawoong Jeong^1,3,4

205 Accesses
3 Citations
Explore all metrics

Abstract

The quantitative analysis of digitized historical documents has begun in earnest in recent years. Text classification is of particular importance for quantitative historical analysis because it helps to search literature efficiently and to determine the important subjects of a particular age. While numerous historians have joined together to classify large-scale historical documents, consistent classification among individual researchers has not been achieved. In this study, we present a classification method for large-scale historical data that uses a recently developed supervised learning algorithm called the Hierarchical Attention Network (HAN). By applying various classification methods to the Annals of the Joseon Dynasty (AJD), we show that HAN is more accurate than conventional techniques with word-frequency-based features. HAN provides the extent that a particular sentence or word contributes to the classification process through a quantitative value called ’attention’. We extract the representative keywords from various categories by using the attention mechanism and show the evolution of the keywords over the 472-year span of the AJD. Our results reveal that largely two groups of event categories are found in the AJD. In one group, the representative keywords of the categories were stable over long periods while the keywords in the other group varied rapidly, exhibiting repeatedly changing characteristics of the categories. Observing such macroscopic changes of representative words may provide insight into how a particular topic changes over a historical period.

Article PDF

Hierarchical Attention Networks for Different Types of Documents with Smaller Size of Datasets

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

A visual attention-based keyword extraction for document classification

Article 01 March 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

D. J. Hopkins and G. King, Am. J. Political Sci. 54, 229 (2010).
Article Google Scholar
J. Grimmer and B. M. Stewart, Polit. Anal. 21, 267 (2013).
Article Google Scholar
J. B. Michel et al, Science 331, 176 (2011).
Article ADS Google Scholar
S. Klingenstein, T. Hitchcock and S. DeDeo, Proc. Natl. Acad. Sci. U.S.A. 111, 9419 (2014).
Article ADS Google Scholar
S. Hochreiter and J. Schmidhuber, Neural Comput. 9, 1735 (1997).
Article Google Scholar
Y. Wu et al, arXiv: 1609.08144.
D. Tang et al., in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Baltimore, Maryland, USA, June 23-25, 2014), Vol. 1, pp. 1555–1565.
Article Google Scholar
Y. Kim, arXiv:1408.5882.
X. Zhang, J. Zhao and Y. LeCun, in Advances in Neural Information Processing Systems (Montreal, Canada, December 7-12, 2015), pp. 649–657.
Google Scholar
Z. Yang et al, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (San Diego, CA, USA, June 12-17, 2016), pp. 1480–1489.
Google Scholar
B. Lee, D. Kim, D. Kim and H. Jeong, New Phys.: Sae Mulli 66, 502 (2016).
Google Scholar
J. Bak and A. Oh, in Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities (LaTeCH) (Beijing, China, July 30, 2015), pp. 10–14.
Google Scholar
J. Bak and A. Oh, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (Brussels, Belgium, October 31-November 4, 2018), pp. 956–961.
Book Google Scholar
The Annals of the Joseon Dynasty, http://sillok.history.go.kr.
The Daily Records of Royal Secretariat of Joseon Dynasty, http://sjw.history.go.kr.
R. Rehurek and P. Sojka, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta, May 22, 2010), pp. 45–50.
Google Scholar
T. Mikolov, K. Chen, G. Corrado and J. Dean, arXiv:1301.3781.
D. Bahdanau, K. Cho and Y. Bengio, arXiv: 1409.0473.
K. Xu et al., in International Conference on Machine Learning (Lille, France, July 6-11, 2015), pp. 2048–2057.
Google Scholar
D. P. Kingma and J. Ba, arXiv:1412.6980.
A. Paszke et al., in 31st Conference on Neural Information Processing Systems (Long Beach, CA, USA, December 4-9, 2017).
Google Scholar
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval (McGraw-Hill, New York, NY, USA, 1983).
MATH Google Scholar
S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach (Pearson Education Limited, Malaysia, 2016).
MATH Google Scholar
F. Pedregosa et al, J. Mach. Learn. Res. 12, 2825 (2011).
MathSciNet Google Scholar
G. Tsoumakas and I. Katakis, Int. J. Data Warehous. Min. 3, 1 (2007).
Article Google Scholar
The ratio of people’s names to verbs and nouns in each category is as follows; Royal 0.11, Military 0.13, Diplomacy 0.18, Finance 0.10, Agriculture 0.10, Science 0.01, Politics 0.58, Administration 0.30, Personnel 0.60, Jurisdiction 0.51, Rebellion 0.60, Philosophy 0.25 and History 0.42.

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea (Grant No. 2017R1A2B3006930).

Author information

These authors equally contributed to this work.

Authors and Affiliations

Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea
Dong-Kyum Kim, Byunghwee Lee & Hawoong Jeong
Merck Sharp and Dohme Korea, Seoul, 04637, Korea
Daniel Kim
Institute for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Korea
Hawoong Jeong
Asia Pacific Center for Theoretical Physics, Pohang, 37673, Korea
Hawoong Jeong

Authors

Dong-Kyum Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byunghwee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hawoong Jeong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hawoong Jeong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, DK., Lee, B., Kim, D. et al. Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks. J. Korean Phys. Soc. 76, 368–377 (2020). https://doi.org/10.3938/jkps.76.368

Download citation

Received: 03 September 2019
Accepted: 17 September 2019
Published: 13 March 2020
Issue Date: March 2020
DOI: https://doi.org/10.3938/jkps.76.368

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

Abstract

Article PDF

Similar content being viewed by others

Hierarchical Attention Networks for Different Types of Documents with Smaller Size of Datasets

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

A visual attention-based keyword extraction for document classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

Abstract

Article PDF

Similar content being viewed by others

Hierarchical Attention Networks for Different Types of Documents with Smaller Size of Datasets

A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique

A visual attention-based keyword extraction for document classification

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation