Abstract
In this paper, we proposed an improved Hidden Markov Model (HMM) to extract metadata in the academic literatures. We have built a dataset including 458 literatures from the VLDB conferences, which contains the visual feature of text blocks. Our approach is based on the assumption that the text blocks in the same line have the same state (information type). The assumption is effective in more than 98% occasions. Thus, the state transition probability among the same states in the same line is much larger than that in different lines. According to this conclusion, we add one state transition matrix for HMM and modified the Viterbi algorithm. The experiments show that our extraction accuracy is superior to that of any existing works.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Giles, C.L., Kurt, D.B., Steve, L.C.: An automatic citation indexing system. In: Digital Libraries 1998 (1998)
Ying, D., Gobinda, C., Schubert, F.: Template mining for the extraction of citation from digital documents. In: Proc. Second Asian Digital Library Conference, Taiwan, pp. 47–62 (1999)
Dayne, F., Andrew, K.M.: Information extraction with HMMs and shrinkage. In: AAAI 1999 (1999)
Cora Dataset (2003), http://www.cs.umass.edu/~mccallum/data/cora-hmm.tar.gz
pdftohtml (2006), http://sourceforge.net/projects/pdftohtml/files/
Du, L.: Hidden markov model (HMM), http://math.sjtu.edu.cn/teacher/wuyk/HMM-DL.pdf
Cui, B.: Scientific literature metadata extraction based on HMM. In: Luo, Y. (ed.) Cooperative Design, Visualization, and Engineering. LNCS, vol. 5738, pp. 64–68. Springer, Heidelberg (2009)
Zhang, N.R.: Hidden markov models for information extraction (June 2001)
Seymore, K., McCallum, A., Ronal, R.: Learning hidden markov model structure for information extraction. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Zhang, L.: Research and application of web information extraction technology. Master’s thesis. Chinese Academy of Sciences (2003)
Zhang, M., Yin, P., Deng, Z.H., Yang, D.Q.: SVM+BiHMM: A hybrid statistic model for metadata extraction. Journal of Software 19, 358–368 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cui, BG., Chen, X. (2010). An Improved Hidden Markov Model for Literature Metadata Extraction. In: Huang, DS., Zhao, Z., Bevilacqua, V., Figueroa, J.C. (eds) Advanced Intelligent Computing Theories and Applications. ICIC 2010. Lecture Notes in Computer Science, vol 6215. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14922-1_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-14922-1_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14921-4
Online ISBN: 978-3-642-14922-1
eBook Packages: Computer ScienceComputer Science (R0)