An Improved Hidden Markov Model for Literature Metadata Extraction

Cui, Bin-Ge; Chen, Xin

doi:10.1007/978-3-642-14922-1_26

Bin-Ge Cui²⁰ &
Xin Chen²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6215))

Included in the following conference series:

International Conference on Intelligent Computing

2038 Accesses
9 Citations

Abstract

In this paper, we proposed an improved Hidden Markov Model (HMM) to extract metadata in the academic literatures. We have built a dataset including 458 literatures from the VLDB conferences, which contains the visual feature of text blocks. Our approach is based on the assumption that the text blocks in the same line have the same state (information type). The assumption is effective in more than 98% occasions. Thus, the state transition probability among the same states in the same line is much larger than that in different lines. According to this conclusion, we add one state transition matrix for HMM and modified the Viterbi algorithm. The experiments show that our extraction accuracy is superior to that of any existing works.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Metadata Extraction for Scientific Papers

Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction

Reference Metadata Extraction from Korean Research Papers

Keywords

References

Giles, C.L., Kurt, D.B., Steve, L.C.: An automatic citation indexing system. In: Digital Libraries 1998 (1998)
Google Scholar
Ying, D., Gobinda, C., Schubert, F.: Template mining for the extraction of citation from digital documents. In: Proc. Second Asian Digital Library Conference, Taiwan, pp. 47–62 (1999)
Google Scholar
Dayne, F., Andrew, K.M.: Information extraction with HMMs and shrinkage. In: AAAI 1999 (1999)
Google Scholar
Cora Dataset (2003), http://www.cs.umass.edu/~mccallum/data/cora-hmm.tar.gz
pdftohtml (2006), http://sourceforge.net/projects/pdftohtml/files/
Du, L.: Hidden markov model (HMM), http://math.sjtu.edu.cn/teacher/wuyk/HMM-DL.pdf
Cui, B.: Scientific literature metadata extraction based on HMM. In: Luo, Y. (ed.) Cooperative Design, Visualization, and Engineering. LNCS, vol. 5738, pp. 64–68. Springer, Heidelberg (2009)
Chapter Google Scholar
Zhang, N.R.: Hidden markov models for information extraction (June 2001)
Google Scholar
Seymore, K., McCallum, A., Ronal, R.: Learning hidden markov model structure for information extraction. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Zhang, L.: Research and application of web information extraction technology. Master’s thesis. Chinese Academy of Sciences (2003)
Google Scholar
Zhang, M., Yin, P., Deng, Z.H., Yang, D.Q.: SVM+BiHMM: A hybrid statistic model for metadata extraction. Journal of Software 19, 358–368 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Shandong University of Science and Technology, 266510, Qingdao, China
Bin-Ge Cui & Xin Chen

Authors

Bin-Ge Cui
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Computing Laboratory, Chinese Academy of Sciences, P.O. Box 1130, 230031, Hefei, Anhui, China
De-Shuang Huang
Department of Biomedical Informatics, Vanderbilt University Medical Center, 2,525 West End Avenue, Suite 600, 37203, Nashville, TN, USA
Zhongming Zhao
Electrical and Electronics Department, Polytechnic of Bari, Via Orabona 4, 70125, Bari, Italy
Vitoantonio Bevilacqua
Faculty of Engineering, District University Francisco José de Caldas, Cra. 7a,No. 40-53, Fifth Floor, Bogotá, Colombia
Juan Carlos Figueroa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, BG., Chen, X. (2010). An Improved Hidden Markov Model for Literature Metadata Extraction. In: Huang, DS., Zhao, Z., Bevilacqua, V., Figueroa, J.C. (eds) Advanced Intelligent Computing Theories and Applications. ICIC 2010. Lecture Notes in Computer Science, vol 6215. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14922-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-14922-1_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14921-4
Online ISBN: 978-3-642-14922-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Improved Hidden Markov Model for Literature Metadata Extraction

Abstract

Chapter PDF

Similar content being viewed by others

Metadata Extraction for Scientific Papers

Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction

Reference Metadata Extraction from Korean Research Papers

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Improved Hidden Markov Model for Literature Metadata Extraction

Abstract

Chapter PDF

Similar content being viewed by others

Metadata Extraction for Scientific Papers

Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction

Reference Metadata Extraction from Korean Research Papers

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation