Abstract
In recent years, we have seen huge volumes of research papers available on the World Wide Web. Metadata provides a good approach for organizing and retrieving these useful resources. Accordingly, automatic extraction of metadata from these papers and their bibliographies is meaningful and has been widely studied. In this paper, we utilize a bigram HMM (Hidden Markov Model) for automatic extraction of metadata (i.e. title, author, date, journal, pages, etc.) from bibliographies with various styles. Different from the traditional HMM, which only uses word frequency, this model also considers both words’ bigram sequential relation and position information in text fields. We have evaluated the model on a real corpus downloaded from Web and compared it with other methods. Experiments show that the bigram HMM yields the best result and seem to be the most promising candidate for metadata extraction of bibliographies.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text into Structured Records. In: Proc. ACM-SIGMOD Int’l Conf. Management of Data (SIGMOD 2001), pp. 175–186. ACM Press, New York (2001)
Lawrence, S., Giles, C., Bollacker, K.: Digital libraries and autonomous citation indexing. IEEE Computer 32(6), 67–71 (1999)
harvester.jar, http://www.cs.cornell.edu/cdlrg/Reference%20Linking/software/RefLink.tar.gz
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
Freitag, D., McCallurn, A.: Information extraction with HMMs and shrinkage. In: Workshop Notes of AAAI-99 Conference on Machine Learning for Information Extraction, pp. 31–36 (1999)
Seymore, K., McCallum, A., Rosenreid, R.: Learning hidden Markov model structure for information extraction. In: AAAI-1999 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high performance learning namefinder. In: Proceeding of the fifth Conference on Applied Language Processing, pp. 194–201 (1999)
Connan, J., Omlin, C.W.: Bibliography Extraction with Hidden Markov Models, Technical Report US-CS-TR-00-6, 24, Department of Computer Science, University of Stellenbosch (February 2000)
Leek, T.: Information Extraction Using Hidden Markov Models, Masters Thesis, Department of Computer Science & Engineering, University of California, San Diego (1997)
Freitag, D., McCallum, A.: Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Eighteenth Conference on Artificial Intelligence, AAAI-2000 (2000)
Stolcke, A., Omohundro, S.M.: Hidden Markov Model Induction by Bayesian Model Merging. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural Information Processing Systems, 1992, vol. 5, pp. 11–18. Morgan Kaufman, San Francisco (1992)
Stolcke, A., Omohundro, S.M.: Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, Computer Science Division, University of California at Berkeley and International Computer Science Institute (1994)
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proc. 17th International Conf. on Machine Learning, pp. 591–598 (2000)
Probabilistic Logic Learning Seminar. Hidden Markov Models for Information Extraction
Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning: Special Issue on Natural Language Learning 34, 233–272 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yin, P., Zhang, M., Deng, Z., Yang, D. (2004). Metadata Extraction from Bibliographies Using Bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, Ep. (eds) Digital Libraries: International Collaboration and Cross-Fertilization. ICADL 2004. Lecture Notes in Computer Science, vol 3334. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30544-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-30544-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24030-3
Online ISBN: 978-3-540-30544-6
eBook Packages: Computer ScienceComputer Science (R0)