Abstract
The replacement of textual units by synonymous canonical forms is an important prerequisite for many variants of automated text analysis. In scientific texts, one common normalization step is the consistent replacement of acronyms by their definitions. For many acronyms, the definition is found at a certain position of the text where the acronym is introduced and “expanded” to a synonymous sequence of full words. A recent approach to detecting acronym-expansion pairs by Park and Byrd [19] describes possible graphical correspondences between acronyms and expansions by means of fine-grained rules. Here we show how rule sets as used in [19] can be translated into hidden Markov models that abstract from details of the graphical correspondence and improve recall in a significant way. Stability in terms of precision is ensured by exploiting simple properties of the expansion with an optional reinforcement of linguistic knowledge. With this extension of the original formalism, the introduction of large rule sets can be avoided and a fixed model can be applied to a large variety of texts without retraining, with good values both for recall and precision.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998)
Boguraev, B., Kennedy, C.: Applications of term identification terminology: domain description and content characterization. Nat. Lang. Eng. 5(1), 17–44 (1999)
Basili, R., Moschitti, A.: Intelligent NLP-driven text classification. Int. J. Artif. Intell. Tools 11(3), 389–423 (2002)
Teresa, C. M.: Terminology: Theory, Methods and Applications. John Benjamins John Benjamins Publishing Company, Amsterdam (1998)
Charniak, E.: Statistical Language Learning. MIT Press, Cambridge, MA (1993)
Cohen, J.D.: Highlights: language and domain independent automatic indexing terms for abstracting. J. Am. Soc. Inf. Sci. 46(3), 162–174 (1995)
Dagan, I., Church, K.W.: Termight: identifying and translating technical terminology. In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL’95), pp. 34–40 (1995)
Fung, P., McKeown, K.: A technical word and term translation aid using noisy parallel corpora across language groups. Mach. Transl. J. (Special Issue on New Tools for Human Translators) pp. 53–87 (1996)
Gaizauskas, R., Demetriou, G.,Humphreys, K.: Term recognition in biological science journal articles. In: Proceedings of the Workshop on Computational Terminology for Medical and Biological Applications and 2nd International Conference on Natural Language Processing (NLP-2000), Patras, Greece, pp. 37–44 (2000)
Hirschman, L., Park, J.C., Tsuji, J., Wong, L., Wu, C.H.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12), 1553–1561 (2002)
Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995)
Larkey, L.S., Ogilvie, P., Price, M.A., Tamilio, B.: Acrophile: an automated acronym extractor and server. In: Proceedings of the 5th ACM International Conference on Digital Libraries (2000)
Lehnert, W., Soderland, S., Aronow, D., Feng, F.: Inductive text classification for medical applications. J. Exp. Theor. Artif. Intell. 7(1), 49–80 (1995)
Acronym/alias identification corpus of Brandeis University. http://www.medstract.org/gold-standards.html/ (2003)
Medline—Searchable with PubMed. http://www.ncbi.nlm.nih.gov/PubMed/. Service by the U.S. National Library of Medicine (2003)
Mikheev, A.: Periods, capitalized word, etc. Comput. Linguist. 28(3), 289–318 (2002)
Nenadić, G., Spasić, I., Ananiadou, S.: Automatic acronym acquisition and term variation management within domain-specific texts. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, vol. VI, pp. 2155–2162. European Language Resources Association (2002)
U.S. National Library of Medicine: Fact sheet Medline. http://www.nlm.nih.gov/pubs/factsheets/medline.html (2002)
Park, Y., Byrd, R.J.: Hybrid text mining for finding abbreviations and their definitions. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). http://citeseer.nj.nec.com/444674.html (2001)
Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic glossary extraction: beyond terminology identification. In: Proceedings of COLING’02 (2002)
Pustejovsky, J., Castaño, J., Cochran, B., Kotecki, M., Morrell, M., Rumshisky, A.: Linguistic knowledge extraction from medline: automatic construction of an acronym database. Updated version of a paper presented at Medinfo. http://medstract.org/publications.html (2001)
Paice, C.D., Jones, P.A.: The identification of important concepts in highly structured technical papers. In: Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 69–78 1993
Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Swanson, D.R.: Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78(1), 29–37 (1990)
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Technical Report 95-03, ISRI Information Science Research Institute, University of Nevada, Las Vegas. (1995)
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Int. J. Doc. Anal. Recog. 1(4), 191–198 (1999)
Wright, S.E., Budin, G. (eds.): Handbook of Terminology Management, vol. 1, Basic Concepts of Terminology Management. John Benjamins, Amsterdam (1997)
Yeates, S., Bainbridge, D., Witten, I.H.: Using compression to identify acronyms in text. In: Conference on Data Compression, pp. 582 (2000)
Yeates, S.: Automatic extraction of acronyms from text. In: New Zealand Computer Science Research Students’ Conference, pp. 117–124 (1999)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Schumann, E.T., Schulz, K.U. Stable methods for recognizing acronym-expansion pairs: from rule sets to hidden Markov models. IJDAR 8, 1–14 (2006). https://doi.org/10.1007/s10032-005-0146-7
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-005-0146-7