Abstract
Applications related to domain specific text processing often use glossaries and ontologies, and the main step of such resource construction is term recognition. This paper presents a survey of existing definitions of the term and its linguistic features, formulates the task definition for term recognition, and analyzes presently-available methods for automatic term recognition, such as methods for candidates collection, methods based on statistics and contexts of term occurrences, methods using topic models, and methods based on external resources (such as text collections from other domains, ontologies, and Wikipedia). This paper also provides an overview of standard methodologies and datasets for experimental research.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Astrakhantsev, N. and Turdakov, D., Automatic construction and enrichment of informal ontologies: A survey, Program. Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.
Myakshin, K.A., Various approaches to definition of the concept “term,” Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2007, vol. 3, no. 3, pp. 175–178.
Pazienza, M., Pennacchiotti, M., and Zanzotto, F., Terminology extraction: An analysis of linguistic and statistical approaches, in Knowledge Mining, 2005, pp. 255–279.
Komarova, R.I., Term system of the heuristics sublanguage (on the material of English), Extended Abstract of Cand. Phil. Sci. Dissertation, Odessa, 1991, p. 18.
Vinokur, G.O., Grammatical observations in the field of technical terminology, Tr. Mosk. Inst. Filosofii Literatury Istorii, 1939, vol. 5.
Wüster, E., Einfuuhrung in die allgemeine Terminologielehre und terminologische Lexikographie (1979), Kobenhavn: Handelshojskolen, 1985.
Felber, H., Basic principles and methods for the preparation of terminology standards, Standardization of Technical Terminology: Principle and Practices, ASTM STP, 1982, vol. 806, pp. 3–13.
Terminology–Vocabulary: Standard, CH: International Organization for Standardization, Geneva, 1990.
Pearson, J., Terms in Context, John Benjamins, 1998, vol. 1.
Rondeau, G., Introduction ala terminologie, Quebec: Gaetan Morin, 1984, 2nd ed.
Myakshin, K.A., On the question of main features of the term, Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2008, vol. 2, no. 21, pp. 17–22.
Khayutin, A.D., Compound terms: Functional type of complex linguistic units from the perspective of lexicography, in Otraslevaya terminologiya i leksikografiya (Industrial Terminology and Lexicography), Voronezh: Voronezh State Pedagogical Univ., 1981.
Akhmanova, O.S., Linguistic terminology, Linguistic Encyclopedic Dictionary, Moscow: Sov. Entsikl., 1990.
Judea, A., Schütze, H., and Bruegmann, S., Unsupervised training set generation for automatic acquisition of technical terminology in patents, Proc. 25th Int. Conf. Computational Linguistics: Technical Papers (COLING), Dublin, 2014, pp. 290–300.
Bernier-Colborne, G. and Drouin, P., Creating a test corpus for term extractors through term annotation, Terminology, 2014, vol. 20, no. 1, pp. 50–73.
Wu, W., Liu, T., Hu, H., et al., Extracting domain-relevant term using Wikipedia based on random walk model, Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2012, pp. 68–75.
Bordea, G., Buitelaar, P., and Polajnar, T., Domainindependent term extraction through domain modeling, Proc. 10th Int. Conf. Terminology and Artificial Intelligence (TIA), Paris, 2013.
Bagot, R.E., Les unites de signification sprecialisrees relargissant l’objet du travail en terminologie, Terminology, 2002, vol. 7, no. 2, pp. 217–237.
Kageura, K. and Umino, B., Methods of automatic term recognition: A review, Terminology, 1996, vol. 3, no. 2, pp. 259–289.
Wermter, J. and Hahn, U., You can’t beat frequency (unless you use linguistic knowledge): A qualitative evaluation of association measures for collocation and term extraction, Proc. 21st Int. Conf. Computational Linguistic and 44th Annu. Meet. Association for Computational Linguistic, 2006, pp. 785–792.
Zhang, Z., Brewster, C., and Ciravegna, F., A comparative evaluation of term recognition algorithms, Proc. 6th Int. Conf. Language Resources and Evaluation (LREC), Marrakech, 2008.
Evans, D.A. and Lefferts, R.G., Clarit-trec experiments, Inf. Process. Manage., 1995, vol. 31, no. 3, pp. 385–395.
Ahmad, K., Gillam, L., Tostevin, L., et al., University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder), Proc. 8th Text Retrieval Conf. (TREC), 1999.
Frantzi, K., Ananiadou, S., and Mima, H., Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. Digital Libr., 2000, vol. 3, no. 2, pp. 115–130.
Kozakov, L., Park, Y., Fin, T., et al., Glossary extraction and utilization in the information search and delivery system for IBM technical support, IBM Syst. J., 2004, vol. 43, no. 3, pp. 546–563.
Sclano, F. and Velardi, P., Termextractor: A web application to learn the shared terminology of emergent web communities, Enterprise Interoperability II, 2007, pp. 287–290.
Braslavskii, P.I. and Sokolov, E.A., Comparison of four methods for automatic recognition of two-word terms in text, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2006, pp. 88–94.
Braslavskii, P. and Sokolov, E., Comparison of five methods for recognition of terms of arbitrary length, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2008, no. 7, p. 14.
Bourigault, D., Surface grammatical analysis for the extraction of terminological noun phrases, Proc. 14th Conf. Computational Linguistic, 1992, vol. 3, pp. 977–981.
Baroni, M. and Bernardini, S., BootCaT: Bootstrapping corpora and terms from the Web, Proc. Conf. Language Resources and Evaluation (LREC), 2004, pp. 1313–1316.
Dobrov, B.V., Lukashevich, N.V., and Syromyatnikov, S.V., Formation of the base of terminological phrases based on domain texts, Trudy 5oi Vseross. nauchn. konf. “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii” (Proc. 5th All-Russ. Sci. Conf. “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”), 2003, pp. 201–210.
Automatic Text Processing, Syntactic analysis. http://www.aot.ru/docs/synan.html.
Fedorenko, D., Astrakhantsev, N., and Turdakov, D., Automatic recognition of domain-specific terms: An experimental evaluation, Proc. Spring Researchers Colloquium on Databases and Information Systems (SYRCo- DIS), 2013, pp. 15–23.
Nokel, M. and Loukachevitch, N., An experimental study of term extraction for real information-retrieval thesauri, Proc. 10th Int. Conf. Terminology and Artificial Intelligence, 2013, pp. 69–76.
Ventura, J.A.L., Jonquet, C., and Roche, M., et al., Combining C-value and keyword extraction methods for biomedical terms extraction, Proc. Int. Symp. Languages in Biology and Medicine (LBM), 2013, pp. 45–49.
Barron-Cedeno, A., Sierra, G., Drouin, P., et al., An improved automatic term recognition method for Spanish, in Computational Linguistics and Intelligent Text Processing, Berlin: Springer, 2009, pp. 125–136.
Bordea, G., Domain adaptive extraction of topical hierarchies for expertise mining, Ph.D. Dissertation, Galway: National University of Ireland, 2013.
Navigli, R. and Velardi, P., Semantic interpretation of terminological strings, Proc. 6th Int. Conf. Terminology and Knowledge Engineering, 2002, pp. 95–100.
Dennis, S.F., The construction of a thesaurus automatically from a sample of text, Proc. Symp. Statistical Association Methods for Mechanized Documentation, Washington, 1965, pp. 61–148.
Church, K., Gale, W., Hanks, P., et al., Using statistics in lexical analysis, in Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991, p. 115.
Dunning, T., Accurate methods for the statistics of surprise and coincidence, Comput. Linguist., 1993, vol. 19, no. 1, pp. 61–74.
Church, K.W. and Hanks, P., Word association norms, mutual information, and lexicography, Comput. Linguist., 1990, vol. 16, no. 1, pp. 22–29.
Daille, B., Combined approach for terminology extraction: Lexical statistics and linguistic filtering, Ph.D. Thesis, Paris: University Paris 7, 1994.
Park, Y., Byrd, R., and Boguraev, B., Automatic glossary extraction: Beyond terminology identification, Proc. 19th Int. Conf. Computational Linguistic, 2002, vol. 1, pp. 1–7.
Blei, D.M. and Lafferty, J.D., Topic models, Text Min. Classif., Clustering, Appl., 2009, vol. 10, p. 71.
Bolshakova, E., Loukachevitch, N., and Nokel, M., Topic models can improve domain term extraction, in Advances in Information Retrieval, Berlin: Springer, 2013, pp. 684–687.
Li, S., Li, J., Song, T., et al., A novel topic model for automatic term extraction, Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2013, pp. 885–888.
Meijer, K., Frasincar, F., and Hogenboom, F., A semantic approach for extracting domain taxonomies from text, Decis. Support Syst., 2014, vol. 62, pp. 78–93.
Penas, A., Verdejo, F., Gonzalo, J., et al., Corpusbased terminology extraction applied to information access, Proc. Corpus Linguistics, 2001, vol. 13.
Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, 1999.
Braslavskii, P.I. and Sokolov, E.A., Automatic term recognition using Internet retrieval engines, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2007, pp. 89–94.
Golomazov, D.D., Methods and techniques for scientific information management using ontologies, Cand. Sci. (Phys.–Math.) Dissertation, Moscow, 2012.
Dobrov, B.V. and Loukachevitch, N.V., Multiple evidence for term extraction in broad domains, Proc. Recent Advances in Natural Language Processing, 2011, pp. 710–715.
Xu, F., Kurz, D., Piskorski, J., et al., A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping, Proc. Int. Conf. Language Resources and Evaluation, 2002.
Milne, D., Medelyan, O., and Witten, I.H., Mining domain-specific thesauri from Wikipedia: A case study, Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence, 2006, pp. 442–448.
Strube, M. and Ponzetto, S.P., WikiRelate!: Computing semantic relatedness using Wikipedia, Proc. 21st AAAI Conf. Artificial Intelligence, 2006, vol. 6, pp. 1419–1424.
Mihalcea, R. and Csomai, A., Wikify!: Linking documents to encyclopedic knowledge, Proc. 16th ACM Conf. Information and Knowledge Management, 2007, pp. 233–242.
Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. Information and Knowledge Management, 2008, pp. 509–518.
Vivaldi, J. and Rodriguez, H., Using Wikipedia for term extraction in the biomedical domain: First experiences, Procesamiento del Lenguaje Natural, 2010, vol. 45, pp. 251–254.
Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., et al., Using Wikipedia to validate the terminology found in a corpus of basic textbooks, Proc. Conf. Language Resources and Evaluation (LREC), 2012, pp. 3820–3827.
Astrakhantsev, N., Automatic term recognition in a domain-specific text collection using Wikipedia, Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk, 2014, vol. 26, no. 4, pp. 7–20.
Patry, A. and Langlais, P., Corpus-based terminology extraction, Proc. 7th Int. Conf. Terminology and Knowledge Engineering, Copenhagen, 2005.
Astrakhantsev, N., Fedorenko, D., and Turdakov, D., Automatic enrichment of informal ontology by analyzing a domain-specific text collection, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2014, vol. 13, pp. 29–42.
Yang, Y., Yu, H., Meng, Y., et al., Fault-tolerant learning for term extraction. http://www.aclweb.org/anthology/Y10-1036.
Liu, X. and Kit, C., An improved corpus comparison approach to domain specific term recognition, Proc. Pacific Asia Conf. Language, Information, and Computing (PACLIC), 2008, pp. 253–261.
Kim, J.-D., Ohta, T., Tateisi, Y., et al., GENIA corpus: A semantically annotated corpus for bio-textmining, Bioinformatics, 2003, vol. 19, no. Suppl. 1, pp. 180–182.
Nenadie, G., Ananiadou, S., and McNaught, J., Enhancing automatic term recognition through recognition of variation, Proc. 20th Int. Conf. Computational Linguistics, 2004, p. 604.
Krauthammer, M. and Nenadic, G., Term identification in the biomedical literature, J. Biomed. Inf., 2004, vol. 37, no. 6, pp. 512–526.
Medelyan, O. and Witten, I.H., Domain-independent automatic keyphrase indexing with small training sets, J. Am. Soc. Inf. Sci. Technol., 2008, vol. 59, no. 7, pp. 1026–1040.
Krapivin, M., Autaeu, A., and Marchese, M., Large dataset for keyphrases extraction. http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeumarchese. pdf.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © N.A. Astrakhantsev, D.G. Fedorenko, D.Yu. Turdakov, 2015, published in Programmirovanie, 2015, Vol. 41, No. 6.
Rights and permissions
About this article
Cite this article
Astrakhantsev, N.A., Fedorenko, D.G. & Turdakov, D.Y. Methods for automatic term recognition in domain-specific text collections: A survey. Program Comput Soft 41, 336–349 (2015). https://doi.org/10.1134/S036176881506002X
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S036176881506002X