Abstract
The high complexity of natural language and the huge amount of human and temporal resources necessary for producing the grammars lead several researchers in the area of Natural Language Processing to investigate various solutions for automating grammar generation and updating processes. Many algorithms for Context-Free Grammar inference have been developed in the literature. This paper provides a survey of the methodologies for inferring context-free grammars from examples, developed by researchers in the last decade. After introducing some preliminary definitions and notations concerning learning and inductive inference, some of the most relevant existing grammatical inference methods for Natural Language are described and classified according to the kind of presentation (if text or informant) and the type of information (if supervised, unsupervised, or semi-supervised). Moreover, the state of the art of the strategies for evaluation and comparison of different grammar inference methods is presented. The goal of the paper is to provide a reader with introduction to major concepts and current approaches in Natural Language Learning research.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Adriaans PW (2001) Learning shallow context-free languages under simple distributions. In: Opestake A, Vermeulen K (eds) Algebras, diagrams and decisions in language, logic and computation, CSLI/CUP
Adriaans PW (1992) Language learning from a categorial perspective. PhD thesis, University of Amsterdam, Amsterdam
Adriaans PW, Vervoort M (2002) The EMILE 4.1 grammar induction toolbox. In: Adriaans P, Fernau H, van Zaanen M (eds) Grammatical inference: algorithms and applications: 6th international colloquium: ICGI 2002. Lecture notes in computer science, vol 2484. Springer, Heidelberg, pp 293–295
Angluin D (1982) Inference of reversible languages. J ACM 29: 741–765
Baker JK (1979) Trainable grammars for speech recognition. In: Klatt DH, Wolf JJ (eds) Speech communication papers for the 97th meeting of the acoustical society of America, pp 547–550
Black E, Abney S, Flickinger D, Gdaniec C, Grishman R, Harrison P, Hindle D, Ingria R, Jelinek F, Klavans J, Liberman M, Marcus M, Roukos S, Santorini B, Strzalkowski T (1991) A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the DARPA speech and natural language workshop, pp 306–311
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with cotraining. In: Proceedings of the workshop on computational learning theory
Bonnema R, Bod R, Scha R (1997) A DOP model for semantic interpretation. In: ACL 1997, pp 159–167
Briscoe T (2000) Grammatical acquisition: inductive bias and coevolution of language and the language acquisition device. Language, pp 245–296
Charniak E, Johnson M (2005) Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In: Proceedings of the 43rd annual meeting of the ACL, Ann Arbor, pp 173–180
Charniak E (1997) Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the fourteenth national conference on artificial intelligence, Menlo Park. AAAI Press/MIT Press
Chomsky N (1957) Syntactic Structures. The Hague Mouton.
Clark A (2001) Unsupervised induction of stochastic context-free grammars using distributional clustering. In: ConLL ‘01: Proceedings of the 2001 workshop on computational natural language learning, Morristown, NJ, USA. Association for Computational Linguistics, pp 1–8
Cramer B (2007) Limitations of current grammar induction algorithms. In: Proceedings of the 45th annual meeting of the ACL: student research workshop, June 25–26, 2007, Prague, Czech Republic
Déjean H (2000) ALLiS: a symbolic learning system for natural language learning. In: Cardie C, Daelemans W, N’edellec C, Tjong Kim Sang E (eds) Proceedings of the fourth conference on computational natural language learning and of the second learning language in logic workshop; Lisbon, Portugal. Held in cooperation with ICGI-2000, pp 95–98
de la Higuera C, Oncina J (2003) Identification with Probability One of Stochastic Deterministic Linear Languages. In: Proceedings of ALT 2003. Springer, Berlin, Heidelberg, pp 134–148
Denis F (1998) Pac learning from positive statistical queries. In: Proceedings of 9th international conference on algorithmic learning theory—ALT ‘98, Springer, pp 112–126
Edelman S, Solan Z, Horn D, Ruppin E (2005) Learning syntactic constructions from raw corpora. In: 29th Boston University conference on language development, Cascadilla Press
Emerald JD, Subramanian KG, Thomas DG (1996) Learning code regular and code linear languages. In: Proceedings of international colloquium on grammatical inference (ICGI-96), lecture notes in artificial intelligence 1147, Springer, pp 211–221
Garcia P, Vidal E (1990) Inference of K-testable languages in the strict sense and applications to syntactic pattern recognition. J IEEE Trans Pattern Anal Mach Intell 12(9): 920–925
Gold EM (1967) Language identification in the limit. Inform Control 10: 447–474
Hänig C, Bordag S, Quasthoff U (2008) UnsuParse: unsupervised parsing with unsupervised part of speech tagging. In: Proceedings of the sixth international language resources and evaluation (LREC 2008)
Hopcroft JE, Ullman JE (1979) Introduction to automata theory, languages, and computation. Addison-Wesley, New York
Horning JJ (1969) A study of grammatical inference. PhD thesis, Stanford University, Stanford:CA, USA
Kasami T (1965) An efficient recognition and syntax analysis algorithm for context-free languages. Science report, Air Force Cambridge Research Laboratory, Bedford
Koshiba T, Makinen E, Takada Y (1997) Inferring pure context-free languages from positive data. Technical report A-1997-14, Department of Computer Science, University of Tampere
Langley P, Stromsten S (2000) Learning context-free grammars with a simplicity bias. In: Proceedings of the eleventh European conference on machine learning (ECML 2000), lecture notes in artificial intelligence 1810, Springer, pp 220–228
Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSR 163(4): 845–848 (Original in Russian)
MacWhinney B (1991) The CHILDES project: tools for analyzing talk. Erlbaum, Mahwah
Marcus M, Santorini B, Marcinkiewicz M (1993) Building a large annotated corpus of English: the Penn treebank. Comput Linguist 19(2): 313–330
McClosky D, Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proceedings of HLT-NAACL 2006
Nakamura K (2003) Incremental learning of context free grammars by extended inductive cyk algorithm. In: Higuera C, Adriaans PW, Zaanen M, Oncina J (eds) ECML workshop on learning contex-free grammars. Ruder Boskovic Institute, Zagreb, pp 53–64
Nakamura K, Matsumoto M (2002) Incremental learning of context free grammars. In: ICGI ‘02: proceedings of the 6th international colloquium on grammatical inference (London, UK), Springer, pp 174–184
Nakamura K, Ishiwata T (2000) Synthesizing context free grammars from sample strings based on inductive cyk algorithm. In: ICGI ‘00: proceedings of the 5th international colloquium on grammatical inference, London, UK, Springer, pp 186–195
Petasis G, Paliouras G, Karkaletsis V, Halatsis C, Spyropoulos CD (2004) e-GRIDS: computationally efficient grammatical inference from positive examples. GRAMMARS 7: 69–110
Pullum GK (2003) Learnability. In: The Oxford International Encyclopaedia of Linguistics, 2nd edn. Oxford, Oxford University Press, pp 431–434
Rissanen J (1982) A universal prior for integers and estimation by minimum description length. Ann Statist 11: 416–431
Roberts A, Atwell E (2002) Unsupervised grammar inference systems for natural language. Research report number 2002.20. School of Computing, University of Leeds
Sakakibara Y (1997) Recent advances of grammatical inference. Theor Comput Sci 185: 15–45
Sakakibara Y, Brown M, Hughley R, Mian I, Sjolander K, Underwood R, Haussler D (1994) Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res 22: 5112–5120
Sakakibara Y, Muramatsu H (2000) Learning context-free grammars from partially structured examples. In: Proceedings of the 5th international colloquium on grammatical inference: algorithms and applications (ICGI), pp 229–240
Salvador I, Benedı JM (2002) RNA modeling by combining stochastic context-free grammars and n-Gram models. Int J Pattern Recogn Artif Intell 16(3): 309–316
Seginer Y (2007) Fast unsupervised incremental parsing. In: Proceedings of the ACL 2007, Prague
Solan Z, Horn D, Ruppin E, Edelman S (2005) Unsupervised learning of natural languages. Proc Natl Acad Sci USA 102(33): 11629–11634
Steedman M, Osborne M, Sarkar A, Clark S, Hwa R, Hockenmaier J, Ruhlen P, Baker S, Crim J (2003) Bootstrapping statistical parsers from small datasets. In: Proceedings of the annual meeting of the European chapter of the ACL, Budapest, Hungary
van Zaanen MV (2001) Bootstrapping structure into language: alignment-based learning. PhD thesis, School of Computing, University of Leeds, UK
van Zaanen M, Adriaans P (2001) Alignment-based learning versus EMILE: a comparison. In: Proceedings of the Belgian-Dutch conference on artificial intelligence (BNAIC), Amsterdam, The Netherlands
Watkinson S, Manandhar S (2001) A psychologically plausible and computationally effective approach to learning syntax. In: Proceedings of the workshop computational natural language learning (CoNLL-2001), pp 160–167
Yokomori T (1995) On polynomial-time learnability in the limit of strictly deterministic automata. J Mach Learn 19: 153–179
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
D’Ulizia, A., Ferri, F. & Grifoni, P. A survey of grammatical inference methods for natural language learning. Artif Intell Rev 36, 1–27 (2011). https://doi.org/10.1007/s10462-010-9199-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-010-9199-1