Abstract
In this paper, we present experiments with unsupervised dependency parser without using any part-of-speech tags learned from manually annotated data. We use only unsupervised word-classes and therefore propose fully unsupervised approach of sentence structure induction from a raw text. We show that the results are not much worse than the results with supervised part-of-speech tags.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, unsupervised methods for building structure of natural language sentences (sometimes called grammar induction), became very popular. Their advantage is that there is no need of manually annotated corpora (treebanks), which is very time consuming, and there are still many languages which has no, or very little such resources.
The unsupervised methods of course does not reach the same results as the supervised methods. However, it is not fair to compare the results comparing the output structures with the manually annotated treebanks, because the outputs of the unsupervised parsers can be linguistically plausible as well, even though the structure is completely different. Better comparison should be done in an extrinsic way in practical application, for example in machine translation.
Many of the ongoing unsupervised natural language parsers uses supervised part-of-speech annotation. Part-of-speech is a category of word assigned to each word in the corpus, e.g. noun, verb, adjective, adverb, preposition, conjunction, etc. This means that the sentences, on which the parser operates, must be preprocessed by manual annotation of part-of-speech tags, or by a supervised tagger, which is trained on another corpus, where the part-of-speech tags are annotated manually. An example of a sentence with manually assigned part-of-speech tags and dependency structure is shown in Fig. 1.
In this work, we do not use any part-of-speech annotation. We use only unsupervised word-classes, which are learned on big, raw (not annotated) language corpus. We show that the unsupervised parser using word classes performs only a bit worse than the one with supervised part-of-speech tags. However, its advantage is the complete independence on any annotated data for particular language.
2 Related Work
We can divide the area of unsupervised parsing research into several groups according to the extent of the degree of supervision.
In the first group, there are so-called delexicalised parsers. Such parsers are in fact supervised, but trained on another language for which a well annotated treebank exists (for example English Penn Treebank [9]). The condition is to use the same part-of-speech (POS) tagset both in the source and in the target language. The parser is then trained on the part-of-speech tags only, since the word forms are of course completely different [13]. The problem of different part-of-speech tagsets is often solved by mapping the original part-of-speech tags to a common (universal) part-of-speech tags. Nowadays the Google Universal part-of-speech tagset [14] become very popular and the part-of-speech mapping exists for many languages. Usage of the delexicalised parsers is therefore very easy.
In the second group, the parsers are also trained on another resource rich language, but they use a projection of dependency structures from the source language to the target language, instead of the common tagset. The procedure is following:
-
1.
Parse the source language (e.g. English) of the parallel treebank using a supervised parser trained on a manually annotated treebank.
-
2.
Project the tree structures of the source language into the target language using the word-alignment links between the words.
-
3.
Train the supervised parser on the target-language projected trees.
See for example [4] for more details.
The third group contains the real unsupervised approaches, in which the grammar is induced from monolingual corpus without annotated structure. Many of these approaches however use the supervised part-of-speech tags. The state-of-the-art dependency parsers are described in [10, 20]. Some of the parsers can be called “minimally supervised”, because of using some external knowledge of how the individual part-of-speech tags behaves in the structure, for example, they say that the tags beginning by letter “V” represent verbs and therefore have higher probability to be in the root of the trees and govern other words (e.g. [15]). Another work only distinguish nouns by a condition that the most frequent tag in a given part-of-speech tagset is a tag for nouns [11].
The fully unsupervised parsing methods that uses only the raw input texts is for example the work [16]. The most interesting work so far is [17], where the word classes are used, however, only experiments on English are proposed there. In this work, we experiment with word classes as well, but we use different parser and test it over 25 languages, so that we could prove that the method is language universal.
3 Unsupervised Dependency Parser
In our experiments, we use the unsupervised parser described in [10]. This parser is based on Dependency Model with Valence, introduced by Klein and Manning [7] and then improved by Headden et al. [6] and Spitkovsky et al. [18, 19]. The Dependency Model with Valence is a generative model using the following two types of probabilities:
-
\(P_{choose}(w_d|w_g,dir)\) is the probability of the dependent word \(w_d\) given the governing word \(w_g\) and the direction dir, which represents the left or right attachment. This generates the words (or the labels of nodes) in the dependency tree.
-
\(P_{stop}(\cdot |w_g,dir,adj)\), is the probability that a word \(w_g\) has another child in the direction dir. This generates the edges in the tree. “STOP” means that no more dependency edge from \(w_g\) is going to be generated in direction dir.
The parser uses Gibbs sampling method [5] for the inference of the dependency structures. The authors of the parser also show that it is also beneficial to estimate the \(P_{stop}\) probabilities for individual part-of-speech tags (or word classes) using so called reducibility principle [12]. These probabilities are estimated using a big raw not-annotated corpus. For more details, see the work [10]Footnote 1
4 Data
For testing the dependency parser, we use data from HamleDT collection of treebanks [21]. This collection consists of 29 dependency treebanks, all of them are separated into training part and testing part. We took the testing part to do the inference and evaluation of the dependency trees.Footnote 2
To estimate the \(p_{stop}\) probabilities, we use the large collections of Wikipedia articles provided by [8] containing between 10 and 80 million words for each language.
The Wikipedia articles must be first preprocessed – tokenized and encoded in the same way as the particular treebank. For many languages, this is not a problem, since they use Latin alphabet and the segmentation into tokens can be simply done by segmentation on whitespaces and separation of the punctuation marks. However, a couple of problems have arisen:
-
In the Arabic treebank, the tokenization is a part of morphological analysis, which cannot be used here, since we do not allow any supervised methods trained on any annotated data. The Arabic treebank is therefore not included into our experiments. Similar problem has arisen for Chinese. We did not find any automatic tokenizer which would tokenize the Wikipedia articles in the same way as it is in the treebank. So we excluded Chinese from the experiments as well.
-
Tamil treebank has the texts transliterated into Latin, while the Wikipedia articles are not.
-
The ancient Greek has very poor resources on Wikipedia and the treebank has also very unintelligible structure, so we excluded it as well.
Because of the aforementioned problems, we perform our experiments on 25 languages only. Their sizes are listed in Table 1.
5 Word Clustering
Instead of using part-of-speech tags for categorizing the words, we use the unsupervised word-clustering tool developed by Alex Clark [3].Footnote 3 The clustering is based on distributional and morphological information. It treats words not only as atoms, it goes deeper to the level of letters and takes into account also the similarity of prefixes or suffixes of words. Such morphological information can mainly improve the performance on rare words.
We do the inference of word clusters using both the corpora we have. For each language, the testing part of the treebank is joined with the articles from Wikipedia and such input is passed to the Clark’s part-of-speech induction clustering tool. We run it several times for different number of clusters.
6 Experiments and Results
Each experiment comprises the following steps:
-
1.
We tokenize the Wikipedia data in the same way as it is done in the treebank.
-
2.
We concatenate the Wikipedia and the testing data and run the Clark’s POS induction tool to make clusters. We perform the experiments for 25, 50, 100, and 200 clusters. From now, each word is categorized to one cluster.
-
3.
We run the script get_stop_priors.pl (see [10] for detailed description) on the Wikipedia data with assigned word clusters, and extract the stop-probability estimates for individual clusters and direction. The stop-probability says, how likely a word belonging to a particular cluster has a left or a right dependent.
-
4.
We run the unsupervised parser [10] on the testing data and the inferred word-classes. We use the stop-probability estimates for the dependency tree inference.
-
5.
We compare the resulting dependency structures with the original manually annotated treebank. We measure the accuracy: how many tokens have the correct parent (the parent is the same as in the original treebank).
In Table 2, there are the results for all the 25 languages (treebanks) and four different number of clusters used for unsupervised part-of-speech induction. The columns “with punctuation” use the treebanks in their original form. The results in the columns “without punctuation” are on the same texts, but with removed punctuation marks (full stops, commas, hyphens, colons, semicolons, quotations, etc.). The evaluation was also done on the treebank with removed punctuation marks. If a punctuation node was not a leaf in the dependency structure, all its dependents were attached to its parent, so that they did not remain unattached.
We can see, that grammar induction of sentences including punctuation has worse results. It is often caused by the full stop, which is almost in all the sentences and can be easily treated as the root of the sentence, which is not correct with respect to the manual annotations. It is hard to say which number of clusters is ideal for unsupervised grammar induction. There are many languages, for which the number of clusters does not affect the accuracy much. For example Catalan and Russian have both the parsing accuracy on the level about 49 % for all sizes of classes. On the other hand, Turkish have only 27 % accuracy on 25 classes, but more then 51 % accuracy for 50, 100, and 200 classes. Hungarian and Portuguese achieved, however, the best results just on the 25 classes and the results on 50, 100, and 200 are much worse.
Table 3 shows the left-chain and the right-chain baselines and compare them with the unsupervised parsing results. In the left-chain, each word is attached to the following word and the last word is attached to the technical root. In the right-chain, each word is attached to the previous one.
Our best results from Table 2 (inference without punctuation and 50 word classes) are compared to the baselines. We also compare it to the unsupervised parsing with gold part-of-speech tags [10]. The fifth column (in italic) shows the best result over the number of classes for particular language. This is a possibly achievable score, if we were able to automatically infer the best number of classes for each language.
On average, the unsupervised parsing with unsupervised part-of-speech tags has a bit lower accuracy than with gold part-of-speech tags. However, there are languages, on which the unsupervised tags perform much better: Greek, Estonian, Finnish, Italian, Romanian, Russian. Very good results and similar to the results with gold part-of-speech tags were achieved on e.g. Spanish (60.1 %), Swedish (54.9 %), and Turkish (55.0 %).
For curiosity, we have computed the correlation coefficient between the parsing results with unsupervised POS tags and the results with gold POS tags. The highest correlation (0.45) was for the best setting, i.e. for 50 classes and without punctuation. The lowest correlation (0.04) was for 200 classes and including punctuation.
There are languages, on which the accuracy of the parser is very low, for example Hindi (13.1 %), Slovenian (18.3 %), Latin (24.1 %), or Telugu (29.8 %). One reason could be lower amount of Wikipedia texts, from which the important stop-probability estimates are computed [10].
Some of the languages have higher baseline score than any of the parser results. It is hard to beat it for the strongly left-branching languages as Bengali, Telugu, or Turkish (Table 3).
7 Conclusions
We described an experiment with fully unsupervised dependency parsing using unsupervised word classes instead of supervised (or gold) part-of-speech tags. We tested it on 25 different languages (treebanks) and four different number of classes (25, 50, 100, and 200). When compared to the manually annotated dependency structures, the results of fully unsupervised parser are slightly worse then when we use gold part-of-speech tags, however, there are some languages, for which the unsupervisedly inferred word classes perform better that the gold, manually designed part-of-speech tags.
8 Future Work
In future, we would like to try also different part-of-speech induction tools, for example the Brown’s clusters [2] or the clusters inferred by the Pitman-Yor process [1]. It would be also beneficial to automatically detect the best number of classes. Such mechanism would bring the results to the ones in the fifth column of Table 3. We would also like to bring the unsupervised dependency parsing to the practical applications, for example to the machine translation systems. Measuring the performance extrinsically is more fair than comparing it to the manual annotations.
Notes
- 1.
The software of the unsupervised parser described in [10] can be downloaded at http://ufal.mff.cuni.cz/udp.
- 2.
We can do the inference and evaluation on the same data, since the correct annotation (labels and dependencies) is not used in the inference (unsupervised training).
- 3.
The Clark’s tool for unsupervised POS induction can be downloaded at http://www.cs.rhul.ac.uk/home/alexc/pos2.tar.gz.
References
Blunsom, P., Cohn, T.: A hierarchical Pitman-Yor process hmm for unsupervised part of speech induction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, HLT 2011, vol. 1, pp. 865–874. Association for Computational Linguistics, Stroudsburg (2011). http://dl.acm.org/citation.cfm?id=2002472.2002582
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992). http://dl.acm.org/citation.cfm?id=176313.176316
Clark, A.: Combining distributional and morphological information for part of speech induction. In: Proceedings of 10th EACL, pp. 59–66 (2003)
Ganchev, K., Gillenwater, J., Taskar, B.: Dependency grammar induction via bitext projection constraints. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, vol. 1, pp. 369–377. Association for Computational Linguistics, Stroudsburg (2009). http://dl.acm.org/citation.cfm?id=1687878.1687931
Gilks, W.R., Richardson, S., Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice. Interdisciplinary Statistics. Chapman & Hall, London (1996)
Headden III, W.P., Johnson, M., McClosky, D.: Improving unsupervised dependency parsing with richer contexts and smoothing. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2009, pp. 101–109. Association for Computational Linguistics, Stroudsburg (2009)
Klein, D., Manning, C.D.: Corpus-based induction of syntactic structure: models of dependency and constituency. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004. Association for Computational Linguistics, Stroudsburg (2004)
Majliš, M., Žabokrtský, Z.: Language richness of the web. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1994)
Mareček, D., Straka, M.: Stop-probability estimates computed on a large corpus improve unsupervised dependency parsing. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1 (Long Papers), pp. 281–290. Association for Computational Linguistics, Sofia, Bulgaria, August 2013
Mareček, D., Žabokrtský, Z.: Gibbs sampling with treeness constraint in unsupervised dependency parsing. In: Proceedings of RANLP Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing, pp. 1–8. Hissar, Bulgaria (2011)
Mareček, D., Žabokrtský, Z.: Exploiting reducibility in unsupervised dependency parsing. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, pp. 297–307. Association for Computational Linguistics, Stroudsburg (2012)
McDonald, R., Petrov, S., Hall, K.: Multi-source transfer of delexicalized dependency parsers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 62–72. Association for Computational Linguistics, Stroudsburg, July 2011. http://dl.acm.org/citation.cfm?id=2145432.2145440
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: Chair, N.C.C., Choukri, K., Declerck, T., Doan, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Rasooli, M.S., Faili, H.: Fast unsupervised dependency parsing with arc-standard transitions. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, ROBUS-UNSUP 2012, pp. 1–9. Association for Computational Linguistics, Stroudsburg (2012)
Seginer, Y.: Fast unsupervised incremental parsing. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 384–391. Association for Computational Linguistics, Prague, Czech Republic (2007)
Spitkovsky, V.I., Alshawi, H., Chang, A.X., Jurafsky, D.: Unsupervised dependency parsing without gold part-of-speech tags. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011) (2011)
Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: Punctuation: making a point in unsupervised dependency parsing. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL-2011) (2011)
Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: Three dependency-and-boundary models for grammar induction. In: Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012) (2012)
Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: Breaking out of local optima with count transforms and model recombination: a study in grammar induction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1983–1995. Association for Computational Linguistics, Seattle, October 2013
Zeman, D., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: to parse or not to parse? In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Acknowledgments
This research has been supported by the grant no. GPP406/14/06548P of the Grant Agency of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mareček, D. (2015). Multilingual Unsupervised Dependency Parsing with Unsupervised POS Tags. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-27060-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27059-3
Online ISBN: 978-3-319-27060-9
eBook Packages: Computer ScienceComputer Science (R0)