Keywords

1 Introduction

In recent years, unsupervised methods for building structure of natural language sentences (sometimes called grammar induction), became very popular. Their advantage is that there is no need of manually annotated corpora (treebanks), which is very time consuming, and there are still many languages which has no, or very little such resources.

The unsupervised methods of course does not reach the same results as the supervised methods. However, it is not fair to compare the results comparing the output structures with the manually annotated treebanks, because the outputs of the unsupervised parsers can be linguistically plausible as well, even though the structure is completely different. Better comparison should be done in an extrinsic way in practical application, for example in machine translation.

Many of the ongoing unsupervised natural language parsers uses supervised part-of-speech annotation. Part-of-speech is a category of word assigned to each word in the corpus, e.g. noun, verb, adjective, adverb, preposition, conjunction, etc. This means that the sentences, on which the parser operates, must be preprocessed by manual annotation of part-of-speech tags, or by a supervised tagger, which is trained on another corpus, where the part-of-speech tags are annotated manually. An example of a sentence with manually assigned part-of-speech tags and dependency structure is shown in Fig. 1.

Fig. 1.
figure 1

Example of English dependency structure with manually annotated (gold) part-of-speech tags of the sentence “When the dollar is in a free-fall, even central banks can’t stop it.”

In this work, we do not use any part-of-speech annotation. We use only unsupervised word-classes, which are learned on big, raw (not annotated) language corpus. We show that the unsupervised parser using word classes performs only a bit worse than the one with supervised part-of-speech tags. However, its advantage is the complete independence on any annotated data for particular language.

2 Related Work

We can divide the area of unsupervised parsing research into several groups according to the extent of the degree of supervision.

In the first group, there are so-called delexicalised parsers. Such parsers are in fact supervised, but trained on another language for which a well annotated treebank exists (for example English Penn Treebank [9]). The condition is to use the same part-of-speech (POS) tagset both in the source and in the target language. The parser is then trained on the part-of-speech tags only, since the word forms are of course completely different [13]. The problem of different part-of-speech tagsets is often solved by mapping the original part-of-speech tags to a common (universal) part-of-speech tags. Nowadays the Google Universal part-of-speech tagset [14] become very popular and the part-of-speech mapping exists for many languages. Usage of the delexicalised parsers is therefore very easy.

In the second group, the parsers are also trained on another resource rich language, but they use a projection of dependency structures from the source language to the target language, instead of the common tagset. The procedure is following:

  1. 1.

    Parse the source language (e.g. English) of the parallel treebank using a supervised parser trained on a manually annotated treebank.

  2. 2.

    Project the tree structures of the source language into the target language using the word-alignment links between the words.

  3. 3.

    Train the supervised parser on the target-language projected trees.

See for example [4] for more details.

The third group contains the real unsupervised approaches, in which the grammar is induced from monolingual corpus without annotated structure. Many of these approaches however use the supervised part-of-speech tags. The state-of-the-art dependency parsers are described in [10, 20]. Some of the parsers can be called “minimally supervised”, because of using some external knowledge of how the individual part-of-speech tags behaves in the structure, for example, they say that the tags beginning by letter “V” represent verbs and therefore have higher probability to be in the root of the trees and govern other words (e.g. [15]). Another work only distinguish nouns by a condition that the most frequent tag in a given part-of-speech tagset is a tag for nouns [11].

The fully unsupervised parsing methods that uses only the raw input texts is for example the work [16]. The most interesting work so far is [17], where the word classes are used, however, only experiments on English are proposed there. In this work, we experiment with word classes as well, but we use different parser and test it over 25 languages, so that we could prove that the method is language universal.

3 Unsupervised Dependency Parser

In our experiments, we use the unsupervised parser described in [10]. This parser is based on Dependency Model with Valence, introduced by Klein and Manning [7] and then improved by Headden et al. [6] and Spitkovsky et al. [18, 19]. The Dependency Model with Valence is a generative model using the following two types of probabilities:

  • \(P_{choose}(w_d|w_g,dir)\) is the probability of the dependent word \(w_d\) given the governing word \(w_g\) and the direction dir, which represents the left or right attachment. This generates the words (or the labels of nodes) in the dependency tree.

  • \(P_{stop}(\cdot |w_g,dir,adj)\), is the probability that a word \(w_g\) has another child in the direction dir. This generates the edges in the tree. “STOP” means that no more dependency edge from \(w_g\) is going to be generated in direction dir.

The parser uses Gibbs sampling method [5] for the inference of the dependency structures. The authors of the parser also show that it is also beneficial to estimate the \(P_{stop}\) probabilities for individual part-of-speech tags (or word classes) using so called reducibility principle [12]. These probabilities are estimated using a big raw not-annotated corpus. For more details, see the work [10]Footnote 1

Table 1. Properties of language resources used in our experiments. We used the testing parts of the treebanks only, both for inference and for testing.

4 Data

For testing the dependency parser, we use data from HamleDT collection of treebanks [21]. This collection consists of 29 dependency treebanks, all of them are separated into training part and testing part. We took the testing part to do the inference and evaluation of the dependency trees.Footnote 2

To estimate the \(p_{stop}\) probabilities, we use the large collections of Wikipedia articles provided by [8] containing between 10 and 80 million words for each language.

The Wikipedia articles must be first preprocessed – tokenized and encoded in the same way as the particular treebank. For many languages, this is not a problem, since they use Latin alphabet and the segmentation into tokens can be simply done by segmentation on whitespaces and separation of the punctuation marks. However, a couple of problems have arisen:

  • In the Arabic treebank, the tokenization is a part of morphological analysis, which cannot be used here, since we do not allow any supervised methods trained on any annotated data. The Arabic treebank is therefore not included into our experiments. Similar problem has arisen for Chinese. We did not find any automatic tokenizer which would tokenize the Wikipedia articles in the same way as it is in the treebank. So we excluded Chinese from the experiments as well.

  • Tamil treebank has the texts transliterated into Latin, while the Wikipedia articles are not.

  • The ancient Greek has very poor resources on Wikipedia and the treebank has also very unintelligible structure, so we excluded it as well.

Because of the aforementioned problems, we perform our experiments on 25 languages only. Their sizes are listed in Table 1.

5 Word Clustering

Instead of using part-of-speech tags for categorizing the words, we use the unsupervised word-clustering tool developed by Alex Clark [3].Footnote 3 The clustering is based on distributional and morphological information. It treats words not only as atoms, it goes deeper to the level of letters and takes into account also the similarity of prefixes or suffixes of words. Such morphological information can mainly improve the performance on rare words.

We do the inference of word clusters using both the corpora we have. For each language, the testing part of the treebank is joined with the articles from Wikipedia and such input is passed to the Clark’s part-of-speech induction clustering tool. We run it several times for different number of clusters.

Table 2. Parsing accuracy against the manually annotated data for different number of word classes. The best results for each the languages are in bold. The average accuracy over all the languages is computed in the last row.

6 Experiments and Results

Each experiment comprises the following steps:

  1. 1.

    We tokenize the Wikipedia data in the same way as it is done in the treebank.

  2. 2.

    We concatenate the Wikipedia and the testing data and run the Clark’s POS induction tool to make clusters. We perform the experiments for 25, 50, 100, and 200 clusters. From now, each word is categorized to one cluster.

  3. 3.

    We run the script get_stop_priors.pl (see [10] for detailed description) on the Wikipedia data with assigned word clusters, and extract the stop-probability estimates for individual clusters and direction. The stop-probability says, how likely a word belonging to a particular cluster has a left or a right dependent.

  4. 4.

    We run the unsupervised parser [10] on the testing data and the inferred word-classes. We use the stop-probability estimates for the dependency tree inference.

  5. 5.

    We compare the resulting dependency structures with the original manually annotated treebank. We measure the accuracy: how many tokens have the correct parent (the parent is the same as in the original treebank).

Table 3. Comparison of the best results (without punctuation) with parsing baselines and unsupervised parsing with gold part-of-speech tags.

In Table 2, there are the results for all the 25 languages (treebanks) and four different number of clusters used for unsupervised part-of-speech induction. The columns “with punctuation” use the treebanks in their original form. The results in the columns “without punctuation” are on the same texts, but with removed punctuation marks (full stops, commas, hyphens, colons, semicolons, quotations, etc.). The evaluation was also done on the treebank with removed punctuation marks. If a punctuation node was not a leaf in the dependency structure, all its dependents were attached to its parent, so that they did not remain unattached.

Fig. 2.
figure 2

Example Spanish sentence “Tony volcó sobre la ruta una categoría que sólo Indurain posee en el ciclismo actual y que esta temporada juega con el misterio”, parsed and tagged using 50 word-classes. Punctuation marks were removed.

We can see, that grammar induction of sentences including punctuation has worse results. It is often caused by the full stop, which is almost in all the sentences and can be easily treated as the root of the sentence, which is not correct with respect to the manual annotations. It is hard to say which number of clusters is ideal for unsupervised grammar induction. There are many languages, for which the number of clusters does not affect the accuracy much. For example Catalan and Russian have both the parsing accuracy on the level about 49 % for all sizes of classes. On the other hand, Turkish have only 27 % accuracy on 25 classes, but more then 51 % accuracy for 50, 100, and 200 classes. Hungarian and Portuguese achieved, however, the best results just on the 25 classes and the results on 50, 100, and 200 are much worse.

Table 3 shows the left-chain and the right-chain baselines and compare them with the unsupervised parsing results. In the left-chain, each word is attached to the following word and the last word is attached to the technical root. In the right-chain, each word is attached to the previous one.

Our best results from Table 2 (inference without punctuation and 50 word classes) are compared to the baselines. We also compare it to the unsupervised parsing with gold part-of-speech tags [10]. The fifth column (in italic) shows the best result over the number of classes for particular language. This is a possibly achievable score, if we were able to automatically infer the best number of classes for each language.

On average, the unsupervised parsing with unsupervised part-of-speech tags has a bit lower accuracy than with gold part-of-speech tags. However, there are languages, on which the unsupervised tags perform much better: Greek, Estonian, Finnish, Italian, Romanian, Russian. Very good results and similar to the results with gold part-of-speech tags were achieved on e.g. Spanish (60.1 %), Swedish (54.9 %), and Turkish (55.0 %).

For curiosity, we have computed the correlation coefficient between the parsing results with unsupervised POS tags and the results with gold POS tags. The highest correlation (0.45) was for the best setting, i.e. for 50 classes and without punctuation. The lowest correlation (0.04) was for 200 classes and including punctuation.

There are languages, on which the accuracy of the parser is very low, for example Hindi (13.1 %), Slovenian (18.3 %), Latin (24.1 %), or Telugu (29.8 %). One reason could be lower amount of Wikipedia texts, from which the important stop-probability estimates are computed [10].

Some of the languages have higher baseline score than any of the parser results. It is hard to beat it for the strongly left-branching languages as Bengali, Telugu, or Turkish (Table 3).

7 Conclusions

We described an experiment with fully unsupervised dependency parsing using unsupervised word classes instead of supervised (or gold) part-of-speech tags. We tested it on 25 different languages (treebanks) and four different number of classes (25, 50, 100, and 200). When compared to the manually annotated dependency structures, the results of fully unsupervised parser are slightly worse then when we use gold part-of-speech tags, however, there are some languages, for which the unsupervisedly inferred word classes perform better that the gold, manually designed part-of-speech tags.

8 Future Work

In future, we would like to try also different part-of-speech induction tools, for example the Brown’s clusters [2] or the clusters inferred by the Pitman-Yor process [1]. It would be also beneficial to automatically detect the best number of classes. Such mechanism would bring the results to the ones in the fifth column of Table 3. We would also like to bring the unsupervised dependency parsing to the practical applications, for example to the machine translation systems. Measuring the performance extrinsically is more fair than comparing it to the manual annotations.