Keywords

1 Introduction

Wordnets are lexical-semantic knowledge bases, modelled after Princeton WordNet (PWN) [1]. They group synonyms in synsets, which represent concepts by their possible lexicalisations. Together with the synset glosses, different types of semantic relation, including hypernym and meronymy, are established between synsets and help to describe their meaning. As the same meaning might be transmitted by different words, the same word might be in more than one synset, one for each of its senses.

Due to its machine-friendly structure, wordnet became the standard model of a lexical knowledge base. We have seen the creation of wordnets for many languages, including Portuguese [2], though none is as consensual as PWN is for English. Given the overwhelming task of populating a wordnet from scratch, the open Portuguese wordnets are created automatically or semi-automatically, and rely heavily on the contents of other lexical resources, including wordnets of other languages. On the one hand, automatic processes enable a faster creation but, at the same time, existing noise leads to less reliable resources.

In order to tackle existing limitations, we aim go further on leveraging the advantages of automatic approaches, and to give the users some control on coverage and reliability, depending on their needs. We believe in the potential of redundant information across open Portuguese lexical-semantic resources, which should enable the creation of a new broad-coverage wordnet where confidence degrees are assigned to the decisions taken, including the membership of words in synsets or the connection of two synsets by a semantic relation. This should enable users to select their own confidence cut-points, which will set either large but less reliable or smaller and more reliable wordnets. The result can be seen as a fuzzy wordnet, an idea that is not completely new (see [3]), but has not been much explored. Moreover, the fuzzy representation is less artificial, as we know that word senses are not discrete [4], but complex and overlapping structures, so their representation as crisp objects does not reflect the human language.

This paper presents the first experiments towards the creation of a fuzzy Portuguese wordnet. Next section overviews the current Portuguese wordnet initiatives. Resources exploited in this work are then enumerated, and their contents and redundancy analysed. After that, the proposed approach for discovering fuzzy synsets and fuzzy semantic connections is described, together with some results and their evaluation. It follows the steps of ECO [5] – extraction, clustering and ontolosiging –, an abstract model tailored for the automatic creation of Onto.PT, one of the open Portuguese wordnets, but flexible enough to the creation of other resources of the same kind. This is also why this new wordnet is baptised as CONTO.PT – as in Confidence-enriched Onto.PT. The paper ends with the first conclusions of this approach and some lines for further work.

2 Portuguese Wordnets

There are at least six Portuguese lexical-semantic knowledge bases structured according to the wordnet model [2], created by independent teams, following different approaches, and with different licenses and usage restrictions. WordNet.PT Global [6] is the most recent instantiation of the first Portuguese wordnet, in development since 1998. It is essentially handcrafted and created from scratch, for Portuguese, it can be browsed online, but it is not available for download. WordNet.Br is a wordnet project for Brazilian Portuguese where synsets and antonymy relations were first manually produced, based on dictionaries and corpora, and released under the name TeP [7]. Synsets were then manually aligned with PWN and semantic relations between Portuguese synsets with English equivalents were inherited [8]. To our knowledge, this part is not publicly available. MultiWordNet.PTFootnote 1 is a Portuguese wordnet with synsets derived from the translation of PWN synsets. It can be browsed online and used under the payment of a license.

Besides the previous, there are four open Portuguese wordnets. Onto.PT [5] is created in a completely automatic fashion – both synset boundaries and the attachment of semantic relations are learned from the exploitation of available lexical semantic resources, without any human supervision. Its development follows ECO, a three-step approach to integrate words and relations from different sources: (i) relation extraction between words; (ii) synset discovery from the synonymy relations; (iii) mapping of words in remaining relations to discovered synsets. OpenWordNet-PT [9] was originally developed as a syntactic projection of the Universal WordNet [10] for Portuguese. Its development is thus based on the translation of lexical information in PWN, across multiple languages of Wikipedia, open dictionaries, and also some information from corpora. It is aligned to PWN and a manual curation process is currently undergoing. PULO [11] is based on the probabilistic translation of open wordnets of other languages, with special focus to those included in the MCR project [12], where wordnets of the Iberian languages are aligned to PWN. UfesWN [13] is another Portuguese wordnet, based on the automatic translation of PWN.

With more than 168 k lexical items, 248 k word senses, 117 k synsets, and 340 k relation instances, Onto.PT is the largest Portuguese wordnet [2], which additionally covers a broad range of relation types. On the other hand, it is not aligned to PWN nor any other wordnet and it is far from being 100 % reliable. In a manual evaluation [5], 74 % of synsets were labelled as correct, in 18 % there was no agreement between two judges, and the remaining had at least one incorrect word. Moreover, considering that relations between incorrect synsets are also wrong, between 78 %–82 % were labelled as correct. This highlights the need for incorporating confidence information in large automatically-created wordnets, such as Onto.PT, which may allow users to, depending on their needs, define their coverage vs reliability trade-off.

3 Redundancy in Portuguese Lexical-Semantic Resources

This section overviews the contents of the lexical-semantic resources exploited in the reported work and analyses their redundancy, which can be useful for the computation of confidence measures, as shown in the following section.

3.1 Open Portuguese Lexical-Semantic Resources Used

Seven Portuguese lexical-semantic resources are exploited. All of them, listed here, are freely available for download:

  • Semantic relation instances of the network PAPEL [14], extracted automatically from a commercial Portuguese dictionary;

  • Additional semantic relation instances extracted from two dictionaries – Dicionário Aberto (DA) [15] and Wiktionary.PTFootnote 2 (Wikt.PT) – using the same grammars as PAPEL, and included in the network CARTÃO [16];

  • Synonymy and antonymy instances from two handcrafted synset-based thesauri: TeP 2.0 [17] and OpenThesaurus.PTFootnote 3 (OT.PT);

  • Semantic relation instances acquired from two open Portuguese wordnets: OpenWordNet-PT (OWN.PT) [9] and PULO [11].

All the obtained lexical-semantic information was converted to a suitable input format for the second and third steps of ECO – term-based triples (a related-to b), where words a and b are connected by a predicate (related-to) that is the name of a semantic relation. For that purpose, thesauri and wordnets synsets had to be deconstructed. For instance, a part-of relation between the synsets {porta, portão} and {automóvel, carro, viatura} would result in the triples: (porta synonym-of portão), (automóvel synonym-of carro), (automóvel synonym-of viatura), (carro synonym-of viatura), (portão part-of automóvel), (porta part-of carro), (porta part-of viatura), (porto part-of automóvel), (portão part-of carro), (portã part-of viatura). Relation types used were those covered by PAPEL, with a minor extension to include wordnet relations not extracted from dictionaries, such as hypernymy between verbs (hiperonimoAccaoDe) or entailment (accaoQueCausaAccao). Other wordnet relation names were adapted to the equivalent names in PAPEL. For instance, hypernymOf became hiperonimoDe and substanceHolonymOf became materialDe.

Table 1. Number of lexical items and triples used from each exploited resource.

From all the resources, a lexical-semantic network was established with 355,026 lexical items and 1,139,243 triples (excluding inverse relations in the wordnets) respectively distributed according to Table 1.

3.2 Redundancy

As expected, although most triples in the network occurred in only one resource, about 109 k were in more than one, and 192 in all the seven. Table 2 distributes the triples of covered types according to the number of resources they occur at.

Table 2. Occurrences of the same triples in different resources, per type.

A key intuition behind this work is that the more resources a triple is in, the more likely it is to transmit a consensual and useful relation, which is confirmed by selected examples in Table 3. On the other hand, triples that only occur in one resource are more likely to either be incorrect, resulting from noise on the automatic process, or to involve very specific meanings, though less useful.

Table 3. Examples of redundant triples.

4 Computing Confidence from Redundancy

We aim at exploiting the potential of redundancy for computing confidence towards the creation of a fuzzy Portuguese wordnet. For this purpose, triples acquired from the seven resources might be the input of a new implementation of the second and third steps of the ECO [5] that should encompass the assignment of scores that transmit confidence. In the second step, fuzzy synsets are discovered from synonymy triples and, in the third, they are connected by different semantic relations, based on the exploitation of all available triples.

4.1 Discovering Fuzzy Synsets

Though not very explored, the idea of fuzzy synsets is not new. Fuzzy memberships of words to synsets have been obtained from manual judgements [18] or from the structure of synonymy networks [19]. In order to integrate domain knowledge, PWN has been extended with fuzzy memberships of words to synsets, as well as fuzzy semantic relations [3]. Fuzzy sets of highly related words have also been discovered from text, to represent word senses [20].

Despite its similarities with word sense disambiguation [21], this part of the work can be seen as a kind of word sense induction [22] because, instead of assigning words to senses in an inventory, word senses are drawn from scratch, based on the structure of the synonymy network.

Method: We have recently proposed an alternative approach for discovering fuzzy synsets from synonymy networks, in two steps [23]: (i) centroid discovery; (ii) fuzzy memberships computation. It is applied to a weighted synonymy network \(N=(W,P)\), where W is a set of words and P a set of weighted synonym pairs, with a weight reflecting the number of times a synonym pair, \(P(W_i, W_j)\), occurs in the exploited sources. In the first step, Chinese Whispers [24] (CW), an efficient graph clustering algorithm, is run in the network. This results in a set of hard words clusters, used as centroids. In the second step, the membership degree of each word \(W_i\) to each centroid \(C_k\) is computed by Eq. 1, which considers the number of synonym pairs between \(W_i\) and each word in \(C_k\).

$$\begin{aligned} \small \mu (W_i, C_k) = \frac{\sum _{j=0}^{|C_k|} \#(W_i \text { synonym-of } [C_k]_j)}{|C_k|} \end{aligned}$$
(1)

Example: The synset discovery approach is illustrated in Fig. 1, with the help of a weighted graph where two senses of the Portuguese word canudo arise: a tube/pipe, or, more informally, a diploma. If CW identifies the hard clusters \(C_A\) and \(C_B\), to compute the membership of canudo to the fuzzy cluster \(C'_A\), the weights of the connections between this word and words in \(C_A\) are summed and divided by the size of \(C_A\). Since \(\#({{ canudo } \text { synonym-of } { diploma}}) = 2,\) \(\mu (canudo, C'_A) = \frac{2}{4} = 0.5\). For the membership of canudo to \(C'_B\), the three connections between this word and words in \(C_B\) are considered, plus the word canudo itself, which belongs to \(C_B\) and has the maximum weight (7, if seven sources are exploited). So \(\mu (canudo, C'_B) = \frac{3+5+2+7}{6} = \frac{17}{6} = 2.83\)

Fig. 1.
figure 1

Weighted lexical network, resulting hard clusters, and fuzzy synsets.

Results: A total of 20,315 fuzzy synsets (13,735 noun, 4,827 adjective, 1,126 verb, 627 adverbs) were discovered from the synonymy network obtained from the seven exploited resources. On average, noun synsets had 9.4 words, adjectives 11.9 and verbs 59.3, because their network has more connections, which can be interpreted as a higher ambiguity and/or more synonyms for the Portuguese verbs. The resulting fuzzy thesaurus was baptised as CLIP 2.1 [23].

Evaluation: To assess the quality of the fuzzy synsets and computed memberships, random pairs of words from the same synset (240 nouns, 150 verbs, 150 adjectives), organised in sets of ten, were uploaded to the Crowdflower platformFootnote 4, where Portuguese-speaking volunteer contributors, living in Portuguese-speaking countries, manually labelled each pair either as possible synonyms or notFootnote 5. In the end, 59 % of the noun pairs, 46 % verb and 55 % adjective pairs were labelled as correct. Each pair was labelled by two judges, respectively with an agreement (IAA) of 87 %, 85 % and 75 %. At first, quality does not look very promising. However, it improves for increasing membership degrees. Figure 2 plots the evolution of the proportion of correct pairs for different cut-points – if the membership of one of the words in the pair is below the cut-point, the pair is ignored – and confirms that the computed memberships behave as a confidence measure, because they are positively correlated with the quality. For instance, for a cut-point of 1.0, the proportion of correct noun and adjective pairs is 85 % and for verbs 89 %. Moreover, there is a point after which all the pairs are correct. Also in Fig. 2, the total number of words and their average number of senses is presented for each cut-point.

Fig. 2.
figure 2

Evolution of the correct synonymy pairs while increasing the cut-point. (Color figure online)

4.2 Discovering Fuzzy Synset Connections

After discovering the fuzzy synsets, some of them may be automatically connected by semantic relations. Possible attachment points can be discovered by exploiting the non-synonymy triples, which is done in this step.

Method: Each pair of synsets, \(S_a\) and \(S_b\), is analysed to set attachment points with a fuzzy score, computed by Eq. 2. For each relation type R, this equation considers the: (i) number of triples of type R between a word from each synset, \(a_i\) and \(b_j\); (ii) number of resources where each of the previous triples occurs, \(\#(a_i, R, b_j)\); (iii) membership of each word in the previous triples to their synset, \(\mu (a_i, S_a)\) and \(\mu (b_j, S_b)\).

$$\begin{aligned} \small c(S_a, R, S_b) = \frac{\sum _{i=0,j=0}^{|S_a|,|S_b|} \big ( \#(a_i, R, b_j) \times (\mu (a_i, S_a) + \mu (b_j, S_b))\big )}{|S_a| + |S_b|} : a_i \in S_a, b_j \in S_b \end{aligned}$$
(2)

Example: Figure 3 illustrates the computation of the proposed measure in two synsets with several hypernymy triples between their words. Hypernymy triples used are represented in a graph, where the only redundant triple has weight 3.

Fig. 3.
figure 3

Computing the confidence of the connection \(S_1\) hiperonimoDe \(S_2\).

Fig. 4.
figure 4

Examples of discovered synset connections, their computed confidence, and their rendering, used in the crowdsourced evaluation.

Results: The previous measure was computed between all pairs of discovered fuzzy synsets, with a cut-point of 0.1, for relation triples of any type that were in at least two resources. A total of 52,504 synset connections were discovered, with a score higher than 0. As those did not include triples between words without synonyms, and thus not in the discovered fuzzy synsets, in a second step, when a word w involved in a triple was not in any synset, a new synset \(S_w\) containing just that w was created, with \(\mu (w, S_w) = 1.0\). In the end, 406,751 additional synset connections were made, with at least one synset with a single word. Moreover, 13,542 new single-word synsets were added to the 20,315 multiword synsets discovered earlier.

Evaluation: To assess the quality of the discovered synset connections and the suitability of their computed confidence, we relied once again on Crowdflower, where a random selection of 930 synset connections were uploaded. These included only connections where at least one synset had more than one word. To make labelling faster for the contributors, the following was done before uploading: (i) only the first word of each synset was used, as we noticed that they are often the most representative for the underlying concept; (ii) each triple was rendered to a natural language sentence, depending on the relation type. Contributors could label each rendering as either: (i) correct; (ii) incorrect; or (iii) unsure. Figure 4 illustrates, at the same time, the output of the fuzzy attachments and of the evaluation samples. It includes the first three words and respective memberships of several synset connections in the sample, their computed confidence, and the textual rendering shown to the contributors.

Figure 5 shows the results of the crowdsourced evaluation and the evolution of the correct connections for increasing cut-points. It also presents the proportion of answers where the contributors were unsure and insights on the size of the fuzzy wordnet for the same cut-points, namely the number of synsets and connections between them. Once again, the initial quality is far from impressive: 49.5 % renderings were labelled as correct and 44.3 % as incorrect. Agreement was also lower, 70 %. It should still be noted that connections between two single-word synsets, with a higher chance of being correct, were not used. Not to mention that, in some cases, the used renderings might be too limitative and they show just one word per synset. Moreover, though less consistently than for the synonyms evaluation, quality still increased for higher cut-points, which indicates that the computed score behaves as a confidence measure. At the same time, the number of connections is drastically reduced each time the cut-point increases, especially from 0 to 0.25.

Fig. 5.
figure 5

Evolution of the correct triples while increasing the cut-point.

After a shallow error analysis, we noticed that there were several renderings that should have been labelled as correct, but were not. Those included connections with confidence higher than 1.8, such as (origem antonimoDe término), (dicéfalo dizSeDoQue ter_duas_cabeças), or (planta hiperonimoDe bisnaga). Although we asked the contributors to confirm their answers in electronic dictionaries and check for less known senses, or to mark unknown answers as unsure, most of them were probably less experienced or have answered the questions too fast, thus not following the instructions strictly.

5 Conclusion and Further Work

The first experiments towards the automatic creation of a fuzzy Portuguese wordnet, through the exploitation of redundancy in available lexical-semantic resources, were presented. The projected wordnet combines the advantages of an automatic creation approach, including lower creation effort for a broad-coverage resource, with the option of controlling the quantity-quality trade-off, with a confidence cut-point. Synsets, discovered from synonymy networks, have words with variable memberships, and they can be connected, by semantic relations of different types, to other synsets, also with variable degrees.

A preliminary version of the resulting wordnet is available, in a non-standard format, from http://ontopt.dei.uc.pt, under the option CONTO.PT. We are still studying alternatives for representing CONTO.PT with standard formats, such as RDF/OWL.

Besides dealing with the previous issue, there is additional work to do. Alternative ways of computing confidence from redundancy should be explored, especially on the synset attachment, where the current measure seems to be biased towards smaller synsets. In order to measure progress, we can use the annotated data collected from crowdsourcing or, given the limitations of the previous, a more controlled evaluation might be performed by more experienced and trustful judges. It should also be analysed whether the synset memberships can be adjusted when connecting synsets. For instance, if several words of the same synset share a relation with another word, their memberships may increase.

It should be added that, although applied to Portuguese, this approach can be used to create fuzzy wordnets in other languages, as long as there are available computational lexical resources, whether they are dictionaries, thesauri, wordnets or even relations extracted from corpora.