An approach to measuring and annotating the confidence of Wiktionary translations

Roa-Valverde, Antonio J.; Sanchez-Alonso, Salvador; Sicilia, Miguel-Angel; Fensel, Dieter

doi:10.1007/s10579-017-9384-9

An approach to measuring and annotating the confidence of Wiktionary translations

Original Paper
Published: 06 February 2017

Volume 51, pages 319–349, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Language Resources and Evaluation Aims and scope Submit manuscript

An approach to measuring and annotating the confidence of Wiktionary translations

Download PDF

Antonio J. Roa-Valverde¹,
Salvador Sanchez-Alonso²,
Miguel-Angel Sicilia² &
…
Dieter Fensel¹

271 Accesses
4 Altmetric
Explore all metrics

Abstract

Wiktionary is an online collaborative project based on the same principle than Wikipedia , where users can create, edit and delete entries containing lexical information. While the open nature of Wiktionary is the reason for its fast growth, it has also brought a problem: how reliable is the lexical information contained in every article? If we are planing to use Wiktionary translations as source content to accomplish a certain use case, we need to be able to answer this question and extract measures of their confidence . In this paper we present our work on assessing the quality of Wiktionary translations by introducing confidence metrics. Additionally, we describe our effort to share Wiktionary translations and the associated confidence values as linked data.

Completeness and Reliability of Wikipedia Infoboxes in Various Languages

Analysis of References Across Wikipedia Languages

Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Web is turning into a data oriented platform. Collaborative initiatives like Linked Open Data (LOD) rely on accepted standards and provide with a set of best practices to promote the sharing and consumption of data at large scale. A growing portion of this data is populated by linguistic information, which tackles the description of lexicons and their usage. An important resource within this scope is Wiktionary,^{Footnote 1} which can be seen as the leading data source containing lexical information nowadays. Wiktionary is an online collaborative project based on the principle of the “Wisdom of Crowd” that tries to build an open multilingual dictionary available for everybody. Since its inception in 2002, Wiktionary has grown considerably and, therefore, caught the attention of many researchers. Several attempts have tried to compare the usability of Wiktionary with traditional expert-edited lexicographical efforts (Fuertes-Olivera 2009; Meyer and Gurevych 2012). Others have relied on reusing the data provided by Wiktionary for accomplishing certain information retrieval (IR) and natural language processing (NLP) tasks (Müller and Gurevych 2009; Zesch et al. 2008; Navarro et al. 2009). Additional approaches have focused on aligning Wiktionary with other available resources (Matuschek et al. 2013; Miller and Gurevych 2014).

In this paper we exploit the multilingual dimension of Wiktionary. We perform a quantitative analysis of the existing translations in order to measure their level of reliability and guarantee a minimum of quality during their consumption. We rely our analysis on the use of ranking approaches like random walks, which have shown to provide successful results in IR and other related scenarios. A part of our research focuses on the study of existing formats and mechanisms to exchange linguistic data. We use the gained expertise to design an ontological model in order to cope with the interoperability issues associated to our scenario. We use this model to share the generated data as part of the public open data cloud.

The paper is organized in two main parts. The first part focuses on the description of our approach for associating confidence values to Wiktionary translations. In Sect. 2 we describe the multilingual structure of Wiktionary. Section 3 introduces our approach. We describe the formalisms of our algorithm in Sect. 3.1 and a quantitative evaluation in Sect. 3.2. The second part, which starts in Sect. 4, deals with LOD approaches for modeling and sharing linguistic data. In Sect. 4.1 we describe the lemon vocabulary. The different efforts for handling multilingual data are described in Sect. 4.2. Section 4.3 describes our approach for adding confidence annotations as an extension to the lemon vocabulary. In Sect. 4.4 we show the resulting dataset after representing the generated data from the evaluation with the proposed vocabulary. Finally, Sect. 5 concludes this work and enumerates possible extensions.

2 Wiktionary

The task of compiling lexical data to build dictionaries has been traditionally related to lexicographical experts belonging to well-known educational institutions. Independently of the format chosen to distribute the dictionary (printed vs. digital, offline vs. online), the creation process is characterized for being time-consuming and tedious. The social aspect introduced by the Web 2.0 pushed forward the collaborative nature of this task by embracing wiki-based initiatives to allow any Internet user to add contributions on this area. Wiktionary is a product born out of this initiative. Being a relatively young project in comparison to established dictionary providers, its online presence can give an overview of its traction among the community. Figure 1 shows a comparison of the traffic metrics of Wiktionary and four other known dictionary providers.

The advantages of Wiktionary in comparison to other expert-built resources rely on the following properties:

Collaborative the easy handling of the underlying wiki-based system allows any Web user to contribute. Due to the growing community and its contribution, the data keeps up to date with potential changes in the language. Additionally, there is a group of editors that take care of reviewing the changes and avoid the lost of data due to vandalism.
Rich coverage Wiktionary offers an article for each lexical word, containing diverse information like definitions, part of speech, etymology, senses, lexico-semantic relationships (homonyms, synonyms, antonyms, hyponyms, hyperonyms) and translations. This information acquires a high value in the case of small languages, due to the difficulties to find resources targeting them.
Open data the content offered by Wiktionary can be reused following the Creative Commons CC-BY-SA 3.0^{Footnote 2} and GNU Free Documentation^{Footnote 3} licenses. This makes Wiktionary particularly attractive to app developers willing to build ad-hoc functionality on top of its content without dealing with subscription fees.
Multilingual Wiktionary supports different languages that are organized in separated language editions, each with a different community of supporters and contributors. For instance, the Spanish version of Wiktionary contains information about lexical words in Spanish (in this case, the wiki-UI is rendered to the user also in Spanish). Language editions not only contain information about words within its own language, but also foreign words. In this way, for example, the Spanish Wiktionary contains information about English words that are described in Spanish. The idea behind this feature is to allow the description of foreign words in one’s native language.

The available content in the different languages is interlinked by translations links and inter-wiki links. The former are links within the same language edition reflecting words with equivalent meaning in two languages. E.g., “bank” (English) http://en.wiktionary.org/wiki/bank and “banco” (Spanish) http://en.wiktionary.org/wiki/banco#Spanish. The latter are links to the same word in a different language edition. E.g., “bank” (English) http://en.wiktionary.org/wiki/bank and “banco” (Spanish) http://es.wiktionary.org/wiki/bank.

The current mechanism implemented in Wiktionary for adding new translations to a certain word is the following: (a) users navigate through the word’s senses until finding the required one within a certain language edition. (b) In the translations section of the chosen sense the user must make sure that the translation she is willing to add does not exist for the target language. If it does not, then she can use the form shown at the bottom of the list of available translations to enter the details if using the Web UI (Fig. 2) or just add the scripting code as described in the public guideline^{Footnote 4} if using the wiki editor. (c) The added translation is reviewed by the responsible editor of the language edition and made available to the public or rejected and removed from the resource.

At the time of writing Wiktionary supports 171 languages and provides more than 21 million articles.^{Footnote 5}

Even though these properties have shown to be leading Wiktionary towards the right direction, the project is still a resource under construction and presents data deficiencies, which make users doubt about the reliability of the information. This lack is deeper for some languages than others, specially when dealing with translations, which has been remarked in public critics to Wiktionary.^{Footnote 6} In the following, we describe our attempt to provide Wiktionary translations with confidence annotations in order to help the user to get an idea of the reliability of the data.

3 Confidence analytics

Several publications in the literature have tried to address the problem of measuring the reliability of Wikipedia articles. Most of the approaches rely on the analysis of the article’s history to extract some measures that can be correlated with the grade of quality. In this direction, Lih (2004) suggested a correlation of the quality of an article with the amount of editors and the number of revisions. Lim et al. (2006) proposed a ranking of Wikipedia articles based on three different metrics: the length of the article, the number of revisions and the reputation of the editors, which is dependent on the amount of previous edits. Other approaches focus on analyzing the text content of the articles. Blumenstock (2008) used metrics like the amount of words, characters and sentences within an article. These metrics were used to classify the articles as featured and non-featured. Authors achieved an accuracy of 97%, demonstrating that the number of words is the best metric to perform such classification.

Contrary to the case of Wikipedia, assessing the quality of Wiktionary content is an area that has been less explored. Fuertes-Olivera (2009) evaluate the adequacy of Wiktionary from a lexicographical point of view regarding to some social needs, for instance, the use case of learning business-related vocabulary. In this work, authors manually perform a qualitative analysis of the lexical information available for terms related to the business topic. This analysis inspects the structure and the kind of lexical content (definitions, part of speech, phonetics, use examples, translations, etc.) used to describe each entry. The analysis focuses only on the English–Spanish scenario, but authors remark the English-centered nature of Wiktionary, i.e., the amount of content in the English Wiktionary is several grades of magnitude more extensive than in other language editions. Meyer and Gurevych (2012) perform another qualitative analysis for the English, German and Russian editions of Wiktionary and compare these resources with the work done by professional lexicographers. Weale et al. (2009) and Sajous et al. (2013) address the problem of detecting new synonyms for Wiktionary entries. Both works rely on the link structure of Wiktionay and the use of random walks to measure the grade of synonymity among the different candidates. These two approaches are at certain stage the closest to our work we could find in the bibliography. Despite these analyses, to the best of our knowledge, no quantitative approaches have been published addressing the reliability of Wiktionary translations. In the following we describe our contribution.

3.1 Translation-confidence algorithm

The approach we present does not target the quality of the individual Wiktionary articles as a whole, like it has been done in previous research efforts on Wikipedia, but it focuses only on translations. Unlike history-based approaches, that mostly use text mining, we rely our computation on link analysis.

Previously, we described how translations are organized in Wiktionary following two kinds of links, namely, translation links and inter-wiki links. As already discussed, these links define the structure of the multilingual dimension of Wiktionary. We base our approach on the following hypothesis: the way users contribute by adding new translations or modifying existing ones could have an impact on how the multilingual structure is organized. We argue that user contributions and the link structure formed by translations are strongly related, being possible to extract some measures that reflect the reliability of translations. We think the wisdom of the crowd can be used to reflect the quality of the translations available in Wiktionary at certain stage. If Wiktionary users add links to a certain word, there are more chances for this word to be correct and for the translation pairs to be relevant. In different words, we believe there is a relation between popularity of words and translation quality.

Note that the heuristic we propose in this work does not decide itself whether a translation pair is correct or not, it rather gives a confidence score as additional information to the user that could support her with this decision.

Our approach uses PageRank (Brin and Page 1998) to estimate the quality of the translation pairs. PageRank is an algorithm originally conceived for ranking Web documents. Considering the structure of the Web as a directed graph, where the nodes are the Web documents and the edges are the different links among Web documents, the idea behind PageRank is that of assigning more relevance to those documents with a bigger amount of incoming links. Authors consider this value as a measure of the attention towards a certain document. From another perspective, this relevance represents the probability that a random surfer ends up in a Web document after following a certain amount of links. The relevance of a Web document is higher when it is referenced by important documents. This fact implies a recursive definition of rank and therefore, the rank of a web document is a function of the ranks of the Web documents referencing it.

Applying the previous concept, we build an input graph relying only on the use of inter-wiki links and discarding the translation links. The reason for this is that translation links are strongly dependent of each language edition and they do not always contain rich information about the translations, i.e., there are many entries for which the translation link points to a to-be-done Wiktionary page.^{Footnote 7}

There are other considerations to take into account that have a clear impact on the resulting ranking. First, due to the varying amount of contributions for each supported language, the different language editions must be split when building the input graph. This means that only those links within the same language pair will be considered. This approach will avoid penalizing those translations which are popular in some language pairs, but are not available for others. Second, it is necessary to organize the translations in what we have called an Isolated Semantic Graph (ISG). An ISG is a directed graph where the nodes are existing entries in a language edition and the edges are inter-wiki links from a single source language edition to a single target language edition. A couple of nodes joined by an inter-wiki link form a translation pair. The key about ISGs is that they contain only translations for those words that are related semantically, i.e., there are inter-wiki links at n-hops connecting the words. Figure 3 shows the ISG associated to the English word “able” and targeting the Spanish language edition.

3.1.1 Notation

We use the following notation to present our algorithm. Let WLE be the set of all available language editions of Wiktionary. We denote every single language edition by ${ WLE}_{lang}$, where lang is the ISO code corresponding to the particular language. For instance, ${ WLE}_{en}$ represents the English language edition of Wiktionary. Every language edition contains a set of words in the same language, $\{w^{en}_1, w^{en}_2, \ldots , w^{en}_n\}$.

We define the following function:

$$\begin{aligned} f_{source \rightarrow target}\left( w^{source}_i\right) = \left\{w^{target}_1, \ldots , w^{target}_m\right\} \end{aligned}$$

For a given word $w^{source}_i$ in ${ WLE}_{source}$, f returns the set of words belonging to ${ WLE}_{target}$ that are connected with $w^{source}_i$ through an inter-wiki link.

In the same way, we define:

$$\begin{aligned} arc_{source \rightarrow target}\left( w^{source}_i\right) = \left\{e^{source \rightarrow target}_{i,1}, \ldots , e^{source \rightarrow target}_{i,m}\right\} \end{aligned}$$

as the function returning the set of edges for a given word in ${ WLE}_{source}$ with inter-wiki links to ${ WLE}_{target}$. Note that this set represents in fact the set of inter-wiki links.

Let $V^{source, target}_i$ and $E^{source, target}_i$ be the set of all words and links after applying f and arc to the word $w^{source}_i$ and to each word in $\{w^{target}_1, \ldots , w^{target}_m\}$ recursively, we define:

$$\begin{aligned} ISG_{source, target}\left( w^{source}_i\right) = \left\{V^{source, target}_i, E^{source, target}_i\right\} \end{aligned}$$

$w^{source}_i$ is known as the seed word for generating the ISG.

In order to build the network that is going to be used as input to the PageRank, we merge all ISGs under a Unified Semantic Graph (USG):

$$\begin{aligned} USG_{source, target} = \bigcup _{1\le i \le n} \left\{ISG_{source, target}\left( w^{source}_i\right) \right\} \end{aligned}$$

We define a translation pair containing a word $w^{source}_i$ from ${ WLE}_{source}$ and a word $w^{target}_j$ from ${ WLE}_{target}$ as:

$$\begin{aligned} t^{source, target}_{i,j} = \left( w^{source}_i, w^{target}_j\right) \end{aligned}$$

Note that $( w^{source}_i, w^{target}_j) \equiv ( w^{target}_j, w^{source}_i) $, since “translation of” is a symmetric relationship.

We define the set of all translation pairs in ${ WLE}_{source}$ and ${ WLE}_{target}$ as:

$$\begin{aligned} T_{source, target} = \left\{ t^{source, target}_{i,j} | 1 \le i \le n; 1 \le j \le m \right\} \end{aligned}$$

T does not contain repeated translations, i.e., in the case of symmetrical pairs we only consider one translation. Our proposed algorithm assigns a confidence value in the range [0–1] to each translation pair within $T_{source, target}$. For doing, so we apply the PageRank algorithm to the nodes of the ${ USG}_{source, target}$ graph. This results in the association of a PageRank score to every node of the graph, i.e., $w^{source}_i$ or $w^{target}_j$, independently. As we are interested about a confidence score for the translation pairs, we need to combine both PageRank values by using the following expression:

$$\begin{aligned} Score\left( t^{source, target}_{i,j}\right) = \frac{PR\left( w^{source}_i\right) \text { + } PR\left( w^{target}_j\right) }{2} \end{aligned}$$

This expression computes the mean of the PageRank values associated to each component of the pair. This value represents the confidence score associated to each translation pair by our algorithm.

3.2 Evaluation

In order to evaluate our approach we applied it to a subset of Wiktionary. We used the Ogden’s basic English word list^{Footnote 8} consisting on 850 selected words. We used this list as a seed for generating the USG containing the translation pairs to rank and get the confidence values.

The aim of choosing a predefined list of words is to have a reference in order to compare with future approaches. Nevertheless, our algorithm can be adapted to take as input a word list generated from a chosen Wiktionary dump containing the latest available data. Our experiment relies uniquely on English and Spanish, but it can be easily extended to other language combinations as well.

3.2.1 Accuracy estimation

Given the Ogden’s list of seed words, we build the ${ USG}_{en,es}$ and compute the confidence values for each translation pair as described previously. Figure 4 shows the histogram of the confidence values, which contains a total of 7366 nodes. The resulting dataset has approximately 12k translation pairs. As it can be appreciated, the most frequent values are placed near the mean (0.0001357).

We then label each pair using the following binary classification:

$$g\left( t\right) = {\left\{ \begin{array}{ll} 1, &\text{ if } \text{ score(t) }\ge \frac{1}{\#\text{nodes}\, \text{in} \,\text{USG}} \\ 0, &\text{ otherwise }\end{array}\right. }$$

In this expression, the value 1 represents a translation pair that is commonly accepted. The value 0, on the other hand, represents a translation that is rare. Note that we use the value 1/(#nodes in USG) as the classification factor, because it is the average value computed by PageRank. Recall that the computed PageRank vector represents an N-dimensional probability distribution, which means that the total of all the probabilities must add up to 1. We use this binary classification with the aim of measuring the accuracy of our approach towards judgments generated manually.

We selected a total of 10 people, 5 native Spanish speakers and 5 native English speakers, with an intermediate or superior level in the respective foreign language (a minimum B2 level of the CERF^{Footnote 9} was required). We prepared a survey consisting on 100 translations pairs extracted randomly from the total of approximately 12k generated pairs. We asked the volunteers to label each translation pair with either common or rare according to their knowledge. The surveys were made available online via Google Forms and anonymously submitted. In order to consider a survey as valid, all the 100 responses must be filled in. With all received submissions we calculated the average response vector and used this one for our comparison.

Additionally, we used a random baseline in our comparison simulating 500 independent evaluators. Each evaluator is represented by a vector of 100 elements, being each element either a 0 or a 1. We computed the average of the 500 vectors to extract the final vector used for labeling the translation pairs. Table 1 shows the 100 translation pairs together with the computed confidence. In addition, each pair shows the label generated by following each of the previous approaches.

Table 1 Evaluation results

Full size table

Table 2 shows the precision and recall for our approach and the random baseline. While the precision of our system is higher than the obtained with the random approach, the recall is nearly the same. The reason for the low recall can be found on the way we compute the confidence for the pair by combining the individual PageRank values of the words. As stated previously, we obtained the final score by computing the average of both values. This assumes that both words have the same weight in the final confidence score. We observed cases like the pair (hueso, bone), with individual PageRank values of (8.580229958181926e-05, 0.00011250231809933178), which shows that bone is a lot more relevant than hueso. This means that the structure of one language edition can increase or penalize the final confidence value. A future line of improvement would consist on evaluating if the structure of the ISGs could be used to extract weights that help tuning the final confidence scores.

In any case, the use of a binary classification was for the sake of measuring the accuracy of our system. The main purpose of the computed scores is to be used as a weighted score. That is, instead of using each translation score as a binary value (common vs. rare), use it as a gradual measure to get an estimation of the confidence.

Table 2 Precision-recall comparison for our approach and a random baseline

Full size table

4 Sharing translations as Linked Open Data

Enabling interoperability in a LOD environment of different NLP systems that perform complementary tasks would facilitate the comparison of results, as well as the combination of tools to build more complex and reusable processes. A step towards NLP interoperability relies on the definition of shared vocabularies to handle linguistics. In this section we focus on the definition of a common vocabulary for modeling linguistic data, with special attention to multilingual translations.

At the time of writing, lemon ^{Footnote 10} (McCrae et al. 2012) can be considered as de facto vocabulary for modeling linguistics on the LOD cloud. In order to avoid reinventing the wheel by proposing a new model, we will reuse lemon as much as possible and try to extend those components that we might need to fulfill our requirements.

4.1 Overview of the lemon model

The LExical Model for ONtologies (lemon) is an RDF model to describe linguistic information associated to ontologies. One of the main features of lemon is that it keeps the independence between the target ontology and the linguistic descriptions. This means that the use of an ontology is not compulsory in order to create linguistic descriptions. This design factor is pushing the lemon vocabulary to become de facto model for describing and exchanging linguistic resources on the Web.^{Footnote 11}

The different modules^{Footnote 12} that compose the lemon vocabulary are the following:

Core as the name indicates, this module is the central part of the lemon model. It provides the required constructions to represent lexical descriptions and associate them to the concept of an external ontology. Figure 5 shows a representation of the vocabulary constructions available in this module. The class Lexicon is used to encapsulate lexical descriptions referring to certain preferences. For example, lexical entries belonging to the English language. In this way, lemon allows the definition of sets of descriptions grouped independently.
Linguistic description this module allows the usage of properties for adding information to the lexicon like part-of-speech, gender, number, tenses, phonetics, etc.
Variation this module provides the necessary vocabulary constructions to build relationships among the element of a lexicon, for example synonyms, antonyms and translations.
Phrase structure several constructions are provided in this module in order to deal with multiple word expressions.
Syntax and mapping this module contains the needed vocabulary to establish syntactic rules between lexical components.
Morphology the target of this module is to handle the different forms of a lexical entry.

4.2 Coping with translations

Several approaches have been developed to annotate ontologies with natural language descriptions, such as the rdfs:label (Manola and Miller 2004) or skos:prefLabel (Miles and Bechhofer 2009b) properties. These properties allow the use of simple multilingual labels (e.g., rdfs:label “mountain”@en). However, the main limitation of this approach is the impossibility to create explicit links among labels. In order to solve this limitation SKOS-XL (Miles and Bechhofer 2009a) introduced a skosxl:Label class that allows labels to be handled as RDF resources. In addition, a skosxl:labelRelation property was introduced to create links between instances of skosxl:Label, which could be used for establishing translation relationships.

The lemon core model allows the implicit representation of multilingual information, in the sense that several lexicon models using different natural languages can reference to the same ontology. We say it is an implicit representation because translation relations can only be inferred when they point to the same ontology entity. In case that more information about translations must be stated or when the references to the ontology from the lexical entries are not available, additional approaches must be considered in order to explicitly represent translations as RDF resources. The advantage of introducing explicit translations is that the model is more independent of the target ontology.

lemon does not contemplate the explicit handling of translations in its core model. Flourishing efforts try to keep independent of any ontology and therefore implement an explicit model for the translations. A first step in this direction has been implemented in the lemon variant module, which introduces a senseRelation property that can be used to create relationships between senses. As a specialization of this property, the isocat:translationOf property is used to create translation links between different senses of words. The translation link itself does not include information about the language pairs that compose the translation. The only way to extract the languages is by using the lexicons containing the terms.

Montiel-Ponsoda et al. (2011) introduce an extension module to lemon for modeling translations explicitly. The main part of this module is the Translation class, which represent the relation between lexical senses. Instances of the lemon:LexicalSense class composing the translation pair are referenced by two new properties, namely sourceLexicalSense and targetLexicalSense. Authors differentiate two kinds of translations, i.e, LiteralTranslation and CulturalEquivalenceTranslation. The first type is used in the case the semantic of the sense is equivalent in both languages, e.g.: “mountain”@en and “Berg”@de. The second type of translation is used when the lexical senses can not be considered exactly equivalent, but only under certain cultural context, e.g.: “Prime Minister” and “Presidente del Gobierno” in the British and Spanish political systems, respectively. Figure 6 shows an overview of this model.

Gracia et al. (2014) propose a modification^{Footnote 13} of the work described in Montiel-Ponsoda et al. (2011). Figure 7 shows an overview of the model. The main difference relies on the introduction of a TranslationSet class, which groups the different translations in a similar way to how lemon:Lexicon groups lexical entries. The Translation class has been modified, so that the original specialization classes, i.e., LiteralTranslation and CulturalEquivalenceTranslation, have been removed. However, that information can still be modeled by using a translationCategory property that points to an external registry of translation types.^{Footnote 14} At the time of writing, authors suggest the following categories:

directEquivalent Typically, the two terms describe semantically equivalent entities that refer to entities that exist in both cultures and languages. E.g., “surrogate mother”@en and “mère porteuse”@fr.
culturalEquivalent Typically, the two terms describe entities that are not semantically but pragmatically equivalent, since they describe similar situations in different cultures and languages. E.g., “Ecole Normal”@fr and “Teachers college”@en.
lexicalEquivalent It is said of those terms in different languages that usually point to the same entity, but one of them verbalizes the original term by using target language words. E.g., “Ecole Normal”@fr and “Normal School”@en.

In addition a context property has been added to Translation, which points to extra information that specifies the concrete context in which the pair of lexical senses compose the translation. This information can be really important when disambiguating senses of a word.

Sérasset (2014) introduce DBnary, a multilingual LOD dataset^{Footnote 15} built using Wiktionary as data source. DBnary relies on lemon and adds an extension for dealing with the information in every Wiktionary page. From all lexical relations that might appear in Wiktionary and are handled in DBnary, we only focus on translations. To make translations explicit a Translation class is defined. This class provides the following set of properties:

isTranslationOf Relates a Translation to a LexicalEntry or LexicalSense. It is important to remark the difference with the model in Montiel-Ponsoda et al. (2011) and Gracia et al. (2014), where a sourceLexicalSense and a targetLexicalSense properties are used to describe the translation pair. In DBnary only a relation is needed to indicate the source of the translation because the target is modeled by using the string property writtenForm. This fact implies redundancy of information in DBnary and will require the consolidation between the information pointed by writtenForm and the one included in LexicalEntry or LexicalSense, respectively.
targetLanguage Points to one of the ISO 639-3 language codes available in the lexvo ^{Footnote 16} namespace.
writtenForm Gives the string representation of the translation in the target language.
gloss Is a string containing information that determines the context of the lexical sense. It is similar to the context property introduced in Gracia et al. (2014).
usage It is a string that gives extra information about the translation, i.e., genre, number, etc.

Figure 8 depicts an overview of the model.

4.3 Modeling confidence and provenance

Montiel-Ponsoda et al. (2011) and Gracia et al. (2014) remark the need for modeling provenance and confidence scores associated to translations. The reason for this is the ambiguity of lexical senses (usually a word in a certain language could have different senses, which could be translated differently into a target language) and the implicit subjectivity that characterizes the translation of lexical content, i.e., lack of accuracy of automatic methods, lack of consensus among human contributors, etc.

The model in Montiel-Ponsoda et al. (2011) allows the specification of provenance information associated to translations by using the translationOrigin property for pointing to external resources. Gracia et al. (2014) provenance is also taken into account through the use of DCMI metadata terms.^{Footnote 17} associated to Translation and TranslationSet instances

Additionally, in Montiel-Ponsoda et al. (2011) confidence values can be added to a translation by using the confidenceLevel property [renamed to translationConfidence in Gracia et al. (2014)]. Confidence can be interpreted as a ranking score denoting how trustable the defined translation is. Authors state that the “confidence level will ultimately depend on the translation tools and translation resources employed to obtain translations”. However, they do not include any kind of vocabulary construction in their approach to model information about the applied tool. Ranking scores themselves do not provide enough information that can be used during the data consumption process. With this idea in mind, we developed a vocabulary for ranking (vRank), which allows the reification of ranking data. We will use vRank for modeling confidence associated to lexical translations. vRank has been designed in order to facilitate its reusability and therefore it is independent of the targeted use case.

The purpose of vRank is to provide data consumers with a standardized, formal, unambiguous, reusable and extensible way of representing ranking computations. How data is consumed depends strongly on what is relevant for data consumers. When data providers offer a ranking service, obviously they cannot contemplate all possible relevance models of consumers. Therefore, the need for the functionality that vRank tries to implement is apparent, i.e., associating different ranking models with the same data. The following requirements have guided the design of vRank:

1.
The need for unifying the way ranking algorithms are developed in order to promote reusability and evaluation.
2.
The need for a common and accepted model to homogenize the exploitation of ranking services.
3.
The need for isolating data from any kind of assumption regarding to publication and consumption (data providers and consumers may not share the same interests).

Offering ranking computations as part of the data can facilitate its consumption in several ways:

Different relevance models computed by diverse ranking strategies can coexist within the same dataset. Consumers can adapt data requests to their relevance expectations.
Data ranking becomes open and shareable. Consumers can reuse a specific way of ranking a dataset. If existent ranking approaches do not suit consumers’ needs, they can extend the dataset with their own method.
Consumers can reuse ranking scores in order to evaluate and compare different strategies over a given dataset.
Consumers (and not data providers) can have control about how they want to consume data, giving more preference to what is more relevant.

In the following we develop these aspects in more detail.

Ranking crystallization Ranking algorithms rely on data structures that are used to compute the final scores of data items. Traditionally, these data structures are kept internal and inaccessible for the users. By using a service, data consumers can submit their queries and retrieve a list of results ordered according to the implemented relevance model. This kind of behavior defines ranking algorithms as a black box, which makes it very difficult, if not impossible, to reuse and share computations over existing data.

The relevance models computed by ranking algorithms need to be materialized in a way that can be offered publicly and can be queried by data consumers. The publication can be done “easily” in RDF with a vocabulary that models the ranking domain. This is what vRank has been defined for.

SPARQL^{Footnote 18} is considered as the standard language for querying the Web of Data, however it does not support any kind of ranking apart from ORDER BY clauses. By adopting vRank it is not necessary to extend SPARQL with ranking support as the ranking can be made explicit within the dataset. Consumers do not need to learn a different query language or any kind of extension. They still can use ORDER BY clauses and just adapt their queries to use the according vRank triples.

Ranking evaluation Due to the different policies used in ranking, it is very difficult to establish a technical comparison to analyze the accuracy and precision of each algorithm in reference to others. One of the main contributions of vRank is that it helps to homogenize the way ranking services are exploited, so that third parties can compare and evaluate them.

Evolution of data In an open environment like the Web, data is always going under modifications and revisions. When data is updated, the ranking scores associated to the data items have to be updated as well. A consumer may be interested in analyzing the ranking scores over the time in order to predict future changes that might affect her consumption patterns. The mechanism implemented by vRank opens new possibilities for addressing the problem of measuring changes within data.

Multirelevancy Consumers can make use of the available ranking scores to combine and compose their own ranking functions. This approach is addressed under what is known in the literature as ranking aggregation (Dwork et al. 2001). Following the same pattern, the newly obtained scores can be materialized and shared by using vRank.

4.3.1 vRank model

vRank aims to model ranking information within datasets. We have tried to keep a simplistic design and therefore, we have reused existent vocabularies wherever possible. A full specification of vRank is available under the namespace http://purl.org/voc/vrank#. Figure 9 shows an overview of vRank. In the following, we describe the core components of the vocabulary.

Algorithm: In vRank an Algorithm is an entity that models metadata about a ranking implementation. The main purpose of this entity is to provide provenance information about the ranking scores. By knowing which settings have produced certain ranking scores, a data consumer can decide which ranking approach should be applied to the requested data. In order to characterize certain algorithm, vRank allows the use of features and parameters.
Feature: A Feature complements the description of an algorithm in terms of its functionality. Features should be specified by the authors of the ranking approach with the aim of facilitating its understanding to data consumers. As already mentioned, ranking algorithms are characterized by a diverse functionality, which in many cases is combined under the same implementation.
Parameter: A Parameter adds a finer level of description than a Feature. The main target of a Parameter is to capture the specific configuration of the algorithm that leads to the obtained ranking scores. An example of Parameter is the damping factor used by PageRank.
Rank: Rank is an entity that formalizes the ranking scores associated to a data item. Anything that can be model in RDF can have an associated Rank. The flexibility of the model resides on relating different instances of Rank with a particular data item. A Rank by itself is meaningless. Therefore, Ranks are related to Algorithms and to concrete executions (defined by specifying different Parameters). In order to capture different executions with certain settings we have added a timestamp to the Rank entity.

4.3.2 Consolidated model

As already stated, our aim is to reuse accepted models by the community in order to avoid reinventing the wheel with a new approach. Figure 10 shows a representation of the resulting vocabulary that we use for modeling the multilingual translations with the associated confidence annotations. We have used the prefix lemon-tmp to refer to the vocabulary that is related to the translations. As it can be appreciated, we have relied on previous works addressing the description of translations as LOD, especially (Gracia et al. 2014). A translation is modeled through the entity lemon-tmp:Translation, which contains two relations to lemon:LexicalSense for representing the source and the target senses of the translation pair. This information is modeled through the properties lemon-tmp:sourceLexicalSense and lemon-tmp:targetLexicalSense, respectively. The confidence information related to each translation pair is modeled through the property vrank:hasRank, which serves as junction between lemon and vRank.

4.4 Resulting dataset

We have used the model introduced previously to represent all the generated information about translations and their confidence after running the evaluation described in Sect. 3.2. A generic overview of the dataset is shown in Fig. 11. For each couple of processed language editions we organize the data in a source lexicon, a target lexicon and a translation set. The source lexicon and target lexicon contain information about words in the source and target language, respectively. In the current version of the dataset we just include the lexical form of each word, discarding other lexical information like definitions, synonyms, etc. The translation set includes all the translation pairs, together with the associated confidence score computed with our approach. We made this English–Spanish dataset containing more than 100k triples publicly available on Dydra^{Footnote 19} under the following SPARQL endpoint: http://dydra.com/narko/dict/sparql. An example of the dataset is described in the "Appendix".

5 Conclusions and future work

This paper pointed out the need for establishing measures to show the level of reliability of linguistic open data. More concretely, we studied the multilingual dimension of Wiktionary in order to assess the quality of translations as a previous step before using them in practice. We proposed a heuristic approach based on random walks to exploit the link structure existing between different Wiktionary language editions. This mechanism yielded confidence values associated to each translation pair found within the network formed by combining a source and a target language edition. We studied the precision and recall of our approach evaluating towards human assessments and found out that the heuristic performs significantly good.

As a complementary part of this paper, we extended the state of the art by providing an extension to the lemon vocabulary to model translation data with associated confidence measures. We showed the flexibility of LOD to model new data needs and how it can overcome heterogeneity issues that are present in fixed data schemas like those used in Wiktionary. As a proof of concept we published a dataset containing all translation pairs we computed by relying on the described vocabulary.

In the short term, future work includes the improvement of the weighting mechanism used to compute the confidence measures associated to the translation pairs. As described, right now we rely on the mean of the individual translation components. However, this has shown to have a high impact in the case the associated ISG to one of the components of the pair is incomplete. This effect appears in those words that did not receive enough contributions within a language edition. Our initial idea is to suggest weights that penalize the component of the pair suffering this problem, while increasing the relevance of the other component. Additionally, we want to apply our approach to any combination of Wiktionary language editions in order to provide the community with one of the biggest datasets containing bilingual translations. This resource can be quite useful for researchers on applied linguistics and semantics.

In the long term, we consider applying our heuristic to other linguistic resources apart of Wiktionary. Projects like UBY^{Footnote 20} and BabelNet^{Footnote 21} integrate other multilingual resources that could be explored. These projects could make use of our approach in order to offer confidence values associated to the translations that they store. As already stated through the paper, our approach relies strongly on the link structure built among the different Wiktionary editions. Applying our heuristic to a different resource in order to capture the crowd contribution would require the implementation of a different data analysis targeting such resource. This means that while the main idea behind our heuristic can be reused, the current technical implementation would need to be adapted to each targeted resource.

Notes

http://wiktionary.org.
http://en.wiktionary.org/wiki/Wiktionary:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License.
http://en.wiktionary.org/wiki/Wiktionary:GNU_Free_Documentation_License.
http://en.wiktionary.org/wiki/Wiktionary:Translations.
http://meta.wikimedia.org/wiki/Wiktionary.
http://mastersofmedia.hum.uva.nl/2008/09/20/wiktionary-and-the-limitations-of-collaborative-sites/.
An example of this can be found for the German translation of the word “banco” in the Spanish language edition (http://es.wiktionary.org/wiki/banco). The available translations contain the German translation “Bank”, however it points to an empty page (http://es.wiktionary.org/w/index.php?title=Bank&action=edit&redlink=1), showing that the term does not exist in the Spanish edition. The translation link is well created in the opposite direction, i.e., from German to Spanish.
http://ogden.basic-english.org.
http://www.coe.int/t/dg4/linguistic/Cadre1_en.asp.
http://www.lemon-model.net.
The latest developments on lemon by part of the Ontology Lexicon (Ontolex) community group can be found at http://www.w3.org/2016/05/ontolex/.
For a detailed description of the different lemon components we refer the reader to the official cookbook at http://www.lemon-model.net/lemon-cookbook.
At the time of writing, this modification is being considered as a possible approach to model explicit translations in lemon. Further discussions are available at http://www.w3.org/community/ontolex/wiki/Translation_Module.
The category registry can be found at http://purl.org/net/translation.
The dataset is available at http://kaiko.getalp.org/sparql.
http://www.lexvo.org.
http://dublincore.org/documents/dcmi-terms/.
http://www.w3.org/TR/rdf-sparql-query/.
http://dydra.com.
https://www.ukp.tu-darmstadt.de/data/lexical-resources/uby/.
http://babelnet.org/.

References

Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on Wikipedia. In Proceedings of the 17th international conference on world wide web, WWW ’08 (pp. 1095–1096). ACM, New York, NY, USA. doi:10.1145/1367497.1367673.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
Article Google Scholar
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. WWW.
Fuertes-Olivera, P. A. (2009). The function theory of lexicography and electronic dictionaries: Wiktionary as a prototype of collective free multiple-language internet dictionary. In Lexicography at a crossroads: Dictionaries and encyclopedias today, Lexicographical Tools Tomorrow (pp. 99–134). Bern: Peter Lang.
Gracia, J., Montiel-Ponsoda, E., Vila-Suero, D., & de Cea, G. A. (2014). Enabling language resources to expose translations as linked data on the web. In LREC (pp. 409–413).
Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources? metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th international symposium on online journalism (pp. 16–17). http://jmsc.hku.hk/faculty/alih/publications/utaustin-2004-wikipedia-rc2.pdf.
Lim, E. P., Vuong, B. Q., Lauw, H. W., & Sun, A. (2006). Measuring qualities of articles contributed by online communities. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, WI ’06 (pp. 81–87). IEEE Computer Society, Washington, DC, USA. http://dx.doi.org/10.1109/WI.2006.115.
Manola, F., & Miller, E. (2004). Rdf primer. w3c recommendation. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/.
Matuschek, M., Meyer, C. M., & Gurevych, I. (2013). Multilingual knowledge in aligned Wiktionary and Omegawiki for translation applications. Translation: Corpora, Computation, Cognition (TC3), 3(1), 87–118.
Google Scholar
McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gmez-Prez, A., et al. (2012). Interchanging lexical resources on the semantic web. Language Resources and Evaluation, 46(4), 701–719.
Article Google Scholar
Meyer, C. M., & Gurevych, I. (2012). Wiktionary: A new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography. In S. Granger, & M. Paquot (Eds.) Electronic lexicography, chap. 13 (pp. 259–291). Oxford: Oxford University Press. http://www.christian-meyer.org/research/publications/oup-elex2012/.
Miles, A., & Bechhofer, S. (2009a). SKOS simple knowledge organization system extension for labels (SKOS-XL). http://www.w3.org/TR/skos-reference/skos-xl.html.
Miles, A., & Bechhofer, S. (2009b). SKOS simple knowledge organization system reference. http://www.w3.org/TR/2009/REC-skos-reference-20090818/.
Miller, T., & Gurevych, I. (2014). Wordnet—Wikipedia—Wiktionary: Construction of a three-way alignment. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.) Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.
Montiel-Ponsoda, E., Gracia, J., de Cea, G. A., & Gómez-Pérez, A. (2011). Representing translations on the semantic web. In MSW (pp. 25–37).
Müller, C., & Gurevych, I. (2009). Using Wikipedia and Wiktionary in domain-specific information retrieval. In Proceedings of the 9th cross-language evaluation forum conference on evaluating systems for multilingual and multimodal information access (pp. 219–226). CLEF’08 Berlin, Heidelberg: Springer-Verlag.
Navarro, E., Sajous, F., Gaume, B., Prévot, L., ShuKai, H., & Tzu-Yi, K. et al. (2009). Wiktionary and NLP: Improving synonymy networks. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 19–27). People’s Web ’09 Stroudsburg, PA, USA: Association for Computational Linguistics.
Sajous, F., Navarro, E., Gaume, B., Prévot, L., & Chudy, Y. (2013). Semi-automatic enrichment of crowdsourced synonymy networks: The wisigoth system applied to Wiktionary. Language Resources and Evaluation, 47(1), 63–96.
Article Google Scholar
Sérasset, G. (2014). DBnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semantic Web Journal: Special issue on Multilingual Linked Open Data. http://hal.archives-ouvertes.fr/hal-00953638.
Weale, T., & Brew, C. F. L. E. (2009). Using the Wiktionary graph structure for synonym detection. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources, People’s Web ’09 (pp. 28–31). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1699765.1699769.
Zesch, T., Müller, C., & Gurevych, I. (2008). Using Wiktionary for computing semantic relatedness. In Proceedings of the 23rd national conference on artificial intelligence—Volume 2, AAAI’08 (pp. 861–866). AAAI Press.

Download references

Author information

Authors and Affiliations

Semantic Technology Institute, University of Innsbruck, Innsbruck, Austria
Antonio J. Roa-Valverde & Dieter Fensel
University of Alcalá, Madrid, Spain
Salvador Sanchez-Alonso & Miguel-Angel Sicilia

Authors

Antonio J. Roa-Valverde
View author publications
You can also search for this author in PubMed Google Scholar
Salvador Sanchez-Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Miguel-Angel Sicilia
View author publications
You can also search for this author in PubMed Google Scholar
Dieter Fensel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio J. Roa-Valverde.

Appendix: Dataset example

In the following, we show an example of how our data model can be used to describe lexical translations. We have taken the word “able” in English and build the associated ISG for Spanish as described in Sect. 3. The resulting graph is shown in Fig. 3. Table 3 contains the adjacency matrix with the existing translations and the computed confidence after combining the individual PageRank scores. Note that for this example we take the ISG as the only graph under consideration and therefore it is equivalent to the USG. Listing 5 depicts the generated model in turtle notation.

Table 3 Adjacency matrix corresponding to ${ ISG}_{en,es}\big (able\big )$ and associated confidence

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roa-Valverde, A.J., Sanchez-Alonso, S., Sicilia, MA. et al. An approach to measuring and annotating the confidence of Wiktionary translations. Lang Resources & Evaluation 51, 319–349 (2017). https://doi.org/10.1007/s10579-017-9384-9

Download citation

Published: 06 February 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10579-017-9384-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An approach to measuring and annotating the confidence of Wiktionary translations

Abstract

Similar content being viewed by others

Completeness and Reliability of Wikipedia Infoboxes in Various Languages

Analysis of References Across Wikipedia Languages

Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia

1 Introduction

2 Wiktionary