Keywords

1 Introduction and Related Work

Multilingual access to metadata and contents is of particular interest for international digital libraries (DL) in the area of Cultural Heritage (CH), which have collections in multiple languages, and users from different countries and with different cultural backgrounds. However, Multilingual Information Retrieval (MLIR) is rarely implemented in this domain beyond the interface language [9, 15]. Only a few practical cases have been reported in the literature (see extensive reviews in Vassilakaki and Garoufallou [19], Diekema [3], and Chen [2]), and most of them use human translations and specialized vocabularies. This is the case for example of the World Digital Library [11], or the International Children’s Digital LibraryFootnote 1, where contents are manually translated. In query translation, Bonet et al. [5] obtained good results using specialized dictionaries, while Kools et al. [7] obtained satisfactory results using machine translation. Matusiak et al. [9] reports an experiment using Google Translate to translate to English a collection of Chinese artworks, but they finally opted for human translation given the limitations found. In other domains machine translation seems to work well for the most widely spoken languages [4], with only a decrease of performance of 5–12% compared to the monolingual setting [13]. This lack of use of machine translation in DLs could be explained by the translation ambiguity and the insufficient lexical tools’ coverage, considered to be among the most prominent problems in MLIR [12].

Europeana, a European digital library that aggregates content from libraries, archives and museums from all around EuropeFootnote 2, is also a good example of this situation. It provides access to more than 60 million objects, from textual documents, like books or newspapers, to multimedia objects like audio, videos and paintings, which are primarily associated with 38 different languages. The data of these objects (i.e. metadata and content) is indexed in a search engine that provides a search functionality over all collections, however, in most cases, this data is only available in one language. Europeana performs data enrichment, adding persons, locations and concepts described in multiple languages to its metadata records. Yet the coverage of this approach is incomplete: there is no wide-spread translation of metadata, content and/or queries.

We have run an experiment using part of Europeana’s collections to see the effectiveness of a MLIR system in this domain. We have focused on the content, not the metadata, and we have adopted a mixed approach where queries and object content are automatically translated to English as a pivot language, following the Europeana Multilingual Strategy [10]. Although document translation is considered more effective [12, 13, 17], this hybrid approach has outperformed other strategies in an experiment conducted [13], and it is more scalable when the number of different languages is considerable. Also, English is the most present language in these collections, and its effectiveness in machine translation is higher [4, 13]. We have used the CEF translation service [1] as it is intended as a free, secure service for public bodies, which can be appealing for CH institutions, especially in Europe. The repository with the data of the experiment [8] and the client [6] used to get the translations are publicly available.

2 Data and Evaluation

We have selected a sample of 18,257 handwriting transcriptions of documents from the Europeana 1914–1918 thematic collectionFootnote 3, obtained from the Transcribathon crowdsourcing platform [18]. This collection includes many World War I related objects contributed by members of the public all over Europe, like soldiers’ diaries or letters. After removing 18 transcriptions that lacked indication of the original language, and those originally in English, we submitted 13,996 transcriptions to the service for translation to English. We received errors for 404 of them (2.9%), either because the language is not supported or because the text is too long and a different interface should then be used (this is part of our future work). As a result, we obtained 13,592 transcriptions translated to English from 15 different languages (see Table 1).

Regarding queries, we successfully translated a small sample of 68 queries issued in languages other than English from the logs of Europeana’s 1914–1918 collection between January and August 2019.

Table 1. Original language of the transcriptions and queries (assuming for the queries it is the same as the language of the portal), and number of successful English translations.

We manually assessed the quality of translation of the queries, as they play a major role in the cross-lingual system. We also conducted a quantitative evaluation to answer the following research question: is it possible to obtain similar results as those obtained with the original query, when searching on the same collection using translations? Our assumption is that the results obtained in a monolingual system for a specific query and collection in that language, should also appear when searching with the translated query in the same collection translated to English. In order to answer this question, we compare two lists of retrieval results per query q in original language l: a) the set \(s_{qo}\) obtained when searching with the original query \(q_{o}\) in the transcriptions in l, and b) the set \(s_{qt}\) obtained when searching with the English translation of \(q_{o}\), \(q_{t}\), in the transcriptions in l translated to English. The precision and recall of \(s_{qt}\) with respect to \(s_{qo}\) is then computed. Finally, we calculate the additional number of transcriptions retrieved when using \(q_{t}\) in the whole corpus of English transcriptions (translated or not).

3 Results

After a manual assessment of the queries, we discovered that in a number of cases the input to the translation tool was wrong because the queries contain typos or have the wrong language assigned (i.e., our assumption that its language is the language of the portal is wrong). The first issue happened 6 times, while the second happened in 18 queries, with two of them having both issues at the same time. After removing them, and an additional 3 for which the user’s intention was not clear to us, we manually analyzed the translation of the remaining 43 queries. In 37 cases the query was an entity that had to be left unchanged (e.g., ‘Bernhard Stiens’ is to be left unchanged, while ‘Italia’ must be translated to ‘Italy’). The service correctly translated (that is, left unmodified) 20 of those entities (54%). In the remaining 6 cases, where the translation was supposed to be different from the original, the translation service did it correctly in 5 cases (83%).

The incorrect translation of named entities is the main source of problems as, setting aside other issues, there are more queries with entities than without: 42 of the 68 queries are (or include) named entities (62%). The problem is especially hard to solve as the named entities present and queried in the World War I context are very specialized (less-known authors, small villages) and sometimes incompletely referred to (e.g., ‘Tonale’ refering to ‘Passo del Tonale’), or are formulated with typos (e.g., ‘san elia’ refering to Antonio Sant’Elia). In some other cases they include common nouns that are not correctly disambiguated (e.g.,‘Antonio Sordi’ and ‘Fogliano’ are translated from Italian as ‘Antonio Deaf’ and ‘sheet’ respectively). This ambiguity issue is also observed in queries not involving named entities. For example, ‘carnet de route’ is correctly translated from French as ‘journey log’ in the transcriptions, however the query ‘carnet’ is translated as ‘notebook’, so no relevant results are retrieved.

For the quantitative evaluation, we obtained precision, recall, and new translations found for the queries with search results, that is, 31 queries out of the 68 originally considered (see Table 2). The recall indicates that 67% of the objects in \(s_{qo}\) are retrieved when using the translations. As a negative counterpart, we have on average 49% of results that are not in \(s_{qo}\). Given the poor quality of the translation of the queries, we would have to assume that those results are more likely to be noisy: in our case, on average 337 of those new transcriptions retrieved are less likely to be relevant. This could be however compensated in some cases by the new transcriptions found. When using \(q_{t}\) in the whole corpus of English transcriptions we retrieve an average of 687 new transcriptions per query. A quick review shows that some of those new results are relevant. For example, for the query ‘domov’ in Czech (‘home’ in English) we only retrieve 2 results, however if we search by ‘home’ in the English translations we retrieve more than 1500 transcriptions in 9 additional languages.

Table 2. Precision and recall obtained when comparing \(s_{qt}\) and \(s_{qo}\) per language, as well as additional transcriptions retrieved when searching on the translations of any language.

4 Conclusions and Future Work

This experiment in a real scenario shows (or confirms) some of the benefits and the challenges of deploying MLIR systems in this specific domain. Albeit focused on a rather small set of queries, our case illustrates the problem of performing query translation in the CH context: the number of queries that we are sure the service should actually translate is way smaller than the number of queries that it should leave unmodified, so the selection of a high quality translation service is important. Additional techniques like controlled vocabularies and named entity recognition tools are also needed [16], although they need to be adapted to the specific domain and updated regularly.

We have observed a significant number of cases where the queries had typos or there was a mismatch between the language of the query and the language assigned according to the language of the portal. These cases are especially harmful as the translation service was not given appropriate input. A spelling-correction system could mitigate the first problem, while for the second, language detection based on various signals [14] could improve the results.

This work shows that without addressing these issues, the drawbacks of a multilingual system in a CH domain could easily exceed its benefits. The next step will be to address those challenges and complement the evaluation conducted with a more balanced sample of queries in terms of languages to see its impact in the results. A qualitative analysis of the retrieval results is also due to better account for additional benefits of the translation (e.g. synonyms).