Keywords

1 Introduction

With advances in neural models for machine translation (MT) and Information Retrieval (IR), it is time to revisit the problem of Multilingual IR (MLIR). Soon after Cross-Language IR (CLIR) was proposed as an information retrieval task, research began on MLIR [34]. MLIR seeks to produce a total ordering over retrieved documents, regardless of language, such that the most useful documents appear at the top of the ranking. Assuming a searcher can consume multilingual information (either directly or using MT), the search engine should be able to return useful information regardless of the language of the document.

Much prior work on MLIR has involved subsetting documents by language, performing CLIR on each document set, and merging the results [37]. The advent of neural machine translation and neural IR using Multilingual Pretrained Language Models (MPLMs) creates new opportunities for MLIR that we study here.

If MT were perfect, translating all documents into the query language and searching monolingually might suffice. Indeed, our experiments confirm that for the high-resource languages with which we have experimented (English, French, German, Italian, and Spanish), using neural machine translation to convert each document into the query language is effective when used with neural ranking (in our experiments, ColBERT [26]) fine-tuned on MS MARCO [2]. However, using neural MT in that way incurs substantial indexing costs because a GPU is required first to translate the document and then again to encode it into dense vectors for neural IR. Alternatively, we can use translations of MS MARCO to fine-tune an MPLM; that approach is nearly as effective, not statistically different, and considerably faster at indexing time. Our use of MS MARCO makes English a natural choice as the query language, but our approach is extensible to any query language for which suitable fine-tuning data exists.

This paper makes the following contributions: (1) Using a collection containing five high-resource European languages, we show that neural MT with neural IR achieves higher MAP and Precision at 10 scores than any other known MLIR technique, but that reliance on neural MT greatly increases the time required to index a collection. (2) We show that extending the ColBERT-X [32] Translate-Train (TT) CLIR model to multiple languages achieves equivalent retrieval effectiveness with less than half the indexing time when used with mixed-language fine-tuning. (3) We show that some language bias in favor of query-language documents is present with all approaches, but that query-language bias is smaller with our Multilingual Translate-Train (MTT) implementation of ColBERT-X.

2 Background

We provide an overview of MLIR, followed by a brief review of traditional and neural IR. The term “multilingual” has been used in several ways in IR. Hull and Grefenstette [22], for example, note that it has been used to describe monolingual retrieval in multiple languages, as in Blloshmi et al. [5], and it has also been used to describe CLIR tasks that are run separately in several languages [7,8,9, 27, 31]. We adopt the Cross-Language Evaluation Forum (CLEF)’s meaning of MLIR: using a query to construct one ranked list in which each document is in one of several languages [36]. We note that this definition excludes mixed-language queries and mixed-language documents, which are yet other cases to which “multilingual” has been applied.

Five broad approaches to MLIR have been tried. Among the earliest, Rehder et al. [39] represented English, German and French documents in a learned trilingual embedding space, represented the query in the same embedding space, and then computed query-document similarity in the embedding space. The techniques and training data for creating multilingual embeddings were, however, too limited at the time to get good results from that technique. More recently, Sorg and Cimiano [44] garnered substantial attention by training embeddings on topically-related Wikipedia pages in English, German, French and Spanish. This paper extends this line of work.

A second approach by Nie and Jin [33] indexed terms from all documents in their original language then created queries containing translations of the query terms in all target languages. With many document languages, this can lead to long queries. A third approach is to translate indexed terms into the query language at indexing time; the original queries can then be used directly to find similar (translated) content [18, 29, 38]. We experiment with this approach as well. This approach is, however, only practical when just a few query languages are to be supported. To address that limitation, the second and third approaches can be combined to create a fourth approach in which documents and query terms are each converted into one of a small number of indexing languages. This has been called a “pivot language” approach, because in the limit all documents and queries can be translated into a single language.

The fifth, and most widely studied, approach is to first use monolingual or bilingual retrieval to create a ranked list for each document language, and then to merge those ranked lists to construct a single result list [37, 43, 45]. While this approach is architecturally similar to collection sharding, a widely-used approach to address efficiency, differences in collection statistics result in incompatible scores that require normalization prior to late fusion. Unfortunately, normalizing scores for collections across languages has been shown to be challenging [37].

Finally, one can simply show one ranked list per language to the user, as is done in the 2lingual search engine.Footnote 1 This approach does not scale well beyond a small number of languages, but it has the advantage of making it fairly clear to the searcher what the search engine has done.

Every MLIR ranked retrieval model must rank the indexed documents given a query. Traditional ranking methods such as computing inner products between the query and each indexed document containing a query term using sparse BM25 [40] term weights are fast, but neural IR methods yield better rankings [24, 26, 32] with more relevant documents earlier in the ranked list.

This paper focuses on tradeoffs between effectiveness and efficiency. Each technique described in this paper achieves ranking latency sufficient for interactive use (below 300 ms) on the collections that we experiment with, but the time required to index the documents varies. Indexing time consists of three components: text processing (e.g., casing and tokenization), machine translation, and representation (e.g., McCarley [30] and Magdy and Jones [29]). Of these, neural MT is the slowest, so IR methods that do not require neural MT at indexing time have a substantial indexing time advantage (e.g., Aljlayl and Frieder [1]). Our principal MLIR result is that MPLMs can achieve MAP close to the best results while producing substantial savings in indexing time.

We achieve this by extending the ColBERT-X [32] CLIR model to perform MLIR. ColBERT-X combines three key ideas. First, drawing insight from BERT [15], it represents documents using contextual embeddings, which better represent meaning than simple term occurrence. Second, using both multilinguality and improved pretraining from either multilingual BERT [47] or XLM-R [11], ColBERT-X generates similar contextual embeddings for terms with similar meaning, regardless of language. Third, drawing its structure from ColBERT [26], ColBERT-X limits ranking latency by separating query and document transformer networks, allowing offline indexing. ColBERT scores documents by focusing query term attention on the most similar contextual embedding in each document. Our experiments confirm that this approach yields better MLIR MAP than does computation of inner products between classification tokens for the query and each document, an approach known as Dense Passage Retrieval (DPR) [24].

3 Fine-Tuning MPLMs for MLIR

Following Nair et al. [32] we consider two high-level approaches to fine-tuning for generalizing neural retrieval models to MLIR. Both approaches use existing MPLMs such as XLM-R [11] to encode queries and documents in multiple languages. We adapt the MPLM to MLIR via task-specific fine-tuning. These approaches are applicable to any retrieval model that is able to encode text using an MPLM.

Consider a set of queries in a source language \(\textbf{L}_s\) and a set of documents in m target languages \(\textbf{L}_t = \cup _{i=i}^m \textbf{L}_i\). We want to train a scoring function \(\mathcal {M}_\varTheta (q_{(s)}, d_{(t)}) \rightarrow \mathbb {R}\) for ranking documents with respect to a query. This paper denotes the language of an instance as a subscript \(\bullet _{(l)}\).

3.1 English Training (ET)

Since MPLMs can encode text from many languages, we follow Nair et al. [32] and only fine-tune the model monolingually. When processing queries, we transfer the model to MLIR zero-shot. Specifically, consider a loss function \(\mathcal {L}\) (for example, cross-entropy),

$$\begin{aligned} \varTheta = \mathop {\mathrm {arg\,min}}\limits _{{\theta }} \sum _{q, d} \mathcal {L}_\theta (q_{(s)}, d_{(s)}, r_{q,d}) \end{aligned}$$

where \(q_{(s)}\) and \(d_{(s)}\) are representations of the queries and documents and \(r_{q,d}\) is the relevance judgment of document d on query q, both in language \(\textbf{L}_s\), encoded by an MPLM. We use English as our query language because that is the language of MS MARCO. We refer to this approach as “English Training” or ET. However, this approach could equally well use any language for which similar extensive training data is available.

Despite only exposing the model to text in \(\textbf{L}_s\) during fine-tuning, the multilingual model can transfer its task model to other languages, as has been seen in prior CLIR work [32]. However, such zero-shot language transfer is suboptimal because of (1) the lack of alignment objectives between languages during pretraining [48]; and (2) differences in the representation of each language by the MPLM, which has been called the curse of multilinguality [11, 46]. As we show in Sect. 6.1, such zero-shot transfer not only produces suboptimal retrieval effectiveness, it can also lead to language bias.

3.2 Multilingual Translate Training (MTT)

To mitigate those issues, we propose a Multilingual Translate-Train (MTT) approach that generalizes the CLIR Translate-Train (TT) approach to MLIR [32, 42]. To expose target languages \(\textbf{L}_1... \textbf{L}_m\) to the model, we translate the monolingual training documents into each target language using MT. Specifically, the training objective can be expressed as

$$\begin{aligned} \varTheta = \mathop {\mathrm {arg\,min}}\limits _{{\theta }} \sum _{q, d} \sum _{l=1}^{m} \mathcal {L}_\theta (q_{(s)}, d_{(l)}, r_{q,d}) \end{aligned}$$

This objective exposes the retrieval model to language pairs that it might see when processing queries, resulting in a more effective, better-balanced model. We experiment with two batching approaches. In Mixed-language (MTT-M), each batch contains documents in multiple languages, which encourages the model to learn similarity measures for all languages simultaneously.Footnote 2 With Single-language (MTT-S), each batch contains only documents in one language, helping the model to learn retrieval for one language pair at a time. We found that MTT-M yields better retrieval effectiveness; thus, we present MTT-M as our main result. Section 5.1 compares the two approaches. In Sect. 6.1, we also demonstrate that MTT-M reduces language bias in MLIR. Implementation details can be found in Appendix A

Table 1. Dataset statistics of CLEF 2001, 2002, and 2003. CLEF 2001 and 2002 share the document collection but have different queries. Numbers in parentheses are the number of topics in each query set. We report the number of documents judged relevant over all the topics in a particular year.

4 Experiments

One of the few test collections that currently supports MLIR evaluation with relevance judgments across multiple languages is from the Cross-Language Evaluation Forum (CLEF). Following Rahimi et al. [38] we use five document languages in the CLEF 2001–2002 collections [7, 8] and four languages in the CLEF 2003 collection [9]. Table 1 shows collection statistics. We report performance for both title and title+description queries, also following Rahimi et al. [38]. Because the number of query elements (subwords) is limited when encoding a query for dense retrieval, we remove stop structure to ensure that no query exceeds the length limit. Stop structure includes phrases such as “Find documents” and a limited stop-word list including “on,” “the,” and “and.”Footnote 3

4.1 Neural Retrieval Models

We evaluate our proposed training approaches on two retrieval models – ColBERT-X [32] and DPR-X [48, 49], which are multilingual variants of ColBERT [26] and DPR [24]. Nair et al. [32] generalized the ColBERT [26] model to CLIR, calling it ColBERT-X, by modifying the vocabulary space and replacing the monolingual pretrained language model with the MPLM XLM-RoBERTa (XLM-R) Large (550M parameters) [11]. With proper training, ColBERT-X achieves state-of-the-art effectiveness in CLIR. In this study, we integrate our proposed fine-tuning approaches with the ColBERT-X XLM-R implementation, which is based on the ColBERTv1 code base. We similarly adapted DPR [24, 48], a neural retrieval model that matches a single dense query vector to a single dense document vector. We name this model DPR-X. We use Tevatron [17], an open-source implementation of several neural end-to-end retrieval models in Python, for training, indexing, and retrieval.

For training data, we use MS MARCO-v1 [2], a commonly-used question-answering collection in English for fine-tuning neural retrieval models. For MTT, we use the publicly available mMARCO translations of MS MARCO [6], fine-tuning using the “small training triple” (query, positive and negative document) file released by mMARCO’s creators. We trained all retrieval models with four GPUs (NVIDIA DGX and v100 with 32 GB Memory) with a per-GPU batch size of 32 triples for 200,000 update steps. All models are trained with half-precision floating points and optimized by the AdamW optimizer with a learning rate of \(5\times 10^{-6}\).

During indexing, documents are separated into overlapping spans of 180 tokens with a stride of 90 [32]. We aggregate by MaxP [3, 13], which takes the maximum score among the passages in a document as the document score.

4.2 Evaluation

We report previously published results for the state-of-the-art MULM [38] system as a baseline for models that do not perform MT on the full collection. MULM is essentially an MLIR version of Probabilistic Structured Queries (PSQ) [14]. PSQ maps term frequency vectors from document to query language using a matrix of translation probabilities generated using statistical machine translation. For MLIR, a translation matrix is created for each query-document language pair. The query likelihood model is used to score documents. Three key decisions led to good performance: (1) estimating collection statistics based on translation probabilities; (2) estimating document length based on the translation and using that for smoothing; and (3) truncating the translation list at three. As another baseline, we use BM25 (\(b=0.4\), \(k_1=0.9\)) as implemented in Patapsco [12] over neural machine translated documents (abbreviated ITD for Indexed Translated Documents). For BM25, English queries and documents are tokenized by spaCy [21] and stemmed by the NLTK [4] Porter stemmer (all supported by Patapsco).

For approaches that require document translation, we use directional MT models built on a transformer architecture (6-layer encoder/decoder) using Sockeye 2 [16, 19]. Measured by BLEU [35], Sockeye 2 achieves state-of-the-art effectiveness in each translation direction. Optimizations cut decoding time in half compared to Sockeye 1 [20]. We chose Sockeye 2 for its good trade-off between efficiency and effectiveness.

To evaluate effectiveness on multiple languages in CLEF 2001–2002 and CLEF 2003, we combine the relevance judgments (qrels) for all languages for each query. In general, different languages have different numbers of relevant documents for each query. To evaluate models trained with English training data, we also translate the document sets into English with MT for indexing. Our main effectiveness measures are Mean Average Precision (MAP) and Precision at 10 (P@10). Both measures focus on the top of the rankings, and both were used by Rahimi et al. [38], facilitating comparison between the neural approaches presented herein and prior state-of-the-art results.

To evaluate language bias, we count the number of relevant documents for a query across all languages, and calculate recall at that level. To compute the measure for a specific language, we keep this level constant, but ignore all documents in other languages (both in the MLIR results and in the relevance judgments). We call the mean of this measure over all queries Recall@MLIR-Relevant. When computing the mean, we omit from the calculation cases in which no relevant documents in that language are known (recall is undefined in such cases). This measure lies between 0 and 1, and values across that full range are achievable. We use the open source ir-measures [28]Footnote 4 package to compute all effectiveness measures.

Table 2. Configurations of experiments identifying the pre-trained language model when applicable, the fine tuning data and process, the retrieval model, and the language of the indexed documents. Under Fine-Tuning Data, MS MARCO refers to English MS MARCOv1, while mMARCO includes the translations into the various languages as well as the original English MS MARCOv1. A model that lists either under its Indexing Language can index either machine translated document (translation) or native documents in their various languages.

5 Results

We experiment with the Multilingual Translation Training (MTT) using two retrieval models and compare them to two strong baseline retrieval models: BM25-ITD indexing translated documents and MULM indexing native documents; these represent the state of the art on our test collections. Since per-query results for MULM have not been published we perform significance tests only between our systems and the BM25+ITD baseline (the stronger of the two baselines). Table 2 summarizes the experiments that facilitate this analysis. We first compare the effectiveness of our two batching strategies for MTT before examining their effectiveness relative to the baselines. Finally, we consider the trade-off between effectiveness and indexing time.

5.1 Multilingual Batching for Fine-Tuning

We compare two alternatives for fine-tuning the MTT condition and summarize the results with title+description queries in Table 3. In all cases, mixed-language batches (MTT-M) produce more effective retrieval models than single-language (MTT-S). This is likely because, in MLIR, the model must rank documents from different languages together instead of transferring trained models to other languages. The outcome might be different if our goal were to perform CLIR over monolingual document collections.

Table 3. ColBERT-X MTT for Multiple or Single language training batches, indexing documents in their native language using title+description queries. \(\dagger \) indicates significant improvement over MTT-S by paired t-test with 3-test Bonferroni correction (\(p<0.05\)).

5.2 Effectiveness Relative to Baselines

Our main effectiveness results are shown in Table 4. For ColBERT-X and DPR-X, MTT-M consistently improves effectiveness when retrieving documents in their native language (i.e., without document MT) compared to English Training (ET). Such improvements are seen in all three query sets, and for both Title (T) and Title+Description (T+D) queries. Differences are larger for MAP than P@10, indicating that MTT-M affects more than just the top ranks.

ColBERT-X MTT-M numerically outperforms MULM for both query types and over all collections in MAP and nearly all collections in P@10. With longer, more fluent title+description queries, ColBERT-X MTT-M gives a larger improvement over MULM in both MAP and P@10, indicating that XLM-R favors queries with more context. Since DPR-X is less effective [48], MTT-M only brings its performance up to par with MULM.

With modern MT models, we can improve MLIR effectiveness. A common, yet strong, baseline of using BM25 to search over translated documents yields substantial improvement over MULM in both MAP and P@10 with both query types. We argue that BM25+ITD is a proper baseline to which future MLIR experiments should be compared.

Table 4. MAP and P@10 on CLEF Title and Title+Description queries. Bold are best among a year; italics are best in a row (i.e., with and without neural machine translation), \(\dagger \) indicates significant difference from BM25+ITD by paired t-test with 16-test Bonferroni correction (\(p<0.05\)).
Table 5. Monolingual ColBERT model using BERT-Large trained with ET and evaluated with translated documents.

We can also reduce neural IR to the monolingual case, training our retrieval model with English training and searching documents represented by English machine translations. For both ColBERT and DPR, an English-trained model (ET) indexing translated documents often yields better effectiveness than MTT-M indexing translated documents (ITD). Furthermore, an English trained model indexing translated documents yields better effectiveness than MTT-M indexing documents in their native language; however, these differences are only statistically significant for CLEF 2002 on Title queries using a paired t-test with 3-test Bonferroni correction (\(p < 0.05\)). We observe similar results with ColBERT using the BERT-Large pretrained LM trained under the same conditions except for using a learning rate of \(3\times 10^{-6}\) (the value suggested by the authors). Compare Table 5 to ColBERT-X with English training, presented in Table 4.

5.3 Preprocessing and Indexing Time

Applying machine translation to entire document collections is expensive. Table 6 summarizes the cost for preprocessing and indexing the collection in GPU-hours for ColBERT-X and BM25. We omit consideration of query latency here since all of our systems are sufficiently fast at query time for interactive use on collections of this size. We refer the interested reader to Santhanam et al. [41].

This table reveals that differences in total indexing time between searching native and translated documents range from four to 6.5 times depending on collection size and model.Footnote 5 Despite that searching translated documents with monolingual retrieval models is more effective, the computational cost of MT at indexing time is significantly higher; one might choose not to bear this cost in exchange for the small and not statistically significant numerical gain in measured effectiveness over searching documents in their native language with MTT-M fine-tuning for title+description queries.

We also see this trade-off on a per-document basis. Figure 1 shows that ColBERT-X with English training searching translated documents (ColBERT-X(ET)+ITD) achieves the best effectiveness with both title (0.375 MAP) and title+description (0.453 MAP) queries. However, it has a high preprocessing cost of 0.32 s per document, whereas ColBERT-X trained with MTT-M searching documents in their native languages (ColBERT-X(MTT-M)) requires under 0.05 s per document. This is an 84% reduction in preprocessing cost at an apparent (but not statistically significant) cost of only 2% in MAP with title+description queries.

Table 6. ColBERT-X GPU hours for translating and indexing. BM25 does not use GPU.
Fig. 1.
figure 1

Effectiveness (MAP) vs. efficiency (per-document GPU indexing time in seconds) trade-off on CLEF 2001–2003. MAP scores (y-axis) for Title and Title+Description queries are disjoint ranges. The upper left is the optimal part of the chart.

6 Analysis

This section investigates our experimental results by breaking down the collection in two ways – by document language, and by topic.

Fig. 2.
figure 2

R@MLIR-Relevant of BM25 and ColBERT-X variants for each language in CLEF2001-2002 with title+description queries. The yellow dashed line is the average over all languages, i.e., the R-Precision in MLIR. Outliers are defined as values beyond 1.5\(\times \)interquartile range. Horizontal black bars indicate the median and white circles indicate the mean. (Color figure online)

6.1 Language Bias

Since MPLMs are known to exhibit language biases [10, 25], we investigate how retrieval models fine-tuned with our training schemes inherit or alleviate these biases. In MLIR we consider a model biased if it ranks a language’s documents systematically higher or lower than those of another language. While MLIR is not a new task, we are not aware of prior work that has examined language bias. Therefore we introduce two approaches to studying this phenomenon. The first approach examines rates of relevant documents. Since relevant documents are unevenly distributed across languages (e.g., Spanish has more than three times as many known relevant documents as English among the CLEF 2001 topics, averaging 54 vs. 17 relevant documents per topic, respectively), meaningful comparisons require us to focus on rates rather than on counts. In this analysis, we focus on Recall@MLIR-Relevant (see Sect. 4.2), illustrating our analysis using the 100 title+description queries in CLEF 2001–2002 topics to characterize the coverage of relevant documents in each language (results on CLEF 2003 topics are similar).

Figure 2 shows distributional statistics of Recall@MLIR-Relevant over topics by language and condition that have at least one known relevant document in that language (96 for German, 97 for Spanish, 94 for Italian, 90 for French, 73 for English). When transferring a ColBERT-X model fine-tuned zero-shot with English training (i.e., ColBERT-X(ET)) to other languages, the model favors English documents due to the fine-tuning condition. This results in a strong language bias in the retrieval results. Such biases can be ameliorated by fine-tuning with MTT. MTT-M appears to have more consistent behavior across languages compared to MTT-S, although the small apparent difference is not statistically significant. When indexing translated documents, Recall@MLIR-Relevant tends to be lower for English compared to other languages (though also not significantly). Since documents were translated sentence-by-sentence, we hypothesize that indexing translated documents provides more synonym variety when decoding similar terms, resulting in document expansion; this hypothesis requires more investigation, which we leave for future work.

An alternative approach to investigating language bias is to assume that in a bias-free approach to MLIR, the scores for relevant documents would be drawn from the same underlying distribution. Using the 2-sample Kolmogorov-Smirnov test, the null hypothesis is that the two samples are drawn from the same distribution. For this analysis, we chose English as a reference and tested each topic with at least three relevant documents in each language. We then adjusted the p-values to account for multiple comparisons. We found that we could reject the null hypothesis for all languages and all configurations, indicating the document scores are not drawn from the same distribution based on language. Although some of this difference could result from differences in collection statistics (i.e., with some languages better supporting the queries than others based on the numbers of relevant documents), the differences we observe across retrieval models indicate that there are retrieval model effects as well. Notably, ColBERT-X(ET) retrieving documents in the native language has the largest percentage of topics with bias (from 15% to 30% depending on language pair), while all other configurations have no more than 12% of topics exhibiting biased scores. This confirms the qualitative analysis above, which revealed that ColBERT-X(ET) over the documents in their native language had the most skewed rates of relevant documents. Future research will need to address language bias in document scores.

Fig. 3.
figure 3

Average Precision (AP) of BM25 and ColBERT-X on selected topics using title queries.

6.2 Example Queries

For more insight into differences among the algorithms, we show effectiveness on individual queries in Fig. 3. Our query selection here is not meant to be representative, but rather illustrative of phenomena that we see. For two topics on which ColBERT-X outperformed BM25 (topics 158 and 118), the queries include terms that likely benefit from ColBERT-X soft term-matching – “soccer” and “commissioner” respectively. This term expansion effect has also been observed in monolingual retrieval with ColBERT.

MT is particularly helpful for topics 63 and 88, likely due to the quality of the translation for documents on these topics. Especially for topic 88, English monolingual retrieval produces strong results. Such behaviors indicate that the multilingual term matching in ColBERT-X is still not as effective on less common concepts like “mad cow” as is machine translation.

Topic 58 is an outlier. The term “euthanasia” is tokenized as a single token for BM25 but separated into _eu, thana, and sia by the XLM-R tokenizer; combined with the minimal context provided by a query, this prevents ColBERT-X from matching properly across languages. Such diverse behaviors suggest room for further MLIR improvements using system combination.

7 Conclusion and Future Work

This paper proposes the MTT training approach to MLIR that uses translated MS MARCO. When searching non-English documents, fine-tuning with MTT using mixed-language batches (MTT-M) enables neural models such as ColBERT and DPR to be more effective than if fine-tuned on English MS MARCO. ColBERT-X with MTT-M is not statistically different from monolingual English models applied to neural indexing-time translation of the collection into English, yet it achieves substantially better indexing time efficiency. These results may not hold for more diverse sets of languages or when MT is less effective; future work will examine the multilingual topics from the TREC 2022 the NeuCLIR track,Footnote 6 which judges the relevance of documents written in Chinese, Persian, and Russian. Our observation that the retrieval method that yields the best retrieval effectiveness is query-dependent suggests future work on system combination, but our focus on efficiency and on language bias also calls attention to issues beyond retrieval effectiveness that will merit consideration in such a study.