Keywords

1 Introduction

Ad hoc Cross-Language Information Retrieval (CLIR) has been studied for decades. Yet until the advent of high-quality machine translation, the usefulness of CLIR has been limited. Easy access to inexpensive or free machine translation has altered this landscape. If one can find a document of interest in a language one cannot read, machine translation is now often sufficient to make the majority of the document’s content accessible. Thus, the breadth of the audience for CLIR has increased dramatically in a short period of time.

As machine translation has increased the usefulness of CLIR, recently introduced deep neural methods have improved ranking quality [4, 29, 43, 45, 47]. By and large, these techniques appear to provide a large jump in the quality of CLIR output. Yet the evidence for these improvements is based on small, dated test collections [14, 15, 27, 36, 37]. Problems with existing collections include:

  • Some CLIR test collections are no longer available from any standard source.

  • They are typically small, often 100,000 or fewer documents, and some have few known relevant documents per topic.

  • Judgment pools were retrieved using older systems. New neural systems are thus more likely to systematically identify relevant unjudged documents [38, 40, 46].

  • Many of the early test collections have only binary judgments.

The increased importance of CLIR thus argues for the creation of new ad hoc CLIR collections that ameliorate these problems. A new CLIR collection should contain a large number of recent documents in a standard encoding, with distribution rights that foster broad use, sufficient numbers of relevant documents per topic to allow systems to be distinguished, and graded relevance judgments.

To this end, we have created HC4Footnote 1 – the HLTCOE Common Crawl CLIR Collection. In addition to addressing the shortcomings described above and facilitating evaluations of new CLIR systems, this suite of collections has a few unique aspects. First, to mimic well contextualized search sessions, topics are generally inspired by events in the news and written from the perspective of a knowledgeable searcher familiar with the background information on the event. Each topic is associated with a date, and in most cases the topic is linked to Wikipedia page text written immediately prior to that date, generally contemporaneous with the event. This page serves as a proxy for a report that might have written by a searcher prior to their search, reflecting their knowledge at that time. It is included in the collection to enable exploration of contextual search. Second, to maximize recall in the judged set, instead of pooling, active learning identified the documents to be judged [1]. This approach reduces judgment bias toward any specific automated retrieval system.

2 Related Work

The first CLIR test collection was created for Salton’s seminal work on CLIR in 1970, in which English queries were manually translated into German [35]. Relevance judgments were exhaustively created for those queries for several hundred abstracts in both languages. In 1995, the first instance of a large-scale CLIR test collection in which documents were selected for assessment using pooling translated Spanish queries from the Fourth Text Retrieval Conference’s (TREC-4) Spanish test collection into English for CLIR experimentation [12]. The next year, TREC organizers provided standard English versions of queries for Spanish and Chinese collections [37]. The following year, CLIR became the explicit focus of a TREC track, with collections in German, French, and Italian; that track continued for three years [36]. One enduring contribution from this early work was recognition that to be representative of actual use, translations of topic fields in a test collection should not be made word-by-word, but rather should be re-expressions fluently written in the query language.

With the start of the NACSIS Test Collection Information Retrieval (NTCIR) evaluations in Japan in 1999 [34], the Cross-Language Evaluation Forum (CLEF) in Europe in 2000 [15], and the Forum for Information Retrieval Evaluation (FIRE) in India in 2008 [27], the center of gravity of CLIR evaluation moved away from TREC. Over time, the research in each of these venues has become more specialized, so although CLIR tasks continue, the last large-scale CLIR test collection for ad hoc search of news that was produced in any of the world’s four major information retrieval shared-task evaluation venues was created in 2009 for Persian [14]. The decline in test collection production largely reflected a relative stasis in CLIR research, which peaked around the turn of the century and subsequently tailed off. Perhaps the best explanation for the decline is that the field had, by the end of the first decade of the twenty-first century, largely exhausted the potential of the statistical alignment techniques for parallel text that had commanded the attention of researchers in that period.

One consequence of this hiatus is that older test collections do not always age gracefully. As Lin et al. point out, “Since many innovations work differently than techniques that came before, old evaluation instruments may not be capable of accurately quantifying effectiveness improvements associated with later techniques” [25]. The key issue here is that in large test collections, relevance judgments are necessarily sparse. TREC introduced pooling as a way to decide which (typically several hundred) documents should be judged for relevance to each topic, with the remaining documents remaining unjudged. Pools were constructed by merging highly ranked documents from a diverse range of fully automated systems, including some of the best systems of the time, sometimes augmented by documents found using interactive search. Zobel found, using evaluation measures that treat unjudged documents as not relevant, that relevance judgments on such pools result in system comparisons not markedly biased against other systems constructed using similar technology that had not contributed to the pools [48]. Contemporaneously, Voorhees found that comparisons between systems were generally insensitive to substituting judgments from one assessor for those of another [39]. A subsequent line of work found that some newly designed evaluation measures produced system comparisons robust to random ablation of those pools [5, 28, 33, 44]. However, these conclusions do not necessarily hold when new technology finds relevant documents that were not found by earlier methods, as can be the case for neural retrieval methods [25]. In such cases, three approaches might be tried:

  1. 1.

    Re-pool and rejudge an older collection, or create a new collection over newer content using pooling.

  2. 2.

    Select documents to be judged in a manner relatively insensitive to the search technology of the day, without necessarily judging all relevant documents.

  3. 3.

    Use an approach that simply does a better job of finding most of the relevant documents, thus reducing the risk of bias towards any class of system.

We used the third of these approaches to select documents for judgment in HC4. Specifically, we used the HiCAL system [10] to identify documents for judgment using active learning. HiCAL was originally developed to support Technology Assisted Review (TAR) in E-Discovery, where the goal is to identify the largest practical set of relevant documents at a reasonable cost [3, 9, 31, 42]. Similar approaches have been used to evaluate recall-oriented search in the TREC Total Recall and Precision Medicine tracks [17, 22, 32]. The key idea in HiCAL is to train an initial classifier using a small set of relevance judgments, and then to use active learning with relevance sampling to identify additional documents for review. As Lewis found, relevance sampling can be more effective than the uncertainty sampling approach that is more commonly used with active learning when the prevalence of relevant documents in the collection being searched is low [24]. This low prevalence of relevant documents is often a design goal for information retrieval test collections, both because many real information retrieval tasks exhibit low relevance prevalence, and because (absent an oracle that could fairly sample undiscovered relevant documents) accurately estimating recall requires reasonably complete annotation of the relevant set. One concern that might arise with HiCAL is that if the document space is bifurcated, with little vocabulary overlap between two or more sets of relevant documents, then HiCAL could get stuck in a local optimum, exploiting one part of the document space well but missing relevant documents in another. Experience suggests that this can happen, but that such cases are rare.Footnote 2 In particular, we expect such cases to be exceptionally rare in the news stories on which our HC4 test collections are built, since journalists typically go out of their way to contextualize the information that they present.

Early TREC CLIR test collections all included binary relevance judgments, but the introduction of the Discounted Cumulative Gain (DCG) measure in 2000 [20], and the subsequent broad adoption of Normalized DCG (nDCG), increased the demand for relevance judgments with more than two relevance grades (e.g., highly relevant, somewhat relevant, and not relevant). Some of the early CLIR work with graded relevance judgments first binarized those judgments (e.g., either by treating highly and somewhat relevant as relevant, or by treating only highly relevant as relevant) [21]. However, Sakai has noted that using graded relevance in this way can rank systems differently than would more nuanced approaches that award partial credit for finding partially relevant documents [34]. In our baseline runs, we report nDCG using the graded relevance judgments, then binarize those judgments to report Mean Average Precision (MAP) by treating highly and somewhat relevant as relevant.

3 Collection Development Methodology

We adopted several design principles to create HC4. First, to develop a multilingual document collection that was easy to distribute, we chose the Common Crawl News Collection as the basis for the suite of collections. We applied automatic language identification to determine the language of each document.Footnote 3 We then assembled Chinese, Persian, and Russian documents from August 2016 to August 2019 into ostensiblyFootnote 4 monolingual document sets. Finally, we automatically identified and eliminated duplicate documents.

The second design principle was to create topics that model the interests of a knowledgeable searcher who writes about world events. Such topics enable CLIR research that addresses complex information needs that cannot be answered by a few facts. Key attributes of a knowledgeable searcher include a relative lack of ambiguity in their information need and an increased interest in named entities. To support this goal, we used events reported in the Wikipedia Current Events Portal (WCEP)Footnote 5 as our starting point for topic development. To support exploration of how additional context information could be used to improve retrieval, each topic was associated with a contemporaneous report.

A third design principle was to include topics with relevant documents in multiple languages. Once a topic was developed in one language, it was vetted for possible use with the document sets of other languages.

3.1 Topic Development

Starting from an event summary appearing in WCEP, a topic developer would learn about that event from the English document that was linked to it, and from additional documents about the event that were automatically identified as part of the WCEP multi-document summarization dataset [16]. Topic developers were bilingual, so they could understand how an English topic related to the event being discussed in the news in another language. After learning about the event, the topic developer searched a non-English collection to find documents about the event. After reading a few documents in their language, they were asked to write a sentence or question describing an information need held by the hypothetical knowledgeable searcher. They were then asked to write a three-to-five word summary of the sentence. The summary became the topic title, and the sentence became the topic description. Next, the topic developer would investigate the prevalence of the topic in the collection. To do this they would issue one or more document-language queries and judge ten of the resulting documents. Topic developers answered two questions about each document: (1) How relevant is the most important information on the topic in this document?; and (2) How valuable is the most important information in this document? Relevance was judged as central, tangential, not-relevant, or unable-to-judge. The second question was only posed if the answer to the first question was central. Allowable answers to the second question were very-valuable, somewhat-valuable, and not-valuable.

To develop topics with relevant documents in more than one language, the title and description, along with the event that inspired the topic, were shown to a topic developer for a different language. The topic developer searched for the presence of the topic in their language. As with the initial topic development, ten documents were judged to evaluate whether the document set supported the topic. Topic developers were allowed to modify the topic, which sometimes led to vetting the new topic in the initial language.

3.2 Relevance Judgments

After topic development, some topics were selected for more complete assessment. The titles and descriptions of selected topics were vetted by a committee comprising IR researchers and topic developers. The committee reviewed each topic to ensure that: (a) the title and description were mutually consistent and concise; (b) titles consisted of three to five non-stopwords; (c) descriptions were complete, grammatical sentences with punctuation and correct spelling; and (d) topics were focused and likely to have a manageable number of relevant documents. Corrections were made by having each committee member suggest new phrasing, then a topic developer selecting a preferred alternative.

Given the impracticality of judging millions of documents, and because most documents are not relevant to a given topic, we followed the common practice of assessing as many relevant documents as possible, deferring to the evaluation measure decisions on how unassessed documents should be treated. Because we did not build this collection using a shared task, we did not have diverse systems to contribute to judgment pools. Thus, we could not use pooling [41, 48]. Instead, we used the active learning system HiCAL [10], to iteratively select documents to be judged. HiCAL builds a classifier based on the known relevant documents using relevance feedback. As the assessor judges documents, the classifier is retrained using the new assessments. To seed HiCAL’s classifier, we used ten documents judged during topic development. Because the relevance assessor is likely not the person who developed the topic, and because the topic might have changed during topic vetting, those documents are re-judged. At least one document must be judged relevant to initialize the classifier.

Once assessment was complete, assessors provided a translation of the title and description fields into the language of the documents, and briefly explained (in English) how relevance judgments were made; these explanations were placed in the topic’s narrative field. In contrast to the narrative in a typical TREC ad hoc collection, which is written prior to judging documents, these narratives were written after judgments were made; users of these collections must therefore be careful not to use the narrative field as part of a query on the topic.

Our target time for assessing a single topic was four hours. We estimated this would allow us to judge about one hundred documents per topic. According to the designers of HiCAL,Footnote 6 one can reasonably infer that almost all findable relevant documents have been found if an assessor judges twenty documents in a row as not relevant. From this, we estimated that topics with twenty or fewer relevant documents were likely to be fully annotated after viewing 100 documents. Treating both central and tangential documents as relevant would have led to more than twenty relevant documents for most selected topics. Thus, to support topics that went beyond esoteric facts, we treated only documents deemed central to the topic as relevant.

We established three relevance levels, defined from the perspective of a user writing a report on the topic:

  • Very-valuable Information in the document would be found in the lead paragraph of a report that is later written on the topic.

  • Somewhat-valuable The most valuable information in the document would be found in the remainder of such a report.

  • Not-valuable Information in the document might be included in a report footnote, or omitted entirely.

Fig. 1.
figure 1

Annotation interface for relevance judgments.

To map graded relevance values to the binary relevance required by HiCAL, documents judged as very-valuable or somewhat-valuable were treated as relevant, while documents judged not-valuable, and those that were not central to the topic, were considered not-relevant. The final collection maps the not-valuable category to not-relevant. This means that a document can mention a topic without being considered relevant to that topic if it lacks information that would be included in a future report. Because an assessor could judge a topic over multiple days, assessors took copious notes to foster consistency.

To more quickly identify topics too broad to be annotated under our annotation budget, assessors were instructed to end a task early (eliminating the topic from inclusion in the collection) whenever:

  • more than five very-valuable or somewhat-valuable documents were found among the first ten assessed;

  • more than fifteen very-valuable or somewhat-valuable documents were found among the first thirty assessed;

  • more than forty very-valuable or somewhat-valuable documents were found at any point; or

  • relevant documents were still being found after assessing 85 or more documents.

Once assessment was completed, we dropped any topic with fewer than three relevant documents. We subsequently sought to refocus dropped topics to ameliorate the problems encountered during assessment; if this was deemed likely to produce a conforming topic, the refocused topic was added back into the assessment queue. Thus, a few similar but not identical topics are present in different languages.

We used the process described above to develop the topics in each of the three languages. Figure 1 shows the interface used to annotate the collection. Key features include: hot keys to support faster judgment; next document and previous document navigation; identification of near-duplicate documents that were not identified during deduplication; the ability to save progress and return to annotation in another session; counts of how many documents have been judged in different categories; and a button to end the annotation early.

Table 1. Collection statistics.
Table 2. Multilingual topic counts.
Table 3. Document annotation time in minutes with median of each class and Spearman’s \(\rho \) correlation between assessment time and the resulting binarized label.

3.3 Contemporaneous Reports

Contemporaneous reports are portions of Wikipedia page text written before a particular date. Each topic was associated with a date, which either came from the date of the event in WCEP that inspired the topic or, if after topic development there was no such event, from the earliest relevant document. The assessor was instructed to find the Wikipedia page most related to the topic and use the edit history of that page to view it as it appeared on the day before the date listed in the topic. The assessor selected text from this page to serve as the contemporaneous report. Because of the date restriction, some contemporaneous reports are less closely related to the topic, since a specific Wikipedia page for the event may not have existed on the day before the event.

4 Collection Details

This section introduces collection details, discusses the annotation cost in terms of time, and reports on inter-assessor agreement. Table 1 describes the size of the collection in documents and topics, and presents counts of the number of annotations used in the final collection. Disjoint subsets of Train and Eval topics are defined to encourage consistent choices by users of the test collections. As in most information retrieval collections, the vast majority of the unjudged documents are not relevant. However, because we used active learning to suggest documents for assessment, and because of our desire to create topics with relatively few relevant documents, on average there are only about 50 judged documents per topic. This number ranges from 28 (when no additional relevant documents were discovered during the second phase) to 112 documents (when an assessor used the “Essentially the same” button shown in Fig. 1Footnote 7). Some of the topics have judged documents in multiple languages. Table 2 displays the number of topics with judgments in each pair of languages, and the subset of those with judgments in all three languages. While we sought to maximize the number of multilingual topics, we were constrained by our annotation budget.

The people who performed topic development and relevance assessment were all bilingual. A majority of them were native English speakers, although a few were native speakers in the language of the documents. While some were proficient in more than two languages, none was proficient in more than one of Chinese, Persian or Russian. Highly fluent topic developers verified that the human translations of topics were expressed fluently in the non-English language.

4.1 Development and Annotation Time

As a proxy for the cost of creating these test collections, we report the time spent on topic development and relevance assessment. The total time for developing candidate topics, including those not included in the final collection, is shown in Table 4. A total of about 570 h were spent by 30 developers to create the 559 topics in the three languages. The median time to develop a topic was about 36 min, with an average of about an hour, suggesting a long tail distribution.

As mentioned in Sect. 3.2, developed topics were filtered before assessment. As shown in Table 3, a total of about 540 h were spent by 33 assessors.Footnote 8 These figures include documents rejudged for quality assurance, and topics with incomplete assessments. The median annotation time per document suggests that relevant documents took longer to judge. Here, we aggregated very-valuable and somewhat-valuable as relevant, and the remaining categories as not relevant. Despite this consistent observation across all three languages, Spearman’s \(\rho \) suggests only a weak correlation between the judgment time and relevance due to the long tail distribution shown in Fig. 2. There are more not-relevant documents that took a shorter time to assess, but as we observe in Fig. 2 the distributions are similar, and the differences are thus not statistically significant by an independent samples t-test.

Table 4. Topic development time in minutes.
Fig. 2.
figure 2

Document annotation time.

Table 5. Example for intersection and union agreement

4.2 Inter-assessor Agreement

Although all topics were assessed by a single assessor for consistency, several were additionally assessed by one or two other assessors for quality assurance. In Table 6 we report the raw agreement (i.e., proportion of the documents all assessors agreed upon) and the Fleiss’ \(\kappa \) (i.e., the agreement after chance correction for multiple assessors). Because active learning is path-dependent, each assessor judged a somewhat different set of documents; we thus evaluate agreement on both the intersection and the union of the documents for a complete picture. Unjudged documents were considered not-relevant for the union agreements. Table 5 shows an example, where only D1 and D4 are in the intersection, judged by all three assessors. D3 was not judged by any assessor, and is thus is not in the union.

Table 6. Inter-assessor agreement on binarized labels.

All three languages demonstrate at least fair agreement (\(\kappa \) between 0.20 and 0.40 [23]), with Chinese topics having a substantial agreement (\(\kappa \) between 0.60 and 0.80), for both the intersection and the union. The raw agreement indicates that 69% to 85% of the judged documents have the same binarized judgments. The small gap between intersection and union agreements supports our assumption that unjudged documents are not relevant.

5 Baseline Runs

To demonstrate the utility of HC4 for evaluating CLIR systems, we report retrieval evaluation results for a set of baseline CLIR systems on the Eval sets in Table 7. Three retrieval approaches, implemented by Patapsco [11], human query translation, machine query translation, and machine document translation, use BM25 (\(k_1=0.9\), \(b=0.4\)) with RM3 pseudo relevance feedback on title queries. Translation models are trained in-house using the Sockeye toolkit [18].

As examples of neural CLIR models, we evaluated vanilla reranking models [26] fine-tuned with MS-MARCO-v1 [2] for at most one epoch with various multi-language pretrained models, including multilingual-BERT (mBERT) [13], XLM-Roberta-large (XLM-R) [8], and infoXLM-large [6]. Model checkpoints were selected by nDCG@100 on HC4 dev sets. Each trained model reranks the top 1000 documents retrieved by the machine query translation BM25 modelFootnote 9 in a zero-shot fashion [30].

Table 7. Baseline results of title queries using BM25 with RM3 on Eval sets, QT/DT: query/document translation.

For both nDCG and MAP, human query translation tends to provide the most effective results, usually indistinguishable from machine document translation and from XLM-R (both of which are effective but computationally expensive). In contrast, machine query translation is efficient. Title queries are unlikely to be grammatically sound though, so machine translation quality is lower, resulting in lower retrieval effectiveness. We report p-values for two-sided pairwise statistical significance tests. As expected with this number of topics [7], some differences that would be significant at \(p<0.05\) are observed.Footnote 10

The similar levels of Judged at 10 (the fraction of the top 10 documents that were judged) among the highest-scoring systems by nDCG and MAP suggest that our relevance judgments are not biased toward any of those systems, despite their diverse designs. mBERT yields specifically lower Judged at 10 due to the significantly worse effectiveness, which has also been found by others [19].

6 Conclusion

Our new HC4 test collections provide a basis for comparing the retrieval effectiveness of both traditional and neural CLIR techniques. HC4 allows for wide distribution since documents are distributed as part of the Common Crawl and the topics and relevance judgments are being made freely available for research use. HC4 is among the first collections in which judged documents are principally identified using active learning. In addition to providing titles and descriptions in English and in the language of the documents, English contemporaneous reports are included to support research into using additional context for retrieval. HC4 will thus help enable development of next generation CLIR algorithms.