1 Introduction

Decision making and opinion formation are routine human tasks that often involve weighing pro and con arguments. Since the Web is full of argumentative texts on almost any topic, in principle, everybody has the chance to acquire knowledge to come to informed decisions or opinions by simply using a search engine. However, large amounts of the easily accessible arguments may be of low quality. For example, they may be irrelevant, contain incoherent logic, provide insufficient support, or use foul language. Such arguments should rather remain “invisible” in search results which implies several retrieval challenges—regardless of whether a query is about socially important topics or “only” about personal decisions. The challenges include assessing an argument’s relevance to a query, deciding what is an argument’s main “gist” in terms of the take-away, and estimating how well an implied stance is justified but also range to finding images that help to illustrate some stance. Still, today’s popular web search engines do not really address these challenges and lack a sophisticated support for searchers in argument retrieval scenarios—a gap we aim to close with the Touché lab.Footnote 1

In the spirit of the two successful Touché labs on argument retrieval at CLEF 2020 and 2021 [6, 7], we propose a third lab edition to again bring together researchers from the fields of information retrieval and natural language processing who work on argumentation. At Touché 2022, we organize the following three shared tasks, the last of which being fully new to this edition:

  1. 1.

    Argumentative sentence retrieval from a focused collection (crawled from debate portals) to support argumentative conversations on controversial topics.

  2. 2.

    Argument retrieval from a large collection of text passages to support answering comparative questions in the scenario of personal decision making.

  3. 3.

    Image retrieval to corroborate and strengthen textual arguments and to provide a quick overview of public opinions on controversial topics.

As part of the previous Touché labs, we evaluated about 130 submissions from 44 teams; the majority submitted their software using the tira.io platform. Many of the submissions improved over the “official” argumentation-agnostic DirichletLM- and BM25-based baselines. In total, we manually assessed more than 11,000 argumentative texts and web documents for 200 search topics. All topics and judgments are publicly available at https://touche.webis.de.

While the first two Touché editions focused on retrieving complete arguments and documents, the third edition focuses on more refined problems. Three shared tasks explore whether argument retrieval can support decision making and opinion formation more directly by extracting the argumentative gist from documents, by classifying their stance as pro or con towards the issue in question, and by retrieving images that show support or opposition to some stance.

2 Task Definition

In the Touché lab, we follow the classic TREC-styleFootnote 2 methodology: documents and topics are provided to the participants who then submit their ranked results (up to five runs) for every topic to be judged by human assessors. The third lab edition includes the three complementary tasks already sketched above and further detailed in the following: (1) argument retrieval for controversial questions, (2) argument retrieval for comparative questions, and (3) image retrieval for arguments. The unit of retrieval of our previous tasks were always entire documents, whereas now we focus on the retrieval of relevant argumentative sentences, passages, and images as well as their stance detection.

2.1 Task Description

Task 1: Argument Retrieval for Controversial Questions. Given a controversial topic and a collection of arguments, the task is to retrieve sentence pairs that represent one argument’s gist (e.g., a claim in one sentence and a premise in the other), and to rank these pairs according to their relevance to the topic. The argument collection for Task 1 is the args.me corpus [1]. A pre-processed version of the args.me corpus with each argument split into its constituent sentences is provided and can be indexed easily by the participants.

The pairs retrieved by the participants will be evaluated by human assessors with respect to topical relevance and argument quality. As for quality, there are three key properties: (1) each sentence in the pair must be argumentative (e.g., a claim, a premise, or a conclusion), (2) the sentence pair must form a coherent text (e.g., sentences in a pair must not contradict each other), and (3) the sentence pair constitutes a short summary of a single argument (i.e., the major claim of an argument and the best premise supporting this claim are good candidates).

The participants may use a number of previously compiled resources to lower the entry barrier of this task. These include the document-level relevance and quality judgments from the previous Touché editions, and a sample of sentence pairs from the snippet generation framework of Alshomary et al. [3], enabling a basic understanding of the task and the evaluation during development. For the identification of claims and premises, the participants can use any existing argument tagging tool, such as the APIFootnote 3 of TARGER [9] hosted on our own servers, or develop an own method if necessary.

Task 2: Argument Retrieval for Comparative Questions. Given a comparison search topic with two comparison objects and a collection of text passages, the task is to retrieve relevant argumentative passages for one or both objects, and to detect the passages’ stances with respect to the two objects. The collection for Task 2 is a focused collection of 868,655 passages extracted from the ClueWeb12 for the 50 search topics of the task (cf. Sect. 2.2). Near-duplicates are already removed with CopyCat [12] to mitigate negative impacts [13, 14].

The relevance of the top-k ranked passages of a system (\(k\ge \) 5 determined based on assessor load) will be assessed by human annotators (‘not relevant’, ‘relevant’, or ‘highly relevant’) along with the rhetorical quality [22] (‘no arguments or low quality’, ‘average quality’, or ‘high quality’). Stance detection effectiveness will be evaluated in terms of the accuracy of distinguishing ‘pro first compared object’, ‘pro second compared object’, ‘neutral’, and ‘no stance’.

The participants may use a number of previously compiled resources to lower the entry barrier of this task. These include the document-level relevance and argument quality judgments from the previous Touché editions as well as, for passage-level relevance judgments, a subset of MS MARCO [19] with comparative questions identified by our ALBERT-based [17] classifier (about 40,000 questions are comparative) [5]. Each comparative question in MS MARCO contains 10 text passages with relevance labels. For stance detection, a dataset comprising 950 comparative questions and answers extracted from Stack Exchange is provided [5]. For the identification of arguments in texts (e.g., claims and premises), the participants can use any existing argument tagging tool, such as the TARGER API hosted on our own servers, or develop their own tools.

Task 3: Image Retrieval for Arguments (New Task). Given a controversial topic and a collection of web documents with images, the task is to retrieve images that show support for each stance (pro/con the topic). The collection for Task 3 is a focused crawl of 10,000 images with the documents that contain them; for the retrieval, also the textual content of the web documents can be used.

A system’s results should provide a searcher with a visual overview of public opinions on a controversial topic; we envision systems that juxtapose images for each stance. The approaches will be evaluated in terms of precision, namely by the ratio of relevant images among 20 retrieved images, 10 per stance.

Participants may use our available image-level relevance judgments [16]; The format is aligned with the format of the task’s collection. Similar to the other Touché tasks, participants are free to use any additional existing tools and datasets or develop their own. Moreover, our goal is to collect a software suite for extracting various features—both for the images and web documents. Participants are encouraged to contribute Docker containers to this suite.

2.2 Search Topics

For the tasks on controversial questions (Task 1) and image retrieval (Task 3), we provide 50 search topics that represent a variety of debated societal matters. Each of these topics has a title in terms of a question on a controversial issue, a description specifying the particular search scenario, and a narrative that serves as a guideline for the human assessors:

figure a

For the task on comparative questions (Task 2), we provide 50 search topics that describe scenarios of personal decision making. Each of these topics has a title in terms of a comparative question, comparison objects for the stance detection of the retrieved passages, a description specifying the particular search scenario, and a narrative that serves as a guideline for the assessors:

figure b

3 Touché at CLEF 2021: Results and Findings

At Touché 2021, we received 36 registrations (compared to 28 registrations in the first year); aligned with the lab’s fencing-related title, the participants were asked to select a real or fictional fencer or swordsman character (e.g., Zorro) as their team name upon registration. We received result submissions from 27 of the 36 registered teams (after 17 active teams in the first year) that resulted in 88 valid runs (after 41 in 2020; participants were allowed to submit up to 5 result rankings in both years). Touché aims to foster the reproducibility of submissions by asking participants to submit their approaches via the TIRA platform [20], which allows easy software submission and automatic evaluation.

Task 1: Argument Retrieval for Controversial Questions. In the first two Touché editions, Task 1 was stated as follows: given a question on a controversial topic, retrieve relevant and high-quality arguments from a focused crawl of online debate portals—the args.me corpus [1]. The submissions in 2021 [7] partly continued the trend of Touché 2020 [6] by deploying “traditional” retrieval models, however, with an increased focus on machine learning models (especially for query expansion and for argument quality assessment). Overall, there were two main trends in the participants’ retrieval pipelines: (1) reproducing and fine-tuning approaches from the previous year by increasing their robustness, and (2) developing new, mostly neural approaches for argument retrieval by fine-tuning pre-trained models for the domain-specific search task at hand.

Like in the first year, combining “traditional” retrieval models with various query expansion methods and domain-specific re-ranking features remained a frequent choice for Task 1. Not really surprising—given its top effectiveness as the 2020 baseline—, DirichletLM was employed most often as the initial retrieval model, followed by BM25. For query expansion (e.g., with synonyms), most participating teams continued to use WordNet [11], however, Transformer-based approaches received increased attention [2]. Moreover, many approaches tried to use some form of argument quality estimation in the (re-)ranking.

The approaches in 2021 benefited from the relevance judgments collected at Touché in 2020. Many teams used them for general parameter optimization but also to evaluate intermediate results of their approaches, to select preprocessing methods, and to fine-tune or select the best configurations.

Task 2: Argument Retrieval for Comparative Questions. In the first two Touché editions, Task 2 was stated as follows: given a comparative question, retrieve documents from the ClueWeb12 that help to answer the comparative question. The participants’ approaches submitted in 2021 all used the ChatNoir search engine [4] for an initial document retrieval, either by submitting the original topic titles as queries, or by applying query preprocessing (e.g., lemmatization and POS-tagging) and query expansion techniques (e.g., synonyms from WordNet [11], or generation based on word2vec [18] or sense2vec embeddings [21]). Most teams then applied a document “preprocessing” (e.g., removing HTML markup) before re-ranking the ChatNoir results with feature-based or neural classifiers trained on the Touché 2020 judgments (e.g., using argumentativeness, credibility, or comparativeness scores as features). The teams predicted document relevance labels by using a random forest classifier, XGBoost [8], LightGBM [15], or a fine-tuned BERT [10].

Overall, in both tasks, many more approaches submitted in 2021 could improve upon the argumentation-agnostic baselines (DirichletLM for Task 1 and BM25 for Task 2) than in the first year, indicating that progress was achieved.

4 Conclusion

At Touché, we continue our activities to establish a collaborative platform for researchers in the area of argument retrieval, and organize respective shared tasks for the third time. By providing submission and evaluation tools as well as by organizing collaborative events such as workshops, Touché aims to foster the accumulation of knowledge and development of new approaches in the field. All evaluation resources developed at Touché are shared freely, including search queries (topics), the assembled manual relevance and argument quality judgments (qrels), and the ranked result lists submitted by the participants (runs).