Keywords

1 Introduction

Informed decision making and opinion formation are natural routine tasks. Generally, both of these tasks often involve weighing two or more options. Any choice to be made may be based on personal prior knowledge and experience, but they may also often require searching and processing new knowledge. With the ubiquitous access to various kinds of information on the web—from facts over opinions and anecdotes to arguments—everybody has the chance to acquire knowledge for decision making or opinion formation on almost any topic. However, large amounts of easily accessible information imply challenges such as the need to assess their relevance to the specific topic of interest and to estimate how well an implied stance is justified; no matter whether it is about topics of social importance or “just” about personal decisions. In the simplest form, such a justification might be a collection of basic facts and opinions. More complex justifications are often grounded in argumentation, though; for instance, a complex relational aggregation of assertions and evidence pro or con either side, where different assertions or evidential statements support or refute each other.

Furthermore, while web resources such as blogs, community question answering sites, news articles, or social platforms contain an immense variety of opinions and argumentative texts, a notable proportion of these may be of biased, faked, or populist nature. This has motivated argument retrieval research to focus not only on the relevance of arguments, but also on the aspect of their quality. While conventional web search engines support the retrieval of factual information fairly well, they hardly address the deeper analysis and processing of argumentative texts, in terms of mining argument units from these texts, assessing the quality of the arguments, or classifying their stance. To address this, the argument search engine args.me [51] was developed to retrieve arguments relevant to a given controversial topic and to account for the pro or con stance of individual arguments in the result presentation. So far, however, it is limited to a document collection crawled from a few online debate portals, and largely disregards quality aspects. Other argument retrieval systems such as ArgumenText [45] and TARGER [13] take advantage of the large web document collection Common Crawl, but their ability to reliably retrieve arguments to support sides in a decision process is limited. The comparative argumentation machine CAM [44], a system for argument retrieval in comparative search, tries to support decision making in comparison scenarios based on billions of individual sentences from the Common Crawl. Still, it lacks a proper ranking of diverse longer argumentative texts.

To foster research on argument retrieval and to establish more collaboration and exchange of ideas and datasets among researchers, we organized the second Touché lab on argument retrieval at CLEF 2021 [8, 9].Footnote 1 Touché is a collaborative platformFootnote 2 to develop and share retrieval approaches that aim to support decisions at a societal level (e.g., “Should hate speech be penalized more, and why?”) and at a personal level (e.g., “Should I major in philosophy or psychology, and why?”), respectively. The second year of Touché featured two tasks:

  1. 1.

    Argument retrieval for controversial questions from a focused collection of debates to support opinion formation on topics of social importance.

  2. 2.

    Argument retrieval for comparative questions from a generic web crawl to support informed decision making.

Approaches to these two tasks, which do not only consider the relevance of arguments but also facets of argumentative quality, will help search engines to deliver more accurate argumentative results. Additionally, they will also be an important part of open-domain conversational agents that “discuss” controversial societal topics with humans—as showcased by IBM’s Project Debater [4, 5, 32].Footnote 3

The teams that participated in the second year of Touché were able to use the topics and relevance judgments from the first year to develop their approaches. Many trained and optimized learning-based rankers as part of their retrieval pipelines and employed a large variety of pre-processing methods (e.g., stemming, duplicate removal, query expansion), argument quality features, or comparative features (e.g., credibility, part-of-speech tags). In this paper, we report the results and briefly describe the most effective participants’ retrieval approaches submitted at Touché 2021; a more comprehensive overview of each approach will be covered in the forthcoming extended overview [9].

2 Previous Work

Queries in argument retrieval often are phrases that describe a controversial topic, questions that ask to compare two options, or even complete arguments themselves [53]. In the Touché lab, we address the first two types in two different shared tasks. Here, we briefly summarize the related work for both tasks.

2.1 Argument Retrieval

Argument retrieval aims for delivering arguments to support users in making a decision or to help persuading an audience of a specific point of view. An argument is usually modeled as a conclusion with supporting or attacking premises [51]. While a conclusion is a statement that can be accepted or rejected, a premise is a more grounded statement (e.g., a statistical evidence).

The development of an argument search engine is faced with challenges that range from mining arguments from unstructured text to assessing their relevance and quality [51]. Argument retrieval follows several paradigms that start from different sources and perform argument mining and retrieval tasks in different orders [1]. Wachsmuth et al. [51], for instance, extract arguments offline using heuristics that are tailored to online debate portals. Their argument search engine args.me uses BM25F to rank the indexed arguments while giving conclusions more weight than premises. Also Levy et al. [29] use distant supervision to mine arguments offline for a set of topics from Wikipedia before ranking them. Following a different paradigm, Stab et al. [45] retrieve documents from the Common CrawlFootnote 4 in an online fashion (no prior offline argument mining) and use a topic-dependent neural network to extract arguments from the retrieved documents at query time. With the two Touché tasks, we address the paradigms of Wachsmuth et al. [51] (Task 1) and Stab et al. [45] (Task 2), respectively.

Argument retrieval should rank arguments according to their topical relevance but also to their quality. What makes a good argument has been studied since the time of Aristotle [3]. Recently, Wachsmuth et al. [48] categorized the different aspects of argument quality into a taxonomy that covers three dimensions: logic, rhetoric, and dialectic. Logic concerns the local structure of an argument, i.e., the conclusion and the premises and their relations. Rhetoric covers the effectiveness of the argument in persuading an audience with its conclusion. Dialectic addresses the relations of an argument to other arguments on the topic. For example, an argument that has many attacking premises might be rather vulnerable in a debate. The relevance of an argument to a query’s topic is categorized by Wachsmuth et al. [48] under dialectic quality.

Researchers assess argument relevance by measuring an argument’s similarity to a query’s topic or incorporating its support/attack relations to other arguments. Potthast et al. [40] evaluate four standard retrieval models at ranking arguments with regard to the quality dimensions: relevance, logic, rhetoric, and dialectic. One of the main findings is that DirichletLM is better at ranking arguments than BM25, DPH, and TF-IDF. Gienapp et al. [21] extend this work by proposing a pairwise strategy that reduces the costs of crowdsourcing argument retrieval annotations in a pairwise fashion by 93% (i.e., annotating only a small subset of argument pairs).

Wachsmuth et al. [52] create a graph of arguments by connecting two arguments when one uses the other’s conclusion as a premise. Later on, they exploit this structure to rank the arguments in the graph using PageRank scores [37]. This method is shown to outperform several baselines that only consider the content of the argument and its local structure (conclusion and premises). Dumani et al. [15] introduce a probabilistic framework that operates on semantically similar claims and premises. The framework utilizes support/attack relations between clusters of premises and claims and between clusters of claims and a query. It is found to outperform BM25 in ranking arguments. Later, Dumani et al. [16] also proposed an extension of the framework to include the quality of a premise as a probability by using the fraction of premises which are worse with regard to the three quality dimensions cogency, reasonableness, and effectiveness. Using a pairwise quality estimator trained on the Dagstuhl-15512 ArgQuality Corpus [50], their probabilistic framework with the argument quality component outperformed the one without it on the 50 Task 1 topics of Touché 2020.

2.2 Retrieval for Comparisons

Comparative information needs in web search have first been addressed by basic interfaces where two to-be-compared products are entered separately in a left and a right search box [34, 46]. Comparative sentences are then identified and mined from product reviews in favor or against one or the other to-be-compared entity using opinion mining approaches [23, 24, 26]. Recently, the identification of the comparison preference (the “winning” entity) in comparative sentences has been tackled in a more broad domain (not just product reviews) by applying feature-based and neural classifiers [31, 39]. Such preference classification forms the basis of the comparative argumentation machine CAM [44] that takes two entities and some comparison aspect(s) as input, retrieves comparative sentences in favor of one or the other entity using BM25, and then classifies their preference for a final merged result table presentation. A proper argument ranking, however, is still missing in CAM. Chekalina et al. [11] later extend the system to accept comparative questions as input and to return a natural language answer to the user. A comparative question is parsed by identifying the comparison objects, aspect(s), and predicate. The system’s answer is either generated directly based on Transformers [14] or by retrieval from an index of comparative sentences.

3 Lab Overview and Statistics

The second edition of Touché received 36 registrations (compared to 28 registrations in the first year), with a majority coming from Germany and Italy, but also from the Americas, Europe, Africa, and Asia (16 from Germany, 10 from Italy, 2 from the United States and Mexico, and 1 each from Canada, India, the Netherlands, Nigeria, the Russian Federation, and Tunisia). Aligned with the lab’s fencing-related title, the participants were asked to select a real or fictional swordsman character (e.g., Zorro) as their team name upon registration.

We received result submissions from 27 of the 36 registered teams (up from 20 submissions in the first year). As in the previous edition of Touché, we paid attention to foster the reproducibility of the developed approaches by using the TIRA platform [41]. Upon registration, each team received an invitation to TIRA to deploy actual software implementations of their approaches. TIRA is an integrated cloud-based evaluation-as-a-service research architecture on which participants can install their software on a dedicated virtual machine. By default, the virtual machines operate the server version of Ubuntu 20.04 with one CPU (Intel Xeon E5-2620), 4 GB of RAM, and 16 GB HDD, but we adjusted the resources to the participants’ requirements when needed (e.g., one team asked for 30 GB of RAM, 3 CPUs, and 30 GB of HDD). The participants had full administrative access to their virtual machines. Still, we pre-installed the latest versions of reasonable standard software (e.g., Docker and Python) to simplify the deployment of the approaches.

Using TIRA, the teams could create result submissions via a click in the web UI that then initiated the following pipeline: the respective virtual machine is shut down, disconnected from the internet, and powered on again in a sandbox mode, mounting the test datasets for the respective tasks, and running a team’s deployed approach. The interruption of the internet connection ensures that the participants’ software works without external web services that may disappear or become incompatible—possible causes of reproducibility issues—but it also means that downloading additional external code or models during the execution was not possible. We offered our support when this connection interruption caused problems during the deployment, for instance, with spaCy that tries to download models if they are not already available on the machine, or with PyTerrier that, in its default configuration, checks for online updates. To simplify participation of teams that do not want to develop a fully-fledged retrieval pipeline on their end, we enabled two exceptions from the interruption of the internet connection for all participants: the APIs of args.me and ChatNoir were available even in the sandbox mode to allow accessing a baseline system for each of the tasks. The virtual machines that the participants used for their submissions will be archived such that the respective systems can be re-evaluated or applied to new datasets as long as the APIs of ChatNoir and args.me remain available—that are both maintained by us.

In cases where a software submission in TIRA was not possible, the participants could submit just run files. Overall, 5 of the 27 teams submitted traditional run files instead of software in TIRA. Per task, we allowed each team to submit up to 5 runs that should follow the standard TREC-style format.Footnote 5 We checked the validity of all submitted run files, asking participants to resubmit their run files (or software) if there were any validity issues—again, also offering our support in case of problems. All 27 teams submitted valid runs, resulting in 90 valid runs (doubling the 42 result submissions that we received in the first year).

4 Task 1: Argument Retrieval for Controversial Questions

The goal of the Touché 2021 lab’s first task was to advance technologies that support individuals in forming opinions on socially important controversial topics such as: “Should hate speech be penalized more?”. For such topics, the task was to retrieve relevant and high-quality argumentative texts from the args.me corpus [1], a focused crawl of online debate portals. In this scenario, relevant arguments should help users to form an opinion on the topic and to find arguments that are potentially useful in debates or discussions.

The results of last year’s Task 1 participants indicated that improving upon “classic” argument-agnostic baseline retrieval models (such as BM25 and DirichletLM) in the ranking of arguments from a focused crawl is difficult, but, at the same time, the results of these baselines still left some room for improvements. Also, the detection of the degree of argumentativeness and the assessment of the quality of an argument were not “solved” in the first year, but identified as potentially interesting contributions of submissions to the task’s second iteration.

4.1 Task Definition

Given a controversial topic formulated as a question, approaches to Task 1 needed to retrieve relevant and high-quality arguments from the args.me corpus, which covers a wide range of timely controversial topics. To enable approaches that leverage training and fine-tuning, the topics and relevance judgments from the 2020 edition of Task 1 were provided.

4.2 Data Description

Topics. We formulated 50 new search questions on controversial topics. Each topic consisted of (a) a title in form of a question that a user might submit as a query to a search engine, (b) a description that summarizes the particular information need and search scenario, and (c) a narrative that guides the assessors in recognizing relevant results (an example topic is given in Table 1). We carefully designed the topics by clustering the debate titles in the args.me corpus, formulating questions for a balanced mix of frequent and niche topics—manually ensuring that at least some relevant arguments are contained in the args.me corpus for each topic.

Table 1. Example topic for Task 1: Argument Retrieval for Controversial Questions.

Document Collection. The document collection for Task 1 was the args.me corpus [1], which is freely available for downloadFootnote 6 and also accessible via the args.me API.Footnote 7 The corpus contains about 400,000 structured arguments (from debatewise.org, idebate.org, debatepedia.org, and debate.org), each with a conclusion (claim) and one or more supporting or attacking premises (reasons).

4.3 Submitted Approaches

Twenty-one participating teams submitted at least one valid run to Task 1. The submissions partly continued the trend of Touché 2020 [7] by deploying “classical” retrieval models, however with an increased focus on machine learning models (especially for query expansion and for assessing argument quality). Overall, we observed two kinds of contributions: (1) Reproducing and fine-tuning approaches from the previous year by increasing their robustness, and (2) developing new, mostly neural approaches for argument retrieval by fine-tuning pre-trained models for the domain-specific search task at hand.

Like in the first year, combining “classical” retrieval models with various query expansion methods and domain-specific re-ranking features remained a frequent choice of approaches to Task 1. Not really surprising—given last year’s baseline results—DirichletLM was employed most often as the initial retrieval model, followed by BM25. For query expansion, most participating teams continued to leverage WordNet [17]. However, transformer-based approaches received increased attention, such as query hallucination, which was successfully used by Akiki and Potthast [2] in the previous Touché lab. Similarly, utilizing deep semantic phrase embeddings to calculate the semantic similarity between a query and possible result documents gained widespread adoption. Moreover, many approaches tried to use some form of argument quality estimation as one of their features for ranking or re-ranking.

This year’s approaches benefited from the judgments released for Touché in 2020. Many teams used them for general parameter optimization but also to evaluate intermediate results of their approaches and to fine-tune or select the best configurations. For instance, comparing different kinds of pre-processing methods based on the available judgments from last year received much attention (e.g., stopword lists, stemming algorithms, or duplicate removal).

4.4 Task Evaluation

The teams’ result rankings should be formatted in the “standard” TREC format where document IDs are sorted by descending relevance score for each search topic (i.e., the most relevant argument/document occurs at Rank 1). Prior to creating the assessment pools, we ran a near-duplicate detection for all submitted runs using the CopyCat framework [18], since near-duplicates might impact evaluation results [19, 20]. The framework found only 1.1% of the arguments in the top-5 results to be near-duplicates (mostly due to debate portal users reusing their arguments in multiple debate threads). We created duplicate-free versions of each result list by removing the documents for which a higher-ranked document is a near-duplicate; in such cases, the next ranked non-near-duplicate then just moved up the ranked list. The top-5 results of the original and the deduplicated runs then formed the judgment pool—created with TrecTools [38]—resulting in 3,711 unique documents that were manually assessed with respect to their relevance and argumentative quality.

For the assessment, we used the Doccano tool [35] and followed previously suggested annotation guidelines [21, 40]. Our eight graduate and undergraduate student volunteers (all with a computer science background) assessed each argument’s relevance to the given topic with four labels (0: not relevant, 1: relevant, 2: highly relevant, or -2: spam) and the argument’s rhetorical quality [50] with three labels (0: low quality, 1: sufficient quality, and 2: high quality). To calibrate the annotators’ interpretations of the guidelines (i.e., the topics including the narratives and instructions on argument quality), we performed an initial \(\kappa \)-test in which each annotator had to label the same 15 documents from three topics (5 documents from each topic). The observed Fleiss’ \(\kappa \) values of 0.50 for argument relevance (moderate agreement) and of 0.39 for argument quality (fair agreement) are similar to previous studies [21, 49, 50]. However, we still had a final discussion with all the annotators to clarify potential misinterpretations. Afterwards, each annotator independently judged the results for disjoint subsets of the topics (i.e., each topic was judged by one annotator only).

Table 2. Results for Task 1: Argument Retrieval for Controversial Questions. The left part (a) shows the evaluation results of a team’s best run according to the results’ relevance, while the right part (b) shows the best runs according to the results’ quality. An asterisk (\(^\star \)) indicates that the runs with the best relevance and the best quality differ for a team. The baseline DirichletLM ranking is shown in bold.

4.5 Task Results

The results of the runs with the best nDCG@5 scores per participating team are reported in Table 2. Below, we briefly summarize the best configurations of the teams ranked in the top-5 of either the relevance or the quality evaluation. A more comprehensive discussion including all teams’ approaches will be part of the forthcoming extended lab overview [9].

Team Elrond combined DirichletLM retrieval with a pre-processing pipeline consisting of Krovetz stemming [27], stopword removal using a custom list, removing terms with certain part-of-speech tags, and enriching the document representations using WordNet-based synonyms.

Team Pippin Took also used DirichletLM as their basic retrieval model (parameter optimization based on the Touché 2020 judgments) combined with WordNet-based query expansion.

Team Robin Hood combined RM3 [28] query expansion with phrase embeddings for retrieval. Their system represents the premise and the conclusion of each argument in two separate vector spaces using the Universal Sentence Encoder [10], and then ranks the arguments based on their cosine similarity to the embedded query.

Team Asterix combined BM25 as basic retrieval model with WordNet-based query expansion and a quality-aware re-ranking approach (linear regression model trained on the Webis-ArgQuality-20 dataset [21]). In their system, arguments are ranked based on a combination of the predicted quality score and a normalized BM25 score.

Team Dread Pirate Roberts trained a LambdaMART model on the Task 1 relevance labels of Touché 2020 to re-rank the top-100 results of an initial DirichletLM ranking. Using greedy feature selection, they identified the four to nine features with the best nDCG scores in a 5-fold cross-validation setup.

Team Heimdall represented arguments using k-means cluster centroids in a vector space constructed using phrase embeddings. Their system combines the cosine similarity of a query to a centroid with DirichletLM retrieval scores, and derives an argument quality score from an SVM regression model that uses \( tf\cdot idf \) features and was trained on the overall quality ratings from the Webis-ArgQuality-20 dataset.

Team Skeletor, finally, combined a fine-tuned BM25 model with the cosine similarity of passages calculated by a phrase embedding model fine-tuned for question answering. They included pseudo-relevance feedback using the 50 arguments that are most similar in the embedding space to the top-3 initially retrieved arguments. The final retrieval score of a candidate result passage is approximated in their system by its similarity to the relevance feedback passages determined with manifold approximation and summed as the argument’s score.

5 Task 2: Argument Retrieval for Comparative Questions

The goal of the Touché 2021 lab’s second task was to support individuals making informed decisions in “everyday” or personal comparison situations—in its simplest form for questions such as “Is X or Y better for Z?”. Decision making in such situations benefits from finding balanced justifications for choosing one or the other option, for instance, in the form of pro/con arguments.

Similar to Task 1, the results of last year’s Task 2 participants indicated that improving upon an argument-agnostic BM25 baseline is quite difficult. Promising proposed approaches tried to re-rank based on features capturing “comparativeness” or “argumentativeness”.

5.1 Task Definition

Given a comparative question, an approach to Task 2 needed to retrieve documents from the general web crawl ClueWeb12Footnote 8 that help to come to an informed decision on the comparison. Ideally, the retrieved documents should be argumentative with convincing arguments for or against one or the other option. To identify arguments in web documents, the participants were not restricted to any system; they could use own technology or any existing argument taggers such as MARGOT [30]. To lower the entry barriers for participants new to argument mining, we offered support for using the neural argument tagger TARGER [13] hosted on our own servers and accessible via an API.Footnote 9

Table 3. Example topic for Task 2: Argument Retrieval for Comparative Questions.

5.2 Data Description

Topics. For the second task edition, we manually selected 50 new comparative questions from the MS MARCO dataset [36] (questions from Bing’s search logs) and the Quora dataset [22] (questions asked on the Quora question answering website). We ensured to include questions on diverse topics, for example asking about electronics, culinary, house appliances, life choices, etc. Table 3 shows an example topic for Task 2 that consists of a title (i.e., a comparative question), a description of the possible search context and situation, and a narrative describing what makes a retrieved result relevant (meant as a guideline for human assessors). We manually ensured that relevant documents for each topic were actually contained in the ClueWeb12 (i.e., avoiding questions on comparison options not known at the ClueWeb12 crawling time in 2012).

Document Collection. The retrieval corpus was formed by the ClueWeb12 collection that contains 733 million English web pages (27.3 TB uncompressed) crawled by the Language Technologies Institute at Carnegie Mellon University between February and May 2012. For participants of Task 2 who could not index the ClueWeb12 on their side, we provided access to the indexed corpus through the BM25F-based search engine ChatNoir [6] via its API.Footnote 10

5.3 Submitted Approaches

For Task 2, six teams submitted approaches that all used ChatNoir for an initial document retrieval. Most teams then applied a document “preprocessing” on the ChatNoir results (e.g., removing HTML markups) and re-ranked them with feature-based or neural classifiers trained on last year’s judgments. Commonly used techniques further included (1) query processing (e.g., lemmatization and POS-tagging), (2) query expansion (e.g., synonyms from WordNet [17], or generated with the word2vec [33] or sense2vec embeddings [47]), and (3) calculating argumentativeness, credibility, or comparativeness scores used as features in the re-ranking. The teams predicted document relevance labels by using a random forest classifier, XGBoost [12], LightGBM [25], or a fine-tuned BERT [14].

5.4 Task Evaluation

Using the CopyCat framework [18], we found that on average 11.6% of the documents in the top-5 results of a run were near-duplicates—a non-negligible redundancy that might have negatively impacted the reliability and validity of an evaluation, since rankings containing multiple relevant duplicates tend to overestimate the actual retrieval effectiveness [19, 20]. Following the strategy used in Task 1, we pooled the top-5 documents from the original and the deduplicated runs, resulting in 2,076 unique documents that needed to be judged.

Our eight volunteer annotators (same as for Task 1) labeled a document for its topical relevance (three labels; 0: not relevant, 1: relevant, and 2: highly relevant) and for whether rhetorically well-written arguments [50] were contained (three labels; 0: low quality or no arguments in the document, 1: sufficient quality, and 2: high quality). Similar to Task 1, our eight volunteer assessors went through an initial \(\kappa \)-test on 15 documents from three topics (five documents per topic). As in case of Task 1, the observed Fleiss’ \(\kappa \) values of 0.46 for relevance (moderate agreement) and of 0.22 for quality (fair agreement) are similar to previous studies [21, 49, 50]. Again, however, we had a final discussion with all the annotators to clarify some potential misinterpretations. Afterwards, each annotator independently judged the results for disjoint subsets of the topics (i.e., each topic was judged by one annotator only).

Table 4. Results for Task 2 Argument Retrieval for Comparative Questions. The left part (a) shows the evaluation results of a team’s best run according to the results’ relevance, while the right part (b) shows the best runs according to the results’ quality. An asterisk (\(^\star \)) indicates that the runs with the best relevance and the best quality differ for a team. The baseline ChatNoir ranking is shown in bold.

5.5 Task Results

The results of the runs with the best nDCG@5 scores per participating team are reported in Table 4. Below, we briefly summarize the best configurations of the teams. A more comprehensive discussion including all teams’ approaches will be part of the forthcoming extended lab overview [9].

Team Katana re-ranked the top-100 ChatNoir results using an XGBoost [12] approach (overall relevance-wise most effective run) or a LightGBM [25] approach (team Katana’s quality-wise best run), respectively. Both approaches were trained on judgments from Touché 2020 employing relevance features (e.g., ChatNoir relevance score) and “comparativness” features (e.g., number of identified comparison objects, aspects, or predicates [11]).

Team Thor re-ranked the top-110 ChatNoir results by locally creating an Elasticsearch BM25F index (fields: original and lemmatized document titles, bodies, and argument units (premises and claims) as identified by TARGER; BM25 parameters b and \(k_1\) optimized on the Touché 2020 judgments). This new index was then queried with the topic title expanded by WordNet synonyms [17].

Team Rayla re-ranked the top-120 ChatNoir results by linearly combining different scores such as a relevance score, PageRank, SpamRank (all returned by ChatNoir), or an argument support score (ratio of argumentative sentences (premises and claims) in documents found with their own DistilBERT-based [43] classifier). The weights of the individual scores were optimized in a grid search on the Touché 2020 judgments.

Team Mercutio re-ranked the top-100 ChatNoir results returned for the topic titles expanded with synonyms (word2vec [33] or nouns in GPT-2 [42] extensions when prompted with the topic title). The re-ranking was based on the relative ratio of premises and claims in the documents (as identified by TARGER).

Team Prince Caspian re-ranked the top-40 ChatNoir results using a logistic regression classifier (features: \( tf\cdot idf \)-weighted 1- to 4-grams; training on the Touché 2020 judgments) that predicts the probability of a result being relevant (final ranking by descending probability).

6 Summary and Outlook

From the 36 teams that registered for the Touché 2021 lab, 27 actively participated by submitting at least one valid run to one of the two shared tasks:(1) argument retrieval for controversial questions, and (2) argument retrieval for comparative questions. Most of the participating teams used the judgments from the first lab’s edition to train feature-based or neural approaches that predict argument quality or that re-rank some initial retrieval result. Overall, many more approaches could improve upon the argumentation-agnostic baselines (DirichletLM or BM25) than in the first year, indicating that progress was achieved. For a potential next iteration of the Touché lab, we currently plan to enrich the tasks by including further argument quality dimensions in the evaluation by focusing on the most relevant/argumentative text passages in the retrieval and by detecting the pro/con stance of the returned results.