Keywords

1 Introduction

The current coverage of the political landscape in both the press and in social media has led to an unprecedented situation. Like never before, a statement in an interview, a press release, a blog note, or a tweet can spread almost instantaneously across the globe. This speed of proliferation has left little time for double-checking claims against the facts, which has proven critical in politics. For instance, the 2016 US Presidential Campaign was arguably influenced by fake news in social media and by false claims. Indeed, some politicians were fast to notice that when it comes to shaping public opinion, facts were secondary, and that appealing to emotions and beliefs worked better. It has been even proposed that this was marking the dawn of a post-truth age.

As the problem became evident, a number of fact-checking initiatives have started, led by organizations such as FactCheckFootnote 1 and SnopesFootnote 2 among many others. Yet, this has proved to be a very demanding manual effort, which means that only a relatively small number of claims could be fact-checked.Footnote 3 This makes it important to prioritize the claims that fact-checkers should consider first, and then to help them discover the veracity of those claims.

The CheckThat! Lab at CLEF-2018 aims at helping in that respect, by promoting the development of tools for computational journalism. Figure 1 illustrates the fact-checking pipeline, which includes three steps: (icheck-worthiness estimation, (iiclaim normalization, and (iiifact-checking. The CheckThat! Lab focuses on the first and the last steps, while taking for granted (and thus excluding) the intermediate claim normalization step.

Fig. 1.
figure 1

The general fact-checking pipeline. First, the input document is analyzed to identify sentences containing check-worthy claims, then these claims are extracted and normalized (to be self-contained), and finally they are fact-checked.

Fig. 2.
figure 2

English debate fragments: check-worthy sentences are marked with .

Task 1 (Check-Worthiness) aims to help fact-checkers prioritize their efforts. In particular, it asks participants to build systems that can mimic the selection strategies of a particular fact-checking organization: factcheck.org. The task is defined as follows:

figure b

Figure 2 shows examples of English debate fragments with annotations for Task 1. In example 2a, Hillary Clinton discusses the performance of her husband Bill Clinton while he was US president. Donald Trump fires back with a claim that is worth fact-checking: that Bill Clinton approved NAFTA. In example 2b, Donald Trump is accused of having filed for bankruptcy six times, which is also worth checking.

Task 1 is a ranking task. The goal is to produce a ranked list of sentences ordered by their worthiness for fact-checking. Each of the identified claims then becomes an input for the next step (after being manually normalized, i.e., edited to be self-contained with no ambiguous or unresolved references).

Task 2 (Fact-Checking) focuses on tools intended to verify the factuality of a check-worthy claim. The task is defined as follows:

figure c

For example, the sentence “Well, he approved NAFTA...” from example 2a is normalized to “President Bill Clinton approved NAFTA.” and the target label is set to HALF-TRUE. Similarly, the sentence “And when we talk about your business, you’ve taken business bankruptcy six times.” from example 2b is normalized to “Donald Trump has filed for bankruptcy of his business six times.” and the target label is set to TRUE.

Task 2 is a classification task. The goal is to label each check-worthy claim with an estimated/predicted veracity. Note that we provide the participants not only with the normalized claim, but also with the original sentence it originated from, which is in turn given in the context of the entire debate/speech. Thus, this is a novel task for fact-checking claims in context, an aspect that has been largely ignored in previous research on fact-checking.

Note that the intermediate task of claim normalization is challenging and requires dealing with anaphora resolution, paraphrasing, and dialogue analysis, and thus we decided not to offer it as a separate task.

We produced data based on professional fact-checking annotations of debates and speeches from factcheck.org, which we modified in three ways: (i) we did some minor adjustments of which sentences were selected for fact-checking, (ii) we generated normalized versions of the claims in the selected sentences, and (iii) we generated veracity labels for each normalized claim based on the fact-checker’s free-text analysis. As a result, we created CT-C-18 , the CheckThat! 2018 corpus, which combines two sub-corpora: CT-CWC-18  to predict check-worthiness, and CT-FCC-18  to assess the veracity of claims. We offered each of the two tasks in two languages: English and Arabic. For Arabic, we hired professional translators to translate the English data, and we also had a separate Arabic-only part for Task 2, based on claims from snopes.com.

Nine teams participated in the lab this year. The most successful systems relied on supervised models using a manifold of representations. We believe that there is still large room for improvement, and thus we release the corpora, the evaluation scripts, and the participants’ predictions, which should enable further research on check-worthiness estimation and automatic claim verification.Footnote 4

The remainder of the paper is organized as follows. Section 2 presents an overview of related work. Section 3 describes the datasets. Section 4 discusses Task 1 (check-worthiness) in detail, including the evaluation framework and the setup, the approaches used by the participating teams, and the official results. Section 5 provides similar details for Task 2 (fact-checking). Finally, Sect. 6 draws some conclusions.

2 Related Work

Journalists, online users, and researchers are well aware of the proliferation of false information, and topics such as credibility and fact-checking are becoming increasingly important. For example, there was a 2016 special issue of the ACM Transactions on Information Systems journal on Trust and Veracity of Information in Social Media [24], and there is a Workshop on Fact Extraction and Verification at EMNLP’2018. Moreover, there is a SemEval-2017 shared task on Rumor Detection [7], an ongoing FEVER challenge on Fact Extraction and VERification at EMNLP’2018, the present CLEF’2018 Lab on Automatic Identification and Verification of Claims in Political Debates, and an upcoming task at SemEval’2019 on Fact-Checking in Community Question Answering Forums.

Automatic fact-checking was envisioned in [31] as a multi-step process that includes (i) identifying check-worthy statements [9, 14, 16], (ii) generating questions to be asked about these statements [18], (iii) retrieving relevant information to create a knowledge base [29], and (iv) inferring the veracity of the statements, e.g., using text analysis [6, 28] or external sources [18, 27].

The first work to target check-worthiness was the ClaimBuster system [14]. It was trained on data that was manually annotated by students, professors, and journalists, where each sentence was annotated as non-factual, unimportant factual, or check-worthy factual. The data consisted of transcripts of historical US election debates covering the period from 1960 until 2012 for a total of 30 debates and 28,029 transcribed sentences. In each sentence, the speaker was marked: candidate vs. moderator. The ClaimBuster used an SVM classifier and a manifold of features such as sentiment, TF.IDF word representations, part-of-speech (POS) tags, and named entities. It produced a check-worthiness ranking on the basis of the SVM prediction scores. The ClaimBuster system did not try to mimic the check-worthiness decisions for any specific fact-checking organization; yet, it was later evaluated against CNN and PolitiFact [15]. In contrast, our dataset is based on actual annotations by a fact-checking organization, and we release freely all data and associated scripts (while theirs is not available).

More relevant to the setup of Task 1 of this Lab is the work of [9], who focused on debates from the US 2016 Presidential Campaign and used pre-existing annotations from nine respected fact-checking organizations (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Washington Post): a total of four debates and 5,415 sentences. Beside many of the features borrowed from ClaimBuster—together with sentiment, tense, and some other features, their model pays special attention to the context of each sentence. This includes whether it is part of a long intervention by one of the actors and even its position within such an intervention. The authors predicted both (i) whether any of the fact-checking organizations would select the target sentence, and also (ii) whether a specific one would select it.

In follow-up work, [16] developed ClaimRank, which can mimic the claim selection strategies for each and any of the nine fact-checking organizations, as well as for the union of them all. Even though trained on English, it further supports Arabic, which is achieved via cross-language English-Arabic embeddings.

The work of [25] also focused on the 2016 US Election campaign, and they also used data from nine fact-checking organizations (but slightly different set from above). They used presidential (three presidential one vice-presidential) and primary debates (seven Republican and eight Democratic) for a total of 21,700 sentences. Their setup asked to predict whether any of the fact-checking sources would select the target sentence. They used a boosting-like model that takes SVMs focusing on different clusters of the dataset and the final outcome was considered as that coming from the most confident classifier. The features considered ranged from LDA topic-modeling to POS tuples and bag-of-words representations.

For Task 1, we follow a setup that is similar to that of [9, 16, 25], but we manually verify the selected sentences, e.g., to adjust the boundaries of the check-worthy claim, and also to include all instances of a selected check-worthy claim (as fact-checkers would only comment on one instance of a claim). We further have an Arabic version of the dataset. Finally, we chose to focus on a single fact-checking organization.

Regarding Task 2, which targets fact-checking a claim, there have been several datasets that focus on rumor detection. The gold labels are typically extracted from fact-checking websites such as Politifact with datasets ranging in size from 300 for the Emergent dataset [8] to 12.8 K claims for the Liar dataset [33]. Another fact-checking source that has been used is snopes.com, with datasets ranging in size from 1k claims [20] to 5k claims [26].

Less popular as a source has been Wikipedia with datasets ranging in size from 100 claims [26] to 185k for the FEVER dataset [30]. These datasets rely on crowdsourced annotations, which allows them to get large-scale, but risks having lower quality standards compared to the rigorous annotations by fact-checking organizations. Other crowdsourced efforts include the SemEval-2017’s shared task on Rumor Detection [7] with 5.5k annotated rumorous tweets, and CREDBANK with 60M annotated tweets [22]. Finally, there have been manual annotation efforts, e.g., for fact-checking the answers in a community question answering forums with size of 250 [21]. Note that while most datasets have been targeting English, there have been also efforts focusing on other languages, e.g., Chinese [20], Arabic [3], and Bulgarian [13].

Unlike the above work, our focus in Task 2 is on claims in both their normalized and unnormalized form and in the context of a political debate or speech.

3 Corpora

We produced the CT-C-18 corpus, which stands for CheckThat! 2018 corpus. It is composed of CT-CWC-18 (check-worthiness corpus) and CT-FCC-18 (fact-checking corpus). CT-C-18 includes transcripts from debates, together with political speeches, and isolated claims. Table 1 gives an overview.

The training sets for both tasks come from the first and the second Presidential debates and the Vice-Presidential debate in the 2016 US campaign. The labels for both tasks were derived from manual fact-checking analysis published on factcheck.org. For Task 1, a claim was considered check-worthy if a journalist had fact-checked it. For Task 2 a judgment was generated based on the free-text discussion by the fact-checking journalists: true, half-true, or false. We followed the same procedure for texts in the test set: two other debates and five speeches by Donald Trump, which occurred after he took office as a US President. Note that there are cases in which the number of claims intended for predicting factuallity is lower than the reported number of check-worthy claims. The reason is that claims exist which were formulated more than once in both debates and speeches and, whereas we do consider them all as positive instances for Task 1, we consider them only once for Task 2.

Table 1. Overview of the debates, speeches, and isolated claims in the CT-C-18 corpus. It includes the number of utterances, those identified as check-worthy (task 1), and those claims identified as factually- true, half-true, and false. The debates/speeches that are available in Arabic are marked with . Note that the claims from snopes.com were released in Arabic only, and are marked with .

The Arabic version of the corpus was produced manually by professional translators who translated some of the English debates/speeches to Arabic as shown in Table 1. We used this strategy for all three training debates, for the two testing debates, and for one of the five speeches that we used for testing. In order to balance the number of examples for Task 2, we included fresh Arabic-only instances by selecting 150 claims from snopes.com that were related to the Arab world or to Islam. As the language of snopes.com is English, we translated these claims to Arabic but this time using Google Translate, and then some of the task organizers (native Arabic speakers) post-edited the result in order to come up with proper Arabic versions. Further details about the construction of the CT-CWC-18 and the CT-FCC-18 corpora can be found in [2, 4].

Table 2. Task 1 (check-worthiness): overview of the learning models and of the representations used by the participants.

4 Task 1: Check-Worthiness

4.1 Evaluation Measures

As we shaped this task as an information retrieval problem, in which check-worthy instances should be ranked at the top of the list, we opted for using mean average precision as the official evaluation measure. It is defined as follows:

$$\begin{aligned} MAP = \frac{\sum _{d=1}^D AveP(d)}{D} \end{aligned}$$
(1)

where \(d\in D\) is one of the debates/speeches, and AveP is the average precision:

$$\begin{aligned} AveP = \frac{\sum _{k=1}^K (P(k)\times \delta (k))}{\# {\text {check-worthy claims}}} \end{aligned}$$
(2)

where P(k) refers to the value of precision at rank k and \(\delta (k)=1\) iff the claim at that position is check-worthy.

Following [9], we further report the results for some other measures: (i) mean reciprocal rank (MRR), (ii) mean R-Precision (MR-P), and (iii) mean precision@k (P@k). Here mean refers to macro-averaging over the testing debates/speeches.

4.2 Evaluation Results

The participants were allowed to submit one primary and up to two contrastive runs in order to test variations or alternative models. For ranking purposes, only the primary submissions were considered. A total of seven teams submitted runs for English, and two of them also did so for Arabic.

English. Table 4 shows the results for English. The best primary submission was that of the Prise de Fer team [35], which used a multilayer perceptron and a feature-rich representation. We can see that they had the best overall performance not only on the official MAP measure, but also on six out of nine evaluation measures (and they were 2nd or 3rd on the rest).

Interestingly, the top-performing run for English was an unofficial one, namely the contrastive 1 run by the Copenhagen team [12]. This model consisted of a recurrent neural network on three representations. They submitted a system that combined their neural network with the model of [9] as their primary submission, but their neural network alone (submitted as contrastive 1), performed better on the test set. This can be due to the model of [9] relying on structural information, which was not available for the speeches included in the test set.

To put these results in perspective, the bottom of Table 4 shows the results for two baselines: (i) a random permutation of the input sentences, and (ii) an n-gram based classifier. We can see that all systems managed to outperform the random baseline on all measures by a margin. However, only two runs managed to beat the n-gram baseline: the primary run of the Prise de Fer team, and the contrastive 1 run of the Copenhagen team.

Arabic. Only two teams participated in the Arabic task [11, 34], using basically the same models that they had for English. The bigIR [34] team translated automatically the test input to English and then ran their English system, while UPV–INAOE–Autoritas translated to Arabic the English lexicons their representation was based on, and then trained an Arabic system on the Arabic training data, which they finally ran on the Arabic test input. It is worth noting that for English UPV–INAOE–Autoritas outperformed bigIR, but for Arabic it was the other way around. We suspect that a possible reason might be the direction of machine translation and also the presence/lack of context. On one hand, translation into English tends to be better than into Arabic. Moreover, the translation of sentences is easier as there is context, whereas such a context is missing when translating lexicon entries in isolation.

Finally, similarly to English, all runs managed to outperform the random baseline by a margin, while the n-gram baseline was strong yet possible to beat.

5 Task 2: Factuality

5.1 Evaluation Measures

Task 2 (factuality) the claims have to be labeled as true, half-true, or false. Note that, unlike standard multi-way classification problems, here we have a natural ordering between the classes and confusing one extreme with the other one is more harmful than confusing it with a neighboring class. This is known as an ordinal classification problem (aka ordinal regression), and it requires an evaluation measure that would take this ordering into account. Therefore, we opted for using mean absolute error (MAE), which is standard for such kinds of problems, as the official measure. MAE is defined as follows:

$$\begin{aligned} MAE = \frac{\sum _{c=1}^C |y_c-x_c|}{C} \end{aligned}$$
(3)

where \(y_c\) and \(x_c\) are gold and predicted labels of claim c and \(|\cdot |\) is the difference between them: either zero, one, or two.

Following [23], we also compute macro-average mean absolute error, accuracy, macro-averaged \(F_1\), and macro-averaged recall.Footnote 5

5.2 Evaluation Results

When dealing with the factuality task, participants opted for retrieving evidence from the Web in order to assess the factuality of the claims. After retrieving a number of search engine snippets or full documents, they performed different operations, including calculating similarities or levels of contradiction and stance between the supporting document and the claim. For example, the Copenhagen team [32] concatenating the representations of claim and of the document in a neural network. Table 3 gives a brief overview. Refer to [4] and the corresponding participants’ reports for further details.

Table 3. Task 2 (factuality): overview of the learning models and of the representations used by the participants.
Table 4. Task 1 (check-worthiness): English results, ranked based on MAP, the official evaluation measure. The best score per evaluation measure is in shown in bold.
Table 5. Task 1 (check-worthiness): Arabic results, ranked based on MAP, the official evaluation measure. The best score per evaluation measure is in bold.
Table 6. Task 2 (factuality): English results, ranked based on MAE, the official evaluation measure. The best score per evaluation measure is in bold.
Table 7. Task 2 (factuality): Arabic results, ranked based on MAE, the official evaluation measure. The best score per evaluation measure is in bold.

Note that the bigIR team [34] tried to identify the relevant fragments in the supporting documents by considering only those with high similarity against the claim. Various approaches [32, 34] are based at some extent on [17]. Only one team, Check it out [19], did not use external supporting documents (Table 5).

English. Table 6 shows the results on the English dataset. Overall, the top-performing system is the one by the Copenhagen team [32]. One aspect that might explain the relatively large difference in performance compared to the other teams is the use of additional training material. The Copenhagen team incorporated hundreds of labeled claims from Politifact to their training set. Their model combines the claim and supporting texts to build representations. Their primary submission is an SVM, whereas their contrastive one uses a CNN.

Unfortunately, not much information is available regarding team FACTR, as no paper was submitted to describe their model. They used a similar approach as most other teams: converting the claim into a query for a search engine, computing stance, sentiment and other features over the supporting documents, and using them in a supervised model.

Arabic. Table 7 shows the results of the two teams that participated in the Arabic task. In order to deal with it, FACTR translated all the claims into English and performed the rest of the process in that language. In contrast, UPV–INAOE–Autoritas [10] translated the claims into English, but only in order to query the search engines,Footnote 6 and then translated the retrieved evidence into Arabic in order to keep working in that language. Perhaps, the noise generated by using two imperfect translations caused their performance to decrease (the performance of the two teams in the English task was much closer).

Overall, the performance of the models in Arabic is better than in English. The reason is that the isolated claims from snopes.com—which were released only in Arabic (cf. Table 1)—were easier to verify.

6 Conclusions and Future Work

We have presented an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 1 asked to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 asked to assess whether a claim made by a politician is factually true, half-true, or false. We proposed both tasks in English and Arabic, relying on comments and factuality judgments from both factcheck.org and snopes.com to obtain a further-refined gold standard and on translation for the Arabic versions of the corpus. A total of 30 teams registered to participate in the lab, and 9 of them actually submitted runs. The evaluation results showed that the most successful approaches used various neural networks (esp. for Task 1) and evidence retrieved from the Web (esp. for Task 2). The corpora and the evaluation measures we have released as a result of this lab should enable further research in check-worthiness estimation and in automatic claim verification.

In future iterations of the lab, we plan to add more debates and speeches, both annotated and unannotated, which would enable semi-supervised learning. We further want to add annotations for the same debates/speeches from different fact-checking organizations, which would allow using multi-task learning [9].