Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

Nakov, Preslav; Barrón-Cedeño, Alberto; Elsayed, Tamer; Suwaileh, Reem; Màrquez, Lluís; Zaghouani, Wajdi; Atanasova, Pepa; Kyuchukov, Spas; Da San Martino, Giovanni

doi:10.1007/978-3-319-98932-7_32

Preslav Nakov²²,
Alberto Barrón-Cedeño²²,
Tamer Elsayed²³,
Reem Suwaileh²³,
Lluís Màrquez²⁴,
Wajdi Zaghouani²⁵,
Pepa Atanasova²⁶,
Spas Kyuchukov²⁷ &
…
Giovanni Da San Martino²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11018))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1433 Accesses
45 Citations

Abstract

We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. In its starting year, the lab featured two tasks. Task 1 asked to predict which (potential) claims in a political debate should be prioritized for fact-checking; in particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact-checking. Task 2 asked to assess whether a given check-worthy claim made by a politician in the context of a debate/speech is factually true, half-true, or false. We offered both tasks in English and in Arabic. In terms of data, for both tasks, we focused on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign (we also provided translations in Arabic), and we relied on comments and factuality judgments from factcheck.org and snopes.com, which we further refined manually. A total of 30 teams registered to participate in the lab, and 9 of them actually submitted runs. The evaluation results show that the most successful approaches used various neural networks (esp. for Task 1) and evidence retrieval from the Web (esp. for Task 2). We release all datasets, the evaluation scripts, and the submissions by the participants, which should enable further research in both check-worthiness estimation and automatic claim verification.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media

Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News

CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media

Keywords

1 Introduction

The current coverage of the political landscape in both the press and in social media has led to an unprecedented situation. Like never before, a statement in an interview, a press release, a blog note, or a tweet can spread almost instantaneously across the globe. This speed of proliferation has left little time for double-checking claims against the facts, which has proven critical in politics. For instance, the 2016 US Presidential Campaign was arguably influenced by fake news in social media and by false claims. Indeed, some politicians were fast to notice that when it comes to shaping public opinion, facts were secondary, and that appealing to emotions and beliefs worked better. It has been even proposed that this was marking the dawn of a post-truth age.

As the problem became evident, a number of fact-checking initiatives have started, led by organizations such as FactCheck^{Footnote 1} and Snopes^{Footnote 2} among many others. Yet, this has proved to be a very demanding manual effort, which means that only a relatively small number of claims could be fact-checked.^{Footnote 3} This makes it important to prioritize the claims that fact-checkers should consider first, and then to help them discover the veracity of those claims.

The CheckThat! Lab at CLEF-2018 aims at helping in that respect, by promoting the development of tools for computational journalism. Figure 1 illustrates the fact-checking pipeline, which includes three steps: (i) check-worthiness estimation, (ii) claim normalization, and (iii) fact-checking. The CheckThat! Lab focuses on the first and the last steps, while taking for granted (and thus excluding) the intermediate claim normalization step.

Task 1 (Check-Worthiness) aims to help fact-checkers prioritize their efforts. In particular, it asks participants to build systems that can mimic the selection strategies of a particular fact-checking organization: factcheck.org. The task is defined as follows:

Figure 2 shows examples of English debate fragments with annotations for Task 1. In example 2a, Hillary Clinton discusses the performance of her husband Bill Clinton while he was US president. Donald Trump fires back with a claim that is worth fact-checking: that Bill Clinton approved NAFTA. In example 2b, Donald Trump is accused of having filed for bankruptcy six times, which is also worth checking.

Task 1 is a ranking task. The goal is to produce a ranked list of sentences ordered by their worthiness for fact-checking. Each of the identified claims then becomes an input for the next step (after being manually normalized, i.e., edited to be self-contained with no ambiguous or unresolved references).

Task 2 (Fact-Checking) focuses on tools intended to verify the factuality of a check-worthy claim. The task is defined as follows:

For example, the sentence “Well, he approved NAFTA...” from example 2a is normalized to “President Bill Clinton approved NAFTA.” and the target label is set to HALF-TRUE. Similarly, the sentence “And when we talk about your business, you’ve taken business bankruptcy six times.” from example 2b is normalized to “Donald Trump has filed for bankruptcy of his business six times.” and the target label is set to TRUE.

Task 2 is a classification task. The goal is to label each check-worthy claim with an estimated/predicted veracity. Note that we provide the participants not only with the normalized claim, but also with the original sentence it originated from, which is in turn given in the context of the entire debate/speech. Thus, this is a novel task for fact-checking claims in context, an aspect that has been largely ignored in previous research on fact-checking.

Note that the intermediate task of claim normalization is challenging and requires dealing with anaphora resolution, paraphrasing, and dialogue analysis, and thus we decided not to offer it as a separate task.

We produced data based on professional fact-checking annotations of debates and speeches from factcheck.org, which we modified in three ways: (i) we did some minor adjustments of which sentences were selected for fact-checking, (ii) we generated normalized versions of the claims in the selected sentences, and (iii) we generated veracity labels for each normalized claim based on the fact-checker’s free-text analysis. As a result, we created CT-C-18 , the CheckThat! 2018 corpus, which combines two sub-corpora: CT-CWC-18 to predict check-worthiness, and CT-FCC-18 to assess the veracity of claims. We offered each of the two tasks in two languages: English and Arabic. For Arabic, we hired professional translators to translate the English data, and we also had a separate Arabic-only part for Task 2, based on claims from snopes.com.

Nine teams participated in the lab this year. The most successful systems relied on supervised models using a manifold of representations. We believe that there is still large room for improvement, and thus we release the corpora, the evaluation scripts, and the participants’ predictions, which should enable further research on check-worthiness estimation and automatic claim verification.^{Footnote 4}

The remainder of the paper is organized as follows. Section 2 presents an overview of related work. Section 3 describes the datasets. Section 4 discusses Task 1 (check-worthiness) in detail, including the evaluation framework and the setup, the approaches used by the participating teams, and the official results. Section 5 provides similar details for Task 2 (fact-checking). Finally, Sect. 6 draws some conclusions.

2 Related Work

Journalists, online users, and researchers are well aware of the proliferation of false information, and topics such as credibility and fact-checking are becoming increasingly important. For example, there was a 2016 special issue of the ACM Transactions on Information Systems journal on Trust and Veracity of Information in Social Media [24], and there is a Workshop on Fact Extraction and Verification at EMNLP’2018. Moreover, there is a SemEval-2017 shared task on Rumor Detection [7], an ongoing FEVER challenge on Fact Extraction and VERification at EMNLP’2018, the present CLEF’2018 Lab on Automatic Identification and Verification of Claims in Political Debates, and an upcoming task at SemEval’2019 on Fact-Checking in Community Question Answering Forums.

Automatic fact-checking was envisioned in [31] as a multi-step process that includes (i) identifying check-worthy statements [9, 14, 16], (ii) generating questions to be asked about these statements [18], (iii) retrieving relevant information to create a knowledge base [29], and (iv) inferring the veracity of the statements, e.g., using text analysis [6, 28] or external sources [18, 27].

The first work to target check-worthiness was the ClaimBuster system [14]. It was trained on data that was manually annotated by students, professors, and journalists, where each sentence was annotated as non-factual, unimportant factual, or check-worthy factual. The data consisted of transcripts of historical US election debates covering the period from 1960 until 2012 for a total of 30 debates and 28,029 transcribed sentences. In each sentence, the speaker was marked: candidate vs. moderator. The ClaimBuster used an SVM classifier and a manifold of features such as sentiment, TF.IDF word representations, part-of-speech (POS) tags, and named entities. It produced a check-worthiness ranking on the basis of the SVM prediction scores. The ClaimBuster system did not try to mimic the check-worthiness decisions for any specific fact-checking organization; yet, it was later evaluated against CNN and PolitiFact [15]. In contrast, our dataset is based on actual annotations by a fact-checking organization, and we release freely all data and associated scripts (while theirs is not available).

More relevant to the setup of Task 1 of this Lab is the work of [9], who focused on debates from the US 2016 Presidential Campaign and used pre-existing annotations from nine respected fact-checking organizations (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Washington Post): a total of four debates and 5,415 sentences. Beside many of the features borrowed from ClaimBuster—together with sentiment, tense, and some other features, their model pays special attention to the context of each sentence. This includes whether it is part of a long intervention by one of the actors and even its position within such an intervention. The authors predicted both (i) whether any of the fact-checking organizations would select the target sentence, and also (ii) whether a specific one would select it.

In follow-up work, [16] developed ClaimRank, which can mimic the claim selection strategies for each and any of the nine fact-checking organizations, as well as for the union of them all. Even though trained on English, it further supports Arabic, which is achieved via cross-language English-Arabic embeddings.

The work of [25] also focused on the 2016 US Election campaign, and they also used data from nine fact-checking organizations (but slightly different set from above). They used presidential (three presidential one vice-presidential) and primary debates (seven Republican and eight Democratic) for a total of 21,700 sentences. Their setup asked to predict whether any of the fact-checking sources would select the target sentence. They used a boosting-like model that takes SVMs focusing on different clusters of the dataset and the final outcome was considered as that coming from the most confident classifier. The features considered ranged from LDA topic-modeling to POS tuples and bag-of-words representations.

For Task 1, we follow a setup that is similar to that of [9, 16, 25], but we manually verify the selected sentences, e.g., to adjust the boundaries of the check-worthy claim, and also to include all instances of a selected check-worthy claim (as fact-checkers would only comment on one instance of a claim). We further have an Arabic version of the dataset. Finally, we chose to focus on a single fact-checking organization.

Regarding Task 2, which targets fact-checking a claim, there have been several datasets that focus on rumor detection. The gold labels are typically extracted from fact-checking websites such as Politifact with datasets ranging in size from 300 for the Emergent dataset [8] to 12.8 K claims for the Liar dataset [33]. Another fact-checking source that has been used is snopes.com, with datasets ranging in size from 1k claims [20] to 5k claims [26].

Less popular as a source has been Wikipedia with datasets ranging in size from 100 claims [26] to 185k for the FEVER dataset [30]. These datasets rely on crowdsourced annotations, which allows them to get large-scale, but risks having lower quality standards compared to the rigorous annotations by fact-checking organizations. Other crowdsourced efforts include the SemEval-2017’s shared task on Rumor Detection [7] with 5.5k annotated rumorous tweets, and CREDBANK with 60M annotated tweets [22]. Finally, there have been manual annotation efforts, e.g., for fact-checking the answers in a community question answering forums with size of 250 [21]. Note that while most datasets have been targeting English, there have been also efforts focusing on other languages, e.g., Chinese [20], Arabic [3], and Bulgarian [13].

Unlike the above work, our focus in Task 2 is on claims in both their normalized and unnormalized form and in the context of a political debate or speech.

3 Corpora

We produced the CT-C-18 corpus, which stands for CheckThat! 2018 corpus. It is composed of CT-CWC-18 (check-worthiness corpus) and CT-FCC-18 (fact-checking corpus). CT-C-18 includes transcripts from debates, together with political speeches, and isolated claims. Table 1 gives an overview.

The training sets for both tasks come from the first and the second Presidential debates and the Vice-Presidential debate in the 2016 US campaign. The labels for both tasks were derived from manual fact-checking analysis published on factcheck.org. For Task 1, a claim was considered check-worthy if a journalist had fact-checked it. For Task 2 a judgment was generated based on the free-text discussion by the fact-checking journalists: true, half-true, or false. We followed the same procedure for texts in the test set: two other debates and five speeches by Donald Trump, which occurred after he took office as a US President. Note that there are cases in which the number of claims intended for predicting factuallity is lower than the reported number of check-worthy claims. The reason is that claims exist which were formulated more than once in both debates and speeches and, whereas we do consider them all as positive instances for Task 1, we consider them only once for Task 2.

Table 1. Overview of the debates, speeches, and isolated claims in the CT-C-18 corpus. It includes the number of utterances, those identified as check-worthy (task 1), and those claims identified as factually- true, half-true, and false. The debates/speeches that are available in Arabic are marked with . Note that the claims from snopes.com were released in Arabic only, and are marked with .

The Arabic version of the corpus was produced manually by professional translators who translated some of the English debates/speeches to Arabic as shown in Table 1. We used this strategy for all three training debates, for the two testing debates, and for one of the five speeches that we used for testing. In order to balance the number of examples for Task 2, we included fresh Arabic-only instances by selecting 150 claims from snopes.com that were related to the Arab world or to Islam. As the language of snopes.com is English, we translated these claims to Arabic but this time using Google Translate, and then some of the task organizers (native Arabic speakers) post-edited the result in order to come up with proper Arabic versions. Further details about the construction of the CT-CWC-18 and the CT-FCC-18 corpora can be found in [2, 4].

Table 2. Task 1 (check-worthiness): overview of the learning models and of the representations used by the participants.

Full size table

4 Task 1: Check-Worthiness

4.1 Evaluation Measures

As we shaped this task as an information retrieval problem, in which check-worthy instances should be ranked at the top of the list, we opted for using mean average precision as the official evaluation measure. It is defined as follows:

$$\begin{aligned} MAP = \frac{\sum _{d=1}^D AveP(d)}{D} \end{aligned}$$

(1)

where $d\in D$ is one of the debates/speeches, and AveP is the average precision:

$$\begin{aligned} AveP = \frac{\sum _{k=1}^K (P(k)\times \delta (k))}{\# {\text {check-worthy claims}}} \end{aligned}$$

(2)

where P(k) refers to the value of precision at rank k and $\delta (k)=1$ iff the claim at that position is check-worthy.

Following [9], we further report the results for some other measures: (i) mean reciprocal rank (MRR), (ii) mean R-Precision (MR-P), and (iii) mean precision@k (P@k). Here mean refers to macro-averaging over the testing debates/speeches.

4.2 Evaluation Results

The participants were allowed to submit one primary and up to two contrastive runs in order to test variations or alternative models. For ranking purposes, only the primary submissions were considered. A total of seven teams submitted runs for English, and two of them also did so for Arabic.

English. Table 4 shows the results for English. The best primary submission was that of the Prise de Fer team [35], which used a multilayer perceptron and a feature-rich representation. We can see that they had the best overall performance not only on the official MAP measure, but also on six out of nine evaluation measures (and they were 2nd or 3rd on the rest).

Interestingly, the top-performing run for English was an unofficial one, namely the contrastive 1 run by the Copenhagen team [12]. This model consisted of a recurrent neural network on three representations. They submitted a system that combined their neural network with the model of [9] as their primary submission, but their neural network alone (submitted as contrastive 1), performed better on the test set. This can be due to the model of [9] relying on structural information, which was not available for the speeches included in the test set.

To put these results in perspective, the bottom of Table 4 shows the results for two baselines: (i) a random permutation of the input sentences, and (ii) an n-gram based classifier. We can see that all systems managed to outperform the random baseline on all measures by a margin. However, only two runs managed to beat the n-gram baseline: the primary run of the Prise de Fer team, and the contrastive 1 run of the Copenhagen team.

Arabic. Only two teams participated in the Arabic task [11, 34], using basically the same models that they had for English. The bigIR [34] team translated automatically the test input to English and then ran their English system, while UPV–INAOE–Autoritas translated to Arabic the English lexicons their representation was based on, and then trained an Arabic system on the Arabic training data, which they finally ran on the Arabic test input. It is worth noting that for English UPV–INAOE–Autoritas outperformed bigIR, but for Arabic it was the other way around. We suspect that a possible reason might be the direction of machine translation and also the presence/lack of context. On one hand, translation into English tends to be better than into Arabic. Moreover, the translation of sentences is easier as there is context, whereas such a context is missing when translating lexicon entries in isolation.

Finally, similarly to English, all runs managed to outperform the random baseline by a margin, while the n-gram baseline was strong yet possible to beat.

5 Task 2: Factuality

5.1 Evaluation Measures

Task 2 (factuality) the claims have to be labeled as true, half-true, or false. Note that, unlike standard multi-way classification problems, here we have a natural ordering between the classes and confusing one extreme with the other one is more harmful than confusing it with a neighboring class. This is known as an ordinal classification problem (aka ordinal regression), and it requires an evaluation measure that would take this ordering into account. Therefore, we opted for using mean absolute error (MAE), which is standard for such kinds of problems, as the official measure. MAE is defined as follows:

$$\begin{aligned} MAE = \frac{\sum _{c=1}^C |y_c-x_c|}{C} \end{aligned}$$

(3)

where $y_c$ and $x_c$ are gold and predicted labels of claim c and $|\cdot |$ is the difference between them: either zero, one, or two.

Following [23], we also compute macro-average mean absolute error, accuracy, macro-averaged $F_1$, and macro-averaged recall.^{Footnote 5}

5.2 Evaluation Results

When dealing with the factuality task, participants opted for retrieving evidence from the Web in order to assess the factuality of the claims. After retrieving a number of search engine snippets or full documents, they performed different operations, including calculating similarities or levels of contradiction and stance between the supporting document and the claim. For example, the Copenhagen team [32] concatenating the representations of claim and of the document in a neural network. Table 3 gives a brief overview. Refer to [4] and the corresponding participants’ reports for further details.

Table 3. Task 2 (factuality): overview of the learning models and of the representations used by the participants.

Full size table

Table 4. Task 1 (check-worthiness): English results, ranked based on MAP, the official evaluation measure. The best score per evaluation measure is in shown in bold.

Full size table

Table 5. Task 1 (check-worthiness): Arabic results, ranked based on MAP, the official evaluation measure. The best score per evaluation measure is in bold.

Full size table

Table 6. Task 2 (factuality): English results, ranked based on MAE, the official evaluation measure. The best score per evaluation measure is in bold.

Full size table

Table 7. Task 2 (factuality): Arabic results, ranked based on MAE, the official evaluation measure. The best score per evaluation measure is in bold.

Full size table

Note that the bigIR team [34] tried to identify the relevant fragments in the supporting documents by considering only those with high similarity against the claim. Various approaches [32, 34] are based at some extent on [17]. Only one team, Check it out [19], did not use external supporting documents (Table 5).

English. Table 6 shows the results on the English dataset. Overall, the top-performing system is the one by the Copenhagen team [32]. One aspect that might explain the relatively large difference in performance compared to the other teams is the use of additional training material. The Copenhagen team incorporated hundreds of labeled claims from Politifact to their training set. Their model combines the claim and supporting texts to build representations. Their primary submission is an SVM, whereas their contrastive one uses a CNN.

Unfortunately, not much information is available regarding team FACTR, as no paper was submitted to describe their model. They used a similar approach as most other teams: converting the claim into a query for a search engine, computing stance, sentiment and other features over the supporting documents, and using them in a supervised model.

Arabic. Table 7 shows the results of the two teams that participated in the Arabic task. In order to deal with it, FACTR translated all the claims into English and performed the rest of the process in that language. In contrast, UPV–INAOE–Autoritas [10] translated the claims into English, but only in order to query the search engines,^{Footnote 6} and then translated the retrieved evidence into Arabic in order to keep working in that language. Perhaps, the noise generated by using two imperfect translations caused their performance to decrease (the performance of the two teams in the English task was much closer).

Overall, the performance of the models in Arabic is better than in English. The reason is that the isolated claims from snopes.com—which were released only in Arabic (cf. Table 1)—were easier to verify.

6 Conclusions and Future Work

We have presented an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 1 asked to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 asked to assess whether a claim made by a politician is factually true, half-true, or false. We proposed both tasks in English and Arabic, relying on comments and factuality judgments from both factcheck.org and snopes.com to obtain a further-refined gold standard and on translation for the Arabic versions of the corpus. A total of 30 teams registered to participate in the lab, and 9 of them actually submitted runs. The evaluation results showed that the most successful approaches used various neural networks (esp. for Task 1) and evidence retrieved from the Web (esp. for Task 2). The corpora and the evaluation measures we have released as a result of this lab should enable further research in check-worthiness estimation and in automatic claim verification.

In future iterations of the lab, we plan to add more debates and speeches, both annotated and unannotated, which would enable semi-supervised learning. We further want to add annotations for the same debates/speeches from different fact-checking organizations, which would allow using multi-task learning [9].

Notes

1.
http://www.factcheck.org.
2.
http://www.snopes.com.
3.
Fully automating the process of fact-checking is not yet a viable alternative, partly because of limitations of the existing technology, and partly due to low trust in such methods by human users.
4.
https://github.com/clef2018-factchecking.
5.
The implementation of the evaluation measures is available at https://github.com/clef2018-factchecking/clef2018-factchecking/.
6.
The reason is that the Arabic dataset was produced by translating the datasets from an English version. Hence it was difficult to find evidence in Arabic.

References

Agez, R., Bosc, C., Lespagnol, C., Mothe, J., Petitcol, N.: IRIT at CheckThat! 2018. In: Cappellato et al. [5]
Google Scholar
Atanasova, P., et al.: Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims. Task 1: Check-worthiness. In: Cappellato et al. [5]
Google Scholar
Baly, R., Mohtarami, M., Glass, J., Màrquez, L., Moschitti, A., Nakov, P.: Integrating stance detection and fact checking in a unified corpus. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL-HLT 2018, New Orleans, Louisiana, USA, pp. 21–27 (2018)
Google Scholar
Barrón-Cedeño, A., et al.: Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims. Task 2: Factuality. In: Cappellato et al. [5]
Google Scholar
Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.): Working Notes of CLEF 2018-Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Avignon, France (2018)
Google Scholar
Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, pp. 675–684 (2011)
Google Scholar
Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Wong Sak Hoi, G., Zubiaga, A.: SemEval-2017 task 8: RumourEval: determining rumour veracity and support for rumours. In: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval 2017, Vancouver, Canada, pp. 60–67 (2017)
Google Scholar
Ferreira, W., Vlachos, A.: Emergent: a novel data-set for stance classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL-HLT 2016, San Diego, California, USA, pp. 1163–1168 (2016)
Google Scholar
Gencheva, P., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: A context-aware approach for detecting worth-checking claims in political debates. In: Proceedings of the International Conference Recent Advances in Natural Language Processing. RANLP 2017, Varna, Bulgaria, pp. 267–276 (2017)
Google Scholar
Ghanem, B., Montes-y Gómez, M., Rangel, F., Rosso, P.: UPV-INAOE-Autoritas - check that: an approach based on external sources to detect claims credibility. In: Cappellato et al. [5]
Google Scholar
Ghanem, B., Montes-y Gómez, M., Rangel, F., Rosso, P.: UPV-INAOE-Autoritas - check that: preliminary approach for checking worthiness of claims. In: Cappellato et al. [5]
Google Scholar
Hansen, C., Hansen, C., Simonsen, J., Lioma, C.: The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 fact checking lab. In: Cappellato et al. [5]
Google Scholar
Hardalov, M., Koychev, I., Nakov, P.: In search of credible news. In: Proceedings of the 17th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2016, Varna, Bulgaria, pp. 172–180 (2016)
Google Scholar
Hassan, N., Li, C., Tremayne, M.: Detecting check-worthy factual claims in presidential debates. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, Australia, pp. 1835–1838 (2015)
Google Scholar
Hassan, N., Tremayne, M., Arslan, F., Li, C.: Comparing automated factual claim detection against judgments of journalism organizations. In: Computation + Journalism Symposium, Stanford, California, USA, September 2016
Google Scholar
Jaradat, I., Gencheva, P., Barrón-Cedeño, A., Màrquez, L., Nakov, P.: ClaimRank: detecting check-worthy claims in Arabic and English. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT 2018, New Orleans, Louisiana, USA, pp. 26–30 (2018)
Google Scholar
Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: Fully automated fact checking using external sources. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pp. 344–353. INCOMA Ltd., Varna (2017)
Google Scholar
Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: Fully automated fact checking using external sources. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, pp. 344–353 (2017)
Google Scholar
Lal, Y.K., Khattar, D., Kumar, V., Mishra, A., Varma, V.: Check it out : politics and neural networks. In: Cappellato et al. [5]
Google Scholar
Ma, J., et al.: Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, New York, USA, pp. 3818–3824 (2016)
Google Scholar
Mihaylova, T., et al.: Fact checking in community forums. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, Louisiana, USA, pp. 879–886 (2018)
Google Scholar
Mitra, T., Gilbert, E.: CREDBANK: a large-scale social media corpus with associated credibility annotations. In: Cha, M., Mascolo, C., Sandvig, C. (eds.) Proceedings of the Ninth International Conference on Web and Social Media, ICWSM 2015, Oxford, UK, pp. 258–267 (2015)
Google Scholar
Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: SemEval-2016 task 4: Sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval 2016, San Diego, California, USA, pp. 1–18 (2016)
Google Scholar
Papadopoulos, S., Bontcheva, K., Jaho, E., Lupu, M., Castillo, C.: Overview of the special issue on trust and veracity of information in social media. ACM Trans. Inf. Syst. 34(3), 14:1–14:5 (2016)
Article Google Scholar
Patwari, A., Goldwasser, D., Bagchi, S.: TATHYA: a multi-classifier system for detecting check-worthy statements in political debates. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, pp. 2259–2262 (2017)
Google Scholar
Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credibility assessment of textual claims on the web. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 2173–2178. ACM, Indianapolis (2016)
Google Scholar
Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Where the truth lies: explaining the credibility of emerging claims on the web and social media. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017, Perth, Australia, pp. 1003–1012 (2017)
Google Scholar
Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., Choi, Y.: Truth of varying shades: analyzing language in fake news and political fact-checking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pp. 2931–2937 (2017)
Google Scholar
Shiralkar, P., Flammini, A., Menczer, F., Ciampaglia, G.L.: Finding streams in knowledge graphs to support fact checking. In: Proceedings of the IEEE International Conference on Data Mining, ICDM 2017, New Orleans, Louisiana, USA, pp. 859–864 (2017)
Google Scholar
Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset for fact extraction and verification. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL-HLT 2018, New Orleans, Louisiana, USA, pp. 809–819 (2018)
Google Scholar
Vlachos, A., Riedel, S.: Fact checking: task definition and dataset construction. In: Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Baltimore, Maryland, USA, pp. 18–22 (2014)
Google Scholar
Wang, D., Simonsen, J., Larseny, B., Lioma, C.: The Copenhagen team participation in the factuality task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 fact checking lab. In: Cappellato et al. [5]
Google Scholar
Wang, W.Y.: “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, pp. 422–426 (2017)
Google Scholar
Yasser, K., Kutlu, M., Elsayed, T.: bigIR at CLEF 2018: detection and verification of check-worthy political claims. In: Cappellato et al. [5]
Google Scholar
Zuo, C., Karakas, A., Banerjee, R.: A hybrid recognition system for check-worthy claims using heuristics and supervised learning. In: Cappellato et al. [5]
Google Scholar

Download references

Acknowledgments

This work was made possible in part by NPRP grant# NPRP 7-1313-1-245 from the Qatar National Research Fund (a member of Qatar Foundation). Statements made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Qatar Computing Research Institute, HBKU, Doha, Qatar
Preslav Nakov, Alberto Barrón-Cedeño & Giovanni Da San Martino
Computer Science and Engineering Department, Qatar University, Doha, Qatar
Tamer Elsayed & Reem Suwaileh
Amazon, Barcelona, Spain
Lluís Màrquez
College of Humanities and Social Sciences, HBKU, Doha, Qatar
Wajdi Zaghouani
SiteGround, Sofia, Bulgaria
Pepa Atanasova
Sofia University “St Kliment Ohridski”, Sofia, Bulgaria
Spas Kyuchukov

Authors

Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Barrón-Cedeño
View author publications
You can also search for this author in PubMed Google Scholar
Tamer Elsayed
View author publications
You can also search for this author in PubMed Google Scholar
Reem Suwaileh
View author publications
You can also search for this author in PubMed Google Scholar
Lluís Màrquez
View author publications
You can also search for this author in PubMed Google Scholar
Wajdi Zaghouani
View author publications
You can also search for this author in PubMed Google Scholar
Pepa Atanasova
View author publications
You can also search for this author in PubMed Google Scholar
Spas Kyuchukov
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Da San Martino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Preslav Nakov .

Editor information

Editors and Affiliations

Aix-Marseille University, Marseille Cedex 20, France
Patrice Bellot
Virtual University of Tunis, Tunis, Tunisia
Chiraz Trabelsi
Systèmes d’informations, Big Data et Rec, Institut de Recherche en Informatique de, Toulouse Cedex 04, France
Josiane Mothe
Department of Computer Science, University of Huddersfield, Huddersfield, United Kingdom
Fionn Murtagh
DIRO, Universite de Montreal, Montreal, Québec, Canada
Jian Yun Nie
Pierre and Marie Curie University, Paris Cedex 05, France
Laure Soulier
Université d'Avignon et des Pays de, Avignon, France
Eric SanJuan
Department of Information Engineering, University of Padua, Padua, Padova, Italy
Linda Cappellato
University of Padua, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakov, P. et al. (2018). Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. In: Bellot, P., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2018. Lecture Notes in Computer Science(), vol 11018. Springer, Cham. https://doi.org/10.1007/978-3-319-98932-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-98932-7_32
Published: 15 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98931-0
Online ISBN: 978-3-319-98932-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

Abstract

Similar content being viewed by others

Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media

Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News

CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media

Keywords

1 Introduction

2 Related Work

3 Corpora

4 Task 1: Check-Worthiness

4.1 Evaluation Measures

4.2 Evaluation Results

5 Task 2: Factuality

5.1 Evaluation Measures

5.2 Evaluation Results

6 Conclusions and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

Abstract

Similar content being viewed by others

Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media

Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News

CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media

Keywords

1 Introduction

2 Related Work

3 Corpora

4 Task 1: Check-Worthiness

4.1 Evaluation Measures

4.2 Evaluation Results

5 Task 2: Factuality

5.1 Evaluation Measures

5.2 Evaluation Results

6 Conclusions and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation