Keywords

1 Introduction

The requirement to ensure that patientsFootnote 1 can understand their official, privacy-sensitive health information in their own Electronic Health Records (EHRs) is stipulated by policies and laws [16]. Patients’ better abilities to understand their own EHR empowers them to take part in the related healthcare judgment, leading to their increased independence from healthcare providers, better healthcare decisions, and decreased healthcare costs [16]. Improving patients’ ability to access and digest this content could mean paraphrasing the EHR-text, enriching it with hyperlinks to term definitions, care guidelines, and further supportive information on patient-friendly and reliable websites, helping them to discover good search queries to retrieve more contents, and allowing not only text but also speech as a query modality for example.

Information access conferences have organized evaluation labs on related Electronic Health (eHealth) Information Extraction (IE), Information Management (IM), and Information Retrieval (IR) tasks for almost 20 years. Yet, with rare exception, they have targeted the healthcare experts’ information needs only [4, 5, 11]. The CLEF eHealth Evaluation-lab and Lab-workshop SeriesFootnote 2 has been organized every year since 2012 as part of the Conference and Labs of the Evaluation Forum (CLEF) [7, 8, 10, 12,13,14, 19, 22, 23] with the primary goal of supporting laypersons, and their next-of-kin, access to medical information. This year, the lab proposes two tasks: one centered on Information Extraction (identify and classify Named Entities in written ultrasonography reports); one centered on Information Retrieval (Consumer Health Search (CHS)).

In this paper we overview the interest in the CLEF eHealth evaluation lab series to-date. We then consider recent advances in IE and IR which inform the offered CLEF eHealth 2021 IE and IR tasks. These IE and IR evaluation lab challenge tasks are also described. The paper concludes with a vision for CLEF eHealth beyond 2021.

2 CLEF eHealth in 2012–2020

The CLEF and other information access conferences have organized evaluation labs and shared tasks on eHealth IE, IR, and Information Management for approximately two decades. Yet, their primary focus has been on healthcare experts’ information needs, with limited consideration of laypersons’ difficulties to retrieve and digest credible, topical, and easy-to-understand contents in their preferred language to make health-centred decisions [4, 5, 11].

This niche of addressing patients, their families, health scientists, healthcare policy makers, and other laypersons’ health information needs in a range of languages in order to make health-centered decisions began stimulating the annual CLEF eHealth Evaluation-lab and Lab-workshop Series in 2012. Its first workshop took place in 2012 with an aim to organize an evaluation lab, and in 2013–2021, this lab with up to three shared tasks annually has preceded each campaign-concluding CLEF eHealth workshop  [7, 8, 10, 12,13,14, 19, 22, 23].

3 CLEF eHealth 2021 Information Extraction Task

3.1 Preceding Efforts

In 2020, the CodiEsp task of the CLEF eHealth evaluation lab mastered the challenge of building a publicly available automatic clinical coding system for Spanish documents, which is a step towards the final application of natural language processing (NLP) technologies in non-English speaking countries [10]. In contrast to previous clinical coding tasks using death certificates and non-technical summaries of animal experimentations [14, 20, 21], the 2020 task was able to use a collection of clinical case reports from a variety of medical disciplines chosen to constitute a corpus of electronic health records (EHRs; 1, 000 documents from the Spanish clinical case reports (SPACCC) corpus). CodiEsp shared tasks attracted participants from both Spanish and non-Spanish speaking countries, with different backgrounds in the 51 teams registered for the tasks. Thus, CodiEsp was able to prove that the language barrier (languages other than English) does not necessarily make the tasks more restrictive, but presents an opportunity to adapt well-known techniques to language-specific features. The diversity in profiles led to the development of heterogeneous resources, with a development of 167 novel clinical coding systems achieved. Finally, the 2020 task organizers’ showed that individual task results could be combined, leading to further performance gains.

The 2020 task on Spanish resources was popular to the extent that it set the ground for the 2021 SpRadIE (Spanish Radiology Information Extraction) task focusing on further sub-aspects of the Spanish language: text in the radiology domain, image reports written under time constraints, resulting in misspellings and inconsistencies, coming from a public hospital in South America, as elaborated in the next subsection. These particularities pose an interesting challenge of domain and register adaptation for systems trained for general Spanish eHealth, in their application to a specific setting. With this objective, we are calling for submissions from hospitals and private companies to supplement academic participants.

3.2 The Task in 2021: Multilingual Information Extraction

In 2021, the SpRadIE task will target Named Entity Recognition and Classification in the domain of radiological image reports, more concretely, pediatric ultrasonographies. These reports are written in haste, under time pressure in a public Argentinean hospital. They tend to be repetitive, probably due to an extensive use of copy and paste. Nevertheless, these are actual free text reports with no pre-determined structure, which results in great variations in size and content. No element is mandatory in the report except the age of the patient. Also, there are misspellings and inconsistencies in the usage of abbreviations, punctuation and line breaks.

The corpus consists of a total of 513 sonography reports, with over 17,000 annotated named entities with some class imbalance (the smallest class is a sixth of the majority class). Reports were manually annotated by clinical experts and then revised by linguists. Annotation guidelines and training were provided for both rounds of annotation. Interannotator (dis)agreement, detailed for each type of entity, will be used to better assess the performance of automatic annotators. Automatic annotators will be expected to perform well in those cases where human annotators have strong agreement, and worse in cases that are difficult for human annotators to identify consistently.

Five different classes of entities are distinguished: Finding, Anatomical Entity, Location, Measure, Degree, Type of Measure and Abbreviation. Hedges are also identified, distinguishing Negation, Uncertainty, Condition and Conditional Temporal. Entities can be embedded within other entities of different types. Moreover, entities can be discontinuous, and can span over sentence boundaries. The entity type Finding is particularly challenging, as it presents great variability in its textual forms. It ranges from a single word to more than ten words in some cases, and comprising all kinds of phrases. However, this is also the most informative type of entity for the potential users of these annotations. Other challenging phenomena are the regular polysemy observed between Anatomical entities and Locations, and the irregular uses of Abbreviations. In the manual annotation process, we have found that human annotators differ more on those categories than on the others, thus we expect automatic annotators will also have difficulties to consistently classify those as well.

For the SpRadIE 2021 task, submissions will be evaluated with different metrics, including exact and lenient match. The lenient evaluation will be carried out using a Jaccard Index, similarly as used in the 2013 BioNLP shared task [1]:

$$J_{(ref,pred)} = \frac{overlap_{(ref,pred)}}{length_{ref} + length_{pred} - overlap_{(ref,pred)}}$$

It takes the length (offsets) of the annotated reference concept, the predicted concept, as well as the overlap between them. This index amounts to 1 in the case of perfect match and 0 if there is no overlap between reference and prediction.

The official evaluation measures for the task are Slot Error Rate (SER) [15] with the Jaccard index as primary metric for entity match, and F1 for classification of matching entities within each type of entity.

4 CLEF eHealth 2021 Information Retrieval Task

4.1 Preceding Efforts

In 2020, the CHS task of CLEF eHealth consisted of an extension of the 2018 task. The use case was similar to previous years: helping patients and their next-of-kins find relevant health information online. The topics were extracted from query logs from the Health on the Net website and were representative of real information needs. The organizers oversaw the generation of spoken queries for these topics, and transcription of these spoken queries. Participants could submit their runs to two subtasks: one adhoc IR subtask using the textual queries; one spoken IR subtask using the spoken queries or their transcriptions. In each subtask, the effectiveness of the participants systems were evaluated considering three dimensions of relevance: topical relevance, understandability, and credibility. Three teams took part in the challenge, and all of them submitted runs to the 2 subtasks. However, none of them adapted the IR models used for each subtask – only the input query changed (textual query or transcription). This tendency was also observed in the previous multilingual tasks (running from 2014 until 2018), where only a few teams went further than adding a translation layer before the IR pipeline. Given the workload necessary to record and transcribe the topics, the organizers have decided not to carry on this task that failed to bring together several communities, and in the end did not really address the challenge of varying input type for IR models.

A constant effort has been made in the task since 2014 to integrate relevance dimensions. This has led to many interesting publications in order to adapt IR models to these dimensions, as well as the evaluation framework itself. Since 2020, the credibility dimension has been considered too. Integrating a dimension that, in itself, is already challenging to define, assess, and measure, led to a variety of interesting and exciting research questions. The 2021 CHS tasks reflect these new challenges.

4.2 The Task in 2021: Consumer Health Search

The 2018 CLEF eHealth CHS document collection will be used in the 2021 IR task. This collection consists of Web pages acquired from Common Crawl,Footnote 3 which is augmented with additional pages collected from a number of known reliable health Websites and other known unreliable health Websites [9]. The topics for 2021 are manually created by medical professionals from realistic scenarios. Participants are challenged in the 2021 Task with retrieving the relevant documents from the provided document collection. A number of distinct subtasks can be completed using the considered queries and the provided labeled dataset: ad-hoc search, credibility assessment, and personalized search based on multi-dimensional relevance assessment.

Like in the 2020 IR task, the pool of documents to be assessed will be labelled with respect to three relevance dimensions: topicality, understandability, and credibility. The assessment guidelines will follow up on 2020 guidelines: assessors will be asked to assess if the documents are on the same topic as the query, how readable/understandable the document is to a layperson, and how credible it is. Credibility has been introduced in the 2020 IR task. When assessing the credibility of online information, we consider credibility as an objective characteristic of an information item (either it is true, false, or partially true/false) [25], which is subjectively perceived by individuals [18]. Hence, the assessors are required to consider distinct aspects related to [24]: the source that disseminates information (e.g., its trustworthiness [3]), some characteristics associated with the message diffused (e.g., syntactic, semantic, and stylistic aspects [17]), and some social aspects if the information is disseminated through a virtual community (e.g., to be part of an echo chamber [2]).

The official evaluation measures include classic IR measures such as Binary Preference, Mean Reciprocal Rank, or Normalized Discounted Cumulative Gain @ 1–10, measuring how well systems retrieve relevant documents at low ranks (which is in line with the CHS use case). In order to measure how well systems can adapt the retrieved content to the consumers knowledge, understandability and credibility Rank-biased Precision will also be considered as official metrics. For the credibility assessment subtask, reference will made to measures such as Accuracy and F-measure to establish the goodness of the classification between credible information or not.

5 A Vision for CLEF eHealth Beyond 2021

The general purpose of our lab throughout the years, as its 2021 IE and IR tasks demonstrate, has been to assist laypeople in finding and understanding health information in order to make enlightened decisions. Breaking language barriers has been our priority over the years, and this will continue in our multilingual tasks. Each year of the labs has enabled the identification of difficulties and challenges in IE, IM, and IR which have shaped our tasks. For example, our IR tasks have considered multilingual, contextualized, spoken queries, and query variants. However, further exploration of query construction, search scenario definition, aiming at a better understanding and management of CHS are still needed. The task will also further explore relevance dimensions, and work toward a better assessment of understandability and credibility, as well as methods to take these dimensions into consideration. Moreover, by better defining the search scenarios, the topics, and considering a document relevance in all its various aspects, the task will progress towards personalized and effective health search engines. As lab organizers, our purpose is to increase the impact and the value of the resources, methods and the community built by CLEF eHealth. Examining the quality and stability of the lab contributions will help the CLEF eHealth series to better understand where it should be improved and how. As future work, we intend continuing our analysis of the influence of the CLEF eHealth evaluation series from the perspectives of publications and data/software releases [6, 20, 21].