Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One or two decades ago getting access to large visual data sets for research was a problem and open data collections that could be used to compare algorithms of researchers were rare. Now, it is getting easier to access data collections but it is still hard to obtain annotated data with a clear evaluation scenario and strong baselines to compare against. Motivated by this, ImageCLEF has for 16 years been an initiative that aims at evaluating multilingual or language independent annotation and retrieval of images [5, 21, 23, 25, 39]. The main goal of ImageCLEF is to support the advancement of the field of visual media analysis, classification, annotation, indexing and retrieval. It proposes novel challenges and develops the necessary infrastructure for the evaluation of visual systems operating in different contexts and providing reusable resources for benchmarking. It is also linked to initiatives such as Evaluation-as-a-Service (EaaS) [17, 18].

Many research groups have participated over the years in these evaluation campaigns and even more have acquired its datasets for experimentation. The impact of ImageCLEF can also be seen by its significant scholarly impact indicated by the substantial numbers of its publications and their received citations [36].

There are other evaluation initiatives that have had a close relation with ImageCLEF. LifeCLEF [22] was formerly an ImageCLEF task. However, due to the need to assess technologies for automated identification and understanding of living organisms using data not only restricted to images, but also videos and sound, it was decided to be organised independently from ImageCLEF. Other CLEF labs linked to ImageCLEF, in particular the medical task, are: CLEFeHealth [14] that deals with processing methods and resources to enrich difficult-to-understand eHealth text and the BioASQ [4] tasks from the Question Answering lab that targets biomedical semantic indexing and question answering but is now not a lab anymore. Due to their medical orientation, the organisation is coordinated in close collaboration with the medical tasks in ImageCLEF. In 2017, ImageCLEF explored synergies with the MediaEval Benchmarking Initiative for Multimedia Evaluation [15], which focuses on exploring the “multi” in multimedia: speech, audio, visual content, tags, users, context. MediaEval was founded in 2008 as VideoCLEF, a track in the CLEF Campaign.

This paper presents a general overview of the ImageCLEF 2018 evaluation campaignFootnote 1, which as usual was an event organised as part of the CLEF labsFootnote 2.

The remainder of the paper is organized as follows. Section 2 presents a general description of the 2018 edition of ImageCLEF, commenting about the overall organisation and participation in the lab. Followed by this are sections dedicated to the four tasks that were organised this year: Sect. 3 for the Caption Task, Sect. 4 for the Tuberculosis Task, Sect. 5 for the Visual Question Answering Task, and Sect. 6 for the Lifelog Task. For the full details and complete results on the participating teams, the reader should refer to the corresponding task overview papers [7, 11, 19, 20]. The final section concludes the paper by giving an overall discussion, and pointing towards the challenges ahead and possible new directions for future research.

2 Overview of Tasks and Participation

ImageCLEF 2018 consisted of three main tasks and a pilot task that covered challenges in diverse fields and usage scenarios. In 2017 [21] the proposed challenges were almost all new in comparison to 2016 [40], the only exception being Caption Prediction that was a subtask already attempted in 2016, but for which no participant submitted results. After such a big change, for 2018 the objective was to continue most of the tasks from 2017. The only change was that the 2017 Remote Sensing pilot task was replaced by a novel one on Visual Question Answering. The 2018 tasks are the following:

  • ImageCLEFcaption: Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines. Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The task addresses the problem of bio-medical image concept detection and caption prediction from large amounts of training data.

  • ImageCLEFtuberculosis: The main objective of the task is to provide a tuberculosis severity score based on the automatic analysis of lung CT images of patients. Being able to extract this information from the image data alone allows to limit lung washing and laboratory analyses to determine the tuberculosis type and drug resistances. This can lead to quicker decisions on the best treatment strategy, reduced use of antibiotics and lower impact on the patient.

  • ImageCLEFlifelog: An increasingly wide range of personal devices, such as smart phones, video cameras as well as wearable devices that allow capturing pictures, videos, and audio clips of every moment of life are becoming available. Considering the huge volume of data created, there is a need for systems that can automatically analyse the data in order to categorize, summarize and also to retrieve query-information that the user may desire. Hence, this task addresses the problems of lifelog data understanding, summarization and retrieval.

  • ImageCLEF-VQA-Med (pilot task): Visual Question Answering is a new and exciting problem that combines natural language processing and computer vision techniques. With the ongoing drive for improved patient engagement and access to the electronic medical records via patient portals, patients can now review structured and unstructured data from labs and images to text reports associated with their healthcare utilization. Such access can help them better understand their conditions in line with the details received from their healthcare provider. Given a medical image accompanied with a set of clinically relevant questions, participating systems are tasked with answering the questions based on the visual image content.

In order to participate in the evaluation campaign, the research groups first had to register by following the instructions on the ImageCLEF 2018 web page. To ease the overall management of the campaign, this year the challenge was organized through the crowdAI platformFootnote 3. To get access to the datasets, the participants were required to submit a signed End User Agreement (EUA) form. Table 1 summarizes the participation in ImageCLEF 2018, including the number of registrations (counting only the ones that downloaded the EUA) and the number of signed EUAs, indicated both per task and for the overall Lab. The table also shows the number of groups that submitted results (runs) and the ones that submitted a working notes paper describing the techniques used.

The number of registrations could be interpreted as the initial interest that the community has for the evaluation. However, it is a bit misleading because several persons from the same institution might register, even though in the end they count as a single group participation. The EUA explicitly requires all groups that get access to the data to participate, even though this is not enforced. Unfortunately, the percentage of groups that submit results is often limited. Nevertheless, as observed in studies of scholarly impact [36, 37], in subsequent years the datasets and challenges provided by ImageCLEF often get used, in part due to the researchers that for some reason (e.g. alack of time, or other priorities) were unable to participate in the original event or did not complete the tasks by the deadlines.

After a decrease in participation in 2016, the participation again increased in 2017 and for 2018 it increased further. The number of signed EUAs is considerably higher, mostly due to the fact that this time each task had an independent EUA. Also, due to the change to crowdAI, the online registration became easier and attracted other research groups than usual, which made the registration-to-participation ratio lower than in previous years. Nevertheless, in the end, 31 groups participated and 28 working notes papers were submitted, which is a slight increase with respect to 2017. The following four sections are dedicated to each of the tasks. Only a short overview is reported, including general objectives, description of the tasks and datasets and a short summary of the results.

Table 1. Key figures of participation in ImageCLEF 2018.

3 The Caption Task

This task studies algorithmic approaches to medical image understanding. As a testbed for doing so, teams were tasked with automatically “guessing” fitting keywords or free-text captions that best describe an image from a collection of images published in the biomedical literature.

3.1 Task Setup

Following the structure of the 2017 edition, two sub tasks were proposed. The first task, concept detection, aims to extract the main biomedical concepts represented in an image based only on its visual content. These concepts are UMLS (Unified Medical Language System®) Concept Unique Identifiers (CUIs). The second task, caption prediction, aims to compose coherent free-text captions describing the image based only on the visual information. Participants were, of course, allowed to use the UMLS CUIs extracted in the first task to compose captions from individual concepts. Figure 1 shows an example of the information available in the training set. An image is accompanied by a set of UMLS CUIs and a free-text caption. Compared to 2017 the data sets was modified strongly to respond to some of the difficulties with the task in the past [13].

Fig. 1.
figure 1

Example of an image and the information provided in the training set in the form of the original caption and the extracted UMLS concepts.

3.2 Dataset

The dataset used in this task is derived from figures and their corresponding captions extracted from biomedical articles on PubMed Central® (PMC)Footnote 4. This data set was changed strongly compared to the same task run in 2017 to reduce the diversity on the data and limit the number of compound figures. A subset of clinical figures was automatically obtained from the overall set of 5.8 million PMC figures using a deep multimodal fusion of Convolutional Neural Networks (CNN), described in [2]. In total, the dataset is comprised of 232,305 image–caption pairs split into disjoint training (222,305 pairs) and test (10,000 pairs) sets. For the Concept Detection subtask, concepts present in the caption text were extracted using the QuickUMLS library [30]. After having observed a strong breadth of concepts and image types in the 2017 edition of the task, this year’s continuation focused on radiology artifacts, introducing a greater topical focus to the collection.

3.3 Participating Groups and Submitted Runs

In 2018, 46 groups registered for the caption task compared with the 37 groups registered in 2017. 8 groups submitted runs, one less than in 2017. 28 runs were submitted to the concept detection subtask and 16 to the caption prediction task. Although the caption prediction task appears like an extension of the concept detection task, only two groups participated in both, and 4 groups participated only in the caption prediction task.

3.4 Results

The submitted runs are summarized in Tables 2 and 3, respectively. Similar to 2017, there were two main approaches used on the concept detection subtask: multi-modal classification and retrieval.

ImageSem [41] was the only group applying a retrieval approach this year achieving 0.0928 in terms of mean F1 scores. They retrieved similar images from the training set and clustered concepts of those images. The multi–modal classification approach was more popular [27, 28, 38]. Best results were achieved by UA.PT Bioinformatics [27] using a traditional bag-of-visual-words algorithm. They experimented with logistic regression and k-Nearest Neighbors (k-NN) for the classification step. Morgan State University [28] used a deep learning based approach by using both image and text (caption) features of the training set for modeling. However, instead of using the full 220K-image collection, they relied on a subset of 4K images, applying the KerasFootnote 5 framework to generate deep learning based features. IPL [38] used and encoder of the ARAE [44] model creating a textual representation for all captions. In addition, the images were mapped to continuous representation space with a CNN.

Table 2. Concept detection performance in terms of \(F_1\) scores.

In the Caption Prediction subtask, ImageSem [41] achieved the best results using an image retrieval strategy and tuning the parameters such as the most similar images and the number of candidate concepts. The other 4 groups used different deep learning approaches in very interesting ways from generating captions word by word or in sequences of words. Morgan State University [28] and WHU used a long short-term memory (LSTM) network while UMass [33] and KU Leuven [32] applied different CCNs.

Table 3. Caption prediction performance in terms of BLEU scores.

After discussions in the 2017 submissions where groups used external data and possibly included part of the test data, no group augmented the training set in 2018. It is further noticeable that, despite the dataset being less noisy than in 2018, the achieved results were slightly lower than observed in the previous year, in both tasks.

3.5 Lessons Learned and Next Steps

Interestingly and despite this year’s focus on radiology modalities, a large number of target concepts was extracted in the training set. Such settings with hundreds of thousands of classes are extremely challenging and fall into the realm of extreme classification methods. In future editions of the task, we plan to focus on detecting only the most commonly used UMLS concepts and truncate the concept distribution in order to shift the intellectual challenge away from extreme or one-shot classification settings that were not originally meant to be the key challenge in this task.

The new filtering for finding images with lower variability and fewer combined figures helped to make the task more realistic and considering the difficulty of the task the results are actually fairly good.

Most techniques used relied on deep learning but best results were often obtained also with other techniques, such as using retrieval and handcrafted features. This may be due to the large number of concepts and in this case limited amount of training data. As PMC is increasing in size very quickly it should be easy to find more data for future contests.

4 The Tuberculosis Task

Tuberculosis (TB) remains a persistent threat and a leading cause of death worldwide also in recent years with multiple new strains appearing worldwide. Recent studies report a rapid increase of drug-resistant cases [29] meaning that the TB organisms become resistant to two or more of the standard drugs. One of the most dangerous forms of drug-resistant TB is so-called multi-drug resistant (MDR) tuberculosis that is simultaneously resistant to several of the most powerful antibiotics. Recent published reports show statistically significant links between drug resistance and multiple thick-walled caverns [42]. However, the discovered links are not sufficient for a reliable early recognition of MDR TB. Therefore, assessing the feasibility of MDR detection based on Computed Tomography (CT) imaging remains an important but very challenging task. Other tasks proposed in the ImageCLEF 2018 tuberculosis challenge are automatic classification of TB types and TB severity scoring using CT volumes.

4.1 Task Setup

Three subtasks were proposed in the ImageCLEF 2018 tuberculosis task [11]:

  • Multi-drug resistance detection (MDR subtask);

  • Tuberculosis type classification (TBT subtask);

  • Tuberculosis severity scoring (SVR subtask).

The goal of the MDR subtask is to assess the probability of a TB patient having a resistant form of tuberculosis based on the analysis of a chest CT. Compared to 2017, datasets for the MDR detection subtask were extended by means of adding several cases with extensively drug-resistant tuberculosis (XDR TB), which is a rare and the most severe subtype of MDR TB.

The goal of the TBT subtask is to automatically categorize each TB case into one of the following five types: Infiltrative, Focal, Tuberculoma, Miliary, and Fibro-cavernous. The SVR subtask is dedicated to assess the TB severity based on a single CT image of a patient. The severity score is the results of a cumulative score of TB severity assigned by a medical doctor.

Table 4. Dataset for the MDR subtask.
Table 5. Dataset for the TBT subtask.
Table 6. Dataset for the SVR subtask.

4.2 Dataset

For all three subtasks 3D CT volumes were provided with a size of \(512 \times 512\) pixels and number of slices varying from 50 to 400. All CT images were stored in the NIFTI file format with .nii.gz file extension (g-zipped .nii files). This file format stores raw voxel intensities in Hounsfield Units (HU) as well as the corresponding image metadata such as image dimensions, voxel size in physical units, slice thickness, etc. For all patients automatically extracted masks of the lungs were provided. The details of the lung segmentation used can be found in [9].

Tables 4, 5 and 6 present for each of the subtasks the division of the datasets between training and test sets (columns), and the corresponding ground truth labels (rows). The dataset for the MDR subtask was composed of 262 MDR and 233 Drug-Sensitive (DS) patients, as shown in Table 4. In addition to CT image data, age and gender for each patient were provided for this subtask. The TBT task contained in total 1,513 CT scans of 994 unique patients divided as shown in Table 5. Patient metadata includes only age. The dataset for the SVR subtask was represented by a total number of 279 patients with a TB severity score assigned for each case by medical doctors. The scores were presented as numbers from 1 to 5, so for a regression task. In addition, for the 2-class prediction task the severity labels were binarized so that scores from 1 to 3 corresponded to “high severity” and 4–5 corresponded to “low severity” (see Table 6).

4.3 Participating Groups and Submitted Runs

In the second year of the task, 11 groups from 9 countries submitted at least one run to one of the subtasks. There were 7 groups participating in the MDR subtask, 8 in the TBT subtask, and 7 groups participating in the SVR subtask. Each group could submit up to 10 runs. Finally, 39 runs were submitted by the groups in the MDR subtask, 39 in the TBT and 36 in the SVR subtasks. Several Deep Learning approaches were employed by 8 out of the 11 participating groups. The approaches were based on using 2D and 3D Convolutional Neural Networks (CNNs) for both classification and feature extraction, transfer learning and a few other techniques. In addition, one group used texture-based graph models of the lungs, one group used texture-based features combined with classifiers and one group used features based on image binarization and morphology.

4.4 Results

The MDR subtask is designed as a 2-class problem. The participants submitted for each patient in the test set the probability of belonging to the MDR group. The Area Under the ROC Curve (AUC) was chosen as the measure to rank the results. The accuracy was provided as well. For the TBT subtask, the participants had to submit the tuberculosis type. Since the 5-class problem was not balanced, Cohen’s KappaFootnote 6 coefficient was used to compare the methods. Again, the accuracy was provided for this subtask. Finally, the SVR subtask was considered in two ways: as a regression problem with scores from 1 to 5, and as a 2-class classification problem (low/high severity). The regression problem was evaluated using Root Mean Square Error (RMSE), and AUC was used to evaluate the classification approaches. Tables 7, 8 and 9 show the final results for each run and its rank.

Table 7. Results for the MDR subtask.
Table 8. Results for the TBT subtask.
Table 9. Results for the SVR subtask.

4.5 Lessons Learned and Next Steps

Similarly to 2017 [10], in the MDR task all participants achieved a relatively low performance, which is only slightly higher than the performance of a random classifier. The best accuracy achieved by participants was 0.6144, and the best reached AUC was 0.6178. These results are better than in the previous years but still remain unsatisfactory for clinical use. The overall increase of performance compared to 2017 may be partly explained by the introduction of patient age and gender, and also by adding more severe cases with XDR TB. For the TBT subtask, the results are slightly worse compared to 2017 in terms of Cohen’s Kappa with the best run scoring a 0.2312 Kappa value (0.2438 in 2017) and slightly better with respect to the best accuracy of 0.4227 (0.4067 in 2017). It is worth to notice that none of the groups achieving best performance in the 2017 edition participated in 2018. The group obtaining best results in this task this year (the UIIP group) obtained a 0.1956 Kappa value and 0.3900 accuracy in the 2017 edition. This shows a strong improvement, possibly linked to the increased size of the dataset. The newly-introduced SVR subtask demonstrated good performance in both regression and classification problems. The best result in terms of regression achieved a 0.7840 RMSE, which is less than 1 grade of error in a 5-grade scoring system. The best classification run demonstrated a 0.7708 AUC. These results are promising taking into consideration the fact that TB severity was scored by doctors using not only CT images but also additional clinical data. The good participation also highlights the importance of the task.

5 The VQA-Med Task

5.1 Task Description

Visual Question Answering is a new and exciting problem that combines natural language processing and computer vision techniques. Inspired by the recent success of visual question answering in the general domainFootnote 7 [3], we propose a pilot task to focus on visual question answering in the medical domain (VQA-Med). Given medical images accompanied with clinically relevant questions, participating systems were tasked with answering questions based on the visual image content. Figure 2 shows a few example images with associated questions and ground truth answers.

5.2 Dataset

We considered medical images along with their captions extracted from PubMed Central articlesFootnote 8 (essentially a subset of the ImageCLEF 2017 caption prediction task [13]) to create the datasets for the proposed VQA-Med task.

We used a semi-automatic approach to generate question-answer pairs from captions of the medical images. First, we automatically generated all possible question-answer pairs from captions using a rule-based question generation (QG) systemFootnote 9. The candidate questions generated via the automatic approach contained noise due to rule mismatch with the clinical domain sentences. Therefore, two expert human annotators manually checked all generated question-answer pairs associated with the medical images in two passes. In the first pass, syntactic and semantic correctness were ensured while in the second pass, well-curated validation and test sets were generated by verifying the clinical relevance of the questions with respect to associated medical images.

The final curated corpus was comprised of 6,413 question-answer pairs associated with 2,866 medical images. The overall set was split into 5,413 question-answer pairs (associated with 2,278 medical images) for training, 500 question-answer pairs (associated with 324 medical images) for validation, and 500 questions (associated with 264 medical images) for testing.

Fig. 2.
figure 2

Example images with question-answer pairs in the VQA-Med task.

5.3 Participating Groups and Runs Submitted

Out of 58 online registrations, 28 participants submitted signed end user agreement forms. Finally, 5 groups submitted a total of 17 runs, indicating a considerable interest in the VQA-Med task. Table 10 gives an overview of all participants and the number of submitted runsFootnote 10.

Table 10. Participating groups in the VQA-Med task.

5.4 Results

The evaluation of the participant systems of the VQA-Med task was conducted based on three metrics: BLEU, WBSS (Word-based Semantic Similarity), and CBSS (Concept-based Semantic Similarity) [19]. BLEU [26] is used to capture the similarity between a system-generated answer and the ground truth answer. The overall methodology and resources for the BLEU metric are essentially similar to the ImageCLEF 2017 caption prediction taskFootnote 11. The WBSS metric is created based on Wu-Palmer Similarity (WUPSFootnote 12) [43] with WordNet ontology in the backend by following a recent algorithm to calculate semantic similarity in the biomedical domain [31]. WBSS computes a similarity score between a system-generated answer and the ground truth answer based on word-level similarity. CBSS is similar to WBSS, except that instead of tokenizing the system-generated and ground truth answers into words, we use MetaMapFootnote 13 via the pymetamap wrapperFootnote 14 to extract biomedical concepts from the answers, and build a dictionary using these concepts. Then, we build one-hot vector representations of the answers to calculate their semantic similarity using the cosine similarity measure.

The overall results of the participating systems are presented in Table 11a to c for the three metrics in a descending order of the scores (the higher the better).

Table 11. Scores of all submitted runs in the VQA-Med task.

5.5 Lessons Learned and Next Steps

In general, participants used deep learning techniques to build their VQA-Med systems [19]. In particular, participant systems leveraged sequence to sequence learning and encoder-decoder-based frameworks utilizing deep convolutional neural networks (CNN) to encode medical images and recurrent neural networks (RNN) to generate question encoding. Some participants used attention-based mechanisms to identify relevant image features to answer the given questions. The submitted runs also varied with the use of various VQA networks such as stacked attention networks (SAN), the use of advanced techniques such as multimodal compact bilinear (MCB) pooling or multimodal factorized bilinear (MFB) pooling to combine multimodal features, the use of different hyperparameters etc. Participants did not use any additional datasets except the official training and validation sets to train their models.

The relatively low BLEU scores and WBSS scores of the runs in the results table denote the difficulty of the VQA-Med task in generating similar answers as the ground truth, while higher CBSS scores suggest that some participants were able to generate relevant clinical concepts in their answers similar to the clinical concepts present in the ground truth answers. To leverage the power of advanced deep learning algorithms towards improving the state-of-the-art in visual question answering in the medical domain, we plan to increase the dataset size in the future editions of this task.

6 The Lifelog Task

6.1 Motivation and Task Setup

An increasingly wide range of personal devices, such as smart phones, video cameras as well as wearable devices that allow capturing pictures, videos, and audio clips pf every moment of life have now become inseparable companions and, considering the huge volume of data created, there is an urgent need for systems that can automatically analyze the data in order to categorize, summarize and also retrieve information that the user may require. This kind of data, commonly referred to as lifelogs, gathered increasing attention in recent years within the research community above all because of the precious information that can be extracted from this kind of data and for the remarkable effects in the technological and social field.

Despite the increasing number of successful related workshops and panels (e.g., JCDL 2015Footnote 15, iConf 2016Footnote 16, ACM MM 2016Footnote 17, ACM MM 2017Footnote 18) lifelogging has seldom been the subject of a rigorous comparative benchmarking exercise as, for example, the lifelog evaluation task at NTCIR-14Footnote 19 or last year’s edition of the ImageCLEFlifelog task [6]. Also in this second edition of the task we aim to bring the attention of lifelogging to a wider audience and to promote research into some of its key challenges such as on multi-modal analysis of large data collections. The ImageCLEF 2018 LifeLog task [7] aims to be a comparative evaluation of information access and retrieval systems operating over personal lifelog data. The task consists of two sub-tasks and both allow participation independently. These sub-tasks are:

  • Lifelog moment retrieval (LMRT);

  • Activities of Daily Living understanding (ADLT).

Lifelog Moment Retrieval Task (LMRT)

The participants have to retrieve a number of specific moments in a lifelogger’s life. “Moments” were defined as semantic events or activities that happened throughout the day. For example, participants should return the relevant moments for the query “Find the moment(s) when I was shopping for wine in the supermarket.” Particular attention should be paid to the diversification of the selected moments with respect to the target scenario. The ground truth for this subtask was created using manual annotation.

Activities of Daily Living Understanding Task (ADLT)

The participants should analyze the lifelog data from a given period of time (e.g., “From August 13 to August 16” or “Every Saturday”) and provide a summarization based on the selected concepts provided by the task organizers of Activities of Daily Living (ADL) and the environmental settings/contexts in which these activities take place.

In the following it is possible to see some examples of ADL concepts:

  • “Commuting (to work or another common venue)”

  • “Traveling (to a destination other than work, home or another common social event)”

  • “Preparing meals (include making tea or coffee)”

  • “Eating/drinking”

Some examples of contexts are:

  • “In an office environment”

  • “In a home”

  • “In an open space”

The summarization is described as the total duration and the number of times the queried concepts happens.

  • ADL: “Eating/drinking: 6 times, 90 min”, “Traveling: 1 time, 60 min”.

  • Context: “In an office environment: 500 min”, “In a church: 30 min”.

6.2 Dataset Employed

This year a completely new multimodal dataset was provided to participants. This consists of 50 days of data from a lifelogger. The data contain a large collection of wearable camera images (1,500–2,500 per day), visual concepts (automatically extracted visual concepts with varying rates of accuracy), semantic content (semantic locations, semantic activities) based on sensor readings (via the Moves App) on mobile devices, biometric information (heart rate, galvanic skin response, calorie burn, steps, etc.), music listening history. The dataset is built based on the data available for the NTCIR-13 - Lifelog 2 task [16]. A summary of the data collection is shown in Table 12.

Table 12. Statistics of ImageCLEFlifelog2018 Dataset.

Evaluation Methodology

For assessing performance in the Lifelog moment retrieval task classic metrics were employed. These metrics are:

  • Cluster Recall at X(CR@X)—a metric that assesses how many different clusters from the ground truth are represented among the top X results;

  • Precision at X(P@X)—measures the number of relevant photos among the top X results;

  • F1-measure at X(F1@X)—the harmonic mean of the previous two measures.

Various cut off points were considered, e.g., \(X = 5, 10, 20, 30, 40, 50\). Official ranking metric this year was the F1-measure@10, which gives equal importance to diversity (via CR@10) and relevance (via P@10).

Participants were allowed to undertake the sub-tasks in an interactive or automatic manner. For interactive submissions, a maximum of five minutes of search time is allowed per topic. In particular, the organizers would like to emphasize methods that allow interaction with real users (via Relevance Feedback, RF, for example), i.e., beside the best performance, the method of interaction (e.g. the number of iterations using relevance feedback), or innovation level of the method (for example, new way to interact with real users) are encouraged.

In the Activities of daily living understanding, the evaluation metric is the percentage of dissimilarity between the ground-truth and the submitted values, measured as average of the time and minute differences, as follows:

$$\begin{aligned} ADL_{score} = \frac{1}{2} \left( max(0, 1 - \frac{|n - n_{gt}|}{n_{gt}}) + max(0, 1 - \frac{|m - m_{gt}|}{m_{gt}})\right) \end{aligned}$$

where \(n, n_{gt}\) are the submitted and ground-truth values for how many times the events occurred, respectively, and \(m, m_{gt}\) are the submitted and ground-truth values for how long (in minutes) the events happened, respectively.

Table 13. Submitted runs for ImageCLEFlifelog2018 LMRT task.
Table 14. Submitted runs for ImageCLEFlifelog2018 ADLT task.

6.3 Participating Groups and Runs Submitted

This year the number of participants was considerably higher with respect to 2017: we received in total 41 runs: 29 (21 official, 8 additional) for LMRT and 12 (8 official, 4 additional) for ADLT, from 7 teams from Brunei, Taiwan, Vietnam, Greece-Spain, Tunisia, Romania, and a multi-nation team from Ireland, Italy, Austria, and Norway. The received approaches range from fully automatic to fully manual, from using a single information source provided by the task to using all information as well as integrating additional resources, from traditional learning methods (e.g. SVMs) to deep learning and ad-hoc rules. Submitted runs and their results are summarized in Tables 13 and 14.

6.4 Lessons Learned and Next Steps

We learned that the majority of the approaches this year exploit and combine visual, text, location and other information to solve the task, which is different from last year when often only one type of data was analysed. Furthermore, we learned that lifelogging is following the trend in data analytics, meaning that participants are using deep learning in many cases. However, there still is room for improvement, since the best results are coming from the fine-tuned queries, which means we need more advanced techniques on bridging the gap between the abstract of human needs and the multi-modal data. Regarding the number of the signed-up teams and the submitted runs, we received a significant improvement compared to last year. This shows how interesting and challenging lifelog data is and that it holds much research potential. As next steps we do not plan to enrich the dataset but rather provide richer data and narrow down the application of the challenges (e.g., extend to health-care application).

7 Conclusions

This paper presents a general overview of the activities and outcomes of the ImageCLEF 2018 evaluation campaign. Four tasks were organised covering challenges in: caption prediction, tuberculosis type and drug resistance detection, medical visual question answering and lifelog retrieval.

The participation increased slightly compared to 2017, with over 130 signed user agreements, and in the end 31 groups submitting results. This is remarkable as three of the tasks are only in the second edition and one was in the first edition. Whereas several of the participants had participated in the past there was also a large number of groups totally new to ImageCLEF and also collaborations of research groups in several tasks.

As is now becoming commonplace, many of the participants employ deep neural networks to address all proposed tasks. In the tuberculosis task, the results in multi-drug resistance are still limited for practical use, though good performance was obtained in the new severity scoring subtask. In the visual question answering task the scores were relatively low, even though some approaches do seem to predict concepts present. In the lifelog task, in contrast to the previous year, several approaches used a combination of visual, text, location and other information.

The use of crowdAI was a change for many of the traditional participants and created many questions and also much work for the task organizers. On the other hand it is a much more modern platform that offers new possibilities, for example continuously running the challenge even beyond the workshop dates. The benefits of this will likely only be seen in the coming years.

ImageCLEF 2018 again brought together an interesting mix of tasks and approaches and we are looking forward to the fruitful discussions at the workshop.