Crowdsourcing Labels for Pathological Patterns in CT Lung Scans: Can Non-experts Contribute Expert-Quality Ground Truth?

O’Neil, Alison Q.; Murchison, John T.; van Beek, Edwin J. R.; Goatman, Keith A.

doi:10.1007/978-3-319-67534-3_11

Alison Q. O’Neil²⁷,
John T. Murchison²⁸,
Edwin J. R. van Beek²⁹ &
…
Keith A. Goatman²⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10552))

Included in the following conference series:

1235 Accesses
5 Citations

Abstract

This paper investigates what quality of ground truth might be obtained when crowdsourcing specialist medical imaging ground truth from non-experts. Following basic tuition, 34 volunteer participants independently delineated regions belonging to 7 pathological patterns in 20 scans according to expert-provided pattern labels. Participants’ annotations were compared to a set of reference annotations using Dice similarity coefficient (DSC), and found to range between 0.41 and 0.77. The reference repeatability was 0.81. Analysis of prior imaging experience, annotation behaviour, scan ordering and time spent showed that only the last was correlated with annotation quality. Multiple observers combined by voxelwise majority vote outperformed a single observer, matching the reference repeatability for 5 of 7 patterns. In conclusion, crowdsourcing from non-experts yields acceptable quality ground truth, given sufficient expert task supervision and a sufficient number of observers per scan.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Large scale crowdsourced radiotherapy segmentations across a variety of cancer anatomic sites

Article Open access 22 March 2023

Early Experiences with Crowdsourcing Airway Annotations in Chest CT

Enrichment of lung cancer computed tomography collections with AI-derived annotations

Article Open access 04 January 2024

1 Introduction

Crowdsourcing is gaining in popularity as a method for sourcing labels for the very large amounts of data required to train machine learning algorithms [7]. Previous experiments have shown that it is possible to use non-experts for cheaply and readily crowdsourcing medical imaging ground truth [3, 14], perhaps using gamification [1, 11], at least for reasonably straightforward problems.

This paper investigates whether it is feasible to commission non-experts to undertake a relatively specialist imaging annotation task — that of recognising and segmenting the pathological patterns which are seen in interstitial lung disease. To this end, a toy exercise was designed in which participants were recruited to annotate the same representative set of twenty scan slices. In order to render the task accessible to the layperson, we restricted it to be one of annotation rather than diagnosis. Each scan slice was provided with expert labels indicating the presence of the main patterns to be labelled, and participants were asked to annotate regions belonging to these patterns. These labels are usually noted in a radiology report; thus the objective was for the routine expert diagnosis to direct the non-expert in the rather time-consuming work of delineating the pathological regions. To assess performance, we quantitatively and qualitatively compared the annotations to those of an expert medical researcher (A.O.) and two experienced radiologists (J.M. and E.v.B.) respectively.

The contributions of this paper are as follows:

To demonstrate how a specialist medical imaging ground truth task may be simplified such that a non-expert (given some basic training) performs comparably to an expert.
To analyse which factors are predictive of good performance.
To demonstrate how (and how many) non-expert observers should be assigned and combined for each scan in a real world crowdsourcing task, in order to improve label robustness.
To provide practical recommendations for how this task might be better conducted in future.

2 Methodology

2.1 Ground Truth for Interstitial Lung Disease

Identification of the presence, volume and distribution of different pathological patterns is helpful for the diagnosis and prognosis of interstitial lung disease [8]. Training machine learning algorithms to recognise and segment such patterns requires large amounts of labelled data. Thus, for this paper, the ground truth exercise was to label regions representing each of the common lung disease patterns: consolidation, emphysema, ground glass opacity (GGO), ground glass opacity+reticulation, honeycombing, micronodules, and reticulation. This is the same labelling system as used by Anthimopoulos et al. [2] for the same publicly available data [4], but with the addition of an emphysema class. Examples of these patterns are shown in Fig. 1.

2.2 Data

Twenty computed tomography (CT) scan slices were selected from twenty different subjects in the MedGift ILD database [4]. The slices were chosen to span the range of disease labels, and each was labelled with one or two key patterns to be annotated by participants. Table 1 shows the pattern labels and medical diagnosis of each scan.

Table 1. Scan diagnoses (3 unknown) and patterns to label (C = Consolidation, E = Emphysema, G = GGO, GR = GGO+Reticulation, H = Honeycombing, M = Micronodules, R = Reticulation)

Full size table

2.3 Recruitment of Participants

The exercise was completed by 34 volunteers from a company which makes medical imaging software. The participants have a variety of roles and levels of expertise, including junior scientists and software engineers, senior managers, and clinical experts. Entry and exit questionnaires were completed by all the participants. The entry questions were designed to ascertain each participant’s level of experience, and the factors motivating their participation. The exit questionnaire gathered feedback on participants’ experience of the exercise, and suggestions for improvement.

2.4 Annotation Task

Prior to the annotation task, all participants received a one-hour long tutorial on interstitial lung disease and the patterns of interest (based on the Fleischner Society Glossary of Terms for Thoracic Imaging [5]), given by a biomedical sciences graduate (A.O.) who had recently attended a one-day hands-on training course on interstitial lung disease run by the British Institute of Radiology.

Participants were provided with the twenty pre-selected slices and asked to annotate patterns belonging to provided labels. Each participant annotated the images in a random order, to allow measurement of any training effect over the course of annotating the scans. Annotations were created using a tool that allowed users to draw polygonal regions of interest (ROIs) and assign a pattern class label to each ROI. The task was expected to take approximately two hours to complete. The use of online resources such as Radiology Assistant and Google was allowed and even encouraged, although collaboration between participants was prohibited.

3 Results

3.1 Evaluation of Non-expert Versus Expert Performance

Each annotation was scored by comparison to those of the reference annotator (A.O.) using Dice Similarity Score (DSC). The overall DSC was computed for each participant by weighting scans equally, and weighting patterns equally within a scan. Per-pattern DSC metrics were calculated for each participant by averaging over all examples of a pattern. In addition, the reference annotator repeated the annotations 10 days later to assess repeatability (the overall repeatability DSC was 0.806). Figure 2 summarises the results.

There is clear variation in performance between classes, showing that some were more straightforward than others. It was known in advance that the distinction between e.g. GGO, GGO+Reticulation, and Reticulation might be open to interpretation. Also, there were a few cases of mistaken identity, with participants labelling vessels (pulmonary vessels and aorta) as pathology.

Following the exercise, interviews were held with two experienced pulmonary radiologists (J.M. and E.v.B.), who confirmed the veracity of the provided labels, and annotated the images with some obvious examples of each pattern. Figure 3 shows some qualitative results of four interesting cases, showing the radiologist and reference annotations overlaid on the results of the crowd.

It can be seen that for A (Emphysema) and B (GGO) in Fig. 3, the range of variation of the crowd is comparable to the agreement (or disagreement) between the two radiologists. In each case, one radiologist is more sensitive and the other more specific for the given pattern, and the crowd approximately ranges between the two.

Examples C and D illustrate where improvements could be made. In C (consolidation), it is difficult to distinguish vessels from consolidation. It can be seen that the radiologists were cautious with their labelling compared with the reference, who outlined both vessel and consolidation where they were adjacent and therefore not separable. The crowd generally followed the philosophy of the reference, but some of the crowd confused what is definitely vessel with consolidation. In D (honeycombing), both radiologists were stricter on the definition of honeycombing than the reference, and both raised the differential diagnosis with bronchiectasis. Honeycombing and bronchiectasis lie on a spectrum [12], and the bronchiectasis label was not included in our labelling system.

In summary, it was observed that in many cases the variability of the crowd matched the variability between the two radiologists, and this variability was reflective of underlying ambiguity in the pattern definition — or the ambiguity of the boundary between patterns such as GGO versus GGO+reticulation. However, in future the whole volume should be provided to the annotator rather than single slices, such that vessels can be better tracked and distinguished from consolidation (with appropriate teaching examples). We should also consider adding further labels such as bronchiectasis and fibrosis (fibrosis not illustrated here).

3.2 Factors Predicting Performance

None of the participants had specific prior experience of interstitial lung disease images. However, it was predicted that there may be a correlation between prior imaging experience and performance, particularly if insufficient training was provided for the task. Participants rated their level of experience with medical imaging data, from level 0 (little to none), to level 4 (clinical researcher). Figure 4 shows a plot of performance versus experience level. There is no significant correlation, suggesting that adequate guidance was provided for this task. Further, it was hypothesised that a training effect might be observed, however no correlation was measured between the scan ordering (randomised between participants) and each participant’s performance.

Conversely, there is a weak correlation between the time spent on the task and performance (see Fig. 4). The times shown are self-reported estimates. It is likely that those participants who performed better took time to do more research and/or took more care with their annotations. Visible annotation behaviour (number of regions, number of polygon vertices, rate of polygon vertices) was also analysed and found to exhibit no correlation with performance.

3.3 Crowdtruthing in the Real World: Assigning and Combining Multiple Observers

The previous results have shown the range in annotation quality between observers. It is likely that more consistent results could be achieved by combining annotation results from multiple observers, and this is true also of expert annotations, since human error or variations in pattern interpretation might be identified and corrected. In a real world crowdsourcing exercise, some questions would thus arise. How many observers should be assigned to each scan? How are their annotations best combined to give an annotation of predictable and reasonable quality?

To investigate this, different odd numbers of observers between one and fifteen were combined using majority vote at each voxel. For each number of observers, 200 combinations were randomly drawn from the 34 annotations, after omitting the few cases where the annotation was zero i.e. the participant had forgotten or was unable to label the key pattern. As in earlier DSC computations, the problem is simplistically treated as binary (i.e. a one-vs-all approach taken when evaluating each pattern), even where more than one pattern was labelled in a scan. The graphs in Fig. 5 show the median, minimum and maximum values, both overall and for each pattern, averaged across the twenty scans.

In summary, multiple observers give a better result than a single observer. The median increases and the range in DSC metrics narrows increasingly as more observers are added, with little improvement beyond the \(k=9\) observer. Note that the minimum, maximum and median converge at the limit of \(n=34\) observers, where there is just one possible combination of observers. For 5 of 7 patterns, the median DSC matches the repeat DSC and the range converges whilst \(k \ll n\), showing that when sufficient observers are combined, the limit of accuracy is reached. For GGO and GGO+Reticulation, combination of multiple observers does not bring the crowd into agreement with the reference, suggesting that observers generally had a different idea to the reference for where the threshold between ground glass opacity and healthy tissue lies. STAPLE [15] methods were also tried (results not shown), initialised using both uniform (0.99999) and learnt rater sensitivities and specificities (learnt from the first ten scans and applied to the second ten), and STAPLE gave worse results than the majority vote. This is in line with what other authors have found [9, 10].

4 Discussion

Overall, the crowd performed well relative to the reference segmentations, with some observers for some patterns matching the reference repeatability. Where there was variation, this was often indicative of genuine ambiguity between patterns. The greater range of disagreement for e.g. ground glass opacity compared to emphysema in this exercise has been observed by other authors measuring agreement between radiologists [13]. In fact, the combined annotations displayed as greyscale values in Fig. 3 could be interpreted as probabilities associated with the respective labels, and even used as soft labels for a machine learning algorithm in line with the “dark matter” idea promoted by Hinton et al. [6]. Note that agreement both between non-experts and between radiologists would be increased with a more stringent ground truth protocol (this might involve e.g. prescribing a Hounsfield Unit range for ground glass opacity).

Experiments regarding combination of observers showed that multiple observers outperformed a single observer. For many patterns, when sufficient observers are combined, the median DSC matches the reference repeatability DSC and the DSC range converges around the reference repeatability DSC, showing that the limit of accuracy is reached. Improvements as discussed earlier (additional teaching for distinguishing normal anatomy such as vessels from pathology, provision of three-dimensional context, additions to the labelling system, a more stringent ground truth protocol), should both raise the repeatability DSC and reduce the number of observers required to achieve a consistent result.

In conclusion, given sufficient expert task supervision and a sufficient number of observers per scan, crowdsourcing with non-experts can yield ground truth fit for use in image analysis algorithms.

References

Albarqouni, S., Matl, S., Baust, M., Navab, N., Demirci, S.: Playsourcing: a novel concept for knowledge creation in biomedical research. In: Carneiro, G., et al. (eds.) LABELS/DLMIA -2016. LNCS, vol. 10008, pp. 269–277. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8_28
Google Scholar
Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1207–1216 (2016)
Article Google Scholar
Cheplygina, V., Perez-Rovira, A., Kuo, W., Tiddens, H.A.W.M., de Bruijne, M.: Early experiences with crowdsourcing airway annotations in chest CT. In: Carneiro, G., et al. (eds.) LABELS/DLMIA -2016. LNCS, vol. 10008, pp. 209–218. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8_22
Google Scholar
Depeursinge, A., Vargas, A., Platon, A., Geissbuhler, A., Poletti, P.A., Müller, H.: Building a reference multimedia database for interstitial lung diseases. Comput. Med. Imaging Graph. 36(3), 227–238 (2012)
Article Google Scholar
Hansell, D.M., Bankier, A.A., MacMahon, H., McLoud, T.C., Müller, N.L., Remy, J.: Fleischner society: glossary of terms for thoracic imaging. Radiology 246(3), 697–722 (2008)
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Neural Information Processing Systems (2014)
Google Scholar
Hossain, M., Kauranen, I.: Crowdsourcing: a comprehensive literature review. Strateg. Outsourcing Int. J. 8(1), 1753–8297 (2015)
Google Scholar
Humphries, S.M., Yagihashi, K., Huckleberry, J., Rho, B.H., Schroeder, J.D., Strand, M., Schwarz, M.I., Flaherty, K.R., Kazerooni, E.A., van Beek, E.J.R., Lynch, D.A.: Idiopathic pulmonary fibrosis: data-driven textural analysis of extent of fibrosis at baseline and 15-month follow-up. Radiology 5, 161177 (2017)
Article Google Scholar
Langerak, T.R., van der Heide, U.A., Kotte, A.N., Viergever, M.A., van Vulpen, M., Pluim, J.P.: Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE). IEEE Trans. Med. Imaging 29(12), 2000–2008 (2010)
Article Google Scholar
Van Leemput, K., Sabuncu, M.R.: A cautionary analysis of STAPLE using direct inference of segmentation truth. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 398–406. Springer, Cham (2014). doi:10.1007/978-3-319-10404-1_50
Google Scholar
Luengo-Oroz, M.A., Arranz, A., Frean, J.: Crowdsourcing malaria parasite quantification: an online game for analyzing images of infected thick blood smears. J. Med. Internet Res. 14(6), e167 (2012)
Article Google Scholar
Piciucchi, S., Tomassetti, S., Ravaglia, C., Gurioli, C., Gurioli, C., Dubini, A., Carloni, A., Chilosi, M., Colby, T.V., Poletti, V.: From traction bronchiectasis to honeycombing in idiopathic pulmonary fibrosis: a spectrum of bronchiolar remodeling also in radiology? BMC Pulm. Med. 16(1), 87 (2016)
Article Google Scholar
Salisbury, M.L., Lynch, D.A., van Beek, E.J.R., Kazerooni, E.A., Guo, J., Xia, M., Murray, S., Anstrom, K.A., Yow, E., Martinez, F.J., Hoffman, E.A., Flaherty, K.R.: Idiopathic pulmonary fibrosis: the association between the adaptive multiple features method and fibrosis outcomes. Am. J. Respir. Crit. Care Med. 195(7), 921–929 (2017)
Article Google Scholar
Schlesinger, D., Jug, F., Myers, G., Rother, C., Kainmuller, D.: Crowdsourcing image segmentation with aSTAPLE. arXiv (2017)
Google Scholar
Warfield, S.K., Zhou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)
Article Google Scholar

Download references

Acknowledgements

Many thanks to Phil Tolland who developed the software for the ground truth collection tool, and to all of the employees at Toshiba Medical Visualization Systems who took part in this study: Allan Barklie, Erin Beveridge, Antony Brown, Gerald Chau, Alasdair Corbett, Ross Davies, Matt Daykin, Ben Docherty, Venkatesh Gaddam, Keith Goatman, Marta Guarisco, Joseph Henry, Corné Hoogendoorn, Pia Kullik, Aneta Lisowska, Steve Magness, Craig Matear, James Matthews, Chris McGough, Haritha Miryala, Brian Mohr, Costas Plakas, Ian Poole, Marco Razeto, Faye Riley, Matt Shepherd, Simeon Skopalik, Andy Smout, Ken Sutherland, Paul Thomson, Phil Tolland, John Tough, Aidan Wellington and Gavin Wheeler.

Author information

Authors and Affiliations

Toshiba Medical Visualization Systems Ltd., Edinburgh, UK
Alison Q. O’Neil & Keith A. Goatman
Royal Infirmary of Edinburgh, Edinburgh, UK
John T. Murchison
Clinical Research Imaging Centre, University of Edinburgh, Edinburgh, UK
Edwin J. R. van Beek

Authors

Alison Q. O’Neil
View author publications
You can also search for this author in PubMed Google Scholar
John T. Murchison
View author publications
You can also search for this author in PubMed Google Scholar
Edwin J. R. van Beek
View author publications
You can also search for this author in PubMed Google Scholar
Keith A. Goatman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alison Q. O’Neil .

Editor information

Editors and Affiliations

University College London, London, United Kingdom
M. Jorge Cardoso
McGill University, Montreal, Québec, Canada
Tal Arbel
Imperial College London, London, United Kingdom
Su-Lin Lee
Eindhoven University of Technology, Eindhoven, The Netherlands
Veronika Cheplygina
University of Barcelona, Barcelona, Spain
Simone Balocco
Technical University of Munich, Garching, Germany
Diana Mateus
Nara Institute of Science and Technology, Nara, Japan
Guillaume Zahnd
DKFZ, Heidelberg, Germany
Lena Maier-Hein
Technical University Munich, Munich, Germany
Stefanie Demirci
École de Technologie Supérieure, Montreal, Québec, Canada
Eric Granger
École de Technologie Supérieure, Montreal, Canada
Luc Duong
École de Technologie Supérieure, Montreal, Québec, Canada
Marc-André Carbonneau
Technical University Munich, Munich, Germany
Shadi Albarqouni
University of Adelaide, Adelaide, South Australia, Australia
Gustavo Carneiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

O’Neil, A.Q., Murchison, J.T., van Beek, E.J.R., Goatman, K.A. (2017). Crowdsourcing Labels for Pathological Patterns in CT Lung Scans: Can Non-experts Contribute Expert-Quality Ground Truth?. In: Cardoso, M., et al. Intravascular Imaging and Computer Assisted Stenting, and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. LABELS STENT CVII 2017 2017 2017. Lecture Notes in Computer Science(), vol 10552. Springer, Cham. https://doi.org/10.1007/978-3-319-67534-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-67534-3_11
Published: 08 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67533-6
Online ISBN: 978-3-319-67534-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics