Introduction

ITK-SNAP is a software tool that provides a graphical user interface for manual and user-guided semi-automatic segmentation of 3D medical imaging datasets. ITK-SNAP was created to address image segmentation problems for which fully automated algorithms are not yet available. Automatic segmentation may be lacking because a given problem has not received sufficient attention from algorithm developers; because the problem is too complex to be solved without human input; or because there is not yet sufficient expert-annotated data to train automated algorithms. Current state-of-the-art automatic medical image segmentation algorithms often use machine learning (Litjens et al. 2017; Shen et al. 2017), multi-atlas label fusion (Iglesias and Sabuncu 2015), or statistical shape priors (Heimann and Meinzer 2009). These techniques require training data in the form of tens or even hundreds of manually or semi-automatically segmented example images. Furthermore, extending fully automatic techniques to new domains (e.g. from adult to pediatric subjects, from healthy subjects to subjects with extensive pathology, or from one scanner manufacturer to another) requires yet more expert-generated segmentations as additional training data and/or validation data. Subsequently, there is a robust need for expert-guided medical image segmentation approaches that span multiple imaging modalities and application domains.

ITK-SNAP was first developed in the early 2000s to provide an interactive platform for segmenting anatomical structures in 3D images both manually (by painting outlines on 2D cross-sections of a 3D image) and semi-automatically (by manually setting the parameters and initial seeds for two active contour algorithms (Caselles et al. 1997; Zhu and Yuille 1996)). Since its introduction, ITK-SNAP became a popular tool, as evidenced by a large number of citations in the scientific literature.Footnote 1 We analyzed a sample of 50 articles from 2014 that cited ITK-SNAP, and found that 86% of the articles used ITK-SNAP for image segmentation, 4% used it as an image viewer, and 10% cited ITK-SNAP but did not use it. Among the first group, 42% used ITK-SNAP only for manual segmentation, 42% used the semi-automatic segmentation features, and 16% did not state which approach was used. The relatively modest utilization of semi-automatic segmentation prompted us to extend ITK-SNAP with more advanced and generally applicable semi-automatic segmentation capabilities. These new capabilities focused primarily on two areas: enabling the concurrent use of multiple image channels during semi-automatic segmentation (such as multiple MRI contrast mechanisms, e.g., T1-weighted and FLAIR, as done increasingly in fully automatic segmentation) and leveraging machine learning to identify foreground and background image regions in the input to active contour segmentation (in contrast to the original ITK-SNAP (Yushkevich et al. 2006), which relies on simple thresholding and edge detection). Introducing these capabilities required not only integrating new algorithms into ITK-SNAP, but also incorporating a large set of new user interface capabilities, for example to support the display and management of multiple image channels.

The present paper serves several aims: (1) to describe the new features introduced in ITK-SNAP software since the original 2006 publication (Yushkevich et al. 2006); (2) to demonstrate that the new semi-automatic functionality in ITK-SNAP can be applied to problems where threshold and edge-based active contour tools are ineffective; (3) to quantitatively compare ITK-SNAP semi-automatic segmentation to state-of-the-art specialized automatic segmentation algorithms in a widely studied problem; (4) to show that semi-automatic segmentation in ITK-SNAP can reduce segmentation time over manual segmentation; and (5) to demonstrate that ITK-SNAP segmentation capabilities can be applied in multiple application domains and imaging modalities.

Toward addressing aims 2, 3 and 4, we apply the new semi-automatic segmentation capabilities in ITK-SNAP to the challenging problem of brain tumor segmentation in multi-modality MRI. This study leverages data from the MICCAI 2013 Multi-Modal Brain Tumor Segmentation Challenge (BRATS) (Menze et al. 2015), a well characterized benchmark dataset. ITK-SNAP was used by three neuroradiologists as well as one novice user to label a set of 20 glioma cases from multi-modality MRI (pre-contrast T1, post-contrast T1, T2, and FLAIR). We hypothesized that users (both novices and neuroradiologists) would label tumors using ITK-SNAP reliably and in less time than what is required for manual segmentation. Inter-rater and intra-rater evaluation, as well as online evaluation by the BRATS system against reference manual segmentations were performed to quantify segmentation reliability and accuracy.Additionally, toward addressing aims 2, 4, and 5, we apply semi-automatic segmentation in ITK-SNAP to the very different problem of placenta segmentation in the first-trimester in 3D ultrasound images, and evaluate against manual segmentation in terms of accuracy and segmentation time.

Materials and Methods

This section describes the main features of ITK-SNAP software. The first part of the section briefly summarizes the software design principles and the user interface features for image navigation, visualization, and manual segmentation. The remainder of the section focuses on the semi-automatic segmentation workflow, including the machine learning approach for reducing multi-modality image information to object/background probability maps and active contour segmentation.

ITK-SNAP Software Design Principles

The software development for ITK-SNAP is guided by three simple principles: exclusive focus on segmentation; generality of purpose; and ease of use. These principles are applied by the developers to prioritize potential new features. Features are excluded if they do not directly support the needs of image segmentation, if they are tailored exclusively to specific segmentation problems, or if they are largely redundant with existing capabilities. The application of these principles has resulted in a feature set that is relatively contained, as seen in Table 1, which lists the primary features incorporated into the software since 2011. It is possible for new users to learn the primary features of ITK-SNAP in the course of a 90 min training session.

Table 1 Major new features added in recent versions of ITK-SNAP

ITK-SNAP is open-source software and is distributed under the General Public License (Free Software Foundation 2007). It is a cross-platform application written in the C+ + language. It leverages the Insight Toolkit (www.itk.org) library for image processing functionality, the Visualization Toolkit (www.vtk.org) for image and surface visualization, and Qt (www.qt.io) for user interface functionality. The CMake, CTest and CDash tools (cmake.org) are used for cross-platform compilation, automated testing, and posting of compilation and test results to a web-based dashboard. A web portal (www.itksnap.org) provides source code, pre-complied binaries for the Windows, MacOS and Linux platforms, documentation, and user support resources.

Image Navigation and Visualization in ITK-SNAP

ITK-SNAP allows the user to load image volumes using common 3D medical image formats, including DICOM, NIFTI, MetaImage and NRRD. ITK-SNAP recognizes the information encoded in the image header on the spatial position and orientation of image volumes relative to the scanner physical coordinate system. The first image loaded into ITK-SNAP is designated as the “main image” and all visualization is performed relative to the main image geometry. Additional images can be loaded into ITK-SNAP after the main image, and these images can have different dimensions, resolution, and spatial orientation than the main image.

As shown in Fig. 1, 3D volumes are visualized as three orthogonal slices (cross-sections). The slices are parallel to the axes of the main image. When the main image is acquired non-obliquely, the slices correspond to the axial, coronal and sagittal planes in physical space. The three slices intersect at the center of a single voxel in the main image; the position of this voxel is defined as the “3D cursor” position. Crosshairs displayed on each slice visualize the 3D cursor position. Moving this crosshair in one slice view adjusts the slices visualized in other views. This “linked crosshair” concept provides a convenient way to navigate through 3D volumes, with all three views focused on the same location in the 3D image.

Fig. 1
figure 1

Screen shot of the ITK-SNAP user interface after completed brain tumor segmentation. Three orthogonal slices through the T1-weighted MRI scan are shown, with segmentation overlaid in color. A 3D rendering of the segmentation appears in the lower left quadrant. Small thumbnails in the top and bottom right quadrants represent other MRI scans loaded in ITK-SNAP (T2-weighted, FLAIR, contrast-enhanced T1)

Multiple images loaded in ITK-SNAP can be visualized in three ways: (1) a “tiled” layout, where the coronal, axial and sagittal slice views each display the same slice through all loaded modalities; (2) a “thumbnail” layout, where one modality occupies most of each slice view, while others are shown as small thumbnails, clicking on which switches to that modality; and (3) a “overlay” mode, in which selected images are shown as semiopaque overlays shown on top of the main image and other images.

Images loaded into ITK-SNAP may be scalar images (each voxel holds a single intensity value) or multi-component images, such as RGB color images (each voxel holds a red, green and blue value), displacement fields (each voxel holds a displacement vector, e.g., from deformable registration), diffusion tensor images, or dynamic image sequences. For multi-component images, the user can select between viewing a single selected component, the maximum, average or magnitude of the components. Special visualization modes are provided for RGB color images and displacement fields.

The visualization of individual voxels is controlled by a user-controlled intensity remapping function, which may be linear (e.g., window/level control) or spline-based; and a color map function that maps scalar intensities to display color. Window and level can be set automatically based on the image histogram.

Navigation in 3D image space is accomplished by the repositioning of the crosshairs in the three slice views with the mouse or keyboard, as well as mouse-based zooming and panning. Multiple redundant user interface widgets are provided to support crosshair repositioning, exact zooming and panning.

The ITK-SNAP state, i.e., the set of images currently loaded, their layout, intensity remapping, color map function, and various other state variables can be saved in lightweight XML format workspace files.

Segmentation Representation and Visualization

Segmentations are represented in ITK-SNAP as 3D images with the same dimensions and orientation as the main image. Each voxel in the segmentation volume is assigned a discrete integer label, with the label 0 representing the clear label. Segmentations can be loaded and saved using popular 3D image file formats, such as NIFTI. A label description table is maintained internally that assigns a name, color, opacity, and other metadata to each label. The label table may be edited using a “label editor” UI, and saved in the XML format. Segmentations are visualized as color overlays rendered on top of the main and additional image slices; and as surfaces in the 3D render view (Fig. 1). These surfaces can be exported to file using common 3D geometry formats used in 3D printing and 3D visualization (STL, VTK). The 3D render view supports navigation by allowing the user to click on the rendered surfaces to reposition the 3D cursor.

The approach of representing segmentation as discrete label images limits the resolution of the segmentation to that of the main image and disallows partial volume segmentation.Footnote 2 However, it simplifies three-dimensional editing of the segmentations, as a change made in one slice view is unambiguously translated into changes in the other slice views. The use of a common representation for both images and segmentations also facilitates analysis.

Image Registration

ITK-SNAP provides a linear registration mode that makes it possible to correct for subject motion between scans, such as head motion between multiple MRI scans obtained in the same session. A manual registration mode allows the user to rotate and translate images relative to the main image using widgets displayed on top of the orthogonal slices and mouse-based panning. An automatic registration mode can be used to find locally optimal rigid and affine transformations between the main image and a given additional image using common image similarity metric.

Manual Segmentation

ITK-SNAP provides simple tools for creating manual segmentation and editing semi-automatic segmentations. The “polygon” tool can be used to draw structure outlines in any of the slice views. Polygons can be edited by moving vertices in the slice plane. Once accepted, the polygon is assigned the current label and integrated into the 3D segmentation volume. The “paintbrush” tool allows quick drawing and touch-up editing using the mouse, with masks of different shape and size. An adaptive paintbrush mask is also provided, wherein only the neighboring voxels similar in intensity to the voxel clicked on by the user are assigned the foreground label. Additionally, the 3D render view provides a “3D scalpel” tool that can be used to assign a different label to a part of a structure using a user-specified cut plane.

When applying the manual segmentation tools, users select the active label used to perform the drawing/painting operation as well as the way in which operations will affect existing labels. For the latter, the users may select to paint over all existing labels, only the clear label, or only a specific selected label. This allows the user to “protect” previously drawn labels during segmentation and editing.

Semi-automatic Segmentation

Semi-automatic workflow proceeds in five stages, which are detailed in the sub-sections below.

  1. 1.

    The users specifies the region of interest (ROI) in which to perform semi-automatic segmentation;

  2. 2.

    The user uses one of several available presegmentation modes to transform the input image volumes into a single synthetic image volume called the speed image. In most presegmentation modes, the speed image represents the difference between the probability that a voxel belongs to the object of interest and the probability that a voxel belongs to the image background.

  3. 3.

    The user places one or more initial contours inside of the object of interest.

  4. 4.

    The contours evolve in a manner governed by the speed image and a shape regularization term.

  5. 5.

    The semi-automatic segmentation result is incorporated into the main ITK-SNAP segmentation volume.

ROI Selection for Semi-automatic Segmentation

The first stage of semi-automatic segmentation involves defining the rectilinear image domain in which segmentation will be performed. It is often desirable for this domain to be smaller than the whole main image, so as to reduce computational and memory demands of the segmentation algorithm. It is also desirable for the images input to the active contour segmentation algorithm to have approximately isotropic voxel size, as noted in “Active Contour Segmentation”.

In the ROI selection stage, the user defines the corners of a rectilinear ROI that contains the object of interest, and can optionally set the voxel size for the ROI to be different from that of the main image. All images are then cropped and resampled to the space of the user-selected ROI. All subsequent operations are performed on these cropped and resampled images. We note that while the multiple image layers loaded in a single ITK-SNAP session may be in different voxel spaces (i.e., have different resolution and orientation from each other), they are brought into a common voxel space for the purpose of semi-automatic segmentation.

Intensity-Based Presegmentation

For each voxel x in the ROI, multiple intensity values may be available, e.g., if the user applies segmentation to multiple co-registered image volumes or to a multi-component image. Presegmentation reduces all the image intensity values available at a voxel to a single scalar value. The resulting scalar image g(x) is called the speed image. ITK-SNAP offers four presegmentation modes: supervised classification, unsupervised classification, soft thresholding, and edge detection.

  • In the supervised classification mode, presegmentation estimates the probability Pobj(x) of the voxel x belonging to the object of interest and the probability Pbkg(x) of it belonging to the background. These probabilities are estimated by training a random forest classifier (Breiman 2001; Criminisi et al. 2012) on a set of example voxels identified by the user via painting operations. The probabilities are estimated using all available image intensity values at x. The speed image has range [− 1,1] and is defined as the difference between object and background probabilities, g(x) = Pobj(x) − Pbkg(x). Figure 2 shows ITK-SNAP in this presegmentation mode.

  • In the unsupervised classification mode, the speed image also estimates the difference between object and background probabilities at each voxel. These probabilities are also estimated using all available image intensity values at each voxel. However, this estimation is obtained without training data using a Gaussian mixture model and the Expectation-Maximization (EM) algorithm (Dempster et al. 1977). The user specifies the number of distinct tissue classes in the ROI, and the initial parameters for each class are randomly seeded using the k-means+ + algorithm (Arthur and Vassilvitskii 2007).

  • In the soft thresholding mode, the speed image is also of the form g(x) = Pobj(x) − Pbkg(x), but the foreground and background probabilities are estimated in a more rudimentary way. A soft binary threshold function with user-supplied upper and lower threshold values is applied to a single image intensity component selected by the user. Intensity values between the lower and upper thresholds are assigned positive speed values, and values outside the thresholds map to negative speed values. The soft thresholding mode corresponds to the “region competition” segmentation approach developed by Zhu and Yuille (1995), and its implementation within ITK-SNAP is described and evaluated in Yushkevich et al. (2006).

  • In the edge attraction mode, the speed image has the range [0,1] and the speed image is derived from the gradient magnitude of a single image intensity component selected by the user. Large gradient magnitude values (strong edges) are mapped to small speed values, and vice versa. The edge attraction mode corresponds to the “geodesic active contours” approach described by Caselles et al. (1993, 1997), and its implementation within ITK-SNAP is described in Yushkevich et al. (2006).

The remainder of this section focuses on the supervised classification mode, which is used in all evaluation experiments in this paper. In this mode, the user specifies examples of k ≥ 2 tissue classes present in the segmentation ROI. Examples are specified by painting brushstrokes in one or more orthogonal slice views (the polygon tool can also be used). Each voxel painted by the user is treated as a separate example {Fj, yj} for training the random decision forest classifier, where Fj denotes the vector of features associated with the j-th example voxel and yj ∈{1,…, k} is its tissue class.

Fig. 2
figure 2

Screen shot of the ITK-SNAP user interface during brain tumor presegmentation (supervised classification mode). Axial, sagittal and coronal slices through four MRI modalities and the speed image are in the top left, top right and bottom right quadrants of the user interface, respectively. The speed image (blue-to-white color map) has range between − 1 (blue) and 1 (white), with positive values indicating higher probability that a voxel belongs to the object of interest and negative values indicating higher probability of a voxel belonging to the background. The object of interest in this example consists of all tissue classes composing the complete tumor: edema, active tumor, enhancing tumor core and necrosis. The lower left quadrant shows a 3D rendering of the samples used for training the random forest classifier. Some of the samples (necrosis: green, normal gray and white matter: blue) are also seen as color overlays in the axial, sagittal and coronal views

By default, the feature vector Fj consists of all the image intensity values available for the j-th voxel (i.e., all the components of all the images). However, the feature vector can also be made to include all the intensity values in the rectangular patch centered on the j-th voxel, with the patch size set by the user. Including neighboring intensities as additional features makes it possible for the classifier to learn more complex intensity patterns that separate different tissue classes. As illustrated in Fig. 3, patch features make it possible to differentiate between image regions based on texture. The feature vector Fj may also be made to include the coordinates of the j-th voxel as features. This makes it possible to differentiate between image regions that have identical intensity characteristics but distinct locations in the image, as illustrated in Fig. 3.

Fig. 3
figure 3

Illustration of the “patch intensity” and “location” features available in the ITK-SNAP supervised classification presegmentation mode. A test 2D image on the left consists of four regions with similar mean intensity but different texture. A set of example voxels has been marked in each region (red, green, cyan and yellow circles) The 4 × 4 grid on the right consists of speed images generated by training a Random Forest classifier using the four circles as the training data under different conditions. The rows in the grid correspond to different set of features used to train the classifier: the default features (the intensity of each voxel serving as its only feature), patch intensity features (the set of intensities in the 5 × 5 × 5 patch around the voxel used as features), location features (the coordinates of each voxel used as features), and patch and location features combined. The columns in the grid correspond to different objects (red, green, cyan and yellow) being selected as the object of interest. The addition of patch features improves the ability of the classifier to discriminate regions based on texture, while the location features allow imposition of geometrical constraints into the segmentation

The random forest algorithm (Breiman 2001; Criminisi et al. 2012) is applied to the training data. The algorithm trains an ensemble of decision tree classifiers. Each decision tree is trained using a random bootstrap sample of the training data, and a random sample of the features (Breiman 2001; Criminisi et al. 2012). The number of decision trees and the depth of each decision tree are user-adjustable parameters (defaulting to 50 and 30, respectively).

After training, the random forest classifier is applied to all voxels in the ROI. For each voxel x, the feature vector Fx is constructed, and each decision tree in the ensemble is applied to Fx, resulting in a set of posterior probability values \(P_{x,l}^{t}\), where l ∈{1,…, k} is the class index and t ∈{1,…, T} is the tree index. The ensemble posterior probability for voxel x and label l is computed as \(P_{x,l}={\sum }_{t = 1}^{T}{P}_{x,l}^{t}\).

For the computation of Pobj(x) and Pbkg(x), the user tags one or more segmentation labels as corresponding to the object of interest, and the remaining labels are assigned to the background. Then we set \(P_{\text {obj}}(x)={\sum }_{l\in \text {obj}}P_{x,l}\) and \(P_{\text {bkg}}(x)={\sum }_{l\in \text {bkg}}P_{x,l}\).

The user interface for the supervised classification mode is lightweight. It includes a set of buttons for selecting classes (labels) for painting examples, a button to train/retrain the classifier, a button to clear all examples, and a list of defined classes in which labels can be tagged as object or background. A separate window allows the user to specify how the feature vectors are constructed, to set classifier parameters, and to export and import examples.

The user interface is also highly responsive in order to allow interactive modification of the training data. During presegmentation, the random forest is applied selectively to the input image volume, so that only the slices visible to the user are classified. This is much faster than applying the classifier to the whole image. If the user moves the 3D crosshair (thus changing which slices are shown in the three ITK-SNAP views), the classification is recomputed on the fly. This allows the user to repeatedly retrain the classifier until a desired classification is accomplished. For example, if a particular area of the object of interest is mislabeled, the user can paint some voxels in that area with the object label and retrain the classifier. The availability of undo/redo functionality for the painting operations also speeds up classifier training.

Active Contour Segmentation

In the supervised classification, unsupervised classification, and soft thresholding modes, positive values of the speed image correspond to parts of the image that have higher probability of being the object than the background. However, simply thresholding the speed image at 0 usually fails to provide a satisfactory segmentation. Firstly, there may be multiple objects of interest in the image (e.g., left and right caudate, or multiple lesions) that need to be assigned different labels; and there may be parts of the background that have nearly identical appearance to the object of interest. Secondly, the speed image may be noisy due to imaging noise and due to the fact that presegmentation is applied independently to each voxel. In ITK-SNAP, presegmentation is followed by a more geometric active contour segmentation step, in which seeds are placed inside of the specific object of interest and grown in a way that balances adherence to the speed image with a geometric regularization term (Zhu and Yuille 1995; Caselles et al. 1997; Sethian 1999; Whitaker 1998).

The active contour evolution algorithm implementation in ITK-SNAP was described previously in Yushkevich et al. (2006), and we only provide a brief summary here for completeness. Let t be time, and let Ct be a smooth contour in \(\mathbb {R}^{3}\), i.e., there exists a continuous, smooth function \(\phi _{t}:\mathbb {R}^{3}\rightarrow \mathbb {R}\) such that \(C_{t} = \{x\in \mathbb {R}^{3}:\phi (x)= 0\}\). The contour evolves according to the differential equation

$$ \frac{\mathrm{d}C_{t}}{\mathrm{d}t} = \left[g(C_{t})+\alpha\cdot\kappa_{C_{t}}\right] \cdot N_{C_{t}}, $$
(1)

where g(x) is the speed function, \(\kappa _{C_{t}}\) is the mean curvature of the contour Ct, and \(N_{C_{t}}\) is the unit outward normal to the contour Ct, and α is a scalar parameter set by the user. Under this evolution equation, the contour expands into regions where the speed function is positive (and contracts where the speed function is negative), while also contracting at points where curvature is high. The evolution equation (1) corresponds to the variational gradient descent of an energy function that maximizes the energy

$$E[C] = {\int}_{\mathcal{C}}g(x)\,\text{dx} - {\int}_{\mathbb{R}^{3} \backslash \mathcal{C}}g(x)\,\text{dx} + \alpha{\int}_{C}\mathrm{d}A, $$

where \(\mathcal {C}\) denotes the interior of the contour C, and dA is the element of area. Numerically, the contour evolution (1) is solved using the level set method (Sethian 1999), which expresses all terms of Eq. 1 in terms of the function ϕt and uses a robust finite difference scheme to approximate derivatives. An efficient extreme narrow banding method that only computes ϕt at a set of nodes adjacent to the zero contour (Whitaker 1998) is used for computational efficiency. The requirement for approximately isotropic voxels in active contour segmentation stems from the fact that the surface normal and mean curvature of C are approximated from the partial derivatives of ϕ, and the approximation is inaccurate when voxels have large aspect ratios (e.g., 1:2 or greater).

The workflow for active contour segmentation consists of selecting an active label for the segmentation (in the supervised classification mode, this label is pre-populated as the first “object” tissue class); placing spherical seeds in the ROI; and supervising the active contour evolution. The user can choose to advance the evolution by fixed step size, or continuously, until pressing “stop”. The evolving contour is visualized in real time in 2D slices and in 3D if enabled by the user.

As the last step, the user “accepts” the segmentation. The active contour segmentation is then resampled into the space of the ITK-SNAP main image and integrated into the overall segmentation image.

Segmentation of Multiple Structures in Supervised Classification Mode

The supervised classification mode can be used to define multiple tissue classes in the image (e.g., edema, non-enhancing tumor core, enhancing tumor core, necrosis, healthy tissue, etc. in the case of glioblastomas), whereas the active contour segmentation only segments a single object at a time. To facilitate the segmentation of all relevant objects in the image, ITK-SNAP retains the classifier training data after active contour segmentation is completed. To segment additional objects in the ROI, the user re-enters the semi-automatic segmentation mode, assigns a different combination of tissue classes as object and background, and applies active contour evolution; all without having to re-train the random forest classifier.

Experiments and Results

The overall goal of the evaluation experiments is to show that ITK-SNAP can be used to perform complex image segmentation tasks in multi-modality image data quickly and reliably.

Brain Tumor Segmentation in Multi-contrast MRI

The primary evaluation is carried out in the context of semi-automatic segmentation of high-grade and low-grade gliomas in multi-contrast MRI from the 2013 MICCAI Brain Tumor Segmentation (BRATS) challenge (Menze et al. 2015). BRATS challenge data has been used to evaluate dozens of brain tumor segmentation algorithms, so it offers a well-established benchmark for evaluating ITK-SNAP segmentation performance. The reliability of “ground truth” manual segmentation in BRATS data is also known (Menze et al. 2015, Figure 5).

Our evaluation uses data from the 25-subject “leaderboard” subset of the 2013 BRATS dataset (Menze et al. 2015). For each patient, four MRI scans are provided: pre-contrast T1-weighted, T2-weighted and FLAIR scans, as well as a gadolinium contrast enhanced T1-weighted scan (T1CE). All four scans are co-registered by BRATS organizers and resampled to 1 mm × 1 mm × 1 mm resolution. Most high-grade gliomas have four distinct tissue classes: edema, which appears bright on T2 and FLAIR; enhancing tumor core (EC), which appears bright on T1CE; non-enhancing tumor core (NEC), which is abnormal in T2 but appears as normal gray/white matter in T1CE; and necrosis, which appears dark in T1. However not all classes are present in all subjects and appearance can be variable. Low-grade gliomas typically do not have EC or necrosis.

The BRATS leaderboard dataset includes 21 scans of patients with high-grade gliomas and 4 scans of patients with low-grade gliomas. These include 15 cases (11 high-grade, 4 low-grade) that were used for off-site evaluation both in BRATS 2012 and BRATS 2013, and 10 additional cases that were used only in the 2013 challenge (Fig. 4). The 15-case subset (subset B in Fig. 4) was used to compare 20 tumor segmentation methods in the report on BRATS 2012 and 2013 by Menze et al. (2015).

Fig. 4
figure 4

Composition of the different subsets of the BRATS 2012/2013 data referenced in this paper. A: the “Leaderboard” dataset provided for off-site evaluation in BRATS 2013. The online BRATS system (virtualskeleton.ch) continues to use this dataset for evaluating and ranking segmentation methods. B: the subset of 15 cases used for off-site evaluation in both BRATS 2012 and BRATS 2013. It served as the primary dataset for the comparison of 20 algorithms from the two challenges in the BRATS evaluation paper by Menze et al. (2015, Figure 7). C: the subset of 20 cases that was segmented using ITK-SNAP by all three neuroradiologists in this study. D: the subset of five cases segmented twice by the three neuroradiologists

Gliomas were segmented in ITK-SNAP using the following protocol. In the supervised classification presegmentation mode, seven tissue classes were available: edema, EC, NEC, necrosis, normal brain tissue, cerebrospinal fluid (CSF), air/background. In most segmentations the first five classes were marked by the raters, and CSF was marked when tumors were adjacent to the CSF. Active contour segmentation was performed repeatedly, starting from the whole tumor and working inwards, as illustrated in Fig. 5. First, the combined tumor region (edema+ NEC + EC +necrosis) is segmented; then (NEC+ EC +necrosis); then (EC + necrosis); and finally necrosis only. This sequence takes advantage of the fact that in most tumors necrosis lies within the EC, which is within the NEC, which in turn is within the edema, and minimizes the need to label structures with holes.

Fig. 5
figure 5

Sequence of segmentation used by the tumor segmentation protocol. The input images for the example tumor dataset are shown on the left. The columns on the right show the speed images and the segmentations obtained in the four stages of the segmentation. Segmentation is performed proceeding from the largest object inwards, so that during each segmentation stage, the object being segmented does not have holes. The table in the bottom right describes how the different tissue classes in the image are assigned to the object of interest and background during each stage, as well as what label is assigned to the result of active contour segmentation during each stage. For example, in stage 1, the four tissue classes comprising the complete tumor are assigned to the object of interest, while healthy appearing gray/white matter and CSF are assigned to the background. This yields a speed image that is positive in the complete tumor and negative in the healthy tissue. After applying active contour segmentation to this speed image, the result is assigned the edema label

The whole BRATS 2013 leaderboard dataset was segmented twice by a non-expert rater (AP) who had no previous experience with image segmentation or brain tumor segmentation. After studying the BRATS manual segmentation protocol (Jakab 2012) and ITK-SNAP tutorials, this rater practiced on a set of 20 cases with available segmentations from the BRATS “training” subset for about one week. The rater then segmented the 25-case leaderboard dataset sequentially over the course of 10 days. Following a one-month delay, the rater segmented each dataset again. The total segmentation time (from loading a workspace file in ITK-SNAP to saving final segmentation) was recorded for each segmentation attempt. The strokes used to train the random forest classifier were also saved as an image volume.

Additional segmentation was performed independently by three expert neuroradiologists (JES, JMS, SM) who had no prior experience with ITK-SNAP. The neuroradiologists performed segmentation in a subset of 20 cases (16 high-grade, 4 low-grade), designated as subset C in Fig. 4. Subset C includes the 15-case subset B used for the comparison methods in the BRATS 2012 and 2013 challenges (Menze et al. 2015, Figure 7). A smaller subset of 5 images (3 high-grade, 2 low-grade) were segmented twice by each neuroradiologist after a delay of at least two weeks. The neuroradiologists attended a two-hour training session from ITK-SNAP developers, watched ITK-SNAP training videos online, and practiced on the 20-subject training subset until they felt comfortable with the tool and the segmentation protocol.

Segmentations performed by different raters, as well as repeat segmentations by the same rater, were compared in terms of the Dice similarity coefficient (Dice 1945) and volume. These measurements were conducted in a manner consistent with evaluations in the BRATS challenge (Menze et al. 2015). Specifically, Dice coefficient was computed and reported for the “complete tumor” (edema + NEC+ EC +necrosis), “tumor core” (NEC+ EC +necrosis) and “enhancing core” (EC) for high-grade gliomas and “complete tumor” and “tumor core” for low-grade gliomas. Dice coefficient is a measure of relative overlap between segmentations, defined as the ratio of the volume of the intersection between two segmentations to the average volume of the two segmentations, and ranging between 0 and 1.

Results: Intra-rater and Inter-rater Reliability

The intra-rater reliability for the non-expert rater is summarized in Table 2. There is a substantial difference between the mean and median Dice coefficient, driven in part by zero Dice coefficient for one subject (high-grade 137), for which the rater labeled a completely different part of the image as tumor. The intra-rater reliability for low-grade gliomas is much lower than for high-grade gliomas, particularly for the tumor core. Overall, the median intra-rater Dice coefficient for our non-expert rater in subset B (0.92 for complete tumor, 0.81 for tumor core, 0.78 for enhancing region) compares favorably with the inter-rater reliability of the ground truth BRATS manual segmentation reported in Menze et al. (2015, Figure 5) for the same set of cases (0.85 for complete tumor, 0.84 for tumor core, 0.72 for enhancing region). Mean intra-rater Dice for our non-expert rater (0.83 for complete tumor, 0.63 for tumor core, 0.64 for enhancing region) compares less favorably with Menze et al. (2015, Figure 5) (0.81 for complete tumor, 0.77 for tumor core, 0.68 for enhancing region), which is likely driven by the lower performance of the non-expert rater on low-grade cases and the outlier high-grade case 137.

Table 2 Intra-rater reliability of glioma segmentation by the non-expert rater

Average intra-rater reliability for each of the three radiologists and average inter-rater reliability between all pairs of radiologists are reported in Table 3. Intra-rater reliability is consistently higher than inter-rater reliability, as would be expected, since inter-rater disagreements may include differences in the interpretation of underlying anatomy, while intra-rater disagreements, in principle reflect difficulty in applying a given set of anatomical rules consistently. Additionally, Table 4 compares the average inter-rater reliability of three radiologists using ITK-SNAP and the average inter-rater reliability of three raters who provided reference manual segmentation in Menze et al. (2015) in the same set of images. Compared with Menze et al. (2015), the ITK-SNAP inter-rater reliability is lower for high-grade cases and higher for low-grade cases; the average over all cases is lower for ITK-SNAP, particularly for the tumor core.

Table 3 Summary statistics for inter-rater and intra-rater reliability of the three neuroradiologist experts
Table 4 Comparison of mean and median inter-rater Dice coefficient between the three experts in the current study and manual segmentation by three raters in Menze et al. (2015)

The intra-class correlation coefficients (ICC, Shrout and Fleiss (1979)) in Table 3 display a wide range, and higher ICC values do not always correspond to higher inter-rater and intra-rater Dice coefficient. Figure 6 uses Bland-Altman plots (Bland and Altman 2007) to plot the between-rater and within-rater disagreement in volume. Large range of ICC values is likely explained by very different ranges of volume for the different regions. For example, for the enhancing tumor region, the between-rater and within-rater error is approximately the same in absolute terms, but the range of volumes is very small for the set of 3 high-grade cases in which intra-rater ICC is computed, resulting in a very low ICC (0.1).

Fig. 6
figure 6

Bland-Altman plots showing agreement in segmentation volume between attempts by different pairs of neuroradiologists (inter-rater plot, left) and different attempts by the same neuroradiologist (intra-rater plot, right). The average volume in both attempts is plotted on the horizontal axis, and the difference in volume between attempts is plotted on the vertical axis. All sub-plots have aspect ratio of 1

Results: Comparison to BRATS Reference Segmentation

Segmentations by the three experts and the non-expert were uploaded to the online BRATS evaluation system, which reports the Dice coefficient between each segmentation and the BRATS reference segmentation, which is a consensus segmentation derived from combining multiple manual segmentations (Menze et al. 2015). Table 5 reports the mean and median overlap for each rater on the set of 20 cases that were segmented by all four raters (Subset C in Fig. 4). The non-expert’s agreement with the BRATS reference is generally on par with the experts for the high-grade gliomas, but lower than that of the experts on the low-grade gliomas. Combining across low-grade and high-grade gliomas, Expert 2 has the highest agreement with the BRATS reference of all four raters.

Table 5 Evaluation of the ITK-SNAP segmentation by the three experts and the non-expert against the BRATS challenge reference segmentation on Subset C (Fig. 4), which was segmented by all four raters

Table 6 estimates how the ITK-SNAP segmentations by the four raters compare to the 20 brain tumor segmentation techniques evaluated in the BRATS 2012 and 2013 challenges (Menze et al. 2015, Figure 7). The “Rank” columns of Table 6 indicates where the segmentation by each expert would rank with respect to the 20 methods evaluated in Menze et al. (2015, Figure 7). The best overall ranking is achieved by Expert 2, with Experts 1, 3 and the non-expert having very similar ranking profiles. However, for low-grade gliomas, Expert 3 has the best ranking. If the Dice coefficient across all three regions is averaged, then ITK-SNAP segmentation by experts 2 and 3 comes out in the first place relative to the 20 methods in Menze et al. (2015, Figure 7) and in the second place for Expert 1 and the non-expert.

Table 6 Agreement of ITK-SNAP segmentation by the three experts and the non-expert with the BRATS challenge reference segmentation, ranked against the 20 methods evaluated in the BRATS 2012/2013 challenges (Menze et al. 2015, Figure 7)

The BRATS online evaluation system is open and new methods are continually added. The online system ranks methods based on average Dice coefficient relative to the BRATS reference segmentation across the 25-case leaderboard dataset (Subset A in Fig. 4). As of April 2017, the segmentation produced by the non-expert was given overall rank 5 of 45 by the online system. The dataset combining segmentations by Expert 2 for subset C and non-expert segmentation for the remaining 5 leaderboard cases, was given overall rank of 4 of 45.

Segmentation Effort

The distribution of segmentation times for each rater, separated by tumor grade, is plotted in Fig. 7. The mean segmentation time for the non-expert across all cases was 12.3 min, while the mean segmentation for the three experts was higher: 24.1, 24.3, and 16.2 min, respectively. We used the Wilcoxon signed rank test to determine whether the differences in segmentation time between pairs of raters were statistically significant. Significant differences were found between Expert 1 and non-expert (p < 0.001), Expert 2 and non-expert (p < 0.001), Expert 2 and Expert 3 (p = 0.005); a trend toward significance was found between Expert 2 and Expert 3 (p < 0.06).

Fig. 7
figure 7

Distribution of ITK-SNAP segmentation time for each rater, by tumor grade

Figure 8 plots the number of training voxels labeled by each rater for each tissue class during supervised classification. The average number of training voxels per case was greatest for Expert 1, followed by the non-expert, Expert 2, and Expert 3. Differences in the number of training voxels were statistically significant on the Wilcoxon signed rank test between all pairs of raters (p < 0.01), except between Expert 2 and the non-expert.

Fig. 8
figure 8

Number of voxels labeled for each tissue class by each rater. The number of voxels is plotted on a logarithmic scale

For the non-expert rater, who has the most training data available for analysis, and whose training examples were consistently drawn on disjoint slices, we estimated the number of training strokes by performing 1-voxel morphological erosion on each x, y and z slice, and counting the number of connected components. The number of training strokes correlated significantly with segmentation time s(R = 0.51, p = 0.01) and with Dice coefficient on BRATS evaluation (R = 0.52, p = 0.01). These correlations remained significant if only high-grade gliomas were considered. However, segmentation time was not significantly correlated with Dice coefficient on BRATS evaluation (R = 0.25, p = 0.26).

Placenta Segmentation

A secondary evaluation of ITK-SNAP semi-automatic segmentation was performed in the context of segmenting the placenta in 3D ultrasound (3DUS) images taken during the first trimester of pregnancy. The placenta is difficult to segment in 3DUS because of non-uniform intensity within the placenta, high levels of noise, and weak intensity contrast between the placenta and surrounding uterine tissue. A dataset of 25 first-trimester (11–14 weeks) 3DUS scans acquired with the GE Voluson E8 ultrasound machines was used in this study. Images were exported in cartesian format with isotropic resolution, ranging from 0.34 mm × 0.34 mm × 0.34 mm to 0.61 mm × 0.61 mm × 0.61 mm.

Manual segmentation was performed in ITK-SNAP in collaboration between a non-expert rater (NY) and an expert obstetrician with over 10 years of experience in prenatal ultrasound imaging and placenta segmentation (NS). The expert supervised training for the non-expert and inspected and approved each segmentation. The non-expert traced the placental outline in approximately every fifth slice in all three orthogonal slice planes. The resulting sparse segmentations were reviewed by the expert, and if needed, sent back to the non-expert rater for correction. Sparse segmentations were interpolated to create a smooth 3D placental volume, which was reviewed and corrected if necessary. Although segmentation time was not recorded, it took over ten hours of total segmentation time per case with the correction and editing.

A semi-automatic placenta segmentation protocol was developed using a subset of 13 “training” images for which he had access to manual segmentations. Images were first processed using the 3 × 3 × 3 voxel median filter to reduce speckle noise. The supervised classification mode in ITK-SNAP was used to define examples of seven tissue classes: placenta, anterior uterine muscle, posterior muscle, amniotic fluid, fetus, fat and other. The feature set for supervised classification included both patch intensity features (with patch radius 2, i.e., 5 × 5 × 5 voxel patches) and coordinate features. The coordinate features helped define placental boundaries in regions where there was little or no contrast between the placenta and the adjacent structures. An example of placenta presegmentation is shown in Fig. 9.

Fig. 9
figure 9

ITK-SNAP during placenta presegmentation using the supervised classification mode. The orthogonal slice views show a median-filtered 3DUS image of the placenta and the fetus and the speed image. The 3D view shows the samples traced for training the classifier

A non-expert rater (AP) applied this protocol to a “testing” set of 12 placenta 3DUS images for which manual segmentation were available. The mean Dice coefficient between the manual and semi-automatic segmentation was 0.88 ± 0.04, and the median Dice coefficient was 0.89. The average segmentation time using the ITK-SNAP semi-automatic protocol was 27.8 ± 7.0 min per case.

Discussion

Brain Tumor Segmentation

The evaluation in BRATS data illustrates the ability of ITK-SNAP to label complex multi-label structures quickly and effectively. Semi-automatic segmentation by the non-expert rater required only 12.5 min per case on average, compared with 60 min per case for the manual segmentation used to generate the BRATS ground truth (Menze et al. 2015). Experts took longer to perform segmentation (from 16 to 24 min, on average), and experts 2 and 3 had better overall accuracy relative to the BRATS reference than the non-expert. ITK-SNAP segmentation by Experts 2 and 3 (Table 6) was very competitive with the 20 methods evaluated in Menze et al. (2015), achieving first and second ranks in almost all categories in Table 6. Segmentation by the non-expert and expert 1 was also competitive, ranking just behind the top methods in Menze et al. (2015). When new methodology development since 2013 is taken into account, tumor segmentation by ITK-SNAP remains competitive, ranking 4th or 5th (depending on the rater) in the online BRATS leaderboard evaluation system.

A caveat in the comparison with Menze et al. (2015) is that most of the methods participating in the BRATS challenges were either fully automatic or required minimal seeding to initialize (Menze et al. 2015), whereas ITK-SNAP is a highly interactive method that required raters to spend 10–15 min training the classifier. Indeed, the fact that intra-rater reliability of ITK-SNAP is in the 0.7–0.9 Dice coefficient range for most combinations of region/grade (Tables 2 and 3) indicates the significant impact that user input plays in determining the ITK-SNAP segmentation result. This can be viewed as both a drawback and a strength. The drawback is operator bias: just as in manual segmentation, results produced by one user may not be reproduced well by another user. The strength, compared to fully automatic segmentation, is that the expert is largely in control of the anatomical definitions applied during segmentation. Differences between the experts’ segmentations are driven, at least in part, by differences in their interpretations of the underlying anatomy. As demonstrated in a typical example shown in Fig. 10, the experts and the non-expert frequently disagreed as to what constituted different parts of the tumor. Crucially, the neuroradiologists did not study and try to mimic the manual segmentation protocol used in Menze et al. (2015), but instead applied their individual understanding of what constitutes the edema, enhancing and non-enhancing core, and necrotic components of the tumors.

Fig. 10
figure 10

Example high-grade case demonstrating areas of disagreement between raters. To a considerable degree, disagreement is driven by differences in the anatomical interpretation of complex cases (e.g., what constitutes enhancement, what constitutes edema vs. tumor), as opposed to fine-scale disagreements in tracing of agreed-upon boundaries

The successful methods in BRATS were specialized to the problem of brain tumor segmentation and incorporated machine learning classifiers trained on a dataset of 30 annotated cases. Menze et al. (2015) conclude that “a majority of the top ranking algorithms relied on a discriminative learning approach, where low-level image features were generated in a first step, and a discriminative classifier was applied in a second step, transforming local features into class probabilities with MRF regularization to produce the final set of segmentations”. By contrast, ITK-SNAP is a general-purpose tool, and the Random Forest (Breiman 2001; Criminisi et al. 2012) classifiers used in ITK-SNAP were trained using the input images themselves, without a need for a separate annotated dataset. This suggests that ITK-SNAP is a viable alternative to manual segmentation in “novel” problems where annotated training data has not yet been generated.

The inter-rater reliability between the experts using ITK-SNAP was below that of the inter-rater reliability of BRATS reference segmentation (Menze et al. 2015, Figure 5) for high-grade tumors (Table 4). However, the BRATS segmentation itself was not purely manual: for example, segmentation was performed on every second or third slice and interpolated; and Gaussian smoothing was performed following the segmentation (Jakab 2012). Local threshold-based segmentation was used to segment the enhancing region and the necrosis (Jakab 2012). Interpolation, smoothing and threshold-based painting operations can result in more consistent segmentations between raters and result in higher inter-rater Dice coefficient than purely manual segmentation might produce. Furthermore, BRATS manual segmentation followed a detailed and procedural protocol (Jakab 2012) that involved tracing edema in T2-weighted MRI and tumor sub-regions on T1CE MRI (Jakab 2012). The manual segmentation in BRATS also had access to native resolution MRI scans. By contrast, the segmentation in ITK-SNAP used information from all modalities concurrently, and worked with images resampled to a common 1 mm3 isotropic resolution. The three radiologists defined the tissue classes in the image based on their individual interpretation of anatomy, which likely resulted in greater divergence between experts than in the BRATS reference segmentation.

Placenta Segmentation

In the placenta segmentation problem, the accuracy of semi-automatic segmentation in ITK-SNAP relative to manual segmentation (average Dice coefficient of 0.88 ± 0.04) compared favorably with previously published work on semi-automatic segmentation of the placenta in 11–13 week 3DUS scans using the random walker algorithm (Collins et al. 2013; Stevenson et al. 2015), where the average Dice coefficient of 0.86 ± 0.06 is reported; and with fully automatic multi-atlas segmentation results, where Dice coefficient of 0.83 ± 0.05 is reported (Oguz et al. 2016). However, the segmentation time in ITK-SNAP (27.8 ± 7.0) was several times greater than the average time of 3.6 min reported for the random walker segmentation (Stevenson et al. 2015). One possible factor explaining the difference in time is that the initialization for the random walker algorithm is performed in a manner favorable for placenta identification: in slices taken radially around the placenta; whereas in ITK-SNAP the placenta is viewed in orthogonal planes that may make placental anatomy harder to interpret.

Conclusions from the ITK-SNAP Evaluation

The main conclusion of the evaluation in brain tumor and placenta segmentation is that while ITK-SNAP does not unequivocally improve on existing segmentation solutions in these domains, it offers segmentation performance competitive with the state of the art, without being specially tuned, optimized, or trained for these specific problems. The ability of a general-purpose semi-automatic segmentation tool to compete with highly specialized and/or heavily trained approaches in these two problems suggests that ITK-SNAP can be an effective tool in a broad range of complex image segmentation problems, including problems for which there currently do not exist specialized solutions or annotated data for training machine learning and multi-atlas based techniques. Even in problems where automated algorithms have been developed, they may not always be available in the form of ready-to-use software, making ITK-SNAP a preferred segmentation solution.

ITK-SNAP in Relation to Other Medical Image Analysis Software

A number of mature medical image analysis software tools provide automatic and semi-automatic segmentation capabilities. General-purpose segmentation functionality is offered in 3D Slicer (Gering et al. 2001; Fedorov et al. 2012), ImageJ (Abramoff et al. 2004; Barboriak et al. 2005), BioImage Suite (Duncan et al. 2004) and MIPAV (McAuliffe et al. 2001). Additionally, a large number of domain-specific tools that support segmentation exists, such as FreeSurfer (Fischl et al. 2002), FSL (Smith et al. 2004) or SPM (Ashburner 2009) provide excellent automated tools for brain MRI segmentation, but they are not general-purpose segmentation tools. To our knowledge, ITK-SNAP is the first interactive image segmentation tool to combine Random Forest classification and active contour segmentation in a single interactive segmentation workflow. It is also the first to extend such a workflow to concurrent semi-automatic segmentation of multiple image modalities. In the 2D microscopy segmentation domain, the tool Ilastik (Sommer et al. 2011), which also leverages Random Forest classifiers trained interactively by the user on multi-channel images offers similar scope and functionality to ITK-SNAP. 3D Slicer is the general-purpose 3D medical image analysis tool that arguably comes closest in offering the semi-automated segmentation functionality in ITK-SNAP. However, 3D Slicer follows a distinct design philosophy from ITK-SNAP that emphasizes “breadth of functionality [and] extensibility” (Fedorov et al. 2012, p. 1324) that “separates Slicer from such task-oriented packages as ITK-Snap” (Fedorov et al. 2012, p. 1324). For example, 3D Slicer offers extensive functionality for image filtering, non-linear registration, surface model visualization and editing, diffusion MRI processing and tractography, and image-guided surgery (Fedorov et al. 2012). By contrast, ITK-SNAP focuses specifically on the task of image segmentation and eschews functionality not directly related to segmentation. 3D Slicer is a much larger tool that ITK-SNAP, with over 570,000 lines of C+ + and Python code (not counting the 90 + extensions that do not form the core of the software). By contrast, ITK-SNAP has 190,000 lines of code and does not currently offer extensions. 3D Slicer offers several built-in options for image segmentation, such as tissue classification based on the expectation-maximization algorithm (Pohl et al. 2007), a simple region growing algorithm,Footnote 3 a competitive region growing algorithm “GrowCut” (Egger et al. 2013), a more recent Fast GrowCut algorithm (Zhu et al. 2014), and active contour segmentation via robust statistics (Gao et al. 2012). The extensions of 3D Slicer supporting general-purpose segmentation include a tool for watershed segmentation,Footnote 4 and a tool called the “Segmentation Wizard”.Footnote 5 The above modules, with the exception of the Segmentation Wizard, take only a single-modality image as the input. Semi-automatic GBM segmentation using 3D Slicer was evaluated in Egger et al. (2013); the authors determined that the GrowCut module was the most suitable of the available tools, and was shown to have good overlap with manual segmentations (88% Dice coefficient) while reducing segmentation time by 39% over manual segmentation. However, direct comparison of the quantitative results in Egger et al. (2013) and the current paper cannot be made for several reasons: (1) in Egger et al. (2013), segmentation was performed in a single modality (contrast-enhanced T1-MRI) and only using a single anatomical label (tumor vs. no tumor); (2) in Egger et al. (2013), the segmented tumors were considerably smaller than in the BRATS challenge (average tumor volume 27.7 cm3, as opposed to 82.4 cm3 for the complete tumor and 60.1 cm3 for the tumor core in the current paper); (3) the comparison of semi-automatic and manual segmentation in Egger et al. (2013) was between segmentations performed by the same rater, whereas in the current paper, we report accuracy relative to the BRATS consensus reference segmentation derived by different raters.

Limitations

A limitation of ITK-SNAP is that it is a general-purpose image segmentation tool, and thus not optimized for any specific segmentation problem. As the result, it is likely to produce segmentation results inferior to tools that are specialized. However, as noted in the Introduction, expert-guided segmentation is needed both to create and train new specialized tools, and to address existing segmentation problems for which specialized tools have not yet been developed. A particular limitation of the active contour algorithm implemented in ITK-SNAP is that for thin tissue layers (e.g., fewer than 3-4 voxels thick), the active contour may not flow into the tissue layer due to the strong action of the curvature term in (1) and due to the inherent limitation of the level set method in representing parallel contours separated by short distances. These limitations can be overcome by reducing the weight of the curvature term and/or by supersampling the input image (effectively making the tissue layer thicker in units of voxels). However, these mitigation strategies add time to segmentation and may result in leakage and poor segmentation due to under-regularization. In the tumor segmentation problem, we took advantage of the adjacency of thin layers composing the tumors by performing segmentation in a nested fashion, so that thin layers never had to be segmented directly. Such a nesting strategy may not always be feasible in other applications. In particular, the active contour algorithm as implemented in ITK-SNAP is not well suited for the segmentation of thin tubular structures, such as vessels.

A limitation of the evaluation in this paper is that ITK-SNAP was compared to manual segmentation and, in the case of brain tumors, to specialized brain tumor segmentation algorithms, but not to other general-purpose interactive image segmentation tools. It would have been difficult to conduct such an evaluation in a manner that is free of real or perceived bias, since in the current study, the developers of ITK-SNAP participated in the evaluation (e.g., by training the neuroradiologists) while the developers of other interactive open-source tools did not. However, the fact that the brain tumor evaluation was performed on a publicly available BRATS challenge dataset will allow the developers of other tools to carry out similar evaluations independently and to compare their segmentation results to those reported in this paper in a direct way.

Another limitation of the evaluations performed in this paper is the use of Dice coefficient to characterize segmentation accuracy. Dice coefficient provides only one aspect of segmentation accuracy and the value is correlated with the shape of the segmented object, e.g., the same displacement will result in a much larger reduction in Dice overlap for thin shapes than for blob-like shapes. Boundary distance metrics are frequently reported alongside Dice coefficients when evaluating segmentation, but in this paper, such data were unavailable to us for the BRATS challenge because the underlying reference manual segmentations are not publicly available.

Information Sharing Statement

The ITK-SNAP (RRID:SCR_002010) software (source code and binaries) are hosted on SourceForge at https://sourceforge.net/projects/itk-snap, and can also be accessed through the ITK-SNAP website www.itksnap.org and the Neuroimaging Tools & Resources Collaboratory (NITRC) at https://www.nitrc.org/projects/itk-snap.

The BRATS challenge data were obtained from the Swiss Institute for Computer Assisted Surgery (SICAS) Medical Image Repository at https://www.smir.ch/BRATS/Start2013. The ITK-SNAP segmentations of the BRATS datasets, as well as the placenta ultrasound scans, manual segmentations, and ITK-SNAP segmentations are available in the ITK-SNAP NITRC repository https://www.nitrc.org/projects/itk-snap under the package “nein2018”. The R language statistical analysis scripts and spreadsheets used to generate the tables and figures in this paper are hosted on Github at https://github.com/pyushkevich/nein2018_stats.