Keywords

1 Introduction

Much like chronic liver disease, the incidence of chronic pancreas disease is rising rapidly reflecting the increasing worldwide prevalence of obesity [20]. Obesity causes fat infiltration in organs such as the liver and pancreas and triggers a set of inflammatory responses that can ultimately lead, in the pancreas, to chronic pancreatitis and pancreatic cancer. Quantitative magnetic resonance imaging (MRI) has become the gold standard tool for early detection, diagnosis and monitoring of chronic liver disease due to its soft tissue contrast, lack of radiation, high accuracy and precision, even in the most obese patients. This has resulted in the development of imaging biomarkers such as corrected T1 (cT1) [12] and proton density fat fraction (PDFF) [15].

Similar considerations motivate the development of imaging biomarkers for assessing the pancreas, which is increasingly important as it is fundamentally implicated in obesity-related conditions such as the metabolic syndrome and diabetes. Pancreas fat infiltration is often heterogeneous, with person-specific patterns of infiltration, which rules out two-dimensional and local quantification approaches. It is also usually asymptomatic and only reported incidentally as part of abdominal imaging routines. Furthermore, evaluation of pancreas fat content is important in the context of transplant [5], but is often only subjectively assessed. Evidently, there is a need for quantitative tools to assess pancreas state, including fat content.

The volume of the pancreas has been reported to change under certain conditions, such as type I and type II diabetes [10, 17]. The advent of deep learning-based approaches – particularly convolutional neural networks (CNNs) – has substantially improved pancreas segmentation accuracy. However, the pancreas remains one of the most challenging organs to segment, with potentially heterogeneous pathology and boundaries that are often unclear. This uncertainty leads to high inter-rater variability that impairs training, even for state-of-the-art approaches. In addition, substantial variability in acquisition sequences and parameters, as well as patient-related variability in signal-to-noise ratio and image uniformity (even for a fixed acquisition), make it difficult to develop robust pancreas segmentation methods.

Pancreas morphology may also be altered through disease processes [10, 22]. Imaging biomarkers of pancreas morphology have thus far relied on manual approaches, such as subjective scoring systems of e.g. surface “irregularity” (1 to 5 score) [10]. Automated approaches have used carefully engineered, non-organ-specific metrics such as curvature [22]. Advances in machine learning and modern pattern recognition invite more data-driven approaches for pancreas shape characterization. To this end, there are methods used in brain imaging for group-wise registration and computational anatomy [6, 9] that may yield robust and useful imaging biomarkers for the pancreas.

In this work, we report the development of a pancreas shape characterization pipeline which has potential for disease stratification and automated quality control. We used deep learning-based segmentation, a computational anatomy method involving registration to a template, and manifold learning. We validate our pipeline using data from the UK Biobank imaging sub-study.

2 Materials and Methods

Our method is based on a deep learning model to compute pancreas segmentations for each case. Though the focus of this work is not a comparison of deep learning architectures for segmentation of the pancreas, but rather to provide base performance of such a pipeline, we explored differences in performance given by two commonly used loss functions in the pancreas segmentation problem. Segmentations on a separate dataset are then used to extract total pancreas volume and to extract shape metrics via diffeomorphic image registration and non-linear dimensionality reduction.

Data. MRI acquisitions from the imaging sub-study of UK Biobank [19], aiming to scan 100,000 volunteers, were used to develop a deep learning based pancreas segmentation model. Specifically, we used the “Pancreas fat - DICOM” volumetric acquisition (Field ID 20202) which targets the abdominal location of the pancreas. Only datasets from the first imaging visit (Instance 2) were used. Imaging data was acquired with a Siemens Aera 1.5T (Siemens Healthineers AG, Erlangen, Germany) at the Stockport, Manchester, UK imaging center using the FLASH-3D acquisition (TE/TR = 1.15/3.11 ms, voxel size = 1.1875 \(\times \) 1.1875 \(\times \) 1.6 mm), with 10\(^{\circ }\) flip angle and fat suppression.

Manual Annotations. Manual annotations were performed on N = 217 cases using the 3D brush tool in ITK-SNAP [23]. Where pancreas coverage was incomplete, only the pancreas volume covered by the field of view was annotated (see Fig. 1).

Fig. 1.
figure 1

Two coronal slices of a case with partial pancreas head coverage in the breath-hold acquisition, shown in ITK-SNAP. Arrows indicate pancreas head (yellow), body (blue), and tail (green). (Color figure online)

Model Training. Keras with TensorFlow 1.13 as backend was used for training U-Net CNN segmentation models [16] based on an available architecture [7], adapted from 2D to 3D convolutions and input size of our data [14]. A random affine transformation (up to 3\(^{\circ }\) rotation, 5% translation, 5% scaling) was applied to each case at training time, different for each case at every epoch, for data augmentation purposes. Adam optimization was used with learning rate = 5e-5, as well as batch size = 1 and 100 epochs.

195 datasets with annotations were used for training the model, while a random selection of 22 cases (10%) were used as our validation set and were not seen during training. The model was checkpointed every 10 out of 100 epochs and the instance with lowest validation loss was chosen as the segmentation model to be used in subsequent stages.

Loss Function. The choice of loss function is reported to be an important step in pancreas segmentation, due to the substantial class-imbalance of a small structure in a relatively large field of view [11, 13]. We experimented with two loss functions, binary cross-entropy loss and soft Dice loss, in order to observe the potential improvement of soft Dice over binary cross-entropy in the pancreas segmentation problem, using the same network architecture. This yielded two separate models that were named ModelCE and ModelDSC, respectively.

ModelDSC was chosen for subsequent experimentation and derivation of metrics in subsequent sections and Results. The subsequent work does not place any restrictions on the nature of the segmentation model, though segmentation accuracy will naturally affect subsequent analysis.

Predictions. ModelDSC was used to make inferences on a new cohort of 3,909 subjects from UK Biobank drawn from the same pool of cases used for training that shared the same acquisition parameters. Segmentation volumes were calculated for all cases in this cohort using the number of voxels in the segmentation mask and the acquisition voxel size.

Diffeomorphic Registration and Kernel Generation. Predicted segmentations’ volumetric masks for 600 of the 3,909 cases were iteratively registered towards a group average using the Large Deformation Diffeomorphic Metric Mapping (LDDMM) via Geodesic Shooting approach [2] in the ‘Shoot’ Toolbox of SPM12Footnote 1, in order to generate a set of average template images that became sharper as the algorithm converged (see Fig. 2).

The scalar momenta maps were generated with SPM12 using the last template image together with the deformation fields and the Jacobian determinant fields; scalar momenta images were smoothed using a 10 mm Gaussian kernel. A 600 \(\times \) 600 kernel similarity matrix (Gram matrix) was then computed with SPM12 using the smoothed scalar momenta maps.

A manifold learning procedure was applied to the kernel matrix by making use of the “kernel trick” [3]. The Principal Component Analysis (PCA) algorithm in scikit-learnFootnote 2 was run on the kernel matrix to extract the 10 principal components of variation of the shapes of pancreas segmentation masks. The end goal of this step was to obtain a dimensionality-reduced space where we could explore the separation between pancreata from control subjects and pancreata with different conditions, e.g. with different pancreas fat infiltration patterns.

Fig. 2.
figure 2

The Template images increasingly sharpened as the LDDMM algorithm progressed. The coronal (top) and axial (bottom) views of Template0, i.e. average of segmentation masks (left), and Template4, i.e. the last template file produced by the algorithm (right), are shown.

Regress Out Volume. In the unmodified kernel matrix, we found pancreas volume to be among the strongest descriptor of the data (see Fig. 3). However, we are primarily interested in volume-invariant shape metrics, as there is already an independent straightforward way of deriving volume from the segmentation masks. In order to treat shape metrics independently of pancreas volume, pancreas segmentation volumes were regressed out of the kernel matrix K by using the “residual-forming” matrix R:

$$\begin{aligned} X = \begin{bmatrix} 1 &{} v_1\\ \vdots &{} \vdots \\ 1 &{} v_n \end{bmatrix}, \,\,\,\, R=I-XX^+, \,\,\,\, K'=RKR \end{aligned}$$
(1)

where (\(v_1\), \(v_2\), ..., \(v_n\) ) are the pancreas volumes for all 600 subjects, I is the identity matrix and \(X^+=(X'X)^{-1}X'\) is the Moore–Penrose pseudoinverse. The kernel matrix \(K'\) output was the kernel matrix after unconfounding pancreas volume.

Fig. 3.
figure 3

Plotting the 1st vs the 2nd component of PCA (run on the original kernel matrix) shows that volume is a strong descriptor of the data, with the largest variation along the 1st component. Both plots show the same data; left plot shows image projections of the segmentation masks as datapoints (blank images are zero-volume cases); right shows color-coding by segmentation volume (color bar from 0 to 126 ml). Volume was subsequently regressed out to yield volume-invariant shape metrics.

Labelling. Fat-infiltrated regions of the pancreas appears dark in our fat-suppressed images. The expectation from a segmentation algorithm working on this kind of acquisitions is that fat-infiltrated portions of the pancreas will not be delineated, leaving only pancreas parenchyma. Thus, segmentation masks with missing portions of the pancreas may indicate different fat infiltration patterns.

We labelled a subset of segmentation masks with clearly missing portions of the pancreas, further identifying which portion was missing, either the head (label 1) or the body and tail of the pancreas (label 2). Labelling was performed under the assumption that the previous segmentation stage did not miss any parenchymal regions of the pancreas. A total of 56 cases were labelled, 28 missing most of the pancreatic head, and 28 missing most of the pancreatic body and tail.

3 Results

3.1 Pancreas Volume

Model Evaluation. Pancreas segmentations were computed in the 22 datasets of the validation set with both ModelCE and ModelDSC models. More specifically, ModelCE was the checkpoint model at epoch 50 and ModelDSC was the checkpoint model at epoch 60 (lowest validation loss). The performance on each dataset was assessed in terms of Dice Similarity Coefficient (DSC) compared to manual annotations. Overall, ModelCE and ModelDSC had median ± std DSC performance respectively of 0.706 ± 0.243, 0.837 ± 0.136. ModelDSC’s performance was statistically significantly higher than ModelCE (one-tailed two-sample t-test, p < 0.01), and gave higher DSC on 21/22 cases. Examples from the two models are compared in Fig. 4.

Quality Control. A total of 42 subjects had reported zero volumes which, upon inspection, corresponded either to imaging artefacts or no pancreas coverage. Wrap-around artefacts were the most common in those with zero volume. There was a greater prevalence of high BMI cases among acquisitions with artefacts compared with the entire population (p < 0.01). Zero-volume cases with no pancreas coverage often imaged the heart, perhaps due to incorrect repositioning during the imaging routine. Moreover, a total of 198 subjects had non-zero volumes that were smaller than 10 ml. These often depicted cases where coverage was suboptimal, or cases where the segmentation failed to delineate the entire pancreas region. An arbitrary threshold of 10 ml was defined to exclude cases for subsequent group-wise comparisons, though more thorough quality control measures should be implemented in the future.

Fig. 4.
figure 4

Examples showing qualitative performance of the segmentation models compared to manual annotations (ModelCE, yellow contour; ModelDSC, cyan contour; manual label, red overlay). The 2D slice shown for each dataset was selected based on largest cross-sectional area of the manual pancreas annotation in the 3D image. Dice Similarity Coefficient (DSC) scores for the 3D segmentation are included above each case (ModelCE: ModelCE DSC; ModelDSC: ModelDSC DSC). The last 6 of the 22 cases in the validation set are shown. (Color figure online)

Population Measurements. The mean pancreas segmentation-derived volume in the 3,669 remaining subjects was 55.8 ml. Pancreas volume determination allowed for relative comparisons between subjects (see Fig. 5). Overall, males had a larger pancreas than females (p < 0.01), while overweight (body mass index, BMI = [25, 30)) people had larger pancreata than normoweight subjects (p < 0.01), but not significantly smaller than obese subjects (p = 0.99).

Fig. 5.
figure 5

Volume statistics on the UK Biobank cohort (N = 3,669) showing impact of gender and BMI. Left, volume comparisons between Male and Female subjects. Right, volume comparisons between subjects in different BMI groups, namely Normal, Overweight and Obese.

3.2 Pancreas Morphology

Preliminary results showing a two-dimensional space defined by the 1st and 4th modes of variation of pancreas segmentation shape, offers potential for classification (see Fig. 6). In the decomposition of the subset kernel of manually labelled cases, simple (albeit without independent validation) linear discrimination using the 1st and 4th components showed 93% performance in determining the assigned labels. No clear separation was observed between the selected subset of labelled cases when including all data. A tendency was observed in shape differences between genders, as shown in Fig. 6.

Fig. 6.
figure 6

Running PCA on the kernel revealed components in the manifold space with potential capability to identify fat infiltration patterns, as well as for shape characterization. (A) Subset of labelled cases that had missing body/tail portions of the pancreas (blue dots) and missing pancreatic head (brown dots). (B) Male (magenta dots) and female (green dots) gender differences in shape over the entire cohort. Left and right plots in (A) and (B) rows show the same data; left plots show image projections of the segmentation masks as datapoints. (Color figure online)

4 Discussion and Conclusions

This paper presented a pipeline for pancreas imaging biomarkers of volume and shape, through advanced organ segmentation and representation learning. To our knowledge, this work is the first to attempt learning pancreas morphology, instead of using hand-engineered features. Advanced feature extraction from pancreas segmentation masks may have potential in discovery, diagnosis and monitoring of chronic disease. This is our first step towards the ambitious goal of developing a data-driven, machine learning-based approach to analysis of pancreas state and pathology.

This work did not focus in particular on the choice of deep learning network architecture for pancreas segmentation. Recent works have reported self-optimized U-Net configurations [8], as well as model architectures improving on U-Net for pancreas segmentation in CT and MRI data [4, 13]. Instead, the significant differences in performance of the two segmentation models shown in this work –differing only by the optimization loss function– demonstrate the importance of certain hyperparameters in pancreas segmentation in addition to the choice of model architecture, possibly due to the high class-imbalance. Other more advanced loss functions incorporating surface distance measures in combination with soft Dice loss have been introduced [1].

The relatively low median DSC of the better performing model reaffirms pancreas segmentation as a challenging problem, which our downstream segmentation-derived metrics will depend on. In particular, the distinction between small volumes and segmentation failures (i.e. missing pancreas structures) is unclear. A threshold of 10 ml was arbitrarily chosen to filter out what were considered segmentation failures. In the future, specific a priori quality control measures for outlier detection and that are independent from volume estimates will be considered. This would increase our confidence in the reported volume measurements. UK Biobank provides an excellent big-data resource to approach quality control from a machine learning perspective [21].

The calculated mean volunteer pancreas volume is reportedly smaller than that reported in the literature [17], perhaps reflecting the prevalence of acquisitions with only partial pancreas coverage. The assumption that partial coverage is both random and equally distributed among the groups is reasonable, therefore the relative comparisons between groups presented in this work are still valid; the comparison results were also consistent with previous literature [10, 18]. Future work should look into volume comparisons between groups with different conditions reported in the UK Biobank study, such as metabolic disease or type II diabetes. Absolute pancreas volume measurement in UK Biobank needs further consideration, and perhaps other acquisitions with more coverage (albeit lower resolution), such as whole-body scans, could be used to correct for the missing volume. In general, it was found that segmentation performance was accurate independently of field-of-view coverage.

Manual labelling of segmentations with missing portions of the pancreas as putative fat infiltrated regions (in fat-suppressed scans) assumed full pancreas coverage and acceptable pancreas segmentation performance, which does not always hold. However, the promising results shown in this work reflect the potential performance of such a method in an eventual fat-infiltrated pancreas population, and provide a biomarker that is complementary to PDFF, which reflects parenchymal fat deposition. It is worth noting that at this stage the classification of fat-infiltrated patterns relies on prior detection of fat-infiltrated cases. This was performed manually by visual inspection, but automated detection should be considered in future work.