Introduction

For organ segmentation, different methods can be adopted, ranging from manual to automated methods, each with advantages and disadvantages whereof. The artificial intelligence (AI), if used, might contribute to the integration of the advantages of both methods. Through AI, the precision of manual segmentation, which was not possible in traditional automated methods, can be achieved, along with a less erroneous iterative machine function. AI-created MRI images of the organ segmentation is a wide-ranging, controversial, and burgeoning research work. The automated segmentations could be used for dosimetry and attenuation correction purposes in nuclear medicine, assisted image interpretations, and research-oriented mass image processing [1,2,3,4,5].

The AI used previously for classification purposes [3, 4] cannot be employed for segmentation, and the denoted segmentation scripts are not sophisticated enough to delineate the boundaries of many organs, such as the liver and embedded tumors. Currently, deep learning (DL) is the optimal AI technology for segmentation. DL has been previously used for liver segmentation and the detection of liver tumors in CT scans [6, 7]; nevertheless, such an application has not been sufficiently employed for segmentation on MRI images (Table 1). There are some studies wherein 2D MRI image slices [8, 9] and machine learning [10, 11] are used. Yet three-dimensional imaging data shows more potential for correct assessment of liver tumors and also enjoys more features for improved segmentation.

One of the less studied organs is the liver. Liver lesions, although very common, are also among the most difficult tumors for segmentation as it is challenging to isolate lesions from the normal liver parenchyma, even for experienced radiologists using single imaging sequences (i.e., T1w or T2w). However, the interpreter can examine different image sequences, scroll up and down the images, and even search for previous imaging data and clinical history of the patient. It is believed that this procedure can be learned by AI using DL for segmentation of the liver and extracting liver lesions. Considering aforementioned obstacles, efficient AI segmentation of the liver and liver lesions on T1 and T2 weighted MRI images is not widely documented, in contrast to similar experiments for CT images (Table 1).

This study’s key contributions are its application of deep learning techniques for the precise segmentation of both liver and liver lesions in 3D T1w and T2w MRI images, using a specialized dataset for this dual analysis. The network used in this study, established and applied for segmentation purposes, had not previously been utilized for segmenting liver and liver lesions. Given the network’s proven efficacy in brain tumor segmentation, its successful application in training for organ and tumor segmentation would highlight its versatility for such tasks. The utilization of actual noisy clinical MRI data in the current study is unique which underscores real word DL organ segmentation strengths and drawbacks. The study results enhance segmentation accuracy in 3D clinical MRI and set a new precedent for DL in the detailed examination and differentiation of liver and liver lesions.

Table 1 Previous approaches to Segmentation of liver using deep learning

Materials and methods

A brief visual representation of the procedures of the current study is illustrated in Fig. 1.

Data preparation

Patients

A total of 128 patients were included in the current study from the liver transplantation and general surgery wards of Imam Khomeini Hospital Complex (IKHC), affiliated to Tehran University of Medical Sciences, Tehran, Iran. Holding to the guidelines established by the Ethics Committee of Imam Khomeini Hospital, the anonymity of the images used in the current study was guaranteed. The MRI images of the livers were collected from the picture archiving and communication system (PACS). Imaging was carried out using MAGNETOM Trio, which is a 3T total imaging matrix (Tim) eco system (Siemens, Germany), and GE Optima MR360 Advance 1.5T system (GE Healthcare, USA) with a standard imaging protocol including axial and coronal T2w, axial T1w images.

Out of 128 patients, 110 were randomly allocated to the model training data and 18 to the hold-out test group. In training data, 20% (n = 16) of the subjects were randomly allocated to the validation group, while the rest were allocated to the training cohort. Liver segmentation was carried out for 77 cases on both T1w and T2w sequences. Liver lesions were segmented in both T1w and T2w sequences across 75 cases, with 42 of these cases overlapping in both liver and lesion segmentation categories. The test group included 18 patients with 9 liver tumors and 9 patients without liver tumors.

Segmentation

The segmentation of the liver and liver lesions was semi-automatically carried out using a 3D Slicer v4.10.0 on T1w- and T2w MRI images, separately, to create the training ground truth. Masks for the images were created separately for both T1w and T2w sequences, for liver and tumor-labeled ground truths, and were saved as 3D NIfTI images. A value of 1 was assigned to the target regions of each image.

Pre-processing procedures

The variation in voxel sizes among MRI images necessitated the registration of voxel sizes for different images. To integrate different voxel sizes, the geometry-based registration of 3D Slicer software was used to resize images, by changing the voxel size into a common voxel size of 1.48 mm, 1.48 mm, 4 mm. The results ensure the input images are all of the same size, which is essential for deep learning models to function properly.

Normalization with the N4 bias field algorithm [12] was carried out in Python v3.7 to handle the heterogeneity of low-frequency data. Next, T1w and T2w images of the liver and lesion masks, with substantial misalignment of T1w and T2w images, were aligned by a Python code to match together. Liver ground truth masks were utilized separately for T1w and T2w images. To this end, it was crucial to identify the optimal shift to properly align these two sets of images. A score was established to gauge the degree of alignment between two shifted images. Various shift values were then explored to determine which one would maximize this alignment score. More precisely, the score is the result of “And” between two masks after shift. If two shifts yield the same score, the smaller of the two will be chosen for use. The final images are aligned by the best shift found. This process didn’t change the characteristics of the images.

Without normalization, the heterogeneity of MR data prevents proper training, possibly due to the training from bias. The pre-processed data were loaded. The intensity of the voxels was normalized by subtracting data from the mean divided by the standard deviation. We computed a mean and standard deviation over all the training data and applied them for normalization of both training and test datasets. The outcome values were in the range of [-1, 1].

The common field of view of medical images contains a black area around the body contour. Initial experiments indicated that it would yield better results if the black parts of the images around the body are cropped. For the segmentation of lesions, we opted not to use the segmented liver but rather the original cropped MRI images as the input for the deep learning model. This approach is consistent with the method we used for training the model in liver segmentation. The reason is that if the liver masks were used for liver lesion segmentations, large liver lesions would not be extracted efficiently due to the position of tumor at the edge of image.

In the next step, data augmentation was performed. Generally, a network with a small training sample tends to over fit. Data augmentation was applied to avoid over fitting. Data augmentation was carried out in two parts: (1) real data augmentation of flip and 90-degree rotations, augmenting the original datasets (16 times), non-90-degree random rotation, distort elastic, scale, and random noise; (2) the second part of data augmentation comprised data augmentation in an “on-the-fly” mode in the generator module to speed up the process and compensate for memory limitations.

Finally, we employed ‘white sampling’ a technique where the area of interest is specifically targeted for training, for lesion segmentation. In the process of liver segmentation, most areas of the image show a mask value of one, illustrated as a white area, because the liver comprises the major part of trans-axial images. Therefore, the segmentation process was performed without the need for additional modification. In contrast to liver segmentation, in lesion segmentation, only a small portion of the image is attributed to the lesion, and most of the image is black outside the lesion. Moreover, the position of lesions in the liver varies from one image to another. As a result, we used a method known as random white sampling for training, which focuses on the areas surrounding random values that are equal to one.

Model architecture

The TensorFlow library in Python v3.7 was used for this purpose. The employed DL model throughout this paper is the one developed by Isensee in 2017, which is referenced as “Isensee 2017.” This model is grounded in a U-Net-based architecture and was specifically designed and used for segmentation purposes. At its core, the network employs a convolutional neural network (CNN) framework, originally proposed for the brain tumor segmentation challenge known as BraTS [17]. The goal of BraTS is to develop a state-of-the-art method for tumor segmentation by providing a large dataset of annotated low-grade gliomas and high-grade glioblastomas. We employed parameters same as Isensee 2017, and hyper parameters are as follows: optimizer: Adam; initial learning rate: 1e− 4; loss function: weighted Dice coefficient loss; activation function: sigmoid; dropout rate: 0.2; Batch size: 8; and number of epochs: 400. Patch size of 96 × 96 × 32 for liver and 64 × 64 × 32 for liver tumors were employed; moreover, 16 filters were implemented, the size of which varied across each layer. The network depth was set to 5; the “weighted Dice coefficient loss” was used as loss function.

In the final training effort, the network was trained with 400 epochs for both liver and lesion datasets. T1w and T2w data of each patient were fed into the network via two channels. Training continued until the Dice coefficient calculated for the validation data remained constant with no increment. The Dice coefficient [18] was calculated based on the following equation:

$$ Dice\left({Y}_{true} {Y}_{pred}\right)=\frac{2\left({Y}_{true}\cap {Y}_{pred}\right)}{\left|{Y}_{true}\right|+\left|{Y}_{pred}\right|+\epsilon}$$
(1)

where Ytrue is the image annotation or ground truth, Ypred is the resulting mask (i.e., liver or lesion mask), ϵ is a small value added to prevent the denominator from being equal to zero, ∩ is an intersection, and| ⋅| determines the cardinality of its argument (i.e., number of non-zero elements in the mask). The T1 and T2 images were used via two different channels of CNN input to predict two different outputs: segmentation for T1 images and T2 images. To quantify the final performance of the network, Dice coefficient for the hold-out test dataset was calculated using the binary average of T1w and T2w sequences.

Due to the limited size of our MRI dataset and the preference to make the most of our labels, we adopted the strategy of using the weights from liver segmentation training as the initial weights for tumor segmentation. This approach enables the model to continue learning from the point where it previously stopped.

The network was trained on a desktop PC (NVIDIA 2080 Ti GPU, 32 GB RAM).

Fig. 1
figure 1

The visual representation of the method

Results

The code and weight of the trained network were uploaded to GitHub. Finally, the liver and lesion segmentations were tested on 18 cases. The average Dice coefficient for the binary average of T1w and T2w sequences was 88% for the liver and 53% for liver tumors. Diagnostic accuracy and statistics for the performance of the trained network for the test data is presented in Table 2. Nine patients without liver tumor were involved. However, since normal liver was not considered in the training, the results of segmentation were not optimal for them. Figures 2 and 3 illustrate loss-validation diagrams for liver and lesion segmentation training. Figures 4 and 5 present the segmentation results for the liver and liver tumors from T1w and T2w sequences, and Fig. 6 presents the binary averages. Training with the mentioned specifications using the method described for either liver or lesion segmentation (except data preparation) continued for about three days. Segmentation of the liver or liver lesion for a never-seen-before T1w or T2w image took almost five minutes.

Table 2 Diagnostic accuracy and statistical analysis of the trained network’s performance on the test data
Fig. 2
figure 2

Training loss and validation-loss of model training for liver segmentation

Fig. 3
figure 3

Training loss and validation-loss of model training for liver lesion segmentation

Fig. 4
figure 4

A typical slice of the original MRI (a & e), grand truth labels (b & f), and network-generated liver segmentation (c & g); the upper row denotes T1w, and the lower row denotes T2w of same patient; d and h present the difference between ground truth labels and network predictions. The gray color (float number) in f is related to the registration process

Fig. 5
figure 5

A typical slice of original MRI (a & e), ground truth labels (b & f), and network-generated lesion segmentation (c & g); the upper row denotes T1w, and the lower row denotes T2w of same patient as in Fig. 3; d and h present the difference between the ground truth label and network prediction. The gray color (float number) in f is related to the registration process

Fig. 6
figure 6

A typical slice of original T1w and T2w images (a & b), corresponding to the averaged binary of annotated labels (c & e), and the averaged binary of the prediction mask of the liver and tumor (d & f)

Discussion

The performance of the Isensee 2017 network for segmentation of the liver was more effective than lesion segmentation. Visually, the majority of lesions were detected, despite failure to depict their shapes and boundaries. The roughly typical shape of the liver in different individuals, besides the large white portion of the liver on images due to the larger liver size compared to variable sized and shaped lesions, plausibly facilitates the detection of liver boundaries compared to the liver lesions by the network.

Comparing other studies, the Dice score of the current U-net deep CNN is not optimal (88% vs. generally 94–97%; Table 1). The reasons for the difference can be listed as below:

(1) Segmentation with CT data which provides higher resolution is more accurate compared to segmentation using MRI data. Reasonably, the Dice similarity coefficients were higher in the studies cited in Table 1 which segmented CT images compared to the current study; (2) MRI data from low-noise public databases differ essentially from clinical MRI images with higher noise and uncontrolled clinical conditions including different image voxel sizes; and (3) the use of MRI and CT data together facilitates detection of the boundaries but reduces the clinical applicability because MRI and CT images are not usually available simultaneously. Plausibly, the study of Wang K et al. [11] employing both CT and MRI data had better result. The high-resolution CT data prevail the noisy MRI data to learn segmentation. (4) Few pieces of research reporting liver segmentation worked on a 3D MRI dataset using DL [8,9,10,11]. It should be noted that in real clinical scenarios, the whole (i.e. 3D) liver should be segmented and 2d segmentation would cause boundary indentations and final inaccuracies. If artificial intelligence is to segment the liver in real clinical application, 3D segmentation should be targeted and the obstacles become solved. The reason why 2D datasets are more commonly employed is that the prepared 3D datasets are scant in online libraries, particularly for MRI images. The data of the current study could be used for this purpose in future studies. Furthermore, manual annotation of the liver and liver lesion slices is very cumbersome which was performed in the current work; and additionally, training, based on 3D datasets, requires advanced systems with more data processing specifications. Essentially, comparison of the results of the training based on 3D MRI data with 2D datasets cannot be performed robustly since the corresponding networks have a different number of parameters and a different number of convolutional layers.

A novel approach provided by the current study is that the original liver and lesion segmentation tasks on 3D MRI, applied simultaneously for T1w and T2w data, have not been reported for DL yet. The only available study using DL on MRI data for 3D liver and lesion segmentation was carried out by Christ et al. [19]. They reported Dice coefficients of 86% and 69% for the liver and liver lesions, respectively, which are comparable with results of current study. While their cascaded fully convolutional neural networks (CFCNs) surpassed our current model in lesion detection, our method showed a slight advantage in liver segmentation performance. Since their CNN was trained with diffusion-weighted MR (DW-MR), considering liver tumors better visualization in DW-MR compared to T1w and T2w images, their superior results is reasonable. It should be considered that Crist et al. trained their MRI model using a smaller dataset, consisting of only 31 cases.

In the current study, the Isensee 2017 network was opted because it is a pure 3D U-Net-based CNN, developed for segmentation. Comparing to the results obtained from previous studies employing machine learning (ML) based approach on MRI images, the network performance in the current study was superior to those reported in studies by Pratondo et al. and Häme et al. [20, 21]. The reason behind the current finding is that the use of DL is superior to ML due to the generalizability of the network for a wide spectrum of data; moreover, there is no need for feature extraction in DL [7, 20,23,24].

The preparation of MRI data for DL was challenging and time-consuming since real hospital-based MRI images of the patients were utilized (uploaded to https://www.cancerimagingarchive.net; accepted and available for the reviewer by correspondence). The use of MRI was important because the up-to-date most accurate imaging for hepatic lesions is MRI. The method could be used for express research and dosimetry purposes. Furthermore, the corresponding coefficient of liver attenuation could be employed for liver masking during the attenuation correction process in PET/MR instruments. Although poor lesion delineation hinders lesion dosimetry, such a failure would negligibly change attenuation correction-based application for the proposed network.

A drawback to the current study is that the performance of the trained network for lesion detection on images without liver lesion was ignored. In future assessments, the confusion of application of lesion detection algorithm to the images without lesion should be sought further. Furthermore, both T1w and T2w images were used; however, odds are that one may use one of these images, for example only T1w images, for the purpose of AI training. The comparison between the suggested applications of T1w and T2w images may provide further insights in future studies. Lastly, adding classifiers to find detection rate of tumor segmentation will lead to more insight on model performance.

Conclusion

In the current study, T1w and T2w 3D images were prepared and learned, via two channels, employing the Isensee 2017 network. The results indicated the capability of DL to utilize T1w and T2w data simultaneously for each specific patient in order to segment the liver. Such a method can be applied to researches on MRI images, dosimetry, and attenuation correction in PET/MRI scanners in future studies.