Keywords

1 Introduction

Liver cancer is one of the leading causes of death worldwide. Early detection of liver cancers by analysis of medical images is a helpful way to reduce death due to liver cancer. High-definition medical images produced by modern medical imaging devices provide more detailed descriptions of tissue structures and thus facilitate more accurate diagnoses. High-definition medical images and large unorganized medical datasets, however, post challenges to doctors from the viewpoint of analysis and review. Computer-aided diagnosis (CAD) systems will assist doctors by characterizing the focal liver lesion (FLL) images.

Based on clinical observations, different types of liver lesions exhibit different visual characteristics at various time points after intravenous contrast injection. To capture the visual feature transitions of liver tumors over time, multi-phase contrast-enhanced computer-tomography (CT) scanning is generally employed on patients who are thought to have liver problems. In the multi-phase contrast-enhanced CT scan procedure, four phases of images are obtained: noncontrast-enhanced (NC) phase images are obtained from scans before contrast injection, arterial (ART) phase scanned 25–40 s after contrast injection, portal venous (PV) phase 60–75 s after contrast injection, and delayed (DL) phase scanned 3–5 min after contrast injection.

Characterization of FLLs, including classification and retrieval, has attracted considerable research interest recently. Mir et al. [1] first presented texture analysis in liver characterization, which illustrated the importance of gray-level distribution for distinguishing normal and malignant tissue. Yu et al. in [2] developed a content-based image retrieval system to differentiate among three types of hepatic lesions by using global features derived from a nontensor product wavelet filter and local features based on image density and texture. Roy et al. [4] used four types of features, that is, density, temporal density, texture, and temporal texture, which are derived from four-phase medical images, to retrieve the most similar images of five types of liver lesions. Shape feature was adopted in [5] in combination with density and texture features for retrieving five types of FLLs. Comparing to low-level features introduced above, the middle-level feature bag-of-visual-words (BoVW) has been proved to be considerably more effective for classifying and retrieving natural images. Diamant et al. [8] learned BoVW representation of the interior and boundary regions of FLLs for classifying three types of FLLs from single-phase CT images. A variant of BoVW called bag of temporal co-occurrence words (BoTCoW) was proposed by Xu et al. [9]. In BoTCoW, BoVW was applied to temporal co-occurrence images, which were constructed by connecting the intensities of multi-phase images, to extract temporal features for retrieving five types of FLLs from triple-phase CT images. After a common codebook learning procedure, Diamant et al. [11] proposed a visual word selection method based on mutual information to select more meaningful visual words for each specific classification task. In addition to these variants and enhanced versions of BoVW based on the hard-assignment mechanism, Wang et al. [12] learned sparse representations of local structures, which is a soft-assignment BoVW method, of multi-phase CT scans for FLL retrieval. Research on learning high-level features by deep learning methods, in particular using the convolutional neural networks (CNNs), is growing rapidly. [13] surveyed the use of deep learning methods in medical image analysis tasks, such as image classification, object detection, segmentation, registration. Due to the difficulties in collecting professional marked medical images, current medical image databases are always too small for deep learning methods. Most of current approaches use pre-trained CNNs to extract feature descriptors from medical images. We have not yet seen many applications of deep learning methods in medical image feature extraction, especially in classification of focal liver lesions. To our knowledge, Bag-of-Visual-Words is still the state-of-the-art method in this field.

However, the conventional vector-based BoVW methods, as mentioned above, analyze the multi-phase images separately, in which the temporal co-occurrence information is neglected. In this study, we explore a multilinear generalization of the soft-assignment BoVW, that is, the tensor sparse representation approach, for joint analysis of multi-phase CT images and apply the proposed method for classification of four classes of focal liver lesions.

2 Tensor Sparse Representation of Multi-phase Medical Images

2.1 Tensor Codebook Learning by the Proposed K-CP Algorithm

First, we introduce the notations used throughout this paper. A vector is denoted by a lowercase boldface letter, for example, \({\varvec{x}}\). A matrix is denoted by an uppercase boldface letter, for example, \({\varvec{X}}\). A tensor is denoted by a Lucida Calligraphy letter, for example, \(\text {X}\). We define tensor multiplication in a way similar to that in [14].

Given a set of tensor training samples \(\text {Y}\), we proposed a K-CP method to learn tensor codebook \(\text {D}\). Implementation of the proposed K-CP method comprises two iterated stages: calculation of sparse coefficients, assuming that the codebook is fixed, and codeword update based on the calculated sparse coefficients.

The first stage can be solved easily by using the tensor generalization of Orthogonal Matching Pursuit (OMP) algorithm. The OMP algorithm is a greedy algorithm that finds sparse coefficients of vector-based signals using a given codebook, whose codewords (atoms) are also vectors. In tensor OMP, given a collection of samples \(\text {Y}\) = [\(\text {Y}_{1},\text {Y}_{2},...,\text {Y}_{N}\)], where \(\text {Y}_{i}\in \mathbb {R}^{I_{1}\times I_{2}\times ...\times I_{M}}, i=1,2,...,N,\) is an \(M^{th}\)-order tensor and \(\text {Y}\in \mathbb {R}^{I_{1}\times I_{2}\times ...\times I_{M}\times N}\) is an \((M+1)^{th}\)-order tensor. Suppose a codebook \(\text {D}\) comprises of K tensor codewords \(\text {D}_{k}\in \mathbb {R}^{I_{1}\times I_{2}\times ...\times I_{M}}\). Then, \(\text {D}\) is a \((M+1)^{th}\)-order tensor. The tensor OMP can be formulated as follows:

$$\begin{aligned} \begin{aligned} i=1,2,...,N \; \; \; \underset{{\varvec{x}}_{i}}{min}\left| \right| \text {Y}_{i}-\text {D}\bar{\times }_{(M+1)} {\varvec{x}}_{i} \left| \right| ^{2}_{2},\; \; \\ s.t.\; \left| \right| {\varvec{x}}_{i}\left| \right| _{0} \leqslant T, \forall i \end{aligned} \end{aligned}$$
(1)

where a column vector \({\varvec{x}}_{i}\) in X represents a combination of the codewords that approximates a sample \(\text {Y}_{i}\), and T is a sparsity measure.

In the codeword update stage, each tensor codeword is updated individually. To update codeword \(\text {D}_{k}\), we first find the row vector \({\varvec{x}}_{k}^{T}\) in X, in which each entry corresponds to the coefficient of a sample in \(\text {Y}\) to \(\text {D}_{k}\). Then, we define the approximation error without using codeword \(\text {D}_{k}\) as follows:

$$\begin{aligned} \text {E}_{k}=\text {Y}-\sum _{j\ne k}^{K}{\text {D}_{j}\circ {\varvec{x}}_{j}^{T}} \end{aligned}$$
(2)

The total reconstruction error can be written as follows:

$$\begin{aligned} \left| \right| \text {Y}-\text {D}\times _{(M+1)} {\varvec{X}}\left| \right| ^{2}=\left| \right| \text {E}_{k}-\text {D}_{k}\circ {\varvec{x}}_{k}^{T}\left| \right| ^{2} \end{aligned}$$
(3)
figure a

Our aim is to find the optimal \(\text {D}_{k}\) that well approximates the reconstruction error \(\text {E}_{k}\) in Eq. (3), which can be solved easily by applying CP decomposition on \(\text {E}_{k}\).

CP (CANDECOMP/PARAFAC decomposition) decomposes a \(P^{th}\)-order tensor \(\text {D}\) into a sum of rank-one tensors [14].

$$\begin{aligned} \text {D}\approx \sum _{r=1}^{R}\lambda _{r}({\varvec{d}}_{r}^{1}\circ {\varvec{d}}_{r}^{2}\circ ... \circ {\varvec{d}}_{r}^{P}) \end{aligned}$$
(4)

where \(\circ \) denotes the outer product. We suppose the vector \({\varvec{d}}^{p}_{r}\) is normalized to unit length, and the weight of each rank-one tensor is \(\lambda _{r}\).

However, applying CP on \(\text {E}_{k}\) directly would fill the coefficient vector \({\varvec{x}}_{k}^{T}\), which means that the sparsity would be destroyed. Therefore, we construct a constraint vector \(\varvec{\omega }_{k} = ({i|1\le i \le N, {\varvec{x}}_{k}^{T}\ne 0})\) that captures the nonzero entries of \({\varvec{x}}_{k}^{T}\). According to \(\varvec{\omega }_{k}\), we must restrict \(\text {E}_{k}\) and \({\varvec{x}}_{k}^{T}\) to \(\text {E}_{k}^{R}\) and \({\varvec{x}}_{k}^{R}\), respectively. By applying CP to \(\text {E}_{k}^{R}\) with a rank-one tensor component, \(\text {D}_{k}\) can be updated by using the decomposition result and the coefficient vector \({\varvec{x}}_{k}^{T}\) can be updated by zero-padding the weight \(\lambda \), as in Eq. (4)

The process of applying the CANDECOMP/PARAFAC (CP) decomposition to the reconstruction residual tensor is executed K times to update each of the K tensor codewords in each iteration. Thus this method is called K-CP method.

The above two stages are iterated until a pre-specified reconstruction error is achieved or the maximum iteration number is reached. The details of the K-CP method for overcomplete tensor codebook learning are given in Algorithm 1.

Fig. 1.
figure 1

Learning spatiotemporal features via the proposed tensor sparse coding method from multi-phase images

2.2 FLL Classification Using Tensor Sparse Representations of Spatiotemporal Structures

For each patient in the dataset, there are triple-phase (NC/ART/PV) CT images, which is explained in detail in Sect. 3.1. Based on the structure of the dataset, spatiotemporal features are extracted by using the BoVW models, in which codebooks are learned by the proposed tensor sparse coding method.

To capture the temporal feature of multi-phase CT images, corresponding slices from triple-phase CT images were center-aligned according to the tumor masks and stacked to form three-layer volumes. By this operation, the temporal co-occurrence information is transformed into spatial information in the third dimension of the constructed volumes. A spatiotemporal codebook can be learned by applying our proposed method on the tensor training samples, which are local descriptors extracted from three-layer volumes. Spatiotemporal feature of each medical case can be then calculated by summarizing the representations of local descriptors using mean pooling method. Spatiotemporal feature of a query case can also be calculated based on the learned spatiotemporal codebook under the same mechanism. Features of the query and cases in the dataset were fed into a support-vector machine (SVM) classifier with a Radial basis function (RBF) kernel to predict the possible class that the query case may belong to. The workflow is shown in Fig. 1.

3 Experiments and Results

3.1 Multi-phase Medical Dataset

A multi-phase medical dataset was constructed with the help of radiologists to evaluate the performance of the proposed method. The dataset comprises four types of FLLs collected from 111 medical cases. For each medical case, triple-phase (NC/ART/PV) CT images were collected, with spacing of \((0.5-0.8)\times (0.5-0.8)\times (5/7)\) mm\(^{3}\). The size of a CT slice was fixed to \(512\times 512\) pixels, while the number of CT slices was set depending on the region scanned (full body or only the abdomen). All tumors in each CT image were manually marked by an experienced medical doctor. In our experiments, however, only the major tumor, that is, the tumor with the largest volume, was considered. As a result, 111 FLLs were selected for use in our experiments, including 38 lesions of the cyst class, 19 cases of focal nodular hyperplasia (FNH), 26 cases of hepatocellular carcinoma (HCC), and 28 cases of hemangioma (HEM). Examples of the four types of FLLs are shown in Fig. 2.

Fig. 2.
figure 2

Examples of each lesion type on 3 phases. Rows are images belong to same contrast phase, while columns are images from same lesion: cyst, FNH, HCC, HEM

3.2 Evaluation Method

Considering the constructed small dataset, the leave-one-out cross-validation method is used in performance evaluation. The classification accuracy are calculated for quantitative measurement, shown as follows:

$$\begin{aligned} Accuracy=TP/(TP+FP) \end{aligned}$$
(5)

where, TP is number of correct classified cases, FP represents the number of miss classified cases. (\(TP+FP\)) is the total number of cases in the corresponding FLL type.

Fig. 3.
figure 3

Comparing classification performance by the proposed tensor sparse representation and the conventional sparse representation method using sing-/multi-phase CT images

3.3 Experimental Results

We compared the classification performance of the proposed tensor sparse representation method with the conventional sparse representation method over both single-/multi-phase medical images, as shown in Fig. 3. We used PV phase images in the single-phase experiments as most of related works do, since most liver lesion types can be visualized clearly on PV phase images. It’s interesting that both the two methods got exactly the same results using single-phase images. The accuracy is more significantly improved, however, by the proposed tensor sparse representation method than the conventional one when using multi-phase images, which emphases that the proposed method is more effective in capturing the temporal information from multi-phase images. The detailed classification result of the proposed method is shown in Table 1. Due to the clear texture features and temporal enhancement features of Cyst and FNH, they are much easier to be classified from the others when using the temporal co-occurrence information captured by the proposed tensor sparse representation method.

A comparison of the performance of the proposed method with those of the state-of-the-art methods is given in Table 2. As mentioned in previous sections, considerable research effort has been invested to exploring variants and enhanced versions of the BoVW model for FLL characterization. Most of he state-of-the-art methods are based on the BoVW framework. Table 2 shows a comparison of the proposed method with a few other BoVW models. The proposed tensor sparse coding method outperforms the other methods by preserving spatiotemporal features captured from multi-phase CT images, especially for FNH that shows significant different contrast enhancement features in different phases.

Table 1. The performance of the proposed method
Table 2. Compare the classification accuracy (%) of the proposed method with those of state-of-the-art methods

4 Conclusion

In this paper, we proposed the K-CP method to learn tensor sparse representations of multi-phase medical images. We learned tensor codebooks by using the proposed method and builded BoVW models for extracting spatial features and temporal co-occurrency of multi-phase medical images. Experiments of applying the proposed method on focal liver lesion classification showed that the proposed method achieved more significant improvement from single-phase to multi-phase images than conventional sparse representation method and the proposed method outperforms the state-of-the-art methods in this task.