Keywords

1 Introduction

Ultrasound imaging (US) is commonly used in nephrology for diagnostic studies of the kidneys and urinary tract. Anatomic measures of the kidney computed from US data, such as renal parenchymal area, are correlated with kidney function [1], and pattern classifiers built upon US imaging features could aid kidney disease diagnosis [2, 3]. Recent deep learning studies have demonstrated that automated US data analysis could achieve promising performance in a variety of US data analysis tasks, including segmentation and classification [4,5,6]. However, it remains challenging to automate the kidney disease diagnosis based on clinical 2D US scans since in clinical practice multiple 2D US scans of the same kidney in different views are often collected and the multi-view 2D US scans have heterogeneous appearance, providing partial anatomic information of the kidney, as illustrated by Fig. 1. Therefore, it is desired to develop a clinically useful diagnosis model that is robust to different views of US images of the same kidney.

Fig. 1.
figure 1

Multi-view 2D US scans of the same kidney. The images shown on the 1st column have abnormal appearance annotated by radiologists, while others shown on the 2nd and 3rd columns have heterogeneous appearance.

Multiple instance learning (MIL) is an ideal tool to build a robust classifier on multi-view 2D US scans of the same kidneys by treating multi-views of 2D US scans of the same kidney as multiple instances of a bag and predicting a bag-level classification label [7]. To effectively solve the MIL classification problem, a number of methods have been developed [7]. Among the existing MIL methods, neural network based methods have demonstrated promising performance in a variety of MIL problems, partially due to its end-to-end learning capability [8,9,10,11]. Particularly, neural networks could be used to estimate instance-level classification probabilities and fuse them with a log-sum-exp based max operator [8] or a max operator [9] to generate a bag-level classification probability in an end-to-end learning framework. Since the instance-level classification might be affected by instance label instability problem [12], several embedded-space based deep MIL methods have been developed to learn informative features at the instance level, generate a bag mapping with a permutation-invariant MIL pooling operator, and build a bag-level classifier on the embedded-space in an end-to-end learning framework [10, 11]. Particularly, an attention-based MIL pooling has been develop to learn a weighted average of instances [11].

However, the existing deep MIL methods ignore classification labels of instances that are often available in training data and could potentially improve the MIL classification performance if properly integrated, such as those shown on the 1st column of Fig. 1. Furthermore, potential correlation between instances of the same bag has not been well explored in the existing deep MIL methods, which may lead to suboptimal instance-level features. In order to overcome these limitations and further improve deep MIL methods, we develop a novel deep MIL method to learn a deep MIL classification model in an end-to-end learning framework, and apply it to kidney disease diagnosis based on multi-view 2D US images. Particularly, we build a MIL classifier to distinguish kidneys from patients with different kidney diseases based on their multi-view 2D US images. We adopt convolutional neural networks (CNNs) [13] to learn informative US image features, and adopt graph convolutional neural networks (GCNs) [14] as a permutation-invariant operator to further optimize the instance-level CNN features by exploring potential correlation among different instances of the same bag. We adopt the attention-based MIL pooling to learn an optimal permutation-invariant MIL pooling operator in conjunction with learning a bag-level classifier on the embedded space [11]. We further adopt instance-level supervision to enhance the learning of instance features with a focus on instances with reliable labels in the training data. We have validated our method based on clinical 2D US images collected from patients at a local hospital. Extensive comparison and ablation studies have demonstrated that the proposed method could improve the deep MIL methods.

2 Methodology

We model the kidney disease diagnosis problem based on multi-view 2D US kidney images as a MIL classification problem. Particularly, given kidneys \( X_{i} , i = 1, \ldots ,N \), and their 2D US scans in different views, \( x_{ik} , k = 1, \ldots ,K, \) their class label \( Y_{i} = 0 \) if all \( x_{ik} \) are normal, otherwise \( Y_{i} = 1 \). We build a deep MIL network upon recent advances in deep MIL methods that facilitate end-to-end optimization of learning informative features at the instance level, generate a bag mapping with a permutation-invariant MIL pooling operator to embed bags into an embedded-space, and build a bag-level classifier on the embedded-space [10, 11]. As illustrated by Fig. 2, our network consists of CNNs to learn instance-level features from 2D US kidney images, GCNs as permutation-invariant operators to further improve instance-level features, the attention-based MIL pooling to learn a bag-level classifier using full connected neural networks (FCNs), and instance-level supervision to enhance the instance level feature learning and the bag-level classification.

Fig. 2.
figure 2

Network architecture of the proposed deep MIL method. The instance-level supervision is denoted by the yellow circles and the bag-level supervision is denoted by the red circle. (Color figure online)

2.1 Learning Image Features for Instances Using CNNs and GCNs with Instance-Level Supervision

To learn informative image features from 2D US kidney images, we adopt CNNs in conjunction with nonlinear activation functions, as illustrated by Fig. 2. Particularly, each 2D US kidney image, \( x_{ik} , \) is first fed into multiple layers of CNNs followed by nonlinear activation functions (in the present study, we use 3 CNN layers and ReLU activation). We denote the CNN output of \( x_{ik} \) by \( h_{ik} \).

As illustrated by Fig. 1, instances of the same bag are potentially correlated with each other. Such correlation information could not be utilized by the CNNs since they are applied to individual instances of the same bag separately. For modeling such unorganized instances, graph theory-based modeling provides an effective means. By modeling each instance as a graph node, and connecting every pair of instances weighted by their similarity measure, we could model the instances with an undirected graph in a graph convolutional network (GCN) framework [14]. Particularly, new features on the graph nodes could be learned by optimizing weights of GCNs.

Given the CNN features of different instances of the same bag, \( h_{ik} , k = 1, \ldots ,K \), we build a bag-level graph \( G = \left\{ {V,E} \right\} \) by treating each instance as a graph node and connecting each pair of nodes with a weight measuring their similarity based on their CNN features. GCNs are then adopted to learn new features on the graph nodes [14]:

$$ H^{{\left( {l + 1} \right)}} = \sigma (\tilde{D}^{{ - \frac{1}{2}}} A\tilde{D}^{{ - \frac{1}{2}}} (H^{\left( l \right)} )^{T} W^{\left( l \right)} ), $$
(1)

where \( A \) with its element denoted by \( a_{ij} \) is a symmetric adjacent matrix of the undirected graph \( G,\tilde{D}_{ii} = \sum\nolimits_{j} {a_{ij} } \) is its degree matrix, \( W^{\left( l \right)} \) is a layer-specific trainable weight matrix of GCNs, \( \sigma \left( \cdot \right) \) is a nonlinear activation function, \( H_{i}^{\left( l \right)} = \left\{ {h_{i1}^{l} ,h_{i2}^{l} , \ldots ,h_{iK}^{l} } \right\}, h_{ik}^{l} \in R^{F} \) is a set of node features obtained by the \( l \)th GCN layer, \( H_{i}^{{\left( {l + 1} \right)}} = \left\{ {h_{i1}^{l + 1} ,h_{i2}^{l + 1} , \ldots ,h_{iK}^{l + 1} } \right\}, h_{k}^{l + 1} \in R^{M} \) is a set of new nodal features obtained by the \( \left( {l + 1} \right) \)th GCN layer, K is the number of nodes, F and \( M \) are the numbers of features on each node obtained by the \( l \)th and the \( \left( {l + 1} \right) \)th GCN layers, respectively.

In the present study, we adopt a Euclidean distance function to obtain the adjacency matrix based on the input feature \( H^{\left( l \right)} \):

$$ a_{ij} = exp\left( { - \left\| {h_{i}^{l} - h_{j}^{l} } \right\|^{2} } \right). $$
(2)

To guide the feature learning, an instance-level supervision is adopted. Particularly, instances with reliable positive labels and all negative instances are used to optimize the feature learning using a softmax loss function. For a two-layer GCN, its forward model takes the simple form:

$$ Z = f\left( {H^{l} } \right) = A_{2} {\text{ReLU}}\left( {A_{1} (H^{\left( 0 \right)} )^{T} W^{0} } \right)W^{1} , $$
(3)

where \( W^{\left( 0 \right)} \in R^{F \times M} \) and \( W^{\left( l \right)} \in R^{M \times 2} \) are GCN weight matrices, \( H^{\left( 0 \right)} \) is the input CNN features, \( A_{i = 1,2} \) is the ith layer adjacency matrix which is computed based on the ith layer input features. The second GCN layer yields the instance-level feature \( Z^{T} = \left\{ {z_{1} ,z_{2} , \ldots .,z_{K} } \right\} \) with \( z_{k} \in R^{2} \), and the instance-level classification probability \( P^{T} = \left\{ {p_{1} ,p_{2} , \ldots .p_{K} } \right\} \) is obtained by applying a row-wise softmax activation function.

2.2 Attention-Based MIL Pooling Layer with a Gating Mechanism

Once we obtain the instance-level features, we aggregate them to obtain an embedded-space representation using a MIL pooling operator. Instead of adopting simple mean or max MIL pooling, we adopt a gated attention-based MIL pooling layer [11]. Particularly, the attention-based MIL pooling layer learns a weighted average operator to aggregate instance features with a gating mechanism. Given a bag of K instances with GCN features \( H^{l + 1} = \left\{ {h_{1}^{l + 1} ,h_{2}^{l + 1} , \ldots ,h_{K}^{l + 1} } \right\}, h_{k}^{l + 1} \in R^{M} \), gated attention-based MIL pooling weight \( a_{k} \) is computed as

$$ a_{k} = \frac{{{ \exp }\{ w^{T} \left( {\tanh \left( {\varvec{V}\left( {h_{k}^{l + 1} } \right)^{T} } \right) \bullet {\text{sigm}}\left( {\varvec{U}\left( {h_{k}^{l + 1} } \right)^{T} } \right)} \right)}}{{\mathop \sum \nolimits_{j = 1}^{K} \exp \left\{ {w^{T} \left( {\tanh \left( {\varvec{V}\left( {h_{j}^{l + 1} } \right)^{T} } \right) \bullet {\text{sigm}}\left( {\varvec{U}\left( {h_{j}^{l + 1} } \right)^{T} } \right)} \right)} \right\} }} , $$
(4)

where \( w \in R^{L \times 1} \) and \( \varvec{V},\varvec{U} \in R^{L \times M} \) are parameters to be optimized, \( \bullet \) is an element-wise multiplication, sigm (·) is the sigmoid non-linear activation function, and tanh (·) is used as the gating mechanism. So, the embedded-space representation of bag \( Z_{X} \) is defined as:

$$ Z_{X} = \sum\nolimits_{k = 1}^{K} {a_{k} z_{k} } . $$
(5)

Once the embedded-space representation of bags is obtained, a softmax operation is applied to obtain the bag positive score \( P_{X} \).

2.3 Jointly Training the Instance-Level and Bag-Level Loss Functions

Once the bag positive score \( P_{X} \) is obtained, the bag-level loss function is defined as:

$$ L_{X} = - \left\{ {YlnP_{X} + \left( {1 - Y} \right)\ln \left( {1 - P_{X} } \right)} \right\} , $$
(6)

To utilize information of instances with reliable labels, we also generate instance-level classification results by optimizing a cross-entropy loss function.

$$ L_{M} = - \sum\nolimits_{{n \in N_{Y} }} {} \sum\nolimits_{c = 1}^{2} {y_{nc} lnp_{nc} } , $$
(7)

where \( p_{nc} \) is the classification probability of an instance, \( y_{nc} \) is its ground truth classification label, and \( N_{Y} \) is the set of node indices that have reliable classification labels in a bag \( X \). Finally, an overall loss function is defined as:

$$ L = L_{M} + L_{X} . $$
(8)

3 Experimental Results

3.1 Clinical US Kidney Scans

We evaluated our method based on a data set of clinical US kidney scans of kidney patients collected at the Children’s Hospital of Philadelphia (CHOP). The work described has been carried out in accordance with the Declaration of Helsinki. The study has been reviewed and approved by the institutional review board.

Participants were randomly sampled from two patient groups. Particularly, one group of the patients were children with mild hydronephrosis (MH) which does not affect the echogenicity, growth, or function of the affected or contralateral normal kidney. The other group of the patients were children with Congenital anomalies of the kidneys and urinary tract (CAKUT), with varying degrees of increased cortical echogenicity, decreased corticomedullary differentiation, and hydronephrosis. All images were obtained for routine clinical care. The first US scans after birth were used, and all identifying information was removed. In total, we obtained 105 MH patients with 2246 scans and 120 CAKUT patients with 2687 scans. All the MH scans were labeled as negative instances with reliable classification labels, all CAKUT scans were labeled as positive instances, and 335 of CAKUT scans with noticeable abnormality from different patients were deemed as instances with reliable classification labels. All the US scans were resized to have a spatial resolution of 321 × 321, and their image intensities were linearly scaled to [0, 255].

3.2 Implementation Details

Our network consisted of 3 layers of CNNs, and their numbers of channels were set to 128, 64 and 32 respectively. All the CNNs had the same kernel sizes of 5 × 5 and the same stride sizes of 2. Our GCNs had 2 layers, and their numbers of hidden features were set to 64. In the attention-based MIL pooling network, the number of hidden nodes \( L \) was set to 64. The learning rate was 0.0001 and batch size was set as 6. The maximum number of iteration steps was set to 20000. All the methods were implemented using TensorFlow and executed on a GeForce GTX 6.00 GB GPU.

3.3 Ablation Studies and Comparisons with Alternative Methods

We compared the proposed network with its degraded versions to investigate how GCNs, the attention-based MIL pooling, and the instance-level supervision contribute to the overall classification based on validation datasets. All the networks had the same number of parameters. Particularly, we first implemented the proposed network with only the bag-level loss function (Bag-level MINN), but without the GCNs (replaced the GCNs with FCNs having the same hidden nodes), the attention-based MIL pooling, and the instance-level loss. Then, Bag-level MINN was enhanced by adding the attention-based MIL pooling (Bag-level MINN+attention”), the GCNs (Bag-level MINN+GNN+attention), and the instance-level supervision (Bag-level MINN+GNN+attention+all instance supervised). In the implementation of Bag-level MINN+GNN+attention+all instance supervised, all instances of the positive bags were labelled as positive.

In the ablation studies, we randomly selected 79 MH and 99 CAKUT patients as a training data set, random 45 subjects from the remaining dataset were used as a validation data set. This procedure was repeated twice to estimate the classification performance of different versions of the proposed method. Their classification results are summarized in Table 1, demonstrating that GCNs, the attention-based MIL pooling, and the instance-level supervision based on instances with reliable labels could improve the MIL classification performance. Particularly, these results also indicated that the instance-level supervision based on all instance might be affected by the instance label instability problem [12].

Table 1. Comparison results of different versions of the proposed method (mean±std).

We further evaluated our method and compared it with state-of-the-art MIL methods, including CNN based instance level classification with max MIL pooling (minet) [9], embedded-space based deep MIL method with mean (Minet+mean) MIL pooling [10], as well as embedded-space based deep MIL with an attention-based MIL pooling (Gated-Attention) [11]. All the deep MIL methods under comparison had the same CNNs with the same numbers of parameters. In the minet, we labelled all instances of the positive bags as positive instances. The classification performance of these methods were estimated using 5-fold cross-validation. All the classification results are summarized in Table 2. These results further demonstrated that our method could improve the classification performance of the state-of-the-art MIL methods.

Table 2. Comparison results of different MIL methods (mean±std)

Finally, we adopted Grad-CAM to identify informative image regions for the classification [15]. Figure 3 shows Grad-CAM maps of two randomly selected testing subjects with CAKUT. Particularly, instances with relatively larger weights learned by the attention-based MIL pooling are shown from left to right, indicating that our method could capture clinically meaningful image features.

Fig. 3.
figure 3

Two examples of the multi-instance Grad-CAM maps of abnormal subjects with relatively lager attention weights. The largest weight across all test subjects was about 0.06.

4 Conclusions

In this study, we develop a novel multi-instance deep learning method to build a robust classifier to aid kidney disease diagnosis using ultrasound imaging. Our method is built upon recent advance in deep MIL on the embedded-space [10, 11] with novel components, including the GCNs to optimize the instance-level features learned by CNNs and the integrated instance-level and bag-level supervision to improve the classification. Extensive ablation studies and comparison experiments have demonstrated that our method could improve state-of-the-art deep MIL methods for the kidney disease diagnosis. Our future work will be devoted to automatic network architecture optimization and extensive validation of the proposed method based on different data sets.