1 Introduction

Accurate classification of lymph nodes is of very important clinical meaning for the diagnosis or prognosis of numerous diseases such as metabolises cancer and can be used as assistance for the radiologist to locate abnormality and diagnosis different diseases. Developing machine learning methods to automatically identify different lymph nodes from MRI scans is a very challenging task due to the similar morphological structures of lymph nodes, very complicated relationship between different types of lymph nodes and the expensive cost to collect large-scale labelled image datasets. Labelling lymph node types from the MRI scans is very time consuming and expensive since it requires the well trained radiologist to look into each MRI slices, therefore, it is not realistic to collect a large size labeled dataset to train a deep learning system.

Fig. 2.
figure 1

The framework of the classical graph based classification.

In this work, instead of manually labelling images, we extracted lymph node key words from clinical notes associated with MRI scans by experienced radiologists and used the extracted key words as classification labels for the corresponding MRI images. We extracted 14 different types of lymph nodes from the clinical reports (as shown in Fig. 3) from 821 T2 MRI key slices. It is worth noting that our dataset is highly unbalanced on different types of lymph nodes (some lymph node has more than 80 training images and some lymph node only have less than 10 training images). Considering the unbalanced small size training dataset, it is very challenging to train the classification model with high accuracy and generalizability.

Motivated by recent works on successfully using semantic information for image classification and image captioning problems [10, 11], we proposed to leverage the semantic features learned from clinical notes along with a predefined lymph node ontology graph (as shown in Fig. 2) by radiologist for the lymph node classification. We proposed to learn a semantic feature embedding on the clinical reports and used it to learn the semantic relationship between different lymph node types. Based on this semantic embedding space, we also combined it with the ontology graph (shown in Fig. 1 (b)) to construct a knowledge graph to guide our classification task. Besides, most existing graph models assume that the graph structure is known or predefined. When graph structure is unknown, they usually use K-nearest neighbor to construct a graph structure from image features and use it for downstream image classification task  [2, 8]. Defining graph structure is critical for the downstream graph node classification task, recent work shows that learning a graph structure can significantly boost the performance of graph based classification tasks [12]. In this work, we also proposed a joint graph structure learning and classification framework with prior information from semantic features (learned from clinical reports) and radiologist defined lymph node ontology graph. We evaluated our proposed model on lymph node classification in T2 MRI scans and show consistent improvement compared to several state-of-the-art methods given a small and imbalanced training dataset.

Fig. 3.
figure 2

The predefined ontology graph of a lymph node structure by an experienced radiologist. We use this ontology graph as a prior to construct our knowledge graph. The 14 labels on our dataset is also shown here.

2 Method

Graph Learning for Classification. We define a graph \({G}= (\mathbf {V}, E)\), where \(\mathbf {V}\) is the set of vertices or nodes (we will use nodes throughout the paper), and \(E\) is the set of edges. Let \(\mathbf {v}_i\in \mathbf {V}\) denote a node and \(e_{ij} = (\mathbf {v}_i,\mathbf {v}_j)\in E\) to denote an edge pointing from \(\mathbf {v}_j\) to \(\mathbf {v}_i\). The adjacency matrix \(\mathbf {\Lambda }\) is a \(n \times n\) matrix with \(\mathbf {\Lambda }_{ij} = 1\) if \(e_{ij} \in E\) and \(A_{ij} = 0\) if \(e_{ij} \notin E\). It is worth noting that we only consider about the un-directed graph in this work, thus \(\mathbf {\Lambda }\) is symmetric. It is straightforward to extend our frame work to the directed graph with different graph structure regularization terms. The node of graph has node feature \(x_i \in \mathcal {R}^{d\times 1}\), \(X =[x_1,\cdots ,x_n] \in \mathbb {R}^{n\times d}\) is a node feature matrix for all nodes in the graph. Let \(\mathbf {y}_i\) denotes the class labels of node \(x_i\) and \(\mathbf {Y}= [\mathbf {y}_1,\cdots ,\mathbf {y}_n]\) denotes the node class labels of all different nodes on the graph. Conventional graph learning methods usually define a function \(f(\mathbf {X},\mathbf {\Lambda })=\mathbf {Y}\) with input as node feature vector matrix \(\mathbf {X}\) and graph adjacency matrix \(\mathbf {\Lambda }\) to estimate the class labels \(\mathbf {Y}\) for each node. f can be a simple linear model or a graph neural network model. The classic graph learning solves the following function to learn f,

$$\begin{aligned} \begin{aligned} \L (\mathbf {X},\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {\Lambda })\Vert _p+\lambda f(\mathbf {X},\mathbf {\Lambda })^T \mathbf {f}(\mathbf {X},\mathbf {\Lambda })\\ \text{ s. } \text{ t. }&=\mathbf {I}-\mathbf {\Lambda }, \end{aligned} \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _p\) represents \(L_p\) norm, the first term is the label prediction term and the second term is the laplacian constraint term, \(\mathbf {I}\) is the identity matrix, is the laplacian matrix of graph. If using a multi-layer deep graph neural network model for f

$$\begin{aligned} \begin{aligned} \L (\mathbf {X},\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {\Lambda })\Vert _p + \lambda f(\mathbf {X},\mathbf {\Lambda })^T \mathbf {f}(\mathbf {X},\mathbf {\Lambda })\\ \text{ s. } \text{ t. }&=\mathbf {I}-\mathbf {\Lambda }\end{aligned} \end{aligned}$$
(2)

These graph learning methods always assumed that the graph structure is predefined (the adjacency matrix \(\mathbf {\Lambda }\) is given), which is not applicable for many cases. For example, the connection or similarity between different patients or different types of lymph nodes is hard to defined. In practice, many previous works just use a K-nearest neighbor to extract the graph adjacency matrix and use it for label propagation on the graph. Recent works have shown that defining an optimal graph structure is crucial for the label classification task on the graph and jointly optimizing label propagation and learning graph structures can significantly improve the performance and generalizability of graph models [2].

Jointly Learning Graph Structure and Classification. We propose to learn the graph structure \(\mathbf {\Lambda }\) jointly with graph label propagation, let \(\mathbf {\Lambda }= g(\mathbf {X}) \) be the function to learn the graph structure, we define the new loss function as,

$$\begin{aligned} \L (\mathbf {X},\mathbf {\Lambda },\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {\Lambda })\Vert _p + \lambda _0 f(\mathbf {X},\mathbf {\Lambda })^T \mathbf {f}(\mathbf {X},\mathbf {\Lambda }) \\ \nonumber&+\lambda _g\Vert \mathbf {\Lambda }-g(\mathbf {X})\Vert _p, \text{ s. } \text{ t. }\ =\mathbf {I}-\mathbf {\Lambda }\end{aligned}$$
(3)

There are many ways to construct the graph structure learning g function, for example, previous work [2] has tried to use Bernoulli sampling to learning the graph structure from discrete nodes, or using sparse and low rank subspace learning to learn the graph structure. We follow the [1, 6, 13] to use the sparse and low rank subspace learning in our work. However, it is straight forward to extend our work with different graph structure learning methods. We reformulated our objective function as,

$$\begin{aligned} L(\mathbf {X},\mathbf {\Lambda },\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {\Lambda })\Vert _p + \lambda _0 f(\mathbf {X},\mathbf {\Lambda })^T \mathbf {f}(\mathbf {X},\mathbf {\Lambda })\\ \nonumber&+\lambda _1\Vert \mathbf {\Lambda }\Vert _1 +\lambda _2\Vert \mathbf {\Lambda }\Vert _*, \text{ s. } \text{ t. } =\mathbf {I}-\mathbf {\Lambda }, \mathbf {X}=\mathbf {X}\mathbf {\Lambda }, \end{aligned}$$
(4)

where \(\mathbf {\Lambda }\) is constrained to be low rank using the nuclear norm  \(\Vert \Vert _*\) and sparse using \(L_1\) norm.

Predefined Knowledge Graph. Many previous works have shown that expert knowledge is tremendously helpful for medical data analysis especially when labelled training data-size is small. In this work, we only have 821 labelled MRI slices as training data, which is very small for training a graph neural network. In order to further improve our model, we extracted a knowledge graph from radiologist labelled lymph node ontology graph as shown in Fig. 2. For all labelled training images/nodes, we will construct in-directed edges between them, if they are connected in the predefined ontology graph.

Besides the ontology graph, since we can access the clinical report associated with each MRI slice in our dataset, we also use the MRI reports (shown in Fig. 3) to train a report classification model using pre-trained BERT  [4] and apply it to extract a semantic vector by attention pooling for different lymph nodes. Based on the similarity between these semantic features and the ontology graph, we constructed a knowledge graph for node label classification. Denote the adjacency matrix of the knowledge graph as \(A_{kg}\), our problem is further formulated as,

$$\begin{aligned} \L (\mathbf {X},\mathbf {\Lambda },\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {\Lambda })\Vert _p + \lambda _0 f(\mathbf {X},\mathbf {\Lambda })^T \mathbf {f}(\mathbf {X},\mathbf {\Lambda })\\ \nonumber&+\lambda _1\Vert \mathbf {\Lambda }\Vert _1 +\lambda _2\Vert \mathbf {\Lambda }\Vert _*,\text{ s. } \text{ t. } =\mathbf {I}-(\mathbf {\Lambda }+\beta \mathbf {\Lambda }_{kg}), \mathbf {X}=\mathbf {X}\mathbf {\Lambda }, \end{aligned}$$
(5)

where \(\beta \) is a hyper-parameter to learn for adding the knowledge graph as prior in this problem. In order to solve Eq. 6, we use a Lagrange multiplayer to add the constraints to the objective function,

$$\begin{aligned} \L (\mathbf {X},\mathbf {\Lambda },\mathbf {Y})&=\Vert \mathbf {Y}-f(\mathbf {X})\Vert _p +\lambda _0 f(\mathbf {X})^T(\mathbf {I}-(\mathbf {\Lambda }+\beta \mathbf {\Lambda }_{kg}))\\\nonumber&\mathbf {f}(\mathbf {X})+\lambda _1\Vert \mathbf {\Lambda }\Vert _1 +\lambda _2\Vert \mathbf {\Lambda }\Vert _*+\lambda _3\Vert \mathbf {X}-\mathbf {X}\mathbf {\Lambda }\Vert _2^2 \end{aligned}$$
(6)

Graph Convolutional Neural Networks. Graph convolutional neural networks have shown great successes on many applications [3, 14], our proposed framework can be easily combined with graph convolutional neural networks. Let \(\mathbf {H}\) denotes the hidden states of a graph neural network, \(\mathbf {W}_l\) as the weights for the hidden layer l, a standard graph convolutional neural network can be formulated \( \mathbf {H}_{l+1} =f(\mathbf {W}_l,\mathbf {\Lambda },\mathbf {H}_{l}) \), where the input is the image feature matrix \(\mathbf {X}\) and output is the predicted labels \(\mathbf {Y}\), thus, \(\mathbf {H}_{0} = \mathbf {X}, \mathbf {H}_{last} = \mathbf {Y}, \mathbf {H}=[\mathbf {H}_0,\cdots ,\mathbf {H}_{last}]\). We can rewrite Eq. 7 using graph convolutional neural networks as,

$$\begin{aligned}&\L (\mathbf {X},\mathbf {\Lambda },\mathbf {Y},\mathbf {W},\mathbf {H})=\Vert \mathbf {Y}-f(\mathbf {X},\mathbf {W},\mathbf {\Lambda },\mathbf {H})\Vert _p \nonumber \\&+\lambda _0 f(\mathbf {X},\mathbf {W},\mathbf {\Lambda },\mathbf {H})^T(\mathbf {I}-(\mathbf {\Lambda }+\beta \mathbf {\Lambda }_{kg})) \mathbf {f}(\mathbf {X},\mathbf {W},\mathbf {\Lambda },\mathbf {H})\nonumber \\&+\lambda _1\Vert \mathbf {\Lambda }\Vert _1 +\lambda _2\Vert \mathbf {\Lambda }\Vert _*+\lambda _3\Vert \mathbf {X}-\mathbf {X}\mathbf {\Lambda }\Vert _2^2. \end{aligned}$$
(7)

Optimization Method. It is trivial to optimize both the graph structure and node classification on the same dataset. In order to solve Eq. 7, we proposed a bi-level optimization method to learn the graph structure \(\mathbf {\Lambda }\), GNN model weight \(\mathbf {W}\) and labels \(\mathbf {Y}\) jointly. The graph structure adjacency matrix \(\mathbf {\Lambda }\) is learned on the validation dataset, and the GNN model weight \(\mathbf {W}\) is learned on the training dataset. Hyperparamters \(\lambda _0,\lambda _1,\lambda _2,\lambda _3,\beta \) are also learned on the validation dataset too. The detailed optimization algorithm is shown in Algorithm 1.

figure a

3 Experiments

Dataset. For model development and validation, we collected large-scale MRI studies from **, performed between Jan 2015 to Sept 2019 along with their associated radiology reports. The majority (63%) of the MRI studies were from the oncology department. This dataset consists of a total of 821 T2-weighted MRI axial slices from 584 unique patients. The lymph node labels were extracted by a radiologist with 8 years of post-graduate experience. The study was a retrospective study and was approved by the Institutional Review Board with a waiver of informed consent. This dataset comprised the reference (gold) standard for our evaluation and comparative analysis.

Fig. 4.
figure 3

An example of MRI slice with the labelled bounding boxes for lymph-nodes and graph-cut based detailed annotations of lymph nodes, linked clinical report sentences and radiologist labelled lymph node names.

Benchmark Methods. We implemented several benchmark methods in our experiments. 1) Support Vector Machine (SVM) [7]: applying classical SVM on the extracted multi-scale bounding box features. 2) Structured SVM: constraining the support vector machine to output structural labels constrained by the knowledge graph structure in Fig. 2. 3) Standard Simple Graph Model (SG); 4) SG with Graph Structure Learning (SG+SL); 5) SG with SL and Predefined Knowledge Graph (SG+SL+KG); 6) Deep Neural Graph (GCN); 7) GCN with Graph Structure Learning (GCN+SL); 8) GCN with SL and Predefined Knowledge Graph (GCN+SL+KG); 9) Deep Neural Hyper-Graph (HGCN)[9]; 10) HGCN with Graph Structure Learning (HGCN+SL); 11) HGCN with SL and Predefined Knowledge Graph (HGCN+SL+KG). We use the same lymph node image feature embedding framework for all competing methods since we want to show the different classification performance between our methods and all other benchmark methods.

Table 1. The top-k ACC & top-k F1 score of multi-class lymph node classification performance by different competing methods

Experiment Setting and Data Processing. We divided the dataset into 10 folds, use one fold as validation dataset and two folds as testing dataset, the left seven folds are used as the training dataset. We run the cross validation for 10 times and report the averaged top-k (k = 1,2,3) classification of accuracy for different types of lymph nodes, F1-score and AUC of binary classification performance. In our dataset, we have the access to the clinical report of each MRI scan. The radiologist describes the lymph node information including the labels, size measurements, and slice numbers in a sentence with hyperlink (called bookmark) referred to the related MRI slices. The radiologist defines a bookmark as a hyperlink connection between the annotation in the image and then writes description in the report. We have one experienced radiologist to extract the lymph node labels from the bookmark linked sentences and use them as the ground truth labels for the lymph nodes in the connected MRI slices.

The size of lymph nodes are measured by four points at maximum dimension of lymph nodes or two points at the maximum dimension of lymph nodes. Based on these key points, we extracted multi-scale bounding boxes around the lymph nodes and extract the features in these bounding boxes using pretrained CNN model on MRI slices. We further use graph-cut to extract the fine contours of the lymph nodes and extract the cut lymph nodes using pretrained CNN model. We concatenated all these multi-scale bounding box features and lymph node features and use it as the graph node feature representation. The length of the concatenated multi-scale feature vector is 25088. We use the pre-trained bioBERT to train the clinical notes classification and label attention to extract the semantic label embedding. We used about more than 28000 sentences of de-identified clinical reports from ** hospital and use it to embed the distance between different lymph node names in our dataset. Based on the semantic distance between different lymph nodes, we constructed a semantic embedding graph and combine it with the predefined ontology graph shown in Fig. 2 to refine the final label prediction results.

Quantitative Results. We compared our proposed model to several benchmarks and Table 1 shows the top-k mean accuracy and F1 score of 14 classification results on the 10-fold cross validation. Top-k accuracy has been broadly used for multi-class classification performance measurement in previous work [5]. We show that the simple graph model generally outperforms SVM methods, and the structured SVM improves classical SVM about \(>0.03\) on both accuracy and F1 score by adding the structured constraints on different classes (extracted from the pre-defined ontology graph). Learning graph structure improves the top-k accuracy of the simple graph model by \(> 0.03\) and F1 score about \(>0.02\). Using knowledge graph improves the simple graph mode further about \(>0.03\) on both top-k accuracy and F1 score. The convolutional neural graph also improves the classification top-k accuracy and F1 score consistently compared to the simple graph model. Learning graph structure and using knowledge graph under convolutional neural graph framework, our proposed model achieves the best performance and shows about 0.91 on top-3 accuracy and 0.90 on top-3 F1 score. We also combine the graph learning method with convolutional hyper-graph model and show that it improves the accuracy and F1-score \(>\% 2\) compared to convolutional graph model. The best top-3 accuracy and F1 score achieved by hyper-graph model is \(93\%\) and \(0.92\%\).