Introduction

Ultrasound-guided synovial tissue biopsy (USSB) provides a better understanding of the pathophysiology of inflammatory arthritis, facilitating the discovery of new biomarkers for diagnostic and/or prognostic purposes and the identification of new therapeutic targets [1, 2]. As recently shown, USSB may allow treatment personalization in the very next future [3] and perfectly suits everyday clinical practice as it is a reliable, safe, and relatively fast procedure, also in outpatient settings. A critical issue in the diagnostic work-up is the standardization of biopsy reading, therefore validated technological support may help pathologists in providing more informative and faster biopsy reports. In an increasing number of social and clinical scenarios, artificial intelligence (AI) is proving to be a valuable tool for the generation and implementation of complex multi-parametric decision algorithms. In particular, there is a rising interest in the application of image analysis and machine learning techniques to histopathology. Whatever the image acquisition method is, either traditional small fields capture on microscope or modern whole slide image digitalization, computer vision may yield accurate diagnostic interpretations [4]. In 2012, convolutional neural networks (CNNs) outperformed previous machine learning approaches by classifying 1.2 million high-resolution images in the ImageNet LSVRC2010 contest into 1000 different classes. At the same time, CNNs were found to be superior to other methods in segmenting nerves in electron microscopy images and detecting mitotic cells in histopathology images. Since then, methods based on CNNs have consistently outperformed other handcrafted methods in a variety of classification tasks in digital pathology [5]. The ability of CNNs to learn features directly from the raw data without the need for inputs from pathologists and the availability of marked histopathology datasets has also fuelled the explosion of interest in deep learning applied to histopathology [5]. Unlike previous deep learning architectures such as the multilayer perceptron, where each neuron in one layer connects to every neuron in the following layers, CNNs are supervised multidimensional algorithms consisting of a neural network with several hidden neuron layers, where connection and activation only take place between spatially close neurons, echoing the organization of the animal visual cortex in which neurons selectively respond to different stimuli. When presented with sufficient annotated training image data, CNNs can learn complex histological patterns from images through deconvolution of the image content into thousands of salient features, followed by selection and aggregation of the most meaningful ones [5]. These patterns may be then identified in yet unseen images [6].

In rheumatology, AI may enable a further step towards precision medicine, leading to the improvement of patient profiling and treatment personalization. AI has proven to be effective in predicting treatment responses to TNF inhibitors by driving treatments based on the clinical and genetic features of analysed patients [7]. Computerized digital analysis based on RGB video signal acquisition through a microscope had been used in the last two decades to quantify cells infiltrating the synovium, although this procedure has never been adopted in real-life patient management as it was time-consuming and led to a non-negligible level of disagreement between centres [8, 9] The potential benefit of using computer vision for histopathologic analyses of synovitis is largely unexplored so far. The aim of our study was to investigate whether CNNs have the potential to accurately identify synovitis’ grade according to Krenn’s synovitis score [10].

Materials and methods

Data acquisition

For training, validation and testing, we used a dataset of 150 photomicrographs of different synovitis slides, originally taken at 1280 × 1080 pixels (SONY® Sensor IMX185) at diverse magnifications (4 × to 20x), obtained from rheumatic patients with arthritis of knees who underwent USSB for routine clinical practice in a single tertiary centre, from January 2019 to January 2020. This was necessary because of the heterogeneity of inflammatory changes in the synovial membrane requires sample analyses at different magnifications, as originally proposed by Krenn et al. [10].

For each patient, demographic and clinical characteristics, including ultrasound features of synovitis were recorded for routine clinical practice, moreover informed consent was obtained from all patients and the local ethical committee approved this study as part of the Biopure registry (IRB approval n 5277/2017). According to OMERACT standardization of synovial tissue biopsy procedure, for each patient, at least 6 specimens were obtained from an involved knee joint using 16G Tru-Cut needles and then embedded in paraffin [11]. In total, 78 specimens (mean length ± SD 0.83 ± 0.08 cm) were processed. For all slides, biopsy surface area was greater than 2.5 mm2. Haematoxylin and eosin (H&E) staining was used for histopathological analysis and, for each of them, Krenn’s score was calculated by a pathologist with 20 year experience in synovial histopathology (AC). The latter score is obtained by semiquantitatively evaluating three features of chronic synovitis (enlargement of the lining cell layer, the cellular density of synovial stroma, leukocytic infiltrate) (from 0, absent to 3, strong), allowing discrimination between low grade (i.e. Krenn’s Score < 5) and high-grade synovitis (i.e. Krenn’s Score ≥ 5) [10]. Lining layer was evaluable in all 150 images.

The whole dataset was split and photomicrographs were randomly allocated to either the training, validation and test datasets according to a 3:1:1 ratio [12].

Theory

CNNs require large labelled image datasets to attain a high level of classification accuracy [13]. In several fields though, the acquisition of such an image dataset is challenging and their annotation is expensive.

Transfer learning (TL) has emerged as a powerful tool to mitigate these issues; TL consists of a process where a model trained on one problem is exploited to predict labels related to a second, similar, problem [14]. The first few layers in CNNs extract very general information such as colors, dots, lines and edge information, while subsequent layers aggregate these features into complex patterns. The initial layers of a CNN trained on a large and varied dataset can be hence treated as general image feature extractors [13]. Because of this, it is possible to “freeze” the weights of a CNN’s initial layers and fine-tune or re-train the last layers only, which generally encode higher-level features. This allows the model to recognize features specific, in our case, to histopathologic images. On one hand pre-trained models substantially ease the development of new models, also leading to lower generalization error. On the other hand, there is robust evidence that, with suitable fine-tuning, pre-trained CNNs may outperform a CNN trained from scratch for biomedical applications [14,15,16].

We fine-tuned a specific CNN architecture called ResNet34 [17] (He et al. 2015). For reproducibility, we used a Python 3.6 environment with PyTorch 1.4.0 and fast.ai [18] 1.06 libraries. ResNet34 is a particular architecture which was pre-trained with ImageNet database [19] containing approximately 1.2 million of images of about 22.000 categories. ResNet34 consists of 5 convolutional layer groups ending with a pooling layer group for prediction (for details see Supplementary Materials and Fig. 1 [17, 18]. Briefly, as the input flows through the ResNet34, less complex feature maps in the former layers apply filters for the above mentioned basic visual elements, whereas the latter layers provide more complex features. For this reason, the initial layers of ResNet34 can be considered as general image feature extractors. The architecture of ResNet34 has been briefly represented in Fig. 1, where we annotated the layers that we fine-tuned during transfer learning (see Supplementary Materials). To improve the robustness and ability of our model to generalize, and to further decrease the risk of overfitting [6], image augmentation (i.e. up to n.1680 different versions of the same item) was performed following the default augmentation protocol in fast.ai 1.06 [18].

Fig. 1
figure 1

We depict an outline of ResNet34’s architecture, and show what parts were fine-tuned in this work. The model consists on one convolution and pooling step (in yellow) followed by n.4 convolution groups of similar structure. Its last layers are a global average pooling layer and a 1000-way fully-connected layer with softmax for prediction. Layers in distinct convolution groups follow the same pattern performing 3 × 3 convolution with a fixed feature map of ascending dimension (64, 128, 256, 512), with Rectified Linear Unit (ReLu) activation, bypassing the input every 2 layers. The initial layers of a CNN, trained on a large and varied dataset, can be treated as general image feature extractors. Because of this, it is possible to “freeze” the weights of a CNN’s initial layers and fine-tune or re-train the last layers only, to enable the model to recognize features specific to histopathologic images

Parameters and metrics

In the training and validation phases, we relied on the train and validation losses to investigate the model’s goodness-of-fit.

The loss represents a quantitative measure of how much the model’s predictions differ from ground truth (i.e. the pathologist’s Krenn’s score). In general, losses are defined to be inversely proportional to the number of correct predictions of the model, so that the training procedure can be defined as a loss minimization problem. In other words, loss can be defined as an average of the errors made by the models on the images contained in a subset of the data. The loss is calculated on the training and validation sets, therefore it can be interpreted as a number describing how accurately the model is predicting these two sets.

To rule out overfitting and underfitting during training, both losses should typically show a trend towards decrease, with the training loss normally being the smallest.

An “epoch” indicates one forward pass and one backward pass of all the training images. The number of epochs after which to stop training is a metaparameter usually chosen to minimize the loss while avoiding overfit [14].

The test phase performance of our CNN was assessed with the following metrics:

$$ {\text{Accuracy}}\;{ = }\frac{{{\text{truepositives}}\;{ + }\;{\text{truenegatives}}}}{{{\text{truepositives}}\;{ + }\;{\text{truenegatives}}\;{\text{ + falsepositives}}\;{ + }\;{\text{falsenegatives}}}} $$
$$ {\text{Recall}}\;{ = }\frac{{{\text{truepositives}}\;}}{{{\text{truepositives}}\;{ + }\;{\text{falsenegatives}}}} $$
$$ {\text{Precision}}\;{ = }\frac{{{\text{truepositives}}\;}}{{{\text{truepositives}}\;{ + }\;{\text{falsepositives}}}} $$

Calculation

Images were scaled at 500 × 281 pixels and underwent pixel z-score normalization. Our training dataset included n.90 images (n.42/90 high-grade synovitis); the classification capability of CNNs is highly reliant on the size of the data used for the training; if the dataset is small, the CNN model starts overfitting already after a few epochs [20]. Given that, ResNet34 was trained for 4 epochs using fast.ai 1cycle policy on the aforementioned dataset, opting for a fine-tuning involving only the last convolutional layer, an approach that has been discussed and found to work well in cases similar to ours [21, 22] (see Supplementary Materials for details). Figure 1 graphically shows what parts of the network were fine-tuned.

Validation and test datasets included n.30 (n.14/30 high-grade synovitis) and n.30 items (n.16/30 high-grade synovitis), respectively.Following the training phase, validation and test were carried out on the remaining datasets (See Supplementary Materials for details).

Grad-CAM algorithm

Similarly to other deep learning models, CNNs are considered “black box” methods, for which researchers cannot precisely explain what parts of the input image the network is “attending” to, or how the model arrived at its final output [23]. It is crucial to resolve these issues, with particular regards to biomedical contexts. To provide explainability we employed the Grad-CAM algorithm, which has been discussed and applied in recent literature for visually debugging CNNs and properly understanding which features or parts of the image are the most important for classification purposes [14, 23].

In brief, Grad-CAM uses the loss functions with respect to one specific test image to produce a heatmap highlighting the regions that are more relevant to the model for predicting the given label.

With Grad-CAM we checked where in each test image the CNN was looking when a histopathological slice is evaluated. This allowed us to further validate that the model works correctly, by verifying that it is indeed “attending” intuitively correct patterns in the image and activating around those patterns. Examples are discussed in the Results section.

Ablation study

In deep learning research, an ablation study typically refers to removing some features of the model or algorithm and seeing how that affects performance for the sake of explainability.

To properly observe the effect of the fine-tuning on the CNN, we also compared the performance of the model with the fine-tuning and the performance of it without the fine-tuning.

Inter-rater reliability study

Krenn’s synovitis score was independently assessed on the aforementioned test dataset by a second pathologist (GC), who was not aware of its colleague’s classification. Using Cohen’s K method we measured inter-rater reliability, comparing ResNet34 outcomes with the latter pathologist’s report.

Results

Twelve patients (6/12 female, 50%) with a mean (± SD) age 48.7 ± 12 years underwent USSB of knee synovium during routine clinical practice in the time frame of the study. In particular 6/12 patients (50%) had Psoriatic Arthritis, 5/12 (41.7%) had Rheumatoid Arthritis, 1/12 (8.3%) had Peripheral Spondyloarthritis. All patients met the current classification criteria for each disease [24,25,26]. Detailed patients’ characteristics are illustrated in Table 1.

Table 1 Clinical demographic and histopathological characteristics of our cohort

Validation phase

The learning curves in the validation phase showed that the CNN learned steadily with a rapid decrease in train loss and the concomitant increase of accuracy (Fig. 2). Fine-tuned (ft-) ResNet34 showed a good fit: train loss and validation loss, both plotted over epochs, displayed a trend towards a decrease with the former being the smallest (Validation accuracy 96.67%). That was not the case of the plain ResNet34, trained without fine-tuning for ablation purposes that indicated underfitting, with the training loss continuing to decrease until the end of the training, displaying higher values than validation loss throughout the whole process (Validation accuracy 86.67%, Fig. 2).

Fig. 2
figure 2

Training and validation metrics. Train loss and validation loss have been plotted. In the fine-tuned ResNet34 model (left panel) a good fit was observed: both train loss and validation loss showed a trend towards a decrease with the former being the smallest. The plain ResNet34 model (right panel) indicated underfitting, with the training loss continuing to decrease until the end of training, displaying higher values than validation loss throughout the whole process

Test phase

Conversely deploying the ft-ResNet34 on the test dataset yielded an accuracy of 100% was shown (precision = 1, recall = 1, Fig. 3). As expected plain ResNet34 showed worse performance, scoring accuracy = 90%, precision = 0.90 and recall = 0.90 (Fig. 3).

Fig. 3
figure 3

Confusion matrix for test phase metrics (high VS low grade synovitis) for fine-tuned (right panel) and plain (left panel) ResNet34 models. Upper-left square shows true positive predictions (high grade synovitis), lower-right square shows true negative ones (low grade synovitis). Right and Lower-left squares and Upper-right squares show false positive and false negative predictions, respectively. For fine-tuned model (left panel): accuracy = 100%, precision (true positive/actual results) = 1, recall (true positive/predicted results) = 1. For plain model (right panel): accuracy = 90%, precision (true positive/actual results) = 0.90, recall (true positive/predicted results) = 0.90

Grad-CAM analysis

Grad-Cam (Fig. 4) shows that the activation map of our ft-CNN focused on the cellularity in the synovial lining and the sublining layers—two of Krenn’s score fundamental items [8]. This is further confirmation that the model is working correctly. We conclude that it consistently focuses its activation map on areas of the image that are considered very informative for synovitis grading by human pathologists as well.

Fig. 4
figure 4

Two examples of Grad-CAM algorithm output on our fine-tuned model. The algorithm provides a heatmap highlighting the regions that are more relevant to the model for predicting the given label. Here we show each test example on the left, and the same image with the heatmap superimposed on the right. Red-yellow zones (highlighted by black arrows) are the most informative areas for the model’s classification of both high grade synovitis (upper panels, 20 × magnification) and low grade synovitis (lower panels, 20 × magnification). It can be noticed that cellularity in lining and sublining layers is a salient image characteristic. Areas that are less informative for the model are darkened by the heatmap

Reliability

Sixteen out of 30 high grade synovitis were identified by the second pathologist in the test dataset (ground truth n.16/30). Cohen’s K for inter-rater reliability taking into account ft-CNN output and the latter pathologist’s diagnosis was 1, indicating very good agreement.

Discussion

To the best of our knowledge, this is the first report showing that a CNN trained with a TL approach can accurately discriminate between high and low-grade synovitis in USSB specimens. H&E stained slides, the basic tool of precision-based diagnostics, also represent the basis of personalized care in rheumatology. USSB is now poised to revolutionize management of patients with rheumatic diseases, by helping rheumatologists to extract large amounts of objective and multiparametric information about disease pathogenesis, prognosis and treatments outcome [3, 27]-With the clear perspective of USSB-driven therapy, pathologists will play a key role in such revolution, being progressively involved in rheumatologists’ everyday clinical practice.

In the past years, interactive measurement of synovial layers and computer-assisted image analysis based on RGB signal nuances recognition had been used to count cells in lining and synovial stroma [8]. Nevertheless, this process was time-consuming as it required pathologists to mark the margins of synovial lining, whereas the final score showed only an adequate correlation with ground truth (r = 0.725) [8].

Moreover, RGB signal recognition deployed to quantify CD3 + lymphocytes and CD68 + macrophages sublining infiltration across different European centers showed unsatisfactory agreement (Intraclass correlation coefficients: 0.79 and 0.58 respectively) [9].

Conversely, the application of CNN-based computer vision may help to smooth and globally improve the workflow in the future. A benefit of CNNs is to provide a fast, reproducible and standardized tool to assist diagnosis. Indeed, the overall agreement between this model and pathologists is encouraging and suggests that CNNs have the potential to be employed in the clinical management of patients with rheumatic diseases when a USSB is needed.

Undoubtedly, this new technology needs to be integrated with pathologist’s expertise to oversee and approve machine-based interpretations. Note that, at this stage, this algorithm does not answer all the questions rheumatologists may ask about synovial histological samples. Our model cannot yet identify “pathotypes” that are claimed to be informative for treatment personalization. Indeed, the implementation of this feature is conditional on the availability of larger datasets to train the algorithm. To accurately train the multi-class classifier that would be needed for this task, a researcher would conceivably need plenty of slices for “lympho-myeloid”, “myeloid” and “pauci-immune” pathotypes [28]. Such data cannot in any way be substituted by image augmentation, due to the peculiar features of individual pathotypes. This, once more, poses the necessity to improve data sharing for visual contents and the creation of public image databases for machine-learning research, as it currently happens for melanoma detection. Given that CNNs seem to spontaneously attend to classical Krenn’s score items found in the image—cells in the synovial lining, synovial stroma and inflammatory infiltrate—it is conceivable that, with adequate training and fine-tuning, CNNs could discriminate pathotypes based on distinctive histopathological patterns.

The ablation study indicated that the fine-tuning was of utmost importance for maximizing testing metrics. We observed that the fine-tuned ResNet34 seemed to rely on lining and sublining cellularity for discrimination. To explain the usefulness of fine-tuning, we hypothesize that the higher level feature maps found in the fine-tuned layers, which were learned during pretraining on the ImageNet dataset, needed to be adjusted specifically to improve classification accuracy on our dataset.

Finally, we must acknowledge that this was a pilot study with a small image dataset obtained from a low number of patients. Although promising, our model still needs to be prospectively validated in real-life cohorts.

Conclusions

This study shows that CNNs have the potential to accurately discriminate between low grade and high-grade synovitis. The application of CNN-based computer vision may help to smooth and globally improve the workflow practice between rheumatologists and pathologists. Further research is necessary to evaluate performance in a real-world clinical setting, to test this technique across the sample distribution and spectrum of synovitis patterns detected in daily practice. But such potential developments are conditional on the creation of large and open-access datasets, which historically have driven the development of machine learning.