Introduction

Rotator cuff (RC) tendon tears are associated with varied degrees of muscle atrophy, manifested by decreased muscle bulk and fatty infiltration [1, 2]. Atrophy of RC musculature is linked to higher rates of repair failure and overall worse clinical outcomes [3,4,5,6]. MRI is the reference standard for imaging RC tendons for tears, severity of cuff abnormalities, and postoperative healing [7, 8]. Further, MRI is the preferred method to evaluate RC muscles, enabling quantification and longitudinal assessment of atrophy [2, 9, 10]. Fatty infiltration and degree of atrophy of the supraspinatus muscle have received most attention in studies correlating surgical decision-making and prognostic factors [5, 10, 11]. Importantly, tears of subscapularis and infraspinatus tendons may also occur, and atrophy of their muscles also carries important implications to functionality and postoperative outcome [4].

Although multiple approaches have been described for estimation of RC muscle atrophy, they are qualitative or semi-quantitative, with limitations in their reproducibility [1, 9, 10]. On the other hand, quantitative methods require accurate manual or semi-automated segmentation strategies that are time-consuming and may exhibit variability across operators [12,13,14]. Further, some of these techniques require water-fat separation sequences, which are not typically included in routine shoulder examinations [14, 15]. Consequently, their adoption for clinical management is limited, highlighting the need for reliable automated methodologies.

Recently, automated segmentation of shoulder MRI images using deep learning techniques has been limited to supraspinatus muscle [16] and shoulder girdle muscles in birth-related brachial plexus palsy [17]. Previously, deep learning methods for segmentation of other muscles have also been reported on ultrasound [18] and MRI [19]. However, prior studies have not evaluated an automated workflow to select a specific shoulder image for segmentation and generate cross-sectional areas of multiple RC muscles. The purpose of our study was to develop deep convolutional neural networks (CNN) to identify a scapular Y-view (hereafter referred to as Y-view) from a routine sagittal T1-weighted shoulder MRI and another CNN to segment subscapularis, supraspinatus, and infraspinatus/teres minor muscles on a Y-view. We hypothesized that Y-view selection using the Inception v3 architecture [20] and multi-class segmentation using a modified U-net [21] would achieve high accuracy as compared with a reference standard of manual Y-view selection and manual muscle segmentation, respectively.

Materials and methods

Our study was IRB-approved and complied with Health Insurance Portability and Accountability Act (HIPAA) guidelines with exemption status for individual informed consent. MRI examinations obtained between October 2018 and January 2020 were collected retrospectively, regardless of indication. The shoulder MRIs were performed using 1.16-T, 1.5-T, and 3.0-T scanners (General Electric, Waukesha, WI, USA; Hitachi Medical Corporation, Tokyo, Japan; Siemens Healthcare, Erlangen, Germany; Philips Healthcare, Amsterdam, Netherlands) within our institution (hereafter referred to as “internal”). In addition, we separately collected shoulder MRIs performed at non-affiliated imaging facilities that were uploaded to our hospital’s database for clinical consultation (“external”).

Shoulder MRIs were obtained with the patient in supine position, head first, and using a dedicated shoulder coil. The field-of-view was adapted to the patient’s body habitus. Only T1-weighted sagittal images were used for our study. They were prescribed parallel to the glenoid articular surface, with standard acquisition parameters: repetition time (TR) 400–775 ms, echo time (TE) 8–25 ms, field-of-view (FOV) 140–180 cm, number of excitations (NEX) 0.5–3, bandwidth 61–325 Hz, slice thickness 3–4.5 mm, and inter-slice gap 20–25% of slice thickness.

We defined the Y-view as the most lateral image showing contact between scapular spine and posterior glenoid, forming a Y-letter shape [3, 14] (Fig. 1a). This image was used as it is recognizable, provides a representative cross section of RC muscles, and has been used in previous studies [3, 14, 16, 17].

Fig. 1
figure 1

a Definition of sagittal Y-view: we used the most lateral image that showed contact between scapular spine and posterior glenoid (arrow), forming a Y-letter shape (Refs. [3, 14]). b Grouping of sagittal images in 3 anatomic zones (1, 2, and 3) that served as ground truth labels for model A training and classification

No cases had intra-articular or intra-venous contrast injection. One-hundred and ninety scans were excluded due to the following: motion artifacts that severely degraded anatomic detail (N = 42), inadequate field-of-view for RC muscles (N = 97), and inadequate slice coverage of T1 sagittal images not including a proper Y-view (N = 51).

Two models were developed:

  • Model A (classifier), for Y-view selection; and,

  • Model B (segmentation), for RC muscle segmentation at a Y-view.

Model A (Y-view selection)

Ground truth labeling

T1-weighted sagittal shoulder MRI images were grouped into 3 anatomical zones in order to balance the classification task (Fig. 1b):

  • Zone 1, from most lateral image to lateral acromioclavicular (AC) joint;

  • Zone 2, from AC joint up to Y-view; and

  • Zone 3, from Y-view to most medial image.

The AC joint and Y-view were selected as boundaries for each anatomical zone for being easily identifiable. These 3 zones were the ground truth labels for their respective images. An equal number of images in each zone was used to create model A.

To better characterize model A’s cohort, two investigators classified Y-view images as either normal or pathologic, using the Goutallier grading system modified by Fuchs et al. [2]: normal, grade 1 images; pathologic, grades 2 and 3 (moderate and advanced fatty infiltration). Images were scored by consensus of two investigators with 6 years of medical image analysis and 10 years of clinical orthopedic experience. This data was not used for model development.

Training and testing

Contrast limited adaptive histogram equalization (CLAHE) was performed on all grayscale images and saved as Tag Image File Format (TIFF) files. For this classification task, we used the GoogLeNet Inception v3 CNN architecture, which was developed by Szegedy et al. [20]. Briefly, this architecture comprises 42 layers, incorporating three varieties of Inception modules that help reduce computation time relative to other architectures [20]. Input images were 299 × 299 pixels and 8-bit 3-channel grayscale. All images were normalized to the training dataset mean and standard deviation. Model A was trained using Python 3.6 (Python Software Foundation, Beaverton, OR) and the Keras library (v2.2.4, https://keras.io) with Tensorflow 1.13.1 (Google, Mountain View, CA) backend [22]. The training dataset was split as 80% training and 20% validation images. Inline image augmentation was performed using the Keras built-in image generator, including rotation, magnification, cropping, horizontal flipping, and horizontal/vertical shifting. Batch size was 16 and we used the RMSprop optimizer (learning rate, 0.001; rho = 0.9). The model trained for 100 epochs on a Linux workstation (Ubuntu 14.04) with 4 NVIDIA Titan Xp Graphic Processing Units. We ran the training procedure in triplicate to produce 3 versions of model A and generate an average testing performance.

Each version of model A was tested on sagittal T1-weighted series of internal and external MRI studies. Each full sagittal T1-weighted series was examined individually, in which its images were sequentially tested to predict anatomic zone assignment. The most lateral slice predicted as zone 3 was considered the model’s prediction for a Y-view. Additionally, one slice immediately lateral and one slice immediately medial to the predicted Y-view were added, yielding a 3-image prediction. The 3-image prediction was considered accurate if one of its images matched the ground truth Y-view (Fig. 2).

Fig. 2
figure 2

Workflow for testing of model A. Images from a test T1 sagittal series were sequentially exposed to model A. The most lateral zone 3 prediction was considered the model’s choice for a Y-view. One medial and one lateral adjacent images were combined to yield a 3-slice prediction, which was compared with the ground truth Y-view

Model B (muscle segmentation)

Ground truth labeling

Manual segmentation was performed using the Horos DICOM viewer (version 6.5.2, www.horosproject.com) by a single operator with 10 years of clinical experience, with all images and segmentations inspected by a second investigator with 23 years of clinical experience. As shown in Fig. 3, examinations were segmented manually into 4 classes, as follows: (1) background pixels (all pixels outside RC muscles; black); (2) subscapularis (blue); (3) supraspinatus (red); (4) infraspinatus/teres minor (yellow).

Fig. 3
figure 3

Example of manual ground truth segmentation. The sagittal T1-weighted at the scapular Y-view (a) is manually segmented into multiple muscles (b) resulting in a mask (c) with four classes: background pixels (all pixels outside RC muscles; black), subscapularis (blue), supraspinatus (red), and infraspinatus/teres minor (yellow)

Muscle segmentations comprised all pixels within the muscle boundary, including intramuscular fatty septae. If a muscle had severe fatty replacement, its fascial boundaries were traced, rather than delineating individual preserved fibers. The teres minor muscle was not individually segmented given its limited surgical importance and poorly defined boundaries with infraspinatus muscle. All segmented images were anonymized TIFF, with color ground truth masks being 8-bit 3-channel RGB. Corresponding grayscale images were 8-bit single-channel. All images were resized to 384 × 384 pixels.

To better characterize model B’s cohort and understand potential biases, images were classified for muscle status as done for model A. This information was not used for model development.

Training and testing

The training dataset was split as 80% training and 20% validation images. CLAHE was performed on all grayscale images, which were then saved as JPEG files, followed by image augmentation of training dataset applying random rotation, horizontal flipping, cropping, and scaling, to achieve a total of N = 10,000. To further increase variability and improve generalizability, we applied Poisson noise to randomly selected 50% of augmented grayscale images. All images were normalized to the training dataset mean and standard deviation.

Model B employed a modified U-Net CNN architecture [21]. Briefly, images were input in our U-Net structure that consisted of five layers with four down-sampling steps followed by four up-sampling steps. Each step consisted of two successive 3 × 3 padded convolutions, and in the down-sizing steps, a dropout of 0.25 was applied. This was followed by a rectified linear unit (ReLU) activation function and a max-pooling operation with a 2 × 2 pixel kernel size. The up-sampling operations were performed using a 2 × 2 transposed convolution followed by a 3 × 3 filter size convolution, after which the output concatenates with the corresponding decoding step. The final layer consisted of a 1 × 1 convolution followed by a sigmoid function, resulting in an output pixelwise prediction score for each class.

Model B was trained in using the Python/Keras/Tensorflow stack as previously described. During training, the 4 classes were adjusted for imbalances by weighting prevalence, which penalized predictions of classes with highest number of pixels (e.g., background). Batch size was 8 and we used the Adadelta optimizer (learning rate, 0.0001). The model trained for 25 epochs with early stopping enabled. Multi-class Dice loss was used as cost function. Training was performed on a Linux workstation (Ubuntu 14.04) with 4 NVIDIA Titan Xp Graphic Processing Units. We ran the training procedure in triplicate to produce 3 versions of model B and generate an average testing performance.

Model B was tested on internal and external Y-view images to output predictions in 4 classes (background, subscapularis, supraspinatus, and infraspinatus/teres minor) that were compared with manual segmentations using Dice (F1) score [23].

Statistical analysis

Descriptive statistics are reported in terms of percentages and means ± standard deviations (SD). For model A, a top-3 success rate was used to evaluate performance. The top-3 success rate was determined by comparing the manually selected ground truth Y-view to the 3-image prediction. The 3-image prediction was considered accurate if one of its images matched the ground truth Y-view. For model B, the Dice (F1) score was used to assess similarity between the manual segmentations and the CNN predicted segmentations [23]. A Dice score of 1.00 is a perfect similarity. We also obtained mean precision (positive predictive value) and mean recall (sensitivity) for model B tests.

Results

Model A

Model A was trained on 258 scans (N = 4320 images) from patients with mean age 56.2 ± 14.3 years. Each of 3 shoulder anatomic zones was represented by an equal number of images (N = 1440 images per zone). Model A was tested on 100 internal scans (N = 3197 images; mean age, 56.0 ± 15.0 years) and separately on 50 external scans (N = 1205 images; mean age, 55.0 ± 17.2 years). Cohort characteristics regarding RC muscle status in training and test datasets for model A are outlined in Table 1. Overall, the subscapularis muscle was more frequently normal, followed by supraspinatus and infraspinatus/teres minor.

Table 1 Cohort characteristics: rotator cuff muscle status in training, internal testing, and external testing datasets

Training took 1 h 20 min per run (training was run 3 times). Mean top-3 success rates to detect a proper Y-view were 98.7 ± 1.0% (internal test dataset) and 99.7 ± 1.0% (external). The few errors observed were due to the predicted Y-view being 2 slices apart from the manually determined Y-view. Mean top-1 success rates to detect the singular ground truth Y-view were 80.0 ± 3.0% (internal) and 91.0 ± 3.0% (external). On our workstation, detecting a Y-view took 1.8 s per test scan (each scan contained a full T1 sagittal series).

Model B

A total of 1048 scans, from which one Y-view image was used per scan, were collected from 1030 patients (mean age of 56.1 ± 14.5 years). The images were divided into 90% training (N = 943) and 10% test (N = 105) datasets. Cohort characteristics regarding RC muscle status were similar to model A (i.e., larger proportion of normal subscapularis muscle) (Table 2). External test cases for model B were the same used to test model A.

Table 2 Mean Dice, precision, and recall scores for model B segmentation on internal and external test datasets. Values are mean ± SD from testing on models generated in 3 distinct runs

Manual segmentations were accomplished in approximately 5–10 min per image. Training took 1 h per run (training was performed 3 times). Overall, mean muscle segmentation Dice scores for internal and external test datasets were > 0.93 and are outlined in Table 2. Figure 4 shows examples of accurate CNN muscle segmentations. Although overall accuracy was high on internal and external test datasets, minor prediction errors were seen especially along the inferior contour of the subscapularis, where teres major/latissimus dorsi and axillary artery produce challenging boundaries (Fig. 5). There was only one instance—from a total of 155 test cases—in which the model misclassified a larger muscle area (Fig. 6). On our workstation, each automated segmentation was accomplished in 0.02 s per test image.

Fig. 4
figure 4

Examples of accurate muscle segmentation using model B, with each row containing test images from different patients, with normal rotator cuff muscle appearance (a), and varied degrees of muscle atrophy and fatty infiltration (b, c). In grayscale images from all 3 cases, note the challenging boundaries between infraspinatus and teres minor. Manual, manual tracing; Model B (U-Net), model prediction by CNN

Fig. 5
figure 5

Prediction errors on test images from 3 different subjects (one per row). Errors at caudal contour of subscapularis muscle (arrowheads), due to challenging muscle boundaries with teres major/latissimus dorsi (a) and axillary artery (b). c Underestimation of supraspinatus segmentation (arrow) and overestimation of infraspinatus/teres minor (white arrowhead), with model misclassifying portion of trapezius (black arrowhead). Manual, manual tracing; Model B (U-Net), model prediction by CNN

Fig. 6
figure 6

Prediction errors on test images from 3 different subjects (one subject per row). a Although most atrophied subscapularis (white arrowhead) was correctly predicted by the model, its cranial portion was missed and a stray incorrect prediction is noted in this area (red pixels, black arrowhead). Note the correct inclusion of atrophied teres minor (white arrows) in infraspinatus/teres minor segmentation (yellow). b Underestimation of subscapularis segmentation (arrowheads) in area of pulsation artifact from axillary artery. c Rare example of model uncertainty assigning an area of subscapularis as belonging to infraspinatus/teres minor class (arrowheads). Manual, manual tracing; Model B (U-Net), model prediction by CNN

Discussion

The findings of our study are twofold: (1) a CNN classification method is able to accurately select an appropriate shoulder Y-view and (2) another U-Net-based CNN is able to accurately segment multiple RC muscles at that level. Importantly, our results show the feasibility of these methods in a large cohort of randomly selected shoulder MRIs obtained both inside and outside our institution.

RC muscle atrophy affects the repairability of tendons, with greater muscle atrophy predisposing higher rates of re-tear and unfavorable outcomes [3,4,5,6]. Prior qualitative [1, 9] and quantitative [10] studies have graded fatty infiltration and atrophy of RC muscles. A commonly used qualitative system is the Goutallier classification [1], originally described on non-contrast shoulder CT scans and later adapted for MRI [2] using Y-view T1-weighted images, yielding wide interobserver [2, 24,25,26,27] and intra-observer [24,25,26,27] reliability measures. Subsequently, the tangent sign was introduced as a binary qualitative assessment of supraspinatus muscle atrophy [9], and Thomazeau et al. [10] proposed an occupation ratio to determine supraspinatus muscle atrophy. Taken together, although these methods provide insight into RC muscle status, their limitations include subjective prescription of anatomic landmarks and manual tracings, which are time-consuming and present variability across operators [12,13,14]. These factors represent drawbacks for broader implementation of quantitative RC muscle measures in clinical care.

In our study, we focused on the Y-view for its familiar bony landmarks, good representation of RC muscle status, and frequent use in RC muscle atrophy studies [3, 14]. Automatic slice selection methods have been previously described to identify anatomical landmarks using atlas-based approaches and deep learning [28,29,30]. For musculoskeletal applications, Zhou et al. [31] had success using a CNN to select a knee sagittal slice for anterior cruciate ligament tear classification with an accuracy of 0.98. Our results are novel in presenting accurate methods for Y-view selection that can be the initial step in a workflow for automated RC muscle segmentation at that level.

Our automated segmentation of RC muscles showed an accuracy comparable or better to other deep learning methodologies. For example, Kim et al. [16] manually selected a Y-view on 240 patients and found Dice scores of 0.95 for supraspinatus muscle and 0.97 for supraspinatus fossa. Similarly, Conze et al. [17] used a CNN to obtain volumetric segmentations from 24 pediatric healthy and pathological shoulder exams finding Dice scores of 0.71, 0.83, and 0.82 for supraspinatus, subscapularis, and infraspinatus, respectively. They also noted an improved performance when images of pathological shoulders were combined with images of unaffected shoulders [17].

In our study, both models were trained and tested on datasets containing a variety of RC muscle conditions (i.e., normal, moderate, and severe atrophy). Although our accuracy and short analysis time per image for model B are promising, areas of over- and underestimation were seen. Some minor errors occurred most commonly at muscle boundaries with adjacent fat planes and likely represent low-impact quantitative issues. More prominent errors were noted along the inferior contour of subscapularis (adjacent to axillary vessels) and inferior contour of infraspinatus/teres minor. Despite high overall and per-muscle Dice scores, strategies to improve these errors should include expanding training datasets with more cases containing confounding features in those areas. Inclusion of a larger variety of supraspinatus atrophy states may also benefit segmentation performance. As noted by Kim et al. [16], a possible explanation for lower supraspinatus muscle Dice score is due to variations in cross-sectional area caused by supraspinatus tendon tears and atrophy.

Strengths of our study include successful slice selection using a classification algorithm and demonstration of accurate automated RC muscle segmentations on routine T1 sagittal MRIs from a large and varied cohort. This simple yet robust technique has not been previously described and yielded excellent results. Importantly, both our models were tested on datasets from studies obtained outside our institution, rendering comparable accuracies. The size of our training and testing dataset is another advantage as compared with prior studies [16, 17]. Further, although Conze et al. [17] examined a pediatric population, our work investigated a wider range of adult shoulder MRI images.

An important focus of our study was to separately validate methods for proper Y-view selection (model A) and accurate muscle boundary determination (model B). We did not test the performance of an integrated pipeline across both models; therefore, our top-3 performance for model A should not imply passing a 3-image dataset to model B would be used in such a workflow. An integrated pipeline, currently in development by our group, requires specific procedures and modifications to training and testing datasets that were beyond the scope of the current study. Our algorithm was not designed to quantify the degree of atrophy in each muscle, which will require an additional stage of thresholding muscle vs. fat pixels within each segmentation. This desirable feature will be the subject of future development, which, however, relies first on robust and reliable localization of muscle boundaries, which was the key effort in our study. Our manual tracing also included fatty septae and fat replacement within the boundaries of each cross-sectional area, with the future expectation of separating muscle from fat pixels using dedicated methods. In principle, a similar procedure could be adopted to more accurately separate the infraspinatus from teres minor. Altogether, such developments may allow prompt determination of rotator cuff muscle cross-sectional area in clinical workstations, which could automatically provide overlays on specific images and data on dictation platforms.

Limitations of our study include model B being trained on a single standardized sagittal image. Volumetric (3D) muscle quantification using a CNN approach has been demonstrated in prior studies [17]. The use of 3D measures of RC muscle volume produces more accurate measures, which however require multiple slice segmentation and longer imaging time to cover the entire shoulder girdle, which is rarely accomplished in clinical practice [13]. Furthermore, previous studies have found that a single slice is appropriate for clinical assessment of fatty infiltration [13]. Another limitation is a relatively lower proportion of severely atrophic RC muscles in our datasets. Although this reflected the typical patient population at our imaging/clinical services, future work will develop models on datasets with higher degrees of fatty infiltration. For this reason, performance of our method could vary in a cohort with higher prevalence of severe RC muscle atrophy.

In conclusion, we demonstrate novel and accurate methods to select a Y-view image and segment multiple RC muscles using a combination of CNN models. Our work extends prior studies examining a larger and diverse cohort of patients. By offering automated and reliable muscle area quantification, our methods have potential use in surgical outcomes research and clinical assessment of RC pathology.