Introduction

Spondyloarthritides (SpA) are a group of diseases that mainly affect the spine and produce inflammatory changes, new bone formation, deformity, impaired mobility, and pain. The use of magnetic resonance imaging (MRI) in SpA allows evaluating the inflammatory activity and structural damage. In SpA, MRI is informative both at the spinal level as well as at the sacroiliac (SI) joints. At present, MRI is considered the paramount imaging technique to detect inflammatory damage at the axial skeleton, to the extent that the new criteria of the Assessment of SpondyloArthritis International Society (ASAS) for the classification of axial SpA include MRI in the classification algorithm [1, 2].

MRI is widely used for detecting early active sacroiliitis [3,4,5,6]. According to ASAS, a SI MRI is considered positive when bone marrow edema (BME) is detected in the subchondral or juxta-articular area of the SI joints. In case of a single lesion, this must be present in at least two consecutive slices, but if there are more lesions, it is enough that they appear in the same MRI slice [7]. Isolated synovitis, capsulitis or enthesitis, without edema, are compatible, but not sufficient, for the diagnosis of active sacroilitis [8]. A limitation of this definition is the lack of quantification of BME. This is important, since BME in the SI joints has been observed in up to 23% of patients with nonspecific low back pain as well as in 6% of healthy individuals [8, 9]. In these cases, BME is usually of low grade and, in general, unrelated to structural lesions (especially erosions). If any level of BME is covered by the MRI definition of sacroiliitis, misclassification of SpA will inevitably occur, underscoring the need to quantify the magnitude of BME.

Another relevant aspect for the quantification of BME at the SI joint is its role as a predictor of structural damage. In a longitudinal prospective study of 40 patients with inflammatory low back pain who were subjected to X-ray and to SI MRI at baseline, and who had a radiograph after 8 years, Bennet et al. demonstrated that HLA-B27 positivity and severity of BME in SI MRI were excellent predictors of ankylosing spondylitis [10]. On the other hand, the high cost of biological therapies demand a rational use; for this reason, any tool able to assess, and even predict, treatment response has gained great interest. In this regard, different tools to quantify inflammatory changes in SI MRI have been developed [11, 12]. These tools include the Spondyloarthritis Research Consortium of Canada (SPARCC), the Berlin, the Aarhus-Puhakka, and Aarhus-Madsen, the Leeds, the MR Imaging of Seronegative SpA (MISS), Leeds, Sieper/Rudwaleit and Hermann/Bollow scoring systems [13,14,15,16,17]. In general, all of them are based on the presence and extension of BME near the cartilage, although some also incorporate inflammation in the joint space and ligamentous portion. Some of the above listed techniques use gadolinium and others the short tau inversion recovery (STIR) sequences. The scores can vary depending on, whether the entire joint or joint quadrants are evaluated. However, these methods have shortcomings. The MISS and Leeds methods have not been fully validated. The Aarhus method reads inflammation at all levels and it is very complex to use. All of them show very good to excellent intraobserver reliability, but poor to moderate interobserver reliability, except for the SPARCC, for which interobserver reliability is acceptable [18]. The SPARCC also seems to be the most sensitive to change, as it has demonstrated good discriminant ability in a randomized placebo-controlled trial of adalimumab in ankylosing spondylitis refractory to NSAID [19]. In general, the discriminant power of current quantification systems at the SI joint is poorly defined. In addition, the use of these quantification methods is restricted to clinical trials, since their use in clinical practice is limited due to their complexity, need for trained personal, and prolonged procedural time.

The development of computers and data processing software has led to significant advances in methods for image analysis. With the objective to improve interobserver quantification of sacroiliitis, while maintaining a practical perspective, our group has developed a fast, easy and valid method to quantify BME by MRI of SI joints based on a semi-automated process. Herein, we present its development, characteristics, and successive studies aimed at analyzing its validity and reliability.

Methods

Development of the procedure

Two rheumatologists expert in SpA (RA, PZ), a radiologist expert in musculoskeletal imaging (AB), and an engineer (LM) were in charge of the process to develop a semi-automatic quantification method of BME observed in MRI images. After discussion on requirements, a search of applicable software elements was undertaken. The software designed was tuned up and tested repeatedly until it was considered adapted to the defined task. A specific data register program was designed for Windows operating systems, which allows to include the digital imaging and communications in medicine (DICOM) images selected for each patient. For the evaluation of the images, a plugin was designed to allow the program interacting easily with the ImageJ software (https://imagej.nih.gov/ij/index.html) [20]. Using an algorithm for the detection of contours, the program automatically delimits the target area with a mouse click on the region of interest (ROI) and calculates the area and the intensity of the selected zones. The program saves the zones selected by the user and the measurements made in a database with a Firebird engine (https://www.firebirdsql.org/).

Tests were conducted by the rheumatologists and the radiologist, who also established the necessary adjustments to define the borders of the inflamed area. The adequacy of the chosen software was decided by consensus among the four researchers based on the simplicity of use and lack of errors.

Description of the technique

The technique, named SCAISS by its abbreviation in Spanish (herramienta eSpañola para la Cuantificación semiAutomática de Inflamación de Sacroilíacas en resonancia magnética en eSpondiloartritis) needs an MRI image in STIR sequence saved in DICOM format. Only two planes are selected, semiaxial and semicoronal (axial and coronal perpendicular to the SI joint), within the justa-articular area with justa hyperintense signal. Over the image on the screen, and with a click of the mouse, the physician marks, one by one, the areas with visible BME (hyperintense). The software selects automatically the areas adjacent to the pointer-click with intensity within a predetermined tolerance range. Having defined the area within a closed perimeter, the software then calculates the area, perimeter length, and mean of signal intensity (brightness) in that area (see Fig. 1). The lesions selected by the mouse-clicks should be in the juxta-articular area, as specified by the ASAS consensus on anatomical features of sacroiliitis [9].

Fig. 1
figure 1

Sacroiliac MRI with various grades of bone marrow edema (BME) and the results of the semi-automated scoring with the SCAISS. The user clicks with the mouse in the bone marrow area, next to joint space, with hyperintense signal (inflammatory changes). The system automatically draws the area with similar signal (in a prearranged tolerance range), clearly different from the normal background (thick arrows). If this area corresponds reasonably with what the user considers as pathological, it is saved pressing “G” key. If negative, the user repeats the mouse click until he agrees with the area. When he finishes selecting all pathological areas he presses “R” Key to set a reference (background) area (for example on the center of vertebral body S1; thin arrows). To finish, “F” key is pressed, and the system provides the score (right screens)

To correct the background signal intensity noise, the reader finally indicates, again by a mouse click, a location previously agreed—the middle area of the sacrum bone (thin arrows in Fig. 1). This area serves as a reference to assess the average signal intensity in the lesion areas.

Based on the selection of images, the software calculates automatically the area (total and average), and mean intensity, providing an overall score of the lesion (lower panels of Fig. 1). The score is the mean of the sum of the areas of each ROI weighted by the ROI mean intensity and divided by the reference of intensity in both planes, according to the formula:

$${{\left( {\mathop \sum \limits_{i}^{{{\text{Axial}}}} {A_i} \times {I_i}+\mathop \sum \limits_{j}^{{{\text{Coronal}}}} {A_j} \times {I_j}} \right)} \mathord{\left/ {\vphantom {{\left( {\mathop \sum \limits_{i}^{{{\text{Axial}}}} {A_i} \times {I_i}+\mathop \sum \limits_{j}^{{{\text{Coronal}}}} {A_j} \times {I_j}} \right)} {\left( {2 \times {\text{Ref}}} \right)}}} \right. \kern-0pt} {\left( {2 \times {\text{Ref}}} \right)}}.$$

Figure 1 explains the SCAISS technique and shows three examples of SI MRI, each with a different grade of BME. (Note: the areas selected can be reset, and re-selected simply by another mouse click, if the evaluator is not totally satisfied).

Validation study

A cross-sectional study was conducted to analyze the feasibility and convergent validity of the method. For this purpose, 23 patients with a diagnosis of axial SpA (according to ASAS classification criteria), with SI MRI with semiaxial and semicoronal planes images available, were selected from the Picture Archiving and Communication System (PACS). Patients were selected consecutively by stratified sampling, to allow balanced groups by gender and levels of severity of BME (low, moderate, and severe). The first patient of each reader was used as test case. Patients 2nd, 3rd, and 4th, were evaluated in a second round to assess intra-reader reliability (test–retest).

The images were then scored semi-automatically with the SCAISS, and manually with the Berlin and the SPARCC methods (Fig. 2). All 23 patients were evaluated by the developers in a first phase. In a second phase, 12 patients were randomly selected to be scored with the three methods by 20 readers (12 rheumatologists and 8 radiologists). In a third phase, the images of the 12 patients were scored with the SCAISS and Berlin methods by 203 attendees in workshops.

Fig. 2
figure 2

Quantification of sacroiliitis by the Berlin and the SPARCC methods. Two additional screens were used to score the patients’ images with the Berlin and the SPARCC methods

The SCAISS method has been detailed above.

The SPARCC method assesses BME in six consecutive slices of a semi-coronal MRI image on STIR sequences. On these images, the presence of BME is evaluated in each right and left quadrant of the SI joint (0–1); 1 point is added for each joint if the depth is greater than 1 cm and an additional point if the intensity is high. The total score range lies between 0 and 72 [16].

The modified Berlin method is based on a semi-coronal image obtained with STIR sequences in a single slice with the most significant lesion. Right and left SI joints are divided into two facets (sacral and iliac facet joint) and the presence of BME is then scored between 0 and 3 (1 if BME occupies < 25% of the facet joint; 2 if it occupies between 25 and 50%, and 3 if > 50%). The total score may range from 0 to 12 [5, 21].

The readers analyzed all images with the three techniques presented randomly and blinded to patient’s characteristics.

In addition, the system was prepared to collect parameters to define feasibility: number of clicks by plane and time spent on the evaluation of each patient’s set of images, by each method.

Descriptive variables were collected from the rheumatology database from the visit closest to the MRI date: sex, age, disease activity—with the Bath Ankylosing Spondylitis Activity Index (BASDAI), and the ASAS Disease Activity Score, (ASDAS-CRP)—functional capacity by the Bath Ankylosing Spondylitis Functional Index (BASFI), acute phase reactants (erythrocyte sedimentation rate, ESR, and C-reactive protein, C-RP), and a pain visual analogue scale (VAS). The specialty and the years of experience of the readers were also recorded.

Statistical analysis

The median and interquartile range were used as summary statistics to describe the sample of patients and readers, followed by a calculation of the validation parameters.

Reliability

The inter-observer reliability was calculated as an intraclass correlation coefficient (ICC) with 95% confidence intervals (95% CI). The three scoring methods were evaluated in the phases with 3 and 20 observers. In the 20 observers study, intra-observer reliability (test–retest) was tested on three patients for whom the images appeared a second time at random and was measured with the Pearson’s r coefficient. In the 20 readers study, only the reproducibility of the SCAISS and the Berlin methods were tested.

Construct validity

Convergent validity of the SCAISS with the Berlin and SPARCC methods was tested with the Pearson’s r coefficient in the three observers study. In addition, the discriminant ability of the SCAISS was tested by graphically depicting the scores in box-plots by degrees of BME, and by analysis of variance. The three categories of BME (low, moderate, and high) where the ones used for the stratified sampling and were based on the radiologist impression.

Feasibility

The median (IQ) time spent in the assessment of the images by the different methods was calculated in the 3 and 20 observers’ studies.

All analyses were performed with R: a language and environment for statistical computing [22].

Results

Table 1 shows the description of the 23 patients whose images were used for the study of validity and reliability of the SCAISS. The 12 patients used for reproducibility purposes in the workshops did not differ from the larger sample, as they were chosen randomly from it. The proportion of men (50%) and of the different grades of BME (33%) were fixed. Otherwise, the sample was composed of average patients with axial SpA, with ages between 28 and 51 and an median disease duration of 7 years, with moderate levels of disease activity and impact.

Table 1 Description of the sample used for validation

The interobserver reliability (ICC and 95% CI) for each method in the three observers’ study was similar across methods: SCAISS = 0.770 (0.580–0.889); Berlin = 0.725 (0.537–0.860); and SPARCC = 0.824 (0.671–0.916). In the 20 observers’ study, interobserver reliability was: SCAISS = 0.801 (0.653–0.927); Berlin = 0.702 (0.518–0.882); and SPARCC = 0.790 (0.623–0.923). In the 203 observers’ study, the ICC (95% CI) of the SCAISS was 0.810 (0.675–0.930), and that of the Berlin method 0.636 (0.458–0.843).

The intra-observer reliability, tested in the 20 observers’ study in three patients, was tested with the Pearson correlation coefficient (r) (95% CI), as follows: SCAISS = 0.965 (0.938–0.980); Berlin = 0.838 (0.725–0.907); and SPARCC = 0.949 (0.911–0.971).

Regarding construct validity, tested in the three observer phase, the Spearman correlation between the SCAISS and the Berlin method was 0.747, 0.729, and 0.747 for the first, second and third evaluator, respectively. The Spearman correlation coefficient between SCAISS and SPARCC was 0.772, 0.840, and 0.793.

Figure 3 shows the discriminant ability of the SCAISS by grades of BME (low, moderate, and severe) from the three observers reading exercise. The three groups are markedly different from each other (p value from the analysis of variance < 0.01).

Fig. 3
figure 3

Summary statistics of the SCAISS by grades of bone marrow edema

Median (IQ) time spent on each of the scoring systems by the three developers was, in seconds: SCAISS = 30 (32), Berlin = 18 (14), and SPARCC = 76 (59). In the 20 observers’ study, the median time spent was: SCAISS = 28 (27), Berlin = 14 (9), and SPARCC = 94 (68).

Table 2 shows the results of the validation with 20 evaluators by types of readers, rheumatologists (60%) or radiologists (40%). The experience of the radiologists and rheumatologists was measured in years of practice and it was 20.3 (SD 7.7) and 17.1 (SD 7.5), respectively. No differences were detected in terms of reliability or feasibility between groups of specialists.

Table 2 Description of the 20 readers and results of the validation by groups

Discussion

Imaging reading remains a subjective procedure, with implications in clinical trials and in epidemiological studies. The use of automated or semi-automated procedures for the evaluation of clinical images is of primary importance to reduce measurement error. Our group has developed a semi-automated technique, based on intensity detection imputable to BME, to grade sacroiliitis. Based on the results, the SCAISS appears as a valid method to be used in studies, and simple enough to be used also in clinical practice, to assess and monitor sacroiliitis.

Other available techniques may have the problem of subjectivity. In a study evaluating the efficacy of training in reading spinal and SI MR images with the Berlin method, reading concordance for sacroiliitis between the participants and an expert was very low, with a kappa value below 0.5 and without clear improvement after training [23]. In fact, our study shows a poorer interrater concordance of the widely used Berlin and SPARCC methods compared to the SCAISS. This reflects higher reliance on the software and lower on the readers’ skills with the method proposed. The developers were very satisfied with the areas selected by the software, thus reflecting acceptable face validity, and usability was good and similar between rheumatologists and radiologists. Being a semi-automated procedure—the doctor decides the ROI—and given that different specialists may have different criteria, it was expected that the agreement would be limited; however, although not perfect, it is better than the ones obtained with manual procedures used so far for these purposes [23]. Notwithstanding a level of discordance, the discrepancy between evaluators is primarily due to differences in criteria rather than in the technique, as no evaluator found that the areas defined by the software should be others.

Using BME as the target image for inflammation has been validated as a marker of future structural damage. Hermann et al. demonstrated correlation between cellularity and the grade of BME on MRI [24] and Gong’s et al. evaluated 109 patients over 10 years and found that sacroiliitis by biopsy predicts ankylosing, being this association even stronger in patients with high SPARCC scores [25]. SCAISS’ convergent validity hypothesis was confirmed by the strong correlation coefficients obtained versus validated methods that also measure BME in SI joints, such as the SPARCC.

In determining the utility of software designed to overcome direct reading of the images, it is important to be informed about the procedural and computing times. Most readings in our study took less than 30 s, thus showing a very acceptable feasibility. It is important to highlight that (1) this is a new procedure, (2) the readers had no training, and (3) it has been applied in a small number of patients. For those reasons, although it is reasonable to assume that it will be simple to use, the feasibility of the SCAISS will be confirmed in future studies with larger number of patients.

Recently, Hededal et al. developed a semiaxial MRI scoring method of SIJ inflammation by modifying the Berlin and SPARCC methods [26]. This score divides, SI into four quadrants, and, BME is scored as 0 if absent and 1 if present per quadrant per joint, the total score ranging from 0 to 8, similar to the semicoronal and semiaxial Berlin and SPARCC methods. They conclude that there is no advantage in using semiaxial over semicoronal slices due to reliability and responsiveness issues, and they propose that future studies investigate the combined evaluation of semiaxial and semicoronal sequences, which is what SCAISS actually brings about.

The greatest advantages of SCAISS over other methods is its easiness—the reader only needs to select ROI with a mouse click—being validity and reliability demonstrated, and, therefore, we propose it for use in clinical practice.

SCAISS is more precise than the Berlin’s method, and its concordance is as good as the SPARCC’s, needing fewer planes than the latter—only two, semi-coronal and semi-axial, compared to six semi-coronal cuts in the SPARCC—and thus permitting faster reading. In addition, the images selected can be saved (as ROI), not only the score, what implies a better tracking of the measuring process, allowing a re-reading if needed, and an easier follow-up of the same or new areas. Additional advantages of the method are that only STIR sequences are needed and that both coronal and axial slices can be reliably read, while other techniques can only be assessed in coronal images.

Some limitations must be taken into account to interpret these results. The small sample, despite being enough to test reliability consistently, may seem small, and will be corrected in ongoing validation studies. In addition, we have not yet evaluated sensitivity to change in the SCAISS, and we cannot propose our method for use in clinical trials. Nevertheless, the range for variation of SCAISS is ample, a characteristic favorable to responsiveness.

Future developments and validation of the SCAISS will include the selection of cut-offs, reproducibility with different machines, and sensitivity to change.

In summary, we have developed SCAISS, a semi-automated technique that allows a fast, reliable and easy quantification of sacroiliitis in MRI images by detecting objectively areas of BME; additional validation studies are needed to confirm whether SCAISS may avoid the need to centralized readings, and serve to monitor lesion and treatment response with more detail than other methods in which only the global score is recorded.