Introduction

Systemic lupus erythematosus (SLE) is an autoimmune, multisystem disease with protean clinical and immunological manifestations. Moreover, even in a single patient, it has a very variable course with periods of flares and remissions [1, 2]. Therefore, assessment of disease activity is a great challenge for both physicians and researchers.

Measurement of disease activity has a central role in proper and prompt management of SLE in clinical daily practice. Moreover, it is important in observational and randomized clinical trials to evaluate whether the primary endpoint or efficacy of treatment has been reached [3,4,5]. Although serological markers and acute phase reactants have been used widely to assess lupus activity, their drawbacks and pitfalls should be considered [4]. For instance, anti-double-stranded DNA (anti-ds DNA) and complement levels may be abnormal despite a low disease activity state. Similarly, high erythrocyte sedimentation rate or C-reactive protein does not necessarily mean active SLE. Furthermore, differentiating reversible active disease from irreversible tissue/organ damage is another important issue in assessment of disease activity [1].

Although more than 60 indices have been developed to measure the disease activity since past mid-century, no assessment tool has been proved to be the gold standard yet [1, 6].

Physician Global Assessment (PGA), or physician’s opinion, has also been used to capture disease activity [3, 7, 8]. However, it relies solely on physician’s judgment. In addition, laboratory results could confound its determination [9].

Among several disease activity indices, some considered global disease activity such as the SLE Disease Activity Index-2K (SLEDAI-2K), the European Consensus Lupus Activity Measurement (ECLAM), and the Revised Systemic Lupus Activity Measure (SLAM-R) [1]. They are useful to compare the patients with widely different clinical presentations, but they may overlook the high activity in a single organ in the face of overall measured mild disease activity. Conversely, some other indices such as the British Isles Lupus Assessment Group-2004 (BILAG-2004) index point out individual organ activity [10]. Therefore, they may detect improvement or deterioration more precisely than the former group, and consequently, they may be more conclusive towards the change of treatment.

The validity and reliability of these indices have been studied before [11,12,13,14]. However, previous studies have demonstrated inconsistent results about their sensitivity to detect disease activity. For instance, Yee et al. showed more sensitivity of BILAG than SLEDAI-2K [15], whereas Gladman et al. found the vice versa [16] Likewise, although some studies showed the more sensitivity of SLAM-R than SLEDAI to record the clinical changes [8, 17], some others demonstrated the opposite [16]. Moreover, most previous studies have compared only two/three indices or with low sample size. We aimed to longitudinally compare SLEDAI-2K, BILAG-2004, SLAM-R, and ECLAM on detection of changes in disease activity and to understand their predictability on treatment alteration in routine clinical practice.

Patients and methods

Study design

In a longitudinal cohort study conducted from August 2014 to April 2015, 102 patients who fulfilled at least four of the American College of Rheumatology classification criteria [18] of SLE were recruited. The patients were referred from the Lupus Clinic affiliated to the Isfahan University of Medical Sciences. The regional Ethical Committee approved the study design. Informed consent was obtained from all patients.

Data collection

Clinical and laboratory parameters as well as prescribed medications were recorded during each visit. The interval between the two consecutive assessments was not determined in advance. Rather, patients were visited in routine clinical practice “as needed” according to their health status in order to either modify the dosage of medications, prescribe stronger/lighter medications, or follow the patients more closely. Activity indices including SLEDAI-2K, BILAG-2004, ECLAM, and SLAM-R were calculated during each visit.

We selected measurement tools used most frequently in clinical settings and research investigations with the ability of disease categorization. They were different in measurement format and definition of disease severity. Overlooking the change of medications used by patients was the common drawback of all tools. A rheumatologist (the corresponding author) conducted the interview and physical examination during each visit, and recorded his opinion about disease activity according to PGA. The PGA scales were as follow: 0 (no active disease), 1 (mildly active disease), 2 (moderately active disease), and 3 (severely active disease) [19].

BILAG-2004 included nine categories of symptoms and laboratory evaluations as constitutional, musculoskeletal, mucocutaneous, cardiorespiratory, neuropsychiatric, gastrointestinal, renal, ophthalmic, and hematology [11]. Each item was scored based on presence, absence, recurrence, or progress of symptom or laboratory parameter in the last 4 weeks in relation to the previous 4 weeks. BILAG-2004 grades would be nominal scaling of alphabetical orders A, B, C, D, and E which stand for “requiring urgent disease-modifying therapy,” “demanding close attention, often symptomatic therapy,” “stable, controlled on current therapy,” and “no involvement,” inactive disease but previously affected, respectively. Active disease was defined if any previous score changed to A score or change of C, D, or E score to B score. Improvement (decreased disease activity) was defined as disappearance of A and B scores in all organs/systems with no new A or B scores [1].

SLEDAI-2K included 16 clinical manifestations and eight laboratory parameters (four urinary and four hematologic/immunologic items) [14]. Presence of any item during the past 10 days was recorded and summed up towards weighted scoring. The final score would be from 0 (no disease activity) and 105 (the most severe disease activity). An increase in SLEDAI > 3 and a decrease in SLEDAI > 3 were considered as increased and decreased disease activity, respectively [1].

ECLAM is also a global index of SLE disease activity in the past month [20]. It scored the disease activity from 0 to 10 using 15-weighted clinical, hematological, and serological findings. The latter included ESR and complement levels.

SLAM-R involved 23 clinical active vs. non-active symptoms and signs plus 7 laboratory parameters [12]. They were weighted based a on Likert scale of 1 (mild) to 3 (severe) and categorized in 10 categories leading to maximum score of 81. The categories included laboratory, cardiovascular, pulmonary, ophthalmic, gastrointestinal, dermatologic, neuromotor, rheumatologic, and reticuloendothelial systems. Similar to BILAG-2004, the immunologic parameters are excluded from SLAM-R. ECLAM and SLAM-R were considered as continuous variables and one unit change was defined as the index change.

Treatment change was classified as decrease/no change vs. increase in treatment. The classification was based on the difference in treatment after patient evaluation and comparing with the treatment before evaluation. Any increase in any of the medications of interest, irrespective of concurrent decrease in other medications, was considered as treatment increase. Any decrease in any of the medications of interest, without concurrent increase in other medications, was considered as treatment decrease. The followings were considered as the medications of interest: glucocorticoids, immunosuppressive and anti-malarial agents, immunoglobulins, and plasmapheresis. Treatment change was used as the standard reference of disease activity. PGA and the four aforementioned indices were compared to the standard reference. Also, treatment change was used as the response variable (dependent variable) to build the models according to indices (see below).

Statistical methods

Data analyses were conducted using SPSS program (SPSS, Chicago, IL). Patients were divided into two groups on each visit as increased treatment and decreased/not-changed treatment. The Spearman correlation was applied to assess the correlation of treatment change and each measurement index. Frequency distributions of increased disease scores in relation to the last visit scores were compared between the two groups of treatment change for each index by chi-square test. The probability of increased treatment vs. decreased/not-changed treatment was evaluated by longitudinal analysis using generalized linear-mixed effect model (GLMM) and generalized estimating equations (GEE), adjusted for fixed effects of sex, age of disease diagnosis, current age and disease duration, and random effects of subjects. Information criteria to compare the models were based on Akaike’s information criterion (AIC) in GLMM and Quasi-likelihood under Independence model Criteria (QIC) in GEE. Significant models with smaller AIC/QIC values fit better. Although some patients were visited more than three times, we only included the first three visits in the current study to have a balanced matrix which results in optimum model buildings by GEE. Adjusted odds ratios (ORs) and 95% confidence intervals (CIs) were reported. Predictive power of indices was compared through area under the curves (AUC) in plots of sensitivity vs. 1-specificity and application of receiver operating characteristic (ROC) curves. The larger the AUC, the higher the predictability of index. p value less than 0.05 was considered significant.

Results

This study comprised 89 women and 13 men. Patients were followed up for 2 to 8 months and they were visited at least 3 times. There was no significant difference between women and men in terms of mean age (35.5 vs. 34.8 years, respectively), average disease duration (95.8 vs. 79.4 months, respectively), average follow-up duration, and average number of visits. Other demographic characteristics of patients are presented in Table 1. High anti-ds DNA and low complement levels were seen in 57 and 30.5% of patients at baseline, respectively. Additional laboratory indices are accessible in Table 1. In the first visit, renal involvement was the most frequent organ treated (72%) followed by skin (15%), joints (6%), blood (6%), and central nervous system affection (1%). The same pattern with minimal changes in percentage was observed in the second and third visits. Hydroxychlroquine was the most frequent prescribed medication (84%) followed by prednisolone (83%), azathioprine (25%), mycophenolate mofetil (21%), cyclophosphamide (14%), tacrolimus (12.5%), methotrexate (6%), and cyclosporine (3%). The mean (SD) dosage of prescribed prednisolone in the first, second, and third visits were 8.8 (12), 7.2 (7.8), and 6.1 (6.5) mg/day, respectively. The correlations between treatment change and all indices are shown in Table 2. BILAG-2004 and SLEDAI-2K substantially correlated with treatment change; whereas, SLAM-R and ECLAM demonstrated moderate correlation with treatment change.

Table 1 Demographics of patients recruited into the study and their baseline scores (n = 102)
Table 2 Correlation among the indices and treatment change. Significant correlations at the 0.01 level (two-tailed) are characterized by an asterisk

The distribution of the change of indices on each visit in relation to the previous visit according to treatment change is presented in Table 3. Concordant cases of increased treatment/increased index are shown in italic. Similarly, concordant cases of decreased/not-changed treatment and decreased/not-changed index are in italic. In the second visit, 15 patients had increased treatment. An ideal index might be able to show concurrent increased disease activity index in relation to the first visit in those 15 patients. The concurrent increased index/increased treatment (concordant cases) was from 40% (6/15) in SLEDAI-2K to 73% (11/15) by SLAM-R. In the third visit, the treatment was increased for 13 patients. This treatment change was again detected differently by the indices; from 38% (5/13) by SLAM-R to 85% (11/13) by BILAG-2004. Similarly, 87 patients had decreased/not-changed treatment in the second visit, of which, the concurrent decreased/not-changed index cases (concordant ones) were from 73% (64/87) by SLAM-R to 84% (73/87) by BILAG-2004. Also, 89 patients had decreased/not-changed treatment in the third visit, of which, the concurrent decreased/not-changed index cases (concordant ones) were from 77.5% (69/89) by SLAM-R to 89% (79/89) by SLEDAI-2K. No index showed the ability to capture all concordant pairs in both second and third visits.

Table 3 Distribution of treatment increased vs. treatment decreased/not changed according to each index. The significant distributions are marked by an asterisk

Treatment was considered as the dependent variable and each index was considered as independent variable in GLMM and GEE models of longitudinal analyses (Table 4). The models were adjusted for sex, age of disease diagnosis, current age, and disease duration. In addition to PGA, BILAG-2004 followed by SLEDAI-2K revealed the smallest AIC among GLMM models and the smallest QIC among the GEE models. Sensitivity analysis according to AUC and ROC are presented in Table 5 and Fig. 1. Aside from PGA, BILAG-2004 followed by SLEDAI-2K demonstrated the largest AUC indicating the highest ability to distinguish patients who need increased treatment among four main indices. The difference between BILAG-2004 and SLEDAI-2K was less than 1 %, 77.9 vs. 77.1%, respectively (Table 5).

Table 4 Longitudinal analyses using GLMM and GEE. Treatment was the dependent variable and each index was the independent variable
Table 5 Area under curves (AUC) for all indices
Fig. 1
figure 1

Sensitivity analyses of all indices according to AUC and ROC

Discussion

In this study, we evaluated the changes of four SLE disease activity indices in daily clinical practice and their relationships to physician’s assessment and treatment change. We tried to show the best index that might be able to predict treatment changes in longitudinal analyses.

All indices substantially correlated with PGA. SLEDAI-2K had the highest rank followed by BILAG-2004, ECLAM, and SLAM-R in decreasing order. This finding, more or less, was consistent with the previous studies [8, 21, 22]. In addition, PGA, BILAG-2004, and SLEDAI-2K showed substantial correlation with treatment change. BILAG-2004 seems to be the only index (A, B, C, D, E) which covers the treatment decision among the indices. It has more items than the other three indices. This may explain its stronger correlation with treatment change compared with the other ones. Also, SLEDAI-2K includes objective items such as immunologic markers, which might justify its substantial correlation with treatment change [1]. On the other hand, some indices such as SLAM-R includes more subjective items like fatigue or arthralgia that might be unrelated to SLE activity, and, consequently, to treatment modification [1].

Table 3 showed an inconclusive pattern of detecting disease activity by five indices requiring treatment change. That’s why we needed model building in order to take into account other factors such as age and sex of patients. GLMM models in Table 4 lead us to the probability of increased treatment relative to decreased/not-changed treatment in the current study. For example, the odds of requiring increased treatment instead of decreased/not-changed treatment for patients with BILAG-2004 = A category are estimated to be 23.65 (OR) times the corresponding odds for patients with BILAG-2004 = E category, all other variables being equal. Patients with BILAG-2004 = B are 8 times more likely to need increased treatment relative to those with BILAG-2004 = E. Similarly, patients with BILAG-2004 = C are roughly 2.5 times more likely to need increased treatment relative to patients with BILAG-2004 = E. GEE models in Table 4 lead us to the similar information provided by GLMM but in population level. In other words, GLMM models provide subject-specific results; whereas, GEE models provide average population information.

The AUC measured discrimination. It reflected the ability of index to correctly distinguish between the patients who needed treatment increase vs. those who did not, based on the sensitivity and specificity of each index. An index with perfect discrimination, 100% sensitivity and 100% specificity, has a ROC curve that passes through the upper left corner. Then, the closer the ROC curve to the upper left corner, the higher the overall accuracy of the index. For instance, AUC of 0.779 for BILAG-2004 in Table 5 means, on average, a patient with increased treatment had higher BILAG-2004 score than 77.9% of those with no increased treatment. In other words, BILAG-2004 had the predictive ability to discriminate patients who needed increased treatment than those who did not. This was true for all indices since all AUCs were significant. But, the delectability power was different, from the minimum AUC of 71% for SLAM-R to the maximum AUC of 78% for BILAG-2004. The difference between BILAG and SLEDAI-2K in terms of AUC was minimal. The ability of indices to capture disease activity and the need to increase treatment was different in previous studies. Although BILAG-2004 was more able to detect this issue in the study of Yee et al. [15], there are some reports for better response of SLEDAI [16], SLAM [8, 23], or ECLAM [24]. This inconsistency among studies may result from differences in statistical analysis, sample size, or study design (for instance, cross-sectional vs. longitudinal study).

This study had some limitations. Our results could be improved if patients’ assessment of disease status were considered. In addition, one rheumatologist decided on all treatments, and there was no second check. If the patients were evaluated by a second rheumatologist, less errors were probably expected.

In summary, the current study showed that BILAG-2004 followed by SLEDAI-2K had the highest predictability of the need to increase the treatment in three consecutive visits among the four disease activity indices in our study. It may be expected that the outcome of BILAG-2004 in terms of treatment will be more meaningful.