Introduction

Low back pain (LBP) is a highly prevalent problem [1, 2] and a major cause of disability worldwide [1, 3], with a tendency to recur or persist, leading to chronicity [4, 5]. Often, no definitive pathology or underlying mechanism can be identified [6]. To account for uncertain and potentially multiple contributing factors, chronic LBP is commonly labeled as Non-Specific Chronic LBP (NS-CLBP) [1, 7,8,9]. NS-CLBP is a general diagnosis encompassing a wide range of conditions, symptoms, and clinical features. However, such diagnosis does not distinguish characteristics to guide specific clinical decision-making [10].

There is currently no optimum treatment strategy for NS-CLBP [11]. This is partially due to a lack of standardized, valid, and reliable methods which classify specific characteristics. An effective classification system is expected to be based on studies reporting selection criteria in clear and standardized terms [12]; requirements that have been proposed as a key research priority [13].

There are many approaches to classify NS-CLBP, such as identifying symptom sources [14,15,16], functional characteristics, and/or psychosocial risk [17]. Functional evaluation, which generally assesses how people move and perform movement tasks, is typically designed to both inform treatment and be used as a clinical outcome. Functional classification is promising, in part because it identifies characteristics of a condition and offers information about potential mechanisms contributing to chronicity. Active treatments designed to address functional deficits may also positively influence self-efficacy, reduce fear avoidance behaviors, and promote symptom self-management [18, 19].

Despite the potential importance of functional classification as a diagnostic process, to the authors’ knowledge, no previous comprehensive evaluation of the reliability and validity of existing classification systems has been performed. Therefore, the aims of this review were to describe and critically appraise the psychometric properties of functionally oriented NS-CLBP diagnostic classification systems.

Methods

This systematic review was registered on PROSPERO (CRD42015023958) [20] and conducted according to PRISMA guidelines [21].

Search strategy

A systematic electronic and manual search of the literature published in English, from inception until January 2020, was conducted in the following electronic databases and Journals: PubMed, EMBASE, Cochrane, PEDro, CINAHL, Index to chiropractic literature, ProQuest, Physical Therapy, Journal of Physiotherapy, Canadian Physiotherapy and Physiotherapy Theory and Practice.

The search strategy consisted of keywords and Medical Subject headings (MeSH) related to “Non specific,” “mechanical,” “low back pain,” “simple backache,” “lumbar strain,” “spinal degeneration,” “classification,” “clinical test,” “clinical examination,” “clinical sign,” “valid*,” and “reliabl*.” Searching strategy details are available in Appendix 1. Reference lists of eligible articles were also searched for relevant publications.

Eligibility criteria

Inclusion and exclusion criteria are summarized in Table 1.

Table 1 Studies eligibility criteria

Data collection and analysis

Selection of studies

Studies from electronic databases and manual searches were imported into EndNote X5.0.1 and checked for eligibility by two reviewers (AA and NFM) independently; first by title, abstract, and finally by full text. Discordance was resolved through discussion with co-authors (ARY and RV).

Data extraction

Two reviewers (AA and NFM) independently extracted all relevant information into an Excel spreadsheet. All discrepancies were resolved through discussion.

Risk of bias assessment

Risk of bias was assessed independently by two authors (AA and NFM) using the Critical Appraisal Tool (CAT) developed by Brink and Louw [22] (Appendix 2). This tool consists of 13 items designed to appraise the quality of the validity and/or reliability studies; 4 items assess reliability studies, 4 items evaluate validation studies, and the remaining 5 items evaluate both validity and reliability studies. Each item is scored as “Yes,” “No,” or “Not Applicable (N/A)” [22]. Disagreement was resolved through consensus discussion. Studies were considered of low risk of bias if they scored \(\ge \hspace{0.17em}\)60% [23,24,25].

Results

Selection of studies

Database and manual searching identified a total of 2899 articles. After full article screening, 22 studies published in 21 articles (one article reported 2 studies) were included. The article screening process is outlined in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram of literature search and studies included

Characteristics of included studies

The methodological approach of each included study is depicted in Tables 2 and 3. Of the 22 included studies, 5 investigated inter-rater reliability [26,27,28,29], with one study reporting both intra- and inter-rater reliability [27] (Table 2).

Table 2 Description of included reliability studies
Table 3 Description of included validity studies

Validity was assessed in 17 studies; 15 studies evaluated O’Sullivan’s Classification System (OCS) [30,31,32,33,34,35,36,37,38,39,40,41,42,43,44], one assessed the 10-item Motor Control Impairment (MCI) test battery [45], and one investigated the Pain Behavior Assessment (PBA) classification system [46]. All validity studies were cross-sectional design except 2 studies [42, 45] were cross-sectional case–control design (Table 3).

Demographic data of participants in eligible studies

Sixteen studies enrolled asymptomatic participants as controls [27, 29,30,31,32,33,34,35,36,37,38, 40,41,42,43, 45]. Sample size ranged from 12 [39] to 200 [46]. Participant mean age ranged from 28.4 [41] to 55.1 years [27]. The mean body mass index (BMI) ranged from 20.8 [42, 43] to 26.9 kg/m2 [45], although four studies did not report BMI [26, 27, 29, 46].

Classification systems

All eligible studies described three different classification systems (Tables 2, 3):

  1. 1.

    OCS (n = 18 studies) classifies NS-CLBP as predominantly centrally (e.g., central sensitization) or peripherally mediated (e.g., injury, inflammation of peripheral tissues). OCS also includes a psychosocial assessment step and separates pain presumed from lumbar and pelvic origin. Functional testing of lumbar and pelvic girdle pain evaluates presumed motor control impairment by identifying specific postural and movement characteristics [26] (Appendix 3).

  2. 2.

    MCI Test Battery (n = 3 studies) used specific movements/positions to differentiate participants with MCI from normal individuals. This battery consists of 10 individual tests that identify possible flexion, extension and rotational dysfunction. Assessment is dichotomous (impairment or no impairment) with the severity described on 3 levels (none, mild, moderate/severe) [45].

  3. 3.

    PBA classification (n = 1 study) rates: (1) pain perception; (2) overt pain behavior (e.g., guarding movements); (3) effort during physical test performance; and (4) consistency of behavior across different situations of clinical testing. Categories include no pain, low pain or high pain behaviors [46].

Reliability of different classification systems

OCS and MCI test Battery inter-rater reliability testing was assessed in 5 studies (Table 4).

Table 4 Overview of the results of reliability studies

Inter-rater reliability of using the entire OCS classification system (all steps) was moderate (kappa (K) > 0.4) [26]. For levels 1–4, the mean agreement (%) was excellent (96%; range 75–100%). For the fifth level, K- and mean agreement between 4 testers was strong 0.82 (range 0.66–0.90) and 86% (range 73–92%), respectively. The final classification level had a moderate mean K of 0.65 (range 0.57–0.74) and excellent agreement of 87% (range 85–92%) [26]. Within the OCS-MCI subcategory, the most reliably identified subgroup is the passive extension pattern (PEP) (K = 0.90; strong), while the least reliable is the active extension pattern (AEP) (K = 0.66; moderate) [26].

The OCS-MCI subcategory was also tested in 2 studies reported within a single article [28]. In the first study, the inter-rater agreement was excellent (K = 0.96 and %-of-agreement = 97%). In the second study, the inter-rater agreement was moderate (K = 0.61) on average and ranged from 0.47 to 0.80, while the mean agreement was 70%, ranging from 60 to 84%, among 13 examiners who assessed 25 cases including subjective information from participants and video recorded functional tests [28].

MCI test battery

In one study, four examiners independently rated video recordings of 27 participants with NS-CLBP and 13 controls performing 10 MCI tests. K values for inter-rater reliability ranged between minimal and moderate (0.24–0.71). Six out of 10 tests showed substantial K > 0.6 inter-rater-reliability. The most reliable tests (for both rater pairs) were the “pelvic tilt” for extension dysfunction, “one leg stance” for rotational dysfunction and “sitting knee extension” and “waiter’s bow” tests for flexion dysfunction. The poorest reliability was reported for the “abduction in crook lying" test for rotational dysfunction where both rater pairs had low K-values (K = 0.44; 95%, CI 0.18–0.70 and K = 0.32; 95% CI 0.10–0.54). Intra-tester reliability ranged from 0.51 to 0.96. All tests, except abduction in crook lying, showed substantial reliability (K > 0.6) [27].

The second study reported the inter-rater reliability between two examiners who independently examined 25 participants with NS-CLBP and 15 asymptomatic controls using the five MCI clinical tests. Intra-class correlation coefficients (ICCs) were excellent (0.90) for repositioning (RPS), 0.96 for sitting forward lean (SFL), 0.96 for sitting knee extension (SKE), 0.94 for bent knee fall out (BKFO) and 0.98 for leg lowering (LL) [29].

Validity of different classification systems

All three diagnostic systems underwent some aspect of validation testing (Table 5):

  1. 1.

    OCS Fifteen studies assessed OCS validity (12 for OCS-MCI subcategories and 3 for sacroiliac joint dysfunction (SIJD)).

    OCS-MCI construct validity was reported by measuring lumbosacral kinematics and trunk muscle activation in 33 participants with NS-CLBP (20 Flexion Pattern (FP) and 13 AEP) and 34 asymptomatic controls. The biomechanical model used lower lumbar kinematics in sitting and forward bending and two trunk muscle activation variables (lack of flexion relaxation of the superficial lumbar multifidus in slump sitting and end range of forward bending). The model correctly classified 96.4% of cases and distinguished between individuals with No LBP, AEP and FP [38].

    Discriminant validity of MCI subcategories through spinal kinematics testing:

    Sitting postures were tested to distinguish participants with NS-CLBP with AEP (lordotic lumbar posture) and FP (kyphotic lumbar posture) from asymptomatic controls (P < 0.001). Participants with NS-CLBP had less ability to consciously alter posture when asked to slump from usual sitting (P < 0.001) [36]. Similar findings were reported during cycling as participants with NS-CLBP (FP) exhibited greater lumbar region flexion compared to asymptomatic controls (p = 0.018) and reported remarkable pain increase over 2 hours of cycling (p < 0.001) [41]. Further, cyclists with NS-CLBP (FP) showed increased, although non-significant, lumbar flexion and rotation tendency compared to controls (P > 0.05) [35]. In another study, participants with NS-CLBP (FP) sat with less hip flexion (P = 0:05), suggesting a relative posterior pelvic tilt. During “usual” sitting, the FP group positioned the lumbar spine significantly closer to end range lumbar flexion compared to asymptomatic controls [30].

    Functional tasks: spinal kinematics during functional tasks were not different between the AEP and asymptomatic controls in a single study [43]. However, the AEP group distinctively adopted more upper lumbar and lower thoracic (T6—L3) extension compared to the FP group which adopted more flexion during these activities (p < 0.05). The FP group also exhibited greater thoraco-lumbar kyphosis than asymptomatic controls [43].

    Spinal position sense (SPS): Lumbar repositioning accuracy was assessed in 3 studies [31, 33, 40]. Participants with NS-CLBP, compared to asymptomatic controls, developed substantially greater magnitude of Absolute Error (AE) [31, 40] and Variable Error (VE) [40]. The FP group underestimated lumbar target positions [31, 33, 40], while the AEP group overestimated lumbar and underestimated thoracic target positions compared to FP [40]. The Cardiff Dempster–Shafer Theory (DST) Classifier method, based on objective measures of repositioning sense during sitting and standing, discriminated the No LBP from NS-CLBP (pooled and in subsets) with an accuracy ranging between 93.83 and 98.15%. Further, the DST classifier method distinguished different NS-CLBP subgroups with an accuracy of 96.8%, 87.7% and 70.27% for FP from PEP, FP from AEP and AEP from PEP subtypes, respectively. Finally, ranking analysis showed that lumbar AE in sitting could distinguish participants with NS-CLBP from No LBP and FP from No LBP, while lumbar constant error in standing consistently discriminated LBP extension subsets (AEP and PEP) from No LBP [44].

    Discriminant validity of MCI subcategories through trunk muscle activity testing:

    Surface electromyography (sEMG) recorded from five trunk muscles during unsupported “usual” and “slumped” sitting postures could not distinguish trunk muscle activity between asymptomatic controls and a pooled NS-CLBP group. However, compared to controls, participants classified with AEP presented with significantly higher co-contraction of lumbar multifidus, ilio-costalis lumborum pars thoracis and transverse fibers of internal oblique muscles (p < 0.05) [37]. Burnett et al. [35] reported less co-contraction of the lower lumbar multifidus in the FP group compared to controls during a cycling task [35]. Sheeran et al. [40] reported the NS-CLBP (FP and AEP combined) group produced significantly higher abdominal activity (p < 0.01) compared to controls during usual sitting and standing postures. Hemming et al. [42] also reported significantly greater muscle activation in right-sided superficial lumbar multifidus muscles during the functional tasks of step up, reach up and box replace (p < 0.05). External oblique muscle contraction during box lift differed significantly between participants with AEP and asymptomatic controls (p = 0.016). Significant differences between participants with FP and asymptomatic controls were also reported for left-sided transversus abdominis/internal oblique and superficial lumbar multifidus activity during stand-to-sit tasks (p = 0.009) [42].

    SIJD subcategory: Two studies reported decreased diaphragmatic excursion, altered respiratory patterns and depression of the pelvic floor (PF) in participants with NS-CLBP during the ASLR test compared to controls [32, 39]. When the examiner added manual pelvic compression to the ASLR test, there were no differences between the two groups. Manual pelvic compression during the ASLR theoretically improves load transfer by enhancing passive stability of the SIJs and MC patterns/force closure [32]. Another study reported delayed activation of obliquus internus abdominis (OI), multifidus and gluteus maximus muscles in patients with SIJD compared to controls. Delayed OI and multifidus activation occurred in both the symptomatic and the asymptomatic sides in the SIJD group. Biceps femoris activation occurred earlier in SIJP group [34].

  2. 2.

    MCI Test Battery: One study assessed the clinical validity of the MCI test battery for classifying participants with NS-CLBP [45]. For both the two-class (impairment or not) and the three-class (none, mild/moderate and severe) categorization, the ideal number of MCI tests was 10. The overall discrimination potential for two-class categorization was good (Area Under the Curve (AUC) > 0.8, sensitivity = 0.75, specificity = 0.82, Youden = 0.57, LR +  = 3.40, LR- = 0.20, effect size = 1.45), with an optimal cutoff of three tests. To classify MCI, at least four failed items are needed. The overall discrimination potential for the three-class categorization was fair (volume under the surface > 0.5, sensitivity = 0.48, sensitivity = 0.50, specificity = 0.82, Youden = 0.40, effect size = 1.56), with an optimal cutoff of three and six tests. At least four failed MCI tests are needed to classify mild/moderate MCI and six or more failed tests classify severe cases [45].

  3. 3.

    PBA: The internal consistency (reliability) of PBA showed good person separation index (0.83). Construct validity evaluated by Rasch analysis resulted in 41 items. PBA convergent validity was supported by a significant correlation with other questionnaires [46].

Table 5 Overview of results of validity studies

Risk of bias assessment

Risk of bias assessment showed an excellent inter-assessor agreement (92.2% and K = 0.84) [47]. Nine studies [30,31,32, 34, 39,40,41, 45] did not clarify evaluators’ characteristics (Item 2). Reference standard tests (Item 3) were reported only in two studies [38, 44] and were performed independently (Item 9). All inter-rater reliability studies used raters blinded to each other’s findings [26,27,28,29] (Item 4). All studies reported clear descriptions of measurement procedures (Item 10). Appropriate reliability and validity statistical methods (item 13) were employed in 21 studies, while one study employed the Kolmogorov–Smirnov test, a less than ideal test used to confirm normality distribution in a small sample study (n < 50) [41]. Overall, all five reliability studies [26,27,28,29] and two OCS validity studies [38, 44] were rated as low risk of bias. The remaining validity studies were rated as high risk [30,31,32,33,34,35,36,37, 39,40,41,42,43, 45, 46] (Table 6).

Table 6 Quality assessment of the included studies with the Clinical Appraisal Tool (CAT)

Discussion

This systematic review identified and critically appraised studies reporting reliability and validity of functionally oriented NS-CLBP diagnostic classification systems; specifically, the OCS, MCI test battery, and PBA systems. Of the 3 systems evaluated through studies included in this review, the OCS is the most reliable and valid. All included reliability studies were consistently rated as high quality. However, validity in this context is limited to the capacity to systematically identify different muscle activation patterns and spinal kinematic changes from each other and controls (construct and discriminant validity) as demonstrated in two high-quality studies. The remaining reviewed studies that assessed some aspects of OCS validation had high risk of bias. Limited evidence supports acceptable inter- and intra-rater reliability for clinical use of the following MCI test battery tests: "sitting knee extension" (to identify flexion dysfunction), “one leg stance” (for rotational dysfunction) and “pelvic tilt” (for extension dysfunction). Evidence supporting validity of the PBA is inconclusive.

The OCS system

The OCS is currently the most studied, functionally based classification system with inter-rater reliability among various stages ranging from moderate to excellent [26]. For the OCS-MCI subcategory, three low risk of bias studies reported strong reliability (FP, AEP, PEP, flexion/lateral shift pattern and multidirectional pattern) [26, 28], with PEP as the most reliable subgroup, and AEP as the least reliable [26]. This review identified two low risk of bias studies demonstrating construct [38] and discriminant validity [44] of OCS-MCI subcategories based on determining and explaining aberrant muscle activity and spinal kinematic changes in participants with NS-CLBP. These studies generally adhered to guidelines for developing and validating classification systems [12, 48,49,50,51,52,53,54,55,56].

MCI test battery

Based on 2 low risk of bias reliability studies [27, 29], 3 individual tests included in the MCI test battery show good to excellent reliability to classify people with NS-CLBP with or without MCI. Two previous systematic reviews concluded similarly [57, 58]. Current evidence suggests clinical use of the "sitting knee extension" test to identify flexion dysfunction, the “one leg stance” test for rotational dysfunction and the “pelvic tilt” test for extension dysfunction are suitable for clinical use based on good–excellent values both for intra- and inter-rater reliability [27, 29]. Because validity of the 10-test battery is based on a single study [45] with a high risk of bias, recommending routine clinical use is premature.

PBA

The PBA consistently recognizes and classifies pain behavior into three categories (none, low, and high). However, evidence is limited to a single study with a high risk of bias [46]. People with no or low levels of pain behavior are likely to benefit from a physically oriented rehabilitation program with little emphasis on psychological and behavioral approaches. Conversely, those with high pain behavior may benefit from programs that emphasize psychological and behavioral factors. This reasoning incorporates a biopsychosocial approach similar to stratified care informed by the 9-item STarT Back questionnaire [59]. Unlike the STarT back questionnaire, the PBA also includes observing movement tasks and behaviors.

Risk of bias assessment of individual studies

All reliability studies for OCS and MCI test battery were rated as high quality. The main risk of bias in reliability studies was the lack of randomizing test order.

Only two OCS validation studies [38, 44] were rated as high-quality largely because they employed expert opinion as a reference standard. However, no currently available objective tests classify function [54]. When no such tests are available, expert opinion, though limited, represents the best available reference standard [60,61,62]. The main risk of bias for the MCI test battery and PBA was the absence of a reference standard. The choice of statistical methods was considered appropriate for all studies; although one study used [41] the Kolmogorov–Smirnov test, which is not recommended for testing normality [63, 64].

Implications for clinical practice

The findings of this review suggest that clinicians can use OCS to reliably classify functional characteristics of patients with NS-CLBP. Upper lumbar and lower thoracic spine kinematic studies offer mechanistic evidence supporting the rationale for assessing MCI. However, evidence supporting the validity of the 10-item MCI test battery is inconclusive because it is available from only 1 high risk of bias study. Because the effectiveness of therapies informed by functional classification is generally unknown, it is unclear if such diagnosis can be used to both inform effective care and/or as an objective measure of condition severity or response to care.

Implication for future research

Standardized assessment protocols for determining MCI require well-defined procedures, operational definitions and quantifiable values. Standardizing these will facilitate more clinically useful findings and the ability to pool data from clinical trials [29]. Included classification systems used sagittal plane MCI assessment. Future studies should consider frontal and transverse planes to more comprehensively assess complex movement strategies. Further studies with lower risk of bias are needed to confirm the clinical usefulness of PBA classification. Finally, RCTs validating the clinical effectiveness of treatments based on functional assessment are needed.

Review strengths and limitations

This review employed methodology consistent with PRISMA guidelines. However, as with all systematic reviews, articles may have been missed in the database searches. A meta-analysis was not feasible, due to heterogeneity in the methodological design and the statistical analyses employed. Only articles in the English language were included. Another limitation is the inclusion low-quality studies (small sample sizes and no reference standards).

Conclusions

Evidence from multiple studies with low risk of bias demonstrates OCS as a reliable classification method. Strong inter-rater reliability also exists for using 3 tests of the 10-item MCI test battery. Evidence for the reliability and validity of the PBA is limited to one study with high risk of bias. While clinicians are encouraged to categorize the functional capacity of patients with NS-LBP using reliable methods, research evidence is not yet available to answer questions about the effectiveness of care informed by such classification.