Introduction

The measuring method according to Cobb [1] is the standard method to quantify spinal curvature. Although the Cobb angle measures only the amount of the most tilted vertebrae on an anterior–posterior (AP) or posterior–anterior (PA) radiographs in the coronal plane, and is not an objective measurement of all three-dimensional aspects of the spinal deformity, this determination is used to guide decisions regarding progression, physiotherapeutic and orthotic options, as well as surgical interventions [2, 3]. Thus, an accurate and reproducible measuring procedure is crucial. Reproducibility concerns the degree to which repeated measurements yield similar results. Differences in repeated measurement may result from the subjects under investigation (e.g. biological variance), the equipment (e.g. lack of calibration), or the observers (e.g. lack in technical competence). To interpret the measurement results correctly, the orthopaedic examiner should be cognizant of the reproducibility parameters of the used measurement procedure.

Generally, in the technique developed by Cobb, lines are superimposed manually onto a hardcopy radiographic film of the vertebrae, and the angles are determined using a protractor [1]. Morrissy et al. [4] reported decreased measurement error when a standardised pencil and protractor were used to measure pre-selected (previously marked) end-vertebrae.

Even when the same end-vertebrae are selected, measurements may vary by up to 5° [4, 5]. Thus, a change of 5° or more is considered progression of the scoliotic curvature [5]. To improve reproducibility, and in response to technical advancements, new measurement procedures to determine the Cobb angle have emerged.

Aside from the manual procedure, there are so-called digital computer-assisted (semi-automatic) [612], automatic [13, 14] and smartphone app procedures [15, 16].

Included in the digital computer-assisted measurement group were all studies in which the Cobb angle was calculated independently by a software program.

The landmarks used to calculate the angle must be manually entered using a computer mouse. The angle is then calculated automatically.

Depending on the software program, the numbers of landmarks to be placed, and/or the positioning of the radiograph (e.g. horizontal rotation) varies (see Fig. 1).

Fig. 1
figure 1

Different measuring procedures

Automatic procedures still require definition of the regions of interest, i.e. “fuzzy hough transform procedure [14]”, or initial points, i.e. the “active shape procedure [13]”, to be imputed on a radiograph by the observer. The computer itself recognises the exact profile of the vertebrae or the landmarks, and will calculate the angle automatically.

As a result of the new technical capabilities of smartphones (S), there are applications available to calculate the Cobb angle. To use the app described by Qiao et al. [16], the smartphone must be aligned to the upper vertebrae and the lower vertebral endplate, and the software automatically displays the angle. The smartphone app described by Shaw et al. [15] uses an accelerometer.

The present review aims to systematically investigate the reliability and agreement of various techniques used to determine the Cobb angle in idiopathic scoliosis, and to assess whether new technical procedures are reasonable alternatives when compared to manual measurement of the Cobb angle.

Methods

Search strategy

A search strategy was developed in collaboration with a spine specialist and a radiological specialist. Relevant computerised databases were searched and analysed from the earliest date to 30 June 2012 consisting of PubMed (including MEDLINE), Medpilot, and Cochrane Library (including Cochrane Reviews).

Languages were restricted to English, Dutch, and German.

The search was performed using the following keywords and their combinations. Idiopathic scoliosis, Cobb angle, measurement, automatic, digital, computer-assisted, validity, accuracy, reliability, variability, reproducibility, agreement, X ray, radiography.

Titles were independently assessed for eligibility by two investigators. The articles were divided into two groups: “definitely exclude”, or “possibly include”. The articles to “possibly include” were assessed using the abstracts. Studies with abstracts not fulfilling the inclusion criteria were excluded. In the remaining articles, the entire text was assessed. Duplicates were deleted.

At each step, disagreements among the reviewers were resolved by discussion. In addition, the references of all included articles were checked, and potentially relevant articles were obtained as a full-text publication, rated, and included when fulfilling the inclusion criteria.

Selection criteria

Studies were included if the patients had an idiopathic scoliosis and if the Cobb angle was measured using computer-assisted, automatic, or a smartphone application procedures on an AP/PA radiograph.

Studies were excluded if: (a) measurements were performed on non-human subjects (e.g. spine models), (b) only manual procedures were reported (c) the study subjects measured had undergone instrumented fusions, (d) three-dimensional measuring methods were used (e) rasterstereography and or topographical methods were used, or (f) radiographs were compared to digital printouts.

Data extraction and management

One aim of this review is to investigate the reproducibility of various new measuring procedures. The term “reproducibility” has been deliberately chosen, as it encompasses both terms of agreement and reliability [17, 18] (see Fig. 2). Reliability relates measurement error to the variability between study objects: in this case, radiographs of patients and varying degrees of the spinal deformity. If the measurement error is small compared to the variability in the degrees of spinal deformity, the reliability parameter approaches 1.

Fig. 2
figure 2

Example for reproducibility parameters

In most of the included studies, the Intraclass Correlation Coefficient (ICC) was used as a measure of reliability [19]. Two variants of the ICC can be distinguished as ICCagreement and ICCconsistency [17]; ICCagreement incorporates the systematic difference between the observers and ICCconsistency ignores the systematic difference.

Various guidelines exist for the interpretation of ICC, but one reasonable interpretation is that an ICC value of less than 0.40 indicates poor reliability, 0.40–0.75 indicates fair to good reliability, and values greater than 0.75 reflect excellent reliability [20].

Additional correlations (r) are presented as shown in the studies.

For the presentation of agreement, different terms are given according to those used in the original studies.

From studies attempting to evaluate the validity through inter-methodological comparison, the statistical values were recorded.

The details of the selected studies are shown in Table 1.

Table 1 Characteristics of included studies

The data presented are taken directly from the Result sections of the included articles. The authors were not able to perform independent analysis of the source data.

Risk of bias assessment

The “Quality Assessment of Diagnostic Accuracy Studies” (QUADAS) checklist was used to assess internal validity of the included studies [21]. Generally, this checklist is used to evaluate a new test (index test) against a reference test (golden standard) and not for evaluating the reproducibility of studies. For reproducibility, the “Quality Appraisal of Reliability Studies” (QAREL) checklist is used [22]. For this review, items from both checklists appeared relevant. To assess studies examining the reproducibility of two different measuring procedures, we elected to modify the QUADAS checklist. The new procedure under investigation was considered the index test, and the comparative procedure the reference test. Items that appeared relevant in the context of reproducibility were added.

These items evaluates whether

  • observers were representative,

  • observers were blinded to their own prior findings,

  • observers were blinded to the findings of other observers,

  • the order of examination was randomized,

  • all assessments described in the Method section of the article were presented in the Result section.

Items without contextual relevance were deleted.

The items were scored as “yes”, “no” or “unclear” as described by Whiting et al. [21].

Not applicable (NA) was used when studies had no comparative design. Because the study by Allen et al. [13] examined three different measuring procedures, we divided the reference column into two parts. The manual and digital procedures were then used as the reference test.

Two reviewers assessed the risk of bias of the included studies independently, and disagreements regarding study quality were resolved by a third reviewer.

Individual checklist items were not weighted, and no summary score was used to exclude studies not attaining a certain level of quality.

Results

Search results

The electronic searches generated a total of 2,915 records, with 47 remaining after exclusion by title. Twenty-two abstracts did not fulfil the inclusion criteria as mentioned in the Method section and therefore 25 full-text articles were obtained. Included articles, and reasons for the exclusion of 14 of these 25, are listed in Fig. 3.

Fig. 3
figure 3

Flow chart of selected articles

Three papers appeared to report the same cohort but with different numbers of patients and observers [14, 23, 24]. To avoid duplication of data, we chose to include only the study with the most complete information [14]. In total, 11 studies were included in our review.

Risk of bias assessment

The risk of bias regarding the included studies is presented in Table 2. The initial agreement of the reviewers was 90 % (130 of 145 items). Disagreements were resolved by consensus. All studies presented a representative spectrum and had relevant clinical information’s (modified QUADAS item numbers 1 and 11). In most of the studies, poor description of the selection criteria, lack of details regarding the reference test and uninterpretable data, lack of blinding to the others’ and own prior findings, lack of randomization, and incompleteness of results were evident (modified QUADAS item numbers 3, 5, 6, 7, 8, 9, 10, 13). This may lead to selection, reviewer and information bias and bias the assessment of test characteristics. Three studies [11, 12, 14] with a high risk of bias were included.

Table 2 Risk of bias of included studies

Measuring methods

We found several new procedures to determine the Cobb angle. They can be divided methodologically into three categories. These are digital computer-assisted (D2, 4, 6L or DH), automatic (A), and smartphone app (S) techniques (see Fig. 1).

Under automatic measurement procedures are assigned all methods that the authors of the primary article have expressly designated as such. The difference between the digital computer-assisted procedures and the automatic procedure is that no manual entering of landmarks is needed but the regions of interest [14] or initial points [13] have to be marked manually by the observers.

In recent years, the manual procedure described and used by Cobb has been seen in the literature as reference test. Thus, most of the included studies used the manual procedure as the reference test. In cases where the technical description of the reference test was lacking within the Method section of the paper, we assumed that the authors used the best known reference standard, and included and denote such reference test as a manual test.

The different technical procedures and their use in the studies are displayed in Fig. 1 and Table 1.

Study description

Of the 11 studies, 8 reported in their Methods section that they evaluated both agreement and reliability. One study assessed reliability only and two examined agreement only.

Inter-methodological comparison between individual measuring procedures (e.g. smartphone vs. manual procedure) was carried out in five of the studies [8, 9, 11, 13, 15].

The individual selection of statistical methods used to calculate observer agreement, reliability, or inter-methodological comparison was made and presented in various ways.

Findings agreement

The results of the different procedures are divided into intra- and inter-observer agreement and are presented in Table 3. The original publications presented varied statistics. When appropriate, statistical terms have been unified for clarity (e.g. absolute average difference (AAD) = mean absolute difference (MAD), technical error of measurement (TEM) = standard error of measurement (SEM)). Nevertheless, a summary estimate was not possible.

Table 3 Results agreement: data presented as taken from the Result section of the article

Findings reliability

Reliability was also analysed using various procedures. All studies presenting the ICC showed excellent intra- and inter-observer reliability with ICCs >0.75. Further differentiation between ICCagreement and ICC consistency was not possible due to lack of information.

Studies calculating a correlation coefficient showed values above r = 0.96. Only the study by Chockalingam et al. [12] showed a lower correlation r = 0.781 among the observers using the manual procedure (see Table 4).

Table 4 Results reliability: data presented as taken from the Result section of the article

Inter-methodological comparison between procedures

Comparing the manual and digital two landmarks procedure, Jones et al. [8] reported a mean difference range of 0.15°–1.69° with a range of standard deviations (SD) of 2.81°–3.32°. Tanure et al. [9] identified no statistically significant difference in variability of the manual versus the digital two landmarks method (t: 0.51 > 0.05). Comparing the two and six landmarks procedure, Cheung et al. [11] demonstrated that, using the two landmarks procedure, the Cobb angle is 0.3° greater than using the six landmarks digital procedure (p < 0.05).

Allen et al. [13] reported poor inter-methodological reliability (range ICC 0.25–0.35) with a standard error of measurement (SEM) of 8.64°–9.54°. After eliminating four outliers, the reliability and SEM showed improvements (range ICC 0.68–0.74; SEM 4.13°–5.08°). Examining measurements obtained with smartphone app and manual measurements, Shaw et al. [15] identified a MAD of 2.1°, with limits of agreement of ±3.3°.

Discussion

In the past, the Cobb angle was measured manually, mostly on AP X-rays. Various studies confirmed higher intra-observer agreement than inter-observer agreement. A positive effect was evident if the end-vertebrae to be measured were previously marked, and if the observers used the same measuring instruments [4, 5, 2527].

New technical advancements to determine the Cobb angle must establish themselves as at least equivalent to the manual procedure. It might be of more use to ascertain in a study the reliability and agreement of the new procedure alone and then compare these results to those of the reference test. This was not done or the data were not presented in a few of the available studies [68, 12].

Even more reasonable would be to application of the “Bland–Altman” method [28]. To further assess these studies, agreement should be prioritized over reliability, because this is expressed on the original measurement scale and does not depend on patient sample heterogeneity.

Agreement

Studies evaluating the agreement of digital and manual procedures have shown only minor differences in the extent of agreement on direct comparison. Detailed analysis indicates lower inter-observer agreement versus intra-observer agreement.

Allen et al. [13] yielded the lowest SEM for intra-and inter-observer agreement using the automatic procedure when compared to digital and manual procedures.

Zhang et al. [14] also reported a lower MAD for the automatic procedure than the digital one. All automatic procedures showed smaller intra-observer measurement error in comparison to inter-observer error. Even the so-called “automatic” procedures require some observer input prior to calculations. This might explain the inter-observer difference.

Results of the smartphone studies have shown contradictory results. Qiao et al. [16] yielded higher intra- and inter-observer agreement of the smartphone app, while Shaw et al. [15] found an advantage for the manual procedure.

The main difference in study design was the use of observer training and the pre-selection of vertebrae. Thus, it appears that these variables increase intra- and inter-observer agreement for the manual and smartphone procedure. The app described by Qiao et al. [16] calculates the angle automatically, and that described by Shaw et al. [15] must be calculated by the observer. This might also be a potential source of error.

Furthermore, for a more appropriate comparison, agreement should be presented using the SEM.

Reliability

The reliability was remarkably high in all analysed studies (range ICC 0.83–0.99). There was no clear trend for any of the procedures or between intra- and inter-observer reliability. Evaluation was limited by the variety of statistical methods applied. It would have been beneficial had the type of ICC used, i.e. agreement or consistency, been reported.

Inter-methodological comparison between procedures

A number of the evaluated studies tried to compare the validity of the procedures with inter-methodological comparison.

To achieve this, a number of statistical techniques were used. Tanure et al. [9] and Cheung et al. [11] assessed the statistical significance and observed no difference between procedures. This result is not remarkable due to the fact that both procedures are designed to measure the same quantity. The test of significance is irrelevant to the question of agreement. Shaw et al. [15] used the Bland–Altman method [28] and presented the MAD and limits of agreement.

The Bland–Altman method is the only one that determines the actual agreement between the two measurement methodologies. With the graphical presentation of the Bland–Altman plot, differences between the methods are shown against the mean differences. Thus, an impression is made regarding the agreement as well as systematic differences between the investigated methods.

Factors affecting interpretation

Due to the extensive broad search strategy, it can be assumed that no relevant articles in English, French or Dutch, were overlooked.

One of the limitations of this review is based on the high variability of the statistical tests implemented in the observed studies. This makes generalisation of results impossible. Only half of the studies presented sufficient data allowing conclusions regarding reproducibility. The risk of bias assessment, as judged by the modified QUADAS, revealed only one study that clearly fulfilled all parameters.

Another issue affecting interpretation of the data was that terminology such as reproducibility, reliability, agreement, variability, variance, and validity are not used in a consistent way. Even in different studies, same statistics were calculated but given different names (e.g. average absolute difference = mean absolute difference, limits of agreement = coefficient of repeatability). In some cases, the SD was pure measurement deviation, in other cases it includes variability between radiographs. Therefore, no overall assessment was possible.

Conclusion

All of the measurement procedures investigated here showed high degrees of reliability.

The agreement of digital procedures tends to be slightly better than that of manual ones. For all other measurement procedures (automatic or smartphone), results were variable. It is notable that studies implementing vertebral pre-selection and observer training achieved better agreement. Therefore, we cannot currently favour a measuring procedure to determine the Cobb angle, but it is to be noted that emphasis should be placed on observer training. Another point would be to consider the cost–benefit ratio of the measurement procedure used.

For the future, we recommend the establishment of a widely recognised standard for study and reporting of the Cobb angle to enable an entire statement.