Introduction

Surveys are tools using which data on the beliefs, attitudes and behaviour of both patients and doctors can be collected [16] and using which researchers can gather and analyse quantitative data that are essential in clinical epidemiology and health services research [9].

A web-based multi-page survey was developed to investigate the attitudes of a large community of orthopaedic surgeons (members of the European Society of Sports Traumatology, Knee Surgery and Arthroscopy, ESSKA) in terms of the treatments of choice and the management of some of the most common joint diseases.

The survey consists of eight clinical scenarios (4 scenarios of shoulder pathology and 4 scenarios of knee pathology); each scenario was displayed on a different page and was associated with a series of closed-ended questions and qualitative scales using which respondents were requested to assess the appropriateness of a list of alternative options.

The preferences of orthopaedic surgeons for the scenarios relating to the management of clinical shoulder cases have been published in a previous paper [26].

The aim of this paper was to report the preferences relating to clinical scenarios of knee disease management, which were defined in order to investigate the degree of inclusion of evidence-based medicine in daily clinical practice. A variety of diseases can affect the knee joint and their management can be a challenge for orthopaedic surgeons, especially in the borderline cases that we purposely defined to address situations in which choosing the right option is not a trivial task.

In order to enable hidden common preferences to emerge and make surgeons aware of potential differences in practice, we had to assess the degree of agreement reached by the surgeons involved in the survey. Agreement assessment relates to the degree to which surgeons agree about a specific treatment. One important purpose of this work is to report agreement measurements based on the experimentation and consolidation of heuristics proposed elsewhere in the literature and to introduce a novel measurement of agreement that is more sensitive to a large number of respondents and more easy to calculate. In doing this, this paper contributes to unravelling the potential of web-based surveys administered to large communities of practitioners as an effective tool to help surgeons develop consensus-based and truly practice-based guidelines.

Materials and methods

Questionnaire design

The survey under consideration in this paper consisted of 4 clinical knee scenarios (Tables 1, 2, 3, and 4), each displayed on a different page in order to focus attention on its peculiarities.

Table 1 Survey results for Scenario 1
Table 2 Survey results for Scenario 2
Table 3 Survey results for Scenario 3
Table 4 Survey results for Scenario 4

An international association of surgeons specialising in sports traumatology and knee surgery (ESSKA) was contacted. All the people on the official mailing list of the ESSKA association were regarded as eligible and 1,084 personalised invitations were sent (target population).

For each scenario, the respondents were asked to state whether they would opt for either surgical or conservative treatment on the basis of the available information. Then, according to this first choice, each respondent was asked either to select his/her treatment of choice or to rank alternative treatments; more precisely:

  1. 1.

    Respondents were invited to choose only one of the treatments made available on a list;

  2. 2.

    Respondents were invited to quantify, on an ordinal scale how appropriate they found the treatments for the scenario under consideration.

Two null hypotheses, closely related to the corresponding research questions, were formulated. A categorical variable (nominal) was defined for each item and called ‘treatment of choice for the scenario under consideration’ and the following null hypothesis was formulated:

H 0 (A)

all treatments for the scenario under consideration were regarded as equivalent.

Rejecting this hypothesis (chi-square test of the equality of the frequency) is regarded as being the same as claiming that respondents displayed a preference among the treatments made available on a list.

In order to establish whether a proposed treatment was deemed as being significantly relevant to the clinical scenario under consideration, a variable called ‘appropriateness of a treatment for the scenario under consideration’ was defined. This variable was quantified on an ordinal scale as follows: totally inappropriate, −2; inappropriate; appropriate, 1; and very appropriate, 2. In this way, a negative number would be assigned to a treatment that was not considered appropriate for the clinical scenario under consideration. The following null hypothesis was then formulated:

H 0 (B)

the proposed treatment is not deemed appropriate for the scenario under consideration.

Rejecting this hypothesis (one-sided sign test on the median being less than or equal to −1) means that the proposed treatment was deemed appropriate by the respondents.

Members of the ESSKA association were contacted by email twice: the first time to present the research initiative and to invite each member to participate in the initiative and complete the online questionnaire; the second time to send a reminder to join the initiative only to those members who had not responded to the survey by that time.

The survey was kept open for 24 days, from 12 May to 4 June 2010; on 26 May (2 weeks after the first invitation), the reminder email was sent and this produced a further 63 % of responses. The date on which the reminder was sent was decided on the basis of considerations aimed at reducing non-response bias [26].

An open-source platform (http://www.limesurvey.org/) was configured to collect the responses anonymously.

In overall terms, at the end of the survey, 374 fully completed questionnaires (36 % of the target population) and 38 partially completed questionnaires were collected.

Agreement assessment

In order to measure the strength of consensus among the raters involved, an inter-rater agreement score was calculated for each question.

Inter-rater agreement relates to the extent to which different evaluating surgeons, each assessing the same clinical scenario, come to the same decision, that is, either select the same treatment or assign the same assessment category (appropriate and not appropriate) to the treatment under consideration.

In the specialist literature, several methods for measuring inter-rater agreement have been proposed [8, 17]. The simplest and most common measurement of inter-rater agreement is the observed proportion of agreement (percentage of overall agreement, Po in what follows POOA). POOA provides an estimation of the probability of two (randomly selected) raters assigning the same appropriateness grade to a given treatment. However, this statistical measurement does not take account of the agreement that would have been expected due solely to chance [13] and it usually overestimates agreement. To assess agreement between multiple respondents, several coefficients of concordance have been proposed [1]. At the same time, common statistical software packages do not provide functionalities to calculate these scores and they can be perceived as difficult by non-specialist users, as surgeons usually are.

Consequently, to make the assessment of agreement easier, we created the nomogram depicted in Fig. 1. The ‘Normalised Chi-square based Agreement’ (NX2A) makes it possible to achieve a univocal level of agreement (the dependent variable) as a simple linear function of an independent variable that is obtained by normalising the chi-square score associated with the proportions of responses (namely, k) with respect to the number of respondents involved (namely, n).

Fig. 1
figure 1

This figure illustrates the relationship between the agreement measurement and the normalised Chi-square score, for a set of significant numbers of categories (k = 2, 3, 4, 5 and 6). Values of agreement range from 0 to 1 and increase linearly with the Chi-square score. The coefficient of proportionality (slope of the lines) is 1/(k−1). Values of agreement less than 0.4 are associated with a ‘poor agreement’ label, values between 0.4 and 0.75 with ‘moderate agreement’ and values above 0.75 with ‘excellent agreement’ [13]. In the figure is shown the inter-rater agreement for the scenario 4 (question 2)

The nomogram was obtained by correlating scores from the free-marginal multirater kappa, which is an indicator of collective agreement that has recently been developed and validated [30], with the ratio between the Chi-square scores associated with the proportions of responses (k) and the number of respondents (n). The functional relationship has been verified for number of respondents (n) larger than 30. The reason why we based our nomogram on the chi-square is that this is a common score in the medical community. All the main statistical software packages calculate a score of this kind (even common spreadsheets like MS Excel) and a number of online calculators are also available (e.g. http://www.jspearson.com/Science/chiSquare.html), as well as a number of look-up, precomposed tables.

Considering the question 2 of scenario 4, the NX2A is obtained following this procedure:

  • k = 3 (no. of categories from which respondents could make their choice); n = 348 (no. of respondents).

  • Chi-square score = 315.7

  • Chi-square score/n = 0.9

  • Looking to the line associated with k = 3 (Fig. 1), NX2A = [1/(3−1)] × 0.9 and it is equal to 0.45 (moderate).

Statistical analysis

The responses collected during the survey have been summarised in terms of the proportions of respondents for each alternative option with respect to the total number of actual respondents; the associated 95 % confidence intervals are also reported. A finite population correction factor was used when the sample size was large enough in comparison with the population size. Statistical analyses were carried out using the SPSS package. Equality of proportions has been tested using a chi-square test. For these and all the other tests mentioned below, a conventional confidence level of 95 % (p value < 0.05) has been adopted in order to regard the results as being statistically significant.

Results

Inferential statistics

Scenario 1

The surgical approach was preferred to the conservative approach to a significant degree (p < 0.001). Procedures based on biological response and microfractures were not considered to be appropriate (n.s.). A cycle of intra-articular cortisone injections was not an appropriate conservative option (n.s.).

Among the surgeons those chose a conservative approach, a significantly higher rate of responders preferred to re-evaluate surgery within 6 months (p = 0.001).

Scenario 2

The surgical treatment was preferred to conservative treatment (p < 0.001). Tibial osteotomy (closing wedge) and a unicompartimental knee prosthesis were not regarded as appropriate surgical options for the management of this clinical scenario (n.s.). The surgeons that preferred a conservative approach chose to re-evaluate surgery in 6 months to a significant degree (p = 0.001).

Scenario 3

No significant preference emerged between the surgical and conservative approaches (n.s.). All the proposed conservative treatments were regarded as adequate for the management of this scenario. Among the surgical treatments, arthroscopic ACL reconstruction with an allograft was not regarded as an appropriate option (n.s.).

Scenario 4

Respondents preferred a surgical approach to a significant degree and chose to consider a total knee prosthesis for the management of this clinical scenario (p < 0.001).

They preferred a cemented prosthesis (p < 0.001), but no preference emerged between a posterior-stabilized implant and a posterior cruciate-retaining design (n.s.).

Agreement assessment

Proportions of responses are summarised in (Tables 1, 2, 3 and 4). Values of agreement based on the POOA and NX2A nomogram are also shown.

Scenario 1

Although there was a high observed proportion in favour of surgical management, the level of agreement among surgeons was poor. It is interesting to observe that almost perfect agreement was obtained for the appropriateness of aerobic exercise.

Scenario 2

A high tibial osteotomy (closing wedge) was not regarded as an appropriate treatment, with an excellent degree of agreement among surgeons.

Scenario 3

There was an almost perfect lack of agreement (poor agreement) for the management of this scenario, as respondents showed no preference at all. Interestingly, almost perfect agreement with regard to the appropriateness of arthroscopic single-bundle ACL reconstruction with a semitendinosus/gracilis graft was reached by the surgeons.

Scenario 4

Almost perfect agreement in favour of a total knee prosthesis was obtained for the management of this scenario.

Discussion

The main aim of this study was to analyse the preferences of a homogeneous community of surgeons with respect to clinical scenarios by means of a specifically designed web-based questionnaire.

Scenario 1

Cartilage repair is only indicated for focal cartilage defects, which can been seen as a precursor of osteoarthritis [27]. In accordance with the literature, biological procedures were not considered appropriate for the management of this clinical scenario, and in this respect, the respondents achieved a moderate degree of agreement.

A cycle of intra-articular cortisone injections was not regarded as appropriate conservative treatment. This finding is supported by the literature that shows that corticosteroid injections are effective in the short term, whereas the benefits in relation to pain and function have not been confirmed in the long term [2, 3].

According to the literature, while strengthening appears to be better treatment in comparison with aerobic exercise in the short term for specific impairment-related outcomes (e.g. pain), aerobic exercise appears to be more effective for functional outcomes in the longer term [4]; an excellent degree of agreement was obtained among surgeons for the appropriateness of the latter conservative treatment.

Scenario 2

The use of autologous chondrocyte implantation (ACI) and other chondral resurfacing techniques is becoming increasingly widespread. However, at the present time, there is insufficient evidence to draw conclusions about the use of ACI for treating articular cartilage defects in the knee [29, 31].

According to these findings, the degree of agreement among the respondents that regarded this treatment as appropriate was low (i.e. ‘poor agreement’); this suggests that further randomised controlled trials should be conducted on this topic.

Osteotomy is one of the treatment options for unicompartmental osteoarthritis of the knee. With regard to the kind of techniques, we detected excellent agreement among the surgeons who did not consider the traditional closed-wedge tibial osteotomy appropriate for the management of this scenario.

This is hardly surprising, as the classic lateral closing-wedge procedure requires a fibular osteotomy that is associated with a risk of damage to the peroneal nerve that is reported to be as high as 11 % [27]. Additional drawbacks include limb shortening, extensive surgical dissection, additional morbidity of a fibular osteotomy and complications in relation to the subsequent placement of the tibial component of a total knee replacement.

On the other hand, although poor agreement was obtained among the respondents, surgeons considered the opening-wedge treatment appropriate, in agreement with several authors [20, 28] who suggest this technique for the valgisation of varus osteoarthritis.

Scenario 3

The evidence from randomised trials to determine whether surgery or conservative management is preferable for ACL injury is insufficient [19]. According to the literature, the preferences of respondents to the survey were distributed fairly evenly between the conservative and surgical options (54 vs. 46 %) and, accordingly, the agreement score was very low.

Among the surgical treatments, arthroscopic ACL reconstruction with an allograft was not regarded as an appropriate option (NS). One possible explanation of this finding may be related to the concern regarding potential complications from the allograft tissue. However, a recent literature review has not found a statistically significant difference in terms of failure rate between autografts and allograft tissue (4.7 vs 8.2 %) [11].

Although a recent meta-analysis has concluded that the current evidence is insufficient to recommend whether a bone-patellar tendon-bone graft or a semitendinosus/gracilis graft is better for ACL reconstruction [23], an excellent degree of agreement on the appropriateness of treatment has been obtained for the latter graft. The lower donor-site morbidity seen in the literature in the case of hamstring autografts could be a factor driving the preferences of the surgeons [7].

Almost perfect agreement on the appropriateness of single-bundle treatment was observed among the surgeons. On the other hand, agreement was poor when it came to the double-bundle reconstruction. Two factors can be considered as playing a role in favour of this finding: on the one hand, there is a lack of evidence relating to the clinical advantages of a double-bundle ACL reconstruction compared with a single-bundle ACL reconstruction [22]; on the other hand, a double-bundle reconstruction can add unnecessary complexity to the surgical procedure [5, 6].

Scenario 4

This is one of the scenarios in which the respondents reached a clear consensus in favour of a surgical approach, although they were obviously unaware of one another’s views. The high polarisation of preferences is associated with an almost perfect agreement score.

With regard to the kind of implant, the respondents expressed a clear preference for the cemented total knee implant. There is evidence in the literature of an improved survival rate for the cemented implants compared with the uncemented ones (odds ratio: 4.2, p < 0.001) [12], and this could be the main reason for the collective preferences of the surgeons with regard to the management of this clinical scenario.

With regard to the choice of whether to use a posterior cruciate-retaining design or a posterior-stabilized design for total knee arthroplasty, no explicit recommendation can be found in the literature [15]. According to this lack of evidence, the respondents were split fairly evenly in favour of the two techniques and a poor level of agreement was reached; this could suggest that further clinical studies need to be conducted on this specific topic in the near future.

Agreement assessment

Our study has confirmed other work [16] relating to the fact that online survey systems are a flexible tool for collecting the preferences and attitudes of doctors towards appropriate treatments in medical practice.

The ideally ‘best treatment’ or ‘treatment of choice’ for the varying levels of severity of joint injuries is a challenging topic, as clear treatment recommendations are either frequently not available, not agreed on or just spread among the orthopaedic community. This condition can be found in almost any medical setting [24, 32].

In the literature, there are some reports of investigations in which web-based questionnaires have aimed to investigate consensus on complex and borderline cases and to determine whether surgeons agree on the kind of patient who needs surgery and the type of surgery that should be recommended for treatment in clinical scenarios of this kind [18, 21].

Our work is intended to act as an advance with respect to those contributions where agreement is identified on the basis of a merely conventional threshold (or cut-off value) of the proportion of respondents in favour of a given option, where agreement is only limited to two alternative options, where no measurement of agreement among the respondents is given or where measurements of agreement that have been acknowledged to be prone to overestimation or other systematic biases are adopted [10, 14, 21].

The proposed NX2A nomogram makes it possible to overcome naïve considerations that are based on the mere proportion of preferences and to obtain access to indications with the same strength as those based on the free-marginal multirater kappa [30]. For instance, in Table 1, it would be wrong to consider the agreement obtained by a community, where 80 % of its respondents prefer one option (i.e. surgical), as ‘good agreement’, as in this case, other considerations (relating to the number of possible alternatives, the number of respondents and the influence of chance in the preferences, for example, to mention only the main factors) that actually make a percentage of this kind indicative of ‘poor agreement’ would be neglected.

The NX2A nomogram provides doctors with both an ordinal and an interval evaluation of the detected inter-rater agreement (see the vertical axis in Fig. 1). The former kind of indication is useful in terms of making sense, the appropriateness of the indicator and communication with patients; the latter is useful in order to compare the agreement reached in different groups (independent samples) or to look for any within-group variation in a pretest and post-test fashion (paired samples).

Furthermore, this kind of reliable information on collective agreement could be used by surgeons when counselling patients on the treatments that are available for their joint pathologies. For variables on which agreement is poor or moderate, surgeons may want to advise patients about the objective variability (or uncertainty) of choice among a large number of orthopaedic surgeons. Conversely, for variables on which excellent agreement is achieved, surgeons may confidently advise patients that there is consensus among their peers about the treatment that should be chosen in cases similar to the one under discussion. In this respect, it should, however, be made clear that excellent agreement does not necessarily imply a good outcome per se; it can instead be presented as a simple, pragmatic (and understandable for the patient as well, we believe) indication of the technique almost any referred surgeon would prefer to use. Obviously, areas of significant clinical uncertainty (or disagreement) should be the focus of future research, or more intensive medical education and training for orthopaedic surgeons who treat the kind of injuries that were the subject of the survey.

Our point is that, when backed up by an objective and quantitative assessment of their strength within a large community of practitioners (as in the case of the NX2A measurement), collective indications could be related to a kind of ‘four-and-a-half’ level of evidence, that is, a level of ‘evidence’ that is based on the ‘consensus of many experts’, rather than on the opinion of a few, albeit respected, experts (level of evidence 5); as such, we believe this evidence can correctly summarise the indications of choice that are actually applied within even wide communities of practitioners or, as in our case, within a whole medical association/society.

Conclusion

The assessment of agreement by the proposed NX2A nomogram has a twofold aim. On the one hand, it can foster discussion among the surgeons in those areas in which adequate evidence exists in the literature to support clinical decision-making. In this case, a certain degree of disagreement among orthopaedic surgeons could be attributed to a lack of adequate peer-reviewed literature on the topic; to an existing controversy among available scientific publications; to the inadequate dissemination and adoption of this evidence; to a conservative attitude and preference to rely on personal experience and tradition; or to a combination of these factors. On the other hand, a tool of this kind, when it is employed in the interpretation of even complex (i.e. multi-page, multi-item and multi-option) online surveys, could help scientific committees discuss and propose indications for practice. These considerations provide the rationale for a possible extension of the classical taxonomy of levels of evidence that is adopted in orthopaedic research [25].