Introduction

Behcet’s disease (BD) has long been recognized as a disease entity (fifth century BC), and its description can be found in the third book of Hippocrates on endemic diseases [1]. In the context of modern medicine, it was first described by a Greek ophthalmologist, Adamantiades, in 1930 and 1931 [2, 3] and then as a separate entity in 1937 by Behçet, a Turkish dermatologist [4]. The disease is called BD, but also Adamantiades–Behcet’s disease (ABD), by some authors [5]. There is no specific laboratory test for the diagnosis of the disease, with the exception of the pathergy test that uses the pathergy phenomenon of the disease [6]. This test was first described in 1937 by Blobner [7] and quickly became an important tool for the diagnosis of the disease [8, 9]. Different methods are used to perform the pathergy test. However, the basic technique consists of puncturing the skin with a needle. In the case of a positive pathergy test (PPT), within 24–48 h post-puncture, a papulopustular reaction will appear at the site of the needle puncture, which is surrounded by an erythematous reaction [6, 913]. The shape (21–25 gauge needle) and sharpness of the needle, the angle of penetration into the skin, and the time delay for reading the reaction varies depending on the center performing the test. The pathergy test can measure disease progress, such as attacks and remissions, as well as other manifestations of the disease [14]. The incidence of PPT varies depending on the country. Overall, the incidence is higher in the ‘Silk Road’ countries [15], but it is declining gradually in some countries [16]. Although the sensitivity of the test has decreased, its specificity has increased along with its positive predictive value (PPV), positive likelihood ratio (PLR), and the diagnostic odds ratio (DOR) [17]. The pathergy test is used in many sets of classification/diagnosis criteria [18].

BD, although considered to be a ‘young’ disease, has already 16 sets of classification/diagnosis criteria. Curth presented the first set of criteria in 1946 [19], less than 10 years after the official description of the disease appeared. Other sets of classification/diagnosis criteria have followed: in 1969 by Hewitt et al. [20] and Mason and Barnes [21], respectively, in 1971 by Hewitt and colleagues (revision of their original criteria [22]), in 1972 by the Japan Research Committee for BD (Japan criteria [23]), in 1974 by Hubault and Hamza [24] and O’Duffy [25], respectively, in 1980 by Cheng and Zhang [26], in 1986 by Dilsen et al. [27], and in 1988 by Japan Research Committee for BD (Japan revised criteria [28]). In near 42 years since the first description of the BD, ten sets of classification/diagnosis criteria have been proposed for BD, and there is as yet no consensus on any of them. In 1990, the International Study Group on BD, formed by seven countries (France, Iran, Japan, Tunisia, Turkey, UK, and USA), proposed the ISG criteria [29]. However, due to their low sensitivity and high specificity [3036], no consensus was reached on these criteria, and other criteria have continued to be presented, including the Iran criteria in 1993 [30], the Classification Tree in 1993 [37], the Dilsen revised criteria in 2000 [38], and the Korean criteria in 2003 [39, 40]. In 2004, an International Team from 27 countries (Austria, Azerbaijan, China, Egypt, France, Germany, Greece, India, Iran, Iraq, Israel, Italy, Japan, Jordan, Libya, Morocco, Pakistan, Portugal, Russia, Saudi Arabia, Singapore, Spain, Taiwan, Thailand, Tunisia, Turkey, and USA) was formed. This Team presented the International Criteria for Behcet’s disease (ICBD) in 2006 to the 12th International Conference on BD [41, 42]. The ICBD was again presented to the 2007 American College of Rheumatology (ACR) congress [43], and has since been validated in Germany [44], China [45], and Iran [46].

The pathergy test is a critical criterion/component of 12 of the 16 sets of criteria used for diagnosing BD (the Hewitt, Mason and Barnes, Hewitt revised, and O’Duffy criteria do not use the pathergy test criterion). The aim of this study was to determine the impact of the pathergy test on the performance of those 12 criteria sets.

Patients and methods

Patients

All patients on the BD registry of the BD Unit, Rheumatology Research Center, Tehran University of Medical Sciences (Tehran, Iran) were enrolled in the study. They were diagnosed by expert clinicians as BD based on their clinical manifestations in the absence of other diseases that could explain them. The diagnosis was made basically on the presence of oral aphthosis, genital aphthosis, skin manifestations, ocular lesions, vascular lesions, and neurological symptoms. The combination of two or more manifestations led to the diagnosis of BD when no other diseases could explain their presence together and their appearance seemed to be related to each other. Each of the patients was classified at least by one of the known diagnosis/classification criteria, and the large majority of patients were classified by several diagnosis/classification criteria. Controls were selected from patients referred to the BD unit due to suspected BD, but who after a thorough evaluation were found not to have the disease.

Pathergy test

Three tests were performed on the forearm of the patient, all of which consisted of an intradermal disposable needle prick after thorough asepsis of the skin with povidone iodine 10 % (Betadine®). One prick was done with a 21-gauge needle, the second with a 25-gauge needle, and the third with a 25 gauge needle, all with the injection of one to two drops of normal saline The test was read 24 h later by one of the two dermatologists of the BD clinic and cross-checked by one of the rheumatologists (the same rheumatologist throughout the study). A test was considered to be positive (PPT) when a papule or pustule formed on the site of the needle prick, surrounded by an erythema [6, 11, 13, 16, 17].

Methods

All criteria (except the Hewitt original criteria) were checked in all patients. The percentage of patients and controls fulfilling the criteria were calculated. The result of the pathergy test was then set to negative for all patients and controls, and the calculations were redone.

Statistical analysis

Sensitivity, specificity, and accuracy (percentage agreement) were calculated [47] together with the respective 95 % confidence interval (95 % CI). The PPV, the negative predictive value (NPV), the PLR, the negative likelihood ratio (NLR), the DOR, and the Youden index were calculated [4851]. For comparison purposes, the chi-square test (Pearson’s test) was used [52].

Results

A total of 6,727 BD patients and 4,648 controls were enrolled in the study. The pathergy test was positive for 51.1 % of the BD patients (95 % CI 49.9–52.3) and for 5.9 % of the control patients (95 % CI 5.2–6.6).

The BD and control patients were examined using each of the 15 sets of diagnosis/classification criteria for BD in terms of the sensitivity, specificity, and accuracy of the sets of criteria. To check the impact of the pathergy test on the performance of the criteria, we then set the results of the pathergy test to negative for all patients and controls and determined once again the sensitivity, specificity, and accuracy of the sets of criteria.

Sensitivity refers to the number of BD patients correctly classified by the criteria and in our study is expressed as the percentage of patients fulfilling the criteria. The sensitivity was 81.7 % for the Dilsen criteria, 86 % for the Japan revised criteria, 77.7 % for the ISG criteria, 97.2 % for the Classification Tree, 86.2 % for the Korean criteria, and 98.3 % for the ICBD. Full details are given in Table 1. These results for the Curth, Mason and Barnes, Hewitt, Japan’s original, Hubault and Hamza, O’Duffy, Cheng and Zhang, Iran, and Dilsen revised sets of criteria are also given in Table 1.

Table 1 Sensitivity

The sensitivity without the pathergy test fell to 64.4 % for the Dilsen criteria, 78.4 % for the Japan revised criteria, 61.7 % for the ISG criteria, 89.7 % for the Classification Tree, 76.4 % for the Korean criteria criteria, and 91.8 % for the ICBD. Full details of all criteria are given in Table 1. The loss of sensitivity was 17.3 % for the Dilsen criteria, 7.6 % for the Japan revised criteria, 16 % for the ISG criteria, 7.5 % for the Classification Tree, 9.8 % for the Korean criteria, and 6.5 % for the ICBD.

Specificity is the number of controls correctly classified as not having the disease and is expressed as the percentage of controls not fulfilling the criteria. The specificity was 95 % for the Dilsen criteria, 97.7 % for the Japan revised criteria, 99.1 % for the ISG criteria, 97.4 % for the Classification Tree, 98.2 % for the Korean criteria, and 96 % for the ICBD. Full details of all criteria are given in Table 2.

Table 2 Specificity

The specificity without the pathergy test improved to 99.7 % for the Dilsen criteria, 97.8 % for the Japan revised criteria, 99.8 % for the ISG criteria, 98.1 % for the Classification Tree, 98.9 % for the Korean criteria, and 96.8 % for the ICBD. Full details are given in Table 2. The increase in specificity was 4.7 % for the Dilsen criteria, 0.1 % for the Japan revised criteria, 0.7 % for the ISG criteria, 0.7 % for the Classification Tree, 0.7 % for the Korean criteria, and 0.8 % for the ICBD.

Accuracy (percentage agreement) is the overall number of patients and controls correctly classified by the criteria and is expressed by the percentage of patients fulfilling the criteria and the percentage of controls not fulfilling the criteria. It shows the overall performance of the criteria. The accuracy was 87.1 % for the Dilsen criteria, 90.8 % for the Japan revised criteria, 86.5 % for the ISG criteria, 97.3 % for the Classification Tree, 91.1 % for the Korean criteria, and 97.4 % for the ICBD. Full details for all criteria are given in Table 3.

Table 3 Accuracy

The accuracy without the pathergy test fell to 78.8 % for the Dilsen criteria, 86.3 % for the Japan revised criteria, 77.3 % for the ISG criteria, 93.2 % for the Classification Tree, 85.6 % for the Korean criteria, and 93.8 % for the ICBD. Full details are given in Table 3. The loss in accuracy was 8.3 % for the Dilsen criteria, 4.5 % for the Japan revised criteria, 9.2 % for the ISG criteria, 4.1 % for the Classification Tree, 5.5 % for the Korean criteria, and 3.6 % for the ICBD.

The PPV, NPV, PLR, NLR, DOR, and Youden index are indicators of the different performance aspects of the sets of criteria and are all calculated from the sensitivity and the specificity of the criteria. These were calculated for each of the criteria sets, with and without the pathergy test.

PPV was calculated as (number of positive BD patients) divided by (number of positive BD patients + number of positive controls). The PPV, taking in account the result of the pathergy test or not, was 94.2 versus 99.5 for the Dilsen criteria, 97.4 versus 97.3 for the Japan revised criteria, 98.8 versus 99.7 for the ISG criteria, 97.4 versus 97.9 for the Classification Tree, 97.9 versus 98.6 for the Korean criteria, and 96.1 versus 96.6 for the ICBD. The NPV, taking in account the result of the pathergy test or not, was calculated as (number of negative control patients) divided by (number of negative controls + number of negative BD patients). The NPV was 83.8 versus 73.7 for the Dilsen criteria, 87.5 versus 81.9 for the Japan revised criteria, 81.6 versus 72.3 for the ISG criteria, 97.2 versus 90 for the Classification Tree, 87.7 versus 80.7 for the Korean criteria, and 98.3 versus 92.2 for the ICBD. The full results are given in Table 4.

Table 4 Indicators of the different performance aspects of the sets of criteria

PLR was calculated as (sensitivity) divided by (1 − sensitivity). The PLR, taking into account the result of the pathergy test or not, was 16.3 versus 215 for the Dilsen criteria, 37 versus 36 for the Japan revised criteria, 86 versus 308 for the ISG criteria, 37 versus 47 for the Classification Tree, 48 versus 69 for the Korean criteria, and 25 versus 29 for the ICBD. NLR was calculated as (1 − sensitivity) divided by (specificity). The NLR, taking into account the result of the pathergy test or not, was 0.19 versus 0.36 for the Dilsen criteria, 0.14 versus 0.22 for the Japan revised criteria, 0.23 versus 0.38 for the ISG criteria, 0.03 versus 0.10 for the Classification Tree, 0.14 versus 0.24 for the Korean criteria, and 0.02 versus 0.08 for the ICBD. The full results are given in Table 4.

DOR was calculated as (sensitivity X specificity) divided by (1 − sensitivity) X (1 − specificity). The DOR, taking in account the result of the pathergy test or not, was 85 versus 601 for the Dilsen criteria, 261 versus 161 for the Japan revised criteria, 384 versus 804 for the ISG criteria, 1,300 versus 450 for the Classification Tree, 341 versus 291 for the Korean criteria, and 1,388 versus 339 for the ICBD. The full results are given in Table 4.

Youden’s index was calculated as (sensitivity) + (specificity − 1). The Youden’s index, taking in account the result of the pathergy test or not, was 0.77 versus 0.64 for the Dilsen criteria, 0.84 versus 0.76 for the Japan revised criteria, 0.77 versus 0.62 for the ISG criteria, 0.95 versus 0.88 for the Classification Tree, 0.84 versus 0.75 for the Korean criteria, and 0.94 versus 0.89 for the ICBD. The full results are given in Table 4.

Discussion

The sensitivity of all sets of BD criteria in which the pathergy test is a criterion fell significantly when the results of the pathergy test were not accounted for—with the exception of the Cheng and Zhang criteria (Table 1). The greatest loss in sensitivity was seen with the Hubault and Hamza criteria (35 %), followed by the Dilsen criteria (17.3 %), Dilsen revised criteria (16.1 %), ISG criteria (16 %), and Iran criteria (13.5 %), thereby showing the dependence of these sets of criteria on the pathergy test. These tests will not work effectively in countries where the pathergy test is rarely positive, such as Western countries [12, 15, 35, 53, 54]. In these countries, criteria having the least decrease in sensitivity will be the most suitable, such as the Cheng and Zhang criteria (0 %), Curth criteria (1.9 %), ICBD (6.5 %), Classification Tree (7.5 %), Japan revised criteria (7.6 %), and Japan original criteria (7.8 %) (Table 1). The best overall sensitivity, without taking in account the result of the pathergy test, was obtained for the Curth criteria (97.8 %), Cheng and Zhang criteria (94 %), and ICBD (91.8 %).

The specificity, which is the converse of sensitivity, improved when the results of the pathergy test were disregarded. However, the improvement was not as important as was the decrease in sensitivity. The largest gain was obtained by the Dilsen criteria (4.7 %) and Curth criteria (3.1 %), while the least improvement was obtained for the Cheng and Zhang criteria (0 %), Japan original criteria and Japan revised criteria (each 0.1 %). The other criteria were improved by 0.7–0.8 % (Table 2). The best overall specificity, without the pathergy test, was 99.9 % and was obtained by the Hubault and Hamza criteria, followed by 99.8 % for the Mason and Barnes criteria, Hewitt criteria, and ISG.

The accuracy (percentage agreement), which is the overall result of sensitivity and specificity, lost some of its performance when the results of the pathergy test were disregarded. The loss was 24 % for the Hubault and Hamza criteria, followed by the Dilsen revised criteria (9.3 %), the ISG criteria (9.2 %), the Dilsen criteria (8.3 %), and the Iran criteria (7.7 %). The least affected sets of criteria were those of Cheng and Zhang (0 % decrease), Curth (0.3 %) and the ICBD (3.6 %), the Classification Tree (4.1 %), and the Japan revised criteria (4.5 %) (Table 3). The best overall accuracy, without the pathergy test, was obtained with the Curth (94.8 %) criteria, ICBD (93.8 %), and the Cheng and Zhang criteria (93.7 %).

The PV is the probability that a test result is a true result, and the PPV is the probability that a positive test result is a true positive. It is highly influenced by the prevalence of the disease in the population in which it is tested. The prevalence of a disease will change depending on where and in which setting the patients are seen. Consequently, the results obtained in different settings will differ. In 2010, the PPV of ICBD in Iran was 95.7 % when the prevalence of BD was not taken into account [46]; this dropped to 91.7 % in the BD clinic (Rheumatology Research Center) where one-third of new patients were true BD patients (prevalence 33 %). The PPV was calculated to fall to 71.3 % in a setting with 10 % BD prevalence, to 18.4 % in a setting with 1 % BD prevalence, and to only 1.8 % if the test was done randomly in a population where the prevalence is 80 per 100,000 inhabitants [46]. The PPV improves substantially with improvement—even a minor one—of the specificity. The specificity was improved in the absence of the pathergy test and in the same manner as the PPV. The best improvement, 5.3 %, was obtained using the Dilsen criteria, followed by the Curth criteria (2.4 %) and the Hubault and Hamza criteria (1.1 %). A PPV not improved or only slightly improved by disregarding the pathergy result indicates a better reliability of the criteria. The NPV is the probability that a negative test is truly a negative result. The NPV is also influenced by the prevalence of the disease. The higher the NPV, the more reliable is the test; a lower NPV means that fewer true patients were diagnosed. Sets of criteria that were relatively more dependent on the pathergy test, such as the Dilsen criteria, lost relatively more in NPV value than the other sets of criteria, with the highest loss found for the Dilsen criteria (10.1 %), followed by the Iran criteria (10.1 %), Dilsen revised criteria (9.6 %), and the ISG criteria (9.35). The least waste was for the Cheng and Zhang criteria (0 %) and Curth criteria (2.1 %).

The likelihood ratio (LR) shows how much the odds of having the disease may change with a positive or negative result. Prevalence does not influence the LR and, therefore, figures can be used in any disease setting. The PLR shows the odds of having the disease. When the PLR exceeds 5, the test is related to the disease. Both sensitivity and specificity improve the PLR, but specificity more than sensitivity. When the results of the pathergy test were disregarded, the sensitivity increased and the specificity improved; the increase in the former was much greater than the improvement in the latter, but not as much as to decrease the PLR. The PLR of all criteria using the pathergy test showed an improvement in their PLR, with the exception of the Japan original and Japan revised criteria. The ISG criteria showed the highest improvement, with an odds improvement of 222, followed by the Dilsen (199) and Hubault and Hamza (187) criteria. A higher improvement means that those criteria are more dependent on the result of the pathergy test. The Curth criteria, ICBD, and Classification Tree (odds improvement 23, 4, and 4, respectively) showed the lowest dependency on the pathergy test. The Cheng and Zhang criteria did not change when the pathergy results were disregarded, while the Japan criteria (original and the revised) lost 2 and 1 relative units, thereby showing their independence of the pathergy test. The NLR shows the odds of not having the disease. The odds decreased for all criteria when the pathergy test results were disregarded because NLR is more sensitive to sensitivity than specificity. The least impairment was for the Curth criteria (0.02), ICBD (0.06), and Classification Tree (0.07), while the worse impairment was for the Hubault and Hamza (0.35), Dilsen (0.17), and ISG criteria (0.15).

The DOR shows the power of discrimination of the criteria. A value of 1 means the criteria do not discriminate between patients and controls, while higher values indicate better discrimination. DOR is evenly influenced by sensitivity and specificity. The problem with the DOR is that a high sensitivity can mask a dangerous lack of specificity, and vice et versa, especially when the value is between 99 and 100 %. The DOR decreased in the majority of criteria in the absence of the pathergy test, with the largest loss shown by the Curth criteria (1,850), ICBD (1,049), and Classification Tree (850). On the contrary, the DOR improved by 516 with the Dilsen criteria, by 420 with the ISG criteria, and by 191 with the Dilsen revised criteria.

Youden’s index is a simple calculation combining the results of sensitivity and specificity to show the precision or accuracy of the test. Sensitivity and specificity equally influence the Youden index. The results can vary from zero to one, with the latter being the most precise. In our study, the Youden index of all sets of criteria decreased when the pathergy test results were disregarded, but especially the Hubault and Hamza criteria (0.45), the ISG criteria (0.15), and the Dilsen revised criteria (0.15). The least impaired criteria were the ICBD (0.05) and the Classification (0.07). The Cheng and Zhang criteria did not change as expected, and the Curth criteria improved by 0.01.

Conclusion

The pathergy test is the only paraclinical test that is available for diagnosing BD. Without it, the sensitivity of the majority of sets of classification/diagnosis criteria decrease by 1.9 to 35 %. On the contrary, the specificity improves by 0.1–4.7 %. Overall, the sets of criteria show a loss of performance (accuracy), demonstrating that this parameter is necessary to improve the power of existing classification/diagnosis criteria.