Introduction

Monitoring the status and spread of citrus canker (Xanthomonas citri subsp. citri) on citrus requires estimation of the incidence and severity of the disease (Gottwald et al. 1993; Gottwald and Timmer 1995). Incidence is a more straightforward measure of disease and relates to the number of units infected, while the severity of disease is a measure of the amount of disease per unit, and has greater subjectivity in estimation (Kranz 1988; Madden et al. 2007; Nita et al. 2003; Bock et al. 2008a, b, 2009). Various studies including epidemiological analyses and the ranking of plant breeding material benefit from accurate and precise assessments of disease severity, and should be based on methods that reflect the true disease (Kranz 1988; Nutter et al. 1993; Nita et al. 2003; Parker et al. 1995; Shaw and Royle 1989; Forbes and Korva 1994; Bock et al. 2008a, b, 2009).

Apart from estimating area diseased to the nearest percent, various scales have been developed to address perceived problems with direct estimation, or to simplify or speed the process of disease assessment (Madden et al. 2007). The most widely used and cited scale is that developed by Horsfall and Barratt (1945). The Horsfall and Barratt (H-B) scale was developed in the early 1940s as the science of plant pathology was becoming more quantitative (Horsfall 1986). The scale has been used widely, and is still used by plant pathologists (Mila et al. 2005; Lang et al. 2007; Stevenson et al. 2007), and has been adopted in other disciplines (Pernezney et al. 2008; Bussotti et al. 2003; Chappelka et al. 2003). The impact of the H-B scale can be further established by it being noted as a “Citation Classic” in Current Contents (Horsfall 1986). However, over the last 25 years it has been subject to controversy regarding the rationale for its psychophysical basis, and what it actually achieves in accuracy and precision (Herbert 1982; Forbes and Korva 1994; Nutter and Schultz 1995; Nita et al. 2003; Nutter and Esker 2006; Madden et al. 2007).

The H-B scale is an interval scale–dividing percent severity into categories. Horsfall and Barratt (1945) developed the scale in the presence of tested theory (the Weber law), and untested theory (the Fechner law; Baird and Norma 1978, Birnbaum 1994; Nutter and Esker 2006). Weber’s law states that the physical size of a just noticeable difference is a constant proportion of the value of the standard for a given dimension; Fechner’s law, which assumes Weber’s law, adds that the subjective value is a logarithmic function of the physical value. On this basis Horsfall and Barratt divided the percent area disease from 1–100% into a series of logarithmically increasing, then decreasing categories symmetrical around 50% (Table 1). The symmetry was based on the assumption that the eye reads infected tissue below 50% disease, and healthy tissue above 50% disease (Horsfall and Barratt 1945). Thus the H-B scale was designed to take into account the (unproven) assumption that estimated disease was logarithmically related to true disease (Horsfall and Barratt 1945; Horsfall 1945; Horsfall and Cowling 1978; Herbert 1982; Nutter and Esker 2006).

Table 1 The Horsfall-Barratt scale showing the categories, the ranges for the categories and the category mid-points

The way the H-B scale was to be applied is described by Horsfall and Barratt (1945). This was later expanded on (Horsfall 1945; Horsfall and Cowling 1978), and by Redman et al. (1969). First an appropriate category on the scale (Table 1) is allotted to the sample leaf or plant based on the estimated percent area diseased, and in a similar manner several replicate units (“several plants (20 or more)”) are assessed for that plant or plot. For each estimate, the percent range mid-point is taken (Table 1) directly, or using the Elanco conversions (Redman et al. 1969). Finally, the percent mid points for all replicates are summed and averaged (sum of midpoint percents ÷ number of replicates) to arrive at the estimated mean percent disease severity for that plant, plot or field. The mid-point conversion is necessary prior to taking the arithmetic mean as averaging the category numbers leads to unacceptable bias due to the geometric nature of the scale. The H-B scale continues to be used (Mila et al. 2005; Keinath et al. 2006; Lang et al. 2007; Stevenson et al. 2007) and the claims that it provides an accurate estimate of the mean disease (Horsfall 1945; Horsfall and Cowling 1978; Redman et al. 1969) remain to be thoroughly explored.

There are two facets to the interpretation of the H-B scale. Firstly the relationship between the estimated and true disease was logarithmic—which has been established not to be the case as estimated disease is linearly related to true disease (Nutter and Esker 2006; Nita et al. 2003; Bock et al. 2008a and 2009). Secondly, variance of an estimate of disease severity increases (i.e., precision and accuracy decrease) as magnitude of true severity increases. Thus according to Horsfall (1945), a “difference” of 3% at the end of the scale was equivalent to telling the “difference” between 25% in the middle of the scale, resulting in a process that gave each grade equal weight. There is evidence that variance of direct estimates of disease is not constant with magnitude of true disease severity, but the relationship is not necessarily logarithmic (Redman and Brown 1964; Koch and Hau 1980; Forbes and Korva 1994; Forbes and Jeger 1987; Hau et al. 1989; Bock et al. 2008a, b and 2009). Furthermore, individuals vary in characteristics of assessment (Nutter and Schultz 1995; Bock et al. 2009), and in at least one study estimation error was only weakly related to true disease severity (Nita et al. 2003).

There are few studies that have ascertained effects of the H-B scale (or other scales) on the reliability, precision and accuracy of disease severity estimates in different pathosystems, and the approaches used in these studies have not been the same. Using visual estimates as the “standard”, Slopek (1989) found the H-B scale gave an accurate estimate of severity of barley diseases. However, Forbes and Korva (1994) found that both precision and accuracy were greater when estimating severity of Phytophthora infestans on potato directly using the percent scale compared to the H-B scale. Furthermore, Nita et al. (2003) reported the H-B scale did not improve reliability, accuracy or precision of individual estimates of Phomopsis on strawberry. Considering the widespread use of disease scales as tools in estimating disease severity it is desirable to thoroughly quantify their reliability, accuracy and precision, and establish advantages, or disadvantages of their use. They have not been used widely to estimate severity of citrus canker, but some characteristics of disease scales, such as the speed with which they can be applied, can be advantageous in some situations.

The objectives of this paper were to compare effects of direct estimates of the percent area infected with citrus canker to H-B scale estimates (i) on inter-rater reliability, (ii) agreement between individual raters and true disease values, (iii) on mean estimates of severity, and the characteristics of the mean estimates, (iv) on the frequency of estimates closer to the true value, and (v) assess the relationship and agreement between mean estimated severity of each leaf (averaged across raters) and true disease severity.

Materials and methods

Leaves and leaf images

Images of diseased leaves have been demonstrated to be realistic surrogates for actual leaves in various studies and are valuable and useful tools for disease assessment studies: computer-generated leaf images are used to train raters (Nutter and Schultz 1995); standard area diagrams are constructed from digitised computer images used to aid assessments (Belasque et al. 2005; Pethybridge et al. 2004); yet other studies have used computer-generated, drawn or digitised images as a basis to explore disease assessment and found them to be realistic (Amanat 1976; Forbes and Jeger 1987; Hau et al. 1989; Bock et al. 2008a). Based on these studies, and for practical reasons, images of diseased leaves, rather than actual diseased leaves, were used. The collection and processing of the images of diseased leaves was described in detail previously (Bock et al. 2008a, b, 2009). The leaves were from grapefruit trees in south Florida and displayed a natural range of severity (0 to approx. 38%). They were photographed using a 6.5 megapixel Canon EOS 10D digital camera (24–85 mm AF lens). The computer image analysis software ASSESS V1.0 (APS Press, St. Paul, MN) was used to measure the percent area with symptoms of citrus canker (necrosis and chlorosis) accurately on a leaf-by-leaf basis. Image size was standardised and printed (3.25’ × 2.25’) on letter-size paper and approximately reflected the dimensions of the original leaf, with six images per page. Each of the images was assessed once visually by each of twenty-eight raters (Bock et al. 2009). The data from the image analysis assessments were assumed to be the true values—the method is accurate and repeatable when leaves are assessed individually (Bock et al. 2008a). Assessments by raters were made individually, with no formal resting period although brief breaks of a few minutes could be taken.

The H-B scale was applied to each direct estimate of each leaf by each rater by placing the percent disease severity in the equivalent H-B category as described by Nita et al. (2003), assuming that raters would base the H-B category on the perceived percent area infected. The mid-point value of that category was applied to the individual leaf. For each individual, the directly estimated percent severity data was compared to the H-B scaled data to determine whether there was any effect of scaling over the range of citrus canker severity commonly found in the field.

Data analysis

Characteristics of reliability among individual raters were estimated. The correlation coefficient (r) was calculated for each pair of raters to assess inter-rater reliability for both direct estimates and H-B scaled data. The average correlation coefficient was calculated to compare overall reliability for each method. Lin’s concordance correlation (LCC, Lin 1989; Nita et al. 2003; Madden et al. 2007; Bock et al. 2008a) was used to gauge agreement of the true disease severity and disease assessments for each rater for direct estimated values and the H-B scaled data. LCC provides a quantifiable method to evaluate the characteristics of accuracy and precision (Madden et al. 2007). Accuracy is the closeness of the estimate to the true value, and precision is the variation or scatter associated with the estimates (Madden et al. 2007). Thus complete accuracy in relation to true values is only possible if there is complete precision. The LCC coefficient (ρ c ) is the product of elements of precision (the correlation coefficient, r) and accuracy (C b ). C b in this case is a generalised bias correction factor that measures accuracy and comprises two measures that account for scale or slope shifts (coefficient υ, with a perfect relation of 1 between X and Y) and a location or elevation shift (coefficient μ, where 0 is a perfect relation between X and Y). For the individual raters, cumulative leaf sample means and variances were calculated and compared on a leaf-by-leaf basis for both direct estimates and the H-B scaled data in relation to the true values from image analysis. For these cumulative leaf means, leaves were ordered arbitrarily, but in the same sequence for each rater. The final means and the standard deviations (a measure of precision) were calculated for all 200 leaves assessed for each rater using direct estimates and H-B scaled data. Using each individual leaf estimate, and the cumulative means, the direct estimates and H-B scaled data for each rater were compared to the true values, and the total proportions that were closest to the true value calculated.

To further investigate the effects of the H-B scale on disease estimates, average severity estimates for each leaf were calculated for 3, 5, 10, 20 and 28 raters (rater-means). Raters were ordered arbitrarily and averaged consecutively for the various analyses. Effects of H-B scaling on these more precise and/or accurate data were investigated. Mean disease is generally not based on multiple rater estimates of single leaves, but averaging several rater estimates of the same leaflet is useful in understanding and providing insight into error that might be involved in using direct estimation compared to H-B scaling, as well as other general sources of error. LCC was to gauge agreement of the rater-means and true values. Regression analysis examined the relationship between true and estimated mean-rater severity by direct estimation and H-B scaling, and that of the relationship between variance of the rater-means and true values for direct estimates and H-B scale data. The slope, intercept, coefficient of variation, and coefficient of determination were calculated, and residual plots generated. Thus residuals (d) for the i th data point were calculated as d i  = y i – (a + (b × x i)), where x is the true area infected, a is the intercept, and b is the slope parameter for the regression solution. Regression solutions for rater-means and variance of the rater-means using direct estimates or H-B scaling were compared using general linear modelling (GLM). In this analysis the regression slopes are tested to determine if there are differences in the dependent variable (ds, disease severity) due to assessment method (am) when compared to the continuous independent variable, true disease severity (tds). A test for interaction between regressor and class (ds = tds, am, tds*am) gauges whether there are differences due to assessment method which are reflected in slope for the two classes. If the interaction term is significant the classes are different. If the interaction is not different a further test for differences in the elevation of the regression lines (a test of parallelism) is required, which is achieved by running GLM on the reduced model (no interaction term, ds = tds, am). Data were analysed using SAS (SAS Systems, Cary, NC).

Results

Effect of H-B scaling on reliability of estimates among raters

Overall, inter-rater reliability measured by the correlation coefficient r, ranged from 0.25–0.93 for the direct estimates and 0.20–0.89 for the H-B scaled data. Although not all inter-rater reliabilities were best by direct estimation, the frequency of correlation coefficients showed that they were most frequently higher for direct estimates compared to H-B scaled data (Fig. 1). The mean inter-rater reliability (r) was 0.75 for the direct estimates and 0.71 for the H-B scaled data.

Fig. 1
figure 1

Comparison of the frequency of inter-rater reliability for estimates of citrus canker severity by direct estimation or using the Horsfall-Barratt scale for twenty-eight raters assessing two-hundred canker-infected grapefruit leaves. Inter-rater reliability measured by the correlation coefficient (r = 0.0–1.0). Total number of correlations performed = 378 for direct estimates and H-B scaled data

Effect of H-B scaling on agreement with individual estimates

As a measure of agreement comparing the individual estimate to the true value the components of the LCC coefficient most often showed loss of agreement for all raters with the H-B scale (Table 2). This was in large part due to a loss of precision shown by the correlation coefficient for all but two raters (raters 18 and 23) when the data were converted to the H-B scale. Only rater 8 precision was the same for both methods. The bias correction factor (C b ), a measure of generalised bias or accuracy, showed that both methods were similar for most raters, although overall H-B scaling might offer a very slight improvement in the C b (the mean C b for direct estimation = 0.87, and for H-B scaled data = 0.89). The components of C b , the scale (slope, υ) and location (elevation, μ) were variable among raters. Of the twenty-eight raters, the scale was improved for eighteen raters, and the location improved for fifteen raters by using H-B scaling, respectively. The overall result in the LCC coefficients (ρ c ) showed H-B scaled estimates generally had lower measures for most raters (except rater 18). Thus according to the components of agreement of the LCC coefficient, loss in precision (r) of the estimate appeared to be the major effect of H-B scaling.

Table 2 Correlation analysisa and Lin’s concordance correlation analysisb of the severity of citrus canker symptoms on two-hundred leaves of grapefruit assessed by direct estimates, and by scaling the estimated percent area to the Horsfall-Barratt scale

The cumulative sample mean calculated on a leaf-by-leaf basis for each rater for both direct estimates and the H-B scaled data is shown for the first 100 leaves and 14 raters (Fig. 2). The true mean severity stabilised at a sample size of approximately 40, and the mean estimates of each of the raters within 15–40 consecutive leaves. The tendency to over—or underestimate depended on the individual rater, and was the major source of error. Rater 13 underestimated, and rater 5 overestimated disease (Bock et al. 2009). Both direct estimation and H-B scaling gave similar estimates of cumulative mean severity most of the time, but there was no benefit to using H-B scaling. The variance of these mean estimates for each rater for the first one-hundred leaves showed that magnitude of the variance was rater-dependent (Fig. 3). The variance of the true mean and that of several raters stabilised when the mean was based on a sample size of approx. fifty leaves, although some raters (e.g., rater 8) required a larger sample. The variance of some rater estimates was also higher or lower than the true sample variance. H-B scaling had neither a consistent nor beneficial effect on the variance of the cumulative mean for the true value or the individual rater estimates. These results were reflected in the final sample means of all 200 leaves for the true values and each rater (Fig. 4A). With this sample of 200 leaves (true mean disease 6.56%), H-B scaling had only a slight effect on the mean estimate, and in most raters cases caused a very slight increase (<1%) in the mean estimate compared to direct estimation. The grand mean estimate was close to the true value for both assessment methods. The standard deviation of the sample for each rater was similar for both direct estimated data and H-B scaled data (Fig. 4B), but was not consistently greater or smaller with either method. The magnitude of the standard deviation was dependent on the rater ability compared to the true value (true sample standard deviation = 7.95). The proportion of the total estimates that were closer to the true values was greater for direct estimation (59.5% of estimates) compared to H-B scaling, suggesting a greater discrepancy was more frequent when using the H-B scale. With cumulative mean estimates of disease for each rater, direct estimation was closer to the true disease 53.7% of the time, and the variance of these data was most often closer to the true variance (63.6% of the time).

Fig. 2
figure 2

The cumulative leaf-by-leaf true mean citrus canker severity by image analysis, and for direct estimation and Horsfall-Barratt (H-B) scaled data for the first fourteen raters (only leaves 1 to 100 shown)

Fig. 3
figure 3

The variance of the cumulative mean of the leaf sample of the true citrus canker severity for image analysis, direct estimation (dir. est.) and for Horsfall-Barratt (H-B) scaled data for the first fourteen raters (only leaves 1 to 100 shown, means shown in Fig. 2)

Fig. 4
figure 4

The mean severity of a sample of two hundred canker-infected grapefruit leaves and the standard deviation \( \left( {\sigma = \sqrt {\frac{{s^2 }}{n}} } \right) \) of the mean for true values (by image analysis, IA), direct estimates and Horsfall-Barratt (H-B) scaled data for twenty-eight raters

Effect of H-B scaling on the rater-mean estimates of single diseased leaves

A linear relationship existed between all rater-mean estimates based on different numbers of raters and the true values (Table 3 and Fig. 5A–E). The coefficient of determination (r 2) and the coefficient of variation (CV) suggested H-B scaled data were less precise compared to direct estimates for rater-means based on small samples sizes, but improved as individual leaf severities were based on estimates by more raters. The GLM analysis was unable to discern significant differences in the mean in slope or elevation at any sample size using either method (data not shown). An overall tendency to overestimate was pronounced at low disease severity (<10%) and is evident in the residual plots. The data for rater-means based on different numbers of raters was heteroscedastic using either rating method. Estimates of severity on leaves with >20% true area infected were not as accurate as those of less severely infected. To check the heteroscedasticity was not due to inaccurate image analysis, five of the outlying leaves with means based on 28 raters (23.4, 25.9, 30.4, 30.9 and 35.8% true area infected) were printed on paper and the areas diseased cut out and weighed (Nita et al. 2003) and compared to the true area by image analysis. True disease by weighing was within 0.8–3.7% of the true area measured by image analysis, while the mean estimate of 28 raters compared less well to the true value by image analysis (5.7–8.7%) or by weighing (2.0–9.5%).

Table 3 Linear regressiona analysis between the true citrus canker severity and severity using either direct estimation or Horsfall-Barratt scaled data for two-hundred grapefruit leaves using mean severity for each leaf based on 3, 5, 10, 20 or 28 raters
Fig. 5
figure 5

The relationship and the residuals of the relationship between true citrus canker severity and the mean estimate of symptom severity by direct estimation (dir. est.) and for Horsfall-Barratt (H-B) scaled data for two-hundred leaves based on means of 3 (A), 5 (B), 10 (C), 20 (D) and 28 (E) raters. The residuals (e) for the i th leaf were calculated as e i  = y i–(a + (b × x i)), where x is the true area infected, a is the intercept, and b is the slope parameter. See regression parameters in Table 3

The LCC coefficient and its component parameters (Table 4) showed that rater-means based on more raters had better agreement with true values compared to individual estimates, but there were only minor improvements when the rater-mean was based on five or more raters. Rater-means based on a few raters (3–5 raters) were moderately more precise using direct estimated data compared to H-B scaled data (means of 3 raters, r = 0.94 and 0.91, respectively). Basing the rater-mean on more raters (≥10 raters) resulted in small improvements in precision for both assessment methods (means of 10 raters, r = 0.95). The bias correction factor, C b , for mean severities based on 5 or more raters using direct estimation or H-B scaling was >0.98. Regardless of method, the scale shift (υ) showed there was little deviation of the variance of the estimate compared to that of the true values for means based on ≥ 5 raters (0.97–1.00), but overall the location shift (µ) showed a small tendency to overestimate disease (means of ≥5 raters, µ = 0.08–0.19). Overall, the measure of agreement by the LCC coefficient was the same for both assessment methods with means based on ten or more raters (ρ c  = 0.95).

Table 4 Correlation analysisa and Lin’s concordance correlation analysisb of the severity of citrus canker on two-hundred grapefruit leaves either estimated directly or for Horsfall-Barratt scaled data using mean canker severity for each leaf based on 3, 5, 10, 20 or 28 raters

The logarithm of the variance for all rater-mean estimates had a linear relationship with the logarithm of the true disease severity for both methods (Fig. 6A–E). Rater-mean estimates based on more raters had less dispersed variance for both assessment methods (Table 5). The coefficient of determination (r 2) showed that variability of mean estimates based on 3 raters was least for direct estimates (r 2 = 0.63) compared to H-B scaled data (r 2 = 0.41), but with means based on 10 or more raters, variability was similar (10-rater-means, r 2 = 0.83 and 0.82, respectively). GLM analysis showed no difference in the slope between the two methods with severity means based on 3–10 raters (Table 6), but with means based on more raters (≥20) the variance based on direct estimates was significantly lower compared to H-B scaled data. Ability to discern difference in variance when rater-means are based ≥20 raters might be due to the more precise nature of these values (Lin 1989).

Fig. 6
figure 6

The relationship and the residuals of the relationship between the log true citrus canker severity and the log variance of the cumulative rater-mean by direct estimation (dir. est) and for Horsfall-Barratt (H-B) scaled data for two-hundred leaves based on means of 3 (A), 5 (B), 10 (C), 20 (D) and 28 (E) raters. The residuals (e) for the i th leaf were calculated as e i  = y i–(a + (b × x i)), where x is the true area infected, a is the intercept, and b is the slope parameter. See regression parameters in Table 5

Table 5 Linear regression analysisa of the relationship between log true disease severity and the log variance of the direct estimate and log variance of the H-B scaled data for severity of citrus canker on two-hundred grapefruit leaves based on mean canker severity for each leaf by 3, 5, 10, 20 or 28 raters
Table 6 General linear modelling analysisa of the relationship between true citrus canker severity and the variance of the mean directly estimated (dir. est.) severity or for Horsfall-Barratt (H-B) scaled data for two-hundred grapefruit leaves based on means of 3, 5, 10, 20 and 28 raters

Discussion

Effect of H-B scaling on reliability and agreement of individual estimates

The inter-rater reliability of estimates between raters was diverse using both direct estimates and H-B scaled data, but the H-B scaled data was most often less reliable, which agrees with previous observations (Nita et al. 2003), suggesting that there was no gain in inter-rater reliability from using the H-B scale to assess symptoms of citrus canker on grapefruit leaves.

Compared to the true values, the H-B scale did not improve agreement of the estimate for individual raters. The H-B scale most often resulted in lower precision, which has been noted before (Forbes and Korva 1994; Nita et al. 2003). In this study generalised bias (accuracy) was not demonstrably different for direct estimation of citrus canker symptoms compared to the H-B scale—and there is no prior way of determining whether a single disease severity estimate using the H-B scale or direct estimation will be closest to the true disease value. When the true disease happens to be the same as, or very close to an H-B category mid-point (0, 1.5, 4.5, 9.0, 18.5, 37.5…% etc) the H-B scale will provide good agreement, but if the true disease, and direct estimate, is close to a category boundary (0, 3, 6, 12, 25, 50…% etc), then the H-B scale will be less accurate, which is more likely to occur where the intervals in the scale are wider (in the midrange of the scale) and if the estimate is in the wrong category error might be further exacerbated.

The estimate of cumulative mean disease severity on a leaf-by-leaf basis by each rater explored the effect of sample size in relation to the true values of disease using both direct estimation and H-B scaled data. The effect of rater bias on the mean estimate was much greater than any effect of the H-B scale on the mean estimate. Some raters consistently over—or underestimated the cumulative true mean, but the mean estimates (and the variance of the means) by both methods tracked each other closely, although the direct estimated mean was most often closer to the true mean, and the variance of the direct estimated mean most often closest to the variance of the true mean suggesting there was no advantage in terms of accuracy or precision in using the H-B scale.

The effect of rater-means and H-B scaling on agreement with the true values

The H-B scale did not improve agreement of the rater-mean estimates with the true values compared to direct estimates, but all rater-estimates had a linear relationship with true disease (Slopek 1989; Nutter and Schultz 1995; Nita et al. 2003). Rater-mean values were more accurate and precise when based on data from a larger sample of raters. Mean-rater estimates based on ≥5 raters showed that both methods had similar, good agreement with the true values, but a few rater-mean estimates of individual leaves were still inaccurate with 28 raters suggesting inherent and widespread rater bias to over—or underestimate severity on those particular leaves.

Both direct estimation and the H-B scale gave similar sample means for individual raters for the 200 leaves, with H-B means being <1% different to those by direct estimation, reflecting previous data on means estimated by both methods (Nita et al. 2003). H-B scaled data appeared to have a greater effect on precision of an individual’s estimates compared to the effect on the accuracy (bias), and this imprecision was reflected in larger and more scattered variance of the rater-mean estimates with H-B scaling. Variance (s 2) is a measure of the dispersion of the individual sample estimates, and is used to determine the standard deviation \( \left( {\sigma = \sqrt {\frac{{s^2 }}{n}} } \right) \) which describes the dispersion of the data about the sample mean value, and is thus a measure of precision of the sample mean and relevant to sample size estimation (Everitt 1998; Madden et al. 2007). Based on these data, a major effect of the H-B scale was on precision of the estimates, and the effect of less precise and/or inaccurate visual assessment data has already been demonstrated (Parker et al. 1995) to increase the probability of a Type II error falsely accepting the null hypothesis (Everitt 1998). Furthermore, superimposing additional imprecision on an already subjective assessment method might exacerbate error (Forbes and Korva 1994).

Variance of the rater-mean estimates was not constant with magnitude of true disease, which is in agreement with some previous observations (Forbes and Korva 1994; Forbes and Jeger 1987, Hau et al. 1989; Bock et al. 2008b, 2009). Even with rater-means based on estimates by ten or more raters, specific leaves were over—or underestimated more than others, particularly at true severities >20%. The reason for this is unknown, but might be associated with widespread bias in delineation of more complex symptoms encountered on more severely cankered grapefruit leaves, causing raters to overestimate (or underestimate) disease on those leaves. Raters are known to overestimate disease, particularly at low severity (<10%), and the number of lesions relative to area infected also influence rater estimates (Amanat 1976; Sherwood et al. 1983; Hau et al. 1989; Bock et al. 2008b; 2009). Other uncharacterised illusions or factors of symptom topography might result in rater error.

Advantages and disadvantages of the H-B scale

These results suggest the H-B scale does not improve reliability, accuracy or precision of severity estimates of citrus canker. Indeed, the estimates were generally less reliable or precise. Estimates of disease severity in plots are most often mean values based on a predetermined sample size of diseased leaves or plants (Madden et al. 2007), but the effects of sample size on the accuracy and precision of the mean value using direct estimation or H-B scaling has yet to be explored. Large sample size was an original stipulation for use of the H-B scale (Horsfall and Barratt 1945; Horsfall 1945; Horsfall and Cowling 1978) and the data presented here suggest that for individual raters the mean estimate of disease for 200 leaves was similar (within 1%) using either method, but dependent on rater accuracy (bias). There are circumstances where H-B type-scales, or other interval scales (Slopek 1989; Nita et al. 2003) might be useful or warranted for reasons of practicality. Interval scales are relatively easy to learn, and can be faster to apply in the field allowing for large sample sizes where time is a limiting factor, but if they are to be used, scales with equal increments might be preferable to logarithmic-based ones (Nita et al. 2003; Nutter and Esker 2006; Madden et al. 2007).

The psychophysical basis of the perception of plant disease remains poorly defined across the full range of disease (Nutter and Esker 2006), but based on these results H-B scaling gave no predictable or consistent benefit to estimates of canker severity. The H-B scale continues to be used in plant pathology (Hagan et al. 2003; Keinath et al. 2006; Stevenson et al. 2007), and has been adopted in other disciplines (Pernezney et al. 2008; Bussotti et al. 2003; Chappelka et al. 2003). Choosing to use an H-B type disease scale will depend on a decision based on the trade-offs between time taken to perform the assessment, sample size, precision and accuracy, and reliability of the data.