ROC Analysis

Flach, Peter A.

doi:10.1007/978-0-387-30164-8_733

Peter A. Flach

1017 Accesses
11 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Synonyms

Receiver operating characteristic analysis

Definition

ROC analysis investigates and employs the relationship between sensitivity and specificity of a binary classifier. Sensitivity or true positive rate measures the proportion of positives correctly classified; specificity or true negative rate measures the proportion of negatives correctly classified. Conventionally, the true positive rate (tpr) is plotted against the false positive rate (fpr), which is one minus true negative rate. If a classifier outputs a score proportional to its belief that an instance belongs to the positive class, decreasing the decision threshold – above which an instance is deemed to belong to the positive class – will increase both true and false positive rates. Varying the decision threshold from its maximal to its minimal value results in a piecewise linear curve from (0, 0) to (1, 1), such that each segment has a nonnegative slope (Fig. 1). This ROC curve is the main tool used in ROC analysis. It can be used to address a range of problems, including: (1) determining a decision threshold that minimizes error rate or misclassification cost under given class and cost distributions; (2) identifying regions where one classifier outperforms another; (3) identifying regions where a classifier performs worse than chance; and (4) obtaining calibrated estimates of the class posterior.

figure 1_733 — **ROC Analysis. Figure 1**

Motivation and Background

ROC analysis has its origins in signal detection theory (Egan, 1975). In its simplest form, a detection problem involves determining the value of a binary signal contaminated with random noise. In the absence of any other information, the most sensible decision threshold would be halfway between the two signal values. If the noise distribution is zero-centered and symmetric, sensitivity and specificity at this threshold have the same expected value, which means that the corresponding operating point on the ROC curve is located at the intersection with the descending diagonal tpr + fpr = 1. However, we may wish to choose different operating points, for instance because false negatives and false positives have different costs. In that case, we need to estimate the noise distribution.

A slight reformulation of the signal detection scenario clarifies its relevance in a machine learning setting. Instead of superimposing random noise on a deterministic signal, we can view the resulting noisy signal as coming from a mixture distribution consisting of two component distributions with different means. The detection problem is now to decide, given a received value, from which component distribution it was drawn. This is essentially what happens in a binary classification scenario, where the scores assigned by a trained classifier follow a mixture distribution with one component for each class. The random variations in the data are translated by the classifier into random variations in the scores, and the classifier’s performance depends on how well the per-class score distributions are separated. Figure 2 illustrates this for both discrete and continuous distributions. In practice, empirical ROC curves and distributions obtained from a test set are discrete because of the finite resolution supplied by the test set. This resolution is further reduced if the classifier only assigns a limited number of different scores, as is the case with decision trees; the histogram example illustrates this.

figure 2_733 — **ROC Analysis. Figure 2**

Solutions

For convenience we will assume henceforth that score distributions are discrete, and that decision thresholds always fall between actual scores (the results easily generalize to continuous distributions using probability density functions). There is a useful duality between thresholds and scores: decision thresholds correspond to operating points connecting two segments in the ROC curve, and actual scores correspond to segments of the ROC curve connecting two operating points. Let f(s | + ) and f(s | − ) denote the relative frequency of positive (negative) examples from a test set being assigned score s. (Note that s itself may be an estimate of the likelihood p(x | + ) of observing a positive example with feature vector x. We will return to this later.)

Properties of ROC Curves

The first property to note is that the true (false) positive rate achieved at a certain decision threshold t is the proportion of the positive (negative) score distribution to the right of the threshold; that is, tpr(t) = ∑_{s > t} f (s | + ) and fpr(t) = ∑_{s > t} f (s | − ). In Fig. 2 , setting the threshold at 1 using the discretized scores gives a true positive rate of 0. 72 and a false positive rate of 0. 08, as can be seen by summing the bars of the histogram to the right of the threshold. Although the ROC curve does not display thresholds or scores, this allows us to reconstruct the range of thresholds yielding a particular operating point from the score distributions.

If we connect two distinct operating points on an ROC curve by a straight line, the slope of that line segment is equal to the ratio of positives to negatives in the corresponding score interval; that is,

$$\mathrm{slope}({t}_{1},{t}_{2}) = \frac{\mathrm{tpr}({t}_{2}) -\mathrm{tpr}({t}_{1})} {\mathrm{fpr}({t}_{2}) -\mathrm{fpr}({t}_{1})} = \frac{{\sum \nolimits }_{{t}_{1}<s<{t}_{2}}{\it { f}}(s\vert +)} {{\sum \nolimits }_{{t}_{1}<s<{t}_{2}}{\it { f}}(s\vert -)}$$

Choosing the score interval small enough to cover a single segment of the ROC curve corresponding to score s, it follows that the segment has slope f (s | + ) ∕ f (s | − ).

This can be verified in Fig. 2: e.g., the top-right segment of the smoothed curve has slope 0 because the leftmost bin of the histogram contains only negative examples. For continuous distributions the slope of the ROC curve at any operating point is equal to the ratio of probability densities at that score.

It can happen that slope(t ₁, t ₂) < slope(t ₁, t ₃) < slope(t ₂, t ₃) for t ₁ < t ₂ < t ₃, which means that the ROC curve has a “dent” or concavity. This is inevitable when using raw classifier scores (unless the positives and negatives are perfectly separated), but can also be observed in the smoothed curve in the example: the rightmost bin of the histogram has a positive-to-negative ratio of 5, while the next bin has a ratio of 13. Consequently, the two leftmost segments of the ROC curve display a slight concavity. It means that performance can be improved by combining the two bins, leading to one large segment with slope 9. In other words, ROC curve concavities demonstrate locally suboptimal behavior of a classifier. An extreme case of suboptimal behavior occurs if the entire curve is concave, or at least below the ascending diagonal: in that case, performance can simply be improved by assigning all test instances the same score, resulting in an ROC curve that follows the ascending diagonal. A convex ROC curve is one without concavities.

The AUC Statistic

The most important statistic associated with ROC curves is the area under (ROC) curve or AUC. Since the curve is located in the unit square, we have 0 ≤ AUC ≤ 1. AUC = 1 is achieved if the classifier scores every positive higher than every negative; AUC = 0 is achieved if every negative is scored higher than every positive. AUC = 1 ∕ 2 is obtained in a range of different scenarios, including: (1) the classifier assigns the same score to all test examples, whether positive or negative, and thus the ROC curve is the ascending diagonal; (2) the per-class score distributions are similar, which results in an ROC curve close (but not identical) to the ascending diagonal; and (3) the classifier gives half of a particular class the highest scores, and the other half, the lowest scores. Note that, although a classifier with AUC close to 1/2 is often said to perform randomly, there is nothing random in the third classifier: rather, its excellent performance on some of the examples is counterbalanced by its very poor performance on some others. (Sometimes a linear rescaling 2AUC − 1 called the Gini coefficient is preferred, which has a related use in the assessment of income or wealth distributions using Lorenz curves: a Gini coefficient close to 0 means that income is approximately evenly distributed. Notice that this Gini coefficient is often called the Gini index, but should not be confused with the impurity measure used in decision tree learning).

AUC has a very useful statistical interpretation: it is the expectation that a (uniformly) randomly drawn positive receives a higher score than a randomly drawn negative. It is a normalized version of the Wilcoxon–Mann–Whitney sum of ranks test, which tests the null hypothesis that two samples of ordinal measurements are drawn from a single distribution. The “sum of ranks” epithet refers to one method to compute this statistic, which is to assign each test example an integer rank according to decreasing score (the highest scoring example gets rank 1, the next gets rank 2, etc.); sum up the ranks of the n ⁻ negatives, which have to be high; and subtract ${\sum \nolimits }_{i=1}^{{n}^{-} }i = {n}^{-}({n}^{-} + 1)/2$ to achieve 0 if all negatives are ranked first. The AUC statistic is then obtained by normalizing by the number of pairs of one positive and one negative, n ⁺ n ⁻ . There are several other ways to calculate AUC: for instance, we can calculate, for each negative, the number of positives preceding it, which basically is a columnwise calculation and yields an alternative view of AUC as the expected true positive rate if the operating point is chosen just before a randomly drawn negative.

Identifying Optimal Points and the ROC Convex Hull

In order to select an operating point on an ROC curve, we first need to specify the objective function that we aim to optimize. In the simplest case this will be accuracy, the proportion of correctly predicted examples. Denoting the proportion of positives by pos, we can express accuracy as a weighted average of the true positive and true negative rates pos ⋅tpr + (1 − pos)(1 − fpr). It follows that points with the same accuracy lie on a straight line with slope (1 − pos) ∕ pos; these parallel lines are the isometrics for accuracy (Flach, 2003). In order to find the optimal operating point for a given class distribution, we can start with an accuracy isometric through (0, 1) and slide it down until it touches the ROC curve in one or more points (Fig. 3 (left)). In the case of a single point this uniquely determines the operating point and thus, the threshold. If there are several points in common between the accuracy isometric and the ROC curve, we can make an arbitrary choice, or interpolate stochastically. We can read off the achieved accuracy by intersecting the accuracy isometric with the descending diagonal, on which tpr = 1 − fpr and therefore the true positive rate at the intersection point is equal to the accuracy associated with the isometric.

figure 3_733 — **ROC Analysis. Figure 3**

We can generalize this approach to any objec- tive function that is a linear combination of true and false positive rates. For instance, let predicting class i for an instance of class j incur cost cost(i | j), so for instance the cost of a false positive is cost( + | − ) (profits for correct predictions are modeled as negative costs, e.g., cost( + | + ) < 0). Cost isometrics then have slope (cost( + | − ) − cost( − | − )) ∕ (cost( − | + ) − cost( + | + )). Nonuniform class distributions are simply taken into account by multiplying the class and cost ratio, giving a single skew ratio expressing the relative importance of negatives compared to positives.

This procedure of selecting an optimal point on an ROC curve can be generalized to select among points lying on more than one curve, or even an arbitrary set of points (e.g., points representing different categorical classifiers). In such scenarios, it is likely that certain points are never selected for any skew ratio; such points are said to be dominated. For instance, points on a concave region of an ROC curve are dominated. The nondominated points are optimal for a given closed interval of skew ratios, and can be joined to form the convex hull of the given ROC curve or set of ROC points (Fig. 3 (right)). (In multiobjective optimization, this concept is called the Pareto front.) This notion of the ROC convex hull (sometimes abbreviated as ROCCH) is extremely useful in a range of situations. For instance, if an ROC curve displays concavities, the convex hull represents a discretization of the scores which achieves higher AUC. Alternatively, the convex hull of a set of categorical classifiers can be interpreted as a hybrid classifier that can reach any point on the convex hull by stochastic interpolation between two neighboring classifiers (Provost & Fawcett, 2001).

Obtaining Calibrated Estimates of the Class Posterior

Recall that each segment of an ROC curve has slope slope(s) = f (s | + ) ∕ f (s | − ), where s is the score associated with the segment, and f(s | + ) and f(s | − ) are the rela- tive frequencies of positives and negatives of assigned score s. Now consider the function $\mathrm{cal}(s) = \mathrm{slope}(s)/(\mathrm{slope}(s) + 1)\,=\,\mathit{f }(s\vert +)/(\,\mathit{f }(s\vert +) + \mathit{f }(s\vert -))$: the calibration maps ↦ cal(s) adjusts the classifier’s scores to reflect the empirical probabilities observed in the test set. If the ROC curve is convex, slope(s) and cal(s) are monotonically nonincreasing with decreasing s, and thus replacing the scores s with cal(s) does not change the ROC curve (other than merging neighboring segments with different scores but the same slope into a single segment).

Consider decision trees as a concrete example. Once we have trained (and possibly pruned) a tree, we can obtain a score in each leaf l by taking the ratio of positive to negative training examples in that leaf: score(l) = p( + | l) ∕ p( − | l). These scores represent posterior odds, taking into account the class prior in the training set. Each leaf gives rise to a different segment of the ROC curve, which, by the nature of how the scores were calculated, will be convex. The calibrated scores cal(score(l)) then represent an adjusted estimate of the positive posterior that replaces the training set prior with a uniform prior. To see this, notice that duplicating all positive training examples would amplify all uncalibrated scores score(l) with a factor 2, but the ROC curve and therefore the calibrated probability estimates remain unchanged.

If the ROC curve is not convex, the mapping s↦cal(s) is not monotonic; while the scores cal(s) would lead to improved performance on the data from which the ROC curve was derived, this is very unlikely to generalize to other data, and thus leads to overfitting. This is why, in practice, a less drastic calibration procedure involving the convex hull is applied (Fawcett & Niculescu-Mizil, 2007). Let s ₁ and s ₂ be the scores associated with the start and end segments of a concavity, i.e., s ₁ > s ₂ and slope(s ₁) < slope(s ₂). Let slope(s ₁ s ₂) denote the slope of the line segment of the convex hull that repairs this concavity, which implies slope(s ₁) < slope(s ₁ s ₂) < slope(s ₂). The calibration map will then map any score in the interval [s ₁, s ₂] to slope(s ₁ s ₂) ∕ (slope(s ₁ s ₂) + 1) (Fig. 4).

figure 4_733 — **ROC Analysis. Figure 4**

This ROC-based calibration procedure, which is also known as isotonic regression (Zadrozny & Elkan, 2002), not only produces calibrated probability estimates but also improves AUC. This is in contrast with other calibration procedures such as logistic calibration which do not bin the scores and therefore do not change the ROC curve. ROC-based calibration can be shown to achieve lowest Brier score (Brier, 1950), which measures the mean squared error in the probability estimates as compared with the ideal probabilities (1 for a positive and 0 for a negative), among all probability estimators that do not reverse pairwise rankings. On the other hand, being a nonparametric method it typically requires more data than parametric methods in order to estimate the bin boundaries reliably.

Future Directions

ROC analysis in its original form is restricted to binary classification, and its extension to more than two classes gives rise to many open problems. c-class ROC analysis requires c(c − 1) dimensions, in order to distinguish each possible misclassification type. Srinivasan proved that basic concepts such as the ROC polytope and its linearly interpolated convex hull generalize to the c-class case (Srinivasan, 1999). In theory, the volume under the ROC polytope can be employed for assessing the quality of a multiclass classifier (Ferri, Hernández-Orallo, & Salido, 2003), but this volume is hard to compute as – unlike the two-class case, where the segments of an ROC curve can simply be enumerated in O(nlogn) time by sorting the n examples on their score (Fawcett, 2006; Flach, 2004) – there is no simple way to enumerate the ROC polytope. Mossman considers the special case of 3-class ROC analysis, where for each class the two possible misclassifications are treated equally (the so-called one-versus-rest scenario) (Mossman, 1999). Hand and Till propose the average of all one-versus-rest AUCs as an approximation of the area under the ROC polytope (Hand & Till, 2001). Various algorithms for minimizing a classifier’s misclassification costs by reweighting the classes are considered in Bourke, Deng, Scott, Schapire, and Vinodchandran (2008) and Lachiche and Flach (2003).

Other research directions include the explicit visualization of misclassification costs (Drummond & Holte, 2006), and using ROC analysis to study the behavior of machine learning algorithms and the relations between machine learning metrics (Fuernkranz & Flach, 2005).

Cross References

Accuracy

Class Imbalance Problem

Classification

Confusion Matrix

Cost-Sensitive Learning

Error Rate

False Negative

False Positive

Gaussian Distribution

Posterior Probability

Author information

Authors and Affiliations

Authors

Peter A. Flach
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Flach, P.A. (2011). ROC Analysis. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_733

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_733
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics

ROC Analysis

Synonyms

Definition

Motivation and Background