Keywords

1 Conformal Predictors and Measurement Errors

Machine-learning methods work often very well and have found many applications in both the public and the private sector. On the other hand, the reliability of their performance is typically proven asymptotically and is therefore not very useful in practice. Conformal predictors, which perform competitively in terms of success rates, include from their conception simple and useful measures of confidence [1]. Conformal prediction can be based on any technique of point prediction for classification or regression, including support-vector machines, decision trees, neural networks and Bayesian methods. Starting from the point prediction tool, the conformal predictor consists of building a nonconformity measure, which determines how unusual an example is relative to previous examples. The conformal algorithm, based on the statistical concept of the p-values, turns this nonconformity measure into prediction regions. Given a nonconformity measure, the conformal algorithm produces a prediction region Uε for every probability of error ε. The region Uε is a (1−ε)-prediction region; it classifies the next example with probability at least 1−ε. Therefore conformal predictors are conservatively valid, which means that the probability they make a mistake when their output is at confidence level 1−ε is not greater than ε.

In most of the non-conformity measures utilised by conformal predictors, the Euclidean distance is implicitly assumed to be the proper metric to adopt in the calculation of the non-conformity measure and the p-values. The Euclidean distance has a precise geometrical meaning and a very long historical pedigree. However, it implicitly requires considering all data as single infinitely precise values. This assumption can be appropriate in other applications but it is obviously not the case in physics, since all the measurements typically present an error bar. An alternative idea is to use a new distance between data, which would take into account the measurement uncertainties. The causes of uncertainties in the measurements are typically many, which from a statistical point of view can be considered random variables. As a consequence, their global contribution can be often modelled as a noise of normal distribution. The idea, behind the approach proposed in this paper, consists therefore of considering the measurements not as points, but as Gaussian distributions [2]. Modelling measurements not as point values, but as Gaussian distributions, requires defining a distance between Gaussians. This distance must be the Geodesic on the Gaussian Manifold (GDGM) of the measurements and can be expressed as a closed formula (see Sect. 3) [3]. As shown in the rest of the paper, adopting this geodesic distance can increase significantly the accuracy of traditional conformal predictors, even when the data are affected by a very limited level of noise.

With regard to the structure of the paper, next Section provides a short introduction to the general framework of conformal prediction. The mathematical background to the main mathematical tool introduced in the paper: the Geodesic Distance on Gaussian Manifolds, is the subject of Sect. 3. The proposed method is assessed with a series of numerical tests using a toy model described in Sect. 4. Section 5 reports in detail the results of the numerical tests. Conclusions and lines of future work are provided in the last Section of the paper.

2 The Framework of Conformal Prediction for Classification

The task of classification basically consists of attributing objects to different classes. Mathematically this can be formalised by considering successive ordered pairs (x 1, y1), (x 2, y2)…….which are called examples. Each example consists of an object x i and its label yi, where the former represents the feature vector that describes the object i. The objects are elements of a measurable space X called the object space; the labels are elements of a measurable space Y called the label space. It is common practice to adopt a more compact notation, according to which zi indicate the ordered pair (x i, yi), and Z := X xY is defined as the example space.

Many machine learning tools are available to perform classification. On the other hand, as mentioned earlier, the vast majority of them cannot easily quantify the quality of their predictions. On the contrary, conformal predictors have been conceived explicitly to quantify the reliability of their predictions. They achieve this on the basis of the past examples. To this end, for each new sample to classify, it is necessary to measure how different the new one is from the old examples. In this perspective, a nonconformity measure is defined, which allows calculating a nonconformity score to estimate how different a new example is from a bag of old ones. A bag of size n ϵ N is a collection of n elements some of which may be identical. In this paper, the notation < z1,…, zn > indicates a bag of n elements.

Given a nonconformity measure A and a bag < z1,…,zn > , the nonconformity score can be calculated as:

$$ \alpha_{i} \text{ := }A\left( {\left\langle {\text{z}_{1} , \ldots ,\text{z}_{i - 1} ,\text{z}_{i + 1} , \ldots ,\text{z}_{n} } \right\rangle ,\text{z}_{i} } \right) $$
(1)

for each example zi in the bag. Because nonconformity measures are not absolute but relative, the numerical value of \( \alpha_{i} \) does not, by itself, determines how unusual zi is according to the measure A. To really quantify how unusual a sample is, it is necessary to compare \( \alpha_{I} \) with the nonconformity measures \( \alpha_{j} \) of the other members of the bag. The p-value is a convenient and statistically sound way of calculating how anomalous a new example is. By definition the p-value is the fraction

$$ Pval = \frac{{ \# \,\{ j = 1, \ldots ,n:\alpha_{j} \ge \alpha_{i} \} }}{n} $$
(2)

This indicator, which lies between 1/n and 1, is the fraction of the examples in the bag as non conforming as zi and in literature is called p-value of the element zi (p val (z i )). The symbol “#” stands in fact for the number of elements “j” in the collection having a nonconformity score higher or at least the same nonconformity of the element “i”. The lower the p-value, i.e. the closer to its lower bound 1/n (“j” includes “i” in fact) for large n, the more non conforming zi is and the more likely it can be considered as an outlier; this means that zi is not representative of the typical member of the bag. If the p-value is large, i.e. close to its upper bound 1, then zi is very conforming or very representative of the typical member of the bag. The new sample is attributed to the class with the highest p-value.

On the basis of the p-values, conformal predictors allow calculating, for each new classification, two indicators, confidence and credibility, which quantify the reliability of the prediction. Credibility is defined as the largest p-value; confidence is defined as 1-2nd largest p-value. Confidence can be interpreted as the probability that the prediction, corresponding to the maximal p-value, is correct. A low credibility, typically less than 0.05, intuitively means that either the training set is non random or the test object is not representative of the training set. If the maximum p-value appears in more than one class, an ambiguity is present and the algorithm is not able to classify the sample. It is important to emphasize that confidence and credibility of the prediction play an analogous role to the observed level of significance in statistical parameter tests.

3 Geodesic Distance on Gaussian Manifolds

As mentioned in the previous section, in the natural sciences the data available are typically the result of experimental measurements. In this context, all measurements are affected by uncertainties referred to as error bars. The sources of this uncertainty are normally quite many and therefore it is more than reasonable to assume that the pdf of the noise is normal. Each measurement can therefore be modelled as a probability density function (pdf) of the Gaussian type, determined by its mean μ and its standard deviation σ:

$$ p(x;\mu ,\sigma ) = \frac{1}{{\sigma \sqrt {2\pi } }}exp\left[ { - \frac{{(x - \mu )^{2} }}{{2\sigma^{2} }}} \right] $$
(3)

It is normal practice to assume that the experimental measured value is the mean of the pdf, since this is the most likely value of the pdf. The standard deviation can be determined independently from the knowledge of the instrumentation.

The set of normal distributions can therefore be modelled as a two dimensional space, or better a two dimensional manifold, parameterized by \( \mu \) and \( \sigma \). Modelling measurements not as point values, but as Gaussian distributions, requires defining a distance between Gaussians. The most appropriate definition of distance between Gaussian distributions is the geodesic distance (GDGM), on the probabilistic manifold containing the data, which is not a Euclidean but a Riemannian space. This geodesic distance on the Gaussian manifold can be calculated using the Fischer-Rao metric [3, 4]. For two univariate Gaussian distributions (p 1(x|μ 1σ 1)) and (p2(x2, σ2)), parameterised by their mean \( \mu_{i} \) and standard deviations σ i (i = 1, 2), the geodesic distance GDGM is given by:

$$ GD(p_{1} ||\,p_{2} )= \sqrt 2 ln\frac{1 + \delta }{1 - \delta } = 2\sqrt 2 tanh^{ - 1} \delta , where \,\delta = [\frac{{(\mu_{1} - \mu_{2} )^{2} + 2(\sigma_{1} - \sigma_{2} )^{2} }}{{(\mu_{1} - \mu_{2} )^{2} + 2(\sigma_{1} + \sigma_{2} )^{2} }}]^{{\frac{1}{2}}} $$
(4)

As will be shown in detail in the next sections, the replacement of the Euclidean distance with the GDGM improves significantly the robustness of the classification compared to the case of the Euclidean distance. In Fig. 1 a graphical example of the improvement obtained using the metric in Eq. (4) is shown. Considering a Cartesian coordinate system (μσ), where each point represents a Gaussian distribution, the Euclidean distance between the four points, so between the four distributions, is higher between the two wider distributions. On the other hand, considering the Geodesic Distance, the lower distance is obtained considering the wider distributions. This behaviour reflects the physical interpretation according to which physical quantities having higher error bars, are to be considered closer and more similar than those with narrower error bars.

Fig. 1.
figure 1

Examples to illustrate how the GDGM determines the distance between two Gaussians. The two couples of pdf in the figure have the same mean but different σ. The geodesic distance between the two with higher σ is much smaller.

4 A Toy Model

To exemplify and prove the usefulness of the method proposed in this paper, a series of numerical test has been performed. They are based on a toy model already introduced in [5]. The simplicity of the model allows appreciating both the nature of the problem and the advantages of adopting the proposed metric, the GDGM. The classification task consists of classifying points on a straight line, on which three classes have been defined. The problem is represented graphically in Fig. 2. The aim is to classify the new point Q with confidence and credibility.

Fig. 2.
figure 2

The toy model. A new point Q must be classified as belonging to class A, B or C.

For the purpose of this example, the classification is based on the nearest neighbour. Mathematically, given a “bag” {z 1,…,z n−1}, where each z i consists of a feature vector x i and a non-numerical label y i , when a new example z n  = (x n , y n ) becomes available for classification, its feature vector x n is known but its label y n is not. The nearest-neighbour method finds the x i closest to x n and its label y i becomes the prediction of y n . A natural way to measure the nonconformity of the new example z n with respect to the old examples z i consists of comparing x’s distances to old objects with the same label to its distance to old objects with a different label. For example, the nonconformity scores can be defined as:

$$ \alpha_{i} = \frac{{{\text{d}}_{\text{sl}} }}{{{\text{d}}_{\text{dl}} }} $$
(5)
$$ {\text{d}}_{\text{sl}} = \hbox{min} \left\{ {\left| {x_{j} - x_{i} } \right|: 1 \le j \le n \,\& \,j \ne i \,\& \,y_{j} = y_{i} } \right\} $$
$$ {\text{d}}_{\text{dl}} = \hbox{min} \left\{ {\left| {x_{j} - x_{i} } \right|: 1 \le j \le n \,\& \, j \ne i \,\& \,y_{j} \ne y_{i} } \right\} $$

For the new point Q = 14.85 shown in Fig. 2 (the non-conformity measurement is presented in Table 1), therefore the P values credibility and confidence assume the values 1 and 0.9844, respectively; and point Q belongs to Class C.

Table 1. Non-conformity measurements for point Q = 14.85

In the previous example, the conformity measure of Eq. (5) has been calculated using the Euclidean distance between the various points. All the derived quantities are therefore also based on this metric. In the case of measurements affected by noise, the Euclidean metric is not adequate and adopting the GDGM provides several improvements as discussed in the next section.

5 Results of the Numerical Tests

In order to assess the potential of the GDGM metric to counteract the effect of noise, a series of systematic tests has been performed using the toy model introduced in the previous section. To this end, a series of points have been automatically generated along the straight line of Fig. 2. These are to be considered the right values of the physical quantity to measure. Then Gaussian noise, with zero mean and standard deviations equals to a percentage (10 %,20 %,…) of the value itself, has been added to the previously generated points. Adding this noise to the data provides the actual values to be considered as the available measurements, affected by additive noise of Gaussian distribution. These points have been then classified with the nonconformity measure based on the next neighbour criterion using both the Euclidean distance and the GDGM. The results have been reported in Table 2 for the Euclidean distance as metric and in Table 3 for the GDGM as metric.

Table 2. Classification using the Euclidean distance to calculate the nearest neighbour. The first column reports the accuracy (Acc.); the second the credibility (Cred.) and the third the confidence (Conf). The following column reports the same quantities but for different levels of noise. The top of the table reports the average values for all the 50 points.
Table 3. Classification using the Geodesic Distance on Gaussian Manifolds distance to calculate the nearest neighbour. The first column reports the accuracy (Acc.); the second the credibility (Cred.) and the third the confidence (Conf). The following column reports the same quantities but for different levels of noise. The top of the table reports the average values for all the 50 points.

The results reported in Tables 2 and 3 indicate that the GDGM provides a clear improvement in the success rate of the classification. Table 2 shows how the performance of conformal predictors degrade with increasing levels of noise. It is important also to notice how the indicators of the quality of the prediction, confidence and credibility, tend to overestimate the reliability of the classification when significant level of noise is present. Table 3 reports the clear improvement in both performance and reliability of the quality indicators when the Euclidean distance is replaced with GDGM. Another important consideration is the fact that, adopting the GDGM metric does not cause any degradation of performance when the data are not affected by noise.

6 Conclusions

In many applications of conformal predictors, the Euclidean distance is explicitly or implicitly adopted as the right metric. In the case of experimental measurements typical of the physical sciences, the data are affected by noise of normal distribution. In this situation, the GDGM proves to be a better metric, to be used in the definition of the non-conformity measure. The calculation of the nonconformity measure and of the p-values using the GDGM provides significantly more reliable classifications, by reducing the adverse effects of the noise. The reported results using the GDGM have been obtained using a desktop computer with two Xeon E5520 @2.27 GHz processors and 24 GB of RAM, and required an average of one minute for each test performed, for a total of 50 min for all 50 points. The computational cost is therefore very similar to the one required to perform the calculations with the Euclidean distance.

With regard to future developments, it would be important to apply the same approach to different pdfs: particularly relevant would be the case of the Poisson distribution, since in practice many detectors work in photon counting or particle counting mode. Another very interesting application would be the case in which the pdf of the noise is not known. This situation has practical applications because in many experimental situations the uncertainties in the measurements can be quantified with an interval but without any additional specification. Therefore the real value is expected to fall in a certain interval but no additional information is available. In this case the implementation of an appropriate form of uncertain probability is expected to produce improvements in the classification of conformal predictors comparable to the case of the GDGM for the case of measurements affected by Gaussian noise.

In terms of practical applications, the mathematics of conformal predictors can be applied to most classifiers, including Fuzzy ones [6]. Therefore the approach can be of extreme help in all the cases, such as disruptions in Tokamaks, where classification is a particularly problematic and difficult task also due to the uncertainties in the measurements [7, 8].