1 Introduction

1.1 Context

Conformal predictions (CPs) (Shafer and Vovk 2008) are gaining increasing importance in machine learning (ML) since they validate algorithms in terms of confidence of the prediction. Although it is a fairly recent field of study, there has been an astonishing production of scholarly papers, from the definition of new score functions to different methodologies for constructing conformal sets and, of course, a wide variety of applications. In fact, the ferment of scientific research in this field is so active that even the father of this theory, V. Vovk,Footnote 1, continues to actively contribute to the improvement of its knowledge, as in the case of Vovk et al. (2017) where he and his colleagues investigate the concept of validity under nonparametric hypotheses or the innovative introduction of Venn predictors as in Vovk et al. (2022). We refer the reader to the surveys (Angelopoulos and Bates 2023; Fontana et al. 2023; Toccaceli 2022) that largely cover all recent publications and discussions on uncertainty quantification (UQ) through CP for machine learning models.

Under canonical CP theory, the definition of a score function is very peculiar to either the classifier or the application at hand. For example, Forreryd et al. (2018) defines a special conformity measure (corresponding to a score function), based on the residual between the calibration points and the classification hyperplane of a SVM model. Other example, always SVM-based, can be found in Forreryd et al. (2018), Shafer and Vovk (2008) and Balasubramanian et al. (2009), where different definitions of score function (or conformity/non-conformity measure) are given. One of the strengths of our approach, as will become clear later, is the unique definition of such a score function, which, given any classifier, allows the conformal prediction framework to be applied in the most natural way. The work in Narteni et al. (2023) defines a score function for rule-based models. The softmax function is used as score function in most image classification problems as in Angelopoulos et al. (2020); Park et al. (2019); Andéol et al. (2024), and many other examples can be provided (see, e.g., the above cited surveys). As a matter of the fact, those definitions come after the setting of the classifier and do not outline a common methodology.

1.2 Contribution

By focusing on binary classification, our goal is to introduce to the CP community a way to link ML classifiers with a natural definition of score function that embeds the conformal guarantee by construction.

We exploit the concept of scalable classifiers \(f_{\varvec{\theta }}(\varvec{x},\rho )\) (Sect. 2.1) introduced in Carlevaro et al. (2023) to develop a new class of score functions that rely on the geometry of the problem and that are naturally built from the classifier itself, by inheriting its properties (Sect. 3.1). This allows CP theory to derive the relationship between the input space and the conformity guarantee explicitly. By introducing the new concept of conformal safety region, we provide an analytical form of the specific subsets of the input space in which marginal coverage guarantees on prediction (Sect. 3.2) can be ensured. Controlling the misclassification rate (either false positives or false negatives) naturally follows from eliciting the following quantities: the confidence level given by the conformal framework, the binary output \(y\in \{+1,-1\}\), the confidence error \(\varepsilon \in (0,1)\), as well as the new notion of conformal safety set \(\mathcal {S}_\varepsilon\) that satisfies

$$\begin{aligned}\Pr \{y = -1 \ \text {and} \ \varvec{x}\in \mathcal {S}_\varepsilon \}\le \varepsilon .\end{aligned}$$

In short, the paper defines a methodology in which the optimal shape of a classifier is derived, where the optimality criterion is embedded in the classifier by the conformal guarantee. The proposed methodology thus places itself in the recent and as yet unexplored field of set-value classification (Chzhen et al. 2021), a broad theory that studies predictors that have both good prediction properties and specific performance requirements, two points that underlie the proposed research.

The remainder of the article is organized by providing a brief recall of the concepts of scalable classifiers and conformal prediction and then delving into the details of the definition of scalable score function and conformal safety region. The whole procedure is then validated on an application use case related to cyber-security for identifying DNS tunneling attacks (Sect. 4).

2 Background: scalable classifiers and conformal prediction

The background of the theory we would like to propose in this research refers to a new interpretation of classical classification algorithms, scalable classifiers, and another rather new theory on trustworthy AI, called conformal prediction. Both of these techniques belong to the field of reliable AI, searching for the definition of models, procedures or bounds that can make a learning algorithm probabilistically robust and reliable.

2.1 Scalable classifiers

Given an input space \(\mathcal {X}\subseteq \mathbbm {R}^d\), \(d\in \mathbbm {N}^+\), and an output space \(\mathcal {Y} = \{-1,+1\}\), scalable classifiers (SCs) were introduced in Carlevaro et al. (2023) as a family of (binary) classifiers parameterized by a scale factor \(\rho \in \mathbbm {R}\)

$$\begin{aligned} \phi _{\varvec{\theta }}(\varvec{x},\rho ) \doteq {\left\{ \begin{array}{ll} +1 \quad \quad \text {if } \, f_{\varvec{\theta }}(\varvec{x},\rho ) < 0, \\ -1 \quad \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

where the function \(f_{\varvec{\theta }}: \mathcal {X}\times \mathbbm {R}\longrightarrow \mathbbm {R}\) is the so-called classifier predictor and the notation with subscript \(\varvec{\theta }\) refers to the fact that the classifier also depends on a set of hyperparameters \(\varvec{\theta }=[\varvec{\theta }_1,\cdots ,\varvec{\theta }_{n_{\varvec{\theta }}}]^\top\) to be set in the model (e.g. different choices of kernel, regularization parameters, etc.). To give a meaningful interpretation of this classifier, we refer to the class \(+1\) as a “safe” situation we want to target and the other class with \(-1\) as an “unsafe” situation. Some examples might be differentiating between a patient’s condition in developing or not developing a certain disease (Lenatti et al. 2022), or understanding what input parameters lead an autonomous car to a collision or non-collision (Carlevaro et al. 2022), among many other applications.

SCs rely on the main assumption that for every \(\varvec{x}\in \mathcal {X}\), \(f_{\varvec{\theta }}(\varvec{x},\rho )\) is continuous and monotonically increasing in \(\rho\), and that \(\lim \limits _{\rho \rightarrow -\infty }f_{\varvec{\theta }}(\varvec{x},\rho )<0<\lim \limits _{\rho \rightarrow \infty } f_{\varvec{\theta }}(\varvec{x},\rho )\), [Carlevaro et al. (2023) Assumption 1]. These assumptions imply that, there exists a unique solution \(\bar{\rho }(\varvec{x})\) to the equation

$$\begin{aligned} f_{\varvec{\theta }}(\varvec{x},{\rho })=0. \end{aligned}$$
(2)

The proof of this claim is available in [Carlevaro et al. (2023) Property 2]. In words, a scalable classifier is a classifier that satisfies some crucial properties: i) given \(\varvec{x}\), there is always a value of \(\rho\), denoted \(\bar{\rho }(\varvec{x})\), that establishes the border between the two classes, ii) the increase of \(\rho\) forces the classifier to predict the \(-1\) class and iii) the target \(+1\) class of a given feature vector \(\varvec{x}\) is maintained for a decrease of \(\rho\). Moreover, [Carlevaro et al. (2023) Property 3] shows how any standard binary classifier can be made scalable by simply including the scaling parameter \(\rho\) in an additive way with the classifier predictor. That is, given the function \(\hat{f}:\mathcal {X}\longrightarrow \mathbbm {R}\) and its corresponding classifier \(\hat{\phi }(\varvec{x})\)then the function \(f_{\varvec{\theta }}(\varvec{x},\rho ) = \hat{f}(\varvec{x}) + \rho\) provides the scalable classifier \(\phi _{\varvec{\theta }}(\varvec{x},\rho )\) Thus, examples of classifiers that can be rendered scalable are support vector machine (SVM), support vector data description (SVDD), logistic regression (LR) but also artificial neural networks. More in detail, given a learning set

$$\begin{aligned} \mathcal {Z}_\ell \doteq \left\{ \left( \varvec{x}_i,y_i\right) \right\} _{i=1}^n \subseteq \mathcal {X}\times \left\{ -1,+1\right\} \end{aligned}$$

containing observed feature points and corresponding labels, \(\varvec{z}_i=\left( \varvec{x}_i,y_i\right)\), and assuming that \(\varphi :\mathcal {X}\longrightarrow \mathcal {V}\) represents a feature map (where \(\mathcal {V}\) is an inner product space) that allows to exploit kernels, some examples of scalable classifier predictors are:

  • SVM: \(f_{\varvec{\theta }}(\varvec{x},\rho ) = \varvec{w}^\top \varphi (\varvec{x}) - b + \rho\),

  • SVDD: \(f_{\varvec{\theta }}(\varvec{x},\rho ) = \left\Vert \varphi (\varvec{x})-\varvec{w}\right\Vert ^2 - R^2 + \rho\),

  • LR: \(f_{\varvec{\theta }}(\varvec{x},\rho ) = \dfrac{1}{2}-\dfrac{1}{1+e^{\left( \varvec{w}^\top \varphi (\varvec{x})-b\right) + \rho }}\),

where the classifier elements \(\varvec{w},b\) and R can be obtained as solution of proper optimization problems. The interested reader can refer to [Carlevaro et al. (2023) Section II c] for a more in depth discussion.

Different values of the parameter \(\rho\) correspond to different classifiers that can be considered as the level sets of the classifier predictor with respect to \(\rho\). In particular, since we are interested in predicting the class \(+1\) which, we recall, encodes a safety condition, we introduce

$$\begin{aligned} \mathcal {S}(\rho ) = \{ \; \varvec{x}\in \mathcal {X} \;:\;f_{\varvec{\theta }}(\varvec{x},\rho )<0\;\}, \end{aligned}$$
(3)

that is the set of points \(\varvec{x}\in \mathcal {X}\) predicted as safe by the classifier with the specific choice of \(\rho\), i.e. the safety region of the classifier \(f_{\varvec{\theta }}\) for given \(\rho\). It is easy to see that these sets are decreasingly nested with respect to \(\rho\), i.e.

$$\begin{aligned} \rho _1 > \rho _2 \Longrightarrow \mathcal {S}(\rho _1) \subset \mathcal {S}(\rho _2). \end{aligned}$$

2.2 Conformal prediction

Conformal Prediction is a relatively recent framework developed in the late nineties by V. Vovk. We refer the reader to the surveys (Angelopoulos and Bates 2023; Shafer and Vovk 2008; Fontana et al. 2023) for a gentle introduction to this methodology. CP is mainly an a-posteriori verification of the designed classifier, and in practice returns a measure of its “conformity” to the calibration data. We consider the particular implementation of CP discussed in Angelopoulos and Bates (2023), relative to the so-called “inductive" CP: in this setting, starting from a given predictor and a calibration set, CP allows to construct a new predictor with given probabilistic guarantees.

To this end, the first key step is the definition of a score function \(s:\mathcal {X}\times \mathcal {Y} \longrightarrow \mathbbm {R}\). Given a point \(\varvec{x}\in \mathcal {X}\) and a candidate label \(\hat{y}\in \{-1,1\}\), the score function returns a score \(s(\varvec{x},\hat{y}).\) Larger scores encode worse agreement between point \(\varvec{x}\) and the candidate label \(\hat{y}\). Then, assume to have available a second set of \(n_c\) observations, usually referred to as calibration set, defined as follows

$$\begin{aligned} \mathcal {Z}_c \doteq \left\{ (\varvec{x}_i,y_i)\right\} _{i=1}^{n_c} = \mathcal {X}_c\times \mathcal {Y}_c{\subseteq \mathcal {X}\times \mathcal {Y}}, \end{aligned}$$
(4)

that are pairs of points \(\varvec{x}\) with their corresponding true label y.

We assume that the observations \(\varvec{x}_i\in \mathcal {X}_c\) come from the same distribution \(\Pr\) of the observations in the test set \(\mathcal {Z}_{ts} = \{(\varvec{x}_{i},y_i)\}_{i=1}^{n_{ts}} = \mathcal {X}_{ts}\times \mathcal {Y}_{ts}{\subseteq \mathcal {X}\times \mathcal {Y}}\). Additionally, CP requires that the data are exchangeable, which is a weaker assumption than that of i.i.d.. Exchangeability means that the joint distribution of the data \(\varvec{z}_1, \varvec{z}_2, \dots , \varvec{z}_n\) is unchanged under permutations:

$$\begin{aligned}(\varvec{z}_1,\varvec{z}_2,\dots ,\varvec{z}_n) \sim (\varvec{z}_{\sigma (1)},\varvec{z}_{\sigma (2)}, \dots , \varvec{z}_{\sigma (n)}), \ \text {for all permutations } \sigma . \end{aligned}$$

Then, given a user-chosen confidence rate \((1-\varepsilon )\in (0,1)\), a conformal set \(C_\varepsilon (\varvec{x})\) is defined as the set of candidate labels whose score function is lower than the \((\lceil (n_c+1)(1-\varepsilon )\rceil /n_c)\)-quantile, denoted as \(s_\varepsilon\), computed on the \(s_1, \dots , s_{n_c}\) calibration scores. That is, to every point \(\varvec{x}\), CP associates a set of “plausible labels"

$$\begin{aligned} C_\varepsilon (\varvec{x}) = \{ \; \hat{y}\in \{-1,1\} \;:\;s(\varvec{x},\hat{y}) \le s_\varepsilon \;\}. \end{aligned}$$

The usefulness of the conformal set is that, according with Vovk et al. (1999), \(C_\varepsilon (\varvec{x})\) possesses the so-called marginal conformal coverage guarantee property, that is, given any (unseen before) observation \((\tilde{\varvec{x}},\tilde{y})\), the following holds

$$\begin{aligned} \Pr \left\{ \tilde{y}\in C_\varepsilon (\tilde{\varvec{x}})\right\} \ge 1-\varepsilon . \end{aligned}$$
(5)

In other words, the true label \(\tilde{y}\) belongs with high probability – at least (\(1-\varepsilon\)) – to the conformal set.

3 Notion of score function for scalable classifiers and conformal safety sets

In this section we introduce two concepts: i) a definition of score function for scalable classifiers (see Definition 1) and ii) the notion of conformal safety region (see Definition 2).

3.1 Natural definition of score function for scalable classifiers

In this paragraph, we show how scalable classifiers allow for a natural definition of the score function, based on their own classifier predictor.

Definition 1

[Score Function for Scalable Classifier] Given a scalable classifier \(\phi _\theta (\varvec{x},\rho )\) with classifier predictor \(f_\theta (\varvec{x},\rho )\), given a point \(\varvec{x}\) and an associated candidate label \(\hat{y}\), the score function associated to the scalable classifier is defined as

$$\begin{aligned}s(\varvec{x},\hat{y}) = -\hat{y}\bar{\rho }(\varvec{x})\end{aligned}$$

with \(\bar{\rho }(\varvec{x})\) such that \(f_{\varvec{\theta }}(\varvec{x},\bar{\rho }(\varvec{x}))=0\).

We notice that, since \(f_{\varvec{\theta }}\) is a SC predictor, the existence and uniqueness of such \(\bar{\rho }(\varvec{x})\) is guaranteed (Sect. 2.1) and consequently s is well defined.

In practice, the score function evaluates how much it is necessary to vary the original classification boundary \(f_{\varvec{\theta }}(\varvec{x},0)\) such that the point \(\varvec{x}\) falls on the classification boundary of the new classifier \(f_{\varvec{\theta }}(\varvec{x},\bar{\rho }(\varvec{x}))\), starting from class \(\hat{y}\). Alternatively, it is possible to think of the score function as a measure of the “difficulty” of making the classifier predict a certain class: very large values for \({\bar{\rho }(\varvec{x})}\) imply that it is difficult to render \(f_{\varvec{\theta }}(\varvec{x},{\bar{\rho }(\varvec{x})})\) positive, or equivalently that the class \(-1\) is not conformal (thus, when \(\hat{y}=-1\), the score function is \({\bar{\rho }(\varvec{x})}=-\hat{y}\) \({\bar{\rho }(\varvec{x})}\)). Very negative values of \({\bar{\rho }(\varvec{x})}\) imply that it is difficult to render the output equal to \(+1\), thus the score function is in this case −\({\bar{\rho }(\varvec{x})} = -\hat{y}\) \({\bar{\rho }(\varvec{x})}\).

Example 1

Scalable SVDD is the most straightforward example of correctly understanding such a definition for score function. In this case the score function takes this form

$$\begin{aligned}s(\varvec{x},\hat{y}) = -\hat{y}\left( R^2-\left\Vert \varvec{w}-\varphi (\varvec{x})\right\Vert ^2\right) .\end{aligned}$$

This represents exactly the quantity that needs to be removed (\(\hat{y}=+1\) for the point inside the sphere, \(\left\Vert \varvec{w}-\varphi (\varvec{x})\right\Vert ^2-R^2 < 0\)) or added (\(\hat{y}=-1\) for the point outside the sphere, \(R^2-\left\Vert \varvec{w}-\varphi (\varvec{x})\right\Vert ^2 > 0\)) to the radius such that \(\varvec{x}\) falls on the boundary of the classifier.

For example, consider two classes of points, “safe” (\(+1\), in blue in the following figure) and “unsafe” (\(-1\), in red in the following figure) sampled from two two-dimensional Gaussian distributions with respectively means and covariance matrices

$$\begin{aligned}\mu _\textrm{S} = \begin{bmatrix} -1 \\ -1 \end{bmatrix}, \, \sigma _\textrm{S} = \frac{1}{2}\textrm{I} \, \,; \, \, \mu _\textrm{U} = \begin{bmatrix} +1 \\ +1 \end{bmatrix}, \, \sigma _\textrm{U} = \frac{1}{2}\textrm{I} \end{aligned}$$

where \(\textrm{I}\) is the identity matrix. We trained a linear SVDD classifier (Fig. 1a) and plotted the respectively score function (Fig. 1b). Exactly the behavior described above can be observed: the score function associates values according to the geometry provided by the classifier. In this case, points belonging to the boundary of the circumference have score function values of 0 (dashed green line) and negative or positive depending on whether the point is inside or outside the circumference. It is worth noting that the classifier can be interpreted as a level set of the score function, and this interpretation is crucial as will become clear in the following.

Fig. 1
figure 1

Relationship between the SVDD classifier and the corresponding score function: the absolute value of the score function assigns to a sample its distance to the circumference boundary. The color bar on the right helps to understand the behavior of the score function: darker colors indicate regions with less conformity with the target class, warmer the opposite. The zero value of the score function is obtained exactly on the boundary

3.2 Conformal safety regions

Classical CPs define subsets of the output space that satisfy the probabilistic marginal coverage constraint, but it is equally important to understand the relationship between the input space and the conformal sets. In other words, it would be meaningful to define regions in the input space classified on the basis of the conformal set of their samples to identify for which inputs the classifier is most reliable in making a certain prediction. For example, one should be interested in finding the region of classification uncertainty (\(C_\varepsilon (\varvec{x})=\{-1,+1\}\)) or the region in which the conformal classifier predicts a specific label (\(C_\varepsilon (\varvec{x})=\{+1\}\) or \(C_\varepsilon (\varvec{x})=\{-1\}\)) or in which it has no guess at all (\(C_\varepsilon (\varvec{x}) = \varnothing\)).

In particular, since the goal is to find the input values that bring the classification to a “safe” situation (i.e., in our notation, \(y=+1\)) with a certain level of confidence, we introduce the concept of conformal safety region.

Definition 2

[Conformal Safety Region] Consider a calibration set \(\mathcal {Z}_{c} = \{(\varvec{x}_i,y_i)\}_{i=1}^{n_c}\) from the same data distribution of the test set \(Z_{ts}\). Given a level of error \(\varepsilon \in (0,1)\), a score function \(s:\mathcal {X}\times \mathcal {Y}\longrightarrow \mathbbm {R}\), and its corresponding \((\lceil (n_c+1)(1-\varepsilon )\rceil /n_c)\)-quantile \(s_\varepsilon\) computed on the calibration set, the conformal safety region (CSR) of level \(\varepsilon\) is defined as follows

$$\begin{aligned} \begin{aligned} \Sigma _\varepsilon = \{ \; \varvec{x}\in \mathcal {X} \;:\; s(\varvec{x},+1)\le s_\varepsilon , \ s(\varvec{x},-1)>s_\varepsilon \;\}. \end{aligned} \end{aligned}$$
(6)

In words, a conformal safety region (CSR) is the subset of the input space where the conformal set is composed by only safe labels, \(C_\varepsilon (\varvec{x})=\{+1\}\), which can be inferred directly from the definition. Note that the above definition is independent on the choice of the score function s. What we will prove in the next is that using the score function defined for SCs (Definition 1) it is possible to give an analytical form to \(\Sigma _\varepsilon\).

Example 2

Consider the same configuration as in Example 1 but with covariance matrices \(\sigma _S = \sigma _U = I\) and with a probability to sample an outlier for each class \(p_O = 0.1\). Consider the LR classifier and its corresponding score function

$$\begin{aligned}s(\varvec{x},\hat{y}) = -\hat{y}(b-\varvec{w}^\top \varphi (\varvec{x})){,}\end{aligned}$$

which is the same of the SVM since the solution of the equation \(f_{\varvec{\theta }}(\varvec{x},\rho )=0\) is for both \(\bar{\rho }(\varvec{x}) = b-\varvec{w}^\top \varphi (\varvec{x})\).

We trained on a training set composed by 3000 samples the LR classifier with cubic polynomial kernel (Fig. 2a) and then we computed the score values on a calibration set of 5000 samples. We computed the quantiles varying \(\varepsilon\) (0.05, 0.1 and 0.5) and we plotted (on a test set of 10000 samples) the scatter of the points according to the conformal set. Green points belong to the CSR \(\Sigma _\varepsilon\) and it is easily understandable that the smaller \(\varepsilon\) is, the smaller \(\Sigma _\varepsilon\). This behavior is in line with CP theory, since small values of \(\varepsilon\) mean that the conformal prediction must be very precise, and this is achievable only if the classifier itself is “very confident” of assigning the true label to a sample. Also, it should be noted that the smaller \(\varepsilon\) is, the larger the region of uncertainty for the conformal prediction (\(C_\varepsilon (\varvec{x}) =\{-1,+1\}\), in yellow in Figs. 2a, 2b). Again, since for small \(\varepsilon\) high levels of marginal coverage must be satisfied, conformal prediction tends to give both labels to a point when it is uncertain. Contrarily, for high values of \(\varepsilon\) (Fig. 2c) the conformal sets for uncertain points tend to be empty (in black) because the score is too high and no output meets the specifications to belong to \(C_\epsilon\). Finally, it is worth noting that the regions into which the points scatter have a well-defined shape: as introduced in Example 1 and as will become clear in the next section, these regions correspond to level sets of the score function.

Fig. 2
figure 2

Scatter-plots of the conformal set varying \(\varepsilon\) for cubic LR. Green and red points correspond to singleton conformal set (\(C_\varepsilon (\varvec{x})=\{+1\}\) and \(C_\varepsilon (\varvec{x})=\{-1\}\) respectively) yellow points to double predictions (\(C_\varepsilon (\varvec{x})=\{+1,-1\}\)) and black points to empty prediction (\(C_\varepsilon (\varvec{x})=\varnothing\))

3.3 Analytical form of conformal safety regions for scalable classifiers

The definition of score we gave for SCs in Definition 1 identifies a particular value of the scalable parameter, which is the one corresponding to the quantile \(s_\varepsilon\), that we can define formally as

$$\begin{aligned} \rho _\varepsilon \doteq |s_\varepsilon |. \end{aligned}$$
(7)

To this value, we can associate a level set \(\mathcal {S}(\rho _\varepsilon )\) defined as in (3), i.e. the \(\rho _\varepsilon\)-safe set

$$\begin{aligned} \mathcal {S}_\varepsilon = \{ \; \varvec{x}\in \mathcal {X} \;:\;f_{\varvec{\theta }}(\varvec{x},\rho _\varepsilon )<0\;\}. \end{aligned}$$
(8)

We can prove that non-trivial relationships link \(\mathcal {S}_\varepsilon\) to the CSR \(\Sigma _\varepsilon\). But before, let us split \(\Sigma _\varepsilon\) in two contribution:

$$\begin{aligned} \Sigma _\varepsilon = \Sigma _\varepsilon ^a \cup \Sigma _\varepsilon ^b, \end{aligned}$$
(9)

where

$$\begin{aligned} \Sigma _\varepsilon ^a=\{ \; \varvec{x}\in \mathcal {X} \;:\;s(\varvec{x},+1)<s_\varepsilon , s(\varvec{x},-1)>s_\varepsilon \;\}, \end{aligned}$$
(10)

and

$$\begin{aligned} \Sigma _\varepsilon ^b=\{ \; \varvec{x}\in \mathcal {X} \;:\;s(\varvec{x},+1)=s_\varepsilon , s(\varvec{x},-1)>s_\varepsilon \;\}. \end{aligned}$$
(11)

The relationship between \(\mathcal {S}_\varepsilon\) and \(\Sigma _\varepsilon\) is explored in the following results that provide as final and major contribution the fact that \(\mathcal {S}_\varepsilon \subseteq \Sigma _\varepsilon\).

Proposition 1

$$\begin{aligned}\mathcal {S}_\varepsilon = \Sigma _\varepsilon ^a \subseteq \Sigma _\varepsilon .\end{aligned}$$

Proof

$$\begin{aligned} \varvec{x}\in \mathcal {S}_\varepsilon\iff & {} f_{\varvec{\theta }}(\varvec{x},|s_\varepsilon |)<0,\\\iff & {} f_{\varvec{\theta }}(\varvec{x},|s_\varepsilon |)<f_{\varvec{\theta }}(\varvec{x},\bar{\rho }(\varvec{x})),\\\iff & {} |s_\varepsilon |<\bar{\rho }(\varvec{x}),\\\iff & {} -s_\varepsilon<\bar{\rho }(\varvec{x}) \ \text {and} \ s_\varepsilon< \bar{\rho }(\varvec{x}),\\\iff & {} -s_\varepsilon<-s(\varvec{x},+1) \ \text {and} \ s_\varepsilon<s(\varvec{x},-1),\\\iff & {} s(\varvec{x},+1)<s_\varepsilon \ \text {and} \ s(\varvec{x},-1)>s_\varepsilon ,\\\iff & {} \varvec{x}\in \Sigma _\varepsilon ^a\subseteq \Sigma _\varepsilon . \end{aligned}$$

\(\square\)

Corollary 1

$$\begin{aligned} \mathcal {S}_\varepsilon = \Sigma _\varepsilon \text {only if} \Sigma _\varepsilon ^b = \varnothing . \end{aligned}$$

Proof

Trivial, from

$$\begin{aligned}\Sigma _\varepsilon = \Sigma _\varepsilon ^a\cup \Sigma _\varepsilon ^b = \mathcal {S}_\varepsilon \cup \Sigma _\varepsilon ^b.\end{aligned}$$

\(\square\)

Proposition 2

$$\begin{aligned}\Sigma _\varepsilon ^b\ne \varnothing \Longrightarrow s_\varepsilon >0.\end{aligned}$$

Proof

$$\begin{aligned} \varvec{x}\in \Sigma _\varepsilon ^b\iff & {} s(\varvec{x},+1) = s_\varepsilon \ \text {and} \ s(\varvec{x},-1)<s_\varepsilon ,\end{aligned}$$
(12)
$$\begin{aligned}\iff & {} -\bar{\rho }(\varvec{x})=s_\varepsilon \ \text {and} \ \bar{\rho }(\varvec{x})<s_\varepsilon ,\end{aligned}$$
(13)
$$\begin{aligned}\iff & {} -s_\varepsilon <s_\varepsilon ,\end{aligned}$$
(14)
$$\begin{aligned}\iff & {} s_\varepsilon >0. \end{aligned}$$
(15)

\(\square\)

We can then summarize all these information in a single theorem that defines the “analytical form” of the CSR, i.e. that it is possible to express \(\Sigma _\varepsilon\) in terms of a single scalar parameter.

Theorem 3

[Analytical Representation of the Conformal Safety Region via Scalable Classifiers] Consider the classifier (1) and suppose that [Carlevaro et al. (2023) Assumption 1] holds and that \(\Pr \{\varvec{x}\in \mathcal {X}\} = 1\). Consider then a calibration set \(\mathcal {Z}_c = \left\{ (\varvec{x}_i,y_i)\right\} _{i=1}^{n_c}\) (\(n_c\) exchangeable samples), a level of error \(\varepsilon \in (0,1)\), a score function \(s:\mathcal {X}\times \mathcal {Y}\longrightarrow \mathbbm {R}\) as in Definition 1 with \(\lceil (n_c+1)(1-\varepsilon )\rceil /n_c\)-quantile \(s_\varepsilon\) computed on the calibration set. Define the conformal scaling of level \(\varepsilon\) as follows

$$\begin{aligned} \rho _\varepsilon = |s_\varepsilon |, \end{aligned}$$
(16)

and define the corresponding \(\rho _\varepsilon\)-safe set

$$\begin{aligned} \mathcal {S}_\varepsilon = \{ \; \varvec{x}\in \mathcal {X} \;:\;f_{\varvec{\theta }}(\varvec{x},\rho _\varepsilon )<0\;\}. \end{aligned}$$
(17)

Then, given the conformal safety region of level \(\varepsilon\), \(\Sigma _\varepsilon\), we have

  1. i)

    \(\mathcal {S}_\varepsilon \subseteq \Sigma _\varepsilon\).

  2. ii)

    \(\mathcal {S}_\varepsilon = \Sigma _\varepsilon\) if \(s_\varepsilon \le 0\).

that is, \(\mathcal {S}_\varepsilon\) is a CSR.

Proof

Proof follows directly from Propositions (1) and (2) and Corollary (1). \(\square\)

Fig. 3
figure 3

CSR computed with a Gaussian SVM at \(\varepsilon = 0.05\). Scattered CSR \(\Sigma _\varepsilon\), a, coincides with the analytical CSR \(\mathcal {S}_\varepsilon\), b that coincides with the level set \(z = \rho _\varepsilon\) of the score function, (c). d is the planar representation on \({x_1}-{x_2}\) plane of the score function

In its classical definition, conformal prediction is a local property, that is, the conformal coverage guarantee is valid only punctually. However, conformal labels map each point in a subset of the input space, depending on the size of the respective conformal set. Theorem 3 then provides a new classifier that maps the samples contained in \(\mathcal {S}_\varepsilon\) to the target class \(+1\). Once \(\rho _\varepsilon\) has been computed, it is then possible to write

$$\begin{aligned} {\mathcal {S}_\varepsilon = \phi _{\varvec{\theta }}(\cdot ,\rho _\varepsilon )^{-1}(y=+1)}, \end{aligned}$$

identifying a unique relationship between the target class of the classification and the CSR.

Example 3

In the same configuration as in Example 2, we trained a Gaussian SVM and calculated the score values on the calibration set. Figure 3 shows exactly what Theorem 3 claims: CSRs are level sets of the score function that correspond to a specific quantile and thus to a specific confidence level. Specifically, in this example it is shown the CSR at level of confidence \(1-0.05 = 0.95\) that results in a quantile equal to 2.8113 and corresponding conformal scaling \(\rho _{0.05} = -2.8113\). The hyperplane \(\text {``score''}= -2.8113\) exactly cuts the score function at the level set corresponding to the CSR.

Remark 1

[On the Usefulness of Conformal Safety Regions] The introduction of the concept of CSR brings inevitably to understand how this instrument can be useful in practice. First of all it allows to identify reliable prediction regions and quantify uncertainty: in decision making problems where a certain amount of confidence in the prediction is required (like for example in medical applications) CSRs can suggest the best set of input features that guides the predictions reliably, minimizing the presence of misclassification samples. Moreover CSRs provide an interpretable way to understand the model’s behavior in different regions of the input space. This can be useful for the model explanation and for possible improvements and corrections to the model. Finally, CSRs are very “regulatory compliant”: in applications with regulatory requirements, CSRs ensure compliance by providing a clear understanding of where model’s predictions are reliable.

In addition, CSRs can provide strong information about the prediction of points belonging to them. Indeed, it can be proved that the number of false positives is limited by the \(\varepsilon\) error.

Theorem 4

Consider the classifier (1) and the corresponding CSR developed as in Theorem 3 with a level of error \(\varepsilon \in (0,1)\). Then, it can be stated that

$$\begin{aligned} \Pr \left\{ y=-1 \ \text {and} \ \varvec{x}\in \mathcal {S}_\varepsilon \right\} \le \varepsilon . \end{aligned}$$
(18)

Proof

Since \(\mathcal {S}_\varepsilon \subseteq \Sigma _\varepsilon\):

$$\begin{aligned} \begin{aligned} \Pr \{y = -1 \ \text {and} \ \varvec{x}\in \mathcal {S}_\varepsilon \}&\le \Pr \{y = -1 \ \text {and} \ \varvec{x}\in \Sigma _\varepsilon \}\\&= \Pr \{y = -1 \ \text {and} \ \varvec{x}\in \{\varvec{x}: C_\varepsilon (\varvec{x})=\{+1\}\} \}\\&\le \Pr \{y = -1 \ \text {and} \ \varvec{x}\in \{\varvec{x}: -1 \notin C_\varepsilon (\varvec{x})\} \}\\&= \Pr \{y = -1 \ \text {and} \ y\notin C_\varepsilon (\varvec{x}) \}\\&\le \Pr \{y \notin C_\varepsilon (\varvec{x}) \} \le \varepsilon , \end{aligned} \end{aligned}$$

where the last inequality holds for the marginal coverage property of CP (5). \(\square\)

The significance of this statement cannot be overstated, as it implies that thanks to CSRs, it becomes feasible to identify regions in feature space where the conformal coverage of the target class is assured. Consequently, these regions identify feature points with a high degree of certainty, thereby enhancing the reliability, trustworthiness, and robustness of (any) classification algorithm, especially with regard to safety considerations. Specifically, the final output of the proposed method is a region, \(\mathcal {S}_\varepsilon\), in which with high probability the chance of finding the unwanted label is small (and thus as small as desired). This means that the scalable classifier together with the conforming prediction can handle the natural uncertainty arising both from the data (to the extent that the data are representative of the information they provide, i.e. aleatoric uncertainty) and the model (to the extent that it is accurate in modeling, i.e. epistemic uncertainty), providing “safety” sets that have a volume proportional to \(\varepsilon\), i.e., to the confidence of the prediction (Hüllermeier and Waegeman 2021).This is very much in line with recent and ongoing literature in the field of geometric uncertainty quantification, as in Sale et al. (2023) where the authors propose the idea of “credal sets” (Abellán et al. 2006) that, as our CSR does, guarantee the correctness of the prediction bounding the input set in polytopes. In this regard, the idea of quantifying uncertainty through functions that give a measure of distance (such as the score function proposed here) is something that is sparking the UQ community, enabling future comparisons with other methods such as the “second order UQ” discussed in Sale et al. (2023).

Remark 2

[On the link with Probably Approximate Correct theory] Probably approximate correct (PAC) learning is a theory developed in the 1980 s by Valiant (2013) for quantifying uncertainty in learning processes, with a focus on the case of undersampled data. PAC learning has been used to define sets of predictions that can satisfy probabilistic guarantees with nonparametric probabilistic assumptions (see, for example, Park et al. (2022)) with similarities with our (and in general with CP theory) approach. Specifically, PAC learning is a broad theory where it is possible to insert the research presented in this paper on uncertainty quantification of machine learning classifiers with conformal prediction. For example, the confidence bounds on which conformal prediction theory is based (and so is our research) are inherited from PAC learning theory. As shown in [Vovk (2012) Prop 2a], the concept of \((\varepsilon ,\delta )\)-validity (i.e. the marginal coverage guarantee of equation (5) together with the randomness of the calibration set) is a PAC style guarantee on the (inductive) conformal prediction. As reported in our previous work Carlevaro et al. (2023), there are nontrivial relationships between the number of samples on the calibration set and probabilistic guarantees on prediction. All these relationships can be read into the PAC learning formalism, and future work will focus on this topic.

In the next section we report some numerical examples of Theorem 4, see Fig. 6.

4 A real world application: detection of SSH-DNS tunnelling

The dataset chosen for the example application deals with covert channel detection in cybersecurity (Aiello et al. 2015). The aim is detecting the presence of secure shell domain name server (SSH-DNS) intruders by an aggregation-based monitoring that avoids packet inspection, in the presence of silent intruders and quick statistical fingerprints generation. By modulating the quantity of anomalous packets in the server, we are able to modulate the difficulty of the inherent supervised learning solution via canonical classification schemes (Carlevaro and Mongelli (2021); Vaccari et al. (2022)).

Let q and a be the packet sizes of a query and the corresponding answer, respectively (what answer is related to a specific query can be understood from the packet identifier) and Dt the time-interval intercurring between them. The information vector is composed of the statistics (mean, variance, skewness and kurtosis) of qa and Dt for a total number of 12 input features:

$$\begin{aligned}\varvec{x}=[m_A, m_{Q}, m_{Dt}, v_A, v_{Q}, v_{Dt}, s_A, s_{Q}, s_{Dt}, k_A, k_{Q}, k_{Dt}],\end{aligned}$$

and an overall size of 10000 examples. High-order statistics give a quantitative indication of the asymmetry (skewness) and heaviness of tails (kurtosis) of a probability distribution, helping to improve the detection inference. The output space \(\mathcal {Y} = \{-1,+1\}\) is generated by associating each sample \(\varvec{x}\) with the label \(-1\) when “no tunnel” is detected and \(+1\) when “tunnel” is detected. In this sense, the idea of safety should be interpreted as an indication that the system has detected the presence of a “tunnel” or abnormal behavior, i.e., the system believes that there is a potential security threat or intrusion. This could trigger various security countermeasures, such as blocking incoming traffic or applying filters to the connection.

Fig. 4
figure 4

Trend of the average error as \(\varepsilon\) varies in [0.05, 0.5] for different classifiers. The errors vary in [0, 0.6] for SVM, [0, 0.8] for SVDD and [0, 0.6] for LR

Fig. 5
figure 5

Trend of the average size of conformal sets as \(\varepsilon\) varies in [0.05, 0.5] for different classifiers. The size varies from 0 (empty) to 1 (full)

Conformal predictions assess the goodness of an algorithm by two basic metrics of evaluation: accuracy and efficiency. Accuracy is measured by the average error, over the test set, of the conformal prediction sets considering points of both classes (err), only class \(y=-1\) points (\(\text {err }_-\)) and only class \(y=+1\) points (\(\text {err }_+\)). We remind that an error occurs whenever the true label is not contained in the prediction set. Efficiency is quantified through the rate of test points prediction sets with no predictions (empty), two predictions (double) and singleton predictions (single), these ones also divided by class (\(single _-\) and \(single _+\)). The obtained results (as the classifier varies) are reported in Figs. 4 and 5 for accuracy and efficiency, respectively.

The overall metrics computed on the benchmark dataset outline the expected behavior of the conformal prediction, with slight differences between the example classifiers. For all values of \(\varepsilon\), the average error is indeed bounded by \(\varepsilon\) in all cases. Also, err increases linearly with \(\varepsilon\). This means that the classification is reliable. As for the size of the conformal set, the overall results point out that for small values of \(\varepsilon\) the model produces more double-sized regions, since in this way it would be “almost certain” that the true label is contained in the conformal set. Then, the size reduces by increasing \(\varepsilon\), allowing for the presence of more empty prediction sets. The number of singleton conformal set remains always sufficiently high (it increases as double conformal sets decrease and it decreases as empty conformal set increase) meaning that the classification is efficient. Regarding the use of the example classifiers, it is interesting to note that LR is the most stable with respect to \(\varepsilon\) and the error conditional on classes: the error rate for both classes is nearly linear with \(\varepsilon\), suggesting that the prediction is reliable even conditional on the single class or, better, that the classifier is able to clearly separate the classes while maintaining the expected confidence. The same behavior is also observed for SVM, although the errors per class deviate more from the average error. The error for class “tunnel” is always lower than that for class “no tunnel”, suggesting that the classifier is more likely to minimize the number of false positives, losing in accuracy for true negatives. The opposite behavior is observed for SVDD, which instead tries to classify negative instances better, resulting in a lower expected classification error for class “no tunnel”. The most interesting aspect, however, is that the algorithm is less conformal when conditioned on the error of the single class, increasing the spread with respect to the average error as \(\varepsilon\) increases. Conformal prediction together with scalable classifiers define then a totally new framework to deal with uncertainty quantification in classification based scenarios. The results shown in this application drastically overcome the ones obtained on the same dataset in Carlevaro and Mongelli (2021). The previous approach relied on an iterative procedure to control the number of misclassified points that could only be used with a specific algorithm (SVDD) and without a-priori confidence bounds, but only on the basis of an smart trial-and-error algorithm. The point that the reader should observe is precisely this: the presented theory allows dealing with the uncertainty naturally brought by machine learning approaches in a simple and probabilistically grounded way, allowing setting confidence in prediction by design. Finally, Fig. 6 shows the behavior of the coverage error of the CSR with respect to the example classifiers. As stated in Theorem 4, the probability that the wrong label \(-1\) is predicted for the points belonging to \(\mathcal {S}_\varepsilon\) is under-linear with respect the expected error \(\varepsilon\).

Fig. 6
figure 6

Error coverage plot as \(\varepsilon\) varies in [0.05, 0.5] for the example classifiers. The probability varies in [0, 0.5] for SVM, [0.05, 0.55] for SVDD and [0, 0.6] for LR

5 Conclusions

Scalable classifiers allow for the development of new techniques to assess safety and robustness in classification problems. With this research we explored the similarities between scalable classifiers and conformal prediction. Through the definition of a score function that naturally derives from the scalable classifier, it is possible to define the concept of conformal safety region, a region that possesses a crucial property known as error coverage, which implies that the probability of observing the wrong label for data points within this region is guaranteed to be no more than a predefined confidence error of \(\varepsilon\). Moreover, ongoing studies on the conformal coverage (that is, the probability of observing the true safe label in the CSR is no less than \(1-\varepsilon\)) suggest that a mathematical proof for this property is conceivable. The idea is to exploit the results on class-conditional conformal prediction as in Vovk (2012). In addition, future work will include the possibility of extending the formulation of scalable classifiers, and thus the conformal safety region, to the multi-class and multi-label context.

The exploration of conformal and error coverages introduces a novel and meaningful concept that holds great promise for applications in the field of reliable and trustworthy artificial intelligence. It has the potential to enhance the assessment of safety and robustness, contributing to the advancement of AI systems that can be trusted and relied upon in critical applications.