Keywords

1 Introduction

The continuous expansion of the application fields of Machine Learning (ML) into safety-critical domains, such as autonomous vehicles, entails an increasing need for suitable safety assurance approaches. One key aspect in this regard is getting a grasp on the confidence associated with the output of an ML component. While some ML models provide a probabilistic output that can be interpreted as a level of confidence, such output alone is not sufficient to establish overall trust. Significant progress has been made towards addressing this question, with approaches that introduce more sophisticated evaluation of a given model’s outputs. Model-specific approaches base their evaluation on understanding of the internals of the given ML model, e.g. [23] focus on the second-to-last layer of a given deep neural network. On the other hand, model-agnostic approaches treat models as black-boxes, basing their evaluation on properties that can be examined externally, e.g. in [16], surrogate models are constructed during training to later provide uncertainty estimates of the ML model in question. An additional concern for evaluating ML models, is that the evaluation must also satisfy the application requirements, in particular with regards to performance. For instance, the authors of [25] propose auxiliary networks for evaluation, but the computational capacity needed to estimate them hinders their roll-out into real-time systems. On a general note, A safety argument for a system with ML components will typically be very specific for a given application and its context and comprise of a diverse range of measures and assumptions, many of which we would expect to include both development-time approaches and runtime approaches, with ours falling under the latter category.

SafeML, proposed in [2] and improved in [1], is a runtime approach for evaluating ML model outputs. In brief, SafeML compares training and operational data of the ML model in question and determines whether they are statistically ‘too distant’ to yield a trustworthy answer. The work in [1] further demonstrates a bootstrap-based p-value estimation extension to improve confidence in measurements. However, the existing literature does not explain how to address specific challenges for practical application of SafeML.

Our contribution is to identify these limitations and propose an approach that enables a systematic application of SafeML and overcomes these limitations. In the remainder of Sect. 1, we provide a more detailed description of previous work on SafeML. We then discuss what its practical limitations are, provide the motivation behind our approach, and then further detail our contributions.

1.1 SafeML

SafeML is a collection of measures that estimate the statistical distance between training and operational datasets based on the Empirical Cumulative Distribution Function (ECDF). In [2], the estimated distance has been shown to negatively correlate with a corresponding ML model’s accuracy. In the same paper, a plausible workflow of applying SafeML for monitoring ML was also proposed. The workflow allows an ML task to be divided into two phases, an offline/training phase and an online/application phase. In the training phase, it is assumed that we have a trusted dataset and there is no uncertainty associated with its labels. An ML model, such as a deep neural network or a support vector machine, can be trained using the trusted data for classification or regression tasks.

After its validation, in the online/application phase, the same trained model and a buffer are provided to gather a sufficient number of samples from inputs. The number of buffered samples should be large enough that the distance determination can be relied upon, but the existing approach does not provide further guidance on how this number should be specified. When a large enough number of samples is obtained, the ECDF of each feature and each class is calculated based on the trained classifier decisions. The ECDF-based statistical distance measures are used to evaluate the differences between the trusted dataset and the buffered data. To ensure that the statistical measures are valid, a bootstrap-based p-value evaluation method is added to the measurements, as in [1]. The user of the method must then specify a minimal distance threshold (and optionally additional ones) for the distance measures. The proposed workflow suggests that if the outcome is slightly above the minimal threshold, additional data can be requested. On the other hand, if the outcome is significantly above the threshold value (or a specified additional threshold), alternative actions can be taken, e.g. operator intervention. If the outcome is below the minimal threshold (or a specified additional threshold), the decision of the Machine Learning algorithm can be trusted and the statistical distance measures can be stored to be reported.

As SafeML is model-agnostic, it can be flexibly deployed in numerous applications. In [1, 2], Aslansefat et al. already presented experimental applications of SafeML for security attack detection [27], and German Traffic Sign Recognition Benchmark (GTSRB) examples [29]. For security intrusion detection, SafeML measures were used to compare the statistical distances against the accuracy of classifier. In the GTSRB example, the model was trained, and the incorrectly classified set of images was compared against randomly selected input images from the training set.

1.2 Motivation

As mentioned in Sect. 1.1, applying SafeML requires the specification of the number of runtime samples that needed to be acquired, and at least the minimal distance threshold for acceptance/rejection. Both parameters must be defined during development time, as they need to be known by the time the ML model is in operation. Existing work on SafeML does not investigate nor provide guidance for establishing these parameters, leaving it up to the user to find reasonable values.

However, this is not a trivial matter, as identifying appropriate thresholds has application-related implications. As will be highlighted further in Sect. 3, an inadequate number of runtime samples may result in low statistical power of the SafeML-based evaluation, whereas collecting too many samples can be inefficient and limit application performance. Addressing these limitations is the focus of this publication.

Statistical power is the probability of a correctly rejected null-hypothesis test, i.e., the probability of a true positive, given a large enough population [7]. Conversely, by presetting a required level of statistical power, the population size needed to correctly distinguish two distributions can be calculated through power analysis. Similarly, distance thresholds that are too low can lead to flooding the host application with false positive alarms, whereas distance thresholds that are too high can lead to potentially critical conditions being overlooked. Concretely, we establish the following research questions:

  • RQ1: Dissimilarity-Accuracy Correlation. Can we confirm that data points seen during operation that are dissimilar to training data impact the model’s performance in terms of accuracy?

  • RQ2: Sample Size Dependency. Can we determine whether the sample size affects the accuracy of the SafeML distance estimation?

1.3 Paper Contribution and Outline

The contribution of this paper is three-fold. First, we use power analysis to specify sampling requirements for SafeML monitoring at runtime. Second, we systematically determine appropriate SafeML distance thresholds. Finally, we apply the above method in the context of an example automotive simulation.

The remainder of the paper is structured as follows: In Sect. 2, we discuss background and related work, including approaches both similar to and different from SafeML. In Sect. 3, we describe our approach for systematically applying SafeML and determining relevant thresholds, as well as our experimental setup. In Sect. 4, we discuss our experimental results, before recapping our key points and discussing future work in Sect. 5.

2 Background and Related Work

To briefly recap, in [1, 2] the authors propose statistical distance measures to compare the distributions of the training and operational datasets; the measures are based on established two-sample statistical testing methods, including the Kolmogorov-Smirnov, Anderson-Darling, Cramer von Mises [8], and Wasserstein methods [24]. The statistical distance measures used by SafeML capture the dissimilarity between two different distributions, but the approach itself does not propose an explicit threshold at which those distributions are not equivalent, nor a means for determining one systematically.

Setting meaningful thresholds is a reoccurring problem in ML and data-driven applications. A method based on the 3-sigma rule was shown to provide suitable threshold criteria in Hidden Markov Models under the assumption of normal distribution [6]. Our approach is similar in the sense that we used the same principle, but we did not assume that our datasets are normally distributed. Therefore, instead of a 3-sigma rule, we opted for a gradual increase of the threshold based on the sigma value. We will elaborate on this further in Sect. 3.

A prerequisite for the transition of AI applications to safety- and security-critical systems is the existence of guarantees and guidelines to assure underlying system dependability. A method was proposed in [25] to assure a model’s operation within the intended context in a model-agnostic manner, with an additional autoencoder-based network being used to detect semantic novelty.

However, the innate problem of using neural networks, including autoencoders, is their black-box nature with respect to explainability, which inhibits the establishment of dependability guarantees. Hence, the use of a more explainable statistical method could serve as a solution to this issue. This includes our proposed approach, as the ECDF-based distance to the training set could provide additional insight into the model’s decision.

In [23], the authors propose a commonality metric that, inspects the second-to-last layer of a Deep Neural Network (DNN). The proposed metric expresses the ratio between the activation of the neurons in the last layer during training (across all training instances) versus their activation during operation, for the given operational input. The approach shares common ideas with SafeML, but diverges in terms of being model-specific, as the metric directly samples the last layer’s neurons. In contrast, SafeML does not consider model internals and makes no assumption on the distribution of the training and operational data.

Efforts have been made to ensure a dependable and consistent behavior in AI-based applications. These have taken various forms, from providing generative models, whose outputs can be interpreted as confidence in the predictions, to the aforementioned novelty detection. Design-time safety measures are introduced in [28], where the robustness of neural networks could be certified through a novel abstract domain, before deployment. Similarly, a feature-guided safety testing method for neural networks is proposed in [30] to evaluate the robustness of neural networks by feeding them through adversarial examples. Markov decision processes have also been proposed to be paired with neural networks to verify their robustness through statistical model checking [12].

Uncertainty wrappers are another notable concept [13,14,15,16]. This mathematical concept distinguishes ML uncertainty into three layers I) model performance, II) input quality, and III) scope compliance, and provides a set of useful functions for evaluating the existing uncertainties in each step. The uncertainty wrapper can be compared to SafeML in the third layer (scope compliance). Both of them are model-agnostic.

Safeguard AI [17] proposes calculating the likelihood of out-of-distribution (OOD) inputs and adding it to the loss function of the ML/DL model. This approach also uses a Generative Adversarial Network (GAN) to produce boundary data in order to create a more accurate OOD. In comparison to SafeML, the approach is model-specific and cannot be evaluated at runtime.

Another common theme across approaches for safeguarding ML models is the investigation of all conceivable input perturbations to produce robust, safe, and abstract interpretable solutions and certifications for ML/DL models [9, 10, 18,19,20, 26]. These approaches are also model-specific and do not provide runtime solutions. Similar to previous approaches, DeepImportance is a model-specific solution that presents a new Importance-Driven Criteria (IDC) as a layer-wise function to be assessed during the test procedure and provides a systematic framework for ML testing [11]. Regarding the reliability evaluation of ML models, only a small number of solutions have been provided so far. One of these is ReAsDL, which divides the input space into tiny cells and evaluates the ML/DL reliability based on the cells’ robustness and operational profile probability [31, 32]. This solution is model-agnostic and focuses on classification tasks similar to SafeML. The NN-Dependability-kit suggests a new set of dependability measures to assess the impact of uncertainty reduction in the ML/DL life cycle. The authors also included a formal reasoning engine to ensure that the ML/DL dependability is guaranteed. The approach can be used for runtime purposes [3].

Fig. 1.
figure 1

Process flowchart

3 Methodology

In this section, we present our refined approach for applying SafeML, in the form of a proposed workflow, and address the question of how to determine the sampling and distance thresholds. To validate our approach, we applied SafeML to ML monitoring during simulation and, also used it against an existing dataset, the GTSRB. In the next section, we will describe the experimental design for our empirical evaluation of the proposed approach.

3.1 Process Workflow

The process workflow for determining the needed number of samples as well as the distance threshold is divided into three stages, as shown in Fig. 1.

  • Acquisition: In this stage, two datasets are involved, a training dataset and a testing dataset. In our empirical experiments (see Sect. 3.2), these datasets are generated from the simulation, but they should generally be derived during development. At this point, power analysis is used to find the number of samples to determine the difference between the operational dataset and the training set. This factor can be calibrated for the application at hand, as it determines an additional number of samples beyond the minimum needed to achieve the determined test power. The effect size for the power analysis is established between the training set and the testing set, using Cohen’s d coefficient [4].

  • Training: The training dataset is processed and split into a training set and a testing set. A sub-sample of the smaller training set is uniformly sampled to represent the Training Scope Set (TSS) in the calculation of statistical distances, which maintain its features in order to reduce computational complexity during runtime. A model is then built from the smaller training set and used to predict the outputs of the testing set. The result is further distinguished into correctly and incorrectly classified outputs, where SafeML measures evaluate the statistical distance between the incorrectly classified outputs and the TSS. The resulting distances are finally used as the initial distance threshold. This initial distance threshold is then increased gradually by a factor of the standard deviation until a user-defined safety performance level is met.

  • Operation: Once the trained model is in operation, the value obtained in the ‘Acquisition’ stage is used to aggregate operational data points into an operational set. SafeML measures evaluate the statistical distance between this operational set and the TSS. If the value falls within the defined threshold, the model continues its operation normally, otherwise, a signal is sent to run a user-defined action.

3.2 Experiment Setup

We performed experiments on the German Traffic Sign Recognition Benchmark (GTSRB) [29] and on a synthetic example dataset in the CARLA simulatorFootnote 1 [5] to evaluate our approach. CARLA is an automotive simulator used for the development, training, and validation of autonomous driving systems. The dataset generated from CARLA was used to evaluate the confidence level of SafeML predictions and the autopilot decisions of the simulated vehicle. The GTSRB dataset is a collection of traffic sign images, along with their labels used for benchmarking the ML algorithms. It was first used in 2011. The dataset is a good representation of the safety-critical application of ML-components. Hence, it was also considered in this work for the evaluation of the presented approach.

The CARLA setup allows us to identify a systematic method for estimating the minimum number of required samples and the distance acceptance threshold though a fixed-point iteration, as well as to determine their implication on the model’s prediction and how they correlate to the model’s performance. It also offers multiple maps called Towns, with different sizes and properties, which allows for the experiment to be repeated. A simple model was built from a dataset sampled from CARLA, using a vehicle autopilot with varying driver profiles (shown in Table 1). This corresponds to the ‘Acquisition’ step in section Sect. 3.1. Three types of driving profiles were considered: safe, moderate, and dangerous. We should note that the profiles (and the model) were not designed with the aim to provide an accurate risk behavior estimation, but rather as a source of plausible ground truth for evaluating SafeML. A collection of classifiers were trained as the subject ML models for the CARLA dataset with results shown in Table 2. The models’ inputs are the three location coordinates and the outputs are ordinally-encoded speed levels at the given coordinates (0: slow, 1: moderate, 2: fast).

As the dataset for GTSRB is already available, the creation of the dataset was assumed to be complete from the ‘Acquisition’ phase. Then a network was built to classify the GTSRB dataset. We built a simple convolutional neural network, as such networks are known for their superior performance on image applications. We then applied the above mentioned approach. This allows obtaining the minimum number of required samples and the distance acceptance threshold for this application.

Table 1. Properties of driver profiles
Table 2. Performance of trained models on the simulated CARLA dataset

We trained a CNN network. The network was able to achieve an accuracy of around 99.73%. We remind readers that SafeML is model-agnostic, and other ML models could also have been used. This high accuracy resulted in very few incorrect samples for testing SafeML. Thus, one of the minority classes was excluded in order to be considered as an out-of-scope class, reducing accuracy to 97.5%. This added greater disparity to enable validation of SafeML.

In [2], SafeML distance measures have been shown to negatively correlate with the accuracy of the model. From this fact, and according to the first research question established in Sect. 1.2, we hypothesize that misclassified points would have a higher distance than correctly classified data points due to their dissimilarity to the training set.

Furthermore, from principles of statistical analysis, it is established that, if an insufficient number of samples is used during hypothesis testing, there is a risk of the statistical tests not achieving sufficient power. According to our second research question in Sect. 1.2, our corresponding hypothesis is that the number of samples correlates with confidence of dissimilarity (the magnitude of the distance).

The experiment concluded by following the ‘Operation’ step of the process workflow explained in Sect. 3.1. In the CARLA example, the same experiment was reproduced in different environment setups to ensure consistency of the results. In GTSRB, this was performed on the test set, which can be replaced by runtime dataset, at runtime.

Table 3. Mean and standard deviation of the statistical distances of the entire test set (CVM: Cramer von Mises, AD: Anderson-Darling, KS: Kolmogorov-Smirnov, WS: Wasserstein)

4 Results

4.1 Preliminary Findings

Before continuing with the workflow of the simulation, an analysis of the trained model was used to test the hypotheses predefined in Sect. 3.2, namely:

  • RQ1: Dissimilarity-Accuracy Correlation was tested by calculating the statistical distance between the correctly classified data points and the TSS, as well the incorrectly classified data points and the training scope. Table 3 shows the mean and standard deviation of each of the statistical distance measures used. It shows that the incorrectly classified points are highly dissimilar to the TSS (higher distance), supporting the corresponding hypothesis.

  • RQ2: Sample Size Dependence: Due to the model’s accuracy of 95%, the number of correctly classified data points was significantly larger than that of incorrectly classified points when the distances in Table 3 were calculated. To account for the number of samples, the distances were calculated over a varying number of randomly sampled points of each group. As shown in Fig. 2, the distance of incorrectly classified points is always larger than the distance of correctly classified points and increases with increasing number of samples. This can be attributed to several factors, such as: (a) increased distinction between the distributions and (b) a shift of the average value of the distances when the number of available samples increases, which removes skewness in the distribution.

Fig. 2.
figure 2

Statistical distance over varying sampling sizes

4.2 Experiment Results

Following the process workflow presented in Sect. 3.1, each stage produced its corresponding values after being executed on the “Town 1” standard map from CARLA. In the ‘Acquisition’ stage, power analysis was used on each of the driver profiles. The highest number of samples returned was 91. Multiplying this by an additional factor of 1.3 yielded a final number of samples of 120, which aligned with our sampling batches; the operational samples were collected in batches over 4 s with a simulation resolution of 30 frames per second. The performance of the trained model is shown in Table 2, where the kNN model was used in the evaluation of the results due to its simplicity and high reported performance. The resulting threshold values for SafeML are shown in Table 4.

Table 4. Threshold parameters used for Town 1 (CVM: Cramer von Mises, AD: Anderson-Darling, KS: Kolmogorov-Smirnov, WS: Wasserstein)
Fig. 3.
figure 3

SafeML performance on Town 1 with moderate driver profiles

The acceptable performance of the ML-model is a design decision obtained from the application requirements specified. In our example, let us consider the correctness over a batch. Since each batch contains multiple frames, let us assume a batch is considered correctly classified if its overall accuracy is 0.8 (96 correct points out of 120). Consequently, a batch is assumed to be incorrectly classified if its overall accuracy is 0 (focusing on worst-case scenarios), with all of its members being misclassified. This high limit was chosen to represent an extreme scenario that minimizes the number of false alarms.

The performance of each of the distance measures in SafeML was evaluated on different driver profiles as shown in Figs. 3 and 4, where the true positive rate (batches with 0 accuracy that were above the threshold) and the false positive rate (batches with 0.8 accuracy that were above the threshold) were plotted over a varying increase in the threshold in increments of 0.1 of the standard deviation.

Figure 3 shows the standard deviation factor by which the threshold should be increased to yield reliable identification by SafeML. The plot compares incorrect (i.e., false positive rate) versus correct SafeML alarms (true positive rate), set to a threshold of 0.8 (as mentioned previously, this threshold can be determined based on application-level requirements). Through this method, a suitable factor for the distance measures was found, with the exception of Kolmogorov Smirnov, where a similar percentage of false positive rates was achieved for the distance measures.

The same process was repeated for the dangerous driver profile shown in Fig. 4, where similar plot curves were observed, and the threshold points could be established following similar steps as for the moderate profile. However, the performance ratio between true and false positive rate is exceptionally bad. The experiment was repeated on “Town 2” and “Town 4” with similar results.

Repeating the process workflow on the GTSRB shows quite a similar trend, where the correct classification and the incorrect classification are completely separable by setting a suitable distance threshold, as shown in Fig. 5. The number of samples (with each sample being an image) required can be seen on the x-axis. In this case, the majority of the incorrect classifications represent an out-of-scope class. The distance was calculated using features derived from the last layer of the CNN instead of from the raw pixels. More detailed results can be found in the git repo.

Fig. 4.
figure 4

SafeML performance on Town 1 with dangerous driver profiles

Fig. 5.
figure 5

Statistical distance over varying sampling sizes for GTSRB

5 Conclusion and Future Work

In this paper, we addressed the challenge of determining sampling and distance thresholds for SafeML, a model-agnostic, assessment tool for scope compliance. Our approach incorporates power sampling during the development stage of the subject ML model in order to determine the number of samples necessary to achieve sufficient statistical power while applying the SafeML distance evaluation during the runtime stage. Furthermore, we proposed means of identifying appropriate distance thresholds, based on the observed performance of the ML model during development-time simulation. We validated our approach experimentally, using a scenario developed in the CARLA automotive simulator as well as the publicly available GTSRB dataset.

Apart from the SafeML applications discussed earlier in Sect. 2, at the time of writing, additional examples are being researched, such as using SafeML for cancer detection via x-ray imaging as well as for pedestrian detection, financial investment, and predictive maintenance.

Regarding future work, we are considering further directions to improve SafeML, including investigating the effect of outlier data’ and the effect of dataset characteristics (see [22]), using dimensionality reduction, accounting for uncertainty in the dataset labels (see [21]), and expanding the scope towards graph, quantum, and time-series datasets.

6 Code Availability

Regarding the reproducibility of our research, codes and functions supporting this paper have been published online at: https://tinyurl.com/4a76z2xs.