Keep Your Distance: Determining Sampling and Distance Thresholds in Machine Learning Monitoring

Farhad, Al-Harith; Sorokos, Ioannis; Schmidt, Andreas; Akram, Mohammed Naveed; Aslansefat, Koorosh; Schneider, Daniel

doi:10.1007/978-3-031-15842-1_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13525))

Included in the following conference series:

International Symposium on Model-Based Safety and Assessment

782 Accesses
2 Citations

Abstract

Machine Learning (ML) has provided promising results in recent years across different applications and domains. However, in many cases, qualities such as reliability or even safety need to be ensured. To this end, one important aspect is to determine whether or not ML components are deployed in situations that are appropriate for their application scope. For components whose environments are open and variable, for instance those found in autonomous vehicles, it is therefore important to monitor their operational situation in order to determine its distance from the ML components’ trained scope. If that distance is deemed too great, the application may choose to consider the ML component outcome unreliable and switch to alternatives, e.g. using human operator input instead. SafeML is a model-agnostic approach for performing such monitoring, using distance measures based on statistical testing of the training and operational datasets. Limitations in setting SafeML up properly include the lack of a systematic approach for determining, for a given application, how many operational samples are needed to yield reliable distance information as well as to determine an appropriate distance threshold. In this work, we address these limitations by providing a practical approach and demonstrate its use in a well known traffic sign recognition problem, and on an example using the CARLA open-source automotive simulator.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Case Study on the Use of the SafeML Approach in Training Autonomous Driving Vehicles

Corner cases in machine learning processes

Article Open access 02 January 2024

Towards Large Scale Urban Traffic Reference Data: Smart Infrastructure in the Test Area Autonomous Driving Baden-Württemberg

Keywords

1 Introduction

The continuous expansion of the application fields of Machine Learning (ML) into safety-critical domains, such as autonomous vehicles, entails an increasing need for suitable safety assurance approaches. One key aspect in this regard is getting a grasp on the confidence associated with the output of an ML component. While some ML models provide a probabilistic output that can be interpreted as a level of confidence, such output alone is not sufficient to establish overall trust. Significant progress has been made towards addressing this question, with approaches that introduce more sophisticated evaluation of a given model’s outputs. Model-specific approaches base their evaluation on understanding of the internals of the given ML model, e.g. [23] focus on the second-to-last layer of a given deep neural network. On the other hand, model-agnostic approaches treat models as black-boxes, basing their evaluation on properties that can be examined externally, e.g. in [16], surrogate models are constructed during training to later provide uncertainty estimates of the ML model in question. An additional concern for evaluating ML models, is that the evaluation must also satisfy the application requirements, in particular with regards to performance. For instance, the authors of [25] propose auxiliary networks for evaluation, but the computational capacity needed to estimate them hinders their roll-out into real-time systems. On a general note, A safety argument for a system with ML components will typically be very specific for a given application and its context and comprise of a diverse range of measures and assumptions, many of which we would expect to include both development-time approaches and runtime approaches, with ours falling under the latter category.

SafeML, proposed in [2] and improved in [1], is a runtime approach for evaluating ML model outputs. In brief, SafeML compares training and operational data of the ML model in question and determines whether they are statistically ‘too distant’ to yield a trustworthy answer. The work in [1] further demonstrates a bootstrap-based p-value estimation extension to improve confidence in measurements. However, the existing literature does not explain how to address specific challenges for practical application of SafeML.

Our contribution is to identify these limitations and propose an approach that enables a systematic application of SafeML and overcomes these limitations. In the remainder of Sect. 1, we provide a more detailed description of previous work on SafeML. We then discuss what its practical limitations are, provide the motivation behind our approach, and then further detail our contributions.

1.1 SafeML

SafeML is a collection of measures that estimate the statistical distance between training and operational datasets based on the Empirical Cumulative Distribution Function (ECDF). In [2], the estimated distance has been shown to negatively correlate with a corresponding ML model’s accuracy. In the same paper, a plausible workflow of applying SafeML for monitoring ML was also proposed. The workflow allows an ML task to be divided into two phases, an offline/training phase and an online/application phase. In the training phase, it is assumed that we have a trusted dataset and there is no uncertainty associated with its labels. An ML model, such as a deep neural network or a support vector machine, can be trained using the trusted data for classification or regression tasks.

After its validation, in the online/application phase, the same trained model and a buffer are provided to gather a sufficient number of samples from inputs. The number of buffered samples should be large enough that the distance determination can be relied upon, but the existing approach does not provide further guidance on how this number should be specified. When a large enough number of samples is obtained, the ECDF of each feature and each class is calculated based on the trained classifier decisions. The ECDF-based statistical distance measures are used to evaluate the differences between the trusted dataset and the buffered data. To ensure that the statistical measures are valid, a bootstrap-based p-value evaluation method is added to the measurements, as in [1]. The user of the method must then specify a minimal distance threshold (and optionally additional ones) for the distance measures. The proposed workflow suggests that if the outcome is slightly above the minimal threshold, additional data can be requested. On the other hand, if the outcome is significantly above the threshold value (or a specified additional threshold), alternative actions can be taken, e.g. operator intervention. If the outcome is below the minimal threshold (or a specified additional threshold), the decision of the Machine Learning algorithm can be trusted and the statistical distance measures can be stored to be reported.

As SafeML is model-agnostic, it can be flexibly deployed in numerous applications. In [1, 2], Aslansefat et al. already presented experimental applications of SafeML for security attack detection [27], and German Traffic Sign Recognition Benchmark (GTSRB) examples [29]. For security intrusion detection, SafeML measures were used to compare the statistical distances against the accuracy of classifier. In the GTSRB example, the model was trained, and the incorrectly classified set of images was compared against randomly selected input images from the training set.

1.2 Motivation

As mentioned in Sect. 1.1, applying SafeML requires the specification of the number of runtime samples that needed to be acquired, and at least the minimal distance threshold for acceptance/rejection. Both parameters must be defined during development time, as they need to be known by the time the ML model is in operation. Existing work on SafeML does not investigate nor provide guidance for establishing these parameters, leaving it up to the user to find reasonable values.

However, this is not a trivial matter, as identifying appropriate thresholds has application-related implications. As will be highlighted further in Sect. 3, an inadequate number of runtime samples may result in low statistical power of the SafeML-based evaluation, whereas collecting too many samples can be inefficient and limit application performance. Addressing these limitations is the focus of this publication.

Statistical power is the probability of a correctly rejected null-hypothesis test, i.e., the probability of a true positive, given a large enough population [7]. Conversely, by presetting a required level of statistical power, the population size needed to correctly distinguish two distributions can be calculated through power analysis. Similarly, distance thresholds that are too low can lead to flooding the host application with false positive alarms, whereas distance thresholds that are too high can lead to potentially critical conditions being overlooked. Concretely, we establish the following research questions:

RQ1: Dissimilarity-Accuracy Correlation. Can we confirm that data points seen during operation that are dissimilar to training data impact the model’s performance in terms of accuracy?
RQ2: Sample Size Dependency. Can we determine whether the sample size affects the accuracy of the SafeML distance estimation?

1.3 Paper Contribution and Outline

The contribution of this paper is three-fold. First, we use power analysis to specify sampling requirements for SafeML monitoring at runtime. Second, we systematically determine appropriate SafeML distance thresholds. Finally, we apply the above method in the context of an example automotive simulation.

The remainder of the paper is structured as follows: In Sect. 2, we discuss background and related work, including approaches both similar to and different from SafeML. In Sect. 3, we describe our approach for systematically applying SafeML and determining relevant thresholds, as well as our experimental setup. In Sect. 4, we discuss our experimental results, before recapping our key points and discussing future work in Sect. 5.

2 Background and Related Work

To briefly recap, in [1, 2] the authors propose statistical distance measures to compare the distributions of the training and operational datasets; the measures are based on established two-sample statistical testing methods, including the Kolmogorov-Smirnov, Anderson-Darling, Cramer von Mises [8], and Wasserstein methods [24]. The statistical distance measures used by SafeML capture the dissimilarity between two different distributions, but the approach itself does not propose an explicit threshold at which those distributions are not equivalent, nor a means for determining one systematically.

Setting meaningful thresholds is a reoccurring problem in ML and data-driven applications. A method based on the 3-sigma rule was shown to provide suitable threshold criteria in Hidden Markov Models under the assumption of normal distribution [6]. Our approach is similar in the sense that we used the same principle, but we did not assume that our datasets are normally distributed. Therefore, instead of a 3-sigma rule, we opted for a gradual increase of the threshold based on the sigma value. We will elaborate on this further in Sect. 3.

A prerequisite for the transition of AI applications to safety- and security-critical systems is the existence of guarantees and guidelines to assure underlying system dependability. A method was proposed in [25] to assure a model’s operation within the intended context in a model-agnostic manner, with an additional autoencoder-based network being used to detect semantic novelty.

However, the innate problem of using neural networks, including autoencoders, is their black-box nature with respect to explainability, which inhibits the establishment of dependability guarantees. Hence, the use of a more explainable statistical method could serve as a solution to this issue. This includes our proposed approach, as the ECDF-based distance to the training set could provide additional insight into the model’s decision.

In [23], the authors propose a commonality metric that, inspects the second-to-last layer of a Deep Neural Network (DNN). The proposed metric expresses the ratio between the activation of the neurons in the last layer during training (across all training instances) versus their activation during operation, for the given operational input. The approach shares common ideas with SafeML, but diverges in terms of being model-specific, as the metric directly samples the last layer’s neurons. In contrast, SafeML does not consider model internals and makes no assumption on the distribution of the training and operational data.

Efforts have been made to ensure a dependable and consistent behavior in AI-based applications. These have taken various forms, from providing generative models, whose outputs can be interpreted as confidence in the predictions, to the aforementioned novelty detection. Design-time safety measures are introduced in [28], where the robustness of neural networks could be certified through a novel abstract domain, before deployment. Similarly, a feature-guided safety testing method for neural networks is proposed in [30] to evaluate the robustness of neural networks by feeding them through adversarial examples. Markov decision processes have also been proposed to be paired with neural networks to verify their robustness through statistical model checking [12].

Uncertainty wrappers are another notable concept [13,14,15,16]. This mathematical concept distinguishes ML uncertainty into three layers I) model performance, II) input quality, and III) scope compliance, and provides a set of useful functions for evaluating the existing uncertainties in each step. The uncertainty wrapper can be compared to SafeML in the third layer (scope compliance). Both of them are model-agnostic.

Safeguard AI [17] proposes calculating the likelihood of out-of-distribution (OOD) inputs and adding it to the loss function of the ML/DL model. This approach also uses a Generative Adversarial Network (GAN) to produce boundary data in order to create a more accurate OOD. In comparison to SafeML, the approach is model-specific and cannot be evaluated at runtime.

Another common theme across approaches for safeguarding ML models is the investigation of all conceivable input perturbations to produce robust, safe, and abstract interpretable solutions and certifications for ML/DL models [9, 10, 18,19,20, 26]. These approaches are also model-specific and do not provide runtime solutions. Similar to previous approaches, DeepImportance is a model-specific solution that presents a new Importance-Driven Criteria (IDC) as a layer-wise function to be assessed during the test procedure and provides a systematic framework for ML testing [11]. Regarding the reliability evaluation of ML models, only a small number of solutions have been provided so far. One of these is ReAsDL, which divides the input space into tiny cells and evaluates the ML/DL reliability based on the cells’ robustness and operational profile probability [31, 32]. This solution is model-agnostic and focuses on classification tasks similar to SafeML. The NN-Dependability-kit suggests a new set of dependability measures to assess the impact of uncertainty reduction in the ML/DL life cycle. The authors also included a formal reasoning engine to ensure that the ML/DL dependability is guaranteed. The approach can be used for runtime purposes [3].

3 Methodology

In this section, we present our refined approach for applying SafeML, in the form of a proposed workflow, and address the question of how to determine the sampling and distance thresholds. To validate our approach, we applied SafeML to ML monitoring during simulation and, also used it against an existing dataset, the GTSRB. In the next section, we will describe the experimental design for our empirical evaluation of the proposed approach.

3.1 Process Workflow

The process workflow for determining the needed number of samples as well as the distance threshold is divided into three stages, as shown in Fig. 1.

Acquisition: In this stage, two datasets are involved, a training dataset and a testing dataset. In our empirical experiments (see Sect. 3.2), these datasets are generated from the simulation, but they should generally be derived during development. At this point, power analysis is used to find the number of samples to determine the difference between the operational dataset and the training set. This factor can be calibrated for the application at hand, as it determines an additional number of samples beyond the minimum needed to achieve the determined test power. The effect size for the power analysis is established between the training set and the testing set, using Cohen’s d coefficient [4].
Training: The training dataset is processed and split into a training set and a testing set. A sub-sample of the smaller training set is uniformly sampled to represent the Training Scope Set (TSS) in the calculation of statistical distances, which maintain its features in order to reduce computational complexity during runtime. A model is then built from the smaller training set and used to predict the outputs of the testing set. The result is further distinguished into correctly and incorrectly classified outputs, where SafeML measures evaluate the statistical distance between the incorrectly classified outputs and the TSS. The resulting distances are finally used as the initial distance threshold. This initial distance threshold is then increased gradually by a factor of the standard deviation until a user-defined safety performance level is met.
Operation: Once the trained model is in operation, the value obtained in the ‘Acquisition’ stage is used to aggregate operational data points into an operational set. SafeML measures evaluate the statistical distance between this operational set and the TSS. If the value falls within the defined threshold, the model continues its operation normally, otherwise, a signal is sent to run a user-defined action.

3.2 Experiment Setup

We performed experiments on the German Traffic Sign Recognition Benchmark (GTSRB) [29] and on a synthetic example dataset in the CARLA simulator^{Footnote 1} [5] to evaluate our approach. CARLA is an automotive simulator used for the development, training, and validation of autonomous driving systems. The dataset generated from CARLA was used to evaluate the confidence level of SafeML predictions and the autopilot decisions of the simulated vehicle. The GTSRB dataset is a collection of traffic sign images, along with their labels used for benchmarking the ML algorithms. It was first used in 2011. The dataset is a good representation of the safety-critical application of ML-components. Hence, it was also considered in this work for the evaluation of the presented approach.

The CARLA setup allows us to identify a systematic method for estimating the minimum number of required samples and the distance acceptance threshold though a fixed-point iteration, as well as to determine their implication on the model’s prediction and how they correlate to the model’s performance. It also offers multiple maps called Towns, with different sizes and properties, which allows for the experiment to be repeated. A simple model was built from a dataset sampled from CARLA, using a vehicle autopilot with varying driver profiles (shown in Table 1). This corresponds to the ‘Acquisition’ step in section Sect. 3.1. Three types of driving profiles were considered: safe, moderate, and dangerous. We should note that the profiles (and the model) were not designed with the aim to provide an accurate risk behavior estimation, but rather as a source of plausible ground truth for evaluating SafeML. A collection of classifiers were trained as the subject ML models for the CARLA dataset with results shown in Table 2. The models’ inputs are the three location coordinates and the outputs are ordinally-encoded speed levels at the given coordinates (0: slow, 1: moderate, 2: fast).

As the dataset for GTSRB is already available, the creation of the dataset was assumed to be complete from the ‘Acquisition’ phase. Then a network was built to classify the GTSRB dataset. We built a simple convolutional neural network, as such networks are known for their superior performance on image applications. We then applied the above mentioned approach. This allows obtaining the minimum number of required samples and the distance acceptance threshold for this application.

Table 1. Properties of driver profiles

Full size table

Table 2. Performance of trained models on the simulated CARLA dataset

Full size table

We trained a CNN network. The network was able to achieve an accuracy of around 99.73%. We remind readers that SafeML is model-agnostic, and other ML models could also have been used. This high accuracy resulted in very few incorrect samples for testing SafeML. Thus, one of the minority classes was excluded in order to be considered as an out-of-scope class, reducing accuracy to 97.5%. This added greater disparity to enable validation of SafeML.

In [2], SafeML distance measures have been shown to negatively correlate with the accuracy of the model. From this fact, and according to the first research question established in Sect. 1.2, we hypothesize that misclassified points would have a higher distance than correctly classified data points due to their dissimilarity to the training set.

Furthermore, from principles of statistical analysis, it is established that, if an insufficient number of samples is used during hypothesis testing, there is a risk of the statistical tests not achieving sufficient power. According to our second research question in Sect. 1.2, our corresponding hypothesis is that the number of samples correlates with confidence of dissimilarity (the magnitude of the distance).

The experiment concluded by following the ‘Operation’ step of the process workflow explained in Sect. 3.1. In the CARLA example, the same experiment was reproduced in different environment setups to ensure consistency of the results. In GTSRB, this was performed on the test set, which can be replaced by runtime dataset, at runtime.

Table 3. Mean and standard deviation of the statistical distances of the entire test set (CVM: Cramer von Mises, AD: Anderson-Darling, KS: Kolmogorov-Smirnov, WS: Wasserstein)

Full size table

4 Results

4.1 Preliminary Findings

Before continuing with the workflow of the simulation, an analysis of the trained model was used to test the hypotheses predefined in Sect. 3.2, namely:

RQ1: Dissimilarity-Accuracy Correlation was tested by calculating the statistical distance between the correctly classified data points and the TSS, as well the incorrectly classified data points and the training scope. Table 3 shows the mean and standard deviation of each of the statistical distance measures used. It shows that the incorrectly classified points are highly dissimilar to the TSS (higher distance), supporting the corresponding hypothesis.
RQ2: Sample Size Dependence: Due to the model’s accuracy of 95%, the number of correctly classified data points was significantly larger than that of incorrectly classified points when the distances in Table 3 were calculated. To account for the number of samples, the distances were calculated over a varying number of randomly sampled points of each group. As shown in Fig. 2, the distance of incorrectly classified points is always larger than the distance of correctly classified points and increases with increasing number of samples. This can be attributed to several factors, such as: (a) increased distinction between the distributions and (b) a shift of the average value of the distances when the number of available samples increases, which removes skewness in the distribution.

4.2 Experiment Results

Following the process workflow presented in Sect. 3.1, each stage produced its corresponding values after being executed on the “Town 1” standard map from CARLA. In the ‘Acquisition’ stage, power analysis was used on each of the driver profiles. The highest number of samples returned was 91. Multiplying this by an additional factor of 1.3 yielded a final number of samples of 120, which aligned with our sampling batches; the operational samples were collected in batches over 4 s with a simulation resolution of 30 frames per second. The performance of the trained model is shown in Table 2, where the kNN model was used in the evaluation of the results due to its simplicity and high reported performance. The resulting threshold values for SafeML are shown in Table 4.

Table 4. Threshold parameters used for Town 1 (CVM: Cramer von Mises, AD: Anderson-Darling, KS: Kolmogorov-Smirnov, WS: Wasserstein)

Full size table

The acceptable performance of the ML-model is a design decision obtained from the application requirements specified. In our example, let us consider the correctness over a batch. Since each batch contains multiple frames, let us assume a batch is considered correctly classified if its overall accuracy is 0.8 (96 correct points out of 120). Consequently, a batch is assumed to be incorrectly classified if its overall accuracy is 0 (focusing on worst-case scenarios), with all of its members being misclassified. This high limit was chosen to represent an extreme scenario that minimizes the number of false alarms.

The performance of each of the distance measures in SafeML was evaluated on different driver profiles as shown in Figs. 3 and 4, where the true positive rate (batches with 0 accuracy that were above the threshold) and the false positive rate (batches with 0.8 accuracy that were above the threshold) were plotted over a varying increase in the threshold in increments of 0.1 of the standard deviation.

Figure 3 shows the standard deviation factor by which the threshold should be increased to yield reliable identification by SafeML. The plot compares incorrect (i.e., false positive rate) versus correct SafeML alarms (true positive rate), set to a threshold of 0.8 (as mentioned previously, this threshold can be determined based on application-level requirements). Through this method, a suitable factor for the distance measures was found, with the exception of Kolmogorov Smirnov, where a similar percentage of false positive rates was achieved for the distance measures.

The same process was repeated for the dangerous driver profile shown in Fig. 4, where similar plot curves were observed, and the threshold points could be established following similar steps as for the moderate profile. However, the performance ratio between true and false positive rate is exceptionally bad. The experiment was repeated on “Town 2” and “Town 4” with similar results.

Repeating the process workflow on the GTSRB shows quite a similar trend, where the correct classification and the incorrect classification are completely separable by setting a suitable distance threshold, as shown in Fig. 5. The number of samples (with each sample being an image) required can be seen on the x-axis. In this case, the majority of the incorrect classifications represent an out-of-scope class. The distance was calculated using features derived from the last layer of the CNN instead of from the raw pixels. More detailed results can be found in the git repo.

5 Conclusion and Future Work

In this paper, we addressed the challenge of determining sampling and distance thresholds for SafeML, a model-agnostic, assessment tool for scope compliance. Our approach incorporates power sampling during the development stage of the subject ML model in order to determine the number of samples necessary to achieve sufficient statistical power while applying the SafeML distance evaluation during the runtime stage. Furthermore, we proposed means of identifying appropriate distance thresholds, based on the observed performance of the ML model during development-time simulation. We validated our approach experimentally, using a scenario developed in the CARLA automotive simulator as well as the publicly available GTSRB dataset.

Apart from the SafeML applications discussed earlier in Sect. 2, at the time of writing, additional examples are being researched, such as using SafeML for cancer detection via x-ray imaging as well as for pedestrian detection, financial investment, and predictive maintenance.

Regarding future work, we are considering further directions to improve SafeML, including investigating the effect of outlier data’ and the effect of dataset characteristics (see [22]), using dimensionality reduction, accounting for uncertainty in the dataset labels (see [21]), and expanding the scope towards graph, quantum, and time-series datasets.

6 Code Availability

Regarding the reproducibility of our research, codes and functions supporting this paper have been published online at: https://tinyurl.com/4a76z2xs.

Notes

1.
https://carla.org.

References

Aslansefat, K., Kabir, S., Abdullatif, A., Vasudevan, V., Papadopoulos, Y.: Toward improving confidence in autonomous vehicle software: a study on traffic sign recognition systems. Computer 54(8), 66–76 (2021)
Article Google Scholar
Aslansefat, K., Sorokos, I., Whiting, D., Tavakoli Kolagari, R., Papadopoulos, Y.: SafeML: safety monitoring of machine learning classifiers through statistical difference measures. In: Zeller, M., Höfig, K. (eds.) IMBSA 2020. LNCS, vol. 12297, pp. 197–211. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58920-2_13
Chapter Google Scholar
Cheng, C.H., Huang, C.H., Nührenberg, G.: nn-dependability-kit: engineering neural networks for safety-critical autonomous driving systems. In: International Conference on Computer-Aided Design (ICCAD), pp. 1–6. IEEE (2019)
Google Scholar
Cohen, J.: A power primer. Psychol. Bull. 112(1), 155 (1992)
Article Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: 1st Annual Conference on Robot Learning (2017)
Google Scholar
Duan, J., Zeng, J., Zhang, D.: A method for determination on HMM distance threshold. In: 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 1, pp. 387–391 (2009). https://doi.org/10.1109/FSKD.2009.732
Ellis, P.D.: The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511761676
Book Google Scholar
Evans, D.L., Drew, J.H., Leemis, L.M.: The distribution of the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling test statistics for exponential populations with estimated parameters. In: Glen, A.G., Leemis, L.M. (eds.) Computational Probability Applications. ISORMS, vol. 247, pp. 165–190. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-43317-2_13
Chapter Google Scholar
Fischer, M., Balunovic, M., Drachsler-Cohen, D., Gehr, T., Zhang, C., Vechev, M.: Dl2: training and querying neural networks with logic. In: International Conference on Machine Learning, pp. 1931–1941. PMLR (2019)
Google Scholar
Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev, M.: Ai2: safety and robustness certification of neural networks with abstract interpretation. In: Symposium on Security and Privacy (SP). IEEE (2018)
Google Scholar
Gerasimou, S., Eniser, H.F., Sen, A., Cakan, A.: Importance-driven deep learning system testing. In: 42nd International Conference on Software Engineering (ICSE). IEEE (2020)
Google Scholar
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Chapter Google Scholar
Jöckel, L., Kläs, M.: Increasing trust in data-driven model validation. In: Romanovsky, A., Troubitsyna, E., Bitsch, F. (eds.) SAFECOMP 2019. LNCS, vol. 11698, pp. 155–164. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26601-1_11
Chapter Google Scholar
Jöckel, L., Kläs, M., Martínez-Fernández, S.: Safe traffic sign recognition through data augmentation for autonomous vehicles software. In: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 540–541. IEEE (2019)
Google Scholar
Kläs, M., Jöckel, L.: A framework for building uncertainty wrappers for AI/ML-based data-driven components. In: Casimiro, A., Ortmeier, F., Schoitsch, E., Bitsch, F., Ferreira, P. (eds.) SAFECOMP 2020. LNCS, vol. 12235, pp. 315–327. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55583-2_23
Chapter Google Scholar
Kläs, M., Sembach, L.: Uncertainty wrappers for data-driven models. In: Romanovsky, A., Troubitsyna, E., Gashi, I., Schoitsch, E., Bitsch, F. (eds.) SAFECOMP 2019. LNCS, vol. 11699, pp. 358–364. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26250-1_29
Chapter Google Scholar
Lee, K., Lee, H., Lee, K., Shin, J.: Training confidence-calibrated classifiers for detecting out-of-distribution samples. In: International Conference on Learning Representations (2018)
Google Scholar
Mirman, M., Gehr, T., Vechev, M.: Differentiable abstract interpretation for provably robust neural networks. In: International Conference on Machine Learning. PMLR (2018)
Google Scholar
Mirman, M., Singh, G., Vechev, M.: A provable defense for deep residual networks. arXiv preprint arXiv:1903.12519 (2019)
Müller, M.N., Makarchuk, G., Singh, G., Püschel, M., Vechev, M.: PRIMA: precise and general neural network certification via multi-neuron convex relaxations. arXiv preprint arXiv:2103.03638 (2021)
Northcutt, C.G., Jiang, L., Chuang, I.L.: Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. (JAIR) 70, 1373–1411 (2021)
Article MathSciNet Google Scholar
Oreski, D., Oreski, S., Klicek, B.: Effects of dataset characteristics on the performance of feature selection techniques. Appl. Soft Comput. 52, 109–119 (2017)
Article Google Scholar
Paterson, C., Calinescu, R., Picardi, C.: Detection and mitigation of rare subclasses in deep neural network classifiers. In: 2021 IEEE International Conference on Artificial Intelligence Testing (AITest), Los Alamitos, CA, USA, pp. 9–16. IEEE Computer Society, August 2021. https://doi.org/10.1109/AITEST52744.2021.00012. https://doi.ieeecomputersociety.org/10.1109/AITEST52744.2021.00012
Ramdas, A., Trillos, N.G., Cuturi, M.: On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2), 47 (2017)
Article MathSciNet Google Scholar
Rausch, A., Sedeh, A.M., Zhang, M.: Autoencoder-based semantic novelty detection: towards dependable AI-based systems. Appl. Sci. 11(21) (2021). https://doi.org/10.3390/app11219881
Ruoss, A., Baader, M., Balunović, M., Vechev, M.: Efficient certification of spatial robustness. arXiv preprint arXiv:2009.09318 (2020)
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1, 108–116 (2018)
Google Scholar
Singh, G., Gehr, T., Püschel, M., Vechev, M.: An abstract domain for certifying neural networks. Proc. ACM Program. Lang. 3, 1–30 (2019). https://doi.org/10.1145/3290354
Article Google Scholar
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012). https://doi.org/10.1016/j.neunet.2012.02.016. http://www.sciencedirect.com/science/article/pii/S0893608012000457
Wicker, M., Huang, X., Kwiatkowska, M.: Feature-guided black-box safety testing of deep neural networks. In: Beyer, D., Huisman, M. (eds.) TACAS 2018. LNCS, vol. 10805, pp. 408–426. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89960-2_22
Chapter Google Scholar
Zhao, X., et al.: Assessing the reliability of deep learning classifiers through robustness evaluation and operational profiles. arXiv:2106.01258 (2021)
Zhao, X., Huang, W., Schewe, S., Dong, Y., Huang, X.: Detecting operational adversarial examples for reliable deep learning. arXiv:2104.06015 (2021)

Download references

Acknowledgements

This work was supported by the Secure and Safe Multi-Robot Systems (SESAME) H2020 project under grant agreement 101017258 and the German Federal Ministry for Economic Affairs and Climate Action (BMWK) within the research project “FabOS” under grant no. 01MK20010A.

Author information

Authors and Affiliations

University of Mannheim, Schloss, 68131, Mannheim, Germany
Al-Harith Farhad
Fraunhofer IESE, Fraunhofer-Platz 1, 67663, Kaiserslautern, Germany
Ioannis Sorokos, Andreas Schmidt, Mohammed Naveed Akram & Daniel Schneider
University of Hull, Cottingham Road, Hull, HU6 7RX, UK
Koorosh Aslansefat

Authors

Al-Harith Farhad
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Sorokos
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Naveed Akram
View author publications
You can also search for this author in PubMed Google Scholar
Koorosh Aslansefat
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ioannis Sorokos or Mohammed Naveed Akram .

Editor information

Editors and Affiliations

ONERA, Toulouse, France
Christel Seguin
Siemens AG, Munich, Bayern, Germany
Marc Zeller
ONERA, Toulouse, France
Tatiana Prosvirnova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farhad, AH., Sorokos, I., Schmidt, A., Akram, M.N., Aslansefat, K., Schneider, D. (2022). Keep Your Distance: Determining Sampling and Distance Thresholds in Machine Learning Monitoring. In: Seguin, C., Zeller, M., Prosvirnova, T. (eds) Model-Based Safety and Assessment. IMBSA 2022. Lecture Notes in Computer Science, vol 13525. Springer, Cham. https://doi.org/10.1007/978-3-031-15842-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-15842-1_16
Published: 09 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15841-4
Online ISBN: 978-3-031-15842-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Keep Your Distance: Determining Sampling and Distance Thresholds in Machine Learning Monitoring

Abstract

Similar content being viewed by others

Case Study on the Use of the SafeML Approach in Training Autonomous Driving Vehicles

Corner cases in machine learning processes

Towards Large Scale Urban Traffic Reference Data: Smart Infrastructure in the Test Area Autonomous Driving Baden-Württemberg

Keywords