Skip to main content

Sample-Efficient Safety Assurances Using Conformal Prediction

  • Conference paper
  • First Online:
Algorithmic Foundations of Robotics XV (WAFR 2022)

Abstract

When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than \(\epsilon \) will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an \(\epsilon \) false negative rate using as few as \(1/\epsilon \) data points. We apply our framework to a driver warning system and a robotic grasping application, and empirically demonstrate guaranteed false negative rate while also observing low false detection (positive) rate.

The NASA University Leadership Initiative (grant #80NSSC20M0163) provided funds to assist the authors with their research, but this article solely reflects the opinions and conclusions of its authors and not any NASA entity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Lyft motion prediction dataset. https://www.kaggle.com/c/lyft-motion-prediction-autonomous-vehicles/data (2020).

  2. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving (2019). arXiv:1903.11027

  3. Cai, F., Koutsoukos, X.: Real-time out-of-distribution detection in learning-enabled cyber-physical systems. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 174–183 (2020)

    Google Scholar 

  4. Calafiore, G., Campi, M.: The scenario approach to robust control design. IEEE Trans. Autom. Control 51(5), 742–753 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. Chen, Y., Rosolia, U., Fan, C., Ames, A., Murray, R.: Reactive motion planning with probabilistic safety guarantees. In: Conference on Robot Learning (2020)

    Google Scholar 

  6. Correll, N., Bekris, K.E., Berenson, D., Brock, O., Causo, A., Hauser, K., Okada, K., Rodriguez, A., Romano, J.M., Wurman, P.R.: Analysis and observations from the first amazon picking challenge. IEEE Trans. Autom. Sci. Eng. 15(1), 172–188 (2016)

    Article  Google Scholar 

  7. Crestani, D., Godary-Dejean, K., Lapierre, L.: Enhancing fault tolerance of autonomous mobile robots. Robot. Auton. Syst. 68, 140–155 (2015)

    Article  Google Scholar 

  8. Ding, S.X.: Model-Based Fault Diagnosis Techniques: Design Schemes, Algorithms and Tools, pp. 3–11. Springer, London (2013)

    Google Scholar 

  9. Eppner, C., Höfer, S., Jonschkowski, R., Martín-Martín, R., Sieverling, A., Wall, V., Brock, O.: Lessons from the amazon picking challenge: Four aspects of building robotic systems. In: Robotics: Science and Systems (2016)

    Google Scholar 

  10. Feldman, S., Bates, S., Romano, Y.: Improving conditional coverage via orthogonal quantile regression (2021). arXiv:2106.00394

  11. Foody, G.M.: Sample size determination for image classification accuracy assessment and comparison. Int. J. Remote Sens. 30(20), 5273–5291 (2009)

    Article  Google Scholar 

  12. Gammerman, A., Nouretdinov, I., Burford, B., Chervonenkis, A., Vovk, V., Luo, Z.: Clinical mass spectrometry proteomic diagnosis by conformal predictors. Stat. Appl. Genet. Mol. Biol. 7(2), 1–12 (2008)

    Article  MathSciNet  Google Scholar 

  13. Harirchi, F., Ozay, N.: Model invalidation for switched affine systems with applications to fault and anomaly detection. Anal. Design Hybrid Syst. (ADHS) 48(27), 260–266 (2015)

    Google Scholar 

  14. Harirchi, F., Ozay, N.: Guaranteed model-based fault detection in cyber-physical systems: a model invalidation approach (2017). arXiv:1609.05921

  15. Hernandez, C., Bharatheesha, M., Ko, W., Gaiser, H., Tan, J., van Deurzen, K., de Vries, M., Van Mil, B., van Egmond, J., Burger, R., et al.: Team delft’s robot winner of the amazon picking challenge 2016. In: Robot World Cup, pp. 613–624. Springer (2016)

    Google Scholar 

  16. Khalastchi, E., Kalech, M.: On fault detection and diagnosis in robotic systems. ACM Comput. Surv. (CSUR) 51(1), 1–24 (2018)

    Article  Google Scholar 

  17. von Luxburg, U., Schölkopf, B.: Statistical learning theory: models, concepts, and results. In: Gabbay, D.M., Hartmann, S., Woods, J. (eds.) Inductive Logic, Handbook of the History of Logic, vol. 10, pp. 651–706. North-Holland (2011)

    Google Scholar 

  18. Mahler, J., Liang, J., Niyaz, S., Laskey, M., Doan, R., Liu, X., Ojea, J.A., Goldberg, K.: Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Robotics: Science and Systems (RSS) (2017)

    Google Scholar 

  19. Mahler, J., Matl, M., Liu, X., Li, A., Gealy, D., Goldberg, K.: Dex-net 3.0: Computing robust robot suction grasp targets in point clouds using a new analytic model and deep learning (2017). arXiv:1709.06670

  20. Mahler, J., Matl, M., Satish, V., Danielczuk, M., DeRose, B., McKinley, S., Goldberg, K.: Learning ambidextrous robot grasping policies. Sci. Robot. 4(26), eaau4984 (2019)

    Google Scholar 

  21. Muradore, R., Fiorini, P.: A pls-based statistical approach for fault detection and isolation of robotic manipulators. IEEE Trans. Ind. Electron. 59(8), 3167–3175 (2011)

    Article  Google Scholar 

  22. Nouretdinov, I., Costafreda, S.G., Gammerman, A., Chervonenkis, A., Vovk, V., Vapnik, V., Fu, C.H.: Machine learning classification with confidence: application of transductive conformal predictors to mri-based diagnostic and prognostic markers in depression. NeuroImage 56(2), 809–813 (2011)

    Article  Google Scholar 

  23. Patton, R., Chen, J.: Observer-based fault detection and isolation: Robustness and applications. Control Eng. Pract. 5(5), 671–682 (1997)

    Article  Google Scholar 

  24. Perdomo, J., Zrnic, T., Mendler-Dünner, C., Hardt, M.: Performative prediction. In: International Conference on Machine Learning, pp. 7599–7609. PMLR (2020)

    Google Scholar 

  25. Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data (2020)

    Google Scholar 

  26. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. (JMLR). https://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf (2008)

  27. Tibshirani, R.J., Barber, R.F., Candès, E.J., Ramdas, A.: Conformal prediction under covariate shift (2019). arXiv:1904.06019

  28. Vemuri, A.T., Polycarpou, M.M., Diakourtis, S.A.: Neural network based fault detection in robotic manipulators. IEEE Trans. Robot. Autom. 14(2), 342–348 (1998)

    Article  Google Scholar 

  29. Visinsky, M.L., Cavallaro, J.R., Walker, I.D.: Expert system framework for fault detection and fault tolerance in robotics. Comput. & Electr. Eng. 20(5), 421–435 (1994)

    Article  Google Scholar 

  30. Visinsky, M.L., Cavallaro, J.R., Walker, I.D.: Robotic fault detection and fault tolerance: a survey. Reliab. Eng. & Syst. Safety 46(2), 139–158 (1994)

    Article  Google Scholar 

  31. Visinsky, M.L., Cavallaro, J.R., Walker, I.D.: A dynamic fault tolerance framework for remote robots. IEEE Trans. Robot. Autom. 11(4), 477–490 (1995)

    Article  Google Scholar 

  32. Vovk, V., Gammerman, A., Shafer, G.: Algorithmic learning in a random world (2005)

    Google Scholar 

  33. Vovk, V., Lindsay, D., Nouretdinov, I.: Mondrian confidence machine (2003)

    Google Scholar 

  34. Yu, K.T., Fazeli, N., Chavan-Dafle, N., Taylor, O., Donlon, E., Lankenau, G.D., Rodriguez, A.: A summary of team mit’s approach to the amazon picking challenge 2015 (2016). arXiv:1604.03639

  35. Zeng, A., Yu, K.T., Song, S., Suo, D., Walker, E., Rodriguez, A., Xiao, J.: Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. IEEE (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Pavone .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 6.1 Proofs

Proposition 1: Under Assumption 1, Algorithm 1 is \(\epsilon + 1/(1+|{\mathcal {A}}|)\)-safe (with respect to \({\hat{Y}}, {\hat{Z}}\)).

Proof

Given a sequence of data points \((Y_1, Z_1), \cdots , (Y_T, Z_T)\), denote the subsequence of “unsafe” data as \((Y_{c_1}, Z_{c_1}), \cdots , (Y_{c_M}, Z_{c_M})\) where \(Z_{c_t}\) is the t-th unsafe example (i.e. \(f(Z_{c_t}) < f_0\)), so \(M = |{\mathcal {A}}|\). Suppose that \(\hat{Z}\) is also unsafe, i.e. \(f(\hat{Z}) < f_0\). Let denote an unordered bag (i.e. it is a set that can have repeated elements). We can bound the safety by

By the assumption of exchangeability we are equally likely to observe any permutation of . Intuitively, it is equally likely for \(g(\hat{Y})\) to be the largest, 2nd largest, etc., among \(g(Y_{c_1}), \cdots , g(Y_{c_M}), g(\hat{Y})\). Formally, the random variable \(|\lbrace t \mid g(Y_{c_t}) < g(\hat{Y}) \rbrace | + U\) takes on all values \(\lbrace 0, 1, \cdots , M \rbrace \) with equal probability. Therefore,

We can combine this with the original result to get

1.2 6.2 Lower Bound on the False Positive Rate

Consider a function w that maps a dataset \({\mathcal {D}}= (g(X_1), Y_1), \cdots , (g(X_T), Y_T)\) of unsafe examples, and a new data point \(g({\hat{X}})\), to \(\lbrace 0, 1\rbrace \). We argue that any w that gives a distribution-free false negative rate guarantee should depend only on the ordering between \(g(X_1), \cdots , g(X_T), g({\hat{X}})\), and not on their specific values. In other words, w should take the form defined by

$$\begin{aligned} w({\mathcal {D}}, {\hat{Z}}) = \left\{ \begin{array}{ll} \phi \left( \# \lbrace t, g({\hat{X}}) < g(X_t) \rbrace \right) &{} \text { with probability } \gamma \\ 1 &{} \text { with probability } 1-\gamma \end{array} \right. \end{aligned}$$
(6)

for some deterministic function \(\phi \) and real number \(\gamma \). We know that when the data is exchangeable, \(\# \lbrace t, g({\hat{X}}) < g(X_t) \rbrace \) is uniformly distributed on \(\lbrace 0, 1, \cdots , T \rbrace \).

Case 1 Suppose \(\phi \) takes the value 0 for at least one possible input; then the false negative rate is given by

$$\begin{aligned} \text {FNR} \ge \gamma /(1+T) \end{aligned}$$
(7)

and the false positive rate is given by

$$\begin{aligned} \text {FPR} \ge 1-\gamma \end{aligned}$$
(8)

so combined we have

$$\begin{aligned} \text {FPR} \ge 1-\gamma \ge 1 - (1+T) \text {FNR} \ge 1 - (1+T) \epsilon \end{aligned}$$
(9)

Case 2 Suppose \(\phi \) takes the value 0 for none of the inputs; then the false negative rate is given by

$$\begin{aligned} \text {FNR} = 0, \text {FPR} = 1 \end{aligned}$$
(10)

so we would still (trivially) have \(\text {FPR} \ge 1-(1+T)\epsilon \).

So far we have shown that if w were to take the specific form of Eq. (6), then the false positive rate must be lower bounded by \(1-(1+T)\epsilon \). In other words, when \(\epsilon = o(1/T)\), the false positive rate tends to 1 when T is large.

1.3 Additional Experimental Details: Driver Alert System

Safety score: We define the safety score by the Mahalanobis distance between the ego-vehicle and the agent, where the first eigenvector is aligned with the ego-vehicle’s velocity vector, and the second eigenvector is orthogonal to the ego-vehicle; the magnitude of the first eigenvector is the magnitude of the velocity, and the magnitude of the second eigenvector is approximately half of a car width (we use 1m). Intuitively, this means that agents that are along the ego-vehicle’s velocity vector appear closer than agents in the perpendicular direction. This metric is similar to time to collision (TTC), but it is continuous whereas TTC is not—TTC is infinite unless two vehicles are exactly on a collision course.

Dataset details: The nuScenes dataset includes 952 scenes collected across Boston and Singapore, divided into a 697/105/150 train/val/test split (the same split used for the original Trajectron++). Each scene is 20 s long. The Kaggle Lyft Motion Prediction dataset is a subset of the full Lyft Level 5 dataset (chosen over the full dataset for computational reasons). It includes approximately 16k scenes, divided into an 70%/15%/15% train/val/test split. Each scene is 25 s long. Both datasets include labeled ego-vehicle trajectories as well as labeled detections and trajectories for other agents in the scene. Note that for both of these datasets, because the training split was used to train the Trajectron++ model, we used the validation split as the input training data for Algorithm 1.

Additional experimental results: We demonstrate empirically on the nuScenes dataset that the sum of \(\epsilon \) and the false positive rate must be high when there are few (e.g. \(< 1/T\)) samples, which is consistent with what our theory from Sect. 3.2 would predict. Figure 4 plots the epsilon bound as well as the false negative and false positive rates vs. the number of unsafe samples in the validation dataset; we see that when \(\epsilon \) decreases as 1/T, the false positive rate is relatively flat and low.

Fig. 4.
figure 4

Epsilon bound, false negative rate, and false positive rate on the nuScenes dataset while varying the number of unsafe samples. Consistent with our theory from Sect. 3.2, the sum of \(\epsilon \) and the false positive rate is high when there are few samples.

We also demonstrate empirically on the Kaggle Lyft dataset that the variance on the false negative rate over different train/test splits is low. Table 1 displays the variance on the false negative rate calculated over the 100 trials at each \(\epsilon \) value. All of the variances are well below 0.003, suggesting that the test sequence false negative rates are clustered around \(\epsilon \) (rather than having some sequences that fail on zero examples and others with catastrophic failures). As further evidence, in Fig. 5, we provide a representative box plot of the false negative rates over the 100 trials with \(\epsilon = 0.04\). The variances are indeed clustered around 0.04.

Table 1. Variance on the test sequence false negative rates at different \(\epsilon \).
Fig. 5.
figure 5

Box plot of the 100 false negative rates calculated over randomized train/test splits with \(\epsilon = 0.04\).

1.4 6.3 Additional Experimental Details: Robotic Grasping Experiments

Model and dataset details: The Grasp Quality Convolutional Neural Network (GQ-CNN) from [18] is a model that classifies whether a candidate robotic grasp will be successful. The inputs to a GQ-CNN are a point cloud representation of an object, \(\textbf{y}\), and a candidate grasp, \(\textbf{u}\). A GQ-CNN outputs the predicted probability, \(Q_{\theta }(\textbf{y}, \textbf{u})\), that the candidate grasp will be able to successfully pick and transport the object. We use this predicted probability as the safety score, \(g = Q_{\theta }(\textbf{y}, \textbf{u})\). We consider a candidate grasp “unsafe” if it will not be able to successfully pick the object (i.e. the true label is \(Z = 0\)). Note that this is exactly the ROC curve threshold tuning setup. We use the DexNet dataset of synthetic objects grasped with a parallel jaw gripper [18], which includes approximately 500k pick attempts not used in training the GQ-CNN model. These are divided into a 50%/50% train/test split. Each example is labeled a success if the robot successfully picks and places the object, and a failure otherwise.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, R. et al. (2023). Sample-Efficient Safety Assurances Using Conformal Prediction. In: LaValle, S.M., O’Kane, J.M., Otte, M., Sadigh, D., Tokekar, P. (eds) Algorithmic Foundations of Robotics XV. WAFR 2022. Springer Proceedings in Advanced Robotics, vol 25. Springer, Cham. https://doi.org/10.1007/978-3-031-21090-7_10

Download citation

Publish with us

Policies and ethics