1 Introduction

In this paper, we focus on building a trustworthy detector of railway light signals via machine learning (ML).Footnote 1 We apply uncertainty quantification (UQ) to the problem of object detection (OD) [21] via the distribution-free, non-asymptotic, and model-agnostic framework of conformal prediction (CP). CP is computationally lightweight, can be applied to any (black-box) predictor with minimal hypotheses and efforts, and has rigorous statistical guarantees.

We give a brief overview of UQ for OD and the CP method we apply to our use case. After selecting a pre-trained model among the state-of-the-art architectures, we give some insights on how UQ techniques can quantify the trustworthiness of OD models, which could be part of a critical artificial intelligence (AI) system, and their potential role in certifying such technologies for future industrial deployment.

2 The industrial problem

AI can work as a complementary tool to enhance existing technologies, like in the automotive industry [1]. We can draw a parallel with the railway sector. While main lines (e.g., high-speed lines) already have in-cabin signaling and can be automatized [19], this is too costly to apply to the whole network. Consequently, on secondary lines, drivers can be subject to a larger cognitive load, and therefore strain, from signals and the environment. Assisting drivers with AI-based signaling recognition could facilitate the operations.

With respect to Fig. 1, our application corresponds to point (1) of their process based on multiple ML tasks: trustworthiness concerns can grow with the number of ML components. For a recent overview on the technical and regulatory challenges raised by the safety of ML systems in the railway and automotive industries, see Alecu et al. [1].

Fig. 1
figure 1

Example of pipeline where an AI system acts following ML-based predictions. Source: Alecu et al. [1]

2.1 Building an exploratory dataset

Currently, there is no standard benchmark for railway signaling detection. Stemming from the dataset FRSign of Harb et al. [12] and their insights, our own iterations made us aware of some additional needs. Notably, our new dataset aims for images to be independent and identically distributed (i.i.d), and therefore, raw sequences cannot be considered. Moreover, our new dataset features increased variability (more railway lines, environmental and weather conditions, etc.) which could enable more accurate predictions in real-world scenarios. Finally, we generalize the task from single to multi-object detection, laying the foundations for future work in instance segmentation.

We would expect a deployable system to work whenever a human operator is successful: at night, with rain, or even when the signals are partially occluded by foliage. Within our exploratory dataset, we included different light conditions but this is far from exhaustive; these issues will be taken into account in future iterations of the dataset.

Table 1 Characteristics of our dataset

As source data, we used some video footage taken on French railway lines, with the approval of the creators.Footnote 2 To extract the samples, we acted as follows. On 25 videos of average duration about 1.5 h, we extract on average 55 images per video by running a pre-trained object detector with a low objectness threshold, and we keep a minimum interval of 5 s between detections, to prevent excessive dependence between images. We kept the images without the associated detections. We then manually annotated all visible railway signals. In Table 1 we report the statistics of our dataset.

3 Related works

Among the OD architecture, we point out YOLO [18] and its variants: they propose a one-stage detection combining convolution layers with regression and classification tasks. Howard et al. [13], with MobileNets, introduced the concept of depth-wise separable convolutions to reduce the number of parameters and accelerate the inference. As for the recent introduction of transformer layers, we find DETR [5] and ViT [9], among others. These networks reach state-of-the-art performances, and seem well-suited to transfer learning. Finally, DiffusionDet Chen et al. [6] recently formulated OD as a denoising diffusion problem, from random noisy boxes to real objects, with state-of-the-art performance.

3.1 Uncertainty quantification in object detection

UQ is an important trigger to deploy OD in transport systems. We find probabilistic OD [10], where the probability distributions of the bounding boxes and classes are predicted; Bayesian models like in Harakeh, Smart, and Waslander [11] and Bayesian approximations Li et al.deepshikha2021monte are also found in the literature. We point out the distribution-free approach of Li et al. [15]: they build probably approximately correct prediction sets via concentration inequalities, estimated via a held-out calibration set. They control the coordinates of the boxes but also the proposal and objectness scores, resulting in more and larger boxes. Finally, de Grancey et al. [7] proposed an extension of CP to OD, which will be the framework of choice in our exploratory study.

4 Conformalizing object detection

Conformal prediction (CP) [2, 20] is a family of methods to perform UQ with guarantees under the sole hypothesis of data being independent and identically distributed (or more generally exchangeable). For a specified (small) error rate \(\alpha \in (0,1)\), at inference, the CP procedure will yield prediction sets \(C_{\alpha }(X)\) that contain the true target values Y with probability:

$$\begin{aligned} \mathbb {P} \big ( Y_{new} \in C_{\alpha }(X_{new}) \big ) \ge 1 - \alpha . \end{aligned}$$
(1)

This guarantee holds true, on average, over all images at inference and over many repetitions of the CP procedure. However, it is valid for any distribution of the data \(\mathbb {P}_{XY}\), any sample size and any predictive model \(\widehat{f}\), even if it is misspecified or a black box. The probability \(1-\alpha\) is the nominal coverage and the empirical coverage on \(n_{\text {test}}\) points is \(\sum _{i=1}^{n_{\text {test}}} \mathbb {I}\{Y_i \in {C}_{\alpha }(X_i)\} / n_{\text {test}}\).

We focus on split CP [14, 17], where one needs a set of calibration data \(D_{cal}\) drawn from the same \(\mathbb {P}_{XY}\) as the test data, with no need to access the training data. At conformalization, we compute the nonconformity scores, to quantify the uncertainty on held-out data. At inference, CP adds a margin around the bounding box predicted by a pre-trained detector \(\widehat{f}\).

4.1 Split conformal object detection

We follow de Grancey et al [7]. Let \(k=1, \dots , n_{box}\) index every ground-truth box in \(D_{cal}\) that was detected by \(\widehat{f}\), disregarding their source image. Let \(Y^{k} = (x_{\text {min}}^{k}, y_{\text {min}}^{k}, x_{\text {max}}^{k}, y_{\text {max}}^{k})\) be the coordinates of the k-th box and \(\widehat{Y}^{k} = (\hat{x}_{\text {min}}^{k}, \hat{y}_{\text {min}}^{k}, \hat{x}_{\text {max}}^{k}, \hat{y}_{\text {max}}^{k})\) its prediction.

In de Grancey et al. [7], their nonconformity score, which we refer to as additive, is defined as:

$$\begin{aligned} \small R_{k} = ( \hat{x}_{\text {min}}^{k} - {x}_{\text {min}}^{k}, \hat{y}_{\text {min}}^{k} - {y}_{\text {min}}^{k}, {x}_{\text {max}}^{k} - \hat{x}_{\text {max}}^{k}, {y}_{\text {max}}^{k} - \hat{y}_{\text {max}}^{k}). \end{aligned}$$
(2)

We further define the multiplicative one as:

$$\begin{aligned} \small R_{k} = \big ( \frac{\hat{x}_{\text {min}}^{k} - {x}_{\text {min}}^{k}}{\widehat{w}^{k}} , {\hat{y}_{\text {min}}^{k} - {y}_{\text {min}}^{k}}{\widehat{h}^{k}} , \frac{ {x}_{\text {max}}^{k} - \hat{x}_{\text {max}}^{k}}{\widehat{w}^{k}}, \frac{ {y}_{\text {max}}^{k} - \hat{y}_{\text {max}}^{k}}{\widehat{h}^{k}} \big ), \end{aligned}$$
(3)

where the prediction errors are scaled by the predicted width \(\widehat{w}^{k}\) and height \(\widehat{h}^{k}\). This is similar to Papadopoulos et al. [17], and a natural extension to de Grancey et al. [7].

Split conformal object detection goes as follows:

  1. 1.

    choose a nonconformity score: e.g., Equation (2) or (3);

  2. 2.

    Run a pairing mechanism, to associate predicted boxes with true boxes (see following paragraphs);

  3. 3.

    For every coordinate \(c \in \{x_\text {min},y_\text {min},x_\text {max},y_\text {max}\}\), let \(\bar{R}^{c} = \left( R_{k}^{c}\right) _{k=1}^{n_{box}}\);

  4. 4.

    compute \(q_{1 - \frac{\alpha }{4}}^{c} = \lceil (n_{box}+1)(1-\frac{\alpha }{4}) \rceil\)-th element of the sorted sequence \(\bar{R}^{c}\), \(\forall c \in \{x_\text {min},y_\text {min},x_\text {max},y_\text {max}\}\).

Since we work with four coordinates, for statistical reasons, we adjust the quantile order from \((1-\alpha )\) to \((1- \frac{\alpha }{4})\) using the Bonferroni correction. We conformalize box-wise: we want to be confident that when we detect correctly an object (“true positive”), we capture the entire ground-truth box with a frequency of at least \((1-\alpha )\), on average. During calibration (point 2. above), for all the true positive predicted boxes (\(\widehat{Y}^{TP}_{cal}\)), we compute the nonconformity score between the true box and the prediction. The pairing mechanism is the same as the NMS used in OD, that is, for each ground-truth bounding box, the predicted bounding boxes (not already assigned to a ground truth) are tested in decreasing order of their confidence scores. The first predicted box with an IoU above a set threshold is assigned to the ground-truth box. Note that while building (\(\widehat{Y}^{TP}_{cal}\)), we do not consider false negatives (due to \(\widehat{f}\)) which cannot be taken care of by box-wise CP.

At inference, the additive split conformal object detection prediction box is built as:

$$\begin{aligned} \widehat{C}_{\alpha }(X) = \{ \widehat{x}_{\text {min}}&- q_{1-\frac{\alpha }{4}}^{x_\text {min}},\, \widehat{y}_{\text {min}} - q_{1-\frac{\alpha }{4}}^{y_\text {min}}, \\ \widehat{x}_{\text {max}}&+ q_{1-\frac{\alpha }{4}}^{x_\text {max}},\, \widehat{y}_{\text {max}} + q_{1-\frac{\alpha }{4}}^{y_\text {max}} \}. \end{aligned}$$
(4)

The multiplicative conformal prediction box is:

$$\hat{C}_{\alpha } (X) = \left\{ {\hat{x}_{{\min }} - \hat{w} \cdot q_{{1 - \frac{\alpha }{4}}}^{{x_{{\min }} }} ,{\mkern 1mu} \hat{y}_{{\min }} - \hat{h} \cdot q_{{1 - \frac{\alpha }{4}}}^{{y_{{\min }} }} ,{\text{ }}\hat{x}_{{\max }} + \hat{w} \cdot q_{{1 - \frac{\alpha }{4}}}^{{x_{{\max }} }} ,{\mkern 1mu} \hat{y}_{{\max }} + \hat{h} \cdot q_{{1 - \frac{\alpha }{4}}}^{{y_{{\max }} }} } \right\}.{\text{ }}$$
(5)

5 Experiments

We split our dataset into three subsets: validation, calibration, and test (respectively, of size 300, 700, 395). We compare, using the validation set, the performance of YOLOv5, DETR and DiffusionDet, pre-trained on COCO [16], as candidate base predictors \(\widehat{f}\), restricting the detection to the class “traffic light”. Commonly, OD models are evaluated using the average precision (AP), which is the Area Under the recall–precision Curve. AP incorporates the precision–recall trade-off, and the best value is reached when precision and recall are maximized for all objectness thresholds. In our application (Table 2), the AP for YOLOv5 is low, while DETR gives better results, and DiffusionDet is significantly superior to the others. We, therefore, use the DiffusionDet model for our conformalization. However, AP metrics alone do not give a complete picture to select OD architecture.

Table 2 Comparing models via AP for two IoU threshold levels
Fig. 2
figure 2

Bounding boxes as predicted by the ML predictor (blue), their conformalized version with additive scores (green) and the ground truth (red). Example of images used in conformalization and testing, cropped for readability

5.1 Detector evaluation with box-wise conformalization

We evaluate OD performances (Table 3) by box-wise conformal prediction, reporting the estimated quantiles \(q_{1-\frac{\alpha }{4}}^{c}\) at the desired risk level. A higher quantile reveals a higher uncertainty of OD predictions.In order to compare additive and multiplicative quantiles which are qualitatively different, we report the stretch: the ratio of the areas of the conformalized and predicted boxes. We display examples of additive and multiplicative conformalized boxes respectively on Fig. 2 and Fig. 3.

Table 3 Conformalization margins on DiffusionDet

6 Results

Empirically, CP works as expected: in Table 4, we see that conformalized coverage is close to the nominal level of \((1 - \alpha ) = 0.9\). Remark that the guarantee holds on average over multiple repetitions, hence we cannot expect to attain the requested nominal coverage with just one dataset.

Table 4 Empirical coverage before and after conformalization, nominal coverage: \(1 - \alpha = 0.90\)
Fig. 3
figure 3

Bounding boxes as predicted by the ML predictor (blue), their conformalized version with multiplicative scores (green) and the ground truth (red). Example of images used in conformalization and testing, cropped for readability

6.1 Interpretation of conformal bounding boxes

A conformalized predictor is only as good as its base predictor \(\widehat{f}\). If the latter misses many ground-truth boxes, guaranteeing \((1 - \alpha ) \, 100 \%\) correct predictions of a few boxes will still be a small number. That is, conformalization is not a substitute for careful training and fine-tuning of a detection architecture, but a complementary tool for certification.

The interest of capturing the whole box can be operational: our ML pipeline could rely on a conservative estimation of the ground-truth to carry out a control operations (e.g., running a ML subcomponent on the detection area).

7 Conclusion and perspectives

Given the insights from this exploratory study, we plan on building and publishing an augmented version of the dataset. The objective is to have a dedicated, high-quality benchmark for the scientific community and the transport industry. As mentioned above, CP works with exchangeable data. In the long term, if trustworthy AI components are to be deployed, the UQ guarantees will need to be adapted to streams of data: this will pose a theoretical challenge and one in the construction and validation of the dataset.

So far, the underlying criterion for successful prediction has been whether the ground-truth box is entirely covered by the predicted box. This is strict, as having a system that guarantees to cover a big part of the truth, seems to be equally useful in practice. Bates et al. [4], with their risk controlling prediction sets, and Angelopoulos et al. [3], with conformal risk control, go in this direction. They extend the guarantee of CP to arbitrary losses, that can incorporate other operational needs.