Keywords

1 Introduction

Traffic signs are important both in city and highway driving for supporting road safety and managing the flow of traffic. Therefore, traffic sign classification (recognition) is an integral part of any vision system for autonomous driving. It consists of: a) isolating the traffic sign in a bounding box, and b) classifying the sign into a specific traffic class. This work focuses on the second task.

Building a traffic sign classifier is challenging as it needs to cope with complex real-world traffic scenes. A well-know problem of the classifiers is the lack of robustness to adversarial examples [29] and to occlusions [30]. Adversarial examples are traffic signs taken as input which produce erroneous outputs and, together with occlusions, they naturally occur because the traffic scenes are unique in terms of weather conditions, lighting, aging.

One way to alleviate the lack of robustness is to formally verify that the trained classifier is robust to adversarial and occluded examples. For constructing the trained model, binary neural networks (BNNs) have shown promising results [14] even in computationally limited and energy-constrained devices which appear in the context of autonomous driving. BNNs are neural networks (NNs) with weights and/or activations binarized and constrained to \(\pm 1\). Compared to NNs, they reduce the model size and simplify convolution operations utilized in image recognition task.

Our long term goal, which also motivated this work, is to give formal guarantees of properties (e.g. robustness) which are true for a trained classifier. The formal verification problem is formulated as follows: given a trained model and a property to be verified for the model, does the property hold for that model? To do so, the model and the property are translated into a constrained satisfaction problem and use, in principle, existing tools to solve the problem [22]. However, the problem is NP-complete [17], so experimentally beyond the reach of general-purpose tool.

This work makes an attempt to arrive at BNN architectures specifically for traffic signs recognition by making an extensive study of variation in accuracy, model size and number of parameters of the produced architectures. In particular, we are interested in BNNs architectures with high accuracy and small model size in order to be suitable in computationally limited and energy-constrained devices but, at the same time, reduced number of parameters in order to make the verification task easier. A bottom-up approach is adopted to design the architectures by studying characteristics of the constituent layers of internal blocks. These constituent layers are studied in various combinations and with different values of kernel size, number of filters and of neurons by using the German Traffic Sign Recognition Benchmark (GTSRB) for training. For testing, similar images from GTSRB, as well as from Belgian and Chinese datasets were used.

As a result of this study, we propose the network architectures (see Sect. 6) which achieve an accuracy of more than \(90\%\) for GTSRB  [13] and an average greater than \(80\%\) considering also the Belgian [1] and Chinese [3] datasets, and for which the number of parameters varies from 100k to 2M.

2 Related Work

Traffic Sign Recognition Using CNNs. Traffic sign recognition (TSR) consists in predicting a label for the input based on a series of features learned by the trained classifier. CNNs were used in traffic sign classification since long time ago [8, 27]. These works used GTSRB [13] which is maintained and used on a large scale also nowadays. Paper [8] obtained an accuracy of 99.46% on the test images which is better than the human performance of 98.84%, while [27] with 98.31% was very close. These accuracies were obtained either modifying traditional models for image recognition (e.g. ResNet [27]) or coming up with new ones (e.g. multi-column deep neural network [8]). The architecture from [8] (see Fig. 1) contains a number of parameters much higher than those of the models trained by us and it is not amenable for verification although the convolutional layers would be quantized. The work of [8] is still state of the art for TSR using CNNs.

Fig. 1.
figure 1

Architecture for recognizing traffic signs  [8]. Image sz: 48 \(\times \) 48 (px  \(\times \)  px)

Binarized Neural Networks Architectures. Quantized neural networks (QNNs) are neural networks that represent their weights and activations using low-bit integer variables. There are two main strategies for training QNNs: post-training quantization and quantization-aware training [18] (QAT). The drawback of the post-training quantization is that it typically results in a drop in the accuracy of the network with a magnitude that depends on the specific dataset and network architecture. In our work, we use the second approach which is implemented in Larq library [11]. In QAT, the imprecision of the low-bit fixed-point arithmetic is modeled already during the training process, i.e., the network can adapt to a quantized computation during training. The challenge for QNNs is that they can not be trained directly with stochastic gradient descent (SGD) like classical NNs. This was solved by using the straight-through gradient estimator (STE) approach [15] which, in the forward pass of a training step, applies rounding operations to computations involved in the QNN (i.e. weights, biases, and arithmetic operations) and in the backward pass, the rounding operations are removed such that the error can backpropagate through the network.

BinaryConnect [9] is one of the first works which uses 1-bit quantization of weights during forward and backward propagation, but not during parameter update to maintain accurate gradient calculation during SGD. As an observation, the models used in conjuction with BinaryConnect use only linear layers which is sufficient for MNIST [20] dataset, but convolutional layers for CIFAR-10 [19] and SVHN [24]. Paper [14] binarizes the activations as well. Similarly, for MNIST dataset they use linear layers, while for CIFAR-10, SVHN and ImageNet [10] they use variants of ConvNet, inspired by VGG [28], with the binarization of the activations.

In XNOR-Net [25], both the weights and the inputs to the convolutional and fully connected layers are approximated with binary values which allows an efficient way of implementing convolutional operations. The paper uses ImageNet dataset in experiments. We use XNOR-Net architectures in our work but for a new dataset, namely traffic signs.

Research on BNNs for traffic sign detection and recognition is scarce. Paper [7] uses the binarization of RetinaNet [21] and ITA [6] for traffic sign detection, in the first phase, and then recognition. Differently, we focus only on recognition, hence the architectures used have different underlying principles.

Verification of Neural Networks. Properties of neural networks are subject to verification. In the latest verification competition there are various benchmarks subject to verification [2], however, there is none involving traffic signs. We believe that this is because a model with reasonable accuracy for classification task must contain convolutional layers which leads to an increase of number of parameters. To the best of our knowledge there is only one paper which deals with traffic signs datasets [12] that is GTSRB. However, they considered only subsets of the dataset and their trained models consist of only fully connected layers with ReLU activation functions ranging from 70 to 1300. They do not mention the accuracy of their trained models. BNNs [5, 23] are also subject to verification but we did not find works involving traffic signs datasets.

3 Binarized Neural Networks

A BNN [14] is a feedforward network where weights and activations are mainly binary. [23] describes BNNs as sequential composition of blocks, each block consisting of linear and non-linear transformations. One could distinguish between internal and output blocks.

There are typically several internal blocks. The layers of the blocks are chosen in such a way that the resulting architecture fulfills the requirements of accuracy, model size, number of parameters, for example. Typical layers in an internal block are: 1) linear transformation (LIN), 2) binarization (BIN), 3) max pooling (MP), 4) batch normalization (BN). A linear transformation of the input vector can be based on a fully connected layer or a convolutional layer. In our case is a convolution layer since our experiments have shown that a fully connected layer can not synthesize well the features of traffic signs, therefore, the accuracy is low. The linear transformation is followed either by a binarization or a max pooling operation. Max pooling helps in reducing the number of parameters. One can swap binarization with max pooling, the result would be the same. We use this sequence as Larq [11], the library we used in our experiments, implements convolution and binarization in the same function. Finally, scaling is performed with a batch normalization operation [16].

There is one output block which produces the predictions for a given image. It consists of a dense layer that maps its input to a vector of integers, one for each output label class. It is followed by function which outputs the index of the largest entry in this vector as the predicted label.

We make the observation that, if the MP and BN layers are omitted, then the input and output of the internal blocks are binary, in which case, also the input to the output block. The input of the first block is never binarized as it drops down drastically the accuracy.

4 Datasets and Experimental Setting

We use GTSRB [4] for training and testing purposes of various architectures of BNNs. These architectures were also tested with the Belgian data set [1] and the Chinese [3].

GTSRB is a multi-class, single-image dataset. The dataset consists of images of German road signs in 43 classes, ranging in size from 25  \(\times \)  25 to 243  \(\times \)  225, and not all of them are square. Each class comprises 210 to 2250 images including prohibitory signs, danger signs, and mandatory signs. The training folder contains 39209 images; the remaining 12630 images are selected as the testing set. For training and validation the ratio 80:20 was applied to the images in the train dataset. GTSRB is a challenging dataset even for humans, due to perspective change, shade, color degradation, lighting conditions, just to name a few.

The Belgium Traffic Signs dataset is divided into two folders, training and testing, comprising in total 7095 images of 62 classes out of which only 23 match the ones from GTSRB. Testing folder contains few images for each remaining classes, hence, we have used only the images from the training folder which are 4533 in total. The Chinese Traffic Signs dataset contains 5998 traffic sign images for testing of 58 classes out of which only 15 match the ones from GTSRB. For our experiments, we performed the following pre-processing steps on the Belgium and Chinese datasets, otherwise the accuracy of the trained model would be very low: 1) we relabeled the classes from the Belgium, respectively Chinese, datasets such that their common classes with GTSRB have the same label, and 2) we eliminated the classes not appearing in GTSRB.

In the end, for testing, we have used 1818 images from the Belgium dataset and 1590 from the Chinese dataset.

For this study, the following points are taken into consideration.

  1. 1.

    Training of network is done on Intel Iris Plus Graphics 650 GPU using Keras v2.10.0, Tensorflow v2.10.0 and Larq v0.12.2.

  2. 2.

    From the open-source Python library Larq [11], we used the function QuantConv2D in order to binarize the convolutional layers except the first. Subsequently, we denote it by QConv. The bias is set to False as we observed that does not influence negatively the accuracy but it reduces the number of parameters.

  3. 3.

    Input shape is fixed either to \(30 \times 30\), \(48 \times 48\), or \(64 \times 64\) (px  \(\times \)  px). Due to lack of space, most of the experimental results included are for \(30 \times 30\), however all the results are available at https://github.com/apostovan21/BinarizedNeuralNetwork.

  4. 4.

    Unless otherwise stated, the number of epochs used in training is 30.

  5. 5.

    Throughout the paper, for max pooling, the kernel is fixed to non-overlapping \(2 \times 2\) dimension.

  6. 6.

    Accuracy is measured with variation in the number of layers, kernel size, the number of filters and of neurons of the internal dense layer. Various combination of the following values considered are: (a) Number of blocks: 2, 3, 4; (b) Kernel size: 2, 3, 5; (c) Number of filters: 16, 32, 64, 128, 256; (d) Number of neurons of the internal dense layer: 0, 64, 128, 256, 512, 1024.

  7. 7.

    ADAM is chosen as the default optimizer for this study. For initial training of deep learning networks, ADAM is the best overall choice [26].

Following section discusses the systematic progress of the study.

5 Proposed Methodology

We recall that the goal of our work is to obtain a set of architectures for BNNs with high accuracy but at the same time with small number of parameters for the scalability of the formal verification. At this aim, we proceed in two steps. First, we propose two simple two internal blocks XNOR architecturesFootnote 1 (Sect. 5.1). We train them on a set of images from GTSRB dataset and test them on similar images from the same dataset. We learned that MP reduces drastically the accuracy while the composition of a convolutional and binary layers (QConv) learns well the features of traffic signs images. In Sect. 5.2.1, we restore the accuracy lost by adding a BN layer after the MP one. At the same time, we try to increase the accuracy of the architecture composed by blocks of the QConv layer only by adding a BN layer after it.

Second, based on the learnings from Sects. 5.1 and 5.2.1, as well as on the fact that a higher number of internal layers typically increases the accuracy, we propose several architectures (Sect. 5.2.2). Notable are those with accuracy greater than \(90\%\) for GTSRB and an average greater than \(80\%\) considering also the Belgian and Chinese datasets, and for which the number of parameters varies from 100k to 2M.

5.1 XNOR Architectures

We consider the two XNOR architectures from Fig. 2. Each is composed of two internal blocks and an output dense (fully connected) layer. Note that, these architectures have only binary parameters. For the GTSRB, the results are in Table 1. One could observe that a simple XNOR architecture gives accuracy of at least \(70\%\) as long as MP layers are not present but the number of parameters and the model size are high. We can conclude that QConv synthesizes the features well. However, MP layers reduce the accuracy tremendously.

Fig. 2.
figure 2

XNOR architectures

Table 1. XNOR(QConv) and XNOR(QConv, MP) architectures. Image size: 30px  \(\times \)  30px. Dataset for train and test: GTSRB.

5.2 Binarized Neural Architectures

5.2.1 Two Internal Blocks

As of Table 1, the number of parameters for an architecture with MP layers is at least 15 times less than in a one without, while the size of the binarized models is approx. 30 times less than the 32 bits equivalent. Hence, to benefit from these two sweet spots, we propose a new architecture (see Fig. 3b) which adds a BN layer in the second block of the XNOR architecture from Fig. 2b. The increase in accuracy is considerable (see Table 2)Footnote 2. However, a BN layer following a binarized convolution (see Fig. 3a) typically leads to a decrease in accuracy (see Table 3). The BN layer introduces few real parameters in the model as well as a slight increase in the model size. This is because only one BN layer was added. Note that the architectures from Fig. 3 are not XNOR architectures.

5.2.2 Several Internal Blocks

Based on the results obtained in Sects. 5.1 and 5.2.1, firstly, we trained an architecture where each internal block contains a BN layer only after the MP (see Fig. 4a). This is based on the results from Tables 2 (the BN layer is crucial after MP for accuracy) and 3 (BN layer after QConv degrades the accuracy). There is an additional internal dense layer for which the number of neurons varies in the set \(\{64, 128, 256, 512, 1028\}\). The results are in Table 4. One could observe that the conclusions drawn from the 2 blocks architecture do not persist. Hence, motivated also by [14] we propose the architecture from Fig. 4b.

Fig. 3.
figure 3

BNNs architectures which are not XNOR

Table 2. XNOR(QConv, MP) enhanced. Image size: 30px \(\times \)30px. Dataset for train and test: GTSRB.
Fig. 4.
figure 4

Binarized Neural Architectures

6 Experimental Results and Discussion

The best accuracy for GTSRB and Belgium datasets is 96, 45 and 88, 17, respectively, and was obtained for the architecture from Fig. 5, with input size 64 \(\times \) 64 (see Table 5). The number of parameters is almost 2M and the model size 225, 67 KiB (for the binary model) and 6932, 48 KiB (for the Float-32 equivalent). There is no surprise the same architecture gave the best results for GTSRB and Belgium since they belong to the European area. The best accuracy for Chinese dataset (\(83,9\%\)) is obtained by another architecture, namely from Fig. 6, with input size 48 \(\times \) 48 (see Table 6). This architecture is more efficient from the point of view of computationally limited devices and formal verification having 900k parameters and 113, 64 KiB (for the binary model) and 3532, 8 KiB (for the Float-32 equivalent). Also, the second architecture gave the best average accuracy and the decrease in accuracy for GTSRB and Belgium is small, namely \(1,17\%\) and \(0,39\%\), respectively.

Table 3. XNOR(QConv) modified. Image size: 30px  \(\times \)  30px. Dataset for train and test: GTSRB.
Table 4. Results for the architecture from the column Model Description. Image size: 30px \(\times \)30px. Dataset for train and test: GTSRB.

If we investigate both architectures based on confusion matrix results, for GTSRB we observe that the model failed to predict, for example, the End of speed limit 80 and Bicycle Crossing. The first was confused the most with Speed limit (80 km/h), the second with Children crossing. One reason for the first confusion could be that End of speed limit (80 km/h) might be considered the occluded version of Speed limit (80 km/h).

For Belgium test set, the worst results were obtained, for example, for Bicycle crossing and Wild animals crossing because the images differ a lot from the images on GTSRB training set (see Fig. 7a). Another bad prediction is for Double Curve which was equally confused with Slippery road and Children crossing.

Fig. 5.
figure 5

Accuracy Efficient Architecture for GTSRB and Belgium dataset

In the Chinese test set, the Traffic signals failed to be predicted at all by the model proposed by us and was assimilated with the General Caution class from the GTSRB, however General Caution is not a class in the Chinese test set (see Fig. 7b, top). Another bad prediction is for Speed limit (80 km/h) which was equally confused with Speed limit (30 km/h), Speed limit (50 km/h) and Speed limit (60 km/h) but not with Speed limit (70 km/h). One reason could be the quality of the training images compared to the test ones (see Fig. 7b, bottom).

Table 5. Results for the architecture from Fig. 5. Dataset for train: GTSRB.
Fig. 6.
figure 6

Accuracy Efficient Architecture for Chinese dataset

Table 6. Results for the architecture from Fig. 6. Dataset for train: GTSRB.
Fig. 7.
figure 7

Differences between traffic sign in the datasets

In conclusion, there are few cases when the prediction failures can be explained, however the need for formal verification guarantees of the results is urgent which we will be performed as future work.