1 Introduction

Motor vehicle accidents are increasing day by day, causing deaths and disabilities in a large number [44]. The main reason for the accidents is human mistakes, unavailability of timely treatment, and secondary accidents [35]. Usually, these accidents happen from vehicle-to-vehicle collisions and pedestrians, single-vehicle accidents, and unnoticed accidents on one road ahead.

Hence, it becomes crucial to develop automatic accident detection methods to reduce fatalities and injuries. Such systems can help rescue agencies and vehicles around the accident by providing timely information in order to avoid road congestion and secondary accidents. Road perception, surrounding object understanding, and traffic detection techniques are the fundamental function of the autonomous vehicle to determine various driving rules. Thus, many companies and researchers are involved in developing algorithms for driverless systems to drive safely by identifying the obstacles present in the trajectory.

The path planning and traffic-driving directions of autonomous vehicles are built upon image recognition technologies. These technologies help in lane detection and locating the obstacle and/or objects present in a different road segment. It is to be noted that vehicle accident and rollover detection using image recognition algorithms are reliable and easy to deploy in real-time. A highly reliable accident prediction model is developed by analyzing previously available accident data. Thus, a traffic accident prediction model based on image recognition is a hot research area among academicians and corporate companies. According to the researchers, the six major factors which are the causes for majority of crashes are driver mistakes, traffic congestion, road geometry, weather conditions, vehicle conditions, and the time of the crash [1].

Besides the high-end sensors and advanced machine learning algorithms for segmentation and object localization [10, 39, 40], the safety of the autonomous vehicle is still a public concern. Many incidents show that the automated vehicle fails to recognize the objects on the road. An example of such a case is the Tesla sedan that ran into an overturned truck in Taiwan. Road segmentation and object recognition algorithms should improve recognition parameters like rollover, vehicle collisions, and accident vehicles.

The principal contribution of this research is as follows: (1). We propose a car crash detection system to detect car crashes in various car crash scenarios, making it suitable for autonomous vehicles' navigational decision-making process. (2). We designed a novel CNN8L with fewer training parameters to detect accident vehicles, improve the actual detection rate and reduce the false alarm. (3). We evaluated different state-of-the-art models using a transfer learning-based approach for feature extraction in a customized car crash dataset.

The rest of the paper is organized as follows: In Section 2, the proposed object recognition and detection technique are discussed in detail; Section 3 offers a description of the dataset and the experimental set-up and focuses on the evaluation of the recognition and detection results of the individual modules, Section 4 discusses on the final performance of the detection system, and Section 5 describes our conclusion and future works.

2 Related works

Ijjina et al. [22] proposed a framework for dealing with traffic accidents. The accident probability is calculated based on the observed speed and the anomalous pattern. The Mask R-CNN with centroid-based tracking is used for accident detection. Presented in [4] is a real-time algorithm to predict the pedestrian path to avoid accidents. Using Bayesian inference, the model identifies and learns the features and patterns of 2D trajectory data using both global and local characteristics. Their framework performed well on noisy and sparse data extracted from dense indoor and outdoor crowd videos.

In [11], the authors have suggested two deep learning techniques, gated recurrent unit (GRU) and convolutional neural network (CNN). These methods applied the ensemble technique with a weighted average on video and audio signals from dashboard cameras and multiple classifiers on these unstructured data to analyze the model prediction ability. But it validated and compared the model efficiency only with video or audio data from single classifiers. DeepCrash, a deep learning model for the Internet of Vehicles (IoV), proposed by Chang et al. [7], includes different components and software modules in a single platform named in-vehicle infotainment (IVI) telematics platform. The self-collision detection sensor and the front camera data are passed to a cloud-based deep learning server for pattern analysis. A cloud-based platform manages the hardware and software components in the IVI platform.

A Cooperative Vehicle Infrastructure System (CVIS) combined with machine vision proposed by Tian et al. [48] to detect small objects, utilizes dynamic weights in the loss function and Multi-Scale Feature Fusion (MSFF) for improving the performance of automatic car accident detection systems. Rahim et al. [37] proposed a traffic crash severity prediction model based on deep learning techniques with a customized f1-loss function. He has compared its performance with other machine learning models using precision and recall indicators.

Lu et al. [29] proposed a deep learning framework to extract features from urban traffic video records using residual neural networks (ResNet) and attention modules. This feature fusion-based model achieved a high detection speed on par with accuracy for constrained computing resource systems. [36] developed a Deep Learning system capable of spotting accidents that are captured on video. The proposed method assumes that visual cues that appear over time can adequately describe traffic collision events. Therefore, a visual features extraction phase composes the model architecture, after which a temporary pattern is identified. The model is trained using various public datasets. This enables the model to learn the various visual and temporal characteristics during the training phase by using custom-built convolution and recurrent layers. While using the public traffic accident datasets, a 98% accuracy is achieved. Thus, we can see that independent road structures are detected efficiently [26].

The existing car crash detection models are efficient at predicting multiple vehicle collisions but fall short in identifying single-vehicle accidents. Many autopilot failures have been the cause for crashes involving repaired and accident vehicles. In [11], the authors proposed a car crash detection system using an ensemble method based on multiple classifiers for video and audio data from dashboard cameras. Also, they validated their study using YouTube videos of auto accidents. Bakheet et al. [3] proposed temporal templates of moving objects to extract local features. A deep neural network (DNN) model is trained on the extracted features to detect abnormal vehicle behavioral patterns and predict accidents before they happen.

In [53], the authors propose an accident detection approach based on spatiotemporal feature encoding with a multilayer neural network to cluster the video frames effectively and efficiently detect accidents from driving videos. The spatial relationships of the objects detected from these potential accident frames are then captured and encoded to determine whether these frames are accident frames. In [20], the authors suggest an ML framework based on multimodal in-car sensors for automated car accident detection using CNN and SVM classifier techniques to identify real-world driving collisions on the strategic highway research program (SHRP2) naturalistic driving study (NDS) crash data set, five different feature extraction method including ones based on feature engineering and feature learning with deep learning are assessed. In [34], the authors suggest an intelligent accident detection and rescue system that uses the Internet of Things (IoTs) and Artificial Intelligence (AI) to mimic the cognitive functions of the human mind. It uses a customized dataset created from a variety of online videos and gathers all accident-related data, such as position, pressure, gravitational force, speed, etc., and sends it to the cloud. When the DL module notices an accident, it immediately alerts all nearby emergency services, including the hospital, police station, mechanics, etc.

An in-depth analysis of action recognition using the various methods, taxonomies, and algorithms for autonomous driving and accident detection is presented in [2]. This analysis focuses on different types of traffic video capturing, including dashcams, drone cameras, highway monitoring cameras, and stationary surveillance cameras at traffic intersections. The authors have advised future research to concentrate on scaling up accident detection systems for spontaneous detection to alert first responders about traffic accidents and provide a quick response to victims. Such model designs in AVs reduce the possibility of collision, automatic brake control, path planning and lowering the speed limit. Fully autonomous vehicles need to plan the trajectory on their own based on the obstacles and already occurred accidents in the front to avoid collision. Several existing studies based on the machine learning techniques above consider car crash prevention and identify near crash scenarios to alert the drivers. But, the car crash detection approach proposed in this paper, uses ensemble learning for feature extraction without prior training and concentrates on detection rather than prevention of accidents. Aiming at the requirement of autonomous vehicles, we have developed a car crash detection system for detecting diverse car crash including single-car crash scenarios.

3 Methodology

Our approach focuses on three different neural networks for casting car crash detection tasks: feature extraction network (VGG16), Region Proposal Network (RPN), and Region-Based detection network (CNN8L). The feature extractor part of the car crash detection system is reviewed with different transfer learning neural networks, thus allowing us to compare the performance of transfer learning networks in crash detection tasks. The RPN is another neural network that identifies the regions of interest (RoI) in an image by proposing bounding boxes, each of which has a score indicating the likelihood of the object present in the current sliding window. The Region-Based detection network is named CNN8L. It is a novel neural network architecture with a regressor and classifier for car crash classification and detection. Figure 1 illustrates our proposed car crash detection system. The flowchart of the proposed methodology is shown in Fig. 2.

Fig. 1
figure 1

Schematic representation of the Car Crash detection CNN model. A. Feature Extractor, B. Feature Map, C. Region of Interest (RoI), D. Region Proposal Network, E. RoI Pooling, and F. Region-Based Detection Network

Fig. 2
figure 2

Flowchart of the proposed methodology

3.1 Feature extractor

CNNs are built from multiple interconnected neural network layers, from which powerful low, middle, and high-level features are extracted hierarchically to create more complex networks [7, 8]. The InceptionResNetV2 network, created by GoogLeNet and available for download [12, 24], is one such example. The two most popular CNNs, ResNet [13, 19] and Inception [45] are integrated into InceptionResNetV2, where batch-normalization replaces the traditional summations process on the top layer of the network. The numerous residual block within the network increases the training complexity. To solve the training complexity problem efficiently [30] and minimize the residuals, Mahdianpari et al. applied more than 1,000 filters in the network layers.

According to its predecessor, AlexNet [28], the VGG network [43] took second place in the ILSFRC-2014 competition localization and classification track. The VGG network is distinguished by having a deep network structure and a small convolution filter of 3 × 3 compared to AlexNet. In the competition, the VGG-VD group introduced six deep CNNs, two of which, namely VGG16 and VGG19, were more successful than the others. The remaining four were less successful. Thirteen convolution layers and three fully-connected layers were used in VGG16 [13]. At the same time, VGG19 has sixteen convolution layers and three fully connected layers. Both networks use a 3 × 3 small convolution filter stack with step 1, followed by many non-linear layers in a stacked configuration. This contributes to learning more complex features by deepening the depth of the network and increasing its breadth. Due to the impressive VGG results, it has been found that tissue depth is an essential factor in achieving high classification accuracy [21].

The Mobile Net architecture can be summarized as follows; the structure is built on different abstraction layers, each of which is a component of various convolutions. These convolutions appear to be the quantized configuration to evaluate the in-depth complexity of a regular problem [18, 36]. The complexity of 1 × 1 can be described as a platform for the total production of abstraction layers with in-depth structures and to point through a standard, rectified linear unit (ReLU).

ResNet, the ILSVRC-2015 competition's classification task winner, has an intense network of 152 layers [47, 52]. In particular, the residual module in ResNet employs a direct path between the input and output. Each layer in the stack complies with a residual mapping instead of matching the desired underlying mapping directly [14].

The feature extractor module in the detection system extracts high-level features from the image and transfers them into feature maps. Many of the object detection models use transfer learning techniques to extract object. Transfer learning (TL) is a fundamental machine learning method that applies knowledge learned from a source domain to a target domain that is different but related. It can produce excellent results while also encouraging the development of areas that are challenging to progress due to a lack of training data.

To illustrate the advantages of using TL as the feature extractor, we compare the performance of more relevant models, including InceptionResNetV2 [15], VGG 16, Mobile Net V2 [9], and Resnet 50. In our proposed transfer learning model, all the convolutional layers, excluding the top fully connected layer of the pre-trained model, are used as a backbone in the feature extraction block.

3.2 Region proposal network (RPN)

The RPN takes in any input size from the feature extractor, the base layer of the transfer learning models, and proposes rectangular bounding boxes and the likelihood objective score for each bounding box. This model requires anchors, the boundary of proposals of different scales and ratios. Three scales and three ratios are used, yielding nine anchor boxes at each sliding position of the image. The number of anchors depends on the image's width(W) and height(h), given as 9xWxH. The scale factors 8,6 and 32 are used. The ratios used for calculations are 0.5, 1, and 2. anchor generation and image scaling are explained in Algorithm 1.

The RPN uses multiple convolutional blocks to extract a feature vector of size 50 × 50x512 in the lower dimension and receives the input feature vector from the feature extractor block. This output is passed to two fully connected layers for performing two tasks: an offset regression (a box regression layer) and an objective score calculation (a box classification layer). The offset regression layer outputs the bounding boxes, and the box classification layer outputs objective scores to determine the probability of the object present within the bounded area. The top objective score anchors are selected as RoI.

The RPN model is trained to make predictions for each anchor box. It predicts the Intersection of Union (IoU) between the anchor box and ground truth and the class score for the object inside the box. The best anchor box is selected based on the IoU and assigned to ground reality with the class label = 1. If no such match is found, the anchor box is given class label = 0. This helps separate the image background from the foreground. The training loss function is a weighted sum of the regression and classification loss. The smooth L1 loss is used as regression loss, and binary cross-entropy is used as classification loss. Figure 3 illustrates the top 20 anchor boxes proposed by the RPN network. The red color locates the predicted ROIs, and the green color discovers the ground truths of the object.

Fig. 3
figure 3

Top 20 anchor boxes proposed by the RPN network

Algorithm 1:
figure a

Anchor box generation and image scaling

3.3 Region-based detector

We propose a CNN8L neural network to classify the RoIs proposed by the RPN. CNN8L consists of 4 convolutional and 4 pooling layers summing up to 8 layers, so it is named CNN8L. We replaced the fully connected layer of RPN and FRPN with CNN8L since this is a detector with prior knowledge about car crash and trained independently without the VGG and RPN. This helps integrate multiple functionalities like the alarm, speed alert and brake control without the need for training the whole system again. A Region-Based Detector, CNN8L structure consists of a feature extractor block, image classifier block, and bounding box regressor block, as shown in Fig. 4. The feature extractor block integrates the low-level and high-level features and provides comprehensive feature information by combining four different convolution layers. The classifier block classifies the image, and the bounding box regressor block locates the image. The feature extractor block is the base layer for the proposed CNN8L model with four convolution layers and three max pool layers. The convolution layers bl_1, bl_3, and bl_5 are connected to the max pool layers bl_2 of size 3 × 3, bl_4 of size 3 × 3, and bl_6 of size 2 × 2, respectively. The last convolution layer bl_7 receives input from the max pool layer bl_6 and gives the output from bl_7 to the global average pooling layer bl_8. The global average pool layer ends the feature extractor block and converts each feature map from bl_7 into one value. The extracted feature map size from the convolution layers (bl_1, bl_2,….bl_8) varies with the width and the height as shown in the formula given below—Eq. (1) and Eq. (2).

$${bl}_{w} = (w{\,-\,{f}_{w}+2p)}/({s}_{w})+1$$
(1)
$${bl}_{h} = (h\,-\,{f}_{h}+2p)/({s}_{h})+1$$
(2)

where pooling \(p\) defines the input sample's border, the input layer of this model is images with Width \(w\), Height \(h\), and three channels RGB. The feature map size and the required parameters for the convolution layers bl with a filter size of N, kernel size of width \({f}_{w}\), and height \({f}_{h}\) define the field of view of convolution. The stride size of width \({s}_{w}\) and height \({s}_{h}\) determines the step size of the kernel given in Eq. (3) and Eq. (4).

Fig. 4
figure 4

CNN8L Network Architecture

The final output size of

$$bl = (N, {bl}_{w}, {bl}_{h})$$
(3)

Parameters needed for

$$bl = N\times {bl}_{w}\times {bl}_{h }$$
(4)

3.3.1 Image classifier block

The image classifier block is the classification layer for the proposed CNN8L model with a dropout layer and SoftMax activation function. The model reduces the features overfitted in the label_1 dropout layer by dropping 30% of the features. The dropout layer is connected to the three-output dense layer with the SoftMax activation function to obtain the probability distribution for the various classes [51]. The loss function of the Image Classifier Block is the label loss \({l}_{l}\) as given in Eq. (5).

$$Label\;loss\;l_l=lw_l\times l_l$$
(5)

where \(l{w}_{l}\) is defined as the label weight loss.

3.3.2 Bounding box regressor block

The bounding box regression block is the object detection layer for the proposed CNN8L model with three dense layers. The first two dense layers, bbox_1 and bbox_2, activate using the ReLu activation function and give the output of feature sizes of 128 and 64, respectively. The final dense layer gives four outputs for locating the object in the image. The regression block's final layer is activated using the sigmoid activation function. The loss function of the bounding box regressor block is \({bb}_{l}\) as given in Eq. (6), and the overall model loss is shown in Eq. (7).

The bounding box loss

$${bb}_{l} ={bbw}_{l}\times {bb}_{l}$$
(6)

where \(bb{w}_{l}\) is the bounding box weight loss.

The overall model loss

$$L = {l}_{l} + {bb}_{l}$$
(7)

CNN8L detected the same classes with overlapping areas often. So, a computer vision technique called non-maximum suppression is applied to the detected bounding boxes to repeatedly choose the entity with the highest probability, output that as the prediction, and then remove the redundant bounding boxes having an IoU greater than (Threshold)Th = 0.4 compared to the other bounding boxes of the same class. Algorithm 2 explains the Non-Maximum Supression(NMS) technique.

Algorithm 2:
figure b

Non-Maximum Suppression

3.4 Model validation

Our experiment utilized different parameters [25, 42] to measure the efficiency and performance of the models. In computer vision, the research community has converged on using the accuracy metrics, average Precision, and mean average Precision to measure the objection recognition and detection model. The label is classified based on the probability scores. If the score of the predicted class is greater than or equal to 0.5 and equal to annotated class, it is considered True Positive (\({T}_{P})\). If the score of the predicted class is greater than or equal to 0.5 and not equal to the annotated class, it is considered False Positive (\({F}_{P})\). If the score of the predicted class is lesser than 0.5 and similar to the annotated class, it is considered False Negative (\({F}_{N}\)). If the score of the predicted class is less than 0.5 and not equal to the annotated class, it is considered True Negative \(({T}_{N})\). The accuracy of the classification model is defined in Eq. (8).

$$Accuracy = \frac{{T}_{p}+{T}_{N}}{{T}_{P}+{F}_{P}+{T}_{N}+{F}_{N}}$$
(8)

The object detection measurement is based on two important metrics, Intersection Over Union (IoU) and class label. The object in the image is located using the rectangle bounding box(bb). Using this representation, the IoU is defined as given in Eq. (9) [24].

$$\mathrm{IoU}=\frac{Area\;of\;Intersection\;of\;{bb}_gOver\;{bb}_p}{Area\;of\;Union\;of\;{bb}_g\;and\;{bb}_p}$$
(9)

where \({bb}_{g}\) is the ground truth bounding box, and \({bb}_{p}\) is the Predicted bounding box.

The Precision \(({p}_{class })\) and recall \(({r}_{class})\) is calculated for the IoU threshold greater than or equal to 0.5 for a class is shown in Eq. (10) and Eq. (11) [41], \({T}_{P}\) lesser than the 0.5 IoU threshold is considered \({F}_{P}\).

$$p_{class}=\frac{T_P}{Total\;Number\;of\;Annotation}$$
(10)
$${r}_{class} = \frac{{T}_{P}}{{T}_{P}+{F}_{P}}$$
(11)

The Average Precision of the class (\({\text{AP}}_{class})\) is the weighted sum of Precision for the class and each IoU threshold, where the weight is the difference between the current and next recall. \({\text{AP}}_{class}\) is defined in Eq. (12).

$${AP}_{class}=\sum_{t=0}^{t=th-1}[{r}_{class}\left(t\right)-{r}_{class}(t+1)]{p}_{class}(t)$$
(12)

Further, \(th\) is the total number of IoU thresholds. \({r}_{class}\left(th\right)=0\) and \({p}_{class}\left(th\right)=1\). The Mean Average Precision(mAP) is given in Eq. (13).

$$mAP = \frac{1}{n} {\sum }_{class = 1}^{n}{\text{AP}}_{class}$$
(13)

where \(n\) is the number of classes.

The car crash accident detection framework performance is based on two major parameters that have been used in most of the comparative studies for accident detection models [17]. The two parameters are Accident Detection Rate (ADR) and False Alarm Rate (FAR), as given in Eqs. (14) and (15) [6, 16, 46].

$$ADR =\boldsymbol{ }\frac{Correctly detected accidents}{Total accidents} \times 100$$
(14)
$$FAR =\boldsymbol{ }\frac{Falsely identified classes}{ a total number of classes} \times 100$$
(15)

4 Experiment results and discussion

4.1 Dataset

The model used in this paper classifies three different categories background, car, and car crash, as shown in Fig. 4, where the three categories are represented as 0 for background, 1 for car, and 2 for a car crash. In the case of Autonomous vehicles, locating the boundary of each object is critical for path navigation. The car and car crash classes easily overlap without having high confidence in the other class. Thus, adding a background class helps locate the objects clearly without embedding the image background in the detection process. We used different datasets for each category. Our research work collected background images from the Pothole dataset compiled by the Electrical and Electronic Department, Stellenbosch University, 2015 [31, 32], car images from Stanford Cars Dataset [27], and car crash images from Unsplash [50], and iStock [23] photo. Along with this, we used the car crash and car images from [33]. The car crash dataset is released at: https://github.com/vanisaravanarajan/Car-Crash-Dataset.git.

Our experiment applied four types of augmentation to the collected images RandomCrop, Rotate, HorizontalFlip, and VerticalFlip.

Moreover, our work used the LabelImg as a tool [49] to create an axis-aligned rectangular at the end of the day and a bounding box around the object for labeling the newly generated Accident detection dataset. We constructed a car accident detection dataset, and the sample images are shown in Fig. 5.

Fig. 5
figure 5

The labeled images

4.2 Data preprocessing

We applied five preprocessing strategies to the dataset. Firstly, all the images are resized to 100 × 100 pixels. Secondly, we checked the image type and file type-image format. Thirdly, check for missing annotations. Fourthly, labeled the background category as '0', the car category as '1', and the car crash category as '2'. Lastly, divide the dataset into three sets such that 1700, 267, and 607 for training, validation, and testing, respectively. The categories of car accident detection datasets are shown in Table 1.

Table 1 Dataset Information

4.3 Experiment

CNN 8L and Transfer Learning models are evaluated on the car accident detection dataset. The processor used to conduct these experiments is 11th Gen Intel(R) Core (TM) i7-11800H @ 2.30 GHz, GPU of NVIDIA GeForce RTX 3050 Ti, Python 3.8, TensorFlow 2.6.2, CUDA 11.2, and cuDNN 8.1. The models are compiled using an Adam optimizer with two losses Sparse Categorical Cross Entropy and Mean Squared Error (MSE). We tested the model with various epochs (50,100, 150 and 200), and batch sizes (8, 16 and 32). The model validation stabilized at 200 epochs and the batch size = 32. The hyper-parameters are listed in Table 2. In the proposed methodology, we applied a sparse categorical loss function for the image classification and MSE(Mean Squared Error) for bounding box regressor with Adam optimization. The model learns from the loss function and optimizes the parameters.

Table 2 Hyper-Parameter Settings

4.4 Results and discussion

We performed an ablation study to analyze the system mAP as per Eq. 13 by combining the state-of-the-art models as feature extractors with the CNN8L model. The VGG16 and CNN8L combination performed the best, as shown in Table 3.

Table 3 Ablation Study

Table 4 lists the ablation study performed to identify the number of layers for a CNN model as a Region-Based detector. We tested the CNN model accuracy by varying the number of layers. Considering the model with optimum layers in terms of accuracy and the number of parameters, we selected CNN with 4 conv(convolutions) and 4 pool(pooling) summing to 8 layers as the Region-Based detector in the name of CNN8L, even though CNN with 8 conv and 8 pool showed 0.5% slightly higher accuracy than the CNN8L.

Table 4 Ablation study by varying the layers

We compared the processing time and the number of parameters of CNN8L with the other state-of-the-art models, as shown in Table 5. CNN8L is the best with the lowest parameters and processing time. The CNN 8L model has 36 times fewer parameters and is 6 times faster than the second best VGG16 model. In autonomous vehicles, both model speed and accuracy are equally important.

Table 5 List of Model Parameters and Processing Time

A real-time accident detection system should identify the accidents from a wide variety of accident data though those patterns did not appear during the model training process. Such models are preferred for practical applications, as they can generalize well and detect patterns in any condition. Hence, to study the performance of the detection system, we tested a wide range of crash scenarios. The system showcased excellent generalization properties by identifying many previously unseen car crashes, as shown in Fig. 6. The determined car crash and car classes are masked with green and blue colors. The detection confidence score threshold is more significant than 0.7 for identifying and locating classes in the tested images.

Fig. 6
figure 6

Masked Output of Car Crash Detection System from the (a) images and (b) video

In some cases, both the car and car crash had a good confidence score and overlapping, as shown in Fig. 6(b). We selected 125 car crash frames and 75 normal frames without car crash for testing. Out of 125 frames, 105 frames are correctly detected for car crash, and the other 15 frames are detected as cars. A total of 225 classes were identified in the 200 frames selected for testing, as some of the frames had multiclass. The final detection output showed 75 false alarms out of 225 classes. The ADR and FAR of our framework are listed in the different framework comparisons in Table 6.

Table 6 Accident Detection framework comparison

5 Conclusion

A novel system for detecting car crash with a new deep-learning convolutional neural network model named CNN8L is proposed. Due to the higher number of required parameters, the training is computationally expensive for the widely accepted popular models like RCNN, Faster CNN and Mask RCNN. The improvement of our proposed car crash detection system is due to faster training, easier to converge and reduced overfitting problems. The CNN8L is trained with smaller filter sizes and fewer filter numbers, with 449,927 parameters, and 101.67 s of processing time. The car crash detection system takes advantage of ensemble learning techniques of neural networks. We also proposed transfer learning techniques with the state of art models VGG16, InceptionResNetV2, MobileNetV2, and Resnet50 for object recognition and detection on the customized car accident dataset. The sequential convolutional neural networks CNN8L and VGG16 performed well. Overall system integrated with VGG16 for feature extraction, RPN for region proposal, and CNN8L for region-based detection recognized the accidents accurately with an 86.25% Accident Detection Rate and 33.00% False Alarm Rate.

Due to the extreme lack of data for AVs, our study's first limitation is the use of a single data set, which restricts the generalizability of our findings. The second limitation is the challenge to compare our results to those of the existing works. This is due to different accident definitions and input data types. Future work in our domain will extend the CNN8L model to research areas, such as road scene segmentation, navigational planning, audiovisual alarm generation, the suggestion of speed limit, and automatic brake control to lower the speed and also to reduce the possibility of a collision for AVs.