Introduction

The emergence of machine learning and its substitution for several statistical models have led to better problem-solving, which in turn has led various fields of study to turn their research paths to take advantage of this new method. Transportation systems have been influenced by the growth of machine learning, particularly in intelligent transportation systems (ITS).With the proliferation of data and advancements in computational techniques such as graphical processing units (GPUs), a specific class of machine learning known as deep learning (DL) has gained popularity. The capability of DL models to address large amounts of data and extract knowledge from complex systems has made them a powerful and viable solution in the domain of ITS. A variety of networks in DL have helped researchers to formulate their problems in a way that can be solved with one of these neural network techniques. Traffic signal control for better traffic management, increasing the security of transportation via surveillance sensors, traffic rerouting systems, health monitoring of transportation infrastructure, and several other problems now have a strong new approach, and for several challenging problems in transportation engineering, new solutions have been created.

There have been several surveys of the literature on the application and enhancement of ITS using DL techniques. However, most of these have tended to focus on a specific aspect of DL or a specific aspect of ITS. For instance, Zhu et al. (2018a) conducted survey of big data analytics in ITS. A review of computer vision playing a key role in roadway transportation systems was discussed in Loce et al. (2013). While (Nguyen et al. 2018) reviews DL models across the transportation domain, it is not a comprehensive survey that encompasses all current research publications on the ITS domain and DL. One dedicated review on enhancing transportation systems via DL was done in Wang et al. (2018a) where substantial research was included, but it focused primarily on traffic state prediction and traffic sign recognition tasks. The ITS domain includes other tasks, such as public transportation, ride-sharing, vehicle re-identification, and traffic incident prediction and inference tasks, which are all represented in this paper to make its extent more comprehensive and holistic. The transportation and research community has always taken notice of pivotal research directions, with the earliest review of neural nets applied to transportation (Dougherty 1995), where the critical review spanned the classes of problems, neural nets applied and the challenges in addressing various problems. It is this that motivates of the question we address in this paper: How effective and efficient are the current DL research applications for the domain of ITS? To the best of the authors’ knowledge, the literature in this field has suffered from the lack of a holistic survey that takes a broader perspective of ITS as a whole and its enhancement using DL models.

The purpose of this paper was, therefore, to present the systematic review we have conducted on the existing state of the research on ITS and its foray into DL. In “Research Approach and Methodology”, we discuss our approach taken to identify relevant literature. In “Background on Techniques in Deep Learning”, we talk about different methods of DL network systems and breakthrough research on those methods. In “Applications in Transportation”, we talk about different applications of DL methods in transportation engineering, specifically six major application categories in ITS.

In “Discussion and Conclusion”, we investigate different available embedded systems, or devices that can facilitate the running of neural network experiments. Finally, in “References”, we provide a summary and an outlook for future research.

The research methodology which is followed in this paper is PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) (Moher et al. 2009). Following this method, we first produced a questionnaire and in each paper we reviewed, we looked for answers to these questions. The focus of these questions is about the gap which each paper tries to address, their proposed solutions, and finally the performance of these solutions for their datasets.

Research Approach and Methodology

This paper performs a detailed analysis of existing studies on intelligent transportation systems (ITS) and deep learning (DL). Articles were searched in multiple databases using the search strategy described below. The collected articles were then reviewed and organized. The scope of this review was restricted to conference proceedings and journal articles, including existing literature reviews.

Relevant articles were primarily obtained by querying the TRID TRB database (Home—transport research international documentation 2017), where the search terms included “deep learning”, “convolutional”. These search terms were sought in the title, abstract and notes. Then the references of the papers identified were examined to trace other trusted journals and papers. Also, online searches on various databases such as Scopus, Science Direct, IEEE, and ArXiv were done. All papers obtained were included in this review if they met the following criteria:

  • Describe solutions to ITS problems using DL, as identified by methodology sections, that include DL-based model development

  • Published between January 2015 and October 2019 (during which period the majority of research so far using DL in ITS has been conducted)

  • Not a book, book chapter, dissertation, thesis or technical report

  • Not a general introduction to ITS

  • Not in the domain of autonomous vehicles

Though DL boom was spawned by the ImageNet project in 2012 (Russakovsky et al. 2015) and applications of DL on ITS first appeared in 2013, substantial growth in ITS research by means of DL methodologies did not start until 2015. This is illustrated in Fig. 1. Since then, there has been a steady growth in the prominence of DL-based ITS studies across journals and conferences. In the year 2019, up until October, 43 papers have been published across various ITS applications. In light of the marked increasing importance of DL as an ITS research method, in the following section, we will discuss and review the various DL structures and then their key applications in the ITS domain.

Fig. 1
figure 1

Year-wise publication growth in ITS domains

Background on Techniques in Deep Learning

Deep Neural Networks (DNN)

Deep learning (DL) is a specific subcategory of machine learning where several layers of stacked parameters are used for the learning process (Ketkar 2017). These parameters are component representations of different aspects which can affect the result of the network. Each layer contains several perceptrons (known also as neurons or hidden units) which carry weights for the parameter. The input of each layer is multiplied by these parameters and, therefore, the output is a representation of the impact of each parameter on the input. Usually after each layer or several layers of neurons, a nonlinearity function such as the tanh, sigmoid, or rectified linear function (ReLU) (Glorot et al. 2011) is used to generate the output layer. All these layers combine to form a deep neural network (DNN) (Schmidhuber 2015). There are two major challenges in building a DNN: first, designing the structure of the network, which includes the number of layers, number of neurons in each layer, and nonlinearity function type,and second, adjusting the weight of the parameters to train the network on how it should perceive the input data and calculate the output. For the first challenge, what is usually most helpful is simply trial and error and overall experience. For the second challenge, the back-propagation method is the most popular method to train the weight of parameters in a supervised manner. More details about this method can be found in Schmidhuber (2015). Although all the techniques which will be discussed in the rest of this paper can be classified as a subcategory of DNN, here in this paper, DNN is defined as the simplest structure of a network, in other words, fully connected layers. In this fully connected model, there is a connection between all the neurons of one layer to all the neurons in another layer, and for each connection, there is a weight which should be determined through back-propagation method.

Convolutional Neural Networks (CNN)

One of the major applications of neural networks was computer-aided detection (CAD) that aimed to increase classification accuracy and inferencing time. A revolutionary method was proposed in LeCun et al. (1989) called convolutional neural networks (CNN). Inspired by the vision system of cats which are locally sensitive and orientation-selective, as presented in LeCun et al. (1989) and Hubel and Wiesel (1962) suggested that instead of using fully connected layers of neural networks, it is possible to use a single kernel with shared weights to wisp the entire image and extract the local features. The proposed method enhanced the detection effectiveness both in terms of accuracy and memory requirement when compared with traditional methods, which required handcrafted feature extractions (LeCun et al. 1998).

CNN is a detection architecture that automatically learns spatial hierarchical features using back-propagation through the network. A schematic figure of this architecture is presented in Fig. 2a. These networks usually contain three types of layers: convolution, pooling, and fully connected, where the first two are used to extract the features and the last one used as a classifier (Bengio et al. 2015).

Fig. 2
figure 2

Figures depicting CNN and RNN schematic

The convolution layer consists of a combination of a convolution kernel, which counts as a linear part of the layer and a nonlinear activation function. The main advantage of using a kernel that shares weights in operation, is extracting the local features and learning the spatial hierarchies of features efficiently by reducing the required parameters. Then the nonlinear activation function maps the results onto the feature map. In order to reduce the number of parameters, usually one pooling layer comes after a few convolutional layers in order to downsample the data, by taking the maximum unit (max pooling) or the average (average pooling) of a collection of units and substituting it as a representative of these collections. After extracting features and downsampling the data by the convolution and pooling layers, they are mapped onto the final output by fully connected layers. The output of these layers usually is the same size as the number of classes and each output indicates the probability of it belonging to that class. Finally, this string maps onto the final result by an activation function. This activation function can be sigmoid for binary/multiclass classification, softmax for single/multiclass classification or to identity continuous values in case of regression (Yamashita et al. 2018).

Based on the fact that in order to train a deep model a large amount of data are needed, CNN and other models’ popularity only began to rise when a large quantity of labeled data were provided for the ImageNet challenge (Russakovsky et al. 2015). Afterward, lots of architectures have been proposed which use these CNN blocks to enhance the efficiency of CAD. Some of these methods are AlexNet, Inception, VGGNet 16/19, Resnet, etc. However, in order to increase the accuracy of detection, other concepts have been used in the process. Some of these concepts are transfer learning, which uses the knowledge of the network from retraining on a large dataset in order to train the network on a smaller dataset (Yamashita et al. 2018). The other method is training with an equal prior instead of a biased prior in those cases where the dataset has a bias towards one of the classes (imbalanced dataset). In this case, different sampling or resampling rates are applied to the dataset to balance it. The effect of these different methods of changing the architecture, using transfer learning and balancing the dataset for various datasets are investigated in Shin et al. (2016).

Recurrent Neural Networks (RNN)

Recurrent neural networks (RNNs), another class of supervised DL models, are typically used to capture dynamic sequences of data. RNNs can successfully store the representation of recent inputs and capture the data sequence by introducing a feedback connection to interpret the data. This ability can play the role of memory to pass information selectively across sequence steps to process data at a certain time. Thus, each state depends on both the current input and the state of the network at a previous time. In other words, there is a similarity between a traditional, simple RNN and Markov models (Lipton et al. 2015). In 1982, the first algorithm for recurrent networks was used by Hopfield (1982) in order to do pattern recognition. In 1990, Elman (1990) introduced his architecture, which is known as the most basic RNN. A schematic figure of this architecture is presented in Fig. 2b. In this architecture, associated with each hidden unit, there is a context unit which takes the exact state of the corresponding unit at the previous time as an input and re-feeds it with the learned weight to the same unit in the next step.

Although training RNN networks seems to be straightforward, vanishing or exploding gradient problems remain the two main difficulties. These problems can happen during learning from previous states when the chain of dependencies gets prolonged and, in this case, it is difficult to choose which information should be learned from past states. In order to solve the problem of an exploding gradient in recurrent networks, which can result in oscillating weights, Williams and Zipser (1989) has suggested Truncated Back-Propagation Through Time (TBPTT), which sets a certain number of time steps as a propagation limit. In this case, to prevent exploding the gradient, a small portion of previously analyzed data is collected to use during the training phase. However, this means that in the case of long-range dependencies cases, the former information related to these dependencies will end up lost.

Long Short Term Memory (LSTM) architecture has been suggested by Hochreiter and Schmidhuber (1997) to solve both these problems together. The primary idea of this method is using a memory cell with only two gates of input and output. The input gate decides when to keep the information in the cell and the output gate decides when to access the memory cell or prevent its effect on other units. In recent years, several corrections and improvements have been made on LSTM architecture.

As described above, LSTM contains a memory cell that holds its state over time, and based on its regulation, controls how this cell affects the network. The most common type of LSTM cell has been suggested by Graves and Schmidhuber (2005). Several gates and components which are added to this cell are different from the basic suggested LSTM by Hochreiter and Schmidhuber (1997). A logistic sigmoid function is usually used as the gate activation, though due to the state-of-the-art design of Graves and Schmidhuber (2005), a tanh function is usually used as the block input activation and block output activation. The forget gate and peephole connections were first suggested by Gers and Schmidhuber (2001) that enables the cell to reset by forgetting its current state and passing the current state data from the internal state to all gates without passing them through an activation function.

Finally, it is notable that Cho et al. (2014) has proposed a gated recurrent unit (GRU) inspired by the LSTM block, where they have eliminated the peephole connections and output activation function. They have also coupled the input gate and forget gate into one gate called the update gate and what passes through their output gate is only recurrent connections to the block input. This architecture is much simpler than LSTM and based on what it eliminates, it avoids a significant reduction in performance, which makes it more popular to use.

Autoencoders (AE)

One of the most important task in DL is access to a large amount of data to train the model. Usually, such a dataset is not readily available and producing a rich dataset would be expensive. In this situation, unsupervised methods show their value. Instead of training models using labeled data, unsupervised methods extract the features of unlabeled data and use these extracted features to train the model. Autoencoders (AEs) are one such method which aims to reconstruct the input data and in this manner is similar to principal component analysis. AEs are composed of two networks that are concatenated to each other. The first network extracts and encodes the input data into its main features and the second network usess these features to reshape arbitrary random data to reconstruct something similar to the input data. The schematic figure of this architecture is presented in Fig. 3a. Although the concept of AEs has been used previously as a denoiser (Vincent et al. 2008) and data constructor (Tan and Eswaran 2008), it found a new application as variational AEs (Kingma and Welling 2013). To minimize the difference from input and output, Kingma and Welling (2013) have used the variational inference method. They introduced a lower bound on the marginal likelihood and tried to maximize it to minimize the error between input and output. Doersch (2016) and Le (2015) have explained exactly how a variational AE can be built.

Fig. 3
figure 3

Figures depicting AE and GAN schematic

Usually, an AE’s hidden layer is smaller than its input layer, although the opposite situation can happen as well. Also, the horizontal orientation of AEs is defined as combining two or more AEs horizontally, and this can have different motivations such as different learning algorithms (e.g., RBM, neural network, or Boolean) or different initialization and learning rates. In addition to details about these situations, linear and nonlinear AEs have been studied by Baldi (2012). It has been shown that a Boolean AE as a nonlinear type has the ability to cluster data and an AE layer on top can be used as a pretrainer for a supervised regression or classification task.

Deep Reinforcement Learning (DRL)

Reinforcement learning (RL) attempts to train a machine to act as an agent who can interact with the environment and learn to optimize these interactions by learning from responses (Arulkumaran et al. 2017). In RL, the agent observes the environment and gets a state signal and chooses an action that impacts the environment to produce a new state. In the next step, a reward from the environment and the new state is fed to the agent to help it decide more intelligently in the next step. The goal of an agent in this setup is gaining the maximum reward over the long term by following an optimal policy. The algorithm of RL is usually based on the Markov Decision Process (MDP) (Silver 2015). The problems that can be solved by RL algorithms can be divided into episodic and non-episodic MDP. In episodic MDP, the state will reset at the end of the episode and the return (accumulation of rewards for the episode) is calculated. In non-episodic MDP, there is no end of the episode and using a discount factor is vital to prevent an explosion of return values (Arulkumaran et al. 2017).

There are two functions usually used in RL: the state-value function, also known as the value function, is the expected return if the agent starts at a given state (no action limitation), whereas the action-value function, also known as the quality function (Q-function) is the expected return of starting at a given state and taking a particular action. Usually, one of two methods is implemented to solve an RL problem. In the first approach, the Q-function is predicted using different methods of temporal difference controls such as state–action–reward–state–action (SARSA), which improves the estimation of Q. The second approach is Q learning, which directly approximates the optimal Q. Both of these methods use bootstrapping and learn from incomplete episodes.

Deep reinforcement learning (DRL) is an approach to solving the RL problem using a DNN. Although the history of DRL began in the 1990s when Tesauro (1995) developed a neural network that reached an expert level in backgammon, its rebirth can be considered as Mnih et al. (2015) who introduced Deep Q-Networks (DQN) as DNNs that can approximate Q instead of reading its value from a Q table that indicates for each state what the Q value would be for taking each action. In this new method, complex and high dimensional problems have potential to be addressed easily (Mnih et al. 2015). The model used by Mnih et al. (2015) extracted images from the Atari games and used a combination of a CNN model and a fully connected layer on the data extracted from the images to obtain an estimate of the Q value.

However, because of the complexity of DRL, it can be unstable. Therefore, much research has been focused on solutions able to defeat this instability. Experience replay (Lin 1992) and target networks (Mnih et al. 2015) are the two most used techniques to make RL stable. Other techniques include Double-Q learning (Hasselt 2010) and dueling DQN (Wang et al. 2015), which have also been proposed to make DRL more robust and stable. In Double-Q learning, the second estimator is used for estimating an extra assumptive Q′ to approximate the Q value more precisely. On the other hand, dueling DQN (Wang et al. 2015) uses a baseline instead of an accurate calculation of Q value to learn relatives.

Generative Adversarial Networks (GAN)

Generative adversarial networks (GANs) are a specific class of deep learning networks that learn how to extract the statistical distribution of training data to synthesize new data similar to real-world data. These synthetic data can be used for several applications such as producing high-resolution images (Ledig et al. 2017), denoising low-quality images, and image-to-image translation (Isola et al. 2017). Most of the generative models use the maximum likelihood concept to create a model that can estimate the probability distribution of the training data and synthesize a dataset that maximizes the likelihood of the training data (Dougherty 1995). Although calculating maximum likelihood can directly result in the best action of the model, sometimes these calculations are so difficult that it is more beneficial to implicitly estimate this amount. In the case of explicit density calculation, three main types of models are popular:

  • Fully visible belief networks

  • Variational AEs

  • Markov chain approximations

All of these models, however, suffer from the problems of low speed, low quality, and early stoppage (Goodfellow 2016). To overcome these problems, Goodfellow et al. (2014a) has suggested a method that does not require explicit definition of the density function. This model can generate samples in parallel, no Markov chain is needed to train the model and no variational bound is needed to make it asymptotically consistent.

This method has two models: the generative model which is responsible to pass random noise through a multilayer network to synthesize samples, and the discriminative model, which is responsible to pass real data and artificial data through a multilayer network to detect whether the input is fake or real. A schematic figure of this architecture is presented in Fig. 3b. Both models use back-propagation and dropout algorithms: the generative model to create more realistic data and the discriminative model to achieve better distinction between real and fake data.

When GANs were first proposed in both their generative and discriminative models, fully connected networks were used. However, later in 2015, Radford et al. (2015) suggested a new architecture named deep convolution GAN (DCGAN), which uses batch normalization in all layers of both models, except the last layer of the generator and first layer of the discriminator. Also, no pooling or unpooling layer is used in this architecture. A DCGAN allows the model to understand operations in latent space meaningfully and respond to these operations by acting on the semantic attributes of the input (Goodfellow 2016).

The other improvisation on the GAN architecture has been conditional GAN (Mirza and Osindero 2014), where both networks are class conditional, which means the generator tries to generate image samples for a specific class and the discriminator network is trained to distinguish real data from fake data, conditional on the particular class. The advantage of this architecture is better performance in multimodal data generation (Creswell et al. 2018).

In the next section, we discuss and review the applications of deep learning models to transportation.

Applications in Transportation

Performance Evaluation

Before reviewing papers that have already used DL methods to investigate ITS applications, it is necessary to make clear the model evaluation criteria used. The classification metrics are accuracy (AC), precision (PR), recall (RL), top 1 accuracy, and top 5 accuracy, while the regression metrics are mean average precision (mAP), mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error (RMSE):

$$\mathrm{AC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}$$
(1)
$$\mathrm{PR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(2)
$$\mathrm{RL}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(3)

where TP = true positive, TN = true negative, FP = false positive, FN = false negative.

Top 1 accuracy means the model’s top answer must match the expected answer.

Top 5 is when at least one of the model’s five highest probability answers must match the expected answer.

mAP is the mean of the average precision (AP) scores for every query, where AP is the area under the PR vs RL curve

IoU is the ratio between area of overlap and area of union, between the predicted and the ground truth bounding boxes:

$$\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}|{y}_{i}-\stackrel{-}{{y}_{i}}|$$
(4)
$$\mathrm{MAPE}=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{{y}_{i}-\stackrel{-}{{y}_{i}}}{{y}_{i}}\right|$$
(5)
$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{({y}_{i}-\stackrel{-}{{y}_{i}})}^{2}}$$
(6)
$$\mathrm{MSRE}=\frac{1}{n}\sum_{i=1}^{n}{\left(\frac{{y}_{i}-\stackrel{-}{{y}_{i}}}{{y}_{i}}\right)}^{2}$$
(7)

where yi is the actual value of observed travel time, yi is the predicted value of travel time, and n is the number of observations.

We now discuss different applications of deep learning in ITS. The included topics have been selected based on the functional areas in ITS as mentioned in Sussman (2008) and have been studied substantially over the period of 2012–2019.

Traffic Characteristics Prediction

One of the most considered applications of DL in transportation is related to traffic characteristics prediction. Traffic characteristics information can help drivers to choose their routes more wisely and traffic management agencies to manage traffic more efficiently. The main characteristics of interest are traffic flow, traffic speed, and travel time. Since these characteristics are not mutually exclusive, methods that are used to predict one of them also can be used to predict the value for the remaining features. Due to this, methods used to make these predictions are discussed together as follows:

Based on the duration of prediction for each traffic characteristic, a forecast value is usually classified as short-term (S) for predictions within less than 30 min, medium-term (M) for a prediction window between 30 and 60 min, and long-term (L) within more than 60 min (Yu et al. 2017a). Since driving behavior and traffic characteristics can vary across locations, results from one dataset are difficult to apply to other datasets (Wang et al. 2018a). Previously, traffic feature prediction has predominantly used parametric and statistical methods, such as autoregressive integrated moving average (ARIMA) modeling, but most of the time these methods have been incapable of predicting irregular traffic flows (Wang et al. 2018a). However, through the emergence of machine learning and furthermore DL methods, nonparametric methods are now being used in traffic characteristics prediction to achieve higher accuracy.

One of the first attempts to predict traffic characteristics has used deep belief networks (DBN) as an unsupervised feature learner. Chen et al. (2017a), Huang et al. (2014) and Khajeh Hosseini and Talebpour (2019) have implemented DBNs for traffic flow prediction. Siripanpornchana et al. (2016) and Hou and Edara (2018) have used the same concept for predicting travel time and traffic speed. Along with traffic data, weather data have been fed into DBNs using data fusion techniques to predict traffic flow more accurately (Koesdwiady et al. 2016).

However, due to the nature of the above mentioned traffic features and their dependency on past traffic conditions, several studies have been done to discover correlations using RNN to predict traffic characteristics. For instance, Zhang and Kabuka (2018) have used a gated RNN unit to predict traffic flow with respect to the weather conditions, where Jia et al. (2016) have used LSTM to overcome the same challenge. Liu et al. (2017) and Tian and Pan (2015) have used LSTM to predict travel time as well as traffic flow, while also taking into account weather conditions. Finally, Ma et al. (2015) have implemented a combination of deep RBM and RNN to predict congestion in transportation network links.

Polson and Sokolov (2017) have tried to increase the AC of traffic flow prediction especially for nonrecurrent traffic congestion, such as a special event or harsh weather, by paying more attention to the spatiotemporal feature of traffic. This feature is grounded in the assumption that to predict any traffic characteristic, we need both the historical data on that particular location and current traffic in the neighboring areas. To accomplish this, Wang et al. (2016a) have tried to combine an RNN with a CNN to pay attention to both the temporal and spatial aspects of traffic. Fouladgar et al. (2017), Du et al. (2017) and Goudarzi et al. (2018) have combined the power of LSTM + CNN to understand both temporal and local dependencies to predict different traffic characteristics. Yao et al. (2018a) have considered two challenges, the first being the dynamic dependency of traffic on temporal features, that is, in different hours of the day, this dependency may differ from one direction of traffic flow to another direction. The second challenge has been the probability of shifting time periods in relation to traffic density. In other words, a periodic temporal dependency may shift from one time to another (e.g., on different days of the week). As a result Yao et al. (2018a) designed a network consisting of a flow-gated local CNN network to capture the dynamic of the spatial dependencies and an LSTM network as a periodically shifted attention mechanism for handling the periodic dependencies. One other approach to accounting for both types of dependencies was taken by Ma et al. (2017). They converted data into images representing the two dimensions of time and space. By converting their data matrices into images, they were able to use a CNN model to extract image features and predict the network-wide traffic speed. Yu et al. (2019) improved this approach later by adding a temporal gated convolution layer to extract temporal features.

To extract both spatial and temporal features, Cui et al. (2018a) have used a deep model called the stacked bidirectional and unidirectional LSTM (SBU-LSTM) model where the bidirectional LSTM considers both the backward and forward dependencies in time-series data. Since traffic conditions have periodicity, by analyzing both backward and forward features, the AC can be increased.

One of the other models able to consider the spatiotemporal property of traffic has been AE, which was proposed first by Lv et al. (2014) and improved by Duan et al. (2016) using denoising Stacked AE (dSAE) and Yu et al. (2017a) by combining LSTM and AE to predict traffic conditions at peak hours and in post-accident situations. To predict post-accident situations, they extracted a latent representation 7 of the static features that are common in all accidents from stacks of AE and combined this with a temporal correlation to traffic flow that came from stacks of LSTM, using a linear regression (LR) layer.

Table 1 summarizes all these papers, with the columns from left to right describing for each study the traffic characteristics investigated and its DL model, dataset, experiment results (best results achieved), baseline model, and the baseline model’s best results, prediction window length, hyperlink to the given paper and its year of publication.

Table 1 Overview of papers using deep learning techniques for traffic characteristic prediction

To the best of the authors’ knowledge, all studies matching the meta-analysis criteria described in “Research Approach and Methodology” of the current paper related to travel time, traffic speed, traffic flow, traffic conditions, and traffic density have been tabulated here. For traffic conditions, the goal was to predict if the road is congested or not. Results performed on multiple datasets are also represented in Table 1. To have uniformity, the best results are those achieved when the window length is ‘S’ (short-term). This table structure is followed across all tables in this paper.

Traffic Incident Inference

The goals of predicting traffic incident risk for a given location as well as incident detection based on traffic features are to help traffic management agencies to reduce incident risk in a hazardous area and traffic jams in incident locations. Although there are parameters such as drivers’ behavior, that are not very predictable, there are several key features that can help predict traffic incidents.

Human mobility (Chen et al. 2016), traffic flow, geographical position, weather, time period, and day of the week [97] are some of these features that can be investigated as indicators of a traffic incident. However, a single model cannot generally be used in different places because accident factors in metropolitan areas, where the population and vehicles are generally dense, are completely different from accident factors in a small town with a scattered population (Yuan et al. 2017). The prediction and detection of an incident is more challenging than the prediction of incident risk since data for the former are usually heterogeneous (i.e., traffic incidents happen rarely, compared to the amount of data for the cases where there is no incident). To overcome this issue, Yuan et al. (2017) in each step changed only one feature of the data (hour, day, or location) and then checked if the resulting data point was negative or not. In negative cases, it was added to the pool of data to be considered.

To measure the traffic incident risk based on surveillance camera data, different approaches have been used. For example, Chen et al. (2016) have used a stack denoising AE (SDAE) to learn the hierarchical features of human mobility and their correlation with a traffic incident. In contrast, Ren et al. (2017) and (Bao et al. 2019) have implemented an LSTM model to evaluate risk, but Ren et al. (2017) achieved better performance due to learning from more features.

To predict traffic incidents in a macroscopic manner, Yuan et al. (2017) and Pan et al. (2017) have tried implementing DNN models, Yuan et al. (2017) by considering the curvature of the road as well as the number of intersections and density of the area in order to overcome the spatial heterogeneity problem. For the same concern, Dong et al. (2018) have used AE by considering both continuous and categorical variables, and Yuan et al. (2018) have used a Conv-LSTM that breaks regions into smaller regions in order to overcome spatial heterogeneity.

If, following Yuan et al. (2018), we consider the macroscopic prediction of traffic incidents as not focused on any single vehicle, but instead as predicting the probability of an accident between any pair of vehicles in the wider region, microscopic incident prediction studies can also be introduced that—by getting data about the location, speed, and direction of each vehicle in the surrounding area—predict the probability of an incident in the near future between any certain pair of vehicles. In this regard, Chen et al. (2018b) and Theofilatos et al. (2019) have trained a DNN to predict likely collisions. Theofilatos et al. (2019) have used a simple NN with four layers, which, though it does not compare well with the baseline results of machine learning (ML) techniques, is still preferred, as the ML techniques have poor sensitivities.

Suzuki et al. (2018) have annotated their large dataset of near-miss traffic accidents to train a quasi-RNN model. The innovation of their work was introducing an adaptive loss function for early anticipation (AdaLEA), which gives their model the ability to predict a collision 3.65 s before it happens.

Another challenge in traffic incident inferencing is detecting an accident by processing only raw data. To address this, Hatri and Boumhidi (2018) and Singh and Mohan (2018) have used a stacked AE (SAE) to extract the features of traffic patterns in the context of an accident. Also, Hatri and Boumhidi (2018) have used a fuzzy DNN to control the learning of traffic-incident-related parameters. Zhang et al. (2018a) have trained their DBN model on a dataset that includes tweets related to traffic accidents, showing that non-traffic features can be used along with traffic feature data to validate traffic incident detection.

Incident severity prediction based on recorded incident features have been studied in Wang et al. (2016a), Sameen and Pradhan (2017) and Alkheder et al. (2017). The artificial neural network (ANN) trained in Alkheder et al. (2017) has shown an improvement in baseline performance as compared to the LSTM model with fully connected layers in Sameen and Pradhan (2017).

Table 2 summarizes all these papers, shows their model, the dataset which their model was trained on, evaluation of their model for their testing dataset as well as comparison of their model’s performance to that of their baseline model. In the first section of this table, different studies regarding parameters effective in predicting increased incident risk and the manner in which incident risk is affected are listed. In the next section, macroscopic studies on incident prediction are categorized as “traffic incident prediction,” whereas microscopic studies are categorized as “collision prediction.” In the incident detection 9 section, all studies focused on detecting incidents by analyzing raw traffic data have been gathered and, finally, in the last section, investigations predicting the severity of the incident are listed.

Table 2 Overview of papers using deep learning techniques for traffic incident inference

Vehicle Identification

Applications of re-identification (Re-ID) vary from calculating travel time to automatic ticketing. Since license plates are unique to each vehicle, the first task in Re-ID is recognizing them.

Zang et al. (2015) and Abedin et al. (2017) have implemented DL models to recognize license plates by using a visual attention model that first generates a feature map using a combination of the most commonly used colors in license plates, extracts data from plates using a CNN model, and ultimately runs an SVM on the extracted data. However, bad lighting, blurriness due to vehicle movement, low camera quality, and even traffic occlusion where the plate is covered behind other cars can make reading license plate characters impossible. To overcome this, Liu et al. (2016) have proposed a CNN layer to extract conspicuous features such as the color and model of the vehicle and have used a Siamese neural network to distinguish similar plates. (This network has been used before in signature verification tasks). Note that for some feature extractions, such as vehicle color recognition, solutions like what Hu et al. (2015) did using a combination of CNN for feature extraction and SVM for categorizing are also available. Tang et al. (2018) have similarly used a histogram-based adaptive appearance model like what Zheng et al. (2017) did for target re-identification, detecting and saving other features of each car besides the scheme of the license plate to do Re-ID. Also, Yu et al. (2017b) have used faster RCNN to detect vehicles in images. In addition, a modified version of the Single Shot Detection (SSD) method to localize and classify the different types of construction equipment by employing MobileNet as the feature extraction network has been done by Arabi et al. (2020). Wu et al. (2018b) has worked on the same idea but trained their model based more on spatiotemporal data, pruning their results with the fact that (1) a vehicle cannot be in two places at one time and (2) a vehicle that has already passed a section is unlikely to pass it again. However, their model could not compete with the model defined in Tang et al. (2018) that proposed a Markov chain random fields to prepare several queries based on a visual spatiotemporal path and then used a combined Siamese-CNN and path-LSTM model.

Table 3 summarizes all these papers, shows their models, the dataset which model is trained on, and their performances on those dataset and comparison to the baseline model.

Table 3 Overview of papers using deep learning techniques for vehicle id tasks

Traffic Signal Timing

One of the main tasks of ITS management based on multiple types of data is controlling traffic via traffic signal lights. For several years, research on optimizing signal light timing to have the best performance has been one of the greatest challenges in the transportation field. The results of studies in this area have endowed traffic agencies with analytical models that use mathematical methods to address this optimization problem. However, through emerging DL studies, modeling the dynamics of traffic to achieve the best performance has taken a new path. This is because the nature of RL has facilitated its application in different studies to find the best traffic signal timing.

Li et al. (2016) has used DRL to tackle traffic light timing. In DRL, a DL model is usually used to implement the Q-function in a complex system to capture the dynamics of traffic flow. A dSAE network is used to take the state as input and give the Q-function for any possible action as the output of the network. Li et al. (2016) has shown a 14% reduction in cumulative delay in the case of using an SAE to predict the Q-function instead of conventional prediction.

Gao et al. 2017) has suggested an alternative novel idea for choosing RL states. They argue that instead of taking raw data as the state, it could be more effective if the CNN extracts important features from the raw data—e.g., the position of the cars and their speeds—and feeds it to a DRL network with a fully connected network to predict the Q-value for each of four states of green, yellow, red, and protected left turn light, considering cumulative staying time as the reward. They have also used the experience replay and target network techniques to stabilize the algorithm and converge it to the optimal policy as suggested in Tan and Eswaran (2008).

Liang et al. (2018) have also used CNN to map states. They use several state-of-the-art techniques such as the target network, experience replay, double Q-learning network, and dueling network methods to increase the performance of the network and make it stable. Their results have shown a great reduction in waiting time (more than 30%) for a fixed-time scenario.

Genders and Razavi (2018) have investigated the importance of choosing delay time states. The main goal of this study was investigating whether the data from conventional sensors, such as occupancy and average speed, are satisfactory or more precise data are needed, such as vehicle density and queue length, or even data with the highest resolution, such as discretizing each incoming lane into cells and considering the presence of a vehicle in each cell separately. The results of this study showed that using high-resolution data is not substantially effective and conventional data are good enough for their model. However, one of the reasons that may have contributed to this conclusion is that they used a simple fully connected model that could not extract deep features from more precise states very well.

Finally, Wei et al. (2018) have tested their model on real-world traffic data to see how effective its results could be. They suggest that instead of only studying the reward, we need to consider different policies that may result in the same reward and then take the most feasible one. The final results of this study have shown great performance in reducing queue length, delay time, and duration compared with other methods.

Table 4 summarizes all these papers, shows their model, the dataset which their model was trained on, and the performance of their model for the testing dataset as well as comparison of their model’s performance to that of the baseline model.

Table 4 Overview of papers using deep learning techniques for traffic signal timing

Ride Sharing and Public Transportation

Public transportation systems (including bus or metro systems, taxis, etc.) are one of the main means of moving passengers within cities. To increase city planning performance and also passenger satisfaction, the nature of DNN has endowed companies with increasingly optimal routing maps that take into account data such as passenger demand for a given mode of 11 travel at particular places and times. DL has been adopted to make predictions even more accurate compared to existing ML techniques.

Saadi et al. (2017) have investigated the performance of several ML techniques and a fully connected DL model with only two hidden layers and have shown that their very simple DL model outperforms almost all other techniques except a boosted decision tree. Besides the simple DNN models in Dominguez-Sanchez et al. (2017), Jung and Sohn (2017), Wan et al. (2018) and Zhu et al. (2018b), a hybrid model containing a stacked AE and a DNN has been implemented by Liu and Chen (2017) to predict hourly passenger flow.

To capture all related features such as the spatial, temporal, and exogenous features impacting passenger demand, a fusion convolutional LSTM network (FCL-Net (Ke et al. 2017) has been proposed. This network includes stacked Conv-LSTM layers to analyze spatiotemporal variables, such as historical demand intensity and travel time, and LSTM layers to evaluate nonspatial time-series variables, such as weather, day of the week, and time of the day. With the same idea, Zhang et al. (2017) has proposed a spatiotemporal Resnet (ST-Resnet) which includes several convolutional layers. Liao et al. (2018) has implemented both of these techniques on a New York City taxi record dataset and their comparison has shown that better performance with a faster training time can be achieved using ST-Resnet. The authors suggest two reasons for this. First, LSTM captures fine temporal dependencies which are not as fundamental as the coarse-grained dependencies from the convolutional layers. Their second explanation is that spatial features may be more important than temporal ones and since the ST-Resnet focuses more on spatial features, it outperforms the FCL-Net. Zheng et al. (2017) and Lin et al. (2018b) work directly on graphs structures to leverage structural information by considering the nodes as stations and the edges as dependencies among stations. Finally, Yao et al. (2018b) and Ma et al. (2018) have proposed a deep multiview spatiotemporal network to capture all dependencies separately.

Another research area related to public transportation deals with travel mode selection. Nam et al. (2017) has implemented a simple fully connected DNN on Swiss Metro data to reveal demand based on mode. Another issue for transportation network companies is route scheduling for their drivers to pick up passengers in order to minimize passenger waiting time as well as cost for the driver and company. Shi et al. (2018) has suggested a DRL model aiming to give drivers the best route. This paper considers different factors such as the current location of vehicles, time of day, and competition between drivers, resulting in a significantly shorter search time and more long-term revenue for drivers.

Table 5 summarizes all these papers, shows their model, the dataset which their model was trained on, and evaluation of their model for their testing dataset as well as comparison of their model’s performance to that of their baseline model. (In this table, “travel mode” refers to studies which tried to predict the mode of transportation that passengers would choose at each time point. Also, “passenger flow” is defined as the number of passengers flowing in or out of a given location at a certain time point).

Table 5 Overview of papers using deep learning techniques for ride sharing and public transportation

Visual Recognition Tasks

One of the most significant applications of DL is the use of nonintrusive recognition and detection systems, such as camera-image-based systems. These applications can vary from providing a suitable roadway infrastructure for driving vehicles to endowing the autonomous vehicles with a safe and reliable driving strategy.

One of the first visual recognition challenges tackled has been obstacle detection via exploiting vehicle sensors. To do this, a variety of networks with unique architectures have been implemented. Kim and Ghosh (2016) have merged data from an RGB camera and LIDAR sensors to increase obstacle detection performance in different illumination conditions. Dairi et al. (2018a, b), on the other hand, have confronted obstacle detection as an anomaly detection problem. They have used a hybrid encoder model to extract features of Deep Boltzmann Machine (DBM) and then an autoencoder to reduce the dimensionality and obtain vertical disparity (V-disparity) map coordinate system data from images. The key feature of V-disparity data is that these data are mostly stable with small variations from noise and they change drastically only if an obstacle appears in an image.

Wang et al. (2016b) and Cai et al. (2016) have used data from far-infrared sensors to improve vehicle detection at night. While the former used only far infrared data, the latter, in order to decrease the false positive percentage used both camera and far-infrared data. Wang et al. (2016c) have tried to address requirements in regard to vehicle following, which include detecting brake lights. They used the Histogram of Oriented Gradient (HOG) approach implemented with LIDAR and camera data. To decrease the false positive rate and speed up the process, they also used the vanishing point technique. Next, they used AlexNet to detect if the rear middle brake light was on or off.

Another important task in navigating safely is traffic sign detection. These signs obligate, prohibit or alert drivers. One of the most common DL models to detect traffic signs are CNNs. Qian et al. (2015), Yang et al. (2015), Lin et al. (2016, 2019), Lim et al. (2017), Zeng et al. (2016), Hu et al. (2017b), Yuan et al. (2016), Arcos-Garcia et al. (2018), Natarajan et al. (2018), Lee and Kim (2018), Li et al. (2018b) and You et al. (2018) have all used CNN as their main feature extractor, each trying to tune their model to get the best results. Qian et al. (2015) have used RCNN to derive regions of interest from RGB images. Lim et al. (2017) have focused on low-illumination images. They used a classifier to detect regions of interest and an SVM to verify if any traffic signs were present inside the region or not. Then, a CNN model using the Byte-MCT technique classified the traffic sign. Experiments have shown that this method is robust in deficient lighting, outperforming other methods in cases of low illumination.

Zeng et al. (2016) have suggested that the RGB space cannot provide as much useful data as the perceptual lab color space. Therefore, after space changing, they extracted the deep perceptual features using a CNN and fed these features to a kernel-based ELM classifier to identify the traffic sign. This classifier used the radial basis function to map the features in a higher dimension space in order to disconnect features to get the best outcome.

Arcos-Garcia et al. (2018) have tried different optimization methods on a CNN model containing several convolutional layers and spatial transformer networks (STN) that make the CNN spatially independent, resulting in no need for supervised training, data augmentation or even normalization. In contrast, Li and Yang (2016), instead of using a CNN, have used a DBM that is boosted with canonical correlation analysis for feature extraction and then an SVM for classification. Also, they have used certain conventional image-processing techniques such as image drizzling and gray-scale normalization to reduce noise.

Weber et al. (2016), Behrendt et al. (2017) and Kim et al. (2018a) have focused more on traffic light detection and classification. This has a very significant role in managing traffic, and correct detection has a high correlation to reduced risk. Weber et al. (2016) have proposed their deep traffic light recognition (DeepTLR) model that first classifies each fine-grained pixel of the input data, calculating the probability for each class. Then, for the regions with higher probability toward the presence of a traffic light, a CNN was used to classify the status of the traffic light. (In this model, temporal data were not used and each frame was analyzed separately). However, Behrendt et al. (2017) have used traffic speed information as well as stereovision data to track detected traffic lights. Lin et al. (2016) have used a combination of region-of-interest (ROI) performance, CNN feature extraction and an SVM as a classifier to detect arrow signs on the roadway and classify their direction. Gurghian et al. (2016) have used a CNN to detect lane position in the road.

Finally, the monitoring of civil infrastructure has always been a focus for engineers and researchers. Various monitoring techniques have been used for infrastructure performance evaluation, ranging from conventional short-term (Arabi et al. 2018) and long-term (Arabi et al. 2019, 2017; Constantinescu et al. 2018) sensor-based monitoring to nondestructive and noncontact techniques (Moll et al. 2018). Among the applications of nondestructive damage detection, pavement crack detection, in particular, has received attention, due to its importance in civil infrastructure management. For instance, Hosseini et al. (2020) and Hosseini and Smadi (2020) developed pavement prediction models that can help agencies to come up with more accurate maintenance and rehabilitation activities. Zhang et al. (2018c) have proposed a unified pavement crack detection approach that can distinguish between cracks, sealed cracks, and background regions. Through their approach, they have been able to effectively separate different cracks having similar intensity and width. Moreover, Bang et al. (2019) have proposed pixel-level pavement crack detection in black-box images using an encoder-decoder network and found that ResNet-152 with transfer learning outperformed other networks. Additionally, CrackNet, which performs pixel-level pavement crack detection on laser-based 3D asphalt images, was introduced by Zhang et al. (2018d). In a separate study, Zhang et al. (2018d) extended their previous study to CrackNet-R, which utilizes RNN with a gated recurrent multilayer perceptron (GRMLP) to update the memory of the network, showing their model outperforms other models based on LSTM and GRU. Also, Nhat-Duc et al. (2018) have investigated pavement crack detection performance using metaheuristic-optimized Canny and Sobel edge detection algorithms, comparing these algorithms with their proposed CNN and confirming the superior performance of DL over conventional edge detection models.

Table 6 summarizes all these papers, shows their model, the dataset which their model was trained on, and evaluation of their model for their testing dataset as well as comparison of their model’s performance to that of the baseline model.

Table 6 Overview of papers using deep learning techniques for visual recognition tasks

Discussion and Conclusion

Hardware

Generally, there are two types of intelligent decision-making, namely cloud-computing-based and edge-computing-based. While computing services are delivered over the internet via the cloud computing approach, they are performed at the edge of the network via the edge-computing approach. The edge-computing approach has introduced several advantages, such as efficient and fast intelligent decision-making as well as decreased data transfer cost. Emerging technologies such as DL have significantly increased the importance of edge computing devices. Though discussing edge computing devices in detail goes beyond the scope of this paper, we briefly overview and compare the edge computing devices popularly used for DL. Figure 4 illustrates the various edge computing platforms discussed in this section. Also, Table 7 summarizes the technical specifications of the covered hardware.

Fig. 4
figure 4

Hardware (left to right): NVIDIA Jetson Xavier (Jetson AGX Xavier Developer Kit 2020), NVIDIA Jetson TX2 (Jetson TX2 - Elinux.Org 2020), NVIDIA Jetson Nano (Jetson Nano Developer Kit 2020), Raspberry (2020), Intel NCS 2 (Intel® Neural Compute Stick 2 Product Specifications 2020)

Table 7 Detailed specifications of the popular edge-computing devices used for DL

The Jetson Xavier is the high-end system-on-a-chip (SoC) computing unit in the Jetson family, which exploits the Volta GPU. An integrated GPU with Tensor Cores and dual Deep Learning Accelerators (DLAs) make this module ideal to deploy computationally extensive DL based solutions. NVIDIA Jetson Xavier is capable of providing 32 TeraOPS of computing performance with a configurable power consumption of 10, 15 or 30 W.

Another widely used embedded SoC is NVIDIA Jetson TX2 which takes advantage of NVIDIA Pascal GPU. Although it delivers less computing performance than NVIDIA Xavier, it can be a reliable edge computing device for certain applications. The module can provide more than 1TFLOPS of FP16 computing performance using less than 7.5 W of power consumption. The Jetson Nano, which utilizes the Maxwell GPU, is newest product from the Jetson family introduced by NVIDIA. It is suitable for deploying computer vision and other DL models and can deliver 472 GFLOPS of FP16 computing performance with 5–10 W of power consumption.

Another family of edge computing devices is the Raspberry Pi family, which introduces affordable SoCs capable of high performance in basic computer tasks. The Raspberry Pi3 Model B + is the latest version of the Raspberry Pi which uses a 1.4-GHz 64-bit quad-core processor and can be used alongside deep learning accelerators to achieve high performance in computationally expensive tasks.

Finally, the Intel Neural Computing Stick 2 (NCS 2) is a USB-sized fanless unit, which utilizes the Myriad X Vision Processing Unit (VPU) that is capable of accelerating computationally intensive inference on the edge. Very low power consumption along with supporting popular DL frameworks such as Tensorflow and Caffe have made the NCS 2 ideal to use with resource-restricted platforms such as Raspberry Pi3 B + . There have been limited studies investigating the inference speed of these hardware, though Arabi et al. (2020) has compared the inference speed of an SSD-MobileNet model of the abovementioned embedded devices on a construction vehicle dataset. Utilizing the Jetson TX2, they achieved 47 FPS, and utilizing a Raspberry Pi and NCS combination, they achieved 8 FPS.

Summary

Below, we provide a summary of the studies cited in the current paper. We have classified these studies according to our six ITS application categories in relation to the DL models they use (see Fig. 5). The following are our observations:

  • Traffic characteristics: CNN, RNN, and CNN-RNN hybrid models are most frequently used. The main reason is undoubtedly related to the nature of traffic that has two main dependencies: spatial and temporal. Because various datasets and performance evaluation metrics have been used, it is hard to compare different studies related to traffic characteristics, but in traffic flow studies, the PeMS dataset has been widely used. The majority of research has used hybrid CNN and RNN models, which can identify both long temporal dependencies and local trend features. Although most papers have defined their own CNN model rather than using an existing architecture, CNN has generally shown better performance across papers when compared to RNN, which shows lower computation/training time.

  • Traffic incidents: the most widely used model is RNN, since the result of an incident shows itself at a specific time that requires a powerful network model to identify. Autoencoders are also popular models, since they can learn traffic patterns and then detect and isolate accident conditions from regular conditions.

  • Vehicle ID: CNN is the most widely used model, given its power in inferencing from images, as detection and tracking is the main task in license plate and vehicle type/color identification. Existing CNN architectures that have been popularly utilized are AlexNet and VGG models that have been pretrained on ImageNet.

  • Traffic signal timing: RL has been the most commonly used model, given the control strategy nature of the traffic signal timing task. Hybrids of CNN and SAE have been used to approximate or learn Q-values to improve DRL performance.

  • Ride-sharing and public transportation: CNN, RNN, and DNN have been the most frequently used models in the domain. Most researchers have built their own DL architecture to accomplish tasks in this category. Public transportation demand and traffic flow prediction tasks have generally been done by either CNN or hybrid CNN models.

  • Visual recognition tasks: CNN has been the most commonly used DL model for visual recognition tasks, again because detection and tracking are efficient via CNN. Especially in traffic sign recognition tasks, the GTSRB dataset has been one of the most frequently used benchmarks. Existing architecture such as ResNet, AlexNet, VGG, and YOLO have been used extensively, with the AlexNet and ResNet architectures being the most popular to build on. This can be attributed to the fact that visual recognition tasks are not limited to ITS, so research done in other domains can be utilized to accomplish ITS-related visual recognition tasks.

Fig. 5
figure 5

ITS vs DL models—a traffic character, b traffic incident, c vehicle ID, d traffic signal, e public transport, f visual recognition

Based on all the studies reviewed in the current paper, deep learning as an approach for addressing intelligent transportation problems has undeniably achieved better results as compared to existing techniques. The major growth has been seen in the past 3 years, constituting more than 70% of all ITS-related DL research performed so far.

Future Work and Challenges

In recent years, DL methods have been able to achieve state-of-the-art results in different visual recognition and traffic state prediction tasks. The majority of the visual recognition work such as vehicle and pedestrian detection, traffic sign recognition, etc. have focused on autonomous driving or in-vehicle cameras. However, there have also been a significant number of overhead cameras installed by city traffic agencies and state Departments of Transportation that are mostly used for human-evaluated surveillance purposes. To date, there have been only a few studies that have focused on using these cameras for determining traffic volumes on freeways and arterials, traffic speed, and also for surveillance purposes such as automatically detecting anomalies or traffic incidents (particularly at a large-scale, citywide level). Currently, the majority of traffic intersections rely on using loop detectors for vehicle counting and for developing actuated traffic signals. However, installation of these loop detectors is intrusive, in that road closures are required for installing such sensors. Cameras, on the other hand, can be used as a cheap, nonintrusive detection sensor technology for counting traffic volume in all directions as well as turning movements, the presence of pedestrians, etc., thereby facilitating smart traffic signal control strategies. However, two main challenges need to be considered for developing DL techniques able to handle the use of cameras as sensors. First, such methods need to be able to handle the large volume of data collected from hundreds or thousands of cameras installed at a citywide or statewide level. Efficiently providing real-time or near-real-time inferencing from this large volume of data is currently one of the primary challenges of using cameras as sensors. Second, the methods developed need to be able to perform with minimal or no calibration such that they are feasible to apply and maintain at a large-scale level. Also, the ITS community needs to focus on creating more benchmark datasets for different research tasks related to DL applications. Although PeMS has been used as a popular dataset for traffic state prediction as shown in 1, the absence of any comparable benchmark dataset for traffic incident inference and ride-sharing studies has resulted in most of these studies using an original dataset. This has created difficulties in comparing different algorithms to determine the state-of-the-art model. Indeed, one of the reasons these research areas have still not been significantly explored using DL models is likely attributable to their lack of a recognized benchmark dataset. While this study has shown that DL models have been successfully applied to traffic state prediction, vehicle ID and visual recognition tasks, significant improvements need to be made in the use of DL models for other research topics such as traffic incident inference, traffic signal timing, ride sharing, and other public transportation concerns. These topics have still not been fully explored using DL models and hence there remains significant scope for improving detection and prediction accuracy in these areas.

While DL models are becoming increasingly popular among researchers as the most effective classification method in visual recognition tasks in the ITS domain, privacy and security are extremely important. Therefore, the potential for adversarial attacks and thus the need for robustifying DL models have been receiving greater attention. (Adversarial attacks in this domain are, in most of the cases, small changes in the input which are imperceptible to the human eye but make the classifier classify incorrectly.) For example, self-driving cars use DL algorithms to recognize traffic signs (Cireşan et al. 2012), other vehicles, and related objects for navigation purposes. However, if DL models fail to detect a stop sign due to slight modification in a couple pixels, this can create serious impedance to the adoption of self-driving cars. Adversarial attacks, are, therefore, an increasing area of focus in different DL application research topics such as natural language processing, computer vision, speech recognition, and malware detection (Najafabadi et al. 2015; Collobert and Weston 2008; LeCun et al. 2010; Deng et al. 2013; Hardy et al. 2016; Tan et al. 2020).

Biggio et al. (2013) has called into question the advisability of using neural networks and SVMs in security-sensitive applications, demonstrating the legitimacy of their concern by attacking some arbitrary PDF files and the MNIST dataset using the gradient descent evasion attack algorithm that they proposed. Their suggested solution is employing regularization terms in classifiers. In the same vein Szegedy et al. (2013) has shown that accuracy for perturbed input due to adversarial attacks is much less than that in the case of high magnitude noise. Another downside of DL classification methods is that adversarial attacks can be independent of the classification model, meaning that one can generate an adversarial attack that can fool a machine learning system without any access to the model. These are called black-box attacks, a concept first introduced by Papernot et al. (2016), whereas white-box attacks are when the attacker is aware of all relevant information such as the training dataset, the model, etc. For example, Madry et al. (2017) has used a projected gradient descent (PGD) form of attack, which is different from related work that has mostly used a form of attack involving the Fast Gradient Sign Method (FGSM). Also, Moosavi-Dezfooli et al. (2017) has come up with a systematic way to compute universal attacks that are small image-agnostic perturbations that have a high probability of breaking most classifiers. Concurrent to research regarding designing attacks and understanding the vulnerability of neural networks to them, researchers have studied different ways to defend against adversarial attacks to make DNNs robust to them. One of the most popular approaches to defense against adversarial attacks is to add the adversarial set generated by any algorithm to the training set and then training the neural network with the new augmented dataset (Fawcett 2003). Goodfellow et al. (2014b) has shown that although this method works for specific perturbations, networks being trained by this method are not robust to all adversaries. For example, while working to mitigate the effect of adversaries using denoising autoencoders (DAEs), Gu and Rigazio (2014) discovered that the resulting DNN became even more 17 sensitive to perturbed input data.

Around the same time, Bastani et al. (2016) designed a metric to measure the robustness of networks and approximate this using the encoding of their robustness as a linear program to improve the robustness of the overall DNN. Defense against adversarial attacks can be looked at as a robust optimization problem, as Shaham et al. (2018) has shown that adversarial training using their proposed algorithm results in a more robust network achieved by robust optimization theory which results in increasing the accuracy and robustness of the DNN. Also, authors in Esfandiari et al. (2019) achieved an algorithm which can provide comparable accuracies with State-Of-the-Art algorithms, and save a lot of computational overhead accompanied with computing worst case adversarial attacks. They achieved that by looking at the robust learning problem from a robust optimization lens as well. Another recent method to harden DNNs against adversarial attacks is defensive distillation which has shown outstanding preliminary results in being able to reduce the adversarial attack success rate from 95 to 0.5% (Papernot et al. 2016), but Carlini and Wagner (2017) defeated this method by designing a powerful attack able to break this defense mechanism. Thus, defense and design against adversarial attacks remain an open problem in DL applications.

As mentioned above, most studies regarding the application of DL models in transportation have paid no attention to robustness. However, in light of emerging malware attacks, the importance of defending models from such attacks has become increasingly important. These attacks usually destroy the input data by adding noise to them. These attacks can thus disturb the control unit by causing it to infer wrong information from the data, resulting in serious accidents. Also, another source of noise can be the weather conditions such as rainy or snowy conditions. Increasing the robustness of detection models will enable ITS models to operate better in severe conditions and thus improve their performance.

In summary, though much research is happening in various domains of ITS using a variety of DL models, the focus of future research in DL for ITS should encompass the following: how to develop DL models able to efficiently use the heterogeneous ITS data generated, how to build robust detection models, and how to ensure security and privacy in the use of these models.