Keywords

1 Introduction

Traffic congestion has become an annoying and a complicated issue in most of the urban areas. A smart and efficient traffic controlling mechanism is the solution to this problem. Moreover, such a system can provide abundant advantages such as smooth traffic flow, and reducing unwanted waiting time in traffic junctions. Better managing of traffic at bottleneck junctions is essential as the traffic demands rise, failure in which is sure to cause congestions. Congestion can mostly occur in a junction if most of the vehicles are waiting for the signal to turn green. Unfortunately, the current traffic systems fail to consider real-time parameters that affect traffic congestion.

Thus, many research works are ongoing in traffic regulatory systems to avoid the challenge of traffic congestion. The automation of traffic regulatory systems is related to many fields such as image processing (IP), machine learning (ML), and Internet of things (IoT). Previously, traffic signal control (TSC) models did not significantly address the inconveniences caused by over-saturation, delays due to unexpected events, and climate change. Data collected from traffic networks at different times were used to control green signals based on the Webster formula [1]. However, they were not adequate to control the fast-moving traffic. Scientific and technical studies that are consistent with the fact that queue size plays a vital role in traffic control [2,3,4] have also failed to address green-signal vegetate, cross-blocking, and occlusion.

Nowadays, the world is witnessing a few exciting research pieces that strive to automate an optimized traffic signal that overcomes the shortcoming of existing ones by considering all the real-time facts learned from the system’s surroundings, including the driver behavior [5]. The research in traffic automation is fastened right from the introduction of Reinforcement Learning (RL). However, utilizing RL, the traffic regulatory system can be modified in an effective way such as the green light duration is shortened or lengthened, or even skipped according to the dynamic traffic conditions [6]. RL is highly adaptable to the dynamicity of traffic conditions irrespective of the time. This peculiar property increases the possibility of producing such a system in real.

Figure 1 shows the typical framework of an RL scenario. Here, an agent takes action (say A) in an environment. This action is interpreted into a reward (say R) and a representation of the state (say S). The result is fed back into the agent. The main problem arises in deciding an algorithm that suits the current situation. One must have a clear idea of the algorithms to select an appropriate one for the case under study. Also, when RL algorithms utilize the advantages of other techniques, the results obtained are mindblowing.

Fig. 1
figure 1

Framework of the RL scenario

This paper focuses on different RL algorithms. The discussion introduces some of the well-known algorithms and familiarizes the environments where these algorithms can be used. The paper also addresses the advantages and disadvantages of algorithms under consideration.

The study is organized as follows. Section 2 focuses on the preliminary knowledge of the RL algorithms. Section 3 provides a foundation to lay a better understanding of the existing systems, which will act as the basis for this paper. A detailed analysis of RL models (RLMs) is presented in Sect. 4. Section 5 gives an overview of datasets, simulation platforms and performance metrics used in RL-based vehicular traffic control models (RL-VTCMs). The open challenges and recommendations of the system and the conclusion are given in Sects. 6 and 7, respectively.

2 Reinforcement Learning

Reinforcement learning (RL) refers to a kind of ML method which analyzes how the software agents are ought to take actions in their environments to maximize the cumulative reward. Nowadays, it is used in various software and machines to find the most suitable path or state it should take while considering the present scenario. This section provides the preliminaries regarding RL.

The environment means the object on which the agent is acting. The agent is the RL algorithm. Initially, without any prior knowledge of how to behave, the agent starts interacting with its environment. The input is sent to the agent by the environment. The input is a state. Then, the agent takes action based on the knowledge it gained as a response to the received state. Via an interpreter, the environment sends a pair of next state and reward back to the agent. The reward, which is either positive or negative, solemnly depends on the agent’s action. The negative reward is usually referred to as punishment to the agent. Also, to evaluate its last action, the agent updates its knowledge using the reward obtained. The agent iteratively learns and reaches the optimal condition.

Even now, few are confused with Supervised Learning (SL) and RL. Table 1 gives a comparison between RL and SL to bring in more clarity.

Table 1 Reinforced learning versus supervised learning

The sequential nature of RL can be explained as follows. The output depends on the state of current input, which depends on the previous output. i.e., the input at a particular time always considers the output of the previous cycle. Thus, a chain is formed. To predict future output, the SL algorithms apply the knowledge gained to the new data as labeled examples.

The continuous interaction with the environment benefits software agents and the machines to automatically determine the specific context’s quintessential behavior to maximize its performance. After sufficient training, the ideal output is attained for any new input if the system is provided with a suitable dataset.

RL requires a reward feedback method known as the reinforcement signal. The agent learns using this reinforcement signal. The action corresponding to each reward is analyzed to find the best. The learning algorithm compares its obtained output with the correct, intended output and thereby calculate the error. Accordingly, the necessary modifications are made in the model.

2.1 Classification of Reinforcement Learning Algorithms

The RL algorithms can be classified based on different factors such as the reward, model, action space, policy. These classifications are explained below:

Based on Reward The reward-based classification mainly depends on the nature of the reward. In practical cases, RL is categorized into positive RL and negative RL. An RL algorithm is said to be positive reinforcement when an event increases the strength of the behavior; i.e., an event occurs because of a particular behavior of the agent. A reward is assigned to the agent. If that reward helps to maximize performance, then it is positive reinforcement. Alternatively, in other words, in positive RL, the reward said to be a positive effect on the behavior. In negative RL, the system is trained to stop or avoid unfavorable conditions, which reduces strength. Such an action strengthens behavior. It resists the minimum standard of performance.

Based on Model Model-free and model-based are the two classifications of RL algorithms based on the model. The transition probability distribution (TPD) is also known as the transition model. The model of the environment contains both TPD and Reward Function (RF). In the model-free algorithm, the TPD and the RF associated with the environment are not utilized.

Let the current state be s0 and action be a. Performing a, the model reaches s1 from s0. The model analyzes and learns the transition probability function T. In this case, T(s1|(s0, a)) . By successfully analyzing, the agent determines the chance to enter into a particular state from the current state by taking a specific action. In model-free algorithms, the peculiar trial-and-error method of RL algorithms is used. In each trial, the model gains some knowledge. Correct action helps the model to optimize the output. The wrong action helps the model to update itself to stay away from entering into unfavorable states. Because the model earns some knowledge from all its trials, there is no need to store the transitions.

The absence of state space and action space makes the model-free RL algorithms more demanding than model-based. The cases where the transitions have to be saved uses model-based RL algorithms. As the state-space and action-space grow, the effective utilization of storage space became impractical. Model-based RL algorithms are preferred in scenarios where the system could decide the next move based on a trained model, without interacting with the current environment. Conversely, if the system decision needs a continuous interaction with the environment, such as a real-time traffic regulation system, model-free RL algorithms will perform well.

Based on Action Space Based on the action space, RL agents can have two categories of action spaces, namely discrete and continuous action space. If the agent decides the next action from a finite action set, it is called discrete action space algorithms. Instead, in the continuous action space, a single real-value vector is used to represent the entire action space. The difference in actions cannot be expressed because of the single vector representation. In discrete action space, the fine-tuning of action selection is done. Also, discrete action space is more suited for value-based approaches. Further, a discrete action space approach is engaged in cases where small action space is required. The continuous action space is required when the size of action space grows to infinity.

Based on Policy In the policy-based RL algorithms, the main objective is to maximize the reward. The policy defines the behavior of an agent at a particular time. In other words, it is a mapping from learned states to actions to be taken when the agent reaches those states. These algorithms try to determine the action to be taken at a state to attain the maximum reward in the forthcoming steps.

The algorithm fine-tunes a vector of parameters to attain the objective. For example, to select the best action to be taken under the policy \( \pi \), a vector of parameters say \( \theta \) is adjusted. This example is mathematically represented as follows:

$$\begin{aligned} \pi (a|s, \theta ) = Pr \left\{ A_{t} = a | S_{t} = s, \theta _{t} = \theta \right\} \end{aligned}$$
(1)

Right-hand side (RHS) of Eq. 1 means that, at time interval t, the best action to be taken is a from state s by tuning the parameter \( \theta \). Left-hand side (LHS) implies that the agent learns this knowledge. The entire learning or training phase follows the same policy.

The two types of policy-based RL algorithms are on-policy and off-policy algorithms. The agent learns the Q-function so that the probability of goodness of each action is determined. Among the results, the best one selected stochastically. Such a learning approach is known as on-policy RL algorithms. On the other hand, a greedy decision is taken to take action with the best Q-value. The Q-value is learned by using other different algorithms. Such algorithms are called off-policy RL algorithms. Based on their nature, these kinds of reinforcement algorithms are sometimes referred to as stochastic and deterministic reinforcement algorithms, respectively.

Policy-based algorithms can exhibit better convergence. These algorithms are suitable even in higher-dimensional action spaces. The more attractive characteristics of these algorithms are their stochastic nature. Though the policy-based algorithms have many advantages, they still possess some disadvantages. Rather than converging into the global optimum, these algorithms converge to the local optimum. In mathematics and computer science, global optimum gives the optimal solution among every possibility. Local optimum is not preferred because it is the best solution to a problem only within a small neighborhood of possible solutions.

Also, the policy-based algorithms have higher variance. But a small variance characterizes an efficient estimation. The important reinforcement algorithms with the properties mentioned above have been compared. The main traits of them are given in Table 2.

Table 2 Reinforced learning algorithms: comparison

3 Related Work

Researchers have proposed several solutions to solve conventional TSC problems. This section discusses the important among such solutions put forward by the researchers.

An RLM that uses the Q-learning algorithm with action-value approximation has been used to build an online model-free traffic signal controller [7]. This work focuses on both average delay and queue length, rather than considering only the average delay. Also, it utilized the advantage of ANN to train the model according to the temporal difference. However, [7] failed to consider unexpected dynamic scenarios in real time.

The model-free RL algorithm can contribute highly to the traffic signal control problem by combining the prior traffic knowledge with a deep RL approach, which is shown in [8]. Here, Q-learning is used to train another approach named Mixed Q-network (MQN). A model can learn traffic patterns and then find out the most suitable agent. The biggest failure of such a system is finding a suitable traffic pattern detector according to the dynamic traffic structures.

Most of the earlier studies have succeeded in developing traffic signal controllers (with restricted action selection) with the help of peculiar properties of basic RL algorithms. Two RL adaptive traffic signal controllers were designed to analyze their learned policies and compare them to a Webster’s controller [9]. The controllers were implemented by using asynchronous Q-learning and advanced adaptive actor-critic algorithms. The neural network function approximation has also been added to the design. Interval became constant due to the fixed green signal duration for the scenario under observation. If the action selection is made dynamic, the agent could control the environment better. Also, in the testing scenario, each intersection was controlled by an isolated RL agent. Hence, the model cannot be considered as a multiagent RL system.

Q-learning techniques maximize the number of vehicles passing a junction and adjust the roads’ signals by observing the variation of queue lengths and throughput as the key parameters [10]. However, this system fails to evaluate the accuracy of the model in multiple intersection roads. Also, the data transfer between the traffic island have not been considered in this study.

The time delay, the number of idle vehicles, and the combined saturation were estimated from the experience to learn and determine the optimal actions preserving the traffic signal timing efficiently [11]. The work modularized the actual continuous traffic states for simplification purposes.

The spectacular properties of Deep Q-Network (DQN) have a lot to help with TSC models [12]. Further, DQN is used in learning models in modern ride-sharing platforms [13]. The model-free DQN learns the optimal vehicle dispatch policies from its interaction with the environment. However, some crucial detailing is missing in this study. Scalability, fault tolerance, reliability, and availability of shared data also have to be considered.

European countries are well versed with the advantage of group-based signal control that provides flexible phase structures. Most of the existing systems used simple timing logic in implementation. Jin and Ma [14] try to formulate the existing system as an adaptive multi-agent system by incorporating Q-learning and SARSA. Nevertheless, the work lacks the handling of real-time scenarios and the issues associated.

R-Markov average reward technique (RMART) is suited for an environment among signal controllers in a connected vehicle environment [15]. The research took eighteen signalized intersections to implement the idea in a hypothetical network by assuming the learning parameter and discount factor to be arbitrary. Aragon-Gómez and Clempner [16] address a multiagent continuous-time schedule problem and proposes a learning scheme for it. Thus introduced an RLM (based on the temporal difference method) by observing traffic signal control problems as continuous-time Markov games (CTMG). Transition rates and reward points are calculated accordingly. However, some shortcomings in this work include the lack of incorporation of a collaborative approach and the method’s robustness when exposed to a real-time environment.

Vehicles are used not only for travel. They are also used for goods transportation. Therefore, traffic control is one of the primary demands for manufacturing companies too. In the future, more emphasis will be given on automation. Therefore, the product’s timely delivery to the consumer is also a factor that affects the product’s quality and production cost. The deep reinforcement learning (DRL) model paves a solution to this via dynamic routing strategy [17]. The traffic states and actions can be predicted using DRL combined with a Q-learning step and a recurrent neural network (RNN). Hence by reducing the delivery time and delay, the different combinations of states, actions, rewards are utilized for the modeling. Still, the model failed to consider a few other dependent factors/causes of traffic congestion.

Lack of proper traffic control not only creates traffic congestion but also adversely affects safety, time, efficiency, and energy. These problems are also heating up with the advent of autonomous cars and electric cars. Therefore, ongoing research work has begun using RL techniques to address these issues [18,19,20]. RL techniques can also be used intelligently and appropriately to facilitate learning and problem-solving in many other traffic-related areas [21, 22].

4 RL Models in Vehicular Traffic

This section presents some of the RLMs that are used in vehicular traffic for traffic automation purposes. A review of RLMs and their strengths to address traffic control challenges is given in Table 3. Also, Table 4 reports RLMs and the attributes for the vehicular traffic regulation system.

4.1 Multiagent Reinforcement Learning

Most intelligent systems nowadays highly depend on multiple agents competing with each other to improve the system’s overall behavior. Such a process that incorporates RL algorithms is known as multiagent reinforcement learning (MARL) Algorithm. The combinational availability of state-action pairs (SAPs) increases exponentially with the number of agents. In other words, the number of agents is directly proportional to the number of SAPs. In MARL, the agents exchange information. Based on all the available and received data, the agents coordinate their actions to achieve global Q-value optimization. The most attractive feature of MARL is its scalability (i.e., adding new agents quickly) [12, 23,24,25].

Table 3 Summary of RL models
Table 4 Summary of RL models for vehicular traffic regulation systems

The main challenge for the agents in a shared dynamic environment lies in learning the situation and making a better decision. The same is the case with traffic also. In vehicular traffic scenario, the action of an agent at an intersection point can affect and vary with the agent’s decision at the neighboring intersection point, which may also affect the agent’s self-performance. In case of a wrong decision, there is a high probability of having high congestion in the nearby intersections. Hence, each agent should take and communicate optimal actions and coordinate with each other. MARL is a helpful model in such cases [2, 26]. MARL that tries to optimize the global Q-value is used for the traffic regulation system [2]. The inappropriateness of the traffic phase is tackled using a distributed model [26].

4.2 Multistep Backup Reinforcement Learning

The optimal action decided by a typical RL algorithm highly depends on the present state. Usually, an action affects a consecutive series of states. In multistep backup reinforcement learning (MBRL), the average outcomes of the temporal differences are calculated inside an episode (a series of time instants). Based on this data, the agent updates the Q-values. MBRL focuses on the long-term payoff for an action related to a state. The average effects of temporal difference are attained using most fitting traces.

The MBRL model cut down the average hold-up time by considering the phase sequence and phase split of traffic in an intersection with a single lane traffic network [27]. Traffic phases with grouped individual traffic give the traffic phase split for processing. An episode is a duration between activation and termination of the green signals with the combination of traffic movements. Each time a SAP is visited, its value is set to one. This value gets updated for all the visits, and the eligibility trace adds more credit to recent SAPs. The temporal difference is being weighted using the eligibility traces. The Q-value of an episode is updated using this temporal difference. A trace decay parameter exponentially decays the eligibility trace of an unvisited SAP.

4.3 Max-Plus Reinforcement Learning

Agents in a coordination graph are interconnected. A max-plus algorithm calculates and exchanges the local and global payoffs among these agents. As part of the optimal joint action, agents use the payoff values to determine their corresponding action. A max-plus reinforcement learning (MPRL) follows a top-down approach. This modularization helps them to confront the challenge of dimensionality. The probability of better results in an oversaturated network is calculated by incorporating MRPL in the reward structure of Q-learning agent in the design of a traffic signal control [3].

The agent i sends locally optimized payoffs to its neighbor j via the edges connecting them. The action taken by j determines the payoff. After a finite number of iterations, the algorithm converges to a fixed point. It is possible to increase the throughput of traffic signal and reduce the number of stops per vehicle to some extent [3].

4.4 Reinforcement Learning with Function Approximation

Commonly, in a shared dynamic space, the number of SAPs can be huge in number. The SAPs increase exponentially when the number of agents increases, leading to a diminishing scalability scope. Thus, RL faces the challenge of dimensionality. This issue can be solved to some extend by introducing function approximation (FA) logic in RL. Instead of many SAPs, FA stores and pays attention to an appreciably smaller amount of features. Thus reduces memory/storage capacity, improve scalability, and reduce learning time. In RL with FA (RLFA), Q-values are represented using tunable weight vectors and feature vectors [1, 4, 28, 29].

Consider a real-world traffic network based on Bangalore, \(2\times 2\) and \(3\times 3\) grids, and sixteen trivial streets traffic network using a centralized model. Here, the RLFA approach addresses the challenge in traffic phase sequence in a two-way intersection by optimizing the global system performance. RLFA helps to increase throughput and reduce waiting time [4, 28].

5 Datasets, Simulation Platforms, and Performance Metrics Analysis of RL-VTCMs

This section includes analyzing performance metrics used in traffic-related research and simulation platforms used in such studies. Also, it investigates the datasets used in RL-VTCMs. Table 5 gives a summary of the performance metrics.

Table 5 Summary of performance measures

5.1 Benchmarked Datasets for RL-VTCMs

Some of the benchmarked datasets that focus on autonomous navigation are ADE20K [30], Berkeley Deep Drive (BDD) [31], Cityscapes [32], Camvid [33], Daimler [34], IDD [35], KITTI [36], Leuven [37], and Mapillary Vistas [38]. The different lighting circumstances and the multiple cameras and sensors in the cities help the Cityscapes provide a large amount of data. The Mapillary Vistas Dataset creates the imagery of street scenes. Images from different angles of the road and its surroundings are present in this dataset, irrespective of the cameras that captured them. They have no video data. The Berkeley Deep Drive Dataset concentrates on autonomous navigation. For ADE20K, the general locale parsing issue is the main area of interest. Dashboard cameras are used on the BDD100K to capture images. The glass in front of the cameras adversely affects the image quality. It can get worse in rainy conditions. IDD can be used to ensure security and reliability in unusual and extreme cases.

5.2 Simulation Platforms

Some discrete-event simulators are developed using programming languages such as C/C++ and tools such as MATLAB. There exist macroscopic and microscopic approaches for traffic simulators with a graphical user interface (GUI). Most traffic simulators embrace the microscopic approach, including VISSIM, SUMO, TSIS, and ITSUMO.

5.3 Performance Measures

Appropriate performance measures are required to assess the merits of any traffic control system. These parameters are essential in RL based TSC; because an agent needs to assess his own performance to learn from experience. Some of the performance measures used in vehicular traffic are reduction of fuel consumption, reduction of emissions, the number of stops in a journey, percentage of stopped vehicles, average delay, average trip waiting time (ATWT), vehicle density at different parts of the network, queue length, and average vehicle speed. Table 5 reports some of the performance measures accomplished by the RLMs and algorithms.

6 Open Challenges and Recommendations

After discussing the major algorithms and models in RL, here we examine various challenges that need to be addressed during their usage. This section throws light into the important hurdles in using RLMs and algorithms in ITS. It also includes suggestions for handling these challenges.

  • Injecting RL in unfitting circumstances- RL is propitious and fastly advancing technique in a variety of fields such as Resources management in computer clusters, Traffic Light Control, Robotics, Games, and Chemistry. Too much reinforcement leads to states overload, followed by the diminishment of results. The inappropriate parameters and assemblage of payoff messages lead to poor system performance, even during the initial learning phase.

  • Availability of data When enough data are available, SL methods are preferred. This is due to the fact that when action space is large enough, the RL algorithm becomes time-consuming.

  • Real-time environment In a shared and dynamic environment like traffic regulation, RL algorithms and models have to include the recent advances in ITS to exhibit their full strength. A better traffic regulatory system comprises almost all the dynamic parameters such as traffic density, road utilization, and vehicles.

  • Self adaptiveness Aim of the current researches is to build an automated traffic regulatory system that performs self-configuration of the dependent parameters to adapt with the dynamicity of traffic. The interoperability is usually affected by the communication overhead. Hence, the exchange of control messages needs a limit by eliminating unwanted control messages, by which the learning rate of the system also improves. The agent is expected to learn new and unexpected actions and states in the operating environment.

  • External impediments The weather conditions such as rain, flood, fog are the factors that pull down the hope of a fully automated self-paced traffic regulatory system. Not only this, but also the traffic flow(in and out) and the disturbance in traffic flow make the problem worse. In upcoming traffic regulation proposals, all such situations have to be taken care of.

RL enhances system performance in scenarios with fewer data, such as in traffic regulatory systems. Hence, in developing countries with very few publicly available traffic datasets, RL has a huge impact in developing better VTCMs. Integrating RL with advanced technologies such as fuzzy logic, game theory, and AI; fastens the ride towards an extremely self-paced traffic regulatory system. These technologies help to include prior knowledge and obtain optimal actions. The analysis of prior traffic data, gained knowledge, approximation, and conventional control systems are required for a better traffic control model. Agents in the model use the traffic observer’s information for increasing the learning rate in the (re)learning phase to achieve enhanced system performance.

7 Conclusion

In this paper, we have reviewed the RLMs and algorithms with an emphasis on the applicability in traffic regulation systems. The ability of RL algorithms to determine actions that yield highest rewards can be regarded as the prime reason for their wide acceptability. Consequently, a study on the RL algorithms can reveal the intrinsic features which in turn can be utilized effectively for handling traffic regulation issues. The paper provides such a detailed review of the RL algorithms, but it is not limited to that.

In addition to providing an in-depth analysis of various RL algorithms, the paper also discusses the issues that need to be rectified for its hurdle-free application in traffic regulation systems. These issues demand immediate attention of researchers, especially considering the fact that we are fast progressing towards a world which is ’smart’ in all aspects.