Keywords

1 Introduction

Population growth and urbanization had sped up the demand for a good transportation system. On the other hand, the existing infrastructure roads and its corresponding management are not satisfactory to bear the increasing traffic volume [1]. This had led to several problems like congestion which causes delays and subsequently had an impact on the environment through noise and air pollution [2]. So proper traffic management is a must. There were numerous research ideas that were put forward however, most of them consist of a set of codes (programs). These programs were not only computationally complex but were based on certain assumptions. As a consequence, they were not effective in the real-time scenario. Some researchers tried to employ machine learning algorithms like fuzzy logic, genetic logic algorithm, etc. Still, these methods failed to achieve the desired output because they were not considering the real-time data inputs. Therefore, a good traffic system is required which should do the continuous monitoring of the traffic in real-time and take the decision after analysing all the roads at the intersection.

In order to achieve the above goal, we are going to employ an RFID reader for continuous monitoring (assuming that every vehicle is having an RFID tag). For analysing and decision-making we will take the help of reinforcement learning [3, 4]. The reason behind using a reinforcement learning technique is that it does not depend upon the heuristic assumptions and equations and fits perfectly in the model to master the optimal control through the previous experiences of handling the traffic. The concept of reinforcement learning is based on the Markov decision process [5]. It has been extensively used as a computational tool. The traditional reinforcement learning algorithm was less optimal with limited scalability. So, Q-learning, a popular RL algorithm, and a deep neural network was chosen because of its better learning capacity [6, 7].

The use of information and communication technology to improve the quality of living is referred to as smart cities. As a result of continual development and population growth, smart cities have evolved. Smart cities seek to address these issues in order to improve inhabitants’ quality of life. When a city can develop and implement creative solutions that are based on cutting-edge technologies and cutting-edge scientific knowledge, it is considered smart. To put it another way, a city gets “smarter” by developing and implementing data-driven solutions to make it easier to monitor, understand, analyse, plan, and optimize its operations, activities, services, and policies.

  1. A.

    Outline of Chapter

The remaining portion of the chapter is tabulated in the following order: Sect. 2 gives the synopsis of the existing research work of traffic light control. While Sect. 3 outlines the preliminaries. Section 4 describes the learning mechanism of the model and describes its corresponding state, action, and reward in detail. Section 5 discusses the experimental setup and training of the whole model additionally an analysis  and the performance of the results are also mentioned. Finally, the conclusion and future work are described in Sect. 6.

2 Related Work

In this paper [8] the author put forward an adaptive traffic signal control system that was trained by deep Q-learning and reduces the travel time by 20% and waiting time by 82%. In another attempt, the author tried to reduce the waiting time of the vehicle by using a multi-agent reinforcement learning algorithm. Furthermore, they showed their model outperformed even after an increase in traffic load.

In the paper [9], the author has constructed a two-way street model with a controllable traffic light which is dependent on the time function each cycle time is different from the previous one. While in [8] theauthor applied a deep reinforcement learning method to propose an adaptive traffic signal control system (simulator) and trained the model using Q-learning. In [10] paper Juntao Gao et al. modelled a deep reinforcement learning algorithm that takes the attribute from the real-time data and masters the excellent policy for adaptive signal control. In this research paper, [11] author attempted to adjust the duration of traffic lights dynamically. For that, they divided the whole scenario into smaller sections and employed conventional neural network to map the reward with the state. The state was defined by vehicle position and speed information.

Centralized reinforcement learning is not feasible for adaptive traffic signal control. This paper [12] presents a fully scalable and decentralized MARL algorithm. Proposed novel A2C-based MARL. The author put forward an intelligent driving model in [13] that functions on the input of this adaptive traffic signal control which employs a multi-agent framework that dynamically learns the behaviour of the driver. They have used the Bayesian interpretation of probability for decision-making. This paper [14] put forward a load balancing approach that relies on the prediction of the traffic situation. This is achieved by the efficient cooperation of the micro base stations. Spatiotemporal correlation with CNN is used to predict the traffic situation. In this paper [15] author integrated deep learning with the Bayesian model (proposed by Wang) to predict the traffic flow.

One advantage is that it overcomes the error magnification phenomenon. In another paper [16] employs deep learning. The model was specifically designed to learn traffic speed. In order to predict the real-time traffic speed, they employed LTE (Long Term Evolution) data.

This paper [17] presents an efficient scheduling method. The proposed method is able to control a dynamic and complex traffic environment with the help of reinforcement learning techniques. In another paper [18] author designed an adaptive signal control system and called it RHODES. For the smoother working of the adaptive signal control, real-time information is taken from the detector. In order to predict the arrival of vehicles at the intersection point, the author had used the PREDICT algorithm which was proposed by Head. For that, this algorithm uses detectors output, traffic state, and planned phase timing on reaching each upstream intersection. This adaptive model could predict both the short-term and medium-term fluctuations of the traffic. This approach could even set phases so as to maximize performance.

3 Background

Reinforcement learning [19] is a particular section of machine learning class that is entirely offbeat from the other two categories (i.e. supervised and unsupervised learning) [20, 21]. The main idea is to keep on learning and producing better results. This is done by firstly monitoring the environment and keep on analysing the situations so as to take suitable action. The reward is earned based on the effectiveness of the actions. So, the goal is to augment the rewards. We can denote the reinforcement learning model with the help of a four-tuple (S, A, R, T) which is described below:

  1. A.

    RFID

    The RFID is a special type of technology that employs radio frequency signals to identify, pinpoint, and track the target without human intervention [22, 23. RFID systems consist of transponders, transceivers, and back-end storage which are commonly called tags, readers, and databases, respectively. Tags have a microchip that has memory constraints and is used to store the unique tag identifier and other related information. They are two types: active and passive. Active tags have their own battery source while passive tags don’t. They use destination electromagnetic waves to transmit the collected data. Since these tags have a longer range of transmission thus, they are preferred where there is a need to identify objects over long distances such as roadside units in traffic management, health care applications, animal tracking, object locating in logistics markets, etc. While readers are the devices that can read tags for object identification and stored information. It can transfer these data to the back-end systems.

  2. B.

    Reinforcement Learning Model

    For building a traffic light control system using deep Q-learning [17], we need to define the states, actions, and rewards.

    Here we present how the three elements are defined in our model.

    State: The state of the agent describes a representation of the situation of the environment in a given timestep t and it is denoted with st. For effective learning of agents to optimize the traffic on each road the agent should have sufficient information on the distribution of vehicles in the current environment. The objective of this representation is to allow the agent to know about the environment where vehicles are located at timestep t. To serve this purpose the approach proposed in this paper is inspired by paper [24] due to its simple representation of state which makes use of RFID reader so easy. In particular, this state design includes only spatial information about the vehicles hosted inside the environment, and the cells used to discretize the continuous environment are not regular. In this paper, we will explore the chance of getting good results from a simple state representation. In each arm of the intersection, incoming lanes will be discretized in cells that will notify whether or not a vehicle is present inside the cell (Fig. 1).

    Fig. 1
    figure 1

    State representation

  1. (1)

    Discrete state representation: Basically DSR (discrete state representation) is a mathematical representation of the state space as a form of the vector where every element is computed by the following equation

    $$\text{DSR[i] = c}$$
    (1)

c = 1 if ith cell contains at least one vehicle else c = 0, as explained in Fig. 2.

Fig. 2
figure 2

DSR vector representation with respect to cells of intersection

It must be noted that the DSR vector does not represent only a single intersection. As shown in Fig. 1 we have divided a single road of intersection into 3 lanes. As described in Fig. 3 lane1 is dedicated to going straight and left, lane2 is dedicated to going right only. So lane1 and lane2 both are having different DSR vector and each lane contain 10 cells that mean every arm of the intersection there are 20 cells and in the whole intersection 80 cells.

Fig. 3
figure 3

State representation of the west road of the intersection, with cells length

So, the proposed state-space is composed of 80 boolean cells. This means that the number of possible states is 280. The choice of boolean cells for the environment representation is also crucial because the agent has to explore just the most significant subset of the state space in order to learn the best behaviour.

When the agent samples the environment at a timestep t, it receives a vector DSRt containing the discretize representation of the environment in that timestep. This is the principal information about the environment that the agent receives, so it is designed to be as precise as possible but without being excessively detailed in order to not increase the computational complexity of the neural network’s training [25].

Action space-The action set identifies the possible actions that the agent can take. The agent is the traffic light system, so doing an action translates to turning green some traffic lights for a set of lanes and keep it green for a fixed amount of time. In this paper we are using green time is set at 10 s and the yellow time is set at 4 s. In other words, the task of the agent is to initiate a green phase choosing from the predefined ones. The action space is defined as

$${\text{A }} = \, \left\{ {{\text{NSSL}};{\text{ NSR}};{\text{ EWSL}};{\text{ EWR}}} \right\}$$
(2)

The set represents every possible action that the agent can take. Every action a set is described below

  • North-South Straight and Left (NSSL): the green light phase will be activated for those vehicles which are present in the north and south arm and want to proceed either straight or left

  • North-South Right (NSR): the green light phase will be activated for those vehicles which are present in the north and south arm and want to proceed with their right arm.

  • East-West Straight and Left (EWSL): the green light phase will be activated for those vehicles which are present in the east and west arm and want to proceed either straight or left.

  • East-West Right (EWR): The green light phase will be activated for those vehicles which are present in the east and west arm and want to proceed with their right arm (Fig. 4).

    Fig. 4
    figure 4

    The four possible actions

If the same is chosen at timestep t and timestep t – 1 (i.e. the traffic light orientation is the same) then there is no yellow phase and therefore the current green phase continues. And if the action chosen in timestep t is not equal to the previous action, a 4 s yellow phase is activated between the two actions. As explained in Fig. 5

Fig. 5
figure 5

Possible simulation time Difference between two actions

Rewards: In reinforcement learning, based on the action chosen by the agent—the feedback will be generated from the environment in terms of reward [26, 27]. To enhance the model, based on the reward the agent will change its intuition for further future action. Therefore, the reward is a crucial aspect of the learning process. The reward usually has two possible values: positive or negative. Good action led to positive reward similarly bad action leads to negative reward. In this scenario, the objective is to maximize the traffic flow through the intersection over time. In order to achieve this goal, the reward should be derived from some performance measure of traffic efficiency, so the agent is able to know whether or not the action increases efficiency. In traffic analysis, several measures are used, such as throughput, mean delay, and travel time [28]. In this paper, the agent measures the total waiting time defined as:

Total waiting time: The sum of individual waiting times of each car in the environment in timestep t. waiting time is defined as the time duration during which a vehicle is moving with a speed of less than 0.1 m/s. The total waiting time is computed by the following equation.

$${\text{twt}}_{{\text{t}}} = \sum {{\text{wt}}_{{\left( {{\text{veh}},{\text{f}}} \right)}} }$$
(3)

where twtt is the total waiting time at timestep t and wt(veh;t) is the time duration (in seconds) a vehicle veh has a speed of less than 0.1 m/s at timestep t. n represents the total number of vehicles in the environment in timestep t

Reward function: The reward function that generates a reward for the agent is defined in the equation as

$${\text{r}_\text{t}} = \text{tw}{\text{t}_{\text{t} - 1}} - \text{tw}{\text{t}_\text{t}}$$
(4)

where rt represents the reward at timestep t. twtt and twtt 1 denote the aggregate waiting time of all the vehicles at the intersection at timestep t and t – 1, respectively.

  1. (2)

    Positioning of RFID Reader for DSR: As explained earlier DSR vector is a discrete representation of state from a continuous environment. In the simulation, SUMO provides much functionality to get this vector but for real-time, we proposed the use of RFID reader and RFID tags. Here we are supposing that each vehicle is having an RFID tag and we are using RFID reader at the starting of each cell as shown in Fig. 6. Starting cells are having RFID readers with 2–5 m range capacity and the rest are having with a range of about 10 m capacity. As we explained earlier if there is only one vehicle in the kth cell then DSR[k] set to one, so basically here we are trying to detect the vehicle at starting of the cell.

    Fig. 6
    figure 6

    RFID positions in cells of arm

RFID reader for the first 4 small cells: For the first four small cells, the RFID will be positioned at the beginning of the cell facing towards diagonal with the range capacity 5 m as shown in Fig. 7.

Fig. 7
figure 7

RFID positions for starting 4 cells of the arm

RFID reader for other cells: The RFID reader for other cells will be positioned 5 m apart from the beginning of the cell facing towards the road with the range 2 m. Here we are assuming the width of the road is 3.7 5 m as used in India.

4 The Agent’s Learning Mechanism

We have employed Deep Q-learning for the purpose of learning. It is a combination of deep learning and neural networks [25].

  1. A.

    Q-Learning

It is a unique form of reinforcement learning which is model-free [29]. According to the state of the environment, a value is assigned to the action which is about to be taken. This value is called the Q-value which is defined in the form of an equation (Fig. 8).

$${\text{Q}}\left( {{\text{s}}_{{\text{t}}} ,{\text{ a}}_{{\text{t}}} } \right) \, = {\text{ Q}}\left( {{\text{s}}_{{\text{t}}} ,{\text{ a}}_{{\text{t}}} } \right) \, + \left( {{\text{r}}_{{{\text{t}} + {1}}} + \gamma {\text{max}}_{{\text{A}}} {\text{Q}}\left( {{\text{s}}_{{{\text{t}} + {1}}} ,{\text{ a}}_{{\text{t}}} } \right) {\text{Q}}\left( {{\text{s}}_{{\text{t}}} ,{\text{ a}}_{{\text{t}}} } \right)} \right)$$
(5)
Fig. 8
figure 8

RFID positions for other cells of the arm

Where:

  • Q(st, at) represents the value obtained after the action at has been taken after analysing the state st.

  • The above equation basically updates the present Q-value with a quantity discounted by the learning rate.

  • rt+1 denotes the reward.

  • t + 1 highlights the relationship between taken action at and the subsequently received reward.

  • The immediate future’s Q-value is denoted by Q (st+1, at), where st+1 represents the evolved state after implementing action at.

  • Among all the possible actions at the most valuable action is selected which is represented by max A.

  • is the discount factor that assumes a value between 0 and 1, lowering the importance of future reward compared to the immediate reward

Here we are going to use a slightly modified version of the equation which is mentioned below:

$${\text{Q}}\left( {{\text{s}}_{{\text{t}}} ,{\text{ a}}_{{\text{t}}} } \right) \, = {\text{ r}}_{{{\text{t}} + {1}}} + \gamma {\text{max}}_{{\text{A}}} {\text{Q }'}\left( {{\text{s}}_{{{\text{t}} + {1}}} ,{\text{ a}}_{{{\text{t}} + {1}}} } \right)$$
(6)

where

  • rt+1 represent the reward.

  • The term Q'(st+1, at+1) is the Q-value associated with taking action at+1 in state st+1 i.e. the next state after taking action at in state st+1.

  • As seen in the above equation γ denotes a small penalization of the future reward compared to the immediate reward.

  1. B.

    Deep Q-Learning

In order to map a state of the environment st to Q-values representing the values associated with actions at, a deep neural network [25] is built. The input of the network is the vector DSRt the state of the environment at timestep t. The outputs of the network are the Q-values of the possible action from state st.

The input layer of the neural network Sin is defined as:

$${\text{S}}_{{{\text{k}},{\text{t}}}}^{{{\varvec{in}}}} = {\text{ DSR}}_{{{\text{k}},{\text{t}}}}$$
(7)

where \({\text{S}}_{{{\text{k}},{\text{t}}}}^{{{\varvec{in}}}}\) is the kth input of the neural network at timestep t and IDRk;t is the kth element of the vector DSR at timestep t as shown in Fig. 5 This means that |Sin| = |DSR| = 80, that is the input size of the neural network.

The output layer of the neural network Sout is defined as:

$${\text{S}}_{{\user2{j,k}}}^{{{\varvec{out}}}} = {\text{ Q}}({\text{s}}_{{\varvec{t}}} ;{\varvec{a}}_{{\user2{j;t}}} )$$
(8)

where \({\text{S}}_{{\text{j,k}}}^{{{\varvec{out}}}}\) is the jth output of the neural network at timestep t and Q(st, aj,t) is the Q-value of the jth action taken from state st at timestep t. This means that the output cardinality of the neural network is |A| = 4, where A is the action space.

The neural network is a fully connected deep neural network with a rectified linear unit activation function (ReLU). And 5 hidden layers are used.

As explained in Fig. 9 shows the vector DSR as the input of the network, then the network itself with the hidden layers, and finally the output layer with 4 neurons representing the 4 Q-values associated with the 4 possible actions.

Fig. 9
figure 9

Strategy of deep neural network

5 Experimental Setup and Training

As we have already defined the specification of an agent such as the state, the possible actions, and the reward. Fig. 10 shows how all these components work together to establish the workflow of the agent during one single timestep t.

Fig. 10
figure 10

The workflow of the agent in a timestep

After a fixed amount of simulation steps, the timestep t of the agent begins. First, the agent retrieves the environment state and the delay times next, using delay times of this timestep t and from the last timestep t − 1. It calculates the reward associated with action taken at t − 1. Then the agent packs the information gathered and saves it to a memory which is used for training purposes. Finally, the agent chooses and set the new action to the environment, and a new sequence of simulation step begins.

  1. A.

    Experience Replay

For the sake of improving the conduct of the agent and the learning efficiency, a procedure is endorsed during the training phase which is called Experience replay [30]. It requires acknowledgment of all the essential information necessary for learning to the agent, in the form of a group called a batch. Before submitting simulation information, the agent takes the batch to the data structure intuitively called memory who stores all collected samples. A sample m is formally defined as the quadruple.

$${\text{m = }}\left\{ {{\text{s}}_{{\text{t}}} {\text{, a}}_{{\text{t}}} {\text{, s}}_{{\text{t + 1}}} {\text{, a}}_{{\text{t + 1}}} } \right\}$$
(9)

where rt+1 is the reward that is obtained after carrying out the action at from state st, which unfolds into the succeeding state st+1. A training instance involves the gathering of a group of samples from the memory and the neural network training using the aforesaid samples.

  1. B.

    The Training Process

Given the description of experience replay, a detailed explanation of the training process is explained. This process is executed every time a training instance of the agent is initiated.

  • A sample m containing the most recent information, described in Eq. (8), is added to the memory.

  • A fixed number of samples (light sampling strategy used) are picked randomly from the memory constituting the batch B.

A single sample bk ε B contains the initial state st with the most suitable action selected at and its corresponding reward rt+1 following the next immediate state st+1. For every sample bk the following operations are performed.

  1. (1)

    Computation of the Q-value Q′ (st+1, at+1) by submitting the vector DSR representing st to the neural network and obtaining the predicted Q-value relative to action at. As shown in Fig. 11.

    Fig. 11
    figure 11

    Computation of the Q-value Q for one sample

  2. (2)

    Computation of Q-values Q′ (st+1, at+1) by submitting the vector DSR representing the next state st+1 to the neural network and obtaining the predicted Q-values relative to actions at+1. These represent how the environment will evolve and what values will probably have the next actions. As shown in Fig. 12.

    Fig. 12
    figure 12

    Computation of Q-values Q΄ for one sample

  3. (3)

    Update of the Q-value using equation (among the possible future Q-values computed in stage two) maxA Q′ indicates that the best possible Q-value is selected, representing the maximum expected future reward. It will be deducted by a factor γ that gives more importance to the immediate reward. As shown in Fig. 13

    Fig. 13
    figure 13

    Update Q-value using Eq. (6) for one sample

  4. (4)

    Training neural network: The input is the vector DSR representing the state st, while the desired output is the updated Q-values Q (st, at) that now includes the maximum expected future reward due to Eq. (6), the next time the agent encounters the state st or a similar one, the neural network will be likely to output the Q-value of action at that is comprehensive of the best future situation.

  1. III.

    Simulation of Urban MObility (SUMO)

The abbreviation SUMO stands for (Simulation of Urban MObility) [31]. It is a traffic microsimulation which provides a software package that allows users to design the road infrastructure and related elements. Among the packages that SUMO offers, in this paper the following were used.

  • NetEdit was used to design the static elements of the intersection, such as the characteristics of the roads, the distribution of traffic lights, and the lane connections across the intersection.

  • Package TraCI is used to define the type, characteristics, and generation of vehicles that are going to be in the simulation

  1. IV.

    Result

The model is trained with the following hyper-parameters.

Parameters

Values

Neural network

8 layers, 400 neurons each

Memory size

50,000

Episodes

300

RFID range

2 and 5 m

The high gamma means that the agent is aiming at maximizing the expected cumulative reward of multiple consecutive actions. This can be considered as the true RL agent that has a long look ahead and tries to search for the policy that overall gives the best performance with respect to the reward obtained at every step.

Figure 14 shows the reward gain by RL agent at each episode, as we can see that it continuously increasing which means after each episode agent learns to take better action.

Fig. 14
figure 14

Reward while training in high-traffic

Figure 15 shows the queue length of vehicles i.e. number of cells occupied by vehicles, which is continuously decreasing which means the waiting time of the vehicles is decreasing after each episode.

Fig. 15
figure 15

Queue length of the vehicle while training in high-traffic

Figure 16 shows the delay of the vehicle which basically represents the amount of time spent by a vehicle in the red-light phase, as we can see this delay is continuously decreasing after each episode.

Fig. 16
figure 16

Delay of the vehicle while training in high-traffic

  1. E.

    Performance Metrics

In order to evaluate the agent performance, after 300 episodes when the agent has been trained, we conduct 10 more episodes to observe the following performance matrix.

  1. (1)

    Average Negative Reward: Average of all rewards generated by the last 10 episodes when the agent has already trained.

    $$anr=av{g}_{ep} \sum\limits_{ep=0}^{ep=10}{r}_{ep}$$
    (10)
  2. (2)

    Total waiting time: Sum of delay times of last 10 episode.

    $$twt= \sum\limits_{ep=0}^{ep=10}w{t}_{ep}$$
    (11)

The baseline is needed to compare the agent performance, so we perform a simulation which behaves similarly as the current traffic scenario behave that means every traffic light phase is always activated in the same order defined as

[NSR-NSSL-EWR-EWSL] and following performance matrix observed.

Parameters

Values

anr

−202, 871

twt

942, 652

As Figs. 14 and 16 show that γ = 0.25 performs best so that we perform baseline scenario with the same value and in high traffic scenario.

Now the comparison between the baseline scenario and proposed method shown below.

Parameter

Baseline

Proposed

Reduced%

anr

 

202, 871

44, 155

78.23

twt

942, 652

129, 353

86.27

6 Conclusion

Reinforcement learning is actually an environment-dependent algorithm that had achieved augmenting interests in the traffic control domain. In this chapter, we have presented a traffic light control model using a deep Q-learning method (i.e., Q-learning with the neural network) and RFID. The results were discussed. The performance of the proposed model was evaluated using SUMO and was compared. It was shown that the proposed model notably outperforms conventional approaches.

For future work, we would like to investigate other machine learning algorithms for traffic light control systems. Furthermore, we would like to test our model proposed in this study on a more realistic traffic simulator and try to give priorities to important and emergency vehicles like ambulance and a police van.