1 Problem

Traffic lights are introduced to resolve conflicts between road users by dynamically changing the priority to cars approaching an intersection from different directions. In practice, the underlying optimization problem is not always clearly defined by policy makers. Several objectives are possible: minimize average delay, minimize pollution by cars (e.g., CO2 emission) or a combination. In addition, (political) constraints may apply such as public transport traveling at a separate lane has the highest priority over all other traffic flows. In practice, traffic engineers aim to set a good (hopefully nearly optimal) control scheme using simulation software, queueing delay formulas, and experience. For an overview of existing methods to control road traffic, see [9].

Many road users experience traffic lights as a source of delay. This chapter presents a model for the underlying Markov Decision Process (MDP) of minimizing the expected delay or waiting time of cars. The principle of the MDP model is to dynamically adjust the traffic lights depending on the number of cars queued in each queue, and the current state of the traffic lights. Accurate information on the number of queued cars is available to the controller from magnetic loop detectors cut into the surface of the road, or other sensors or cameras positioned along the road [1, 7, 12].

A number of dynamic control policies, reported in the literature, like SCOOT and SCAT [8], dynamically switch between off-line calculated fixed cycles (FC). The approach that we present in this chapter has more freedom to choose: it does not have to choose between pre-calculated time plans; instead it requires only one FC to be calculated off-line. Another class of dynamic control is vehicle actuated (VA) control policies, that are characterized by a minimum and a maximum green period for each traffic flow, and a gap-out time to dynamically decide to end a green period. The optimization of VA policies is complicated and usually requires heuristics, because of the number of parameters to be set jointly for all traffic flow (see [14]).

Also solving an MDP for infrastructures with many traffic flows, requires heuristics solution procedures. Because of the curse of dimensionality, the number of states grows exponentially in the number of queues. Parallelization of algorithms to solve MDPs, as in [6], provides (at best) a linear reduction of the solution time, which is not enough to solve many complex infrastructures in practice. Two general techniques to solve large scale MDPs are aggregation and decomposition to reduce the state space. Decomposition has been applied successfully to other problem settings, see [13] and [2]. Another method is named Approximate Dynamic Programming, which is based on approximating the value function [10]. In this chapter, we present an approach based on decomposition.

Section 13.2 describes the MDP model. Section 13.3 describes a way to approximately solve the MDP. The resulting policy is evaluated by simulation in Sect. 13.4. A short discussion and some conclusions follow in Sect. 13.5. In Appendix of this chapter follows a summary of notations.

2 Markov Decision Process (MDP)

The model includes the flows of cars at an intersection, excluding other participants in traffic, such as pedestrians and public transport. The applied definition of waiting time is the time a car spends in the queue regardless of the color of the light. The corresponding optimization is modeled in terms of an MDP. Approximate solutions of the model are derived via policy iteration in Sect. 13.3. The resulting traffic control tables are then simulated in Sect. 13.4.

2.1 Examples: Terminology and Notations

To get familiar with the notation, consider the infrastructures in Fig. 13.1. Figure 13.1a depicts a simple intersection with only F = 4 traffic flows leading to four queues. The flows and queues are numbered clockwise: 1–4. Flows 1 and 3 are grouped into combination 1 (C1), as they receive green simultaneously. Combination 2 (C2) consists of flows 2 and 4. At most one combination at a time has right of way (when its lights are green or yellow). When switching from green to one combination to green to the other combination, the green lights are first changed into yellow (for two time slots) followed by a clearance period (of one time slot) during which all lights are red and after which lights of some combination are changed into green. The clearance time is included to safely clear the intersection. The recurring questions are when to end a green period, and which combination to serve next, given the actual color of the traffic lights, the actual number of cars present at each queue, and the probabilistic arrival process of new cars.

Fig. 13.1
figure 1

Two typical infrastructures. (a ) “F4C2” serve four flows in two symmetric combinations. (b ) “F12C2” serve 12 flows in four asymmetric combinations

Figure 13.1b shows a more complex intersection where F = 12 flows are grouped into four combinations C1–C4. C1 and C3 consist of twice as many flows than C2 and C4. As long as some cars are waiting at all queues, giving priority to C1 or C3 results in more departures per unit time than giving priority to C2 and C4. The decisions ‘when-to-end-a-green-period’ as well as ‘which-combination-to-serve-next’, should depend on the number of cars waiting at each queue: q = (q 1, q 2, , q 12). An optimal policy prescribes for each possible state of the queues, i.e. each value of the vector q, the action to take given the current state of the lights l. Such an optimal policy can be obtained from the solution of the underlying MDP.

2.2 MDP Model

To model the decision problem, time is modeled in time slots of a fixed length equal to a safe traveling distance between two cars. Hence we set the length of a time slot to 2 s, which correspond roughly with the time between two departures from a queue.

2.2.1 State

The state (at the start of a slot) is s = (l, q), where l ∈ L is the state of the traffic lights. S denotes the state space, which is composed of the state space of the traffic lights L and the state space related to the queue lengths. L is a finite set of | L | elements. The number of cars queued is in theory unbounded. However, for computational reasons, we limit the length of each queue to Q − 1 cars. The number of queue states is Q F. The total number of states is | S | = | L | Q F, which grows exponentially in the number of traffic flows F.

2.2.2 Action

The action a ∈ A(s) ⊆ L, taken at the start of a slot, changes the traffic light state instantaneously, i.e. immediately after observing the state s. As switching between green for some (combination of) flow(s) to green to another, takes time to clear the intersection the choices for adjusting the lights is limited to A(s) ⊆ L.

2.2.3 State Transition Probabilities

The state changes by the action choice on the traffic light, and by having cars departing or arriving at the queues. From each queue with one or more cars waiting, and that has priority according to light state l, exactly one car will leave within a time slot. Within a time slot, new cars arrive at the queues by F independent Bernoulli processes: i.e. with probability λ f , a new car arrives at flow f (and with probability 1 −λ f no car arrives). The car either forms or joins a queue, or, if having priority and no car is queued on his lane, it crosses the intersection in the same slot without delay.

The state transition probability from state s to state s′ = T(s, a) = (a, q′) taking action a is

$$\displaystyle{ P(\mathbf{q}'\vert \mathbf{s},a) =\prod _{ f=1}^{F}p_{ f}(q_{f}'\vert (l,q_{f}),a) }$$
(13.1)

where p f (q f ′ | (l, q f ), a) depends on l and whether a car arrives (w.p. λ f ) or not, as follows:

  • when l implies priority (green or yellow) to flow f:

    $$\displaystyle{ p_{f}(q_{f}'\vert (l,q_{f}),a) = \left \{\begin{array}{ll} \lambda _{f} &\text{if }q'_{f} = q_{f}, \\ 1 -\lambda _{f}\qquad \qquad &\text{else if }q'_{f} = \text{max}\{0,q_{f} - 1\} \\ 0 &\text{otherwise.}\end{array} \right. }$$
    (13.2)
  • when l implies no priority (=red) to flow f:

    $$\displaystyle{ p_{f}(q_{f}'\vert (l,q_{f}),a) = \left \{\begin{array}{ll} \lambda _{f} &\text{if }q'_{f} = q_{f} + 1 \\ 1 -\lambda _{f}\qquad \qquad &\text{else if }q'_{f} = q_{f}\qquad \qquad \qquad \\ 0 &\text{otherwise.}\end{array} \right. }$$
    (13.3)

2.2.4 Contribution: Waiting Costs

The objective is to minimize the expected number of cars queued at the start of a time slot, such that by Little’s law the expected waiting time is minimized. The contribution in a time slot to this objective function is the so-called one-period or direct cost function:

$$\displaystyle{ c(\mathbf{q}) =\sum _{ f=1}^{F}q_{ f}. }$$
(13.4)

2.2.5 Bellman Equation

A stationary dynamic control policy π that minimizes the expected number of queued cars fulfils

$$\displaystyle{ \pi ^{{\ast}}(\mathbf{s}) =\arg \text{min}_{ a\in A(\mathbf{s})}\sum _{\mathbf{q}'}P(\mathbf{q}'\vert \mathbf{s},a))v^{{\ast}}(\mathbf{s}'). }$$
(13.5)

where there exists a constant g and a value function v being a solution of the Bellman equation:

$$\displaystyle{ \forall \mathbf{s} \in S: \quad v^{{\ast}}(\mathbf{s}) + g^{{\ast}} = c(\mathbf{q}) + \text{min}_{ a\in A(\mathbf{s})}\sum _{\mathbf{q}'}P(\mathbf{q}'\vert \mathbf{s},a))v^{{\ast}}(\mathbf{s}') }$$
(13.6)

The constant g represents the expected number of cars waiting at the start of a time slot when an optimal policy π is followed.

At this point, we have to make a technical remark: to get a finite state space S, we have limited each queue length to at most Q − 1 cars. However, in the Bellman equation we have included transitions to states s′ that are beyond the state space. (e.g. when a car is arriving at a queue that contains already Q − 1 cars). To determine v(s′) for these states we apply quadratic extrapolation. For details see [3].

2.2.6 Computational Complexity

Solving the Bellman equation involves solving a set of | S | equations in | S | + 1 unknowns; therefore one has one degree of freedom to fix one of the elements of v . Hence v (s) is a relative value of state s when an optimal policy π is applied. The Bellman equation can be solved using fixed point algorithms, such as value iteration, or by exact algebraic methods. A detailed discussion of the theory and methods to solve MDPs is found in [11].

However, solving the MDP is only possible for infrastructures with a ‘small’ number of traffic flows, and under non-saturated conditions such that a reasonable bound to the queue lengths can be set. For example, for infrastructure F12C4, when all 12 queues may have up to 9 cars, then each element q f can take 10 possible values (0–9). Consequently, the number of possible vectors q is 1012. Hence the state space of the MDP easily exceeds the computationally acceptable limit of say 10 million states. The computational limit can be extended a bit by parallelization [6]. However, that only partly solves the computational issue, as the number of states grows exponentially with the number of traffic flows F. Large MDPs that cannot be solved to optimality require an approximate solution method, like the one that we discuss in the next Section.

3 Approximation by Policy Iteration

3.1 Policy Iteration (PI)

Instead of applying exact algebraic methods to solve Eq. (13.6), we apply a policy iteration (PI) algorithm to approximate an optimal policy. PI successively repeats the following two steps.

  1. Step 1

    Policy evaluation step: for an initial policy π, determine for all s ∈ S the associated relative state values v π(s) that satisfy:

    $$\displaystyle{ v^{\pi }(\mathbf{s}) + g^{\pi } = c(\mathbf{q}) +\sum _{\mathbf{q}'}P(\mathbf{q}'\vert \mathbf{s},\pi (\mathbf{s})))v^{\pi }(\mathbf{s}'). }$$
    (13.7)
  2. Step 2

    Policy improvement step: Next, policy π is improved by executing a policy improvement step:

    $$\displaystyle{ \pi '(\mathbf{s}) =\arg \text{min}_{a\in A(\mathbf{s})}\sum _{\mathbf{q}'}P(\mathbf{q}'\vert \mathbf{s},a))v^{\pi }(\mathbf{s}'). }$$
    (13.8)

    if π′ = π, then stop (as π = π ), otherwise set π: = π′ and return to Step 1.

Just as any other methods to solve the Bellman equations, PI suffers from the computational burden due to the large state space. The advantage of PI is that doing only one iteration may already give a good approximation when the initial policy is reasonably good.

3.2 Initial Policy: Fixed Cycle (FC)

A well studied policy is a static policy for which one pre-fixes the time intervals at which flows get green and the cyclic order in which combinations of flows are served. The cycle has a fixed length of D time units, which is the sum of the green periods and the time period needed to switch between combinations. Such a policy we call FC. The slots within a cycle are numbered t = 1, 2, , D. The slot number provides all relevant information about the state of the traffic lights: i.e. from t one can derive for each flow f the color of the light, the time it takes till getting green, yellow, or red. Thus the state of the traffic light l is for π = FC given by t.

FC could act as an initial policy for PI. Therefore one first needs to configure FC: i.e. set D, (nearly) optimal lengths of the green periods, and the cyclic order in which combinations get priority. An optimization algorithm for setting an initial FC is presented in [3] and [4].

3.3 Policy Evaluation Step of FC

The relative state values of FC, v FC(s), can be decomposed in relative state values v f FC(t, q f ) per traffic flow f:

$$\displaystyle{ v^{FC}(\mathbf{s}) =\sum _{ f=1}^{F}v_{ f}^{FC}(t,q_{ f}). }$$
(13.9)

Relative state values v f FC(t, q f ) are determined by value iteration:

  1. Step 1a.

    Define V 0 f(t, q) = 0 for all t ∈ { 1, , D} and q ∈ { 0, , Q} and let ε take a very small value (compared to the expected number of cars waiting at the start of any time slot), e.g. 10−5.

  2. Step 1b.

    For all t ∈ { 1, , D} and q ∈ { 0, , Q}, recursively compute V n+1 f(t, q) as follows (with V n f(D + 1, ⋅ ) read as V n f(1, ⋅ )):

    Start with n = 0 and V 0 = 0.

    Repeat

    • if flow f is having priority (green or yellow) during time slot t:

      $$\displaystyle{ V _{n+1}^{f}(t,q):= q +\lambda _{ f} \cdot V _{n}^{f}(t + 1,q) + (1 -\lambda _{ f}) \cdot V _{n}^{f}(t + 1,(q - 1)^{+}), }$$
      (13.10)

      where x + = max{0, x},

    • if flow f is not having priority during time slot t (its light is red):

      $$\displaystyle{ V _{n+1}^{f}(t,q):= q +\lambda _{ f} \cdot V _{n}^{f}(t + 1,q + 1) + (1 -\lambda _{ f}) \cdot V _{n}^{f}(t + 1,q). }$$
      (13.11)
    • n: = n + 1;

    until max(V n fV nD f) −min(V n fV nD f) < ε.

    Set N: = n.

    For all t ∈ { 1, , D} and q ∈ { 0, , Q}, the difference (V N f(t, q) − V ND f(t, q)) is at most ε off from the long-run average cost per cycle (g (f)), as the D-step Markov chains are all irreducible.

  3. Step 1c.

    We set the relative state values relative to a fixed reference state, (D, 0). Thus v f FC is:

    $$\displaystyle{ \mathbf{v}_{f}^{FC}\, \equiv \, \frac{\mathbf{V}_{N-D+1}^{f} + \cdots + \mathbf{V}_{ N}^{f}} {D} -\frac{V _{N-D+1}^{f}(D,0) + \cdots + V _{N}^{f}(D,0)} {D} \cdot \mathbf{1}\, }$$
    (13.12)

    where 1 is the all-ones vector.

    A discussion of the relative value definition in (13.12) is found in [3]. The differences V N f(q, t) − V N f(D, 0) cannot be used for this purpose as FC is a periodic policy and consequently these differences change periodically with N.

    An alternative criterion is to compare d = 0 D−1 V Nd fD against the average cost over N slots:

    $$\displaystyle{ \mathbf{v}_{f}^{FC}\, \equiv \, \frac{\mathbf{V}_{N-D+1}^{f} + \cdots + \mathbf{V}_{ N}^{f}} {D} - N \cdot g\, }$$
    (13.13)

    where g may be approximated by any element of \(\frac{\mathbf{V}_{N}^{f}-\mathbf{V}_{ N-D+1}^{f}} {D}\), as N is sufficiently large.

For example, Fig. 13.2 shows for a particular queue state q = (4, 2, 2, 1) at infrastructure F4C2, the valuation of the traffic light state t. Figure 13.2a shows for each of the four flows, the individual preference of time slot t. Figure 13.2b shows the sum of these preferences. On top of the x-axis (which shows the time slots of FC), labels are added that indicate which of the two combination gets green (G1/G2), or yellow (Y1/Y2), or whether all lights are red (R).In this case with q = (4, 2, 2, 1), the best time slot is slot 1 (i.e. first slot of green to combination 1), as it gives the lowest relative (cost) value. FC loops over all time slots: after slot 12 (R=all red) it continues at slot 1 (G1=green to combination 1).

Fig. 13.2
figure 2

Relative state values of FC. (a ) Relative value curves for flows 1–4 when q = (4, 2, 2, 1) cars are waiting. (b ) Sum of the relative value curves when q = (4, 2, 2, 1) cars are waiting at flow 1–4

3.4 Single Policy Improvement Step: RV1 Policy

The relative value v f FC(τ, q f ) quantifies the preference of flow f for time slot τ, when q f cars are waiting at queue f. Assuming all flows are equally important, the overall relative appreciation of time slot τ is the sum f = 1 F v f FC(τ, q f ), as depicted in Fig. 13.2b.

By applying a single policy improvement step, one finds a new policy π′ that may interrupt or breaks FC by dynamically deciding to decrease or increase a green period. That is, in state (t, q), policy π′ switches the state of the traffic light to the best time slot reachable from the current slot t:

$$\displaystyle{ \pi '(t,\mathbf{q}) =\arg \text{min}_{\tau \in A(t,\mathbf{q})}\sum _{f=1}^{F}v_{ f}^{FC}(\tau,q_{ f})\, }$$
(13.14)

where A(t, q) is the set of time slots to which one may switch safely from the current traffic light state t.

This process is illustrated by Fig. 13.2b: one selects the time slot that can be reached from the current time slot t, and that yields the lowest sum of relative values. That is, under cyclic control, if t = 1, 2, or 3, then π′(t, q) = 1. If t = 9 or 10, then π′(t, q) = 10. For t = 8 holds π′(t, q) = 9, as lights cannot switch from red to yellow without granting first green to one combination. For all other values of t holds π′(t, q) = t, i.e., one may not interrupt FC during switching.

We call this policy in the rest of this chapter the RV1 policy as it follows from a 1-step policy improvement using the relative values of FC.

3.5 Computational Complexity of RV1

The computational complexity of determining RV1 is very low as the state space under FC is effectively decomposed into the state per traffic flow. The computation of the relative values for each flow can be done quickly off-line. It allows a high bound on the queue length that is practically not restrictive. The computation of the sum of relative value curves is to be done quickly online: based on the actual number of cars queued at each queue and the actual state of the traffic lights, the best time slot to jump to can be computed in real time using Eq. (13.14). This is an important characteristic that allows applying RV1 to large and complex infrastructures.

3.6 Additional Iterations of PI

For problems with a large state space S, additional iterations of PI are not considered, since this is not possible to determine relative state values for policy RV1: i.e. v RV 1(s) cannot be computed in reasonable time, as its state space cannot be decomposed. For problems with a small state spaces that allows additional iterations of PI (without decomposition of the state space), one may continue PI by following the procedure sketched in Sect. 13.3.1 until an optimal MDP policy is found.

4 Results

To assess the quality of RV1, its performance is evaluated by simulation and compared against FC and some exhaustive control policies. Exhaustive control (XC) grants green to a combination as long as cars are queued. Policy XC-1 switches to yellow as soon as at each ‘green’ queue at most one car is waiting anticipating its departure during the next (=first) yellow slot. XC-2 switches to yellow as soon as at each ‘green’ queue (at most) two cars are waiting, and thus XC-2 anticipates their departure during the next two yellow slots.

For both infrastructures F4C2 and F12C4 we consider symmetric cases: i.e. the arrival rates are identical for all flows. For F4C2 an optimal MDP policy is evaluated. For F12C4, the optimal MDP policy could not be determined due to the large state space. Acyclic control, non-identical arrival rates, and other infrastructures are reported in [3] and [5].

4.1 Simulation

The simulation model relies on the same assumptions as the MDP model. The accuracy of the results presented in the next subsections is primarily set by the number of simulation runs and the length of each run. All reported simulation results are based on 100 runs of 72,000 slots per run, corresponding to 4000 h in the real system. At the start of each run, a warming-up period of 450 slots ( = 15 min) is applied. The reported mean waiting times are accurate up to (at least) 2–3 digits. To be concise, we do not report confidence intervals.

4.2 Intersection F4C2

Table 13.1 shows the results for the fully-symmetric F4C2 intersection at varying workloads ρ. In the cases of a workload of ρ = 0. 4, 0. 6, 0. 8 all flows have identical arrival probabilities of respectively λ f  = 0. 2, 0. 3, 0. 4. The workload ρ should be well below 1, to accommodate time for switching between combinations.

Table 13.1 Overall mean waiting time (in s) for the fully-symmetric F4C2

The Optimize-fixed-cycle algorithm (see [3, 4]) suggests cycle lengths of 8, 12 and 22 slots respectively, as reported in the next-to-last row (in seconds). The effective green time of a combination is the length of the related green and yellow period under FC. RV1 is based on FC with these cycle lengths and effective green times.

The performance of the cyclic RV1 strategy obtained by the one-step policy improvement algorithm, is close to that of the optimal cyclic MDP strategy. Next to the average waiting times, we report the relative difference compared to RV1. The cyclic MDP policy performs only slightly better. FC and XC yield on average 15 and 27% larger waiting times; when the load is high (ρ = 0. 8) the biggest differences can be observed. On average, RV1 is just slightly better than the anticipating exhaustive variants (XC-1 and XC-2). However, the differences are small for this simple fully-symmetric case.

4.3 Complex Intersection F12C4

For infrastructure F12C4, the optimal MDP strategy cannot be computed, because the number of states is prohibitively large. Even if the MDP model would truncate queues at the unrealistic level of two cars, the total number of states is still quite large: 8. 5 ⋅ 106 states (= (1 + 2)12 queue states times 16 traffic light states). F12C4 is not only computationally more complex because of the number of states, but also because the combinations are asymmetric: C1 and C3 consist of four flows, whereas C2 and C4 consists of two flows only. It seems to be more profitable to under-serve combinations C2 and C4, since serving the other two combination results in more departures per time slot as long as cars are present at the respective queues.

The asymmetry in the number of flows per combination, makes it more difficult to define simple rules that perform well. We keep the same definition of exhaustive control: all queues must be empty before the green signals are turned into yellow. Under XC-2, green lights are turned into yellow as soon as at each of the flows that have right of way less than three cars are queued.

In Table 13.2 the results for varying workloads at a F12C4 intersection are presented for the case where all flows have identical arrival intensities λ f  = 0. 1, 0.15, and 0.2, providing a workload of ρ = 0. 4, 0.6 and 0.8 respectively. RV1 outperforms all other strategies. When the workload of the intersection is low (0.4) XC-2 performs equally well. At a high load XC-2 is too simplistic. At ρ = 0. 8, FC performs even better than XC-2: XC-2 yields an average waiting time that is 28% higher than under RV1.

Table 13.2 Overall mean waiting time (in s) at different loads for F12C4 (λ f  = ρ∕4)

Equal Allocation of Waiting Time

In Table 13.3, we study the waiting time at the different flows under a heavy workload of ρ = 0. 8. Although the arrival rates λ f are identical for all flows, the mean waiting time differs per combination. Combinations C1 and C3 are ‘thicker’ than combinations C2 and C4, since the latter two combinations have only two flows each, whereas C1 and C3 consists of four flows each. Therefore C1 and C3 experience a lower waiting time than combinations C2 and C4 under all policies except FC. The average waiting time per car under FC is identical for each flow as each flow experiences an effective green time of 20 s (10 slots) per cycle.

Table 13.3 Mean waiting times (in s) for symmetric F12C4 at ρ = 0. 8

Although FC seems to be most fair in the sense that the flows experience similar average waiting times, RV1 performs much better: the average waiting time to flows of C1 and C3 is 13 s (or 26%) lower than under FC, while the average waiting times to C2 and C4 are virtually the same as under FC. Notice further that both XC and anticipative-exhaustive control (XC-1 and XC-2) perform worse than FC, when the workload is high.

5 Discussion and Conclusions

The dynamic control of traffic lights can be formulated as an MDP. In practice, for many infrastructures the state space may be too large to determine an optimal MDP policy. Nevertheless, this chapter shows that the principles and theory of MDP can still be applied to obtain good approximate solutions by executing a single iteration of a policy improvement (PI) algorithm. Key to this approach is to start PI with a well structured policy for which relative values of the state can be determined. For the problem of controlling traffic lights, such a well-structured policy is fixed cycle control (FC). Under FC the computation of relative values can be decomposed in computing relative values of the traffic light state.

For a single intersection, an approximate solution is provided that is based on policy iteration (PI) and decomposition of the state space. The approach starts with a Markov chain analysis of a pre-timed control policy, called Fixed Cycle (FC). The computation of relative states values for FC is fast as under FC the multi-dimensional state space can be decomposed into sub-spaces per traffic flow. The policy obtained by executing a single iteration of PI using relative values of FC, is called RV1.

Numerical results obtained by simulation, shows RV1 greatly reduces the average waiting time compared to FC (and other policies). As the relative values of states under policy RV1 cannot be computed using decomposition, additional PI steps can be executed only for infrastructures for which the state space is not too large.

RV1 seems to be a promising policy for practical application as:

  • RV1 is robust to changes in traffic volumes: RV1 performs well even when the underlying FC is sub-optimal,

  • RV1 is thus easy to maintain: if traffic conditions change the performance of RV1 deteriorates less rapidly than FC,

  • RV1 is fast: evaluating a change of the traffic lights is done online in a micro second; also the off-line calculation of the relative values for each flow can be done quickly,

  • RV1 can be extended to include information on the predicted arrival time of a car approaching the intersection,

  • RV1 can be scaled up to complex intersection and networks of intersections.

Future research may be devoted to the above aspect. For bringing RV1 to practice, it is relevant to include other vehicles and traffic flows, such as public transport, bicycles and pedestrians. These additional traffic flows do not hamper the use of RV1 policies. In addition, multiple criteria, like vehicle speed and CO2 emissions, can be included. This chapter has shown that insights and approximations obtained using an MDP framework are of practical importance to improve the control of traffic lights.