Keywords

1 Introduction

Wireless communication networks have evolved with the advancements in technology to their fifth generation which supports much faster data rates, is ultra-dense, and provides lower latency. The 5G networks use a lot of emerging technologies and innovations to achieve these capabilities. Software-defined networks are one such technology deployed in 5G networks to reduce latency as well as cost. SDN architecture separates the data plane from the control plane in the network and provides a centralized network control. The routers in the traditional network are replaced with high-speed switches that have limited control plane capability. The routing functionality is shifted to the centralized SDN controller that shares the routing information with the switches. Standard protocols like OpenFlow are used for this communication between the SDN controller and the switches. The routing decisions are communicated to the switches by the SDN controller through this standard protocol. The centralized control of routing eliminates the overhead of routing from the network and also allows a centralized policy to be enforced. SDN networks provide much higher reliability and scalability compared to the traditional model of distributed routing.

The use of SDN architecture improves the performance of the networks significantly and also provides much more scalability compared to conventional routing protocols. Rego et al. [1] demonstrated a performance improvement when OSPF routing was used in an SDN network as compared to traditional distributed networks. There was a significant reduction seen in packet delays as well as jitter in video streaming applications in SDN networks as compared to OSPF routing networks. Zhang et al. [2] demonstrated that SDN routing is much more scalable and routing convergence is faster in the case of large networks with higher link delays. The OSPF networks provided better response time as compared to SDN networks in the case of small, 16-node topology. However, the response time of SDN networks was 20% faster as compared to conventional OSPF networks in the case of large, 120-node topology. Gopi et al. [3] compared the routing convergence time of conventional routing networks and SDN networks. It was demonstrated that in an 80-node topology, conventional networks took 3 times more to converge as compared to SDN networks.

This architecture, however, requires the routing function to be implemented in the SDN controller and merely shifts the problem from routers to the SDN controller.

An alternative to the use of routing protocols in the control plane is to use machine learning techniques for traffic optimizations and learning routing paths in SDN networks. Machine learning techniques can be used for capturing multiple features such as bandwidth, delays, energy efficiency, QoS in routing. The overheads of traditional routing algorithms can be eliminated through the use of machine learning.

2 Existing Work in This Area

The existing literature includes work in the area of using machine learning techniques for learning of routing in SDN networks. Reinforcement learning is one of the commonly used techniques used for this purpose.

Lin et al. [4] proposed a QoS-aware adaptive routing for SDN networks based on reinforcement learning. The specific RL algorithm used by them was State-Reward-Action-State-Reward (SARSA). SARSA is conservative and learns a near-optimal policy as opposed to optimal policy learning in Q learning.

Tang et al. [5] suggested the use of deep CNN-based learning for automatically learning routing information in SDN networks. The deep learning approach was found to be faster compared to traditional routing. At a packet generate rate of 480 Mbps, the packet loss in deep learning was 50% of the rate observed in traditional routing. Deep learning, however, has the overhead and drawbacks of supervised learning and is not dynamic.

Stampa et al. [6] proposed the use of a deep reinforcement learning approach for learning routing information from the network using the Deep Deterministic Policy Gradient (DDPG) algorithm. It was demonstrated that the algorithm provided optimal delays in the network as compared to an initial benchmark. DDPG has a problem of overestimation bias which also needs to be addressed.

Yu et al. [7] also experimented with the DDPG algorithm to optimize routing in SDN networks. Minimization of delay was used as a performance metric and the benchmark used for comparison was OSPF routing data. Under a 70% traffic load condition, the delay performance of the proposed algorithm improved by 40.4% as compared to OSPF.

Xu et al. [8] integrated the DDPG learning algorithm into the routing process of SDN to facilitate routing. A comparison was made with the traditional OSPF routing protocol.

Tu et al. [9] also used the DDPG algorithm to take real-time routing decisions in SDN networks. The reward was set chosen as a function of bandwidth, delay, jitter, and packet loss. A comparison was also done with the OSPF routing protocol and significant improvement was seen in the case of DDPG-based routing.

Pham et al. [10] proposed a knowledge plane in SDN networks to manage the routing decisions. This knowledge plane used the DDPG algorithm to learn QoS-aware routing decisions. Latency and packet loss rate were considered as the criteria for optimizing routing. Improvement in packet loss rate as well as latency was observed in the case of the DDPG algorithm as compared to traditional routing mechanisms.

Kim et al. [11] implemented a DDPG-based deep reinforcement learning agent (DRL) for routing in SDN networks. DDPG-based agents demonstrated better performance compared to Naive methods.

Sun et al. [12] proposed Time-Relevant Deep Learning Control (TIDE) algorithm based on DDPG for QoS guarantee in SDN networks. TIDE took less time for running as compared to Shortest Path (SP) algorithms and provided better results.

The DDPG algorithm, used by several researchers, has the problem of over-estimation bias. Any error is propagated through Bellman equations and can make the algorithm unstable, making it miss the target value or find a local optimum.

3 DDPG and TD3 Deep Learning Algorithms

Lillicrap et al. [13] proposed a deep learning algorithm for continuous spaces and called in Deep Deterministic Policy Gradient (DDPG). DDPG is an off-policy, model-free, online learning algorithm that uses the Actor-Critic method for learning. The DDPG agent looks for an optimal policy that maximizes the cumulative long-term reward. It uses four function approximators—Actor, Target Actor, Critic, and Target Critic for learning. The Actor takes the action to maximize the reward, and the Critic returns the expected value of the long-term reward. Targets are used to improve the stability of the optimization. DDPG uses a replay buffer to randomly sample past actions.

DDPG also uses soft updates through target networks for both Actors and Critics. This means that all the weights are not copied from the target networks to the main networks, but a fraction of the weights are copied to provide stability.

This can be mathematically shown as:

$$\theta ^{\prime}_{i} \leftarrow \tau \theta + (1 - \tau )\theta ^{\prime}_{i} ,$$
(1)
$$\phi ^{\prime} \leftarrow \tau \theta + (1 - \tau )\phi ^{\prime},$$
(2)

where θ and ϕ are the weights of the networks and τ is the parameter used for the soft update. The value of τ is less than 1 (typically around 0.999).

The DDPG algorithm, however, still suffers from an overestimation bias problem and can become unstable or may lead to a local optimum. To avoid this problem, several changes have been suggested in an alternative to DDPG known as Twin-Delayed Deep Deterministic Policy Gradient or TD3 algorithm. The changes from DDPG to TD3 are:

A TD3 agent learns two Q values instead of a single value used by DDPG. The minimum of these two values is used by the algorithm for updates.

Unlike DDPG, the policy is not updated after every iteration in TD3 but is updated less frequently (usually every two or three iterations of Q values).

During policy updates, the TD3 agent adds noise to the target action. This ensures that any high values are not exploited by the agent. The noise used in TD3 is Gaussian noise.

  • TD3 Algorithm

Initialization:

Two critic networks \(Q_{{\theta_{1} }} \;{\text{and}}\;Q_{{\theta_{2} }}\).

are initialized with random parameters \(\theta_{1} \;{\text{and}}\;\theta_{2}\).

An Actor network \(\Pi_{\phi }\) is initialized with random parameter ϕ.

Target networks are initialized \(\theta ^{\prime}_{1} \leftarrow \theta_{1} ,\theta ^{\prime}_{2} \leftarrow \theta_{2} ,\phi ^{\prime} \leftarrow \phi\).

The replay buffer Ɓ is also initialized.

for t in range (1, T):

Select an Action with exploration noise \(a\sim \pi_{\phi } (s) + \varepsilon ,\varepsilon \sim N(0,\sigma )\).

Watch the reward r and new state s′.

Save the transition values (s, r, a, s′) in the replay buffer.

Pick up N random transitions (s, r, a, s′) from Replay Buffer

$$\tilde{a}\sim \pi_{\phi ^{\prime}} (s) + \varepsilon ,\varepsilon \sim {\text{clip}}(N(0,\sigma ), - c,c)$$
$$y \leftarrow r + \gamma \left( {\min imum\left( {{}_{i = 1,2}Q_{{\theta_{i} }} \left( {s^{\prime},\tilde{a}} \right)} \right)} \right)$$

Update the critics

$$\theta i \leftarrow \arg \min {}_{{\theta_{i} }}N^{ - 1} \sum {\left( {y - Q_{{\theta_{i} }} (s,a)} \right)^{2} }$$

if t mod d:

update ϕ using the Deterministic Policy Gradient

$$\nabla {}_{\phi }J(\phi ) = N^{ - 1} \sum {\nabla {}_{a}Q}_{{\theta_{i} }} (s,a)|a = \pi_{\phi } (s)\nabla_{\phi } \pi_{\phi } (s)$$

Update the target networks

$$\theta ^{\prime}_{i} \leftarrow \tau \theta_{i} + (1 - \tau )\theta ^{\prime}_{i}$$
$$\phi ^{\prime} \leftarrow \tau \phi + (1 - \tau )\phi ^{\prime}$$

4 Experiments and Results

The experiments were carried out using the Omnet++ simulator which is a discrete event simulator written in C++ and also provides support for Python. The hardware environment consisted of an 8 × NVIDIA A100 System with 40 GB GPUs (5 GB GPU Memory allocated for this setup). The network topology simulated using Omnet ++  was the standard 14-node NSFNet topology with 21 full duplex links as shown in Fig. 1.

Fig. 1
An Illustration of network topology depicts a box as a deep learning agent, below there are circles 0 to 13 interlinked. S N D controller links to circles 11, 7, and 1. Additional links as 0 to 1 and 2, 1 to 2 and 7, 3 to 10 and 4, 2 to 5 and 1, 5 to 4 and 9, 8 to 9, 11, 12, and 13 to 5, 12, 11.

Deep learning agent in SDN network (NSFNet topology)

The nodes were simple forwarding switches connected to a central SDN controller. The routing decisions were taken by the SDN controller using the OpenFlow control protocol. The central controller application for routing was an agent based on reinforcement learning (RL agent) that controlled the routing entries on all switches. Two versions of RL agents were implemented using TensorFlow libraries in Python. The first version was based on the DDPG algorithm and the second version was based on the TD3 algorithm (Table 1).

Table 1 Parameters used in TD3 algorithm

A traffic matrix was created for testing the performance with 1000 different traffic configurations. To check the performance of the RL, 100,000 random routing configurations were used where all nodes were reachable. The same set of routing configurations was used for all the traffic configurations (Fig. 2).

Fig. 2
A block diagram. The critic network, time delayed error update to critic 1 and 2 to target 1 and 2 to minimum value. The actor-network to policy gradient update to actor to target, a circle with plus denotes Gaussian noise, leading to the S D N agent and then to the replay buffer.

Block diagram of TD3 RL agent

Latency is an important criterion in deciding the QoS as well as congestion. The target is to minimize latency, so negative latency was chosen as the reward function which had to be maximized by the RL agents.

The performance results obtained for the models used are shown in Figs. 3 and 4. Figure 3 compares the latency in the network as the training progresses. It can be seen that beyond 50 epochs, the latency is higher in the case of DDPG compared to TD3. The trend is consistent as the training progresses. This demonstrates the effectiveness of the TD3 algorithm in learning the SDN routing with minimal delay as compared to DDPG algorithm.

Fig. 3
A fluctuating line graph of average latency per epoch 1000 episodes for D D P G and T D 3. D D P G exhibits the highest peak at (12, 3.7), with its line consistently higher than the T D 3 line, which has its highest peak at (10, 3.3). All values are approximated.

Comparison of training of DDPG and TD3 RL agents

Fig. 4
A fluctuating line graph of the average running latency over 1000 episodes versus episodes for D D P G and T D 3. D D P G exhibits its line consistently higher than the T D 3 line.

Runtime performance comparison of DDPG and TD3 RL agents

Figure 4 shows the runtime performance of the agents after the training is completed. Again, it can be seen that the performance of TD3 is better as compared to DDPG.

5 Conclusion and Future Directions

As demonstrated through the simulation results, TD3 is more effective and efficient in learning the SDN routing in 5G network as compared to DDPG and reduces the overhead of a routing protocol in the network. It is therefore preferable to use an RL agent based on TD3 for SDN routing in 5G networks.

This study has focused on static network topologies, where learning takes place in a fixed network environment. In case of any changes in the topology, the training may need to be repeated. A subject for future study is to explore transfer learning for using the current models even in case of any changes in the network topology.