1 Introduction

Over the past few decades, cloud computing has been developed as an important trend for transferring computing resources, controlling and storing information in geographically distributed data centers. However, cloud computing is facing growing challenges such as high unpredictable communication latency, privacy gaps, network load connected to the end devices, and costs related to head-on connections [1, 2]. Most of these problems are due to the high physical distance between cloud service provider’s data centers and the end-users. This geographical distant results in high latency and reduced quality of service (QoS) problems. These problems act as barriers for time-sensitive system requests, creating challenges in supporting real-time processing and providing fast response times for end-users [1].

Using middleware between end nodes and the cloud can help solve some of these challenges. Consequently, different studies proposed and studied middlewares such as Cloudlets,Footnote 1 MEC,Footnote 2 Micro Data Centers,Footnote 3 and other similar solutions such as Nano-data centers, which can be under the umbrella of fog computing or edge computing [4,5,6].Footnote 4

Cisco introduced fog computing in 2014 to expand cloud computing to the network edge. It is an entirely virtual platform that presents computing, storage, and network services between end IoT users and traditional cloud computing data centers [5]. Fog computing hopes to decentralize computing and integrate cloud data centers and heterogeneous edge devices to perform ubiquitous distributed computing [7]. Moreover, light-weight or time-sensitive IoT environment tasks can be processed in the fog base station. In contrast, heavyweight tasks or any other thing that is not possible in the fog layer for any reason is performed in cloud computing platforms.

Although fog computing is a promising method for solving the problems of cloud computing and current networks, some problems remain unsolved. Most importantly, there is a need for an intelligent distributed platform at the edge to manage the computational, network, and storage resources at the fog layer. Currently, uncertainties caused by the relation between future requests and limited fog node resources cause multiple obstacles to proper decision-making in resource allocation.

Moreover, if all requests are forwarded from the IoT layer to the cloud, the possibility of responding to requests with low latency requirements will be challenged. On the other hand, as an alternative solution, fog computing faces resource scarcity problems in responding to all requests. Therefore, intelligent resource management at the fog layer can be a solution to this problem.

Resolving the problem is critical for IoT applications that cannot tolerate such latencies. There are tasks with different latency requirements in IoT applications, some of which are more sensitive to latency while others can handle more [8, 9]. Therefore, FN must allocate its limited and valuable resources in heterogeneous IoT environments with different latency requirements intelligently.

1.1 Contributions

The current study presented a new framework for resource allocation in SDN-based fog using reinforcement learning methods. The framework aimed to effectively use limited FN resources while providing the low-latency requirements of IoT applications. Moreover, an SDN-based architecture was used to turn the resource allocation problem into a decision-making process based on reinforcement learning. Consequently, this decision-making could increase the total usefulness of serviced requests at the fog layer. In addition, the proposed algorithm formulated the resource allocation problem as a Markov decision process, which allows the controller to make the best decision in uncertain situations. Since dynamic changes happen in input requests and resource situations, the system cannot precisely predict the transition probabilities and rewards. Therefore, the model-less RL methods such as SARS, Q-Learning, and Monte Carlo were used to learn optimal policies and make correct decision-making.

The key contributions of this paper are listed as follows:

  1. 1.

    Provide the possibility of using all fog layer resources to respond to time-sensitive requests in the shortest possible time.

  2. 2.

    A special multi-layered architecture was presented (Fig. 1) based on SDN using reinforcement learning algorithms.

  3. 3.

    Using reinforcement learning enabled responsible controller to act as an intelligent agent and choose the optimal method for responing to a request without a required prior knowledge of the environment.

The rest of this paper is organized as follows. The next section considers the related studies. After that, architecture and system modeling method are introduced. The problem formulation proposed for solving the resource allocation problem is presented in the third section, along with introducing the reinforcement learning method and related algorithms. Then the simulation results are presented in the fourth section, and the conclusion is in the fifth section. In addition, a list of notation and abbreviations used throughout the paper is provided in Table 1.

Fig. 1
figure 1

Multi-Controller Flat SDN-based Fog Architecture: According to the figure, this environment includes multiple controllers. The controller that receives the service requests from one of its FBSs at the moment t is named the responsible controller, which becomes responsible for deciding how to provide the requested resources

1.2 Fog computing and SDN-based architecture

SDN technology was used in the proposed model of this study to increase fog computing efficiency, network scalability, and programmability. The effects of SDN on fog computing were examined in different studies. For example, Bakhtir et al. [10] survey presented a complete description of how fog computing can use SDN to solve its challenges. Their study thoroughly investigated the problems of fog computing in different environments and analyzed them in a categorized fashion. Bakhtir et al. [10] found SDN a newfound method of facing these challenges. Moreover, they presented a transparent cooperation model for SDN-based fog computing using practical architecture. In addition, they showed the usage of SDN-related mechanisms in fog computing infrastructure.

In another study, Gupta et al. [11] presented an SDN-based fog computing system named SD-Fog. The proposed system performed the intelligent QoS by managing and controlling the flow between services and orchestration. They implemented network function virtualization (NFV) and SDN technologies to show the effectiveness of this proposed system in their case study on smart homes.

In addition, Sun et al. [12] presented edge IoT with a hierarchical fog and cloud computing structure to effectively manage IoT device data at the network edge. In this structure, the SDN-based cellular core was located above the fog servers to transfer information between these servers. The cooperation between fog computing and SDN created efficiency in IoT data stream collection, categorization, and analysis.

Table 1 A Summary of notation and abbreviations

As one of the earliest studies, Truong et al. [13] presented an integrated fog computing and SDN architecture for vehicular ad hoc networks. In their studty, the SDN controller was placed between the fog and cloud layers for fog orchestration and resource management. They also showed the benefits of this integrated system by presenting two scenarios.

In addition, Yong et al. [14] employed SDN technology to operationalize the connection between the fog and the cloud, and improve service quality. Lin et al. [15] introduced a distributed network architecture in another study. They benefited SDN in vehicular networks with scalable network management and support of intelligent data computing policies in fog computing. Besides, Liang et al. [16] presented an integrated architecture for software defined and virtualized radio access networks with fog computing. They suggestinged using a software as a Service service (SaaS) named Open Pipe, which enables network layer virtualization.

1.3 Resource management at the fog

Task offloading and resource management is another subject in fog computing that has received significant attention. For example, resource allocation was studied based on cooperative edge computing to achieve ultra-low latency in fog radio access networks (FRAN) [17,18,19]. Sahni et al. [17] proposed a meshing algorithm for edge computing to distribute decision-making tasks among the edge devices instead of the cloud server.

In another study, heterogeneous F-RAN structures such as small cells and macro base stations were considered to present an F-RAN node selection algorithm for proper heterogeneous resource allocation [18, 19]. Moreover, Mukherjee et al. [20] studied computational offloading for multiple tasks with different latency needs related to users. In the scenario of this paper, the end-user offloaded the task data onto its responsible fog node.

Zhou et al. [21] proposed a solution for coping with different challenges of computational task offloading onto resources related to surrounding vehicles, including the lack of effective motivations and assignment mechanisms. Besides, Zhang et al. [22] considered a particular fog computing network consisting of a set of data service operators (DSO). Each DSO controlled a set of fog nodes in their study for presenting the required data services to its subscribers.

In another study, Gu et al. [23] analyzed the radio and computational resource distribution problem for optimizing system performance and improving user satisfaction. Their proposed method aimed to show a distributed approach to the shared resource allocation problem while using the effective SPA- (S, P) algorithm to find a stable solution to the SPA problem. Their method used a matching game framework, especially a student project allocation (SPA) game, instead of continuous centralized optimization for this aim.

1.4 Fog computing and reinforcement learning

Reinforcement learning methods have been widely used in fog computing due to their benefits in solving resource management and load balancing problems. Other machine learning methods, such as deep learning, are also used to solve the challenges of fog computing [24,25,26]. However, reinforcement learning techniques generally have broader applications in the literature in this field. This wider use is model-free reinforcement learning doesn’t need environmental cognition.

For example, Dutreilh et al. [27] were one of the first attempts to use reinforcement learning algorithms. They used the reinforcement learning technique for automatic resource allocation in the cloud, shaping a proper and dynamic model for resource allocation, which is essential in cloud computing. Furthermore, Le and Tham [28] proposed an offloading application based on deep reinforcement learning for ad-hoc mobile cloud users.

In addition, Gao et al. [29] considered a multi-user mobile edge computing (MEC) system. This system allowed users to perform their computational offloading using wireless channels to a MEC server. Moreover, the total costs of latency and energy consumption for all users were considered the optimization objective. Additionally, the two decisions, including offloading and computing resource allocation, were optimized to minimize the overall system cost. For this purpose, an optimization framework based on reinforcement learning (Q-learning and deep reinforcement learning) was proposed for resource allocation in wireless MEC. The simulation results showed that the proposed method significantly reduced the total cost compared to other base methods.

In another study, Parent et al. [30] suggested reinforcement learning for distributed static load balancing in data-centered applications in a heterogeneous fog computing environment. Moreover, Gazori et al. [31] focused on the timing of fog-based IoT application tasks to minimize long-term service delay, computational costs with limited resources, and execution time. The reinforcement learning method and the double deep Q-learning (DDQL) algorithm were also used to solve these problems. Their evaluation showed that the proposed method performed better than other base algorithms in service latency, computational cost, energy consumption, and task performance while managing the single breaking point and load management challenges.

Besides, Liu et al. [32] presented a trade-off between consumed energy and service latency in mobile vehicular networks considering network traffic and the computational load caused by this traffic. Their study created a cost-minimization problem using the Markov decision process (MDP). In addition, they proposed dynamic reinforcement learning and deep dynamic programming algorithms to solve the offloading decision problem.

In the current study, Table 2 compares some reviewed methods from the point of view of using or not using SDN, the use of artificial intelligence methods and the algorithm used, and their goals. This study intended to present a novel resource management method to improve service quality by introducing SDN-based fog computing architecture with multiple flat-shaped controllers. Furthermore, this study used reinforcement learning algorithms for optimized decision-making regarding precious and limited fog layer resources.

Table 2 Comparison between some solutions presented in fog computing

2 Architecture and system model

This study presented SDN-based fog computing architecture (as shown in Fig. 1) to solve the previously mentioned problems. In this architecture, which is a multi-controller flat architecture,Footnote 5 the environment is divided into multiple partitions. Each partition includes at least one local Fog-based Base Station (FBS) managed and controlled by the responsible SDN controller.

Each IoT device in a singular partition with different access points (Wi-Fi, WiMAX, Cellular) can direct connection and service requests to the partition’s local FBS. The service request information is sent using the local FBS to the responsible controller of that partition for analysis and determines the serving method. Therefore, end-nodes indirectly connect to the controller.

Moreover, the leaving and joining of IoT devices in environments involving mobile IoT devices that move to different partitions over time are recorded by their local FBS at the responsible controller to show the user density. Therefore, every controller has a partition view of their local network situation [37].

In addition, the partition views include the request traffic, the local resource status, and available resources of the neighboring controllers that can be updated and shared between neighbor controllers in certain periods based on network traffic. Fog base stations, in addition to network facilities, are equipped with caching, computing, and signal processing capabilities. This equipping is to optimize communication bandwidth consumption between FBS and the cloud and overcome the challenge of increasing IoT devices and low latency requirements. However, their resources are limited, so efficient use is essential.

To simplify the modeling, the current study assumed that FBS caching, computing, and processing capacity were quantitatively comprehensive indicators presented as resource blocks (RBs) limited to N [2]. Furthermore, all FBSs were presumed to have constant resources; thus, their RBs would not change over time.

Additionally, we supposed that the end node could not supply its own needs over time, which is why it was trying to access the network resources by sending requests to the local FBS in its domain. Each EN must calculate its latency requirements (maximum acceptable delay), computational cost, and amount of data that must be transferred from the EN before sending the request [2]. FBS can calculate some or all of the intended parameters based on the intended IoT environment. After receiving the request from the IoT environment, FBS calculated the request utility by the method presented later, then forwarded the request to the responsible controller.

The method of calculating the request utility was based on the maximum acceptable delay for requesting EN, the amount of data that must be transferred from EN, and the computational cost required by EN. Moreover, the responsible controller must decide how to respond to the received request based on the status of the resources in the fog layer and the request utility. Thus, the decision included one of the following:

  1. 1.

    Sending the request to one of its own FBSs,

  2. 2.

    Sending the request to the neighbor controller (to receive service on one of the FBSs under its control),

  3. 3.

    Sending the request to the cloud.

The SDN-based architecture (introduced in Fig. 1) allowed the controller to use the resources of other neighbors, in addition to the cloud resources and own resources, for responding to requests. In other words, the controller must make the optimized decision between these three options, which increases the necessity of optimized decision-making.

The SDN controller implements computing and network traffic-related policies using south-bound API (such as Open Flow [38]) and West-East API (such as BGP and OSPF[39]). As mentioned, the FN’s computing and processing capacity were limited to N resource blocks (RBs). Furthermore, when user requests arrived sequentially and decisions were taken quickly, no queuing occurred [2].

IoT applications have different levels of latency requirements. Therefore, SDNC gives higher priorities to the servicing low-latency applications. Besides, computational costs and data transfer are taken into account in determining utility to differentiate between requests with similar latency requirements and considering negative points for requests with higher computing costs and data transfers.

Thus, we considered the utility of an IoT end-node request proportional to the reverse of acceptable delay D (in milliseconds), the reverse of process time P (in milliseconds), and the reverse of data transfer time T (in milliseconds). Consequently, we would have: \( U \approx \frac{1}{(D \times P \times T)} \) Considering the equal data transfer times: \( T = \frac{S}{ A} \); where T is the transfer time, A the amount of data that must be transferred from EN (data size of a task), and S the speed or rate of transfer. Then, we would have: \( U \approx \frac{S}{D \times P \times A} \). Notably, the items involved in determining the utility of any request (other than the latency requirements) are highly dependent on the environment. In addition, the priority of each item can be changed based on the environment. Therefore, more flexibility can be provided for calculating the utility by using coefficients or power for each item.

$$\begin{aligned} U = \delta \left( \left( \frac{1}{D} \right) ^\beta \times \left( \frac{S}{P\times A}\right) ^\sigma \right) \end{aligned}$$
(1)

where \( \delta ,\beta \) and \(\sigma > 0 \) are mapping parameters.

We obtained the desired range of U and importance level for latency, computational cost, and data transfer time by selecting the \( \delta ,\beta \) and \( \sigma \). Usually, latency can get higher levels of importance by selecting larger \( \beta \) values. Considering the choices that SDNCs have (three options), the responsible controllers must act intelligently to make the correct decision for each IoT request. Their intelligent decision helps to achieve the conflicting objectives of maximizing the average total utility of EN requests that responded at the fog layer over time and minimizing their resource idle time. Therefore, the system objective can be described as a limited optimization problem as follows:

$$\begin{aligned} \begin{aligned} Max \displaystyle \sum _{t=0}^{T} (u_t \mid a_t = Local\,Service\,Or\,Service\,in\,Fog ) \\ And \\ Min \displaystyle \sum _{t=0}^{T} (FRB_{Local} + FRB_{Fog} \mid a_t = Service\,in\,Cloud ) \end{aligned} \end{aligned}$$
(2)

where \(a_t\) is the action performed at time t and T is the end time when all RBs at the fog layer, including the local responsible SDNC resource blocks and neighbor SDNC resource blocks, are occupied. In addition, \( FRB_{Fog} \) is the number of unoccupied RBs related to FBSs under the management of neighbor SDNCs, and \(FRB_{Local}\) is the number of unoccupied RBs related to the FBSs under the management of the responsible SDNC.

3 Problem formulation

Reinforcement learning can be applied as a mathematical framework for autonomous learning through interacting with the environment. In the standard reinforcement learning model, an independent learner agent interacts with the environment through a sequence of observations, actions, and rewards. At each time step t, the agent initially observes a state, \(s_t\), from its environment. Then, performs an action \( a_t\) and receives a number \(r_t\) as the reward feedback. After conducting the action \(a_t\) in the environment, the environment is transformed to a new state \(s_{t+1}\) (Fig. 2). The process continues, and the agent aims to learn the policy, which is an action selection strategy for maximizing the desired reward in the long run. Moreover reinforcement learning primarily focuses on learning without being aware of the environmental model [4]. Accordingly, this is appropriate for use in our desired fog computing, given what was addressed in the previous section.

Fig. 2
figure 2

Agent environment interaction in a MD [40]

Formally, reinforcement learning can be described as a Markov decision-making process (MDP), an ideal mathematical form of the reinforcement learning problem. That is because a detailed theoretical description can be expressed in which the environment’s response to the subsequent state \(S_{t+1}\) only depends on the state \( S_t \) where the action \(A_t\) is taken. Furthermore, MDP and agent create a sequence that starts in the following order: \(S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,\ldots \) . In a finite MDP, the set of states, actions, and rewards (S, A, and R) all have a limited number of factors. In addition, the the resource allocation problem in the fog layer can be defined as a finite quadruple MDP as follows.

3.1 Set of states

S is a set of feasible states,for example, \( s_t \in S\). In addition, the utility values are quantized to model the environment; therefore, we have \( u_t \in {0,1,2,3,\ldots ,U}\). Accordingly, the state \(S_t\) in any time t is defined as follows:

$$\begin{aligned} S_t = (10^{m+n} \times FRB_{Fog}) + ( 10^n \times FRB_{Local} ) + U_{t+1} \end{aligned}$$
(3)

which allows representing any state simply by a single number. Here,\( FRB_{Local}\) is the total number of free RBs associated with FBSs controlled by the responsible SDNC at time t. \(FRB_{Fog}\) is the total number of free RBs associated with FBSs controlled by neighboring SDNCs at time t. Besides, m and n are positive integers defined as follows: m is the smallest integer such that \(U<10^n\); and n is the smallest integer such that \( RB_{Local}<10^m\). Regarding this description, Eq. (2), and assuming that in case of reference to fog, the request is always referred to a neighboring SDN with the largest number of free resources;Footnote 6 the number of feasible states is:

$$\begin{aligned} Number of states = ( NRB_{Local}+1 ) \times ( NRB_{Fog} + 1 ) \times U \end{aligned}$$
(4)

3.2 Set of actions

A set of possible actions was defined in our model as follows. For a user request with utility \( U_t\) at time t, the controller must select the optimal action from the following three actions (assuming not all resources are occupied in the fog layer):

  • Local Service:Footnote 7 One of the SDNC FBSs will be assigned to respond to the request. As a result, one RB of resources managed by SDNC will be occupied, and the immediate reward rt is received. (\(a_t\) = Local Service).

  • Service in Fog Layer: The request is forwarded to one of the neighbor controllers. For action \(a_t\) = Fog Layer Service, one RB of resources under the control of the neighbor SDNC is occupied , RBs of local FBS are retained, and reward, \( r_t\) is received.

  • Service in Cloud Layer: The request is forwarded to the cloud (\(a_t\) = Cloud Layer Service), all existing resources blocks of FBS in fog layer are retained, and a reward \(r_t\) is received.

Based on the descriptions mentioned above, the set of actions can be defined as follows:

$$\begin{aligned} A = \{ Service in FogLayer, Service in Cloud Layer, Local Service\} \end{aligned}$$

3.3 Probability function

\( P^{a}_{SS^{'}} \) is the transition probability from state S to \(S^{'}\) by choosing the action a, which means \( P^ {a}_{SS^{'}} = P (S^{'} \mid s,a) \) , where \(S^{'}\) represents the successor state. This function is unknown concerning the lack of information from the IoT environment and how requests are sent from it. In addition, and some methods are presented in the following sections to overcome this problem.

3.4 Set of rewards

In reinforcement learning, the purpose or goal is recognized as a specific signal, named reward, and transferred from the environment to the agent. In each time step, a reward is a real number and \(r_t \in R\). \( R^{a}_{SS^{'}} \) is the immediate reward received when the action a is adopted at state S and ends at state \(S^{'}\). The reward mechanism of \( R^{a}_{SS^{'}} \) is usually selected based on the goal of the system designer concerning the system’s unknown nature.

In our model, rewards were determined in a way that led to our intended policies. Moreover in our environment, the goal was to give priority to requests with higher \(U_t\) at the fog layer. In addition, the goal was to use the local resources of the responsible SDNC. If the responsible SDNC was some unoccupied resources more than the average number of resources in the fog layer, it used its resources. To achieve the stated goal (Eq. 2), \( r_t\) was defined based on the received utility \(U_t\) and system state \(S_t\) as shown in Table 3. According to the following table:

$$\begin{aligned} \begin{aligned} r_t \in \{ R_{LH1},R_{LH2},R_{LL1},R_{LL2},R_{LM1},R_{LM2},R_{FH1},R_{FH2},R_{FH1},R_{FH2},R_{FM1},\\ R_{FM2},R_{CH1},R_{CH2},R_{CL1},R_{CL2},R_{CM1},R_{CM2} \} \end{aligned} \end{aligned}$$

\(U_h\) and \(U_l\) of a request are among the design parameters and depend on the utility distribution in the IoT environment. For example, \(U_h\) and \(U_l\) can be a certain percentile of utilities in the environment.

Table 3 The Rewarding method based on selected action, received request utility (\(U_t\)) and state condition (\(S_t\)) including local free blocks (\(FRB_{local}\)), free blocks in the fog layer (\(FRB_{fog}\)), the total number of existing resources in the environment (total RB), total number of neighbor controls (\(N_i\)) and thresholds \(U_h\) and \(U_l\)

Example: Consider a hypothetical environment with two controllers. Each one has a FBS with four resource blocks \((NRB_{local} = NRB_{Fog} = 4)\). Nine levels of request utility are considered in the environment \((U=9)\), and upper and lower thresholds of utility are \(U_h=7\) and \( U_t = 3 \), respectively. The hypothetical scenario of IoT requests with \( u_t\) and random actions can be observed in Table 4. Furthermore, the state changing graph is expressed as follows:

$$\begin{aligned}{} & {} 4491\rightarrow 3461\rightarrow 3321\rightarrow 3221\rightarrow 3241\rightarrow 2271\rightarrow 2231\\{} & {} \quad \rightarrow 1251\rightarrow 0281\rightarrow 0211\rightarrow 0181\rightarrow 061 \end{aligned}$$
Table 4 Actions and states for the hypothetical environment

3.5 Cumulative reward

According to the definition presented for the reward, the agent’s goal can be considered the maximization of the total received reward. Thus, the optimal action in each state is defined as an action that maximizes the cumulative reward as follows [40]:

$$\begin{aligned} G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots +R_T \end{aligned}$$
(5)

where T is the terminal state. The terminal state is reached in this problem when all local and fog layer resource blocks are occupied. To discriminate between immediate and future rewards, \(\gamma \in [0,1]\) is defined as the discount rate, which “determines the present value of future rewards”[40]. Moreover \(\gamma = 0 \)Footnote 8 means no significance for the future rewards and \(\gamma = 1 \) means the similar significance of future and immediate rewards. Therefore, the MDP problems aim to maximize the cumulative discounted rewards from the start point \((G_0)\), which can be expressed as follows [40]:

$$\begin{aligned} G_t = R _{t+1} + \gamma R _{t+2} + \gamma ^2 R _{t+3} +\cdots = \displaystyle \sum _{k=0}^{\infty } \gamma ^ k R_{t+k+1} \end{aligned}$$
(6)

The received reward at k time step in the future is \(\gamma ^{k-1}\) times as much as the reward received immediately. For episodic tasks,Footnote 9 Eq. (5) can be defined as follows:

$$\begin{aligned} G_t = \displaystyle \sum _{k=t+1}^{T} \gamma ^ {k-t-1}R_{k} \end{aligned}$$
(7)

Returns are interrelated at consecutive time steps, which is more critical for RL theory and algorithms and can be expressed as follows:

$$\begin{aligned} \begin{aligned} G_t&= R _{t+1} + \gamma R _{t+2} + \gamma ^2 R _{t+3} + \gamma ^3 R _{t+4} + \cdots \\&= R _{t+1}+ \gamma (R _{t+2} + \gamma R _{t+3} + \gamma ^2 R _{t+4} + \cdots ) \\&= R _{t+1}+ \gamma G_{t+1} \end{aligned} \end{aligned}$$
(8)

3.6 Policies and value functions

Almost all RL algorithms involve estimating the state value function V(s) and action value function Q(sa), which are estimations of how good the agent is in a certain state (how good a given action is in a certain state) [40]. The concept “how good” is specified regarding the expected future rewards and the expected return, based on Eq. (6) stated above.

The expected rewards the agent receives in the future depend on the actions it will do. To this end, the state value function is defined concerning specific practical methods, named policies. The policy of mapping states to probabilities of selecting an action is formally feasible.

The state value function of state S following policy \(\pi \quad (v_\pi (s))\) is the expected return at the beginning of state S and following policy \(\pi \) after that. For MDPs, \(v_\pi \) can be formally defined as:

$$\begin{aligned} v_\pi (s) = {\mathbb {E}}_\pi [G_t \mid S_t = s ] = {\mathbb {E}}_\pi \left[ \displaystyle \sum _{k= 0}^{\infty } \gamma ^ {k}R_{t+k+1} \mid S_t = s\right] , for all \quad s \in S \end{aligned}$$
(9)

where \(E_\pi \) is the expected value of a random variable given that the agent follows policy \(\pi \), and t is the time step. It should be noted that the expected value of the terminal state, if it exists, is always zero. The present study named function \(v_\pi \) as the state value function for policy \(\pi \). Similarly, This study determined the value of choosing action in state S, following policy \(\pi \), shown as \(q_\pi (s,a)\), as the expected return which starts from S, performs action a, and then follows policy \(\pi \):

$$\begin{aligned} q_\pi (s,a) = {\mathbb {E}}_\pi [G_t \mid S_t = s , A_t = a ] ={\mathbb {E}}_\pi \left[ \displaystyle \sum _{k= 0}^{\infty } \gamma ^ {k}R_{t+k+1} \mid S_t = s, A_t = a\right] \nonumber \\ \end{aligned}$$
(10)

This study named \(q_\pi \) as the action value function following policy \(\pi \).

The fundamental feature of state value function used during reinforcement learning is that they satisfy recursive relationships similar to that created for return (Eq. 8). For each policy \(\pi \) and any state S, the following consistency condition holds between the value of s and the value of its possible successor states [40]:

$$\begin{aligned} v_\pi (s)= & {} {\mathbb {E}}_\pi [G_t \mid S_t = s ] \nonumber \\= & {} {\mathbb {E}}_\pi [R_{t+1} + \gamma G_{t+1} \mid S_t = s ] \quad ( By \quad 8: G_t = R _{t+1} + \gamma G_{t+1} ) \nonumber \\= & {} \displaystyle \sum _a \pi ( a \mid s) \displaystyle \sum _S^{'} \displaystyle \sum _r p(S^{'},r \mid s , a )[ r+ \gamma E_\pi [ G_{t+1 } \mid S_{t+1}=S^{'}]] \nonumber \\= & {} \displaystyle \sum _a \pi ( a \mid s) \displaystyle \sum _{S^{'}, r} p(S^{'},r \mid s , a )[ r+ \gamma v_\pi (S^{'})]\quad for all \quad s \in S \end{aligned}$$
(11)

where action a is taken from A, successor state \(S^{'}\) from S, and rewards r from R. Eq. (11) is the Bellman equation for \(v_\pi \) [40]. It is a widely-used and crucial formula indicating a relationship between a state’s value and its successor states’ values. Moreover a reinforcement learning task means finding a policy that leads to the maximum reward in the long term, which is named optimal policy. For limited MDPs, an optimal policy can precisely be defined as follows:

$$\begin{aligned} v_* (s) = max v_\pi (s) \quad for all \quad s \in S \end{aligned}$$
(12)

Optimal policies also have an optimal action value function, which is represented by \(q_*\) and defined as follows:

$$\begin{aligned} q_* (s) = max q_\pi (s) \quad for all \quad s \in S \end{aligned}$$
(13)

We can rewrite \(q_*\) in terms of \(v_*\):

$$\begin{aligned} q_* (s) = {\mathbb {E}} [ R _{t+1}+\gamma v_* (S_{t+1}) \mid s_t=s , A_t = a ] \end{aligned}$$
(14)

Since \(v_*\) is the state value function of an optimal policy, we have:

$$\begin{aligned} \begin{aligned} v_* (s)&= max_{a \in A} q_{\pi *} (s,a) \\&=max_a {\mathbb {E}}[R_{t+1}+ \gamma v_* (s_{t+1} ) \mid S_t = s,A_t = a ] \quad (By 14) \end{aligned} \end{aligned}$$
(15)

Equation (15) is the Bellman optimality equation for \(v_*\). In addition, the Bellman optimality equation for \(q_*\) is expressed as follows:

$$\begin{aligned} q_* (s,a)={\mathbb {E}}[R_{t+1} + \gamma max_{a^{'}} q_* (S_{t+1} ,a^{'}) \mid S_t = s,A_t = a] \end{aligned}$$
(16)

The Bellman optimality equation states that the value of a state under an optimal policy should be equal to the expected return for the best action in this state. In other words, the optimal policy results from choosing the best actions in every state. The expression \(Q_* (s,a)\) is even more convenient for choosing the optimal actions because the best action in each state is selected based on the maximum value of this expression. Based on Eq. (14), the following equation can be applied to determine the optimal actions:

$$\begin{aligned} a_* = max _{a \in A} q_* (s,a)=max _{a \in A} {\mathbb {E}}[R _{t+1}+ \gamma v_* (S_{t+1}) \mid S_t = s ,A_t = a] \end{aligned}$$
(17)

The following backup diagrams show the domain of future states and actions considered in Bellman optimality equations for \(v_*\) and \(q_*\) (Fig. 3).

Fig. 3
figure 3

Backup diagrams for \( v_*\) and \(q_*\) [40]

Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, and thus to solving the reinforcement learning problem. However, this solution is rarely directly useful. It is akin to an exhaustive search, looking ahead at all possibilities, computing their probabilities of occurrence and desirabilities in terms of expected rewards. This solution relies on at least three assumptions that are rarely true in practice: (1) we accurately know the dynamics of the environment, (2) we have enough computational resources to complete the computation of the solution and (3) the Markov property [40].

A technique for solving Bellman equations and calculating optimal policies is dynamic programming (DP). DP algorithms update the state values based on estimating the successor state values; that is, other estimations update estimations. This idea is named bootstrapping. Many reinforcement learning techniques conduct bootstrapping.

The present study assumed the lack of precise information being from the IoT environment where the agent is located in this problem. Thus, considering the limited number of states, this study applied model-free RL techniques, known as approximate DP techniques, to solve the problem. This study resorted to model-free RL techniques instead of exact DP and estimated the optimal value functions. The related algorithms are presented in the following sections.

In this study’s problem, the request was first sent from the IoT layer to FBS, and FBS passed it for decision-making about execution to the responsible SDNC. Then, SDNC decided to choose the optimal action among the allowable ones based on the current state and utility of the received request. Therefore, according to Eq. (17), the optimal action in the MDP problem is defined as follows:

$$\begin{aligned}{} & {} \small a_* = max\{ R_{Local} +\gamma {\mathbb {E}}_u[ V^*(S^{'}_{Local}) ,R_{Fog} ] \nonumber \\{} & {} \qquad \quad + \gamma {\mathbb {E}}_u[ V^*(S^{'}_{Fog})], R_{Cloud} + \gamma {\mathbb {E}}_u[V^*(S^{'}_{Cloud}])\} \end{aligned}$$
(18)

According to Eq. (3), successor states are defined as follows:

$$\begin{aligned} \begin{aligned} S^{'}_{Local}&=(10^{m+n} \times FRB_{Fog}) +( 10^n \times FRB_{Local} - 1) + u_{t+1}\\ S^{'}_{Fog}&=(10^{m+n} \times FRB_{Fog} - 1 ) +( 10^n \times FRB_{Local} ) + u_{t+1}\\ S^{'}_{Cloud}&=(10^{m+n} \times FRB_{Fog}) +( 10^n \times FRB_{Local} ) + u_{t+1}\\ \end{aligned} \end{aligned}$$

where \(S^{'}_{Local}\) is the successor state when a= Local Service, \(S^{'}_{Fog}\) is the successor state when a=Service in Fog Layer, and \(S^{'}_{Cloud}\) is the successor when that a=Service in Cloud Layer and \({\mathbb {E}}_u\) is the expectation concerning the utilities u in the IoT environment.

3.7 Monte Carlo method

This section presents the first learning method for estimating value functions and discovering optimal policies. This method helps to solve the resource allocation problem of fog computation. As mentioned, the lack of thorough environmental cognition is one of the assumptions here.

A popular method to compute the optimal state values is the value iteration by Monte Carlo (MC) calculations, as presented in Eq. (13). MC methods only require the experiences, which is the sequence of samples of states, actions, and received reward resulting from the actual or simulated interaction with an environment. These methods are used for reinforcement learning based on averaging sample returns.

The value of a state is the expected return, which is the cumulative discounted future rewards beginning from that state. Therefore, a straightforward method for estimating the state value is to use experiences. The experiences are used by calculating the mean value of averaging returns after each state visit and observing many returns. It is worth noting that the average must converge to the expected value. This idea is the basis of all MC methods.

For example, assuming that we want to estimate the value of \(V_\pi (s)\), based on the MC method, we would have: \(V_\pi (S)={\mathbb {E}}_\pi [G_t \mid S_t = s]\). Figure 4A indicates the backup diagram for the MC method. Moreover, Algorithm1 presents how to learn the optimal policy for the MDP problem based on the MC method. Each occurrence of state S in an episode is named the visit of S. In an episode, S may be visited many times. The first-visit MC method estimates v(s) as the average of the returns after the first visit of state S. However, the every-visit MC estimates the average of returns after all visits of state S. These two MC methods are so similar to each other, with only some differences in the theoretical properties. We considered the first-visit MC method in this study.

figure a

3.8 Temporal difference learning methods

Temporal difference (TD) learning combines MC and DP ideas. TD methods combine the sampling of MC with bootstrapping of DP. While Monte Carlo methods must wait until the end of the episode to determine the increment to \(V(S_t)\) (only then is Gt known) [40], TD methods need to wait only until the next time step. In addition, TD methods immediately perform an update at time t+1 using observed reward Rt+1 and estimating \( V(S_{t+1})\). The simplest TD method leads to the following update:

$$\begin{aligned} V(S_t) \leftarrow V(S_t) + \alpha [ \, R_{t+1} + \gamma V(S_{t+1}) - V({S_t}) ] \, \end{aligned}$$
(19)

The aim of updating in the MC method is Gt, while in the TD method, it is \(R_{t+1} + \gamma V(S_{t+1})\). This is named TD(0) or the one-step TD methods. Since TD(0), like DP, builds its updating based on existing estimates, it is called a bootstrapping method. Figure 4B shows the backup diagram for the TD(0) method. Furthermore, TD methods have an advantage over DP methods as they do not require knowledge of the next case’s environment model, reward, and probability distribution. Moreover, the most sensible advantage of TD methods over MC methods is that they are implemented online, fully incremental, and only wait for one time-step. Similar to the MC methods, these approaches are divided into on-policy and off-policy. The current study presented the optimal policy learning method from an IoT environment using the on-policy SARSA and off-policy Q-learning methods.

3.8.1 On-policy temporal difference learning method

In the TD(0) method, the present study considered the transfer from one state to another and learned about state values. Here, this study considered the transfer from a state-action pair to another state-action pair and learned about the values of the state-action pair. To this end, Eq. (19) can be rewritten based on the state-action pair:

$$\begin{aligned} Q(S_t,A_t ) \leftarrow Q(S_t,A_t ) + \alpha [ \, R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t ) ]\, \end{aligned}$$
(20)

This update was carried out after each transfer from a non-terminal state. If \(S_{t+1}\) is the terminal state, \(Q(S_t,A_t )\) would be defined as zero.

This formula uses all five elements of events \((S_t,A_t,R_{t+1},S_{t+1},S_{t+1})\) that form the transfer from a state-action pair to another. These five elements are the reason for naming this algorithm SARSA. The backup diagram for SARSA is presented in Fig. 4C.

3.8.2 Q-learning (off-policy temporal difference learning method)

One of the successful reinforcements of the learning methods is the off-policy TD control algorithm, also known as Q-learning, which is defined as:

$$\begin{aligned} Q(S_t,A_t ) \leftarrow Q(S_t,A_t ) + \alpha [ \, R_{t+1} + \gamma max_a Q(S_{t+1},a) - Q(S_t,A_t ) ]\, \end{aligned}$$
(21)

In this case, the action value function Q estimates the optimal action value function directly and independently of the following policy. This estimation significantly simplifies the algorithm’s analysis and makes it possible to prove the initial convergence. The backup diagram can be seen in Fig. 4D.

Fig. 4
figure 4

Backup diagrams for TD, SARSA, Q-learning, and Monte Carlo methods

In addition, the integrated application of SARSA and Q-learning methods for solving the optimal resource allocation problem are indicated in Algorithm 2.

figure b

4 Evaluation and simulation

The goal of providing a solution is to maximize the total utility of the service requests and minimize the unoccupied time of local resources and fog layer resources (Eq. 2). Therefore, a performance metric named R was defined to evaluate the performance of the proposed algorithms and compare them with the performance of a constant threshold algorithm.

The total utility of requests serviced in fog is the essential factor in increasing the metric. However, it is defined in a way that the larger utility of served requests in the cloud and the greater number of unoccupied resource blocks available for the responsible controller when servicing the request in the cloud leads to the reduction in the R-value.

$$\begin{aligned} \begin{aligned} R&= E\left[ \displaystyle \sum _{t = 0}^T ( u_t \mid a_t = Local Or Fog Service) \right. \\&\quad \left. - \theta \displaystyle \sum _{t = 0}^T ( u_t+ (NRB_{Local }+NRB_{Fog }/ All Resource) \mid a_t = Cloud ) \right] \end{aligned} \end{aligned}$$
(22)

where \(u_t\) is the utility of requests and \( NRB_{Fog}\) is the number of unoccupied RBs related to FBSs under the management of neighboring SDNCs. Moreover, \(NRB_{Local}\) is the number of unoccupied RBs related to FBSs under the management of responsible SDNC, T is the time period of the episode, and \(\theta \) is the discount factor as a penalty for idle time \((\theta \in [0,1])\).

In the next step, the simulation results were presented to evaluate the performance of the SDN controller during the execution of RL methods. The methods were SARSA, Q-learning, and MC (as presented in Algorithms 1 and 2) and in the introduced architecture frame. Then, the performance of the RL-based SDN controller was compared with that of the SDN controller with a network slicing approach with various slicing thresholds (2–8 slices) in different IoT environments with different compositions of IoT latency requirements.

In this simulation, consider 10 utility levels with different latency requirements to exemplify a variety of IoT applications. It is assumed that utilities are calculated according to Eq. (1). In the first level of utility (u = 1), the requests were placed with the least importance, similarly, in the last level (u = 10), the requests ware with the most importance. In this way, by changing the combination of utility classes, we produce 10 scenarios of IoT environments, which are presented in Table 5. As it is clear in Table 5, in the first environment, most of the requests are with high utility and have a special sensitivity (this environment is similar to a hospital or military IoT environment) and respectively, in the following environments, the number of requests with high utility decreases and the E10 environment has the lowest number of requests with high utility. (This environment is similar to an IoT entertainment or smart home environment)

Table 5 Utility distribution for different IoT environments with various latency requirements

The simulation parameters demonstrated in Table 6 were used in this section. Furthermore, the rewards defined in Table 3 were valued by Table 7. The greedy policy was selected as the optimal policy to facilitate and accelerate the learning process.

A local SDN controller equipped with computational, signal processing, and storage resources, including 5 RBs, was employed, meaning that: \( NRB_{Local}\)=5. Besides, our intended environment included three neighboring SDN controllers with a total of nine resources: \( NRB_{Fog}\)=9. The locally serving threshold \((U_h)\) and serving threshold in the cloud layer \((U_l)\) were expressed by the utility levels and the ration of local and fog layer resources to the total resources as:

$$\begin{aligned} \begin{aligned} U_l&= \frac{U_{max}}{3} + ( {\mathbb {E}}(u) -\frac{U_{max} + U_{min}}{2})\\ U_h&= U_l+ \left( U_{max} \times \frac{2}{3}\times \frac{NRB_{Fog}}{NRB_{Local} + NRB_{Fog} } \right) \end{aligned} \end{aligned}$$
(23)
Table 6 A summary of simulation parameters and their value
Table 7 Rewards considered in the simulations

To perform simulation, the software was designed and implemented, the simulation was repeated 300 times for each test environment and the results of each stage of the tests were stored in the database. After that, the analysis was done based on the data obtained from the tests. The following table compares the performance of RL methods in terms of R with that of utility filtering-based network slicing methods with various slicing thresholds in 10 different IoT environments. The utility filtering algorithm uses a constant threshold for network slicing, regardless of the environment. For the network slicing-based methods, 2, \(\ldots \), 8 possible slices are investigated and ten levels of utility (0–9) were considered in our discretization, as demonstrated in Fig. 5 and Table 8. Furthermore, SARSA and Q-learning methods outperformed other methods. In addition, SARSA and Q-learning methods showed the best performance. The results were based on the average 300 simulations for each environment and each method.

Fig. 5
figure 5

Performance of RL method in terms of R metric

Table 8 Performance of RL method in terms of R metric

Based on Table 8, the performance of utility filtering algorithms was close to that of SARSA and Q-learning in some environments. As the table shows, in the E2 environment, the NetworkSlice5 and NetworkSlice6 methods had similar performance with Q-learning. The important point here is that determining what threshold should be used in each environment to achieve maximum performance is not simple, especially in complex environments where we don’t have precise prior knowledge, and the state of the environment changes over time. Moreover, RL-based methods provide the best performance, compared to utility filtering algorithms, in different environments merely through learning while not requiring background knowledge about the environment.

5 Conclusions

This paper presented a new resource allocation framework using a software-defined network (SDN) architecture and reinforcement learning techniques. The aim was to apply limited resources in the fog layer optimally. In addition, the framework was presented regarding an accurate architecture based on SDN with multiple controllers using reinforcement learning algorithms. Thus, the following achievements were reached:

  1. 1.

    The optimum use of resources in the fog layer and flexibility in managing these resources was employed. It was provided through exploiting the resources of other SDN controllers and using the resources of the SDN subset of the responsible controller. This is considered an innovation and a key advantage in efficiently managing resources.

  2. 2.

    The use of reinforcement learning methods made the responsible SDN controller behave like an intelligent agent without needing background knowledge of the environment. It was only through learning in different environments and choosing the optimal method of serving the received request from the possible options.

This framework was simulated based on three reinforcement learning techniques: SARSA, Q-learning, and Monte Carlo. The framework was examined in different IoT environments with heterogeneous latency demands. The simulation results showed the superiority of Q-learning and SARSA reinforcement learning compared to network slicing approaches with various slicing thresholds. In addition, the study found that Q-learning and SARSA reinforcement learning have consistency with the IoT environment. Furthermore, RL techniques made an appropriate compromise between two opposing objectives: maximizing the average utility of served requests and minimizing the idle time of fog resources. It is worth noting that the maximization indicated the optimal utilization of resources in the fog layer.

For future works, it is recommended to generalize the proposed resource allocation framework to susceptible environments with high request density, leading to queue formation for SDN controllers. In this situation, the subject of priority queues becomes meaningful, and making decisions for SDN controllers is accompanied by more complexities. Moreover, the shared resource allocation for a request, where the required resources are provided partly from resources under the control of RSDNC and the remaining from other SDNs, can be examined as a more complicated approach.