1 Introduction

Future IoT applications may be served effectively by fog radio access networks (F-RANs) with the help of edge caching and edge computing. Health monitoring, low-latency services, analytics on large amounts of data from the Internet of Things, and so on are all examples of such offerings (Xiang et al. 2020). In an F-RAN, each piece of user equipment (UE) may switch between many different communication modes, such as device-to-device (D2D), fog radio access point (FAP), cloud radio access network (C-RAN), and so on. Performance analysis, radio resource allocation, collaborative design of cloud and edge processing, cache size effect, and many other areas of study have recently been conducted on F-RANs (Cao et al. 2021a). Network slicing, a novel technique being studied in the context of 5G, has the potential to accommodate a wide range of use cases and business models. The idea behind network slicing is to create customised services by orchestrating and chaining together network slice instances. Network slicing helps 5G networks in an economical way by offering flexible support of multiple applications (Iturria-Rivera et al. 2022). Network slicing in radio access networks (RANs) are researched as a crucial component of network slicing to further enhance end-to-end network performance. Although network slicing is an excellent way to address 5G service needs, it still faces enormous difficulties. Traditional core network slicing techniques are solely driven by business, ignoring RAN characteristics. Network slicing, however, varies depending on the network architecture, such as in heterogeneous networks or cloud RANs (CRANs) (Fang et al. 2022). It may be advantageous to take into account RAN characteristics and network slicing together. Second, developing applications have stricter performance requirements. With 4G, many services are offered on the same network but with varying requirements. However, it is inefficient to use the same network to deliver different services. For instance, Internet of Things services demand extremely large connections, yet a high data rate is not crucial. Furthermore, while enormous connections are not necessary for VR applications, they still require a high data rate. As a result, 5G employs a cutting-edge technology called network slicing (NS) to create networks that are appropriate for different services in slices. Slices are chosen based on criteria including throughput, latency, and dependability. In order to meet these needs, slices are also given access to network resources. To implement NS notion, different network resources must be separated into slices, with each slice receiving resources it needs. As opposed to when employing constrained radio resources, RAN is confronted with developing a technique that meets the slice requirements without lowering efficiency (Shi et al. 2020). To do this, it’s crucial to take into account changes in the slice state, such as those related to traffic volume and the quantity of user equipment (UE) attached for control. Additionally, the number of slices processed by a base station (BS) fluctuates according to the service utilisation and UE entry and exit from the BS coverage area. Therefore, a mechanism that dynamically distributes radio resources in accordance with the slice status is required. One type of artificial intelligence method called deep learning (DL) may model the functioning of organic neural systems and provide patterns that can be used to decision-making tasks. Smallest component of this system, the neuron, is present in each layer in a predetermined number. The depth of the structure is determined by the quantity of concealed layers. While deep learning (DL) contains several hidden layers as opposed to shallow learning’s single hidden layer, this is how the term "deep learning" was coined. The development of DL has gone hand in hand with development of technology that can organise vast amounts of data. NLP, wireless networking, and computer vision are just a few of the disciplines where it has been successfully used broadly (Vimal et al. 2020).

2 Related works

Radio resources, which include RBs, are split into frequency and temporal domains in RAN slicing. The RB allocation to each slice is determined using a method that author (Shirmohamadi et al. 2022) described as an extension to the RB scheduling algorithm. Throughput was demonstrated to be higher with the extension approach than it was without one. When assigning RBs to a slice, the current technique does take the fulfilment of the slice criteria into account. It’s critical to meet the slice criteria in NS. As a result, work (Murti et al. 2021) suggested a mechanism that distributes RBs to slices while taking the slice needs into account. By allocating RBs from slices without requirements, this approach satisfies slice requirements. The effectiveness of the RB distribution, however, was not assessed. It’s possible that slices receive more RBs than are required. Author (Chen et al. 2018) suggested a solution that takes slice needs and RB consumption efficiency into account to overcome issue. With this technique, the slices are divided into four categories and given RB allocations. In accordance with the slice criteria, their analysis revealed a 12% increase in allocation efficiency over the scenario without abstraction. The issue is that because the slices are abstracted into four kinds and then given RBs, it was unable to entirely segregate RBs for each slice. In other words, the interference of RBs from other slices may result in a reduction in the extent to which the slice criteria were met (Shahjalal et al. 2023). As a result, it’s crucial to only distribute the necessary amount of RBs to each slice, without allowing influence from other slices. In Du et al. (2020), we created an effective one-dimensional search method to identify the best solution to issue of delay optimum computing job offloading inside a MDP framework. The dependency on statistical data on channel quality variations and computing job arrivals presents a problem, though. In Yan et al. (2020), the author used a Lyapunov optimisation approach to study a dynamic compute offloading policy for a MEC method with wireless energy harvesting-capable mobile devices. The similar methodology was used by Chang et al. (2022) and Deka and Sharma (2022) to investigate power-delay tradeoff in context of compute job offloading. Only a roughly optimum solution can be created by the Lyapunov optimisation. The author created an algorithm that uses reinforcement learning and does not require previous knowledge of network data in order to discover the best compute offloading strategy in Zhou et al. (2023). Multiple BSs with various data transmission quality are available to offload a computing workload when MEC encounters an ultra-dense sliced RAN. The expansion of state space in this situation renders the traditional reinforcement learning techniques (Cao et al. 2021b) impractical. Based on Lu et al. (2019), suggested RL with MultiPointer networks (Mptr-Net) to address offloading issue in MEC, and results revealed that their method achieved greater than 98% optimality. To tackle placement problem for virtual network functions (VNF), authors in Lu et al. (2019) also developed a deep RL technique with a sequence-to-sequence method in an effort to reduce power consumption. A deep RL technique for dynamic computation as well as radio resource control in vRANmethod was suggested in recent work (Filali et al. 2022) as the vrAIn framework. There is currently no prior effort to use such techniques for functional split optimisation in vRAN, despite the fact that they are promising for tackling complicated combinatorial issues for zero-touch optimisation in wireless networks (Koudouridis et al. 2022; Jiang et al. 2019).

3 System model

In this paper, we’ll take into account an ultra-dense service area served by a virtualized RAN with a set B = 1, B of BSs, as shown in Fig. 1. Over the same physical network architecture, both MEC services and conventional communication services are supported. At network’s edge, a MEC server is placed into place, giving the MUs powerful computational resources. By carefully outsourcing the generated compute jobs via BSs to MEC server for execution, MUs can anticipate a significantly enhanced computing experience. Consider wireless radio resources are divided into slices for normal communication as well as MEC in order to provide inter-slice isolation.

Fig. 1
figure 1

Three cells in the conventional C-RAN architecture, one RRH in every cell, and a centralised connection to the BBU pool

Consider a typical single Base Station (BS) downlink cellular network system. Time is divided into TTI units of 1 ms, indexed by t, 1, 2,… For each TTI, bandwidth is split up into a number of PRBs, designated as F = 1, 2,…, F. Consider cellular network is divided into a collection of N network slices, where N = 1, 2,…, N.

3.1 Signal transmission model

We may easily use Shannon’s capacity formulation to evaluate data rate in traditional services, such eMBB with large transmitted packet sizes. The new uRLLC or MTC service, however, differs from typical services in that it transmits little packets (between 32 and 200 bytes in size). Shannon’s capacity theory cannot adequately describe data rate in short packet transmissions. Instead, finite block length theory may be utilized to approximation possible data rate of short packet transmission by Eq. (1)

$$V_{i,j,t} = 1 - \left( {1 + p_{i,j,t} \left| {h_{i,j,t} } \right|^{2} /N_{0} } \right)^{ - 2}$$
(1)

Instantaneous data rate of UE i ∈ Un can therefore be expressed as Eq. (2)

$$r_{i,t} = \sum_{j = 1}^{F} s_{i,j,t} \cdot r_{i,j,t} ,({\text{ bits per TTI)}}$$
(2)

where the binary variable si,j,t is set to 0 if the UE i is not assigned the j-th PRB and to 1 otherwise. Additionally, only one UE may simultaneously acquire a PRB. FCFS policy dictates that every UE have a data queue at BS where it can keep incoming packets before transmitting them. At t-th TTI, queue length of UE i ∈ Un is designated as qi,t and evolved as follows by Eq. (3)

$$q_{i,t + 1} = {\text{max}}\left\{ {q_{i,t} - r_{i,t} /Z_{n} ,0} \right\} + A_{i,t} ,$$
(3)

where Ai,t is instantaneous packet arrival for UE i during t-th TTI and Zn is overall packet size (in bits). Additionally, active UEs in slice n are defined as collection of UEs having a nonzero queue length. Transmission-related latency as well as scheduling-related latency make up majority of the packet delay. In formula (4), the data rate of the UE determines the transmission delay, but the scheduling strategy determines the queuing latency. The packet delay in our system model is calculated by adding transmission delay as well as queuing delay. In order to mimic delay of m-th (m = 1, 2,…) packet arriving at i-th UE’s buffer

$$D_{i,m} = W_{i,m} + \delta_{i,m} ,{\text{ (in TTI) }}$$
(4)

When UE’s average packet arrival rate is low in slice n, queuing delay is almost nil, which causes transmission delay to predominate over packet delays. Applications classify a data packet as dropping out from the perspective of service provisioning when its latency exceeds a maximum tolerated packet delay that has been set. Typically, packet losses are what define how reliable a transmission is. As a result, the likelihood that packet delay is in excess of a predetermined maximum packet delay threshold is defined as PDR of m-th incoming packet at i-th UE’s buffer in system model by Eq. (5)

$$\beta_{i,t} = {\text{Pr}}\left\{ {D_{i,m} > D_{n}^{{{\text{max}}}} } \right\},i \in U_{n} ,$$
(5)

where, \(D_{n}^{{{\text{max}}}}\) is maximum tolerant packet delay of every UE in slice n. Packet latency of UE, as stated in formula (5), should be taken into consideration as one crucial QoS parameter during service provisioning. In the meantime, PDR of UE, as stated in formula (6), measures communication dependability, which is another crucial QoS parameter. We will thus utilise packet delay as well as PDR as two important metrics to assess QoS performance of the service in the next section.

3.1.1 Photonic integrated circuits with dynamic optimization

By regulating light for lasing, switching, and optical filtering as well as for the trapping and emission of photons, ultra-small cavities play a crucial part in photonic integrated circuits. Ring resonators as well as photonic crystal micro-cavities are the most often used structures. One-dimensional (1D) micro-cavities are ideal for very dense packing since they have a very tiny footprint. Nowadays, Mach–Zehnder modulators have mostly been used to show high performance electrooptical modulation in silicon. These modulator devices are in the mm range in length. Recently, ring resonator-based electro-optic modulators have been shown. The tiny modulators have modulation frequencies greater than 10 Gb/s and ring diameters as low as 12 m. If the rings are further shrunk to an order of magnitude smaller than for ring resonators. However, only a modulation frequency of 250 Mb/s has been proven thus far. Therefore, more research is necessary to boost modulation speed. The creation of such miniature electro-optic modulators is our key priority. For the arrangement of the modulators to be optimised for modulation frequency, loss reduction, and extinction ratio, we systematically iterated a number of design parameters. With a unique diode arrangement that decreases absorption and offers extremely low energy consumption per bit, goal was to attain 10 GHz modulation frequency. The cavity’s waveguide is a component of a p-i-n diode. Free carriers are injected into or drained from the cavity by providing a voltage to the diode. Through so-called free carrier plasma dispersion effect, voltage modifies silicon’s refractive index in waveguide. Spectral location of the transmission peak is shifted by the cavity’s refractive index change, which enables modulation of the transmitted light’s intensity. With this architecture, it has the lowest footprint yet demonstrated and could support a data throughput of 25 GBd.

The binary indication xij(t) indicates if the UE i’s request j is fulfilled at time t. Following are restrictions for allocating resources to network slices by Eq. (6)

$$\begin{aligned} & \sum_{i,j} F_{ij} x_{ij} \left( t \right) \le F\left( t \right),\left( {i,j} \right) \in A\left( t \right), \\ & \sum_{i,j} P_{ij}^{C} x_{ij} \left( t \right) \le P^{C} \left( t \right),\left( {i,j} \right) \in A\left( t \right), \\ & \sum_{i,j} P_{ij}^{T} x_{ij} \left( t \right) \le P^{T} \left( t \right),\left( {i,j} \right) \in A\left( t \right), \\ \end{aligned}$$
(6)

where F(t), P C (t), and P T (t) represent, respectively, computing, transmit, and available communication (frequency-time blocks) resources at gNodeB at time t. certain of gNodeB’s resources could already is allotted to certain requests that haven’t finished processing, thus some resources might not be accessible. Picking Fij(t), \(P_{ij}^{C}\)(t), and \(P_{ij}^{T}\) for resource allocation optimisation at time t is the narrow aim by Eq. (7)

$${\text{max}}\sum_{ij} w_{ij} x_{ij} \left( t \right),\left( {i,j} \right) \in A\left( t \right)$$
(7)

We then take into account the temporal horizon optimisation problem. From time t1 to time t, the resources are updated as follows by Eq. (8)

$$\begin{aligned} F\left( t \right) & = F\left( {t - 1} \right) + F_{{\text{r}}} \left( {t - 1} \right) - F_{{\text{a}}} \left( {t - 1} \right). \\ P^{C} \left( t \right) & = P^{C} \left( {t - 1} \right) + P_{r}^{C} \left( {t - 1} \right) - P_{a}^{C} \left( {t - 1} \right), \\ P^{T} \left( t \right) & = P^{T} \left( {t - 1} \right) + P_{{\text{r}}}^{T} \left( {t - 1} \right) - P_{a}^{T} \left( {t - 1} \right){,} \\ \end{aligned}$$
(8)

where, At time t1, the resources related to frequency, CPU usage, and transmit power are freed. Every request has a lifespan of lij, and if service begins at time t to fulfil it, request will finish at time t + lij. Define R(t) as collection of requests that have ended at time t. Released as well as allocated resources at time t are represented by Eq. (9)

$$\begin{aligned} & F_{r} \left( t \right) = \sum_{{\left( {i,j} \right) \in R\left( t \right)}} F_{ij} , \\ & P_{r}^{C} \left( t \right) = \sum_{{\left( {i,j} \right) \in R\left( t \right)}} P_{ij}^{C} , \\ & P_{r}^{T} \left( t \right) = \sum_{{\left( {i,j} \right) \in R\left( t \right)}} P_{i,j}^{T} \\ & F_{a} \left( t \right) = \sum_{i,j} F_{ij} , \\ & P_{a}^{C} \left( t \right) = \sum_{i,j} P_{ij,}^{C} , \\ & P_{{\text{a}}}^{T} \left( t \right) = \sum_{i,j} P_{i,j}^{T} \\ & {\text{max}}\sum_{t} \sum_{ij} w_{ij} x_{ij} \left( t \right),\left( {i,j} \right) \in A\left( t \right) \\ \end{aligned}$$
(9)

This issue can be resolved offline on the unrealistic presumption that the gNodeB is aware of all upcoming requests.

3.1.2 Fog edge model based multi agent self- organizing reinforcement learning:

Figure 1 depicts the scenario that was taken into consideration for this work. It is predicated on an F-RAN architecture with three layers: a cloud computing layer, a network access layer, and a terminal layer. BBU pool enables centralised signal processing at the cloud computing layer. Additionally, there are single-antenna L1 distributed RRHs connected to the BBU pool at network access layer. Additionally, there are M0 F-APs set up with L0antennas. Fog computing allows for the execution of collaborative radio signal processing at dispersed F-APs in addition to the centralised BBU pool.

Fig. 2
figure 2

The RAN slicing architecture’s single antenna system concept, which creates network slices for conventional UEs and F-UEs

Figure 2 The terminal layer has K1 single antenna F-UEs and K0 single antenna conventional UEs, whose sets are designated as K0 and K1. Traditional UEs, which aim for low power consumption as well as unpredictable bursty traffic arrivals, include industrial monitoring devices and sensors used in agricultural fields. F-UEs can be laptops or cell phones, both of which always have a sizable buffer. A network slice instance is built, consisting of several modes and related physical resources, to offer each F-UE a high data rate. In C-RAN mode, RRHs collaborate to receive uplink data, while BBU pool offers centralised baseband processing and signal detection. Additionally, F-APs are set up for local services to lessen load on fronthaul. Similar to this, network slice instance tailored for conventional UEs offers both CRAN mode as well as F-AP mode. However, goal is to keep classic UEs’ transmission latency consistent and power consumption low. F-UEs can also help both network slice instances by using D2D mode. Particularly, F-UEs aggregate data in slice instance for conventional UEs to enable more traditional UEs to connect at once while relaying data traffic of other F-UEs to increase coverage of slice instance for F-UEs. There are N subchannels available for allocation, each having a bandwidth of W0. Both orthogonal as well as multiplexed subchannel techniques are taken into account in this article. In the former, strict isolation between slice instances is made possible by the allocation of subchannel n to a maximum of one standard UE i or F-UE j. with contrast, with the latter, numerous conventional UEs and F-UEs can share subchannel n. With this approach, a complex mode selection as well as resource allocation would ensure isolation between slice instances. Although a mechanism for allocating orthogonal subchannels primarily ensures slice isolation in present efforts. It is still important to look at a multiplexed subchannel allocation technique in order to increase spectrum utilisation by Eq. (10)

$${\mathbb{P}}\left[ {S_{t + 1} = s^{\prime}{\mid }S_{t} = s} \right] = {\mathbb{P}}\left[ {S_{t} = s^{\prime}{\mid }S_{t - 1} = s} \right]$$
(10)

RL must make judgements throughout time in order to maximise the expected value of the return or to choose best course of action. Here are our explanations of our return and policy policies by Eq. (11)

$$\begin{aligned} & G_{t} = R_{t + 1} + \gamma R_{t + 2} + \cdots = \sum_{k = 0}^{\infty } \gamma^{k} R_{t + k + 1} \\ & q_{\pi } \left( {s,a} \right) = {\mathbb{E}}_{\pi } \left[ {G_{t} {\mid }S_{t} = s,A_{t} = a} \right] \\ \end{aligned}$$
(11)

Similarly, action-value function is decomposed as follows:\(q_{\pi } \left( {s,a} \right) = {\mathbb{E}}_{\pi } \left[ {R_{t + 1} + \gamma q_{\pi } \left( {S_{t + 1} ,A_{t + 1} } \right){\mid }S_{t} = s,A_{t} = a} \right]\). Additionally, we may observe the connection between vπ(s) and qπ(s, a):

$$v_{\pi } \left( s \right) = \sum_{\pi \in \Delta } \pi \left( {a{\mid }s} \right)q_{n} \left( {s,a} \right)$$
(12)
$$q_{\pi } \left( {s,a} \right) = {\mathcal{R}}_{s}^{a} + \gamma \sum\limits_{{s^{\prime } \in S}} {{\mathcal{P}}_{{sx^{\prime } }} } v_{\pi } \left( {s^{\prime } } \right)$$
(13)
$$v_{{\uppi }} \left( s \right) = \sum_{a \in A} \pi \left( {a{\mid }s} \right)\left( {{\mathcal{R}}_{s}^{a} + \gamma \sum_{{s^{\prime} \in S}} {\mathcal{P}}_{{ss^{\prime}}}^{a} v_{{\uppi }} \left( {s^{\prime}} \right)} \right)$$
(14)

The Bellman equation compares one state’s state-value function to those of other states. As demonstrated in Eq. (15), Bellman equation for qπ(s, a),

$$q_{{\uppi }} \left( {s,a} \right) = {\mathbb{R}}_{s}^{a} + \gamma \sum_{{s^{\prime} \in S}} p_{{ss^{\prime}}} \sum_{{a^{\prime} \in {\mathcal{A}}}} \pi \left( {a^{\prime}{\mid }s^{\prime}} \right)q_{n} \left( {s^{\prime},a^{\prime}} \right).$$
(15)

Using the theorem by Eq. (16), we can quickly get an ideal policy by increasing q(s, a) over all actions.

$$\begin{aligned} v_{*} \left( s \right) & = {\text{max}}_{\pi } v_{\pi } \left( s \right) \\ q_{*} \left( {s,a} \right) & = \mathop {{\text{max}}}\limits_{\pi } q_{\pi } \left( {s,a} \right) \\ \end{aligned}$$
$$\pi \ge \pi^{\prime}\quad {\text{if}}\quad v_{m} \left( s \right) \ge v_{m} \left( s \right),\forall s$$
$$\tilde{\sigma }_{\pi } \left( {a{\mid }s} \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\;a = \arg \;{\text{m}}q_{*} \left( {s,a} \right)} \hfill \\ 0 \hfill & {a \in \cdot A} \hfill \\ \end{array} } \right.$$
(16)

Finding best value function is the last remaining problem. This issue is resolved using Bellman optimality equation. Equation (17) demonstrates that optimal state-value function and have a relationship that are found.

$$\begin{aligned} v_{*} \left( s \right) & = \mathop {{\text{max}}}\limits_{a} q_{*} \left( {s,a} \right) \\ q_{*} \left( {s,a} \right) & = {\mathcal{R}}_{s}^{a} + \gamma \mathop \sum \limits_{{s^{\prime} \in S}} {\mathcal{P}}_{{ss^{\prime}}} v_{*} \left( {s^{\prime}} \right) \\ \end{aligned}$$
(17)

In order to create a Bellman optimality equation for v and q, express q(s, a) in terms of v(s) using Eqs. (18).

$$\begin{aligned} v_{*} \left( s \right) & = \mathop {{\text{max}}}\limits_{a} \left( {{\mathcal{R}}_{s}^{a} + \gamma \mathop \sum \limits_{{s^{\prime} \in S}} P_{{ss^{\prime}}} v_{*} \left( {s^{\prime}} \right)} \right) \\ q_{*} \left( {s,a} \right) & = {\mathbb{R}}_{s}^{a} + \gamma \mathop \sum \limits_{{s^{\prime} \in S}} {\mathcal{P}}_{{ss^{\prime}}}^{a} \mathop {{\text{max}}}\limits_{{a^{\prime}}} q_{*} \left( {s^{\prime},a^{\prime}} \right). \\ \end{aligned}$$
(18)

For the purpose of clarity, this research solely takes into account the deterministic case. Alternatively, the problem may be formulated stochastically, which would need thinking about anticipated returns for probabilistic transitions. Our method as well as conclusions are easily applied to stochastic MDPs if expectations are determined with accuracy. Controller selects actions to take in accordance with its policy h: X U using uk = h(xk). From any initial state (x0) and current time (k = 0), the controller’s goal is to find a method that maximises the discounted return by Eq. (19).

$$R = \sum_{k = 0}^{\infty } \gamma^{k} r_{k + 1} = \sum_{k = 0}^{\infty } \gamma^{k} \rho \left( {x_{k} ,u_{k} } \right)$$
(19)

Discounted return accurately captures reward accumulated by controller over time. Long-term performance improvement is the aim of learning, which merely makes use of feedback on immediate, one-step performance by Eq. (20)

$$\begin{aligned} Q^{h} \left( {x,u} \right) & = \rho \left( {x,u} \right) + \sum_{k = 1}^{\infty } \gamma^{k} \rho \left( {x_{k} ,h\left( {x_{k} } \right)} \right) \\ Q^{*} \left( {x,u} \right) & = \rho \left( {x,u} \right) + \gamma {\text{max}}_{{w^{\prime} \in U}} Q^{*} \left( {f\left( {x,u} \right),u^{\prime}} \right) \\ \left[ {T\left( Q \right)} \right]\left( {x,u} \right) & = \rho \left( {x,u} \right) + \gamma {\text{max}}_{{u^{\prime} \in {\mathcal{U}}}} Q\left( {f\left( {x,u} \right),u^{\prime}} \right) \\ \end{aligned}$$
(20)

Lipschitz continuity of Q is established utilizing Q-value iteration algorithm, called Q-iteration, which makes use of an a priori task model in form of transition as well as reward functions f. There is a finite LQ that has the property by Eq. (21)

$$\begin{aligned} & \left| {Q^{*} \left( {x,u} \right) - Q^{*} \left( {\underline {x} ,\underline {u} } \right)} \right| \le L_{Q} \left( {\parallel x - \underline {x} \parallel + \parallel u - \underline {u} \parallel } \right) \\ & \left| {\left[ {T\left( {Q_{\ell } } \right)} \right]\left( {x,u} \right) - \left[ {T\left( {Q_{\ell } } \right)} \right]\left( {\underline {x} ,\underline {u} } \right)} \right| \\ & = \left| {\rho \left( {x,u} \right) + \gamma \mathop {{\text{max}}}\limits_{{\underline {u}^{\prime } }} Q_{\ell } \left( {f\left( {x,u} \right),u^{\prime } } \right)} \right. \\ & \left. { - \rho \left( {\underline {x} ,\underline {u} } \right) - \gamma \mathop {{\text{max}}}\limits_{{\underline{{\underline {u} }} }} Q_{\ell } \left( {f\left( {\underline {x} ,\underline {u} } \right),\underline {u}^{\prime } } \right)} \right| \\ & \le \left| {\rho \left( {x,u} \right) - \rho \left( {\underline {x} ,\underline {u} } \right)} \right| \\ & + \gamma \left| {\mathop {{\text{max}}}\limits_{{u^{\prime } }} \left[ {Q_{\ell } \left( {f\left( {x,u} \right),u^{\prime } } \right) - Q_{\ell } \left( {f\left( {\underline {x} ,\underline {u} } \right),u^{\prime } } \right)} \right]} \right| \\ \end{aligned}$$
(21)

\(\left| {\rho \left( {x,u} \right) - \rho \left( {\underline {x} ,\underline {u} } \right)} \right| \le L_{\rho } \left( {\parallel x - \underline {x} {\mid } + \parallel u - \underline {u} \parallel } \right)\) for second term by Eq. (22)

$$\begin{aligned} & \gamma \left| {\mathop {{\text{max}}}\limits_{{u^{\prime}}} \left[ {Q_{\ell } \left( {f\left( {x,u} \right),u^{\prime}} \right) - Q_{\ell } \left( {f\left( {\underline {x} ,\underline {u} } \right),u^{\prime}} \right)} \right]} \right| \\ & \le \gamma \mathop {{\text{max}}}\limits_{{u^{\prime}}} L_{{Q_{t} }} \parallel f\left( {x,u} \right) - f\left( {\underline {x} ,\underline {u} } \right)\parallel \\ & = \gamma L_{{Q_{i} }} \parallel f\left( {x,u} \right) - f\left( {\underline {x} ,\underline {u} } \right)\parallel \\ & \le \gamma L_{{Q_{t} }} L_{f} \left( {\parallel x - \underline {x} \parallel + \parallel u - \underline {u} \parallel } \right) \\ \end{aligned}$$
(22)

Lipschitz continuity of Q` and f. Therefore,\(L_{\rho } \sum_{k = 0}^{\ell + 1} \gamma^{k} L_{f} {\mid }L_{{Q_{t + 1} }} = L_{\rho } + \gamma L_{{Q_{t} }} L_{f} = L_{\rho } + \gamma L_{f} L_{\rho } \sum_{k = 0}^{\ell } \gamma^{k}\) and induction is complete. Taking limit as \(\ell\) → ∞, it follows that \(L_{Q} = L_{\rho } \sum_{k = 0}^{\infty } \gamma^{k} L_{f}\).

Algorithm for MASORL

 

1: start replay memory \({\mathcal{O}}^{k}\) with a size of \(U\), mini-batch \({\mathcal{O}}^{k}\) with a size of \(S\), and the \(Q\)-function with two sets \({\theta }^{k}\) and \(\widehat{\theta }^{k}\) of random weights, for \(k=1\)

2: repeat

3: After deploying \({\mathbf{y}}^{k}\), observe cost \(p\left({\mathbf{x}}^{k},{\mathbf{y}}^{k}\right)\) and new network state \({{\text{x}}}^{k+1}\in \mathcal{X}\)

5: Store \({\mathbf{m}}^{k}=\left({\mathbf{x}}^{k},{\mathbf{y}}^{k},p\left({\mathbf{x}}^{k},{\mathbf{y}}^{k}\right),{\mathbf{x}}^{k+1}\right)\) in \({\mathcal{O}}^{k}\)

7: Update \({\theta }^{k+1}\) with gradient given by (19)

8: Regularly do \(\widehat{\theta }^{{k + 1}} = \varvec{\theta }^{k}\)

9. Update epoch index by \(k\leftarrow k+1\)

10. until A predefined stopping condition is gratified

 

4 Experimental analysis

This section includes a summary of suggested system’s performance. Suggested system is implemented in Java. The Java platform’s physical setup consists of an Intel i5/i7 processor, 4 GB of RAM, and a 3.20 GHz CPU speed. A mathematical paradigm is suggested in the design concept to increase security of cloud storage. Security model works in tandem with the end user and the data owner in this strategy. Even when cloud storage is problematic, the owner’s data is safeguarded during data uploading as well as data transmission to the right user. Assume that there is no intercellular interference created. The 1Mbit file size is the default. Transmission rate in backhaul link is set to R = 100 Mbps for simplicity. The reward decay is set to 0.9 and learning rate is set to 0.001. Assuming nothing else, we set U = 50, F = 500, and N = 5. Benchmark schemes for simulations are standard method, as well as learning schemes. Operational states of a particular processor as well as UE are then altered, if necessary, based on a greedy scheme action selection. The controller then adjusts the precoding and cache state in line with the transition matrix for each D2D transmitter transit. HPN will assist those UEs with unsatisfactory QoS in D2D mode access C-RAN whenever it receives any QoS violation data from UEs, and controller will turn on all of processors. State change, action, and resulting decrease in system power consumption, which is reward, are then recorded in controller’s replay memory. In order to reduce MSE between goal Q values as well as predicted Q values of DQN, controller will update DQN after a number of interactions by training over a batch of interaction data randomly picked from replay memory. Additionally, controller will adjust the DQN weights to the intended DQN for every longer duration. A dense NN called the adopted DQN is built from an input layer, two hidden layers, and an output layer. Input layer has 14 neurons, whereas output layer has 96 neurons. Every hidden layer has 24 neurons, and ReLu is used as activation function. Table 1 contains a list of all other simulation-related parameters.

Table 1 Simulation parameters

It is evident that a lower value of results in higher performance. This is due to the fact that a larger will result in about identical selection probabilities for various actions, even if difference between their Q values widens over the course of learning. Additionally, it is demonstrated that, when compared to τ = τ0/log(1 + tepi), a logarithmic decreasing τ = 0.1 and τ = 0.5 improves performance. This is due to the fact that logarithmic decrease tends to reduce its value as episode tepi rises, which causes a gradual selection of the best solutions with a larger likelihood. As can be seen, value of goal power-minus-rate function is severely constrained when overall amount of computing resources is restricted. Power-minus-rate lowers dramatically when computing resource rises, which may be result of following factors. First, more conventional UEs/F-UEs may be supplied locally as the amount of computing resources grows. The power minus-rate may be reduced by increasing mode selection flexibility; secondly, with higher computational power, UEs/F-UEs that typically chose D2D mode can switch to FAP mode. Since the F-UE is not conveying data, it uses less energy, which causes the power-minus-rate to drop even lower.

4.1 Comparative analysis

Table 2 displays the results of a topological study of the network. In this case, we examine two network parameters: UE density and device density. Throughput, scalability, network efficiency, quality of service, and energy usage have all undergone parametric study.

Table 2 Analysis based on network parameters

Number of UE analysis is shown in Fig. 3. For comparison, existing Q-Learning achieved throughput of 91%, scalability of 45%, network efficiency of 81%, quality of service (QoS) of 41%, energy consumption of 36%, and MDP achieved throughput of 93%, scalability of 48%, network efficiency of 83%, quality of service (QoS) of 43%, energy consumption of 39%.

Fig. 3
figure 3

ae Analysis for number of UE

From above Fig. 4 analysis carried out based on number of devices. Proposed technique attained throughput of 96%, scalability of 59%, network efficiencyof 91%, QoS of 51%, energy consumption of 49%, while existing Q-Learning attained throughput of 92%, scalabilityof 53%, network efficiency of 86%, QoS of 46%, energy consumption of 43%, MDP attained throughput of 94%, scalabilityof 55%, network efficiencyof 89%, QoS of 49%, energy consumption of 45%.

Fig. 4
figure 4

ae analysis for number of devices

Since the number of slices the agent controlled was fixed in previous techniques, retraining the model was required if the number of slices changed between training and evaluation. As a result, under the suggested technique, one agent allots rbs to a single slice, and the agent is called several times when there are numerous slices. Rb allocation that is independent of the number of slices is actualized by this design. Additionally, the agent learns to maximise the number of slices in which the criteria are fulfilled while improving the efficiency of rb utilisation. This is done by satisfying the slice requirement with the least amount of rb allocation necessary. Two different sorts of slice designs are presented here: one creates slices by loosely categorising services, while the other creates a slice for each type of service. In this study, we employ the design that establishes a slice for every service category. It is possible that numerous criteria are established for the same item in a slice when defining slices by broadly categorising services. By designating the tightest criteria for services in a slice as the slice requirements, the suggested solution may be used in this situation. Additionally, because each service has its own slice, there are fewer users per slice than there would be if the slices were determined by broadly categorising the services. When there are one or more ues in the slice, a slice is created, and when there are zero ues, the slice is terminated. This indicates that slices are created and terminated more often than when the slices are determined by approximately categorising services.

5 Conclusion

This study suggests a unique technique for radio access network data transmission that uses Photonic Integrated Circuits (PICs) with dynamic optimisation and multi-agent self-organizing reinforcement learning (MASORL) based on the fog edge model. To discover the best course of action without having prior knowledge of network dynamics, we first suggest a double DQN based model computing offloading technique. A Q-function decomposition approach is then integrated with double DQN, which results in a unique learning method for solution of stochastic computing offloading. This is driven by additive nature of utility function. Compared to the baseline rules, numerical studies demonstrate that our suggested learning techniques significantly enhance computation offloading performance. The connection between the concealed layer’s size and processing speed has to be looked into. Investigating factors that influence training outcomes, such as learning rate and batch size, is also crucial. In this study, we demonstrated that the simulator is capable of performing the ideal RB allocation. The slice state could alter in ways that are unique to a real world, though, as the simulator cannot accurately replicate a genuine environment. However, because it takes a lot of time as well as money to train a model exclusively in a real environment, it is not practicable.