1 Introduction

With the rise of cutting-edge technologies like artificial intelligence [1], Internet of things (IoT), and sensors, we are seeing a proliferation of smart IoT devices that are becoming increasingly common in the medical field. These devices are poised to revolutionize healthcare services, taking them in a direction that is more intelligent and precise, thereby expanding the possibilities for human health improvement. For instance, we have wearable electrocardiogram detectors, wearable blood pressure detectors, and wearable blood glucose detectors. They are connected to mobile devices (MDs) and can monitor physiological parameters such as electrocardiogram signals, blood pressure changes, and patients’ blood glucose levels in real-time and accurately [2, 3]. By thoroughly analyzing this data, doctors can gain a better understanding of their patients’ health status and promptly identify potential health risks. This, in turn, enables them to offer tailor-made medical care and treatment plans. However, MDs have their limitations when it comes to battery life and processing power. Power-hungry applications like sleep analysis, motion tracking, epilepsy detection, and blood pressure monitoring can drain MDs quickly. Some applications, like epilepsy detection warnings and tracking blood pressure fluctuations, are time sensitive. Due to insufficient computing resources, computing in MDs will consume a lot of time. Mobile edge computing (MEC) can move computationally intensive applications from MDs to edge servers (ESs), effectively relieving computational pressure and energy consumption on MDs [4, 5].

MEC represents a cutting-edge computing architecture with the goal of shifting data processing and storage capabilities away from conventional centralized data centers to edge devices or nodes situated closer to the data source. Offloading computing tasks from MDs to nearby base stations (BSs) equipped with ESs reduces data transmission latency and network congestion. It meets user demand for services such as low latency and high bandwidth [6]. However, a single MEC server typically has certain limitations in terms of computing power. It can only fulfill the computing requirements of a limited number of MDs. Ultra-dense network (UDN), on the other hand, brings about an efficient and flexible large-scale wireless connectivity solution by deploying numerous small BSs to cater to the demands of a multitude of MDs accessing the network. This is particularly vital in densely populated settings like schools, hospitals, and residential areas with abundant greenery. To address these challenges, this paper introduces a fusion of UDN and MEC, referred to as ultra-dense edge computing (UDEC), within the 5 G architecture. In this framework, mobile network operators deploy numerous micro-BSs, each equipped with an ES. It can reuse the spectrum resources of the macro-BSs [7, 8]. In this way, the surge in demand for computational and spectral resources by MDs can be addressed, and the latency and energy consumption of task offloading can be reduced [9].

In a UDN collaborative network, MDs in a healthcare system can be more flexible in choosing their offloading methods, and tasks based on different types of computation can be processed on local devices, micro-BSs, or a macro-BS [10, 11]. The literature on computational offloading [12,13,14] reduces the total cost of the system by optimizing offloading and computational resource allocation. There is also some literature on optimizing the performance of MEC networks through advanced network access techniques. Liu and Yang [15] used orthogonal frequency division multiple access (OFDMA) architecture to access the unmanned aerial vehicle (UAV) cloud network, ensuring latency awareness in UAV-assisted MEC systems. Xing et al. [16] used time division multiple access (TDMA) transport protocols during user offloading of tasks to minimize the computational latency of tasks by optimizing the time, rate, resultant download of user task assignment, and computational offloading. However, this literature needs to pay more attention to the application of non-orthogonal multiple access (NOMA) technology in UDEC. It can make multiuser access in wireless communication systems. It is a promising multi-access technology for next-generation wireless networks [17, 18]. NOMA allows multiple users to transmit over the same time and frequency resources, enhancing spectral efficiency and system capacity. Consequently, it effectively boosts wireless network capacity and coverage. Moreover, NOMA proves advantageous in managing high-density user scenarios and enhancing system energy efficiency [19]. NOMA protocols, as a promising radio access technology, have been employed in [20] to enhance the efficiency of MEC offloading. In our pursuit of intelligently managing computational tasks in a healthcare system, we have merged UDN and NOMA to provide users with a more efficient and comprehensive service. The key contributions are summarized below:

  1. 1.

    In the system model of the NOMA-based UDN, each micro-BS deploys an MEC server and employs NOMA technology to cater to its associated users. Subsequently, the paper introduces an optimization problem that addresses joint decisions on offloading, computational resources, and sub-channel allocation. This optimization problem aims to minimize the weighted sum of energy consumption and computational task latency for all users, all while ensuring that the maximum tolerated user latency is met.

  2. 2.

    The objective problem involves mixed-integer nonlinear programming, a type of problem that is (non-deterministic polynomial) NP-hard and cannot be optimally solved in polynomial time. To tackle the original mixed problem, the neural network within deep deterministic policy gradient (DDPG) is employed to approximate the policy function. This approach utilizes empirical replay to train the neural networks, resulting in improved training outcomes.

  3. 3.

    The scheme we put forth is evaluated through simulation experiments, comparing it with various offloading methods and orthogonal multiple access (OMA) techniques. The results from the simulations indicate that our proposed scheme exhibits a quicker convergence when compared to other offloading methods. Furthermore, the utilization of NOMA techniques, as presented in this paper, results in a lower overall system cost in contrast to OMA.

The rest of this article is arranged as follows. Section 2 introduces the research status of UDEC and NOMA technology, and the significance of the combination of UDEC and NOMA. Section 3 describes the intelligent medical system model and objective optimization. In Sect. 4, a Markov decision process (MDP) decision-making process is constructed by deep reinforcement learning (DRL). In Sect. 5, the advantages and disadvantages of each unloading strategy are analyzed by comparing the system cost and algorithm convergence performance. Finally, in Sect. 6 to the full text of the summary. Table 1 contains the list of acronyms used in the research.

2 Related work

Numerous scholars have conducted extensive research on the computational offloading challenge within MEC. Their primary focus is on reducing energy consumption delays and enhancing quality of service (QoS) for users by jointly optimizing offloading, transmission power, and resource allocation. In [21], research explores how, within MEC, mobile network operators address the challenge of accommodating a large number of users accessing the network while dealing with increasing response times. They achieve this by deploying numerous edge-side micro-BSs and a cloud-side macro-BS equipped with ample computational resources to enhance network coverage. This strategic deployment reduces transmission delays by diverting users’ computational tasks to the densely positioned edge-side micro-BSs or the cloud-side BS. Gao et al. [9] introduced the deployment of multiple edge-side macro-BSs in UDN. These macro-BSs are supported by MEC servers, facilitating the transfer of user computational tasks to ESs located in closer proximity. This approach effectively reduces transmission delays and lowers energy consumption, thereby minimizing system overhead. Lu et al. [22] demonstrated that in the context of UDEC, system delay and energy consumption can be significantly reduced through coordinated resource scheduling, task offloading, and BS selection. Lin et al. [23] and Ahmed and Elmokashfi [24] delved into the development of a model-free offloading mechanism within the framework of MEC-supported UDN. This mechanism efficiently minimizes the computational latency of user tasks while meeting energy consumption requirements.

Table 1 List of acronyms

Various communication methods, such as TDMA and frequency division multiple access (FDMA), have seen extensive use in resource allocation within MEC. These methods effectively address the issue of transmission delays during the offloading of MEC computing tasks. Utilizing NOMA technology in UDN proves highly effective in reducing both system energy consumption and the latency of computational tasks [25]. Sun et al. [26] and Gupta et al. [27] applied NOMA technology to the uplink and downlink of MEC networks and confirmed that the introduction of NOMA technology can effectively reduce the latency and energy consumption during user offloading. In [28], the research delves into the MEC-NOMA system within an UDN. In this setup, all users of each micro-BS service are organized into separate clusters, with these users communicating with micro-BSs through NOMA technology. The study introduces a comprehensive iterative optimization algorithm aimed at minimizing the overall system cost. This algorithm achieves its goal by combining aspects such as user clustering, power allocation, and computational resource allocation.

In MEC environments, complex joint optimization challenges arise, encompassing offloading decisions, resource allocation, and energy management. These complexities emerge from the collaborative efforts involving multiple MDs, ESs, and central cloud resources. To tackle this issue, some researchers have turned to DRL algorithms for solving intricate joint optimization problems within MEC. In [29], the authors introduced a DRL algorithm based on Asynchronous Advantage Actor–Critic, tailored for addressing the two-layer architecture outlined in this paper. This algorithm demonstrates strong performance in managing expansive decision spaces, demands fewer computational resources, and achieves quicker convergence compared to strategies based on deep Q-leaning network (DQN). Wang et al. [30] presented an algorithm grounded in DDPG to handle high-dimensional state spaces and the non-convex nature of sequential actions, which characterize the problem in UAV-assisted MEC. The algorithm effectively reduces task processing latency by determining the optimal task offloading decisions. In [31], researchers focus on the challenge of joint computational offloading and resource scheduling, considering that user computational tasks are randomly generated and the environment is dynamically changing. Given the multiple optimization goals involved, the authors propose a double DQN approach based on reinforcement learning (RL) to tackle this issue. Jiang et al. [32] delved into the matter of ES placement within a dynamic MEC setup, where it is crucial to meet user computational task requirements. Since the network exhibits time-varying characteristics, ES placement requires dynamic adjustments. The researchers propose a proximal policy optimization algorithm to enable a limited number of ESs to efficiently serve all BSs, with results indicating a noticeable reduction in system costs. While these studies showcase the applicability of DRL for offloading optimization and resource scheduling, they primarily introduce improved algorithms for specific environments and lack generalizability. Therefore, this article adopts a DDPG algorithm to more accurately model and optimize the system’s processing latency and energy consumption, aiming for more efficient task processing and resource utilization.

These works ignore the scenario of collaborative computing at the micro-BS, where computing tasks for users are transferred to other micro-BSs for collaborative execution. Collaborative execution among BSs enhances resource utilization and evens out the load on BSs across various regions, ensuring they remain in a relatively stable state. This, in turn, enables efficient processing of each user’s computing tasks within a reasonable time frame. While some researches have explored task offloading in UDN and MEC settings, the majority has overlooked the influence of communication methods between mobile users and BSs. With the widespread implementation of UDN and the evolution of IoT devices, conventional communication methods are insufficient for managing the extensive connectivity of devices and data transmission within hospitals. Consequently, we will explore a UDEC network based on NOMA, constructing a novel offloading model to strike a harmonious equilibrium between energy consumption and latency.

3 System model

A UDEC system model based on NOMA is illustrated in Fig. 1. The model include multiple micro-BSs, and assuming that each BS is outfitted with a ES to provide computing services to its associated users and one macro-BS (assuming that the macro-BS is a remote cloud). It is assumed that users that are in the coverage area of BS are associated with only one, where the ESs can be either physical servers with computational power or virtual machines. As in [33], we assume that the macro-BS acts as the central controller responsible for collecting information about the tasks of the mobile users, information about the computing resources of the BS, and information about the network status. Assume that the system has S BSs and K sub-channels, denoted by the sets \(S=\{1,2, \ldots , S\}\) and \(K=\{1,2, \ldots , K\}\), respectively, where the set of users covered by each BS is denoted by \(N_{s}=\{1,2, \ldots , N\}\). We assume that user n in BS s generates one computation request \(\psi _{n, s}\) at a time as denoted by the \(<D_{n, s}, T_{n, s}, C_{n, s}>\), where \(D_{n, s}\) denotes the size of the task data requested by the user for computation, \(T_{n, s}\) denotes the maximum tolerable delay of the task, and \(C_{n, s}\) denotes the computational resources required to complete the request. These data assume that each user’s task cannot be further divided into sub-tasks and that the user will offload the task based on an offload decision. \(x_{n, s}^{k}\) is an offload decision variable, which indicates whether or not user n offloads a task to its covered BS s.

$$\begin{aligned} x_{n, s}^{k}=\left\{ \begin{aligned} 1,\ {}&\text{ if }\ \psi _{n, s} \text{ is } \text{ offloaded } \text{ from }\ \text{ user } \text{ n } \text{ to }\ \textrm{MEC},\\ 0,\ {}&\text {otherwise. } \end{aligned} \right. \end{aligned}$$
(1)
Fig. 1
figure 1

IoT-enabled healthcare system model diagram

3.1 Communication model

MDs incur a certain communication cost (i.e., bandwidth) when generating computational tasks to offload to the edge server, so the communication scenario for user task offloading is shown in Fig. 2. In scenario (a), user n1 transmits its computation task to its associated MEC server s1 using NOMA technology. In scenario (b), when MEC server s1 cannot provide the required service to its associated user n2, the computation task is transmitted to the neighboring MEC server s2 via the x2 link. In scenario (c), if the ESs cannot provide the service to user n3, the user’s computation task will be transmitted to the edge cloud for processing via the fiber link. The computational task transfer delay is calculated as follows in different scenarios.

Fig. 2
figure 2

Offloading decisions for mobile devices

Scenario (a): assuming that NOMA technology is used in BS \(s \in S\) to support different users \(n \in N_{s}\) occupying the same channel, mobile users in the same area sacrifice interference to transmit data to the BS at the same time, and different BSs can multiplex the same channel. The sub-channel occupied by the BS define as \(c_{s}^{k}\). When \(c_{s}^{k}=1\), it means that the BS occupies sub-channel k, otherwise \(c_{s}^{k}=0\). In this case, interference between users may cause a degradation of the uplink transmission rate, so the maximum number of BSs connected to a sub-channel is \({M_{\max }}{\mathrm{}}\). On sub-channel k, for the set of users \(N_{s}\) served by BS s, the channel gain is assumed to follow the order of \(g_{1\,s}^k \ge g_{2\,s}^k \ge \ldots \ge g_{ns}^k\). Thus, the signal-to-interference-noise ratio (SINR) of the n-th user in BS s on sub-channel k is expressed as follows:

$$\begin{aligned} \gamma _{n, s}^{k}=\frac{p_{n, s}^{k} g_{n, s}^{k}}{I_{n, s}^{k}+I_{s, s'}^{k}+\sigma ^{2}}, \end{aligned}$$
(2)

where \(p_{n,s}^k\) denotes the transmit power of the n-th user in BS s on sub-channel k, \(I_{{n,s}}^{k} = \sum\nolimits_{{i = n + 1}}^{N} {x_{i}^{s} p_{{n,s}}^{k} g_{{n,s}}^{k} }\) denotes the interference between users within the BS, \(I_{s, s'}^{k}=\sum _{r\in S}\left( c_{r}^{k}\sum _{n \in U_{r}} x_{n}^{r} p_{n, r}^{k} g_{n, s}^{k}\right)\) denotes the interference between BSs, \(\sigma ^2\) denotes the power of Gaussian white noise, and B denotes the bandwidth of the uplink system. The data rate of the transmission of user n on sub-channel k is calculated according to Shannon’s formula as follows:

$$\begin{aligned} R_{n,s}^k = B{\log _2}\left( {1 + \gamma _{n,s}^k} \right) , \end{aligned}$$
(3)

Therefore, the transmission time for the user device task to offload to BS s on sub-channel k is:

$$\begin{aligned} t_{n,s}^{{\textrm{tran}}} = \frac{{x_{n,s}^k{D_{n,s}}}}{{R_{n,s}^k}},\quad \forall n \in {N_s}. \end{aligned}$$
(4)

Scenario (b): Assume that when BS s receives a user request from within its area, but its computational resources are insufficient to process it, it can forward the associated request via the X2 link to another BS \(s^{'}\) for processing. We use the binary variable \(y_{n,s}^{s \rightarrow {s^{'}}}\) to indicate whether the user’s computational task \({\psi _{n,s}}\) is offloaded at its associated BS s:

$$\begin{aligned} y_{n, s}^{s \rightarrow s^{\prime }}=\left\{ \begin{aligned} 1,\ {}&\text{ if }\ \psi _{n,s} \text{ of } \text{ user } \textrm{n} \text{ is } \text{ offloaded } \\&\text {from BS s to a}\text { neighbor BS }\mathrm {s^{'}}, \\ 0,\ {}&\text{ otherwise. } \end{aligned} \right. \end{aligned}$$
(5)

Therefore, the offload delay from BS s to BS \(s^{'}\) is:

$$\begin{aligned} t_{s, s^{\prime }}^{{\textrm{tran}}}=\frac{\sum _{n \in N_{s}} y_{n, s^{s}}^{s \rightarrow s^{\prime }} D_{n, s}}{\Gamma _{s}^{s^{\prime }}}, \quad \forall s, s^{\prime } \in S. \end{aligned}$$
(6)

Scenario (c): when other BSs around BS s also have no remaining computing resources, the user’s tasks will be offloaded to macro-BS via a wired backhaul link. \(y_{n,s}^{s \rightarrow c}\) is a decision variable to indicate that the user’s computational task is offloaded to macro-BS by BS s:

$$\begin{aligned} y_{n,s}^{s \rightarrow c} = \left\{ \begin{aligned} 1,\ {}&\text {if}\ \psi _{n,s}\ \text{ is } \text{ offloaded } \text{ from } \text{ BS } \textrm{s} \\&\text {to macro-BS, }\\ 0,\ {}&\text {otherwise. } \end{aligned} \right. \end{aligned}$$
(7)

Therefore, the offload delay from the BS s to macro-BS is:

$$\begin{aligned} t_{s, c}^{{\textrm{tran}}}=\frac{\sum _{n \in N_{s}} y_{n, s}^{s \rightarrow c} \psi _{n, s}}{\Omega _{s}^{c}}, \quad \forall s \in S. \end{aligned}$$
(8)

3.2 Computation offloading model

3.2.1 Local execution

When user’s task \(\psi _{n,s}\) is executed locally, computing delay \(T_{n,s}^{{\textrm{loc}}}\) and local power consumption \(E_{n,s}^{{\textrm{loc}}}\) of the MD can be expressed as follows:

$$\begin{aligned} T_{n,s}^{{\textrm{loc}}}= & {} \frac{{{C_{n,s}}}}{{f_{n,s}^l}}, \end{aligned}$$
(9)
$$\begin{aligned} E_{n,s}^{{\textrm{loc}}}= & {} \kappa \cdot {\left( {f_{n,s}^l} \right) ^2} \cdot T_{n,s}^{{\textrm{loc}}}. \end{aligned}$$
(10)

where \(f_{n,s}^l\) represents the CPU frequency of user n in BS s, the size of \(\kappa\) is determined by the chip structure of the MD [13, 34].

3.2.2 Edge execution

In our model, if BS s associated with the user has enough resources, it must perform the user’s task \(\psi _{n,s}\). In this paper, we define a decision variable \(y_{n,s}^s \in \{ 0,1\}\) to indicate whether BS s calculates the user’s unloaded task.

$$\begin{aligned} y_{n,s}^s = \left\{ \begin{aligned} 1,\ {}&\text {if}\ \psi _{n,s}\ \text {offloaded by user n}\\&\text {is computed at BS s},\\ 0,\ {}&\text {otherwise.} \end{aligned} \right. \end{aligned}$$
(11)

The execution delay of task \(\psi _{n,s}\) on BS s is:

$$\begin{aligned} {\textrm{t}}_{n,s}^{{\textrm{comp}}} = \frac{{{C_{n,s}}}}{{f_{n,s}^{{\textrm{mec}}}}}, \end{aligned}$$
(12)

Similar to other studies [31, 35], the transmission delay of BS s sending the calculation results back to the MD is ignored because the size of the output data is much smaller than that of the input data. Therefore, the total execution time of the tasks uninstalled by user n on BS s is:

$$\begin{aligned} T_{n,s}^{{\textrm{mec}}} = t_{n,s}^{{\textrm{tran}}} + t_{n,s}^{{\textrm{comp}}}, \end{aligned}$$
(13)

The energy consumed by transmitting the data of user computing task \(\psi _{n,s}\) to its associated BS s can be calculated as:

$$\begin{aligned} E_{n,s}^{{\textrm{tran}}} = p_{n,s}t_{n,s}^{{\textrm{tran}}}. \end{aligned}$$
(14)

When BS s associated with the user does not have enough computing resources to meet the user’s computing requirements, BS s needs to unload the task to any BS \(s^{'}\) that has enough computing resources to meet the user’s requirements. The computing delay of the task on BS \(s^{'}\) is expressed as:

$$\begin{aligned} \textrm{t}_{n, s^{\prime }}^{{\textrm{comp}}}=\frac{C_{n, s}}{f_{n, s^{\prime }}^{{\textrm{mec}}}}, \end{aligned}$$
(15)

Therefore, the total execution time of user n unloading to BS \(s^{'}\) is:

$$\begin{aligned} T_{s, s^{\prime }}^{{\textrm{mec}}}=t_{n, s}^{{\textrm{tran}}}+t_{s, s^{\prime }}^{{\textrm{tran}}}+\textrm{t}_{n, s^{\prime }}^{{\textrm{comp}}}. \end{aligned}$$
(16)

3.2.3 Cloud execution

When no other BS is available, the user’s computing tasks will be executed at macro-BS, where the computing delay is expressed as:

$$\begin{aligned} T_n^c = t_{n,s}^{{\textrm{tran}}} + t_{s,c}^{{\textrm{tran}}} + \frac{{C_{n,s}}}{{f_n^c}}, \end{aligned}$$
(17)

In summary, the total latency of user’s tasks \(\psi _{n,s}\) calculation is:

$$\begin{aligned} \begin{aligned} T_{n, s}^{{\textrm{total}}}&=\left( 1-x_{n, s}^{k}\right) T_{n, s}^{{\textrm{loc}}}+x_{n, s}^{k}\left( y_{n, s}^{s} T_{n, s}^{{\textrm{mec}}}\right. \\&\quad +\sum _{n \in N_s} y_{n, s}^{s \rightarrow s'} T_{s, s'}^{m e c} +y_{n, s}^{s \rightarrow c} T_{n}^{c}), \end{aligned} \end{aligned}$$
(18)

The energy consumed by the user to complete the task \(\psi _{n,s}\) is:

$$\begin{aligned} E_{n, s}^{{\textrm{total}}}= & {} \left( 1-x_{n, s}^{k}\right) E_{n, s}^{{\textrm{loc}}}+x_{n, s}^{k} \left( y_{n, s}^{s}+\sum _{n \in N_s} y_{n, s}^{s \rightarrow s'} \right. \nonumber \\{} & {} +\,y_{n, s}^{s \rightarrow c}\Bigg ) E_{n, s}^{{\textrm{tran}}}. \end{aligned}$$
(19)

To ensure that the user’s computing tasks are performed in only one location, namely on the user’s device, BS, or macro-BS, the following constraints are imposed:

$$\begin{aligned}{} & {} \left( {1 - x_{n,s}^k} \right) + x_{n,s}^k\left( {y_{n,s}^s + \mathop \sum \limits _{n \in Ns} y_{n,s}^{s \rightarrow {s^{'}}} + y_{n,s}^{s \rightarrow c}} \right) = 1, \end{aligned}$$
(20)
$$\begin{aligned}{} & {} \max \left\{ {y_{n,s}^s,y_{n,s}^{s \rightarrow {s^{'}}},y_{n,s}^{s \rightarrow c},\forall n} \right\} \le x_{n,s}^k,\quad \forall n \in Ns. \end{aligned}$$
(21)

3.3 Problem formation

The optimization problem is expressed as the following (“CORS” means computational offloading and resource scheduling)

$$\begin{aligned}&CORS:\min \mathop \sum \limits _{s \in S} \mathop \sum \limits _{n \in {N_s}} \left( {\lambda T_{n,s}^{{\textrm{total}}} + (1 - \lambda )E_{n,s}^{{\textrm{total}}}} \right) \end{aligned}$$
(22a)
$$\begin{aligned}&\text{ s.t. } T_{n, s}^{{\textrm{total}}}\leqslant T_{n,s}, \quad s \in S,\quad n \in N_{s} \text{, } \end{aligned}$$
(22b)
$$\begin{aligned}&E_{n, s}^{{\textrm{total}}} \leqslant E_{\max }, \quad \forall s \in S,\quad n \in N_{s}\text{, } \end{aligned}$$
(22c)
$$\begin{aligned}&0 \leqslant x_{n, s}^{k} p_{n, s}^{k} \leqslant P_{n, s}^{k}\text{, } \end{aligned}$$
(22d)
$$\begin{aligned}&\sum _{k \in K} c_{s}^{k} \leqslant 1,\quad \forall s \in S\text{, } \end{aligned}$$
(22e)
$$\begin{aligned}&\sum _{s \in S} c_{s}^{k} \leqslant M,\quad \forall k \in K \text{, } \end{aligned}$$
(22f)
$$\begin{aligned}&c_{s}^{k} \in \{0,1\},\quad \forall s \in S,\quad k \in K\text{, } \end{aligned}$$
(22g)
$$\begin{aligned}&\left( {1 - x_{n,s}^k} \right) + x_{n,s}^k\left( {y_{n,s}^s + \mathop \sum \limits _{n \in Ns} y_{n,s}^{s \rightarrow {s^{'}}} + y_{n,s}^{s \rightarrow c}} \right) = 1, \end{aligned}$$
(22h)
$$\begin{aligned}&\max \left\{ {y_{n,s}^s,y_{n,s}^{s \rightarrow {s^{'}}},y_{n,s}^{s \rightarrow c},\forall n} \right\} \le x_{n,s}^k,\quad \forall n \in Ns . \end{aligned}$$
(22i)

where \(\lambda\) is the weight of energy consumption and calculation delay. Constraint (22b) ensures that the computing task delay of the user’s MD cannot exceed the maximum delay. Constraint (22c) is the energy consumption for users. Constraint (22d) is the power constraint for all unloading users. Constraint (22e) ensures that sub-channel allocation is a binary decision variable, and constraint (22f) ensures each BS multiplexes at most one sub-channel, and constraint (22g) ensures each sub-channel can only be multiplexed by M BSs. Constraints (22h) and (22i) ensure that the user’s computing tasks can only be performed in one location. Table 2 lists the important symbols used in the text.

Table 2 Symbolic representation and description

4 Algorithm for solving problem

RL can make optimal decisions in a specific environment through self-study. It models all realistic problems as an interaction between an agent and its environment. In every period of interaction, the agent can receive the state of the environment and choose appropriate actions according to the state. Agents can obtain a reward value and a new state based on the feedback of the environment [31, 36]. Although RL has a lot of strengths, it is not scalable and the problems it deals with are limited to low-dimensional state space [37].

Different from RL, DRL involves the perceptual capabilities of RL and the decision-making capabilities of RL to solve environmental problems with high-dimensional state space and action space [38]. In this section, the MDP model of this paper is built in actual MEC scenario. In system, the current state of the BS is only related to the state and actions of the previous moment (for example, whether the remaining resources of the BS are available, whether the user’s tasks are offloaded to the BS). The DDPG algorithm is improved to solve the unloading and resource scheduling problems of computing tasks for mobile users. Although the number of MDs and ESs will constantly change, the process of task processing will not change. Computing tasks are randomly generated at the mobile user end, some of which are processed locally on the MD, and the other part is unloaded to the BS associated with the user for processing through policy. Finally, the result is returned to the MD.

4.1 Problem formulation based on DDPG

MDP [29] is a mathematical framework describing the discrete time stochastic control process. Part of the results generated by it is random and controlled by an agent or decision maker. It is usually made up of five tuples \((\mathcal{S}, \mathcal{A}, p(.,.), R, \gamma )\) [39, 40], where \(\mathcal{S}\) denotes a finite set of states, \(\mathcal{A}\) represents the finite action set, and is the description of the behavior of the agent. \(p\left( {{s_{i + 1}}\mid {s_i},{a_i}} \right)\) denotes the transitional probabilities of the system state from state \({s_i} \in \mathcal{S}\) to state \({s_{i + 1}} \in \mathcal{S}\) after the execution of action \({a_i} \in \mathcal{A}\), where \(R:\mathcal{S} \times \mathcal{A} \rightarrow \mathcal{R}\) is the instantaneous reward function, and \(\gamma\) represents the discount factor used to calculate the cumulative income of the whole process. When the agent interacts with the environment, the cumulative return at state \({s_t} \in \mathcal{S}\) is:

$$\begin{aligned} {G_t} = {R_{t + 1}} + \gamma {R_{t + 2}} + {\gamma ^2}{R_{t + 3}} + \cdots = \sum \limits _{k = 0}^n {{\gamma ^k}} {R_{t + k + 1}}, \end{aligned}$$
(23)

To evaluate the value of state s and the optimal strategy, the state value function is introduced \({v_{ \pi }}(s)\) and state-behavior value function \({Q_{ \pi }}(s,a)\). In practical application, the Behrman equation is adopted:

$$\begin{aligned} v_{\pi }(s)= & {} E_{\pi }\left( R_{t+1}+\gamma v_{\pi } S_{t+1} \mid S_{t}=s\right) , \end{aligned}$$
(24)
$$\begin{aligned} Q_{\pi }(s, a)= & {} E_{\pi }\left[ R_{t+1}+\gamma Q\left( S_{t+1}, \mathrm {~A}_{t+1}\right) \mid S_{t}=s, \mathrm {~A}_{t}=a\right] , \end{aligned}$$
(25)

State space: we use state \({S_t} = \left( {D_{n,s},T_{n,s},{C_{n,s}},\phi ,{U_1},{U_2}, \ldots {U_{2 + s}}} \right)\) to characterize the tasks generated by users in slot t and the computational resources available to the ES. Where \(D_{n,s}\) represents the size of the data amount requested by the MD; \(T_{n,s}\) represents the maximum tolerated delay of computing tasks on the user’s MD; \({C_{n,s}}\) indicates the computing resources required by the MD to complete the requested computing task; \({\phi _i}\) represents the remaining available computing resources on the ES; Among them \(\phi = \left( {{\phi _1},{\phi _2}, \ldots ,{\phi _i}, \ldots ,{\phi _s}} \right)\), \(\phi _{i} = f_{{n,s}}^{{{\text{mec}}}} - \sum\nolimits_{{i = 1}}^{S} {x_{{n,i}}^{k} } f_{{n,s}}^{{{\text{mec}}}}\). \({U_i}\) indicates whether the I-th computing device in time slot t is used. Meanwhile, to ensure that the tasks of MDs can only be computed locally or offloaded, only \(2 + s\) computing devices, including a macro-BS, a user equipment, and \({\textrm{s}}\) BSs, need to be considered for computing tasks generated by MDs.

Action space: to unload computing tasks of MDs onto appropriate computing devices, the action space set in DRL is corresponding to the collection of available computing devices. \((0/1)_i^j\) is used to indicate whether the computing tasks of user i are unloaded on device j. So it has an action space of \({\textbf{A}}=\left( a_{1}, a_{2}, \ldots , a_{2+s}\right)\).

figure a

Reward: after one step, the agent will receive the corresponding reward R after executing action A. In general, the reward function and the objective function are positively correlated. However, since the objective of this paper is to minimize the cost of the system, the target value can be defined as, \(R = - \left( {\lambda T_{n,s}^{{\textrm{total}}} + (1 - \lambda )E_{n,s}^{{\textrm{total}}}} \right) = Z\).

4.2 Computational offloading and resource scheduling algorithm based on DDPG

The DDPG algorithm consists of a learning strategy function (Actor) and a learning action value function (Critic). The actor network defines a parameterized strategy according to the observed environment state and generates an action, while the critic is responsible for evaluating the rewards obtained through the current strategies [30, 41]. The critic network in DDPG uses experiential playback technology which uses \(\left\{ s_{t}, a_{t}, r_{t},s_{t+1}\right\}\) tuple to save its track as a record and uses a small number of tuples to update network parameters. To minimize the loss function, the critic network calculates the Q value of the current network and the Q value of the target network based on the current state and action. Next, the policy function is updated through the current network using the policy gradient. Then, update the target network parameters. The present action \(a_{t}\) can be calculated using the following formula

$$\begin{aligned} a_{t}=\mu \left( s_{t} \mid \theta ^{\mu }\right) +{\mathcal {N}}_{t}, \end{aligned}$$
(26)

Then, the target Q value is given by

$$\begin{aligned} y_{i}=r_{i}+\gamma Q^{\prime }\left( s_{i+1}, \mu ^{\prime }\left( s_{i+1} \mid \theta ^{\mu ^{\prime }}\right) \mid \theta ^{Q^{\prime }}\right) , \end{aligned}$$
(27)

The loss function in the evaluation process is calculated as follows:

$$\begin{aligned} L=\frac{1}{N} \sum _{i}\left( y_{i}-Q\left( s_{i}, a_{i} \mid \theta ^{Q}\right) \right) ^{2}, \end{aligned}$$
(28)

Next, the actor updates the current policy using policy gradient with the help of \({\theta ^Q}\) and sample tuples.

$$\begin{aligned} {\nabla _{{\theta ^\mu }}}J {\vert _{{s_i}}} \approx \frac{1}{N} \sum \limits _i {{\nabla _a}} Q\left( {s,a\vert _{\theta ^Q}} \right) {\mid _{s = {s_i},a = \mu \left( {{s_i}} \right) }}{\nabla _{{\theta ^\mu }}}\mu \left( {s\vert _{\theta ^\mu }} \right) {\mid _{{s_i}}}, \end{aligned}$$
(29)

In contrast to DQN, DDPG increases the stability of learning by using soft updates to refresh the parameters. The process of updating target Q network parameter \({\theta ^{Q'}}\) in the actor network and target Q network parameter \({\theta ^{\mu '}}\) in the critic network can be represented as

$$\begin{aligned}{} & {} {\theta ^{Q'}} \leftarrow \tau {\theta ^Q} + \left( {1 - \tau } \right) {\theta ^{Q'}}, \end{aligned}$$
(30)
$$\begin{aligned}{} & {} {\theta ^{\mu '}} \leftarrow \tau {\theta ^\mu } + \left( {1-\tau } \right) {\theta ^{\mu '}}. \end{aligned}$$
(31)

This section presents a collaborative computing offloading algorithm that leverages DRL techniques to facilitate efficient offloading. The proposed algorithm, CORS-DDPG, aims to identify an optimal offloading destination for users, determining whether to process tasks locally or offload them to an ES or a macro-BS. CORS-DDPG stores previous experiences in an experience replay buffer, allowing the mechanism to learn from past decisions and make appropriate offloading choices tailored to the computational requirements of different users. The goal is to strike a balance between energy consumption and processing latency. The step-by-step operation of the CORS-DDPG algorithm is as follows:

  1. 1.

    The agent acquires actions based on the current state of the environment using the policy.

  2. 2.

    For each step of the episode, MDs within the coverage area of the micro-BS first perform NOMA channel occupancy based on cooperative offloading and resource allocation policies to accomplish data transmission and task execution.

  3. 3.

    Next, energy consumption and transmission delay are calculated based on the current action and state. Based on these calculations, the reward for the step is obtained, and the environment state is updated.

  4. 4.

    Store the reward and the next state in the experience pool. Periodically, a portion of the samples from the experience pool is used to train the actor and critic networks.

  5. 5.

    When the execution step satisfies the policy update condition, specific equation (28) and equation (29) are used to update the current critic network and current actor network.

5 Performance evaluation

First of all, the simulation parameter setting and different experimental methods are recommended. Then, the effects of different parameter settings on the algorithm convergence performance are compared. Lastly, the scheme’s feasibility is demonstrated by a comparison with the benchmark offloading method and the communication method.

5.1 Simulation setup

In this section, we verify the performance of the proposed joint optimization algorithm and the advantages of the NOMA communication method through simulation experiments. In simulation, we use Python to set the coverage area of the macro-BS based on our system model, which is designated as an area with a radius of 1000 m. Micro-BSs are randomly placed in this area. Based on Wang et al. [30], the number of micro-BSs is from 4 to 8. Each NOMA cluster contains two MDs. The number of MDs is 8 to 16, and sub-channels are 4. The size of the user task is between 600 and 1000 Kbits. The rest of the parameters for the experiments are shown in Table 3.

Table 3 Simulation parameter setting

5.2 Contrast experiment

5.2.1 Comparison of offloading methods

To compare the impact of the unload methods in this article with the total cost of the system, we compare them with other offloading methods, all three of which are described in detail below.

  1. 1.

    Local-Execution-only: the user’s computational tasks are all processed locally without requesting a computational offload to the MEC server.

  2. 2.

    Edge-Execution-only: User-generated computing tasks are executed only on the ES associated with them.

  3. 3.

    The scheme proposed in this paper (CORS-DDPG): the user’s computing tasks can be processed locally, and offloaded to BS associated with it or to macro-BS for processing.

5.2.2 Comparison of communication methods

To compare the impact of the communication approaches presented in this article on overall system cost, we compare OMA communication methods, and the comparison method is described in detail below.

  1. 1.

    Local-Execution-only: the user’s computational tasks are all processed locally without requesting a computational offload to BS.

  2. 2.

    Edge-Execution-only: all users’ computing assignments are unloaded to BS for execution. (communication method using NOMA).

  3. 3.

    The scheme proposed in this paper (CORS-DDPG): all user’s task can be executed locally, offloaded to its associated BS, or offloaded to macro-BS, where the communication method uses NOMA and orthogonal multiple access (OMA), respectively (CORS-DDPG-NOMA) (CORS-DDPG-OMA).

Next simulation in this paper, according to the number of iterations, evaluates different offloading approaches based on DDPG convergence performance. The metric used in the article is system cost which is the sum of the completion latency and user energy consumption.

Fig. 3
figure 3

Convergence performance of CORS-DDPG algorithm at different learning rates

Fig. 4
figure 4

Effect of different discount factors on the performance of CORS-DDPG algorithm

5.3 Algorithm performance analysis

In Fig. 3, we illustrate the impact of various network hyperparameters on the algorithm’s convergence performance. We assess these hyperparameters based on the system cost. Precisely, at the instance where \({\alpha _{{\textrm{Actor }}}} = 0.01,{\alpha _{{\textrm{Critic }}}} = 0.02\), the algorithm fails to converge within a specific time frame. This failure can be attributed to the neural network’s large update step and the high learning rate of both the behavioral and critic networks, making it impossible to decide the behavior that minimizes the system cost. Upon resetting \({\alpha _{{\textrm{Actor }}}} = 0.0001,{\alpha _{{\textrm{Critic }}}} = 0.0002\), the algorithm still does not converge. The absence of convergence results from the reduced learning parameter, which decelerates the network updates, thereby requiring an increased number of iterations to attain convergence. This process demands additional computational resources, escalating the system’s cost. Then, we reset \({\alpha _{{\textrm{Actor }}}} = 0.001,{\alpha _{{\textrm{Critic }}}} = 0.002\), which allows the algorithm to converge within a specific time interval, reducing system cost.

Fig. 5
figure 5

Convergence performance of the CORS-DDPG algorithm

As shown in Fig. 4, we explore the impact of different discount rate factors \(\gamma\) on the algorithm’s performance. We use the system cost as an evaluation metric to determine the optimal discount rate setting. The system cost reaches its lowest point when \(\gamma = 0.95\). The computational offloading strategy is optimized at this point, and the best offloading decision can be found within a time interval. However, in the other two sets of experiments, the set \(\gamma\) values do not converge quickly and are ineffective in reducing the system cost. When larger or smaller values of \(\gamma\) are employed, the algorithm treats the data gathered in the current period as having a prolonged influence on the entire training process. This results in time intervals that are either excessively extended or too brief, subsequently impacting the generalization effect. Consequently, the algorithm fails to reduce the system cost-effectively. To improve the training strategy, we will set \(\gamma =0.95\) in the next experiments.

As shown in Fig. 5, CORS-DDPG and Local-Execution-only can converge over time, the other offloading methods have large fluctuations, but Local-Execution-only does not lead to a minimization of the total cost of the system and it can be seen that the CORS-DDPG offloading method is better.

5.4 Performance Comparison

Figure 6 shows the system cost obtained through different offloading methods. It can be found that the system cost increases with the number of BSs. As shown in Fig. 6, CORS-DDPG method obtains lower system cost than the other two baseline methods (Edge-Execution-only, Local-Execution-only), this is because these two methods cannot handle the dynamic environmental changes.

Fig. 6
figure 6

Total system reward in relation to the number of BSs

Fig. 7
figure 7

Total system reward in relation to the number of BSs

Fig. 8
figure 8

Total system reward in relation to the number of users

Figure 7 displays the quantitative relationship established between the system cost and BS when the number of sub-channels varies. Therefore, the sub-channels are different. In Fig. 7, the system cost is gradually increasing. As the number of sub-channels decreases the system cost is increases, this is because when there are fewer sub-channels in the system the quantity of BSs occupying the same channel increases, which leads to severe BS interference and in turn increases the overall cost of the user offloading process.

Figure 8 shows the system cost with the amount of users under the CORS-DDPG algorithm. From Fig 8, the total system cost is gradually growing as the amount of users increases, but the proposed CORS-DDPG scheme allows for a lower system cost than other offloading schemes.

Fig. 9
figure 9

Total system reward in relation to task size

Fig. 10
figure 10

Total system reward in relation to the number of BSs

Figure 9 displays the correlation between the system cost and the size of the input data for user computing tasks. In Fig. 9, the total system cost increases with the volume of data increases, this is because a larger user computing task load will lead to a significant increase in the cost of local computing and also, more users will choose offload their tasks to BS.

In Fig. 10, the impact of two different communication approaches is compared on the system cost. The total system cost of CORS-DDPG-NOMA is lower than the other solutions, which shows the effectiveness of the communication method we have applied.

6 Conclusion

In this paper, we propose an offloading model for cloud-edge-end cooperative offloading by combining UDEC and NOMA to achieve a balance between energy consumption and latency of MDs in the offloading process of computing tasks. The model has important applications in the field of medical IoT, which can provide energy-efficient and low-latency services according to the different medical needs of patients. The study takes into account energy consumption, latency, and QoS, aiming to minimize the total cost incurred during user offloading tasks by means of joint offloading decisions, computational resources, and sub-channel allocation. Despite the objective problem being non-convex, this paper proposes a DRL cooperative offloading algorithm to determine the optimal offloading strategy for users. Simulation results show that the proposed offloading scheme achieves lower total system cost compared to schemes that process only locally or offload only to ESs. In addition, MOMA can lead to a near-optimal system performance compared to OMA.

However, the research in this paper has some shortcomings and needs to be considered. For example, ESs cannot handle the offloading of users’ computing tasks in a dynamic environment in real-time. In future research, the mobility of users will be considered, such as the sudden movement of a user from the coverage area of micro-BS to that of another.