Introduction

The advent of the internet of things (IoT) has revolutionized the landscape of information systems, mobile technologies, and wireless communication, providing the foundation for ubiquitous computing [1]. IoT enables wireless connectivity among various devices, including radio frequency (RFID) tags, actuators, wireless sensors, smartphones, and physical objects with sensing capabilities, allowing physical objects and people to connect and participate in the digital world. IoT has found applications in diverse domains such as smart homes, environmental monitoring, disaster management, element tracking, smart cities, and smart healthcare systems [2].

At the core of IoT are wireless sensor networks (WSNs), which comprise four critical components: communication, processing, power, and sensing units. Ensuring energy efficiency in IoT networks is vital for prolonging network functionality and supporting many IoT devices. This is particularly crucial for devices deployed in harsh environments, where charging and battery replacement may not be feasible. As a result, the development of energy-efficient routing protocols that control the energy usage of devices and enhance the network's longevity is imperative [3, 4].

Reinforcement learning (RL) presents an opportunity to improve energy efficiency in IoT-enabled WSNs by offering adaptive and dynamic routing solutions. In RL, an agent interacts with an unknown environment and progressively improves its performance through a series of experimental encounters [5]. The agent takes actions and receives rewards, which can be positive or negative, depending on the appropriateness of the action. This approach offers superior flexibility in data routing and network communication compared to static routing alternatives [6].

RL can be leveraged to address challenges in IoT networks, such as changing network topology due to device mobility, energy consumption, and fluctuating transmission characteristics like bandwidth, signal strength, and distance that may impact network performance [7]. Therefore, the proposed deep reinforcement learning energy-efficient routing (DRLEER) technique aims to optimize routing decisions for IoT devices while balancing energy dissipation across the network. This research focuses on extending the network lifespan and scalability by incorporating a hop count factor that minimizes end-to-end latency. In addition, DRLEER employs a feedback mechanism to provide local knowledge as a reward, further enhancing its ability to identify optimal pathways based on hop count and residual energy.

The deep reinforcement learning agent in DRLEER is trained on a comprehensive dataset that includes network topology, traffic patterns, and energy consumption. By learning from experience and adapting to dynamic network conditions, the agent can make optimal routing decisions that minimize energy consumption and prolong the lifespan of IoT devices. This approach ensures efficient energy consumption and increased network durability, making it more scalable for large-scale IoT networks.

To evaluate the performance of DRLEER, simulations were conducted in comparison to existing standard protocols. The results demonstrate that the proposed DRLEER method outperforms current protocols by offering a superior energy balance and extending the network lifespan. This indicates that DRLEER is a promising approach for achieving energy efficiency and improved performance in IoT networks.

The subsequent sections of this paper are organized as follows: “Literature survey” provides a review of existing literature on energy-efficient routing protocols in IoT networks; “Proposed method” describes the proposed DRLEER method, its architecture, and the learning algorithm; “Result and discussion” presents the experimental setup and evaluation of the DRLEER method against existing protocols in terms of scalability, network durability, and energy efficiency; and “Conclusion” concludes the paper with a discussion of the findings, implications, and potential future research directions.

Literature Survey

Energy efficiency has been a significant challenge in IoT-enabled wireless sensor networks (WSNs). Researchers have proposed various clustering techniques to address this issue and extend network lifetimes. This literature survey discusses prominent methods in the field, focusing on their approaches to enhance energy efficiency in IoT networks.

Energy-efficient routing-reinforcement learning (EER-RL) approach is presented to tackle energy efficiency challenges in IoT networks [8]. The RL technique enabled devices to react to network changes, such as shifts in energy levels and mobility, thus improving routing decisions. By comparing two implementations of EER-RL, one flat-based (FlatEER-RL) and the other cluster-based (EER-RL), the researchers demonstrated that the cluster-based technique was more scalable. They suggested employing a flat-based implementation in small-scale networks. The simulation results showed that EER-RL outperformed commonly used routing protocols, such as power-efficient gathering in sensor information systems (PEGASIS) [9] and low-energy adaptive clustering hierarchy (LEACH) [10], in terms of network life and energy usage.

In [11], presented a deep learning-based routing protocol for efficient data transmission in 5G wireless sensor network communication. They used a reinforcement learning (RL) algorithm to organize nodes in the entire network into clusters and distribute rewards to cluster members. Subsequently, the Manta ray foraging optimization (MRFO) technique was employed to select the cluster head (CH) for effective data transfer [12]. The data were sent to the sink node through the chosen CH using an efficient deep learning method. The proposed deep belief network (DBN) routing protocol outperformed existing methods concerning network lifespan when assessed against various metrics such as the number of effective nodes, network lifespan, energy usage, and packet delivery rate [13].

Distributed energy-efficient clustering (DEEC) technique is introduced to extend the lifetime of WSNs [14]. They presented a three-tiered heterogeneous network architecture defined by a single model factor. Heterogeneity in energy increased the network's energy and using the network energy effectively extended its lifespan. Their modified level 3 version (HetDEEC-3) improved the network energy by 100% over the original version and increased the network lifetime by 182.67% [15].

To address the issue of random CH selection using a newly created cluster-based technique called energy-aware centralized control clustering (EACCC) [16]. It utilized the centralized control cluster (3C) algorithm as its foundation. Each sensor node first transmitted data on its energy and position to the base station (BS). The BS determined the nodes' locations in the network based on the received information. EACCC operated in rounds, with data transmission occurring after CH selection in each round.

Additional methods in the field of energy-efficient routing for IoT networks include group-based energy-efficient clustering (GEEC) [17], enhanced distributed energy-efficient clustering (E-DEEC) [18, 19], and several others. GEEC utilizes grouping mechanisms for energy-efficient clustering to improve the overall network lifespan. E-DEEC is the improved version of DEEC that further optimizes energy consumption and prolongs the network lifespan by refining the clustering and data transmission processes. These methods further demonstrate the diversity of approaches researchers have taken to address energy efficiency challenges in IoT-enabled WSNs.

In summary, the literature highlights the importance of energy efficiency in IoT networks, and the various approaches researchers have taken to address this challenge. The methods surveyed here, including EER-RL, FlatEER-RL, PEGASIS, LEACH, MRFO, GEEC, DBN, DEEC, HetDEEC-3, E-DEEC, and EACCC, offer valuable insights into the development of energy-efficient routing protocols.

Proposed Method

In this section, we elaborate on the proposed deep reinforcement learning energy-efficient routing (DRLEER) method, which includes the system model, operation flow of the RL agent, and the genetic optimization algorithm.

Network Model

Consider the IoT-based WSN structure shown in Fig. 1, comprising base stations (BS), various sensor cluster groups classified as strong or weak according to their channel coefficients, and cluster pairs consisting of ‘n’ number of clusters appearing deployed node. To communicate with the base station, the cluster head (CH) establishes connections with other group heads within the network. The cluster pairs can determine whether the radio block (RB) spectrum is occupied or free, thereby preventing interference between robust and feeble cluster groups. The formation of cluster pairs relies on the disparity in channel gains. If this difference exceeds a predefined threshold, cluster formation is allowed; if not, the formation of cluster pairs is limited. The cloud server stores the collected data, which is then used by various social communities and industries, sorting the information based on traffic flow, behavioral patterns, and shared content.

Fig. 1
figure 1

IoT-enabled WSN architecture

System Model

The system model for DRLEER consists of IoT-enabled wireless sensor networks (WSNs) with numerous sensor nodes, as seen in Fig. 2. Each node is equipped with a sensing unit, processing unit, communication unit, and power unit. The nodes are distributed across the network and have different energy levels. The network's primary goal is to efficiently transmit data to a central base station (BS) or sink node while minimizing energy consumption and prolonging the network's lifespan.

Fig. 2
figure 2

IoT single-hop cluster-based communication with WSN assistance

The DRLEER protocol is an energy-efficient routing system that employs reinforcement learning (RL) for IoT wireless sensor networks (WSNs). The method aims to maximize successful hop selection and conserve energy by exchanging local information with neighboring devices. DRLEER follows a three-step structure, like other cluster-based routing protocols, which includes network configuration and cluster head (CH) selection, cluster construction, and data transfer. The fundamental assumptions for the network in our model are:

  • Each sensor node and base station are fixed and identifiable by a unique ID after deployment.

  • Nodes lack GPS-capable antennas and location awareness.

  • Nodes have similar processing and communication capabilities but differ in terms of energy in heterogeneous networks.

  • After deployment, nodes are left unattended, making it impossible to recharge batteries.

  • The central base station has a steady power supply and no energy, memory, or processing limitations.

  • Each node can aggregate data, facilitating the compression of multiple data packets into a single packet.

  • The distance between nodes is estimated by measuring the intensity of the received signal.

  • Nodes can adjust transmission power according to the distance between their receiving nodes. Node failure is considered only when energy is running low.

  • Data transmission between two nodes requires the same energy level, indicating a symmetrical radio connection.

  • Nodes are uniformly and randomly distributed within a 100 × 100 square unit area.

  • A “dead node” refers to a node with a battery level that has dropped to zero.

The DRLEER protocol extracts data contained in the packet header, and each nearby device that can hear the packet receives the local information. The sender then updates the routing table based on the local data, which includes the device's identity number, residual energy, hop count, and positional coordinates. In the proposed model, clustering is performed to facilitate energy-efficient transmission, improve network longevity, and minimize energy consumption. The RL method is used for clustering the nodes, and a deep neural network approach to machine learning is employed for creating an effective path routing system for the WSN-supported IoT.

Energy Constraint

There is energy consumption involved on both the sender and the receiver end for a packet transfer. The sender requires more energy to transmit the packets and intensify the signal across the network. After computing the energy lost during packet transmission or receipt, the residual energy is updated by the energy consumption model. The radio energy dissipation model that was taken into consideration in this work [8]. In this model, the transmitter powers the amplifiers, while the receiver powers radio electronics. Multi-path fading is the fading model employed in this technique. The measured threshold is exceeded by d the distance from the transmitter to the receiver.

The following equation presents the energy consumption model.

$${E}_{\mathrm{Tx}}\left(k,d\right)={E}_{\mathrm{elec}}\times k+{E}_{\mathrm{amp}}\times k\times {d}_{m}$$
(1)
$${E}_{\mathrm{Rx}}\left(k,d\right)={E}_{\mathrm{elec}}\times k$$
(2)

where the energy used by the transmitter and the receiver, are denoted correspondingly by \({E}_{\mathrm{Tx}}\left(k,d\right)\) and \({E}_{\mathrm{Rx}}\left(k,d\right)\). At each transmission (or reception), \({E}_{\mathrm{elec}}\) is used to power the receiver or transmitter circuit which is rqual to (50 nJ/bit), while \({E}_{\mathrm{amp}}\), which is used to strengthen the signal around the space (m = 2 or 4 according to the distance), is estimated to be 100 pJ/bit/m2. \(k\) represents the packet size in bits. Figure 3 represents the energy model used for calculating the energy consumption in this research work.

Fig. 3
figure 3

Represents the energy model used for calculating the energy consumption

Operation Flow of RL Agent

Figure 4 depicts the process flow of the DRLEER technique, illustrating the steps involved in selecting the CHs and determining the optimal routing strategy using the reinforcement learning agent.

Fig. 4
figure 4

Optimal routing strategy using the reinforcement learning agent

The clustering process is performed centrally by the base station (BS) or sink node, which assigns each sensor node (SN) to a specific cluster based on its location. After the SN distribution within a cluster, the cluster heads (CHs) are selected through an optimization process. In a WSN with hierarchical clustering, the significant burden of data collection and data delivery from every member node to each CH requires additional strength. Therefore, it is crucial to select the CH properly to increase network lifetime. The DRLEER method is used for selecting the CH for each cluster, considering other constraints such as time, energy, and distance.

Optimal route selection is a critical task for sensor networks to perform well concerning metrics like data integrity, latency, throughput, and energy efficiency in wireless channels, which are fluctuating, dynamically unstable, unreliable, changeable, and asymmetric. Once the CH is selected for effective data transport, a deep belief network-based routing technique is provided as the process flow for the suggested approach. The neural network analyzes variables such as link distance, energy, the number of neighbor nodes, residual, and distance from the CH to perform this routing procedure. By gaining a thorough understanding of the nodes' communication behavior, the suggested routing creates an energy-efficient routing strategy.

The operation flow of the RL agent in DRLEER involves the following steps:

  • Initialization: the RL agent is initialized with the network topology, traffic patterns, and energy consumption data. The agent's initial knowledge of the network state and possible actions are defined.

  • Exploration and exploitation: the RL agent explores the network to discover the best routing decisions that minimize energy consumption. It balances exploration (discovering new routing paths) with exploitation (leveraging the best-known paths).

  • Interaction with the environment: the RL agent interacts with the IoT network by taking actions (i.e., selecting routing paths) and receiving rewards or penalties based on the energy consumption and performance of the selected paths.

  • Learning and updating: the RL agent continuously updates its knowledge by learning from its experiences and adapting to changing network conditions. It employs a deep learning model to identify patterns and relationships in the data, which aids in making better routing decisions.

  • Policy improvement: the RL agent iteratively refines its decision-making policy by considering the accumulated rewards and penalties. It converges to an optimal policy that minimizes energy consumption and extends the network's lifespan.

Each sensor node (SN) incorporates the reinforcement learning (RL) concept for clustering, which initially calculates the path cost and provides that information to the cluster head (CH) based on the most recent Q-value. The reward parameter represents the connection cost between the current node and the next hop node. The Markov decision process (MDP) follows the basic rule [\(S\)—states involved, \(T\)—transition function, \(A\)—actions involved, and \(R\)—reward event].

The learning agent selects all the states S that display action A and estimates the energy consumption for each cluster using these chosen actions. By evaluating the reward element \(R\) obtained from the predicted energy usage, an appropriate choice is made. The current action and state are incremented by 1, as indicated by the symbols \(S\) to \({S}_{i+1}\) (state) and \(A\) to \({A}_{i+1}\) (action). The learning agent generates the best policy Q according to the learning event, which increases the reward parameter. This optimal policy is employed for the best CH selection. The MDP's state transition and reward are related to the state and action at the moment. The learning agent's goal is policy improvement, \(\pi\):

\(S\to A\): based on the current state \({S}_{i}\), the learning agent selects action \({A}_{i}\) (i.e., \(\pi {S}_{i}={A}_{i}\)). The cumulative value function \({V}_{\pi }\left({S}_{i}\right)\) is then calculated from the analysis of the starting state \({S}_{i}\).

$${V}_{\pi }\left({S}_{i}\right)={r}_{i}+y{r}_{i}+{y}^{2}{r}_{i}+2+\cdots$$
(3)
$$={r}_{i}+y+{V}_{\pi }\left({S}_{i+1}\right)$$
(4)
$$=\sum_{i=1}^{\infty }{y}^{i}{r}_{i+1}$$
(5)

This process is referred to as policy and is given by,

$${V}^{\#}={\mathrm{arg}}_{\pi }\underset{i}{\mathrm{max}}{V}^{\pi }{V}_{S}\left({S}_{i}\right)$$
(6)

Finally, the Q-value is updated using Eq. (7),

$${Q}_{t+1}\left({S}_{t},{\alpha }_{t}\right)=\left(1-\alpha \right){Q}_{t}\left({S}_{t},{\alpha }_{t}\right)+\alpha \left[{r}_{t}+1+{y}_{\mathrm{max}}{Q}_{t}\left({S}_{t+1},{\alpha }^{^{\prime}}\right)-{Q}_{t}\left({S}_{t},{\alpha }_{t}\right)\right]$$
(7)

The equation above continuously updates the Q-table. The maximum Q value and return value are denoted by the symbol max \({Q}_{t}\left({S}_{t+1},{\alpha }^{^{\prime}}\right)\) and \({r}_{t}\). The activity of each knowledge agent is given by \({\alpha }^{^{\prime}}\). By employing this way, effective CH selection is achieved within IoT networks, enhancing energy efficiency and routing performance.

Genetic Optimization Algorithm

A genetic algorithm (GA) is a randomized search and optimization technique frequently used to solve optimization problems with many potential solutions. GA is based on the survival of the fittest hypothesis and consists of an initial population of potential solutions called chromosomes. The fitness function evaluates the fitness value of each chromosome, and through selection, crossover, and mutation processes, the population evolves towards more fit solutions.

The genetic optimization algorithm is integrated into the DRLEER method to further enhance the routing decisions made by the RL agent. This algorithm is inspired by the principles of natural selection and evolution, and it works as follows:

  • Population initialization: a population of candidate routing solutions is generated, where everyone represents a possible routing path.

  • Fitness evaluation: the fitness of each candidate solution is assessed based on energy consumption, network performance, and other relevant criteria.

  • Selection: individuals with higher fitness values are selected for reproduction, favoring those that contribute to better routing decisions.

  • Crossover and mutation: genetic operators, such as crossover and mutation, are applied to the selected individuals to generate offspring, which inherit characteristics from their parents.

  • Replacement: the offspring replace some of the less fit individuals in the population, promoting the evolution of better routing solutions over time.

Residual energy is denoted as \({E}_{\mathrm{res}}\), and the number of cluster heads is represented by \({N}_{\mathrm{CH}}\). The total intra-cluster communication distance is indicated as \({C}_{\mathrm{intra}}\), while the total distance from CHs to the base station is denoted as \({C}_{\mathrm{base}}\). The last parameter's value is dependent on the first. When there are fewer cluster heads, the distances from the cluster heads to the BS are shorter, but the total distance for communication within the cluster is longer. Conversely, a higher density of cluster heads results in a lower overall distance for communication within the cluster but a greater overall distance from the CHs to the BS. The fitness function, denoted as \({\mathrm{CH}}_{\mathrm{fit}}\), is now scaled as follows:

$${\mathrm{CH}}_{\mathrm{fit}}=\frac{1}{{E}_{\mathrm{res}}}+\frac{1}{N-{N}_{\mathrm{CH}}}+\frac{{C}_{\mathrm{intra}}}{N}+\frac{{C}_{\mathrm{base}}}{N}+\frac{{C}_{\mathrm{intra}}}{{C}_{\mathrm{base}}}$$
(8)

where \(N\) represents the total number of network nodes. The fitness function emphasizes the importance of reducing the overall distance between cluster heads and base stations.

The DRLEER method combines the learning capabilities of the RL agent with the optimization power of the genetic algorithm to identify optimal routing paths that minimize energy consumption and prolong the network lifespan. The proposed method is evaluated against various existing routing protocols, demonstrating significant improvements in network efficiency, energy consumption, and overall performance.

By incorporating RL with GA, the clustering process in WSNs can be further optimized. The RL agent aids in selecting the most effective cluster head for each cluster, while GA optimizes the number and choice of cluster heads. This combination results in a more energy-efficient and reliable routing solution, ultimately enhancing the network lifetime and performance.

In the proposed DRLEER method, nodes communicate with the base station by sending energy and position data. The base station optimizes the number and selection of cluster heads using the GA. The base station then informs all nodes about the chosen cluster heads, and a time-division multiple access (TDMA) schedule is created for each cluster. Nodes only awaken to relay sensed data to the appropriate cluster head during their time slot in the TDMA schedule; otherwise, they remain in sleep mode. Clustering is performed again once the round-time is over. The proposed DRLEER method pseudocode is provided as follows.

figure a

Result and Discussion

This section evaluates the performance of the proposed DRLEER algorithm in terms of network lifetime, energy consumption, and throughput. The proposed algorithm employs the RL with GA optimization technique to select the optimal CH, enhancing network longevity while minimizing energy usage. During the simulation process, we consider networks with varying numbers of nodes (N = 100, 200) distributed across different areas, denoted as A in square meters (A = 100 × 100). Initially, various network node factors such as node coordinates, energy consumption, collision rate, and transmission delay are considered. Table 1 presents detailed information on the stimulation parameters and their corresponding values, which have been carefully selected based on the specific simulation settings.

Table 1 Simulation specification

Figure 5a, b illustrate the cluster formation in wireless sensor network (WSN) simulations for scenarios with 100 nodes and 200 nodes, respectively. These figures display how nodes in the network are organized into clusters, with each cluster having a cluster head (CH) responsible for collecting and aggregating data from its member nodes. Forming these clusters is a crucial aspect of WSN design as it helps reduce individual nodes' energy consumption and enhances overall network efficiency. The simulation results demonstrate that the WSN cluster formation is successful for both 100-node and 200-node scenarios, which is a vital prerequisite for any WSN-based application.

Fig. 5
figure 5

Cluster formation scenarios a for N = 100, b for N = 200

This study compares the proposed protocol with other clustering protocols, such as gateway energy efficient clustering (GEEC), distributed energy efficient clustering (DEEC), and enhanced distributed energy efficient clustering (E-DEEC), to evaluate its energy efficiency and network lifespan. Several metrics were considered for comparison, including (i) the number of active nodes in each round, which also serves as an indicator of network longevity, (ii) energy consumption for each round, (iii) total throughput, and (iv) end-to-end delay.

Figure 6a, b display the network lifespan of the proposed protocol for varying numbers of sensor devices (N = 100, N = 200), and compare its performance to GEEC, DEEC, E-DEEC, and DNN to demonstrate its effectiveness. The proposed protocol outperforms the existing ones over extended network lifetimes. To maximize the network lifespan, both the hop count and the residual energy were considered, as large distances require more energy for data transmission.

Fig. 6
figure 6

Number of alive sensor nodes vs. number of rounds a for N = 100, b for N = 200

Figure 7a, b evaluate the effectiveness of the proposed protocol in terms of the number of rounds before a node dies, comparing it to GEEC, DEEC, E-DEEC, and DNN. The results indicate that the proposed protocol surpasses the other protocols, with fewer nodes dying throughout the network's lifetime. In simulations with both N = 100 and N = 200 sensor devices, the proposed DRLEER demonstrates superior performance in terms of network longevity compared to other protocols.

Fig. 7
figure 7

Number of rounds vs the number of dead sensor nodes a for N = 100, b for N = 200

Network delay, a performance metric, measures the average end-to-end delay of data packet transmission. The end-to-end delay is the typical time elapsed between a packet's initial transmission by the source and its successful reception at the destination. This delay measurement considers queuing and packet propagation delays. Figure 8a, b present simulation results for two different scenarios of node deployment, comparing the network delay performance of the proposed algorithm with four other existing techniques mentioned earlier. The findings indicate that DRLEER achieves the lowest delay compared to the existing algorithms.

Fig. 8
figure 8

End-to-end delay scenarios a for N = 100, b for N = 200

Figure 9 plots performance against energy usage and the number of rounds. It shows that the proposed method consumes less energy compared to the four existing approaches. This demonstrates the effectiveness of the proposed algorithm in terms of network delay and energy efficiency, making it a viable option for wireless sensor networks.

Fig. 9
figure 9

Energy consumption vs number of rounds

Figure 10 demonstrates the average improvement achieved by the proposed method in transmitting data packets to the sink. The progress in data packet transfer made by other existing algorithms, such as GEEC, DEEC, and E-DEEC, is slower. In contrast, the efficient clustering of the RL with GA algorithm has resulted in a more effective data transfer outcome. Moreover, the proposed DRLEER has accomplished packet transfer at a faster rate without incurring any data transmission loss.

Fig. 10
figure 10

Data packets transferred to the base station in relation to the number of rounds

The study accounts for both the hop count and the residual energy to maximize the network's lifespan, as high energy consumption can lead to a shorter network lifespan. The findings suggest that the proposed DRLEER significantly enhances the network's energy efficiency and extends its lifespan, making it a promising option for wireless sensor networks.

Conclusion

This paper presented a deep reinforcement learning energy-efficient routing (DRLEER) technique for the low redundancy prediction model of the IoT-enabled WSN. According to this approach, cluster-based, energy-efficient, and reinforcement learning-based method identifies the optimal path for data transmission, considering both low energy consumption and increased network lifespan. The DRLEER technique consists of three stages: network configuration and cluster head selection, cluster construction, and data transfer.

In the first stage, both hop count and initial energy are considered to calculate the initial Q-value used for cluster head selection. During the second stage, each cluster head sends an invitation to any device within its transmission range. Nodes that are distant from the base station connect with the nearest cluster head. In the final phase, the learning-driven data transmission process offers an energy-efficient routing that considers both the hop count and the remaining energy of the devices for determining routing selections. Moreover, a threshold for energy is established for cluster head replacement.

Evaluation outcomes demonstrate that DRLEER outperforms existing systems in terms of energy usage and network lifetime. The lightweight RL approach employed in this study enables faster performance and reduced energy consumption. The results indicate that the DRLEER algorithm significantly improves the overall energy efficiency and longevity of IoT networks, making it a promising solution for various IoT applications. Future research could explore additional optimization techniques and reinforcement learning algorithms to further enhance the proposed method and adapt it to various IoT application scenarios. Moreover, considering other factors for creating a more effective routing system is a potential avenue for future work.