1 Introduction

In general terms, the main objective of an energy-efficient coverage-connectivity algorithm in wireless sensor network (WSN) is to monitor the area of interest and transmit the sensing results to the base station with a minimum number of sensor nodes [1, 2]. To accomplish this objective three types of solutions have been proposed in the literature: node deployment optimization [3, 4], node scheduling [4,5,6], and cover-set formation [7, 8]. Node deployment optimization is defined as to control the number of sensor nodes for deployment. WSN comprises of two types of sensor nodes: static and mobile. A WSN comprises of static sensor nodes is inflexible because it is not capable of handling topology change due to sensor node failure. On its contrary, a WSN having mobile sensor nodes are flexible due to its dynamic behavior. There are real-time applications of WSNs such as border surveillance, harsh geographic surveillance, and disaster management where the number of deployed nodes can’t be controlled. To handle such situation node activity scheduling protocols have been proposed [4,5,6,7,8]. According to these protocols, the nodes can be in any one of three modes i.e. active, hibernate, and sleep. The active nodes in each scheduling round form a cover set. The objective of each cover set is to provide maximum coverage and connectivity with a minimum number of sensor nodes.

These node scheduling protocols have disadvantages in terms of unwanted computational energy consumption and resource utilization such as: tolerating the overhead of communication among nodes before the actual scheduling phase and storage of the scheduling information so that nodes can be completely in sleep mode otherwise the nodes have to be in hibernate mode. In hibernate mode, the nodes can turn off their processing unit, but cannot turn off its radio (transceiver). Therefore, there is a need for a node scheduling protocol, which can intelligently manage the node’s activity scheduling. This problem can be efficiently solved with the help of machine learning algorithms. These algorithms enable the sensor node to learn its action without being explicitly programmed. This is the one application of computational intelligence in WSN. Computational intelligence is a boon for other WSN challenges such as coverage and connectivity maintenance [9,10,11,12], data aggregation [13,14,15], cluster formation [15, 16], energy management [17], routing [18], node duty cycling at MAC layer [19, 20], and security [21]. It provides autonomy, flexibility, and robustness to WSN.

Kulkarni et al. [22] has done an extensive survey of computational intelligence in WSNs and presented the work of various researchers in this field in a concise manner. At first, the challenges in WSNs such as wireless ad hoc nature, energy limitation, physical distribution, localization, quality of service management, and security are very well explained. A brief overview of various models of machine learning algorithms such as neural networks, fuzzy logic, evolutionary algorithms, swarm intelligence, artificial neural networks (ANN), and reinforcement learning is presented. This paper also explains how computational intelligence has been hybridized to mitigate the various challenges of WSNs. To the best of our knowledge, this paper is the best survey paper for the beginners, who want to pursue their research in the field of computational intelligence in WSNs.

Wireless sensor network comprises of a set of sensor nodes, where each node is considered as an agent for intelligent computations. Each sensor node has a learning ability [10, 22]. Therefore, WSN is known as a multiagent system. Reinforcement learning is the only machine learning algorithm possible in such a scenario because it does not require any environment model. Reinforcement learning (RL) is a type of machine learning where an agent learns and takes actions during their learning process to gain long-term reward [23]. A reward can be either a positive reward or a negative reward. Reward function optimizes the correctness of action selection in a given state. The identification of relevant actors in a particular state is only possible after executing all states-actions combinations. Out of n number of executions only m combinations are considered, which can maximize the global reward. The execution of these combinations is known as a Markov Decision Process (MDP). MDP is a discrete-time stochastic control process that provides mathematical modeling for decision making, where outcomes are completely or partially random in nature [24]. The environment of RL is modeled as a Markov Decision Process.

Reinforcement learning applies to both single and multiagent systems. It has been stated earlier that WSN is multiagent system because of the autonomous distribution of sensor nodes. Each sensor node acts as an agent. The potential of extensive learning and the disintegration of a complex problem for better rewards are the key advantages of multiagent reinforcement learning [25]. However, in a multiagent system, the agents have less knowledge about their neighboring agents. At first, the agents learn about each other and then behave appropriately. Reinforcement learning has been tailored in various algorithms of WSNs such as data aggregation, routing, dynamic channel allocation, and resource management [26]. Yau et al. [27] have briefly summarized the applications of reinforcement learning in wireless sensor networks. This paper presents an implementation of one of the reinforcement learning algorithms i.e. Nash Q-learning to solve the problem of coverage and connectivity. The overall objective of reinforcement learning in WSN is to take the right decision at the right time.

1.1 Contribution

This paper focuses on the problem of energy efficient coverage and connectivity maintenance. From the literature, it has been concluded that a very few amount of research work have been done on coverage and connectivity maintenance using computational intelligence. In the present scenario, a quality amount of research is going on to embed intelligence into diverse issues and applications of WSN. Keeping in view of these objectives, this paper has proposed a coverage and connectivity maintenance protocol based on reinforcement learning (CCM-RL). The main contributions of this paper are listed below.

  1. 1.

    The sensor nodes are deployed to achieve the coverage up to threshold level and there must be at least one communication path among the sensor nodes. This protocol enables the sensor nodes to take right decision at right time.

  2. 2.

    The learning ability resides inside each sensor node learns their best action to maximize the coverage rate and maintain the connectivity among active node. The purpose behind this is to activate only a subset of sensor nodes so that energy consumption can be minimized. To fulfill this objective, the sensor nodes do not need to communicate through notification messages. These nodes autonomously learn their best action using Nash Q-learning algorithm.

  3. 3.

    This paper provides a mechanism to eliminate the coverage redundancy among nodes by customizing the sensing range of overlapped sensor nodes. The sensor nodes customized their sensing range after learning their best action. The advantage of this work is to preserve the resources of WSN such has battery power and memory by mitigating the redundant packet generation for same spatial points, avoiding network congestion and contention.

  4. 4.

    CCM-RL provides a solution to preserve the desired coverage rate, connectivity among sensor nodes, and minimizes the total energy consumption.

The rest of the paper is classified as follows. Section 2 discusses concise review of the algorithms proposed in the literature. The WSN model and related assumptions are explained in Sect. 3. Section 4 presents a concise overview of reinforcement learning, Nash Q-learning and definitions related to reinforcement learning in WSN. The proposed protocol Coverage Connectivity Maintenance based on Reinforcement Learning (CCM-RL) is explained in Sect. 5. Section 6 presents the performance evaluation of CCM-RL and its comparison to existing protocols respectively. The conclusions are discussed in Sect. 7.

2 Literature review

For the past two decades, a quality amount of research work has been done to improvise the performance of wireless sensor networks in terms of various parameters such as: coverage [1,2,3,4,5,6,7,8, 28], connectivity [1, 2, 28], network congestion [29], packet delivery rate [20], energy consumption [30], and security [31]. This paper emphasizes on coverage, connectivity, and energy consumption parameter. The coverage is categorized as target coverage, area coverage, and barrier coverage [32]. The network connectivity parameter is considered with only a few coverage protocols, whereas energy consumption is computed for each protocol. This is done so far because sensor nodes are tiny and resource-constrained devices with limited memory and battery power. Energy consumed by the wireless sensor nodes for sensing, computational processing, and communication has a direct and indirect impact on the WSN’s lifetime [33]. However, it has been identified by many researchers that random deployment, coverage redundancy, redundant communications, and idle listening are responsible for unwanted energy consumption and reduce the network’s lifetime. Sensor node activity scheduling is the prime solution, where a subset of sensor node is active and ensures the coverage rate up to a threshold level and connectivity among sensor nodes. The rest of the sensor nodes are in sleep mode to prevent unwanted energy consumption. A lot of researchers have contributed towards the solution of this problem. More et al. [34] have reviewed the key coverage protocol and also illustrated various open challenges such as heterogeneous nodes with obstacles, node failure probability, limited node mobility, coverage degree, and optimization of wake-up rate of sleeping nodes for energy-efficient coverage protocol. Cardei et al. [35] have proposed a solution to solve the problem of coverage redundancy in target coverage and stated this problem as an Adjustable Range Set Cover (AR-SC) problem. According to them, there still exists coverage redundancy after successful node activity scheduling.

Nowadays, to design smart networks and communications with resource optimization and energy management, intelligence has been embedded in the sensor nodes. Although sensor nodes are resource constrained devices, artificial intelligence and data mining techniques have given them the potential to reach a new level of computation, learning and reasoning [20]. The first step of a learning algorithm requires information about the environment and internal structure. This information is provided by the sensor nodes. For a succinct understanding of how to hybridize reinforcement learning systems in wireless sensor network, a lot of researchers published very good research papers [22, 26, 27, 36]. Seah et al. [9] have considered two main parameters i.e. coverage and energy consumption for a large sensor network. This is the first paper where the authors have solved coverage and energy consumption issues using a reinforcement learning algorithm. The authors have proposed a Q-learning based coordinated algorithm (COORD) to identify the coordination among sensor nodes so that area coverage can be maximized as well as total energy expenditure can be minimized. The deployed sensor nodes are multiagent. Each autonomous agent has its reward function and thus aims at maximizing its discounted reward. The learning steps are given reward based on action taken by the sensor node to cover the maximum grid points.

The Probabilistic Coverage Protocol [37] has been proposed for node activity scheduling where the sensor nodes are considered to be localized at a distance of s from each other and have a probabilistic sensing range. The main advantage of PCP is that it can perform coverage computation on both binary and probabilistic sensing models. The Scheduling Algorithm based on Learning Automata (SALA) [10] has been proposed to cover the periodic events and dynamic target points with a minimum number of sensor nodes. The periodic event distribution is based on the Poisson distribution. In SALA, each sensor node is embedded with the set of learning automata. The learning automata enables the sensor node to learn maximum sleep time without compromising the dynamic target detection rate. The learning of sleep time has been dependent on the target trajectory. The key emphasis of SALA is to successfully cover the dynamic event and target points with a minimum number of sensor nodes. These nodes maximize the detection rate to reduce the energy consumption rate. SALA has not shown that whether the active sensor nodes in each scheduling round have connectivity among themselves and there is no network partitioning.

Mohamadi et al. [38] have proposed a scheduling algorithm based on learning automata where the sensor nodes have learnt the appropriate sensing range to cover all target points in each cover set and maximize the network lifetime. They have stated this problem as Maximum Network Lifetime with Adjustable Sensing Range (MNLAR). The sensor nodes are equipped with learning automata based on pruning rules to learn their appropriate sensing range to form a cover set in each scheduling round. Sleep scheduling with Predictive Coverage Redundancy Check (SPRC) [39] has been proposed to solve the problem of coverage redundancy cause due to the random deployment of sensor nodes in abundance to achieve maximum coverage. In reality, it caused wastage of network resources and also increase the network cost. The scheduling algorithms have to perform repeated coverage redundancy checks. Therefore, SPRC proposed an analytical model to predict the need for coverage redundancy check before each scheduling round to minimize the computational scheduling energy consumption. Reinforcement Learning based Sleep Scheduling for Coverage (RLSSC) [11] has been proposed to optimize the selected action of the sensor nodes in each scheduling round using Q-learning. The conventional approach has been applied to schedule the activities of the sensor nodes. RLSSC is a two-step scheduling algorithm. In the first step, it has identified the coverage redundancy among sensor nodes. Secondly, sensor nodes have  learnt their best action based on Q-learning. These algorithms entirely emphasize on maximizing the coverage and minimizing the energy consumption, however, the connectivity among nodes is completely ignored for successful communication among nodes.

Mostafaei et al. [12] have proposed a Partial Coverage with Learning Automata (PCLA) protocol, which has focused on issues such as partial coverage and continuous monitoring of the target point. The main objective of PCLA is to minimize the active number of sensor nodes for desired area coverage and connectivity among sensor nodes. PCLA provides a probabilistic framework to select the most eligible sensor nodes for desired area coverage and network connectivity. Though PCLA has been designed for large scalable wireless sensor networks, but it does not present any solution for handling coverage redundancy.

Coverage Contribution Area (CCA) [5] is a centralized and distributed coverage protocol that has been proposed to solve the k-coverage problem in random deployment of sensor nodes. The residual energy and the spatial density of the sensor nodes is the eligibility criteria for the sensor nodes to become active in a scheduling round. However, CCA neither incorporated any learning mechanism for node activity scheduling nor provided any solution to preserve connectivity among sensor nodes. A brief parametric comparison of existed protocols based on machine learning and CCM-RL to solve the problem of coverage and connectivity is presented in Table 1.

Table 1 Parametric comparisons between protocols

3 Network model

In this paper, the sensor nodes are deployed to cover a hostile region up to a threshold level and maintain network connectivity. Figure 1 represents the network model of WSN. The sensor nodes are randomly deployed inside remote hostile region. At the time of deployment, the sensor nodes have uniform sensing and communication range. The hostile region also comprises obstacles which can disrupt the sensing and communication power of the sensor nodes. The nodes nearer to the base station communicates with the base station. This network model can be applicable in providing barrier coverage where scheduling algorithm divided the sensing task among the sensor nodes [40].

Fig. 1
figure 1

Wireless sensor network model

This paper proposes a node activity scheduling protocol where each sensor node autonomously learns its best action through reinforcement learning in each scheduling round. To do so, each sensor node is equipped with learning ability. In Algorithm 1, for a scheduling round the sensing range of the sensor nodes is customized according to the requirement and the resultant sensing range is marked as Rfinal. To achieve this objective a network is designed with the following assumptions.

  1. 1.

    A wireless sensor network is built to perform monitoring and surveillance task in remote hostile region. It comprises of a set of N static sensor nodes that are randomly deployed inside the sensing region of area As with high density.

  2. 2.

    Each sensor node Si (i = 1,2,3,…,N) is well aware of its location using predefined localization algorithm [41].

  3. 3.

    In the beginning of the algorithm, these sensor nodes have uniform sensing range Rs and communication range Rc. Rc is less than or twice to that of the sensing range (Rc ≤ 2Rs). The coverage area CAi provided by Si is represented as a circle.

  4. 4.

    The sensing region of area As is divided into a G number of square grids of length l (l = 2×Rs) where \(G = \frac{{A_{s} }}{l \times l}\).

  5. 5.

    Initially, the status of all deployed sensor nodes is active and have the same amount of energy E. This paper has incorporated simple energy consumption model [42] to process and communicate b-bit of packets over a distance d, d\(\le\)Rc.

4 Reinforcement learning

The objective of reinforcement learning in WSN is to autonomously maintain the coverage and connectivity of randomly deployed sensor nodes. The best combination of state and actions is considered from a defined set of states and actions. Reinforcement learning is applicable because of the lack of trained data and enough expertise about the constraints and issues of WSN. Therefore, the sensor nodes learn what to do and how to form an optimum combination of state and action to maximize the numerical reward. The learning is dependent on two terms i.e. exploration and exploitation. Exploration is about going through the entire environment to capture more information that can help in reward maximization. On the other hand, exploitation is to utilize already known information to maximize the rewards. In WSN, the network designers have the crystal-clear idea about the desired output and RL achieves that desired output in terms of rewards.

In reinforcement learning, the environment gives states to the agents and then agents perform a suitable action to maximize the global reward. Reinforcement learning is modelled as a Markov Decision Process (MDP). The schematic diagram of reinforcement learning is shown in Fig. 2.

Fig. 2
figure 2

Reinforcement learning and its environment

4.1 Nash Q-learning

Q-learning is a model-free reinforcement learning algorithm. The purpose of Q value in Q-learning is to learn an optimum policy for an agent to choose its best action that can maximize the overall reward value. The change in the state of the environment is entirely based on agent’s current action. Q-learning algorithm has been proven to be a boon for single agent system as it has simple updating approach in which an agent starts with random initial values of Q(s,a). The Q-values in any discrete time step t (t = 1,2,3…,n) are updated using Eq. 1 and stored in the form of a matrix.

$$Q_{t + 1} (s_{t} ,a_{t} ) = (1 - \alpha_{t} )\,Q_{t} \,(s_{t} ,a_{t} ) + \left[ {r_{t} + \gamma \mathop {\hbox{max} }\limits_{a} \,Q_{t} (s_{t + 1} ,a)} \right]$$
(1)

In Eq. 1, s\(\in\)S is set of states, a \(\in\) A is set of actions and rt is the reward while transiting from state st to st+1. α is learning rate (0 < α\(\le\) 1) and γ is discount factor (0 < γ\(\le\) 1). Learning rate defines the extent of new information overriding the previous information. At α = 0, agent learns nothing and α = 1 agent considers only recent information. The small value of α (α > 0) represents exploration. Therefore, in stochastic environment the values of α = 0.1 is considered for exploration. Discount factor defines the importance of future rewards. The larger value of γ represents that agent is going to explore the entire environment.

In real time there are many complex problems that require multiple agents to achieve global optimality. Therefore, extended Q-learning algorithms such as Distributed-Q learning [43], Friend-or-foe Q-learning [44], Nash-Q learning [23], Correlated-Q learning [45], Minimax-Q learning [46] and Optimal adaptive learning (OAL) [47] have been proposed to tackle the multiagent system. In multiagent reinforcement learning (MARL) the agents learn by interacting dynamically with their environment as shown in Fig. 2. The global state of the environment is governed by the actions of all agents. There is one challenge in MARL related to the behavior of agents. The agents can behave as cooperatively, competitively or neutrally [25].

Hu et al. [23] have extended the single agent Q-learning to multiagent systems for stochastic environment. The authors have proposed a Q-learning based distributed non-cooperative multiagent reinforcement learning algorithm where each agent simultaneously and independently tries to maximize the expected sum of discounted rewards. These discounted rewards are represented by Nash Q-values. Each agent follows Nash equilibrium strategies for the next time interval. For a multiagent system the very first step is to recognize the joint actions for a given state. The Q-function is Q(s,a1, a2,…,an) instead of Q(s,a). The future rewards are based on agents’ joint optimal strategy.

In Nash Q-learning algorithm, the agents are indexed as i. The first step of Nash Q-learning algorithm is to assume the random Q-values at t = 0. At time t the agent i identifies its current state and take action accordingly to get positive reward. Afterward, it identifies actions taken by other agents, their rewards and the next state s*. At time t the agent i updates its Q-value using Eq. 2.

$$Q_{t + 1}^{i} (s,a_{1} ,a_{2} \ldots ,a_{n} ) = \,(1 - \alpha_{t} )\,\,Q_{t}^{i} \,(s,a_{1} ,a_{2} \ldots ,a_{n} ) + \,\alpha_{t} \,\left[ {r_{t}^{i} + \gamma \,NQ_{t}^{i} (s^{*} )} \right]$$
(2)

where α is learning rate (0 < α\(\le\) 1) and γ is discount factor (0 < γ\(\le\) 1). NQit(s*) is given by Eq. 3.

$$NQ_{t}^{i} (s^{*} ) = \pi^{1} (s^{*} ) \ldots \pi^{n} (s^{*} ).\,Q_{t}^{i} (s^{*} )$$
(3)

Nash Q-learning algorithm incorporates joint actions and updates its Q-value based on Nash equilibrium strategy specified by all agents over current Q-values. Using Nash Q-learning the agents achieve global optimality faster than single agent Q-learning.

4.2 Formal definitions

Definition 1

An agent takes suitable action in discrete time steps t = 0,1,2,…,n. Each sensor node autonomously acts as an agent.

Definition 2

The entire sensing area As is an environment. It provides state information to the multiagent (sensor nodes) resides inside the sensing area As. The environment takes the agent’s current state and action as input, and returns the output as a numerical reward as shown in Fig. 2.

Definition 3

Local state SL represents the same type of information. For coverage, there are two local states of a sensor node: {SL1 = Coverage Redundancy, and SL2 = Isolate}. The sensor node can be in any one of the states. For network connectivity maintenance, the two local states of a sensor node are {CSL1 = Connected to 1-hop neighbor, and CSL2 = not connected}.

Definition 4

Global state SG represents the global objective of the multi-agent system. For coverage {SG = coverage rate} and for connectivity {CSG = Connectivity among sensor nodes}.

Definition 5

Action A is the set of all possible activities; an agent can perform (A\(\in\)Ai). For coverage the possible set of actions is: {A1 = active, A2 = sleep, and A3 = customize the sensing range} and for connectivity the possible set of actions is: {CA1 = active, and CA2 = hibernate}. Actions are selected either using \(\epsilon\)-greedy method or Boltzmann exploration [25]. \(\epsilon\)-greedy is advantageous in choosing the best action with probability 1-\(\epsilon\) or random action with probability \(\epsilon\), whereas, Boltzmann exploration uses a temperature parameter T to balance exploration and exploitation [24]. In this paper, the actions are selected using the \(\epsilon\)-greedy method.

Definition 6

Reward r is the feedback by which the success or failure of an agent’s selected action is measured. For example: the coverage rate provided by the active sensor nodes is reward. If the coverage rate Cr is greater than or equals to threshold coverage rate \(\tau\) (Cr\(\ge \tau\)), then it is a positive reward otherwise negative reward. Reward can be categorized as local reward and global reward. Coverage area provided by an agent is stated as local reward and the coverage rate provided by the active number of sensor nodes in one scheduling round is stated as global reward. The illustration of local and global reward is shown in Fig. 3.

Fig. 3
figure 3

Representation of local reward and global reward

Definition 7

Coverage rate Cr is the total amount of area covered by active sensor nodes to that of the sensing area As.

$$C_{r} = \frac{{\sum\nolimits_{i = 1}^{{N_{active} }} {\pi \,R_{{final_{\,\,i} }}^{2} } }}{{A_{s} }}$$
(4)

Definition 8

Network connectivity Nc states that there must exist at least one communication link between active sensor nodes.

Definition 9

For coverage maintenance, if the distance d between Si and Sj is less than 2Rs (d(Si, Sj) < 2Rs), then Sj is said to be the neighboring node SNN of Si. For connectivity maintenance, if the distance d between Si and Sj is less than 2Rc (d(Si, Sj) < 2Rc), then Sj is said to be the 1-hop neighboring node of Si.

Definition 10

Convergence time CT is the time taken by the learning algorithm to achieve its global optimality.

Definition 11

Active node ratio is defined as the total number of active nodes Nactive to that of the total number of deployed sensor nodes N.

Definition 12

Policy π is state action mapping probability. It is an approach that an agent applies to identify the next best action based on current state.

Definition 13

Threshold value sets a level in achieving the particular goal. In this paper, the threshold value \(\tau\) represents the minimum value set for achieving the coverage rate. The active sensor nodes must provide coverage up to the threshold level.

5 Coverage connectivity maintenance based on reinforcement learning (CCM-RL) protocol

This section presents the complete description of the proposed CCM-RL protocol. This protocol presents reinforcement learning based an energy-efficient node activity scheduling algorithm for randomly deployed sensor nodes. CCM-RL helps in maintaining the desired coverage rate and connectivity provided by the active sensor nodes in each scheduling round. Nash Q-learning [23], an algorithm for multiagent reinforcement learning is applied for node activity scheduling. A WSN comprises N number of static sensor nodes where each sensor node acts as an agent. Due to N number of sensor node, a WSN is defined as a multiagent system. It is difficult and time-consuming to achieve a global optimal path with a single agent Q-learning algorithm. Therefore, Nash Q-learning algorithm is applied for minimum convergence time. Along with activity scheduling of nodes, CCM-RL also considers two other issues of random deployment i.e. coverage redundancy and partial coverage.

5.1 CCM-RL protocol description

The CCM-RL protocol comprises of two phases: (1) Learning phase for coverage maintenance, and (2) Learning phase for connectivity maintenance. The complete description of these two phases is presented in Sects. 5.1.1. and 5.1.2.

5.1.1 Reinforcement learning for coverage maintenance

The learning process in a sensor node starts immediately after the random deployment of the sensor nodes. The agents explore the entire environment. At time t = 0, the learning rate α, discount factor γ, and random Q-values are initialized. The learning process starts at time t from any randomly selected sensor node Si. Si identifies all of its neighboring nodes SNN and observes its local state SL (SL\(\in\)SL1, SL2). SL\(\in\)SL1 represents that Si is in the coverage redundancy state. Si then computes the amount of coverage redundancy among its neighboring nodes using Euclidean distance formula. Afterward, Si choose the best action A {A1 = active, A2 = sleep, and A3 = customize the sensing range} and (A\(\in\)A1, A2, A3) to mitigate the coverage redundancy. If the selected action is either A1 or A3 then the coverage area CAi\(\left( {CA_{i} \, = \,\pi {R_{{final\,_{i} }}}^{2} } \right)\) provided by Si is assigned as positive reward. SL\(\in\)SL2 represents that Si is in an isolated state. Si need not to perform any action and the coverage area CAi\(\left( {CA_{i} \, = \,\pi {R_{{s\,_{i} }}}^{2} } \right)\) of Si is assigned as positive reward. The Q-values are updated iteratively using Eq. 2 to achieve global optimality. Figure 4 represents the flow chart of the above description.

Fig. 4
figure 4

Learning phase for coverage maintenance using Nash Q-learning

If the global reward is greater than the threshold level after scanning N sensor nodes, then the learning is said to be convergent and the time required to achieve this is known as convergence time. Otherwise, the entire process is repeated to achieve the goal. To achieve the coverage rate Cr (Cr ≥ τ) is the global optimality of the learning algorithm. The output of this phase is the total number of active sensor nodes and their coverage rate. It becomes the input for connectivity maintenance phase as shown in Fig. 5.

Fig. 5
figure 5

Learning procedure for connectivity maintenance using Nash Q-learning

The pseudo code of node activity scheduling for coverage maintenance is presented in Algorithm 1. This algorithm presents the step by step procedure for one scheduling round. The similar steps are executed for the rest of the scheduling rounds. The sensor nodes, which are active in one scheduling round will be in sleep mode for the next scheduling round. At the beginning of the algorithm all important parameters are initialized. Each sensor node is assigned with a unique ID S_ID. The Si sensor nodes with unique IDs are stored in active_sensors and uncovered_set. Initially, all sensor nodes are active. At the beginning of a scheduling round, a random sensor node Si is selected and all of its neighbors are identified. The neighboring nodes SNN are stored in the ascending order of their distance from Si in a list. Si picks the first node from the list. There exists coverage redundancy because of the minimum distance between Si and SNN. The algorithm then computes a variable to customize the sensing range by x units. This helps in mitigating the coverage redundancy. If the sum of coverage area provided by Si and SNN after customizing their sensing range by x units is larger than the coverage provided by max(coverage(Si), coverage(SNN)), then reduce the sensing range of both the sensor nodes by x units. Si and SNN are kept in active mode. Otherwise, put the sensor node that provides minimum coverage in sleep mode. This process is repeated for all deployed sensor nodes. At the end of the scheduling round, the coverage rate is computed. If the coverage rate provided by active sensor nodes is greater than the threshold coverage rate \(\tau\), then the scheduling round is saved. Otherwise, Algorithm 1 is executed again with any other randomly selected sensor node.

figure c

5.1.2 Reinforcement learning for connectivity maintenance

The total number of active sensor nodes acquired from Algorithm 1 is the input for connectivity maintenance phase. The main purpose behind this phase is to maintain at least one communication link between active sensor nodes and mitigate the network partitioning. Figure 5 presents the flow chart for connectivity maintenance.

At time t = 0, Q-values, learning rate α, and discount factor γ are initialized. At time t, the learning process is started with any randomly selected sensor node from active_sensors. Identify its 1-hop neighbors and perform a suitable action from the action set CA {CA1 = active, and CA2 = hibernate} and (CA\(\in\)CA1, CA2). If the active sensor node does not have any 1-hop neighbor, then identify the most eligible Seligible node from uncovered_set and then perform a suitable action from the action set CA. The total number of active sensor nodes is the global reward obtained from this phase. Algorithm 2 represents the pseudo code of node activity scheduling for connectivity maintenance.

figure d

Algorithm 3 presents the pseudocode for coverage connectivity maintenance based on reinforcement learning CCM-RL. The rewards are observed using Algorithm 1 and Algorithm 2 and then Q-table is updated using Eq. 2. The Nash Q-learning for multiagent system is explained in Sect. 4.1. The updating of Q-table is an iterative process to achieve the global optimality of rewards.

figure e

5.2 Computational complexity

Computational complexity is defined as the number of resources and computation time required to execute an algorithm. The computational complexity of Algorithm 1 and Algorithm 2 is based on the number of the deployed sensor nodes N. The number of sensor nodes are the prime metric for coverage and connectivity maintenance. The computational complexity of Algorithm 1 is O(N3) and Algorithm 2 is O(N2) where N is the number of deployed sensor nodes. Section 6.6 presents the convergence time of Algorithm 1, Algorithm 2 and Algorithm 3 with respect to the number of sensor nodes.

6 Performance evaluation

In this section, the performance of the CCM-RL is evaluated through extensive simulations performed on OMNeT++ and Python. The simulations are performed to ensure the accuracy and reliability of CCM-RL protocol. The performance of CCM-RL is evaluated using seven parameters for both small and large size WSN respectively, such as: coverage rate, number of cover set formation, average number of active sensor nodes for coverage, average number of active sensor nodes for connectivity maintenance, number of iterations to achieve global optimality, percentage of wrong decisions, and convergence time. The proposed protocol is compared with Coverage Contribution Area (CCA) [5], Partial Coverage with Learning Automata (PCLA) [12] and Probabilistic Coverage Protocol (PCP) [37]. The three parameters used to compare the performance of CCM-RL with CCA, PCLA, and PCP respectively, such as: average number of active sensor nodes, coverage rate, and average power consumption in each cover set. The Figs. 16, 17, and 18 show that CCM-RL performs better as compared to CCA, PCLA, and PCP. Table 2 presents the list of parameters and their values used for simulation.

Table 2 List of parameters

6.1 Random deployment of sensor nodes and implementation of CCM-RL

A set of 100 sensor nodes having sensor IDs S1, S2,…, S100 is randomly deployed inside 2500 m2 sensing region. These sensor nodes have uniform sensing range Rs = 15 m. The coverage area of each sensor node is considered as a circle. Figure 6 presents the deployment of these sensor nodes. Figure 7 shows the resultant active number of sensor nodes in one scheduling round after the implementation of Algorithm 1, Algorithm 2, and Algorithm 3. It shows that out of 100 only 25 sensor nodes are active. These nodes provide coverage rate up to a threshold level and also ensure network connectivity. The activation of only 25% of nodes in one scheduling round reduces the energy consumption rate. Moreover, Fig. 7 shows that after the execution of algorithms the sensing range of S17, S20, S21, S24, S36, S37, S38, S44, S47, S49, S61, S73, S77, S78, S81, S87, S90 and S96 is customized to mitigate the coverage redundancy among sensor nodes. The customization in sensing power also reduces energy consumption for processing and communicating the data bits. The sensing range of S10, S41, S64, S67, S69, S76 and S97 remains same.

Fig. 6
figure 6

Deployment of 100 sensor nodes inside 2500 m2 sensing region

Fig. 7
figure 7

Active number of sensor nodes after the implementation of Algorithm 1, 2, and 3

6.2 Analysis of number of cover sets, average number of active sensor nodes, and coverage rate for small and large scale WSN

The sets of 50, 100, 150 and 200 sensor nodes are randomly deployed in the small sensing region of area 100*100 m2. Figure 8 shows the results respectively, which are number of cover sets formation, average number of active sensor nodes in one scheduling round, and coverage rate for small scale sensing region. These results are evaluated for 10 m, 20 m, and 25 m sensing range. Similarly, Fig. 9 shows the results for large scale WSN. The sets of 250, 500, 1000, and 1500 sensor nodes are randomly deployed in large sensing area 1000*1000 m2. The results are evaluated for three different sensing ranges, which are 25 m, 30 m, and 35 m. Figures 8(a), 9(a), 8(c), and 9(c) show that there is an increase in cover sets formation and coverage rate with the increase in the sensing range. Figures 8(b) and 9(b) show that the average number of active sensor nodes reduces with the increase in the sensing range for a fixed number of sensor nodes. This analysis helps in estimating the number of sensor nodes with homogenous sensing range require to cover a sensing region up to a threshold level.

Fig. 8
figure 8

Analysis of parameters a number of cover sets, b average number of active sensor nodes, and c coverage rate for fixed number of sensor nodes and small sensing area 100*100 m2

Fig. 9
figure 9

Analysis of parameters a number of cover sets, b average number of active sensor nodes, and c coverage rate for fixed number of sensor nodes and large sensing area 1000*1000 m2

6.3 Average number of active senor nodes for network connectivity

It is not necessary that the average number of active sensor nodes that provide threshold coverage rate must have connectivity among themselves. Algorithm 2 has been proposed to analyze and maintain the connectivity among the sensor nodes. Figure 10 represents the results that show the average number of active sensor nodes require to preserve network connectivity for different number of sensor nodes. The total number of active sensor nodes increases for connectivity maintenance as compared to coverage maintenance. For a fixed sensing area 1000*1000 m2 the average number of active sensor nodes decreases with the increase in their sensing range.

Fig. 10
figure 10

Average number of active sensor nodes for network connectivity versus number of sensor nodes

6.4 Analysis of number of iterations for small and large scale WSN

This section explains about the number of iterations that have been occurred to achieve the global reward for given number of fixed sensor nodes. Figures 11 and 12 show the number of iterations that has taken place for small and large scale WSN. The parametric values of small and large scale WSN are same as considered in Sect. 6.2. The graphical results in Figs. 11 and 12 represent that for a fixed number of sensor nodes the number of iterations decreases with the increase in their sensing range.

Fig. 11
figure 11

Number of iterations occurred for fixed number of sensor nodes deployed inside small sensing area 100*100 m2

Fig. 12
figure 12

Number of iterations occurred for number of sensor nodes deployed inside large sensing area 1000*1000 m2

6.5 Percentage of wrong decision versus number of senor nodes

Reinforcement learning is an iterative process of learning to achieve global rewards. In each learning step the agent generates the output as a reward. The reward can be either a positive reward or negative reward. Only those rewards are considered which help in achieving the global optimality and remaining rewards are considered as wrong decisions. This section represents the percentage of wrong decisions occur with respect to the total number of sensor nodes. Figure 13 shows that for a fixed large sensing area 1000*1000 m2 the percentage of wrong decisions decreases with the increase in the number of sensor nodes and their sensing range. The results are evaluated for 25 m, 30 m, and 35 m sensing range.

Fig. 13
figure 13

Percentage of wrong decision versus number of sensor nodes for large sensing area 1000*1000 m2

6.6 Convergence time versus number of sensor nodes

Convergence time is the time required by Algorithm 1, 2 and 3 to achieve global optimality. Figure 14 shows the convergence time with respect to the total number of deployed sensor nodes. This time includes the time consumed for activity scheduling of sensor nodes, coverage rate computation, and connectivity maintenance. It shows that for a fixed sensing area 1000*1000 m2 the convergence time is directly proportional to the number of sensor nodes. The convergence time increases with the increase in the number of sensor nodes.

Fig. 14
figure 14

Convergence time versus number of sensor nodes for large sensing area 1000*1000 m2

6.7 Effect of learning rate (α) on convergence time

The learning rate reflects the efficiency of an agent to learn the optimum policy that helps in achieving the global optimality. The value of learning rate lies between 0 and 1 (0 < α\(\le\) 1). The larger value of α minimizes the learning ability of an agent. The results generated with large value of α are not correct. In this paper, the learning rate α = 0.1 is considered. Figure 15 shows the effect of learning rate on convergence time (in seconds). This represents that the convergence time decreases with the increase in the value of learning rate. Therefore, the smaller value of α helps in achieving better learning outcomes.

Fig. 15
figure 15

Effect of learning rate (α) on convergence time

6.8 Comparison of CCM-RL with CCA, PCLA, and PCP

The proposed CCM-RL protocol is a node activity scheduling protocol. The main objective of this protocol is to enable the deployed sensor nodes learn their best action that can provide maximum coverage rate and maintain network connectivity. Node activity scheduling activates a subset of sensor nodes, instead of the entire set in each scheduling round. CCM-RL has proposed three algorithms to fulfill this objective. However, the accuracy of the algorithm can only be proven after its comparison with already existed protocols. Therefore, the performance of CCM-RL is compared to CCA, PCLA and PCP on the basis of three parameters respectively, which are average number of active sensor nodes, coverage rate, and average power consumption in each cover set (in Watts). Figure 16 shows that CCM-RL enables a lesser number of sensor nodes to be active in one scheduling round as compare to CCA, PCLA, and PCP. The lesser number of active sensor nodes in each scheduling round increases the number of cover sets and network lifetime.

Fig. 16
figure 16

Average number of active sensor nodes versus number of sensor nodes

Figure 17 compares the performance of CCM-RL in terms of coverage rate. It shows that with lesser number of active sensor nodes CCM-RL provides maximum coverage rate. Moreover, CCM-RL has reduced the coverage redundancy to mitigate the problem of redundant packet generation. Figure 18 shows that CCM-RL consumes less power for processing and communicating the data bits. The less power consumption enhances the network lifetime. Hence, WSN can function for larger duration as compared to CCA, PCLA and PCP. These comparisons show that CCM-RL is efficient in terms of maximizing the coverage rate and minimizing the power consumption.

Fig. 17
figure 17

Coverage rate versus average number of active sensor nodes

Fig. 18
figure 18

Average power consumed by active sensor nodes in each cover set

7 Conclusion

Coverage and connectivity are two fundamental issues of wireless sensor network. These issues become more critical for randomly deployed sensor nodes. Random deployment is a biased deployment strategy of sensor nodes, which imposes many challenges such as uneven deployment, coverage redundancy, partial coverage, and unwanted resource consumption. To resolve these issues, this paper has proposed a Reinforcement Learning based Coverage and Connectivity Maintenance (CCM-RL) protocol. This protocol autonomously and intelligently chooses the best activity for a sensor node that can maximize the overall coverage rate and maintain network connectivity. The sensor nodes learn their best action using Nash Q-learning. This protocol also provides a mechanism to reduce the coverage redundancy among sensor nodes by customizing their sensing range. CCM-RL’s performance assessments demonstrate its accuracy and reliability with respect to the average number of active sensor nodes, number of cover sets, coverage rate, and energy consumption.

Nash Q-learning is the learning algorithm for multiagent. It enables the agents of a multiagent system to learn their best action in a particular state. There are many other learning algorithms for multiagent systems in machine learning such as Distributed-Q learning, Minimax-Q learning, Friend-or-foe Q learning, Correlated-Q learning and Optimal adaptive learning (OAL). The future scope of this paper is to implement these algorithms for teaching the agents and compare the performance in terms of convergence time, rewards, and global optimality.