1 Introduction

The wireless sensor networks (WSNs) have been paid enormous attention for their potential use in monitoring environment, disaster management, combat field reconnaissance, military surveillance, health, home applications, etc. [1, 2]. A WSN is composed of a large number of sensor nodes, which are randomly or manually deployed in some target area. The sensor nodes collect local information, process them, and send it to a remote base station called sink. The sensor nodes are low power battery operated and their replacement is almost impossible in a harsh environment. Therefore, energy consumption of the sensor nodes is the most important issue in the design of a WSN. Reducing energy consumption is thus considered as the most critical challenge in order to maximize the network lifetime. Many research schemes [3, 4] have been addressed on this issue. However, efficient clustering algorithm for WSN is the most focused area, which has drawn much attention in the research community. In a cluster based WSN, sensor nodes are grouped into distinct clusters. Every cluster has a leader, called cluster head (CH). Each sensor node belongs to only one cluster. The sensor nodes inside a cluster collect data and communicate with their CH. The CHs collect and process the local data and send them to the sink by single hop or multihop communication through other CHs. An example of a cluster based WSN is shown in Fig. 1.

Fig. 1
figure 1

A model of cluster based WSN. An arrow from a sensor node towards a cluster head indicates that the sensor node has been assigned to the cluster head

A cluster based WSN has many advantages as follows [5]:

  1. (1)

    It can reduce energy consumption significantly as only one representative (i.e., CH) per cluster needs to be involved in data aggregation and routing process.

  2. (2)

    It can considerably conserve communication bandwidth as the sensor nodes need to communicate with their CHs only, and thus can avoid exchange of redundant messages.

  3. (3)

    The clusters can be more easily managed as they can localize the route set up and require small routing tables for the sensor nodes. This in turn improves the scalability of the network significantly.

However, in a cluster based WSN, CHs bear some extra work load which is contributed by their member sensor nodes as follows. (1) CHs communicate with all the sensor nodes within their cluster, (2) they perform data fusion to discard redundant and uncorrelated data sent by their member sensor nodes, and finally (3) they send the processed data to the sink. Moreover, in many WSNs the CHs are usually selected from the normal sensor nodes, which can die quickly owing to this extra work load. In this context, many researchers [610] have proposed the use of some special nodes called gateways or relay nodes, which are provisioned with extra energy. These gateways are treated as the cluster heads and, therefore, can be used interchangeably for the rest of the paper.

It is important to note that the gateways are also battery operated, and hence power constrained. Life time of the gateways is very crucial for long run operation of the network. Therefore, improper cluster formation may cause some CHs (gateways) overloaded. Such overload may increase latency in communication, consumes high energy of the CH and degrade the overall performance of the WSN. Therefore, load balancing of the CHs is the most important issue for clustering sensor nodes. Particularly, this is a pressing issue when the sensor nodes having unequal load are not distributed uniformly. It is also noteworthy that for a WSN with n sensor nodes and m gateways, the number of possible clusters is m n. This implies that the computational complexity of finding the optimal load balanced clustering for a large WSN seems to be very high by a brute force approach. In fact, load balanced clustering with unequal load of the sensor nodes is a NP-hard problem.

In this paper, we are concerned with assigning the sensor nodes to the CHs to form clusters such that the maximum load of each CH is minimized. By the load of the CHs, we mean that the load contributed by the member sensor nodes due to data generation and communication. In order to search faster and improved algorithms, we first propose here a load balanced clustering algorithm that produces optimal solution in O(nlogn) time for a special case in which all sensor nodes have equal load. We then show that the same algorithm can work for the general case, i.e., sensor nodes having unequal loads. We prove that this is actually a 2-approximation algorithm. Finally, we present an improved algorithm for the general case which is 1.5-approximation. This algorithm is also shown to run in O(nlogn) time. The experimental results show that the proposed algorithms are better than the existing algorithms in terms of load balancing, execution time, and the network life in rounds. Therefore, our contributions in this paper can be summarized as follows:

  • A load balanced clustering algorithm for both equal and unequal load of the sensor nodes in O(nlogn) time for n number of sensor nodes.

  • Proof of the algorithm for optimal solution for equal load of the sensor nodes and proof of the 2-approximation algorithm for unequal load.

  • A 1.5-approximation load balanced clustering algorithm having same time complexity and its proof.

  • Simulation results to demonstrate that the proposed algorithms are superior to existing algorithms.

The rest of the paper is organized as follows. The related work is presented in Sect. 2. System model and problem formulation are described in Sect. 3. Section 4 presents the used terminologies of the algorithms. The proposed algorithm for equal loads and the 1.5-approximation algorithm for unequal loads are presented in Sect. 5 and Sect. 6, respectively. Experimental results are given in Sect. 7 and we conclude our paper in Sect. 8.

2 Related work

A number of clustering algorithms have been developed for WSN [5, 1113]. LEACH [14] is a popular technique that forms clusters by using a distributed algorithm. It dynamically rotates the work load of the CH among the sensor nodes, which is useful for load balancing. However, the main disadvantage of this approach is that a node with very low energy may be selected as a CH which may die quickly. Moreover, the CHs communicate with base station via single-hop, which is impractical for WSNs with large coverage area. Therefore, a large number of algorithms have been developed to improve LEACH such as PEGASIS [15], AEEC [16], etc. Compared to LEACH, PEGASIS improves network lifetime, but it requires dynamic topology adjustment and the data delay is significantly high which is unsuitable for large scale networks. Buyanjargal et al. [16] have presented AEEC, a modified algorithm of LEACH. Although AEEC performs clustering phase similar to LEACH, it selects the CHs based on the remaining energy of the sensor nodes and thus overcomes the problem of random selection of CHs. However, the main drawback of this protocol is a high volume of control massage exchange between elected CH and the sensor nodes to form a cluster. Bandyopadhyay et al. [17] have developed a multihop hierarchical clustering algorithm. Their method is based on probabilistic approach that provides optimal value for CH selection and maximum number of hops from a sensor node to cluster head. But their approach does not take into account the residual energy of the sensor nodes and the CH selection may result in faster death of some sensor nodes. Xue et al. [18] have proposed a clustering algorithm by determining the optimal cluster size and presented a location aware hybrid transmission scheme. But it is difficult to control the actual size of the clusters when the number of dead nodes is increased. It is also cumbersome to maximize the network lifetime through control distribution of the cluster heads. Recently, we have proposed a GA based load balanced clustering algorithm for WSNs in [19]. The algorithm forms clusters in such way that the maximum load of each gateway is minimized. As it is a meta-heuristic approach, it may not provide an optimal solution for the equal load of the sensor nodes. We have also developed an energy efficient load-balanced clustering algorithm (EELBCA) [6] that runs in O(nlogm) time. EELBCA addresses energy efficiency as well as load balancing. EELBCA is a min-heap based clustering algorithm. A min-heap is build using cluster heads (CHs) on the number of sensor nodes allotted to the CHs. However, it does not consider unequal load of the sensor nodes. Gupta et al. [7] proposed a load balanced clustering algorithm called LBC, where the authors define cardinality of a cluster as the number of sensor nodes associated with the cluster and attempts to minimize the variance of the cardinality of each cluster in the system. LBC takes O(mnlogn) time in worst case for n number of sensor nodes and m number of CHs in the networks. Tarachand et al. [20] presented a centralized energy efficient load balancing algorithm called CELBA, which takes care of both load balancing and energy consumption of the sensor nodes. The algorithms presented in [7] and [20] assume that all sensor nodes generates equal traffic load to its CHs. Unfortunately, the algorithms may not be effective for the network that generates unequal traffic load to the CHs. Low et al. [8] proposed two different clustering algorithms. Their first algorithm called LBCA (load balanced clustering algorithm) assumes a special case in which the traffic load contributed by all sensor nodes is equal. To form cluster, LBCA constructs an individual BFS (Breadth First Search) tree for each sensor node to find out the least loaded CH for assigning the sensor node to the least loaded CH. However, the time complexity of LBCA is O(mn 2), which may not be effective for large scale WSNs. Moreover, for a large WSN, building, a BFS tree for individual sensor node takes a substantial amount of memory space. They also proposed an approximation algorithm called GLBCA (greedy load balanced-clustering algorithm) for the case where sensor nodes have unequal loads and this algorithm is shown to be 1.5-approximation. To form a cluster, GLBCA considers a bipartite graph of the sensor nodes and the CHs to find out the maximum matching for assigning a sensor node to a CH. The time complexity of the algorithm is O(n[n+m+q]) where q is the cardinality of the bipartite graph. The algorithms proposed in this paper have the following improvements over the Low’s algorithm [8].

  • The algorithms work for both equal and unequal load of the sensor nodes.

  • Both the algorithms run in O(nlogn) time in contrast to O(n[n+m+q]) and O(mn 2) time for unequal and equal loads, respectively [8].

  • The algorithm requires maintaining only liner array of sensor nodes in contrast to a BFS tree and bipartite graph as required by [8]. Thus, it has improved space complexity.

  • The proposed 1.5 approximation algorithm for unequal load shows better load balancing and execution time.

3 System model and problem formulation

We assume a WSN model where all the sensor nodes are randomly deployed along with a few gateways and once they are deployed, they become stationary. A sensor node can be assigned to any gateway if it is within the communication range of the sensor node. Therefore, there are some prespecified gateways onto which a particular sensor node can be assigned. Thus each sensor node has a list of gateways and it can be assigned to only one gateway amongst them. Similar to LEACH, the data gathering operation is divided into rounds. In each round, all sensor nodes sense local data and send it to their CH. Then CHs perform data aggregation to discard the redundant and uncorrelated data and send the aggregated data to the base station. Between two adjacent rounds, all nodes turn off their radios to save energy. All communication is over wireless link. A wireless link is established between two nodes only if they are within the communication range of each other. Current implementation supports TDMA [2123] protocol to provide MAC layer communication. Gateways use slotted carrier-sense multiple access (CSMA) MAC [23] protocol to communicate with base station. We use the following notations for the problem formulation as follows:

  • Let S={s 1,s 2,…,s n } be the set of sensor nodes and ζ={g 1,g 2,…,g m } be the set of gateways where, n>m.

  • d max denotes the maximum communication range of the sensor nodes and dist(i,j) denotes the Euclidian distance between sensor node s i and gateway g j .

  • d i denotes the traffic load contributed by a sensor node s i , s i S, \(d_{i} \in \mathbb{Q}\) where \(\mathbb{Q}\) is the set of rational numbers. This may be noted that the sensor nodes may have practically different processing and communication capabilities for WSNs, which are heterogeneous in nature. Therefore, the traffic load contributed by the sensor nodes may vary depending on rate of data generation and communication. We assume that the traffic load contributed by each sensor node is estimated prior to the formation of clusters as assumed in [8].

  • G i denotes the set of gateways, which are within communication range of node s i . Therefore, s i can be assigned to any one of the gateway from G i , where G i ζ. For example, G r ={g 1,g 3,g 7} means that s r can be assigned to any one of the CHs, g 1, g 3, g 7.

  • Let W j be the load assigned to the cluster head g j . In other words,

$$ W_j =\sum_{i=1}^n d_i \times \alpha _{ij} $$
(3.1)

where α ij is a Boolean variable such that

$$\alpha_{ij} = \left \{ \begin{array}{l@{\quad}l} 1, & \mbox{if sensor node}\ s_{i}\ \mbox{is assigned to cluster head}\ g_{j} \\ 0, & \mbox{otherwise} \end{array} \right . $$

Then the overall maximum load of cluster heads is W=max{W i |∀g i ζ}. Now, we address the problem of clustering, where our main objective is to minimize the overall maximum load of the CHs. Then the Integer Linear Programming (ILP) of the load-balanced clustering problem can be formulized as follows:

$$\begin{aligned} &\mbox{Minimize}\quad W = \max \{ W_{i}|\forall g_{i} \in \zeta \} \\ &\mbox{Subject to} \\ &\quad \sum_{j = 1}^{m} a_{ij} = 1 |\forall s_{i} \in S \end{aligned}$$
(3.2)
$$\begin{aligned} &\quad \sum_{s_{i} \in S} d_{i} \times a_{ij} \le W|\forall g_{j} \in G_{i} \end{aligned}$$
(3.3)
$$\begin{aligned} &\quad \sum_{j = 1}^{m} \mathit{dist}(i,j) \times a_{ij} \le d_{\max} |\forall s_{i} \in S \end{aligned}$$
(3.4)

The constraint (3.2) states that a sensor node can be assigned to one and only one gateway and (3.3) indicates that the total load of all the sensor nodes assigned to a gateway must not exceed the overall maximum load of the gateway. The constraint (3.4) ensures that the sensor nodes are assigned to the gateway within their communication range.

4 Terminologies

We use the following terminologies in the proposed algorithms. Depending on the communication range between the sensor nodes and the gateways, we define two kinds of sensor nodes in the system: the restricted node and open node as follows [6].

  1. (1)

    Restricted Node and Restricted Set: Restricted nodes are those sensor nodes, which can communicate with one and only one gateway. Restricted set is the set of all restricted nodes in the WSN. We refer this set as “R set.” It is obvious to note that a sensor node s i belongs to R set, if it satisfies the following criteria:

    $$s_{i} \in R_{\mathrm{set}} \Leftrightarrow \bigl[\bigl\{ \exists g_{j}\vert g_{j} \in G_{i} \wedge g_{j} \in \zeta \bigr\} \wedge \bigl\{ \exists (\neg g_{k})\vert g_{k} \in G_{i} \wedge g_{k} \in (\zeta - g_{j})\bigr\} \bigr] $$

    where G i is the set of all those gateways, which are within communication range of s i and ζ is the set of all gateways as mentioned above.

  2. (2)

    Open Node and Open Set: Open nodes are those sensor nodes, which can communicate with more than one gateway. Open set is the collection of all open nodes in the WSN. We refer this set as “O set.” A sensor node s i belongs to O set, if it satisfies the following criteria:

    $$s_{i} \in O_{\mathrm{set}} \Leftrightarrow [s_{i} \notin R_{\mathrm{set}}] $$
  3. (3)

    Assigning a sensor node to a CH depends on various factors such as its distance from the CH, its rate of data generation and communication to the CH and its residual energy. We use P i (g x ) to denote the probability of assigning s i to the cluster head g x . Therefore, we have

    $$ \sum_{x = 1}^{m} P_{i} (g_{x}) = 1|\forall s_{i} \in S,\quad g_{x} \in \zeta $$
    (4.1)
  4. (4)

    The expected load (EL) of a cluster head g x is defined as the summation of mean loads of all the sensor nodes with the probability that they can be assigned to g x . Therefore, the EL can be expressed as follows:

    $$ \mathit{EL}(g_{x}) = \sum_{i = 1}^{n} P_{i} (g_{x}) \times d_{i}|\forall s_{i} \in S,\quad g_{x} \in G_{i} $$
    (4.2)

    We illustrate it with the following example. Let us assume that there are three sensor nodes s 1, s 2 and s 3 and their set of possible CHs are G 1={g 1,g 4,g 5}, G 2={g 2,g 4,g 7,g 10} and G 3={g 2,g 4}. Here, each of G 1, G 2, and G 3 has g 4 as a common member. Let g 4 is not a member of any other set of CHs except G 1, G 2, and G 3. Therefore, P 1(g 4)=1/3. Similarly, P 2(g 4)=1/4 and P 3(g 4)=1/2. Then using Eq. (4.2), we have

    $$\mathit{EL}(g_{4}) = \biggl(\frac{1}{3}\biggr) \times d_{1} + \biggl(\frac{1}{4}\biggr) \times d_{2} + \biggl(\frac{1}{2}\biggr) \times d_{3} $$

    If any sensor node, say s i is finally assigned to the CH, say g x . Then the EL of the cluster head g x and the EL of any other CH, say g y can be updated as follows:

    $$\begin{aligned} \mathit{EL}(g_{x}) &= \mathit{EL}(g_{x}) + \bigl(1 - P_{i}(x)\bigr) \times d_{i} \end{aligned}$$
    (4.3)
    $$\begin{aligned} \mathit{EL}(g_{y}) &= \mathit{EL}(g_{y}) - P_{i}(y) \times d_{i},\quad \forall g_{y} \in G_{i} - \{g_{x}\} \end{aligned}$$
    (4.4)
  5. (5)

    Maximum Possible Load (MPL) of a cluster head g x is the summation of the loads contributed by all the sensor nodes prior their allotment to g x . This can be noted that g x must be within the range of all such sensor nodes. Therefore, the MPL of a cluster head g x can be expressed as follows:

    $$ \mathit{MPL}(g_{x}) = \sum_{i = 1}^{n} d_{i} \vert \forall s_{i} \in S \wedge g_{x} \in G_{i} $$
    (4.5)
  6. (6)

    The Current Load (CL) of a cluster head g x is the summation of the loads contributed by all sensor nodes after their allotment to g x .

5 Proposed load balanced clustering algorithms

Network setup is performed in two phases: bootstrapping and clustering. During the bootstrapping process, all the sensor nodes and gateways are assigned unique IDs. Then the sensor nodes broadcast their IDs using CSMA/CA MAC layer protocol. Therefore, the gateways within the communication range of these sensor nodes can collect the sensor IDs, and finally send the local network information to the base station. Thus, for each sensor node, the number of gateways within its communication range can be calculated by the base station. In clustering phase base station executes the clustering algorithm. When the clustering is over, all the sensor nodes are informed about the ID of the gateway they belong to.

5.1 Algorithm for equal load

We consider here that each sensor node has equal traffic load, i.e., d i =α (say), ∀s i S, 1≤in. Therefore, minimizing the overall maximum load of each CH is equivalent to minimizing the maximum number of sensor nodes that can be assigned to each CH. The basic idea of our algorithm is as follows. We first sort the set of sensor nodes S in nondecreasing order on |G i | of s i , ∀s i S, 1≤in. Let, S={s a ,s b ,s c ,…,s p } be this sorted sensor list. We now successively consider each sensor node from this list (starting with s a ) to assign it to the correct CH. In order to assign s a , we consult its corresponding set of possible CHs, i.e., G a and calculate the EL values using Eq. (4.2) for all the CHs belongs to G a . The sensor node s a is assigned to that CH, which has the minimum EL value. In other words, s a selects the cluster head say g x only if

$$ \mathit{EL}(g_{x}) = \min\bigl\{ \mathit{EL}(g_{k}) \vert \forall g_{k} \in G_{a} \bigr\} $$
(5.1)

If there are two or more CHs with the same EL value then, select that CH for which the probability of assigning the sensor node is maximum. If the probability also ties, then select the CH that has minimum number of sensor nodes already assigned to it. After each assignment of sensor, the EL value of the CHs is updated by Eqs. (4.3) and (4.4) for the assignment of the next sensor from the sorted list. The same procedure is continued until all the sensor nodes are allotted to their correct CH. The algorithm is formally presented in Fig. 2.

Fig. 2
figure 2

2-Approximation load-balanced clustering algorithm

5.2 An illustration

Consider a WSN of 15 sensor nodes and 4 gateways, i.e., S={s 1,s 2,…,s 15} and ζ={g 1,g 2,g 3,g 4}. Without loss of generality, let us assume that d i =1. Table 1 shows the possible gateways to which the sensor nodes may be assigned. In our algorithm, sensor nodes are sorted in nondecreasing order on G i . The sorted list of the sensor nodes is shown in Table 2.

Table 1 Sensor nodes with the list of possible gateways
Table 2 Sorted list of sensor nodes with possible gateways

Initially, every CH has no load. As seen from the Table 2, s 2 is the first sensor node, which can be assigned to any one of the cluster heads g 1 and g 2. Therefore, we calculate the EL values of g 1 and g 2 as follows. We observe from Table 2 that the possible sensor nodes that may be assigned to g 1 are s 1, s 2, s 4, s 6, s 7, s 8, s 9, s 11, s 12, s 14, and s 15. It is also clear from Table 2 that s 1 has four possible CHs. So, the probability of assigning s 1 to g 1 is 1/4, i.e., P 1(g 1)=1/4. Similarly, the probability of assigning s 2, s 4, s 6, s 7, s 8, s 9, s 11, s 12, s 14, and s 15 to g 1 are 1/2, 1/4, 1/4, 1/3, 1/3, 1/2, 1/4, 1/4, 1/2, and 1/3, respectively. Therefore, by Eq. (4.2), we obtain

$$\begin{aligned} &\mathit{EL}(g_{1}) = \biggl(\frac{1}{4}.1 + \frac{1}{2}.1 + \frac{1}{4}.1 + \frac{1}{4}.1 + \frac{1}{3}.1 + \frac{1}{3}.1 + \frac{1}{2}.1 + \frac{1}{4}.1 + \frac{1}{4}.1 + \frac{1}{2}.1 + \frac{1}{3}.1\biggr) \\ &\quad [\because d_{i} = 1] = 3.75 \end{aligned}$$

Similarly, we calculate EL(g 2)=4.583. As g 1 has lesser EL value than g 2, we assign s 2 to g 1. After assigning, EL(g 1) is updated by Eq. (3.3) as follows:

$$\begin{aligned} \mathit{EL}(g_{1}) &= \mathit{EL}(g_{1}) + \bigl(1 - P_{2}(g_{1})\bigr).1 \\ &= 3.75 + 0.5 \quad \bigl[\because P_{2}(g_{1}) = 1 / 2 \bigr] \\ &= 4.25 \end{aligned}$$

The EL value of g 2 is also updated by Eq. (4.4) as follows:

$$\begin{aligned} \mathit{EL}(g_{2}) &= \mathit{EL}(g_{2}) - P_{2}(g_{2}).1 \\ &= 4.583 - 0.5\quad \bigl[\because P_{2}(g_{2}) = 1 / 2 \bigr] \\ &= 4.083 \end{aligned}$$

Next, we consider s 3 for its assignment. From Table 2, it can be noted that s 3 may be assigned to g 2 or g 3. Now we calculate EL value of g 3 as follows:

$$\begin{aligned} \mathit{EL}(g_{3}) &= \biggl(\frac{1}{4}.1 + \frac{1}{2}.1 + \frac{1}{4}.1 + \frac{1}{3}.1 + \frac{1}{4}.1 + \frac{1}{3}.1 + \frac{1}{2}.1 + \frac{1}{3}.1 + \frac{1}{4}.1 + \frac{1}{4}.1 + \frac{1}{2}.1\biggr) \\ &= 4.083 \end{aligned}$$

As the EL values of g 2 and g 3 are same, P 3(g 2) and P 3(g 3) are also same and both g 2 and g 3 are zero loaded, s 3 can be assigned to any one of g 2 and g 3. We assign it to g 2. Now we update EL values of g 2 and g 3 and continue the same method for assigning the remaining sensor nodes. The successive assignment of the sensor nodes s 3, s 9, and finally for s 12 to their CHs along with their calculated EL values are shown in Table 3.

Table 3 Assignment of successive sensor nodes to the CHs

Lemma 5.1

The above 2-approx-load-balanced-clustering algorithm (2ALBC) produces optimal solution for equal load of the sensor nodes.

Proof

First we note that the sensor list S is initially sorted in non-decreasing order on the number of CHs. Let S={s 1,s 2,s 3,…,s n } be this sorted list. Now, the algorithm assigns s 1,s 2,s 3,s 4,…,s n successively. For the assignment, the following approach is followed for load balancing. The algorithm assigns that sensor node first, which has the least chance of assigning to a CH. As a result any other sensor node having more chance can select a CH with the least load.

Let s k be the last assigned sensor node to the maximum loaded cluster head g r after its assignment. Let current load on g r be l r . Suppose there is a cluster head g s G k with total load l s such that l s <l r . Therefore, s k can be assigned to g s and current load of g s will be l s +α, where α is the load of each sensor node. Then the algorithm is not optimal if (l s +α)<l r . But at the time of assignment of s k , the selected cluster head g r had the minimum EL value, i.e., (l r α) was minimum. So, after assigning of s k to any g s G k −{g r }, (l s +α)≥l r . Hence, the following proof. □

Lemma 5.2

The time complexity of the algorithm 2ALBC is O(nlogn).

Proof

Step 1 requires O(nlogn) time for sorting n sensor nodes. In for loop of the step 2, a CH list is created where the average possible load of each CH is calculated. In the worst case, step 2.1 can iterate m times. So, step 2 can take O(mn) time in worst case. Step 3 iterates n times in which step 3.1 and step 3.3 are the dominating ones requiring O(m) time for each in the worst situation. Therefore, step 3 can be executed in O(mn) time. Thus, the above algorithm requires O(mn) time. If m<logn, it requires O(nlogn) time. □

Theorem 1

The algorithm 2ALBC produces optimal solution in O(nlogn) time assuming equal loads of the sensor nodes.

Proof

Follows from Lemmas 5.1 and 5.2. □

5.3 Approximation algorithm for unequal load

We consider here the following scenario of the WSN in which the traffic load contributed by the sensor nodes are variable. This is due to the fact that the sensor nodes may generate or process the data at different rate and their rate of communication with their CHs is also different.

Lemma 5.3

Load balanced clustering problem with unequal load for sensor nodes is NP-hard.

Proof

Please refer [8]. □

Lemma 5.4

The algorithm 2ALBC is a 2-approximation algorithm for load balanced clustering problem with unequal load of the sensor nodes.

Proof

Let OPT be the maximum load of a CH in an optimal solution. Then it is obvious to note that OPTd i , ∀i,1≤in. Let I be the smallest instance of the problem for which the proposed algorithm produces results not as good as the optimal solution. Let g i be a CH with maximum load after complete run of the load- balanced- clustering algorithm, i.e., W i =max{W j |∀j,1≤jn}. Let s r be the last sensor node assigned to g i . The crucial property of our algorithm is that, at the time of the assignment of s r , g i was the minimum loaded CH from G r . So, before assignment of s r , load of g i was (W i d r ), which is less than or equal to OPT(I). Therefore, W i d r OPT(I), i.e., W i OPT(I)+OPT(I) (as, d r OPT(I)). Hence, W i ≤2OPT(I). □

6 Proposed 1.5-approximation algorithm

We assume here that the traffic loads contributed by the sensor nodes are unequal and already calculated prior cluster formation. The basic idea is as follows. We first assign all the sensor nodes s i , ∀s i R set, to their corresponding gateway. Then all the sensor nodes s j , ∀s j O set, are sorted in nonincreasing order on their contributing traffic load. Now, the basic principle of the proposed algorithm is to assign a sensor node, which contributes maximum load to a minimum loaded CH within its range. After the assignment, current load and maximum possible load of the CHs are updated. Then the sensor node contributing next maximum load is assigned to the current minimum loaded CH within its range. This process is continued until all the sensors are assigned to the CHs. The algorithm is now formally presented in Fig. 3.

Fig. 3
figure 3

1.5-Approximation load-balanced clustering algorithm

Lemma 6.1

The time complexity of 1.5-approx-load-balanced-clustering (1.5ALBC) algorithm is O(nlogn).

Proof

Step 1 requires O(m) time in worst case to assign the R set to their corresponding CH. In step 2 a CH list is created where the maximum possible load of each CH is calculated. In the worst case, the step 2 can take O(mn) time. Step 3 requires O(nlogn) time in worst case for sorting sensor nodes of O set. In step 4, a sensor node is assigned to a CH. Step 4.1 through step 4.5 can be run in O(m) time. Outer for loop runs in O(n) time. So step 4 can be executed in O(mn) time. Therefore the above algorithm requires O(mn) time. If m<logn, it requires O(nlogn) time. □

Lemma 6.2

The 1.5ALBC is a 1.5-approximation algorithm of load balanced clustering problem (LBCP) with unequal loads of the sensor nodes.

Proof

Let OPT be the maximum load of a CH in an optimal solution. It is obvious to note that OPTd i , ∀i,1≤in. Let I be the smallest instance of the problem for which this algorithm conflicts with optimal solution. Let g i be a CH with maximum load after complete run of the algorithm approx-load-balanced, i.e., W i =max{W j |∀j,1≤jn}.

Let s r be the last sensor node assigned to g i . Therefore, s r is the sensor node which has the minimum load between all sensor nodes. Now at the time of the assignment of s r , g i was the minimum loaded CH from G r . As we are assuming that I is the smallest instance, so the algorithm conflicts with optimal solution only after assignment of s r . Without loss of generality, we may assume that each CH can be assigned with at least two sensor nodes. Therefore, all the CHs except g i are assigned by at least two sensor nodes. This implies that d r OPT(I)/2, as s r is the least loaded sensor node. Therefore, before allotment of s r , load of g i was (W i d r ), which is less than or equal to OPT(I). In other words, W i d r OPT(I), i.e., W i OPT(I)+OPT(I)/2 (as, d r OPT(I)/2). Hence, W i ≤1.5OPT(I). □

7 Experimental results

The proposed algorithms were experimented extensively using MATLAB (version 7.5) and C programming language on an Intel Core 2 Duo processor with T9400 chipset, 2.53 GHz CPU and 2 GB RAM running on the platform Microsoft Windows Vista. For the experiments, we assumed two different scenarios. In the first scenario, we considered a 300×300 square meter area in which 200 to 500 sensor nodes are randomly deployed along with 35 gateways. We next considered a 500×500 square meter area in which 600 to 1200 sensor nodes are randomly deployed along with 75 gateways. For both of the scenarios, base station was assumed to be located at the centre of the area. Each sensor node was assumed to have an initial energy of 2 Joules and gateways with 10 Joules. In the simulation run, the energy model and the typical parameter values were set same as LEACH. For the sake of comparison, we also ran various algorithms including GLBCA [8], GA-based load balanced clustering algorithm [19], LBC [7], and LDC [24].

7.1 Load balancing

To judge the quality of the load balancing, we measured the standard deviation of the gateway load and plotted against the number of sensor nodes. The standard deviation (σ) of the gateway load gives even distribution of the load per gateway. If there are m gateways and n sensor nodes, the standard deviation of gateway load is given by

$$\sigma = \sqrt{\frac{1}{m}\sum_{j = 1}^{m} (\mu - W_{j})^{2}} $$

where \(\mu (\mbox{average load}) = \frac{1}{m}\sum_{i = 1}^{n} d_{i}\), and W j is the overall load of the gateway g j .

The results of load balancing for unequal load of the sensor nodes are shown in Figs. 4(a)–4(b). Our proposed algorithms perform better in load balancing issue in this case. However, LBC and LDC perform poorly as they are not proposed for unequal load of the sensor nodes.

Fig. 4
figure 4

Comparison of load balancing for unequal load of the sensor nodes using (a) 35 gateways and (b) 75 gateways

We next ran the algorithms for equal load of the sensor nodes, the comparison results of which are depicted in Figs. 5(a)–5(b). This can be noted that although our proposed algorithms 2ALBC and 1.5ALBC perform same as GLBCA, they perform far better than GA, LBC, and LDC. The rationale behind is that all three algorithms 2ALBC, 1.5ALBC, and GLBCA are known to be optimal for the equal load of the sensor nodes whereas the GA, LBC, and LDC are known to be suboptimal in terms of load balancing. Note that LBC can be executed separately for load balancing and energy consumption issues but not considering both of them together. In the simulation, we ran LBC by considering load balancing of the CHs only. LDC performs poorly in load balancing as it does not consider any issue of load balancing of the CHs for cluster formation.

Fig. 5
figure 5

Comparison of load balancing for equal load of the sensor nodes using (a) 15 gateways and (b) 30 gateways

7.2 Execution time

We also obtained the execution time of the algorithms. We can see from Figs. 6(a)–6(b) that the proposed algorithms1.5ALBC and 2ALBC are better than LBC and far better than GLBCA in terms of execution time. The rationale behind this is that in order to assign a sensor node, GLBCA builds a bipartite graph of the sensor nodes and the CHs to find out the maximum matching for assigning a sensor node to a CH. The time complexity of the algorithm is O(n[n+m+q]) in worst case (m number of gateways and n number of sensor nodes and q is the cardinality of the bipartite graph). Thus, GLBCA takes considerably high time for large number of sensor nodes. LBC takes time to expands e-Set based on the distance from the sensor node to gateway and in worst case this process takes O(mn) time. Then LBC assigns the e-Sets which also takes O(mn) time and finally assign all remaining sensor nodes which takes O(mnlogn) time. On the other hand, 2ALBC and 1.5ALBC use only sorting and the overall time complexity of these algorithms is O(nlogn) in the worst case. Thus, 2ALBC and 1.5ALBC consume much less execution time than GLBCA and LBC. It can be observed that LDC executes faster than all the algorithms. This is because LDC simply assigns the sensor nodes to its nearest CH, and thus it takes less execution time.

Fig. 6
figure 6

Comparison of execution time using (a) 15 gateways and (b) 30 gateways

7.3 Network life

We now present the experimental results of the algorithms in terms of network life time, which can be defined in various ways as follows. This is the time duration/number of rounds until the first/last node dies or until certain percentage of nodes die [24]. In our experiment, we assumed the network life time as the number of rounds until first gateway dies. Therefore, network life can be extended if we can prevent the death of the first gateway. This can be achieved through load balancing. This implies, better the load balancing, higher is the network life. Figures 7(a)–7(b) shows the comparison of the algorithms in terms of the network life time. It is obvious to note that the proposed algorithms perform better. Obviously, LDC performs poorly in this case, also.

Fig. 7
figure 7

Comparison of network life time using single hop communication for (a) 35 gateways and (b) 75 gateways

In all of the above experiments, we assumed single hop communication between CHs and the base station, which may not be realistic for a large area WSN. In order to test how the proposed algorithms behave for WSN with multihop communication, we also performed experiments to measure the network life time. The results are depicted in Fig. 8. We used the same routing scheme for all of the algorithms as proposed in [9]. Note that the proposed algorithms perform better than GLBCA, GA, LBC, and LDC in this scenario, also.

Fig. 8
figure 8

Comparison of network life time using multi-hop communication for (a) 35 gateways and (b) 75 gateways

8 Conclusion

In this paper, we have presented two load balanced clustering algorithms for wireless sensor networks. The first algorithm has been shown to be optimal in assigning sensor nodes to the CHs in the case of equal load of the sensor nodes. The algorithm has been shown to run in O(nlogn) time. We have shown that the same scheme can also work as a 2-approximation algorithm for the situation where the loads of the sensor nodes can vary. We have next presented an improved algorithm that has 1.5-approximation ratio for the general case, which is shown to run in O(nlogn) time too. We have shown that the proposed algorithms outperform the existing algorithms in terms of load balancing, execution time and the network life time for both equal and unequal load of the sensor nodes. However, our proposed clustering algorithms consider only the load balancing issue with respect to data generation and communication of the member sensor nodes. We do not consider the residual energy of CHs and the communication distance. Our next attempt will be made to develop a clustering algorithm covering such issues in our future research.