Keywords

1 Introduction

Wireless Sensor Network (WSN) is an infrastructure-less, self-configured network of sensor nodes that communicate with each other via radio signals. Each node in a WSN is laden with sensors of various kinds and these are often deployed in terrains that are dangerous and inaccessible for humans. The sensors on these nodes send back relevant sensed data via an ad-hoc network of nodes, that constitutes a WSN, to a back end cloud for analysis. A sensor node once deployed in such terrains is on its own with limited energy and computational resources with no means of replenishing these. The aim, therefore, is to minimise energy expenditure and prolong the useful life of nodes. In such circumstances, localization of sensor nodes in WSN is an important issue. This is because the usual localization approach in outdoor locations using Global Positioning Systems (GPS) is infeasible. GPS comprises modules that are resource intensive and deploying these over WSN nodes shortens the latter’s life significantly. In addition to this, the geographical locations in which such nodes are deployed often do not facilitate the proper functioning of GPS modules.

Outdoor localization without the use of GPS is broadly classified into range-free [1], and range-based [2] localization. In range-free localization, the approach is to utilise simple data like the ‘number of hops’ between the anchor nodes and the node being localized, to get a rough idea on the location of the node. Range-based localization on the other hand requires additional hardware for transmission and reception of signals at each node.

In this paper, we utilise a range-based technique, more specifically the Received Signal Strength Indicator (RSSI) technique for localization. Anchor nodes whose locations are known in advance transmit signals that are received by the node to be localized. The strength of the received signals from different anchor nodes are analysed using various algorithms. Based on this, the position of the node is determined. The algorithms used to analyse RSSI values and localize nodes are broadly classified into those employing Machine Learning (ML) techniques and those based on more conventional techniques like multilateration [3].

In a realistic scenario, where anchor nodes are few and there are a large number of unknown nodes far away (not within communication range) from the anchor node, Machine Learning techniques are not effective for localisation. Algorithms based on multilateration techniques are more useful in this regard and can be used for localizations that require multiple iterations. Here, one set of unknown nodes are localized first and these ‘newly’ localized unknown nodes are subsequently used to localize further nodes. While multilateration enables localization of nodes far away from anchor nodes through multiple iterations, its major drawback is lack of accuracy.

In this paper, we overcome the issues of both ML based localization techniques and multilateration based techniques by adopting a ‘hybrid’ approach wherein the ML and multilateration techniques are combined. The remainder of this paper is organized as follows: Sect. 2 is a detailed discussion of the method proposed in this paper; the proposed method is validated through experiments in Sect. 3; and finally Sect. 4 concludes the paper with pointers to future work.

2 Proposed Method

The method proposed in this paper is meant for localization of unknown nodes, without the use of a GPS device, in a WSN that is spread over a large area. ‘Large area’ here implies that most nodes in the WSN are not within the communication range of most other nodes owing to the large size of the area of interest. It is important to specify this as most existing localization techniques work on the assumption that each node in the WSN is within the communication range of every other node.

In this large area, we start with the assumption that the locations of a few sensor nodes, called anchor nodes, are known in advance. These anchor nodes are located at the periphery of the area of interest. This is a realistic assumption as the sensor nodes at the periphery of the WSN are usually accessible and within the reach and range of a GPS device. The sensor nodes located deep within the area of interest are usually not accessible by a GPS device because of a hostile geographical terrain and/or the presence of disrupting structures like trees, and tall buildings. It is these nodes that need to be localized.

This paper proposes a hybrid approach to localize such sensor nodes that comprises a Machine Learning (ML) approach combined with a more conventional multilateration approach. The ML algorithm harnessed here is random forest and it localizes a large number of unknown nodes by analysing the RSSI values of communication signals received at the unknown nodes from one or more anchor nodes. Subsequently, these newly localized unknown nodes now serve as the ‘new’ anchor nodes and are used to localize nodes deeper inside the area using multilateration. The multilateration approach is usually harnessed for more than one iteration until all unknown nodes are localized. Figure 1 is a high-level depiction of the steps followed for localization.

Fig. 1.
figure 1

Proposed hybrid localization approach

We now discuss the proposed approach, comprising localization using RSSI in general, analysis of RSSI using an ML algorithm (random forest), and the use of multilateration with RSSI for localization, in more detail.

2.1 Localization Using RSSI

Localization through RSSI values comprises sending low power signals from the transmitter at an anchor node (a node whose location is known) and receiving the signal using a receiver at an unknown node. The strength of the signal as received at the unknown node is assessed and analysed and conclusions are drawn on the position of the unknown node relative to the anchor node that sends the signal. Usually signals sent from multiple anchor nodes are received and analysed at the unknown node and more accurate localization is achieved. The intensity of signals received at the unknown node decreases with increasing distance from the transmitting anchor node i.e., an unknown node close to the anchor node receives a strong signal, while a distant unknown node receives a weak signal.

Equation 1 is Frii’s Free Space Transmission Equation [4] and shows that the received signal strength decreases quadratically with distance from the transmitter.

$$\begin{aligned} P_{r} = \frac{P_{t}G_{t}G{r}\lambda ^{2}}{4\pi d^2} \end{aligned}$$
(1)

where

\(P_{r}\): power of signal as received at unknown node

\(P_{t}\): power of signal as transmitted at anchor node

\(G_{t}\): gain of transmitter at anchor node

\(G_{r}\): gain of receiver at unknown node

d: distance between the anchor and unknown node

\(\lambda \): wavelength of signal.

The power of the signal received at the unknown node is roughly interpreted as the Received Signal Strength Indicator (RSSI) value after incorporating factors specific to the communication technology in use.

The RSSI values for signals received at unknown nodes from the various anchor nodes are collected and stored in a database. A matrix template for RSSI values obtained at each node from every other node in the region of interest is shown in the matrix in Eq. 2. Most of these RSSI values are assigned values of \(-200\) db indicating that the receiving node is beyond the communication range of the sending node.

$$\begin{aligned} \begin{array}{cccc} R = \left[ \begin{array}{cccc} RSSI_{11}&{}RSSI_{12}&{}\ldots &{}RSSI_{1 k} \\ RSSI_{21}&{}RSSI_{22}&{} \ldots &{} RSSI_{2 k}\\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ RSSI_{n1}&{}RSSI_{n2}&{} \ldots &{} RSSI_{n k} \end{array}\right] \\ \end{array} \end{aligned}$$
(2)

The RSSI values so collected are subsequently analysed by an ML algorithm (random forest in this case) and a multilateration technique for localization.

2.2 Localization Using Machine Learning

The Machine Learning (ML) approach to localization involves training an algorithm with data on a large number of sensor nodes. The data comprises the RSSI values of signals received at each node and the relative location of the node. The algorithm is trained in such a manner that it is able to accurately localize a node that receives relevant signals from at least three anchor nodes (anchor nodes, as mentioned earlier, are nodes whose locations are known). The larger the number of anchor nodes, better the accuracy of localization. The algorithm is trained in an ‘off-line’ manner such that it is trained before it is put to use for localizing sensor nodes.

There are a large number of ML algorithms that can be employed for the task of localization. We assessed several algorithms and based on experiments chose to use random forest in our work as it gave the best localization accuracy. A comparison of the localization accuracies of the ML algorithms that we experimented with is shown in Sect. 4 that discusses the experiments conducted.

Random forest [5] is an ensemble technique that can perform both regression and classification tasks [6]. A random forest comprises several decision trees which are tree-like structures that divide a dataset on the basis of decisions taken at each node. The decision point or split value at a node is determined as one that provides the maximum information gain. Intuitively, a large information gain implies splitting the data in a manner that the data subsets formed as a result of splitting are more homogeneous i.e. datapoints in each subset formed are closer to each other in terms of attribute values. The decision tree establishes the best split amongst its variables with the intent of maximising information gain. This process of splitting begins at the root and each node applies its own split function to the new input. This is repeated recursively until a terminal node is reached.

Once trained, a decision tree is able to provide an appropriate value to a new datapoint. The random forest comprises several such decision trees and an average of the value assigned by each decision tree is assigned to the new point.

Data for the Random Forest. The first step in localization using the random forest algorithm is collection of data for training the model. The training entails teaching the model to correctly map RSSI values of signals received at a node with the 2D coordinates expressing the location of the node. The 2D coordinates of the nodes constitute the output of the random forest model. The input data consists of the RSSI values at the unknown nodes from various anchor nodes. At each unknown node \(N_i\), we represent the RSSI value of the signal received from anchor node \(A_j\) as \(RSSI_{ij}\). The input data and output of the model are in the formats shown in matrices 2 and 3 respectively for k unknown sensor nodes and n anchor nodes.

$$\begin{aligned} \begin{array}{cccc} O = \left[ \begin{array}{cccc} x_{1}&{}x_{2}&{}\ldots &{}x_{n} \\ y_{1}&{}y_{2}&{}\ldots &{}y_{n} \\ \end{array}\right] \end{array} \end{aligned}$$
(3)

In the matrix 2, the input data \(RSSI_{nk}\) corresponds to the RSSI value of the signal received at the \(n^{th}\) sensor node from the \(k^{th}\) anchor node; whereas in the matrix 3, \((x_{n},y_{n})\) represent the coordinates of the \(n^{th}\) sensor node.

Data Preprocessing. Prior to creation of the random forest, the data collected goes through a quick step of preprocessing. Here a new parameter called \(\gamma \) is considered for each unknown node. The \(\gamma \) parameter indicates the number of anchor nodes for which the RSSI value at the node is not \(-200\) db. In other words, the \(\gamma \) parameter provides information on the number of anchor nodes whose signals reach the particular sensor node. For example, \(\gamma = 4\) indicates that the sensor node is within the communication range of 4 anchor nodes.

Only datapoints whose \(\gamma \ge 3\) are considered for creation of the random forest. Those with smaller values are removed from consideration. This is because, at least 3 legitimate RSSI values are required for accurate localization with random forest.

Creation of the Random Forest. To create a random forest, small bootstrap samples from the input data with \(\gamma \ge 3\) are taken and a decision tree is developed with each sample. A small subset of the RSSI values at a node is considered for each tree. From this small subset of RSSI values, one RSSI value is randomly selected for the root node of the decision tree. A split point of this RSSI value is so selected that it gives the best improvement in terms of variance. For brevity, we do not dwell into the procedure for variance calculation.

Based on the ‘best’ split point of the feature, the data is divided into two or more parts and these form the child nodes of the root. At each child node again a feature value (in this case RSSI value) is randomly chosen from the small sample and the best split point for this feature value further divides the data. This is continued until a certain number of iterations or until the data is exhausted, whichever comes first. The decision tree so created is combined with a larger random forest that comprises all such decision trees created.

The number of decision trees created in the random forest, called n-estimator is an important parameter and impacts the performance of the model. We experimented with using n-estimators values of 1000, 2000, and 3000. We got the best results with 2000 decision trees and used this value for further computations.

Testing Phase. Of the legitimate RSSI data with values of \(\gamma \ge 3\), \(90\%\) was allocated for training the model whereas \(10\%\) was kept aside for testing the efficacy.

To test the model as well as use it with our real world implementation, the test point is made to go through each of the 2000 decision trees in the random forest. As the test data point moves through each tree and converges at a node in the tree, the xy coordinates of the datapoint at the node are allocated as the coordinates of the test point.

This is repeated for all 2000 decision trees and finally an average of all the 2000 x and y coordinates is computed and is allocated to the test point.

2.3 Localization Through Multilateration

Multilateration is a localization technique popularly used to localize vehicles in a GPS system. Multilateration depends on the relation between the distance of nodes and their relative location coordinates. To localize one node using multilateration, at least three nodes with known locations (anchor nodes in our case) within the communication range of the unknown node are required. The distance between an anchor node and the unknown is calculated using Frii’s Free Space Transmission Equation [4] shown in Eq. 1 that relates the received signal strength value at the unknown node with the distance from the anchor node from which the signal was sent. This distance (which is not the exact distance but a computed approximate distance) is calculated between all the anchor nodes within the communication range of the unknown node and the unknown node. The calculated distance along with the 2D coordinates of the anchor nodes are together employed in the Least Squares Method [7].

Equation 4 shows the expression that needs to be minimised to compute the location of the unknown node. \(\tilde{d_{i}}\) is the distance between the unknown node and the \(i^{th}\) anchor node as computed. The bar above d indicates that the value of the distance is not necessarily exact and is diluted by channel noise, obstacles, and other shadowing effects.

$$\begin{aligned} \text{ Minimize } \varepsilon =\,\mid \sum _{i=1}^{M} \sqrt{\left( x_{i}-x\right) ^{2}+\left( y_{i}-y\right) ^{2}}-\bar{d_{i}}^{2} \end{aligned}$$
(4)

M denotes the number of anchor nodes within the communication range of the unknown node. M needs to be at least 3 for proper localization. The square of the computed distance between the unknown node and the anchor node is computed as follow:

$$ \begin{array}{l} \left( x_{i}-x\right) ^{2}+\left( y_{i}-y\right) ^{2}=\tilde{d_{i}}^{2}\\ \forall i=1, \ldots , M \\ \left( x_{i}-x\right) ^{2}-\left( x_{j}-x\right) ^{2}+\left( y_{i}-y\right) ^{2}-\left( y_{j}-y\right) ^{2} \\ \quad =\tilde{d_{i}}^{2}-\tilde{d_{j}}^{2}\\ \forall i=1, \ldots , M; i \ne j \\ 2 x\left( x_{j}-x_{i}\right) +2 y\left( y_{j}-y_{i}\right) \\ =\left( \tilde{d_{j}}^{2}-\tilde{d_{i}}^{2}\right) -\left( x_{j}^{2}-x_{i}^{2}\right) -\left( y_{j}^{2}-y_{i}^{2}\right) \\ \quad \forall i=1, \ldots , M; i \ne j \end{array} $$

Expressing the equation in the form of a matrix:

$$ \left[ \begin{array}{cc} 2\left( x_{j}-x_{1}\right) &{} 2\left( y_{j}-y_{1}\right) \\ \vdots &{} \vdots \\ 2\left( x_{j}-x_{M}\right) &{} 2\left( y_{j}-y_{M}\right) \end{array}\right] \left[ \begin{array}{c} x \\ y \end{array}\right] =\left[ \begin{array}{c} \tilde{b_{j}} \\ \vdots \\ \tilde{b_{M}} \end{array}\right] $$

where

$$ \left[ \begin{array}{c} \tilde{b_{j}} \\ \vdots \\ \tilde{b_{M}} \end{array}\right] =\left[ \begin{array}{c} \left( \tilde{d}_{j}^{2}-\tilde{d}_{i}^{2}\right) -\left( x_{j}^{2}-x_{i}^{2}\right) -\left( y_{j}^{2}-y_{i}^{2}\right) \\ \vdots \\ \left( \tilde{d}_{j}^{2}-d_{M}^{2}\right) -\left( x_{j}^{2}-x_{M}^{2}\right) -\left( y_{j}^{2}-y_{M}^{2}\right) \end{array}\right] $$

The form of the above equation is \(A.\bar{x}=\bar{b}\). Using this, the location of the unknown node can be computed by minimising \(\Vert A . \bar{x}-\bar{b}\Vert ^{2}\). Using the Least Squares equation, the solution to the equation becomes \(\hat{x}=\left( A^{T} A\right) ^{-1} A^{T} \tilde{b}\).

2.4 The Hybrid Approach to Localization

We take a hybrid approach to localization owing to limitations in the ML approach as well as the multilateration approach. The ML approach is effective in accurately localizing a large number of sensor nodes harnessing the locations of just a few anchor nodes. The limitation of the ML approach, however, is that it needs to be trained in advance and can only be employed for one iteration. It cannot be easily trained with the locations of the newly localized nodes and thus cannot be used for further iterations. The ML approach, therefore, is useful when all the unknown nodes are within the communication range of at least 3 anchor nodes. This is usually possible in an indoor setting and is seldom the case with large outdoor locations.

The multilateration approach to localization on the other hand can be readily employed for multiple iterations. Multiple iterations imply that the unknown nodes localized in an iteration become the new anchor nodes for subsequent iterations. The iterations continue until the entire area is covered. This is useful but has the drawback that localizations through multilateration are not very precise and this imprecision increases at every iteration. A very large number of iterations of multilateration localization is therefore not advised.

The hybrid approach proposed in this paper takes the best of both approaches. One iteration of ML localization is first conducted. This results in significant number of unknown nodes getting accurately localized. These newly localized nodes become the new anchor nodes for subsequent localizations using multilateration. A combination of the two approaches enables the coverage of most of the outdoor region of interest.

3 Evaluation

In this section we experimentally assess the working of the random forest algorithm, the multilateration approach to localization separately first, and subsequently as a hybrid combination. We first create a simulated environment to comprehensively validate the approach; and subsequently demonstrate its efficacy on a real-world set-up.

3.1 Dataset and Simulated Environment

To demonstrate the effectiveness of the proposed localization approach, we create a simulated environment and a synthetic dataset. We need to synthesize the data as standard datasets for localization over large areas do not exist. Also, we do not have access to real world deployments of this scale.

We consider a \(130\,\times \,130\) m\(^{2}\) region. A dataset comprising anchor nodes (nodes whose locations are known in advance) and sensor nodes (unknown nodes that need to be localized) deployed within this region was synthesized. A total of 12, 321 sensor nodes were created whose positions are along a \(1\times 1\) m\(^{2}\) grid starting from a position of 10 m from the periphery of the region of interest and extending to a distance of 110 m. This is done along both the horizontal and vertical axes. 8 anchor nodes, whose locations are known, are placed at the periphery of the region of interest. This is a realistic scenario as nodes along the peripheries of real world regions of interest are accessible and their locations can be determined. The locations of the anchor nodes are as follows: (0, 0), (60, 0), (130, 0), (0, 60), (0, 130), (60, 130), (130, 60), and (130, 130). Each anchor node has a defined range over which it can communicate with other sensor nodes.

Table 1. Sensor nodes within communication range of anchor nodes

Table 1 shows the number of sensor nodes that are within the range of communication of different number of anchor nodes.

3.2 Machine Learning (Random Forest) Localization

We choose random forest as the ML algorithm for the first iteration of localization. Of the total of 3914 sensor nodes that are within the communication range of 3 anchor nodes (you may recall that for localization, a node needs to be receiving signals from at least 3 anchor nodes), \(90\%\) of the nodes or 3523 nodes are set aside for training of the random forest and \(10\%\) or 391 is used for testing.

Localization Accuracy. Table 2 shows the localization results of the random forest algorithm for 10 randomly selected datapoints. In this table (Xactual, Yactual) are the actual coordinates of the datapoints; (Xpred, Ypred) are the predicted coordinates using random forest localization; and Deviation indicates the distance between the actual and predicted locations.

Table 2. X-Y coordinates, actual vs predicted by random forest

3.3 Multilateration Localization

The other major localization approach employed in this paper is multilateration, as discussed earlier. Multilateration utilises the Least Squares Error technique to accurately localize nodes with distances computed from RSSI values. The advantage of the multilateration approach, in contrast to the ML localization, is that it can be used for multiple iterations.

The downside of localization with multilateration, however, is the inferior localization accuracy as the iterations progress. The first iteration usually returns acceptable accuracy results. This deteriorates because the error in localization at earlier iteration propagates through subsequent iterations.

Localization over Iterations. We conducted experiments to understand the extent of deterioration in localization accuracy as the iterations of localization with multilateration progress. To conduct this experiment, we use a \(50\times 50\) m\(^{2}\) sized simulation environment with 8 anchor nodes positioned respectively at (0, 0), (25, 0), (50, 0), (25, 50), (50, 50), (0, 25), (0, 50), and (50, 25). The sensor nodes localized in the first iteration become the new anchor nodes for the next iteration and localise more sensor nodes. In this way, the nodes over the entire region of interest are localized in three iterations. Tables 3, and 4 respectively show the localization of sensor nodes after the first and second iterations of multilateration. (Xactual, Yactual) are the actual coordinates of the nodes localized and (Xpred, Ypred) are the coordinate values computed using multilateration. Deviation indicates the distance between the actual locations of the sensor nodes and the locations predicted by multilaterion. The values of deviation in the three tables indicate a trend towards deteriorating localization accuracy as the iterations progress.

Table 3. X–Y coordinates, actual vs predicted by first iteration of multilateration
Table 4. X–Y coordinates, actual vs predicted by second iteration of multilateration

3.4 The Hybrid Localization Approach

In this paper, we combine the localization potential of random forest localization and multilateration localization seeking to harness the strengths of both. Random forest is utilised in the first iteration and it localizes a large number of sensor nodes with a high degree of accuracy. These newly localized sensor nodes serve as the anchor nodes for the subsequent iterations of localization which is done using multilateration.

Table 5 shows the localization results for 10 random sensor nodes in terms of the predicted coordinates (Xpred, Ypred) and actual coordinates (Xactual, Yactual). The Deviation column shows the distance between the actual locations of the nodes and the locations predicted by the hybrid approach. The results indicate acceptable localization with small deviations from actual locations owing to the initial boost provided to multilateration in terms of a large number of anchor nodes provided by random forest. The hybrid approach, therefore, is seen to be quite useful for localization of nodes in large outdoor spaces.

Table 5. X–Y coordinates, actual vs predicted by hybrid approach

4 Conclusion

In this paper, we proposed a hybrid technique for localization of nodes in a Wireless Sensor Network (WSN) without the use of GPS. The major contribution of our approach is that it overcomes the simplifying assumption that every node in the WSN deployment is within the communication range of every other node. Our hybrid approach combines the capability of random forest, a Machine Learning (ML) algorithm, with a more conventional multilateration algorithm. The random forest algorithm is trained in advance and is able to accurately localize a large number of unknown nodes using just a small number of anchor nodes (nodes whose locations are known in advance). It is difficult to train random forest ‘on the go’ and hence it cannot be used for subsequent iterations. The nodes localized by random forest, however, are utilised as new anchor nodes and employed for localization of the remaining nodes by the multilateration approach. Multilateration is not as accurate as ML algorithms but can be repeated several time and hence is effective in covering a large deployments. In spite of being a little compromised in terms of accuracy of localization, multilateration does a fairly decent job within the hybrid set-up owing to the initial boost provided by random forest wherein a large number of anchor nodes are created.

We validated the efficacy of the proposed technique using a simulated set-up and with synthetic data. This is because standard data sets for WSN deployments are not available and we were unable to get access to a WSN deployment large enough to validate the idea proposed. The results of localization on the simulated set-up clearly demonstrate the efficacy of the proposed idea.