Keywords

1 Introduction

The recent innovation and improvement in material sciences, semiconductor technologies allow the electronic devices becoming smaller and smarter. The development of networking technology makes it possible to build a wireless network that is self-configuring and infrastructure-free. It is known as Wireless Sensor Networks (WSNs) [1]. Typically, WSNs comprise of hundreds of thousands of low-cost resources constrained autonomous sensor nodes deployed in an area. The sensor nodes monitor the environmental and physical phenomenon and send the data over the network to the sink [2]. The sink has a higher resource in which the data are observed and analyzed. The technology was first inspired by the military application such as battlefield monitoring and surveillance. Recently, the application driven WSNs technology is using in a range of civilian application which includes monitoring the environment, agriculture field, infrastructure, transportation, and health sector [3]. There are few challenges in WSNs such as efficient deployment, energy management and security. The deployment strategy for an application is usually determined by the different scenarios and the environment of the application, including harsh fields, disasters, or toxic surroundings [4]. In order to have efficient energy management, the design of the hardware and software needs to consider minimizing the energy usage. Radio transmission energy consumption can be reduced if we use the data compression technology. Energy consumption is depending on the application as well. In some application, the node may not be active continuously [5]. A WSNs and the construction of the sensor node is shown in the Fig. 1.

Fig. 1.
figure 1

WSN and construction of node

In this information and communication technological age, all the network is interconnected utilizing different communication media. The provision of security is a mandatory requirement both for wired and wireless communication [6]. WSNs use wireless communication media to construct the network. Moreover, the characteristics of WSNs includes open nature of wireless media, unattended operation and resource constrained nodes with short communication range and low processing unit, limited energy, memory, communication bandwidth as well as computation power. This makes the network vulnerable to faults and different kinds of attacks. Considering the nature and characteristics of the WSNs, security is a challenging and critical task [7]. In order to have an efficient functionality and collect the meaningful data from the deployed WSNs, securing the network is essential. One of the major and important tasks is to find the anomalous node in the network so that the node can be fixed or isolate if required to secure the network.

Numerous algorithms have been developed by researchers to secure the Wireless Sensor Networks and collect the meaningful data from a functional WSNs. Most of the works focus on the attack defense, authentication process as well as pair wise key establishment. Currently available works rely mostly on cryptographic data and authentication of the sensor data to establish sensor relationships and trust.

But the non-reliable communication using the wireless communication media make it easy for the adversary to compromise the sensor nodes [8]. If any anomalous node or compromised exists in the network, it is almost impossible to have a secure functional WSNs and to extract the meaningful data for the specific application scenario. In the case of deploying WSNs in a sensitive application such as battlefield monitoring, nuclear plant monitoring, detecting of all the adversary is necessary as the anomalous node act like a legitimate node. Whenever the node is compromised by the adversary, the anomalous node will behave abnormally and the node will temper the original message, will drop, or send the excessive packet.

Detection of anomalous node is a significant task to have a secure and accurate significant information from the network. In this research, a machine learning based supervised learning technique called decision tree is applied. The algorithm has the capability of mimic human thinking while making the decision and the logic behind the decision is easily understood as it follows the tree structure. Moreover, the algorithm has benefit of running from different initial points, and this can better approximate the near optimal classifier [9]. This method begins with the root tree and then compares the root and recorded attributes to predict a class level. After comparing the value with the branch, it moves on to the next value [10]. The algorithm can work with the resource constrained nodes and for any sample data.

The rest of the paper is organized as follows: the overview of some recent existing works is discussed in Sect. 2 followed by the assumptions and architecture of the network is in Sect. 3. The detailed methodology of decision tree and implementation is explained in Sect. 4. The result and evaluations are presented in Sect. 5. The last section is the summary of the work in conclusion.

2 Related Works

The sensor nodes are dispersed over an area generally labeled as sensor field in WSNs to collect the specific data for a dedicated application. Securing the communication network is an obligatory task both for wired and wireless medium. WSNs use radio communication link of wireless media within the network. The construction of nodes, features, and operating nature of WSNs make the network more vulnerable. Therefore, securing the WSNs is critical and important for their efficient functioning. Hence, it has attracted the attention of the researchers. Several researchers have developed different algorithms to secure the network. Some of the recent works are discussed in this section.

To develop the detection algorithm for anomalous node the researchers mainly focus on high cost of communications, network topology, resource constrained nodes, distributed streaming data as well as high dimensional data. The existing data driven developed technique can be categorized by the detection’s approaches named: (a) statistical based, (b) classification based, (c) artificial intelligence based, (d) distance based, and (e) cluster based.

The statistical based model uses the representation-based process. Normally, anomaly or outlier is detected based on the irregular data points in these techniques. In this method, if the data instance occurrence probability is low, it will be considered as the node is compromised. The compromised node act as legitime node. The non-parametric methodology of statistical was analysed and studied by Smarpathi et al. in [11], their research the researchers considered the streaming data that collected from the sensor node and calculated the data density. A certain degenerated data density value is considered, and the network outlier is decided. The kernel density estimators are used by the researchers to compute and analyse the data from the streaming sensors. Normally, only the univariate data can be analysed using the method. Wu et al. In [12], designed a parametric-based method which can do the classification of the abnormal sensor nodes in the networks as well as the events that is associated with the sensor networks. The method has good computation accuracy and less false alarm, but in this approach the time-based association was not given attention.

The classification-based approach utilizes the training data as well as compare the freshly received data with training data set to come up with a decision about the anomalous node in WSNs. In [13], Poornima et al. has presented a classification based method using Online Locally Weighted Projection Regression (OLWPR) to detect anomalous node of WSNs. The principal component analysis (PCA) is used for reduction of the unrelated and unwanted data at the level of input. This establishes the estimated value considering the PCA outcome. Mario et al. worked on the wearable sensor networks and formulated the deep recurrent combined method in order to find the outliers in the sensor network, in [14]. In their research, it was considered human activities and used two data set. This method produces good results, but it is computational cost is high and works with multivariate data. The approach using classification depends on the good training sets, which is difficult to obtain. the high dimensional and mixed structured data was not well studied.

Numerous researchers have studied the artificial intelligence (AI)-based methods to secure the network. This methods in WSNs use prediction to detect the anomalous node with the decision-making theory. A fuzzy rule-based methodology was introduced in [15] by Thangaramya et al. The approach detects the outliers or anomalous node on basis of the routing decision. To do that, the method utilises the key and trust management technique. In this approach, assigning the degree of trust for every member in the sensor network is difficult. Nauman et al. has presented an extended support vector machine (SVM) known as Quarter-Sphere formulation of One-Class SVM in [16]. In his work, a new One-Class Quarter-Sphere SVM formulation was developed to classify the data online and find the outliers in the wireless sensor networks. The method is online approached. The method did not consider the data which is infected with noise and faults. The AI-based method needs training and it is hard to train. To train the network well, the high dimensional data set is necessary. Moreover, the learning algorithm requires good learning rate.

A distance-based method is the most common approach used in detection of anomalous node in the wireless sensor networks. The approach works with the nearest neighbour distance measurement. Amel et al. formulated a method that uses the nearest neighbour based techniques with game theory to deal with the outliers, in [17]. The method has a complex computation. Tianwei et al. has come up with an approach to detect the outlier which has low computation and uses less memory in [18]. It detects or eliminates the forge measurement of the sensors with the weighted average distance factor. The method is good as it runs on the independent nodes of the networks. The challenge is to get the finest results. Asmaa et al. presented a method in [19]. This method uses the in-network knowledge discovery which then detected the outlier and do the clustering of the data simultaneously based on the nearest neighbour. In this method, the computational complexity is high. In the distance-based approach, the selection of the suitable distance for the real time application is difficult. So, the selecting the correct and appropriate neighbour is a problem in this type of method.

In data mining research community, the cluster-based approach is considered to be the most modern approach. Xiang et al. [20] have worked on the cluster-based methodology to identify the outliers in the network. To detect the outlier, their investigation of the work developed an unsupervised contextual model. In their research work, the researchers found that the method was able to detect anomalies as well as abnormalities in the network. There are some drawbacks to this type of method, including the fact that the same set of data may show higher density in one specific area of the collection field. Nikos et al. presented a method by dividing the multidimensional and unidirectional outlier in [21]. In their method, the researchers utilize the in-network proposal that deals with the both dimensions of data (unidentical and multidimensional). The researchers assume that the error data is generated from the elements of the individual sensors. The method uses hashing technique, so the computational cost and high memory is needed. The clustering-based approach uses the distance metrics (Euclidian as well as Mahalanobis). Both metrics used are not well studied for the hi-dimensional data as well as the mixed type of structure.

The mechanisms presented in the literature by the researchers have some limitations. Most of the proposed mechanisms use the fixed threshold value, need sensitive parameter selection and many have a complex computation as well as high energy consumption. Moreover, the algorithm implemented distributed structure did not taken into account the appropriate data transmission reference model among the nodes. They do not have the capability distinguishing the events and errors. Studying the facts and the characteristics of the WSNs node, we have established the facts that new method is needed that can deal with the small data set and can make the network sustainable based on the network developed with resource constrained sensor nodes. In this research paper, a decision tree-based techniques is presented to find the anomalous node and secure the network.

3 Assumptions and Architecture of the Network

The sensor network is assumed to be deployed in an area intended to measure the environmental phenomenon. Parameters to be measured or detected are provided by users in the deployed area. The network grid of the area of attention is Ω of Nx × Ny points scenario. Whenever the network deployment is done completely and it is in operation, the channels of communication and nodes are static. The static sensor nodes have the task to observe the data and send the observed data to the sinker utilizing wireless media. During the operation of the network, some sensor nodes may be compromised by the adversary, and it is anomalous node. This node compromises the security of the overall network and does not let the network to function properly. To reflect the scenario, we have simulated our network for the environmental monitoring. To do that, we have hypothetically deployed the network in sunshine cost, Australia and simulated for January to March weather data for temperature and humidity. According to the weather and climate information in sunshine coast during the mentioned period, the temperature stays between 20 to \(30^\circ {\text{C}}\), and the humidity is about 66%. We accept temperature data that lies within the 2 sigma range (the standard deviation of the Gaussian distribution) based on the Gaussian distribution standard deviation method. In accordance with the research conducted by Holder et al. [22], a good collection of data found in 2 sigma choice. Hence, we will have 95.46% of sensor data in our implementation.

4 Method

Decision three is the machine learning algorithm, which is the type of supervised learning technique, to come up with a strategy to accomplish a specific goal. The algorithm can be used to solve both regression problems and classification problems. In which, the information data is constantly separated corresponding to a specific parameter. The algorithm is using tree like decision making formation where each leaf nodes correspond to the class labels, and internal nodes correspond to the attributes. The network consists of internal nodes representing tests on attributes, branches representing test outcomes, and leaves (terminal nodes) storing class labels. Among the most important Decision Tree terms are [23]:

  1. a)

    Root Node: In general, the root node represents the entire sample, which can be subdivided into several homogenous clusters. The root node represents the decision tree at its very top.

  2. b)

    Splitting: A process that divides a node into two or more sub-nodes.

  3. c)

    Decision Node: This is a node that divides data into further sub nodes.

  4. d)

    Leave Node: These are nodes that do not split; they provide the final output of the decision tree. it known as leaf

  5. e)

    A pruning is an opposite action to splitting. The process of pruning involves removing a sub-node from a decision node

  6. f)

    Sub-tree: A branch or sub-tree is the sub-division of an entire tree.

  7. g)

    Parent Node: Nodes that are divided into sub-nodes constitute the parent node

  8. h)

    Child Node: Sub-nodes that make up the parent node

The method starts with the root node of the tree in response to the dataset for class prediction. A comparison is made between the values of the root attribute and the values of the recorded attribute, based on the comparison, follows the branch, and moves to the next node. In the next step, the method evaluates the value of attributes against the values of the other sub-nodes and moves to the next node. In order to reach the leaf node of the tree, the process must be repeated (Fig. 2).

Fig. 2.
figure 2

The decision tree method

In order to select the Attribute, the attribute selection measure (ASM) methods is used. A dataset's entropy refers to the uncertainty in it or the measure of disorders in it. An individual node's entropy value reflects its randomness. In such a situation, the margin of difference for a result is thin, and the model has little confidence in the accuracy of the forecast. In general, the higher the entropy, the more random the dataset will be [24, 25]. When using a Decision Tree algorithm, low entropy is preferred. Entropy is calculated as shown in the following Eq. (1):

$$ Entropy \left( S \right) = \sum\nolimits_{n = 1}^{n} {P_{i} *log\,P_{i} } $$
(1)

In the Eq. (1), \(P_{i}\) represent the probability of the data class. Using information gain, we can decide what attributes belong in which nodes of a decision tree and whether a specific feature should be used to split the node or not. An information gain is simply the change in entropy after a dataset has been segmented based on an attribute. The feature provides us with the amount of information that a class provides us. Creating a decision tree involves dividing each node by the value of the information acquired. A decision tree method splits the node or attribute with the highest information gain first, which maximizing the information gain. The information gain is calculated in Eq. (2) as:

$$ Information \,Gain, \;I = {\text{S}} - \left[ {\left( {\overline{w}} \right)\,{*}\,{\text{S}}_{i} } \right] $$
(2)

where, S is the total sample space entropy, \(\overline{w}\) is the average weight) and \({\text{S}}_{i}\) is the entropy of each feature Information.

The method is effective for anomaly detection in WSNs as it is not necessary to normalize or standardize the collected measure data by node. Moreover, the missing values of the data do not need to be credited. In a decision tree making model, preprocessing steps require less coding and analysis.

5 Result

Wireless sensor network hypothetically simulated in Python with the temperature and humidity input data. In the simulation, we have considered the temperature varies between 20 to \(30^\circ\) centigrade and the humidity between 60% to 70% based on the sunshine coast data. The packet forwarding should be 5 to 8 packets in a minute. Considering this scenario, we have implemented the decision tree algorithm to find the anomalous node to secure the node. The following assumptions are made while constructing the Decision Tree:

  1. I.

    In the beginning, we consider the whole dataset as a Root Node and start the process of the decision tree.

  2. II.

    Decision trees are best constructed using categorical feature values. In order to construct a model with continuous values, they must be discretized

  3. III.

    Recursively, records are distributed based on attribute values. The attribute values are impotent to make the decision.

  4. IV.

    Statistics are used to determine which attributes should be placed as root nodes or internal nodes of the tree. One attribute is not chosen at random.

In the hypothetical wireless sensor networks, which comprises of 300 sensor nodes to measure the temperature and humidity. In the simulation in python, we determined that 75% of the original data would be used for training and 25% for testing. The extracted sample from the dataset is shown in the Table 1. In the table, the predicted possible scenarios are labeled as No Fault (NF), Fault Temperature (FT), Fault Humidity (FH), Fault Packet (FP), Fault Temperature and Humidity (FTH), Fault Temperature and Packet (FTP), Fault Humidity and packet (FHP), Fault in all (FTHP).

Table 1. Sample of extracted from dataset
Fig. 3.
figure 3

The fault data count

Figure 3 shows the count of the types of fault data in the dataset. The confusion matrix shown in Fig. 4 for test dataset consisting of 75 nodes. In order to solve classification problems, confusion matrixes are very popular measures. In addition to binary classification, it also works with multiclass classification problems. In the carried-out simulation, only 6 nodes are showing in wrong category. Though they are showing in wrong category of faults type, they are still lies in Faults groups. There is no single error appear in “No Fault” category. So, the error of 6 nodes will not affect in isolating the anomalous nodes as they will be considered as Fault nodes. In multi-class classification, hamming Loss is a good measure of model performance. The smaller hamming loss will give better model performance. Hamming loss is calculated as dividing wrong labels by the total number of labels. In this case study, the hamming loss was calculated as 0.080. It shows that the performance of the proposed algorithm is more than satisfactory. The simulation is uploaded in https://github.com/trmyo/Aii2022.git.

Fig. 4.
figure 4

Confusion matrix of the data

6 Conclusion

WSNs created a platform for collecting the data and monitoring desired phenomenon. In order to extract the effective and meaningful data from the network, security is a difficult and critical task. Anomalous node in the network creates the security issues by causing the integrity problem in the WSNs. In this research paper, we have studied and done the investigation for the detection of anomalous nodes to secure the networks. To do that we have used the machine learning algorithm-based decision tree mechanism which use the supervised learning technique. The model is consuming less energy and it needs less computation power. Moreover, it is using minimum memory as decision tree does not need to scale the data and it takes less time to make the decision in the preprocessing steps. The simulation result shows that the humming loss is only 0.08 which proves that the accuracy of the proposed approach gives promising result. In future, we would like to combine it with another algorithm to create the hybrid method for better performance.