1 Introduction

The smart environment of IoT in modern real-world consists of tiny devices equipped with sensors, actuators, and computational elements. These devices are connected through mostly wireless network for collecting data from the environment and inferring the status based on them [1]. The smart environment usually consists of heterogeneous devices providing diverse services as shown in Fig. 1 [2]. The heterogeneous devices may generate imprecise or noisy data deteriorating the inference accuracy, while the events and data produced in the smart environment are related with each other. Here it is necessary to implement a method of sophisticated data integration dealing with various sources. Note that it is challenging to efficiently fuse a large amount of probably noisy data and then infer an accurate result. Moreover, it is required to process the data based on different contexts and inference condition. The smart environment needs the context-aware operation to achieve high performance with minimal energy consumption and networking overhead.

Fig. 1
figure 1

The smart IoT environment requiring data fusion [2]

Wireless sensor network (WSN) is commonly used to monitor and gather the required data from the target area. It consists of a number of sensor nodes which are distributed in high density to reliably cover the target area [3, 4]. The sensor nodes are limited with respect to the communication and computation power, and therefore collaboration between them is required to collect and transmit data to the base station (BS). The dense distribution in the target area, however, causes the data redundancy problem due to spatial and temporal correlation of the nodes. Outlier in the sensory data is another problem which is aggravated by the instability of the communication environment. It reduces the integrity of the data and performance of the entire system. Considering such unstable and erroneous characteristics of WSN, machine learning technique is expected to be effective in exploiting the collected data and improving the performance of the system.

The decision-making system is the core component of smart IoT environment, and its accuracy relies on the integrity of the data obtained with the sensor nodes. The sensor data might be corrupt due to sensory deprivation, restricted coverage, imprecision, and uncertainty, which significantly degrade the quality of decision. Also, the spatial and temporal redundancy of the data decrease the performance of WSN [3], and transmission of redundant data consumes large energy which eventually shortens the lifetime of the entire network. By minimizing redundant data, a significant amount of resources can be saved and the network performance can be enhanced. In addition, the uncertainty and inconsistency in sensor data may result in wrong inference on the environment. Therefore, enhancing the data integrity is the key to increasing the accuracy of decision-making. Data fusion is a discipline concerning with how multi-source data are merged to increase the integrity of data. It allows to effectively deal with noisy data of dynamic environment, and helps the decision-making process based on the available information [4]. Fusion of sensor data is one of the crucial tasks with WSN, and numerous data fusion mechanisms have been proposed to filter and merge the sensor data before sending to the sink and decision-making system [5,6,7]. Figure 2 shows the structure of decision making system consisting of data filtering, fusion, and processing.

Fig. 2
figure 2

The structure of decision-making system consisting of data filtering, fusion, and processing

Intelligent data fusion is important to improve the accuracy of decision making process for the following reasons:

  • The IoT system usually operates in dynamic real-time environment, and thus it is necessary to establish a smart network which can efficiently adjust its operation according to the operational condition.

  • WSN is often used for gathering data from unreachable, dangerous, or critical location [8] such as fire or water leakage detection. The system designers need to utilize a robust technique that is able to make correct and reliable decision based on available knowledge, and also gain new knowledge from the experience.

  • WSN is usually deployed in complicated environment, and thus it is quite hard to build an accurate mathematical model on the target operation, e.g. event or outlier detection. Data fusion based on various techniques including machine learning is imperative to efficiently handle such complicated problem and situation.

  • In the machine-to-machine (M2M) communication of IoT environment, smart decision-making and control are required [9]. With artificial intelligence techniques [10], different levels of knowledge can be used to make a decision and the tasks are dynamically performed based on the contextual information.

  • It is not easy to extract important correlation between the data and accurately fuse them if the amount of data is large. Here machine learning techniques are expected to be effective.

Data fusion is applied to combine the data of multiple sources in effective and accurate way. In WSN environment it used to integrate the multi-sensor data and transmit them to the BS [11]. Due to spatial and temporal correlation of adjacent sensor nodes, a significant amount of redundant data are generated which need to be reduced. The outliers in the data are caused due to unexpected events or malicious attacks on the network, while the noises and errors reduce the integrity of the data [12]. Without cleansing or filtering redundant or erroneous data, the fused data might not be useful. Data fusion is classified mainly into two approaches based on the employed network structure, centralized approach and cluster-based approach as shown in Fig. 3.

Fig. 3
figure 3

The two approaches employed for data fusion

The centralized approach is used to filter and fuse the data at sink level so that the end-to-end data transmission delay can be reduced since the data of the highest priority must be transmitted with a minimum transmission delay. But the centralized filtering of the sensed data may limit the inference accuracy, and increases the network load by sending noisy and redundant data to the sink. The cluster-based approach has been developed to reduce the temporal and spatial data redundancy and outliers in the collected data at the cluster head (CH). The objectives of the data fusion techniques are to collect data using minimal resources. Figure 4 describes the data fusion operation with clustered WSN using machine learning technique [13, 14]. Here the researchers attempted to efficiently filter and merge the sensor data at the CH before sending them to the sink.

Fig. 4
figure 4

The fusion of data with clustered WSN using machine learning technique

This survey paper aims to investigate various data fusion schemes which are employed with WSN, and compare their features. The main contributions of the paper are as follows:

  • The multi-sensor data fusion schemes based on various technique or theories for WSN and IoT environment are introduced which can help the researchers in developing a smart cognitive system.

  • The challenges and opportunities for data fusion are explained considering the characteristics of sensor data including uncertainty, noise, inconsistency, redundancy, and outliers, etc.

  • The mathematical models applicable to multi-sensor data fusion are discussed for the applications to WSN and IoT.

The rest of the paper is organized as follows: in Sect. 2, the data fusion techniques are discussed. The opportunities and challenges are explained in Sect. 3, and the conclusion is made in Sect. 4.

2 Approaches for Data Fusion

In smart environment the data from a single source may not be sufficient for making an accurate decision. Hence, multi-sensor data fusion and inference are required handling heterogeneous data. The data fusion techniques are classified into three categories with respect to the employed method [15]; the probability-based, AI-based, and evidence-based technique. They are summarized in Table 1.

  • Probability-based Recursive operators and Bayesian analysis

  • Artificial Intelligence (AI)-based Neural networks (NN) and Fuzzy Logic

  • Evidence theory-based Dempster–Shafer theory

Table 1 The approaches and nature of the techniques developed for data fusion

2.1 Probability-Based Method

In this subsection various probabilistic techniques proposed for data fusion with IoT are reviewed. Bayesian inference is one the most popular probabilistic methods developed for data fusion [41,42,43,44,45,46]. It needs relatively small number of sample data required to train the system, and allows dealing with the heterogeneity of information based on the probabilistic occurrence of the events in the environment. In [47] a data fusion scheme based on hard and soft sensor is proposed. It presented the cloud-enabled Bayes network for consolidating heterogeneous, real-time data streams from the target region to accomplish actionable intelligence from the computer-based decision supportive network. The data fusion information group (DFIG) model is shown in Fig. 5.

Fig. 5
figure 5

The data fusion information group (DFIG) model [47]

The DFIG model supports various control functions based on the spatial/temporal/spectral differences of the sensors. The levels of DFIG model are as follows [47]:

  • Level 0 Data Assessment (DA):

  • Level 1 Object Assessment (OA):

  • Level 2 Situation Assessment (SA):

  • Level 3 Impact Assessment (IA):

  • Level 4 Process Refinement (PR):

  • Level 5 User Refinement (UR):

The Dynamic Bayesian networks (DBNs)-based [20] adaptive data fusion scheme was proposed for various applications such as identification and detection of object. It considers the previous belief under the current observation of the phenomena to get subsequent estimation. Figure 6 is an example of the model.

Fig. 6
figure 6

An example of DBN [20]

In [48] information aggregation and image data mining are achieved using the Bayesian technique. It applied Bayesian inference to acquire an estimate of a given physical parameter based on the perceptions obtained with various sensors. The paradigm of Bayesian information and knowledge fusion is shown in Fig. 7. Here the issue of fusion of two data sets requiring the combination of knowledge is observed, in a form of the determination of the priori models, M1 and M2, as in Eq. (1). A Bayesian methodology for data fusion can be formulated to maximize the posteriori probability [48]:

$$p\left( {\Theta |D_{1} ,D_{2} ,|M_{1} ,M_{2} } \right) = \frac{{p\left( {D_{1} |\Theta ,M_{1} } \right)p\left( {D_{2} |\Theta ,M_{2} } \right) \cdot P\left\{ {p\left( {\Theta |M_{1} } \right)p\left( {\Theta |M_{2} } \right)} \right\}}}{{p\left( {D_{1} ,D_{2} |M_{1} ,M_{2} } \right)}}$$
(1)

where P denotes the prior data in the hypothesis of two distinct models.

Fig. 7
figure 7

The paradigm of Bayesian information and knowledge fusion [48]

Nowadays the improvement in monitoring for animal health care is rapidly growing. In [49] the animal health monitoring scheme was proposed by using Bayesian algorithm for enhancing the productivity and monitoring the health of the animals. A data mining approach based on Bayesian Networks (BN) was introduced in [50], which integrates the quantitative and qualitative knowledge into a comprehensive probabilistic information prototypes and inference in WSN. Similarly, a Bayesian-based model [51] was proposed to fuse the measured temperature data from smart building. It extracts knowledge with a few sensor measurements, and then predicts the spatial temperature distribution for posterior estimation. In [17] three filtering approaches, Pre-Filtering, Post-Filtering, and Pre-Post-Filtering were proposed to fuse the sensor data. It proposes an approach for filtering and combining the sensor data using modified Bayesian fusion algorithm with Kalman filter to effectively handle the uncertainty and inconsistency problem.

2.2 Artificial Intelligence-Based Method

The artificial intelligence-based data aggregation and fusion techniques can effectively classify and abstract the information, and extract important features and knowledge from the data [52, 53]. The sink nodes can handle the fusion and classification of the data extracted from multiple sources using the back-propagation network (BPN) technology. Here the location and time limitation are considered to reduce the data gathering latency (Fig. 8).

Fig. 8
figure 8

The data fusion model for decision making based on BPN [53]

With the fuzzy-based data fusion algorithm [27], an unfixed fusion weight is assigned to the CH. The weight is computed using fuzzy-logic dealing with various parameters such as delay, amount of data, and reliability. The structure of Back-Propagation Networks Data Aggregation (BPNDA) [29] scheme are shown in Figs. 9 and 10, respectively. Here a data aggregation scheme for WSN was proposed to reduce the communication traffic, save the energy, and improve the accuracy of information-gathering. The collected data from sensor were processed at CH using Back-Propagation neural network before transmitting them to the sink.

Fig. 9
figure 9

The schematic model of BPNDA [29]

Fig. 10
figure 10

The structure of BPNDA [29]

In [28] a fuzzy logic is used to separate the occurrences of failure in the data based on the existing false positive instances. It explores the use of various context information to statistically estimate the network condition with negligible overhead. An energy efficient context monitoring framework is presented in [54] which adjusts the monitoring policy based on the learning of associations between the attributes. The schemes in [13, 55, 56] employ self-organized map (SOM) as a clustering approach which is a three-layer neural network of input, middle, and output layer. \(X = \left( {x_{1} ,x_{2} , \ldots x_{d} } \right)^{T}\) represents the input layer and it is fully connected to middle layer to give result to output neural layer, \(Y = \left( {y_{1} ,y_{2} , \ldots ,y_{m} } \right)\) as shown in Fig. 11. The training process of SOMDA iteratively updates the synaptic weights of the winner and its neighbors’ neurons. At each training step, a sample vector, \({x}_{i,d}\), is randomly selected from the input dataset. As training progresses, the algorithm calculates the Euclidean distance between every weight and input vector \({x}_{d}\). The node with a weight vector of closest distance to the input vector is tagged as the best-matching unit (BMU)\(, {j}^{*}.\)

$$j^{*} = \mathop {\min }\limits_{j} \left( {\sqrt {\mathop \sum \limits_{i = 0}^{d} \left( {x_{i} - w_{im} } \right)^{2} } } \right)$$
(2)
Fig. 11
figure 11

The diagram of the SOM network [13]

The synaptic weight vector, \(W_{k} = \left( {w_{k,1} ,w_{k,2} , \ldots w_{k,m} } \right)\), is the directed links between the input layer \(X\) and out layer\(Y\), where \(k \in \left\{ {1,2, \ldots ,m^{2} } \right\}\) expresses the index of kth node of the output layer as shown in Fig. 11. The synaptic weight at time (t + 1), wj(t + 1), is obtained as follows.

$$w_{j} \left( {t + 1} \right) = w_{j} \left( t \right) + \alpha \left( t \right) \cdot h_{ci} \left( t \right)\left[ {x_{i} - w_{j} \left( t \right)} \right]$$
(3)

where α and t represent the learning rate factor and the iteration of the training process, respectively. The Gaussian neighborhood function, \({h}_{ci}\left(t\right)\), indicates how strongly the neighbor neurons are connected around the winner during the learning process, and all the neurons close to each other are arranged in the two-dimensional grid as shown in Fig. 12. It is specified as:

$$h_{ci} \left( t \right) = \exp \left( { - \frac{{r_{c} ,{ }r_{i}^{2} }}{{2\sigma^{2} \left( t \right)}}} \right)$$
(4)

where \({r}_{c}\) and \({r}_{i}\) represent the location of the winner neuron_\(c\) and \({\text{neuron}}\_i\), in the grid and \({\Vert {r}_{c}, {r}_{i}\Vert }^{2}\) is the distance between them.

Fig. 12
figure 12

The grid representation of the SOM-NN [13]

The reinforcement learning technique allowing a sensor node (an agent) to interact with its environment using Q-learning [57] is shown in Fig. 13. The maximization of the efficiency of data collection from sensor nodes depends on the movement policy of the mobile element (ME) with which the best position of an ME is decided. Figure 14 depicts how the policy is applied in accessing the reward. In an uncertain environment, the data gathering process by ME is dynamically modeled through the Markov decision processes to enhance the movement of ME [58, 59]. The authors integrated the reinforcement learning algorithm with the data fusion process to develop an adaptive system. It is employs a kernel-based learning method which enhances the efficiency of data integration and fusion.

Fig. 13
figure 13

The structure of the Q-learning method

Fig. 14
figure 14

The movement policy of ME

The Mahalanobis distance-based radial basis function-based Extreme Learning Machine (MELM) [14, 60] is a two-stage data aggregation scheme with the projection stage and clustering stage as shown in Fig. 15. In the projection stage the weights of the link, \({w}_{n}\), are adjusted with the center of the neuron, \({\mu }_{k}\), at the intermediate layer. The primary objective of the training process with the neurons in the intermediate layer is to place the center of their Gaussian functions as described below. In the clustering stage, the value of neurons is adjusted with output weight, \({w}_{i}^{^{\prime}}\), with the training and tuning process to achieve the target output. The output weight of the clustering stage is analytically determined via mathematical manipulation. As a result, the proposed scheme can improve the accuracy of clustering with small computation overhead. Figure 15 is the structure of the MDRBF-based ELM neural network.

Fig. 15
figure 15

The structure of the MDRBF-based ELM neural network [14]

2.3 Evidence Theory-Based Method

Evidence theory is a powerful and concrete method of fusion which extract precise information from multiple sensor nodes [37]. It transforms multi-source subjective and conflicting information into a decision-making result, and utilizes the combination of mass function from different sources. Dempster–Shafer (DS) is an evidence-based theory, and it is regarded as one of effective approaches for data fusion. The combination rule of DS theory can effectively merge the measures of evidence from different sources as shown in Fig. 16. The relationship between belief, disbelief, unknown and plausibility function in the DS theory are shown in Fig. 17.

Fig. 16
figure 16

The DS process with n sensors

Fig. 17
figure 17

The relationship between belief, disbelief, unknown, and plausibility function

A generic evidence fusion scheme [61] was proposed using the DS theory to deal with the uncertainty in the sensor readings and capture the features of the environment. A two-step technique is used to build a belief function from the sensor data, and the rule of combination of three sensor data are expressed as:

$$m^{1,2,3} \left( E \right) = \frac{{\mathop \sum \nolimits_{{s_{1} \cap s_{2} \cap s_{3} = E }} m^{1} \left( {s_{1} } \right) \cdot m^{2} \left( {s_{2} } \right).m^{3} \left( {s_{3} } \right)}}{{\mathop \sum \nolimits_{{s_{1} \cap s_{2} \cap s_{3} \ne {\Theta }}} m^{1} \left( {s_{1} } \right) \cdot m^{2} \left( {s_{2} } \right).m^{3} \left( {s_{3} } \right)}}$$
(5)

Here \(m^{{{\text{1}},{\text{2}},3}} \left( E \right)\) evidence is obtained using three sensor nodes, \({m}^{1}\left({s}_{1}\right)\), \({m}^{2}\left({s}_{2}\right)\), and \({m}^{3}\left({s}_{3}\right)\). The generalized combinatorial rule of DS theory for n sensor nodes is defined as follows:

$$m^{{\left( {1,2 \ldots n} \right)}} \left( E \right) = m^{1} \left( {s_{1} } \right) + m^{2} \left( {s_{2} } \right) + \cdots + m^{n} \left( {s_{n} } \right)$$
$$m^{{\left( {1,2 \ldots n} \right)}} \left( E \right) = \frac{1}{1 - K} \mathop \sum \limits_{{ \cap_{i} s_{i} = E}} \left( {\mathop \prod \limits_{1 \le i \le n} m^{i} \left( {s_{i} } \right)} \right) C \ne {\Theta }$$
(6)
$$K = \mathop \sum \limits_{{ \cap s_{i} = {\Theta }}} \left( {\mathop \prod \limits_{1 \le i \le n} m^{i} \left( {s_{i} } \right)} \right)$$
(7)

A multi-sensor data fusion system [62] was proposed based on the DS theory to allow the detection of residence in a room based on various sources like temperature, humidity, and light. It assigns a mass to the sensor data by using the mass function to combine all the masses by the combination rules, and then make a decision. Here the occupancy sensing problem is expressed as a classification problem, and each class is considered by a separate set of characteristics. Before computing the mass of the data obtained from a sensor, it is sent to the data fusion center (DC) (shown in Fig. 18) to compute the probability density function. The DC is located within the building premise to increase the accuracy and reduce the cost.

Fig. 18
figure 18

The IoT structure for residence sensing [62]

In [63] a DS theory-based fusion scheme was proposed for event detection in twitter. In this scheme two types of data are involved in the fusion, the features extracted from the text using the bag-of-words technique and the visual features extracted by applying the scale-invariant feature transform. The DS theory of evidence is applied so as to combine the data from the two sources, and the method is depicted in Fig. 19. A feature belonging for either text, \(t\), or image, \(\stackrel{-}{t}\), and \(\theta\), refers to uncertainty inherit in the theory of evidence. All this constitute the frame of discernment, \(\Theta\):

$${\Theta } = \left\{ { t,{ }\overline{t},\theta } \right\}$$
Fig. 19
figure 19

The block diagram of fusion for twitter data [63]

Various techniques proposed for data fusion are compared in Table 2 regarding the employed machine learning approach, complexity, and purpose.

Table 2 The comparison of different techniques proposed for data fusion

3 Opportunities and Challenges

A huge amount of data are continuously generated in IoT environment, and it is very challenging to efficiently handle them since the data generated by the sensors are not precise and contain many outliers. Extracting reliable and accurate information is critical because the low-quality data may negatively affect the result of the overall data fusion operation [67, 68]. The opportunities and challenges with data fusion using various techniques are as follows.

3.1 Opportunities

  • Filtering of data: Sensor data are noisy and imprecise, and thus filtering of data is needed to make data more intelligent, decisive, sensible, and precise. Various filters including Kalman filter and Moving-average filter (MAF) could be employed for pre-processing of data [69, 70]. An adaptive approach is also needed to improve the filtering operation of sensor data for real-time IoT environment.

  • Data analysis: Analysis of the fused data needs to be accurate and fast to provide timely service. The probabilistic technique such as Bayesian decision network might be effective for analyzing heterogeneous data. The Bayesian approach for the estimation of the covariance of data [23] and Bayesian inference-based data fusion [41,42,43,44,45,46] are expected to be effective for the integration of sensor data.

  • Power consumption: Data fusion and classification need to be efficient to increase the lifetime the WSN and IoT devices by removing outliers and redundant data. Clustering of the nodes based on data similarity and density would improve the power efficiency. Various machine learning technique would improve the power efficiency via effective clustering [13, 14].

  • Security and information: The data fusion operation needs to be done in consideration of the security issue which hides and encrypts the information. A new approach integrating the fusion and encryption of data would be important.

  • Knowledge and decision-making: Data fusion needs to help extract knowledge from multi-source data to make accurate decision. Evidence theory is a powerful and concrete method of fusion which extract precise information from multiple sensor nodes and take decision based on the fused data [37]. Data mining based on Bayesian network [50] is expected to be effective for integrating the quantitative and qualitative knowledge into a comprehensive probabilistic information.

  • Self-organized system: Different contexts may require different sensory capabilities, and it is not desirable to determine a priori the subset of sensors to use. In a real-world scenario, the context conditions may change over time, implying the need for a system capable of dynamically selecting the subset of sensory devices. The SOM-based approaches will be effective for implementing context-aware self-organized system.

  • Clustering and classification of data: Since sensors generate uncertain imperfect data containing outliers, a new efficient approach for data fusion is needed to maximize the performance of fusion and hosting network. Here node clustering based on the data density and similarity will play an important role.

3.2 Challenges

  • Multivariate data analysis: Due to the complexity of the data, analysis and visualization of multivariate data are imperative. IoT environments are heterogeneous due to disparate sources of data and devices. There is quite limited study on the covariance and multivariate analysis of the sensor data. The effectiveness of distributed multivariate outlier detection also needs to be enhanced in term of data communication and energy efficiency.

  • Optimization with machine learning model: The researchers have proposed to employ ELM to dramatically reduce the computation time of training. However, instability may occur due to random selection of the weights and biases of the model. A systematic approach needs to be developed to decide optimal values for target problem.

4 Conclusion

Tremendous amount of data are continuously generated in smart IoT environment, which are usually transmitted through wireless network including WSN. Such data are required to be efficiently collected and analyzed to make decisions on the service. This induces various challenges, and timely, accurate data fusion and analysis of sensor data is one of key issues. The performance of data fusion in the IoT and WSN environment can be significantly improved if the errors and uncertainty in the sensor data are reduced by proper fusion considering the context.

Numerous researches and developments have been made on data fusion utilizing various approaches to face the challenges in big data analysis in WSN and IoT. In this article we have presented a literature survey on data fusion proposed for reliable and accurate operation. Here the schemes combine the data obtained from various sources and extract meaningful information to help the decision process. The opportunities and challenges with data fusion in the IoT and WSN environment are also summarized. There still exist numerous challenges and issues needing attention of the researchers in the future.