1 Introduction

The concept of Internet of Things (IoT) was originally introduced by the Auto-ID Centre at the Massachusetts Institute of Technology (MIT) in 1999. With the development of technology and expansion of the scope of applications, the evolutionary process of the IoT has been advanced significantly (Wang et al. 2015). In 2005, the International Telecommunication Union (ITU) formally recognized the concept of IoT at the World Summit on Information Society held in Tunisia. Following the summit, the ITU Internet Report 2005 was released (Strategy and Unit 2005). The report provides an in-depth introduction to IoT and its effect on businesses and individuals around the world. It contains information on key emerging technologies, market opportunities and policy implications. In the report, the IoT is described as a technological phenomenon in which connections multiply and create an entirely new dynamic network of networks. Till date, many IoT applications in different fields such as intelligent agriculture (Taylor et al. 2013), smart grids (Severini et al. 2013), environmental protection, intelligent medical care (Yang et al. 2014) and smart home (Hernandez et al. 2014) have been developed.

The volume of data on the Internet and the World Wide Web is overwhelming and continues to grow at a stunning pace. Approximately, 2.5 quintillion bytes of data are created each day, and it is estimated that 90 % of all data were generated in the past 2 years. It is estimated that there will be approximately 25 billion devices connected to the Internet by 2015 and 50 billion by 2020 (Evans 2011). Such a large number of highly distributed and heterogeneous devices will need to interconnect and communicate in different scenarios autonomously. The suite of technologies developed in the Semantic Web, such as ontologies, semantic annotation, Linked Data and Semantic Web services, can be used as principal solutions for realising the IoT.

Big data solutions, such as network convergence (Yang et al. 2009) and cloud platforms (Wei et al. 2014), can provide infrastructure and tools to handle, process and analyse IoT data. However, efficient methods and solutions that can structure, annotate, share and make sense of the IoT data have not been developed yet. Heterogeneity issues in the IoT result from different software platforms, network technologies, and communication protocols. The focus of this paper is on bridging heterogeneity of the utilized data representations. To address these issues,in view of the meaning and characteristics of the IoT, we compared features relative to data and information in the IoT with an existing wireless sensor network and the Internet. To the best of our knowledge, this is the first attempt to present such a comparison. Then, we design a framework for multi-source heterogeneous information fusion in the IoT and set up an experimental simulation platform for environmental monitoring to verify the proposed framework.

The remainder of the paper is organized as follows. Related work is presented in Sect. 2. The features of data and information in the IoT are compared to an existing wireless sensor network and the Internet in Sect. 3. To the best of our knowledge, this is the first time such a comparison has been presented. Implementation of the proposed architecture for IoT information fusion is described in Sect. 4. Our validation results are presented in Sect. 5, and we conclude our paper and suggest avenues for future work in Sect. 6.

2 Related work

Data fusion is an important tool for the manipulation and management of this data to improve processing efficiency and provide advanced intelligence. The general definition of data fusion (Wald 1999) is that it is a formal framework that contains expressed means and tools for the alliance of data originating from different sources. It aims at obtaining information of greater quality: the exact definition of “greater quality” depends on the application. Information fusion is commonly used in detection and classification tasks in different application domains, such as aggregation techniques to reduce the data traffic to save energy, overall estimation of sensor data to improve information quality, intrusion detection and robotics.

There has been little research into general architectures for information fusion for the IoT; however, there has been a great deal of research into methods and algorithms for information fusion for wireless sensor networks. Several studies (Nakamura et al. 2007; Kreibich et al. 2014) focused on how to obtain more useful information from similar types of raw sensor data effectively and make decisions via computations based on professional rules for several types of sensor data.

Information fusion in the IoT is an active research area that has recently emerged. The focus of this research is to discover potential relevance and knowledge from large amounts of perceptual information. Numerous studies are primarily based on Semantic Web technologies.

In Zhao et al. (2015), the authors present a retrieval system based on topic discovery and semantic awareness in the IoT environment. Semantically aware retrieval is achieved by parsing a query and ranking the relevance of the content. Some studies have focused on enabling technologies to add semantics to the IoT. In Su et al. (2015), the authors analysed data formats that enable IoT applications to consume semantic IoT data in a straightforward and general fashion, and evaluated resource usage of different alternatives with a sensor system. In Ryu et al. (2015), the researchers proposed an integrated semantic service platform to support ontological models in various IoT-based service domains in a smart city. In particular, they addressed three main problems associated with providing integrated semantic services with IoT systems: semantic discovery, dynamic semantic representation and a semantic data repository for IoT resources. Hasan and Curry (2014) proposed an approach in which participants only agree on a distributional statistical model of semantics represented in a corpus of text to derive semantic similarity and relatedness and proposed an approximate model for relaxing the semantic coupling dimension via an approximation-enabled rule language and an approximate event matcher. Jara et al. (2014) analysed the Semantic Web of Things, and reviewed trends for capillary networks and for cellular networks with standards such as IPSO, ZigBee, OMA and the oneM2M initiative. In addition, they summarised the impact of semantic annotations/metadata on the performance of the resources.

In addition to semantics, context awareness is frequently investigated in relation to information fusion in the IoT. Perera et al. (2014) proposed a context-aware sensor search and selection and ranking model, which they refer to as CASSARAM, to address the challenge of efficiently selecting a subset of relevant sensors out of a large set of sensors with similar functionality and capabilities.

Another popular research topic is big data management and mining (Zhou et al. 2013) to glean useful information from massive amounts of multi-factor IoT data generated by perspective networks. These studies primarily focused on computational efficiency. The study of interpretability and integration (Le-Phuoc et al. 2012) of multi-source heterogeneous IoT data has become increasingly popular. Such studies have focused on IoT data abstraction (Henson et al. 2012) and access, linked sensor data (Barnaghi et al. 2013), resource/service search and discovery (Rinne et al. 2012) and semantic reasoning and interpretation (Perera et al. 2014). The primary goal of these studies was to integrate and create inter-understandable heterogeneous IoT data from different sources.

3 Preliminaries for data and information in the internet of things

3.1 The internet of things

Wireless sensor networks and RFID are fundamental technologies that make it possible for IoT “Things” to send messages to the Internet. HumanCmachine interaction technologies identify a variety of human behaviours that achieve customer-centric services.

The Constrained Application Protocol (CoAP) allows these devices to communicate and publish data on the Web. CoAP, which is essentially HTTP for resource-constrained devices, enables RESTful services down to the sensor level. The ontologies proposed by the World Wide Web Consortium (W3C) Semantic Sensor Networks Incubator Group (SSN-XG) (Compton et al. 2012) give formal guidance to establish virtual representation of ‘Things’. The linked data and Simple Protocol and Resource Description Framework (RDF) Query Language (SPARQL) can be used to query and integrate the ‘Things’ and their information.

3.2 Data and information in the IoT

The main sources of IoT information differ from conventional internet sources. The information model of traditional internet applications in which a human is the primary source of information has been changing.

A message on the Internet can be divided into control information and data information according to the given functions and can be divided into content generated by human edits and protocol information for communication between different software systems. Thus, it can be considered that all data transmitted over the Internet have a semantic nature. This semantic nature is derived from the human editing or communication protocol software system requirements in the upper layer. The semantics transmitted over the Internet are complex and completely ignored during the process of transmission. The internet network layer is only used for data storage and forwarding functions.

The primary information in the IoT is derived from the real physical world, e.g. a variety of terminal equipment status information from the physical world collected by various types of sensors’ converges to the Internet. Through pre-designed application-specific environments, these data are used for identification of a certain state in the physical world, such as fire warning, pest detection on crops in intelligent agriculture and food safety monitoring. However, these data can only be applied to a pre-designed specific application environment. Thus, it is difficult to reuse the data.

The data collected through the device layer in the IoT possess the inherent property of semantics because it is a description of the physical state of the world, which is very different from the data transmitted over the Internet. Furthermore, the semantics contained in the data transmitted over the Internet are produced from the upper layer of the Internet architecture. In contrast, in the IoT, such semantics are produced by the bottom layer. The semantics in the IoT have the characteristic of simplicity.

Information should provide semantics, and the data should be a type of information coding. Thus, data alone do not have a semantic nature. In the network layer of the Internet, transmission should be considered ‘data’, and in the network layer of the IoT, transmission should be considered ‘information’. A comparison of the differences among data transmission in the Internet, wireless sensor networks and the IoT is shown in Table 1. There is a huge difference between the input mode of the traditional internet information and that of the IoT. This is one of the fundamental problems in the design of the IoT architecture. The major problem is to integrate these two types of input information and design a compatible architecture for the current Internet and the IoT to fully realize the value of information.

Table 1 Comparison of differences in data transmission

3.3 New features of information fusion in the IoT

The multi-sensor data fusion discussed above focuses on algorithms to improve data quality and accuracy. Information fusion in the IoT is far beyond multi-sensor data fusion. Based on such multi-sensor data fusion, information fusion in the IoT should satisfy new data and information features in the IoT.

In the IoT environment, information fusion is a framework that comprises theories, methods and algorithms. Such information fusion will improve accuracy and provide more specific inferences that combine and mine measurement data from multiple sensors and related information obtained from the associated databases as compared to results obtained using only a single sensor.

The interpretability and integration of multi-source heterogeneous IoT data is a new and fundamental function of information fusion. Without this function, the related data and information cannot be integrated. In addition, it is impossible to process information fusion computation using a variety of algorithms because heterogeneous data cannot be compared and inter-understood. Therefore, inference and reasoning cannot be performed.

4 New architecture of information fusion for the IoT

Based on multi-sensor data fusion, information fusion in the IoT has been developed to solve new problems appearing in IoT data. Information fusion in the IoT inherits the advantages of multi-sensor data fusion. It also has the abilities of interoperability and integration of multi-source heterogeneous IoT data and semantic inference. The architecture of information fusion in the IoT is a general guide that can illustrate how IoT data and information are processed. From the perspective of this function, it can be seen as five constituent phases, as shown in Fig. 1.

Fig. 1
figure 1

The architecture of information fusion in the IoT

4.1 Raw data annotation and abstraction

The W3C has developed a semantic sensor network (SSN) (Compton et al. 2012) ontology that can model sensor devices, systems, processes and observations. This research is conducted by the W3C Semantic Sensor Network Incubator Group (SSN-XG). The SSN ontology enables expressive representation of sensors, sensor observations and knowledge of the environment. The SSN ontology is encoded in the Web Ontology Language and has begun to achieve broad adoption and application within the IoT community. It is currently being used by various organizations, from academia to government and industry, to improve management of sensor data on the Web, including annotation, publishing and search.

4.2 Data integration

Data integration systems are generally defined as a triple (GSM), where G is the global (or mediated) schema, S is the heterogeneous set of source schemas and M is a mapping that maps queries between the source and global schemas. Both G and S are expressed in languages over alphabets composed of symbols for each of their respective relations. The mapping M consists of assertions between queries over G and S. When users pose queries using a data integration system, they pose queries over C. Then, the mapping asserts connections between the elements in the global schema and source schema.

In information fusion in the IoT, data integration involves combining relevant data residing in a number of heterogeneous data sources that may conflict by structure and context or value. Such data integration provides users with a unified view of these data. Its fundamental enabling technology is semantic technology. Ontology is used to specify the models of the raw IoT data, which are schema levels. The raw IoT data are annotated and abstracted following these schema levels. Thus, the raw IoT data are transformed into a massive amount of linked data from which the instance level can be found. Based on the general idea of data integration discussed above, when users require heterogeneous relevant data integration at the instance level, the different schema-level models must be merged. Therefore, the data at the instance level will be presented in a unified view and will accomplish data integration. Different schema-level models can be merged using mappings. There are several ways to obtain such mappings. The first method is pre-defined mappings. This method may yield high accuracy but is inefficient. The second method involves mappings determined by computation following some principles, e.g. using a linked open data cloud.

The core function of data integration is schema-level mapping. Blooms (Jain et al. 2010) is a tool for schema-level mapping based on the idea of bootstrapping information already present in the linked open data cloud. Blooms show good matching results for rich knowledge contained in the linked open data cloud.

SPARQL (Rinne et al. 2012) is an RDF query language (i.e. a query language for databases) that can retrieve and manipulate data stored in the RDF format. There are tools that allow one to connect and semi-automatically construct a SPARQL query for a SPARQL endpoint. Thus, the SPARQL endpoint can be used to query the relevant IoT data in RDF format.

4.3 Data fusion

In this phase, data fusion is similar to multi-sensor data fusion, which focuses on computation of structured and comparable IoT data to improve data quality to obtain appropriate decisions.

According to the relations among sources (Nakamura et al. 2007), information fusion can be classified as follows.

Complementary. When information provided by sources represents different portions of a broader scene, information fusion can be applied to obtain a broad piece of information.

Redundant. If two or more independent sources provide the same piece of information, these pieces can be fused to increase the associated confidence.

Cooperative. Two independent sources are cooperative when the information they provide is fused into new information (usually more complex than the original data), which, from an application perspective, better represents reality.

Depending on the method (Nakamura et al. 2007), information fusion can be processed with several goals, such as inference and estimation.

Inference methods are often applied in decision fusion. In this case, a decision is made based on the knowledge of the perceived situation. Here inference refers to the transition from one likely true proposition to another, whose truth is believed to result from the previous one. Classical inference methods are based on Bayesian inference and DempsterCShafer Belief Accumulation theory. Estimation methods are inherited from control theory and use the laws of probability to compute a process state vector from a measurement vector or a sequence of measurement vectors. Estimation methods include maximum likelihood, maximum a posteriori, least squares, moving average filter, Kalman filter and particle filter.

In this paper, we propose multi-source heterogeneous information fusion in the IoT based on semantics. Here, we provide the basic definitions and a description of the problem.

Definition 1

(Concept) A concept refers to a Wikipedia article in the form of a web page. We leverage the uniform resource identifier to refer to a concept.

Definition 2

(Entity) An entity, which includes both attributes and values, refers to a type of sensor in a mobile phone or other devices. The attributes usually represent the type of sensor, such as temperature, pressure and light. Values represent the state description for the real world. The attributes are also considered a concept and can be interpreted in the Wikipedia articles.

It is formally given in the following to establish the relation for multi-source heterogeneous information fusion in the IoT based on semantics.

  1. 1.

    Setup. To set up the relations between entities and concepts, we leverage TF-IDF to analyse the attributes and values of sensors that people have obtained. As a result, several concepts can be abstracted from Wikipedia articles, which are denoted as \(E_s\) and \(C_s\).

  2. 2.

    Related concept selection. Here, existing concepts and entities (including \(C_i\), \(E_i\), \(C_s\) and \(E_s\)) in Wikipedia are located to confirm their category. The concepts of category are denoted as \(C_c\) and \(E_c\). To traverse the subcategory and parent category of \(C_c\) and \(E_c\), we obtain a new concept as set \(C_{sp}\).

  3. 3.

    Related concept classification. Associated concepts and entities in the same Wikipedia category are dispatched into a classification.

  4. 4.

    Relation construction.

    1. (a)

      Within the same classification, we apply explicit semantic analysis (ESA) (abrilovich and Markovitch 2007) to find relations between concepts and establish similarity between concepts \(c_1\) and \(c_2\) based on Wikipedia. If the relatedness resulting from ESA exceeds a threshold value, the relation between concepts \(c_1\) and \(c_2\) will be established.

    2. (b)

      In different classifications, the relation will be established based on indirect correlation, such as temporal–spatial information. If the relatedness resulting from ESA exceeds a threshold value, the relation between concepts \(c_1\) and \(c_2\) will be established (in which \(c_1\) belongs to \(C_s\) and its \(C_{sp}\)). Note that temporal–spatial information is obtained from a sensor, and the time range and scope of space are predetermined. Thus, if the temporal–spatial information of the concepts is within the time range or space, the relation between them will be established.

figure a

As shown in Algorithm 1, after several steps to establish the relation, we obtain a type of multi-source heterogeneous information fusion in the IoT based on semantics.

5 Simulation verification platform

We deploy a wireless sensor network to monitor a lab environment (Fig. 2). Then, we collect the sensed data into a relational database (Fig. 3). Based on the SSN ontology developed by the W3C, we model our schema level of observation data based on the instance level.

Fig. 2
figure 2

Wireless sensor network to monitor the lab environment

Fig. 3
figure 3

Sensor data in relational database

Fig. 4
figure 4

System architecture

As shown in Fig. 4, the system comprises three main parts.

  1. 1.

    Sensor network and corresponding middleware: Various sensors deployed in labs collect real-time data. The data are sent to sensor network middleware through a wireless gateway. Then, the sensor network middleware stores the data in SQL Server 2008.

  2. 2.

    D2RQ system: The collected sensor network data are abstracted and modelled using the SSN ontology. Then, customized mappings to transactions in SQL Server 2008 are generated by the D2RQ Mapping Language to transform ordinary data into relational data.

  3. 3.

    Domain knowledge base construction: According to the objects and lab environments, a domain knowledge base is constructed with Protg. Then, the Jena reasoner is used to perform intelligent recognition and assessment of the lab environments.

This is a critical process required to transform raw sensor data into RDF format. After careful comparison and analysis, we apply the D2RQ platform to construct the mapping between the relational database and the triple store database based on the schema level modelled above.

We used the SPARQL endpoint to query the RDF observation data (Fig. 5). This example provides all the information about air temperature, such as time, unit, value and node_id.

Fig. 5
figure 5

A SPARQL query in endpoint

The proposed framework for multi-source heterogeneous information fusion in the IoT is classified into the following phases. Raw data annotation and abstraction refers to raw IoT data is annotated and explained by metadata, which can be linked and used to facilitate integration and interoperability. Data integration involves combining relevant data residing in a number of heterogeneous data sources, which may conflict by structure and context or value. The combined data provide a unified view of the data. Data fusion focuses on the computation of the structured and comparable IoT data to improve data quality or obtain appropriate decisions. Feature abstraction and inference enriches the fused data results with true meaning by semantics, abstraction and reasoning.

6 Conclusion

In this study, in view of the meaning and characteristics of the IoT, we have analysed the connotation and characteristics of information fusion in the IoT. We compared features relative to data and information in the IoT with an existing wireless sensor network and the Internet. To the best of our knowledge, this is the first attempt to present such a comparison. We have proposed architecture for information fusion in the IoT that can provide guidance for the development of information fusion in the IoT. We have designed a framework for multi-source heterogeneous information fusion in the IoT and set up an experimental simulation platform for environmental monitoring to verify the proposed framework.

There are still numerous challenges in information fusion in the IoT, such as ontology model mapping, computation and management of massive amounts of data and practical IoT applications. Traditional multi-sensor data fusion can handle the same kind of data effectively. However, as new characteristics emerge in the IoT, interoperable service-oriented technologies are required to share real-world data among heterogeneous devices to integrate and fuse such multi-source heterogeneous IoT data. The IoT can only offer inconsequential practical benefits if it does not have the ability to integrate, fuse and glean useful information from the data generated by a world of interconnected devices. Considerations for future IoT networks include data network integration of heterogeneous networks. Therefore, network layer routing protocol design that considers fine-grained semantic-level fusion is a worthwhile undertaking. In addition, it is necessary to develop a method to solve these problems. Only in this way can we enjoy a much more intelligent IoT.