Keywords

1 Introduction

The Internet of Things (IoT) is new revolution of the Internet. It refers to the billions of physical devices around the world that are now connected to the Internet, all collecting and sharing data. Sensors are one of the main components that enable IoT, which send the observation in stream data.

Furthermore, sensor data are enabled to the web through the Sensor Web. Sensor Web by incorporating technologies of the Semantic Web creates the Semantic Sensor Web. In this way, sensor data stream can be annotated with semantics by providing machine-interpretable descriptions on what the data represents, where it originates from, how it can be related to its surroundings, who is providing it, and what are the quality, technical, and non-technical attributes [3]. The real-time integration of sensor data as dynamic data with semantics is defined as real-time semantic annotation, while sensor data that are stored in repository (data store) as static data, and then integrated with semantics is defined as non-real-time semantic annotation [16].

Organizations like Open Geospatial Consortium (OGC) and World Wide Web Consortium (W3C) have proposed several standards for sensor data. The OGC defines standardization for the Sensor Web named Sensor Web Enablement (SWE). It’s a framework and a set of standards that allow exploitation of sensors and sets of sensors connected to a communication network. Is founded on the concept of “Web Sensor” using standard protocols and application interfaces [13].

This paper is an extended version of conference proceedings [15]. In [15], we have investigated on how to integrate semantic annotations into the sensor stream data. In particular, we have discussed the annotation techniques for real-time integration of semantics into heterogeneous sensor observation data and sensor metadata with context in the IoT. Different from conference proceedings [15], in this paper system architecture is advanced by adding new features such as archival data stream, an ontology, OGC standards, and Xlink annotated SOS. Also, implemented system is advanced as shown in Fig. 7.

The paper is organized as follows: Sect. 2 provides a discussion on literature review for semantic annotations to the sensor stream data. Section 3 is an overview of the difference between sensor streaming versus traditional streaming, semantic annotations concepts, and selection of technologies and standards for semantic annotations. An overview of the system architecture is presented in Sect. 4, while Sect. 5 represents the implemented system, including received sensor data format, integration of semantic annotations to the sensor data, and system outputs. Finally, Sect. 6 concludes the paper and identifies some of the future perspectives of the semantic integrations into the sensor stream data.

2 Literature Review

Recently, some researchers have already shown up with several investigations related to semantic enrichment of sensor stream data. Authors in [1] brought together semantic web and data mining in the context of IoT with a focus on sensors as interconnected devices, concluding that practical data mining applications can be built by usage of real world sensors ontologies, query mechanisms and linked sensor data available. SSW is described as a synthesis of sensor data and semantic metadata in [17]. It represents an approach by OGC and Semantic Web Activity of the W3C to provide meaning for sensor data. Construction of a Semantic Sensor Observation Service (SemSOS) based on the SWE standards is discussed in [7], by adding semantic annotations to sensor data and by using the ontology models to reason over sensor observations.

An extension of the SWE framework in order to support standardized access to sensor data is described in [11]. Furthermore, they list as future work the extension of SOS server with semantics, since the lack of semantically rich mechanism is seen as a significant issue, which makes it hard to explore related concepts, subgroups of sensor types, or other dependencies between the sensors and data collected.

3 Background

Currently, billions of interconnected IoT devices produce sensed data continuously in the stream data, and transmit these data to a centralized server. Due to the dramatically increase of streaming data, their management and exploitation has become increasingly important and difficult to process and integrate the semantic to sensor data stream in real time. Therefore, the selection of technologies and standards for technique development of real-time integration of semantics into heterogeneous sensor observation data and sensor metadata with context in the IoT is highly important. The proposed real-time semantic annotation system utilizes Spark StreamingFootnote 1, Apache KafkaFootnote 2, Apache Cassandra databaseFootnote 3, and standards like OGC Sensor Web Enablement standards, which will be discussed below.

3.1 Sensor Streaming Versus Traditional Streaming

The distinction between traditional data stream processing and sensor data stream processing is important because the sensory data stream have their own features [5]:

  • The sensor data streams are only samples of entire population, while traditional streaming such as network streams, data of web logs, stock market, etc. represent the entire population of the data.

  • The sensor data streams are considered noisy compared with traditional streaming data. Traditional streaming data is exact and error-free. Sometimes the environmental effect on the deployed sensor networks can also play a negative role on the sensed values. While web logs and web click streams are considered accurate values compared with data generated from sensor networks.

  • The sensor data streams is typically of moderate size as compared to overwhelming storage and processing of huge data in traditional streams.

3.2 Semantic Annotations

IoT applications are enabled using heterogeneous sensors, which send observational data referred to as sensor stream data to a remote server. Raw sensor stream data is useless unless properly annotated. Therefore, the researchers proposed Semantic Sensor Web (SSW), which is a combination of Sensor Web and technologies of Semantic Web. Based on study [14], the explored publications show that major number of research are accepting the proposed industry standards, such as SWE, and techniques that can be used for annotating sensor data, such as Resource Description Framework in attributes (RDFa), XML Linking Language (Xlink), and Semantic Annotations for WSDL and XML Schema (SAWSDL), by different organizations like OGC and W3C [18]. However, how to advance techniques for integration of the semantic annotations in real-time is still an open issue that should be addressed.

3.3 Technologies

The proposed real-time integrated semantic annotations to the sensor stream data for the IoT utilizes:

  • Spark Streaming: Several stream data processing systems including Spark Streaming, Storm, Google Data Flow, and Flink have emerged to support real-time analytics for the streaming data sets [10]. Majority studies conclude that Spark Streaming works best with high throughput when the incoming volume is huge [6]. Therefore, we have chosen Sparking Streaming to develop our system for real-time integration of semantic annotations to sensor stream data. Spark Streaming is an extension of the Apache Spark that enables to build scalable fault-tolerant IoT applications for real-time processing sensor stream data. It can receive data from different input sources such as Apache Kafka, TCP sockets, Flume, Kinesis, Hadoop Distributed File System (HDFS), or Twitter, and can be processed using complex algorithms expressed with high-level functions like map, join, reduce and window. Finally, processed streaming data can be published in IoT real-time applications or can be pushed out to databases or file systems.

  • Apache Kafka: Is a distributing streaming platform with capabilities to publish and subscribe to streams of records, similar to a message queue or enterprise messaging system, store streams of records in a fault-tolerant durable way, and process streams of records as they occur. Kafka is generally used for building real-time streaming data pipelines that reliably get data between systems or applications [9]. In our system Kafka is used as middleware between sensor stream data and Spark Streaming.

  • Apache Cassandra database: Is a free and open source, distributed store for structure data that scale-out on cheap, commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. The Spark Streaming interacts well with Cassandra database. Therefore, in our system, the sensor stream data with their semantic annotations processed by Spark Streaming are stored in Cassandra database.

3.4 Standards

The OGC defines standardization for the Sensor Web named Sensor Web Enablement (SWE), which is divided into two parts [12]:

  • SWE Information Model: Is comprised of conceptual language encodings that permits sensor observations visibility on the Internet. The SWE information model includes the following specifications: Sensor Model Language (SensorML), Observation and Measurement (O&M), and Transducer Model Language (TransducerML).

  • SWE Service Model: Is a set of Web Service specifications that allow a client to search and find the required information. The SWE Service model includes the following specifications: Sensor Observation Service (SOS), Sensor Alert Service (SAS), Sensor Planning Service (SPS), and Web Notification Services (WNS).

To encode semantic annotations and data gathered by sensors, in this paper is used SOS O&M, which will be discussed in Sect. 5.2.

4 An Overview of the System Architecture

In the Fig. 1, an overview of the system architecture for real-time integration of semantics into heterogeneous sensor stream data with context in the Internet of Things is presented. As mentioned above, the proposed real-time semantic annotation system utilizes Apache Kafka, Spark Streaming, Apache Cassandra database, and SOS O&M standards.

The heterogeneous sensor stream data from the IoT-based sensor device is wirelessly transmitted to serve as the “producer” for the Kafka server. The “producer” client publishes streams of data to Kafka “topics” distributed across one or more cluster nodes/servers called “brokers”. The published streams of data from Kafka are then processed by Apache Spark Streaming in parallel and real-time.

Kafka server is utilized to receive various formats of sensor data streams (e.g. text, binary, JSON, XML etc.), and to transform them in a particular format that will be processed by Spark Streaming.

Fig. 1.
figure 1

An overview of the system architecture [15].

The Spark Streaming enables a real-time integration of semantics into heterogeneous sensor stream data with context in the IoT, by using sensor metadata, archival data streams, mining data streams, association rules for adding semantic annotations with concept definitions from ontologies or other semantic sources, which allows the understanding of senor data and metadata elements. The semantic annotations will be implemented into SOS O&M by using stakes, such as XLink (without including XPath) and Embedded (only a single value-scalar of semantic annotation) to add annotations in XML files. These annotations can point to extra sources of information (e.g. a file), or Uniform Resource Name (URN).

The enriched sensor stream data with the semantic annotations results will be stored in the Cassandra database, and will be displayed in IoT real-time monitoring system. It is worth mentioning that Spark Streaming will process sensor data stream in format of OGC standards like SWE, respectively version 2.0 of the SOS standard to encode semantic annotations and data gathered by sensors [4].

The detailed description is presented in Sect. 5.2 where an example of integration of semantic annotations into the sensor stream data with context in the IoT is given.

5 System Implementation

An IoT real-time air quality monitoring system is developed to visualize sensor stream data and their semantic annotations, based on web platform. Sensor data of Hydrometeorological Institute of Kosovo (HMIKFootnote 4) are used, through World Air Quality Index API (AQI API). The AQI API can be used for advanced programmatic integration, such as: access to more than 11000 station-level and 1000 city-level data, station name and coordinates, search station by name, geo-location query based on latitude/longitude, individual Air Quality Index (AQI) for all pollutants, current weather conditions, etc. [2].

5.1 Input Sensor Stream Data

The system receives raw sensor stream data from AQI API in JSON format, as presented in Fig. 2, which supports measuring in real-time of the following parameters:

  • Carbon Monoxide (co),

  • Humidity (h),

  • Nitrogen Dioxyde (no2),

  • Ozone (o3),

  • Pressure (p),

  • PM10 (pm10),

  • PM25 (pm25),

  • Sulphur Dioxide (so2),

  • Temperature (t),

  • Wind (w), and

  • Water Gauge (wg).

As shown in Fig. 2, JSON data contains also attributes such as: data (station data: idx - unique id for the city monitoring station, aqi - real time air quality information, time - measurement time information, s - local measurement time, and tz - station time zone), city (information about the monitoring station: name - name of the monitoring station, geo - latitude/longitude of the monitoring station, and url - url for the attribution link), attributions (EPA Attribution for the station), and iaqi (measurement time information: pm25 - individual AQI for the PM2.5, v - individual AQL for the PM2.5).

Data received by sensors every 6 min, through AQI API, are represented in corresponding numerical formats, e.g. in −3.8 (°C) for temperature parameter.

5.2 Processing Sensor Stream Data by Integrating Semantic Annotations

In our system, an ontology name ‘onto-core.owl’ is created (see Fig. 3). Here different semantic annotations for sensor stream data are developed, such as:

  • #AIQ_Index,

  • #Air_Pollution_Level, and

  • #Health_Implications

#AIQ_Index annotation – is an index for reporting daily air quality, and tells how clean or polluted air is. United States Environmental Protection Agency (EPAFootnote 5) calculates the AQI for five major air pollutants regulated by Clean Air Act: ground-level ozone, particle pollution (also known as particulate matter), carbon monoxide, sulfur dioxide, and nitrogen dioxide. The AQI range values is from 0 to 500.

According to EPA, the higher the AQI value, the greater the level of air pollution and the greater the health center (take the maximum of all individual AQI), as presented Eq. 1:

$$ {\text{AQI}} = \max \left( {{\text{AQI}}_{{{\text{PM}}2.5}} ,{\text{AQI}}_{{{\text{PM}}10}} ,{\text{AQI}}_{{{\text{O}}3}} ,...} \right) $$
(1)

#Air_Pollution_Level annotation – based on the AQI value, its divided into six ‘Air Quality Index Levels of Health Concern’ categories:

  • Good (AQI is 0 to 50)

  • Moderate (AQI is 51 to 100)

  • Unhealthy for Sensitive Groups (101 to 150)

  • Unhealthy (AQI is 151 to 200)

  • Very Unhealthy (AQI is 201 to 300)

  • Hazardous (AQI is 301 to 500)

Fig. 2.
figure 2

Sensor stream data - JSON format [15].

Fig. 3.
figure 3

‘Ont-core.owl’ ontology.

#Health_Implications annotation – each of six categories described above, corresponds to a different level of health concert. #Health Implications annotation tells what they mean, for example “Unhealthy for Sensitive Groups” category means: ‘Although general public is not likely to be affected at this AQI range, people with lung disease, older adults and children are at a greater risk from exposure to ozone, whereas persons with heart and lung disease, older adults and children are at greater risk from the presence of particles in the air.’, or for “Moderate” category: ‘Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people who are unusually sensitive to air pollution.’

The above described annotations are developed into ontology named ‘ont-core’.

After describing different types of the semantic annotations for sensor stream data, in the following is presented the process of semantic annotations.

The sensor stream data may arrive in different formats to Kafka server (JSON format - in our case), which will transform them in a specific format that will be processed by Spark Streaming. After that, through the Spark Streaming, based on measuring values, the sensor data stream will semantically be annotated and converted in SOS O&M format. A fragment of an example of integrated semantic annotations to the SOS O&M format by using stakes like XLink and Embedded, is presented in Fig. 4.

SOS O&M observation document comprise zero or multiple observationData entries, and each store an instance of an observation. In the following are presented common observation properties (the prefix gml indicates that this element is defined in OGC 07-033, while the prefix om indicates that the element is defined in OGC 10-025r1) [8]:

  • gml:identifier (mandatory): identifies or refers to a specific observation.

  • om:phenomenonTime (mandatory): describes the time instant or time period for which the observation contains sensor data.

  • om:resultTime (mandatory): provides the time when the result became available (often this is identical to the phenomenonTime).

  • om:procedure (mandatory): the identifier of the sensor instance that has generated the observation.

  • om:observedProperty (mandatory): the identifier of the phenomenon that was observed.

  • om:featureOfInterest (mandatory): an identifier of the geometric feature (e.g. sensor station) to which the observation is associated.

  • om:result (mandatory): the observed value, the type of the result is restricted to the types shown in Table 1.

We have developed a new type of observation to add, named ‘SemObservation’ with ‘gml:Sem MeasureType’ result type, as shown and described in Table 2.

Fig. 4.
figure 4

An example of integrated semantic annotations to the sensor stream data [15].

Table 1. Overview of observation types [8].
Table 2. The developed SemObservation observation type [15].

For clearer explanation of semantic integration to sensor observation data, Fig. 5 illustrates (a) the concept of the O&M and relationship between the entities involved in observations, (b) data streams generated from wireless sensor networks, (c) the sensor data integrated with sensor metadata, archival data streams and the ontological knowledge, and finally, (d) the semantic annotated data with attributes: sem-annotations data, the observed value, unit, metadata, location, timestamp, result type, and gml:id of observation.

Fig. 5.
figure 5

Integrating semantics to sensor observation data [15].

5.3 System Outputs

To display the heterogeneous sensor stream data and their semantic annotations, is developed an real time IoT application in the ASP.NET Core MVC, a cross-platform, high-performance, open source framework for building modern, cloud-based, and Internet-connected applications. The ‘DataStax C# for Apache Cassandra’ is used to read data from Apache Casandra database. It’s a modern, feature-rich and highly tunable C# client library. To display the data in the map, is used Leaflet, an open-source JavaScript library for interactive web maps. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms out of the box, taking advantage of HTML5 and CSS3 on modern browsers while being accessible on older ones too.

Fig. 6.
figure 6

System outputs: map view [15].

Fig. 7.
figure 7

System outputs: table view.

As shown in Fig. 6 and Fig. 7 (map & table view), the users can observe the quality of air pollution on certain geographical points in map marked as measuring nodes. Each node (marker) has an AQI Index, to indicate air pollution. When clicking over a whatever marker, the latest measurement values obtained for that point will be shown, such as: PM2.5, PM10, O3, NO2, SO2, CO, Temperature, Pressure, Humidity, Wind, Water Gauge, and semantic annotations, such as: #AQI Index, #Air Pollution Level, and #Health Implications.

6 Conclusions and Future Work

WSNs are one of the main components of the IoT. They produce the observed data in continuous form, known as sensor stream data and transmit to the server for further processing. Raw sensor data are useless unless properly annotated. By adding semantic annotations with concept definitions from ontologies, it’s possible the interpretation and understanding of sensor data streams.

First, this study provides a literature review related to the topic of the integration of semantics into sensor data for the IoT. Next, is describes the distinction between traditional data stream processing and sensor data stream processing. Then, are presents the selected technologies and standards (such as Spark Streaming, Apache Kafka, Apache Cassandra, and OGC standards) which are used to develop the proposed system. After that, an overview of the system architecture for real-time integration of semantics into heterogeneous sensor stream data with context in the Internet of Things is presented. Finally, an system implementation of an IoT real-time air quality monitoring system is presented, including:

  • input sensor stream data in JSON format of the following measuring parameters: carbon monoxide, humidity, nitrogen dioxyde, ozone, pressure, pm10, pm25, sulphur dioxide, temperature, wind, and water gauge;

  • processing sensor stream data by integrating semantic annotations to the sensor stream data in SOS O&M format, and a new type of observation SemObservation (with gml:Sem MeasureType result type) is developed;

  • system outputs to display the heterogeneous sensor stream data and their semantic annotations in map and table view format.

This paper is an extended version of conference proceedings [15]. Different from this conference proceedings, in this paper system architecture is advanced by adding new features such as archival data stream, an ontology, OGC standards, and Xlink annotated SOS.

Extending the system with more advanced real-time annotation techniques of semantics such as XPath annotations, development of techniques for real-time interpretation of semantic annotations, and to evaluate the system performance is left for future work.