Keywords

1 Introduction

Big data techniques are targeted towards solving system level problems that cannot be solved by conventional methods and technologies. With the emergence of new networked sensor technologies (e.g. large scale wireless sensor networks, body sensor networks [14], Internet/Network/Web/Vehicle-of-Things [5]), the next generation of big data systems will need to deal with machine-generated data from these forms of networked sensor systems [69]. This paper is focused on big sensor data systems (a term to conceptualize the application of big data models towards networked sensors) and surveys its applications towards smart cities and urban informatics. This is becoming an important research area [10]. An estimate from IBM is that the volume of machine-generated data sources will increase to 42 % of all data by 2020, up from 11 % in 2005Footnote 1. Much less research has been conducted for big sensor data systems compared with conventional big data systems [11].

Recently, there has been much research for smart cities. Many smart cities have been established in the world such as in Santander [12], Barcelona [13], and Singapore [14]. In Australia, the state of Tasmania is experimenting with the world’s first economy-wide intelligent sensor network technology (Sense-T)Footnote 2. Smart city applications would generate huge amounts of data from a wide variety of data sources ranging from environmental sensors, mobile phones, localization sensors, to data generated by people from social networks. While there is general agreement that the use of these big sensor data would lead to improved services in smart cities, many research challenges remain on how to integrate and utilize these big sensor data. In this survey paper, we take a bottom-up approach and use representative use-cases for big sensor data research, deconstruct the studies to identify key techniques and challenges, and attempt to summarize the common strategies and solutions utilized by different researchers. The use-cases selected for study have been implemented in the real-world.

Figure 1 illustrates the boom of the global data volume. While the amount of large datasets is drastically rising, it also brings about many challenging problems demanding prompt solutions.

Fig. 1.
figure 1

The continuously increasing big data

2 Domains of Big Sensor Data

A definitionFootnote 3 for big data is given by as “Big data is high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”. An extended definition is that big data systems would involve the five Vs: (1) big volume of data (e.g. involving datasets of terabytes/petabytes), (2) variety of data types, (3) high velocity of data generation and updating, (4) veracity (uncertainty and noise) of acquired data, and (5) big value, as shown in Fig. 2. The first four V’s are concerned about data collection, preprocessing, transmission, and storage. The final V focuses on extracting value from the data using statistical and analytical methods.

Fig. 2.
figure 2

The 4Vs feature of big data

3 Big Sensor Data Generation and Acquisition

If we take data as a raw material, data generation and data acquisition are an exploitation process, data storage is a storage process, and data analysis is a production process that utilizes the raw material to create new value.

3.1 Data Generation

The data collection process consists of three modules: (1) data acquisition, (2) information extraction and cleaning, and (3) data integration, aggregation and representation. The inputs into the data collection process are the raw sensor data values s(x,y,t) harvested from (multiple) sensor farms. The output of the process are the cleaned and aggregated data values s’(x,y,t). The number of output data points would not be more than the number of input data points. The number of output data points may be less than the number of input data points due to data sampling and aggregation. Another characteristic is that the data collection process does not involve utilization of other data sources.

Data generation is the first step of big data. Given Internet data as an example, huge amount of data in terms of searching entries, Internet forum posts, chatting records, and microblog messages, are generated. Those data are closely related to people’s daily life, and have similar features of high value and low density [15, 16]. Such Internet data may be valueless individually, but, through the exploitation of accumulated big data, useful information such as habits and hobbies of users can be identified, and it is even possible to forecast users’ behaviors and emotional moods [1720].

Moreover, generated through longitudinal and/or distributed data sources, datasets are more large-scale, highly diverse, and complex. Such data sources include sensors, videos, clickstreams, and/or all other available data sources. At present, main sources of big data are the operation and trading information in enterprises, logistic and sensing information in the IoT [21, 22], human interaction information and position information in the Internet world, and data generated in scientific research, etc. The information far surpasses the capacities of IT architectures and infrastructures of existing enterprises, while its real time requirement also greatly stresses the existing computing capacity [23].

3.2 Big Data Acquisition

As the second phase of the big data system, big data acquisition includes data collection, data transmission, and data pre-processing. During big data acquisition, once we collect the raw data, we shall utilize an efficient transmission mechanism to send it to a proper storage management system to support different analytical applications. The collected datasets may sometimes include much redundant or useless data, which unnecessarily increases storage space and affects the subsequent data analysis. For example, high redundancy is very common among datasets collected by sensors for environment monitoring. Data compression technology can be applied to reduce the redundancy. Therefore, data pre-processing operations are indispensable to ensure efficient data storage and exploitation:

  1. 1.

    Data Collection: Data collection is to utilize special data collection techniques to acquire raw data from a specific data generation environment. In addition to the aforementioned three data acquisition methods of main data sources, there are many other data collect methods or systems. For example, in scientific experiments, many special tools can be used to collect experimental data, such as magnetic spectrometers and radio telescopes. We may classify data collection methods from different perspectives. From the perspective of data sources, data collection methods can be classified into two categories: collection methods recording through data sources and collection methods recording through other auxiliary tools.

    Participator sensing is considered as an emerging application scenario for efficient data crowdsourcing from smart device equipped ordinary citizens. Liu et al. in [2426] presented a novel resource negotiation scheme bridging between dynamic sensing tasks and heterogeneous sensors. Liu et al. in [2729] proposed a novel framework and subsequent participant selection and incentive mechanism for participatory crowdsourcing including the smart device users, central platform and multiple task publishers. In [30], existing incentive mechanism are extensively surveyed and future research directions are clearly given. Liu et al. [31] extensively analyzed the relationship between energy consumption and smart device user behaviors, and then proposed a novel approach to select the optimal amount of participant while considering possible user rejections. Song et al. [32] introduced an energy consumption index to quantify the average degree of how participants feels disturbed by the energy cost, and proposed a suboptimal approach for participant selection under the multi-task sensing environment. Liu et al. [33] presented a quite novel family-based healthcare monitoring system for long-term chronical disease caring. Event detection systems and energy efficient approaches are given in [34, 35] including both centralized optimal approach and fully distributed suboptimal solutions by participatory sensing. Furthermore, Zhang et al. [36] focused on privacy leakage issues of participatory sensing and presented a participant coordination based architecture and flow to successfully protect user privacy. Finally, Yurur et al. in [37] presented a few posture detections schemes by using the sensor equipped smart devices. Finally, Liu et al. [38] presented a novel concept of quality of service (QoS) index to integrate the multi-dimensional QoS requirements to ensure the degree of QoS satisfactions

  2. 2.

    Data Transportation: Upon the completion of raw data collection, data will be transferred to a data storage infrastructure for processing and analysis.. As we known, big data is mainly stored in a data center. The data layout should be adjusted to improve computing efficiency or facilitate hardware maintenance [39]. In other words, internal data transmission may occur in the data center. Therefore, data transmission consists of two phases: Inter-DCN transmissions and Intra-DCN transmissions [4042].

  3. 3.

    Data Pre-processing: Because of the wide variety of data sources, the collected datasets vary with respect to noise, redundancy, and consistency, etc., and it is undoubtedly a waste to store meaningless data. In addition, some analytical methods have serious requirements on data quality. Therefore, in order to enable effective data analysis, we shall pre-process data under many circumstances to integrate the data from different sources, which can not only reduces storage expense, but also improves analysis accuracy.

4 Big Data Storage and Analysis

4.1 Data Storage

The explosive growth of data has more strict requirements on storage and management. In this section, we focus on the storage of big data. Big data storage refers to the storage and management of large-scale datasets while achieving reliability and availability of data accessing. We will review important issues including massive storage systems, distributed storage systems, and big data storage mechanisms. On one hand, the storage infrastructure needs to provide information storage service with reliable storage space; on the other hand, it must provide a powerful access interface for query and analysis of a large amount of data.

Traditionally, as auxiliary equipment of server, data storage device is used to store, manage, look up, and analyze data with structured Relational DataBase Management Systems (RDBMSs). With the sharp growth of data, data storage device is becoming increasingly more important, and many Internet companies pursue big capacity of storage to be competitive. Therefore, there is a compelling need for research on data storage.

4.2 Data Analysis

The analysis of big data mainly involves analytical methods for traditional data and big data, analytical architecture for big data, and software used for mining and analysis of big data. Data analysis is the final and the most important phase in the value chain of big data, with the purpose of extracting useful values, providing suggestions or decisions [43]. Different levels of potential values can be generated through the analysis of datasets in different fields. However, data analysis is a broad area, which frequently changes and is extremely complex. In this section, we introduce the methods, architectures and tools for big data analysis.

Traditional data analysis means to use proper statistical methods to analyze massive data, to concentrate, extract, and refine useful data hidden in a batch of chaotic datasets, and to identify the inherent law of the subject matter, so as to maximize the value of data. Data analysis plays a huge guidance role in making development plans for a country, understanding customer demands for commerce, and predicting market trend for enterprises. Big data analysis can be deemed as the analysis technique for a special kind of data. Therefore, many traditional data analysis methods may still be utilized for big data analysis.

In the dawn of the big data era, people are concerned how to rapidly extract key information from massive data so as to bring values for enterprises and individuals. Although the parallel computing systems or tools, such as MapReduce or Dryad, are useful for big data analysis, they are low levels tools that are hard to learn and use. Therefore, some high-level parallel programming tools or languages are being developed based on these systems. Such high-level languages include Sawzall, Pig, and Hive used for MapReduce, as well as Scope and Dryad LINQ used for Dryad.

5 Big Data Applications

The works in [4447] used technical sensors embedded in the environment. The work in [44] for Barcelona aimed to expose the daily routines and patterns of people using the city bicycling program. The system collected data on when a bike is picked up or parked. The work in [45] embeds sensor into the city infrastructure. The data collected would be useful for studying the impact of air pollution on respiratory disease, and generating data to inform cycleway development. The work in [46] for Amsterdam used 2400 vehicle detector stations and 60 number plate recognition cameras to decrease the vehicle loss hours. The works in [4852] used humans as sensors and collect data as people go about their daily routines. The work in [48] used Call Detail Records (CDRs) from a cellular network to characterize the human mobility. The work in [49] collects fine-grained environmental information in the city using data mined from crowdsourced bicycles. The work in [50] used the geolocation from photos on the Flickr social networking website to uncover the movements of tourists in Rome. The work in [51] optimize the 5G small cell networks accounting for the big data from user mobility in urban region. The works in [53, 54] are examples where both technical and human sensors are used. The work in [53] used a combination of GPS data and radio channel measurements from a cellphone network to give the instantaneous position of each mobile element. The work in [54] is for flood risk management in Brazil and used a combination of in-situ water sensors and human participatory sensing to give the water level height.

6 Conclusion, Open Issues, and Outlook

6.1 Data Collection

There are three important challenges for data gathering or collection in big sensor data systems: (1) Other than the 5V’s found in conventional big data, big sensor data systems also need to consider an E (energy efficiency) to be fulfilled [55]. This requirement of energy efficiency should be applied at all stages in the big data pipeline whenever there is a non-rechargeable power source to be negotiated [56]. (2) Big sensor data systems will require the collection of data from hundreds of thousands of sensors which could be embedded in “Things” (e.g. garbage cans, street lights, etc.) which could be located and scattered anywhere in the environment. It is prohibitively expensive to implement a sensor relay infrastructure everywhere in the city for data gathering [57]. (3) The third challenge for data collection is to find ways to utilize available sensing infrastructures for new tasks or applications.

6.2 Data Inference

From the paper discussions, we see that a strong emphasis for big sensor data systems is towards using a variety of heterogeneous data sources or historical data to infer missing data or predict future trends in the spatial temporal sensing field. The challenge is to find suitable models and techniques to integrate the various sources of data to solve the big data problem. They used an application from systems biology and showed the diverse data modalities which can be extracted from a single instance of DNA (e.g. the high dimensional expression data, the sparse protein interaction data, the sequence data, the annotation data, and the text mining data). Their approach used the kernel trick where data which has diverse data structures is all transformed into kernel matrices with the same size for combination.

6.3 Value Generation

We find an emphasis towards data-driven research especially using machine learning and complex network techniques. Whereas mathematical models and simulation techniques have been useful for studying the characteristics and behaviors of smaller scale systems, the move to study large-scale systems necessitate the development of new data-driven modeling techniques. Conventional mathematical and simulation models face difficulty in acquiring the correct parameters or in dealing with unpredictable and unknown factors. Amongst, the machine learning approaches, the emergence of deep learning and crossdomain techniques show potential to discover hidden insights and trends in big sensor data systems [58].

6.4 Real-World Applications

Many applications for smart cities (e.g. for earthquake/disaster early warning system, air pollution monitoring) require (near) real-time performance to serve its function [59]. Table 1 presents various typical big sensor data applications. This will drive the “Velocity” characteristic for big sensor data systems. Currently, most (if not all) research on big sensor data systems do not consider this aspect (are performed offline), and research is conducted using historical or past data. In the future, we anticipate the research and development of big sensor data systems where real-time analytics will be performed on large volumes of recently acquired data from multiple sensor farms, and using a number of diverse and historical sources.

Table 1. Typical Big Sensor data applications