Keywords

1 Introduction

Due to rapid growth of smart phones, access to Internet becomes quite easy, results in large amounts of unstructured data, which has been collected and stored from different areas of society [1]. In addition, real-time systems such as sensor-based technologies normally generate streams of data, which require quick storing as well as processing of incoming data. However, such data streams pose new features as compares to traditional stream data. Many applications such as traffic management, log data from Web search engines, Twitter, electronic mail also generates high volumes of stream data with velocity, which is difficult to handle with existing data stream techniques [2, 3]. In past decade, big data comes into picture and can be defined in terms of its characteristics volume, velocity, variety, value, and veracity [3,4,5]. Many researchers proposed different techniques, tools, and complex processes for getting insight into different characteristics of big data. Exploring stream data with velocity is key challenge in big data research, which has been focused by many researchers but still has potential to explore for many applications and domains. Also, processing of stream data in real time differs from non-real-time data processing, since data has to be analyzed based on historic stored data, before itself gets stored for further analysis and prediction of upcoming streams.

Motivation

Data has been generated and acquired at rapid speed which involves volume with it ultimately create challenge to develop methods which must be automated and can respond quickly to make decision in specified time. Since size of data is too big, such data needs to be moved and stored in distributed environment for further computation as traditional data warehouses are ill suited. Further analysis of such data using classic OLAP cube also does not work and replaced by distributed storage environment such as Hadoop which uses master–slave architecture for storing data, also map-reduced technique for processing data in batches and NoSQL databases which uses different storage techniques having columns, graphs, documents, key-value stores which can work on top of Hadoop to make it suited for streaming big data. Some assumption that has been followed by traditional system while dealing with stream data mining can be solved using distributed data processing frameworks or tools for handling the problem of big data.

  • Possible to collect and store whole data stream, not the sample or summaries of data.

  • Integration and indexing of data in real time irrespective of the format in which they came.

  • Velocity with which data come can be processed using distributed streaming algorithm in real time and stored in distributed fashion for improving existing model for further analysis.

  • Analyzing existing or past data is crucial while making targeted future prediction, but decision based on operational or transactional data needs real-time analysis has to be processed in parallel with low-latency time.

2 Related Work

Data stream can be conceived as a continuous and changing sequence of data that continuously arriving at a system to store and process. Stream data processing deals with some or all data input as one or more continuous data stream.

2.1 Data Stream Mining

They explained different factors which are necessary to mine information from stream data including time window which further can be divided into landmark window basically takes whole new data stream as a window instead of sample and considered them equally important which may cause problem to build model with limited memory. Another one is sliding window, one of the most used windowing technique in the field of stream mining that only takes recent data stream and discarded the old ones, and also, it is flexible depending upon the accuracy needed by the model makes it popular. Next one is fading window usually assigned weight to the data according to its arrival time newer one having higher as compare to older one and finally Tilted time window lies between sliding and fading window in terms of its variance.

2.2 Big data Stream Mining

In past few years, mostly research is based on collecting and storing data due to growth of Web 2.0 technologies and increase in bandwidth for data transfer. Many sources of data have been evolved and are mostly need real-time analysis of such stream data and also information extracting algorithms to analyze it. Even Hadoop has been used by Yahoo in its earlier days to collect, store, and analyze large volume of click stream data, which later used by ecommerce enterprises to solve the problems of customers, while choosing product and other followed product normally user tend to purchase. In addition, recommending similar products they want to purchase in future is based on their purchased experience. Many data mining algorithms have been proposed to overcome the challenges of stream data as seen in above section, but to overcome volume, velocity, and volatility challenges [3], a standard framework based on Lambada [6] architecture having different phases ranging from stream data collection to data visualization has been overview.

Nowadays, concept of “Smart Cities” are on its evolving stages, and data gathered using different technologies such as sensors from different aspects such as users location, social gathering information, ITS, temperature changes produce data. Similarly social media sites such as Twitter Facebook, Linkedin also generates large amount of data are some of main sources of big stream real-world data [7]. After acquisition of data from different sources the most important phase in mining any unstructured data is preprocessing unfortunately it has not been yet explored fully in terms of big data. Since data in real world is dirty as well as noisy, same is applied for real stream data which makes it worse to analyze data which has not been preprocessed.

Beginning of any stream processing paradigm was based on hidden information that comes with incoming data that further can be used to obtain useful results to do analysis. In this process, since data is arriving continuously and in huge amount, only a small fraction of stream data is stored in limited memory databases and process using stream processing system such as storm or kafka as shown in Fig. 1. Further, it can be stored in large and distributed databases such as HDFS for future use. Many machine learning algorithms have been used to extract hidden information from stream data. Different approaches present by authors has been discussed below and summarized in Table 1.

Fig. 1
figure 1

Architecture for handling stream big data

Table 1 Summary of various approaches proposed on stream big data

Rutkowski et al. [8] in one of his paper suggested that algorithms based on Hoeffding’s bound which considered as one of the most used decision-tree technique in mining data stream needs to be revised and proposed a method using Mcdiarmid’s inequality to split node in a tree by picking correct attribute, and they performed several experiments and evaluated their result using splitting measures Gini index and information gain.

Limitation of this method is that to split node among given n node, it needs to scan huge number of data elements before selecting right attribute, later this limitation has been overcome in [9] in that they used statistical method for selecting attribute to split node among given n node based on Taylor’s theorem and properties of the normal distribution to test evaluation of splitting criteria and proposed Gaussian decision-tree algorithm to improve the performance of mining stream data. Again in referenced [10, 11], they proposed firstly (mDT) algorithm based on splitting criteria called misclassification error combined with Gini index for creating tree node, which also decides the accurate attribute for existing and incoming stream data and secondly Decision Trees Based on the Hybrid Split Measure (hDT) which they tested on UCI repository dataset. Agerri et al. [12] presented new distributed and highly scalable architecture for analysis of stream textual news data using natural language processing (NLP). They performed experiment using different distributed pipeline modules on virtual machines and evaluated performance of the system using original incoming stream news in which documents has been taken in reproducible manner. In addition, some limitation with proposed system still exists, and they suggested which can be solved by using distributed NoSQL databases like MongoDB.

Vu et al. [13] propose stream algorithm based on AM rules Decision in distributed environment. This is first kind of experiment on adaptive rules in distributed platform for which they used SAMOA open-source software which basically built to deal large-scale data stream, and their main focus is to understand different decision rules in terms of regression. Fegaras [14] proposed framework based on incremental approach for distributed stream data, mainly focuses to improve traditional batch processing as used by map-reduced function in Hadoop, making it iterative incremental batch processing, enables it to store processing data in memory and tested their framework on dataset consist of complex arbitrary values, and also evaluated that their result are accurate instead of approximate. Marron [15] presented use of traditional classification algorithms such as random forest and VFDT for mining large amount of stream data using GPU, making these algorithm to run in parallel to deal with the volume of big data. They compare performance of Very Fast Decision Tree on GPU (GVFDT) and Random Forest algorithms with similar platform such as MOA and VFML having both algorithm, and they found that their results are better in terms of speed as well as accuracy. Yun et al. [16] worked on frequent pattern mining using sliding window technique and proposed algorithm WEPS (Weighted Erasable Pattern mining algorithm on sliding window-based data Streams) in that they assigned weight ti nodes of tree for creation and pruning purpose. The proposed architecture has been divided into two parts; first phase mainly concentrating on sliding window in that tree creation and recreation have been performed, and in second part, they prune the pattern based on weight assigned to it. Zliobaite and Gabrys [17] Proposed automated preprocessing technique based on adaptive technique for three different cases. In each one, different adaptive model has been used for preprocessing data as well for prediction using different techniques such as incremental approach ensemble classifier (Table 2).

Table 2 Summary of various processing models for stream big data

3 Challenges in Various Domains

There are various challenges have been focused by different authors [18,19,20,21] and some of them are.

3.1 GIS

Existing technology available in geographic information system (GIS) is mainly concentrating on conventional databases, dealing normally with static data which limits it when it comes to analyzing Big data. A new tools and techniques for GIS in terms of Big data is require to meet new changing environment of spatial databases. Liu et al. [22] Big data show that how Big data can revolutionized the world of GIS system

3.2 Human Mobility Patterns

Due to large usage of smart phones, sensor-based mobiles are in the pockets of millions, so data of each individual and their traveling habits can be explore, but such data comes not only volume based, but with velocity and because of its spatial nature, in terms of variety as well. Gonzalez et al. [23] shows that individual human have high degree of temporal and spatial pattern regularity which can be used in epidemic prevention, emergency response urban planning as well as agent-based modeling [24].

3.3 Space Technology

The Sloan Digital Sky Survey has collected data which compromises of around 500 million photometric observations of objects from the sky which makes job of space scientist easy to get data without sending astronomer’s into the sky. But still extracting knowledge from such vast collection of Big data is tedious job. Business enterprises are also using big space data to carry their operation in different remote parts of the world [25, 26].

4 Conclusion and Unsolved Issues

Big Data means opportunities for different section of society to grow virtually, and most of the data are continuous in nature which creates research challenges to extract information from such huge volume of stream data. In this paper, we have presented concept of stream Big data and highlighted the general architecture which can be for stream big data, also provided a literature survey on numerous techniques and mechanisms for getting information from stream big data. Still there are many issues need to be focused for usage of big data.

  • Decision science

Building system for real-time analytics which can transfer data science into decision science becomes vital to cope up with enormous need of today’s information system.

  • Distributed algorithms

Many frameworks have been developed for distributed computing, but for analyzing and predicting accurate information for such application, there is a need to have distributed data mining algorithms.