Keywords

1 Introduction

Any large sized and complex data, which limits the capabilities of traditional data processing applications and tools to handle it, can be termed as Big Data. It has certain characteristics that are defined as several Vs by various researchers [1], however, significant ones are Volume, Variety, and Velocity. Big data can comprise of all of these characteristics or any one of them. Volume defines substantial amount of data that challenges storage and processing capacity of currently used tools and processors. Variety refers to different types of data present for analysis like multimedia data (audio, video) and textual data, images, social media data [2], etc. Velocity on the other hand, deals with the pace at which big data is getting generated and propagated for analysis. Streaming data analysis is one example of high-velocity big data where the challenge lies in the rapid analysis of constantly generating streams of data via different sources. Similarly, researchers in [3] have introduced HACE theorem suggesting major characteristics of big data as heterogeneous and diverse data sources, autonomous with distributed control, complex, and evolving in knowledge.

Analysis of big data by incorporating its characteristics to discover and predict valuable information from it is termed as Big Data Analytics. The basic aim of analysis is to extrapolate big data, interpret it, and assist in prediction, diagnosis, and decision-making. It is one of the major research areas in parallel and distributed systems in future generation due to its wide application areas like health welfare, government and public sector, e-commerce, social networking, and natural processes [4] etc.

KDD, Knowledge Discovery, and Data Mining, is a process comprising of basic three steps, i.e. input, analysis, and output. Currently, numerous techniques are available for normal-sized data analysis. However, new techniques and algorithms are required for big data analysis to match highly complex nature of big data. Currently available tools for big data concentrate on two types of analysis, batch processing and streaming data analysis [5]. Most established batch processing tool is Apache Hadoop platform. Apache Hadoop consists of two basic components, HDFS and MapReduce Paradigm. HDFS is Hadoop-Distributed File System and is used for storing voluminous big data in master–slave fashion. MapReduce Paradigm is based on the parallel computation of big data and taking into consideration computation time, parallel processing is the most important future trend [6]. Other tools for batch processing include Apache Mahout, Dryad, Tableau, etc.

On the other hand, streaming data tools include Apache Spark, Storm, S4, Kafka, etc. It is also known as real-time processing in which ongoing data processing requires a very low latency rate for efficient analysis. Apache Storm is mostly used tool for this purpose. It is scalable, fault tolerant, and easy to use and operate. Other tools include S4, Splunk, Kafka, SAP Hana, etc. [5] has very well demonstrated the differences between batch processing and real-time tools.

One needs to decide which platform is most suitable for which type of data according to the application area it is being generated from. For this purpose, various factors are important to consider like data size, throughput optimization, training of model, scalability, etc. [7]. Rest of the paper is as follows: Sect. 2 describes various technologies in big data domain covering almost all available techniques and areas along with a discussion of some case studies. Section 3 discusses trends and forecasts for big data technologies according to Forbes. Section 4 is conclusion and future scope of this chapter.

2 Technologies in Big Data Domain

Big data, consisting of some unique features, gets generated from diverse sources in various formats like batch and streaming. It is observed that most of the big data that is generated is unstructured. Challenge lies in gaining insights into such data directly or by first converting it to structured data and then analyzing it. Storage and analysis of structured or unstructured data are always a challenge when size of data increases enormously. With the evolution of big data, techniques related to it are also developing to efficiently analyze it and extract important information from it. This section discuses top-level technologies in Fig. 1 that are currently being worked on in the big data domain. These include both for storage and analysis purpose.

Fig. 1
figure 1

Big data technologies

Each type of big data technologies described in Fig. 1 such as Predictive Analytics, NoSQL Databases, Real-Time Analytics, Data Virtualization, and Cloud Computing are discussed next.

2.1 Predictive Analytics

Predictive analytics is an advanced level of analytics in which predictions are made for future events based upon previously available data. It is also known as Classification. For this purpose, machine learning algorithms are used to train the machine based on historic data and then it is tested for the quality of predictions it can make. Many practitioners are working toward applying existing machine learning techniques in big data domain or scaling them to map big data level.

There are multiple platforms for machine learning in big data scenario as discussed in [8] like Mahout, H2O, SAMOA, Apache Spark, etc. Mahout is the most well-known platform with a wide range of machine learning algorithms that are basically focused on classification, clustering, and collaborative filtering. It has some shortcomings also like its inefficient runtimes due to slow MapReduce principle and difficulty to set it on a Hadoop cluster. On the other hand, Apache Spark is a recent platform which is based upon MLlib, machine learning library. MLlib implements all algorithms of Mahout along with some regression models, feature extraction, etc., which is the lacking factor for Mahout.

H2O is an enterprise edition with a notable feature of GUI, graphical user interface, and deep neural network toolkits. Users can work on this platform via programming or web-based interface. SAMOA stands for Scalable Advanced Massive Online Analysis and is a machine learning platform for streaming data. It is highly scalable and extensible but with low coverage of algorithms.

Apart from this, one more promising platform for machine learning in big data is R [9]. Though R provides more than 70 packages but they lack scaling to big data. Therefore, R has provided specialized packages for big data like bigrf, biglm, etc. Bigrf provides Random Forest functionality on big data while biglm deals with linear regression for data that is too large to fit in memory. R also provides an interface with Hadoop known as Radoop, which works in the form of map reduce jobs on the data stored in HDFS. Researchers have discussed and compared various machine learning platforms on big data in detail which is in [9].

In addition to this, a general-purpose framework, Petuum is introduced in [10] that systematically addresses data and works efficiently in comparatively less time.

Case Study: For instance, started as an online bookstore, Amazon has led its journey to top-level online retail store. Product recommendation is used based on customers profile and their past experiences in amazon which serves as a major big data challenge. It uses the concepts of content-based and collaborative filtering to group customers and analyzes their preferences.

2.2 NoSQL Databases

In this era of big data analytics, a variety of data in different formats are available and traditional relational databases are becoming obsolete to manage them. In their place, NoSQL databases are providing promising and efficient data storage and analysis platform to handle big data. Some popular NoSQL databases include Redis, MongoDB, Couchbase, Cassandra, HBase, etc. These can be grouped based on their characteristics in four categories as Key-value Store, Document Store, Column Family Store and Graph databases [11].

Key-value Store is where data is stored in the form of key-value pairs as hash tables where values can be easily accessed with the help of corresponding keys. In document store, semi-structured documents are stored and managed in formats like XML or JSON. Column Family Store is where data is stored in the form of rows and columns and based on similarities of columns, related columns formulate column family. Graph databases store information in the form of graph data structure and make use of graph algorithms to analyze them. It is mainly used to represent recommendation system or social networking data.

Many researchers are working in this field to analyze the capabilities and performance of NoSQL databases to handle big data. In [12], it is concluded that Redis which is a key value data store has performed better in terms of efficiency and is found well suited for loading and executing workloads. Apart from this, practitioners have also proposed a fuzzy-based methodology in [13] to store consistent fuzzy geospatial data based on certain validations and constraints for semantics. Researchers in [14] have addressed the problem of data migration in heterogeneous NoSQL databases by providing a fault-tolerant approach to migrate data. In some applications, a hybrid architecture of relational and NoSQL databases can prove useful. In [15], a data adapter system to support hybrid database model supporting both types of databases with three modes for query approach.

Case Study: EHR system as discussed in [16] replaced existing system to NoSQL database for primary data store and to prepare a local cache at each site to improve request latency and availability.

2.3 Real-Time Analytics

Real-time big data analytics is when the timely prediction of streaming data is provided. This is a very challenging task in big data scenario as storage and analysis of big data at the same time requires large in-memory and correct platform. Its applications include majorly sensor related data and social web data. Examples of real-time analytics tools are S4, Storm, and Apache Spark.

In recent times, the role of streaming analytics has been increased in the field of sentiment analysis. A distributed system based on Vertical Hoeffding Tree, a decision tree classifier for same has been presented in [17]. Apache Samoa, a tool for distributed streaming classification along with evaluation criteria to measure the performance of streaming classifiers is depicted in paper [18].

Case Study: A case study on finding the effectiveness of Storm to analyze streaming data in real time from two applications to predict trending news is given in [19].

2.4 Data Visualization

Visualization is the concept of representing the results and analysis part of big data in a graphical manner. It is becoming a concerned technology because traditional visualization approaches do not map well to big data scenario. In case of a large-scale data visualization, processing methods such as feature extraction and geometric modeling are also used. In the current scenario, researchers are actively working on searching for new visualization platforms to scale them for big data. Here, augmented reality techniques are proving to be next milestone in this scenario. In [20], practitioners have discussed utilizing virtual reality and augmented reality techniques for big data visualization. A new algorithm for visualization of big data from high dimension to a two-dimensional space is presented in [21] and found to perform noticeably well. Similarly, in [22], high dimensional big data is visualized in three dimension using unsupervised dimension reduction techniques.

Case Study: To maintain the consistency and integrity of constantly increasing data in Qualcomm, a data virtualization solution was produced to make data available fast and in the virtual view.

2.5 Cloud Computing

Distributed computing along with Internet in an extended form can be conceptualized as cloud computing. Here, cloud refers to a network which can be accessed via the Internet anytime and anywhere to use it as any service. With the evolution of big data, storing and analyzing big data in clouds seems to be perfect solution to almost all data storage problems. Still, there exist some issues in this solution out of which security is most important. In [23], an efficient approach to handle secure storage of big data in a cloud environment is presented which is based on compressing and decompressing of data.

One more problem associated with distributed cloud computing is heavy data traffic among data centers. This leads to high communication costs in query evaluation. An online query evaluation system to deal with this problem of big data in distributed clouds is discussed in [24].

Case Study: Several case studies in this research area are discussed in [25] namely SwiftKey, Nokia, redBus, etc. SwiftKey is a language technology that uses Apache Hadoop on Amazon Cloud for scalable and multilayered model along which applies artificial intelligence component to manage constantly increasing data in prediction problems. redBus is responsible for internet bus ticketing system in India and has implemented its system using Google Query for large data analysis. On the other hand, Nokia makes use of Hadoop Distributed File System to store multi-structured data in cloud environment.

3 Trends and Forecasts for Big Data Technologies

Big data market is constantly expanding in mainstream business and there is a need to analyze which big data technologies are in demand. According to Forbes, an analysis of growing requirements of big data technologies is done and a trajectory of more than 20 techniques in big data scenario is presented as Fig. 2.

Fig. 2
figure 2

Big data technologies described by Forbes

As shown in this figure, a wide range of big data technologies are shown as per their business value in coming time. Predictive analytics is the technology that is highly popular in today’s business needs and the only one with maximum time for next phase. Apart from it, NoSQL databases, Stream analytics, search and knowledge discovery are some other technologies that are currently in demand. No big data technology is achieving the minimum success line on graph showing the promising nature of the technologies in this area. Very few technologies namely data science tools and machine learning libraries are achieving moderate success and are likely to be popular in less than 1 year. Based on TechRadar methodology of Forrester, above-described technologies in this paper are most popular and likely to remain persistent for more than 5 years [26].

4 Conclusion

Big data technologies are trending in almost all present business requirements and their understanding has become a burning need in this field. In this paper, a brief study on some popular big data technologies like predictive analytics, NoSQL databases, Real-time analytics, cloud computing, and data virtualization is discussed. Prediction for future trends of popular techniques and their lifespan is also presented as described by Forbes. Sample case studies for each discussed technology are also given to understand the scope and purpose of this study.