Keywords

1 Introduction

Big data terminology is used for the collection of various data sets which are diverse in format and complexity. Due to its diversity, these huge data sets are very difficult to be stored and processed using traditional data processing tools or applications. Thus, we require some techniques or concepts with the help of which we can easily work on and use these data sets for various purposes. Big data analytics facilitate the collection of data from different sources, transforming them to such a format so that it becomes ready to be used by various analysts and eventually providing it to various organizations. Big data and machine learning altogether enhance the performance of various industries like finance, healthcare, etc. This is because the price of data storage has been reduced and accessibility to high end and high performance computer becomes easy. Thus, various theoretical concepts of big data when implemented using machine learning tools give enhancements to many industries and business organizations.

Nowadays, the generation rate of the data is very fast. Approximately, around 90% of the data which is present in present world has been created in previous two years. In recent decades, the huge amount of data is been generated from various sources like:

  1. 1.

    Walmart handles more than 1 million customer transaction every hour

  2. 2.

    Popular social media platform Facebook uses, stores and analyzes around 30 plus petabytes of data which is all generated by its millions of users

  3. 3.

    Approximately, 48 h of new video are been uploaded to YouTube every hour

  4. 4.

    Amazon handles near about fifteen million user activity click per day that plays an important role for recommending various products to its customers

  5. 5.

    Various mail servers analyze around 294 billion emails to find the spam mails

  6. 6.

    Modern vehicles have more than 100 different types of sensors to monitor various things like fuel consumption, tire pressure, etc., and thus, every vehicle generates lots of sensor data that can be stored and processed on Cloud.

2 Big Data Characteristics

2.1 Volume

Volume means that the enormous information and data which is generated on daily basis increases in exponential, and this huge amount of data mainly represented in terabytes, petabytes or in some cases even in zetabytes. This information or data is so big that it cannot be handled, managed or controlled by using ancient methods or traditional methods of data managing techniques. For example, the size of data being generated by the interaction between humans and machines through various social media platforms.

2.2 Velocity

Velocity means the speed with which various sources will generate data on daily basis. This huge data is very enormous and continuous in nature. For example, on Facebook, there are around 1.03 billion active users daily which approximately increases around 22% each year. This concludes that how fast the number of users is increasing on social media platforms. These users are responsible for the fast growing of data on daily basis. Simply, if you can handle the velocity, you will be able to generate various insights and will be able to take decision based on this real time generated data

2.3 Variety

In big data, many different types of data sources basically responsible for different types of data eventually contribute for the formation of big data. The data generated from various data sources can be structured, semi-structured or unstructured in nature. Traditionally, data was mainly stored in excel and databases, but nowadays, the data is collected in various formats like images, audio, video, sensor data, etc., and hence, this variety of semi-structured and unstructured data mainly creates problems related to storage, collecting, extracting information and analysis of data.

2.4 Veracity

Veracity [1] means that the data is in doubt or is uncertain of data availability due to the incomplete data or inconsistent data (Fig. 1).

Fig. 1
figure 1

Data with missing values

Many a time, the data can be messy and untreatable. As in big data, the data occurs in many forms, and therefore, the quality and accuracy always remain a big problem. The volume is always responsible for the lack of quality and accuracy of data.

2.5 Value

With the volume, velocity, variety and veracity, we need to discuss one more V related to big data, this is value. It is basically the usefulness of the data. The features and functions of big data include security, storing, analysis, exploring, visualization [2], modification and transactions. In today’s world, there are various technologies and techniques [3] which can be used along with big data to perform faster and efficiently. Parallelism increases the speed at which big data can be processed, and it also increases the analyzing capabilities of the data. The usage of distributing computing [4] systems can be used for the efficient processing of big data mainly in real-time manner.

The various technologies used in Big Data which are treated as best four Apache Big Data Frameworks are described briefly in the following section.

3 Apache Hadoop

Apache Hadoop [5, 6] is basically an open source framework, which is written in Java. It is fault tolerant and scalable framework which provides batch processing techniques to be used in efficient way. It performs better than any other technique as it is capable of processing large volume of different forms of data on a group of various commodity hardware. Hadoop is mainly misunderstood as a system to store data, but instead, it is a technology or method which possesses capability for storing large volume of data along with processing of large amount of data.

Hadoop is technology that is designed to process big data which is combination of both structured and semi-structured data which is available in huge volume. It also provides analytical techniques and computational power required to work with large and diverse form of data.

Hadoop framework is an example of cluster which comprises one master node and many worker nodes. This master node is a composed of both Name Node and Job Tracker Node, whereas Worker Node can act as both Data Node—responsible for storage of data and Job Tracker—responsible for monitoring jobs. It also contains Secondary Name Node which is the replication of Name Node. The responsibility of the Secondary Name Node is to take snapshot of Primary Name Node directory information at regular interval of time. This can be used in place of Name Node to restart the faulty or failed Name Node (Fig. 2).

Fig. 2
figure 2

Apache Hadoop and Yarn

Hadoop consists of two main components: Hadoop Distributed File System (HDFS) and Map Reduce [7].

4 Hadoop Distributed File System (HDFS)

HDFS is used for storage and is fault-tolerant mechanism that stores large size files from terabytes to petabytes across different terminals in distributed manner. The default value of replication is 3 that can be increased according to the sensitivity of data being stored. It splits big file into large block size of 64 MB (can be changed to 128 MB) and can be stored independently on multiple nodes. Its main responsibility is to ensure the availability of data even during the failures of the host machine. It is also used to store immediate processing results. HDFS is mainly suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS for the streaming access to file system data (Fig. 3).

Fig. 3
figure 3

Source hadoop.apache.org

HDFS architecture.

HDFS provides an automatic fault detection mechanism which improvises its mechanism of recovery process during disaster. HDFS includes large number of various hardware, and thus, failure of any component is an issue. Therefore, it provides an efficient recovery system to facilitate efficient working of the Hadoop system. The processing methodology of HDFS is such that it always selects node for processing local node to reduce network traffic and increase throughput.

5 Map Reduce

Hadoop Map Reduce [8, 9] is basically a software framework for easily providing various processing tasks that involves huge amount of data. It basically facilitates parallel execution of application on large clusters in fault-tolerant manner. Map Reduce [10, 11] is programming structural model for writing tasks which can be executed in parallel fashion on multiple nodes. It also provides analytical capabilities for complex data. Traditional model has a limitation as it cannot provide mechanism to process huge volumes of scalable data, and on the other hand, the centralized system provides too much of a bottleneck while processing multiple files simultaneously. Google has developed an algorithm to solve this issues, and this technique is called Map Reduce [11, 12] which divides the task into small part and assigns them to different nodes. After processing, these individual results are combined to give the integrated output. The Map Reduce algorithm consists of two important activities: Map and Reduce [5, 13]. Map converts the data sets into individual elements of key, value or tuple. Reduce collects the output from each mapper and combines them. The most important benefit of using Map Reduce is that it provides an easy mechanism and method to distribute data processing on multiple and different computing nodes.

6 Apache Storm

It is a framework which mainly focuses on low latency. It provides an efficient and better option for processing which actually requires real-time processing. It provides an efficient methodology that works on huge amount of data and reduces latency in comparison with other frameworks. Storm has facilities such as real-time analytics, online machine learning, continuous computation and ETL, and it is scalable, fault tolerant, guarantees efficiently processing of data. There are certain features that make storm more powerful tool rather than Hadoop like fault tolerant, scalable, fail fast, auto restart approach, support multiple languages and Json, support for direct acylic graph (DAG) topology, etc.

7 Apache Samza

It is a stream processing framework that is strongly associated with Apache Kafka messaging system. It is designed specifically to enhance the benefits of Kafka’s specific architecture. Like other technologies, it also uses fault-tolerant mechanism for buffering and storage. For the purpose of resource negotiation, it uses YARN [14] along with its rich features.

8 Apache Spark

Apache Spark [15, 16] is an example of all purpose cluster computing system which possesses huge and large number of libraries and APIs for various programming languages such as R, Python, Scala and Java. Unlike Hadoop, it is very fast and efficient in processing and accessing data from the storage. It can be implemented by using Hadoop or without Hadoop. It mainly focuses on quick execution of the task by implementing the methodology of batch processing workload in memory computation. It can be implemented as standalone cluster and can be used with Hadoop as an alternative to Map Reduce. The main component of Spark [17, 18] is driver program, cluster manager and worker node. The driver program is on the spark which starts the execution of any application. The cluster manager allocates all the resources. Lastly, Worker Node does all the processing. Some properties of the spark which makes it better than Hadoop are its high speed, high performance, high query optimization. It can run on any platform, has a large library set and data pipelining facility.

9 Apache Flink

Apache Flink is a platform which is categorized as open source; unlike any other framework, it has a flow engine for streaming data which also provides a methodology for communication, fault tolerant and distribution of data on various distributed computations over streaming data. This framework of data analytics is wholly compatible with Hadoop. Flink has the capability to execute both streaming processing and batch processing [19, 20] without any difficulty.

Because of the micro-batch architecture of spark, it is not suitable for many use cases. It is also enriched by the batch and stream processing capabilities. Apache Flink provides low latency, high throughput and real transactional processing. The architecture of Kappa forms the basis for working of Flink. The benefit of using Kappa architecture is it has only single processor—stream, which considers various input as stream, and the streaming engine present in the Kappa processes the entered data in real-time fashion. The processing of batch data is treated as a special case in Kappa. The diagram specified below gives the architecture of Flink (Fig. 4; Table 1).

Fig. 4
figure 4

Apache Flink architecture

Table 1 Comparison of various tools

10 Conclusion

This paper specifies various comparisons of the tools that can be used in big data analytics. According to the comparison chart given above, it is clear that Map Reduce technique is better only for batch processing system, whereas Spark and Flink can work efficiently on batch processing as well as on streaming data. Fault tolerance is provided in all the techniques, but again, Map Reduce does not support in-memory processing and low latency. According to the survey done on various technologies, Spark is the most efficient framework that can give efficient and accurate results.