Keywords

1 Introduction

There are several conventional corporate tools which can be used to process data efficiently. This brings about useful insights for every department ranging from profit forecasting in sales to inventory management. The conventional architecture consists of a data warehouse where the data typically resides and Standard Query Language (SQL)-based Business Intelligence (BI) tools which execute the tasks of processing and analysing this same data [1]. In order to perform data analysis in a more advanced manner, companies have used dedicated servers to which the data was moved to perform data and text mining, statistical analysis and predictive analytics. The criticality of business analysis and profit making had brought about the wide usage [2] of tools like R [3] and Weka [4] in the deployment of machine learning algorithms and models. However, the evolution and growth of traditional data into what is commonly termed as big data nowadays has rendered the conventional tools unusable due to the major challenges resulting from the features of big data.

Volume, variety and velocity, which are the three main features of big data, have caused it to surpass the capabilities of its ingestion, storage, analysis and processing by conventional systems. big data means very large data sets, with both structured and unstructured data, that come at high speed from different sources. In addition, not all incoming data are valuable; some of it may not be of good quality carrying a certain level of uncertainty. The analysis conducted by the International Data Corporation’s annual Digital Universe [5] has shown that the expected growth in data volume will be around 44 zettabytes (4.4 × 1022 bytes) by the year 2020 which is approximately ten times larger than it was in 2013. This growth can be explained by the increasing adoption of social media (e.g. Facebook, Twitter), Internet of Things (IoT) sensors, click streams from websites, video camera recordings and smartphones communications.

Handling huge amounts of data requires equipment and skills. Many organizations are already in possession of these equipment and expertise for managing and analysing structured data only. However, the fast flows and increasing volumes of data leave them with the inability to “mine” and extract intelligent actions in a timely manner. In addition to the fast growth of the volume of data which cannot be handled for traditional analytics, the variety of data types and speed at which the data arrives requires novel solutions for data processing and analytics [2]. Compared to conventional structured data of organisations, big data does not always fit into neat tables of rows and columns. New data types pertaining to structured and unstructured data have come into existence and which are processed to yield business insights. The very common toolkits for machine learning such as Weka [4] or R [3] have not been conceived for handling such workloads. Despite having distributed applications of some available algorithms, Weka does not match the level of the initial tools developed for handling data at the terabyte scale. Therefore, traditional analytics tools are not well adapted to capture the full value of large data sets like big data.

Machine learning allows computers to learn without being explicitly programmed; the programmes have the ability to change when exposed to different sets of data. Machine learning techniques are therefore more appropriate to capture hidden insights from big data. The whole process is data-driven and executes at the machine layer by applying complex mathematical calculations to big data. Machine learning requires large data sets for learning and discovering patterns in data. Commonly used machine learning algorithms are: Linear Regression, Logistic Regression, Decision Tree, Naive Bayes and Random Forest. There are several tools that have been developed for working with big data for example Jaspersoft, Pentaho, Tableau and Karmasphere. These tools alleviate the problem of scalability by providing solutions for parallel storage and processing [6].

Businesses are increasingly adopting big data technologies to remain competitive and therefore, the demand for data management is on the rise given the advantages associated with analytics. Emerging big data applications areas are Healthcare, Manufacturing, Transportation, Traffic management, Media and Sports among others. Extensive research and development is also being carried out in the field of big data, analytics and machine learning. In [7], the authors highlight an expandable mining technique using the k-nearest neighbor (K-NN) classification approach to demonstrate training data with reduced data sets. This reduced data set ensures faster data classification and storage management. In [8], a methodology to clean raw data using the Fishers Discrimination Criterion (FDC) is proposed. The objective was to ease the data flow across a social network such as Facebook to enhance user satisfaction using machine learning techniques. Excessive data sets may result in the field of biological science (e.g. DNA information) and this adds to the difficulty of extracting meaningful information. A system to reduce similar information from large data sets was therefore proposed; the study analysed processing techniques such as NeuralSF and Bloom Filter applied to biological data [9]. Another biological study attempts to train a neural network using nine parameters to categorise breast cancer as cancerous or non-cancerous [10]. This study applies machine learning techniques on breast cancer dataset to accurately classify this type of cancer with an accuracy of 99.27%. Other example of neural network training is the performance comparison of NN-GA with NN, SVM and Random Forest classifiers to predict the categorisation of a pixel image [11]. In the field of computing and intelligence, related works include cryptographic key management [12] and intelligent activities of daily living (ADLs) assistance systems [13].

big data undoubtedly brings many benefits to businesses. However, there are many challenges associated with big data and related concepts. First of all, big data changes the whole concept of data storage and manipulation. Businesses must now handle a constant inflow of data and this involves modifying the data management strategies of organisations. Secondly, privacy issues are associated with big data since the information is being captured from different sources. In the process, sensitive data may be processed without the knowledge of users. Thirdly, businesses have to invest in new technologies since the whole software environment is changing with big data. The latter implies super computing power for intensive data storage, processing capabilities and new software or hardware to handle analytics and machine learning. Overall, big data brings forward a new approach to handle organisational data and therefore, businesses must properly balance the challenging aspects of big data.

1.1 Big Data Analytic Tools

An overview of some existing corporate-level big data platforms is given in the following sub-sections.

1.1.1 Big Data Analytic Tools

IBM solution is currently helping businesses to achieve tangible benefits by analyzing all of their data available. IBM enables rapid return on investment across several areas such as Financial Services, Transportation, Healthcare/Life Sciences, Telecommunications, Energy and Utilities, Digital Media, Retail and Law Enforcement [14]. A few challenges and benefits of IBM in three areas are summarised in Table 1.

Table 1 Big data challenges and benefits [14]

IBM gives enterprises the ability to manage all their data with a single integrated platform, multiple data engines and a consistent set of tools & utilities to operate on big data. IBM delivers its big data platform through a “Lego block” approach [15] and the platform has four core capabilities [16]:

  1. 1.

    Analytics using Hadoop: It involves the analysis and processing of data in any format using clusters of commodity hardware.

  2. 2.

    Stream Computing: It involves real-time analysis of huge volumes of data on the fly with response times of the order of millisecond.

  3. 3.

    Data Warehousing: It involves advanced analytics on data stored in databases and the delivery of proper actionable insights.

  4. 4.

    Information Integration and Governance: It involves the processes of understanding, transforming, cleansing, delivering and governing reliable information to critical business initiatives [16].

Supporting Platform Services include Visualization & Discovery, Application Development, Systems Management and Accelerators [16]:

  • Visualization & Discovery: It involves the exploration of complex and huge sets of data.

  • Application Development: It involves the development of applications pertaining to big data.

  • Systems Management: It involves the securing and optimizing of the performance of systems for big data.

  • Accelerators: It involves the analytic applications that accelerate the development and implementation of big data solutions (Fig. 1).

    Fig. 1
    figure 1

    IBM big data platform [17]

1.1.2 Hewlett-Packard Enterprise (HPE) Big Data Platform

HPE big data is based on an open architecture which improves existing business intelligence and analytics assets. The platform consists of scalable and flexible models, including on-premises, as-a-service, hosted, and private cloud solutions [18]. HPE offers comprehensive solutions to help an organization bridge to tomorrow through a data-driven foundation. HPE big data Software, Solutions and Services are highlighted in Table 2.

Table 2 Big data software, solutions and services [19]

HP Haven OnDemand enables exploration and analysis of massive volumes of data including unstructured information such as text, audio, and video. Figure 2 highlights the components of the Haven big data platform which integrates analytics engines, data connectors and community applications. The main objective of this platform is further analysis and visualisation of data.

Fig. 2
figure 2

Haven big data platform [20]

1.1.3 SAP HANA Platform

SAP HANA Vora [21] is a distributed computing and an in-memory solution which can aid organisations process big data to obtain insights profitable to the business growth. Organisations use it for the quick and easy running of interactive analytics on both Hadoop and enterprise data. SAP HANA Vora single solution [21] for computation of all data includes Science, Predictive Analytics, Business Intelligence and Vizualisation Apps. Capabilities of the SAP HANA Platform can be categorised under the following groups [22]:

  • Application Services: Making use of open standards for development such as: Java Script, HTML5, SQL, JSON, and others, to build applications on the web.

  • Database Services: Performing advanced analytics and transactions on different types of data stored in-memory.

  • Integration Services: Managing data collected from various sources and developing strategies to integrate all of them together.

According to [23], the SAP HANA Platform is most suitable to perform analytics in real-time, and develop and deploy applications which can process real-time information. The SAP HANA in-memory database is different from other database engines currently available on the market. This type of database provides high speed interactivity and fast response time for real-time applications such as Sales & Operational Planning, Cash Forecasting, Manufacturing scheduling optimization, point-of-sale data and social media data [23] (Fig. 3).

Fig. 3
figure 3

SAP HANA platform [23]

1.1.4 Microsoft Azure

Microsoft Azure is an enterprise-level cloud platform which is flexible and open and allows for the fast development, deployment, and management of application across Microsoft data centers [24]. Azure complements existing IT environments by providing services in the cloud. Azure big data and analytics services and products consist of [25]:

  • HDInsight: R Server Hadoop, HBase, Storm, and Spark clusters.

  • Data Lake Analytics: A service for performing analytics in a distributed environment which eases the processing of big data.

  • Machine Learning: Very powerful tools on the cloud platform which allow predictive maintenance.

The important constituents of the Azure Services Platform are as follows [26]:

  • Windows Azure: Networking and computation, storage with scalability, management and hosting services.

  • Microsoft SQL Services: Services for storage and reporting using databases.

  • Microsoft.NET Services: Services for implementing .NET Frameworks.

  • Live Services: Services for distributing, maintaining and synchronizing files, photos, documents, and other information.

  • Microsoft Dynamics CRM Services and Microsoft SharePoint Services: Fast solution development for Business collaboration in the cloud (Fig. 4).

    Fig. 4
    figure 4

    Azure services platform [26]

World-class companies are doing amazing things with Azure. In its promotional campaign, Azure cloud is used by Heineken to target 10.5 million consumers [27]. Rolls-Royce and Microsoft collaborate to create new digital capabilities by not focussing on the infrastructure management but rather on the delivery of real value to customers [27]. Over 450 million fans of the Real Madrid C.F. worldwide have been able to access the stadium with the Microsoft Cloud [27]. NBC News chooses Microsoft Azure to Host Push Notifications to Mobile Applications [27]. 3M accelerates mobile application development and benefits from real-time insight with Azure cloud [27].

1.1.5 Oracle Big Data

Oracle provides the power of both Spark and Hadoop ready to be incorporated with the existing data of enterprises which were previously using Oracle applications and Database. The service provided is highly performant, secure and fully automated. Oracle big data offers an integrated set of products to organise and analyse various data sources to find new insights and take advantage of hidden relationships. Oracle big data features include [28, 29]:

  • Provisioning and Elasticity: On demand Hadoop and Spark clusters

  • Comprehensive Software: big data Connectors, big data Spatial and Graph, Advanced Analytics

  • Management: Security, Automated life-cycle

  • Integration: Object Store Integration, big data SQL (Fig. 5).

    Fig. 5
    figure 5

    Oracle solution for big data [30]

According to [31], Oracle helps Dubai Smart Government Establishment (DSG) in providing smart government services in the region by proposing new strategies and overseeing processes at the level of government entities. Using Oracle big data, the public health services of England has developed a fraud and error detection solution by analysing hidden patterns in physician prescription records [31].

1.1.6 Other Big Data Platforms

Other big data platforms and big data Analytics Software include Jaspersoft BI Suite, Talend open studio, Dell big data analytics, Pentaho Business analytics, Redhat big data, Splunk, Tableau, Karmasphere Studio and Analyst, Skytree Server.

1.2 Machine Learning Tools for Big Data

Several open-source tools have emerged in recent years for handling massive amounts of data. In addition to managing the data, obtaining insights from the latter is of utmost importance. As such, machine learning techniques add value in all the fields having huge amounts of data ready to be processed. In this chapter two of the open source tools namely H20 [32] and MLIB [33], which facilitate the collaboration of data and computer scientists, will be described in details.

The framework for big data analytics with machine learning is shown in Fig. 6. The structured data from the big data source is imported into H2O and Spark on Databricks. The Data frame obtained is then divided into a training set and a test set. The training set is used to train the machine learning model. Based on the algorithm used, the model is tested using the test set by predicted the labels from the features. The corresponding accuracy is finally computed.

Fig. 6
figure 6

Framework for big data analytics with machine learning

1.2.1 H2O

H2O is an open-source software used with the trending big-data analytics [32]. The start-up company H2O.ai launched in 2011 in Silicon Valley [34] has brought H2O into existence. As part of pattern mining in data, several models can be fitted by H2O. It provides a fast open source software that supports parallel in-memory machine learning, predictive analytics and deep learning. It is based on pure Java and Apache v2 Open Source. It provides simple deployment with a single jar and automatic cloud discovery. H2O allows data to be used without sampling and provides reliable predictions quicker. The main aim of the H2O project is to bring forth a cloud computing interface, analytics, giving the necessary resources for data analytics to its users. H2O is open-source and has a free license. The profits of the company originate mainly from the service provided to customers and tailored extensions. According to VentureBeat in November 2014, H2O’s clients included Nielsen, Cisco, eBay, and PayPal [35,36,37].

The chief executive of H2O is Sri Satish Ambati had assisted the setting up of a firm for big-data called Platfora. The software for the Apache Hadoop distributed file system is implemented by Platfora [38]. However, the R programming language did not provide satisfactory performance on large data-sets. Ambati therefore began developing the H2O software. The pioneer of the S language at Bell Labs, John Chambers who is also part of the R’s leading development team, supported the work [37, 39, 40].

0xdata was co-founded by Ambati and Cliff Click. Cliff held the post of the Chief Technical Officer (CTO) of H2O and had a significant input in H2O’s products. Click gave a helping hand in writing the HotSpot Server Compiler and even collaborated with Azul Systems for the implementation of a big-data Java virtual machine (JVM) [41]. In February 2016, Click moved away from H2O [42]. The author of Grammar of Graphics, Leland Wilkinson, serves as Chief Scientist and provides visualization leadership [32].

Three professors at the Stanford University [43] have been listed as Mathematical Scientists by H2O’s Scientific Advisory Council. One of them is Professor Robert Tibshirani, who partners with Bradley Efron on bootstrapping [44]. He also specialises on generalized additive models as well as statistical learning theory [45, 46]. The second one is Professor Stephen P. Boyd who specialises in convex minimization and applications in statistics and electrical engineering [47]. The third is Professor Trevor Hastie who is a collaborator of John Chambers on S [40], and also an expert on generalized additive models and statistical learning theory [45, 46, 48].

1.2.1.1 H2O Architecture and Features

The main features of the H20 software are given in Table 3 [34].

Table 3 Features of the H20 software [34]

The functionalities for Deep Learning using H2O are as follows [35]:

  • Undertaking of tasks pertaining to classification and regression by making use of supervised training models.

  • Operations in Java using refined Map/Reduce model and which increase the efficiency in terms of speed and memory.

  • Making usage of cluster with multiple nodes or a single node running distributed and Multi-threaded computations simultaneously.

  • Availability of the options to set annealing, momentum, and learning rate. In addition, in order to achieve convergence as fast as possible, a completely adaptive and automatic learning rate per-neuron is also available.

  • Availability of model averaging, and L1 and L2 Loss function and regularization to avoid over fitting of the model.

  • Possibility of selecting the model from a user-friendly web-based interface as well as using scripts in R equipped with the H2O package.

  • Availability of check-points to help achieve lowered run times as well as tune the models.

  • Availability of the automatic process of discarding missing values in the numerical and categorical datasets during both the pre-processing and post-processing stages.

  • Availability of the automatic feature which handles the balance between computation and communication over the cluster to achieve the best performance. Additionally, there is the possibility of exporting the designed model in java code and use in production environments.

  • Availability of additional features and parameters for adjustments in the model automatic encoders for handling unsupervised learning models and ability to detect anomalies.

A typical architecture of the H2O software stack is shown in Fig. 7 [49].

Fig. 7
figure 7

H2O software architecture [49]

A socket connection is used by all the REST API clients for communication with H2O. JavaScript is the language used to code the embedded H2O Web UI which uses the standard REST API. The H2O R package “library(h2o)” can be used in R scripts. Instead of executing H2O using ‘apply’ or ‘ddply’, R functions can be written by the users. The way the system is at present, the REST API must be used directly in Python scripts. The commonly used Microsoft Excel is also equipped with an H2O worksheet. This enables importing large volumes of data into H2O and the running of algorithms such as GLM directly from Excel. Visualization in Tableau is also possible by pulling results directly from H2O.

One or more nodes combined make up an H20 cloud. The JVM process, which consists of a single node, is divided into three layers: core infrastructure, algorithms, and language.

The Shalala Scala and R layer comprise of a language layer consisting of engine for evaluation of expressions. The R REST client front-end is master to the evaluation layer of R. The Scala layer is different from the R layer in the sense that algorithms and native programs using H2O can be written directly. The algorithms layer is made up of all the default algorithms in H2O [49]. For example, datasets are imported using the parse algorithm. It also includes GLM and algorithms for predictive analytics and model evaluation. The bottom (core) layer manages Memory and CPU resources.

The elementary unit of data storage which the users are exposed to is an H2O Data Frame. The engineering term referring to the capability of adding, removing, and updating columns fluidly in a frame is “Fluid Vector”. It is the column-compressed store implementation.

The distributed Key-Value store are essentially atomic and distributed in-memory storage and spread across the cluster. The Non-blocking Hash Map is used in the Key-Value store implementation. CPU Management is based on three main parts. First, there are jobs which are large pieces of work having progress bars and which can be monitored using the web user interface. One example of a job is Model creation. Second there is the MapReduce Task (MRTask). This is an H2O in-memory MapReduce Task and not a Hadoop MapReduce task. Finally there is Fork/Join which is a modified version of the JSR166y lightweight task execution framework.

H2O can support billions of data rows in-memory even if the cluster size is relatively small. This is possible by the use of sophisticated in-memory compression techniques. The H2O platform has its own built-in Flow web interface so as to make analytic workflows become user-friendly to users who do not have engineering background. It also includes interfaces for Java, Python, R, Scala, JSON and Coffeescript/JavaScript. The H2O platform was built alongside and on top of both Hadoop and Spark Clusters. The deployment is typically done within minutes [35, 36, 43]. The software layers that are associated with the running of an R program to begin a GLM on H2O are shown in Fig. 8.

Fig. 8
figure 8

R script starting H20 GLM

Figure 8 shows two different sections. The section on the left depicts the steps which run the R process and the section on the right depicts the steps which run in the H2O cloud. The TCP/IP network code lies above the two processes allowing them to communicate with each other. The arrows with solid lines demonstrate a request originating from R to H2O and the arrows with dashed lines demonstrate the response from H2O to R for the request. The different components of the R program are: R script, R package for R script, (RCurl, Rjson) packages, and R core runtime. Figure 9 shows the R program retrieving the resulting GLM model [49].

Fig. 9
figure 9

R Script retrieving the resulting GLM model

The H2OParsedData class consists of an S4 object which is used to represent an H2O data frame in R. The big data object in H2O is referenced by an @key slot in the S4 object. The H2OParsedData class is used by the package in R to perform summarization and merging operations. These operations are transmitted over to the H2O cloud by making use of an HTTP connection. The operation is performed on the H2O cloud platform and the result is returned as a reference which is in turn stored in another S4 object in R.

1.2.1.2 H2O Algorithms

Several common data analysis and machine learning algorithms are supported by H2O. These algorithms can be classified as per Table 4 [36, 49].

Table 4 Overview of H2O algorithms [49]
1.2.1.3 H2O Deployment and Application Example

H2O can be launched by opening a command prompt terminal, moving to the location of the jar file, and executing the command: java—jar h2o.jar as shown in Fig. 10.

Fig. 10
figure 10

Launching of H2O through the command prompt

To access the web interface, type-in the url: https://localhost:54321/flow/index.html in the web browser. The start-up page shows a list of functionalities supported in H2O. This can also be viewed by using the assist option as depicted in Fig. 11.

Fig. 11
figure 11

Assistance list for operations available in H2O

In this section, the Mobile Health (MHEALTH) dataset from [50,51,52] has been used to demonstrate classification machine learning algorithms with H20.

The dataset consists of recordings of body motions for volunteers while performing different physical activities. The twelve activities recorded are:

L1:

Standing still

L2:

Sitting and Relaxing

L3:

Lying down

L4:

Walking

L5:

Climbing stairs

L6:

Waist bends forward

L7:

Frontal elevation of arms

L8:

Knees bending (crouching)

L9:

Cycling

L10:

Jogging

L11:

Running

L12:

Jump front and back.

In addition to these known classes during which experimentation is performed, there is also a Null class with a value of ‘0’ for data recorded when none of the above recordings are being performed. The feature values are mainly obtained from sensors: acceleration from chest sensor, electrocardiogram sensor, accelerometer sensor, gyro sensor, and magnetometer sensor. The accelerometer, gyro, and magnetometer sensors are placed on the arms and ankles of the subjects.

The common dataset to be used with the classification algorithms can be imported using the importFiles option in the list on the homepage of H2O as depicted in Fig. 12.

Fig. 12
figure 12

Import data set on the cluster to be processed in H2O

As can be observed from Fig. 12, two csv-files are loaded in H2O. The file ‘data1.csv’ is used for training of the different models and ‘data2.csv’ is used for testing.

  • Illustration 1: Random Forest Classification in H2O

The model is then built using the training data set as input. The Distributed Random Forest (DRF) is used with the “label” column selected as the response column. These are shown in Fig. 13. The equation being modelled is: label ~ chestAccX + chestAccY + ….

Fig. 13
figure 13

Building a distributed random forest model

The output metrics on the training model is shown in Fig. 14.

Fig. 14
figure 14

Output metrics for training model

With the training model obtained, prediction can be performed on the test set data. A comparison can then be performed on the predicted and already known “label”. Figure 15 shows the step where the model is used to predict the “label” for the test set.

Fig. 15
figure 15

Prediction using training model on test data set

The predicted “label” can be merged with the test data set and compared with that already known. Figure 16 shows the confusion matrix for the test data set.

Fig. 16
figure 16

Confusion matrix for test data set

An analysis of the percentage difference between the predicted and already known “label” for the test data set can be performed as shown in Table 5.

Table 5 Analysis of percentage difference between predicted and already known “label”
  • Illustration 2: Naïve Bayes Classification in H2O

The same principle as with the Random Forest model is applied to the Naïve Bayes model. The “label” column selected as the response column as shown in Fig. 17. The output metrics on the training model is shown in Fig. 18.

Fig. 17
figure 17

Building a Naïve Bayes model

Fig. 18
figure 18

Output metrics for training model

Figure 19 shows the step where the model is used to predict the “label” for the test set.

Fig. 19
figure 19

Prediction using training model on test data set

The predicted “label” can be merged with the test data set and compared with that already known. Figure 20 shows the confusion matrix for the test data set.

Fig. 20
figure 20

Confusion matrix for test data set

An analysis of the percentage difference between the predicted and already known “label” for the test data set can be performed as shown in Table 6.

Table 6 Analysis of percentage difference between predicted and already known “label”
  • Illustration 3: Deep Learning Classification in H2O

The same principle as with the Random Forest model and Naïve Bayes is applied to the Deep Learning model. The “label” column selected as the response column as shown in Fig. 21. The output metrics on the training model is shown in Fig. 22.

Fig. 21
figure 21

Building a deep learning model

Fig. 22
figure 22

Output metrics for training model

Figure 23 shows the step where the model is used to predict the “label” for the test set.

Fig. 23
figure 23

Prediction using training model on test data set

The predicted “label” can be merged with the test data set and compared with that already known. Figure 24 shows the confusion matrix for the test data set.

Fig. 24
figure 24

Confusion matrix for test data set

An analysis of the percentage difference between the predicted and already known “label” for the test data set can be performed as shown in Table 7.

Table 7 Analysis of percentage difference between predicted and already known “label”

1.2.2 MLlib

Spark MLlib is an open source API that is part of the Apache Software Foundation. Spark MLlib and DataFrames provide several tools. These tools ease the integration of existing workflows developed on tools such as R and Python, with Spark. For example, users are provided with possibility to use the R syntax they are familiar with and call the appropriate MLlib algorithms [53].

The UC Berkeley AMPLab pioneered Spark and the latter was made open-source in 2010. The major aim for the design of Spark is efficient iterative computing. Machine learning algorithms have also been packaged with the early releases of Spark. With the creation of MLlib, the lack of scalable and robust learning algorithms was overcome. As part of the MLbase project [54], MLlib was developed in 2012 and was made open-source in September 2013. When MLlib was first launched, it was incorporated in Spark’s version 0.8. The Apache 2.0 license packages Spark and MLlib as an open-source Apache project. Moreover, as of Spark’s version 1.0, Spark and MLib have been released on a three month cycle [5].

Eleven contributors helped developing the initial version of MLlib at UC Berkeley. The original version provided only a limited group of common machine learning algorithms. With the initiation of MLlib, the number of contributors towards the project has grown exponentially. The number of contributors as from the Spark 1.4 release has grown to more than 140 contributors from more than 50 organizations in a time span of less than two years. Figure 25 depicts the number of contributors of MLib against the release versions. The implementation of several other functionalities was motivated by the adoption rate of MLib [55].

Fig. 25
figure 25

Growth in MLlib contributors [55]

1.2.2.1 Architecture and Features

Several components within the Spark ecosystem are beneficial for MLlib. Figure 26 illustrates the Spark ecosystem.

Fig. 26
figure 26

The spark ecosystem

As observed in Fig. 25, Spark supports programming platforms such as R, SQL, Python, Scala and Java. Additionally, it has several libraries which can provide functionalities such as graph computations, stream data processing, and real-time interactive query processing in addition to machine learning [53].

A general execution engine is provided for data transformation is provided by Spark core at the very low level. These transformations involve more than 80 operators for performing feature extraction and data cleaning. There are other high-level libraries included in Spark and leveraged by MLlib. The functionality of data integration is provided by Spark SQL which aid in the simplification of data preprocessing and cleaning. Support for the fundamental DataFrame abstraction in Spark.ml package is also provided.

Large-scale graph processing is supported by GraphX [56]. The implementation of learning algorithms that can be modeled as large, sparse graph problems, e.g., LDA [57, 58] is also supported. Additionally, Spark Streaming [59] provides the ability to handle and process live data streams to users, thereby enabling the development of online learning algorithms, as in Freeman [60]. The improvements in MLlib are brought up by performance enhancements of the high-level libraries in Spark core.

E-client distributed learning and prediction are supported by the several optimizations included in MLlib. A careful use of blocking is made by the ALS algorithm for recommendations in order to lower the collection of JVM garbage overhead and to leverage high-level operations of linear algebra. Ideas such as data-dependent feature discretization for the reduction of communication costs, and parallelized learning within and across trees using ensembles of trees have been used by Decision trees from the PLANET project [61]. Optimization algorithms are used to learn the Generalized Linear Models. These algorithms perform the gradient computation in parallel by making use of c++ based linear algebra libraries for worker computations. E-client communication primitives are very beneficial for many algorithms. The driver is prevented from being a bottleneck by tree-structured aggregation, and large models are quickly distributed to workers by Spark broadcasts.

Performing machine learning in practice involves a set of stages in an appropriate sequence. The stages in the proper order are: pre-processing of data, extraction of features, fitting of models, and validation. Native support for the multiple set of functionalities needed for the construction of the pipeline are not provided by most of the machine learning libraries. The process of putting together an end-to-end pipeline has severe implications in terms of cost and labor for the network overhead when dealing with large-scale datasets. A package is added in MLlib with the aim to address these concerns inspired by the previous works of [62,63,64], and [65], and by leveraging on Spark’s rich ecosystem. The implementation and setting of cascaded learning pipelines is supported by the spark.ml package via a standard set of high-level APIs [66].

In order to ease the combination of several algorithms in a single workflow, the APIs for machine learning algorithms have been standardized by MLlib. The Scikit-Learn project has brought forward the pipeline concept. The key concepts introduced by the pipelines API are: DataFrame, Transformer, Estimator, Pipeline and Parameter. These are discussed next.

  • DataFrame:

DataFrame from the Spark SQL is used by the ML API as an ML dataset. The dataset can contain several data types and a DataFrame can consist of several columns comprising of feature vectors, text, labels, and predicted labels.

  • Transformer:

A Transformer is an algorithm with the capability of acting upon a DataFrame and converting it to another DataFrame. A machine learning model, for example, is a Transformer which converts a DataFrame consisting of features into predictions.

  • Estimator:

The role of an Estimator algorithms is to learn from the training DataFrame and produce a mathematical model. A Transformer is produced when an Estimator is fit on a DataFrame.

  • Pipeline:

The role of a Pipeline is to concatenate several Estimators and Transformers which specify a workflow.

  • Parameter:

The parameters specified for all Transformers and Estimators share a common API [67]. In machine learning, it is common to run a sequence of algorithms to process and learn from data. For example, a simple text document processing workflow might include the following stages:

  • Splitting of each document’s text into words.

  • Converting each word into a numerical feature vector.

  • Learning of a prediction model by making use of the feature vectors and labels.

This type of workflow is represented as a Pipeline by MLlib. The Pipeline consists of a sequence of Estimators and Transformers to be executed in a predefined and specific order. A transformation operation is performed on the input DataFrame as it flows through each stage of the Pipeline. The transform() method is called on the DataFrame at the Transformer stages while the fit() method is called on the DataFrame for the Estimator stages. Consider for example the simple text document workflow. Figure 27 illustrates the training time usage of a Pipeline [67].

Fig. 27
figure 27

Training time usage of a pipeline

A Pipeline with three stages are depicted in Fig. 27. The first and the second stages (Tokenizer and HashingTF) are Transformers and the third stage (Logistic Regression) is an Estimator. The data flow through the Pipeline is represented in Fig. 27 and the cylinders indicate DataFrames. The Pipeline.fit() method is applied on the original DataFrame comprising of the raw text documents and labels. The Tokenizer.transform() method performs a splitting on the raw text documents to convert them into words and adding a new column to the DataFrame with the words. The HashingTF.transform() method performs a conversion operation on the words to form feature vectors, thereby adding a new column to the DataFrame with these vectors. The Pipeline calls LogisticRegression.fit() to produce the machine learning model since LogisticRegression is an Estimator. Further stages of the Pipeline would call the model’s transform() method on the DataFrame before passing the latter to the other stage [67]. A Pipeline is an Estimator which produces a Transformer following the running of the Pipeline’s fit() method. Figure 28 depicts the Pipeline model used for the test phase.

Fig. 28
figure 28

Pipeline transformer

Figure 28 shows the Pipeline model which consists of the same number of stages as the original Pipeline with the difference that all the Estimators have become Transformers in this case. The model’s transform() method is executed on the test dataset and the dataset is updated at each stage before being passed to the next. The Pipeline and models aid in ensuring that the training and test data undergo the same feature processing steps [67].

1.2.2.2 MLlib Algorithms

The different categories of algorithms supported by MLIB are as follows:

  1. A.

    Extracting, transforming and selecting features:

The algorithms for working with features are roughly divided into the following groups [68]:

  • Extraction: Extracting features from “raw” data.

  • Transformation: Scaling, converting, or modifying features.

  • Selection: Selecting a subset from a larger set of features.

  • Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.

  1. B.

    Classification and regression:

The classification and regression algorithms are given as follows [69]:

Classification:

  • Naive Bayes

  • Decision tree classifier

  • Random forest classifier

  • One-vs-rest classifier

  • Multilayer perceptron classifier

  • Gradient-boosted tree classifier.

Regression:

  • Isotonic regression

  • Logistic regression

    • Binomial logistic regression

    • Multinomial logistic regression

  • Linear regression

  • Generalized linear regression

  • Survival regression

  • Decision tree regression

  • Random forest regression

  • Gradient-boosted tree regression.

The Clustering algorithms are given as follows [70]:

  • K-means

  • Bisecting k-means

  • Latent Dirichlet allocation (LDA)

  • Gaussian Mixture Model (GMM).

MLIB also supports function for Collaborative Filtering [71], ML Tuning: model selection and hyperparameter tuning [72].

1.2.2.3 MLlib Deployment and Application Example

MLlib can be used locally in a Java programming IDE like Eclipse by importing the corresponding jar file in the workspace. In this work, the environment already available in the online Databricks community edition has been used [53]. Users simply need to sign up and use the platform.

In this section, the Mobile Health (MHEALTH) dataset from [50,51,52] has been used to demonstrate classification machine learning algorithms with MLIB.

The classification algorithms used in this section are: Random Forest, Decision Tree, and Multinomial Logistic Regression [36]. Naïve Bayes is not used in this case because the current version does not support negative values for input features. The sensors in the case of the MHEALTH dataset [51, 52, 73] output both positive and negative values. The common dataset to be used with the classification algorithms can be imported using the Create Table option on the Databricks homepage as depicted in Fig. 29.

Fig. 29
figure 29

Import data set on the cluster to be processed in databricks

Given that the community edition is being use, there is a limitation on the resources provided which would not allow the modelling using the large data set as with H2O on the local machine. Therefore, part of the MHEALTH dataset is uploaded to Databricks for the presentation of the modelling with the different classification algorithms.

The MLlib algorithms used in this section require the input data to be in the form of a Resilient Distributed Dataset (RDD) of LabeledPoints. The way the data is read needs to be very specific to be able to proceed with the modelling. Figure 30 shows all the necessary library imports to be used and the Scala code to read data from the csv file as an RDD. The headers are located and removed so as to obtain only the numerical data in the RDD and further processing and application of Machine Learning algorithms as shown in Fig. 31. The data has to be parsed to obtain the label and feature vectors (RDD of LabeledPoints) as shown in Fig. 32. The data is then split into the training and test sets to be used for each classification model. The training and test sets are given 60 and 40% of the dataset respectively. The allocation is done using a random shuffling as shown in Fig. 33.

Fig. 30
figure 30

Import the libraries

Fig. 31
figure 31

Filter header from RDD

Fig. 32
figure 32

Parse RDD to obtain RDD of LabeledPoints with label and feature vector

Fig. 33
figure 33

Split into training and test sets

This process is identical for all the algorithms shown in the following sub-sections.

  • Illustration 1: Random Forest Classification with MLlib and Spark on Databricks

The model is then built using the training data set as input as shown in Fig. 34.

Fig. 34
figure 34

Training of random forest model

With the training model obtained, prediction can be performed on the test set data. A comparison can then be performed on the predicted and already known “label”. Figure 35 shows the step where the model is used to predict the “label” for the test set and compute the test error. The computation of the mean squared error is shown in Fig. 36.

Fig. 35
figure 35

Prediction using training model on the test data set

Fig. 36
figure 36

Computation of the mean squared error

The confusion matrix can also be obtained as shown in Fig. 37.

Fig. 37
figure 37

Confusion matrix

The overall statistics can be computed as shown in Fig. 38.

Fig. 38
figure 38

Overall statistics

The precision by label can be obtained as shown in Fig. 39.

Fig. 39
figure 39

Precision by label

The recall by label can be obtained as shown in Fig. 40.

Fig. 40
figure 40

Recall by label

The false positive rate by label can be obtained as shown in Fig. 41.

Fig. 41
figure 41

False positive rate by label

  • Illustration 2: Decision Tree Classification with MLlib and Spark on Databricks

The model is built using the training data set as input as shown in Fig. 42.

Fig. 42
figure 42

Training of decision tree model

With the training model obtained, prediction can be performed on the test set data. A comparison can then be performed on the predicted and already known “label”. Figure 43 shows the step where the model is used to predict the “label” for the test set and compute the test error. The computation of the mean squared error is shown in Fig. 44.

Fig. 43
figure 43

Prediction using training model on the test data set

Fig. 44
figure 44

Computation of the mean squared error

The confusion matrix can also be obtained as shown in Fig. 45.

Fig. 45
figure 45

Confusion matrix

The overall statistics can be computed as shown in Fig. 46.

Fig. 46
figure 46

Overall statistics

The precision by label can be obtained as shown in Fig. 47.

Fig. 47
figure 47

Precision by label

The recall by label can be obtained as shown in Fig. 48.

Fig. 48
figure 48

Recall by label

The false positive rate by label can be obtained as shown in Fig. 49.

Fig. 49
figure 49

False positive rate by label

  • Illustration 3: Multinomial Logistic Regression Classification with MLlib and Spark on Databricks

The model is built using the training data set as input as shown in Fig. 50. With the training model obtained, prediction can be performed on the test set data. A comparison can then be performed on the predicted and already known “label”.

Fig. 50
figure 50

Training of decision tree model

The confusion matrix can also be obtained as shown in Fig. 51.

Fig. 51
figure 51

Confusion matrix

The overall statistics can be computed as shown in Fig. 52.

Fig. 52
figure 52

Overall statistics

The precision by label can be obtained as shown in Fig. 53.

Fig. 53
figure 53

Precision by label

The recall by label can be obtained as shown in Fig. 54.

Fig. 54
figure 54

Recall by label

The false positive rate by label can be obtained as shown in Fig. 55.

Fig. 55
figure 55

False positive rate by label

The F-measure by label can be computed as shown in Fig. 56.

Fig. 56
figure 56

F-measure by label

The weighted statistics can be computed as shown in Fig. 57.

Fig. 57
figure 57

Weighted statistics

1.2.3 Comparative Analysis of Machine Learning Algorithms

The Machine Learning techniques used in this work together with the open source tools are: Naïve Bayes, Random Forest, Deep Learning, Decision Tree, and Multinomial Logistic Regression. The description, advantages, and limitations are given in Table 8.

Table 8 Description, advantages, and limitations of machine learning algorithms used

1.3 Conclusions and Discussions

The aim of this Chapter was to present the various analytic tools which can be used for the processing of big data, and machine learning tools for analysing and obtaining insights from big data. Essentially, the use of open-source tools: H2O and Spark MLlib, by developing classification based machine learning algorithms for M-Health data, has been illustrated. Random Forest, Naïve Bayes, and Deep Learning algorithms have been used on the data in H2O. Random Forest, Decision Trees, and Multinomial Logistic Regression Classification algorithms have been used with the data in Spark MLlib. The main contribution in this Chapter is that the illustrations demonstrate the flows for developing, training, and testing mathematical models in order to obtain insights from M-Health data using open source tools developed for handling big data.

H2O being configured locally, provides the possibility to use a larger dataset as compared to MLlib on Databricks since an online community version of the latter has been used. The M-Health dataset used with H2O has 982,273 data rows which have been used for model training purposes and 233,472 data rows used for model testing purposes. Based on these data-sets, the accuracies of the Random Forest, Naïve Bayes, and Deep Learning models are 75.99, 75.47, and 75.81% respectively.

The MHealth data-set used with MLlib on Databricks consists of 233,472 data rows out of which 70% is used for model training purposes and the remaining used for model testing purposes. Based on these data-sets, the accuracies of the Random Forest, Decision Tree, and Multinomial Logistic Regression models are 97.12, 97.28, and 75.98% respectively.