Keywords

1 Introduction

Modern processes are frequently monitored using information systems that record large amounts of information given rise to Big Data bases. For a decision making environment, better mining the available data is a challenge. It is the major issue for the current data mining and machine learning algorithms. Regardless the Big Data 3V (variety, velocity and volume), data mining requires powerful knowledge discovery techniques that are able to deal with the related challenges such complexity and learning cost. Thus, the use of evolutionary data mining procedures becomes a hot trend. Thanks to their global search in the solution space, evolutionary computation techniques help in the information retrieval from a voluminous pool of data in a better way compared to traditional retrieval techniques [10].

Genetic Programming (GP) [21], as a particular EDMA, is considered as a universal machine learning tool. Its paradigm has shown a great potential when applied to supervised learning, especially classification and regression problems [3, 32]. However, like other machine learning techniques, GP computational overhead increases with large datasets and its efficiency is affected. Some improvements are needed to scale GP when learning from big datasets. The adaptation of EDMA, specially GP, for big data problems may require redesigning the algorithms and/or their inclusion in parallel environments. Both tracks are discussed in this paper.

An efficient strategy to redesign a machine learning technique to handle large data sets is to introduce a special learning paradigm or a special parallelization paradigm without parallel hardware. In the first case, the algorithm of the data mining method is modified in order to improve its efficiency. With the second case, a parallel data-intensive computing scenario is introduced to the method. This strategy has become very popular in the last few years thanks to the appearance of some open-source tools such as Hadoop/MapReduce.

As discussed in [25], learning paradigms could bring some solutions for big data mining. For example, Ensemble Learning can help alleviate the “Concept Drift” and “Curse of Modularity” issues associated with Big Data. Similarly, Local Learning can help alleviate “Data locality” and “Variance and Bias” issues. In this work, we are interested in the Active Learning paradigm. Active learning is implemented essentially with active data sampling. In previous works [15, 17], we demonstrated how active sampling could be an efficient solution to learn from large data sets by decreasing the size of the training set. The idea is to introduce some components in the algorithm that allows it to select its training data according to the evolution of the learning process. Section 4 explains in details how an active sampling can be introduced in the GP engine.

The second strategy to scale an EA (Evolutionary Algorithm) to Big Data Mining is the parallelization in a distributed environment. Vertical parallelization (scaling up) paradigm was widely explored. It includes multicore CPUs, supercomputers, hardware acceleration including graphic processing units (GPUs) and field-programmable gate arrays (FPGAs) [25]. In this work, we are interested in the horizontal parallelization (scaling out) paradigm where processing is dispersed over networked nodes. This paradigm do not require detailed knowledge of the underlying hardware architecture. The most known framework implementing the horizontal parallelization is the MapReduce paradigm and its distributed file system [11], originally introduced by Google. However, MapReduce encounters difficulties when dealing with iterative algorithms. Several new frameworks have been proposed to provide better support for iterative processing such as HaLoop [6] or Apache SparkFootnote 1. It is this last framework that we propose to scale up GP to Big Data Mining with processing manipulation. Our goal is the implementation of GP in a distributed ecosystem by adapting existing libraries. In a previous work, we proposed a solution to port a known GP implementation to the Spark context in order to take the most of its proven potential [16]. The source code of this model is available at GitHubFootnote 2. This implementation is reused in the present work.

The main purpose of this paper is to summarize the different strategies to address Big Data challenges with machine learning (Sect. 2). It then presents the horizontal parallelization and the active learning paradigm, two suited approaches for EDMA (Sects. 3 and 4). The general purpose and some implementation details of the proposed techniques in each strategy are then given and explained in details (Sects. 3 and 4). The efficiency of GP obtained with the different extensions are studied on the HIGGS classification problem and the pulsar detection problem in Sect. 5.

2 Some Approaches to Address Big Data Learning Cost

In modern Information Systems, the amount of stored data has reached a huge volume and their complexity has widely increased. New challenges had then arisen due to the data characteristics.

Approaches to countering the Big Data challenges for machine learning are quite diverse. They are summarized in [25] and classified according to the data analytics pipeline (see Fig. 1). Another review, published in [3], identifies methods and techniques to accelerate evolutionary algorithms when applied to learning tasks. Scaling approaches focus essentially on manipulating data, processing and algorithms. We summarize below the main categories of the different approaches and paradigms that can be developed to alleviate some issues associated with Big Data.

Fig. 1.
figure 1

Manipulations through Data Analytics pipeline [25].

  • Data manipulation: It is applied in the pre-processing phase before administering the data to the learning process. Its goal is to reduce the size of the learning data base, for example by applying techniques of sampling or feature selection. Sampling, in this case, is independent of the learning process (evolution process for GP).

  • Processing manipulation: it relies on processing manipulations to handle the additional computational cost by vertical or horizontal scaling. It includes parallelization approaches using the parallel nature of some population based algorithms such as GP and EA. Parallelization can be done with:

    \(\bullet \):

    Vertical scaling (scaling up): It is based on increasing resources for a single node. For example, hardware acceleration based on graphics processors called General Purpose Graphics Processing Units (GPGPU) and parallelization using multicore CPUs have been used in several jobs [14, 23, 27].

    \(\bullet \):

    Horizontal scaling (scaling out): It refers to distributed systems where computations are deployed on a cluster of nodes [27]. This is the most common form of distribution with Big Data problems. Several frameworks have been put in place for two types of parallelization: by batch or flow-oriented. The most used are Hadoop/MapReduce [11] and Apache Spark [34].

  • Algorithm Manipulation: it relies on algorithm manipulations in order to optimize its running time or to improve the learning quality. It involves the introduction of new machine learning paradigms or the adaptation of some existing machine learning paradigms. As new paradigms, the Online learning and Transfer learning are promising approaches. As existing paradigms that could be adapted to deal with Big Data, we find the Ensemble learning, the Local learning and the Active learning.

In the context of GP scaling, two paths are considered in this work: the acceleration of evaluations through the processing manipulation or the reduction of their number through the learning paradigm. In the first case, the GP engine is ported over a parallel framework to distribute the fitness computation. In the second case, the GP algorithm is extended with an active learning paradigm implemented with an active data sampling.

3 Scaling by Processing Manipulation: Horizontal Parallelization

The evolution of the Big Data ecosystem has allowed the development of new approaches and tools such as Hadoop MapReduceFootnote 3 and Apache Spark (see Footnote 1) that implement a new programming model over a distributed storage architecture. They are the de facto tools for any data intensive application. This section shows how an existing GP implementation is adapted to the Spark context.

3.1 Spark and MapReduce

MapReduce is a parallel programming model introduced by Dean et al. in [11] for Google. It was made popular with its Apache Hadoop implementation. The fundamental idea that gave rise to this model is to move computations to the data by reducing data traffic between the different nodes. MapReduce operates over a distributed file system.

It applies the “divide and rule” technique to break down a process into multiple tasks performed concurrently on different machines on the portions of data they store. These tasks are of two types: Map and Reduce, that are the functional programming origin. Despite the simplicity of this model, which has allowed to solve several large scale problems, it is not suited to iterative algorithms such as EDMA that are penalized by the large number of I/Os and network latency.

Apache Spark is one of many frameworks intended to neutralize the limitations of MapReduce while keeping its scalability, data locality and fault tolerance. The keystone of Spark is the Resilient Distributed Datasets (RDD) [34]. A RDD is a typical immutable parallel data structure that can be cached. These RDDs support two types of operations: transformations (map. filter, join, ...) that produce a modified RDD and actions (count, reduce, ...) that generate non-RDD-type results (an integer, a table, etc.) and require all the RDD partitions to be performed. Spark is up to 100 times faster than MapReduce with iterative algorithms (see Footnote 1).

3.2 Parallelizing GP over a Distributed Environment

The implementation of evolutionary algorithms over a distributed environment has taken up light since the emergence of the Big Data ecosystem. Several works have been published in this context. For example, Rong-Zhi Qi et al. [31] and Padaruru et al. [29] propose solutions for parallelizing the entire evolutionary process (fitness evaluation and genetic operators) with Spark. This solution has been applied to automatic software test game generation.

In Chávez et al. [7], the well-known EA library ECJ is modified in order to use MapReduce for fitness evaluations. This new tool is tested to resolve a face recognition problem over around 50 MB of data. Only time measure was considered in this work. Peralta et al. [30] applied the MapReduce model to implement an EA for the pre-processing of big datasets (up to 67 million records and attributes from 631 to 2000). It’s a feature selection application where each map creates a vector of attributes on disjoint subsets of the original data base. The Reduce phase aggregates all the vectors obtained.

The implementation used in this paper, introduced in [16], is inspired by the works of Chavez et al. [7] Peralta et al. [30]. It is a solution for modifying an existing tool (DEAP) for the distribution of the training data base using the Spark engine for distributing GP evaluation. The GP loop with the different distribution steps are illustrated by the Algorithm 1. Steps 1, 4, 5, and 6 (blue lines) concern the distribution of the training data set for the population fitness computation. In step 1, we start by creating a RDD containing the training set. Then, at each generation, a map transformation (step 4) is performed by sending individuals code to be executed (step 5) on RDD and then get results to compute each individual fitness in step 6. Afterwards, GP pursues its standard evolutionary steps.

figure a

4 Scaling by the Learning Paradigm: Active Learning

Active Learning [2, 8] could be defined as: ‘any form of learning in which the learning program has some control over the inputs on which it trains.’

Active learning is implemented essentially with active sampling techniques. The goal of any sampling approach is first to reduce the original size of the training set, and thus the computational cost, and second to enhance the learner performance. Sampling training dataset has been first used to boost the learning process and to avoid over-fitting [20]. Later, it was introduced for Genetic learners as a strategy for handling large input databases. With active sampling, the training subset is changed periodically across the learning process. The data selection strategy is based on some dynamic criteria, such as random selection, weighted selection, incremental selection, etc. We distinguish one-level sampling methods using a single selection strategy and multi-level sampling (hierarchical sampling) methods using multiple sampling strategies associated in a hierarchical way.

4.1 One-Level Active Sampling

One-level sampling methods use a single selection strategy based on dynamic criteria. Records in the training subset S are selected before the application of the genetic operators each generation.

To select a training subset \(\mathcal S\) from a data base \(\mathcal B\), the simplest technique is to select randomly \(T_S\) records from B with a uniform probability. It is the basic approach for the Random Subset Selection method [13] (Sect. 4.1) and some variants like the Stochastic Sampling [28] and Fixed Random Selection [35]. Several other techniques use more sophisticated criteria in order to address some learning difficulties like over-fitting or imbalanced data. For example, weighted sampling techniques [13] use information about the current state of the training data such as difficulty and the number of solved fitness cases. However data-topology based sampling techniques [15, 24] use information about data topology in order to avoid to select similar fitness cases in the same training subset. Balanced sampling [19] and incremental sampling [22] techniques use information about the data classes to handle the problem of imbalanced data. For this work, we are interested in the random (RSS) and balanced (SBS) sampling methods. In this work, the training data base is divided on balanced blocks and SBS and RSS are applied the resulting blocks (Algorithm 2).

Random Sampling (RSS). The simplest method to choose fitness cases and build the training sample is random. The selection of fitness cases is based on a uniform probability among the training subset. This stochastic selection helps to reduce any bias within the full dataset on evolution. Random Subset Selection (RSS) is the first implementation given by Gathercole et al. [13]. In RSS, at each generation g, the probability of selecting any case i is equal to \(P_{i}(g)\) such that:

$$\begin{aligned} \forall i : 1\le i\le T_B, P_{i}(g)=\frac{T_S}{T_B}. \end{aligned}$$
(1)

where \(T_B\) is the size of the full dataset B and \(T_S\) is the target subset size. The sampled subset has a fluctuating size around \(T_S\).

Balanced Sampling (SBS). The main purpose of balanced sampling is to overcome imbalance in the original data sets. The well known techniques in this category are those proposed by Hun et al. [19] aiming to improve classifiers accuracy by correcting the original dataset imbalance within majority and minority class instances. Some of these methods are based on the minority class size and thus reduce the number of instances. In this paper, we focus on the Static Balanced Sampling (SBS). SBS is an active sampling method that selects cases with uniform probability from each class without replacement until obtaining a balanced subset. This subset contains an equal number of majority and minority class instances of the desired size.

figure b

4.2 Multi-level Active Sampling

Multi-level sampling combines several sampling algorithms applied at different levels. Its objective is to deal with large data sets that do not fit in the memory, and simultaneously provide the opportunity to find solutions with a greater generalization ability than those given by the one-level sampling techniques. The data subset selections at each level are independent. The usual schema is made up of three levels. The first one (level 0) consists in creating blocks with a given size from the original data set which are recorded in the hard disk. The remaining two levels are a combination of two active sampling methods. In the present work, a balanced sampling is applied at levels 0 and 1 and a random sampling is applied at level 2.

Algorithm 3 details how the hierarchical sampling is added to the GP loop. At level 0, the data set \(\mathcal{B}\) is first partitioned into blocks that are selected randomly for the sampling step. Then, at level 1, an intermediate sample is extracted from the current block using a balanced random selection SBS. Finally, at level 2, the training subset \(\mathcal{S}\) is generated randomly from the intermediate sample.

figure c

In addition to the known GP parameters, two new parameters are used by the algorithm: \(T_{l1}\) and \(T_{l2}\). \(T_{l1}\) and \(T_{l2}\) are respectively the number of GP iterations for the first-level sampling and for the second-level sampling. Thus, after each \(T_{l1}\) iterations, the level-one training data set is replaced with an other set with the SBS method. This set is used to generate the training sub set \(\mathcal{S}\) at each \(T_{l2}\) iterations with the RSS approach.

5 Example of Application

This section presents an example of application of the proposed solutions in solving two real world problems. The aim is not to present a detailed experimental study but to give an example of the GP’s behaviour when extended with an active sampling or a parallelization over Spark. For this purpose, we have chosen two applications: The Higg’s Boson classification and the Pulsar detection problems.

5.1 Application Data Bases

The Higgs Data Base for Boson Detection. A Higgs or Z boson is a heavy state of matter resulting from a small fraction of the proton collisions at the Large Hadron Collider[5]. This heavy state quickly decays into more stable particles, so the intermediate states of matter are not observable by the detectors surrounding the point of collision. Highly faithful collisions are then simulated by the ATLAS Simulator [9] to reproduce the essential measurements provided by the detectors to reproduce the essential measurements provided by the detectors.

ATLAS experiment a portion of the simulated data to optimize the analysis of the Higgs Bosons by machine learning techniques. A subset of this data was presented as a challenge in 2014 [1] in the kaggle platformFootnote 4\(^,\)Footnote 5.

From the machine learning point of view, the problem can be formally cast into a binary classification problem. The task is to classify events as a signal (event of interest) or a background (event produced by already known processes). Baldi et al. [4] published for benchmarking machine-learning classification algorithms a big data set of simulated Higgs Bosons that contains 11 million simulated collision events and the 28 features. In this work, we propose to handle a subset of the published data to test th different GP implementations.

HTRU Data Set for Pulsar Detection. Pulsars are a rare type of Neutron star that produce radio emission detectable on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter [26]. As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals. The HTRU(2) dataset, is the publicly available data setFootnote 6 that describes a sample of pulsar candidates collected during the High Time Resolution Universe (HTRU) Survey. The goal of the classification process is to classify the given candidates as pulsars or non pulsars.

Some Related Works on Higgs and HTRU Data Sets. The first machine learning technique applied to detect pulsars on the Lyon et al. HTRU data set is the Gaussian Hellinger very fast Decision Tree with a precision equal to \(89.9\%\) [26]. These results are improved in the study published in [33] where the precision and recall reach respectively \(95\%\) and \(87.3\%\) with the XG-Boost classifier and \(96\%\) and \(87\%\) with Random Forest classifier.

Otherwise, the Boson classification problem using the Higgs data set was first handled with Deep Neural Network (DNN) proposed by Baldi et al.[5]. Their method was compared the boosted decision tree and the Shallow Neural Network. They used a subset of Higgs data base of 2.6 million examples and 100K validation examples. They demonstrated how DNN can be trained on such data set with a high degree of accuracy. The whole Kaggle data set with 11 millions patterns has been studied in [16, 18] where the best accuracy reach \(66.93\%\).

5.2 Experimental Settings

Software Framework. DEAP (Distributed Evolutionary Algorithms in Python)Footnote 7 is presented as a rapid prototyping and testing framework [12] that supports parallelization and multiprocessing. It implements several Evolutionary Algorithms: Genetic Algorithm, Evolution Strategies and GP. The basic module contains objects and data structures frequently used in Evolutionary Computation. We decided to use this framework for the following reasons: (1) it implements standard GP with tree based representation (1), it is a Python package which is one of the 3 languages supported by Spark and (3) it is distributed ready. The third point means that DEAP is structured in a way that facilitates distribution of computing tasks.

Configurations and Performance Metrics. For each data set, five series of tests are performed with different configurations according to the sampling strategy or parallelization strategy, as follows:

  • Standard GP: GP is run without active learning or parallelization.

  • GP + Spark: GP is parallelized over Spark.

  • GP + 2 Level Sampling: GP is extended with a two level sampling using SBS at level 0 and RSS at level 1.

  • GP + 3 Level Sampling: GP is extended with a three level sampling using SBS at level 0 and 1 and RSS at level 2.

By the end of each run, the best individual based on the fitness function is evaluated on the test data set. Results are recorded in a confusion matrix from which accuracy is calculated according to the following formula.

$$\begin{aligned} Accuracy = \frac{True~Positives + True~Negatives}{Total~patterns}. \end{aligned}$$
(2)

where True Positives and True Negatives are the numbers of exemplars correctly classified in respectively class 1 and 0. Experiments are performed on an Intel i7 (4 Core) workstation with 16 GB RAM running under Windows 64-bit Operating System.

The additional parameters needed for the different configurations are summarized in the Table 1. For these first experiments, the training size is limited 40000 for the case of Boson classification.

Table 1. Additional parameters for the GP runs.

GP Settings. The GP terminal set includes the features of data set benchmarks studied in this work. The GP function set includes basic arithmetic, comparison and logical operators reaching 17 functions. The objective is the minimisation of the classification error. The main GP parameters are summarized in Table 2.

Table 2. GP parameters.

5.3 Results and Discussion

The results of these preliminary experiments are illustrated in Table 3 for the HTRU base and in Table 4 for the HIGGS data base. The recorded metrics are average and best accuracy and average computing time over 10 runs.

Table 3. Preliminary results on HTRU data set.
Table 4. Preliminary results on HIGGS data set.

The key observation from the obtained results is that the proposed solutions are able not only to reduce the computation cost according to Standard GP results, but also to improve the performance of the classifiers. The reduction in the computational cost is about \(50\%\) with active learning and can be reduced up to 10 times with GP+RDD in the case of HTRU. Likewise, the average and best accuracies are improved with parallelization and active learning in the majority of test cases. According to these first series of tests and considering the best measures, we can state that the extended GP is able to generate better results than the standard GP either for Higgs data set or HTRU data set.

Results for 1M Train Instances from Higgs Database. The GP behavior observed in the first experiments should remain unchanged with the increase of data size. To demonstrate it, some tests were carried out with a train sample of 1million instances from the Higgs data base and a test set of 200000 instances. Due to the limited computational capacity, only few runs are performed and the number of generations is limited to 25 for GP+Spark and to 100 for PG+2/3 L Sampling. The remaining GP parameters still unchanged.

GP+Spark, after 25 generations and about 5 h of computing time, the best test accuracy reach 57,15%. The parallelization of GP over Spark has not only allowed GP to be applied to a very large data set that is impossible to do with a Standard GP, but also to slightly improve the overall performance. For active learning, as for previous experiments, the training set is first divided into blocks of 100,000 instances (level 1). RSS is then applied on these blocks to generate samples with 10000 instances in the case of 2L sampling. For the 3L Sampling, SBS and RSS are applied respectively at level 1 and 2 where the sample sizes are equal of 10000 and 3000. The learning time reach 5h30mn, however the results are quite improved. The accuracy of the resulting classifiers is about \(63\%\) for 2L and \(61\%\) for 3L. This performance exceeds not only the first results in Table 4 but also some previous published results in the state of the art. However, it is not possible to compare the results of this paper with those of baldi et al. [5] since we use a smaller sample size.

Discussion. This paper presents some trends to address difficulties related to complexity and volume for big data mining with GP as an EDMA. Two solutions are explored, the active sampling with SBS and RSS and the horizontal parallelization with Spark, that are proved to be promising directions to improve classifiers’ quality. Indeed, according to the first series of tests summarized above, and considering the best measures, we can state that the extended GP is able to generate better results than the standard GP either for Higgs data set or HTRU data set. Otherwise, when comparing these results with those of the state of the art, it is clear that GP, in the different explored configurations, is a competitive technique able to accomplish similar or better performance.

These preliminary experiments allow us to conclude that the solutions proposed in this work are effective and promising. However, additional experiments are needed to compare the different techniques according to a complete set of metrics such as the recall, precision and F2-score measures. Otherwise, further studies aim to better explore the different solutions and the different possibilities of combination.

6 Conclusion

With the incremental demand to analyze huge amounts of data, resulting from variant sources and generated at very high rates, researchers at different domains have studied the expansion of the existing data mining techniques to cope with the evolved nature of data. In this paper, we provide some trends to address complexity and large data size with evolutionary machine learning. Active sampling and parallelization over Spark are explained and we detailed how they can be implemented into GPs. A first experimental study to detect pulsar and Higgs Boson demonstrates how these extensions help GP to better deal with complex data. Further works aim to propose a framework implementing the different trends discussed in this paper with more comprehensive studies. In particular, implementing active sampling on top of Spark RDDs is an interesting path towards combining the discussed techniques in the same process. Another direction is studying the effect of cluster size on GP performance.