Keywords

1 Introduction

Artificial neural networks (ANNs) are an established and thoroughly researched classifier applied in a wide range of real-life applications, including computer-aided diagnosis [13, 28], stock market index prediction [16], chemical engineering [9], weather forecasting [1], face and skin detection [2, 8], and many other pattern recognition tasks [6, 22]. ANNs belong to the supervised classification engines which require labeled training sets to be supplied to a training procedure—commonly the back-propagation routine [26]—whose aim is to build a well-generalizing (to the unseen data) model. Collecting high-quality training sets (i.e., balanced, representative, and containing only correctly labeled examples) is often a fairly user-dependentFootnote 1 and expensive process. Low- or questionable-quality data provided as a training set may easily hamper the learning and deteriorate the performance of any supervised classifier, with ANNs not being an exception [18]. Therefore, selecting reduced (or surrogate) training data from difficult datasets (i.e., very large, imbalanced or weakly-labeled) is a vital research topic and should be considered while building a processing engine for a classification task being tackled [18,19,20].

1.1 Related Work

Training set selection algorithms have been successfully applied for supervised classifiers, including—among others—support vector machines (SVMs) [12, 18, 19] and random forests [14]. Such algorithms are commonly divided into several groups reflecting their underlying ideas. In basic random training set selection techniques, reduced sets are sampled from the entire training set \(\varvec{T}\) without building any surrogate model of the solution space \(\mathcal {S}\) [3]. Although there exist algorithms for intensifying the search in specific parts of \(\mathcal {S}\) (which seem to encompass important \(\varvec{T}\) vectors), their convergence abilities are still very limited.

In geometry-based approaches, the spatial characteristics of examplesFootnote 2 in \(\varvec{T}\) are analyzed in search of important feature vectors, i.e., those influencing the resulting classification model (e.g., vectors which will likely be selected as support vectors in SVMs). Geometry-based training set selection techniques are built upon an assumption that the data geometry reflects the importance of specific examples in \(\varvec{T}\). This information can be extracted before learning a classifier, hence “useless” vectors can be removed from the refined sets. Such methods encompass clustering-based and clustering-free approaches—the former techniques exploit unsupervised algorithms to group \(\varvec{T}\) data in the first step. Then, clusters are further analyzed (e.g., examples positioned inside a cluster will most likely not influence the learning process, therefore can be safely removed [18]). Similarly, neighborhood-based techniques traverse local relations between \(\varvec{T}\) examples to find distinctive vectors [4]. Geometry- and neighborhood-based algorithms are dependent on the size of \(\varvec{T}\) (given as \(t\)) which is their very important downside (they may not be applicable in real-life scenarios in which the cardinality of \(\varvec{T}\) easily exceeds millions). This problem is often not exposed in randomized techniques which sample a subset of \(\varvec{T}\) of a size selected prior to optimization.

Evolutionary algorithms (EAs), with genetic and memetic techniques (the latter being hybrids of evolutionary algorithms and local-search procedures) visibly constituting the mainstream, have recently gained interest in the context of selecting refined training sets, especially for SVMs [17, 18]. In such techniques, which are often independent from \(t\), a population of candidate solutions (i.e., surrogate sets) evolves in time in a biologically-inspired process. Since optimizing training sets using EAs requires learning an underlying model with the reduced set represented by each individual in the population (such approaches are called the wrapper techniques), they may become fairly computationally-intensive for larger sets. Nevertheless, they were shown to be outperforming other methods from the literature and to be able to deliver high-quality training sets (i.e., such datasets from which an effective classifier can be learned). Importantly, such methods are able to build surrogate models that contain vectors that would have been rejected by geometry-based approaches (e.g., positioned within groups of one-class examples), but turned to be important (and affecting the classifier performance) in the course of evolution.

Although the use of EAs for optimizing ANNs is not novel, such approaches were mainly applied for optimizing the weights, learning rules, and network topologies [7]. Evolutionary augmentation of “minimal” neural topologies was a basis of a famous NEAT algorithm (Neuro-Evolution of Augmenting Topologies), which was shown to be outperforming other fixed-topology ANNs in reinforcement learning tasks. Importantly, candidate solutions get more complex in the course of optimization in NEAT, hence this technique also proved that strengthening the analogy with biological evolution is an important aspect of well-designed EAs. A very interesting research pathway involves evolving neural network ensembles [27] (importantly, such classifier committees are gaining research interest nowadays [15, 23]). Such evolutionary ensembles are aimed at extracting diverse groups of ANNs which are learned separately and are able to provide an agreed classification outcome when classifying challenging datasets (which is in line with the divide-and-conquer strategy). There exist methods for evolving surrogate training sets for ANNs [5, 10, 24], however they have not been extensively investigated in the literature.

Reducing the size of a training set is slightly counter-intuitive and can appear simply incorrect, especially for larger ANN topologiesFootnote 3 (with a huge number of neurons in hidden layers), because such networks are prone to memorizing small sets (they can be easily overfitted to \(\varvec{T}\)). In this paper, we address this issue and carefully verify whether it is possible to train a well-generalizing ANN with reduced \(\varvec{T}\)’s, extracted using a genetic algorithm.

1.2 Contribution

The contribution of this work is multifold. We build upon a genetic algorithm (GA) for evolving reduced training sets for SVMs [11], and exploit an analogous approach for extracting reduced sets for ANNs (Genetic Algorithm for Training Set selection, GA-TS), in which each candidate solution (an individual) in a population represents a surrogate set, and its fitness is quantified as the classification performance of the underlying model trained using the corresponding surrogate training set. These surrogate sets can be effectively used for learning other classifiers as well.

Although GA-TS is a wrapper technique (i.e., calculating the fitness value involves performing the full training procedure), our extensive and fairly rigorous experiments (multifold cross-validation backed up with statistical tests) showed that is offers decent convergence capabilities for benchmark and real-life datasets (the latter was extracted from the ECU databaseFootnote 4 in the context of skin detection and segmentation). In sensitivity analysis, we showed how the most important parameters of GA-TS influence the optimization process alongside the classifiers trained using such evolved surrogate training sets. The following insights can be learned from our experimental study:

  • Surrogate sets evolved using GA-TS can be effectively used to improve performance of not only ANNs but other state-of-the-art classifiers as well (we investigated several classification engines known from the literature).

  • ANNs of a predefined topology can be learned from small training sets, and they can generalize well to the unseen data.

  • Increasing the number of generated children (beyond two) and population size in GA-TS slows down the convergence and does not offer a visible increase in the quality of extracted training sets.

  • Reducing the cardinality of \(\varvec{T}\) helps improve all investigated classifiers, which is in line in the conclusions reported in recent works on SVMs [17].

1.3 Paper Structure

This paper is structured as follows. Section 2 discusses our genetic algorithm for evolving surrogate training sets for ANNs. In Sect. 3, we report and analyze the results of our experiments performed using benchmark and real-life datasets. Section 4 concludes the paper and highlights the future research pathways.

2 Genetic Evolution of Surrogate Training Sets

In this paper, we build upon a GA proposed in our previous work [11], and present how to exploit an analogous algorithm for evolving training sets for ANNs (GA-TS). Also, we show that \(\varvec{T}\)’s extracted using our technique can be effectively used for learning other state-of-the-art classifiers as well.

Fig. 1.
figure 1

Evolution of training sets using a genetic algorithm.

In GA-TS, a population of \(N\) solutions (surrogate training sets) undergoes the evolution in search of high-quality reduced \(\varvec{T}\). The size of individuals \(t'\) (i.e., the cardinality of reduced sets) is kept constant across the population and is set aprioriFootnote 5 (see the flowchart in Fig. 1). Then, the parents are paired for mating and are subject to crossover (in Fig. 1, we render operations that are inherently parallelizable as stacked steps). In this step, a child inherits distinct vectors (from both classes) from both parents—if the number of vectors in an offspring is smaller than \(t'\), then random examples are drawn from \(\varvec{T}\) to keep the size of individuals constant in the course of the optimization. An analogous approach of replacing random examples from a surrogate model with those selected from \(\varvec{T}\) is used in mutation (which is performed with the probability \(\mathcal {P}_m\)). The fitness of an individual is quantified as the classification performance (e.g., accuracy or area under the receiver operating characteristic curve) of a classifier (we exploit the accuracy of a trained ANN in this work) learned using the corresponding surrogate set over the validation set \(\varvec{V}\).

The GA is stopped if a termination condition is met—it may happen if a surrogate set of a desired quality has been obtained, the maximum number of generations have been already processed, or the maximum allocated execution time elapsed. Finally, the best individual from the last generation is returned and considered the final surrogate training set (in fact, we return a model trained using this surrogate set which is ready to use for new data). For more details on a GA to evolve surrogate training sets, we refer to [11].

3 Experimental Validation

GA-TS was implemented in C++. We analyzed several classifiers (implemented in sklearn in Python), including: artificial neural networks (ANNs), logistic regression (LR), AdaBoost (AB), Gaussian Naive Bayes (GNB), quadratic discriminant analysis (QDA), k-nearest neighbors (k-NN), decision trees (DTs) and random forests (RFs), and SVMs. The crossover and mutation probabilities in GA-TS were tuned experimentally to \(\mathcal {P}_c=0.7\) and \(\mathcal {P}_m=0.1\), respectively, and remained constant across the entire experimental study. Similarly, the learning rate of our ANN was kept constant and equal to 0.6.

3.1 Datasets

We focused on one UCI benchmark (Wisconsin breast cancerFootnote 6, 699 examples containing 10 attributes, not balanced, with 458 benign and 241 malignant examples), and one real-life set from the field of skin segmentation (Skin, \(10^5\) examples of RGB values, balanced). The latter set contains examples of skin and non-skin pixels sampled from the ECU images (see example images rendered in Fig. 2—very high variability of inner-class vectors representing human skin can be noticed here). Both datasets were divided into a training set \(\varvec{T}\) from which surrogate sets are evolved, validation set \(\varvec{V}\) on which the fitness of individuals is calculated (during the evolution), and test set \(\varPsi \) used to quantify the generalization capabilities of a learned model (this set was not used during the evolution and contains unseen data). For Wisconsin, we followed the 5-fold cross-validation scenario, whereas Skin was divided randomly into \(\left| \varvec{T}\right| =8\cdot 10^4\) training, \(\left| \varvec{V}\right| =10^4\) validation, and \(\left| \varPsi \right| =10^4\) test examples.

Fig. 2.
figure 2

Example ECU images used to elaborate the Skin dataset.

3.2 Experiment 1: Skin Dataset

In this experiment, we verify the impact of the GA parameters, i.e., the population size (\(N\)), number of children generated for each pair of parental solutions during crossover (\(N_\mathrm{ch}\)), and size of surrogate sets being evolved (\(t'\)), on its search capabilities and quality of final solutions (ANNs learned using evolved \(\varvec{T}\)’s)—each variant of GA-TS was executed 10\(\times \). We focused on Skin, and our ANN was fairly small (built with two hidden layers containing 6 and 3 neurons, respectively). The evolution of training sets was terminated if there was no improvement in the best fitness in five consecutive generations.

Table 1. Classification performance of ANNs trained using Skin surrogate sets extracted using GA-TS for all investigated configurations: Acc—accuracy obtained for the test set \(\varPsi \), \(\Delta \eta \)—gain in the average fitness (note that the gain can be negative because we do not exploit any elitism), \(\tau \)—GA convergence time; for all measures we report the minimum, average, and maximum values (the best values are boldfaced).

In Table 1, we gather the results (alongside the convergence time of our genetic technique) obtained using ANNs learned using surrogate sets extracted with GA-TS run in all investigated configurations. For very small sets (containing \(.1\%\) of samples from the entire training set \(\varvec{T}\)), GA-TS prematurely converges to the final solutions, and these surrogate training sets are low-quality (they are much worse compared to other sets, and the differences are statistically relevant at \(p<.01\) in Wilcoxon test—the null hypothesis saying that these GA-TS variants give the same-quality surrogate sets can be safely rejected). On the other hand, increasing the cardinality of evolved training sets does not necessarily improve their quality (the differences are not statistically important; p-value equals .52 for the Wilcoxon test), but notably slows down the convergence (see sets with \(5\%\) of \(\varvec{T}\) examples compared with those containing \(10\%\) of \(\varvec{T}\) vectors). Similarly, the population size should be kept as small as possible to speed up exploration of the solution space (candidate solutions with relatively small fitness values need more time to be improved in larger populations—see the minimum \(\varPsi \) accuracy for \(N=40\)). Finally, generating even two children (instead of one) for each parental pair can help boost the exploitation and create well-fitted offspring solutions. However, it causes increasing the execution time of GA-TS, because all generated children must be assessed during the evolution (hence, it requires training a classifier using the surrogate set represented by a new individual). Therefore, selecting training sets can be seen as a multi-objective optimization problem—on one hand, performance of the final classifier is to be maximized, on the other hand—a selection algorithm should work as fast as possible (also, training time of a classifier should be minimized).

Fig. 3.
figure 3

Example ECU images segmented using selected classifiers learned with: a set of \(0.25\cdot \left| \varvec{T}\right| \) vectors evolved using GA, randomly sampled from \(\varvec{T}\), and the entire \(\varvec{T}\).

Table 2. Accuracy for selected ECU images (see a–c in Fig. 3) elaborated using all investigated classifiers trained with surrogate sets obtained using GA-TS and random sampling, and with the entire training set \(\varvec{T}\), alongside their training and classification (of all pixels in the corresponding image) times (\(\tau ^\mathrm{T+C}\), in seconds). The best results for each image are boldfaced.

Example ECU images segmented using selected classifiers trained with the entire training set, and reduced training sets (evolved using GA-TS and sampled randomly from \(\varvec{T}\)) are rendered in Fig. 3 (also, we present a manual ground-truth segmentation of those images). It can be seen that sets evolved using a genetic technique allowed for learning well-performing models for very heterogeneous data (see inner-class variability in Fig. 2). It is especially visible for QDA, where the classification accuracy was improved by a margin greater than \(30\%\) (lots of false positives were filtered out). Hence, GA-TS can be effectively applied for other learners (even if the quality of surrogate sets during the evolution was quantified using a different classifier) as well to help boost their classification performance over difficult training sets. Also, it is possible to easily change the underlying classification engine that is used to elaborate the fitness value.

In Table 2, we report the classification accuracy alongside processing time (training and classification) over the selected ECU images for all investigated classifiers (learned using the entire set \(\varvec{T}\), and surrogate training sets extracting with our GA-TS and sampled randomly from \(\varvec{T}\)—both sets contained \(0.25\cdot \left| \varvec{T}\right| \) vectors). The results show that evolved sets not only do allow for obtaining highest-quality classification scores, but are also helpful in decreasing the execution time of the classification engine (especially when compared with learners trained using the entire training set). Interestingly, removing samples from \(\varvec{T}\) helped improve the accuracy—it may exhibit the presence of “noisy” vectors in \(\varvec{T}\) which should be rejected from the training set before the training process to maximize the generalization capabilities of a model.

3.3 Experiment 2: Wisconsin Breast Cancer Dataset

In this experiment, we compared pretty small ANNs (two hidden layers with three neurons each) trained using surrogate models (of various sizes) with state-of-the-art classifiers learned with the entire \(\varvec{T}\) for the Wisconsin benchmark dataset (as already mentioned, we performed 5-fold cross-validation and present the results averaged across the folds). Similarly to the previous experiment, each variant of GA-TS was executed 10\(\times \) for each fold.

The classification accuracies reported in Table 3 confirm that ANNs outperform other approaches and deliver very fast operation and training (see \(\tau ^\mathrm{T+C}\)). Also, the results are fairly consistent across the folds (see the ANN standard deviation \(\sigma \)). The increase in classification abilities of ANNs show that GAs can be successfully used for evolving \(\varvec{T}\)’s not only for SVMs [17]. It is interesting to note that training of ANNs did not significantly vary across different surrogate set cardinalities (the boost in classification accuracy was not notable either). This is an indicator showing that this dataset is not very challenging, and even for an entire training set the learning process is affordable.

Table 3. Accuracy for Wisconsin breast cancer using the investigated classifiers, alongside their training and classification (of the test set) times (\(\tau ^\mathrm{T+C}\), in seconds). The best results in each column are boldfaced.

4 Conclusions and Future Work

In this paper, we exploited a genetic algorithm for evolving surrogate training sets (of significantly lower cardinalities than the original set) for ANNs. Also, we showed that such evolved sets can be effectively used for other state-of-the-art classifiers without introducing any changes in the core algorithm. Experimental validation (coupled with statistical tests) performed on real-life and benchmark datasets confirmed the applicability of this evolutionary training set selection scheme in practice. Classifiers learned with surrogate sets were confronted with those trained using the entire set and random surrogate sets, and they were shown to be outperforming the latter techniques. It is an indicator that training set selection should be used in practice for large, heterogeneous and potentially noisy datasets in order to select only those examples which are distinctive.

It would be interesting to see if a surrogate training set evolved for a given classifier can be directly used to improve a different learner—it could be potentially useful to decrease the fitness evaluation time in the course of optimization, and exploit the fastest classifiers to quantify the quality of surrogate sets during the evolution. Although the results reported in this paper are a very good indicator that it is possible, this issue requires further investigation. We currently work on a fully hands-free algorithm for evolving SVMs, in which not only surrogate models are optimized, but also features and kernel hyperparameters evolve. This algorithm will be applied to segment brain lesions from FLAIR MRI sequences and other medical images [25]. Finally, we aim to tackle the problem of selecting training sets for deep neural networks by means of evolutionary computation.