Keywords

1 Introduction

Genetic Programming (GP) is one of evolutionary algorithm-based methodologies inspired by biological evolution. It uses tree-based structures and a suite of defined Genetic Algorithm-operators to generate and evolve a population of solutions to the given problem [3]. GP has produced many novel and outstanding results in various areas such as optimization, searching, sorting, quantum computing, electronic design, game playing, cyberterrorism prevention [4, 6]. One of main areas of GP is Machine Learning

In Machine Learning, generalization and over-fitting are two central challenges need to be solved. Generalization error of learners directly relates to over-fitting and is referred to as the problem of over-fitting [7]. There are many researches in Machine Learning, including GP, try to improve generalization ability of learners by reducing over-fitting error as [2, 812].

Over-fitting can be controlled by Bias-Variance trade-off [2], where bias is the error on training data set and variance is the error of difference on various data sets in the future. Over-fitting will be reduced when bias and variance are small, simultaneously. Because bias and variance are hidden in the L2-norm loss function (e.g. RMSE, MSE, ...). So, many researches in Machine Learning have used these functions [1316] for learning. However, the combination of bias and variance in the error function L2 sometimes causes difficulties in optimizing them simultaneously because Bias and Variance are two conflicting problems. So, in GP, Alexandros et al. proposed the method (BVGP) [2] to overcome this issue. He divided the fitness function into two components: variance and squared bias which aim at bringing variance component into the evolution process more directly. However, this method faces to the over-fit issue on limited training sample. This leads to reducing the ability to generalize of the learned model. Moreover it can make the model very sensitive to noise.

In this paper, we propose a variation on the fitness function for GP which aims at improving the limits of BVGP as shown above. It is called BVGP*. Through experiments, we demonstrate that the use of BVGP* has some advantages: (1) It can help to reduce over-fitting on the problems that GP was over-fitted; (2) the program runs faster and finds the simpler solution. So, the main contribution of this paper is the variation on the fitness function for improving the effectiveness of GP based on bias-variance decomposition of training errors.

The remainder of this paper is organized as follows: In Sect. 2, we briefly present some background knowledge and related work. A variation on the fitness function is presented in Sect. 3. Section 4 are some experimental settings and problems for testing. Next, experimental results are given in Sect. 5. Finally, Sect. 6, we summarize achieved research results and present some future works.

2 Background and Related Work

2.1 Bias-Variance Decomposition

In this section we introduce the background on the statistical concept of loss function and Bias-Variance Decomposition for regression. The material is based on the book of Trevor Hastie [17].

If we assume that \(Y=f(x)+\varepsilon \), where \(\varepsilon \) is prediction error; \(E(\varepsilon )=0\) and \(Var(\varepsilon )=\sigma ^{2}_{\varepsilon }\), we can derive an expression for the expected prediction error of \(\widehat{f}(x)\) at an input point \(X=x_{0}\), using L2-loss function as follows:

$$\begin{aligned} Err(x_{0})= & {} {E[(Y-\widehat{f}(x))^2|X=x_{0}]} \nonumber \\= & {} {\sigma ^{2}_{\varepsilon }+[E\widehat{f}(x_{0})-f(x_{0})]^2+E[\widehat{f}(x_{0})-E\widehat{f}(x_{0})]^2} \nonumber \\= & {} {\sigma ^{2}_{\varepsilon }+Bias^2(\widehat{f}(x_{0}))+Var(\widehat{f}(x_{0}))} \nonumber \\= & {} {{Irreducible\ Error} + Bias^2+Variance} \end{aligned}$$
(1)

The first term is the variance of the target around its true mean \(f(x_{0})\); the second term is the squared bias, the amount by which the average of our estimate differs from the true mean; the last term is the variance; the expected squared deviation of \(\widehat{f}(x_{0})\) around its mean. The last two terms need to be addressed for a good performance of the prediction model.

Generalization error is the prediction error over an independent test sample:

$$\begin{aligned} Err(T)=E[L(Y,\widehat{f}(x))|T] \end{aligned}$$
(2)

where both X and Y are drawn randomly from their joint distribution (population). Here, the training set T is fixed, and test error refers to the error for this specific training set. A related quantity is the expected prediction error:

$$\begin{aligned} Err=E[L(Y,\widehat{f}(x))]=E[Err_{T}] \end{aligned}$$
(3)

Such decomposition is known as the Bias Variance Decomposition.

2.2 Bias-Variance Genetic Programming (BVGP)

The Bias-Variance Genetic Programming proposed by Alexandros et al. is a new method for over-fitting issue based on Bias/Variance Error Decomposition which aims at relaxing the sensitivity of an evolved model to a particular training dataset. This method used the fitness function that is the combination of bias and variance as follows:

$$\begin{aligned} fitness = w_{b}Bias(D)+w_{v} Var(D^{*}) \end{aligned}$$
(4)

where \(w_{b}, w_{v}\) are the coefficients for error and variance respectively; D is the training data set of size n; \(D^{*}\) includes B bootstrap datasets randomly drawn from D by the bootstrap re-sampling method; Bias(D) is the mean error on the original dataset (bias); \(Var(D^{*})\) is variance of the error on the bootstrap datasets.

He separated regression error into two components: bias and variance to put variant error in the evolution process more directly.

3 The Improved Method: BVGP*

In this section, we propose a variation on the fitness function for GP which aims at overcoming the disadvantage of the BVGP. This function is based on BVGP and defined as follows:

$$\begin{aligned} fitness = w_{b}Bias(D^{*})+w_{v} Var(D^{*}) \end{aligned}$$
(5)

where bias and variance are calculated using the bootstrap re-sampling method. We consider f(x) as the model trained on a dataset \(D = \{(x_{1},t_{1}),..., (x_{N}, t_{N})\}\) and use the bootstrap re-sampling method to randomly draw B datasets with replacement from D, each sample the same size as D. We denote \(D^*\) to include B the bootstrap sample sets: \(D^{*}=\{D^{*b}, b:1 .. B \}\). The estimated bias (\(\mu \)) and variance (\(\sigma ^2\)) of stochastic fitness are computed as follows:

$$\begin{aligned} Bias(D^{*}) = \sum _{b=1}^BBias^{*b}/B \end{aligned}$$
(6)

where \(Bias^{*b}\) is the bias of bootstrap sample \(D^{*b}\) is calculated using the error function RMSE:

$$\begin{aligned} Bias^{*b} = \sqrt{\frac{1}{N}\sum _{i=1}^N(f(x_{i})-t_{i})^2} \end{aligned}$$
(7)

So, here we use \(Bias(D^{*})\) rather than the mean error on the original dataset Bias(D).

$$\begin{aligned} \sigma ^2=\frac{1}{B-1}\sum _{b=1}^B(Bias^{*b}-Bias(D^{*}))^2 \end{aligned}$$
(8)

As shown in [5], given a data sample, statistical inference is the process to assess how systems will behave in untested situations. It permits generalizations of conclusions beyond the sample, about an unseen population from which the sample is drawn. This process is inference from statistics to parameters, where statistics are functions on samples and parameters are functions on populations. It is noted that, Bias(D) is a statistic on D while \(Bias(D^{*})\) is a parameter inferred from this statistic. The bootstrap re-sampling method is used to construct empirical sampling distributions for parameter estimation \(Bias(D^{*})\) without making any troubling assumptions about sampling models and population distributions. BVGP* learns to optimize the fitness function based on \(Bias (D^{*})\) while the fitness function of BVGP is based on Bias(D). Therefore, we believe that the generalization ability of BVGP* is better than that of BVGP. The experimental results have confirmed this is true with most of the problems to be tested.

4 Experimental Setting

4.1 Problems

In this paper, we used benchmarks in [2] as shown in Table 1. Besides, we also used three more UCI data sets as shown in Table 2 to test the generalization ability of BVGP*. With UCI data sets, we divide an original dataset into two parts randomly: \(\langle \text {Train sample:Test sample} \rangle = \langle 1:2 \rangle \).

Table 1. GP benchmark regression problems
Table 2. UCI data sets

4.2 GP System Setup

Evolutionary parameter values for GP systems are shown in Table 3. These typical settings are often used by GP researchers and practitioners [1].

Table 3. GP systems setup

5 Results and Discussion

In this section we present results of comparing the performance of BVGP* in comparison with GP, BVGP. We evaluate the effectiveness of BVGP* on three aspects: (1) Generalization ability; (2) Model complexity; and (3) Time complexity.

5.1 Generalization Error (Fittest)

In this section, we repeated one hundred runs independently for each GP system. Generalization error is the median of testing error of the best individual from all these runs. The Table 4 shows the generalization error or testing error (fittest) GP, BVGP and BVGP*, bold values indicate that the corresponding method is the best result. We see that with most of problems (BEN_1, BEN_2, BEN_3, BEN_4, BEN_5, BEN_7, UCI_1) fittest error of BVGP* is smaller than that of GP and BVGP or generalization ability of BVGP* is better than that of GP and BVGP. However, with UCI_2, generalization ability of BVGP* is much worse than that of GP and BVGP. The cause can be the learned model by BVGP is under-fit on this problem.

It is noted that both the GP and BVGP use the bias on the original training dataset (Bias(D)) as the optimal goal, this lead to over-fitting when the size of the training sample is limited or there is noise in the train data or the sampling process is bad. BVGP* rather than using Bias(D), it uses the mean of the empirical bootstrap error distribution (\(Bias (D^{*})\)) as one of the optimal goals. So, it can avoid the sampling bias issues that lead to over-fitting solution as showed above. This explains why the results by BVGP* are better than those of GP and BVGP in most of problems.

Table 4. Summary of fittest error (median). Statistics based on 100 independent runs. Bold values indicate that the method is the best.
Table 5. Evaluation time (median), model complexity (median) is the average number of nodes on the best individual. Statistics based on 100 independent runs. Bold values indicate the method is the best.
Fig. 1.
figure 1

The evaluation time of genotype (a) is similar to that of genotype (b), but their different model complexities are different.

5.2 Model Complexity and Evaluation Time

In this section, we repeated one hundred runs independently for each GP system. Generalization error is the median of testing error of the best individual from these runs. The Table 5 shows the evaluation time and model complexity of the best individual by GP, BVGP, BVGP*. Bold values indicate that the corresponding method is the best.

Here, the evaluation time is measured in milliseconds. It is effected mainly by model complexity. Similar to fittest error, in all problems (10/10 problems, see bold lines), BVGP* is faster than GP since it leaned the smaller model (see corresponding lines at the column Evaluation time). Comparing to BVGP, BVGP* also learned the model with smaller complexity with most of problems (7/10 problems, see bold lines), so it is faster than BVGP or evaluation time is smaller. It is noted that, on BEN_6, the model complexity of BVGP* is larger while the its evaluation time is smaller than that of other methods. It can be caused by genotype of the learned model by BVGP* contains various operators that effect to evaluation time, i.e., considering two genotypes as shown in Fig. 1, although the model complexities of them are different, the evaluation time of them are similar.

6 Conclusion and Future Work

In this paper, we proposed the variation on the fitness function (BVGP*). It is based on the bias-variance decomposition and the method BVGP. Analyses of empirical results show that this approach has some advantages: (1) BVGP* can help reduce over-fitting on the problems that GP and BVGP were over-fitted; (2) It runs faster with simpler solution.

There are several future research directions arisen from this paper. First we need a more natural fitness representation way in bringing two components directly into the process of evolution. Second, we need a new selection mechanism corresponding to this fitness representation.