Keywords

1 Introduction

Recently, researchers successfully applied Semantic methods to Genetic Programming (SGP) on different domains, showing promising results [1,2,3]. While the classic GP operators (e.g., selection, crossover and mutation) act at the syntactic level, blindly to the semantic (behavior) of the individuals (e.g., programs), the key idea of SGP is to apply semantic evaluations [1]. More specifically, classic GP operators ignore the behavioral characteristic of the offspring, focusing only on improving the fitness of the individuals. Differently, SGP uses a richer feedback during the evolution that incorporates semantic awareness, which has the potential to improve the power of genetic programming [1].

In this paper, we are considering the Symbolic Regression domain, and thus assuming the availability of training cases (defined as m pairs of inputs and desired output). Following the most popular SGP approaches [1], we intend “semantics” as the set of output values of a program on the training cases [4]. Such an approach obtains a richer feedback during the evolution relying on the evaluation of the individuals on the training cases. More formally, the semantics of an individual \(\mathcal {I}\) is a vector \(\textit{sem}(\mathcal {I})=\langle y_1,y_2,\cdots ,y_m \rangle \) of responses to the m inputs of the training cases. Let \(\textit{sem}(\hat{y})=\langle \hat{y_1},\hat{y_2},\cdots ,\hat{y_m}\rangle \) denote the semantic vector of the target (as defined in the training set), where \(\hat{y_1},\hat{y_2},\cdots ,\hat{y_m}\) are the desired outputs. SGP defines semantic space [1] with a metric that characterizes the distance between the semantic vectors of the individuals \(\textit{sem}(\mathcal {I})\) and the target \(\textit{sem}(\hat{y})\). SGP often relies on such a distance to compute the fitness score, inducing a unimodal fitness landscape, which avoids local optima by construction [5].

The effectiveness of SGP depends on the availability of GP operators that can move in the semantic space towards the global optimum. An example of semantic operator is the geometric crossover proposed by Moraglio et al. [5]. It produces an offspring with a semantic vector that lies on the line connecting the parents in the semantic space. Thus, it guarantees that the offspring is no worse than the worst of the parents [5]. However, such crossover operator has the major drawback of producing individuals with an exponentially increasing size (i.e., exponential bloat) [1, 5]. To avoid the exponential bloat, researchers proposed variants of this operator that minimize bloating [2] but at the cost of dropping the important guarantee of non-worsening crossover operations.

In this paper, we present a new SGP approach called SGP-DT (Semantic Genetic Programming based on Dynamic Targets) that minimizes the exponential bloat problem and at the same time gives a bound on the worsening of the offspring. SGP-DT divides the search problem into multiple GP runs. Each run is guided by a different dynamic target, which SGP-DT updates at each run based on the residual errors of the previous run. Then, SGP-DT combines the results of each run into a “optimized” final solution.

In a nutshell, SGP-DT works as follows. SGP-DT runs the GP algorithm (see Algorithm 1) a fixed number of times (\(N_{\text {ext}}\)) depending on the available budget. We call these runs external iterations. As opposed to the internal iterations (i.e., generations) that the GP algorithm performs to evolve the individuals. Each GP run performs a fixed number of internal iterations and returns a model (i.e., the best solution) that we call partial model. The next external iteration runs the GP algorithm with a modified training set, where SGP-DT replaces the m desired outputs \(\hat{y}_i = \langle \hat{y_1},\hat{y_2},\cdots ,\hat{y_m} \rangle \) with the residual errors of the partial model returned by the previous iteration. That is, the difference between \(sem(\mathcal {I}_i)\) and \(sem(\hat{y}_{i-1})\), where \(\mathcal {I}_i\) is the partial model at the \(i^{th}\) iteration. Thus, at each external iteration, the fitness function evaluates differently the individuals (because the fitness functions predicates on different training sets). As such, each partial model focuses on a different portion of the problem, the one that most influences the fitness value. As a result, our approach leads to dynamic targets that change at each external iteration incorporating the semantic information. SGP-DT obtains the final solution after \(N_{\text {ext}}\) iterations with a linear combination in the form \(\sum _{i = 0}^{N_{\text {ext}}}{a_i + b_i \cdot \mathcal {I}_i}\), where \(a_i\) and \(b_i\) are computed with the well-known linear scaling [6]. There is a key advantage of using linear scaling. Keijzers showed that linear scaling gives a bound on the error of those generated individuals that are linear scaled [6]. Therefore, SGP-DT entails a bound on the worsening of the offspring at each internal and external iteration.

To reduce the exponential bloat problem, SGP-DT performs the internal GP iterations relying on classic mutation operators only. It does not rely on any form of crossover, neither geometric nor classic, and thus avoiding their fundamental limitations. Geometric crossover leads to exponential bloat and classic crossover decreases the chance to obtain a fitness improvement because it exchanges random functionalities at random points [7]. Despite the absence of crossovers, SGP-DT implicitly recombines different functionalities, similarly to a geometric crossover [5]. This is because, each partial model focuses on a different characteristic of the problem that the fitness function recognized as important (at that iteration). This makes the search more efficient because the evolution focuses on a single characteristic at a time leaving unaltered other (already optimized) characteristics.

We evaluated our approach on eight well-known regression problems. We compared SGP-DT with two baselines: lasso a least square regression technique by Efron et al. [8]; and \(\epsilon \)-lexicase a state-of-the-art SGP approach by La Cava et al. [9]. The results show that our approach obtains a median RMSE on 50 runs that is, on average, 51.47% and 23.19% smaller than the one of lasso and \(\epsilon \)-lexicase, respectively. Moreover, SGP-DT requires as much as 9.26\(\times \) fewer tree computations than \(\epsilon \)-lexicase (4.81\(\times \) on average).

The remainder of this paper is organized as follows. Section 2 describes our approach. Section 3 discusses the related work. Section 4 reports our experimental evaluation and discusses the results. Section 5 concludes the paper.

2 Methodology

figure a

Algorithm 1 overviews the SGP-DT approach. Given the values of the independent (\(\overline{x}\)) and dependent (\(\hat{y}\)) variables of the training cases, and the number of external (\(\text {N}_{\textit{ext}}\)) and internal (\(\text {N}_{{int}}\)) iterations, it returns the final solution (finalModel).

SGP-DT considers tree-like individuals with the usual non-terminal symbols: \(+, -, \cdot , /\) (the protected division), ERC (between −1 and 1). In addition, SGP-DT considers the functions Min and Max that returns the minimum and maximum between two numbers, respectively. The rationale of adding the two latter symbols is to inject discontinuity to make the linear combinations more adaptable. Although also the protected division adds discontinuity in the form of asymptotes, such discontinuity often promotes overfitting [6, 10]. With Min and Max functions, we introduce valid discontinuities alternatives that do not suffer from the limitation of the protected division.

Algorithm 1 holds out a portion of the training cases for validation (lines 1–3). SGP-DT will use such validation sets to construct the final solution (line 22). Lines 4–5 initialize the current target with \(\hat{y}\) and the lists of the best models with the empty list. Line 6 starts the external loop, which re-assigns \(\mathcal {P}\) to a fresh randomly generated population with the ramped-half-and-half approach (function get-random-initial-population of Algorithm 1). Starting every external iteration with a new population alleviates the overfitting problem. Indeed, the syntactic structures of already evolved individuals can be too complex to adapt to a new fitness landscape or to generalize on unseen data. To further reduce overfitting and the cost of fitness evaluation, SGP-DT generates the initial population with individuals with low complexity (i.e., a few nodes).

At line 8, SGP-DT starts the \(N_{\text {int}}\) internal iterations, which resembles the classic GP but with the addition of linear scaling and the absence of crossover. Before line 11 computes the fitness of each individual \(\mathcal {I}\) in \(\mathcal {P}\), line 10 performs the linear scaling of \(\mathcal {I}\) [6]. Linear scaling has the advantage of transforming the semantic of individuals so that their potential fit with the current target is immediately given: we do not need to wait for GP to produce a partial model that reaches the same result [6]. And thus, linear scaling reduces the number of both external and internal iterations. Fewer iterations means populations with simpler structural complexity and less computational cost. Reducing the complexity of the solutions may reduce overfitting [11].

Linear scaling has another important property: it gives an upper bound on the error [6]. Recall that SGP-DT considers errors on dynamic targets, which change at each iteration (at the first iteration the dynamic target is \(\hat{y}\)). To exploit such a situation, we propose a fitness function based on this upper bound. Following Keijzer [6], we compute the linear scaling of an individual \(\mathcal {I}\) as follows:

$$\begin{aligned} \mathcal {I}_{ls} = a + b \cdot \mathcal {I} \end{aligned}$$
(1)
$$\begin{aligned} \text {where} \quad a=\overline{\hat{y}} -b \cdot \overline{y} \quad \text {and} \quad b=\dfrac{\sum _{i=1}^{n} [(\hat{y}_i- \overline{\hat{y}})\cdot (y_i- \overline{y}) ]}{\sum _{i=1}^{n}[(y_i- \overline{y})^2]} \end{aligned}$$
(2)

We define the following fitness function of an individual \(\mathcal {I}\):

$$\begin{aligned} \textit{fitness}(\mathcal {I}) = \sigma ^2(\textit{sem}(\mathcal {I}_\text {ls}(\overline{x})) - \hat{y}) \end{aligned}$$
(3)

The rationale of this function is that the Mean Square Error (MSE) of \(\mathcal {I}_{ls}\) has the variance (\(\sigma ^2\)) of the current target as an upper bound [12]:

$$\begin{aligned} \textit{MSE} = \dfrac{ \sum _{i=0}^{m} (y_i - \hat{y}_i)^2 }{m} \le \sigma ^2(\hat{y}) \end{aligned}$$
(4)

where m is the number of training cases (y).

At each new external iteration the residual error becomes the new target (line 21).

$$\begin{aligned} \textit{target}= \hat{y} - \textit{sem}(\mathcal {I}^\star _{\textit{ls}}(\overline{x})) \end{aligned}$$
(5)

where \(\textit{sem}(\mathcal {I}^\star _{\textit{ls}}(\overline{x}))\) is the evaluation of the best individual at the current iteration, which we call partial model.

The inequality 4 does not guarantee that the external iterations converge to a lower MSE because we do not know if \(\sigma ^2(\textit{error}) \le \sigma ^2(\hat{y})\), where \(\textit{error} = target - \textit{sem}(\mathcal {I}^\star _{\textit{les}}(\overline{x}))\). Thus, by optimizing the variance of the error shown in Eq. 3, we act directly on the minimization of the upper bound, so that the next external iteration can benefit from a lower bound.

At lines 17–19, Algorithm 1 runs a classic GP algorithm without crossovers, using only mutations. We use a tree-based mutation operator because SGP-DT uses trees as syntactic structures for the individuals. The operator randomly generates a subtree from a randomly chosen node. To increase the synergy with linear scaling, we set two constraints during mutation. First, the node selection is biased towards the leaves of the tree, so that the mutated tree does not diverge too much from the original semantic (locality principle). Producing a mutation that is close to the original semantic of the tree preserves the validity of the selection performed after the linear scaling. And thus, we only allow minor changes to improve the fitness. Second, for the same reason, the mutation is biased towards replacing the selected node with a sub-tree of limited depth. Note that, we decided not to limit the maximum size (number of nodes in the tree) or depth of an individual. By doing so, GP can grow and choose the right solution complexity for the problem at hand. These two constraints help us to mitigate the overfitting and bloat problem without preventing the SGP-DT to effectively search for competitive individuals. As linear scaling helps GP to find useful individuals (thanks to the upper bound). Moreover, additional external iterations will further refine other aspects of the problem not yet addressed.

We decided to exclude the classic crossover operator in the internal iterations, as several researchers argued about the effectiveness of crossover in relation to the problem of modularity of GP [13]. There is a consensus that an effective GP algorithm needs a crossover that preserves the semantics of the parts swapped among individuals respecting the boundaries of a useful functionality within the individual’s structure [2, 7, 14]. According to McPhee et al. [4] and Ruberto et al. [11] most classic crossover operators do not obtain a meaningful variation (or any variation at all) in the program semantics, when dealing with Boolean and real value symbolic regression domains. The main issue is that classic crossover operators do not preserve a common context [4] among the building blocks of the individuals exchanged during crossover, which is important to increase the chance of obtaining a semantically meaningful offspring [14]. The idea of determining a common context has been introduced by Poli and Langdon with the one-point crossover operator [7]. But how to identify a meaningful common context among trees structures is still an open problem.

Instead, SGP-DT exchanges functionalities among individuals by relying on the linear combination of the partial models (i.e., the fittest individuals at each external iteration, line 12 Algorithm 1) and on a specific mechanism for selecting and mutating the individuals during the GP runs. In light of this, we exclude the crossover operators in the presence of these semantic recombination alternatives. To have an effective exchange of functionalities among individuals we need to: (i) preserve building blocks semantics (ii) preserve the context of building blocks (iii) make the exchange of functionalities directed towards producing new and interesting semantics. SGP-DT achieves these objectives by (i) mapping each building block to a single partial model (this would avoid arbitrary fragmentations of the blocks); (ii) preserving the context of the building blocks because in our scenario the partial models obtained at previous iterations represent the context; and (iii) using mutation only, which promotes diversity in the population. Despite the absence of crossover, SGP-DT exchanges building blocks because each partial model is a building block. Differently from the classical crossover that exchanges random fragments, SGP-DT obtains the final model by summing the linear scaled partial models. This approach makes the exchange of functionalities more effective, as each partial model (building block) characterizes a specif functionality.

The for-loop at line 6 terminates when SGP-DT concludes all external iterations. We decide not to introduce a different stopping criterion based on the stagnation of fitness improvement. This is because it is difficult to predict if the fitness will not escape stagnation in future iterations. After all the external iterations, the function validate-and-select at line 22 of Algorithm 1 returns the partial models that will be combined into the final solution. Such models are selected as follows. The validation takes in input the ordered sequence of best individuals (models) collected after each internal iteration (line 14 Algorithm 1) and the validation sets (\(\overline{x}_\textit{val}\) and \(\hat{y}_\textit{val}\)) obtained at line 1. Note that, SGP-DT saves the computed linear scaling parameters (a and b Eq. (2)) at line 10 and do not recompute them during the validation and test phases. Internally, the validation scans the sequence models and progressively computes the MSE evaluating the individuals on the validation set to find the point in the sequence where MSE is the smallest. SGP-DT finds the smallest MSE using the rolling mean of the validation set error at a fixed window size to minimize the short-term fluctuations. The function validate-and-select returns the sequence (bestModels) of the partial models that were produced before the smallest MSE. Such sequence represents the transformation chain of the dynamic targets. In case SGP-DT obtained the model with the smallest MSE during the internal iterations, it appends this individual at the end of bestModels. Line 23 of Algorithm 1 computes the final model by summing all the models in bestModels.

3 Related Work

This section divides the related work of SGP-DT in three groups. Each group refers to techniques that are relevant to a main characteristic of SGP-DT: (i) having dynamic or semantic objectives, (ii) using linear combinations or geometric operators, (iii) using an iterative approach on residual errors.

Dynamic or Semantic Objectives. The GP techniques proposed by Krawiec et al. [15] and Liskowski et al. [16] present semantic approaches that consider interactions between individuals and the training set. These approaches cluster such interactions to derive new targets for a multi-objective GP.

Otero et al. proposed an approach with dynamic objectives that combines intermediate solutions in a final Boolean tree [17]. This technique progressively eliminates from the training cases the ones perfectly predicted from the current intermediate solution and operates exclusively in a Boolean domain.

Krawiec and O’Reilly [18] proposed a GP approach that explicitly models the semantic behavior of a solution during the computation of training cases.

BPGP by Krawiec and O’Reilly [18] explicitly models the semantic behavior of a solution during the computation of training cases. BPGP proposes an operator that mutates an individual by replacing a randomly selected sub-tree with a random one. According to Krawiec and O’Reilly this “mutation-like” [18] operator is intended as a “form of crossover”. We think that this is similar in principle to our design choice of dropping crossover altogether and instead choosing among mutated alternatives in the population. However, Krawiec and O’Reilly still use the traditional crossover alongside with this new mutation [18].

We differ from all of these techniques because we build our solution progressively crystallizing the intermediate achievements. Most of these approaches use auxiliary objectives during their search and use a single GP run. Conversely, SGP-DT uses a non-predetermined number of objectives in subsequent GP runs. The approach of Otero et al. [17] is the only one that progressively builds the solution but it uses a strategy that works for Boolean trees only.

Linear Combinations. MRGP [19] uses multiple linear regression to combine the semantics of sub-programs (subtrees) to form the semantic of an individual.

Ruberto et al. proposed ESAGP [20], which derives the target semantics by relying on a specific linear combination between two “optimally aligned” individuals in the error space. Leveraging such geometric alignment property, Vanneschi et al. proposed NA-GP [21], which performs linear combinations between two aligned chromosomes belonging to the same individual.

Gandomi et al. proposed MGGP [22], where each individual is composed of multiple trees. MGGP produces the final solution with a linear combination of the tree’s semantics, deriving the values of the coefficients from the training data with a classic least squares method. However, the number of trees in the linear combination is fixed and the fitness landscape is not dynamic.

Moraglio et al. proposed the Geometric Semantic GP (GSGP) crossover operator [5], which uses linear combinations to guarantee offspring that is not worse than the worst of the parents. Unfortunately, GSGP suffers from the exponential bloat problem and requires many generations to converge, especially if the target is not in the convex hull spanned by the initial population [5].

Notably, all the approaches described in this second group use a single run to search for the final solution. Differently from SGP-DT, they fix the number of components in advance (the only exception is GSGP but it suffers from the exponential bloat problem [5]). In addition, all of the techniques in the first and second groups have a static target, and thus they continuously evolve a population without re-initialization. This limits the diversity of the genetic alternatives when the population converges at later generations. Conversely, SGP-DT has a dynamic target and it starts with a fresh population at each internal iteration (see Algorithm 1).

Iterative Approaches Based on Residual Errors. Sequential Symbolic Regression (SSR) [23] uses the crossover operator GSGP [5] to iteratively transform the target using a semantic distance that resembles the classical residual approach. However, no statistical difference (on the errors) from the classical GP approach was found [23]. Differently from SGP-DT, SSR considers residuals that do not optimize the linear combinations with a least square method. Although SSR overcomes the exponential bloat, it weakens the advantage of using residuals.

Medernach et al. presented the wave technique [24, 25] that similarly to SGP-DT, executes multiple GP runs using the same definition of residual errors (Eq. 5) and obtains the final model by summing the intermediate models. wave produces a sequence of short and heterogeneous GP runs, obtained by “fuzzing” the settings of system parameters (e.g, population size, number of internal iterations) and by alternating the use of linear scaling. However, SGP-DT drastically differs from wave. The heterogeneity nature of wave emulates this dynamic evolutionary environment by simulating periods of a rapid change [24, 25]. The effectiveness of such an approach requires specif combinations of system parameters that converges to a fitter solution. Due to the huge space of possible system parameters, finding such combinations often requires a large number of iterations [24, 25]. Conversely, SGP-DT steers the evolution with a novel approach that gradually evolves the building blocks of the final solution without exploring the huge space of possible combinations of system parameters.

All the techniques of this group use residuals differently from SGP-DT. Moreover, they rely on the classic or geometric crossover. Conversely, one of the key novel aspects of SGP-DT is to avoid crossover altogether.

Table 1. Data sets of regression problems.

4 Evaluation

Data Sets. We performed our experiments on eight well-known data sets of regression problems that have been used to evaluate most of the techniques discussed in Sect. 3 [9, 19, 21, 22, 24, 25]. Table 1 shows the name, number of attributes, and number of instances for each data set. For uball5dFootnote 1 we followed the same configuration used by Cava et al. [28].

4.1 Methods

We compared SGP-DT with two techniques (lasso  [8] and \(\epsilon \)-lexicase [9]) and two variants of SGP-DT (DT-EM and DT-NM).

LASSO. Both SGP-DT and lasso  [8] use the least square regression method to linearly combine solution components. More specifically, lasso incorporates a regularization penalty into least-squares regression using an \( \ell _1\) norm of the model coefficients and uses a tuning parameter \(\lambda \) to specify the weight of this regularization [8]. We relied on the lasso implementation by Efron et al. [8], which automatically chooses \(\lambda \) using cross-validation.

\(\epsilon \)-LEXICASE. This evolutionary technique adapts the lexicase selection operator for continuous domains [9]. The idea behind \(\epsilon \)-lexicase selection is to promote candidate solutions that perform well on unique subsets of samples in the training set, and thereby maintain and promote diverse building blocks of solutions [9]. Each parent selection begins with a randomized ordering of both the training cases and the solutions in the selection pool (i.e., population). Individuals are iteratively removed from the selection pool if they are not within a small threshold (\(\epsilon \)) of the best performance among the pool on the current training sample. The selection procedure terminates when all but one individual is left in the pool, or until all individuals have tied performance. In the latter case, a random one is chosen. The recent study of Orzechowski et al. shows that \(\epsilon \)-lexicase [9] outperforms many GP-inspired algorithms [29]. We relied on the publicly available implementation of \(\epsilon \)-lexicase, ellynFootnote 2, which uses stochastic hill climbing to tune the scalar values of each generated individual. It also relies on a 25% validation hold-out from the training data to choose the final model from a bi-dimensional Pareto archive, which ellyn constantly updates during the evolution. The two dimensions are the number of nodes and the fitness.

DT-EM. We considered a variant of SGP-DT (called DT-EM) with a modified fitness function as the only difference with SGP-DT:

$$\begin{aligned} \textit{fitness}(\mathcal {I}) = \textit{MSE} = \dfrac{ \sum _{i=0}^{m} (y_i - \hat{y}_i)^2 }{m} \end{aligned}$$
(6)

While the original fitness of SGP-DT minimizes the upper bound of the MSE in Eq. 3, this function directly minimizes the MSE in Eq. 6. This variant helps to evaluate the impact of a direct error minimization with respect to a more qualitative and indirect measure of the error, such as the variance (\(\sigma ^2\)).

DT-NM. We considered another variant, called DT-NM, that excludes the Min and Max non-terminal symbols (as the only difference with SGP-DT), and thus evaluating the advantage of different discontinuity types during the evolution.

4.2 Evaluation Setup

Following the setup of Orzechowski et al. [29] for \(\epsilon \)-lexicase, we set for all the four GP techniques (SGP-DT, \(\epsilon \)-lexicase, DT-EM, and DT-NM) a population size of 1,000 and a budget of 1,000 generations. We ran 50 trials for every technique on each data set using 25% of the data for testing and 75% for training.

SGP-DT and its two variants share the same configuration: We divided the 1,000 generations in 20 external iterations (\(N_{\text {ext}}= 20\)), and thus the number of internal iterations (\(N_{\text {int}}\)) is 50. We used ramped half&half initialization up to a maximum depth of four (function get-random-initial-population at line 7 of Algorithm 1). The probability of mutation is 100% and the maximum depth of the sub-trees generated by the mutation operators is five. The probability of a sub-tree mutation happening at the leaf level is 70%. We set no limits on the number of nodes in the trees and on the depth of the trees. We set the Elitism to keep only the best individual at each internal iteration (function elite at line 16 of Algorithm 1). We obtained the validation set by extracting 10% of the training cases (function split at line 1 of Algorithm 1). The fixed window size for the rolling-mean is 20. We chose this configuration after a preliminary tuning phase and kept uniform for all the eight data sets.

4.3 Results and Discussion

Errors’ Comparison. Following previous work we use the Root Mean Square Error (RMSE) to evaluate the final solution with the test set. The first five columns of Table 2 show for each technique the median RMSE of the 50 trials. The last four columns of Table 2 indicate the percentage decrease of the RMSE medians with respect to the competitor techniquesFootnote 3. A positive percentage value means that the RMSE median of SGP-DT is lower (i.e., better), while a negative value means a worst median RMSE. Figure 1 shows the box plots of the RMSE values of the 50 trialsFootnote 4. When comparing the RMSE values we performed a non-parametric pairwise Wilcoxon rank-sum test with Holm correction for multiple-testing, with a confidence level of 95% (p-value <0.05).

Table 2. Median RMSE of the 50 trials.

SGP-DT achieves a smaller RMSE than lasso for all the data sets, obtaining always statistical significance. The decrease of the RMSE medians ranges from 9.06% for housing to 88.67% for yacht (51.47% on average). SGP-DT has smaller RMSE medians than \(\epsilon \)-lexicase for all data sets but housing (decrease −4.48%). This is the only comparison of SGP-DT and \(\epsilon \)-lexicase without statistically significance. The decrease of the RMSE medians ranges from −4.48% for housing to 57.07% for ench (23.19% on average). This is a remarkable result considering that \(\epsilon \)-lexicase outperforms many GP-inspired algorithms [29]. Comparing with the variant DT-EM, SGP-DT achieves the only statistically significant differences with DT-EM on the data sets uball5d and yacht, with percentage decreases of 6.63% and 20.45%, respectively. For such datasets SGP-DT performs better than DT-EM indicating that our fitness function that minimizes the upper bound achieves a better final solution. SGP-DT has statistically significant differences of the median RMSE with DT-NM only with the data sets airfoil, tower and uball5d. SGP-DT performs better than DT-NM on the airfoil and tower datasets: 3.94% and 10.12% of percentage decrease, respectively. This means that the Min and Max non-terminal symbols provide an advantage only in these two datasets. However, Fig. 1 indicates that using such non-terminal symbols does not penalize the outcome in any other dataset, except for uball5d where the difference is statistically significant (the decrease is −7.87%).

Fig. 1.
figure 1

RMSE of test set for all the techniques and for all the eight data sets.

Error Comparison with Related Work. Unfortunately, the implementation of wave [24, 25] is not publicly available, and thus a direct comparison would be difficult. We extracted the median RMSE from the GECCO 2016 paper [25] for our two common subjects: 4.1 (concrete) and 8.7 (yacht). SGP-DT achieves a median RMSE percentage decrease of 25.17% (concrete) and 75.12% (yacht), see Table 2 for the reference values. Note that, the computational cost reported in the GECCO paper has the same order of magnitude with the one of SGP-DT.

From the paper of Vanneschi et al. [21], we extracted the median RMSE on the data set concrete of the following GP techniques: 10.44 (NA-GP [21]), 8.1 (NA-GP-50 [21]), 12.50 (GSGP [5]), and 9.43 (GSGP-LS [30]). SGP-DT has a percentage decrease of 37.64%, 19.62%, 47.92% and 30.96%, respectively. These results are only indicative because their evaluation setup differs from ours.

Computational Effort. To evaluate the computational effort of the evolutionary techniques we decided not to rely on execution time because it depends on implementation details. Instead, we relied on the total number of evaluated nodes (being not a GP technique this metric is not applicable to lasso). Both SGP-DT and \(\epsilon \)-lexicase operate on nodes, SGP-DT on tree-like data structures, while \(\epsilon \)-lexicase on stack-based ones. Following Ruberto et al. [11], we count a node operation every time a technique evaluates a node regardless the purpose of the operation (e.g., mutation, fitness computation). We excluded the computational effort of linear scaling because it does not perform operations on nodes. However, it has a linear computational cost of \(\mathcal {O}(m\cdot P)\), where m is the size of the training set and P the population size. For comparing the number of evaluated nodes, we used the Wilcoxon rank-sum test with Holm correction for multiple-testing, with a confidence level of 95% (p-value < 0.05). The test show that all the comparisons between each pair of techniques are statically significance, except the comparison with SGP-DT and DT-NM on subject uball5d.

Table 3. Median number of evaluated nodes and reduction ratio of SGP-DT.

Table 3 reports the median number of nodes (of the 50 runs) that the GP techniques evaluate to produce the final solution. The last three columns of Table 3 report the ratio between the number of node evaluations of SGP-DT with those of \(\epsilon \)-lexicase, DT-EM and DT-NM. A ratio greater (lower) than one means that SGP-DT evaluates a lower (higher) number of nodes. Comparing with \(\epsilon \)-lexicase, SGP-DT reduces the amount of node evaluations by a factor between 4.01\(\times \) and 9.26\(\times \), obtaining statistically significant better RMSE values than \(\epsilon \)-lexicase for seven out of eight data sets. This result can be explained by (i) SGP-DT computes only a fraction of the entire solution (partial models) at a time; (ii) the size of the individuals is kept at minimum (see Sect. 2).

The number of evaluated nodes of SGP-DT and DT-EM are almost identical (0.99\(\times \) on average). This indicates that guiding the evolution with the fitness function of SGP-DT and with the one of DT-EM yield to the same computational cost but SGP-DT achieves better median RMSE (5.39% on average). DT-NM always evaluated less nodes than SGP-DT (0.77\(\times \) on average).

Size of the Final Solutions. SGP-DT has no limits on the maximum complexity of the individuals, while \(\epsilon \)-lexicase has a limit of 50 nodes because at higher limits the computational effort of \(\epsilon \)-lexicase becomes prohibitively expensive [9]. SGP-DT produces solutions with size ranging from 442 to 1,184 nodes (760 on average), which is on average 15\(\times \) larger than the one produced by \(\epsilon \)-lexicase and is not large enough to be considered (exponential) bloat. This extra complexity of the final solutions positively contributes at the performance of the algorithm. We are investigating a post-processing phase to simplify the final solutions.

On average, DT-EM produces solutions with 806 nodes and DT-NM with 591. DT-NM generates smaller solutions than DT-EM, this could be due to the fact that DT-NM has a smaller search space (DT-NM omits the Min and Max symbols). Evaluating smaller solutions require less computation, this explains why DT-NM requires less computation than SGP-DT and DT-EM (see Table 3).

Overfitting. Figure 2 plots for each data set the median of the best RMSE by computational effort (number of evaluated tree nodes) for SGP-DT and its two variants. Unfortunately, the implementation of \(\epsilon \)-lexicase that we used does not report the intermediate RMSE on test. We use the computational effort, rather the number of generations, for a fair comparison of the three techniques. This is because the number of evaluated nodes is not uniform across the generations.

Fig. 2.
figure 2

Median RMSE of the best so far on the test set by computational effort.

The eight plots indicate that SGP-DT slightly overfits on the data sets tower and yacht, while on housing produces a substantial overfitting, which is comparable to the one of DT-EM but less severe than the one of DT-NM. DT-EM overfits four data sets: airfoil (Fig. 2a), housing (Fig. 2e), tower (Fig. 2f), yacht (Fig. 2h). The worst performance is from DT-NM that shows severe overfitting on airfoil (Fig. 2a), housing (Fig. 2e), tower (Fig. 2f) and yacht (Fig. 2h). Note that all three techniques overfit for the data sets yacht (Fig. 2h) and housing (Fig. 2e). This can be explain by their relatively low number of instances (see Table 1).

For the data sets concrete (Fig. 2b), enc (Fig. 2c) and enh (Fig. 2d) all three techniques do not manifest overfitting (yet). Interestingly, in these three cases DT-NM arrives to a low RMSE with less computations than SGP-DT and DT-EM. We conjecture that this is because concrete, enc and enh are problems that do not need the additional expressiveness of the Min and Max symbols.

DT-NM is the technique that yields to the smallest individuals, as such we would expect less overfitting. Surprisingly, this is not the case. We believe that, to compensate the absence of discontinuity that Max and Min introduce, DT-NM used the protected divisions more frequently. This may lead to many asymptotic discontinuities, which are known to increase the overfitting [6].

When considering each data set individually, SGP-DT and DT-EM mostly manifest similar overfitting, while DT-NM manifests overfitting much earlier. This suggests that (i) the non-terminal symbols Max and Min help to alleviate the overfitting problem; and (ii) relying on the variance (SGP-DT) rather than MSE (DT-EM) in the fitness function indeed contributes to reduce RMSE (5.39% on average, see Table 2) but not to influence overfitting.

5 Conclusion

In this paper, we proposed SGP-DT, a new evolutionary technique that dynamically discovers and resolves intermediate dynamic targets. Our key intuition is that the synergy of the linear scaling and mutation helps to exchange good genetic materials during the evolution. Notably, SGP-DT does not rely on any form of crossover, and thus without suffering from its intrinsic limitations [2, 7]. Our experimental results confirm our intuitions and show that SGP-DT outperforms \(\epsilon \)-lexicase in both lower RMSE and less computational cost. This is a promising result as \(\epsilon \)-lexicase outperforms many GP-inspired algorithms [29].

This paper sparks interesting future work:

We do not perform any type of post-processing of the final solutions to reduce their size. Indeed, the solutions may contain redundant elements. We are currently investigating a post-processing step to minimize the size of the final solutions.

A possible future research direction is to automatically identify the proper number of iterations of SGP-DT. Indeed, problems with different complexity and nature may require a different number of external and internal iterations.