1 Introduction

One of the main objectives of soil classification is to find the suitability of soil for construction of different structures like dams and embankments subgrade. A range of complex factors affect the naming of soils because soils are not usually available in nature separately as sand, gravel or any other single component but mostly are found as mixture with varying proportions of particles of different size [1]. For instance, sandy clay has most of the properties of clay but contains a significant amount of sand. The soil is given the name of the constituent that appears to have significant influence on its behavior. The behavior of soil mass under load deeply depends on various constituents existing in the mass, degree of density saturation, and environmental conditions. Accordingly, it is idealistic to develop predictive models to able to evaluate the classification of the soil and overcome the limitations of existing classification systems by considering all factors related to the soil formation.

Genetic programming (GP) [2, 3] is a developing subarea of evolutionary algorithms [4] inspired from Darwin’s evolution theory. GP may generally be defined as a supervised machine learning technique that searches a program space instead of a data space [2]. Recently, a particular variant of GP that uses a linear representation of chromosomes, namely multi expression programming (MEP) [5] has been proposed. MEP has a special ability to encode multiple computer programs of a problem in a single chromosome. Based on numerical experiments, the MEP approach is able to significantly outperform similar techniques and can be utilized as an efficient alternative to the traditional tree-based GP [6]. Despite the significant advantages of MEP, there has been just some little scientific effort directed at applying it to civil engineering tasks [79].

The main purpose of this paper is to utilize the MEP technique to obtain formulas for the determination of soil classification. A comparison between the results of the proposed formulas and those of existing models in the literature was conducted. A reliable database including previously published soil classification test results was utilized to develop the models.

2 Review of previous studies

Artificial neural networks (ANNs) are a branch of computational intelligence techniques [10] that have successfully been applied to the soil classification problem [1113]. Despite the successful performance of ANNs, they have some fundamental disadvantages that limit them to be used by several researchers. ANNs are black-box models that usually do not give a certain function to calculate the outcome using input values. Hence, a better understanding of the nature of the derived relationship between the different interrelated input and output data is not provided by them. ANN only has final synaptic weights to obtain outcome in parallel manner. ANN approach is appropriate to be used as a part of a computer program and is not suitable for practical calculations.

There has been only limited research with the specific objective of introducing explicit formulas for soil classification by means of ANNs. Rajasekaran and Amalraj [13] built an empirical model using a sequential learning approach (SLA) for single hidden radial basis function (RBF) neuron neural networks proposed by Zhang and Morris [14]. They developed a sequential learning neural network (SLNN) model for the prediction of soil classification. They introduced the following equations based on experimental results and using the values of the weights obtained from neural network training to predict the soil classification (SC):

$$ {\text{SC}}_{\text{SLNN}} = 2,653.92e^{{ - W_{1} }} $$
(1)

and

$$ W_{1} = {\frac{1}{{0.80439^{2} }}}\left[ {\left( {x_{1} - 1.1843} \right)^{2} + \left( {x_{2} - 1.0463} \right)^{2} + \left( {x_{3} - 0.8604} \right)^{2} + \left( {x_{4} - 1.0218} \right)^{2} + \left( {x_{5} - 0.2890} \right)^{2} + \left( {x_{6} - 2.3476} \right)^{2} } \right] $$
(2)

where x 1 color of soil; x 2 percentage of gravel; x 3 percentage of sand; x 4 percentage of fine-grained particles; x 5 percentage of liquid limit; x 6 percentage of plastic limit x 1, …, x 6 are the six input parameters to the model. For inputs to the SLNN network the following rule was used for the color of the soil. 0.1: brown; 0.2: brownish gray; 0.3: grayish brown; 0.5: reddish yellow; 0.7: yellowish red.

The output of the network is the classification of soil which is given as: 0.1: clayey soil; 0.2: clay with medium compressibility; 0.3: clay of low compressibility; 0.6: silt with medium compressibility. It should be noted that the required data, used for the training and testing the SLNN models described above, were taken from Suresh [12]. The database has also been utilized in the present study to develop the models.

3 Genetic programming

GP is one of the branches of evolutionary methods that creates computer programs to solve a problem using the principle of Darwinian natural selection. GP was introduced by Koza [2] as an extension of the genetic algorithms, in which programs are represented as tree structures and expressed in the functional programming language LISP [2]. A comprehensive description of GP can be found in [2, 3]. GP has successfully been applied to some of civil engineering problems [1519].

3.1 Multi expression programming

MEP is a subarea of GP that was developed by Oltean and Dumitrescu [5]. MEP uses linear chromosomes for solution encoding and has a special ability to encode multiple solutions (computer programs) of a problem in a single chromosome. According to the fitness values of the individuals, the best encoded solution is chosen to represent the chromosome. Comparing to the other GP variants that store a single solution in a chromosome, MEP does not increase the complexity of the decoding process except on the cases, where the set of training data is not a priori known [6]. The evolutionary steady-state MEP algorithm starts by the creation of a random population of individuals. In order to evolve the best expression from a data file of inputs and outputs along a specified number of generations, MEP uses the following steps until a termination condition is reached [20]:

  • Selecting two parents by using a binary tournament procedure and recombining them with a fixed crossover probability.

  • Obtaining two offspring by the recombination of two parents.

  • Mutating the offspring and replacing the worst individual in the current population with the best of them (if the offspring is better than the worst individual in the current population).

MEP is represented similar to the way in which C and Pascal compilers translate mathematical expressions into machine code [21]. The number of MEP genes per chromosome is constant and specifies the length of the chromosome. A terminal (an element in the terminal set T) or a function symbol (an element in the function set F) is encoded by each gene. A gene that encodes a function includes pointers towards the function arguments. Function parameters always have indices of lower values than the position of that function itself in the chromosome. The first symbol in a chromosome must be a terminal symbol as stated by the proposed representation scheme.

An example of MEP chromosome can be seen below. It should be noted that numbers to the left stand for gene labels that do not belong to the chromosome. Using the set of arithmetic operators as F = {+, ×, /} and the set of terminals as T = {x 1, x 2, x 3, x 4}, the example is given as follows:

  • 0: x 1

  • 1: x 2

  • 2: × 0, 1

  • 3: x 3

  • 4: + 2, 3

  • 5: x 4

  • 6: /4, 5

The translation of MEP individuals into computer programs can be obtained by reading the chromosome top–down starting with the first position. A terminal symbol defines a simple expression and each of function symbols specifies a complex expression obtained by connecting the operands specified by the argument positions with the current function symbol [20]. In the present example, genes 0, 1, 3, and 5 encode simple expressions formed by a single terminal symbol. These expressions are

$$ \begin{aligned} E_{0} &= x_{1} , \\ E_{1} &= x_{2} , \\ E_{3} &= x_{3} , \\ E_{5} &= x_{4} , \\ \end{aligned} $$

Gene 2 indicates the operation × on the operands located at positions 0 and 1 of the chromosome. Therefore, gene 2 encodes the expression

$$ E_{2} = x_{1} \times x_{2} . $$

Gene 4 indicates the operation + on the operands located at positions 2 and 3. Therefore, gene 4 encodes the expression

$$ E_{4} = (x_{1} \times x_{2} ) + x_{3} . $$

Gene 6 indicates the operation / on the operands located at positions 4 and 5. Therefore, gene 6 encodes the expression

$$ E_{6} = \left( {\left( {x_{1} \times x_{2} } \right) + x_{3} } \right)/x_{4} . $$

In order to choose one of these expressions (E 1, …, E 6) as the chromosome representer, multiple solutions in a single chromosome are encoded. Each of the MEP chromosomes encodes a number of expressions equal to the chromosome length (the number of genes). Because of its multi expression representation, each MEP chromosome may be viewed as a forest of trees rather than as a single tree, which is the case of GP. Figure 1 demonstrates the forest of expressions encoded by the previously presented MEP chromosome. Each of these expressions can be considered as a possible solution of a problem. The fitness of each expression encoded in a MEP chromosome is defined as the fitness of the best expression encoded by that chromosome. For solving symbolic regression problems, the fitness of a MEP chromosome (f) may be computed using the following equation [6]:

Fig. 1
figure 1

Expressions encoded by a MEP chromosome represented as trees

$$ f = \mathop {\min }\limits_{i = 1,m} \left\{ {\sum\limits_{j = 1}^{n} {\left| {E_{j} - O_{j}^{i} } \right|} } \right\} $$
(3)

where n is the number of fitness cases, E j is the expected value for the fitness case j, O i j is the value returned for the jth fitness case by the ith expression encoded in the current chromosome and m is the number of chromosome genes.

4 Model development

The details of developing the MEP-based models including database description and comparison of performance of the models are presented in the following subsections.

4.1 Database

In the present study, the unified soil classification or IS classification system is considered to verify the obtained results. In practice, soil classification is determined on the basis of existing experimental results. The Bureau of Indian Standards classifies soils based on color of soil (CS), percentages of gravel (%G), sand (%S), fine-grained particles (%F), liquid limit (LL) and plastic Limit (PL). These six important properties are utilized as the input parameters to the MEP models to predict the soil classification (SC). The following values were assigned to the color of soils:

  • 0.1: Brown;

  • 0.2: Brownish gray;

  • 0.3: Grayish brown;

  • 0.5: Reddish yellow;

  • 0.7: Yellowish red.

Similar to the SLNN network, the output of MEP is the classification of soil given as below:

  • 0.1: Clayey soil (SL)

  • 0.2: Clay with medium compressibility (CI)

  • 0.3: Clay of low compressibility (CL)

  • 0.6: Silt with medium compressibility (MI)

The database used for model development contains soil classification test results reported by Suresh [12]. The tests were conducted on 17 undisturbed soil samples obtained from different parts of India. The soil samples were collected from trail pits at 1.5–2.0 m depth below ground level. To determine the index properties, disturbed but representative soil samples were also collected from trail pits. The samples were collected using thin-walled samplers satisfying the requirements of IS: 2132 − 1986 [22]. Atterberg limits and grain size distribution characteristics were determined according to relevant IS codes of practice [23, 24]. The soil samples were classified in accord with IS classification system [25]. All these test results are summarized in Table 1. Before the learning process, the input and output parameters were normalized between 0.1 and 0.9. The statistics of different input and output parameters involved in the model development are given in Table 2.

Table 1 Database used for model development
Table 2 The variables used in model development

4.2 Model development using MEP

The main goal is to obtain explicit formulas for soil classification (SC) as a function of variables given as follows:

$$ {\text{SC}} = f\left( {{\text{CS}},\,{\text{G}},\,{\text{S}},\,{\text{F}},\,{\text{LL}},\,{\text{PL}}} \right) $$
(4)

The six parameters are used for the MEP models as the input variables. Two MEP-based formulas (SC1, SC2) were obtained for soil classification considering two different function sets for the MEP runs. The first function set that consists of nearly all operators was used for developing SC1. The latter includes just addition, subtraction, division, and multiplication in order to obtain short and very simple formulas (SC2).

Various parameters involved in the MEP predictive algorithm such as population size, chromosome length, number of generations, tournament size and other parameters that are shown in Table 4. The parameter selection will affect the model generalization capability of MEP. They were selected based on some previously suggested values [8] and also after a trial and error approach. For the analysis, source code of MEP [26] in C++ was modified by the authors to be utilizable for the available problem. The parameter settings for the MEP algorithm are shown in Table 3.

Table 3 Parameter settings for MEP

For the analysis, the data sets were divided into training, and testing subsets. Out of the 17 data sets, the first nine values of were taken for training the MEP algorithm and the next eight values were used for testing the generalization capability of the models. In order to evaluate the capabilities of the proposed MEP models, correlation coefficient (R), mean squared error (MSE), and mean absolute error (MAE) were used as follows:

$$ R = {\frac{{\sum\nolimits_{i = 1}^{n} {\left( {h_{i} - \bar{h}_{i} } \right)\left( {t_{i} - \bar{t}_{i} } \right)} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {\left( {h_{i} - \bar{h}_{i} } \right)^{2} } } \sum\nolimits_{i = 1}^{n} {\left( {t_{i} - \bar{t}_{i} } \right)^{2} } }}} $$
(5)
$$ {\text{MSE}} = {\frac{{\sum\nolimits_{i = 1}^{n} {\left( {h_{i} - \bar{h}_{i} } \right)^{2} } }}{n}} $$
(6)
$$ {\text{MAE}} = {\frac{{\sum\nolimits_{i = 1}^{n} {\left| {h_{i} - \bar{h}_{i} } \right|} }}{n}} $$
(7)

where h i and t i , are respectively, the actual and calculated output values for the ith output, \( \bar{h}_{i} \) is the average of the actual outputs, and n is the number of sample.

4.3 Explicit formulation of soil classification and analysis using MEP

Formulations of soil classification for the best test R values by the MEP algorithm are as given below:

$$ {\text{SC}}_{1} = {\frac{{{\text{PL}}^{2} }}{1,156}}\left( {\left( {{\frac{\text{PL}}{34}}\left( {{ \sin }\left( {{\frac{{{\text{F}} \times {\text{PL}}}}{2,856}}\left( {118{\frac{{{\text{CS}} \times {\raise0.7ex\hbox{${\text{PL}}$} \!\mathord{\left/ {\vphantom {{\text{PL}} {84}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${84}$}} - {\raise0.7ex\hbox{${\text{G}}$} \!\mathord{\left/ {\vphantom {{\text{G}} {18}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${18}$}}}}{\text{LL}}}} \right) - {\frac{\text{G}}{18}}} \right)^{2} } \right) + 2{\text{CS}}} \right) + {\text{CS}}} \right) $$
(8)
$$ {\text{SC}}_{2} = {\text{CS}}\left( {{\frac{{59{\text{PL}}^{3} }}{{9,826{\text{LL}}}}} - {\frac{\text{S}}{82}}} \right) $$
(9)

where SC1 and SC2 are the soil classification predictive equations obtained using, respectively, the first and second function sets shown in Table 3. CS, G, S, F, LL, and PL, respectively, denote the color of soil and percentages of gravel, sand, fine-grained particles, liquid limit, and plastic limit. The comparison of MEP predicted and actual soil classification for Eq. 8 is shown in Fig. 2. It can be seen from this figure that Eq. 8 yielded high R values equal to 0.9931 and 0.9823 for the training and testing data, respectively. Figure 3 shows the relevant results obtained by Eq. 9. It can be observed from this figure that Eq. 9 yielded R values equal to 0.9932 and 0.9871 for the training and testing data, respectively.

Fig. 2
figure 2

Results of MEP predicted and actual soil classification obtained by Eq. 8

Fig. 3
figure 3

Results of MEP predicted and actual soil classification obtained by Eq. 9

In order to evaluate how many times each input appears in a way that contributes to the fitness of the MEP programs that contain them (importance of input parameters), frequency values of input parameters of the predictive models were obtained and presented in Fig. 4. A value of 1.00 in this figure indicates that this input variable appeared in 100% of the best 30 programs evolved by MEP. The frequency values are achieved for the best test R values of MEP runs. From Fig. 4, it can be found that in both of the proposed models, soil classification is more sensitive to CS and PL in comparison with the other inputs.

Fig. 4
figure 4

Frequency values of input parameters of soil classification predictive models

5 Discussion of results

In the present study, two MEP-based formulas were obtained for the classification of soil in terms of CS, %G, %S, %F, LL, and PL. As mentioned previously, R, MSE, and MAE were considered as the target statistical parameters to evaluate the performance of the models. Figure 5 represents a comparison of the ratio between actual and predicted soil classification for different models. Statistical performance of the MEP-based formulations, as well as the SLNN model, are summarized in Table 4. Comparing the performance of the MEP-based equations, it can be observed that the best performance is achieved by Eq. 9 on the training, testing, and all element tests data. On the other hand, Eq. 8, which was developed using the first function set, has taken into account the effects of more parameters compared with Eq. 8.

Fig. 5
figure 5

A comparison of the ratio between actual and predicted soil classification for different models

Table 4 Statistical performance of soil classification prediction models

Comparing the results of MEP and SLNN, it can be seen that both of the formulae obtained by the MEP approach perform superior than the SLNN model on the testing and whole of data. Table 5 shows a comparison between the results of the proposed MEP formulations, SLNN model, and actual experimental values.

Table 5 Comparative analysis of the proposed MEP models with experimental and SLNN results

6 Conclusions

In this paper, an application of a particular subset of GP, namely MEP to the soil classification prediction is presented along with its performance comparisons. Two formulas for the classification of soil were obtained by means of MEP and considering two different function sets. A reliable database of the previously published soil classification test results was used for training and testing the prediction models. The MEP-based formulation results were compared with the experimental results and an existing model in the literature namely, SLNN (RBF). The values of performance measures for the models indicate that the proposed MEP models are able to predict the target values to high degree of accuracy. The results also demonstrate that the formulas evolved by MEP outperform those of the SLNN model. In addition to the considerable accuracy of MEP-based prediction equations, they are quite short and very simple and seem to be more practical for use compared with the equations produced by SLNN. However, this investigation revealed that MEP is a very promising approach for its future applications to the formulation of many civil engineering tasks.