1 Introduction

Since the advent of concrete as a human-made artificial stone, several construction projects have been carried out successfully by using this profitable material. However, estimating the properties of hardened concrete has always been one of the principal challenges in concrete technology. This is mainly due to the fact that many predictable or unpredictable factors may significantly affect the properties of concrete [1, 2]. Compressive strength is one of the properties of concrete which plays a prominent role in the design of engineering structures. In current practice, to determine the compressive strength of concrete, several cubic or cylindrical samples are fabricated and tested at different ages of the samples. However, these tests are not only costly but also time-consuming [3, 4]. Also, changes in the mix design of concrete can result in concrete with completely different properties; thus, the tests need to be repeated even if the content of an ingredient varies [5]. This issue can be better felt in the case of concrete in which cement has been partially replaced by pozzolan powders.

Fly ash (FA) and furnace slag (FS), as sorts of pozzolan powders, are commonly used as partial replacements for cement [6, 7]. FA is generally gained as residue from the combustion of pulverized coal in the furnaces of thermal power plants. Properties of FA are dependent on the combustion process through which it is obtained [8]. Dry processing leads to the FA which is homogenous in particle size, while wet processing can result in the FA with highly separated aggregates [9]. Adding FA to concrete as a partial replacement for cement reduces the values of compressive strength and slump; however, it increases the workability and integrity of the concrete [10]. FS is the by-product of iron and steel production in blast furnaces whose chemical composition is dependent on the raw materials from which it is produced. The slag floats on top of the iron in blast furnaces, and it is decanted for separation [11]. Quick cooling of the molten slag converts it to non-crystalline ingredients with hydraulic properties [12]. Partial replacement of cement with FS results in concrete with higher compressive strength and durability. However, a higher dosage of FS can cause thermos-hygral (TH) damages and cracks, which adversely affect the strength and mechanical properties of concrete [13, 14]. The use of FA and FS in concrete not only addresses a mean for disposal of these waste materials, but also provides an alternative for cement whose production results in the emission of a large amount of carbon dioxide (CO2) and subsequently, global warming [7, 15, 16]. However, due to the dependence of compressive strength on several parameters and high sensitivity to the mixture proportions, it seems that more advanced methods should be employed to not only eliminate the need for conducting experiments as much as possible but also provide a simpler tool for engineers to predicting experimental results.

Soft computing (SC) can be mentioned as an efficient approach that can be used in this regard. The most significant advantage of SC is generating solutions for nonlinear or linear problems where mathematical models cannot easily express the relation among the involving parameters in the problem [17]. Moreover, SC methods use human-based knowledge, recognition, understanding, and learning in computation [18]. In recent years, several researchers have employed artificial intelligence (AI) methods and machine learning (ML) techniques, as sub-branches of SC methods, in properties prediction of different types of concrete.

Oztas et al. [19] investigated the potential of using an artificial neural network (ANN) in predicting the slump value and compressive strength of high strength concrete (HSC). This study revealed the high capability of ANN in the properties prediction of HSC. Alshihri et al. [20] used an ANN model in the compressive strength prediction of lightweight concrete and concluded that using ANN can reduce cost and save time. In another study, the satisfying performance of ANN was reported in compressive strength prediction of self-compacting concrete (SCC) and high-performance concrete (HPC) [21]. Khademi et al. [22] employed an ANN model, an adaptive neuro-fuzzy inference system (ANFIS) model, and a multiple linear regression (MLR) model to predict compressive strength of normal strength concrete. They concluded that the MLR model is not feasible enough in the case of compressive strength prediction as the problem seems nonlinear, while the ANN and ANFIS models are reliable enough to be used in this area. Yaseen et al. [23] compared the performance of an extreme learning machine (ELM), a support vector regression (SVR), a multivariable adaptive regression spline (MARS), and M5 tree model in compressive strength prediction of lightweight foamed concrete. Superior performance of the ELM was reported in this investigation. Alshamiri et al. [24] also examined the capability of ELM in the compressive strength prediction of HSC and compared it with an ANN model. In this case, they could observe better performance from the ELM model than that of the ANN model. Han et al. [25] combined an ANN with particle swarm optimization (PSO) algorithm to estimate the compressive strength of ground granulated blast furnace slag (GGBFS) concrete and compared the result of the hybrid ANN-PSO model with an ANN model. They reported improvements in the performance of ANN after combining with PSO. Golafshani et al. [26] combined an ANN and an ANFIS model with grey wolf optimizer (GWO) and showed that hybridization of the models with GWO improves the training and generalization capability of both ANN and ANFIS models. Sun et al. [27] also predicted and optimized the influencing factors on the compressive strength of concrete containing silica fume (SF) and fly ash (FA) by a hybrid ANN-ABC (artificial bee colony) model. In another interesting research, Ashrafian et al. [28] combined a typical MARS model with a water cycle algorithm (WCA) and suggested a hybrid MARS-WCA model for predicting the compressive strength of foamed cellular lightweight concrete. This research also revealed that the hybridization of standard models with metaheuristic algorithms can improve the performance of the models.

The main objective of this study is to predict compressive strength of concrete in which cement has been partially replaced with fly ash (FA) and furnace slag (FS). To achieve this goal, a SC approach is adopted. An extreme learning machine (ELM) is combined with a metaheuristic algorithm known as grey wolf optimizer (GWO) and a hybrid ELM-GWO model is proposed. Next, the capability of this model is investigated in the case of compressive strength prediction. For this purpose, the most well-known and powerful machine learning (ML) models including an artificial neural network (ANN), an adaptive neuro-fuzzy inference system (ANFIS), an extreme learning machine (ELM), a support vector regression (SVR) with a radial basis function (RBF) kernel (SVR-RBF), and another SVR with a polynomial (Poly) kernel (SVR-Poly) are developed. Design mixture proportion of concrete together with the age of samples is considered as the inputs of the models, and compressive strength of concrete is predicted as the output. Finally, the results of the developed ML models will be compared with each other in terms of different statistical performance indices and the required time for training.

2 Methodologies

2.1 Artificial neural network (ANN)

Artificial neural network (ANN) can be mentioned as the most well-known and implemented methodology in the case of function approximation. ANN is an intelligence tool that has been inspired by the biological neural network of humans or animals [29, 30]. This model with a layer-to-layer structure is able to learn patterns and predict results in the high-dimensional space of the problem [31, 32]. Multilayer perceptron (MLP) is a simple and reliable class of feed-forward ANN. A typical MLP contains an input layer, at least one hidden layer, and an output layer [33, 34]. The input layer takes the values of predictors and sends them to the available neurons in the next layer. A typical neuron in the structure of an ANN is shown in Fig. 1. Inside each neuron, a weighted sum of inputs is computed. Then, this value, plus a value of bias, called Net, is calculated. In the next step, the net value passes through an activation function. The most popular and well-known activation functions are tanh and sigmoid. The concept behind using the activation function in the neuron structures is the fact that without any activation function, the output of a neural network seems to be a linear combination of input values. Accordingly, activation functions are used to make a nonlinear relation between the inputs and outputs of a neural network. The number of neurons, the number of layers, activation functions, and definition of error and evaluation criteria play a vital role in the performance of a neural network. Therefore, these free parameters should be selected in the way that the network’s results reach an acceptable outcome. This mathematical process can be formulated as follows [21, 35]:

$${\text{Net}} = \mathop \sum \limits_{i = 1}^{n} w_{ij} x_{i} + b_{j}$$
(1)

where \(x_{i}\) is the nodal values in the previous. \(n\) is the total number of the nodal values received from the previous layer. \(w_{ij}\) and \(b_{j}\) are also weights and biases of the network in the current layer.

Fig. 1
figure 1

A typical neuron in an ANN

Finally, the Net value is transformed by an activation function and the output signal is transferred to the neurons in the next layer. The tangent hyperbolic function is an activation function that generally leads to more accurate results [36]; thus, this activation function is used in this study. This function varies between − 1 and 1 and is defined as follows:

$$y = f\left( {Net} \right) = \frac{2}{{1 + e^{ - 2 \cdot Net} }} - 1$$
(2)

where \(y\) is the output signal; \(f\) is the activation function in terms of calculated network value (Net).

This process is performed on each of the layers of an MLP until the output signals or predicted values in the last layer are determined. Then, the error value of the neural network is calculated and this value of error is minimized by changing the used weights and biases throughout the MLP. This process which is defined as training can be conducted by different optimization algorithms. However, the fast convergence rate and appropriate precision of backpropagation (BP) algorithms have caused these algorithms to be basically employed in the training phase of standard ANNs [35].

2.2 Adaptive neuro-fuzzy inference system (ANFIS)

An adaptive neuro-fuzzy inference system (ANFIS) is a specific sub-branch of ANN which benefits from the combined features of neural networks and fuzzy logic principles [37,38,39,40]. ANFIS was developed by Jang [41] in 1993 to model nonlinear functions, identify nonlinear components, and predict chaotic time series. ANFIS is capable of constructing an input–output mapping, based on the Takagi–Sugeno fuzzy inference system (in the form of fuzzy IF-THEN rules) [42, 43]. Many advantages of ANFIS such as the ability to capture the nonlinear structure of a process, adaptation capability, and rapid learning have made it very popular among engineers [44,45,46].

ANFIS architecture has five layers, as shown in Fig. 2. The central core of the ANFIS is a fuzzy inference system (FIS). The first layer receives inputs (x and y in Fig. 2) and converts them to fuzzy values by membership functions (MFs). The rule base contains two fuzzy IF-THEN rules of Takagi’s and Sugeno’s type:

Fig. 2
figure 2

Layers of a typical ANFIS model

Rule 1: if x is \(A_{1}\) and \(y\) is \(B_{1}\), then \(f_{1} = p_{1} x + q_{1} y + r_{1}\),

Rule 2: if x is \(A_{2}\) and \(y\) is \(B_{2}\), then \(f_{2} = p_{2} x + q_{2} y + r_{2}\).

Every node in this layer (i.e., the first layer) is selected as an adaptive node with a node function,

$$O_{i}^{1} = \mu A_{i} \left( x \right)$$
(3)

where \(A_{i}\) is a linguistic label and \(O_{i}^{1}\) is the membership function of \(A_{i}\).

Bell-shaped membership functions (or Gaussian functions) are usually used in ANFIS as they have a higher capacity in the regression of nonlinear data [30, 47,48,49,50,51,52,53]. A bell-shaped membership function with the maximum value of one and minimum value of zero is defined as follows:

$$\mu \left( x \right) = {\text{bell}}\left( {x;a_{i} ,b_{i} ,c_{i} } \right) = \frac{1}{{1 + \left[ {\left( {\frac{{x - c_{i} }}{{a_{i} }}} \right)^{2} } \right]^{{b_{i} }} }}$$
(4)

where \(\{ a_{i} ,b_{i} ,c_{i} ,d_{i} \}\) are the parameters set and x is the input. The parameters of this layer are known as premise parameters.

The second layer multiplies the incoming signals and sends their product to the next layer. For instance:

$$w_{i} = \mu A_{i} \left( x \right) \times \mu B_{i} \left( y \right),\quad i = 1, 2.$$
(5)

Every output of the nodes exhibits the firing strength of a rule.

The third layer is the rule layer. In this layer, the ratio of the ith node firing strength of rule to those of the other nodes is calculated. This means that:

$$w_{i}^{*} = \frac{{w_{i} }}{{w_{1} + w_{2} }} \quad i = 1, 2.$$
(6)

The outcomes \(w_{i}^{*}\) are known as normalized firing strength.

The fourth layer is the defuzzification layer in which every node has a node function as follows:

$$O_{i}^{4} = w_{i}^{*} f_{i} = w_{i}^{*} \left( {p_{i} x + q_{i} y + r_{i} } \right)$$
(7)

where \(w_{i}^{*}\) is the output of the third layer and \(\{ p_{i} ,q_{i} ,r_{i} \}\) are the parameters of this layer known as consequent parameters.

The output layer is the fifth layer. In this layer, the overall output is computed by summing all the incoming signals. This means that:

$$O_{1}^{5} = f = \mathop \sum \limits_{i} w_{i}^{*} f_{i}$$
(8)

In this process, a threshold value between the actual value and the output is set. Then, the consequent parameters are obtained by the least-squares method and an error for each of data is obtained. If this value is larger than the considered threshold, the premise parameters are updated by the use of a gradient descent algorithm. This process continues until the error becomes less than the threshold. Since the parameters are determined by two algorithms (i.e., least-squares and gradient descent algorithms) simultaneously, the used algorithm in this process is known as a hybrid algorithm.

2.3 Support vector regression (SVR)

The fundamental principle of support vector regression (SVR) is to map input data to multidimensional feature space and perform a linear regression in this space so that the empirical risk is minimized [54,55,56,57]. The flexible nature of SVR is attributed to the kernel functions (k) which implicitly chart data to the feature space. This process is known as the kernel trick because the linear regression in the feature space represents a nonlinear regression in the original space of the problem [58]. Several kernel functions have been defined for this purpose. However, the radial basis function (RBF) kernel and polynomial (Poly) function kernel usually lead to better results in comparison with other kernels [59]. These kernel functions are defined as follows:

$${\text{RBF}}\;{\text{kernel}} \to k = k\left( {x,x_{i} } \right) = \exp \left( { - \frac{{ \left\| {x - x_{i} } \right\|^{2} }}{{2\sigma^{2} }}} \right) = \exp \left( { - \gamma \left\| {x - x_{i} } \right\|^{2} } \right),\quad \gamma = \frac{1}{{2\sigma^{2} }}$$
(9)
$${\text{Poly}}\;{\text{kernel}} \to k = k\left( {x,x_{i} } \right) = \left( {x . x_{i} + 1} \right)^{d}$$
(10)

where \(x\) and \(x_{i}\) are vectors in the input space; \(\gamma\) and \(\sigma\) are the parameters which define how far the influence of a single sample reaches; \(d\) is the degree of the polynomial kernel function.

The SVR theory can be defined as follows:

Assume that there is a training set of \(N\) samples in the \(d\)-dimensional space of the problem; these samples can be represented as follows:

$$\left( {x_{1} , y_{1} } \right), \ldots , \left( {x_{i} ,y_{i} } \right), \ldots , \left( {x_{N} ,y_{N} } \right) \quad x_{i} , y_{i} \in R^{d}$$
(11)

where \(x_{i}\) is a sample value of input vector \(x\) containing N training points. \(y_{i}\) is the corresponding output value of the sample.

As mentioned before, SVR does perform a linear regression in the feature space of the problem. Thus, if the data points are transferred to such space by the kernel functions (\(k\)), the SVR model can be defined by the following linear equation:

$$\hat{y}_{i} = f\left( x \right) = \omega^{T} \phi \left( x \right) + b$$
(12)

where \(\hat{y}_{i}\) is the output predicted vector; \(\omega\) is the weight vector; \(\phi \left( x \right)\) is the mapping functions applied for feature extraction; and \(b\) is the term of bias.

As described previously, SVR minimizes the empirical risk in the problem. Hence, if the variable of empirical risk is defined by \(R_{{{\text{emp}}}}\), it can be written that:

$$R_{{{\text{emp}}}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left| {y_{i} - \hat{y}_{i} } \right|_{\varepsilon }$$
(13)

where \(\left| {y_{i} - \hat{y}_{i} } \right|_{\varepsilon }\). is Vapnik’s \(\varepsilon\)-intensive loss function, defined by:

$$\left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}} \left| {y_{i} - \hat{y}_{i} } \right| \le \varepsilon } \hfill \\ {\left| {y_{i} - \hat{y}_{i} } \right| - \varepsilon \left| {y_{i} - \hat{y}_{i} } \right| - \varepsilon } \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(14)

The weight vector \(\omega\) and the bias term \(b\) can then be determined by minimizing the cost function \(J(\omega ,\zeta ,\zeta_{i}^{*} )\), described as the following:

$$J\left( {\omega ,\zeta ,\zeta_{i}^{*} } \right) = \frac{1}{2}\omega^{T} \omega + C\mathop \sum \limits_{i = 1}^{N} \left( {\zeta + \zeta_{i}^{*} } \right)$$
(15)

The constraints of this function are also as follows:

$$\left\{ {\begin{array}{*{20}l} {y_{i} - \hat{y}_{i} \le \varepsilon + \zeta_{i} } \hfill \\ { - y_{i} + \hat{y}_{i} \le \varepsilon + \zeta_{i}^{*} } \hfill \\ {\zeta_{i} \ge 0} \hfill \\ {\zeta_{i}^{*} \ge 0} \hfill \\ \end{array} \quad i = 1, 2, 3, \ldots , N} \right.$$
(16)

where \(\zeta\) and \(\zeta^{*}\) are positive slack variables, and \(C\) is a positive real cost value.

2.4 Extreme learning machine (ELM)

Extreme learning machine (ELM) was proposed by Huang et al. [60] in 2006 for single-layer feed-forward neural network (SLFN) architectures. ELM was originated from the observations which proved that an SLFN with random weights and biases could approximate any continuous function on any compact input set [61]. Hence, it was realized that input weights and biases of an SLFN could be randomly selected, and based on them, the output weights of the SLFN would be analytically calculated. Employing this idea in finding the weights and biases of the SLFN resulted in an algorithm with extremely fast learning speed. Also, ELM systematically determines all the network factors, thus preventing unnecessary human interferences [35, 62,63,64,65,66,67].

A three-step procedure is involved in developing the ELM model as follows: (I) an SLFN is created; (II) weights and biases of the network are randomly selected; (III) the output weights are estimated by inverting the hidden layer output matrix [24, 68].

For a dataset containing \(N\) training samples with \(n\)-dimensional input vectors and \(m\)-dimensional target vectors, the SLFN with \(L\) hidden nodes can mathematically be defined as follows:

$$\mathop \sum \limits_{i = 1}^{L} \beta_{i} G\left( {w_{i} .x_{j} + b_{i} } \right) = o_{j} \quad j = 1, 2, 3, \ldots , N$$
(17)

where \(G\) is the activation function, all the neural network-based activation functions can be used herein too; \(w_{i} = \left[ {w_{i1} , w_{i2} , \ldots , w_{in} } \right]^{T}\) is the weight vector connecting the ith hidden neuron to the input neurons; \(x_{j} = \left[ {x_{j1} , x_{j1} , \ldots , x_{jm} } \right]^{T}\) is the input vector; \(\beta_{i} = \left[ {\beta_{i1} ,\beta_{i2} , \ldots , \beta_{im} } \right]^{T}\) is the weight vector connecting the hidden neurons to the output neurons; \(b_{i} = \left[ {b_{i1} ,b_{i2} , \ldots , b_{im} } \right]^{T}\) is the bias vector; \(o_{j} = \left[ {o_{j1} ,o_{j1} , \ldots , o_{jm} } \right]^{T}\) is the output vector.

If it is assumed that an SLFN with \(L\) hidden neurons and activation function \(G\) can approximate the targets (\(t_{j} )\) with zero error, i.e., \(\mathop \sum \nolimits_{j = 1}^{L} o_{j} - t_{j} = 0\), Eq. (17) can be transformed to:

$$\mathop \sum \limits_{i = 1}^{L} \beta_{i} G\left( {w_{i} .x_{j} + b_{i} } \right) = t_{j}\quad j = 1, 2, 3, \ldots , N$$
(18)

where \(t_{j} = \left[ {t_{j1} ,t_{j2} , \ldots , t_{jm} } \right]^{T}\) is the target vector. Also, the above \(N\) equations can be compactly written as:

$$H\beta = T$$
(19)

in which:

$$H = \left[ {\begin{array}{*{20}c} {G\left( {w_{1} + x_{1} + b_{1} } \right)} & \ldots & {G\left( {w_{L} .x_{1} + b_{L} } \right)} \\ \vdots & \ldots & \vdots \\ {G\left( {w_{1} + x_{N} + b_{1} } \right)} & \ldots & {G\left( {w_{L} .x_{N} + b_{L} } \right)} \\ \end{array} } \right]_{N \times L}$$
(20)

and

$$\beta = \left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ \vdots \\ {\beta_{L}^{T} } \\ \end{array} } \right]_{L \times m} {\text{and}}\; T = \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ \vdots \\ {t_{N}^{T} } \\ \end{array} } \right]_{N \times m}$$
(21)

The output weights will be obtained if the minimum difference between the left side (predicted values) and the right side (target values) of Eq. (19) occurs, i.e., min \(\min \left\| {H\beta - T} \right\|\). Although backpropagation (BP) algorithms can minimize this fitness function, similar to what occurs in an ANN, ELM uses mathematical theories and proves that the minimum error between the predicted and target values occurs when the output weights vector is determined as follows [60]:

$$\hat{\beta } = H^{\dag } T$$
(22)

where \(\hat{\beta }\) is the output weight vector; \(H^{\dag }\) is Moore–Penrose generalized inverse matrix; and \(T\) is the target vector.

As it was illustrated theoretically, in contrast to other models that obtain weights and biases of the models through minimization of errors, no minimization and iteration process are involved in a standard ELM model and the SLFN is tuned by calculating the output weights through the Moore–Penrose generalized inverse matrix.

2.5 Grey wolf optimizer (GWO)

Grey wolf optimizer (GWO) is a metaheuristic algorithm which was proposed by Mirjalili et al. [69]. This algorithm has been inspired by the leadership hierarchy and the hunting mechanism of grey wolves. Grey wolves live in a pack and have a very strict social dominant as illustrated in Fig. 3. Leaders of the wolves are called alpha \((\alpha )\) as they are responsible for making decisions. The second level wolves are beta (\(\beta\)) that help alpha wolves in their responsibilities. The last one in this hierarchy is known as omega (\(\omega\)) that plays the role of scapegoat. If a wolf is categorized in none of the mentioned levels, it is known as a delta (\(\delta\)) wolf as well [67, 70]. According to this well-defined leadership hierarchy, grey wolves try to encircle a prey, attack, hunt, and search for other prey as depicted in Fig. 4.

Fig. 3
figure 3

Hierarchy of grey wolves [69]

Fig. 4
figure 4

Hunting strategy of grey wolves: a searching and tracking, bd pursuit, harass, and encircling, e final stationary configuration at the end of the hunt [70]

The encircling behavior of grey wolves in hunting can be mathematically expressed as follows [69]:

$$\vec{D} = \left| {\vec{C} \cdot \overrightarrow {{X_{p} }} \left( t \right) - \vec{X}\left( t \right)} \right|$$
(23)
$$\vec{X}\left( {t + 1} \right) = \vec{X}_{p} \left( t \right) - \vec{A}\cdot\vec{D}$$
(24)

where \(\vec{X}\) describes the position of a grey wolf in a circular configuration; \(\vec{X}_{p}\) is the location vector of a prey; \(t\) is the current moment; \(\vec{A}\) and \(\vec{D}\) are the coefficient vectors which can be defined as follows:

$$\vec{A} = 2\vec{a} \cdot \overrightarrow {{r_{1} }} - \vec{a}$$
(25)
$$\vec{C} = 2 \cdot \overrightarrow {{r_{2} }}$$
(26)

in which the component \(a\) is linearly decreased from 2 to 0; \(\overrightarrow {{r_{1} }}\) and \(\overrightarrow {{r_{2} }}\) are also random vectors uniformly distributed between 0 and 1.

Since the location of the prey (the optimum location) is not obvious in advance, it is assumed that the \(\alpha\), \(\beta\), and \(\delta\) wolves have better knowledge about it [69]. Therefore, the average location of these wolves is used in order to determine the location of the prey. It can thus be written that:

$$\overrightarrow {{D_{\alpha } }} = \left| {\overrightarrow {{C_{1} }} \cdot \overrightarrow {{X_{\alpha } }} - \vec{X}} \right|,\quad \overrightarrow {{D_{\beta } }} = \left| {\overrightarrow {{C_{2} }}\cdot \overrightarrow {{X_{\beta } }} - \vec{X}} \right|, \quad \overrightarrow {{D_{\delta } }} = \left| {\overrightarrow {{C_{3} }} \cdot \overrightarrow {{X_{\delta } }} - \vec{X}} \right|$$
(27)
$$\vec{X}_{1} = \vec{X}_{\alpha } - \overrightarrow {{A_{1} }}\cdot \overrightarrow {{D_{\alpha } }} , \quad \vec{X}_{2} = \vec{X}_{\beta } - \overrightarrow {{A_{2} }} \cdot \overrightarrow {{D_{\beta } }} , \quad \vec{X}_{3} = \vec{X}_{\delta } - \overrightarrow {{A_{3} }} \cdot \overrightarrow {{D_{\delta } }}$$
(28)
$$\vec{X}\left( {t + 1} \right) = \frac{{\vec{X}_{1} + \vec{X}_{2} + \vec{X}_{3} }}{3}$$
(29)

After approximating the location of the prey, the next step is to hunt it (exploitation). This purpose can be achieved by the vector \(\vec{A}\) because when the value of \(a\) in Eq. (25) decreases from 2 to 0, the position of wolves approaches the location of the prey according to Eq. (24). Moreover, to maintain the searching capability (exploration) of this algorithm and avoidance from local minimums, both of the parameters \(C\) and \(A\) contribute. On the one hand, the parameter \(C\) may change the location of prey and the hardness of hunting and, on the other hand, the \(A\) values greater than 1, i.e.,\(\left| A \right| > 1\), force the grey wolves to diverge from the prey and find a fitter prey [69]. If this process is repeated for a population of grey wolves and a specific number of iterations, finally, Eq. (29) will show the location of the prey or global optimum point.

2.6 Hybrid ELM-GWO

As mentioned in Sect. 2.4., ELM calculates the output weights of the model based on mathematical theories and without using any optimization algorithm. This approach causes that ELM has the following advantages over the other proposed models [60, 65, 71]:

  1. 1.

    In spite of the backpropagation algorithms which cannot generally reach the exact values of weights and biases, ELM is capable of computing these values precisely since it uses mathematical theories in the calculation.

  2. 2.

    In contrast to backpropagation algorithms that do not consider the magnitude of weights and biases and only try to minimize the value of the fitness function, ELM employs small values for output weights of the SLFN.

  3. 3.

    This approach makes ELM an extremely fast algorithm in calculating output weights and tuning the SLFN.

However, the most considerable deficiency of ELM lies in the initial input weights and biases which are randomly assigned to the SLFN. Although it has been proposed that the initial input weights do not significantly affect the efficiency of the ELM model, it is highly possible that these weights and biases cannot result in the best output weights. Moreover, the input weights and biases cannot find any opportunity to be updated throughout a standard ELM algorithm and this, in turn, can adversely reduce the capability of the model.

To address these issues and improve the efficiency of ELM, this algorithm can be combined with other optimization algorithms. In this study, the ELM algorithm will be combined with GWO since this algorithm benefits from an appropriate convergence rate and does not have many parameters for tuning. However, ELM can be combined with other optimization algorithms too. To have a hybrid ELM-GWO algorithm, these steps can be followed:

  1. 1.

    Considering a number of neurons in the hidden layer, develop an SLFN.

  2. 2.

    Initially assign random weights and biases to the SLFN, these values can be in the range of [0, 1] or [− 1, 1].

  3. 3.

    Reform the weights and biases so that they can represent the location of a wolf in the \(D\)-dimensional space of the problem, where \(D\) is the total number of weights and biases.

  4. 4.

    For each of the wolves in every iteration, calculate the output weights by employing the ELM algorithm.

  5. 5.

    By having the output weights, the output values can be predicted and the ELM-GWO model is trained. For this purpose, define a fitness function to minimize the error of the model. Herein, the fitness function (\(E\)), in terms of root mean squared error (RMSE), is proposed as follows:

    $$E\left( {\vec{w},\vec{b}} \right) = \sqrt {\left( {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {O_{i} - P_{i} } \right)^{2} }}{n}} \right) }$$
    (30)

    where \(\vec{w}\) and \(\vec{b}\) are the vectors of initial weights and biases; \(n\) is the total number of training samples; \(O_{i}\) and \(P_{i}\) are the observed value (actual value) and the predicted value in the sample \(i\), respectively.

  6. 6.

    Repeat steps 4 to 6 for a specific number of wolves and iteration until the stopping criteria are satisfied. In this study, the maximum number of iterations and acceptable performance were considered as the stopping criteria.

As can be realized, this process provides the opportunity to evaluate the performance of the ELM model for different initial weights and biases. As a result, input weights and biases can be updated and subsequently, output weights are determined in a way they can lead to a more robust model.

3 Data and preparation

The used data in this investigation were obtained from the literature [72,73,74,75,76,77]. A dataset containing 798 data points was collected totally. The contents of cement (C), water (W), fly ash (FA), furnace slag (FS), fine aggregates (FAG), coarse aggregates (CAG), superplasticizer (SP), and age (A) have been considered as the inputs of the models, and compressive strength of concrete (\(f_{c}^{^{\prime}}\)) is predicted as the output. Table 1 shows the variation range, average, and standard deviation of each input variable, while the distribution of the variables is illustrated in Fig. 5.

Table 1 Details of the data
Fig. 5
figure 5

Distribution pattern of: a cement content (C), b furnace slag (FS), c fly ash (FA), d water content (W), e superplasticizer (SP), f coarse aggregate (CAG), g fine aggregate (FAG), h age (A), i compressive strength (\(f_{c}^{^{\prime}}\))

Although the current data can be used in all the proposed models, the performance of the models can be improved if the data are normalized in the range of [0, 1] or [− 1, 1]. This preprocessing on the data can be efficient because, as it was observed previously, the variation range of activation functions or membership functions are usually in these ranges of variation. Therefore, these functions are more sensitive to the input variables which have been normalized prior to training [78]. In order to normalize the inputs in the range of [− 1, 1], the following formulas can be used [21]:

$$X_{i} = \frac{{X_{{{\text{io}}}} - X_{\min } }}{{X_{\max } - X_{\min } }} \times 2 - 1$$
(31)
$$Y_{i} = \frac{{Y_{{{\text{io}}}} - Y_{\min } }}{{Y_{\max } - Y_{\min } }} \times 2 - 1$$
(32)

where \(X_{{{\text{io}}}}\) and \(X_{i}\) are the ith component of each input vector before and after normalization, respectively, and \(Y_{{{\text{io}}}}\) and \(Y_{i}\) are the ith component of the output vector before and after normalization, respectively. \(X_{\min }\), \(X_{\max }\), \(Y_{\min }\), and \(Y_{\max }\) are the minimum and maximum values of each input and output vector, respectively.

This is also important to be mentioned that since actual values are more tangible, postprocessing has been also conducted in this study and results have been denormalized after the training process and before reporting the final results.

4 Model performance indicators

To evaluate the performance of the models, 70% of the data have been randomly devoted to the training phase and the remained 30% have been assigned to the testing phase. As a primary criterion, statistical model performance indicators including Pearson correlation coefficient (r), determination coefficient (R2), root mean squared error (RMSE), mean absolute error (MAE) are used. These indicators are defined as follows:

$$r = \frac{{N\left( {\mathop \sum \nolimits_{i = 1}^{N} O_{i} \cdot P_{i} } \right) - \left( {\mathop \sum \nolimits_{i = 1}^{N} O_{i} } \right) \cdot \left( {\mathop \sum \nolimits_{i = 1}^{N} P_{i} } \right)}}{{\sqrt {\left( {N\mathop \sum \nolimits_{i = 1}^{N} O_{i}^{2} - \left( {\mathop \sum \nolimits_{i = 1}^{N} O_{i} } \right)^{2} } \right) \cdot \left( {N\mathop \sum \nolimits_{i = 1}^{N} P_{i}^{2} - \left( {\mathop \sum \nolimits_{i = 1}^{N} P_{i} } \right)^{2} } \right)} }}$$
(33)
$$R^{2} = \frac{{\left[ {\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - \overline{O}} \right) \cdot \left( {P_{i} - \overline{P}} \right)} \right]^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - \overline{O}} \right) \cdot \mathop \sum \nolimits_{i = 1}^{N} \left( {P_{i} - \overline{P}} \right)}}$$
(34)
$${\text{RMSE}} = \sqrt {\mathop \sum \limits_{i = 1}^{N} \frac{1}{N}\left( {O_{i} - P_{i} } \right)^{2} }$$
(35)
$${\text{MAE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left| {O_{i} - P_{i} } \right|$$
(36)

where \(N\) is the number of training or testing samples; \(O_{i}\) and \(P_{i}\) are the observed and predicted values in the sample \(i\), respectively; \(\overline{O}\) and \(\overline{P}\) are also the mean observed and predicted values, respectively.

In addition to the aforementioned formulations, to show the difference between observed and predicted values as the percentage of the mean observed values, relative root mean squared error (RRMSE) and relative mean absolute error (RMAE) are also described in percentage as the following:

$${\text{RRMSE}} = \frac{{{\text{RMSE}}}}{{\overline{O}}} \cdot 100$$
(37)
$${\text{RMAE}} = \frac{{{\text{MAE}}}}{{\overline{O}}} \cdot 100$$
(38)

Since Eqs. (3338) are based on the linear relations between observed values (\(O\)) and predicted values (\(P\)), they can be only sensitive to outliers or extreme values and cannot take into account the additive or proportional differences between the observed (\(O\)) and predicted (\(P\)) values [79]. To address this issue, Willmott’s index [80] \((0 \le WI \le 1.0)\) and Nash–Sutcliffe coefficient [81] (\(- \infty \le E_{NS} \le 1\)) are also defined as follows:

$$E_{NS} = 1 - \left[ {\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - P_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - \overline{P}} \right)^{2} }}} \right]$$
(39)
$${\text{WI}} = 1 - \left[ {\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - P_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {\left| {P_{i} - \overline{O}} \right| + \left| {O_{i} - \overline{O}} \right|} \right)^{2} }}} \right]$$
(40)

To have a reasonable comparison among the models, all the codes were developed in the MATLAB environment and no external toolbox or compiler was used. Also, the codes were run in a computer system with a processor of the type Intel(R) Core(TM) i7-6700 HQ CPU @ 2.60 GHz 2.59 GHz and 16.0 GB RAM.

5 Results and discussion

Six different models have been considered in this investigation. These models include an ANN, an ANFIS, an ELM, an SVR-Poly, an SVR-RBF, and a hybrid ELM-GWO. Figure 6 briefly shows a flowchart of the models with a focus on the ELM-GWO model. In what follows, the development of the models and results of each model are presented and discussed comprehensively.

Fig. 6
figure 6

Flowchart of the models

5.1 Models development

5.1.1 ANN development

The performance of an ANN model significantly depends on the architecture of the model, i.e., the number of hidden layers and the number of neurons in each of the layers. To determine the architecture of ANN, a trial and error process was conducted. Different architectures with various number of hidden layers and neurons were developed, and each model was run three times with 1000 epochs. In order to tune the weights and biases of ANN models, the Levenberg–Marquardt algorithm (LMA) was used as it is often the fastest BP algorithm in training [65, 82, 83]. Finally, the mean value of RMSE was obtained as a statistical indicator to show the performance of the models. Table 2 represents a summary of the recorded values of RMSE throughout the trial and error process.

Table 2 Results of the trial and error process to find the optimum ANN architecture

As can be seen in this table, architecture number 3, i.e., a single hidden layer with 8 neurons has shown the lowest value of RMSE in the testing phase. Note that although some of the models could reach lower RMSE values in the training phase, they had not been able to achieve such performance in the testing phase too. Therefore, the architecture with the lowest difference of RMSE in the training phase and testing phase is selected as the most reliable model. Considering these observations, architecture number 3 was adopted as the final architecture as depicted in Fig. 7.

Fig. 7
figure 7

Considered ANN architecture

5.1.2 ANFIS development

To develop an ANFIS model, an initial fuzzy inference system (FIS) should initially be created and then be trained. For this purpose, membership functions (MFs) of the type Takagi–Sugeno were adopted as they generally culminate in more robust results [47, 84,85,86]. In addition, the default hybrid algorithm of ANFIS was employed in the training phase. To select the number of MFs for each input variable (number of clusters), a trial and error process was carried out and the RMSE values were recorded, as shown in Table 3. It was also seen in this procedure that an initial FIS with a degree of 3 and 100 iterations lead to establishing a model with better performance. According to Table 3, in model number 4, not only the lowest value of RMSE in the testing phase has been obtained, but also the lowest difference between the RMSE values in the training and testing phases has been recorded. Therefore, an ANFIS architecture with 8 MFs for each input variable was selected, as illustrated in Fig. 8.

Table 3 Performance evaluation of ANFIS with different numbers of MFs
Fig. 8
figure 8

Considered ANFIS architecture

5.1.3 SVR development

The performance of an SVR model is dependent on the value of involving parameters in the problem. Thus, these parameters need to be determined precisely. As it was mentioned, two SVR models have been considered in this study. The first model is an SVR model with an RBF kernel in which the parameters including \(C\), \(\gamma\), and \(\epsilon\) are unknown, and the second one is another SVR model with a polynomial kernel whose unknown parameters are \(C\), \(d\), and \(\epsilon\). To determine these parameters, the grid search technique was used in both of the cases and for constant values of \(\epsilon\), the impact of different values of the two other parameters on the performance of the models was evaluated in terms of RMSE. Finally, the grid was selected as optimum in which the lowest value of RMSE had been occurred. The results of the grid search algorithm showed that the best performance of the SVR-RBF model is obtained for \(C = 1024\) and \(\gamma = 0.25\), while the SVR-Poly model shows its best performance when \(C = 0.125\) and \(d = 4\). The 3D surfaces of the grid search algorithm are also depicted in Fig. 9.

Fig. 9
figure 9

3D graphical diagrams for tuning the parameters of SVR with: a RBF kernel, b polynomial kernel

5.1.4 ELM development

ELM algorithm only deals with single-layer feed-forward neural network (SLFN) architectures; thus, the number of hidden layers does not need to be determined and the only unknown variable is the number of neurons in the SLFN. To specify the number of neurons, a trial and error process was conducted and the performance of the ELM model was evaluated in the training and testing phases by the RMSE indicator. Table 4 shows the considered models in the trial and error process with their corresponding RMSE values. As can be seen in this table, the lowest RMSE value has been obtained in model number 11 in which 110 neurons have been considered in the hidden layer. Therefore, an SLFN with 110 neurons in the SLFN was developed, as shown in Fig. 10.

Table 4 Results of the trial and error process to determine the number of neurons in the ELM model
Fig. 10
figure 10

Considered ELM architecture

5.1.5 ELM-GWO development

The architecture of the ELM-GWO was considered the same as the ELM (i.e., 110 hidden neurons in the SLFN) so that a reasonable comparison between the ELM and the hybrid ELM-GWO could be conducted. One of the most considerable advantages of the GWO algorithm is that it is not dependent on many parameters and the only parameter which should be tuned is the number of the grey wolves’ population. For this purpose, the ELM-GWO algorithm was run for a different number of populations and the convergence rate of the algorithm and the required time for convergence were recorded. Figure 11a illustrates the convergence curves in different populations of wolves for 1000 iterations. As can be seen in this diagram, the number of the population has not significantly affected the fitness value (RMSE). Figure 11b also shows the variation of the best fitness value (i.e., the fitness value after 1000 iterations) and the time required to reach this value in the different number of populations. As can be realized, in population number 75, not only the lowest value of RMSE has been obtained, but also the required time for convergence is less than those of the higher populations. These observations caused the number of 75 wolves were considered in the ELM-GWO model.

Fig. 11
figure 11

ELM-GWO development: a the impact of number of population on the fitness value, b comparison of the required time for convergence and the best fitness value

5.2 Comparison of the results and discussion

After tuning the involving parameters in all the models, each model was run and its performance in terms of previously mentioned performance metrics was evaluated. Table 5 shows the obtained performance indices in the training phase of all the six models. As can be seen in this table, all the models have been capable of reaching satisfying performance indices by resulting in high values (close to one) of r, R2, NSE, and WI, and low values (close to zero) of RMSE, MAE, RRMSE, and RMAE. However, the best performance metrics have resulted in the case of the ELM-GWO model and this reveals that the ELM-GWO model has been more successful in the training phase in comparison with the other models. After the ELM-GWO model, other models have had a very close competition. If the other 5 models (standard models) are compared in the terms of r, R2, RMSE, RRMSE, and WI, it can be concluded that the SVR-RBF model has been able to achieve the best performance. However, if the 5 models are compared in the terms of performance indices such as MAE, RMAE, and NSE, it can be concluded the ANFIS model has been more successful in the training phase. These cases imply that although the magnitude of errors in the SVR-RBF model is lower, the mean of errors in the ANFIS model is closer to zero. The same case can be also seen in the comparison of ANN with SVR-Poly. The other point which can be mentioned is related to the performance metrics of the ELM model which vindicate the worse performance of the ELM model compared to other models. If the performance of the ELM model (i.e., the worst model) is compared to that of the ELM-GWO model (i.e., the best model), it can be concluded that tuning the initial weights and biases of an SLFN in the ELM algorithm can be highly effective in improving the performance of the model.

Table 5 Performance evaluation of the ML models in the training phase

Figure 12 demonstrates the regression diagrams of all the models in the training phase. The horizontal axis of each diagram represents the observed values in the training samples, while the vertical axis shows the predicted values by the model. The blue line in each diagram is also the line with 100% agreement between the observed and predicted values, and other radial lines are the lines with 15% and 30% difference from the blue line. Accordingly, if all the points are placed on the blue line with the equation of \(y = x\), it means that the model has been able to predict the actual values without any error. As can be observed in this figure, the ELM-GWO model not only has the highest value of R2 but it also has the most similar equation to \(y = x\). After this model, ANFIS, SVR-RBF, ANN, SVR-Poly, and ELM could reach the best values of R2.

Fig. 12
figure 12

Regression diagrams of the models in the training phase: a ANN model, b ANFIS model, c ELM model, d SVR-Poly model; e SVR-RBF model, f ELM-GWO model

Although having a proper performance in the training phase will help models in predicting the targets, it does not guarantee the performance of the models in the testing phase too. In other words, models might not be able to repeat their luminous results in the testing phase too; thus, their performance needs to be tested. Table 6 illustrates the performance indices of the models in the testing phase. As can be observed in this table, the ELM-GWO model has been capable of reaching the best performance indices among the other models by resulting in higher values of r, R2, NSE, and WI, and lower values of RMSE, MAE, RRMSE, and RMAE. After this model, in contrast to the training phase, the ANN model has shown the best performance by resulting in superior indices among the other models. This means that after the ELM-GWO model, the ANN model is the most reliable model for predicting new targets. This is important to be mentioned that the ELM-GWO is almost an ANN with a single hidden layer structure whose weights and biases have been determined by another optimization algorithm (i.e., GWO instead of a commonly used backpropagation (BP) algorithm) combined with Moore–Penrose generalized inverse matrix. Therefore, an ELM-GWO model not only incorporates an ANN in its structure but also uses mathematical theories to better arrange the value of weights and biases.

Table 6 Performance evaluation of the ML models in the testing phase

Figure 13 shows the regression diagrams of the models in the testing phase. The better performance of the ELM-GWO model can be realized as it has the highest value of R2 and its data points in the diagram have been gathered along the blue line (the line with 100% agreement) more compactly. After the ELM-GWO model, the better performance in terms of R2 can be found in the diagram of ANN, ANFIS, SVR-RBF, SVR-Poly, and ELM model, respectively.

Fig. 13
figure 13

Regression diagrams of the models in the testing phase: a ANN model, b ANFIS model, c ELM model, d SVR-Poly model, e SVR-RBF model, f ELM-GWO model

To clearly show the performance of the models in the testing phase, Fig. 14 has also been drawn in which the capability of the models in predicting each of the testing samples has been shown. As can be seen in this figure, all the models have been able to predict most of the testing samples closely; however, the better performance of the ELM-GWO and the worse performance of the ELM are almost obvious. To evaluate the performance of the models more comprehensively, graphical Taylor diagrams are also presented in Fig. 15. The horizontal and vertical axes of these diagrams show the values of standard deviation which are connected to each other by circular lines. The drawn blue radial lines from the origin of coordinates show the value of the correlation coefficient as a performance indicator, and the green circular lines show the value of RMSE as another performance indicator. In this diagram, the observed data are assumed as a base model with zero error (i.e., RMSE = 0), the highest correlation coefficient (i.e., r= 1), and a calculated value of standard deviation. Then, the performance of other models in terms of standard deviation, RMSE, and r is compared with those of the observed data. Accordingly, the best model will be the model with the highest similarity to the base model of the observed data. As can be seen in this figure, the ELM-GWO model has been able to take a closer position to the observed data in both of the training and testing phases and this, in turn, illustrates the more successful performance of this model. Also, it can be observed that, on the one hand, the performance of the SVR-RBF model is highly similar to that of the SVR-Poly and, on the other hand, the performance of the ANN and ANFIS models is almost alike. This can be related to the origin of the models as the SVR-RBF and SVR-Poly models are from the SVR family and the ANFIS model is an ANN model which has been combined with fuzzy rule principles. However, a significant difference can be found between the ELM and ELM-GWO models. To exhibit this difference, the ratio of the predicted values by both of the models to the observed values in each of training and testing samples was calculated, and then, the results were prepared in the form of diagrams as shown in Fig. 16. The vertical axis of these diagrams is the ratio of the predicted to the output values, and the horizontal axis shows the training or testing samples. The more the data points in these diagrams concentrate on the blue line with the ratio of one, the more precise the corresponding model is. As can be seen in this figure, the data points of the ELM-GWO model have concentrated more compactly around the blue line, while less concentration can be seen in the ELM model. These, all take together, reveal that the performance of an ELM algorithm can be efficiently improved if the initial weights and biases of SLFN are tuned by other algorithms.

Fig. 14
figure 14

Compressive strength prediction in the testing phase by different ML models

Fig. 15
figure 15

Graphical Taylor diagrams for comparison of the ML models: a training phase, b testing phase

Fig. 16
figure 16

Comparison of the ELM model with the ELM-GWO model: a training phase of ELM, b training phase of ELM-GWO, c testing phase of ELM, d testing phase of ELM-GWO

The other important parameter of the models which can be compared is the time that models required to be trained. As mentioned in the previous sections, no external compiler or toolbox was used in the developing of the models and the same computer system was used to running them. Therefore, the required time of the models for training can be compared in an equal condition. Table 7 shows the recorded time values from the start of the program to the end. As can be seen, the fastest performance in the training phase belongs to the ELM, while the slowest one is related to the ELM-GWO model. In other words, the required time for the ELM-GWO model is reluctantly 300 times more than that of the ELM model. This shows that using an evolutionary algorithm in the structure of ELM can severely increase the training time.

Table 7 Required time for training

6 Conclusion

Concrete is a profitable material that plays a significant role in the construction industry. Partial replacement of cement in concrete with other pozzolans such as fly ash (FA) and furnace slag (FS) addresses not only a way to dispose of waste materials but also a mean to reduce the adverse by-products of cement production. However, estimating the properties of hardened concrete, on top of that compressive strength, is not easy at all and needs more advanced techniques. The main motivation of the current paper was to employ a soft computing approach to predict the compressive strength of hardened concrete in which cement has been partially replaced with FA and FS. For this purpose, an extreme learning machine (ELM) was combined with an inspired metaheuristic algorithm by the social behavior of grey wolves, known as grey wolf optimizer (GWO), and a novel ELM-GWO model was proposed. The predictability of the proposed model was validated against well-established nonlinear predictive models such as an artificial neural network (ANN), an adaptive neuro-fuzzy inference system (ANFIS), an standard ELM, and two support vector regression models with different kernel functions (i.e., radial basis function (RBF) and polynomial (Poly)). Different performance indices were defined and calculated for each of the models, and then, the models were assessed statistically. The results indicated that all the proposed models can predict the compressive strength of concrete satisfactory, thus eliminating the need for conducting costly experiments and saving time. In addition, it was seen that although the ELM model had the worst performance among the models, the ELM-GWO model could reach the best performance indices. Therefore, the hybridization of ELM with GWO can be very efficient in improving the performance of the ELM model. It was also observed that ANN could provide a more reliable model in comparison with other standard ML models as it showed a better performance in the testing phase.

Although in this research, the application of a hybrid ELM-GWO model was evaluated for the first time and considerable improvements in the accuracy of the ELM model were seen, the required time for the training of the model increased severely. This enhancement in time was principally due to employing an evolutionary algorithm (i.e., GWO) in the structure of ELM and as is known, evaluation requires time to achieve. Therefore, time enhancement is inevitable in evolutionary algorithms. In the future study, the hybridization of the ELM algorithm with more advanced algorithms will be investigated so that the associated problem with the time requirement of the model can be addressed. Also, the application of this hybrid model in the behavior prediction of other structural components will be examined.