Introduction

The construction industry contributes to the social and economic wealth of developed and developing countries (Myers, 2016; Owusu-Manu et al., 2019). As a result, numerous researchers have studied the enhancement of the performance of construction projects (Al-Dhaheri & Burhan, 2022; Barnes, 1988; Bryde, 2008; Salim & Mahjoob, 2020; Wateridge, 1998). A successful performance means the construction project is finished within the three critical criteria: cost, duration, and quality (Abbas & Burhan, 2023; Mohammad et al., 2021; Pollack et al., 2018). Project quality can be controlled during the project's construction phases, while cost and duration need to estimate their amount at the beginning of a project (Azman et al., 2013). The project cost is considered a determination of project success and owners, satisfaction due to its impact on their financial decisions (Huo et al., 2018; Matel et al., 2019). Estimating the project cost accurately helps decision-makers perform good feasibility studies and monitor the cash flows of construction projects (Shehu et al., 2014). Estimating construction cost is a complex problem characterized by incomplete information, risks, and uncertainties that lead to inaccurate results (Ahiaga-Dagbui & Smith, 2012; Fadhil & Burhan, 2022; Jing et al., 2019). Underestimated cost leads to cost overruns and financial problems for all parties in construction projects (Nagham Nawar Abbas & Burhan, 2022; Akintoye, 2000). To reduce these problems and achieve project objectives, several methods have been proposed in past studies to accurately estimate construction cost (Araba et al., 2021; Elhegazy et al., 2022; Sharma et al., 2021). Researchers in these studies have focused on two approaches, qualitative and quantitative methods. The qualitative method depends on expert opinion may lead to bias and inaccurate outcomes (Alex et al., 2010). Accordingly, recent studies have developed statistical approaches such as regression analysis (Al-Momani, 1996; Lowe et al., 2006) and artificial intelligence techniques for construction cost estimation (Shutian et al., 2017; Son et al., 2012). Several factors affect cost estimation, such as project characteristics and external economic parameters. Most of the studies have focused on project characteristics and ignore economic parameters. The reason behind this is there is no agreement among researchers on the impact of economic factors on the project cost, and there is little attention to incorporating these variables in the cost estimation process (Baloi & Price, 2003; Elhag et al., 2005; Gunduz & Maki, 2018; Zhao et al., 2020). This disagreement can be discussed using an inappropriate approach to investigate the influencing factors (Zhao et al., 2020). Consequently, there is a need to develop a suitable method to explore the impact of significant parameters of construction cost estimation (Wang et al., 2022).

According to a questionnaire survey and Among six influencing groups, Elhag et al. (2005) concluded that market conditions have gained the fourth rank in cost influencing factors (Elhag et al., 2005). In another study, the authors stated that economic variables have a high effect on the final cost of the construction project (Akinci & Fischer, 1998; Shane et al., 2009). In contrast, Hatamleh et al. (2018) indicated that market conditions had the least impact on cost estimation performance (Hatamleh et al., 2018). Several studies revealed that market conditions critically impact the cost estimation problem (Doloi, 2013; Iyer & Jha, 2005; Zhang et al., 2017). According to Zhao et al. (2019), market conditions have the most significant value among other affecting parameters. Wang et al. (2022) Stated that economic variables are more important than the project's parameters and have an essential role in improving cost estimation accuracy (Wang et al., 2022). One of the significant market conditions that affect cost estimation is inflation. Inflation significantly impacts the construction industry, especially the cost estimation process, due to its effects on material prices, labor wages, and equipment costs. These effects lead to problems among project parties and cost overruns (Musarat et al., 2021).

For the estimation process, several scholars used regression analysis as a popular method for cost estimation (Al-Momani, 1996; Lowe et al., 2006). The advantages of this method are its simplicity and the ability to produce simple results. However, this method has some drawbacks, such as it requires a defined mathematical expression and its inability to handle nonlinear relationships between input and output variables. In recent studies, soft computing algorithms have been used efficiently in construction management research and approved their ability to deal with complex systems and capture the nonlinear relationship between input and output parameters (Aljawder & Al-Karaghouli, 2022; Pan & Zhang, 2021). Artificial intelligence (AI) models help decision-makers capture historical data and deal with incomplete information in the early phases of the construction project (Almusawi & Burhan, 2020; Altaie & Borhan, 2018; Kaveh et al., 2008; Wang et al., 2022; Yaseen et al., 2020). Al-Momani (1996) Used a linear regression model to estimate the cost of a construction project based on three project characteristics (Al-Momani, 1996). Artificial neural network (ANN) was used by Kaveh and Khalegi (1998) to estimate the compressive strength of concrete. The study revealed the capacity of ANN to predict plain and admixture concrete with accepted results (Kaveh & Khalegi, 1998). Another study investigated the improved neural network called counterpropagation neural net to analyze and optimize large-scale structures (Kaveh & Iranmanesh, 1998). The study showed the improved algorithm indicated better results than the traditional propagation neural network.

Three AI models named decision tree (DT), support vector machine (SVM), and ANN were developed to estimate construction cost in Turkey (Erdis, 2013). AI models were built based on 575 datasets collected from a public construction project and three input parameters, including the rate of price -cut, location, and duration of a construction project. Shutian et al. (2017) Used a Kalman filter with SVM model and multi-linear regression (MLR) to estimate construction cost in China (Shutian et al., 2017). The study showed that the presented methods are useful in estimating cost of building projects. A study by Mahalakshmi and Rajasekaran (2019) proposed an ANN model for 52 highway construction projects (Mahalakshmi & Rajasekaran, 2019). The study demonstrated that the ANN model with a backpropagation algorithm could predict construction cost with acceptable accuracy. Linear regression was hybridized with a random forest (RF) model to predict the labor cost of a BIM project (Huang & Hsieh, 2020). The authors concluded that the hybrid model effectively improves the prediction performance of labor cost in the BIM project. Three prediction models named multivariate adaptive regression spline (MARS), extreme learning machine (ELM), and partial least square regression (PLS) were applied to estimate the cost of field canal improvement (Shartooh Sharqi & Bhattarai, 2021). The researchers concluded that the MARS algorithm obtained the best cost prediction accuracy with high R-squared and low estimation error. The performance of Three AI models named RF, SVM and multi-linear regression (MLR) were investigated by Shoar et al. (2022) to predict cost overrun of engineering services of 95 construction projects. The study revealed that RF model performed better than the other two models in cost estimation.

To investigate the impact of influencing parameters on the construction cost problem, most researchers have used relative importance index, correlation statistics, structural equation modeling, and factor analysis (Cheng, 2014; Gunduz & Maki, 2018; Iyer & Jha, 2005). However, bias could occur in these techniques because the collected data depends on opinions and questionnaire surveys. Also, the traditional statistical methods used the linear correlation between input and output parameters, which led to an error in capturing the nonlinear relationship of the complex system. As a result, mistakes are demonstrated in ranking influencing parameters and cost estimation results. It can be seen that traditional approaches cannot deal with the uncertainties and complexity of construction projects. Consequently, there is a necessity to develop an effective tool that can produce accurate cost estimation results. In recent years, a new AI algorithm called extreme gradient boosting (XGBoost) has been adopted to handle the complex nature of engineering problems. It is an efficient AI algorithm and has been used efficiently as a feature selector and a predictor by civil engineering researchers (Chakraborty et al., 2020; Chen & Guestrin, 2016; Falah et al., 2022; Tao et al., 2022).

The current research was done to investigate the efficiency of AI techniques in feature selection and prediction of construction cost estimation. Therefore, the research objectives are: (1) evaluate the ability of XGBoost, ELM, and MARS models in predicting construction cost estimation, and (2) examine the efficiency of XGBoost algorithm in selecting the influencing parameters of the cost estimation process incorporating inflation and project characteristics effects. This study contributes to the body of knowledge by helping decision-making identify and monitor the crucial parameters of cost estimation in a quantitative approach and enables project parties to compare the planning and estimated cost during the construction phase. The outcome of this study helps the project's stockholder decrease the errors in cost estimation and take the appropriate decision to reduce these defects.

Construction cost dataset description

The dataset of the construction cost was gathered from building projects in Iraq. The data were collected using the survey of building documents for nineteen construction projects built for the period between 2016 and 2021. The collected data includes seven parameters named area of ground floor (GFA), total area of floor (TFA), duration (D), number of elevator (EN), floor number (FN), type of footing (FT), and inflation (F). from the survey and reviewing of projects, documents, project characteristics were gathered while inflation information was taken from the open-source central Iraqi bank (https://cbiraq.org/). The statistical measures of the cost dataset, including minimum, maximum, mean, median, standard deviation, skewness, and kurtosis, are illustrated in Table 1. The statistical measures show that the mean number of project cost is 2177699 $. The minimum and the maximum values of the duration are equal to 122 days and 787 days. It can be recognized that the value of kurtosis of the most gathered parameters is less than 3, which indicates that the collected data is normally distributed.

Table 1 Statistical characteristics of the collected datasets

Methodology

Extreme gradient boosting (XGBoost)

Extreme gradient boost algorithm is a new development of a tree-based boosting model introduced as an algorithm that can fulfill the demand of prediction problems (Chen & Guestrin, 2016; Friedman, 2002). It is a flexible model, and its hyperparameters can be tuned using soft computing algorithms (Eiben & Smit, 2011; Probst et al., 2019). The most important reason behind the success of XGBoost is the algorithm's flexibility and ability to scale to billions of parameters in the distributed system. These properties make the algorithm more accurate and faster than the existing algorithm. Whereas, the traditional methods used trial and error and personal experience to choose the optimal parameters of the algorithm. Gradient boosting aims to produce more robust models by combining weak learners in an iterative process. In every iteration, the loss function can be reduced using the residual of the previous trees (Zhang et al., 2019). Every training tree can be modeled based on the residual of the previous predictors, and the new tree is added to the developed model for updating the residual value. XGBoost has proven successful results among tree models such as random forest, gradient boosting tree, and AdaBoost. The reason behind this effectivity is its ability to be scalable in all scenarios of prediction problems and the fast running of the system on a single machine.

The regularized objective loss function ‘f(L)’ for Lth in XGBoost model can be expressed as shown below:

$$f\left(L\right)=\sum_{i=1}^{n} l\left({y}^{\left(i\right)},{\widehat{y}}_{L}^{\left(i\right)}\right)+\sum_{j=1}^{L}\Omega \left({f}_{j}\right),$$
(1)

where n represents the number of observations; \({\widehat{y}}_{L}^{(i)}\) is the estimation of observation ith for iteration L; \(l(-)\) represents the loss function; and \(\Omega\) is the regularization term which is computed using the following expression:

$$\Omega (f)=\gamma N+\frac{1}{2}\lambda \sum_{j=1}^{N} {\omega }_{j,}^{2}$$
(2)

where N denotes the number of nodes in each leaf, and \(\gamma\) and \(\lambda\) are two symbols used to manage regularization.

The number of trees in XGBoost model is optimized using the following equation to produce the best results as follows below:

$$f\left(L\right)=\sum_{i=1}^{n} l\left({y}^{\left(i\right)},{\widehat{y}}_{L-1}^{\left(i\right)}+{f}_{L}\left({x}^{\left(i\right)}\right)\right)+\Omega \left({f}_{L}\right)\sum_{j=1}^{L-1}\Omega \left({f}_{j}\right).$$
(3)

Furthermore, second-order Taylor expansion is applied for managing objective functions, as shown in the following equation:

$$f\left(L\right)=\sum_{i=1}^{n} \left[l\left({y}^{\left(i\right)},{\widehat{y}}_{L-1}^{\left(i\right)}\right)+{g}_{i}\cdot {f}_{L}\left({x}^{\left(i\right)}\right)+\frac{1}{2}{h}_{i}\cdot {f}_{L}\left({x}^{\left(i\right)}\right)\right]+\Omega \left({f}_{L}\right)+K,$$
(4)

where \({g}_{i}\) is equal to \({\partial }_{{\widehat{y}}_{L-1}}l\left({y}^{(i)},{\widehat{y}}_{L-1}\right)\) and represents the first-order derivatives of loss functions; \({h}_{i}\) is \(\partial {\widehat{y}}_{L-1}^{2}l\left({y}^{(i)},{\widehat{y}}_{L-1}\right)\) and reflects the second-order derivatives of loss functions; \(K\) is a constant number.

To select input parameters, XGBoost model is considered a robust algorithm for such kinds of these problems. XGBoost efficiently builds boosting trees parallel to choose the essential parameters based on their weight (Friedman, 2002). gain, cover, and frequency are the popular approaches used by XGBoost for ranking evaluation. The gain evaluates the contribution of each feature in developing the prediction model. The cover revealed the number of the actual values for each feature and the frequency shows the number of features in the gradient boosted trees. The mathematical equation of ranking evaluation can be expressed as below:

$${N}_{v}=\sum_{L=1}^{L}\sum_{l=1}^{X=1}I\left({V}_{L}^{l},v\right),$$
(5)

where \(L\) represents iterations, number, N is the nodes' number in each leaf, and \({(V}_{L}^{l})\) is the feature for the node \(I\), and \(I()\) is the indication term and \({(V}_{L}^{l},v)\) can be calculated using the following expression. The graphical scheme of XGBoost algorithm is presented in Fig. 1.

Fig. 1
figure 1

Graphical scheme of XGBoost model

$${(V}_{L}^{l},v)=f\left(x\right)=\left\{\begin{array}{l}1 \quad {\text{if}} \quad {V}_{L}^{l}=v\\ 0, \quad {\text{otherwise}}.\end{array}\right.$$
(6)

Extreme learning machine (ELM)

An extreme learning machine is a powerful ANN method characterized by simplicity and a non-iterative method for training a single-layer neural network (Kardani et al., 2021; Shi-fan et al., 2021). ELM algorithm can reach optimum performance more efficiently than the traditional ANN. A linear function has been used as an activation function for the input and output layer, and for the hidden layer, the method applied a sigmoid activation function (Hou et al., 2018). In the training process, ELM utilizes random weights for hidden neurons, and then it uses a Moore–Penrose Pseudo-inverse function to determine the weight in the output layer. This process makes the ELM model quickly and enables it to deal with many different transfer functions (Huang et al., 2004, 2006). The mathematical equation of the training ELM model is presented below:

$$\sum_{i=1}^{N}{t}_{i}-{\widetilde{t}}_{i}=\sum_{i=1}^{N}\Vert {t}_{i}-\sum_{l=1}^{L}{\beta }_{i}.g\left({w}_{l}.{x}_{i}+{b}_{l}\right)\Vert =0,$$
(7)

where \({t}_{i}\) represents the outcome vector and \({x}_{i}\) refers to the input vector. Equation (7) can be written in the following expression:

$$H\beta =T,$$
(8)
$$\begin{gathered} H = \left[ {\begin{array}{*{20}c} {g\left( {w_{1} \cdot x_{1} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{1} + b_{L} } \right)} \\ \vdots & \ddots & \vdots \\ {g\left( {w_{1} \cdot x_{N} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{N} + b_{L} } \right)} \\ \end{array} } \right]_{N \times I} , \hfill \\ \begin{array}{*{20}c} {\beta = \left[ {\begin{array}{*{20}c} {\beta_{1,1} } & \cdots & {\beta_{1,m} } \\ \vdots & \ddots & \vdots \\ {\beta_{L,1} } & \cdots & {\beta_{L,m} } \\ \end{array} } \right]_{L \times m} ,} \\ {T = \left[ {\begin{array}{*{20}c} {t_{1,1} } & \cdots & {t_{1,m} } \\ \vdots & \ddots & \vdots \\ {t_{N,1} } & \cdots & {t_{N,m} } \\ \end{array} } \right]_{N \times m} ,} \\ \end{array} \hfill \\ \end{gathered}$$
(9)

where \(H\) represents the output of the hidden layer, \(\beta\) is a matrix that denotes the connection weights between the hidden and output layer, and \(T\) is the matrix of output predicted value depending on N training sets. To develop ELM model, the presented procedure is following: first, create random weights for the hidden layers, then generate \(H\) and \(T,\) which represent the matrix of the hidden and output layer, and finally calculate the weight of the output layer using the below equation:

$$\tilde{\beta } = H^{\dag } T,$$
(10)

where \(H^{\dag }\) refers to the Moore–Penrose Pseudo-inverse function. The graphical scheme of ELM model is illustrated in Fig. 2.

Fig. 2
figure 2

Graphical scheme of ELM model

Multivariate adaptive regression spline model (MARS)

MARS model is a nonlinear machine learning algorithm has been introduced to explore the nonlinearity of complex systems using piecewise segments (Friedman, 1991; Ikeagwuani, 2021; Naser et al., 2022). MARS model is a nonparametric method, and it is called a curve-based algorithm (Wu & Fan, 2019). The algorithm is similar to a tree-based model using the iterative approach in the learning process and selecting the critical features in the prediction problem. MARS model revealed better efficiency than other machine learning algorithms like ELM and SVM models (Guo et al., 2022; Shartooh Sharqi & Bhattarai, 2021; Wu & Fan, 2019). The concept of developing MARS model is as follows: At first, The MARS model changes the nonlinear regression model to a multiple linear regression model for the training dataset. Training data are divided into several groups to develop the linear regression model for each section. Each section has boundaries called the knots, which are identified using the adaptive regression algorithm. In each group of the divided data, the MARS model creates a basic function (BF) to represent the relationship between input and predicted parameters, as shown in the mathematical equation below:

$$\mathrm{BF}=\mathrm{max}\left(0,x-t\right)=\left\{\begin{array}{l}x-t,\quad {\text{if}}\quad x\ge t \\ 0, {\text{otherwise}},\end{array}\right.$$
(11)

where \(x\) is the value of the input variable and \(t\) represents the threshold value.

This process is called a forward phase, where the algorithm chooses the optimum input variables of predicted models. The final phase of the MARS model is called the backward phase. The algorithm eliminates unused parameters selected in the early phase to improve the prediction process's performance. The elimination of unnecessary parameters is achieved using a pruning algorithm based on generalized cross-validation (GCV), which is calculated as shown below:

$$\begin{array}{*{20}l} {{\text{GCV}}\left( M \right) = \frac{{\left( {1/N} \right)\sum\nolimits_{i = 1}^{N} {\left( {O_{i} - f\left( {x_{i} } \right)} \right)^{2} } }}{{(1 - \left( {C\left( M \right)/N} \right))^{2} }},} \\ {C\left( M \right) = \left( {d + 1} \right) \times M,} \\ \end{array}$$
(12)

where \({O}_{i}\) represents the real value; \(N\) is the number of the dataset; \(f\left({x}_{i}\right)\) is the estimated value, \(M\) is the number of basic functions, and \(C(M)\) represents the penalty factor. \(d\) ranges between 2 and 4 and represents the optimization cost of BFs. The final step in the MARS model is combining the BF function to get the predicted outcome of the developed model. Figure 3 provides the structure of MARS model.

Fig. 3
figure 3

Systematic scheme of MARS model

Modeling process and performance evaluation

This study uses R programming language to develop the presented AI models. XGBoost was taken as a feature selector due to its ability to handle nonlinear and complex relationships. The libraries named xgboost, ggplot2, and Matrix were used to ease the selection process of input parameters. To run the XGBoost algorithm, xgboost function was used with max.depth (10), eta (0.3), nrounds (100), and xgb.importance functions were applied to illustrate the best input selection. For the prediction process, XGBoost's parameters were tuned using expand.grid function. The hyperparameters were tuned as follows: nrounds set as 75:150 to determine the number of iterations; eta sets as 0.001, 0.01, 0.1 to control the learning rate of the algorithm; xgbtree the boosting method; max depth used as 5, 8, 10; gamma sets as 0, 1, 2; minchildweight (2); subsample (0.6); and colsamplebytree (0.8). For MARS model, two libraries named plotrix and earth were applied. The function expand.grid was used to control hypermeters of the algorithm such as degree and nprune; Degree sets as 1:8; and prune used as 1:100 with length.out (10). In the case of ELM model, the libraries kernlab, elmNNRcpp, and Matrix were applied. The ELM model parameters were set as nhid (100) to represent the number of hidden layers; actfun (sin) to control the activation function, and init. weights (uniform_positive) to choose the initial weight in the ELM. The integration of XGBoost with AI models is presented in Fig. 4.

Fig. 4
figure 4

Processing phases of the applied models

Dataset was split into two phases, 70% for training and 30% for testing. The outcome of the AI models was evaluated by using statistical methods, including r-squared, mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute error (MAE) (Shehu et al., 2014) as shown in the following equations:

$$R^{2} = \left( {\frac{{\sum\nolimits_{i = 1}^{N} {\left( {y_{p} - \overline{y}_{p} } \right)} \cdot \left( {y_{a} - \overline{y}_{a} } \right)}}{{\sqrt {\sum\nolimits_{i = 1}^{N} {\left( {y_{p} - \overline{y}_{p} } \right)^{2} } \sum\nolimits_{i = 1}^{N} {\left( {y_{a} - \overline{y}_{a} } \right)^{2} } } }}} \right)^{2} ,$$
(13)
$$\mathrm{MAPE}=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{{y}_{\mathrm{p}}-{y}_{\mathrm{a}}}{{y}_{\mathrm{a}}}\right|,$$
(14)
$$\mathrm{RMSE}=\sqrt{\frac{{\sum }_{i=1}^{N}{({y}_{\mathrm{p}}-{y}_{\mathrm{a}})}^{2}}{n}},$$
(15)
$$\mathrm{MAE}=\frac{{\sum }_{i=1}^{N}\left|{y}_{\mathrm{p}}-{y}_{\mathrm{a}}\right|}{N},$$
(16)

where \({y}_{\mathrm{p}}\) and \({y}_{\mathrm{a}}\) represent the predicted and actual values of construction cost; \(\overline{y}_{{\text{a}}}\) is the average value of the actual data of construction cost, and N signifies the number of construction projects.

Results and discussion

Statistical evaluation

In this study, XGBoost was applied as a robust algorithm for prediction and input selection. The results of feature combinations of construction cost prediction are presented in Table 2. It can be seen that the most correlated variable to cost estimation is inflation. The second input combination is the inflation and total floor area. The most correlated parameters in the third combination are inflation, total area of floor, and area of ground floor. The rest of feature combinations are reported in Table 2.

Table 2 Feature combinations selected by XGBoost

Statistical indicators of the introduced machine learning algorithms for the training/ testing phase are presented in Tables 3 and 4. The results showed that XGBoost model achieved an outstanding performance for the training phase, where all input combinations attained R2 more than 0.8. ELM model illustrated a good enhancement in prediction accuracy for all model combinations when increasing the number of input variables. For all the developed AI models, the best results were attained by XGBoost-M6, where R2 = 0.97822, RMSE = 268,500.0294, MAE = 166,301.5715, and MAPE = 0.22701. For the testing division, XGBoost model shows a noticeable performance of cost prediction for all combinations with r-squared more than 0.8 except M1, where R2 reduces to 0.66. The best prediction accuracy was achieved by XGBoost-M5, where R2 = 0.95216, RMSE = 590,609.7821, MAE = 332,157.171, and MAPE = 0.0875. MARS model indicated less prediction performance than the other AI models with r-squared less than 0.78 except M3 and 4, which attained R2 > 0.8. the best performance was attained by MARS-M3 with R2 = 0.86203 and RMSE = 730,717.4588. for ELM model, the best combination was revealed using five input variables where ELM-M5 achieved R2 = 0.86005 and MAPE = 0.26184.

Table 3 Performance measurements of the applied models for the training division
Table 4 Performance measurements of the applied models for the testing division

Figures 5, 6, and 7 depict the scatter diagram for the applied models for the testing part. It can be seen that XGBoost model revealed an excellent improvement in the prediction performance when increases the input variables and the best results were gained by XGBoost-M5 with R2 = 0.95. For ELM model, the developed algorithms indicated good prediction accuracy for all input combinations except M1 where R2 is less than 0.7 as shown in Fig. 6. MARS model shows less prediction accuracy than the other applied models and MARS-M3 has gained the best performance with a coefficient of determination equal to 0.862 as depicted in Fig. 7.

Fig. 5
figure 5

Scatter plot graph of XGBoost model over the testing phase

Fig. 6
figure 6

Scatter plot graph of ELM model over testing phase

Fig. 7
figure 7

Scatter plot graph of MARS model over testing phase

Box plot and spider plot are another graphical tools were used to illustrate the performance metrics of the introduced AI algorithms as presented in Figs. 8 and 9. Figure 8 indicated that the minimum relative error was reflected by XGBoost-M5 followed by MARS-M3 and ELM-M5. The graphical results showed that XGBooat-M5 model presented a significant reduction in residual error and minimum positive error without a negative outlier. ELM-M5 revealed the maximum negative error with one outlier point using five input variables. Figure 9 shows the comparison results of AI models using statistical metrics in the form of spider plot. The findings showed that XGBoost-M5 gained the highest r-squared and lowest performance errors than the other models. Figure 9 also showed that although MARS-M3 and ELM-M5 have an equal R2, MARS-M3 attained higher absolute error than ELM-M5.

Fig. 8
figure 8

Relative error plot for the developed models over the testing phase

Fig. 9
figure 9

Spider plot for the presented AI models over testing phase

Figure 9 was constructed to demonstrate the relationship between the developed models and actual cost based on 3 statistical matrices (i.e., RMSE, correlation, standard deviation) was illustrated by Taylor diagram (Taylor, 2001) as depicted in Fig. 10. The developed diagram showed that the closest position to the actual point was achieved by XGBoost-M5 with a correlation coefficient maxed out 0.95. The visualization results revealed that XGBoost model achieved the nearest distance to the observation points than ELM and MARS models, which refers to the efficiency of the XGBoost approach in cost estimation problems.

Fig. 10
figure 10

Taylor plot for the applied algorithms over the testing phase

Validation against past studies

Over the past years, numerous studies have been conducted on cost estimation. A study by Juszczyk (2018) evaluated the ability of SVM model in the construction cost estimation of the residential building. The researcher revealed that the presented model attained low MAPE with a range between 7% and 8.19. In another study, SVM model was also investigated for cost estimation of bridge construction (Juszczyk, 2019). The developed model showed its ability in cost estimation with a coefficient of determination equal to 0.94. Genetic algorithm (GA) was coupled with ANN by Hashemi et al. (2019) to enhance the results of estimated cost. The study concluded that ANN-GA achieved good prediction performance with an accuracy equal to 0.94. XGBoost model was compared with twenty predicted models to measure the estimated cost of field canal projects by (Elmousalami, 2020). The researcher indicated that the presented model achieved high prediction accuracy with R2 = 0.929. It can be recognized that previous studies made good efforts to enhance the effectiveness of cost estimation process; however, they have focused on exploring project characteristics and ignoring other factors such as economic factors. Also, they demonstrated little attention to feature selection algorithms and hybrid models. This study is different from the previous studies by (1) investigating both project characteristics and economic factors represented by inflation, (2) using a recent AI algorithm (XGBoost) for feature selection and prediction and validating it with ELM and MARS models, and (3) the hybrid model achieved an excellent prediction accuracy using only five input parameters with R2 equal to 0.952.

Discussion

Applying hybrid models in complex prediction processes like cost estimation enhances prediction accuracy and reduces estimation error. The results of XGBoost in the input selection process revealed that the most correlated variables to cost estimation is inflation. This result is agreed with Wang et al. (2022) who concluded that economic factors are more important than project characteristics. Analysis of the predicted models showed that all AI algorithms are able to estimate construction cost because all the developed models achieved an acceptable prediction performance. XGBoost model exhibited excellent performance in cost estimation using five input variables (i.e., inflation, total area of floor, area of ground floor, duration, floor number), where R2 = 0.952 and MAPE = 0.087 as reported in Table 4. The poorest accuracy was achieved using one input parameter, where XGBoost attained R2 < 70 as illustrated in Fig. 5. For ELM model, the best statistical indicators were gained using five input parameters where RMSE = 732,387.351 and MAE = 476,386.9595. The scatter plot diagram revealed that applying models on datasets with 2–5 input parameters achieved high prediction outcomes with R2 more than 0.8 as shown in Fig. 6. The lowest performance was revealed by ELM-M1 with R2 = 0.625 and high mean absolute error as shown in Table 4. MARS model attained good prediction accuracy using three variables with a coefficient of determination maxed out 0.8 depicted in Table 4. The comparison results revealed that XGBoost model outperformed ELM and MARS models with R2 more than 0.9 using four and five input variables. Spider plot revealed that XGBoost algorithm gained excellent performance metrics and outperformed ELM and MARS models as shown in Fig. 9. The visualization results showed the reliability of XGBoost model in cost estimation by having the least residual error and nearest distance to the actual point as demonstrated in Figs. 8 and 10.

Conclusion

Estimating the construction cost accurately is an important issue in construction management studies. This study introduces XGBoost model as an input selector and a predictor to enhance cost estimation accuracy. XGBoost model was compared with two well-known AI algorithms named ELM and MARS models. The study was conducted based on datasets collected from nineteen construction projects. Tabulated metrics and graphical schemes were constructed to examine the applied AI models. The feature selection results revealed that inflation is the most correlated parameter to project cost followed by project characteristics. The comparison analysis between the predictive models showed that all the developed models exhibited efficient predictability when the number of input parameters increased. The tabulated results showed that XGBoost model gained an excellent performance in all input combinations with r-squared maxed out 0.8 except M1 where the coefficient of determination was reduced to 0.663. The study revealed that incorporating inflation with project characteristics enhances the accuracy of the estimated cost. The study found that XGBoost gains the highest prediction results using five input variables. Furthermore, the study showed that XGBoost model provided an excellent capacity in feature selection and prediction processes within a complex cost estimation system. This study focuses on the impact of project characteristics and inflation on the cost estimation modeling of building projects. Studying the impact of the other influencing variables can enhance the accuracy of the cost estimation process. For future study, more affected variables like the characteristics of the client and other stakeholders can be explored to increase cost estimation accuracy. Also, other recent versions of AI models like deep neural network can be investigated to reduce the error of construction cost estimation.