1 Introduction

The demand for concrete has drastically increased among all construction materials. Concrete production is needed at a large scale with the development in the construction industry and cities' urbanization. The main constituents of conventional concrete are cement, fine aggregates, and coarse aggregates. Traditional concrete has shown excellent behavior under compression compared to tension. The performance of concrete under tension can be improved, and various methods have been successfully applied, e.g., steel bars or several types of fiber reinforcements that are widely used. High-strength concrete demand and consumption have significantly increased based on construction industry applications. Existing literature has reported that the brittleness of concrete increased with the enhancement in concrete behavior under compression, limiting its structural applications. However, structural applications can be supported using cementitious materials based on high ductile performance, e.g., ECC [1,2,3,4,5,6,7].

Research has been conducted on the performance enhancement of (ECC) under compression and tension loading with the addition of different fibers, e.g., polyvinyl alcohol (PVA) [6, 8], basalt fiber [9, 10], polyethylene (PE) [11, 12], polypropylene (PP) [13,14,15], natural fiber [16] and steel fibers [17,18,19], hybrid fibers [17, 20]. The formulation of ECC predominantly entails the incorporation of constituents (e.g., cement, fly ash, silica fume, ground granulated blast furnace slag (GGBFS), metacoline, and micro sand). They not only enhance the mechanical performance of concrete but also significantly contribute to green and low-carbon development in civil engineering [21, 22], however, higher amount of industrial waste products reduces the strength. Optimizing proper amount is a key and a challage to develop sutable binders. By reducing the carbon footprint associated with cement production, these sustainable practices align with global efforts toward environmentally conscious construction. The further inclusion of various fiber contents enhances ECC's mechanical prowess, albeit with an attendant increase in complexity concerning the accurate prediction of its CS and TS. The endeavor to accurately predict the CS and TS of ECC is not merely a theoretical preoccupation but a substantive requisite that underscores the operational integrity and longevity of structures fabricated from ECC. This in turn could lead to a substantial decrease in experimental workload and cost. The material components of green PVA- ECC are highly complex and play a crucial role (e.g., reducing carbon footprint, improve strength, ductility, toughness. The need for innovative approaches to support the green and low-carbon development of civil engineering is highlighted by the challenges of accurate modeling mechanical characteristics based on mixture design parameters through conventional regression studies [23, 24].

However, the heterogeneity engendered by the diverse fiber contents, coupled with the interactions among the primary constituents, often morphs the predictive landscape into a complex and challenging endeavor [25,26,27,28]. Previous research has reported that the CS of ECC has been evaluated with experiments based on water/binder ratios (W/B), sand-binder ratios (S/B), particle size grading, age of samples, type of cement, and shape of samples [1, 4, 7]. Supplementary cementitious materials (SCMs) (e.g., fly ash, ground granulated blast furnace slag, rice husk ash (RHA), metakaolin, volcanic ash, silica fume, calcined clays, recycled glass and natural pozzolans) are commonly used as a replacement of the portland cement in concrete. The use of SCMs in concrete, replacing 5–20% of clinker, enhances long-term mechanical properties, durability and CO2 emissions. However, high volume clinker replacements may cause early age performance losses, prompting research to balance sustainability and performance by optimizing the mixture. Several research has been condicted on SCMs hydration [29, 30], workability [31, 32], strength [33, 34], long-term durability [35]. Making ultra high performance concrete (UHPC), fillers like GGBFS, nano-silica (NS), RHA showed promising results at 28-day compressive strengths of 100 MPa and 150 MPa were achieved with RHA mean particle sizes of 8 ~ μm and 3.6 ~ μm, respectively [36,37,38]. Achieving optimal particle packing and reactivity requires extensive trial and error, making it time-consuming and resource-intensive. ML can identify new potential SCMs and fillers by analyzing vast amounts of data from literature and experiments, accelerating the discovery of effective materials for ECC or UHPC.

Examining the durability of concrete materials entails significant costs and consumes a considerable amount of time due to the intricate nature of curing and construction procedures [39]. Researchers have increasingly turned to alternative techniques to construct prediction models that can shed light on the mechanical performance of concrete materials. This shift is driven by a desire to achieve cost-effectiveness and temporal efficiency in their research endeavours. However, the conventional methods of numerical analysis, which frequently involve complex non-linear equations, provide a significant obstacle when it comes to achieving accurate predictions [40]. As a result, there has been a growing need to accurately anticipate the performance of concrete using traditional procedures. This has led to a recognition of the complexity involved in this endeavour, prompting a shift towards the utilization of modern soft computing techniques.

In recent years, machine learning (ML) based prediction algorithms have gained significant popularity in developing prediction models for CS and TS's behavior of concrete materials by simultaneously reducing the time and cost. Naser et al. [41] described how ML can applied in the structural members, optimizing concrete mixtures for constraction site application. Tapeh et al. [42] extensively reviewed AI, ML and DL applications in structural sections, earthquake, wind, and fire engineering, and successfully showed potential future in those sectors. Such as fuzzy logic (FL) [43,44,45] [46], particle swarm optimization (PSO) [47, 47] artificial neural network (ANN) [48,49,50], genetic algorithm (GA) [51, 52, 52, 53] gene expression programming (GEP) [52, 54], random forest (RF) [55, 56], are commonly used in the concrete field. For instance, Khashman et al. [57] have developed an ANN-based prediction model to forecast the concrete's CS. Table 1 shows several ML algorithms that have previously been used to predict the CS and TS for the HPC and several cementitious composite materials. There is a paucity of literature on the subject matter concerning fiber-based concrete focusing on projecting the mechanical properties of ECC predominantly reinforced with PVA and steel fibers. [58,59,60]. Uddin et al. [28] used PVA fiber to predict the CS of the ECC by ANN; however, an extensive study has not been conducted.

Table 1 The Summary of ML-based studies to predict the CS and TS behavior of concrete

Random forests (RF) [61] are applied to forecast the strength of concrete materials, and RF is an advanced ML algorithm and is very effective in solving non-linear regression and classification problems [62]. Chen et al. [63] reported that XGBoost is more efficient and newly developed than other ML algorithms. XGBoost is highly recommended to avoid overfitting problems in civil engineering, e.g., predicting shear strength [64], interface shear strength [65], dynamic modules [66], CS, and creep [67, 68]. Furthermore, it has been successfully applied to predict, e.g., seismic drift [69], concrete strength [70], shear capacity of FRP-RC wall and flat slabs [71,72,73], buckling analysis for steel beam [74]. To a limited extent, the SHAP method has been used on risk factors [75], failure modes of RC walls [76], and concrete strength [70], which preciously illustrates how the existing research on features' importance and their global and local correlate each other and solve the complex problem. The latest ML-based algorithms have been used widely in earlier research to predict concrete properties for varied materials. Mahjoubi et al. [77] used different plastic fibers to predict and optimize the mixture for strain-hardening cementitious composites (SHCC) based on ML. However, according to the author's best knowledge, have not been found enough studies to predict the CS and TS behavior of PVA-ECC materials based on XGBoost and SHAP algorithms.

This paper develops a systematic ML-based prediction models program to only predict the CS and TS of ECC by considering the PVA fiber. For training and testing, data from 81 mixed samples were collected. The XGBoost method constructs a high-accuracy predictive model across all ML models. Random selections of 80% and 20% are used for this purpose. The SHAP algorithm is employed to explain the model's key features and their complex correlations. The suggested model's prediction outcomes are compared against earlier research and chosen best ML algorithm.

2 Machine learning algorithm

2.1 Material and database creation

A thorough dataset was collected to develop machine learning (ML) models for predictive analysis. This dataset consisted of 820 experimental data points related to polyvinyl alcohol engineered composite concrete (PVA-ECC), which were obtained from various literature sources [83,84,85,86,87,88,89,90,91] based on only considering PVA fiber with F and C type fly ash. Table 2 elucidates the statistical attributes of eleven input parameters predicated on the selected experimental data gleaned from the aforementioned sources.

Table 2 Statistical variables of the different data used in the database for the ECC mixture

One of the pivotal facets of data science and machine learning paradigm is data visualization, which lends a nuanced understanding of the underlying data distribution and inherent patterns. The statistical distribution of each numerical parameter, extricated from the database, is vividly portrayed in Fig. 1. This figure delineates the histograms corresponding to the CS and TS of PVA-ECC along with other essential input variables. As exemplified in Table 2, the CS oscillates between 21.30 to 75.20, while the TS exhibits a range between 3.08 and 5.82. Other input variables are articulated in terms of mean, standard deviation, median, minimum, and maximum to foster a comprehensive understanding of the data distribution.

Fig. 1
figure 1

Histograms of the parameters: a Cement (kg/m3); b FA (kg/m3); c W/B, d sand (kg/m3); e W/S; f HRWF (kg/m.3); g AR; h NS (MPa); i YM (GPa); j CS (MPa); k TS (MPa)

Figures 2 and 3 display heatmaps that visually represent the relationship between various input features and the CS and TS, respectively. Each feature is presented in a separate subplot within the heatmaps. In Fig. 2, it is observed that CS (in MPa) exhibits a highly positive correlation with Cement (in kg/m3), Nominal strength (NS) (in MPa), Aspect Ratio (AR), Water-to-Sand ratio (W/S) with correlation coefficients of 0.80, 0.49, 0.31, and 0.20 respectively. On the other hand, Fly ash (FA) (in kg/m3), Sand (kg/m3), and High-Range Water Reducer (HRWR) (in kg/m3) illustrate a negative impact on CS. The RA of the fiber in ECC is very important because of it helps to imporove crack control and durability, workability, fiber distribution and orientation and improve the permeability. Optimiziing the RA could improve the meachanical performace of PVA-ECC.

Fig. 2
figure 2

Heatplot of the CS: a Cement (kg/m3); b FA (kg/m3); c W/B, d sand (kg/m3); e W/S; f HRWF (kg/m.3); g AR; h NS (MPa); i YM (GPa); j CS (MPa)

Fig. 3
figure 3

Heatplot of the TS: a Cement (kg/m3); b FA (kg/m3); c W/B, d sand (kg/m3); e W/S; f HRWF (kg/m.3); g AR; h NS (MPa); i YM (GPa); j TS (MPa)

Furthermore, Sand content and HRWR demonstrate positive correlations with coefficients of 0.31 and 0.76, respectively, while NS and Young's Modulus (YM) (in GPa) exhibit positive correlations with coefficients of 0.59 and 0.83, respectively in Fig. 3. The size of the cube and cylender has been considered using the Nevilar approach and concerted the strength into the cylinder [92]. The dog bon test has been considered according to the apan Society of Civil Engineers (JSCE) [93].

Figure 4 presents a comprehensive representation of the machine learning model, illustrating the data distribution, the approaches employed for predicting CS and TS, and the entire workflow. The study utilized the XGBoost and SHAP modelling methodologies, implemented through the Python packages 'xgboost' and 'shap' correspondingly. The analytical processes were conducted using a Python programming environment, which ensured a methodical and replicable analysis.

Fig. 4
figure 4

Flow chart for the ML model for predicting CS

The current methodological framework highlights the meticulous processes entailed in data curation, visualization, and analytical modeling. These procedures play a pivotal role in enabling an exhaustive predictive analysis of the mechanical strength of PVA-ECC. The utilization of advanced machine learning models and data visualization techniques allows for a comprehensive comprehension of the intricate relationships existing between input parameters and the mechanical properties of ECC. Consequently, this contributes significantly to the burgeoning knowledge base in the realm of predictive analysis for engineered composite materials.

2.2 Random Forest (RF)

The RF algorithm is a type of ensemble learning technique commonly employed for classification and regression tasks. It achieves this by creating a multitude of decision trees throughout the training process. The process of aggregating predictions from multiple trees enhances the model's generalizability and robustness. The bagging technique mitigates the issue of overfitting by introducing a randomization component in selecting data for each tree. Incorporating randomization at every split in decision trees improves the model's resilience against outliers and noise. Missing data can be addressed using imputation techniques or weighted splitting methods.

The Out-of-Bag (OOB) Error Estimation method calculates the average prediction error for each training sample by utilizing trees that were not trained on that particular sample [94].

2.3 Support Vector Machine (SVM)

The SVM is a supervised learning technique that is predominantly employed for tasks such as classification, regression, and outlier detection. The algorithm identifies a hyperplane that achieves maximum separation between classes inside the feature space. Additionally, it utilizes the kernel trick to effectively handle both linear and non-linear data by transferring it to a space with greater dimensions. SVM exhibit resilience when confronted with high-dimensional data and provide a range of kernel functions, including linear, polynomial, and radial basis function (RBF). The essential hyperparameters encompass the regularization parameter (C) and parameters relevant to the kernel. The classifier is intrinsically designed for binary classification, however, it can accommodate multiclass classification by employing approaches such as one-vs-rest or one-vs-one [95]. SVM exhibits versatility and robustness as a machine learning technique, as they can perform regression tasks (known as Support Vector Regression, SVR) and detect outliers. SVM can be low dimensional to the high binational problem using kernel function, solving a non-linear problem using a linear method [96].

2.4 Artificial Neural Network (ANNs)

ANNs are employed in regression analysis to make predictions of continuous values by utilizing input data. The neural network architecture consists of three layers: input, hidden, and output. The output layer is comprised of a solitary neuron that utilizes a linear activation function to predict continuous values. The training process is the manipulation of weights and biases through the utilization of backpropagation to decrease error. This error is typically quantified using metrics such as Mean Squared Error (MSE) or Mean Absolute Error (MAE). Hidden layers utilize activation functions such as Rectified Linear Units (ReLU) in order to capture and represent non-linear relationships. The primary hyperparameters encompass the quantity of hidden layers, neurons, and the learning rate ANNs find use in several domains such as finance, engineering, and other areas where the prediction of continuous values is crucial. The performance of ANNs in these domains is typically assessed using evaluation metrics such as R-squared, mean squared error (MSE), or mean absolute error (MAE).

The neural network architecture includes input, weight, a function, activation function, and outputs Fig. 5(a) shows the neural anatomy with demonstration, Fig. 5(b) describes the neural network architecture with eleven input parameters and ten hidden layers.

Fig. 5
figure 5

a Neural anatomy, b The neural network architecture

2.5 Extreme Gradient Boosting (XGBoost)

A predictive ML model may be trained with the available database using the supervised family of machine learning techniques. There are several algorithms in this category; however, XGBoost, a recently created ML-based regression technique, is used to build the model because of its success in other regression-like applications [97, 98]. XGBoost is a more advanced variant of the ensemble learning method gradient boosted decision tree (GBDT) that improves loss function aspects [99]. The mathematical fundamentals will be briefly discussed in Sect. 2.

2.5.1 Shapley Additive Explanations (SHAP)

To make the model more exact, input and output features have a significant impact. It is also extremely complicated and complex. Model interpretability is understood by itself and explained to interpret the model. Lundberg and Lee introduced [100] SHapley Additive exPlanation (SHAP) to evaluate ML model predictions using Shapley values. The average of all feature permutations' marginal contributions, showing the impact of that feature on the generated output, [99] is the shapley value for each feature. The SHAP explanation model \(g\left(a^\ast\right)\)  is a linear addition of input features expressed as follows [101].

$$g\left(a^\ast\right)=\phi_0+\sum_{i=1}^N\phi_0a_i$$
(1)

where the \(a^\ast\in\left\{0,1\right\}^N\), N is the number of input features \(\phi_0\)  models output with inputs and \(\phi_i\)  used as feature attribution values [98].

$$\phi_i = \sum\limits_{S \in N} \frac{{\left|S\right|!} - {\left({N-\left|S\right| -\mathrm1}\right)!}}{N!}\mathit{\left[{f\left(S \cup{\left\{x_i\right\}}\right)-f\left(S\right)}\right]}$$
(2)

where \(x_i\)  used as a feature's value. S and N used as a subset of the features and features numbers. \(f\left(S\mathit\bigcup\left\{x_i\right\}\right)\)  and \(f\left(S\right)\)   are the prediction of the model trained with and without features.

3 Result and discussion

3.1 Simulation results of machine learning models

The hyperparameter technique was addoped using the gridsearch CV using fivefold cross validation for the RF, SVM, ANN and XGBoost in the model because of acchiving higher accuracy for the testing dataset. In this study, the model uses z-normalization to standardize the continuous input parameters. Naser et al. [102] explained how a wide range of performance fitness and error metrics (PFEMs) and there importance of metrics that are commonly used in evaluating ML models, especially in regression and classification tasks within engineering and science applications. The error matrices in the paper has followed [103], because both MSE and RMSE are sensitive to larger errors because they square the prediction errors before averaging them. Many alternative metrics (e.g., Mean Absolute Percentage Error (MAPE)), might provide similar insights but do not fundamentally change the understanding of model performance [104, 105].

3.2 Predicting Random forest (RF)

Figures 6(a) and (b) depict the actual and predicted results for the CS and TS using the RF model. To optimize the model max_depth, n_estimators, min_samples_split, min_samples_leaf and max_features are 10, 150,10,15, ‘auto’ for the CS and 4, 150, 10, 10, auto, respectively for the TS as chosen after the GridSearchCV has been used. The RF model yields an R2 value, MAE, MSE, and RMSE of 0.95, 1.50, 0.017, and 2.57, respectively, for the training data, and 0.87, 3.08, 22.26, and 4.71, for the testing data in terms of CS. The R2 value of RF is comparable to that of the ANN, as shown in Table 3. However, the XGBoost model outperforms RF in terms of CS predictions.

Fig. 6
figure 6

a and b show the compressive strength and tensile strength using the RF model

Table 3 Performance evaluation of the ML model

Regarding the prediction of TS, the RF model achieves an R2 value, MAE, MSE, and RMSE of 0.94, 0.13, 0.03, and 0.19, respectively, for the training data, and 0.78, 0.37, 0.21, and 0.46, respectively, for the testing data. In this case, RF performs better than the ANN model but is still less accurate than XGBoost.

3.3 Predicting Support Vector Machine (SVM)

The SVM is widely utilized for regression and classification tasks in machine learning. In this study, SVM is employed to predict the CS and TS of PVA- ECC after 28 days. In order to optimize the SVM model, the 'rbf' kernel was selected and tested against the 'linear' and 'poly' kernels. For the CS, the values of gamma and C were set to 0.5 and 10 respectively. Similarly, for the TS, the values of gamma and C were chosen as 0.3 and 5 respectively.

Figure 7(a) and (b) depict the performance metrics, namely R2, MAE, MSE, and RMSE, for the training and testing phases. Notably, the SVM model exhibits lower values for these metrics compared to the other machine learning models when predicting CS. However, for the TS, the performance of the SVM model appears to be similar to the ANN model, indicating that the RF and XGBoost models yield more precise predictions than SVM when considering the dataset.

Fig. 7
figure 7

a and b show the compressive strength and tensile strength using the SVM model

3.4 Predicting Artificial Neural Network (ANNs)

Table 3 highlights the accuracy of various ML models in predicting the CS. The results indicate that the XGBoost model outperforms the RF and SVM models in terms of accuracy. The RMSE values for the XGBoost model remain consistent between the training and testing sets, which suggests its superior accuracy compared to RF and SVM in predicting the CS.

However, it should be noted that the ANNs model does not accurately predict the TS due to overfitting. Thus, based on the findings presented in Table 3, it can be concluded that the XGBoost model is the most prominent and leading among all ML models utilized in this research. Figure 8(a) and (b) illustrate the use of the MLPRegressor to predict the CS and TS within the Python script. The hyperparameters selected for optimization in this case include a hidden_layer_sizes of (150), learning_rate_init of 0.001, learning_rate set to 'constant', activation function as 'relu', solver as 'adam', alpha value equal to 0.0002, batch_size set to 'auto', power_t of 0.6, and maximum iterations capped at 500.

Fig. 8
figure 8

a and b show the compressive strength and tensile strength using the ANN model

3.5 Predicting Extreme Gradient Boosting (XGBoost)

The optimization of the XGBoost model was meticulously undertaken using GridSearchCV, which is implemented with a fivefold cross-validation strategy. The hyperparameters max_depth, learning_rate, and n_estimators were fine-tuned to values of 5, 0.01, and 250, respectively and 5,0.03, and 500 was chosen for the CS and TS through cross-validation, thereby accentuating the model's high-performance attributes as evidenced by the RMSE. Other parameters, such as min_child_weight = 3 and reg_lambda = 0.2, were meticulously set for both model. The evaluative phase deployed the testing set to ascertain the XGBoost model's empirical accuracy for predicting the CS and TS. Figure 9(a) and (b) exhibit the predicted CS and TS values attained through the XGBoost model.

Fig. 9
figure 9

a and b show the CS and TS using the XGBoost model

Table 3 encapsulates the evaluative metrics showcasing that for CS in the training model, the R2, MAE, MSE, and RMSE values were 0.99, 0.80, 1.33, and 1.15, respectively. For the testing model, these values morphed to 0.92, 2.29, 14.57, and 3.81, respectively. Similarly, for TS in the training model, the corresponding values were 0.98, 0.08, 0.01, and 0.10, respectively, and for the testing model, they were 0.97, 0.17, 0.11, and 0.26, respectively. Figure 8 elucidates the x = y line employed for delineating the training and testing sets within the model. The data postulates that among the cadre of machine learning models explored, XGBoost emerges as the most advanced and highly accurate model, thereby signifying its paramount potential in the predictive analysis of ECC's mechanical strength.

This discourse highlights the imperative role of optimized machine learning models, particularly XGBoost, in navigating the complex landscape of predicting mechanical strength properties of ECC, thus fostering a nexus of computational robustness and predictive precision.

Table 4 Models prediction of testing set.presents the experimental data along with the corresponding data predicted by the model to evaluate their accuracy. The errors for the CS are found to be lower for the XGBoost model compared to the SVM model. Similarly, for the TS, the error is 0.01 for the XGBoost model and 0.12 for the SVM model.

Table 4 Models prediction of the testing set

Figures 10 and 11, present a comparative analysis of the prediction performance of the machine learning (ML) models for CS and TS on both the training and testing datasets. Figure 10 exhibits that XGBoost closely aligns with the experimental datasets, demonstrating superior performance compared to SVM, in both the training and testing phases for CS. On the other hand, Fig. 11 illustrates that XGBoost yields nearly identical outcomes for both the training and testing datasets in TS. However, upon closer examination of the testing results, it becomes evident that XGBoost better aligns with the experimental datasets than SVM.

Fig. 10
figure 10

The CS prediction performances of XGBoost and SVM models

Fig. 11
figure 11

The TS prediction performances of XGBoost and SVM models

In summary, the figures visually represent the predictive capabilities of ML models for CS and TS. It is apparent that XGBoost consistently outperforms SVM, particularly when considering the testing datasets.

3.6 SHAP interpretation

A comprehensive framework for feature importance analysis (e.g., SHAP Summary Plots,) in ML is offered by SHAP, which gives specific insights into the individual and combined contributions of each feature to a model's predictions for predicting mechanical strength [106, 107]. In ML applications, this type of analysis is crucial for verifying model behavior, guaranteeing equity, and fostering transparency and feaures relations and their influnces in the strength prediction Naser et al. [108].

3.7 Generating SHAP summary plots

Figure 12 presents the mean SHAP value, which represents the average impact on the magnitude of the model output for the compressive strength (CS). Cement is a crucial parameter that contributes mainly to CS, as evidenced by its highest SHAP value. Conversely, sand has the lowest SHAP value, indicating its minimal contribution to CS. Further, FA is another dominant parameter affecting CS, while the remaining ingredients (HRWR, NS, W/S, AR, W/B) have comparatively less effect on CS.

Fig. 12
figure 12

SHAP's value (average impact on output) for CS

Although Fig. 12 provides an overview of the significance of different features, it does not describe the relationship between the features and whether the outputs are influenced positively or negatively. To address this limitation, the SHAP summary plot is employed to determine the correlation between the parameters and CS (Fig. 12). Each point on the summary plot includes all the details of a shapley value corresponding to the features. By using SHAP, the global mean can be compared against the model's output to determine a specific reason, and the features are ranked according to their importance. Additionally, a dotted line is used to indicate the negative (left) and positive (right) effects of the features, thereby illustrating how they impact both sides. The feature values are plotted as dotted lines on the x-axis and y-axis in the model to emphasize their importance.

Figure 13 demonstrates that FA is highly correlated with the CS of PVA-ECC. The dotted line for FA turns towards the negative (left) side, indicating that an increase in FA significantly reduces the CS of PVA-ECC. On the other hand, cement, NS, and AR advance towards the right side, positively influencing CS, as shown in Fig. 13. In contrast, the addition of HRWR in the mixture leads to a decrease in the CS of PVA-ECC, while adding or removing W/B, sand, and YM does not have any significant influence on CS.

Fig. 13
figure 13

SHAP's summary plot for the XGBoost model for CS

The SHAP summary plot for the splitting tensile strength (TS) of the PVA-ECC mixture, as depicted in Figs. 14 and 15, presents the impact on the model output. Notably, the feature with the highest SHAP value in the plot is NS, indicating its significant influence on TS, although its impact on CS is not considerable. Similarly, the other parameters show similar effects on TS, implying that the substances that result in higher SHAP values for TS exhibit opposite trends for CS. Specifically, Figs. 14 and 15 reveal that NS, YM, and AR positively affect TS, while HRWR, cement, and W/B have a negative impact on TS, signifying that an increase in these substances will decrease TS.

Fig. 14
figure 14

SHAP's value (average impact on output) for TS

Fig. 15
figure 15

SHAP's summary plot for the XGBoost model for TS

Furthermore, Fig. 15 demonstrates that sand has both positive and negative influences on the resulting TS of the concrete material. It is worth noting that existing literature has shown both negative and positive impacts on the TS of concrete [109,110,111,112]. Moreover, the nominal strength and length of PVA fiber have been found to significantly positively impact the development of TS in concrete materials [113,114,115].

3.8 SHAP feature instances

The SHAP instance values for the compressive strength (CS) and tensile strength (TS) of PVA-ECC are illustrated in Figs. 16 and 17, respectively. As discussed earlier, FA and Cement significantly impact CS, while NS and HRWR greatly influence TS. The effects of the other parameters are less noticeable. Figure 16 presents the instances where FA positively impacts CS in datasets 40 to 80, whereas it negatively impacts datasets 1 to 40. Similar trends can be observed for cement.

Fig. 16
figure 16

SHAP's instances for the CS

Fig. 17
figure 17

SHAP's instances for the TS

On the other hand, Fig. 17 shows that the impact of HRWR is positive in datasets 40 to 80, but it has a higher negative intensity in datasets 1 to 10. AR exhibits higher intensities in datasets 75 to 80. These findings help us understand the specific data points significantly influencing the model's accuracy. In future investigations, focusing on these data points or mixtures may help improve the strength of the mixtures.

3.9 Interaction values

The SHAP interaction values, also known as pairwise SHAP values, provide insights into the interaction between pairs of features in a model's prediction. These values help us understand how one feature's presence or absence can influence another feature's effect. In the interaction value plot, red indicates high values for both features, while blue represents low values for both features. The right side of the dots represents a positive impact, while the left side represents a negative effect on the feature's value.

Figure 18 illustrates the high correlation between cement and the CS of PVA-ECC. The dotted line for cement is approaching the positive (right) side, indicating that an increase in cement content significantly increases the CS of PVA-ECC. Conversely, FA and HRWR are approaching the left side, suggesting a negative impact on CS, as depicted in Fig. 18, while increasing the NS and W/B content in the mixture will lead to an increase in the CS of PVA-ECC.

Fig. 18
figure 18

SHAP interaction value for CS

The interaction values for several features used to predict the TS are shown in Fig. 19. The SHAP interaction value indicates that the increase of NS content significantly increases TS while opposite trends are shown by HRWR and W/B turning to the negative (left) side. The interaction value of YM turns to the right side a little, which means an increase in YM will raise TS, and the changes of other parameters ( YM, W/S, FA) do not affect TS as the SHAP interaction values remain in the middle ranges. Cement has high and low effects, indicating that it positively impacts TS, as shown in Fig. 19.

Fig. 19
figure 19

SHAP interaction value for TS

3.10 Dependent plot

A SHAP dependence plot is utilized to examine the relationship between the SHAP value of features on the y-axis and their corresponding data on the x-axis. These plots are constructed to provide a clearer understanding of SHAP interpretations, as demonstrated in Fig. 20. The dependence plot illustrates how two attributes interact with each other to influence the predicted outcome of the SHAP model.

Fig. 20
figure 20

The SHAP dependence plot for the CS

In Fig. 20(a), it is evident that the CS significantly correlates with the W/S within the range of 0.70 to 0.75, particularly when the FA is between 900 to 1000 kg/m3 representing the higher CS. Conversely, in Fig. 20(b), a negative linear correlation is observed between the W/B ratio and the cement at a ratio of 0.265. Figure 20(c) and (d) show a negative linear correlation with the CS.

Furthermore, Fig. 21(b) reveals that an FA of 700 kg/m3 with an HRWR of 7 kg/m3 has a positive impact. Conversely, increasing the HRWR to 8 or 9 kg/m3 leads to a lower correlation with an FA of 700 kg/m3.

Fig. 21
figure 21

The SHAP dependence plot for theTS

In Fig. 21(d), it is shown that with an AR of 200, the cement content exhibits a strong correlation when it is less than 230 kg/m3. In contrast, with an AR of 300, the cement content within the range of 400 to 500 kg/m3 demonstrates a higher correlation. However, increasing the cement content beyond this range results in a decrease in the TS.

4 Conclusions

This paper demonstrates an interpretable ML model for predicting PVA-ECC CS and TS. For training and testing, data from 81 mixed samples were collected. The XGBoost method constructed a high-accuracy predictive model across all ML models. Random selections of 80% and 20% is used for this purpose. The hyperparameter optimization has been used using GridSearchCV. The SHAP algorithm is then employed to explain the model's notable features and their complex correlations. The suggested model's prediction outcomes are compared against earlier research and chosen best ML algorithm. The following conclusions can draw from the present study:

  1. 1.

    The established XGBoost demonstrated extremely high accuracy in predicting the CS and TS of PVA-ECC concrete. The performance for R2 shows the CS and TS of 0.99 and 0.98.

  2. 2.

    The XGBoost model is compared to various machine learning models such as RF, SVM, and ANN. It discovered that ML methods (XGBoost, RF) beat other methods (ANN and SVM), with XGBoost achieving the best overall performance.

  3. 3.

    The results of SHAP interpretation shows that cement and FA are the crucial parameter that mainly affects CS, as evidenced by their highest SHAP value. The increment of cement significantly increases CS while opposite trends (reduces CS with the increase of FA) are found for FA. Although changes in NS do not influence CS, it significantly increases TS if NS is increased. Other parameters that positively affect CS have adverse effects on TS. The SHAP features instances suggests that FA and cement positively affect CS when the instance value exceeds 35. For HRWR, the impact is average on CS and significant for TS in the mentioned instances ranges. Similar trends are also found in SHAP interaction analysis for cement and FA in cases of CS, while for TS, HRWR is found to have negative impacts, although NS has positive impacts

  4. 4.

    SHAP dependency plots demonstrate that the correlation between FA and W/S can positively (increase) affect CS in a particular range (W/S is 0.70–0.75, FA is 900–100 kg/m3). In contrast, other correlations show negative trends for CS. FA and HRWR negatively impact TS when FA is 700 kg/m3 and HRWR is 7 kg/m3.

  5. 5.

    The influence of cement, and NS on TS gives a positive trend, which shows that increasing those features in a certain limitation can give high performance. The sand size is not specified in the database; it shows both positive and negative trends on TS. A specific database for same-size sand can give an accurate prediction for improving the TS using XGBoost SHAP. Therefore, more research should need to specify the sand's influence on TS, It will need more involvement in the mechanical performance of fiber-based concrete in the future.

  6. 6.

    SHAP values findings revealed how a characteristic affects CS and TS prediction straightforwardly and precisely. By setting up meaningful value ranges for different attributes, this data can be used to build superior CS and TS models. It is also under consideration that further improvements are required to make a perfect database; adding more data related to PVA-ECC can help accurate prediction.

Finally it can concluded that, the XGBoost model can quickly calculate concrete's CS and TS in various mixed proportions, and can be utilized to see if the designed mix proportion can meet the target strength requirements. It can help engineers undertake approval analysis during the building stage and safety analysis during the service stage by forecasting the needed CS and TS at 28 days. Moreover, the model can also help to deeply understand the mechanical interaction inside the mixture to find out their intercorrelation, which helps to make a new mixture design with a better understanding. The SHAP value is extraordinarily complex, and new features need to explore in different fiber-reinforced concrete. This research helps understand the features' importance, adopt the optimized mixture properties, and provide high strength and plays a crucial role in promoting green and low-carbon development, aligning with global efforts towards sustainability and environmental conservation. These aspects will be studied based on the work presented in this paper shortly in fiber-based 3D concrete, where the mixtures are costly and complicated.