1 Introduction

The exponential growth and advancements in computational sciences have brought new technologies such as artificial intelligence (AI) and machine learning (ML) (Tapeh and Naser 2023a; Ma et al. 2021; Tiwari et al. 2023). In various fields of engineering As computational power has become more democratised and widely available for research purposes, new paradigms are evolving some of which are structural health monitoring, computer-aided design development, longevity of structures etc. (Tapeh and Naser 2023b; Dimiduk et al. 2018; Salehi and Burgueño 2018; Baduge et al. 2022; Sacks et al. 2019; Hatami et al. 2022; Pan and Zhang 2021; Palsara et al. 2023). These computational sciences have permeated various sectors of the economy, including the real estate industry. Modern structures focuses on various factors such as resilience, safety, sustainability, reliability, economy and aesthetics. Most fundamental building block of any infrastructure is reinforced concrete beams which is widely used in construction to support loads and distribute them to the columns or walls (Rahman et al. 2021a; Truong-Hong and Lindenbergh 2022). The use of RC Deep Beams has increased exponentially since the start of building taller structures (Fujino et al. 2019; Hao et al. 2023). However, these tall structures face various failures like tensile, bending, or shear failures, which can be prevented by embedding steel reinforcing bars in concrete beams (Jin et al. 2019). Shear failure, caused by shear force combined with axial loads and moments, is one of the most dangerous failure types as it can occur without warning (Zhang et al. 2022). In contrast, flexural failure develops gradually due to the yielding of rebars (Al-Osta et al. 2017). The shear transmission process becomes random after shear fractures begin (Zhang et al. 2022).

Studies have shown that computational science plays a vital role in engineering, specifically in Civil Engineering. Ensemble learning, a computer science field, is widely used in various disciplines such as biology, engineering, and sociology (Tiwari et al. 2022; Zhang and Lu 2021; Xu et al. 2021; Gundersen and Kjensmo 2018; Aristodemou and Tietze 2018). Machine Learning (ML) is commonly used in building structural design and performance assessment, enhancing concrete properties predictions, and improving the finite element modeling of structures (Selvaraj and Sivaraman 2019; Castelli et al. 2013; Sun et al. 2021a; Naranjo-Pérez et al. 2020; Abuodeh et al. 2020b; Sun et al. 2021b). Ensemble learning is a powerful ML technique that improves the accuracy of predictions made by a model (Dong et al. 2020; Tiwari et al. 2022). It is particularly useful for large datasets with many features, as it trains a group of models on different subsets of the data and combines their predictions to make a final prediction. This technique can be used to reduce variance in the model’s predictions (Dong et al. 2020).

Predicting reinforced concrete beam shear strength is a complex problem due to the nature of the materials involved. Using ensemble learning can be beneficial in reducing variance in predictions made by the model (Zhang et al. 2020b). The most common type of shear reinforcement in concrete beams is stirrups, which transfer shear forces between the concrete and steel. There are several ways to predict the shear strength of reinforced concrete beams, but empirical codes are the most common method. Every country has its own empirical codes to find shear strength. In the US, the American Concrete Institute (ACI) 318 code governs the design of reinforced concrete beams (Committee 2008). Similarly, in Europe, the use of Eurocode 2 Bethlehem (Bethlehem 2004) for designing concrete structures is prominent.

This study aims to formulate and compare various boosting machine learning algorithms to predict the shear strength of Reinforced Concrete (RC) deep beams, which is a complex task due to uncertain factors. The study explores ensemble learning methods like Adaptive Boosting, Extreme Gradient Boosting, Random Forest, Gradient Boosting, and Voting Regressor. In addition, conventional ML algorithms including ANN, DT and SVM are also compared. The objective is to identify the most effective machine learning approach that outperforms traditional mechanics-driven models based on country codes, including the United States (Committee 2008), China (Standard 2002), Europe (Bethlehem 2004), British (CIRIA) (Arup and Partners 1977), and Canada (Darwin et al. 2016) by evaluating their performance using metrics like \({R}^{2}\), MSE, RMSE, and MAE. This comparison aims to assess whether the machine learning-based approach, particularly the XGBoost model, can surpass the accuracy and performance of traditional mechanics-driven models in predicting the shear strength of RC deep beams across various regions.

The study also incorporates the use of the explainable machine learning (EML) technique, SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017), to gain interpretability and insights into the developed best performing model. This step is crucial for understanding the factors that contribute to the predictions and enhancing transparency in the decision-making process. The authors have also performed feature selection analysis to understand how varied parameters affect the prediction of shear strength while using ML Models.

In summary, the objectives of this study are to advance the understanding and prediction capabilities of RC deep beam shear strength through the application of state-of-the-art machine learning methods. By comparing these models against established mechanics-driven models, the study aims to provide engineers and researchers with a more accurate and reliable tool for designing and assessing the structural behaviour of RC deep beams, ultimately contributing to advancements in the field of civil engineering and construction.

The simulation conducted uses Python language to code and build the relevant models on colab.research.google.com. For comparison and calculations of mechanics-driven models, Microsoft excel was used.

The remainder of this paper will describe the literature review in Sect. 2, methodology of ML models in Sect. 3, model structure, dataset collection, dataset limitation, model selection, model evaluation and hyper-parameter optimisation in Sect. 4 and comparison between conventional and ensemble models, comparison between ML models and mechanics-driven models, SHAP of XGboost model, feature importance analysis in Sect. 5.

2 Literature review

The use of artificial intelligence (AI) and machine learning (ML) techniques in structural engineering has been in play since the 1980s as the researchers realised the conventional approaches e.g., finite element models and analytical models have difficulties in accurately and efficiently predicting the structural behaviours (Tapeh and Naser 2023a). With the rapid development and democratisation of computer science, new paradigms came into play. As the computation power was widely available, more powerful algorithms were proposed which turned out to widen the scope of structural engineering. There are three major fields that showed significant progress, i.e. Structural Health Monitoring (SHM), performance evaluation and modelling of mechanical properties (Tapeh and Naser 2023a).

SHM uses machine learning as it collects huge amounts of data using various sensors and later processes these large amounts of collected data making data-driven models. In addition, unsupervised algorithms or clustering techniques can also be used (He et al. 2022; Gordan et al. 2022; Sarmadi and Yuen 2022).

Performance evaluation is another major area that has improved with the implementation of ML. Conventional performance evaluation methods including fragility and reliability assessment require huge amounts of data in order to take into account the uncertainty and randomness in the structure. As seen in SHM, the data are collected using sensors which can even collect real-time data. Applying ML to formulate dynamic models in accordance with situational data results in low use of computation as seen in various studies (Dubourg et al. 2011, 2013; Kersaudy et al. 2015; Dai and Cao 2017; Lagaros and Fragiadakis 2007; Lagaros et al. 2009; Mangalathu et al. 2018a; Wang et al. 2018; Jalayer et al. 2017; Mangalathu et al. 2018b; Mahmoudi and Chouinard 2016).

In the last 2 decades, there has been a prevalence of modelling the mechanical behavior of structures, while also exploring the diverse usage of concrete and its dynamic behaviors in various structures. Concrete has been widely employed in the construction of structures due to its advantageous engineering characteristics, including rich raw materials, low cost, strong compressive strength, and exceptional durability. Long spans and structures without intermediary columns both benefit from the use of deep beams. Deep beams are employed as girders to support the carriageway in bridges. Deep beams are also employed as side walls in RCC water tanks and as connections for the pile caps in pile foundations. The shear span depth ratio is used to classify deep beams, which are ones with a greater depth than commonly utilised beams. The deep beams are defined differently by different codes. The beams having a depth greater with respect to its span are generally referred to as deep beams (Arup and Partners 1977; Committee 2008; Standard 2002; Darwin et al. 2016; Bethlehem 2004). The ratio of effective span to overall depth when considered less than 3 the beam is called as deep as per Eurocode (Bethlehem 2004). As per ACI Code (Committee 2008), shear design is specially done when clear span to effective depth ratio is less than 5 (Committee 2008). Leonhardt and Walter 1966 (Leonhardt and Walter 1966), experimentally proved that elastic design for such deep beam is not valid. Their investigation further highlighted the significance of accurate steel details in deep beams. The distribution of strain in a section of a deep beam is not linear and cannot be determined by elasticity theory. In general, impact of shear in beam design is taken care of by longitudinal reinforcement provisions. However, in case of excess shear transverse reinforcement is separately designed (Ismail et al. 2018). For the case of deep beams such simplifying assumptions are not adequate and various approaches such as compression field theory, tension field theory, etc. are proposed by various researchers based on which different country codes have proposed their procedures for shear design (Arup and Partners 1977; Committee 2008; Standard 2002; Darwin et al. 2016; Bethlehem 2004). The amount of reinforcement to be used and concrete directly depends on accurate prediction of the shear capacity of a section (Mihaylov et al. 2010).

Machine learning (ML) is one of the widely accepted methods to tackle structural problems (Yaseen et al. 2018; Zhang et al. 2020c; Biswas et al. 2022; Asteris et al. 2019; Armaghani et al. 2019; Basilio and Goliatt 2022; Esteghamati and Flint 2021; Wakjira et al. 2022b; Salman and Kadhum 2022; Farrar and Worden 2012). In Sandeep et al. (2023), the authors thoroughly discuss the implementation of ML approaches for predicting the shear strength of RC deep beams, covering in-depth procedures, various algorithms, and the basics of modeling, training, testing, underfitting, and overfitting. However, the authors show no real-time implementation and results. In Fu and Feng (2021), the authors formulate ML algorithms to predict the shear strength of Corroded reinforced concrete beams using a gradient gradient-boosting regression tree. Authors use 158 shear tests for the corroded reinforced concrete beam meanwhile showing how empirical models cannot take into account how corrosion influences calculations. The authors also calculate Time-dependent corrosion extent and lifetime shear strength prediction. However, the issue of interpretability is not discussed. Similarly, the dataset taken into account is very small. In Chou et al. (2020b), the authors integrate the smart fly algorithm and least square support vector regression to build a hybrid model into a multi-source dataset sourced from North America, Australia and America. The hybrid model shows promising results of MAPE 18.95%. However, the authors show no correlations of features and any method for interpretability of models employed. In Naik and Kute (2013), the authors have implemented an artificial neural net for predicting the shear strength of high-strength steel fibre-reinforced concrete deep beams. The validation method used is the residual sum of squares. Authors also establish a relationship between various features using ANN. The developed ANN8 establishes the relations between various parameters affecting the complex behaviour of steel fibre-reinforced concrete deep beams. In Concha et al. (2023), authors develop a hybrid Neuro-Swarn model to predict the shear strength of steel fibre-reinforced concrete deep beams. The model was developed using 116 experimental datasets. The analysis of the variance test showed prominent results. Authors also present various models used for shear strength calculation and prediction in conventional approaches (Committee 2008; Vamdewalle and Mortelmans 1994; Al-Ta’an and Al-Feel 1990; Sharma 1986; Khuntia et al. 1999; Cho and Kim 2003). However, the experimental data size is small which may result in overfitting. In Pak et al. (2023), the authors have proposed a novel approach named the transfer ensemble neural network (TENN) model to increase the performance of the model while predicting shear capacity on small datasets. In the models, authors have incorporated both ensemble learning and transfer learning in order to control the high variability of ML models. However, the results are impressive, similarly the the issue of the black box approach and overfitting remains an open issue. In Almasabha et al. (2023), the authors have worked on a new dataset of 102 instances of synthetic fibre-reinforced concrete (SyFRC) for reinforced concrete structures. Authors predict the shear strength of SyFRC beams without stirrups using ACI code and ML algorithms- LightGBM, XGBoost and Gene Expression. The study shows that, apart from the ACI equation, all considered models effectively predict the effects of the shear span-to-depth ratio. In Ly et al. (2020), the authors have implemented real-code genetic algorithms and animal-based firefly algorithms in order to predict the shear strength of reinforced concrete deep beams. The dataset contains 463 instances. Later, in the study, the authors compare the obtained results with neural nets which shows promising results. In Olalusi and Awoyera (2021), the authors implement Gaussian Process regression (GPR) and the Random Forest (RF) to predict the shear resistance of steel fibre-reinforced concrete slender beams without stirrups. The results obtained during the study were compared with statistical and German guidelines. The authors also present the inconsistencies in prediction observed during the study. In Hossain et al. (2017), the authors have formulated an ANN approach to predict shear strength on the experimental database containing 173 steel fibre-reinforced concrete (SFRC) beams without stirrups. Additionally, the approach is tested with data from 36 experimental beams. The authors show how ANN is better when it comes to empirical equations for high and ultra-high strength of SFRC beams. However other possible techniques are not explored in this scenario. In Tapeh and Naser (2023a), the authors have conducted state of a state-of-the-art review for AI, ML and Deep Learning (DL) implementations in structural engineering, particularly earthquake, wind, and fire engineering. The authors introduce a wide range of techniques and their varied implications and benefits in the field of structural engineering. Authors cover more than 4000 scholarly works in order to identify best practices. The authors also cover shear strength prediction for RC deep beams, however, the scholarly works are limited to only two on the specific issue. Overall, the paper gives an overview of the last decades of how AI, ML and DL have shaped structural engineering. In Marie et al. (2022), the authors present a framework predicting the shear strength of reinforced concrete beam-column connections which is subjected to cyclic loading. The authors use classical prediction models such as K-nearest neighbour regression (KNN), Multivariate Adaptive Regression Splines (MARS), Ordinary least Squares (OLS), Support Vector Machines (SVM), Artificial Neural Networks (ANN), and kernel regression with mixed data types (Kernel regression) which are implemented on a dataset of 98 instances. The authors show kernel regression predicted the joint shear strength with the highest accuracy. However, neither model interpretability nor feature importance is present. In Wakjira et al. (2022c), authors have implemented Existing predictive models which have shown unsatisfactory results. In response, the research proposed machine learning (ML)-based models, considering all important variables, for predicting shear capacity. The analyses demonstrated successful predictions using the ML models, with extreme gradient boosting (XGBoost) showing the highest capability. Comparisons with existing models revealed the superiority of XGBoost in terms of accuracy, safety, and economic aspects. However, limitations concerning model interpretability were not addressed. Finally, reliability analysis was performed to calibrate resistance reduction factors, improving the confidence and applicability of the proposed model. Further research is needed to address this issue and explore additional avenues for enhancing ML techniques in structural engineering. In Liu et al. (2022), the authors aimed to establish an accurate prediction model for precast concrete joints (PCJ) direct shear strength (DSS) using support vector regression (SVR), a machine learning algorithm. They assembled a comprehensive database of 304 test results with 23 input parameters and employed a novel correlation matrix-based feature selection method for improving the SVR model’s performance. The experimental validation showed that the SVR model outperformed traditional mechanical models in predicting DSS for PCJs. Additionally, the study provided insights into the SVR model’s results using partial dependence and individual conditional expectation plots. Another study addressed the challenges in accurately predicting the shear strength of fiber-reinforced steel (FRS) due to the complex soil-fiber interaction mechanism. To tackle this, they compiled a high-quality database of triaxial and direct shear tests on FRS from 1983 to 2015, including crucial information on sand properties, fiber characteristics, soil-fiber interface properties, and stress parameters. This database served as a solid foundation for further analysis and future developments of improved mechanical models for predicting FRS shear strength.

3 Methodology

3.1 Ensemble learning

Ensemble learning is a machine learning technique that combines multiple models to improve the accuracy and robustness of predictions. Ensemble learning is a very advanced and significant machine learning technique within the academic domain. The core principle of this approach is centred on combining various foundational models or “learners” to generate a more powerful prediction model that exhibits improved accuracy and robustness. This technique is based on the long-standing belief that the combined knowledge and insights of a group frequently exceed those of an individual. Within the domain of machine learning, ensemble learning encompasses the use of this principle to algorithms, hence showcasing the potential for enhanced predictive results through the collaborative integration of several models.

Ensemble learning, at its fundamental essence, aims to enhance predictive accuracy, strengthen generalisation skills, and reinforce model stability. The objective is to mitigate the inherent constraints of individual models through the use of variety and collaboration among the constituent learners. In the realm of academic discourse pertaining to ensemble learning, a number of crucial notions emerge as prominent.

The first consideration pertains to the concept of diversity within the foundational models. Diversity plays a fundamental role in ensemble learning, which is accomplished through a range of strategies including the utilisation of diverse algorithms, the incorporation of distinct subsets of data, and the introduction of variances in hyperparameters throughout the training process. The underlying concept posits that the presence of diverse models results in distinct errors being made on various portions of the data. This collective diversity ultimately enhances the probability of making accurate predictions.

Another crucial factor to consider is the consolidation of forecasts generated by individual models. Ensemble methods utilise many aggregation approaches, such as majority voting, weighted averaging, and stacking, each of which is based on distinct mathematical concepts and possesses distinct features.

The selection of base learners is a critical aspect in the ensemble learning procedure. The category of basic learners includes both elementary models, such as decision trees, as well as more intricate ones, such as neural networks. The choice of suitable base learners is contingent upon the distinct attributes of the data and the inherent nature of the problem under consideration.

Ensemble learning comprises a range of ensemble forms, including bagging, boosting, and stacking, each characterised by unique methodologies for aggregating base models. The scholarly literature has exhaustively examined these many sorts of ensembles, providing insights into their individual merits and limitations.

Ensemble learning offers a structured approach to effectively manage the bias-variance trade-off, a crucial consideration within the field of machine learning. Ensembles has the ability to address the issue of overfitting, characterised by large variance, by integrating various models. Simultaneously, ensembles are capable of capturing detailed patterns in the data, hence minimising bias.

The topic of model interpretability is a subject of scholarly inquiry in the field of ensemble learning. Ensemble approaches frequently augment prediction performance, but concomitantly bring complexity to the overarching model. Scholars are currently engaged in the investigation of methods that aim to achieve a harmonious equilibrium between precision and interpretability of models, so guaranteeing that the knowledge obtained from the model remains lucid and comprehensible.

Finally, scholarly discourse surrounding ensemble learning encompasses its practical implementation in various fields, such as banking, healthcare, image identification, and natural language processing. Researchers continually strive to illustrate the capacity of ensemble methodologies to offer more effective solutions to real-world situations, thus emphasising the practical significance of ensemble learning.

The mathematical notation for ensemble learning involves defining a set of base models, and then combining them to produce a final prediction (Dietterich 2000).

Let X be the input data, and Y be the target variable we wish to predict. We define a set of N base models, denoted by \(M1, M2, \ldots , MN\). Each base model takes X as input and produces a predicted output, denoted by Mi(X).

The ensemble model then combines the predictions of the base models to produce a final prediction, denoted by F(X). There are many ways to combine the predictions of the base models, but one common approach is to use a weighted average is defined by Eq. (1).

$$\begin{aligned} F(X) = w1 * M1(X) + w2 * M2(X) + \cdots + wN * MN(X), \end{aligned}$$
(1)

where \(w1, w2,\ldots , wN\) are the weights assigned to each base model. The weights can be learned from the data or set manually based on prior knowledge.

Ensemble learning is a popular machine learning technique that combines multiple models to achieve better accuracy and generalization performance than using a single model. In the context of classification, ensemble learning involves constructing a set of base classifiers that make predictions on a given dataset, and then combining these predictions using a specified aggregation method to obtain the final classification result (Tiwari et al. 2022).

3.1.1 Boosting

Boosting is a common ensemble learning method that sequentially trains a set of weak classifiers on re-weighted versions of the training data, such that the misclassified samples in each iteration receive higher weights in the subsequent iterations. The final classification is then obtained by weighted voting of the individual classifier outputs (Freund et al. 2003, 1996). Mathematically, the boosting algorithm can be formulated as follows:

Given a training dataset \(D = {(x_i, y_i)}_{i=1}^n\), where \(x_i\) denotes the feature vector of the i-th sample and \(y_i\in {-1, +1}\) represents its class label, and a set of weak classifiers \(h_m(x)\), \(m=1,\ldots ,M\), the boosting algorithm aims to learn a strong classifier H(x) as follows:

  1. 1.

    Initialize sample weights \(w_i=1/n\), \(i=1,\ldots ,n\).

  2. 2.

    For each iteration \(m=1,\ldots ,M\):

    • Train the m-th weak classifier \(h_m(x)\) on the weighted training dataset \(D_m={(x_i,y_i,w_i)}_{i=1}^n\).

    • Compute the error rate \(\epsilon _m=\sum _{i=1}^n w_i I(y_i \ne h_m(x_i))\), where \(I(\cdot )\) is the indicator function.

    • Compute the classifier weight \(\alpha _m=\frac{1}{2}\log \frac{1-\epsilon _m}{\epsilon _m}\).

    • Update the sample weights as \(w_i \leftarrow w_i\exp (-\alpha _m y_i h_m(x_i))\).

  3. 3.

    Output the final classifier is defined by Eq. (2).

    $$\begin{aligned} H(x)=\text {sign}\left( \sum _{m=1}^M \alpha _m h_m(x)\right) . \end{aligned}$$
    (2)

3.1.2 Stacking

Stacking is another popular ensemble learning technique that combines the outputs of multiple base classifiers using a higher level meta-classifier, which is trained on the predictions of the base classifiers. Specifically, stacking consists of the following steps:

  1. 1.

    Split the training dataset D into k disjoint subsets, or folds, \({D_1,\ldots ,D_k}\).

  2. 2.

    For each fold \(i=1,\ldots ,k\): Train the M base classifiers on the \(k-1\) folds other than \(D_i\). Obtain the predicted class probabilities for the samples in \(D_i\) from each base classifier. Concatenate the predicted probabilities from all base classifiers to form a new feature vector for each sample in \(D_i\). Store the new feature vectors and the corresponding true class labels as a new training dataset \(D'_i\).

  3. 3.

    Train a meta-classifier, such as logistic regression or SVM, on the augmented training dataset \({D'_1,\ldots ,D'_k}\).

  4. 4.

    Combine the base classifiers and the meta-classifier to form the final stacked classifier.

Mathematically, the stacking algorithm can be represented as follows:

Given a training dataset \(D = {(x_i, y_i)}_{i=1}^n\), where \(x_i\) denotes the feature vector of the i-th sample and \(y_i\in {-1, +1}\) represents its class label, and a set of base classifiers

3.1.3 Bootstrap aggregating algorithm

Bagging, short for Bootstrap Aggregating, is another popular ensemble learning method that trains multiple base classifiers on different bootstrap samples of the training data, and combines their outputs by majority voting to obtain the final classification. The bagging algorithm can be mathematically represented as follows:

Given a training dataset \(D = {(x_i, y_i)}_{i=1}^n\), where \(x_i\) denotes the feature vector of the i-th sample and \(y_i\in {-1, +1}\) represents its class label, and a set of weak classifiers \(h_m(x)\), \(m=1,\ldots ,M\), the bagging algorithm aims to learn a strong classifier H(x) as follows:

  1. 1.

    For each iteration \(m=1,\ldots ,M\): Generate a bootstrap sample \(D_m\) of size n by randomly sampling n samples from Dwith replacement. Train the m-th weak classifier \(h_m(x)\) on the bootstrap sample \(D_m\).

  2. 2.

    Output the final classifier is defined by Eq. (3).

    $$\begin{aligned} H(x)=\text {sign}\left( \sum _{m=1}^M h_m(x)\right) . \end{aligned}$$
    (3)

3.2 Overview of the ML models

3.2.1 Artificial neural network

An Artificial Neural Network is a computational model inspired by the structure and functionality of biological neural networks in the human brain. It is a type of machine learning algorithm designed to recognize patterns, solve complex problems, and make decisions based on input data (Yegnanarayana 2009).

3.2.2 Decision tree

A decision tree is a non-linear predictive model and a popular supervised learning algorithm used for classification and regression tasks. It is a graphical representation of a set of rules and decisions based on input features that recursively partition the data into subsets, leading to a hierarchical tree-like structure (Maimon and Rokach 2014).

Fig. 1
figure 1

Evolution of XGBoost

3.2.3 Support vector machine

The concept behind Support Vector Machine (SVM) is to find the best decision boundary (hyperplane) that separates the data points of different classes with the largest margin possible. The data points closest to the hyperplane, known as support vectors, play a crucial role in defining the optimal hyperplane. These support vectors are used to determine the margin and influence the overall performance of the SVM (Hamel 2011).

3.2.4 Random forest

Random forest is a supervised learning algorithm. It can be used for both classification and regression. The algorithm works by building multiple decision trees (hence the “forest”) and then selecting the tree that predicts the label for a new data point which is the best. The decision trees are built using a random subset of the features, and the predictions are made by averaging the predictions of all the trees (Bakouregui et al. 2021; Belgiu and Drăguţ 2016).

3.2.5 Gradient boosting

Gradient boosting is a machine learning technique that can be used for both regression and classification problems. It creates a prediction model as an ensemble of weak prediction models, often decision trees. Like other boosting methods, it builds the model incrementally in a stage-wise fashion. It also allows for the optimization of an arbitrary differentiable loss function, which helps to generalize the model. (Natekin and Knoll 2013).

3.2.6 Adaptive boosting

The Adaptive Boosting Algorithm is a classification technique that is used to improve the accuracy of a model by combining a set of weak models. The algorithm adaptively changes the weights of the models in the ensemble so that the model with the highest error rate is given more weight. The algorithm then continues to iteratively train the model and update the weights until the desired accuracy is achieved (Wu et al. 2010).

3.2.7 Extreme gradient boosting (XGBoost)

The extreme gradient boosting algorithm is a powerful machine learning algorithm that is often used for classification tasks. This algorithm is a modification of the gradient boosting algorithm that is designed to be more efficient and to better handle data with a large number of features. The extreme gradient boosting algorithm works by building a model in a stage-wise fashion. In each stage, a new tree is added to the model and the predictions of the new tree are combined with the predictions of the existing trees in the model. The trees are added in a way that minimizes the loss function of the model. The elaborated model is shown in the Fig. 1 (Chen et al. 2015).

The extreme gradient boosting algorithm is very effective at handling data with a large number of features. This is because the algorithm can choose which features to use in each stage of the model. This allows the algorithm to focus on the most important features and to ignore the less important features.The extreme gradient boosting algorithm is also effective at handling data that is imbalanced. This is because the algorithm can learn from the mistakes that it makes on the minority class and use this knowledge to improve the predictions on the minority class. XGBoost is also used in classifying image (Jiang et al. 2019), malware detection (Wu et al. 2020), predicting the death of patient during COVID-19 treatment (Kivrak et al. 2021) and detecting fraudulent activities (Hancock and Khoshgoftaar 2020).

Table 1 Statistical Information of parameters in deep beam database

3.2.8 Voting regressor

A voting regressor is an ensemble learning method for regression that works by combining the predictions of multiple individual regressors. The individual regressors can be any type of regression algorithm, such as linear regression, support vector regression, or decision tree regression. The predictions from the individual regressors are combined using a simple majority vote. The voting regressor is a powerful tool, because it can reduce the variance of the predictions, making the predictions more robust and accurate. In addition, the voting regressor can help to avoid overfitting because it is less likely to overfit to the training data than a single regressor (Chen et al. 2019).

4 Model structure

4.1 Data collection

The data set of RC beams is compiled from the published literature (Feng et al. 2021). In this study, a total of 271 test data samples of RC beams from the literature were collected and used. These test data samples were related to RC deep beams of which 52 samples were from (Smith and Vantsiotis 1982), 25 samples were from (Kong et al. 1970), 37 samples were from (Clark 1951), 53 samples were from (Oh and Shin 2001), 4 samples were from (Aguilar et al. 2002), 12 samples were from (Quintero-Febres et al. 2006), 19 specimens are from (Tan et al. 1995). 12 samples were from (Ramakrishnan and Ananthanarayana 1968) and 39 samples were from (Shaoxi 1982).

The database includes a wide range of RC deep beams so that the model can generate data more effectively. The database contains four different types of deep beams, including beam without web reinforcements (WOR), beams with horizontal web reinforcements (WHR), beams with vertical web reinforcements (WVR), and beams with both horizontal and vertical reinforcements (WHVR). Four distinct deep beam types-beams without web reinforcements, beams with horizontal web reinforcements, beams with vertical web reinforcements, and beams with both horizontal and vertical web reinforcements-are represented in the dataset used for this work. In the data set, this classification is marked with the help of parameters such as area/spacing of vertical web reinforcement and area/spacing of horizontal web reinforcement.

The input variables for these beams are 16 design features that fall into 4 groups, (1) geometric dimensions: beam span \(l_{0}\), height h,effective height \(h_{0}\), width b, shear span \(a\) \(. ;(2)\) longitudinal reinforcement information: reinforcement ratio \(\rho _{l}\) and strength \(f_{y l}\); (3) web reinforcement information: horizontal reinforcement ratio \(\rho _{\textrm{h}}\), spacing \(s_{\textrm{h}}\) and strength \(f_{y \textrm{h}}\), vertical reinforcement ratio \(\rho _{\textrm{v}}\), spacing \(s_{\textrm{v}}\) and strength \(f_{y \textrm{v}};\) (4) concrete property: concrete strength \(f_{\textrm{c}}^{\prime }\). The output is the beam’s shear strength, denoted by \(V_{u}\). The value ranges for these variables, as well as the statistical information (mean and standard derivation, etc.), are listed in Table 1. Meanwhile, Fig. 2 also plots the distributions of the deep beam parameters frequencies.

4.1.1 Limitations

The dataset used for the analysis of shear strength in RC (Reinforced Concrete) deep beams poses certain limitations that need to be considered when implementing machine learning algorithms. Firstly, the dataset contains only 271 samples, which might not be sufficient to fully capture the wide variability of RC deep beams in practice. A small sample size could lead to reduced statistical significance and limit the generalizability of the machine learning models.

Secondly, the data are retrieved from old construction sites, potentially introducing bias and representativeness issues. Construction practices, materials, and design standards may have evolved, making the dataset less relevant to current scenarios. This temporal difference might affect the accuracy of the predictions.

Thirdly, the limited sample size can result in a lack of diversity within the dataset. As a result, the machine learning algorithms might not adequately capture the variations in beam configurations, reinforcement details, and loading conditions, which are crucial factors influencing shear strength.

To address some of these limitations, researchers should interpret the results with caution.

4.2 Model selection

In this study, the authors have utilised Random forest, Adaptive boosting, Gradient Boosting, XGBoost, Support Vector Machine (SVM) and ANN. The authors have also implemented Voting Regressor over the top best performing algorithms to ensure better and generalised results. Figure 3 shows the step-by-step model approach taken in this study.

Fig. 2
figure 2

Deep beam parameters frequencies from the database (Feng et al. 2021)

Fig. 3
figure 3

Step by step ensemble learning modelling approach with interpretations

4.3 Hyper-parameter optimization

Once the data preprocessing is complete, the next task is to tune the hyperparameters in accordance with correlations and multiple other factors. In order to discover the hyperparameters, the grid search approach is paired with k-fold cross-validation (CV) as shown in Fig. 4. The optimisation of model parameters is a crucial phase in ensemble learning, which involves making decisions on many factors such as the quantity of weak learners, learning rates, and maximum tree depths. In order to facilitate this procedure, a methodical methodology is employed, commencing with the determination of parameter boundaries derived from previous research and scholarly sources. This frequently involves constructing a parameter grid that encompasses potential values for every hyperparameter.

The succeeding stage encompasses numerous iterations of model training, wherein different combinations of hyperparameters inside the specified grid are examined. Nevertheless, the effectiveness and dependability of this procedure are contingent upon the manner in which we assess the performance of the model. K-fold cross-validation (CV) assumes a crucial function in this context.

The K-fold cross-validation technique involves dividing the dataset into ‘k’ folds of equal size. The model is subsequently trained ‘k’ times, where each fold is utilised as the validation set once, while the remaining ‘\(k-1\)’ folds are employed as training data. K-fold cross-validation (CV) is considered to be of utmost importance for various reasons.

Firstly, the practise of evaluating the model on several data subsets helps mitigate bias in performance estimates. This approach enhances the robustness of the results and reduces their dependence on specific data divisions. Additionally, the utilisation of k-fold cross-validation (CV) offers a more accurate estimation of the variability in the performance of the model. This aids in evaluating the consistency and reliability of the model when applied to diverse subsets of data.

Furthermore, the selection of the value ‘k’ in k-fold cross-validation has an impact on the determination of the optimal hyperparameters. A higher value of ‘k’ (for example, 10) provides a more extensive investigation of hyperparameters, but at the expense of increased processing burdens. On the other hand, a reduced value of ‘k’ (such as 5) exhibits computational efficiency, although it may result in estimations that are comparatively less reliable. Therefore, the selection of ‘k’ is determined by balancing the available computational resources with the desired level of reliability.

The average of the ‘k’ rounds of training and validation is commonly used to describe the overall model performance. This metric offers a thorough evaluation of the model’s ability to generalise across various subsets of data.

It is generally advised to choose a value of ‘\(k=10\)’ in most search scenarios, since this choice achieves a suitable compromise between computational feasibility and accurate performance estimation. However, the precise value of the ‘k’ parameter may differ based on factors like as the size of the dataset, the computational resources at hand, and the desired level of confidence in the obtained results.

Fig. 4
figure 4

K-fold cross-validation method (Rathakrishnan et al. 2022)

A better technique for dealing the bias of the training set’s random selection is the k-fold CV. A loop of k rounds is conducted, where the training set is divided into k equal-sized subsets. In each round, one subset is used to test the model and the remaining \(k-1\) subsets are used to train the model. The Random forest algorithm involves optimizing three parameters, which include the total number of trees, the total number of features chosen randomly, and the maximum tree depth.

For XGBoost, there are separate value ranges specified using grids for the number of trees, learning rate, and maximum tree depth. [0: 20: 600], [0.02, 0.05, 0.1, 0.2], and [2, 4, 8, 12, 14]. When the tree number is low, the \({R}^{2}\) score rises fast with it, and once it reaches a specific value, the trend becomes progressively steady. The learning rate has a big impact on performance. In order to achieve the same \({R}^{2}\) score for a training set, a model trained with a smaller learning rate will require more trees than a model trained with a larger learning rate. Increasing the number of trees, however, is not essential to improve the \({R}^{2}\) score for a high learning rate. For instance, when the learning rate is between 0.1 and 0.2, the score drops as the number of trees exceeds between 100 and 200. For learning rates of 0.02 and 0.05, however, the score does not peak until the number of trees exceeds 400, at about 0.8. The greatest tree depth of 8 and 16 yields scores that are quite close. Based on the analysis, the optimal values for the number of trees, learning rate, and maximum depth are 600, 0.1, and 10, respectively.

4.4 Model evaluation

This study used four different statistical measurement parameters to assess the prediction accuracy of various ensemble learning models. These evaluation parameters compare the accumulated error in the predictions with the actual observations. The statistical parameters used are the coefficient of determination (R-squared), mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). These metrics provide information about the accuracy and precision of the predictions made by the ensemble learning models. These mathematical formulations are defined as follows:

  • Coefficient of determination \({R}^{2}\) (Di Bucchianico 2008)

    $$\begin{aligned} {R}^{2}=1-\frac{\sum _{i=1}^{m}\left( P_{i}-T_{i}\right) ^{2}}{\sum _{i=1}^{m}\left( P_{i}-\bar{T}\right) ^{2}} \end{aligned}$$
    (4)
  • Mean Absolute Error (MAE) (Maragos 1989)

    $$\begin{aligned} \textrm{MAE}=\frac{\sum _{i=1}^{m}\left| P_{i}-T_{i}\right| }{m} \end{aligned}$$
    (5)
  • Root Mean Squared Error (RMSE) (Chai and Draxler 2014)

    $$\begin{aligned} \hbox {RMSE}=\sqrt{\frac{\sum _{i=1}^{m}\left( P_{i}-T_{i}\right) ^{2}}{m}} \end{aligned}$$
    (6)
  • Mean Absolute Percentage Error (MAPE) (De Myttenaere et al. 2015)

    $$\begin{aligned} {\hbox {MAPE}}=\frac{100\%}{m} \sum _{i=1}^{m}\left| \frac{P_{i}-T_{i}}{T_{i}}\right| \end{aligned}$$
    (7)

    where \(P_{i}\) and \(T_{i}\) are the predicted and tested values, respectively; \(\bar{T}\) is the mean value of all the samples in the database.

Fig. 5
figure 5

Comparison of \({R}^{2}\), MAE, RMSE and MAPE values between Ensemble ML algorithms

Clearly, the four metrics provide for a thorough assessment of the model’s performance. \(R^{2}\), which is better if closer to 1 (Renaud and Victoria-Feser 2010), assesses the linear relationship between predicted values and actual values. The first-order and second-order relative errors (measured by RMSE, MAE, and MAPE) between the predicted value and actual value are better when smaller (Hackeling 2017).

5 Results and discussion

5.1 Comparison between ML algorithms

Traditional single learning techniques like decision trees (DT), support vector machines (SVM), and artificial neural networks (ANN) are contrasted with the performance of ensemble learning techniques. To ensure a fair comparison, the hyper parameters of the single learning methods are also established through grid search and 10-fold cross-validation.

The authors have compared seven machine learning models on the testing dataset, i.e. three conventional ML models and four ensemble learning model. Figure 5 compares the performance of the four ensemble models on the testing dataset. It is clear that compared to single learning models, ensemble learning models exhibit significant improvements. For instance, the worst ensemble learning random forest (RF) model has an \({R}^{2}\) value of 0.906 whereas the greatest single learning DT model has an R-squared value of 0.887. As shown in Tables 2 and 3, the root mean squared error (RMSE) in prediction of shear strength of single learning models ranges from 63 to 72 kN, but that of the four ensemble models is around 55 kN. The MAE of the single learning model is greater than 40 kN, whereas the MAE of the ensemble models is less than 38 kN. The mean absolute percentage error (MAPE) of the ANN model is higher than 18%, whereas that of the ensemble models is lower than 14%, and that of the XGBoost model is only about 10%.

Table 2 Comparison of \({R}^{2}\), MAE, RMSE and MAPE values between conventional ML models

As shown in Table 3, the root mean squared error (RMSE) in prediction of shear strength of ensemble learning models ranges from 47 to 57 kN. The MAE of the ensemble models is less than 38 kN. The mean absolute percentage error (MAPE) of the ensemble models is lower than 14%, and that of the XGBoost model is only about 10%. Overall, the ensemble models and the XGBoost model in particular-perform better than conventional machine learning models.

Table 3 Comparison of \({R}^{2}\), MAE, RMSE and MAPE values in ensemble learning models
Fig. 6
figure 6

Comparison between ML algorithms with training and testing data

The dataset is split in two parts i.e. 80% training set and 20% testing set. The performance of four ensemble learning models and voting regressor is shown in Fig. 6, where the models are evaluated on the basis of testing dataset compared to given experimental data. The experimental data and the prediction are identical, as shown by the diagonal line \((y = x)\). As can be observed from the scatter plots’ near proximity to the diagonal, all four ensemble models generally obtained good results. In case of Voting Regressor, the regressor shows much generalised results compared to all other models.

5.2 Overview of mechanics-driven models

As opposed to normal beams, deep beams structural analysis is more complex, hence, the assumption that the plane section will remain plane before and after bending is invalid, because the strain is not distributed linearly. The pressure that is applied will have a greater impact on the stress than the strain. Shear deformation can also be ignored in normal beams, but it cannot be ignored in deep beams where shear is a major factor in failure. Larger depths, when applied in the conventional procedure, cause stress to not be linear in the elastic stage and prevent the ultimate stress from becoming the parabolic shape, which is another important factor in the shear failure of deep beams. European guideline states that a beam is considered to be deep if its effective span to overall depth ratio is less than 3.0 beam (Bethlehem 2004).

Deep beams are members that are loaded on one face and supported on the other face in accordance with ACI-318 clause 10.7.1 so that compression struts can form between the loads and the supports. In four times, the overall member depth or less, or areas where loads are concentrated within a member’s depth of twice the support’s face (Committee 2008).

Five expressions for determining the shear strength of RC deep beams are taken from the design codes of China, British (CIRIA), the United States, Canada, and Europe. While the other three are determined based on the strut-and-tie model, the expression of China in Chinese code and British (CIRIA) are semi-empirical semi-analytical equation. The following is a list of the detailed expressions:

  • British (CIRIA Guide) (Arup and Partners 1977)

    $$\begin{aligned} V_{u, \textrm{CIRIA}}, = C_{1}\left( 1-0.35 \frac{a_{}}{h_{o}}\right) f_{t} b {h_{o}} + C_{2} \sum _{1}^{n} A_{1} \frac{y_{1}}{h_{o}} \sin ^{2} \alpha \end{aligned}$$

    where \(C_{1}\) and \(C_{2}\) are constants depending on grade of concrete and steel; \(f_{t} =0.5 \sqrt{ } f_{\textrm{c}}\) ; \(A_{1}\) = Area of reinforcement; \(y_{1}\) = depth from the top of the beam to the point where the bar intersects the critical diagonal crack line \(\alpha \)= angle between the bar considered and the critical diagonal crack.

  • US code: ACI 318 (Committee 2008)

    $$\begin{aligned} V_{u, \textrm{ACI}}=0.85 \beta _{\textrm{s}} f_{\textrm{c}}^{\prime } b w_{\textrm{s}} \sin \theta \end{aligned}$$

    with

    $$\begin{aligned} w_{\textrm{s}}= & {} \left[ 1.85 w_{t} \cos \theta +\left( l_{p E}+l_{p \textrm{P}}\right) \sin \theta \right] / 2 \\ \theta= & {} \arctan \frac{d_{\textrm{b}}}{a} \ge 25^{\circ } \end{aligned}$$

    where \(\beta _{\textrm{s}}\) is strut coefficient; \(\theta \) is the angel between the strut and the longitudinal axis; \(w_{\textrm{s}}\) is the width of the strut; \(w_{t}\) is the height of the nodal region; \(l_{p E}\) and \(l_{p \textrm{P}}\) are the width of the top loading and bottom supporting plates, respectively; \(d_{\textrm{b}}\) is the distance between the top and bottom nodal region.

  • Chinese code: GB50010-2010 (Standard 2002)

    $$\begin{aligned} V_{u, G B}= & {} \frac{1.75}{\lambda +1} f_{t} b h_{0}+\frac{l_{0} / h-2}{3} f_{y \textrm{v}} \frac{A_{s v}}{s_{\textrm{h}}} h_{0}\\{} & {} +\frac{5-l_{0} / h}{6} f_{y \textrm{h}} \frac{A_{s h}}{s_{\textrm{v}}} h_{0} \end{aligned}$$

    where \(f_{t}\) is the concrete tensile strength; \(\lambda =a / h_{0}\) is the shear spanto-depth ratio. Other variables are the same as defined in Table 1.

  • Canadian code: CSA A23.3-04 (Darwin et al. 2016)

    $$\begin{aligned} V_{u, \textrm{CSA}}=\frac{f_{\textrm{c}}^{\prime }}{0.8+170 \epsilon _{1}} b w_{\textrm{s}} \sin \theta \end{aligned}$$

    with

    $$\begin{aligned} w_{\textrm{s}}= & {} \left[ 1.88 w_{t} \cos \theta +\left( l_{p E}+l_{p \textrm{P}}\right) \sin \theta \right] / 2\epsilon _{1}=\epsilon _{\textrm{s}}\\{} & {} +\left( \epsilon _{\textrm{s}}+0.002\right) \cot ^{2} \theta \end{aligned}$$

    where \(\in _{\textrm{s}}=0.75 \lambda f_{\textrm{c}}^{\prime } w_{t} b / E_{\textrm{s}} A_{\textrm{s}}\) is the tensile strain of the tie.

  • European code: EN 1-1-1992:2004 (Bethlehem 2004)

    $$\begin{aligned} V_{u, \textrm{EU}}=0.85 \beta _{\textrm{s}} f_{\textrm{c}}^{\prime } b w_{\textrm{s}} \sin \theta . \end{aligned}$$

5.3 Comparison between Ml algorithms and mechanics-driven models

In this section, a comparison of various statistical metrics with Ml algorithms and mechanics-driven models is drawn.

Table 4 compares the ratio of predicted shear strength to experimentally tested shear strength datasets mean, maximum, minimum, standard deviation and covariance values from five codal provisions and the best performing ensemble learning model i.e. XGBoost algorithm. In this case, if the standard deviation of the dataset is low, it might indicate that the data is consistent and reliable and that any predictions or conclusions drawn from the data are likely to be accurate. On the other hand, a high standard deviation would suggest that the data are more variable and less predictable and that any predictions or conclusions based on the data should be interpreted with caution.

Table 4 Performance comparison between mechanics-driven models and best performing ensemble model

In general, a low standard deviation is desirable in many applications, because it indicates that the data is well-behaved and can be easily analyzed and understood. XGBoost evolution is depicted in Fig. 1. It is one of the most superior boosting ensemble learning models, because it has both linear model solver and tree learning algorithms. As also shown in Table 4, predicted to test ratios dataset mean value is coming nearly about 1 and the standard deviation is also very low.

Fig. 7
figure 7

Predicted to test shear strength ratio for different RC beams by mechanics-driven models

A comparison between predicted to test shear strength ratio plotted against various a/d ratios for mechanics-driven model results (Fig. 7) versus ensemble learning models (Fig. 8) clearly depicts better prediction of shear strength on all types of RC deep beams with the XGBoost algorithm. In this study, the authors have also implemented voting regressor over top-performing boosting algorithms to get a better generalised view of ML models as shown in Fig. 8. Unlike black box ML algorithms, Voting regressors are prominent when it comes to transparency.

Fig. 8
figure 8

Predicted to test shear strength ratio for different RC beams by XGBoost model and Voting Regressor model

5.4 SHapley additive exPlanations for XGBoost

The key idea behind SHapley Additive exPlanations (SHAP) is to evaluate the contribution of each feature in a prediction by considering all possible combinations of features and how they affect the XGBoost model’s output. It calculates the average marginal contribution of each feature across all possible feature permutations. This process provides a more robust and balanced measure of feature importance compared to other methods that might suffer from issues like feature interdependence or lack of consistency (Lundberg and Lee 2017). SHAP overcomes the major drawback of using ML models which is its black box nature.

The authors have interpreted the SHAP values for all the features (\(l_{0}\), h, \(h_{0}\), b, a, \(\rho _{l}\), \(f_{y l}\), \(\rho _{\textrm{h}}\), \(s_{\textrm{h}}\), \(f_{y \textrm{h}}\), \(\rho _{\textrm{v}}\), \(s_{\textrm{v}}\), \(f_{y \textrm{v}} \), \(f_{\textrm{c}}^{\prime }\), \(V_{u}\)) as shown in Fig. 9. Concrete compressive strength \(f_{\textrm{c}}^{\prime }\) affects the model the highest and horizontal reinforcement strength \(f_{y \textrm{v}} \) affects the model the least. Concrete compressive strength \(f_{\textrm{c}}^{\prime }\), shear span a, width b and height h affects the models prediction majorly.

Fig. 9
figure 9

SHAP feature importance

In summary, SHAP values provide an interpretable way to understand how each feature affects the model’s output. They can help identify which features are driving the model predictions and the direction of their impact. Understanding these feature contributions can be valuable in gaining insights into the XGBoost model’s behavior and making data-driven decisions.

5.5 Feature importance analysis

Feature importance is important in machine learning models because it helps identify which features are most important for making predictions. This is useful for a number of reasons. First, understanding the relative importance of each feature can help build simpler, more interpretable models. By only using the most important features, it is possible to build a model that is easier to understand and explain to others. This can be especially useful in domains where interpretability is important, such as in healthcare or finance. Second, feature importance can help identify features that are redundant or irrelevant. These features can be removed from the model, which can improve its performance by reducing overfitting and increasing generalization. Third, understanding feature importance can help guide feature engineering efforts. By focusing on the most important features, it is possible to create new features that are more predictive and improve the performance of the model (König et al. 2021).

Overall, feature importance is an important tool for understanding and improving machine learning models. It can help identify the most important features, remove redundant or irrelevant features, and build simpler, more interpretable models.

The concrete compressive strength \((f_{\textrm{c}})\), standardised to a relative relevance of 100%, was discovered to be the most crucial factor for forecasting the shear strength of RC deep beams, as shown in Fig. 10. Shear span (a) and vertical web reinforcement spacing, which have importance values between one-fourth and one-third of the concrete strength, are the second and third most crucial properties, respectively. This makes sense given that these characteristics have a direct impact on the shear mechanism of deep beams. Other characteristics, which account for around 18% of the relevance of shear strength, include section width, shear span-to-depth ratio, and horizontal web reinforcement spacing. Web and longitudinal reinforcement ratios are less important characteristics, with importance values of only about 10% of concrete strength. Other features were found to be of minor significance, with their combined influences being less than 10% of the most significant ones.

Fig. 10
figure 10

Feature importance analysis result

5.6 Conclusion

This paper presents an ML technique-based approach with SHAP to predict the shear strength of RC deep beams. A total of 271 test data samples of RC beams were divided into four groups namely beams without web reinforcements (WOR), beams with horizontal web reinforcements (WHR), beams with vertical web reinforcements (WVR), and beams with both horizontal and vertical reinforcements (WHVR) from the literature were collected and used to train and test the models. The models were trained upon 16 parameters using 3 machine learning and 4 ensemble learning algorithms which were evaluated with each other on parameters coefficient of determination, mean absolute error, root mean squared error and mean absolute percentage error in which XGboost algorithms performed the best. XGboost algorithm was then compared with the mechanics-driven model of CIRIA, United States, Euro Code, Chinese code and Canadian code. According to the results, the following conclusion can be drawn:

  • The ML models provide a superior approach to predicting the shear strength of RC deep beams. The approach is robust in nature and can be replicated easily. The approach can be understood with ease rather than the numerical and theoretical derivations of mechanics-driven modelling. The only fundamental requirement is the dataset which can be easily collected and used for long-term structural health monitoring systems.

  • The XGBoost algorithm performance the best among ANN, Decision Tree, Support Vector Machine, Random Forest, Gradient Boosting Algorithm and Adaptive boosting algorithm with a coefficient of determination of 0.92 (testing), 0.99 (training), mean absolute error of 29.65 (testing), 2.47 (training), root mean squared error of 47.76 (training), 1.45 (testing) and mean absolute percentage error of 9.79 (training), 0.78 (testing) which are far superior to the mechanics-driven models.

  • The hyperparameters for all the models are selected based on their performance in producing the best k-fold cross-validation results. The XGBoost model is found to perform optimally based on multiple iterations in learning rate, number of trees, and maximum depth, with the most suitable parameters being 600 trees, 0.1 learning rate, and a maximum depth of 10.

  • The standard deviation, mean and covariance value of predicted to test ratio for XGBoost model were found 0.06, 1.00 and 6.38, respectively, in comparison to mechanics driven models British (CIRIA Guide)–\(-\)0.47, 1.23, 38.38; United States code: 0.69, 1.57, 44.25; Chinese code: 0.39, 1.43, 27.01, Canadian code: 0.56, 1.56 35.71; and European code: 0.54, 1.42, 38.05. This validates the superiority of the ensemble learning approach, particularly the XGBoost model, over traditional mechanics-driven models, highlighting its potential for accurate shear strength prediction.

  • SHapley Additive exPlanations is proposed for XGBoost algorithms results in order to interpret the inner working of the model removing the black box nature of these ML algorithms and feature importance is shown to deduce the parameters which affects the shear strength of RC deep beams the most.

  • From SHapley Additive exPlanations and feature importance analysis, the Study concludes that compressive strength of concrete and geometry of the beam are the most influential parameters while properties of steel affects the least while predicting the shear strength of RC deep beams.

5.7 Discussion

This study deduces that the ensemble learning models specifically the XGBoost model is the best choice to predict the shear strength of RC deep beams that predicted to experimentally tested shear strength ratio data has the best mean and least standard deviation as compared to other codal methods. The XGBoost model’s predictions of the shear strength ratio for different RC beams indicate that the WHR prediction value is closest to the mean, followed by WVR, WHVR, and WOR, which also show proximity to the mean value in that order. In general, the use of ensemble learning for shear strength prediction may lead to a reliance on black box algorithms that are difficult to interpret and understand. This could potentially pose challenges for engineers in comprehending the rationales behind ensemble predictions and evaluating their reliability. Consequently, a lack of trust in the ensemble’s predictions might impede its widespread adoption within the construction industry. Meanwhile, the authors have utilized SHapley Additive Explanations (SHAP) to interpret the internal mechanisms of the model and identify correlations among parameters that influence the model predictions. This approach effectively addresses the challenges associated with black box algorithms.

Predicting the shear strength of reinforced concrete (RC) deep beams using ensemble learning can have several implications and potential problems. One potential implication is that the use of ensemble learning for shear strength prediction could improve the accuracy of structural design in the construction industry. By combining the predictions of multiple models, ensemble learning can provide more reliable estimates of shear strength, which can help engineers design safer and more efficient structures. This could ultimately lead to a reduction in structural failures and improve the safety of buildings and other infrastructure (Fathipour-Azar 2022).

However, there are also potential problems associated with the use of ensemble learning for shear strength prediction. One potential problem is that the accuracy of ensemble learning models depends on the quality and diversity of the individual models that are combined. If the models used in the ensemble are not sufficiently diverse or are based on limited or biased data, the predictions of the ensemble may not be accurate. This could lead to incorrect design decisions and potentially unsafe structures (Alobaidi et al. 2018; Seni and Elder 2010).

When it comes to drawing direct comparisons between different studies in the literature on the prediction of shear strength in RC deep beams can be challenging for several reasons. One major obstacle is the variation in the datasets used across different studies. Each study may utilize different experimental data or numerical simulations, resulting in disparities in the dataset size, composition, and quality. This variation can significantly impact the performance and reliability of the predictive models. Moreover, the studies often involve a wide range of parameters affecting shear strength prediction, such as the concrete mix design, steel reinforcement, beam geometry, loading conditions, and boundary conditions. The differences in these parameters among studies can lead to divergent outcomes and hinder the establishment of a consistent comparison framework (Chou et al. 2020b; Fu and Feng 2021; Olalusi and Awoyera 2021; Ly et al. 2020; Sandeep et al. 2023; Oh and Shin 2001; Kwak et al. 2002; Rahman et al. 2021b; Zhang et al. 2020a; Wakjira et al. 2022a; Abuodeh et al. 2020a; Mangalathu and Jeon 2018; Chou et al. 2020a; Prayogo et al. 2020). Furthermore, researchers adopt various methodologies to solve the problem of shear strength prediction in RC deep beams. These methods may include analytical approaches, experimental investigations, empirical equations, and machine learning techniques. Each method possesses its unique assumptions, limitations, and uncertainties, making it challenging to directly compare their outcomes. Given these variations in datasets, parameters, and methodologies, it becomes impractical to draw straightforward and reasonable comparisons between the literature.

Overall, the use of ensemble learning for predicting the shear strength of RC deep beams has the potential to improve the accuracy and efficiency of structural design. However, it is important to carefully consider the potential problems and challenges associated with this approach and to address them in order to ensure that it is used safely and effectively in the future.