1 Introduction

The characteristic value of compressive strength of concrete (\(f_{c}\)) is one of the most critical variables for designing concrete structures, and knowledge of this attribute is essential for ensuring the quality and safety control of structural systems (Rauecker et al. 2019). As a rule, the \(f_{ck}\) is defined as the compressive strength measure with 95% confidence at an age of 28 days. This measurement plays a crucial role in various stages of building design, including (a) determining when to remove supports during the construction planning of a structure and (b) assessing the strength of structural designs.

This measurement is determined experimentally through the compression testing of cylindrical specimens as described in NBR 5739 (2018). It is influenced by several factors (Erdal 2013; Abbass et al. 2019), such as water–cement ratio, type of cement, specimen model, and testing speed. Due to the large number of parameters involved, ensuring concrete quality control is a complex task, especially in variations of conventional concrete where additional parameters come into play. For example, in fiber-reinforced concrete (FRC), the volume of fibers, and in high-performance concrete (HPC), the types of additives become significant factors. Therefore, using Machine Learning techniques for predictions has shown great potential in various applications. The fundamental idea behind machine learning-based concrete mix design is to optimize the mix proportions and reduce the time spent in the semi-empirical design process based on the relationships between design factors and concrete compressive strength. Several studies have explored Machine Learning techniques to predict the engineering properties of various construction materials, including recycled aggregate concrete (RCA) (Zhang et al. 2020a; Behnood and Golafshani 2020), normal strength concrete (Feng et al. 2020; Chou et al. 2014), and ultra-high-performance concrete (UHPC) (Al-Shamiri et al. 2020; Fan et al. 2020; Alabduljabbar et al. 2023).

Computational learning capability (Mirjalili et al. 2020) has been highlighted in various application areas (Madabhushi and Lee 2016; Komura and Ishikawa 2018; Milhomem and Dantas 2020; Zhang et al. 2020b; Isinkaye et al. 2015). This article aims to contribute to popularizing these techniques for predicting the mechanical properties of concrete and mortars, as such techniques are still in full development. Additionally, the availability of a dataset related to the concrete mixtures and compressive strength of concrete, resulting from a data curatorial work, can contribute to improving other authors' works, since these data were enhanced.

We can summarize the main contributions of this paper as follows:

  1. (a)

    The creation and consolidation of an extensive dataset comprising data from various works in the literature that utilize concrete mixture design procedures and their respective compressive strength (fc);

  2. (b)

    The demonstration and analysis of results from experiments involving the application of different regression models in the context of different concrete mixtures, ranging from low- to high-strength concrete and their respective compressive strength.

2 Theoretical reference

In addition to the traditional data sources, the advent of different technologies for data extraction, like sensors or even Web scraping, has been increasing the amount of available data for processing. This scenario is expected for machine learning scientists and professionals who intend to create and improve their models. From another perspective, this unprecedented amount of data can also bring some problems known to the database community, like noise and data inconsistency.

In this sense, it is essential that these data would be pre-processed and analyzed before being submitted to subsequent phases of an ML pipeline. However, the most critical aspect before constructing a pipeline is identifying the problem and the most adequate task. There are different tasks that can be applied to a class of problems, like clustering, classification, and regression. Our approach is based on the regression task, where the main objective is to make predictions regarding the future state of data, considering that unknown outcomes adhere to patterns identified in previous observations.

In Fig. 1, we can observe an overview of the machine learning pipeline where we first define the task, validate the scenario, and, finally, decide which model to employ. Still, between steps 1 and 2, we had to integrate and process all collected data from the literature. All the data were submitted to pre-processing and, thus, to feature engineering and exploratory data analysis.

Fig. 1
figure 1

Machine learning (ML) workflow and main ML models for predictive tasks (Hild-Aono et al. 2022)

The regression task is a classical problem of supervised learning, involving the modeling of a predictive function that generates a response, also known as target variables or dependent variables, that can be obtained from a combination of one or more independent variables. In addition to predictions, it also allows analyzing the behavior of data, i.e., the relation between the response and the variables (Al-Shamiri et al. 2020; Igual and Seguí 2017).

These models are used to fit the data points along the line generated from the function as a best fit and minimize the distance between the data points and the line by least-squares methods. Therefore, in this way, the prediction Machine Learning problems can be described generically according to Eq. (1), where (\(y\)) is the vector containing the observed measures or responses and (\(\hat{y}\)) is the vector containing the numerical measures obtained by the prediction model

$${\text{minimize}} \left| {\hat{y} - y} \right|.$$
(1)

The following sections present the formulations of the Machine Learning methods used in this article’s context.

2.1 Machine learning

The concept of Machine Learning has stood out a lot in recent years due to the significant advances obtained in different areas of computing. For instance, databases, artificial intelligence, distributed computing, and advances obtained in other areas that, directly or indirectly, were also primordial for this evolution, contributing to the opportunity to generate, make available, and access ever-increasing volumes of data.

Learning is a vast concept, but when discussed in this work, it concerns the ability to automatically learn by machines capable of performing or simulating such a skill. Since computers perform all their procedures through calculations, nothing is more natural than modeling the learning process mathematically. In this case, the main idea behind this modeling is based on the pattern recognition process, since incorporating prior knowledge is the main influencing factor in the learning process.

In this case, some tools from areas, such as statistics, optimization, and information theory, for example, are essential to “train” the algorithms based on the patterns observed in data extracted in previous moments. A central theme of machine learning theory is developing solutions to express knowledge of a particular domain based on a learning process (Shalev-Shwartz and Ben-David 2014).

In the scientific literature, one can find some proposals for the categorization of types of machine learning. In the vast majority, the division into three main categories prevails, which are (Russell and Norvig 2016):

  1. (i)

    Supervised Learning: In supervised learning, the agent is presented with labeled input–output pairs, enabling it to learn a mapping function from input to output. This method is crucial in tasks like classification and regression, where the algorithm is trained to make accurate predictions based on the provided examples.

  2. (ii)

    Unsupervised Learning: Unlike supervised learning, the agent does not receive explicit input–output pairs in unsupervised learning. Instead, it aims to identify patterns, structures, or intrinsic relationships within the input data. Clustering techniques, for instance, enable the algorithm to automatically detect clusters or segments within the data without explicit guidance.

  3. (iii)

    Reinforcement Learning: The agent learns through rewards and punishments while interacting with an environment. It makes sequential decisions and knows to maximize cumulative rewards over time. This method is often employed in gaming, robotics, and process optimization, allowing the agent to learn optimal behaviors through direct interaction and experimentation within its environment.

The following sections present the formulations of the supervised Machine Learning methods used in this article’s context. The choice for supervised methods was due to the characteristics of the context in question, where measurement techniques that originated the datasets were employed. The data are of a continuous numerical nature, and the task performed was predicting values.

2.2 Regressions

Regression is a technique to investigate the relationship between the input space (independent variables) and the output space (dependent variable). According to Kang et al. (2021), regressions are one of the most common techniques for performing prediction tasks. In these models, numerical values, known as regression coefficients (\(w\)), are used as parameters in the predictive functions to describe the relationship between a predictor variable and its corresponding response. Thus, this method seeks to produce a linear or non-linear predictive function (\(\hat{y}\)) that minimizes the loss function given by Eq. (2) (Tai 2021). In this process, \(y\) represents the vector of observations and \(w\) the vector of weights that minimize the loss function. The loss function is alternatively known as Mean Squared Error (MSE) for this scenario (Igual and Seguí 2017; Russell and Norvig 2016)

$${\text{minimize}}\,\text{Loss}\left( {x,w} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (\hat{y}_{i} (x, w) - y_{i} )^{2}$$
(2)
$$\hat{y}_{i} \left( {x, w} \right) = w_{0} + \mathop \sum \limits_{i}^{{}} w_{i} .x_{i} .$$
(3)

Despite their predictive capacity, the generated models are likely to be affected by overfitting, despite their predictive capacity, the generated models are likely to be affected by the effect of overfitting, a situation characterized by the difficulty of generalizing the model, causing the model to overfit the data set. To reduce the possibility of overfitting the data set, it is possible by creating regularization rules in the loss function. The most commonly employed techniques are Ridge (L2) and Lasso (L1) (Géron 2019). Such techniques were employed in this article.

2.3 Regressive decision tree

Regressive decision tree models, or regression trees, were first introduced by Breiman (1998). In general terms, the decision tree procedure divides the data system into hierarchies. According to Géron (2019), Decision Trees comprise nodes representing the attributes and branches originating from these nodes and receiving the possible values for these attributes (each descending branch corresponds to a possible value of this attribute). In trees, there are leaf nodes (leaf of the tree) representing the different values of a training set. That is, each leaf is associated with a class. Each path in the tree (from root to leaf) corresponds to a regression rule and can be represented as sets of if–then rules. The rules are written considering the path from the root node to a leaf in the tree.

In the same way that the regression process minimizes the residuals (Scikit-Learn 2024), the Regression tree will seek to reduce the impurities in each subset formed. For a Regression tree problem, the cost function is given by Eq. (4)

$${\text{Loss}}\left( {Q_{m} ,\theta } \right) = \frac{{n_{m}^{{{\text{left}}}} }}{{n_{m} }}.H\left( {Q_{m}^{{{\text{left}}}} \left( \theta \right)} \right) + \frac{{n_{m}^{{{\text{right}}}} }}{{n_{m} }}.H\left( {Q_{m}^{{{\text{right}}}} \left( \theta \right)} \right)\quad \theta = (j,t_{m} )$$
(4)
$$H\left( {Q_{m} \left( \theta \right)} \right) = \frac{1}{{n_{m} }}\mathop \sum \limits_{{y \in Q_{m} }}^{{}} (y - \overline{{y_{m} }} )^{2}$$
(5)
$$\overline{{y_{m} }} = \frac{1}{{n_{m} }}\mathop \sum \limits_{{y \in Q_{m} }}^{{}} y,$$
(6)

where \(m\) represents a node, \(j\) is a feature, \(t_{m}\) represents the threshold, \(H()\) measures the impurity of the subsets, \(n_{m}\) is the number of instances in the subsets (\(n_{m}^{{{\text{left}}}}\) is the number of instances in the left subset, and \(n_{m}^{{{\text{right}}}}\) is the number of instances in the right subset), \(Q_{m}\) is a subset (left subset—\(Q_{m}^{{{\text{left}}}}\) and right subset—\(Q_{m}^{{{\text{right}}}}\)), \(y\) represents the observed value in \(i\) node, and \(\overline{{y_{m} }}\) represents the average value in each region. The partitions are given by Eqs. (7) and (8)

$$Q_{m}^{{{\text{left}}}} \left( \theta \right) = \left\{ {\left( {x,y} \right)|x_{j} \le t_{m} } \right\}$$
(7)
$$Q_{m}^{{{\text{right}}}} \left( \theta \right) = \left\{ {\left( {x,y} \right)|x_{j} > t_{m} } \right\}.$$
(8)

In regression trees, the main difference is that instead of predicting a class in each node, it predicts a value. The example results of regression trees are presented in Fig. 2. For example, to predict a new instance with \(x_{1} = 0.6\). You traverse the tree starting at the root, and you eventually reach the leaf node that predicts value = 0.1106. This prediction is simply the average target value of the 110 training instances associated with this leaf node. This prediction results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances (Géron 2019).

Fig. 2
figure 2

Example of a regression tree (Géron 2019)

Therefore, these trees can perform analyses among the proposed data and find patterns that can be organized into different series of prediction rules (Kang et al. 2021). Such a model is usually used as an alternative when linear models fail to return an accuracy within the acceptable level (Güçlüer et al. 2021).

2.4 Ensembles

Ensemble-type learning methods train combinations of models, which can be decision trees, neural networks, or others traditionally used in supervised learning. Ensemble methods have gained popularity, because many researchers have demonstrated superior prediction performance over single models in various problems (Oza 2000).

In the case of this article, the Gradient Boosting technique will be used, which has the general idea of sequentially training the predictive mode (\(F\)), and, at each iteration, correcting its predecessor model (\(F_{m - 1}\)) (Géron 2019; Natekin and Knoll 2013). The model correction is given as a function of Eq. (9) where \(h_{m}\) represents the result of training a tree. In this case, the \(h_{m}\) portion is given by minimizing the loss function

$$F_{m} (x) = F_{m - 1} (x) + h_{m} (x).$$
(9)

3 Machine learning workflow

This section demonstrates the necessary procedures for constructing predictive models of the material's mechanical strength. The Machine Learning workflow in this paper is similar to several other AI papers, such as Yassen et al. (2018), Pakzad et al. (2023), and Alabduljabbar et al. (2023).

3.1 Task definition and features analysis

Following the procedure shown in Fig. 1, the first part of this work begins with constructing a database related to concrete mixtures and defining the task related to regression. Therefore, the dataset studied results from the collection and integration of data from different sources (Beck 2009—24 samples, Bilim et al. 2009—225 samples, Bouzoubaâ and Fournier 2003—68 samples, Chopra et al. 2016—228 samples, Demirboğa et al. 2004—29 samples, Duran Atiş 2005—69 samples, Durán-Herrera et al. 2011—114 samples, Jiang and Malhotra 2000—54 samples, Lee et al. 2006—53 samples, Oner and Akyuz 2007—224 samples, Pala et al. 2007—90 samples, Pitroda 2014—10 samples, Sonebi 2004—62 samples, Yeh 1998—1030 samples, and Yen et al. 2007—80 samples), forming a new base with 2358 records related to very-low- to high-strength concrete (1.76–113.2 MPa). Before generating the statistics of the data set, the Database (BD) was cleaned to form a single reference of samples with seven input attributes referring to the concrete mix proportioning and an output attribute referring to the cylinder compressive strength (\(f_{c}\)).

The cleaning carried out consists of two changes to the database. The first refers to creating the water/cement column, since this rate is critical in concrete technology, directly influencing the strength variable (Singh et al. 2015). The second change relates to merging the filler additions (Fly Ash and Blast Furnace) into a single variable. Addition is an important factor, especially in high-strength concrete (Abbass et al. 2019), but the bases used do not give details about these variables, so this combination was chosen. Also, possible duplicates have been eliminated. DB was missing values before cleanup.

Table 1 presents a brief statistical description of the dataset, and Fig. 3 shows the histogram of the attributes that form the database

Table 1 Statistical characteristics of the analyzed dataset attributes
Fig. 3
figure 3

Distribution of some of the database variables. a Attribute \(c\) histogram, b attribute \(sp\) histogram, and c attribute \(cg\) histogram, d attribute \(fg\) histogram, e attribute \(t\) histogram, f attribute \(w/c\) histogram, f attribute add histogram, and d attribute \(f_{c}\) histogram

Mixture design is essential for all types of concrete and involves defining the proportions of materials that constitute the concrete composite. The objective of mixture design is to determine the quantities of materials needed to achieve specific mechanical properties, such as \(f_{c}\).

Mixture design plays a critical role in projects ranging from small buildings to large-scale structures. Tutikian and Helene (Tutikian and Helene 2011) explain that mixture design aims to find the ideal and most cost-effective mix proportions using available materials while meeting design requirements. Various mixture design procedures exist, including semi-empirical methods like those provided by the American Concrete Institute (ACI) and Brazilian Portland Cement Association (BPCA), as well as more complex procedures like the packaging method used for Ultra-High-Performance Concrete (UHPC).

The proportions of cement, mineral admixtures, and water content directly influence chemical reactions within the paste, such as the formation of Calcium Silicates (C–S–H). Therefore, successful mixture design is crucial for controlling the durability of concrete in its hardened state (Ribeiro et al. 2021).

To verify the correlation between the database attributes, the problem’s correlation matrix was verified (Fig. 4). It is observed that the highest correlation value is between the variables' compressive strength (\(f_{c}\)) and age (\(t\)), with a correlation coefficient of 0.46. There is also a negative correlation between the attribute \(w/c\) and the \(f_{c}\) of -0.40, which aligns with experimental findings showing that mechanical strength decreases as the water content in the mixture increases. Another experimentally proven factor is the strong positive correlation (0.46) between the amount of the cement and the compressive strength.

Fig. 4
figure 4

Database correlation matrix

It was also verified (Table 2) that the sensitivity of the database attributes as a function of the response variable, \(f_{c}\), using the standardized regression coefficient (SRC) and standardized rank regression coefficient (SRRC) methods. The SRC method is widely used to check the sensitivity of linear models, and its ranked version (SRRC) is often used for non-linear models (Homma and Saltelli 1996).

Table 2 Result of the sensitivity analysis of the database attributes concerning the response variable \(f_{c}\) (value ± standard devia.ion)

Using the Stepwise Regression Coefficient (SRC) method, it was determined that the three most important variables concerning \(f_{c}\), are \(c\). (cement), add (additions), and \(t\) (age). However, when applying the Stepwise Robust Regression Coefficient (SRRC) method, the order shifted to \(w/c\), \(t,\) and add. The prominence of attribute c in the SRC method aligns with its linear trend behavior (Figure), which is consistent with experimental observations.

Figure 5f illustrates why the attribute \(w/c\) (water–cement ratio) ranked first in importance by the SRRC method, since it did not present a linear behavior, as observed in experimental observations that used the Abrams law. Additionally, Fig. 5 showcases the regression lines that demonstrate the relationship between the inputs and the compressive strength fc.

Fig. 5
figure 5

Scatter plot with a regression line for each database attribute as a function of \(fc\). a \(f_{c}\) versus \(c\), (b) \(f_{c}\) versus \(sp\), (c) \(f_{c}\) versus \(cg\), (d) \(f_{c}\) versus \(fg\), (e) \(f_{c}\) versus \(t\), (f) \(f_{c}\) versus \(w/c\), and (g) \(f_{c}\) versus add

The add attribute appeared among the top three in importance in both the SRC and SRRC sensitivity analyses despite its very-low correlation with \(f_{c}\) (− 0.09, as shown in Fig. 4). This unexpected finding may be explained by the correlation between add and the attributes c and w/c, indicating an existing interaction among these variables.

Therefore, sensitivity analysis is justified in multiple ways, including assessing response variable behavior concerning input attributes and their interrelationships through plotting and visual inspection.

To apply the numerical modeling of Data Mining, the Python 3 language was used with the use of the following libraries: (a) Scikit-learn, (b) Pandas, (c) Numpy (d), Matplotlib, and (e) Seaborn. The studied problem consists of a prediction case using.

In this step, corrupted records and duplicates were eliminated from the database. This process resulted in a reduction of 1.21% in the original database. The pre-processing phase is one of the most important for machine learning, through which it is possible to amplify the performance or even guarantee it (McCabe et al. 2012; Oliveri et al. 2019).

As magnitudes of different units formed the database, it was necessary to normalize it. This process aims to place the attributes in a standard range of values, thus reducing model accuracy errors due to the weight of the units in the analysis (Sola and Sevilla 1997). In this line of reasoning, the z-score normalization was used in the database. As described in Eq. (10), where \(\mu\) is the mean of the analyzed attribute, \(\sigma\) is the standard deviation of the analyzed attribute, \(x\) is the vector containing the original values of the analyzed attribute, and \(z\) is the vector containing the normalized attribute values

$$z = \frac{x - \mu }{\sigma }.$$
(10)

This is valid in this work, since there are scales regarding material consumption, time, and a dimensionless scale regarding the water–cement ratio.

3.2 Validation scenario

Twelve computational models of machine learning were selected, and these techniques were chosen due to their effectiveness against learning problems of this nature. Table 3 presents the characteristics of each of the adopted models. The other characteristics of the model follow the standardized settings of the Scikit-learn library.

Table 3 Computational models applied

The complete assembly pipeline of the machine learning model followed the guideline of Fig. 6. In the conception used in this article, the data were separated in the proportions of 80%/20% for training and testing, respectively. The separation percentage was defined as a function of the learning curve given in Fig. 7, where it was possible to detect that the Root Mean Square Error (RMSE) measure already presented a satisfactory convergence value for an 80% separation of the training data.

Fig. 6
figure 6

AI training and testing pipeline

Fig. 7
figure 7

Learning curve percentage of the decision tree model in separating training and testing with error. a Learning curve for the tree models and b learning curve for the linear regression model

Once the learning models were defined, cross-validation tests were applied before training, thus verifying whether the models performed satisfactorily to start the learning task. Figure 8 presents the cross-validation model used in this work: the Kfold model. In the case of this article, a division with ten sections (\(cv = 10\)) was applied. This cross-validation process was performed 30 times to verify the consistency of the models concerning the dataset.

Fig. 8
figure 8

Kfold cross-validation strategy

The accuracy measure for both procedures performed in this work was the Coefficient of Determination (\(R^{2}\)) as shown in Eq. (11). In this equation, \(\hat{y}\) is the vector that includes the numerical measurements estimated by the model, \(\overline{y}\) is the average of the observations in the database, and \(y\) is the vector of the observations in the database. In addition, other evaluation metrics, such as Model Error (\(\varepsilon_{\bmod }\)) and Mean Absolute Error (MAE) given by Eqs. (12) and (13) are used

$$R^2 = 1 - {{(\hat{y} - y)^2 } \over {(y - \bar{y})}}$$
(11)
$$\varepsilon_{\bmod } = y/\hat{y}$$
(12)
$${\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\hat{y}_{i} - y_{i} } \right|.$$
(13)

At the end of the simulation and analysis of the results, a Python algorithm was built that loads the trained model into a Jupyter notebook, allowing users of this platform to download and use the Artificial Intelligence model.

3.3 Model selection

The selected model for deployment on the World Wide Web must have an accuracy greater than 80%. Just one machine learning model will be chosen. In Civil Engineering problems, for a model to be considered accurate, it must present a Coefficient of Determination more incredible than 70%, and the closer to 100%, the more precise it is (Arroyo et al. 2020, 2023; Montgomery 2013).

4 Results

Based on Fig. 8, the cross-validation results are presented (Table 4). It is possible to notice that the data have a non-linearity, since the linear regressions could not consistently represent the data. For these models, in particular, the value of the coefficient \(R^{2}\) was lower than 65%. The curvilinear versions of the regressions presented an \(R^{2}\) greater than 75%, showing that non-linear parameters were necessary to improve the accuracy of the predictive model.

Table 4 \(R^{2}\) of the models in the iterative cross-validation phase

In the test stage, the three models with the highest accuracy were the Ensemble GB10 type model and the RL3 and RR3 regressions. In this work, the model with the best accuracy for the tested examples was Gradient Boosting GB10, with an accuracy of 86.33%. Although other models present an \(R^{2}\) not much lower than the previously mentioned models, it was possible to detect overfitting in data, since some models, such as the AR20 tree model, presented an accuracy greater than 99% in training, while in the test, this value reduced to the 75% range. In this work, the GB10 model was selected for uploading and making available online (https://wmpjrufg.github.io/Concreta/).

Figure 9 shows the comparison between the forecasted values and the actual values based on trained models. It is possible to observe that the 1st-degree linear regression (RL1) has the most dispersed values when compared to the Gradient Boosting model (GB10), which has a concentration of values around the diagonal line \(\hat{y} = y\) (\(\varepsilon_{\bmod } = 1.00\)), showing the efficiency of the latter model in predicting the data.

Fig. 9
figure 9

Predicted values versus observed values: a 1st-degree linear regression (RL1) and b Gradient boosting (GB10)

To predict the bias of the predictive model, the model error variable (Eq. (12)) for the predictions was calculated. In the case of this Artificial Intelligence model, the model error variable (\(\varepsilon_{\bmod }\)) has a mean value of 0.9989 for the bias factor. Therefore, it can be concluded that the predictive model tends to overestimate the compressive strength (\(f_{c}\)) slightly compared to the observed result.

Still evaluating the predictive model, it is possible to calculate the Mean Absolute Error (MAE). This analysis categorized the concretes into strength classes ranging from 5 to 115 MPa. The MAE was higher in the higher strength classes of concrete, particularly in the 70 MPa to 90 MPa range. However, the MAE values did not exceed 2 MPa for these concrete classes, which is significantly lower than the compressive strength. Figure 10 illustrates these MAE values across all concrete classes.

Fig. 10
figure 10

MAE as a function of concrete strength class

The confidence interval of the predictive model response was calculated to present the result of the strength prediction on the platform. For this, a confidence level of 95% was used, reaching an error of only ± 0.48 MPa.

Additionally, an analysis of the importance of the variables in the Gradient Boosting model was carried out. Gradient Boosting models capture complex non-linear connections among variables, and their variable importance scores are based on how much each variable contributes to reducing the model's loss function. The importance analysis of the GB10 model is presented in Fig. 11 and shows results similar to Table 2, which classified the variables \(t\), \(w/c\), and add as the most important.

Fig. 11
figure 11

Feature importance in GB10 model

In addition to the numerical results presented above, a web platform was built using the Python Streamlit framework. On this website, it is possible to have access to the AI created in this article. User can enter his mixture design and get the \(f_{c}\) value based on the AI predictive model. The framework's interface presents the process results and the model's error rate. Figure 12 shows the program's interface. It should be noted here that the program is online and has a desktop and mobile version.

Fig. 12
figure 12

CONCRETA prediction framework interface

5 Concluding remarks

This work aimed to evaluate data mining methods for studying compressive concrete strength. It was possible to observe that non-linear models were more effective in extracting information from the concrete database, which comprised 2358 records.

The experimental data used in this research have good coverage; however, these values are unbalanced regarding compressive strength, as seen in Fig. 3d. Furthermore, the age values are concentrated on a date less than 50 days away (Fig. 3e). This situation may imply a more significant error for these input conditions, since the model does not have good accuracy outside this region.

The initial data treatment necessitated prior cleaning of the database, enabling the creation of crucial variables in the mixture design, such as the water–cement ratio. Furthermore, visualizing the data before training allowed the validation of the authenticity of the database by confirming the negative correlation between the water–cement ratio and compressive concrete strength, as observed in the experiments.

During the application of the methods, the iterative cross-validation technique was used to ensure comprehensive testing of the dataset, ensuring that the selected model possesses the ability to generalize. This factor is of paramount importance in Artificial Intelligence (AI) tasks. In this case, 30 repetitions were used in the validation phase.

Simple models were utilized in this work, and the results proved satisfactory. The model with the highest generalization ability was an Ensemble-type model called Gradient Boosting. With this model, an accuracy greater than 85% was achieved, and subsequently, a predictive model was developed, which is available for download on the portal: https://wmpjrufg.github.io/Concreta/.

This research contribution provides individuals and organizations access to state-of-the-art technology based on Machine Learning, enabling them to analyze mixture designs even before conducting experimental tests with cylinder specimen ruptures. Consequently, this work streamlines and reduces the time and resources expended in semi-empirical mixture design. It is important to note that AI should not replace the traditional compression test regulated by NBR 5739 (2018) but rather be used as an additional tool to increase productivity in concrete production.

As a suggestion for future work, we recommend adding new databases to broaden the coverage of the predictive model, especially in strength ranges with the highest mean absolute error, as shown in Fig. 9. Additionally, creating specific models for cement mortars and permeable concrete could be valuable. Including data on new cementitious materials could expand the potential applications of these techniques in civil engineering.