Introduction

Bainite steels are in high demand in many application areas owing to their unique properties, such as high tensile strength, impact toughness and weldability, with ultimate tensile strength up to 2.2 GPa and toughness values up to 130 MPam1/2 [1, 2]. Bainite steels derive their strength mainly from fine bainite plates, which acquire sufficient ductility due to the presence of the ductile phase austenite. Therefore, bainitic steels are often used to manufacture pipelines for gas and oil transportation [3, 4]. The transformation behavior of bainite has been studied by many scholars, among which the common methods for measuring the phase transformation temperature are thermal expansion [5], thermal analysis [6] and metallographic methods [7]. In addition, subcooled austenite continuous cooling transition curve (CCT) about phase transition and phase transition points (bainite transformation start temperature, martensite transformation start temperature) can be obtained by using MUCG 83[8] and J-MatPro software [9].

The transformation kinetics and morphological characteristics of bainite are closely related to the elemental mass fraction, cooling rate and atomic scale characteristics [10]. Previous studies of Fe-9Ni-C alloys found that the Bs value decreases with increasing carbon content which delays the onset of transformation [11]. As the transformation temperature changes, the microstructure of the early transformation changes and the morphological characteristics of bainite is transformed [12]. The addition of silicon delays the transformation kinetics of bainite. It was found that as the silicon content increased in the range of 1.0 to 2.0 wt%, resulting in more film-like residual austenite and less carbide in bainitic steels, the strength and total elongation of bainitic steels increased [13, 14]. The continuous cooling rate also affects the transformation properties of low-carbon bainite steels. Previous studies have shown that the formation of leaf-shaped bainite during the continuous cooling phase accelerates the subsequent bainite transformation [15]. As the cooling rate increases (1–30 °C/S), the bainite transformation start temperature increases and then decreases [16]. In addition, the valence electrons and radii of each atom are different due to the chemical elements involved in the material from different periods [17]. Previous studies found that the variation of the transition temperature of NiTi-based alloys is correlated with the number of valence electrons per atom (ev/a). An increment in the number of valence electrons is accompanied by a tendency for the transition temperature to decline when ev/a < 6.8 or ev/a > 7.2. It is also found that the transition temperature lagging has a relatively wide range of values, which is considered to be related to the atomic radius. During the transformation, an increase in atomic size may lead to more energy dissipation, thus increasing the lag [18].

Previously, Bs was estimated by combining thermal simulation experiments with microstructure, which was inaccurate and limited [12]. So far, most of the bainite phase transformation-related researches have been focusing on experiments clarifying the influence of the phase field on microstructural aspects. The lack of a large amount of experimental data on Bs and related studies on machine learning considerably hinders the further study on revealing influential mechanisms of the bainite phase transformation. So simulation of the data needed is of significance for both the machine learning related studies in materials science and the valuable application of general methods of physics. Powerful data analysis and processing tools in machine learning can significantly reduce errors caused by inaccuracies in experimental operations, exclude noisy data, and provide important information [19]. The method can automatically adjust the weight of each factor according to the target value of the model and perform combined learning.

Machine learning has been used in the field of steel materials [20,21,22,23,24,25]. For example, Van Bohemen et al. [20] extracted model parameters from the best fit of published time–temperature transformed data and used the model to describe the start curve of bainite formation and predict bainite transformation kinetics. Moreover, the martensite transition temperature of the steel was predicted based on the thermodynamic model, including factors such as chemical composition, austenite grain size, and driving force [21]. Wang et al. [22] used an artificial neural network model to predict CCT diagrams for a class of steels. In addition, Zhang et al. [23] used a Gaussian process regression model based on the alloying elements of a steel to predict martensite transition temperature. They analyzed the intrinsic link between alloying elements and phase transformation point of martensite. Bs is closely related to specific alloying elements and process parameters [24, 25]. Therefore, it is crucial to investigate the intrinsic correlation between important alloying elements and Bs by collecting relevant elemental content data for bainite-containing steels.

In this study, Bs is predicted by machine learning algorithm, Pearson and Spearman correlation coefficient, the random forest feature importance method using alloying elements (C, Si, etc.) and cooling rate (CR) as input features. The relationship between the alloying elements Si, C and Bs was analyzed. The dataset was divided based on the concentration of element C, and the model’s performance was evaluated for low and medium carbon, respectively. Meanwhile, considering the model’s generalization capability, new atomic scale features based on alloying elements and cooling rates were added to enable the model to predict Bs better. Finally, the model is validated by randomly selecting some data.

Methodology

In the present work, the random forest algorithm (RF) [26] is used as the dominant machine learning method which is based on the idea of integration learning in Fig. 1. First, the training sets (labeled by ‘Decision Tree-1, 2, …, N’) are generated by bootstrapping, and a decision tree is constructed for each training set (green and blue points). Then, when the node has to find suitable features for further splitting, it randomly selects some features and finds the optimal solution (green dots) in Fig. 1. Since the algorithm uses the bag method, it actually obtains the information of the samples and the corresponding features which avoids overfitting [27, 28]. Finally, the results of all trees (labeled by ‘Result-1, 2, …, N’) are evaluated together, and the final prediction is obtained by voting or averaging (labeled by ‘Majority Voting/Averaging’) [29].

Figure 1
figure 1

Structure diagram of Random Forest algorithm

In machine learning modeling, it is necessary to use several methods to compare and select the optimal model as the actual prediction model [30]. Therefore, the final dataset (see Table. S4 in Supplementary Note 1) has been divided into training and testing sets in the ratio of 4:1, four different algorithms (RF, DT (Decision Tree), GBDT (Gradient Boosting Decision Tree) and Bagging) were adopted to build the model. The steps of this study include data collection, data processing, feature analysis, model building and selection, which can be simplified by a visual flowchart as shown in Fig. 2. The black arrow represents the chemical composition and cooling rate as the input characteristics. The prediction scores of Y1, Y2, Y3 and Y4 models are obtained. The prediction scores correspond to the values of three evaluation indexes for model evaluation, and finally, the optimal model is selected for Bs prediction. The red arrow represents the integration of atomic scale features on the basis of the original features and repeats the above steps to obtain a new Bs.

Figure 2
figure 2

Diagram of machine learning-based alloy design system for the steels with desired phase transformation point

Dataset

In this study, 738 samples were collected from experimental results of previous literature [31,32,33,34,35,36,37,38,39]. The dataset contains 20 features (chemical composition of C, Si, Mn, P, S, Al, etc., and cooling rate) and the target value Bs. The values of unrecorded alloying elements in the samples were set to be zero. All duplicate samples and discrete data from the box plots method [40] have been removed. Moreover, the scatter plot is formed by means of visualization data and used to identify the larger deviation features.

Data processing

Data processing mainly refers to the processing of null values, repeated values, discrete values, etc. (see Tables. S1-3 in Supplementary Note 1).

The Pearson correlation coefficient [41] is used for correlation analysis of the data and is expressed as r to reflect the degree of linear correlation between two variables (X and Y). The value of r is between -1 and 1. The larger absolute value corresponds to the stronger correlation between the variables. The calculation formula is as follows in Eq. (1):

$$r=\frac{\sum_{i=1}^{n}({X}_{i}-\overline{X })({Y}_{i}-\overline{Y })}{\sqrt{\sum_{i=1}^{n}{({X}_{i}-\overline{X })}^{2}}\sqrt{\sum_{i=1}^{n}{({Y}_{i}-\overline{Y })}^{2}}}$$
(1)

where n is the number of samples, Xi and Yi are the observations at point i corresponding to the variables X and Y, and \(\overline{X}\), \(\overline{Y}\) are the mean values of the X and Y variables, respectively.

The Spearman correlation coefficient, generally denoted by \({\rho }_{s}\), is used to assess the monotonic relationship and correlation between two continuous variables [42]. When there are no repeated values in the data and the two variables are completely monotonically correlated, the coefficient is 1 or -1. The formula is calculated according to Eq. (2).

$${\rho }_{s}=\frac{{\sum }_{i=1}^{n}({R}_{i}-\overline{R })({S}_{i}-\overline{S })}{{\left[{\sum }_{i=1}^{n}{\left({R}_{i}-\overline{R }\right)}^{2}{\sum }_{i=1}^{n}{\left({S}_{i}-\overline{S }\right)}^{2}\right]}^\frac{1}{2}}$$
(2)

where n is the total number of samples,\({ R}_{i}\) and \({S}_{i}\) is the ranks of the values for sample i, and \(\overline{R }\) and \(\overline{S }\) is the average ranks of independent and dependent variables, respectively.

Random forest feature importance [43]

There is an important feature in random forests which exhibits the ability to calculate the importance of individual feature variables. There are many features in the prediction model, and it is desirable to find the feature variables that are highly correlated with the target value and to guarantee the prediction accuracy by selecting as few features as possible. Therefore, it is necessary to calculate the importance of each feature and rank them.

The extended new atomic features

To improve the generalizability of the model, atomic features are added to the original features, such as atomic radius and valence electrons. In this study, nine new atomic features are constructed, such as electronegativity relative to iron (\({EN}_{Fe}\)) and carbon (\({EN}_{C}\)) atoms, radius change rate relative to iron (\({\alpha }_{Fe}\)) and carbon (\({\alpha }_{C}\)) atoms, the first ionization energy of the relative iron (\({IP}_{Fe}\)), carbon (\({IP}_{C}\)), and the number of valence electrons (\(Ven\_all\)). The electronegativity is further divided into the following parts: Pauling electronegativity relative to iron (\({PEN}_{Fe}\)) and carbon (\({PEN}_{C}\)) atoms, and Allen electronegativity relative to iron (\({AEN}_{Fe}\)) and carbon (\({AEN}_{C}\)) atoms (Calculation formula in Supplementary Note 4) [44].

Model performance evaluation metrics

The performance of the constructed models is evaluated using [30] root-mean-square error (RMSE), mean absolute error (MAE), and R2 (coefficient of determination) (Eqs. (1)–(3) in Supplementary Note 2).

Empirical formula

The results of the present study are compared with previous formulae. There are four empirical formulas for the calculation of Bs. The following shows formula (3) [45] (Eqs. (4)—(6) in Supplementary Note 3).

$$\begin{aligned} {\text{BS}} = & {72}0 - {585}.{\text{63C}} + {126}.{\text{6C}}^{{2}} - {66}.{\text{34Ni}} + {6}.0{\text{6Ni}}^{{2}} \\ & \quad - 0.{\text{232Ni}}^{{3}} - {31}.{\text{66Cr}} + {2}.{\text{17Cr}}^{{2}} - {91}.{\text{68Mn}} \\ & \quad + {7}.{\text{82Mn}}^{{2}} - .{\text{3378Mn}}^{{3}} - {42}.{\text{37Mo}} + {9}.{\text{16Co}} \\ & \quad - 0.{\text{1255C}}0^{{2}} + 0.00{\text{284Co}}^{{3}} - {36}.0{\text{2Cu}} - {46}.{\text{15Ru}} \\ \end{aligned}$$
(3)

Results and discussions

The result of the processed data

Based on the above dataset, the outliers of Bs were calculated and removed by adopting the box plots. Figure 3a shows the calculation results of Bs outliers that the upper and lower limits of Bs are 765 °C and 295 °C, respectively. Three outliers exist beyond the limits. Figure 3b shows the overall distribution of Bs data. Based on the previous experimental and literature data, it can be judged that there are few data of Bs greater than 700 °C. There are only two values in this paper, so this part of the data is classified as abnormal values. Then, the five outliers of Bs (Red pentagrams in Fig. 3b) were removed. Figure 3 (c, d) shows the data distribution of chemical components C, P and Bs, respectively, and the outliers of C and P (Red pentagrams) are also removed. Similarly, the other feature outliers are optimized (see Fig. S1 in Supplementary Note 1). The range of individual feature values for the final dataset is shown in Table 1.

Figure 3
figure 3

a Outliers of Bs calculated by box diagram. Data distribution, which are b Bs, c chemical element C and d element P. Red pentagram represents the discrete data

Table 1 Range of compositions (wt.%), cooling rate (°C/s) in the dataset

The dependence of Bs on the elements Si and C

Correlation analysis of different chemical components with Bs was performed (Fig. 4). The results showed that the highest correlation between the features was found for the chemical components S and P, and the value of the correlation coefficient r was 0.67. Meanwhile, the elemental features that play an important role on Bs are C, Si, etc. (green line). It indicates that Bs will gradually decrease with the increase in C and Si content. In addition, the features of Cu, Nb, etc., have an positive-going action on the increase in Bs (brown line), which means that Bs will gradually increase with the increase in Cu content. The result indicates that the more C or the less Cu content can reduce Bs, but the dependence of Bs on C was stronger compared to Cu. The correlation results proved that there was no significant linear interaction between the individual elemental characteristics.

Figure 4
figure 4

Correlation thermodynamic diagram. The squares represent correlations between features, with red and purple squares representing forward and reverse, respectively. Lines represent correlations between features and Bs. Green and brown lines represent enhancement or weakening, respectively

To explore the effects of special elements on Bs, the average values of C and Si element contents were taken as representative. When the C concentration is taken in the range of 0–0.6 wt.%, the Bs is 491−586 °C. As shown clearly in Fig. 5a, where the carbon concentration is chosen as the x-axis. The decrease in Bs temperature is inversely proportional to the increase in carbon concentration within the range of 0–0.6 wt.% range. This is because the increase in carbon concentration in austenite would give rise to the decrease in the carbon concentration gradient within the austenite particles prior to transformation. Correspondingly, the carbon concentration gradient is the effective driving force for bainite growth. Reducing the driving force will certainly increase the time required for the diffusion of carbon atoms from the interface and thus inevitably reduce the bainite growth kinetics [46]. Moreover, more carbon concentration leads to a decrease in the Gibbs energy difference between bainite and austenite, finally triggering the bainite nucleation [47]. When the value of the Si element is taken in the range of 0–1.6 wt.%, the Bs is 463–596 °C. Figure 5b illustrates the relationship between Bs and Si element content. The results demonstrate that Bs decreases significantly with the increase in Si content when the Si content is 0–0.2 wt.%. When the Si content is greater than 0.2 wt.%, Bs decreases gradually and slowly with the increase in Si content. Figure 5b also indicates that if the Si concentration of the steel is more than 0.2 wt.%, the Si has a weak influence on the bainite reaction. The reason is that Si, as a non-carbide forming element, can inhibit carbide precipitation and can be used as a solid solution element in the steel to stabilize the austenite, thus making the C curve shift to the right and reducing the bainite transformation start temperature [48]. The addition of Si can delay the bainite transformation by affecting the nucleation and growth rate of bainitic ferrite [49].

Figure 5
figure 5

Relationship between a C, b Si and Bs. The square represents the calculated value. The error bar represents the calculated error value of this point. The pentagrams correspond to the experimental data of C [50] and Si [51], respectively

The results of the model based on the dataset

In this study, different algorithms were used for modeling, and the prediction results are listed in Table 2. The results show that the model has the highest prediction accuracy of 90.5% when modeled RF algorithm, followed by the GBDT algorithm prediction rate of 90.1%. Thus, the above two algorithms are more suitable for predicting the phase transition point. But through the results in Table 2, it is also clearly found that the prediction accuracy of the four algorithms has little difference (the difference in prediction rate is about 3%). This further indicates that the feasibility of the four algorithms is strong. In addition, the parameters of the RF algorithm are optimized in this paper. The results show that the optimal values of the model parameters are n_estimators = 171, max_depth = 17, min_samples_leaf = 1, min_samples_split = 2. Therefore, this set of parameter conditions is chosen to build the final Bs prediction model.

Table 2 Model results according to R2, RMSE and MAE

The results of the model based on low-carbon and medium-carbon dataset

Based on the importance of element C to Bs, the entire dataset was divided into low carbon (Wc ≤ 0.25%) and medium carbon (0.25% ≤ Wc ≤ 0.6%). Figure 6 examines the degree of deviation of the predicted values from the actual values for the low- and medium-carbon data and draws the corresponding scatter plots. Figure 6 also shows the prediction results (R2, RMSE, and MAE) for the low- and medium-carbon data based on random forest. Figure 6a and c shows that the errors RMSE = 14.7311 and MAE = 9.0564 for the low-carbon training set are larger than those of 9.1569 and 2.6751 for the medium-carbon training set. Figure 6(b) and d shows that R2 = 0.8907 for the low-carbon test set is smaller than the value of 0.9653 for the medium-carbon test set, which indicates that the model in this paper performs better for the Bs prediction results for the medium-carbon data.

Figure 6
figure 6

Fitting between the predicted value and experimental value. a low-carbon training set, b low-carbon testing set, c medium-carbon training set, d medium-carbon testing set

Addition of new features

The correlation analysis of the new features is performed as shown in Fig. 7. Analysis indicates that \({\alpha }_{Fe}\), \({\alpha }_{C}\), and \(Ven\_all\) have a facilitative effect on the elevation of Bs, while \({PEN}_{Fe}\) and \({PEN}_{C}\) have the opposite effect. In addition, \({IP}_{Fe}\) and \({IP}_{C}\) had the same correlation coefficient of -0.1 with Bs, indicating that the dependence of Bs on them was similar. Figure 7 further shows that there is a strong correlation between the new features associated with the atomic parameters. The correlation coefficients between \({PEN}_{Fe}\), \({PEN}_{C}\), \({AEN}_{Fe}\), \({AEN}_{C}\), \({IP}_{Fe}\), \({IP}_{C}\), \({\alpha }_{Fe}\) and \({\alpha }_{C}\) were all above 0.9. Therefore, some of the highly correlated features can be considered for removal without affecting the model performance.

Figure 7
figure 7

Correlation analysis between the new features and Bs

Figure 8 shows the results of ranking the importance of new features on Bs. Note that the importance values of the features are normalized. The results show that \(Ven\_all\) ranks first in terms of feature importance. The feature \(Ven\_all\) is related to the number of valence electrons of the element, taking into account the effect of electronic stability on the bainite transformation. In addition, \({PEN}_{Fe}\), \({PEN}_{C}\), \({\alpha }_{Fe}\), \({\alpha }_{C}\) are also ranked high and may include the influence of alloying elements on the stability of iron and carbon. Therefore, adding such atomic features can improve the performance of the trained model.

Figure 8
figure 8

Result of feature importance ranking based on RF

Figure 9 shows the results when each feature is added individually. The addition of new features, such as \({PEN}_{C}\), \({IP}_{C}\), and \(Ven\_all\) reduced the MAE values (green dots). In addition, \(Ven\_all\) also improved the RMSE (red dots) values without significantly worsening the other evaluation metrics (blue dots). Therefore, \(Ven\_all\), \({PEN}_{C}\) and \({AEN}_{Fe}\) may be used as additional beneficial features in addition to chemical composition and cooling rate. Figure 10 shows the results after sequentially adding the remaining eight features and Ven_all combinations. The results show that the model with \(Ven\_all\)+ \({\alpha }_{Fe}\) has the smallest MAE and RMSE errors and perform well on the R2 index.

Figure 9
figure 9

Results of adding new features

Figure 10
figure 10

Results of adding the remaining features based on \(Ven\_all\)

In the present study, the model with the original features is denoted as Model A, and the model with the new features added is denoted as Model B. The comparison of the results of Model A and Model B is reported in Table 3. The results indicate only one special case where the Bagging algorithm model A (R2 = 0.893) gives slightly better results than model B (R2 = 0.891). The predictive ability of Model B of the remaining three algorithms exceeded that of Model A. The four models’ prediction results are excellent (R2 is more than 0.85), which indicates that the bainite transformation start temperature can be accurately predicted by the effective regression model based on the existing data. Furthermore, it can be seen that the random forest model has the highest R2 (0.913) and the smallest RMSE (24.67) and MAE (17.34), denoting that the model has the best performance. Therefore, in selecting the Bs prediction model, this paper first preprocessed the data, analyzed the characteristics, standardized the data, and adopted the random forest method to predict the special phase transformation point.

Table 3 Comparison of the results of Model A and Model B

Modeling validation

To test the model’s accuracy, data validation of the prediction results of the random forest model was conducted in the present work. By randomly selecting 50 data, a scatter plot is drawn with the experimental value as the horizontal coordinate and the predicted values as the vertical coordinate (Fig. 11). The results show the predicted values with the original features (Model A) (in Fig. 11a). The model with the new features (Model B) (in Fig. 11(b)) is in good agreement with the experimental values. Compared to Model A, the predicted values of Model B matched the experimental values better. Figure 11 (c, d) reveals that the calculated Bs values deviate significantly from the experimental values. The vast majority of the Bs values calculated using J-MatPro in Fig. 11c are higher than the experimental values and have larger error values. Figure 11d shows the Bs calculated by the empirical Eq. (1). It can be clearly seen that the data are scattered. There is no clear linear fitting trend with the experimental values. Note that Eq. (1) is the best fit among the four empirical equations, and the remaining three results are shown in Fig. S2 of Supplementary Note 3.

Figure 11
figure 11

Fitting between the predicted and experimental values of the validation set. a Model A, b Model B, c J-MatPro, d Empirical formula. The red shaded area represents the error

Conclusion

In this study, the bainite transformation start temperature (Bs) experimental data were collected and preprocessed. The feature has been analyzed using the Pearson correlation coefficient. The influence law of chemical composition Si and C on Bs was further quantified and investigated by means of the random forest model. The obtained accuracy could be as high as 91%, with the error within ± 25 °C. Finally, the prediction model of Bs based on the random forest was obtained. In addition, new atomic features were added for prediction, and the results are closer to the experimental values. They can be extended to the case of an unknown new element, which also provides new ideas to study other factors affecting Bs. The following conclusions can be drawn:

  1. (1)

    Bs decreases consistently from 586 °C to 491 °C, with increasing C from 0.1 wt.% to 0.6 wt.%. The model prediction results based on the carbon concentration division show that the prediction of medium-carbon steel (R2 = 0.9653) is better than that of low-carbon steel (R2 = 0.8907). It means that the model performs better in predicting Bs for medium-carbon steel.

  2. (2)

    Bs temperatures are found to vary significantly with increasing Si. By increasing the silicon elements to 0.2 wt.%, Bs considerably decreases from 596 °C to 513 °C. A turning point is at 0.2 wt.%. While the tendence remains almost unchanged when the Si content is greater than 0.2 wt.%.

  3. (3)

    Among the added atomic features, the number of valence electrons ranked first in importance. In addition, the radius change rate relative to iron and the number of valence electrons both have a similar relationship with Bs in a positive trend. The addition of atomic features improves the performance of the model and enhances the generalization ability of the model.