1 Introduction

Ensuring product and process quality is a constant challenge in industrial organizations. One of the main impurities found in the steel making process is phosphorus. A high phosphorus content can considerably reduce the quality of steel alloys. Therefore, a method for process analysis and control to ensure high process reliability and quality is necessary (Barella et al. 2017).

Regression models can be used to predict output data based on various input data and explain the underlying phenomenon behind the collected data. Regression models are used for monitoring response variables as functions of one or more input variables.

Most statistical methods are parametric in that they make assumptions about the distributional properties and autocorrelation structure of the process parameters. Several distribution-free or nonparametric methods based on machine learning techniques have been proposed in the literature. These methods are nonparametric in that they do not need to assume specific probability distributions for implementation (Camci et al. 2008). Artificial neural networks (ANNs), support vector machines (SVMs), and relevance vector machines (RVMs) are the most commonly used machine learning techniques.

ANNs can be defined as information processing systems based on the behavior of the human nervous system (Vapnik 1998). Haykin (2009) suggested a decision rule to minimize the error in the training data based on the general induction principle. Mazumdar and Evans (2009) described modern steelmaking processes along with physical modelling, mathematical modelling, and applications of ANN and genetic algorithm. Ghaedi and Vafaei (2017) reviewed the applications of ANN, SVM, and adaptive neuro fuzzy inference system (ANFIS) for adsorption removal of dyes from aqueous solution.

In recent years, SVMs have been introduced as one of several kernel-based techniques available in the field of machine learning for classification, prediction, and other learning tasks (Vapnik 1998). Kernel-based methods are based on mapping data from the original input feature space to a kernel feature space of higher dimensionality and then solving a linear problem in the feature space (Schölkopf and Smola 2002). SVMs were first introduced by Vapnik for solving classification problems. Support vector techniques have since been extended to the domain of regression. These techniques are called support vector regression (SVR) (Vapnik 1998).

Applications of SVR to model different chemical and industrial processes have been presented in recent years. Ghaedi et al. (2014) proposed a multiple linear regression (MLR) and least square support vector regression (LS-SVM) method for modeling of methylene blue dye adsorption using copper oxide loaded on activated carbon. Zaidi (2015) proposed a unified data-driven model for predictioning the boiling heat transfer coefficient in a thermosiphon reboiler using SVR as the modeling method. Ghaedi et al. (2016b) studied the predictive ability of a hybrid SVR and genetic algorithm optimization model for the adsorption of malachite green onto multiwalled carbon nanotubes. Cheng et al. (2016) performed a study which suggested that the SVR model can provide an important theoretical and practical guide for experimental design and for controlling the tensile strength of graphene nanocomposites via rational process parameters. Ghaedi et al. (2016a) presented the application of least squares support vector regression (LS-SVR) and MLR for modeling removal of methyl orange onto tin oxide nanoparticles loaded on activated carbon and activated carbon prepared from Pistacia atlantica wood. Jia et al. (2017) proposed a mathematical model for optimizing the dividing wall column process with a combination of SVM and particle swarm optimization algorithm. Ghugare et al. (2017) performed a study that utilized genetic programming, ANN, and SVR for developing nonlinear models to predict the carbon, hydrogen, and oxygen fractions of solid biomass fuels.

Tipping (2000) introduced the relevance vector machine (RVM), a Bayesian sparse kernel technique for regression and classification of functional forms identical to the SVMs. The RVM for regression, called relevance vector regression (RVR), constitutes an approximation that can be used to solve nonlinear regression models.

In recent years, some applications of RVR to model and predict industrial processes applied to different areas of engineering have been reported. Zhang et al. (2015) utilized an RVM to estimate the remaining useful life of a lithium-ion battery based on denoised data. He et al. (2017) presented a new fault diagnosis method based on RVM to handle small-sample data. The results showed the validity of the proposed method. Liu (2017) utilized just-in-time (JIT) and RVM for soft-sensor modeling. The proposed methodologies were successfully applied for predicting hard-to-measure variables in wastewater treatment plants. Verma et al. (2017) utilized three different kernel-based models (SVR, RVR, and Gaussian process regression) to predict the compressive strength of cement. The performance of SVR and RVR was found to be comparable to that of ANN. Imani et al. (2018) examined the applicability and capability of extreme learning machine (ELM) and RVR models for predicting sea level variations and compared their performances with the SVR and ANN models. The results showed that the ELM and RVR models outperformed the other methods.

The performance of RVR and SVR models depends heavily on the choice of the hyperparameters. In actual applications, many practitioners select the hyperparameters in RVR and SVR empirically by trial and error or use a grid search technique together with a cross-validation method. Apart from consuming enormous amounts of time, such procedures for selecting the hyperparameters may not result in the best performance. Here, we use a differential evolution algorithm to optimize the RVR and SVR hyperparameters with different kernel functions. Differential evolution (DE) is a variant of evolutionary algorithms proposed by Storn and Price (1997).

There is no consistent methodology for determining the control parameters in DE (scale factor F, crossover rate Cr, and population size Np). These parameters are frequently arbitrarily set within predefined ranges. The control parameters are, in general, key factors affecting the convergence of DE (Price at el. 2006). Das et al. (2016) summarized and organized the current developments in DE, and presented recent proposals for parameter adaptation in DE algorithms.

This work is an applied study to predict the phosphorus concentration levels in the manufacture of FeMnMC in a Brazilian steelmaking company. For the same process, Pedrini and Caten (2010) have developed seven models with the MLR, and Acosta et al. (2016) have developed a multilayer perceptron (MLP) neural network to predict the phosphorus concentration levels.

The main objectives of this study are to apply RVR and SVR techniques to predict the phosphorus concentration level in the steelmaking process. In addition, we applied a DE algorithm to optimize the RVR and SVR hyperparameters with different kernel functions. To the best of our knowledge, no previous studies have analyzed the use of RVR and SVR combined with a self-adaptive DE approach to predict phosphorus concentration levels in the steelmaking process.

The main contributions of this paper can be summarized as follow: (i) RVR, and SVR models are proposed for the predictive modeling of phosphorus concentration levels in a steelmaking process with actual data, (ii) a self-adaptive DE algorithm is utilized to optimize the RVR and SVR hyperparameters with different kernel functions, and (iii) the performance of the RVR and SVR models are compared with ridge regression, MLR, model trees, ANN, and random vector functional link (RVFL) neural network.

The rest of this paper is organized as follows: Section 2 presents the fundamental theory of SVR, RVR, and DE; Section 3 presents the proposed monitoring strategy; and Section 4 presents an applied study with a brief description of the steelmaking process, the implementation of the models, and a comparison with the ridge regression, MLR, model trees, ANN, and RVFL. Finally, Section 5 presents the conclusions of the study.

2 Background

2.1 Support vector regression

This section describe the fundamental theory of SVR. For more details on SVM, please refer to Vapnik (1998), Kecman (2001), Schölkopf and Smola (2002), Smola and Schölkopf (2004) and Cherkassky and Mulier (2007). Basak et al. (2007) have reviewed the existing theory, methods, developments, and scope of SVR. The principle idea in SVR is to compute a linear regression function in a high-dimensional feature space which the input data are mapped into via a nonlinear function.

The construction of SVR uses the \({\upvarepsilon }\)-insensitive loss function proposed by Vapnik (1998),

$$ \left| {y - f\left( x \right)} \right|_{\varepsilon } = \left\{ {\begin{array}{*{20}l} {0\quad if\quad \left| {y - f\left( x \right)} \right| \le \varepsilon {\text{ }}} \hfill \\ {\left| {y - f\left( x \right)} \right| - \varepsilon \;otherwise{\text{ }}} \hfill \\ \end{array} } \right. $$
(1)

where \(y\) is the measure (target) value and \(f\left( x \right)\) is the predicted value.

Vapnik´s \({\upvarepsilon }\)-insensitive loss function in Eq. (1) defines a tube with radius \({\upvarepsilon }\) fitted to the data, called the \({\upvarepsilon }\)-tube. Consider training data \(\left\{ {\left( {x_{1} ,y_{1} } \right), \ldots ,\left( {x_{N} ,y_{N} } \right)} \right\} { }\), where ℵ denotes the space of the input. A linear function \(f\left( x \right)\) can be written in the form of (Smola and Schölkopf 2004)

$$ f\left( x \right) = \langle \omega \;,x \rangle \; + b\,\,with\,\,\omega \in \aleph,\;b\; \in {\Re}$$
(2)

where \(\langle .,. \rangle\) denotes the dot product in ℵ and b is the bias.

The problem can be written as a convex optimization problem formulated using slack variables (\(\xi_{i}\) and \(\xi_{i}^{*}\)) to measure the deviation of the training samples outside the \({\upvarepsilon }\)-insensitive zone (Vapnik 1998),

$$ {\rm minimize}\, \frac{1}{2} ||\omega||^{2} + C\left( {\mathop \sum \nolimits_{n = 1}^{N} \left( {\xi_{i} + \xi_{i}^{*} } \right)} \right) \,\,subject \,to\,\, \left\{ {\begin{array}{*{20}c} {y_{i} - \langle \omega ,x_{i} \rangle - b \le \varepsilon + \xi_{i} } \\ {\langle \omega ,x_{i} \rangle + b - y_{i} \le \varepsilon + \xi_{i}^{*} } \\ {\xi_{i} , \xi_{i}^{*} \ge 0} \\ \end{array} } \right. $$
(3)

The regularization parameter C influences the tradeoff between the approximation error and the weight vector norm \(||\omega||\). Fig. 1 illustrates the SVR model. According to Smola and Schölkopf (2004), only the points outside the \({\upvarepsilon }\)-tube (shaded region) contribute to the cost, insofar as the deviations are penalized linearly.

Fig. 1
figure 1

SVR model. (a) \({\upvarepsilon }\)-tube (shaded region) and slack variables \(\xi_{i}\), (b) \({\upvarepsilon }\)-insensitive loss function (Schölkopf and Smola 2002)

The optimization problem in Eq. (3) can be transformed into a dual problem utilizing Lagrange multipliers \(\left( {\alpha_{n} ,\alpha_{n}^{*} } \right)\) (Vapnik 1998; Smola and Schölkopf 2004)

$$ \begin{array}{*{20}c} {{\text{maximize}}\left\{ {\begin{array}{*{20}c} { - \frac{1}{2}\sum\nolimits_{{n,j = 1}}^{N} {\left( {\alpha _{n} - \alpha _{n}^{*} } \right)} \left( {\alpha _{j} - \alpha _{j}^{*} } \right)k\left( {x_{n} ,x} \right)} \\ { - \varepsilon \sum\nolimits_{{n = 1}}^{N} {\left( {\alpha _{n} + \alpha _{n}^{*} } \right)} + \sum\nolimits_{{n = 1}}^{N} {y_{n} } \left( {\alpha _{n} - \alpha _{n}^{*} } \right)} \\ \end{array} } \right.} \\ {subject\;to\sum\nolimits_{{n = 1}}^{N} {\left( {\alpha _{n} - \alpha _{n}^{*} } \right)} = 0\;and\;\alpha _{n} ,\alpha _{n}^{*} \in \left[ {0,C} \right]} \\ \end{array} $$
(4)
$$ f\left( x \right) = \mathop \sum \limits_{n = 1}^{N} \left( {\alpha_{n} - \alpha_{n}^{*} } \right)k\left( {x_{n} ,x} \right) + b $$
(5)

where \(\alpha_{n} \) and \(\alpha_{n}^{*}\) are Lagrange multipliers, \(k\left( {x_{n} ,x} \right)\) is a kernel function and \(b\) is the bias. The support vectors (SVs) are the points that appear with nonzero coefficients in Eq. (5). Therefore, SVR has a sparse solution (Schölkopf and Smola 2002).

The kernel function \(k\left( {x_{n} ,x} \right)\) in Eq. (5) is a symmetric function satisfying Mercer’s conditions and is defined as a linear dot product of the nonlinear mapping (Vapnik 1998). A nonlinear function is learned by a linear learning machine in the kernel-induced feature space, while the capacity of the system is controlled by a parameter that does not depend on the dimensionality of the space.

Table 1 shows the kernel functions and their parameters used in this study. u is the parameter needed for polynomial and sigmoid-type kernels, d is the degree of the polynomial, \(\gamma = 1/2r^{2}\) and \(r > 0\) is a parameter that defines the kernel width. The kernel parameters are determined during the training phase.

Table 1 Kernel functions

The performance of SVR generalization depends on the correct specification of the free hyperparameters, namely, the value of the \({\upvarepsilon }\)-insensitivity, the regularization parameter C, and the kernel parameters. Usually, the kernel type is first selected by the user based on the properties of the application data, and then the SVR hyperparameters are selected using some computational or analytic approaches.

Cherkassky and Ma (2004) summarized many practical approaches for setting the values of the regularization parameter C and the \({\upvarepsilon }\)-insensitivity. They proposed the analytical selection of the C parameter directly from the training data, the analytical selection of the \({\upvarepsilon }\) parameter based on the (known or estimated) level of noise in the training data and the (known) number of training samples, and the selection of the RBF kernel width parameter to reflect the input range of the training/test data.

2.2 Relevance vector regression

In this section, the fundamental theory of relevance vector regression is introduced. For more details on RVM, readers can refer to Tipping (2000, 2001), Schölkopf and Smola (2002), Tipping and Faul (2003), and Bishop (2006).

The approach uses a dataset of input and output (target) pairs \(\left\{ {x_{n} ,t_{n} } \right\}_{n = 1}^{N}\) follows a probabilistic formulation and assumes \(p\left( {t_{n} {|}x} \right) = {\mathcal{N}} \left( {t_{n} {|}f\left( {x_{n} } \right), \sigma^{2} } \right)\), where the notation specifies a Gaussian distribution over \(t_{n}\) with mean \(f\left( {x_{n} } \right)\) and variance \(\sigma^{2}\). The approach considers functions similar in type to those implemented by SVM, i.e., (Tipping 2000),

$$ f\left( x \right) = \mathop \sum \limits_{n = 1}^{N} \omega_{n} k\left( {x_{n} ,x} \right) + \omega_{0} $$
(6)

where \(\omega_{n}\) are the model weights, \(k\left( {x_{n} ,x} \right)\) is a kernel function and \(\omega_{0}\) is the bias. In this study, we use the kernel functions given in Table 1.

The RVR is a Bayesian treatment of Eq. (6). RVR adopts a fully probabilistic framework and introduces a prior on the model weights governed by a set of hyperparameters, each of which is associated with a weight and whose most probable values are iteratively estimated from the data. The likelihood estimation of the dataset can then be written as (Tipping 2001),

$$ p\left( {{\mathbf{t}}{\text{|}}{\mathbf{\omega }},\sigma ^{2} } \right) = \left( {2\pi \sigma ^{2} } \right){\text{exp}}\left\{ { - \frac{1}{{2\sigma ^{2} }}\;||{\mathbf{t}} - {\mathbf{\phi \omega }}^{2} }|| \right\} $$
(7)

where \(\mathbf{t}={\left({t}_{1},\dots ,{t}_{N}\right)}^{\mathrm{T}}\), \({\varvec{\upomega}}={\left({\omega }_{0},\dots ,{\omega }_{N}\right)}^{\mathrm{T}}\) and ϕ is the \(N{\text{x}}\left( {N + 1} \right)\) ‘design' matrix with \({\mathbf{\phi}} = \left[ {\phi \left( {x_{1} } \right), \phi\left( {x_{2} } \right), \ldots , \phi \left( {x_{N} } \right)\user2{ }} \right]^{{\text{T}}}\), wherein \(\phi \left( {x_{n} } \right) = \left[ {1,k\left( {x_{n} ,x_{1} } \right),k\left( {x_{n} ,x_{2} } \right), \ldots ,k\left( {x_{n} ,x_{N} } \right)\user2{ }} \right]^{{\text{T}}}\).

According to Tipping (2000), the maximum-likelihood estimation of \({{\varvec{\upomega}}}\) and \(\sigma^{2}\) from Eq. (7) will result in severe overfitting. Here, he prefer to use smoother (less complex) functions by defining a zero-mean Gaussian prior distribution over the weights. The introduction of an individual hyperparameter \(\alpha_{n}\) for each weight parameter \(\omega_{n}\) is the key feature of RVR. Thus, the weight prior takes the form of

$$ p\left( {{{\varvec{\upomega}}}{|}{{\varvec{\upalpha}}}} \right) = \mathop \prod \nolimits_{n = 0}^{N} {\mathcal{N}} \left( {\omega_{n} {|}0,\alpha_{n}^{ - 1} } \right) $$
(8)

where \({{\varvec{\upalpha}}}\) is a vector of \(N + 1\) hyperparameters and \(\alpha_{n}\) represents the precision of the corresponding parameter \(\omega_{n}\) (Bishop 2006). The marginal likelihood for the hyperparameters is obtained by integrating the weights (Tipping 2001)

$$ p\left( {{\mathbf{t}}|\alpha ,\sigma^{2} } \right) = \left( {2\pi } \right)^{{ - N/2}} \left| {\sigma^{2} {\mathbf{I}} + \upphi {\text{A}}^{- 1} \upphi^{\text{T}}} \right|^{{ - 1/2}} \exp \left\{ { - \frac{1}{2}{\mathbf{t}}^{\text{T}} \left( {\sigma^{2} {\mathbf{I}} + \upphi {\text{A}}^{- 1} \upphi^{\text{T}} } \right)^{{ - 1}} {\mathbf{t}}} \right\} $$
(9)

where \({\text{A = diag}}\left( {\alpha_{0} ,\alpha_{1} , \ldots ,\alpha_{N} } \right)\).

The values of \(\alpha\) and \(\sigma^{2}\) are determined using type-II maximization likelihood, in which the marginal likelihood function is maximized by integrating out the weight parameters. In the RVR method, a proportion of the hyperparameters \(\left\{ {\alpha_{n} } \right\}\) is driven to large values. The weight parameters \(\omega_{n}\) corresponding to these hyperparameters thus have posterior distributions with means and variances both equal to zero (Bishop 2006). Thus, these parameters are removed from the model, and sparsity is realized. In the case of models with the form of Eq. (6), the inputs \(x_{n}\) corresponding to the remaining nonzero weights are called the relevance vectors (RVs) and are analogous to the support vectors (SVs) of a SVR.

According to Tipping (2000), some advantages of RVRs over the SVRs are: (i) they can produce probabilistic output, (ii) there is no need to define the regularization parameter C and the insensitivity parameter \(\varepsilon\), and (iii) non-Mercer kernel functions can be used. The most compelling feature of the RVR is that it is capable of generalization performance comparable to that of an equivalent SVR using, in most cases, significantly smaller number of RVs than the number of SVs used by an SVR to solve the same problem. More significantly, in RVR, the parameters governing complexity and noise variance (\(\alpha\)’s and \(\sigma^{2}\)) are automatically estimated by the learning procedure, whereas in SVR, it is necessary to tune the hyperparameters C and \(\varepsilon\) (Tipping 2001; Tipping and Faul 2003; Bishop 2006).

2.3 Differential evolution

The differential evolution (DE) algorithm, proposed by Storn and Price (1997), is an evolutionary algorithm (EA) for global optimization, which has been widely applied in many scientific and engineering fields (Qin et al. 2009).

The DE algorithm involves the three main operations of mutation, crossover, and selection (Storn and Price 1997). DE is a scheme for generating trial parameter vectors. Mutation and crossover are used to generate new vectors (trial vectors), and selection then determines which of the vectors will survive into the next generation.

The original version of DE can be defined by the following constituents (Storn 2008):

  • Population: DE is a population-based optimizer that attacks the starting point problem by sampling the objective function at multiple, randomly chosen initial points. DE aims to evolve the population of Np D-dimensional vectors, which encodes the gth generation candidate solutions, towards the global optimum (Price et al. 2006).

  • Once the population is initialized, DE mutates and recombines the population to produce a population of Np trial vectors. The scale factor, \(F \in \left(\mathrm{0,1}+\right)\), is a positive real number that controls the rate at which the population evolves (Price et al. 2006).

  • Following the mutation operation, crossover is applied to the population. The crossover probability, \(Cr \in \left[\mathrm{0,1}\right]\), is a user-defined value that controls the fraction of parameter values copied from the mutants (Price et al. 2006).

  • Selection: If the trial vector has an equal or lower objective function value than that of its target vector it replaces the target vector in the next generation g+1; otherwise, the target retains its place in the population for at least one more generation.

The mutation strategies can vary with the type of individual modified to form the donor vector, the number of individuals considered for the disorder and the type of crossing used. The mutation strategy is denoted by \( {\text{DE}}/\xi/\beta /\delta \), where (Santos et al. 2012):

  • \( \xi \) denotes the vector to be disturbed,

  • \(\beta \) determines the number of weighted differences,

  • \( \delta \) denotes the crossover type.

The setting of the DE control parameters is crucial for the performance of the algorithm. According to Storn and Price (1997), DE is much more sensitive to the choice of scale factor F than it is to the choice of crossover probability Cr.

According to Eiben et al. (2007), there are two major approaches for setting the parameter values: parameter tuning and parameter control. Parameter tuning is a commonly practiced approach that tries to find good values for the parameters before the algorithm runs and then runs the algorithm using these values, which remain fixed during the run. In parameter control, the values for the parameters are changed during the run. The methods for changing the values of the parameter scan be classified into one of three categories: deterministic parameter control, adaptive parameter control, and self-adaptive parameter control.

In self-adaptive parameter control, the parameters to be adapted are encoded into chromosomes and undergo mutation and recombination. Better values of these encoded parameters lead to better individuals, which are in turn more likely to survive and produce offspring and hence propagate these better parameter values.

The algorithm proposed in Brest et al. (2006), the jDE algorithm, employs a self-adaptive scheme to perform the automatic setting of the scale factor F and crossover rate Cr control parameters. The control parameter population size Np does not change during the run. The algorithm implements the DE/rand/1/bin mutation strategy.

In our study, we use the jDE algorithm proposed by Brest et al. (2006) and implemented by Conceição and Mächler (2015). The latter implementation differs from the DE algorithm proposed by Brest et al. (2006) most notably in the use of the DE/rand/1/either-or mutation strategy (Price et al. 2006) and a combination of jitter with dither (Storn 2008), and the immediate replacement of each worse parent in the current population by its newly generated better or equal offspring (Babu and Angira 2006) instead of updating the current population with all the new solutions simultaneously as in classical DE.

3 Proposed modeling strategy

In actual applications, many practitioners select the RVR and SVR hyperparameters empirically by trial and error or by using a grid search (exhaustive search) technique with a cross-validation method. These procedures are computationally intensive and may not result in the best performance. Choosing the optimal values for the RVR and SVR hyperparameters is important in obtaining accurate modeling results. In this study, we apply a self-adaptive DE algorithm to optimize the RVR and SVR hyperparameters for modeling the phosphorus concentration levels in the steelmaking process.

Fig. 2 shows the flowchart for implementing the RVR and SVR models optimized by the DE algorithm. The procedure is as follows:

  • Step 1: Collect the database of the process, select the variables, normalize the observations, and divide the dataset into the training and test datasets.

  • Step 2: Select the training dataset. Select the kernel function and set the initial RVR kernel parameters or free SVR hyperparameters (C, ε, and kernel parameters).

  • Step 3: Set the population size Np (Np = 10 x np) in the self-adaptive DE algorithm, where np is the number of RVR or SVR parameters, and set the stopping criterion: maximum number of iterations (200 x np) and tolerance (1 x 10-7).

  • Step 4: Train the RVR or SVR and calculate the fitness function value. The fitness function is defined as the root mean square error (RMSE) and the objective is to minimize the RMSE

$$ RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} } = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {e_{t} } \right)^{2} } $$
(10)
Fig. 2
figure 2

Flowchart to implement RVR and SVR models optimized by DE algorithm

where \({\text{y}}_{i}\) is the observed value measured during the process, \({\hat{\text{y}}}_{i}\) is the predicted value estimated by the model, \({\text{e}}_{t}\) is the residual, and n is the number of observations used in fitting the model.

  • Step 5: If the maximum number of iterations or tolerance for the stopping criterion is reached, the smallest RMSE is selected, and the best parameter estimates of RVR or SVR are output.

  • Step 6: Train RVR or SVR with the best parameter estimates and obtain the optimized model.

  • Step 7: Obtain the predicted values for the training and test datasets using the optimized RVR or SVR model.

  • Step 8: Using the predicted values and residuals, perform a performance analysis of the model.

  • Step 9: If the model is valid, the RVR or SVR model to predict the phosphorus concentration levels in the steelmaking process is obtained.

We also use the grid search technique and cross-validation method to select the optimal parameters for the RVR and SVR models with the Gaussian RBF and Laplacian kernel functions. Because SVR has three free hyperparameters (C, ε, and kernel parameter \({\upgamma }\)) to tune, the use of a grid search technique with a cross-validation method allows the selection of the optimal parameters that have the smallest mean squared error (MSE). RVR has a kernel parameter \({\upgamma }\) to be tuned, for which we use a cross-validation method to select the optimal parameter that has the smallest MSE.

Regression models with good fits present little discrepancies between the observed and predicted values. The adequacy of a model is also an essential aspect because the relation between the response and the factors should be significant and independent of the number or type of input variables. The standard regression model assumes that the residuals are independent and identically distributed (i.i.d) normal random variables with zero mean and constant variance.

To evaluate the generalization capacity of the models, the following error minimization strategies are used: the RMSE (Eq. 10), the mean squared error (MSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). The latter three are given by Eq. (11) to Eq. (13), where \({\text{y}}_{i}\) is the observed value, \({\hat{\text{y}}}_{i}\) is the predicted value and n is the number of observations.

$$ MSE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} $$
(11)
$$ MAE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {y_{i} - \hat{y}_{i} } \right| $$
(12)
$$ MAPE = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left| {\frac{{y_{i} - \hat{y}_{i} }}{{\hat{y}_{i} }}} \right|} { \times }100 $$
(13)

4 Applied study

This section presents the implementation of the models. First, we briefly describe the case study of the phosphorus concentration levels in the steelmaking process. Next, we present the implementation of the RVR and SVR models described in Section 3. We also compare the RVR and SVR models developed in this study with the ridge regression, MLR, model tree, ANN, and RVFL models. Simulations and calculations were performed with the open-source software R® (R 2018). The SVR implementation used LIBSVM, a library for support vector machines (Chang and Lin 2011). All programs ran on a personal computer with an Intel Core i7-2670QM, 2.2 GHz, 8 GB DDR3-1333 SDRAM, Windows 7 Professional 64-bit.

4.1 Case study

The implementation of the models is illustrated through an applied study for modeling the phosphorus concentration levels in the steelmaking process for Medium-Carbon Ferromanganese (FeMnMC). The study was carried out in a Brazilian steelmaking company. One of the main factors affecting the product quality in steelmaking companies is the existence of contaminants in alloy steel.

The refining process in the study uses high-purity oxygen to reduce the carbon level in High-Carbon Ferromanganese (FeMnHC) originating from FeMnMC, which has a higher market value. During this process, changes occur in the proportion of several elements, including that of phosphorus in the final product.

Phosphorus (P) is one of the main contaminants that interferes with the steelmaking processes. Ferromanganese alloys are the major sources of P contamination during the steelmaking process, which requires limited use of this type of alloy during the process (Um et al. 2014). Increased phosphorus levels can significantly affect the physical aspects of alloy steel and severely compromise its quality. P-rich steel compounds usually exhibit: (i) increased hardness, (ii) decreased ductility, (iii) ghost lines in carbon-rich alloy steels, and (iv) increased frailty of steel bonds at high and low temperatures (Chaudhary et al. 2001).

The FeMnMC steelmaking process has 21 initial input variables that are relevant for modeling the dephosphorization process. These input variables were grouped as follow: (i) composition of FeMnHC alloys used as raw materials for the converter, (ii) composition of slag, (iii) composition of loads, and (iv) levels of alkalinity: binary, quaternary e optical basicity. The output variable is the proportion of phosphorus (P) in the final process of FeMnMC steelmaking. The selected database covers a sample of 257 observations. Table 2 shows the variables related to the steelmaking process.

Table 2 Variables related to the steelmaking process

Pedrini and Caten (2010) developed seven MLR models to predict the phosphorus concentration level in this process. The refining process of ferromanganese consists of a decarburization reaction between liquid metal and oxygen injected in the metallic bath. To realize the dephosphorization process, CaO is dissolved during decarburization to reduce the proportion of phosphorus in the final product.

Pedrini and Caten (2010) adopted the suggestion of an engineer from the company to developed the MLR model called Model 8. This model uses the natural logarithm of the difference between the phosphorus concentration in the final process for FeMnMC and the phosphorus concentration of FeMnHC, the raw material in the refining process. The MLR (Model 8) model found to predict the phosphorus concentration levels is

$$ \ln \left( {P - P^{*} } \right) = - 0.804\ln \left( {Fe^{*} } \right) + 0.371\ln \left( {MnO} \right) - 0.656\ln \left( {CaO} \right) $$
(14)

For the same process, Acosta et al. (2016) have developed a MLP neural network to predict the phosphorus concentration levels. For the ANN model, they used an MLP network with 11 neurons in one hidden layer, the logistic activation function, the learning rate of 0.01 and the momentum rate of 0.1. The ANN gave a RMSE of 0.0151986 on the training dataset.

4.2 Implementation of the models

To develop this study, we used a database created from the information system of the company. This database contains all the variables related to the FeMnMC steelmaking process, as shown in Table 2.

The data preprocessing phase to for the RVR and SVR models consisted of correlation analysis between the process input variables and the phosphorus concentration in the final process by applying the lasso method (Tibshirani 1996). The input variables selected are as follows: percentage of initial phosphorus in the alloy composition (P*), percentage of initial carbon in the alloy composition (C*), percentage of manganese oxide in the slag composition (MnO), percentage of calcium oxide in the slag composition (CaO), and liquid volume in the load composition (Liquid). The output variable is the phosphorus (P) concentration in the final process.

The 257 observations were normalized into the interval [0, 1]. The observation set was then randomly divided into two parts: a training dataset composed of 205 (80%) observations and a test dataset composed of the remaining 52 (20%) observations. The training dataset was used to estimate the regression models representing the phosphorus concentration in the actual steelmaking process, and the test dataset was used to evaluate and compare the predictive power of the regression models.

For the RVR and SVR models, we used the kernel functions in Table 1. A flowchart of the RVR and SVR parameter selection is presented in Fig. 2. For the optimization tasks using the self-adaptive DE algorithm (Conceição and Mächler 2015), we used the RMSE as the fitness function of the training dataset, Eq. (10).

The search space of the control parameters for the RVR is \(\gamma \in \left[ {0.001;1} \right]\), \(u \in \left[ {0;10} \right]\) and \(d \in \left[ {1;5} \right]\). To tune the SVR parameters with the RBF kernel using the training data, we first used the procedure proposed by Cherkassky and Ma (2004) to obtain the three free hyperparameters (C, \(\varepsilon\) and RBF kernel parameter \(\gamma\)) and identify the best search region. The search space of the control parameters for the SVR is: \(C \in \left[ {1;50} \right]\), \(\varepsilon \in \left[ {0.001;1} \right]\), \(\gamma \in \left[ {0.001;1} \right]\), \(u \in \left[ {0;10} \right]\) and \(d \in \left[ {1;5} \right]\).

Table 3 shows the best RVR and SVR model parameters for the kernel functions, where DE represents optimization by the self-adaptive DE algorithm and GS represents selection by the 10-fold cross-validation method. We used the CPU running time (seconds) to evaluate the speed to select the hyperparameters in the RVR and SVR models. According to results listed in Table 3, the CPU time was reduced when we used a DE algorithm to optimize the RVR and SVR hyperparameters. The tuning of RVR involves only the kernel parameters, whereas SVR has more parameters for tuning (C, \(\varepsilon\) and kernel parameters). Because of this, the CPU times of RVR models have smaller values of the CPU times of SVR models.

Table 3 Best RVR and SVR model parameters

From Table 3, the DE-RVR Laplacian kernel has a smaller RMSE value than the other RVR models and DE-SVR RBF kernel has a smaller value of RMSE than those of other SVR models. We observe that the DE-RVR Laplacian kernel performed slightly better than the DE-SVR RBF kernel, but the DE-RVR Laplacian kernel produced a smaller number of RVs (10) compared to the number of SVs (135) in the DE-SVR RBF kernel. The number of SVs is 65.8% of the training dataset, which can be considered as an indication of the goodness of fit and the adequacy of the model because a large number of SVs can cause overfitting of the model. The number of RVs is 4.9% of the training dataset, and the number of SVs is nearly thirteen times greater than the number of RVs.

From Table 3, it can be seen that the: (i) DE-RVR Laplacian and DE-RVR RBF kernels have smaller values of RMSE than the GS-RVR Laplacian and GS-RVR RBF kernels, (ii) the DE-SVR Laplacian and DE-SVR RBF kernels have smaller RMSE values than the GS-SVR Laplacian and GS-SVR RBF kernels, (iii) GS-RVR Laplacian and GS-RVR RBF kernels performed slightly better than the GS-SVR Laplacian and GS-SVR RBF kernels, and (iv) RVR Linear has the greatest value of RMSE.

Based on the error indices (Table 3), the selected RVR model is the DE-RVR Laplacian kernel with the optimal kernel parameter \(\gamma\) = 0.002224. The number of RVs is 10, and the RMSE on the training dataset is 0.0137368. The SVR model selected is the DE-SVR RBF kernel with the optimal parameters C = 3.3651, ε = 0.3449 and \(\gamma\) = 0.04175. The number of SVs is 135, and the RMSE on the training dataset is 0.0140385. These optimized RVR and SVR models were used to model the phosphorus concentration levels in the steelmaking process.

In this study, the performance of regression models was evaluated using both residual analysis and error minimization strategies. We tested the normality of the residuals of the fitted models using the Shapiro–Wilk test for the training data, and obtained a p-value higher than 0.4 for the two models, which indicates that the residuals are normally distributed. We examined the autocorrelation of the residuals using the Durbin-Watson test for the training data, and the results indicate no significant correlations in the residuals of the models. We used the Levene test to check homoscedasticity and obtained a p-value higher than 0.5, which means that residuals can be considered as having constant variances. After these tests, we concluded that the residuals of the RVR and SVR models are independent and identically distributed (i.i.d) normal random variables with constant variances. This is an evidence for the goodness of the fits, and shows that the models are appropriate for the observations. Therefore, the models can be utilized to predict the phosphorus concentration levels in the final process.

Table 4 shows the statistical properties of the phosphorus concentration levels obtained by applying the models on the training and test datasets. The statistical properties obtained from the models were found to be similar to those obtained experimentally. Fig. 3 shows the predicted values against the observed values for the training and test datasets for these models. These goodness of fit graphs confirm the good predictive performance of the models.

Table 4 Statistical properties of the phosphorus concentration levels predicted from models
Fig. 3
figure 3

Predicted values against observed values for the training and test datasets: (a) RVR model, (b) SVR model

Compared with traditional ANNs, SVR possesses some advantages: it has high generalization capability and avoids local minima, it always has a solution, does not need the network topology to be determined in advance, and it has a simple geometric interpretation and provides a sparse solution. SVR provides good performance when the model parameters are well tuned. The disadvantages of SVR are that it requires the tuning of many model parameters, and the results obtained are not probabilistic (Wang et al. 2003; Tipping 2000).

There are some advantages associated with RVR. The generalization performance of RVR is comparable to an equivalent SVR. Furthermore, RVR produces probabilistic output, and there is no need to tune the regularization parameter C and the insensitivity parameter \(\varepsilon\) necessary in SVR. RVR yields sparse models with fewer relevance vectors (Tipping 2000).

Ridge regression (RR) is one of the methods to shrink the coefficients of correlated predictors towards each other (Marquardt and Snee 1975). The lambda (λ) parameter is the regularization penalty, and a cross-validation method can be used to select λ (Friedman, Hastie and Tibshirani 2010). We also used RR to model the phosphorus concentration levels using the training dataset. The λ is 0.001811864, and the RMSE on the training dataset is 0.0151184. The test dataset was used with the RR model to predict the future values of the phosphorus concentration levels.

Pao et al. (1994) proposed a random vector functional link (RVFL) neural network. The RVFL is an extension of single layer feedforward neural (SLFN) networks with additional direct connections from the input layer to the output layer (Qiu et al. 2018). The RVFL network has a set of nodes called enhancement nodes, which are equivalent to the neurons in the hidden layer in the conventional SLFN. In RVFL the actual values of the weights from the input layer to hidden layer are randomly generated in a suitable domain and kept fixed in the learning stage (Zhang and Suganthan 2015). The number of enhancement nodes (hidden neurons) were determined by a cross-validation method in order to avoid overfitting. The number of enhancement nodes is 8, and the RMSE on the training dataset is 0.0146759.

Model trees (MT) use recursive partitioning to build a piecewise linear model in the form of a model tree (Quinlan 1992). The idea is to split the training cases in much the same way as when growing a decision tree, using a criterion of minimizing intra-subset variation of class values rather than maximizing information gain. M5 (Quinlan 1992) builds tree-based models but, whereas regression trees (Breiman et al. 1984) have values at their leaves, the trees constructed by M5 can have multivariate linear models. In this work, we used the M5 rule based model with boosting and corrections based on nearest neighbors in the training dataset (Quinlan 1993, Fernández-Delgado et al. 2019). The M5 tunable hyperparameters were selected by a cross-validation method. The number of training committees is 2, the number of neighbors for prediction is 0, and the RMSE on the training dataset is 0.0152490.

In order to compare the models with the MLR model proposed by Pedrini and Caten (2010), the test dataset was used to predict future values of the phosphorus concentration levels. The coefficient models were estimated by the least square method based on the t-student statistical test at 5%, Eq. (14). The RMSE of the training dataset is equal to 0.0161804.

Table 5 summarizes the statistical measures of the error minimization results. Fig. 4 shows the predicted values against the observed values on the test dataset for the RVR, SVR, ANN, RVFL, RR, MLR, and MT models. Fig. 4 confirms that the RVR and SVR models have better performance than the other models.

Table 5 Statistical measures of error minimization results
Fig. 4
figure 4

Predicted values against observed values for the test dataset for the models

We analyzed the results from Table 3, Table 4, Table 5, Fig. 3, and Fig. 4. We can observe that the RVR, SVR, ANN, RVFL, RR, MLR, and MT models achieved good performance in the predicting the phosphorus concentration levels. In Fig. 3, we note that there is a substantial agreement between the training results and the test results, indicating that there are no overfitting problems with the RVR and SVR models.

Analyzing the results in Table 5 from the test dataset, it can be seen that the:

  1. (i)

    RVR model has smaller values of MSE, and RMSE than the other models,

  2. (ii)

    SVR model has smaller values of MAE, and MAPE than the other models;

  3. (iii)

    The ascending order of RMSE is:

    RVR < SVR < ANN < RVFL < RR < MLR < MT

  4. (iv)

    The ascending order of MAE is:

    SVR < RVR < ANN < RVFL < RR < MLR < MT

  5. (v)

    the machine learning techniques (RVR and SVR) have better performance than the statistical methods (RR and MLR).

Statistical tests are employed to give a detail analysis about the performance differences among all the regression models. Parametric tests assume a series of hypotheses on the data on which they are applied (independence, normality, and homoscedasticity). If such assumptions do not hold, the reliability of the tests is not guaranteed. Nonparametric tests do not assume particular characteristics for the underlying data distribution. Nonparametric tests can perform two classes of analysis: pairwise comparisons and multiple comparison (Derrac et al. 2011, Latorre et al. 2020).

The Friedman test is a nonparametric test analogue of the parametric two-way analysis of variance (Garcia et al. 2010). To calculate the statistic, the Friedman test ranks the model performance for each problem and compute the average of each model between problems (Carrasco et al. 2020). The null-hypothesis states that all the models have the same performance. Once Friedman’s test rejects the null hypothesis, we can proceed with the Bergmann-Hommel post-hoc test in order to find the pairs of models which produce differences (N × N comparisons) (Derrac et al. 2011).

We used the Friedman test with the seven models (RVR, SVR, ANN, RVFL, RR, MLR, and MT) and the p-value reported by this test is 0.0033, which is significant at the significance level (α = 0.05). Then, we proceed to perform the Bergmann-Hommel post-hoc test in order to determine the location of the differences between these models.

Fig. 5 shows the adjusted p-values using the Bergmann-Hommel post-hoc test for multiple comparisons. The null hypothesis is rejected if the adjusted p-value is less than the significance level (α = 0.05). The p-values below 0.05 indicate that the respective models differ significantly in prediction errors. We can observe that are significant differences between the RVR and MLR, RVR and MT, SVR and MLR, SVR and ML. The difference is not significant between RVR, SVR, ANN and RVFL.

Fig. 5
figure 5

Adjusted p-values using the Bergmann-Hommel post-hoc test for multiple comparisons

5 Conclusions

The impurities in the metal alloys interfere with the steelmaking process. High levels of phosphorus can severely affect the physical integrity of steel bonds and threaten the quality of the final product.

In this work, we applied relevance vector machine for regression (RVR) and support vector machine for regression (SVR) optimized by a self-adaptive differential evolution algorithm to the predictive modeling of phosphorus concentration levels in a steelmaking process based on actual data.

In the past decade, relevance vector machines have gained the attention of many researchers. Relevance vector machine (RVM) is a Bayesian sparse kernel technique for regression and classification of identical functional form to the support vector machine (SVM). The RVR and SVR generalization performance depends on the correct specification of the hyperparameters. One of the most widely used approaches to select the RVR and SVR hyperparameters is the grid search technique with a cross-validation method. Differential evolution (DE) has also been used to optimize the RVR and SVR hyperparameters. It is essential to choose the best control parameters for DE to achieve the optimal algorithm performance. Thus, we used a self-adaptive scheme to tune the DE parameters automatically.

We used five kernel functions and applied a self-adaptive DE algorithm to optimize the RVR and SVR hyperparameters. Based on the error indices, the RVR model selected is the DE-RVR Laplacian kernel and the SVR model selected is the DE-SVR RBF kernel.

We compared the performance of the RVR and SVR models with the RR, MLR, ANN, MT, and RVFL models. The comparative analysis shows that RVR and SVR have better performance than the RR, MLR, ANN, MT, and RVFL models in the predicting the phosphorus concentration levels in the steelmaking process.

We used the Friedman test and Bergmann-Hommel post-hoc test. We can observe that are significant differences between the RVR and MLR, RVR and MT, SVR and MLR, SVR and ML. The difference is not significant between RVR, SVR, ANN and RVFL.

RVR has slightly better performance than the other models. RVR has nearly the same performance as SVR, but RVR produced nearly thirteen times fewer RVs than the SVs produced by SVR. Furthermore, the tuning of RVR involves only the kernel parameters, whereas SVR has more parameters for tuning (C, \(\varepsilon\) and kernel parameters).

The results of this study indicate that the RVR and SVR models are adequate tools for predicting the phosphorus concentration levels in the steelmaking process. The proposed approach provides an effective strategy to support practitioners in modeling other chemical and industrial processes.