1 Introduction

The support vector machine (SVM) is a learning method, which avoids the problems of over-learning, under-learning and local minimum of network structure in the artificial neural networks. As a binary classification algorithm, it has widespread applications in various fields of engineering and medical sciences (Subasi 2013; Dukart et al. 2013; Alajlan et al. 2012; Haddi et al. 2013; Roy et al. 2016). Time series prediction using SVM has been one of popular research areas in recent years. The SVMs proposed for time series prediction cover many practical application areas from financial market prediction to electric consumption load forecasting and other scientific fields (Sapankevych and Sankar 2009). Pattern recognition is also among the research fields that SVM has been applied frequently (Byun and Lee 2002; Bashbaghi et al. 2017; Christlein et al. 2017; Liu et al. 2018; Shah et al. 2017; Solera-Urena et al. 2012). For example, automatic speech recognition using SVM has been presented in Solera-Urena et al. (2012). In communication systems, various problems can be solved by SVM. For example, in wireless sensor networks, SVM can be used for solving the localization problems (Zhu and Wei 2017). In (Garcia et al. 2006), a robust SVM has been proposed for channel estimation in orthogonal frequency division multiplexing (OFDM) systems.

From mathematical point of view, SVM is based on the structural risk minimized using the maximum margin idea (Ozer et al. 2011). In fact, a convex objective function is optimized to find the classifier (decision boundary). Kernel function plays a dominant role in SVM generalization performance. To be more precise, kernel function transfers the input data into a higher dimensional feature space. As a result, the data can be separated linearly, and the classification will be performed with more accuracy.

Nowadays, pattern recognition plays an important role is industry. There are many industrial robots and other machines such as sorters that perform classification. Food industry is one the industries in which the accuracy of classification is very critical and important. For example, 1% false classification of agricultural products such as pistachio, olive and walnut can decrease the profit significantly. Thus, improving the accuracy of classifies such as SVM is necessary and important and will be applicable in today’s modern industry.

After proposing SVM, extensive studies have been devoted to enhancing its performance using different strategies. Reducing the number of support vectors is one of these strategies. For example, the proposed approach in Downs et al. (2001) allows the recognition and elimination of unnecessary support vectors while leaving the solution unchanged. Also, \(\nu\)—support vector classification has been proposed that introduces a regularization parameter \(\nu\) to control the number of support vectors and margin errors (Gu and Sheng 2017). A robust SVM in Song et al. (2002) has been proposed that tries to solve the over-fitting problem when outliers exist in the training sets and reduce the number of support vectors. Feature extraction is another solution for enhancement of the SVM performance. In (Caoa et al. 2003), three approaches, namely kernel principal component analysis (KPCA), principal component analysis (PCA) and independent component analysis (ICA), have been compared for feature extraction in SVM. Feature normalization is another solution proposed in the literature (Steinwart et al. 2004; Bi et al. 2005). Modifying or changing the kernel function with the aim of enhancing SVM performance is another idea that has been studied extensively (Ye and Suganthan 2012; Zhang et al. 2004; Kuo et al. 2014; Izquierdo-Verdiguier et al. 2013; Moghaddam and Hamidzadeh 2016).

This paper presents a novel SVM using semi-parametric linear models. Regression analysis is a statistical approach for studying the relation between independent variables and a dependent variable. These methods are usually classified into two main methods, so-called parametric and nonparametric methods. Parametric models are very helpful for studying the relationship between variables. However, such methods may sometimes be applied at the risk of introducing modeling biases. In non-parametric models, no prior model structure is required. They can provide useful insight for further parametric fitting. However, they suffer from some drawbacks such as the curse of dimensionality, difficulty of interpretation, and lack of extrapolation capability. Therefore, semi-parametric linear models are more useful and are applied in many applications since both the parametric and nonparametric components can simultaneously exist in the model (Hesamian et al. 2017; Zarei et al. 2020).

This paper is organized as follows. Section 2 presents a brief explanation of SVM. Section 3 demonstrates the proposed SVM using semi-parametric linear regression. The results of some experiments on typical classification problems are illustrated in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Support vector machine

2.1 Linear support vector machines

Without any loss of generality, the classification problem in SVM is demonstrated for two-class problems. The aim is to classify the two groups using a function which is obtained from available data. The main objective is to design a classifier performing well for unseen data. Consider the data in Fig. 1. It is obvious that there are many acceptable linear decision boundaries classifying the data, but there is only one that maximizes the margin (maximizes the distance between the classifier and the closest Lu data point of each group). This linear classifier is known as the optimal separating hyper-plane. Instinctively, we expect this boundary to generalize well in contrast to the other possible boundaries (Gunn 1998).

Fig. 1
figure 1

Optimal separating hyper-plain

Consider the training dataset \({\mathbf{x}}_{i} \in R^{d} \,\,\,i = 1,2,...,N\) with a label \(y_{i} \in \left\{ { - 1, + 1} \right\}\) for all the training data and \(d\) is the dimension of the problem. Consider the hyper-plain described by

$$ {\mathbf{\theta }}^{T} .{\mathbf{x}} + b = 0 $$
(1)

The classification problem is to obtain the hyper-plane such that \({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} + b \ge + 1\) for positive class and \({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} + b \le - 1\) for negative class. According to Gunn 1998, in order to find the hyper-plane with the largest margin, the following minimizing problem should be solved:

$$ \mathop {\min }\limits_{{{\mathbf{w}},b}} \,J({{\varvec{\uptheta}}}) = \frac{{\left\| {{\varvec{\uptheta}}} \right\|^{2} }}{2} $$
(2)

The constraint of this minimization problem is

$$ y_{i} ({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} + b) - 1 \ge 0 $$
(3)

This quadratic programming (QP) problem can be solved by standard techniques such as Lagrange multipliers (Gutschoven and Verlinde 2000; Osuna et al. 1997) or intelligent optimization techniques such as particle swarm optimization (PSO) and genetic algorithm (GA) (Fateh and Khorashadizadeh 2012; Zadeh et al. 2016).

2.2 Nonlinear support vector machines and kernels

Real-life classification problems may be difficult to be solved by a linear SVM. Therefore, its extension to nonlinear decision boundary is inevitable. In nonlinear problems, the kernel function will map the input space into a high dimensional feature space by a nonlinear transformation. For example, consider Fig. 2. According to Cover theorem, input space can be converted into a new feature space in which the patterns can be linearly separable with high probability, if the dimensionality of the feature space is high enough (Haykin 1999). This nonlinear transformation is performed in implicit way through so-called kernel functions.

Fig. 2
figure 2

Transforming the input data to the feature space using the nonlinear function \(\varphi\)

2.2.1 Inner-product kernels

In order to deal with nonlinear classification problems using SVM, a mapping \(\varphi :\,R^{n} \to H\) is required. This mapping will transform the input data into the Euclidean space \(H\) which is a significantly higher dimensional. Now, the linear SVM is performed in the new space with dimension \(d\). As a result, the training algorithm uses the data through dot product in \(H\) of the form \(\varphi ({\mathbf{x}}_{i} )\varphi ({\mathbf{x}}_{j} )\). If the number of training vectors \(\varphi ({\mathbf{x}}_{i} )\) is very large, then the calculation of the dot products will be time-consuming and computational. Moreover, \(\varphi\) is not known a priori. In this situation, Mercer theorem (Burges 1998) is used to replace \(\varphi ({\mathbf{x}}_{i} )\varphi ({\mathbf{x}}_{j} )\) by a positive definite symmetric kernel function \(K({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )\). In other words, \(K({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \varphi ({\mathbf{x}}_{i} )\varphi ({\mathbf{x}}_{j} )\). To be more precise, kernel substitution paves the way for designing nonlinear methods from algorithms previously limited to dealing with linear separable input data (Campbell 2000). In addition, it prevents from.

the so-called challenge of dimension curse (Vapnik 1995). Some typical kernel functions are listed in Table 1.

Table 1 Some typical kernels for nonlinear mapping in SVM

2.3 Designing SVM based on Lagrange optimization

The optimization problem described by (2) and (3) is a typical quadratic programming problem, since the objective function \(J({{\varvec{\uptheta}}})\) is quadratic and the constraints are linear. Based on this constrained optimization problem and considering the training set \(\left\{ {\left( {{\mathbf{x}}_{i} ,d_{i} } \right)} \right\}_{i = 1}^{N}\) and the Lagrange multipliers \(\left\{ {\alpha_{i} } \right\}_{i = 1}^{N}\), another problem called the dual problem can be formulated in the form of

$$ \mathop {\max }\limits_{{}} \,Q(\alpha ) = \sum\limits_{i = 1}^{N} {\alpha_{i} - \frac{1}{2}} \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\alpha_{i} \alpha_{j} d_{i} d_{j} {\mathbf{x}}_{i}^{T} {\mathbf{x}}_{j} } } $$
(4)

subject to the constraints

$$ \sum\limits_{i = 1}^{N} {\alpha_{i} d_{i} } = 0 $$
(5)
$$ \alpha_{i} \ge 0\,\,\,\,\,\,\,i = 1,2,...,N $$
(6)

For non-separable patterns, the constraint (6) should be modified as (Byun and Lee 2002)

$$ 0 \le \alpha_{i} \le C\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,N $$
(7)

where C is a user-defined positive parameter. This optimization problem can be solved by several methods mentioned in Byun and Lee (2002).

3 The proposed SVM based on semi-parametric linear regression

Consider the hyper plain that classifies the training dataset in the form of

$$ {{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} - b = 0\,\,\,\,\,\,\,\,\,\,\,y_{i} \in \{ - 1,1\} \,\,\,\,i = 1,2,...,m $$
(8)

in which \(y_{i} \in \{ - 1,1\} \,\,\,\,i = 1,2,...,m\) is the label of each sample \({\mathbf{x}}_{i}\). Suppose that \({{\varvec{\uptheta}}},{\mathbf{x}}_{i} \in {\mathbb{R}}^{p}\), i.e., \({{\varvec{\uptheta}}}^{T} = \left[ {\begin{array}{*{20}c} {\theta_{1} } & {\theta_{2} } & {\begin{array}{*{20}c} {...} & {\theta_{p} } \\ \end{array} } \\ \end{array} } \right]\) and \({\mathbf{x}}_{i}^{T} = \left[ {\begin{array}{*{20}c} {x_{i1} } & {x_{i2} } & {...} & {x_{ip} } \\ \end{array} } \right]\). The proposed method starts by rewriting (8) as

$$ {{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} - f({\mathbf{x}}_{i} ) = 0 $$
(9)

where \(f({\mathbf{x}}_{i} )\) is a term that will be calculated later. In fact, instead of obtaining a constant value for the parameter \(b\) in (8), we will make it dependent to the sample \({\mathbf{x}}_{i}\). In other words, \(f({\mathbf{x}}_{i} )\) is the term that results in the best classification performance with the smallest error. Since \(f({\mathbf{x}}_{i} )\) is unknown, we try to estimate it by

$$ \hat{f}({\mathbf{x}}_{i} ) = \sum\limits_{j = 1}^{m} {w_{j} } ({\mathbf{x}}_{i} )({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{j} ) $$
(10)
$$ w_{j} ({\mathbf{x}}_{i} ) = \frac{{K\left( {\frac{{\left\| {{\mathbf{x}}_{j} - {\mathbf{x}}_{i} } \right\|}}{h}} \right)}}{{\sum\nolimits_{j = 1}^{m} {K\left( {\frac{{\left\| {{\mathbf{x}}_{j} - {\mathbf{x}}_{i} } \right\|}}{h}} \right)} }} $$
(11)

where \(K(.)\) is the kernel function and \(h\) is the smoother parameter that is obtained by cross-validation measure (Campbell 2000). The kernel function has the following properties (Wasserman 2006):

$$ \int {K(u)du = 1} ,\,\,\,\,\,\int {uK(u)du = 0,\,\,\,\,\,\,\int {u^{2} K(u)du = \sigma_{k}^{2} > 0} } $$
(12)

Some commonly used kernels are illustrated in Fig. 3.

Fig. 3
figure 3

Some commonly used kernels for calculation of weights (https://en.wikipedia.org/wiki/Kernel_(statistics))

Substitution of \(\hat{f}({\mathbf{x}}_{i} )\) from (10) into (9) results in:

$$ {{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i} - \sum\limits_{j = 1}^{m} {w_{j} } ({\mathbf{x}}_{i} )({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{j} ) = 0\, $$
(13)

In other words, we have \({{\varvec{\uptheta}}}^{T} {\mathbf{x}}_{i}^{*} = 0\,\) in which

$$ {\mathbf{x}}_{i}^{*} = {\mathbf{x}}_{i} - \sum\limits_{j = 1}^{m} {w_{j} } ({\mathbf{x}}_{i} ){\mathbf{x}}_{j} $$
(14)

Now, in order to obtain the optimal values for the vector \({{\varvec{\uptheta}}}\) based on SVM that yields in the best classification performance, the following optimization problem should be solved:

$$ L = \mathop {\min }\limits_{{{\varvec{\uptheta}}}} \left\| {{\varvec{\uptheta}}} \right\|^{2} \,\,\,subject\,\,to\,\,\,y_{i} ({{\varvec{\uptheta}}}^{T} .{\mathbf{x}}_{i}^{*} ) - 1 \ge 0 $$
(15)

Using Lagrange optimization, the cost function (15) is rewritten in the form of

$$ L_{p} = \mathop {\min }\limits_{{{\varvec{\uptheta}}}} \left\| {{\varvec{\uptheta}}} \right\|^{2} - \sum\limits_{i = 1}^{m} {\alpha_{i} \left( {y_{i} ({{\varvec{\uptheta}}}^{T} .\,{\mathbf{x}}_{i}^{*} ) - 1} \right)} $$
(16)

where \(\alpha_{i}\) is called Lagrange multiplier. Therefore, differentiating \(L_{p}\) with respect to \({{\varvec{\uptheta}}}\) and setting the result equal to zero \(\left( {i.e.,\frac{\partial }{{\partial {{\varvec{\uptheta}}}}}L_{p} = 0} \right)\), we get:

$$ {{\varvec{\uptheta}}} = \sum\limits_{i = 1}^{m} {\alpha_{i} y_{i} \,{\mathbf{x}}_{i}^{*} } $$
(17)

Due to the duality theorem (Bertsekas 1995, we can reformulate the cost function (16) as

$$ L_{D} = \sum\limits_{i = 1}^{m} {\alpha_{i} \left( {y_{i} ({{\varvec{\uptheta}}}^{T} .\,{\mathbf{x}}_{i}^{*} + 0) - 1} \right)} - \frac{1}{2}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{m} {\alpha_{i} \alpha_{j} y_{i} y_{j} \left( {\,{\mathbf{x}}_{i}^{{{\mathbf{*}}T}} .{\mathbf{x}}_{j}^{{\mathbf{*}}} } \right)} } $$
(18)

Substitution of (17) into (18) results in

$$ L_{D} = \sum\limits_{i = 1}^{m} {\alpha_{i} \left[ {y_{i} \left( {\sum\limits_{j = 1}^{m} {\alpha_{j} y_{j} \left( {\,{\mathbf{x}}_{i}^{{{\mathbf{*}}T}} .{\mathbf{x}}_{j}^{{\mathbf{*}}} } \right)} } \right) - 1} \right]} - \frac{1}{2}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{m} {\alpha_{i} \alpha_{j} y_{i} y_{j} \left( {\,{\mathbf{x}}_{i}^{{{\mathbf{*}}T}} .{\mathbf{x}}_{j}^{{\mathbf{*}}} } \right)} } $$
(19)

For classification problems in which the training set (input space) is not linearly separable, a nonlinear map is used to produce linearly separable data (feature space) (Haykin 2007). Suppose that \(\{ \varphi_{j} ({\mathbf{x}})\}_{j = 1}^{{m_{1} }}\) is a set of nonlinear transformations from the input space to the feature space and \(m_{1}\) is the dimension of the feature space. As a result, the hyper-plane acting as the decision surface is given by:

$$ \sum\limits_{j = 1}^{{m_{1} }} {\theta_{j} \varphi_{j} ({\mathbf{x}})} + b = 0 $$
(20)

Assuming \(\theta_{0} = b\) and \(\varphi_{0} ({\mathbf{x}}) = 1\), the hyper-plane (20) can be simply converted to

$$ \sum\limits_{j = 0}^{{m_{1} }} {\theta_{j} \varphi_{j} ({\mathbf{x}})} = 0 $$
(21)

Adapting the optimal weight vector given in (17) to this new situation involving feature space where we now seek linear separabality of the features yields in

$$ {{\varvec{\uptheta}}} = \sum\limits_{i = 1}^{m} {\alpha_{i} y_{i} \,\varphi ({\mathbf{x}}_{i}^{*} )} $$
(22)

As a result, the cost function of the dual problem given in (19) is rewritten as

$$\begin{gathered} L_{D} = \mathop {\sup }\limits_{{{\mathbf{\alpha }},\lambda }} \left\{ {\sum\limits_{{i = 1}}^{m} {\left\{ {\alpha _{i} \left( {\sum\limits_{{j = 1}}^{m} {y_{i} y_{j} \alpha _{j} \psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right)} } \right)} \right.} } \right. \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left. { - \frac{1}{2}\sum\limits_{{i = 1}}^{m} {\left\{ {\sum\limits_{{j = 1}}^{m} {\alpha _{i} \alpha _{j} y_{i} y_{j} \psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right)} } \right\}} } \right\} \hfill \\ \end{gathered}$$
(23)

where \(\psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right) = \varphi ({\mathbf{x}}_{i}^{*} )\varphi ({\mathbf{x}}_{j}^{*} )\) is the inner-product kernel.

4 Experimental results

In order to investigate the performance of the proposed method in classification problems, some examples are presented. Then, Iris dataset will be classified using the proposed method (SP-SVM), and the results of some previous related works on Iris dataset will be presented for comparision.

Example 1

Consider the training data and the corresponding labels presented in Fig. 4. In other words, this training data can be described as given in Table 2.

Fig. 4
figure 4

Training data in Example 1

Table 2 Training data in Example 1

The kernel used for the semi-parametric regression is a Gaussian kernel:

$$ K\left( {\frac{{\left\| {{\mathbf{x}}_{j} - {\mathbf{x}}_{i} } \right\|}}{h}} \right) = \exp \left( { - \frac{{\left\| {{\mathbf{x}}_{j} - {\mathbf{x}}_{i} } \right\|^{2} }}{h}} \right) $$
(24)

and the polynomial kernel

$$ \psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right) = ({\mathbf{x}}_{j}^{*T} {\mathbf{x}}^{*}_{i} + 1) $$
(25)

has been selected for the nonlinear mapping in SVM. In this example, we have assumed that \(h^{ - 1} = 0.005\). The weights calculated by the proposed method are \(\theta_{1} = 3.09602,\theta_{2} = - 2.64302\), and the predicted outputs (labels) are

$$ \begin{gathered} y = \{ 0.999996,\, - 1.52512,\,4.05512,\,1.53007, \hfill \\ \;\;\;\;\;\;\left. { - \,0.999996,\, - 3.53841,\,2.04821} \right\} \hfill \\ \end{gathered} $$
(26)

It is obvious that all the data have been correctly classified, since the signs of predicted values in (26) arethe same as the sign of values in the training data (the third row in Table 2). In order to plot the decision boundary, one can use (14) to get

$$ \begin{gathered} {\mathbf{x}}_{i}^{*} = \left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,8 \hfill \\ \left\{ \begin{gathered} x_{i1}^{*} = x_{i1} - \sum\limits_{j = 1}^{8} {Exp\left[ { - 0.005 \times \left( {\left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} } \right)} \right]} \hfill \\ x_{i2}^{*} = x_{i2} - \sum\limits_{j = 1}^{8} {Exp\left[ { - 0.005 \times \left( {\left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} } \right)} \right]} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$

The values of \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\) are presented in Table 3. Moreover, the decision boundary in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\) plane is plotted in Fig. 5.

Table 3 Transformed values in Example 1
Fig. 5
figure 5

Decision boundary of Example 1 in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\) plane

Thus, using the aforementioned optimal weights and values obtained for \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) and the equation \({{\varvec{\uptheta}}}^{T} {\mathbf{x}}_{i}^{*} = 0\,\), the decision boundary in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) plane is given by

$$ 3.096x_{i1}^{*} - 2.64x_{i2}^{*} = 0\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,8 $$

Example 2

Consider the training data and the corresponding labels presented in Fig. 6. In other words, this training data can be described as given in Table 4.

Fig. 6
figure 6

Training data in Example 2

Table 4 Training data in Example 2

The kernel used for the semi-parametric regression is a the same as given in (24) with \(h = 40\), and the polynomial kernel

$$ \psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right) = ({\mathbf{x}}_{j}^{*T} {\mathbf{x}}^{*}_{i} + 1)^{2} $$
(27)

has been adopted for the nonlinear mapping in SVM. The weights calculated by the proposed method are \(\theta _{1} = - 33.9996,\theta _{2} = 21.9998,\theta _{3} = 9.99987, \theta _{4} = - 5.9996,\theta _{5} = 2.0001\) and the predicted outputs (labels) are

$$ \begin{gathered} y = \{ 0.99982,\,6.99987,\,0.9994,\, - 0.99987\} \hfill \\ \,\,\,\,\,\,\,\,\,\,\left. { - 10.9999,\,9.9997,\,0.9993,0.9994} \right\} \hfill \\ \end{gathered}$$
(28)

As it can be seen, the proposed method can correctly classify these data. In order to obtain the decision boundary, one can use (14) to get

$$ \begin{gathered}{\mathbf{x}}_{i}^{*} = \left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,8 \hfill \\ \left\{ \begin{gathered} x_{i1}^{*} = x_{i1} - \sum\limits_{j = 1}^{8} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right]} \right. \times x_{j1} } \hfill \\ \hfill \\ x_{i2}^{*} = x_{i2} - \sum\limits_{j = 1}^{8} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right]} \right. \times x_{j2} } \hfill \\ \end{gathered} \right. \hfill \\ I\left( A \right) = \left\{ \begin{gathered} 1\,\,\,\,\,\,\,\,\,\,A \hfill \\ 0\,\,\,\,\,\,\,\,A^{c} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$

Thus, using the aforementioned optimal weights and values obtained for \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) and the equation \({{\varvec{\uptheta}}}^{T} {\mathbf{x}}_{i}^{*} = 0\,\), the decision boundary in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) plane is given by

$$ - 34x_{{i1}}^{*} + 22x_{{i2}}^{*} + 10\left( {x_{{i1}}^{2} } \right)^{*} - 6\left( {x_{{i2}}^{2} } \right)^{*} + 2\left( {x_{{i1}} x_{{i2}} } \right)^{*} = 0\;\;i = 1,2,...,8 $$

Example 3

Consider the training data and the related classes presented in Fig. 7. This training data can be described as given in Table 5.

Fig. 7
figure 7

Training data in Example 3

Table 5 Training data in Example 3

The kernel used for the semi-parametric regression is the same as given in (24) with \(h = 40\), and the polynomial kernel \(\psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right) = ({\mathbf{x}}_{j}^{*T} {\mathbf{x}}^{*}_{i} + 1)^{2}\) has been adopted for feature transformation. The weights obtained by the proposed method are θ1 = − 28.33, θ2 = − 5.77, θ3 = 22.1099, θ4 = 9.11057, θ5 = − 19.5545 and the predicted outputs (labels) are

$$ \begin{gathered} y\, = \,\{ 15.22, - 0.99,\,0.99,\, - 3.77,\, - 39.55, \hfill \\ \left. {{\mkern 1mu} \;\;\;\;\;\;\; - 3.55,\, - 0.99,\,43.1,\,20.22,\,0.99} \right\} \hfill \\ \end{gathered} $$
(29)

As it can be seen, the proposed method can correctly classify these data. In order to obtain the decision boundary, one can use (14) to get

$$ \begin{gathered}{\mathbf{x}}_{i}^{*} = \left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,11 \hfill \\ \left\{ \begin{gathered} x_{i1}^{*} = x_{i1} - \sum\limits_{j = 1}^{11} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right] \times x_{j1} } \right.} \hfill \\ \hfill \\ x_{i2}^{*} = x_{i2} - \sum\limits_{j = 1}^{11} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right] \times x_{j2} } \right.} \hfill \\ \end{gathered} \right. \hfill \\ I\left( A \right) = \left\{ \begin{gathered} 1\,\,\,\,\,\,\,\,\,\,A \hfill \\ 0\,\,\,\,\,\,\,\,A^{c} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$

Thus, using the aforementioned optimal weights and values obtained for \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) and the equation \({{\varvec{\uptheta}}}^{T} {\mathbf{x}}_{i}^{*} = 0\,\), the decision boundary in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) plane is given by

$$ - 28.33x_{i1}^{*} - 5.78x_{i2}^{*} + 22.11\left( {x_{i1}^{2} } \right)^{*} + 9.11\left( {x_{i2}^{2} } \right)^{*} - 19.55\left( {x_{i1} x_{i2} } \right)^{*} = 0\,\,\,\,\,\,\,\,i = 1,\,...,11 $$

Example 4

Consider the training data and the related classes presented in Fig. 8. This training data can be described as given in Table 6.

Fig. 8
figure 8

Training data in Example 4

Table 6 Training data in Example 4

The kernel used for the semi-parametric regression is the same as given in (24) with \(h = 40\), and the polynomial kernel \(\psi \left( {{\mathbf{x}}_{i}^{*} ,{\mathbf{x}}_{j}^{*} } \right) = ({\mathbf{x}}_{j}^{*T} {\mathbf{x}}^{*}_{i} + 1)^{2}\) has been used for feature transformation. The weights achieved by the proposed method are \(\theta_{1} = - 6.15384,\theta_{2} = - 11.6923,\theta_{3} = 5.84615,\theta_{4} = 6.15383,\theta_{5} = - 4.76922\), and the predicted outputs (labels) are

$$ y = \{ 9.3,\,\, - 0.99,\,\,\,0.99,\,\,\,3.61,\,\, - 11.46,\,\,\, - 2.076,\,\,\, - 0.99,12.3,13.53,1,0.99\} $$
(30)

As it can be seen, the proposed method can correctly classify these data. In order to obtain the decision boundary, one can use (14) to get

$$ \begin{gathered} \;\;\;{\mathbf{x}}_{i}^{*} = \left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\qquad i = 1,2,...,11 \hfill \\ \left\{ \begin{gathered} x_{i1}^{*} = x_{i1} - \sum\limits_{j = 1}^{11} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right] \times x_{j1} } \right.} \hfill \\ \hfill \\ x_{i2}^{*} = x_{i2} - \sum\limits_{j = 1}^{11} {0.5 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} \hfill \\ \le 40 \hfill \\ \end{gathered} \right] \times x_{j2} } \right.} \hfill \\ \end{gathered} \right. \hfill \\ \, I\left( A \right) = \left\{ \begin{gathered} 1\,\,\,\,\,\,\,\,\,\,A \hfill \\ 0\,\,\,\,\,\,\,\,A^{c} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$

Thus, using the aforementioned optimal weights and values obtained for \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) and the equation \({{\varvec{\uptheta}}}^{T} {\mathbf{x}}_{i}^{*} = 0\,\), the decision boundary in the \(\left( {x_{i1}^{*} \,,\,x_{i2}^{*} } \right)\,\) plane is given by

$$ - 6.15x_{i1}^{*} - 11.698x_{i2}^{*} + 5.85\left( {x_{i1}^{2} } \right)^{*} + 6.15\left( {x_{i2}^{2} } \right)^{*} + 4.77\left( {x_{i1} x_{i2} } \right)^{*} = 0\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,11 $$

Example 5 In order to test the performance of the proposed SVM on real data, Iris dataset from UCI Machine Learning Repository is used. Since there are 3 classes (+1, 0 and -1) in this dataset, we use the proposed method twice. Since there are 4 inputs in this dataset, it is impossible to plot them and distinguish the decision boundary. However, in order to decide which class should be separated in the first stage, consider Figs. 9, 10, 11, 12, 13, 14 in which some 2 dimensional plots of the input features are plotted. As seen in these figures, the green cluster with the desired output “+1” is completely separated from the other clusters. Therefore, in the first step, the data will be classified into 2 groups: the green patterns are labeled by “+1,” and the black and red patterns are labeled by “−1.” In order to obtain the decision boundary, one can use (14) to obtain:

$$ \begin{gathered} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\mathbf{x}}_{i}^{*} = \left( {x_{i1}^{*} \,,\,x_{i2}^{*} \,\,,\,\,x_{i3}^{*} \,,\,x_{i4}^{*} } \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,150 \hfill \\ \left\{ \begin{gathered} x_{i1}^{*} = x_{i1} - \sum\limits_{j = 1}^{11} {10 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{i3} - x_{j3} } \right)^{2} + \left( {x_{i4} - x_{j4} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{i3} }}^{2} - x_{{_{j3} }}^{2} } \right)^{2} + \left( {x_{{_{i4} }}^{2} - x_{{_{j4} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} + \left( {x_{i1} x_{i3} - x_{j1} x_{j3} } \right)^{2} + \left( {x_{i1} x_{i4} - x_{j1} x_{j4} } \right)^{2} \hfill \\ + \left( {x_{i2} x_{i3} - x_{j2} x_{j3} } \right)^{2} + \left( {x_{i2} x_{i4} - x_{j2} x_{j4} } \right)^{2} + \left( {x_{i3} x_{i4} - x_{j3} x_{j4} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{j1} } \right.} \hfill \\ x_{i2}^{*} = x_{i2} - \sum\limits_{j = 1}^{11} {10 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{i3} - x_{j3} } \right)^{2} + \left( {x_{i4} - x_{j4} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{i3} }}^{2} - x_{{_{j3} }}^{2} } \right)^{2} + \left( {x_{{_{i4} }}^{2} - x_{{_{j4} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} + \left( {x_{i1} x_{i3} - x_{j1} x_{j3} } \right)^{2} + \left( {x_{i1} x_{i4} - x_{j1} x_{j4} } \right)^{2} \hfill \\ + \left( {x_{i2} x_{i3} - x_{j2} x_{j3} } \right)^{2} + \left( {x_{i2} x_{i4} - x_{j2} x_{j4} } \right)^{2} + \left( {x_{i3} x_{i4} - x_{j3} x_{j4} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{j2} } \right.} \hfill \\ x_{i3}^{*} = x_{i2} - \sum\limits_{j = 1}^{11} {10 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{i3} - x_{j3} } \right)^{2} + \left( {x_{i4} - x_{j4} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{i3} }}^{2} - x_{{_{j3} }}^{2} } \right)^{2} + \left( {x_{{_{i4} }}^{2} - x_{{_{j4} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} + \left( {x_{i1} x_{i3} - x_{j1} x_{j3} } \right)^{2} + \left( {x_{i1} x_{i4} - x_{j1} x_{j4} } \right)^{2} \hfill \\ + \left( {x_{i2} x_{i3} - x_{j2} x_{j3} } \right)^{2} + \left( {x_{i2} x_{i4} - x_{j2} x_{j4} } \right)^{2} + \left( {x_{i3} x_{i4} - x_{j3} x_{j4} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{j3} } \right.} \hfill \\ x_{i4}^{*} = x_{i2} - \sum\limits_{j = 1}^{11} {10 \times I\left[ {\left. \begin{gathered} \left( {x_{i1} - x_{j1} } \right)^{2} + \left( {x_{i2} - x_{j2} } \right)^{2} + \left( {x_{i3} - x_{j3} } \right)^{2} + \left( {x_{i4} - x_{j4} } \right)^{2} + \left( {x_{{_{i1} }}^{2} - x_{{_{j1} }}^{2} } \right)^{2} + \left( {x_{{_{i2} }}^{2} - x_{{_{j2} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{i3} }}^{2} - x_{{_{j3} }}^{2} } \right)^{2} + \left( {x_{{_{i4} }}^{2} - x_{{_{j4} }}^{2} } \right)^{2} + \left( {x_{i1} x_{i2} - x_{j1} x_{j2} } \right)^{2} + \left( {x_{i1} x_{i3} - x_{j1} x_{j3} } \right)^{2} + \left( {x_{i1} x_{i4} - x_{j1} x_{j4} } \right)^{2} \hfill \\ + \left( {x_{i2} x_{i3} - x_{j2} x_{j3} } \right)^{2} + \left( {x_{i2} x_{i4} - x_{j2} x_{j4} } \right)^{2} + \left( {x_{i3} x_{i4} - x_{j3} x_{j4} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{j4} } \right.} \hfill \\ \end{gathered} \right. \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I\left( A \right) = \left\{ \begin{gathered} 1\,\,\,\,\,\,\,\,\,\,A \hfill \\ 0\,\,\,\,\,\,\,\,A^{c} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$
Fig. 9
figure 9

Input features of Iris dataset in the \(\left( {x_{2} \,,\,x_{1} } \right)\,\) plane

Fig. 10
figure 10

Input features of Iris dataset in the \(\left( {x_{3} \,,\,x_{1} } \right)\,\) plane

Fig. 11
figure 11

Input features of Iris dataset in the \(\left( {x_{4} \,,\,x_{1} } \right)\,\) plane

Fig. 12
figure 12

Input features of Iris dataset in the \(\left( {x_{3} \,,\,x_{2} } \right)\,\) plane

Fig. 13
figure 13

Input features of Iris dataset in the \(\left( {x_{4} \,,\,x_{2} } \right)\,\) plane

Fig. 14
figure 14

Input features of Iris dataset in the \(\left( {x_{4} \,,\,x_{3} } \right)\,\) plane

After running the optimization problem, the optimal weights are obtained and the optimal hyper-plane is given by

$$ \begin{gathered} - 0.104x_{{i1}}^{*} - 0.042x_{{i2}}^{*} 0.0072x_{{i3}}^{*} + 0.0316x_{{i4}}^{*} + 0.0246\left( {x_{{i1}}^{2} } \right)^{*} \hfill \\ - 0.106\left( {x_{{i2}}^{2} } \right) + 0.227\left( {x_{{i3}}^{2} } \right)^{*} + 0.096\left( {x_{{i2}}^{2} } \right)^{*} - 0.0796\left( {x_{{i1}} x_{{i2}} } \right)^{*} \hfill \\ + 0.0743\left( {x_{{i1}} x_{{i3}} } \right)^{*} + 0.19\left( {x_{{i1}} x_{{i4}} } \right) + 0.1778\left( {x_{{i2}} x_{{i3}} } \right)^{*} \hfill \\ + 0.146\left( {x_{{i2}} x_{{i4}} } \right)^{*} + 0.188\left( {x_{{i3}} x_{{i4}} } \right)^{*} = 0\;\;\;i = 1,2,...,150 \hfill \\ \end{gathered} $$

This hyper-plane can classify the patterns correctly without any error.

Now, a hyper-plane should be calculated to classify the red and black patterns. Therefore, the red group is labeled by “+1” and the black group is labeled by “−1.” In order to obtain the decision boundary, one can use (14) to obtain:

$$ \begin{gathered} \,\,\,\,\,\,\,\,\,\,\,\,\,{\mathbf{x}}_{i}^{*} = \left( {x_{{i1}}^{*} \,,\,x_{{i2}}^{*} \,\,,\,\,x_{{i3}}^{*} \,,\,x_{{i4}}^{*} } \right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,i = 1,2,...,150 \hfill \\ \left\{ \begin{gathered} x_{{i1}}^{*} = x_{{i1}} - \sum\limits_{{j = 1}}^{{11}} {8 \times I\left[ {\left. \begin{gathered} \left( {x_{{i1}} - x_{{j1}} } \right)^{2} + \left( {x_{{i2}} - x_{{j2}} } \right)^{2} + \left( {x_{{i3}} - x_{{j3}} } \right)^{2} + \left( {x_{{i4}} - x_{{j4}} } \right)^{2} + \left( {x_{{_{{i1}} }}^{2} - x_{{_{{j1}} }}^{2} } \right)^{2} + \left( {x_{{_{{i2}} }}^{2} - x_{{_{{j2}} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{{i3}} }}^{2} - x_{{_{{j3}} }}^{2} } \right)^{2} + \left( {x_{{_{{i4}} }}^{2} - x_{{_{{j4}} }}^{2} } \right)^{2} + \left( {x_{{i1}} x_{{i2}} - x_{{j1}} x_{{j2}} } \right)^{2} + \left( {x_{{i1}} x_{{i3}} - x_{{j1}} x_{{j3}} } \right)^{2} + \left( {x_{{i1}} x_{{i4}} - x_{{j1}} x_{{j4}} } \right)^{2} \hfill \\ + \left( {x_{{i2}} x_{{i3}} - x_{{j2}} x_{{j3}} } \right)^{2} + \left( {x_{{i2}} x_{{i4}} - x_{{j2}} x_{{j4}} } \right)^{2} + \left( {x_{{i3}} x_{{i4}} - x_{{j3}} x_{{j4}} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{{j1}} } \right.} \hfill \\ x_{{i2}}^{*} = x_{{i2}} - \sum\limits_{{j = 1}}^{{11}} {8 \times I\left[ {\left. \begin{gathered} \left( {x_{{i1}} - x_{{j1}} } \right)^{2} + \left( {x_{{i2}} - x_{{j2}} } \right)^{2} + \left( {x_{{i3}} - x_{{j3}} } \right)^{2} + \left( {x_{{i4}} - x_{{j4}} } \right)^{2} + \left( {x_{{_{{i1}} }}^{2} - x_{{_{{j1}} }}^{2} } \right)^{2} + \left( {x_{{_{{i2}} }}^{2} - x_{{_{{j2}} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{{i3}} }}^{2} - x_{{_{{j3}} }}^{2} } \right)^{2} + \left( {x_{{_{{i4}} }}^{2} - x_{{_{{j4}} }}^{2} } \right)^{2} + \left( {x_{{i1}} x_{{i2}} - x_{{j1}} x_{{j2}} } \right)^{2} + \left( {x_{{i1}} x_{{i3}} - x_{{j1}} x_{{j3}} } \right)^{2} + \left( {x_{{i1}} x_{{i4}} - x_{{j1}} x_{{j4}} } \right)^{2} \hfill \\ + \left( {x_{{i2}} x_{{i3}} - x_{{j2}} x_{{j3}} } \right)^{2} + \left( {x_{{i2}} x_{{i4}} - x_{{j2}} x_{{j4}} } \right)^{2} + \left( {x_{{i3}} x_{{i4}} - x_{{j3}} x_{{j4}} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{{j2}} } \right.} \hfill \\ x_{{i3}}^{*} = x_{{i2}} - \sum\limits_{{j = 1}}^{{11}} {8 \times I\left[ {\left. \begin{gathered} \left( {x_{{i1}} - x_{{j1}} } \right)^{2} + \left( {x_{{i2}} - x_{{j2}} } \right)^{2} + \left( {x_{{i3}} - x_{{j3}} } \right)^{2} + \left( {x_{{i4}} - x_{{j4}} } \right)^{2} + \left( {x_{{_{{i1}} }}^{2} - x_{{_{{j1}} }}^{2} } \right)^{2} + \left( {x_{{_{{i2}} }}^{2} - x_{{_{{j2}} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{{i3}} }}^{2} - x_{{_{{j3}} }}^{2} } \right)^{2} + \left( {x_{{_{{i4}} }}^{2} - x_{{_{{j4}} }}^{2} } \right)^{2} + \left( {x_{{i1}} x_{{i2}} - x_{{j1}} x_{{j2}} } \right)^{2} + \left( {x_{{i1}} x_{{i3}} - x_{{j1}} x_{{j3}} } \right)^{2} + \left( {x_{{i1}} x_{{i4}} - x_{{j1}} x_{{j4}} } \right)^{2} \hfill \\ + \left( {x_{{i2}} x_{{i3}} - x_{{j2}} x_{{j3}} } \right)^{2} + \left( {x_{{i2}} x_{{i4}} - x_{{j2}} x_{{j4}} } \right)^{2} + \left( {x_{{i3}} x_{{i4}} - x_{{j3}} x_{{j4}} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{{j3}} } \right.} \hfill \\ x_{{i4}}^{*} = x_{{i2}} - \sum\limits_{{j = 1}}^{{11}} {8 \times I\left[ {\left. \begin{gathered} \left( {x_{{i1}} - x_{{j1}} } \right)^{2} + \left( {x_{{i2}} - x_{{j2}} } \right)^{2} + \left( {x_{{i3}} - x_{{j3}} } \right)^{2} + \left( {x_{{i4}} - x_{{j4}} } \right)^{2} + \left( {x_{{_{{i1}} }}^{2} - x_{{_{{j1}} }}^{2} } \right)^{2} + \left( {x_{{_{{i2}} }}^{2} - x_{{_{{j2}} }}^{2} } \right)^{2} \hfill \\ + \left( {x_{{_{{i3}} }}^{2} - x_{{_{{j3}} }}^{2} } \right)^{2} + \left( {x_{{_{{i4}} }}^{2} - x_{{_{{j4}} }}^{2} } \right)^{2} + \left( {x_{{i1}} x_{{i2}} - x_{{j1}} x_{{j2}} } \right)^{2} + \left( {x_{{i1}} x_{{i3}} - x_{{j1}} x_{{j3}} } \right)^{2} + \left( {x_{{i1}} x_{{i4}} - x_{{j1}} x_{{j4}} } \right)^{2} \hfill \\ + \left( {x_{{i2}} x_{{i3}} - x_{{j2}} x_{{j3}} } \right)^{2} + \left( {x_{{i2}} x_{{i4}} - x_{{j2}} x_{{j4}} } \right)^{2} + \left( {x_{{i3}} x_{{i4}} - x_{{j3}} x_{{j4}} } \right)^{2} \le 40 \hfill \\ \end{gathered} \right] \times x_{{j4}} } \right.} \hfill \\ \end{gathered} \right. \hfill \\ I\left( A \right) = \left\{ \begin{gathered} 1\,\,\,\,\,\,\,\,\,\,A \hfill \\ 0\,\,\,\,\,\,\,\,A^{c} \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered} $$

After performing the optimization procedure, the optimal weights are obtained and the optimal hyper-plane is given by:

$$ \begin{gathered} 2.726x_{{i1}}^{*} {\mkern 1mu} + 15.808x_{{i2}}^{*} - 3.{\text{713}}x_{{i3}}^{*} - 7.835x_{{i4}}^{*} + 0.042\left( {x_{{i1}}^{2} } \right)^{*} \hfill \\ \;\; + 8.681\left( {x_{{i2}}^{2} } \right) + 1.687\left( {x_{{i3}}^{2} } \right)^{*} - 0.{\text{061 }}\left( {x_{{i2}}^{2} } \right)^{*} - 7.84\left( {x_{{i1}} x_{{i2}} } \right)^{{*^{*} }} \hfill \\ \;\; + 0.811\left( {x_{{i1}} x_{{i3}} } \right)^{*} + 5.774\left( {x_{{i1}} x_{{i4}} } \right) - 3.98\left( {x_{{i2}} x_{{i3}} } \right)^{*} \hfill \\ \;\; - 8.94\left( {x_{{i2}} x_{{i4}} } \right)^{*} + 1.486\left( {x_{{i3}} x_{{i4}} } \right)^{*} = 0\;\;i = 1,2,...,150 \hfill \\ \end{gathered} $$

Using these two hyper-planes, only 4 patterns of the 150 patterns in Iris dataset will be classified wrongly. Therefore, the accuracy of the proposed method is 97.33%.

In order to compare the performance of the proposed SVM with previous related works, consider the resulZahiri and Seyedin 2007ts in Siswantoro et al. (2016); Zahiri and Seyedin 2007; Tanveer et al. 2021). In (Siswantoro et al. 2016), a combination of Kalman filter and multi-layer perceptron (NN-LMKF) has been presented. In fact, a linear model based on Kalman filter has been used as a post-processing unit after the Multi-layer perceptron. In (Zahiri and Seyedin 2007), a swarm intelligence-based classifier (IPS) has been presented. In (Tanveer et al. 2021), some classification algorithms using K-nearest neighbor (KNN) have been presented. For example, the accuracy of K-nearest neighbor (KNN)-based weighted multi-class twin support vector machines (KWMTSVM), support vector classification–regression machine for K-class classification (K-SVCR) and twin multi-class classification support vector machines (twin-KSVC) has been reported in Tanveer et al. (2021). In (Wu et al. 2019), the results of a random forest classifier on Iris dataset have been presented. Table 7 compares the results of the aforementioned algorithms with the proposed method (SP-SVM). As shown in this paper, the proposed method outperforms the aforementioned algorithms.

Table 7 Comparision of Iris dataset classification accuracy of different algorithms

5 Conclusion

In this paper, a new version of SVM has been proposed based on semi-parametric linear model. The similarity of the hyper-plane in SVM and parametric or semi-parametric models in statistics has been the main motivation of this paper. In other words, similar to semi-parametric regression model that the coefficients of the model are functions of the input data, a new version of SVM has been developed in this paper that parameters of the hyper-plane are functions of the input data. As a result, the proposed classifier is more flexible in comparison with the conventional SVM, since critical data can be classified in the correct set due to the fact that the parameters of the hyper-plane are not constant. The kernels used in data transformation are simple kernels such as polynomial kernels and the kernel used in the semi-parametric model is the Gaussian Kernel. The numerical results show that the proposed method can successfully be used in classification problems.