Incorporating prior knowledge and multi-kernel into linear programming support vector regression

Zhou, Jinzhu; Duan, Baoyan; Huang, Jin; Li, Na

doi:10.1007/s00500-014-1390-x

Incorporating prior knowledge and multi-kernel into linear programming support vector regression

Methodologies and Application
Published: 30 July 2014

Volume 19, pages 2047–2061, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

Incorporating prior knowledge and multi-kernel into linear programming support vector regression

Download PDF

Jinzhu Zhou¹,
Baoyan Duan¹,
Jin Huang¹ &
…
Na Li¹

424 Accesses
7 Citations
Explore all metrics

Abstract

This paper proposes a multi-kernel linear program support vector regression with prior knowledge in order to obtain an accurate data-driven model in the case of an insufficient amount of measured data. In the algorithm, multiple feature spaces have been utilized to incorporate multi-kernel functions into the framework of linear programming support vector regression (LPSVR), and then the prior knowledge which may be exact or biased from a calibrated physical simulator has also been incorporated into LPSVR by modifying optimization formulations. Moreover, a strategy of parameter selections for the proposed algorithm has been presented to facilitate practical applications. Some experiments from a synthetic example, a microstrip antenna and six-pole microwave filter have been carried out, and the experimental results show that the proposed algorithm can obtain a satisfactory data-based model in the case of the scarcity of measured data. The proposed algorithm shows great potentialities in some applications where the experimental data are insufficient for an accurate data-driven model and the prior knowledge from a calibrated physical simulator of practical applications is available.

Bayesian Support Vector Regression Modeling of Microwave Structures for Design Applications

An interpretable regression approach based on bi-sparse optimization

Article 11 July 2020

SVM Classification of Uncertain Data Using Robust Multi-Kernel Methods

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Support vector regression (SVR), which is based on the theory of structure risk minimization in statistical learning, has been applied to many problems (Smola 2002, 2004; Cristianini and Shawe-Taylor 2000). Traditionally, support vector regression can find an estimating function by solving a quadratic program problem, which is also known as QPSVR (Smola 2004; Cristianini and Shawe-Taylor 2000; Muller et al. 2001). Subsequently, Smola has proposed linear programming support vector regression (LPSVR) (Smola 2002; Smola et al. 1999). Both LPSVR and QPSVR adopt $\varepsilon $-insensitive loss function and the kernel function in feature space. However, LPSVR is advantageous over QPSVR in the sparse model, ability to use more general kernel functions and fast learning ability (Lu et al. 2009; Lu and Sun 2009; Smola 2004; Zhao and Sun 2011).

Support vector regression aims at learning an unknown function based on some training data samples. However, in some practical applications, it is complex and costly to obtain sufficient experimental data. Utilizing the fewer data, one can find that it is a little difficult to obtain an accurate data-driven model. Moreover, there are many complex functions which comprise both the steep variation and the smooth variation in some engineering problems, and this will be more difficult to obtain an satisfactory model from a small amount of data samples (Clarke et al. 2005). In this paper, we will focus on the problem that how to obtain an accurate model from a limited amount of experimental data.

In order to improve the modeling accuracy, Lanckriet has proposed a multi-kernel support vector regression by using the conic combinations of kernel matrices, and formulated the algorithm as a convex quadratically constrained quadratic program (QCQP) (Lanckriet et al. 2004; Bach et al. 2004). Although the formulation yields global optimal solutions, it is computationally inefficient and requires a commercial solver. Subsequently, the multi-kernel learning algorithm has been reformulated as a semi-infinite linear programming to obtain a general and efficient algorithm (Sonnenburg et al. 2006; Lanckriet et al. 2004). Based on the principle of kernel–target alignment and predictive accuracy, Qiu has proposed three heuristics methods to speed up the computation of QCQP formulation (Qiu and Lane 2009). In Nguyen and Tay (2008), a multi-kernel semi-parametric support vector regression has been proposed by using quadratic program solver and a semi-parametric algorithm. Instead of a single kernel, multi-scale support vector regression has been presented by using the same kernel with multiple scales (Zheng et al. 2006; Yu and Qian 2008). All of the multi-kernel support vector regression can establish an accurate model, if there is sufficient amount of training data samples (Subrahmanya and Shin 2010; Mingqing et al. 2009). However, the training data are usually so little in some practical applications that the model developed by the algorithms cannot meet some desired requirements.

In some engineering applications, a certain amount of knowledge on the problem is usually known beforehand (Sanchez 2003). This prior knowledge can take many forms such as a simulation model from a practical application, the shape of the function on a particular region and some equality and inequality constrains (Lauer and Bloch 2008a; Trnka and Havlena 2009). By utilizing the prior knowledge, one can improve the predictive accuracy of support vector regression. In Bloch et al. 2008; Lauer and Bloch (2008a), the author has reviewed three methods of incorporating the prior knowledge in support vector machine for classification, which comprise sample methods, kernel methods and the optimization methods. In Lauer and Bloch (2008b), the author explores the incorporation of different prior knowledge in support vector regression by modifying the problem formulation. In addition, the prior knowledge over arbitrary region is incorporating into a kernel approximation problem, and the region has to be discretized before including the prior knowledge to be in the learning framework as a finite set of inequalities (Olvi et al. 2004; Mangasarian and Wild 2007). Though the prior knowledge can make up a small amount of measured data, all of the algorithms incorporating prior knowledge have exploited a single kernel, and have not employed the advantages of multi-kernel functions.

The motivation of this investigation is to provide a regression algorithm which can incorporate multi-kernel functions and prior knowledge into the LPSVR. The proposed algorithm can improve the data-based modeling accuracy from an insufficient amount of measured data. The prior knowledge is from a physical simulator which is used to generate simulation data for arbitrarily chosen inputs in order to compensate for the lack of measured data in some regions of the input space. This problem, although particular, is representative of numerous situations met in engineering, where physical models, more or less accurate, exist, providing prior knowledge in the form of simulation data, and where the measured data are difficult or expensive to obtain. This paper will focus on the problem of how to utilize the prior knowledge and multi-kernel function to approximate complex functions in the case of an insufficient amount of measured data.

In this paper, multi-kernel and prior knowledge from a physical simulator were incorporated into the framework of the LPSVR to improve the modeling accuracy. The contribution of this paper is to propose a novel algorithm, termed the multi-kernel prior knowledge linear programming support vector regression (MKPLPSVR). In the algorithm, multiple feature spaces have been utilized to incorporate multi-kernel functions into the framework of the LPSVR, and then the prior knowledge which may be exact or biased from a physical simulator has also been incorporated into the LPSVR by modifying optimization objectives and inequality constraints. At the end, prior knowledge and multi-kernel functions have been simultaneously incorporated into the framework of the LPSVR, and a new formulation has been presented to solve the regression problem from a limited amount of measured data. The MKPLPSVR can be easily solved by using linear programming and is different from other multi-kernel support vector regression at the aspect of the basic principle and solution method. In addition, a strategy of parameter selections has been presented to facilitate the practical application of the proposed MKPLPSVR algorithm.

The rest of the paper is organized as follows. The LPSVR is introduced in the next section. Section 3 describes the proposed algorithm which can incorporate prior knowledge and multiple kernels into the framework of the LPSVR. In Sect. 4, we have given some experimental results from a synthetic example and two practical applications. Finally, Sect. 5 concludes the paper.

The following generic notations will be used throughout this paper. Lower case symbols such as $y,x_{ij} ,y_{ij} \ldots $ refer to scalar valued objects and lower case boldface symbols such as $\varvec{x},y,\varvec{\alpha } ,\ldots $ refer to column vectors. The matrices are boldface and uppercase in the paper. The 1-norm $\left\| x \right\| _1 $ denotes $\sum _{i=1}^n {\left| {x_i } \right| } $. The matrix $X\in \text{ R }^{N\times d}$ contains all training samples $x_i ( {i=1,\ldots ,N})$ as rows. The notation $A\in \text{ R }^{N\times n}$ will signify a real $N\times n$ matrix and the $j$th column of a matrix $\varvec{A}$ is denoted as $\varvec{A}_{.j} $. The matrix $\varvec{A}^{\mathrm{T}}$ will denote the transpose of $\varvec{A}$. For $\varvec{A}\in \text{ R }^{n\times N}$ and $\varvec{B}\in \hbox {R}^{n\times L}$, a kernel $\varvec{K}( {\varvec{A},\varvec{B}})$ maps $\hbox {R}^{n\times N}\times \hbox {R}^{n\times L}$ into $\hbox {R}^{N\times L}$. In particular, if $x$ and $y$ are column vectors in $\hbox {R}^N$, then the mapping $k( {\varvec{x},\varvec{y}})$ is a real number. The variables $\varvec{o}$ and $\varvec{e}$ are vectors of appropriate dimensions with all their components, respectively, equal to 0 and 1. The superscript $p$ refers to the prior knowledge, and the subscript $k$ refers to the number of data samples from the prior knowledge.

2 Review of LPSVR

Let $\text{ M }=\left\{ {({\varvec{x}}_i ,{\varvec{y}}_i ),i=1,...,N} \right\} $ be an experimental dataset, where the input is ${\varvec{x}}_i \in \text{ R }^d$ and the output is ${\varvec{y}}_i \in \text{ R }$. The regression is considered as a linear function in the feature space which induced by a nonlinear mapping $\phi ( {\varvec{x}})$. The regression function is written as:

$$\begin{aligned} f( {\varvec{x}})={\varvec{\omega }}\cdot \varphi ( {\varvec{x}})+b \end{aligned}$$

(1)

where ${\varvec{\omega }}$ is a normal vector in the feature space, and $b$ is a bias term.

The normal vector ${\varvec{\omega }}$ can be considered as a linear combination of the training patterns, i.e., ${\varvec{\omega }}=\sum _{i=1}^N {\alpha _i \phi ( {{\varvec{x}}_i })} $. Therefore, the regression function in the original space is expressed as:

$$\begin{aligned} f( {\varvec{x}})=\sum \limits _{i=1}^N {\alpha _i k( {{\varvec{x}},{\varvec{x}}_i })} +b \end{aligned}$$

(2)

where $k( {{\varvec{x}},{\varvec{x}}_i })=\varphi ( {{\varvec{x}}_i })\cdot \varphi ( {\varvec{x}})$ is the kernel function which usually includes Gaussian radial basis function, polynomial kernel, and even non-Mercer kernel (Smola et al. 1999; Lu and Sun 2009; Lu et al. 2009).

Instead of choosing the flattest function, LPSVR seek the smallest combination of training patterns. According to the statistical learning theory (Vapnik 1995; Smola 2004), the coefficient $\alpha _i $ and the bias term $b$ can be solved by minimizing the regularized risk function:

$$\begin{aligned} \text{ Min: } Q( {\varvec{a}})+2C\sum \limits _{i=1}^N {L( {y_i -f( {{\varvec{x}}_i })})} \end{aligned}$$

(3)

where $Q( {\varvec{a}})$ is a regularization term, and it is defined as $Q( {\varvec{a}})=\left\| {\varvec{\alpha }} \right\| _1 =\sum _{i=1}^n {\left| {\alpha _i } \right| } $. The vector ${\varvec{\alpha }}=\left[ {\alpha _1 ,\alpha _i ,\ldots \alpha _N } \right] ^{\mathrm{T}}$ in $Q( {\varvec{a}})$ determines the function complexity. A hyper-parameter $C\!>\!0$ is introduced to tune the trade-off between the error minimization and the function sparsely. $L( {y_i -f( {{\varvec{x}}_i })})$ denotes $\varepsilon $-insensitive loss function:

$$\begin{aligned} L( {y_i -f( {{\varvec{x}}_i })})=\left\{ {\begin{array}{ll} 0,&{} \left| {y_i \!-\!f( {{\varvec{x}}_i })} \right| \!\le \!\varepsilon \\ \left| {y_i -f( {{\varvec{x}}_i })} \right| -\varepsilon ,&{} \mathrm{otherwise} \\ \end{array}} \right. \end{aligned}$$

(4)

By introducing a slack variable $\xi _i $ and using $\varepsilon $-insensitive loss function, the LPSVR is formulated as:

$$\begin{aligned} \begin{array}{l} \text{ Find }\!:\,\alpha _i ,\xi _i ,b \\ \text{ Min }\!:\,\left\| \alpha \right\| _1 +2C\sum \limits _{i=1}^N {\xi _i } \\ \text{ s }.\text{ t }.\,\left\{ {{\begin{array}{l} {y_i -\sum \limits _{i=1}^N {\alpha _i k( {x_i ,x_j })-b\le \varepsilon +\xi _i } }\\ {\sum \limits _{i=1}^N {\alpha _i k( {x_i ,x_j })+b-y_i \le \varepsilon +\xi _i } }\\ {\xi _i \ge 0}\\ {\forall i=1,2,\ldots ,N}\\ \end{array} }} \right. \\ \end{array} \end{aligned}$$

(5)

In order to solve the optimization above, we can decompose $\alpha _i $ and $\left| {\alpha _i } \right| $ as follows:

$$\begin{aligned} \begin{array}{l} \alpha _i =\alpha _i^+ -\alpha _i^- \\ \left| {\alpha _i } \right| =\alpha _i^+ +\alpha _i^- \\ \end{array} \end{aligned}$$

(6)

where $\alpha _i^+ ,\alpha _i^- \ge 0$. Due to the nature of the constraints, typically only a subset of $\alpha _i $ is non-zero, and the associated training data are called support vectors (Lu and Sun 2009; Lu et al. 2009).

Substituting (6) into (5), the LPSVR can be expressed as:

$$\begin{aligned} \begin{array}{l} \text{ Find: } \alpha _j^+ ,\alpha _j^- ,\xi _i ,b \\ \text{ Min: } \sum \limits _{j=1}^N {( {\alpha _j^+ +\alpha _j^- })} +2C\sum \limits _{i=1}^N {\xi _i } \\ \text{ s.t. }\left\{ {\begin{array}{l} y_i -\sum \limits _{j=1}^N {( {\alpha _j^+ -\alpha _j^- })k( {{\varvec{x}}_i ,{\varvec{x}}_j })-b\le \varepsilon +\xi _i } \\ \sum \limits _{j=1}^N {( {\alpha _j^+ -\alpha _j^- })k( {{\varvec{x}}_i ,{\varvec{x}}_j })+b-y_i \le \varepsilon +\xi _i } \\ \alpha _j^+ \ge 0, \alpha _j^- \ge 0 \\ \xi _i \ge 0.\quad ( {\forall i=1,2,\ldots ,N}). \\ \end{array}} \right. \\ \end{array} \end{aligned}$$

(7)

The coefficients $\alpha _j^+ ,\alpha _j^- ,\xi _i $ and $b$ in (7) can be solved by using linprog in Matlab. Substituting (6) into (2), Eq. (2) can be expressed as:

$$\begin{aligned} f( {\varvec{x}})=\sum \limits _{j=1}^N {( {\alpha _j^+ -\alpha _j^- })k( {{\varvec{x}},{\varvec{x}}_j })} +b \end{aligned}$$

(8)

3 Proposed algorithms

In some problems of science and engineering, it is complex and costly to obtain sufficient measured data samples by some experiments. On the other hand, a simulation model built from some physical knowledge is available. In this section, we have presented an algorithm which can incorporate multi-kernel and prior knowledge into the learning framework of the LPSVR. Figure1 shows the basic idea of developing the algorithm.

From the Fig. 1, multiple feature spaces have been utilized to develop multi-kernel linear programming support vector regression (MKLPSVR). Subsequently, the optimization objectives and inequality constraints in the MKLPSVR have been modified to incorporate the prior knowledge from a simulation model or a simulator, which leads to multi-kernel prior knowledge linear programming support vector regression (MKPLPSVR). By incorporating the prior knowledge from a calibrated simulator into the MKLPSVR, we can reduce the effect of the biased data from a simulation model on the accuracy of the data-based model. The development of the MKPLPSVR is explained in the following.

3.1 Multi-kernel linear programming support vector regression

As far as the LPSVR is concerned, the function in a feature space is expressed as (1). However, the non-flat function or complicated data trend cannot be described properly in a single feature space. It is promising to seek a space which can utilize the advantage of different feature spaces (Zheng et al. 2006). Therefore, it may be better choice to consider the regression in multiple feature spaces ${\varvec{\omega }}_1 ,\ldots {\varvec{\omega }}_L $, and the function in multiple feature spaces can be written as:

$$\begin{aligned} f( {\varvec{x}})=\sum \limits _{r=1}^L {{\varvec{\omega }}_r } \cdot \varphi ( {\varvec{x}})+b \end{aligned}$$

(9)

Substituting ${\varvec{\omega }}=\sum _{i=1}^N {\alpha _i \varphi ( {{\varvec{x}}_i })} $ into (9), one can express the regression function as:

$$\begin{aligned} f( {\varvec{x}})=\sum \limits _{r=1}^L {\sum \limits _{i=1}^N {\alpha _{ri} k_r ( {{\varvec{x}},{\varvec{x}}_i })+b} } \end{aligned}$$

(10)

where $L$ denotes the number of the kernels which are induced by a set of different feature spaces ${\varvec{\omega }}_1 ,\ldots {\varvec{\omega }}_L $. The function $k_r ( {{\varvec{x}},{\varvec{x}}_i })=\varphi ( {{\varvec{x}}_{ri} })\cdot \varphi ( {\varvec{x}})$ denotes the $r$th kernel, and $\alpha _{ri} $ is the coefficient of the corresponding kernel function.

Utilizing the method in (6), we can reformulate Eq. (10) as:

$$\begin{aligned} f( {\varvec{x}})=\sum \limits _{r=1}^L {\sum \limits _{i=1}^N {( {\alpha _{ri}^+ -\alpha _{ri}^- })k_r ( {{\varvec{x}},{\varvec{x}}_i })+b} } \end{aligned}$$

(11)

Equation (11) can be estimated by minimizing the risk (3) like the previous method. Since the target to be estimated is a complicated data-trend function, the minimization of the regularization term means the maximum of the function fatness, which may result in under-fitting result (Zheng et al. 2006). To avoid the problem, we have introduced a non-negative constant $C_r $ to control the regularization term. Therefore, analogous to (3), the risk function in a multi-kernel framework is expressed as:

$$\begin{aligned} \text{ Min: } \sum \limits _{r=1}^L {C_r } \left\| {{\varvec{\alpha }}_r } \right\| _1 +2C\sum \limits _{i=1}^N {L( {y_i -f( {{\varvec{x}}_i })})} \end{aligned}$$

(12)

where $Q( {\varvec{a}})=\sum _{r=1}^L {C_r } \left\| {{\varvec{\alpha }}_r } \right\| _1 $ is a regularization term, and the constant $C_r $ penalizes non-zero coefficients ${\varvec{\alpha }}_r $. The vector ${\varvec{\alpha }}_r =\left[ {\alpha _{r1} ,\alpha _{ri} ,\ldots \alpha _{rN} } \right] ^{\mathrm{T}}$ denotes the coefficient of the $r$th kernel, and non-zero elements in the vector ${\varvec{\alpha }}_r $ are also called support vectors.

Utilizing the method in (6) and (7), the MKLPSVR is expressed as:

$$\begin{aligned} \begin{array}{l} \text{ Find: } \alpha _{ri}^+ ,\alpha _{ri}^- ,\xi _i ,b \\ \text{ Min: } \sum \limits _{r=1}^L {C_r } \sum \limits _{i=1}^N {( {\alpha _{ri}^+ +\alpha _{ri}^- })} +2C\sum \limits _{i=1}^N {\xi _i } \\ \text{ s.t. }\left\{ {\begin{array}{l} y_i -\sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{x}}_i ,{\varvec{x}}_j })-b\le \varepsilon +\xi _i } \\ \sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{x}}_i ,{\varvec{x}}_j })+b-y_i \le \varepsilon +\xi _i } \\ \alpha _{rj}^+ \ge 0,\alpha _{rj}^- \ge 0 \\ \xi _i \ge 0 \quad ( {\forall i=1,2,\ldots ,N}) \\ \end{array}} \right. \\ \end{array} \end{aligned}$$

(13)

where $C_r $ depends on the kernel parameter of the used kernel function. The coefficient $\alpha _{ri}^+ ,\alpha _{ri}^- $ satisfy $\alpha _{ri} =\alpha _{ri}^+ -\alpha _{ri}^- $ and $\left| {\alpha _{ri} } \right| =\alpha _{ri}^+ +\alpha _{ri}^- $.

Utilizing linear programming to solve Eq. (13), one can obtain the function in Eq. (11). In the paper, the MKLPSVR is a generalized version of the LPSVR. Compared with other multi-kernel support vector regression (Bach et al. 2004; Sonnenburg et al. 2006; Nguyen and Tay 2008; Pozdnoukhov and Kanevski 2008; Mingqing et al. 2009; Qiu and Lane 2009; Varma and Babu 2009), the MKLPSVR has the following differences. Firstly, the basic principle and solution of the MKLPSVR are different from the existing multi-kernel support vector regression, which have exploited convex quadratically constrained quadratic program. Secondly, the MKLPSVR can exploit non-Mercer kernels, which provides more flexibility in designing the kernel function.

3.2 Incorporating prior knowledge into MKLPSVR

In practice, it is complex and costly to obtain sufficient measured data. On the other hand, a simulation model built from some physical knowledge is available. Using a calibrated simulator, one can obtain enough prior data, but which may be biased from the measured results. In order to reduce the effect of the biased prior data, this subsection has presented an approach to incorporate the prior data from a calibrated simulator into the MKLPSVR.

Let the prior dataset $\text{ P }=\big \{ ( {{\varvec{z}}_k^p ,y_k^p }),{\varvec{z}}_k^p\in \text{ R }^d,y_k^p \in \text{ R },k=1,2,\ldots N_k \big \}$ from a calibrated simulator. Obviously, the prior data will satisfy the equation in the simulator:

$$\begin{aligned} f( {{\varvec{z}}_k^p })=y_k^p ( {k=1,2,\ldots ,N_k }). \end{aligned}$$

(14)

The equality constraints can be added to the formulation (13) without changing the linear programming nature. However, this will lead to an exact fit to the data points, which may not be advised if the prior data are biased from the measured results. Moreover, all the equality constraints may lead to an unfeasible problem if they cannot be satisfied simultaneously (Lauer and Bloch 2008b; Bloch et al. 2008; Zhou et al. 2011). Therefore, soft constraints have been utilized in Eq. (14) by introducing a positive slack variable ${\varvec{u}}=\left[ {u_1 ,u_1 ,\ldots ,u_k } \right] ^{\mathrm{T}}$. The slack variable can bound the errors between the prior data $( {{\varvec{z}}_k^p ,y_k^p })$ and the regression function $f( {{\varvec{z}}_k^p })$ in the following inequality:

$$\begin{aligned} \left| {y_k^p -f( {{\varvec{z}}_k^p })} \right| \le u_k ( {\forall k = 1{,}2,\ldots ,N_k }). \end{aligned}$$

(15)

In order to include almost exact or biased knowledge from a prior simulator, it is possible to authorize violations of the constraints (15) that are less than a threshold $\varepsilon ^p$. Therefore, by applying $\varepsilon $-insensitive loss function to the error $u_k $, one can obtain the following inequality:

$$\begin{aligned} \left| {y_k^p -f( {{\varvec{z}}_k^p })} \right| \le u_k +\varepsilon ^p ( {\forall k = 1,2,\ldots ,N_k }). \end{aligned}$$

(16)

In order to minimize the error ${\varvec{u}}=\left[ {u_1 ,u_1 ,\ldots ,u_k } \right] ^{\mathrm{T}}$, the $l_1 $ norm of ${\varvec{u}}$ is added to (12) by introducing a trade-off parameter $\lambda $ which can tune the influence of the prior data on the regression function. Therefore, by adding inequality constraints (16) and the $l_1 $ norm of the slack vector, the MKLPSVR in (13) has been modified to reduce the influence of biased prior data from a simulator on the modeling accuracy. The modified algorithm called as MKPLPSVR in this paper is expressed as:

$$\begin{aligned} \begin{array}{l} \text{ Find: } \alpha _{ri}^+ ,\alpha _{ri}^- ,\xi _i ,u_k ,b \\ \text{ Min: } \sum \limits _{r=1}^L {C_r } \sum \limits _{i=1}^N {( {\alpha _{ri}^+ +\alpha _{ri}^- })} +2C\sum \limits _{i=1}^N {\xi _i } +\lambda \sum \limits _{k=1}^{N_k } {u_k } \\ \text{ s.t. }\left\{ {\begin{array}{l} y_i -\sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{x}}_i ,{\varvec{x}}_j })-b\le \varepsilon +\xi _i } \\ \sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{x}}_i ,{\varvec{x}}_j })+b-y_i \le \varepsilon +\xi _i } \\ y_k^p -\sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{z}}_k^p ,{\varvec{x}}_j })-b\le \varepsilon ^p+u_k } \\ \sum \limits _{r=1}^L {\sum \limits _{j=1}^N {( {\alpha _{rj}^+ -\alpha _{rj}^- })} k_r ( {{\varvec{z}}_k^p ,{\varvec{x}}_j })+b-y_k^p \le \varepsilon ^p+u_k } \\ \alpha _{rj}^+ \ge 0, \alpha _{rj}^- \ge 0 \quad ( {\forall i=1,2,\ldots ,N}) \\ \xi _i \ge 0, u_k \ge 0 \quad ( {\forall k=1,2,\ldots N_k }). \\ \end{array}} \right. \\ \end{array} \end{aligned}$$

(17)

In order to facilitate the solution, we have reformulated the Eq. (17) in the following vector form:

$$\begin{aligned} \begin{array}{l} \text{ Find: } {\varvec{s}} \\ \text{ Min: } {\varvec{h}}^{\mathrm{T}}{\varvec{s}} \\ \text{ s.t. } \left\{ {\begin{array}{l} {\varvec{As}}\le {\varvec{B}} \\ {\varvec{s}}\ge {\varvec{l}} \\ \end{array}} \right. \\ \end{array} \end{aligned}$$

(18)

where

$$\begin{aligned} \begin{array}{l} {\varvec{s}}=\left[ {{\varvec{\alpha }}_1^+ ,\ldots ,{\varvec{\alpha }}_L^+ ,{\varvec{\alpha }}_1^- ,\ldots ,{\varvec{\alpha }}_L^- ,{\varvec{\xi }},{\varvec{u}},b} \right] ^{\mathrm{T}} \\ {\varvec{h}}=\left[ {C_1 {\varvec{e}},\ldots C_L {\varvec{e}},C_1 {\varvec{e}},\ldots C_L {\varvec{e}},2C{\varvec{e}},\lambda {\varvec{e}}^p,0} \right] ^{\mathrm{T}} \\ {\varvec{B}}=\left[ {\varepsilon {\varvec{e}}+{\varvec{y}},\varepsilon {\varvec{e}}-{\varvec{y}},\varepsilon ^p{\varvec{e}}+{\varvec{y}}^p ,\varepsilon ^p{\varvec{e}}-{\varvec{y}}^p } \right] ^{\mathrm{T}} \\ {\varvec{l}}=\left[ {{\varvec{o}}_1 ,\ldots ,{\varvec{o}}_L ,{\varvec{o}}_1 ,\ldots ,{\varvec{o}}_L ,{\varvec{o}},{\varvec{o}}^p,-\infty } \right] ^{\mathrm{T}} \\ {\varvec{A}}\!=\!\left[ {\begin{array}{l} {\varvec{K}}_1 ,\ldots , {\varvec{K}}_L ,-{\varvec{K}}_1 ,\ldots ,-{\varvec{K}}_L ,-{\varvec{E}}, {\varvec{Z}}^k, {\varvec{e}} \\ -{\varvec{K}}_1 ,\ldots ,-{\varvec{K}}_L , {\varvec{K}}_1 , \ldots , {\varvec{K}}_L ,-{\varvec{E}}, {\varvec{Z}}^k,-{\varvec{e}} \\ {\varvec{K}}_1^p ,\ldots , {\varvec{K}}_L^p ,-{\varvec{K}}_1^p ,\ldots ,-{\varvec{K}}_L^p , {\varvec{Z}}^p,-{\varvec{E}}^p, {\varvec{e}}^p \\ -{\varvec{K}}_1^p ,\ldots ,-{\varvec{K}}_L^p , {\varvec{K}}_1^p ,\ldots , {\varvec{K}}_L^p , {\varvec{Z}}^p,-{\varvec{E}}^p,-{\varvec{e}}^p \\ \end{array}} \right] \!. \\ \end{array} \end{aligned}$$

In the optimization, the slack vector ${\varvec{\xi }}$ and ${\varvec{u}}$ represent ${\varvec{\xi }}=\left[ {\xi _1 ,\xi _2 ,\ldots ,\xi _N } \right] ^{\mathrm{T}}$ and ${\varvec{u}}=\left[ {u_1 ,u_1 ,\ldots ,u_k } \right] ^{\mathrm{T}}$, respectively. The vector ${\varvec{\alpha }}_r^+ $and ${\varvec{\alpha }}_r^- $denote ${\varvec{\alpha }}_r^+ =\left[ {\alpha _{r1}^+ ,\alpha _{r2}^+ ,\ldots ,\alpha _{rN}^+ } \right] ^{\mathrm{T}}$and ${\varvec{\alpha }}_r^- =\left[ {\alpha _{r1}^- ,\alpha _{r2}^- ,\ldots ,\alpha _{rN}^- } \right] ^{\mathrm{T}}$, respectively. The vector ${\varvec{e}}=\left[ {1,1,\ldots ,1} \right] ^{\mathrm{T}}$and ${\varvec{o}}=\left[ {0,0,\ldots ,0} \right] ^{\mathrm{T}}$denote $N\times 1$ column vector. The vector${\varvec{e}}^p=\left[ {1,1,\ldots ,1} \right] ^{\mathrm{T}}$and ${\varvec{o}}^p=\left[ {0,0,\ldots ,0} \right] ^{\mathrm{T}}$ denote $N_k \times 1$ column vector. The matrix ${\varvec{K}}_r ( {r=1,2,\ldots ,L})$ denotes a $N\times N$ kernel matrix calculated by the $r$th kernel function, and every element in the matrix is calculated by the $r$th kernel function $k_r ( {{\varvec{x}}_i ,{\varvec{x}}_j })$. ${\varvec{E}}$ is a $N\times N$ identity matrix. ${\varvec{E}}^p$ denotes a $N_k \times N_k $ identity matrix, ${\varvec{Z}}^k$ denotes a $N\times N_k $ zero matrix, and ${\varvec{Z}}^p$ denotes a $N_k \times N$ zero matrix. ${\varvec{K}}_r^p ( {r=1,2,\ldots ,L})$ denotes a $N_k \times N$ kernel matrix calculated by the $r$th kernel function, and every element in the matrix is calculated by the predefined kernel function $k_r ( {{\varvec{z}}_p^k ,{\varvec{x}}_j })$. The vector ${\varvec{y}}\in \text{ R }^{N\times 1}$contains all training samples $y_i ( {i=1,\ldots ,N})$ as rows, and the vector ${\varvec{y}}^p\in \text{ R }^{N_k \times 1}$contains all prior data samples $y_i^p ( {i=1,\ldots ,N_k })$ as rows.

Using linear programming to solve the formulation, one will obtain a regression function as shown in (11). The solution procedure is summarized below.

3.3 Strategy for parameter selections in MKPLPSVR

In this section, a strategy has been presented to select some hyper-parameters in the MKPLPSVR. Figure 2 shows the flowchart of parameter selections for the MKPLPSVR.

In the Fig. 2, we suppose that the kernel function is firstly given beforehand according to some experiences. Generally, the number $L=2,3$ of the kernel function may be sufficient for dealing with most of practical problems (Pozdnoukhov and Kanevski 2008). The error threshold $\varepsilon $ and $\varepsilon ^p$ are proportional to the noise level of training dataset, and the empirical tuning from Cherkassky and Yunqian (2004) has been applied to determine the $\varepsilon $ and $\varepsilon ^p$ which are expressed as:

$$\begin{aligned} \begin{array}{l} \varepsilon =3\sigma \sqrt{{\ln N}/N} \\ \varepsilon ^p=3\sigma ^p\sqrt{{\ln N_k }/ {N_k }} \\ \end{array} \end{aligned}$$

(19)

where the standard deviation $\sigma $ and $\sigma ^p$ are estimated from $N$ training dataset and $N_k $ prior dataset, respectively.

The constant $C_r $ penalizes the non-zero variables $\alpha _{ri}^+ ,\alpha _{ri}^- $ in the MKPLPSVR. To avoid over-fitting, this paper has chosen $C_r =1 / {\delta _r }$ where $\delta _r $ denotes the kernel parameter of the rth kernel function. The remaining parameters including the rth kernel parameter $\delta _r $ and the hyper-parameters $C$ and $\lambda $ will be searched by using fivefold cross-validation (Phienthrakul and Kijsirikul 2010; Pasolli et al. 2012; Huang 2012). The searching procedure is described as below:

1.
Specify the search range of $C$, $\lambda $ and $\delta _r$.
2.
Initialize $C=\hat{C}$, $\lambda =\hat{\lambda }$ and $\delta _r =\hat{\delta }_r $.
3.
Divide the training dataset into five subsets.
4.
Choose four subsets to train the MKPLPSVR, and apply the remaining subset to evaluate the model accuracy (cross-validation error, test error, etc.).
5.
Search $C$, $\lambda $ and $\delta _r $ by using optimization algorithm such as PSO or GA.
6.
Evaluate the model accuracy whether it satisfies a predefined terminal condition or not.
7.
Update the search direction and go to step 3.
8.
Obtain the optimal hyper-parameters $C$, $\lambda $ and $\delta _r $.

4 Experimental results

In this section, we will validate the proposed algorithm by using a synthetic example, a microstrip antenna and six-pole microwave filter. Moreover, the following two criteria are used to evaluate the generalization performance.

$$\begin{aligned} \mathrm{RMSE}=\sqrt{N^{-1}\sum \limits _{i=1}^N {( {y_i -f( {{\varvec{x}}_i })})^2} } \end{aligned}$$

(20)

$$\begin{aligned} \mathrm{MAE}=\text{ max }( {\left| {y_i -f( {{\varvec{x}}_i })} \right| }) \end{aligned}$$

(21)

where $f( {{\varvec{x}}_i })$ is the predicted value, $y_i $ is the corresponding measured value, $N$ is the number of testing samples. RMSE denotes the root mean squared error, and MAE denotes the maximum absolute error.

4.1 Complex function approximation

In the subsection, we will validate the proposed algorithm by a synthetic example, and five different algorithms have been employed to approximate the following function.

$$\begin{aligned} y=\left\{ {\begin{array}{ll} -4x-8, &{} -3\le x<-1 \\ -3x^3-5x^2+5x+3, &{} -1\le x<1 \\ 2\sin ( {\exp ( {1.2x})})+\text{0.3552, } &{} 1\le x\le 3. \\ \end{array}} \right. \end{aligned}$$

(22)

From the range $\left[ {-3,3} \right] $ of the function above, we have generated 13 training data by adding a Gaussian noise $N( {0,0.1^2})$. Then, in order to simulate the prior knowledge, we have applied the same function to generate 35 data samples with a Gaussian noise $N( {0,0.2^2})$ and taken them as the prior data. Finally, 201 data points are also taken uniformly from the same function as a testing data. Figure 3 shows the testing data, the training data and the prior data.

Utilizing the data, we will approximate the function separately by using the LPSVR, the MKLPSVR and the MKPLPSVR. In addition, the SimpleMKL in (Rakotomamonjy et al. 2008) and the PLPSVR in Zhou and Huang (2010) have been used to compare their performance with the MKPLPSVR. The strategy in Sect. 3.3 has been applied to determine the hyper-parameters in the MKPLPSVR.

In order to validate the proposed algorithm, we have designed two groups of experiments. In the first group of experiment, 13 training data will used to develop a regression function, and 35 data samples from the prior knowledge are only used during the course of calculating the constraints of the optimization formulation. In the second group of experiment, we will utilize the 35 prior data samples to extend the 13 training data samples, and then apply the extended data samples to develop a regression function. After obtaining a regression model, we will verify the model by using the 201 testing data.

In the first group of experiment, we have chosen $C=100$, $\varepsilon =0.01$, $\varepsilon ^p=0.04$ for all the algorithms. Both the LPSVR and the PLPSVR exploited only a Gaussian kernel with the kernel parameter $\sigma =0.0803$. However, the MKLPSVR, the SimpleMKL and the MKPLPSVR employed a Gaussian kernel, a polynomial kernel and a wavelet kernel (Lu et al. 2009) with the kernel parameters 2, 0.058 and 0.0125, respectively. In the second group, we have chosen $C=150$, $\varepsilon =0.01$, $\varepsilon ^p=0.04$ for all the algorithms. Similarly, both the LPSVR and the PLPSVR only exploited a Gaussian kernel with the kernel parameter $\sigma =0.013$. The MKLPSVR, the SimpleMKL and the MKPLPSVR utilized a Gaussian kernel, a polynomial kernel and a wavelet kernel with the kernel parameters 2, 0.055 and 0.0122, respectively.

Utilizing the data samples and parameters above, we will establish the models separately by using five algorithms. Figure 4 shows the approximating results in the first group of experiment. Figure 4 shows that all of the algorithms cannot accurately approximate the steep variation of the actual function, due to only the 13 training samples. However, compared with Fig. 4a and b, we can find that the multi-kernel algorithms such as the MKLPSVR, the MKPLPSVR and the SimpleMKL are able to more accurately approximate the flat variation than other algorithms.

In order to clearly show the performance, we have presented some statistical results verified by 201 testing data in Table 1. From Table 1, we can also find that the number of support vector (NSV) is almost the same among the functions. However, the model calculated by the MKPLPSVR has the smallest RMSE and MAE among the five models.

Table 1 Errors and number of support vector

Full size table

Figure 5 shows the approximating results in the second group of experiment. From Fig. 5a, we can find that all of the algorithms can approximate the steep variation of the function. However, the results from Fig. 5b show that other algorithms besides the MKPLPSVR and the MKLPSVR cannot accurately approximate the flat variation. A possible explanation is that incorporating multi-kernel into the LPSVR has improved the accuracy of approximating a function with both the steep variation and smooth variation.

Table 2 Errors and number of support vector

Full size table

Table 2 shows some statistical results of the five regression functions separately calculated by 201 testing data. From Table 2, we can find that the function developed by the MKPLPSVR is the most accurate among all of the functions. Compared with the SimpleMKL, the MKPLPSVR is also advantageous over the SimpleMKL in the aspects of model sparsity and generation performance. Compared with Tables 1 and 2, we find that the function developed in the second group of experiment is more accurate than the one developed in the first group of experiment. The reason is that the prior data have been utilized to extend the few training data samples.

From these comparing results, we can find that the function approximated by the MKPLPSVR is the most accurate among all of the functions when few measurements are available. The MKPLPSVR is more effective to approximate a complex function with both steep and smooth variations than the LPSVR or PLPSVR with a simple kernel. The reason is that incorporating a multi-kernel into the LPSVR has improved the accuracy of approximating a complex function. In terms of the generation performance, the MKPLPSVR is also superior to the multi-kernel algorithms such as the SimpleMKL and the MKLPSVR. A possible explanation is that the MKPLPSVR has utilized the prior data to extend the few training data samples and improve the modeling accuracy. It follows that the introduction of prior knowledge and multi-kernel functions into the framework of the LPSVR can improve the modeling accuracy for a small dataset.

4.2 Bandwidth calculation in microstrip antennas

During the course of designing a microstrip antenna, the antenna bandwidth can be calculated by empirical formulas or numerical techniques. Although empirical formulas can facilitate the design, its accuracy is limited. The numerical techniques based on the electromagnetic theory can obtain an accurate result. However, its solution is relatively time-consuming. In order to accurately predict the bandwidth and reduce the computational effort, the subsection will apply the proposed algorithm to build a hybrid model of bandwidth calculation which can be integrated into a microwave CAD tool.

Consider a rectangular patch of width $d$ and length $L$ over a ground plane with a substrate thickness $h$ and a substrate dielectric permittivity $\varepsilon _r $, as shown in Fig. 6. The rectangular antenna bandwidth BW$_{\exp } $ can be evaluated by using the empirical formula from Sagiroglu et al. (1999):

$$\begin{aligned} \mathrm{BW}_{\exp } =\left[ {89 \left( {\frac{hd}{\varepsilon _r \lambda _0^2 }}\right) ^{0.45}+91( {\frac{h}{\lambda _0 }})} \right] \% \end{aligned}$$

(23)

where $\lambda _0 $ is the free space wavelength at the resonant frequency $f_r $. The dielectric permittivity $\varepsilon _r =\left( {\frac{c}{f_r \lambda _d }}\right) ^2$ is related to the dielectric loss tangent $\tan \delta $, and $c$ is the velocity of electromagnetic waves in free space, and $\lambda _d $ is the wavelength in the dielectric substrate.

The empirical formula can be used to calculate the bandwidth quickly, but the results are not in agreement with the experimental results. This paper has presented a hybrid model for antenna bandwidth, and the hybrid model is expressed as:

$$\begin{aligned} \Delta \mathrm{BW}=h_s ( {\varvec{x}}) \end{aligned}$$

(24)

$$\begin{aligned} \mathrm{BW}=\mathrm{BW}_{\exp } +\Delta \mathrm{BW} \end{aligned}$$

(25)

Equation (24) is a support-vector model which corrects the difference between the empirical formula and the experimental results. The structural parameters $h,d,\varepsilon _r ,\lambda _0 ,f_r $ and $\tan \delta $ will influence the bandwidth $BW$. However, the research in the literature (Sagiroglu et al. 1999) shows that the bandwidth depends on three independent variables ${\varvec{x}}=\left[ {h /{\lambda _d },d,\tan \delta } \right] ^{\mathrm{T}}$. From the literature in Sagiroglu et al. (1999), 27 measured data have been obtained and taken as the measured dataset $\text{ S }=\big \{ ({\varvec{x}},\Delta BW),{\varvec{x}}\in \text{ R }^{27\times 3},\Delta BW\in \text{ R }^{27\times 1} \big \}$. In order to improve the modeling accuracy, a calibrated electromagnetic simulator has been utilized to generate 8 prior data shown in Table 3.

Table 3 Data samples from prior knowledge

Full size table

Table 4 Comparisons of predicted results in the first group of experiment

Full size table

Table 5 Comparisons of predicted results in the second group of experiment

Full size table

Based on the dataset, this section will establish the model separately by using the LPSVR, the MKLPSVR and the MKPLPSVR. Moreover, the SimpleMKL in Rakotomamonjy et al. (2008) and the PLPSVR in Zhou et al. (2010) have also been utilized to establish the model. Two groups of experiments were designed to validate the model. In the first group, 27 training samples will be used to establish the model; meanwhile, 8 data samples from the prior knowledge will be only used for the calculation of the constraints in MKPLPSVR. In the second group, we have extended the 27 training data with 8 prior data, and then applied the extended data to establish the model.

In the first group, we have chosen the hyper-parameters $C=100$, $\varepsilon =0.01$, $\varepsilon ^p=0.001$ for all the algorithms. Both the LPSVR and the PLPSVR exploited a Gaussian kernel with the kernel parameter $\sigma =0.0803$, and the MKLPSVR, the SimpleMKL and the MKPLPSVR exploited a Gaussian kernel and a polynomial kernel with the kernel parameters $0.0803$ and 2, respectively. In the second group, we have chosen the hyper-parameters $C=150$, $\varepsilon =0.01$, $\varepsilon ^p=0.001$. Both the LPSVR and the PLPSVR used a Gaussian kernel with the kernel parameter $\sigma =0.09$, and the MKLPSVR, the SimpleMKL and the MKPLPSVR exploited a Gaussian kernel and a polynomial kernel with the kernel parameters 0.09 and 0.155, respectively.

Using the data samples and the parameters above, different hybrid models were developed by using five algorithms. Finally, six testing data samples have been applied to verify the models. Tables 4 and 5 give some comparisons between the measured results and the predicted ones. Table 6 presents the number of support vectors and the statistic errors calculated by using the same testing data in two groups of experiments. From Tables 4 and 5, we can find that the results predicted by the MKPLPSVR are in very good agreement with the measurements among the five algorithms.

Table 6 Statistic errors and number of support vector

Full size table

As seen in Table 6, the model developed by the MKPLPSVR is more accurate than the ones developed by other algorithms in the first group. Similarly, the model developed by the MKPLPSVR in the second group is also more accurate than the ones developed by other algorithms. Moreover, the MKPLPSVR uses the fewest number of support vector among the algorithms. The results indicate that the MKPLPSVR can improve the modeling accuracy in the case of the scarcity of measurement data available.

Compared with two groups of experiments, we find that the results calculated in the second group are more accurate than the ones in the first group, due to the incorporation of the prior knowledge in the second group. Moreover, the model developed by the MKPLPSVR in the second group is more accurate than the result BW$_{\mathrm{EDBD}}$ from the literature (Sagiroglu et al. 1999). From the comparisons, we can conclude that it is effective to improve the modeling accuracy from an insufficient amount of measurement data by simultaneously incorporating prior knowledge and multiple kernels into the framework of the LPSVR.

4.3 Application in a microwave filter tuning device

In the subsection, the proposed algorithm has been applied to develop a model which is particularly suited to an automatic tuning device for microwave filters (Zhou et al. 2010; Zhou and Huang 2013). A six-pole microwave filter with center frequency of 1,810 MHz, bandwidth of 10 MHz is tuned by an automatic tuning device, as shown in Fig. 7. In this example, we will develop a filter-tuning model which reveals the influence of the inserting depth of the six tunable screws on the filter response.

In order to formulate the problem, we assume that the inserting depth of the six tunable screws at the benchmark, is ${\varvec{L}}_0 =\left[ {t_1 ,t_2 ,...,t_6 } \right] ^\mathrm{T}$, where the ideal coupling matrix ${\varvec{M}}_{\varvec{0}} $ can be obtained from the initial design stage of the microwave filter. When the filter is adjusted, the tunable screws will be rotated $\Delta {\varvec{D}}=[\Delta d_1 ,\Delta d_2 ,...,\Delta d_6 ]^\mathrm{T}$ degree which makes the inserting depth of tunable screws alter $\Delta {\varvec{L}} \varvec{=} \Delta {\varvec{D}}^\mathrm{T}{\varvec{R}}$ with the given thread pitch ${\varvec{R}}$ of tunable screws. As a result, the actual coupling matrix ${\varvec{M}}$ of the filter has a changing amount $\Delta {\varvec{M}}$, and the filter response is also affected. In order to formulate the influence, we assume that there is a mapping between the inserting depth (or the rotating degrees) of the six tunable screws and the changing amount $\Delta {\varvec{M}}$ of the coupling matrix (Zhou et al. 2010; Zhou and Huang 2013), and it can be formulated as:

$$\begin{aligned} \Delta {\varvec{M}}={\varvec{f}}(\Delta {\varvec{D}}). \end{aligned}$$

(26)

If the formulation in (26) has been obtained, the actual coupling matrix $\varvec{M}$ can be expressed as:

$$\begin{aligned} \varvec{M}=\varvec{M}_{\varvec{0}} +\Delta \varvec{M}. \end{aligned}$$

(27)

According to the topology of the filter in Fig. 7a, the coupling matrix $\varvec{M}$ is expressed as:

$$\begin{aligned} \varvec{M}=\left[ {{\begin{array}{l@{\quad }l@{\quad }l@{\quad }l@{\quad }l@{\quad }l} {m_{11} } &{} {m_{12} } &{} 0 &{} 0 &{} 0 &{} 0 \\ {m_{12} } &{} {m_{22} } &{} {m_{23} } &{} 0 &{} {m_{25} } &{} 0 \\ 0 &{} {m_{23} } &{} {m_{33} } &{} {m_{34} } &{} 0 &{} 0 \\ 0 &{} 0 &{} {m_{34} } &{} {m_{44} } &{} {m_{45} } &{} 0 \\ 0 &{} {m_{25} } &{} 0 &{} {m_{45} } &{} {m_{55} } &{} {m_{56} } \\ 0 &{} 0 &{} 0 &{} 0 &{} {m_{56} } &{} {m_{66} } \\ \end{array} }} \right] . \end{aligned}$$

(28)

In this example, the ideal coupling matrix $\varvec{M}_0 $ in the benchmark is determined at the initial design stage by using the synthesis of microwave filters according to some predefined specification (Zhou et al. 2010; Zhou and Huang 2013), and it is given in the followings.

$$\begin{aligned} \varvec{M}_0 =\left[ {{\begin{array}{llllll} {0.0001} &{} {0.8720} &{} 0 &{} 0 &{} 0 &{} 0 \\ {0.8720} &{} {0.0001} &{} {0.6048} &{} 0 &{} {0.109} &{} 0 \\ 0 &{} {0.6048} &{} {0.0008} &{} {-0.6781} &{} 0 &{} 0 \\ 0 &{} 0 &{} {-0.6781} &{} {-0.0032} &{} {0.6048} &{} 0 \\ 0 &{} {0.109} &{} 0 &{} {0.6408} &{} {0.001} &{} {0.872} \\ 0 &{} 0 &{} 0 &{} 0 &{} {0.872} &{} {0.001} \\ \end{array} }} \right] \!. \end{aligned}$$

(29)

Following the analysis in Zhou and Huang (2013); Zhou et al. (2010), the filter response is a function of the coupling matrix , and it is expressed as:

$$\begin{aligned} \begin{array}{l} S_{21} ( f)\!=\!-\!2\hbox {j}\sqrt{R_1 R_2 } \left[ {\left( {\frac{f_0 }{BW}\left( {\frac{f}{f_0 }-\frac{f_0 }{f}}\right) \varvec{I}-\hbox {j}\varvec{R}+\varvec{M}}\right) ^{-1}} \right] _{21} \\ S_{11} (f)\!=\!1\!+\!2\hbox {j}\sqrt{R_1 } \left[ {\left( {\frac{f_0 }{BW}\left( {\frac{f}{f_0 }-\frac{f_0 }{f}}\right) \varvec{I}-\hbox {j}\varvec{R}+\varvec{M}}\right) ^{-1}} \right] _{11} \\ \end{array} \end{aligned}$$

(30)

where the scattering parameters $S_{21} ( f)$ and $S_{11} ( f)$ denote the transfer and reflection characteristics of the filter response, respectively. ${\varvec{I}}$ is a unity matrix. ${\varvec{R}}$ is a diagonal matrix with all elements equal to zero except $R_{11} =R_1 $, $R_{nn} =R_2 $ and $R_{ii} =\frac{f_0 }{BW\cdot Q} \quad ( {1<i<n})$. The operational frequency is denoted by the variable $f$. As for the filter, the unload $Q=3804$, the desired central frequency $f_0 =1810$, and the desired bandwidth $BW=10$ MHz are given at the design stage of the filter. Both input coupling $R_1 $ and output coupling $R_2 $ are equal to 1.043 in this example.

From the formulation above, the key of developing a filter-tuning model is to establish the relationship in (26). In this example, we will build the relationship by using different algorithms. Firstly, some training data were required. According to the data acquisition method from the literature (Zhou et al. 2010), 45 measured data have been collected by a skilled operator. The number of the measured data are too small to obtain a satisfactory model. Moreover, the measurement is costly. Therefore, 30 prior data from a calibrated simulator have been used to make up a small amount of measured data. Finally, we have obtained a training data set $\text{ S }=\big \{ {(\Delta \varvec{D},\Delta \varvec{M}),\Delta \varvec{D} \in \hbox {R}^{75\times 6},\Delta \varvec{M}\in \hbox {R}^{75\times 12}} \big \}$ which consists of a measured data set $\hbox {M}=\big \{ (\Delta \varvec{D} ,\Delta \varvec{M}),\Delta \varvec{D}\in \text{ R }^{45\times 6},\Delta \varvec{M}\in \text{ R }^{45\times 12} \big \}$ and a prior data set $\text{ P }=\big \{ {(\Delta \varvec{D}^\mathrm{p} ,\Delta \varvec{M}^\mathrm{p} ),\Delta \varvec{D}^\mathrm{p} \in \text{ R }^{30\times 6},\Delta M^{p} \in \text{ R }^{30\times 12}} \big \}$.

Based on the dataset, this section established the Eq. (26) separately by using the LPSVR, the MKLPSVR and the MKPLPSVR, and the SimpleMKL in Rakotomamonjy et al. (2008) and the PLPSVR in Zhou and Huang (2010). Because the algorithms can only obtain a multi-input and one-output function, this example has applied the meta-model method in Zhou et al. (2010); Zhou and Huang (2013) to build the multi-input and multi-output model shown in Eq. (26). Every meta-model was established independently by the algorithms, and then all of the meta-models were combined into the coupling matrix in (28).

Two groups of experiments, which are the same as the previous two examples, were designed to validate the proposed algorithm. In the first group of experiments, both the LPSVR and the PLPSVR exploited a Gaussian kernel with the kernel parameter $\sigma =0.0803$. The MKLPSVR, the SimpleMKL, and the MKPLPSVR used a Gaussian kernel and a polynomial kernel with the kernel parameters 0.1 and 1.2, respectively. Other hyper-parameters such as $C=150$, $\lambda =100,\varepsilon =0.08$, and $\varepsilon ^p=0.05$ were used. In the second group, both the LPSVR and the PLPSVR used a Gaussian kernel with the kernel parameter $\sigma =0.22$. The MKLPSVR, the SimpleMKL and the MKPLPSVR exploited a Gaussian kernel and a polynomial kernel with the kernel parameters 0.22 and 1.2, respectively. Other hyper-parameters such as $C=135$, $\lambda =100,\varepsilon =0.05$, and $\varepsilon ^p=0.03$ were used. Once the relationship in (26) was developed separately by using the five algorithms, 10 testing data samples were applied to validate them. Tables 7 and 8 present the maximum absolute errors predicted by the five algorithms in the two groups of experiments. Figure 8 shows their statistical average root mean square error predicted by the five algorithms.

Table 7 Maximum absolute error in the first group of experiment

Full size table

Table 8 Maximum absolute error in the second group of experiment

Full size table

As seen in Tables 7 and 8, the maximum absolute error predicted by the MKPLPSVR is smaller than the ones predicted by other algorithms. Moreover, the results in the second group of experiment are more accurate than the ones in the first group of experiment, and the reason is that the prior data are utilized to extend the few training data. Figure 8 also clearly shows the results predicted by the MKPLPSVR are the most accurate among all of the algorithms. Compared with the MKLPSVR and the PLPSVR, the MKPLPSVR shows a better performance. The reason is that the prior knowledge and multi-kernel have been simultaneously incorporated into the LPSVR. On the contrary, the MKLPSVR and the PLPSVR have separately used multiple kernel or prior knowledge in the LPSVR. Compared with the SimpleMKL and the MKLPSVR, the MKPLPSVR is superior to the SimpleMKL in terms of the generation performance. A possible explanation is that the MKPLPSVR has utilized the prior data from a calibrated simulator to extend the few training data samples and improve the data-based modeling accuracy. It follows that the introduction of prior knowledge from a calibrated simulator and multi-kernel into the LPSVR can improve the data-based modeling accuracy when the amount of measured data are scarce.

The electrical performance was also evaluated by combining the Eqs. (27) and (30). Figure 9 presents a group of comparing result between the five predicted responses and the measurement result.

The results in the Fig. 9 show that the model developed by the MKPLPSVR is much closer to the measurements than other ones. Comparing the transfer characteristics with the reflection characteristics, we find that the transfer characteristics calculated by MKPLPSVR is much closer to the measurement than the reflection characteristics calculated by MKPLPSVR. From the comparing results, we can find that the proposed MKPLPSVR is effective in solving the modeling problem from a small amount of measured data.

5 Conclusions

In order to obtain an accurate model from an insufficient amount of measurement data, this paper has presented multi-kernel linear program support vector regression with prior knowledge. In the algorithm, multiple feature spaces have been utilized to incorporate multi-kernel functions into the framework of the LPSVR, and then the prior knowledge from a physical simulator has also been incorporated into the framework of the LPSVR by modifying optimization objectives and inequality constraints. At the end, prior knowledge and multi-kernel functions have been simultaneously incorporated into the framework of LPSVR. In addition, a strategy of parameter selections has been presented to facilitate the practical application of the proposed MKPLPSVR algorithm. Some experiments from a synthetic example, a microstrip antenna and a six-pole microwave filter have been carried out, and the experimental results show that the model developed by MKPLPSVR is more accurate than the ones developed by the other algorithms.

The proposed MKPLPSVR provides an approach to solve the scarcity of measured data in practice, and the MKPLPSVR shows great potential in some problems where a sufficient measurement data is difficult and costly to obtain, but the prior knowledge data from a physical simulator is available. The proposed MKPLPSVR algorithm can apply to the field of computer-aided modeling and system identification. By incorporating prior knowledge into the MKLPSVR, one can reduce the effect of the biased data from a calibrated physical simulator on the modeling accuracy. Although this paper focuses on the regression from a limited amount of measured data, the same technique can be applied to the classification problem in the case of the scarcity of measurement data, if the prior data from a physical simulator are available. Possible future extension is to solve the problem of model selection for the proposed algorithm, which is very significant to find an efficient method to automatically determine the type, the number of kernel functions and the hyper-parameters of support vector regression.

References

Bach FR, Lanckriet GRG, Jordan MI Multiple kernel learning, conic duality, and the SMO algorithm. In: Banff, Alta, Canada, 2004. Proceedings, Twenty-First International conference on machine learning, ICML 2004. Association for Computing Machinery, pp 41–48
Bloch G, Lauer F, Colin G, Chamaillard Y (2008) Support vector regression from simulation data and few experimental samples. Inf Sci 178(20):3813–3827
Article Google Scholar
Cherkassky V, Yunqian M (2004) Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 17:113–126
Article MATH Google Scholar
Clarke SM, Griebsch JH, Simpson TW (2005) Analysis of support vector regression for approximation of complex engineering analyses. Trans ASME J Mech Design 127(6):1077–1087
Article Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Huang CF (2012) A hybrid stock selection model using genetic algorithms and support vector regression. Appl Soft Comput 12(2):807–818
Article Google Scholar
Lanckriet GRG, Cristianini N, Bartlett P, El Ghaoui L, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
Lauer F, Bloch G (2008a) Incorporating prior knowledge in support vector machines for classification: a review. Neurocomputing 71(7–9):1578–1594
Article Google Scholar
Lauer F, Bloch G (2008b) Incorporating prior knowledge in support vector regression. Mach Learn 70(1):89–118
Article Google Scholar
Lu Z, Sun J (2009) Non-Mercer hybrid kernel for linear programming support vector regression in nonlinear systems identification. Appl Soft Comput 9(1):94–99
Article Google Scholar
Lu Z, Sun J, Butts KR (2009) Linear programming support vector regression with wavelet kernel: a new approach to nonlinear dynamical systems identification. Math Comput Simul 79(7):2051–2063
Article MATH MathSciNet Google Scholar
Mangasarian OL, Wild EW (2007) Nonlinear knowledge in kernel approximation. IEEE Trans Neural Netw 18(1):300–306
Article Google Scholar
Mingqing Hu, Chen Yiqiang (2009) Building sparse multiple-kernel SVM classifiers. IEEE Trans Neural Netw 20(5):827–839
Article Google Scholar
Muller KR, Mika S, Ratsch G, Tsuda K, Scholkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–201
Article Google Scholar
Nguyen C-V, Tay DBH (2008) Regression using multikernel and semiparametric support vector algorithms. IEEE Signal Process Lett 15:481–484
Article Google Scholar
Mangasarian Olvi L, Shavlik Jude W (2004) Knowledge-based kernel approximation. J Mach Learn Res 5:1127–1141
MATH Google Scholar
Pasolli L, Notarnicola C, Bruzzone L (2012) Multi-objective parameter optimization in support vector regression: general formulation and application to the retrieval of soil moisture from remote sensing data. IEEE J Select Topics Appl Earth Observ Remote Sens 5(5):1495–1508
Article Google Scholar
Phienthrakul T, Kijsirikul B (2010) Evolutionary strategies for hyperparameters of support vector machines based on multi-scale radial basis function kernels. Soft Comput 14(7):681–699
Article Google Scholar
Pozdnoukhov A, Kanevski M (2008) Multi-scale support vector algorithms for hot spot detection and modelling. Stoch Environ Res Risk Assess 22(5):647–660
Article MathSciNet Google Scholar
Qiu S, Lane T (2009) A framework for multiple kernel support vector regression and its applications to siRNA efficacy prediction. IEEE/ACM Trans Comput Biol Bioinform 6(2):190–199
Article Google Scholar
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521
MATH MathSciNet Google Scholar
Sagiroglu S, Guney K, Erler M (1999) Calculation of bandwidth for electrically thin and thick rectangular microstrip antennas with the use of multilayered perceptrons. Int J RF Microwave Comput Aided Eng 9(3):277–286
Article Google Scholar
Sanchez AVD (2003) Advanced support vector machines and kernel methods. Neurocomputing 55(1–2):5–20
Article Google Scholar
Smola AJ, Sch$\ddot{\rm o}$lkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–220
Smola A, Schoelkopf B, Raetsch G (1999) Linear programs for automatic accuracy control in regression. In: Proceedings of the Ninth International Conference on Artificial Neural Networks, Edinburgh, UK. IEE Conference Publication. IEE, pp 575–580
Smola A, Scholkopf B (2002) Learning with Kernels. MIT Press, Cambridge
Google Scholar
Sonnenburg S, Ratsch G, Schafer C, Scholkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
MATH MathSciNet Google Scholar
Subrahmanya N, Shin YC (2010) Sparse multiple kernel learning for signal processing applications. IEEE Trans Pattern Anal Mach Intell 32(5):788–798
Trnka P, Havlena V (2009) Subspace like identification incorporating prior information. Automatica 45(4):1086–1091
Article MATH MathSciNet Google Scholar
Vapnik VN (1995) The nature of statistical learning theory. Springer, Berlin
Book MATH Google Scholar
Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In, Montreal, QC, Canada, 2009. Proceedings of the 26th International Conference on machine learning, ICML. Omnipress, pp 1065–1072
Zhao YP, Sun JG (2011) Multikernel semiparametric linear programming support vector regression. Expert Systems Appl 38:1611–1618
Yu Y, Qian F (2008) Multi-scale linear programming support vector regression for ethylene distillation modeling. In: Chongqing, China. Proceedings of the World Congress on Intelligent Control and Automation (WCICA). Institute of Electrical and Electronics Engineers Inc., pp 1548–1552
Zheng D, Wang J, Zhao Y (2006) Non-flat function estimation with a multi-scale support vector regression. Neurocomputing 70(1–3): 420–429
Zhou J, Huang J, Xue X (2011) Modeling and optimization for electromechanical coupling of cavity filters based on support vector regression. (Dianzi Yu Xinxi Xuebao) J Electron Inf Technol 33(11):2780–2784
Zhou J, Duan B, Huang J (2010) Influence and tuning of tunable screws for microwave filters using least squares support vector regression. Int J RF Microwave Comput Aided Eng 20:422–429
Zhou J, Huang J (2010) Incorporating priori knowledge into linear programming support vector regression. In: 2010 IEEE International Conference on Intelligent Computing and Integrated Systems, ICISS2010, October 22, 2010–October 24, 2010, Guilin, China. IEEE Computer Society, pp 591–595
Zhou J, Huang J (2013) Intelligent tuning for microwave filters based on multi-kernel machine learning model. In: 5th IEEE International Symposium on Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications, MAPE 2013, October 29, 2013–October 31, 2013. Chengdu, China, pp 259–266

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (Grant No. 51305323, 51305322 and 51035006) and the Fundamental Research Funds for the Central Universities (Grant No JB140404).

Author information

Authors and Affiliations

Key Laboratory of Electronic Equipment Structure Design of Ministry of Education, Xidian University, Xi’an, Shaanxi, People’s Republic of China
Jinzhu Zhou, Baoyan Duan, Jin Huang & Na Li

Authors

Jinzhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Baoyan Duan
View author publications
You can also search for this author in PubMed Google Scholar
Jin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Na Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinzhu Zhou.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J., Duan, B., Huang, J. et al. Incorporating prior knowledge and multi-kernel into linear programming support vector regression. Soft Comput 19, 2047–2061 (2015). https://doi.org/10.1007/s00500-014-1390-x

Download citation

Published: 30 July 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s00500-014-1390-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Incorporating prior knowledge and multi-kernel into linear programming support vector regression

Abstract

Similar content being viewed by others

Bayesian Support Vector Regression Modeling of Microwave Structures for Design Applications

An interpretable regression approach based on bi-sparse optimization

SVM Classification of Uncertain Data Using Robust Multi-Kernel Methods

1 Introduction

2 Review of LPSVR

3 Proposed algorithms

3.1 Multi-kernel linear programming support vector regression

3.2 Incorporating prior knowledge into MKLPSVR

3.3 Strategy for parameter selections in MKPLPSVR