1 Introduction

The RBF neural networks have shown excellent performance in a number of problems of practical interest. In [24], the reservoirs of brine are analyzed for physicochemical properties using the RBF neural networks with the genetic algorithms. The proposed model is called the GA-RBF model and has shown good results compared to the previous approaches. In [12], the RBF kernel is used to predict the pressure gradient with high accuracy. In the context of nuclear physics, RBF has been effectively used to model the stopping power data of materials as in [15]. A comprehensive discussion of various applications can be found in [6].

In the recent years, considerable advancement has been made in the field. In [23], a couple of new RBF construction algorithms are proposed with the aim of increasing error convergence rates with fewer computational nodes. The first method expands popular Incremental Extreme Learning Machine algorithms by adding Nelder-Mead simplex optimization. The second algorithm uses Levenberg-Marquardt algorithm to optimize the positions and heights of RBF. The results have shown better error performance compared to the previous research. A new architecture of the optimized RBF neural network classifier is developed with the aid of fuzzy clustering and data preprocessing techniques in [19]. In [7], a bee-inspired algorithm, called cOptBees, has been used with heuristics to automatically select the number, location and dispersions of basis functions to be used in the RBF networks. The resultant BeeRBF is shown to be competitive and has the advantage of automatically determining the number of centers. To accelerate the learning for the large-scale data sequence, an incremental learning algorithm is proposed in [2]. The merits of fuzzy and crisp clustering are effectively combined in [18].

In [5], orthogonal least-square-based alternative learning procedure is proposed. In the algorithm, the centers of the RBF are selected one by one in a rational way until an adequate network has been constructed. In [10], a novel RBF network with the multikernel is proposed to obtain an optimized and flexible regression model. The unknown centers of the multikernels are determined by an improved k-means clustering algorithm. An orthogonal least squares (OLS) algorithm is used to determine the remaining parameters. Another learning algorithm proposed in [13] simplifies the neural network training through the use of an adaptive computation algorithm (ACA). The convergence of the ACA is analyzed by the Lyapunov criterion. In [3], a sequential framework Meta-Cognitive Radial Basis Function network (McRBFN) and its Projection-based Learning (PBL) referred to as PBL-McRBFN is proposed. The PBL-McRBFN is inspired by human meta-cognitive learning principles. The proposed algorithm is evaluated on two practical problems, namely the acoustic emission signal classification and the mammogram for cancer classification. In [4], a nonparametric supervised classifier based on neural networks is proposed and is referred to as Self-Adaptive Growing Neural Network (SAGNN). The SAGNN allows a neural network to adapt its size and structure according to the training data. The performance of the method is evaluated for fault diagnosis and compared with various nonparametric supervised neural networks. A hybrid optimization strategy is proposed in [26] by incorporating the adaptive optimization of particle swarm optimization (PSO) into a genetic algorithm (GA), named the HPSOGA. The proposed strategy is used for determining the parameters of radial basis function neural networks automatically (e.g., the number of neurons and their respective centers and radii).

Fig. 1
figure 1

Architecture of an RBF neural network

Essentially the architecture of RBF networks consists of three layers: (1) an input layer, (2) a nonlinear hidden layer, and (3) a linear output layer, refer to Fig. 1. Let \(\mathbf {x} \in \mathbb {R}^{m_0}\) be the input vector, and then the overall mapping of the RBF network, \(s:\mathbb {R}^{m_0}\rightarrow \mathbb {R}^{1}\), is given as:

$$\begin{aligned} y=\sum _{i=1}^{m_1}w_i\phi _i(\left\| \mathbf {x}-\mathbf {x}_i\right\| )+b \end{aligned}$$
(1)

where \(m_1\) is the number of neurons in the hidden layer, \(\mathbf {x}_i \in \mathbb {R}^{m_0}\) are the centers of the RBF network, \(\mathbf {w}_i\) are the synaptic weights connecting the hidden layer to output neuron, b is the bias term of the output neuron, and \(\phi _i\) is the basis function of the ith hidden neuron. Without loss of generality and for simplicity, a single output neuron is considered. Conventional RBF networks employ a number of kernels such as multiquadrics, inverse multiquadrics and Gaussian [14]. The Gaussian kernel, due to its versatility, is considered to be the most commonly used kernel [25]:

$$\begin{aligned} \phi _i(\left\| \mathbf {x}-\mathbf {x}_i\right\| )=\exp \left( \frac{-\left\| \mathbf {x}-\mathbf {x}_i\right\| ^2}{\sigma ^{2}}\right) \end{aligned}$$
(2)

where \(\sigma \) is the spread of the Gaussian kernel. In one form or the other, the kernels use the concept of distance measure with the centers of the network. Conventionally, the Euclidean distance has been used as an efficient distance metric. Recently, it has been argued that the cosine distance metric has some complimentary properties to offer compared to the Euclidean distance measure [1]:

$$\begin{aligned} \phi _{i1}(\mathbf {x}.\mathbf {x}_i)=\frac{\mathbf {x}.\mathbf {x}_i}{\left\| \mathbf {x}\right\| \left\| \mathbf {x}_i\right\| + \gamma } \end{aligned}$$
(3)

where the term \(\gamma > 0\), a very small constant, is added to the denominator to avoid the indeterminant form of (3) in case \(\left\| \mathbf {x}\right\| \) or \(\left\| \mathbf {x}_i\right\| \) is zero. Accordingly, a novel kernel has been proposed to fuse the cosine and Euclidean distances [1]:

$$\begin{aligned} \phi _i(\mathbf {x},\mathbf {x}_i)=\alpha _1\phi _{i1}(\mathbf {x}.\mathbf {x}_i)+\alpha _2\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| ) \end{aligned}$$
(4)

where \(\phi _{i1}(\mathbf {x}.\mathbf {x}_i)\) and \(\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| )\) are the cosine and Euclidean kernels, respectively, with corresponding fusion weights \(\alpha _{1}\) and \(\alpha _{2}\).

Harnessing the distinctive properties of the cosine and Euclidean kernels, the formulation in (4) has shown some good results compared to the conventional Euclidean kernel [1]. We however argue that the fusion of the two kernels is manual and the weights \(\alpha _{1}\) and \(\alpha _{2}\) are adjusted in a hit-and-trial manner. Without any prior information, a common practice is to assign equal weights to the two kernels, i.e., \(\alpha _1=\alpha _2=0.5\) . As such, there is no dynamic method of optimizing these weights for a given data set. We therefore propose a novel framework to adaptively optimize the weight assignment using the steepest descent method [22].

The rest of the paper is organized as follows: In Sect. 2, the proposed novel adaptive kernel is thoroughly discussed. This is followed by extensive experiments in Sect. 3. The paper is finally concluded in Sect. 4.

2 Proposed Method

We consider \(\alpha _1\) and \(\alpha _2\) in (4) to be dynamically adaptive variables:

$$\begin{aligned} \alpha _1\equiv \frac{|\alpha _1(n)|}{|\alpha _1(n)|+|\alpha _2(n)|}\end{aligned}$$
(5)
$$\begin{aligned} \alpha _2\equiv \frac{|\alpha _2(n)|}{|\alpha _1(n)|+|\alpha _2(n)|} \end{aligned}$$
(6)

where the normalization of the mixing weights ensures that \(\alpha _1(n)+\alpha _2(n)=1\). The new kernel is therefore defined as:

$$\begin{aligned} \phi _i(\mathbf {x},\mathbf {x}_i)=\frac{|\alpha _1(n)|\phi _{i1}(\mathbf {x}.\mathbf {x}_i)+|\alpha _2(n)|\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| )}{|\alpha _1(n)|+|\alpha _2(n)|} \end{aligned}$$
(7)

The overall mapping, at the nth learning iteration linked to a specific epoch, can now be written as:

$$\begin{aligned} y(n)=\sum _{i=1}^{m_1}w_i(n)\phi _i(\mathbf {x},\mathbf {x}_i)+b(n) \end{aligned}$$
(8)

where the synaptic weights \(w_i(n)\) and bias b(n) are adapted at each iteration. We define a cost function \(\mathcal {E}(n)\) as:

$$\begin{aligned} \mathcal {E}(n)=\mathcal {E}\left( \alpha _1(n),\alpha _2(n)\right) =\frac{1}{2}(d(n)-y(n))^{2} \end{aligned}$$
(9)

where d(n) is the desired output at the \(n^{th}\) iteration and e(n) the instantaneous error between the desired output and the actual output of the neuron \(e(n)=d(n)-y(n)\). The update rule for the kernel’s weight is given by:

$$\begin{aligned} \varDelta \alpha _1(n)=-\eta \frac{\partial \mathcal {E}(n)}{\partial \alpha _1(n)} \end{aligned}$$
(10)

Using the chain rule of differentiation for the cost function in (9) yields:

$$\begin{aligned} \frac{\partial \mathcal {E}(n)}{\partial \alpha _1(n)}=\frac{\partial \mathcal {E}(n)}{\partial e(n)}\frac{\partial e(n)}{\partial y(n)}\frac{\partial y(n)}{\partial \phi _i(\mathbf {x},\mathbf {x}_i)}\frac{\partial \phi _i(\mathbf {x},\mathbf {x}_i)}{\partial \alpha _1(n)} \end{aligned}$$
(11)

which upon the simplification of the partial derivatives in (11) results in:

$$\begin{aligned} \frac{\partial \mathcal {E}(n)}{\partial \alpha _1(n)}= & {} -e(n)w_i(n)\frac{|\alpha _1(n)||\alpha _2(n)|}{\alpha _1(n)[|\alpha _1(n)|+|\alpha _2(n)|]^{2}}\nonumber \\&\times \,{[}\phi _{i1}(\mathbf {x}.\mathbf {x}_i)-\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| )] \end{aligned}$$
(12)

and using (10) and (12), the update rule for \(\alpha _1(n)\) is found to be:

$$\begin{aligned} \alpha _1(n+1)= & {} \alpha _1(n)+\eta e(n)w_i(n)\frac{|\alpha _1(n)||\alpha _2(n)|}{\alpha _1(n)[|\alpha _1(n)|+|\alpha _2(n)|]^{2}} \nonumber \\&\times \,{[}\phi _{i1}(\mathbf {x}.\mathbf {x}_i)-\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| )] \end{aligned}$$
(13)

Similarly, the update rule for \(\alpha _2(n)\) can be shown to be:

$$\begin{aligned} \alpha _2(n+1)= & {} \alpha _2(n)+\eta e(n)w_i(n)\frac{|\alpha _1(n)||\alpha _2(n)|}{\alpha _2(n)[|\alpha _1(n)|+|\alpha _2(n)|]^{2}} \nonumber \\&\times \,{[}\phi _{i2}(\left\| \mathbf {x}-\mathbf {x}_i\right\| )-\phi _{i1}(\mathbf {x}.\mathbf {x}_i)] \end{aligned}$$
(14)

The update equations of the weight and bias are given as:

$$\begin{aligned} w_i(n+1)= & {} w_i(n)+\eta e(n)\phi _i(\mathbf {x},\mathbf {x}_i)\end{aligned}$$
(15)
$$\begin{aligned} b_i(n+1)= & {} b_i(n)+\eta e(n) \end{aligned}$$
(16)

The proposed approach is dynamic and does not require prior assignment of the weights for the participating kernels.

3 Experimental Results

The proposed novel kernel for the RBF is evaluated for three important tasks: (1) nonlinear system identification, (2) pattern recognition and (3) function approximation. All the experiments were conducted using Matlab on an Intel(R) Core(TM) i7-3770 CPU @ 3.4GHz machine with 4GB memory.

3.1 Nonlinear System Identification

Fig. 2
figure 2

Nonlinear system identification using RBF neural network

Complex control systems and industrial processes can be effectively modeled using nonlinear systems [17]. Nonlinear system identification is a method for estimating the mathematical model of a nonlinear system using the inputs and outputs of the system. RBF neural networks have shown to achieve good performance in this context [8, 9, 16]. To evaluate the efficacy of the proposed novel kernel, we consider a highly nonlinear system, shown in Fig. 2:

$$\begin{aligned} y(t)= & {} a_1r(t)+a_2r(t-1)+a_3r(t-2) \nonumber \\&+\,a_4[\cos (br(t))+\exp (-|r(t)|)]+n(t) \end{aligned}$$
(17)

where r(t) and y(t) are the input and output of the system, respectively, n(t) is the disturbance modeled assumed to be \(\mathcal {N}(0,\sigma _d^2)\), \(a_i\)s are the polynomial coefficients describing the zeros of the system, and \(b>0\) is a constant. For the purpose of this experiment, r(t) is taken to be a step function. In Fig. 2, the system is defined by its impulse response h(t), while \(\hat{y}(t), \hat{h}(t)\) and e(t) are the estimated output, estimated impulse response and the error of estimation, respectively. The simulation parameters chosen for the experiments are: \(a_1=2\), \(a_2=-0.5\), \(a_3=-0.1\), \(a_4=-0.7\), \(b=3\) and \(\sigma _d^2=0.0025\).

Fig. 3
figure 3

Comparison of the output of the nonlinear system

Fig. 4
figure 4

Nonlinear system: the MSE curves for various approaches

Fig. 5
figure 5

Nonlinear system: adaptation of the mixing parameters with respect to time

For the RBF structure, the number of neurons was selected to be 401 and the centers were uniformly spaced between −50 and 50 with a step size of 0.25. The initial weights and bias values were initialized to zero. For the Gaussian kernel, the spread \(\sigma \) was set to 0.1, and for the cosine kernel, a small value of \(\gamma = 1e-\)50 was used. For the proposed approach, the initial values of \(\alpha _1\) and \(\alpha _2\) are taken to be 0.5. Figure 3 shows the estimated output of the proposed approach compared to the actual output, the Euclidean kernel (\(\alpha _1=0, \alpha _2=1\)), cosine kernel (\(\alpha _1=1, \alpha _2=0\)) and the manual fusion of the two kernels (\(\alpha _1=\alpha _2=0.5\)). Note that due to the most precise estimation, the Euclidean kernel overlaps the actual output and therefore cannot be distinguished. The mean square error (MSE) curves are depicted in Fig. 4. The Euclidean kernel produces the best performance achieving a minimum MSE of −6.1943 dB in 1379 iteration epochs, while the cosine kernel performs poorly with an MSE of 2.7887 dB. Without any prior information, the proposed approach dynamically gives more weight to the Euclidean kernel, attaining a minimum MSE of −6.1547 dB in 1447 iterations which is quite comparable to the Euclidean kernel. The final values of the weights were found to be \(\alpha _1=0.002\) and \(\alpha _2=0.998\). The proposed approach is substantially better compared to the manual fusion of kernels which achieved a minimum MSE of −5.5176 dB in 1992 iterations. Variation of the mixing parameters with respect to the iteration epochs is depicted in Fig. 5. For the comparison of time complexity of the proposed method with manual fusion of the two kernels, we investigated the training time for 2000 epochs. The proposed method utilizes 550.78 s, whereas the manual fusion of the two kernel takes 537.74 s. The experiment clearly shows that in the absence of any prior knowledge, the proposed approach adaptively emphasizes the effective Euclidean kernel and achieves a comparable performance.

3.2 Pattern Classification

Machine learning methods have been used with great success in bioinformatics [21]. One of the important applications is the prediction of cancer using gene microarray data. In this experiment, we target the prediction of leukemia disease using the standard Leukemia ALL/AML data [11]. The data set consists of 38 training samples from bone marrow specimens (27 ALL and 11 AML) and 34 testing samples. There are 34 test samples (20 ALL and 14 AML) prepared under different experimental conditions including 24 bone marrow and 10 blood sample specimens. The data set consists of 7129 genes. The Minimum Redundancy and Maximum Relevance (mRMR) is an established technique to select the most significant genes [21]. The mRMR technique was used to select only the top five genes for our experiments. For the RBF structure, the number of neurons was selected to be 38 and the centers were chosen using the subtractive clustering method of [20] with an influence factor of 0.1. The initial weights and bias values were initialized to zero. For the Gaussian kernel, the spread \(\sigma \) was set to 0.2, and for the cosine kernel, a small value of \(\gamma = 1e-\)50 was used. For the proposed approach, the initial values of \(\alpha _1\) and \(\alpha _2\) are taken to be 0.5. For the training phase, the MSE curves of different approaches are shown in Fig. 6.

Fig. 6
figure 6

The MSE curves for the training phase of the pattern classification problem

Fig. 7
figure 7

Pattern classification: adaptation of the mixing parameters with respect to time

The Euclidean kernel outperforms the cosine kernel achieving a minimum MSE of −279.9331 dB. The proposed method dynamically gives more weight to the Euclidean kernel achieving an MSE of −122.4990 dB with \(\alpha _1=0.3121\) and \(\alpha _2=0.6879\). Note that although the Euclidean kernel achieves the minimum MSE for the training data, it is merely the case of overfitting where a classifier achieves the best performance on the training set but fails on the test data. Variation of the mixing parameters with respect to epochs is depicted in Fig. 7. In Fig. 6, after the 65th epoch, the MSE of Euclidean kernel becomes lower than the cosine kernel; noteworthy is the corresponding flip in the weights adaptively assigned by the proposed approach in Fig. 7. Note that the weights become stable after 400 epochs. The manual fusion of the two kernels (\(\alpha _1=\alpha _2=0.5\)) results in an MSE of −73.6652 dB which is inferior to the proposed method. The training accuracies of all the approaches are presented in Table 1, and note that all approaches result in 100 % accuracy for the training samples. The total training time for the proposed method is found to be 12.98 s, whereas the manual fusion of the two kernel takes 12.65 s.

Table 1 Results for the pattern classification problem

True evaluation of any predictive system is for the case of unseen samples, i.e the “testing phase”. Although the Euclidean kernel achieves the minimum MSE during the training phase, the proposed approach demonstrated that the best performance for the testing stage is achieved with an accuracy of 97.06 %. The Euclidean kernel was trained “too well” on the training samples and therefore incurred the problem of “overfitting” attaining a test accuracy of only 58.82 %. The proposed dynamic fusion of the two kernels outperformed the manual fusion (\(\alpha _1=\alpha _2=0.5\)) by a margin of 2.94 %.

We provide an intuitive understanding of the proposed approach using this pattern classification problem. The data which are not linearly separable in the original space pose a challenging task in the classification theory. Cover’s theorem states that such data can be mapped into a high-dimensional space using a nonlinear mapping function (kernel function), thereby resulting in a linearly separable data in the transformed space.

Selection of an appropriate kernel is an important issue to be considered. A good kernel will result in optimal separation of classes in the transformed space, thereby improving the performance on unseen test samples. Using fusion of multiple kernels is often a good idea to harness the complementary properties of various kernels. The weights of the combining kernels play an important role in such cases. Selecting weights on random bases may result in an inefficient fusion. The proposed adaptive fusion framework automatically selects the best weights for the combining kernels, resulting in maximum separation of classes. We demonstrate this through clustering of the Leukemia data set consisting of 38 samples (27 Class A and 11 Class B) and 5 attributes. For demonstration purposes, we choose two centers \(c_1\) and \(c_2\) which are the means of classes A and B, respectively. The mapping of the samples in the 2D—space using various kernels is shown in Fig. 8.

Fig. 8
figure 8

Clustering of the Leukemia data using various kernels. Note the class separation for the proposed kernel

It can be seen that cosine kernel efficiently separates the two classes in the 2D-space, while the Euclidean kernel maps all the samples to origin (overlapping samples seen as one green circle). The manual fusion of the kernels (with equal weights) results in a decreased class separation compared to the cosine kernel. The proposed adaptive fusion of the two kernels automatically assigns more weight to the cosine kernel thereby resulting in better clustering compared to the manual fusion.

3.3 Function Approximation

We consider the problem of approximation of a nonlinear function defined by:

$$\begin{aligned} f(x,y)=\exp (x^2-y) \end{aligned}$$
(18)

The function in Eq. (18) is approximated using various kernels. For all experiments, 121 centers were considered and the learning rate was taken to be \(\eta =1\times 10^{-3}\). The centers were chosen through the subtractive clustering method of [20] with an influence factor of 0.1. The initial weights and bias values were initialized to zero. For the Gaussian kernel, the spread \(\sigma \) was set to 0.2, and for the cosine kernel, a small value of \(\gamma = 1e-\)50 was used. For the proposed approach, the initial values of \(\alpha _1\) and \(\alpha _2\) are taken to be 0.5. A total of 121 values of x and y are used for training ranging from −1 to 1 with a step size of 0.2. Testing has been conducted on 100 data points ranging from −0.9 to 0.9 with a step size of 0.2. For the test data, Fig. 9 shows the estimated output of the proposed approach compared to the actual output, the Euclidean kernel (\(\alpha _1=0, \alpha _2=1\)), the cosine kernel (\(\alpha _1=1, \alpha _2=0\)) and the manual fusion of the two kernels (\(\alpha _1=\alpha _2=0.5\)) in reduced dimension.

Fig. 9
figure 9

Comparison of the output of the nonlinear function for various kernels

Fig. 10
figure 10

The MSE curves for the training phase of the function approximation problem

Fig. 11
figure 11

Function approximation: adaptation of the mixing parameters with respect to time

The MSE curves are depicted in Fig. 10. The Euclidean kernel produces the best performance achieving a minimum MSE of −18.6619 dB, while the cosine kernel performs poorly with an MSE of −4.9277 dB. Without any prior information, the proposed approach dynamically gives more weight to the Euclidean kernel. The proposed approach attains a minimum MSE of −18.4076 dB which is comparable to the Euclidean method. The proposed approach is substantially better compared to the manual fusion of kernels which achieved a minimum MSE of −15.6181 dB. Variation of the mixing parameters with respect to the iteration epochs is depicted in Fig. 11. The final values of the weights were found to be \(\alpha _1=0.0060\) and \(\alpha _2=0.9940\). The experiment clearly shows that in the absence of any prior knowledge, the proposed approach adaptively emphasizes the effective Euclidean kernel and achieves better performance. For the comparison of the time complexity of the proposed method with manual fusion of the two kernels, we investigated the training time for 10,000 epochs. The total training time for the proposed method is found to be 586.3 s, whereas the manual fusion of the two kernel takes 578.2 s.

4 Conclusion

In this research, a novel kernel for the RBF neural network is proposed. The proposed framework adaptively fuses the Euclidean and cosine distance measures, thereby harnessing the complementary properties of the two. The proposed algorithm is dynamic and adaptively learns the optimum weights of the participating kernels for a given problem. The efficacy of the proposed kernel is demonstrated on three important problems, namely nonlinear system identification, pattern classification and function approximation. The proposed algorithm has shown to comprehensively outperform the manual fusion of the two kernels. For the problem of nonlinear system identification, the proposed framework adaptively assigns a higher fusion weight to the Euclidean kernel achieving a comparable performance. The proposed algorithm performs better than the manual fusion of the two kernels. Therefore, in the absence of any prior knowledge, the proposed method is shown to emphasize the most effective kernel. For the pattern classification problem, the proposed method dynamically assigns more weight to the Euclidean kernel and achieves a comparable training accuracy of 100 %. For the more challenging testing phase, the proposed optimized fusion attains the best accuracy of 97.06 %. Note that the proposed approach outperformed the best conventional kernel, i.e., the Euclidean kernel by meaningfully utilizing the complementary properties of the cosine kernel. For the function approximation problem, the Euclidean kernel produces the best performance achieving a minimum MSE of −18.6619 dB, while the cosine kernel performs poorly with an MSE of −4.9277 dB. Without any prior information, the proposed approach dynamically gives more weight to the Euclidean kernel and achieved a minimum MSE of −18.4076 dB. The experiments clearly demonstrate that the proposed optimum fusion of kernels will always perform equal to or better than the best participating kernel.