1 Introduction

Support Vector Machine (SVM) (Cortes and Vapnik 1995) utilizes the principle of structural risk minimization and solves a convex quadratic programming problem (QPP) to find the optimal hyperplane, making it an effective machine learning algorithm for solving pattern recognition problems. Due to its theoretical advantages and excellent generalization performance, SVM is widely used in many fields. However, traditional SVM has high computational complexity and it is difficult to rapidly process huge and complex data. To solve this problem, Khemchandani and Chandra (2007) proposed the Twin Support Vector Machine (TSVM). Unlike conventional SVM, TSVM aims to find two non-parallel classification hyperplanes and makes each plane move closer to one class and stay as far as possible from the other. Furthermore, the single large QPP in SVM is transformed into two smaller QPPs, so that the computational time of TSVM is reduced to a quarter of that of traditional SVM. When dealing with large-scale classification problems (Xie et al. 2023a, b), TSVM exhibits shorter training times and lower training costs, which overcome the shortcomings of existing SVMs. Moreover, TSVM is also superior to some existing models in terms of classification performance (Tanveer et al. 2022b). Therefore, TSVM has been widely used in many fields, such as Alzheimer’s disease prediction (Ganaie et al. 2023; Sharma et al. 2022), EEG signal classification (Ganaie et al. 2022a; Hazarika et al. 2023), and text recognition (Francis and Sreenath 2022), etc.

It is worth noting that TSVM and SVM ignore the data distribution of class, which makes them sensitive to noise (Liang and Zhang 2022). To address this issue, researchers have combined fuzzy sets theory with them. Different fuzzy membership assignment strategies are proposed to describe the influence of different samples on the construction of the optimal hyperplane. Then, the negative impact of noise is reduced, and the classification performance is improved (Ganaie et al. 2022, 2020). For example, Yu et al. (2019) utilized a K-nearest neighbors-based probability density estimation alike strategy to calculate the relative density of each training instance, and thus proposed a fuzzy support vector machine with relative density information. Borah and Gupta (2022) incorporated fuzzy membership values, computed using transformed class probability and class affinity, into the objective function of least squares support vector machine type formulation, and then propose an affinity and transformed class probability-based fuzzy least squares support vector machine. Kung and Hao (2023) proposed a fuzzy least squares support vector machine with fuzzy hyperplane. The two key characteristics of the proposed model are that it assigns fuzzy membership degrees to every data vector according to the importance and the parameters for the hyperplane, such as the elements of normal vector and the bias term, are fuzzified variables.

In order to reduce the impact of outliers, Richhariya and Tanveer (2018) proposed a new fuzzy membership function that takes into account both the importance of samples and the data imbalance ratio. This function is combined with the least squares twin support vector machine to effectively address the class imbalance problems. Chen and Wu (2018) employed some available fuzzy membership functions from fuzzy neural networks to weigh the margin of each training sample. By design, the impact of the samples with high uncertainty can be mitigated, which improves the generalization ability of the model. Gupta et al. (2019) proposed a fuzzy membership assignment strategy based on information entropy and combined with TSVM for class imbalance learning. Hao et al. (2021) evaluated which fuzzy hyperplanes each sample lies closest to by defining the fuzzy partial ordering relation and then developed a novel fuzzy TSVM to merge the large volume of information from online news, using this to predict stock price trends. Ganaie et al. (2021) proposed a novel fuzzy least squares projection twin support vector machine, which seeks projections such that the samples of each class are clustered around its corresponding mean and assigns fuzzy weights to each sample to reduce the effect of outliers. Motivated by the idea of angle-based algorithms, Richhariya et al. (2021) proposed an efficient angle-based universum least squares twin support vector machine (AULSTSVM). It is capable of handling heteroscedastic noise in large-scale datasets. Richhariya et al. (2021a) proposed a fuzzy universum least squares twin support vector machine, which assigns fuzzy membership to the universum data, aiming to provide appropriate data distribution information to the classifier. This approach was applied to Alzheimer’s disease and breast cancer detection.

Recently, Rezvani et al. (2019) combined intuitionistic fuzzy sets with TSVM to address the issue of sensitivity to noise, and then resulted in an extension of FTSVM known as intuitionistic fuzzy twin support vector machine (IFTSVM). In IFTSVMs, each training sample is assigned a membership degree and a non-membership degree to construct a scoring function that characterizes the sample’s importance so as to reduce the impact of noise. On this basis, Rezvani and Wang (2021, 2022) respectively used fuzzy Adaptive Resonance Theory and the weighting strategy in conjunction with IFTSVM to tackle the problem of class imbalance learning, explicitly addressing the challenges posed by large-scale class imbalance problems containing noise. Tanveer et al. (2022a) proposed a novel intuitionistic fuzzy weighted least squares twin support vector machine which uses local neighborhood information among the data points and also uses both membership and non-membership weights to reduce the effect of noise and outliers. It was applied to the diagnosis of schizophrenia disease. Ju et al. (2021) combined interval-valued fuzzy sets with TSVMs to address multi-class problems. In this method, interval-valued fuzzy membership is assigned to each sample. Then, an interval-valued fuzzy twin support vector machine is proposed, which effectively reduces the influence of noise and improves the classification performance. In conclusion, constructing an appropriate fuzzy membership assignment strategy is a crucial method to effectively solve the sensitivity to noise in TSVM.

Centered kernel alignment (CKA) is a method that can measure the degree of similarity between two kernels (or kernel matrices). It has been applied to improve the performance of machine learning algorithms due to its effectiveness and low computational complexity. For example, Lu et al. (2014) employed CKA to unify the two tasks of clustering and multiple kernel learning into a single optimization framework, and then a novel multiple kernel clustering method was proposed. In Cárdenas et al. (2016) utilized CKA to assess the affinity between the resonance imaging data kernel matrix and the label target matrix, and then an improved artificial neural network algorithm was proposed to solve the diagnosis problem of Alzheimer’s disease. Wang et al. (2020) combined CKA with SVM to propose a classification algorithm robust to noise where CKA is employed to calculate the dependence between a data point and its associated label. Therefore, it is worth investigating the use of CKA to address the sensitivity to noise.

In this paper, a fuzzy twin support vector machine based on CKA is proposed to address the problem that TSVM is sensitive to noise. This method uses a heuristic function derived from the CKA to calculate the dependence between a data point and its corresponding label and then assigns fuzzy membership to different sample points. Furthermore, a fuzzy membership assignment strategy that can effectively solve the sensitivity of TSVM to noise is proposed. In order to mitigate the dominance of weakly correlated or irrelevant features in the calculation process, a feature-weighted kernel function is constructed by using the information gain, and it is applied to the calculation of the centered kernel alignment. This gives more weight to the strongly correlated features than to the weakly correlated features in order to describe the classification importance of different features. The strategy is combined with TSVM, and then a new fuzzy twin support vector machine (FTSVM-CKA) is proposed. Moreover, to speed up the training of the model, we employ a coordinate descent strategy with shrinking by active set to reduce computational complexity. This can effectively improve the training speed of the model while maintaining the classification performance. Experiments were conducted on an artificial data set and 15 UCI data sets to validate the performance of FTSVM-CKA. The results show that FTSVM-CKA can efficiently and rapidly solve binary classification problems with noise.

In summary, the main contributions of this paper are as follows:

(1) The idea of feature weighting is integrated into the centered kernel alignment method. This paper constructs a feature weighting kernel function and applies it to the calculation of the centered kernel alignment, thus avoiding being dominated by weakly correlated or uncorrelated features in the calculation process.

(2) A fuzzy membership assignment strategy based on the centered kernel alignment method is given. This strategy can significantly reduce the negative impact of noise.

(3) Combining the fuzzy membership assignment strategy based on centered kernel alignment with TSVM, this paper proposed a FTSVM-CKA model, which could effectively solve the classification problem with noise.

(4) The computational complexity brought by the high-dimensional input is addressed by the coordinate descent strategy with shrinking by active set, which then effectively improves the classification speed of the model.

(5) For nonlinear case, kernel trick is applied directly and hence, the exact formulation is solved.

(6) Experimental results on the benchmark dataset demonstrate the ability of the proposed FTSVM-CKA to reduce the negative impact of noise.

The remaining part of this paper is organized as follows: Section 2 reviews some preliminaries. Section 3 describes the structure of the proposed FTSVM-CKA model in detail. The experimental results are reported in Sect. 4. Finally, conclusions and further work are presented in Sect. 5.

2 Related works

In this section, the model structure of TSVM is introduced, and then the concepts of centered kernel alignment and information gain are elaborated. Let \(S = \left\{ {({x_1},{y_1}),({x_2},{y_2}),}\right. \) \( \left. {\ldots ,({x_l},{y_l})} \right\} \) be the training sample set, where l is the total number of training samples, \({x_i} \in {R^d}\) and \({y_i} \in \left\{ { - 1, + 1} \right\} \), \(i = 1,2,\ldots ,l\) denote the ith training sample and its corresponding target class, respectively. d is the feature dimension of the sample.

2.1 Twin support vector machine

Different from the conventional SVM, TSVM aims to generate two non-parallel planes \(w_1^Tx + {b_1} = 0\) and \(w_2^Tx + {b_2} = 0\). Each plane is closer to one of the two classes and as far away from the other as possible. The optimization problem for TSVM can be modeled as the following two smaller scale QPPs:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_1},{b_1},{\xi _2}} \mathrm{{ }}\frac{1}{2}{(A{w_1} + {e_1}{b_1})^T}(A{w_1} + {e_1}{b_1}) + {C_1}{e_2}^T{\xi _2}\\ s.t.\mathrm{{ }} - (B{w_1} + {e_2}{b_1}) + {\xi _2} \ge {e_2},\mathrm{{ }}{\xi _2} \ge 0 \end{array} \end{aligned}$$
(1)

and

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_2},{b_2},{\xi _1}} \mathrm{{ }}\frac{1}{2}{(B{w_2} + {e_2}{b_2})^T}(B{w_2} + {e_2}{b_2}) + {C_2}{e_1}^T{\xi _1}\\ s.t.\mathrm{{ }}(A{w_2} + {e_1}{b_2}) + {\xi _1} \ge {e_1},\mathrm{{ }}{\xi _1} \ge 0 \end{array} \end{aligned}$$
(2)

where, A and B denote all samples belonging to the positive and negative classes, respectively. \({\xi _1}\) and \({\xi _2}\) are slack variables, \({e_1}\) and \({e_2}\) are the vector of ones with adequate length, \({C_1}\) and \({C_2}\) are penalty parameters.

By solving the dual problems of Eq. (1) and Eq. (2), two optimal hyperplanes can be obtained. For any input sample \({x^ * }\), its classification decision function is as follows:

$$\begin{aligned} {y^ * } = \mathop {\arg \min }\limits _{i \in 1,2} \frac{{\left| {w_i^T{x^ * } + {b_i}} \right| }}{{\left\| {{w_i}} \right\| }} \end{aligned}$$
(3)

2.2 Centered kernel alignment

Centered kernel alignment (CKA) (Cortes et al. 2012) measures the degree of similarity between two kernels (or kernel matrices) and has been widely used for kernel learning and selection due to its effectiveness and low computational complexity.

For data set \(S = \left\{ {({x_1},{y_1}),({x_2},{y_2}),\dots ,({x_l},{y_l})} \right\} \), The kernel matrix K derived from kernel functions k is given by \({K_{i,j}} = k\left( {{x_i},{x_j}} \right) \). Given two kernel functions \({k_1}\) and \({k_2}\), their corresponding kernel matrices are \({K_1}\) and \({K_2}\), respectively. The Frobenius inner product between matrices \({K_1}\) and \({K_2}\) is expressed as follows:

$$\begin{aligned} {\left\langle {{K_1},{K_2}} \right\rangle _F} = \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {{k_1}({x_i},{x_j}){k_2}({x_i},{x_j})} } \end{aligned}$$
(4)

Let \(e = {(1,1,\ldots ,1)^T} \in {R^l}\) and \(I \in {R^{l \times l}}\) is the identity matrix, then the centering matrix H and the centered kernel matrix \(\overline{K} \) are calculated as follows:

$$\begin{aligned} H = I - \frac{{e{e^T}}}{l} \in {R^{l \times l}} \end{aligned}$$
(5)
$$\begin{aligned} \overline{K} = HKH \end{aligned}$$
(6)

The CKA of \({k_1}\) and \({k_2}\) on data set S is defined as

$$\begin{aligned} CKA({K_1},{K_2}) = \frac{{{{\left\langle {\overline{{K_1}} ,\overline{{K_2}} } \right\rangle }_F}}}{{\sqrt{{{\left\langle {\overline{{K_1}} ,\overline{{K_1}} } \right\rangle }_F}{{\left\langle {\overline{{K_2}} ,\overline{{K_2}} } \right\rangle }_F}} }} \end{aligned}$$
(7)

2.3 Information gain

Information gain (Han et al. 2022) is often used for feature correlation analysis.

Suppose the sample set S has m category labels \({C_i},i = 1,2,\ldots ,m\), \({S_i}\) denotes the set of all samples in S with label \({C_i}\), then the information entropy of S is defined as follows:

$$\begin{aligned} Info\left( S \right) = - \sum \limits _{i = 1}^m {{p_i}{{\log }_2}\left( {{p_i}} \right) } \end{aligned}$$
(8)

where, \({p_i} = \frac{{\left| {{S_i}} \right| }}{{\left| S \right| }}\) denotes the proportion of samples with label \({C_i}\) in the sample set S, \(| \cdot |\) denotes the cardinality.

For a certain feature F, suppose it has different values \({f_i},i = 1,2,\ldots ,v\) and the sample set S is correspondingly split into \({S_i},i = 1,2,\ldots ,v\), where \({S_i}\) contains all the samples in S whose feature F take the value \({f_i}\). Then the information gain \(IG\left( {S,F} \right) \) is defined as follows:

$$\begin{aligned} IG\left( {S,F} \right) = Info\left( S \right) - \sum \limits _{i = 1}^v {\frac{{\left| {{S_i}} \right| }}{{\left| S \right| }} \cdot Info\left( {{S_i}} \right) } \end{aligned}$$
(9)

3 A novel fuzzy twin support vector machine based on centered kernel alignment

In this section, we first propose a fuzzy membership assignment strategy based on centered kernel alignment. Then we elaborate the model structure of FTSVM-CKA in the linear and nonlinear cases. Finally, a coordinate descent strategy with shrinking by active set is introduced.

3.1 A fuzzy membership assignment strategy based on centered kernel alignment

Firstly, a feature-weighted kernel function is constructed by using the information gain, and it is applied to the calculation of the centered kernel alignment. This gives more weight to the strongly correlated features than to the weakly correlated features, in order to describe the classification importance of different features. Secondly, the centered kernel alignment method is employed to derive a heuristic function that calculates the dependency between sample points and their corresponding labels. This function assigns fuzzy membership degrees to different sample points, effectively mitigating the detrimental effects of noise.

Let a feature-weighted matrix P derived from the information gain be represented as follows:

$$\begin{aligned} P = \left[ {\begin{array}{*{20}{c}} {{w_1}}&{}{}&{}{}&{}{}\\ {}&{}{{w_2}}&{}{}&{}{}\\ {}&{}{}&{} \ddots &{}{}\\ {}&{}{}&{}{}&{}{{w_d}} \end{array}} \right] \end{aligned}$$
(10)

where, \({w_i},i = 1,2,\ldots ,d\) denotes the weight of the ith feature calculated by the information gain. Then the feature-weighted kernel function can be defined as \({k_p}\left( {{x_i},{x_j}} \right) = k\left( {{x_i}^TP,{x_j}^TP} \right) \). Here are three typical kernels with feature weights:

(1) Linear kernel:

$$\begin{aligned} {k_p}\left( {{x_i},{x_j}} \right) = \left( {{x_i}P} \right) \cdot \left( {{x_j}P} \right) = {x_i}P{P^T}{x_j}^T \end{aligned}$$
(11)

(2) Polynomial kernel:

$$\begin{aligned} \begin{array}{l} {k_p}\left( {{x_i},{x_j}} \right) = {\left[ {\gamma \left( {{x_i}P} \right) \cdot \left( {{x_j}P} \right) + r} \right] ^d}\\ \mathrm{{ }} = {\left( {\gamma {x_i}P{P^T}{x_j}^T + r} \right) ^d},\gamma > 0 \end{array} \end{aligned}$$
(12)

(3) Gaussian kernel:

$$\begin{aligned} \begin{array}{l} {k_p}({x_i},{x_j}) = \exp \left( - \gamma {\left\| {x_i^TP - x_j^TP} \right\| ^2}\right) \\ = \exp ( - \gamma ({({x_i} - {x_j})^T}P{P^T}({x_i} - {x_j}))) \end{array} \end{aligned}$$
(13)

For a binary classification problem, let \(K,G \in {R^{l \times l}}\) be kernel matrices defined as \({K_{i,j}} = k({x_i},{x_j})\) and \({G_{i,j}} = g({y_i},{y_j})\), respectively. The \(g({y_i},{y_j})\) is defined as follows:

$$\begin{aligned} g({y_i},{y_j}) = \left\{ \begin{array}{l} + 1\mathrm{{ , }}{y_i} = {y_j}\\ - 1\mathrm{{ , }}{y_i} \ne {y_j} \end{array} \right. \end{aligned}$$
(14)

where, the similarities from the same class are set to \(\mathrm{{ + }}1\), and those from different classes are \( - 1\). This definition reveals the ideal pairwise similarities between samples. Let \(y = {({y_1},{y_2},\ldots ,{y_l})^T}\), then

$$\begin{aligned}&CKA(K,G)\nonumber \\ {}&= \frac{{{{\left\langle {\overline{K} ,\overline{G} } \right\rangle }_F}}}{{\sqrt{{{\left\langle {\overline{K} ,\overline{K} } \right\rangle }_F}{{\left\langle {\overline{G} ,\overline{G} } \right\rangle }_F}} }} = \frac{{{{\left\langle {\overline{K} ,G} \right\rangle }_F}}}{{\sqrt{{{\left\langle {\overline{K} ,K} \right\rangle }_F}{{\left\langle {\overline{G} ,G} \right\rangle }_F}} }}\nonumber \\&= \frac{{{{\left\langle {\overline{K} ,y{y^T}} \right\rangle }_F}}}{{\sqrt{{{\left\langle {\overline{K} ,K} \right\rangle }_F}{{\left\langle {\overline{G} ,G} \right\rangle }_F}} }} = \frac{{\sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {{y_i}{y_j}\overline{k} \left( {{x_i},{x_j}} \right) } } }}{{\sqrt{{{\left\langle {\overline{K} ,K} \right\rangle }_F}{{\left\langle {\overline{G} ,G} \right\rangle }_F}} }}\nonumber \\&= \frac{1}{{\sqrt{{{\left\langle {\overline{K} ,K} \right\rangle }_F}{{\left\langle {\overline{G} ,G} \right\rangle }_F}} }}\left[ {\sum \limits _{{y_i} = {y_j}} {\overline{k} ({x_i},{x_j})} - \sum \limits _{{y_i} \ne {y_j}} {\overline{k} ({x_i},{x_j})} } \right] \end{aligned}$$
(15)

where, \(\overline{k} ({x_i},{x_j}) = {\overline{K} _{i,j}}\) is the centered kernel function.

For given data set \(S = \left\{ {({x_1},{y_1}),({x_2},{y_2}),\ldots ,({x_l},{y_l})} \right\} \), kernel functions k and g,we get

$$\begin{aligned} \overline{K} = HKH,\mathrm{{ }}\overline{G} = HGH \end{aligned}$$
(16)

where \({K_{i,j}} = k({x_i},{x_j})\) and \({G_{i,j}} = g({y_i},{y_j})\), \(H = I - \frac{{e{e^T}}}{l} \in {R^{l \times l}}\). Then,

$$\begin{aligned} {\left\langle {\overline{K} ,K} \right\rangle _F} = \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {\overline{k} } } \left( {{x_i},{x_j}} \right) k\left( {{x_i},{x_j}} \right) \end{aligned}$$
(17)

and

$$\begin{aligned} {\left\langle {\overline{G} ,G} \right\rangle _F} = \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {\overline{g} } } \left( {{x_i},{x_j}} \right) g\left( {{x_i},{x_j}} \right) \end{aligned}$$
(18)

are obtained, where \(\overline{k} \) and \(\overline{g} \) are the centered kernel functions corresponding to \(\overline{K} \) and \(\overline{G} \), respectively. Thus, \(\frac{1}{{\sqrt{{{\left\langle {\overline{K},K} \right\rangle }_F}{{\left\langle {\overline{G},G} \right\rangle }_F}} }}\) is a constant, CKA of \({x_t}\) can be expressed as follows:

$$\begin{aligned} {d_t} = CKA(K,G,{x_t}) = \sum \limits _{{y_t} = {y_i}} {\overline{k} ({x_t},{x_i})} - \sum \limits _{{y_t} \ne {y_i}} {\overline{k} ({x_t},{x_i})} \end{aligned}$$
(19)

Since \(\overline{k} ({x_t},{x_j})\) is the centered kernel function that measures the similarity between points \(x_t\) and \(x_j\), it is worth noting that the larger the similarity represented by the kernel for input patterns of the same class and the smaller the similarity for patterns from different classes, the larger the \(d_t\). In other words, a sample with a larger \(d_t\) value contributes more to the construction of the optimal classification hyperplane, and a sample with a smaller \(d_t\) value is more likely to be noise. Thus the fuzzy membership function based on the CKA to measure the importance of each sample point to the classification can be expressed as follows:

$$\begin{aligned} {s_t} = \frac{{{d_t} - {d_{\min }}}}{{{d_{\max }} - {d_{\min }}}} \end{aligned}$$
(20)

where \({d_{\max }}\) and \({d_{\min }}\) denote the largest and smallest CKA value among all sample points, respectively. Therefore, the larger the value of \({s_t}\), the greater the contribution of the sample \({x_t}\) to the construction of the optimal classification hyperplane, and conversely, the sample \({x_t}\) is likely to be noise. Different from the existing fuzzy membership function based on distance, relative density and entropy, the proposed strategy utilizes the CKA method to derive a heuristic function for calculating the dependency between samples and their corresponding labels, which assigns fuzzy membership to different samples. In addition, the proposed strategy incorporates the idea of feature weighting, which effectively reduces the influence of weakly correlated features. The corresponding input dataset S is thus modified as \(S = \left\{ {({x_1},{y_1},{s_1}),({x_2},{y_2},{s_2}),\ldots ,({x_l},{y_l},{s_l})} \right\} \).

3.2 Linear FTSVM-CKA

In linear case, the FTSVM-CKA finds the optimal classifier by solving the following two QPPs:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_1},{b_1},{\xi _2}} \mathrm{{ }}\frac{1}{2}{\left\| {A{w_1} + {e_1}{b_1}} \right\| ^2} + \frac{1}{2}{C_1}{\left\| {{w_1}} \right\| ^2} + {C_2}{S_2}^T{\xi _2}\\ s.t.\mathrm{{ }} - (B{w_1} + {e_2}{b_1}) + {\xi _2} \ge {e_2},\mathrm{{ }}{\xi _2} \ge 0 \end{array} \end{aligned}$$
(21)

and

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_2},{b_2},{\xi _1}} \mathrm{{ }}\frac{1}{2}{\left\| {B{w_2} + {e_2}{b_2}} \right\| ^2} + \frac{1}{2}{C_3}{\left\| {{w_2}} \right\| ^2} + {C_4}{S_1}^T{\xi _1}\\ s.t.\mathrm{{ }}(A{w_2} + {e_1}{b_2}) + {\xi _1} \ge {e_1},\mathrm{{ }}{\xi _1} \ge 0 \end{array} \end{aligned}$$
(22)

where, \({C_1}\), \({C_2}\), \({C_3}\) and \({C_4}\) are penalty parameters, \({\xi _1}\) and \({\xi _2}\) are slack variables, \({e_1}\) and \({e_1}\) are the vector of ones with adequate length. \({S_1} \in {R^{{l_ + }}}\) and \({S_2} \in {R^{{l_ - }}}\) denote the corresponding fuzzy membership of positive and negative class samples, respectively.

This paper takes the process of solving problem (21) as an example. The Lagrangian of problem (21) is written as

$$\begin{aligned}&L({w_1},{b_1},{\xi _2},\alpha ,\beta ) \nonumber \\ {}&\quad = \frac{1}{2}{\left\| {A{w_1} + {e_1}{b_1}} \right\| ^2} + \frac{1}{2}{C_1}{\left\| {{w_1}} \right\| ^2}\nonumber \\&\qquad + {C_2}{S_2}^T{\xi _2} + \alpha [(B{w_1} + {e_2}{b_1}) - {\xi _2} + {e_2}] - \beta {\xi _2} \end{aligned}$$
(23)

where \(\alpha \) and \(\beta \) are Lagrange multipliers. Applying Karush-Kuhn-Tucker (KKT) conditions, we get

$$\begin{aligned}{} & {} \frac{{\partial L}}{{\partial {w_1}}} = {A^T}(A{w_1} + {e_1}{b_1}) + {C_1}{w_1} + \alpha B = 0\end{aligned}$$
(24)
$$\begin{aligned}{} & {} \frac{{\partial L}}{{\partial {b_1}}} = e_1^T(A{w_1} + {e_1}{b_1}) + \alpha {e_2} = 0\end{aligned}$$
(25)
$$\begin{aligned}{} & {} \frac{{\partial L}}{{\partial {\xi _2}}} = {C_2}S_2^T - \alpha - \beta = 0 \end{aligned}$$
(26)

According to Eq. (24) and Eq. (25),

$$\begin{aligned} \left( {\begin{array}{*{20}{l}} {{A^T}}\\ {e_1^T} \end{array}} \right) \left( {\begin{array}{*{20}{l}} A&{{e_1}} \end{array}} \right) \left( {\begin{array}{*{20}{l}} {{w_1}}\\ {{b_1}} \end{array}} \right) + \left( {\begin{array}{*{20}{l}} B\\ {{e_2}} \end{array}} \right) \alpha = 0 \end{aligned}$$
(27)

can be obtained. Let \({H_1} = \left( {\begin{array}{*{20}{l}} A&{{e_1}} \end{array}} \right) \), \({G_2} = \left( {\begin{array}{*{20}{l}} B&{{e_2}} \end{array}} \right) \), \({u_1} = \left( {\begin{array}{*{20}{l}} {{w_1}}\\ {{b_1}} \end{array}} \right) \), \({u_2} = \left( {\begin{array}{*{20}{l}} {{w_2}}\\ {{b_2}} \end{array}} \right) \), then, \(H_1^T{H_1}{u_1} + G_2^T\alpha = 0\). Further, we can get

$$\begin{aligned} {u_1} = - {(H_1^T{H_1})^{ - 1}}G_2^T\alpha \end{aligned}$$
(28)

Since \({(H_1^T{H_1})^{ - 1}}\) is difficult to calculate, \({(H_1^T{H_1} + {C_1}I)^{ - 1}}\) is used instead of it in Eq. (28), where I is the identity matrix with the appropriate dimension. Thus,

$$\begin{aligned} {u_1} = - {(H_1^T{H_1} + {C_1}I)^{ - 1}}G_2^T\alpha \end{aligned}$$
(29)

Similarly,

$$\begin{aligned} {u_2} = {(G_2^T{G_2} + {C_3}I)^{ - 1}}H_1^T\beta \end{aligned}$$
(30)

According to the KKT conditions, the dual problems of Eq. (21) and Eq. (22) are as follows:

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _\alpha \mathrm{{ }}{e_2}^T\alpha - \frac{1}{2}{\alpha ^T}{G_2}{(H_1^T{H_1} + {C_1}I)^{ - 1}}G_2^T\alpha \\ s.t.\mathrm{{ }}0 \le \alpha \le {C_2}{S_2} \end{array} \end{aligned}$$
(31)

and

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _\beta \mathrm{{ }}{e_1}^T\beta - \frac{1}{2}{\beta ^T}{H_1}{(G_2^T{G_2} + {C_3}I)^{ - 1}}H_1^T\beta \\ s.t.\mathrm{{ }}0 \le \beta \le {C_4}{S_1} \end{array} \end{aligned}$$
(32)

We get the optimal \(u_1^ * = \left( {\begin{array}{*{20}{c}} {w_1^ * }\\ {b_1^ * } \end{array}} \right) \) and \(u_2^ * = \left( {\begin{array}{*{20}{c}} {w_2^ * }\\ {b_2^ * } \end{array}} \right) \) by solving the two corresponding dual problems. For any input sample \({x^ * }\), its class label \({y^ * }\) can be determined as follows:

$$\begin{aligned} {y^ * } = \mathop {\arg \min }\limits _{i \in 1,2} \left\{ {\frac{{\left| {w{{_i^ * }^T}{x^ * } + b_i^ * } \right| }}{{\left\| {w_i^ * } \right\| }}} \right\} \end{aligned}$$
(33)

3.3 Nonlinear FTSVM-CKA

In nonlinear case, the kernel function \(k({x_1},{x_2}) = (\phi ({x_1}),\phi ({x_2}))\) is introduced, where \(\phi \) is the Hilbert space transformation. Thus, the classification hyperplanes in the nonlinear case can be represented as \(k(x,{X^T}){w_1} + {b_1} = 0,\mathrm{{ }}k(x,{X^T}){w_2} + {b_2} = 0\), where \(X = \left[ {A;B} \right] \). The nonlinear FTSVM-CKA is formulated in the primal form as

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_1},{b_1},{\xi _2}} \mathrm{{ }}\frac{1}{2}{\left\| {k(A,{X^T}){w_1} + {e_1}{b_1}} \right\| ^2} + \frac{1}{2}{C_1}{\left\| {{w_1}} \right\| ^2} + {C_2}{S_2}^T{\xi _2}\\ s.t.\mathrm{{ }} - (k(B,{X^T}){w_1} + {e_2}{b_1}) + {\xi _2} \ge {e_2},\mathrm{{ }}{\xi _2} \ge 0 \end{array} \end{aligned}$$
(34)

and

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{w_2},{b_2},{\xi _1}} \mathrm{{ }}\frac{1}{2}{\left\| {k(B,{X^T}){w_2} + {e_2}{b_2}} \right\| ^2} + \frac{1}{2}{C_3}{\left\| {{w_2}} \right\| ^2} + {C_4}{S_1}^T{\xi _1}\\ s.t.\mathrm{{ }}(K(A,{X^T}){w_2} + {e_1}{b_2}) + {\xi _1} \ge {e_1},\mathrm{{ }}{\xi _1} \ge 0 \end{array} \end{aligned}$$
(35)

The Lagrangian function of the Eq. (34) is written as

$$\begin{aligned} \begin{array}{l} L({w_1},{b_1},{\xi _2},\alpha ,\beta ) = \\ \frac{1}{2}{\left\| {k(A,{X^T}){w_1} + {e_1}{b_1}} \right\| ^2} + \frac{1}{2}{C_1}{\left\| {{w_1}} \right\| ^2} + {C_2}{S_2}^T{\xi _2}\\ + \alpha [(k(B,{X^T}){w_1} + {e_2}{b_1}) - {\xi _2} + {e_2}] - \beta {\xi _2} \end{array} \end{aligned}$$
(36)

Following the same procedure as in the linear case, we get

$$\begin{aligned} {u_1} = - {(H{_1^ *{} ^T}H_1^ * + {C_1}I)^{ - 1}}G{_2^ *{} ^T}\alpha . \end{aligned}$$
(37)

and

$$\begin{aligned} {u_2} = {(G{_2^*{}^T}G_2^* + {C_3}I)^{ - 1}}H{_1^*{}^T}\beta . \end{aligned}$$
(38)

where \(H_1^ * = \left( {\begin{array}{*{20}{c}} {k(A,{X^T})}&{{e_1}} \end{array}} \right) \), \(G_2^ * = \left( {\begin{array}{*{20}{c}} {k(B,{X^T})}&{{e_2}} \end{array}} \right) \), \({u_1} = \left( {\begin{array}{*{20}{c}} {{w_1}}\\ {{b_1}} \end{array}} \right) \), \({u_2} = \left( {\begin{array}{*{20}{c}} {{w_2}}\\ {{b_2}} \end{array}} \right) \). Then, the dual problems of Eq. (34) and Eq. (35) are as follows:

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _\alpha \mathrm{{ }}{e_2}^T\alpha - \frac{1}{2}{\alpha ^T}G_2^*{(H{_1^*{}^T}H_1^* + {C_1}I)^{ - 1}}G{_2^*{}^T}\alpha \\ s.t.\mathrm{{ }}0 \le \alpha \le {C_2}{S_2} \end{array} \end{aligned}$$
(39)

and

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _\beta \mathrm{{ }}{e_1}^T\beta - \frac{1}{2}{\beta ^T}H_1^*{(G{_2^*{}^T}G_2^* + {C_3}I)^{ - 1}}H{_1^*{}^T}\beta \\ s.t.\mathrm{{ }}0 \le \beta \le {C_4}{S_1} \end{array} \end{aligned}$$
(40)

We get the optimal \(u_1^ * = \left( {\begin{array}{*{20}{c}} {w_1^ * }\\ {b_1^ * } \end{array}} \right) \) and \(u_2^ * = \left( {\begin{array}{*{20}{c}} {w_2^ * }\\ {b_2^ * } \end{array}} \right) \) by solving the dual problems. For any input sample \({x^ * }\), its classification decision function is as follows:

$$\begin{aligned}{} & {} {y^ * }\nonumber \\{} & {} = \mathop {\arg \min }\limits _{i \in 1,2} \left\{ {\frac{{\left| {w{{_1^ * }^T}k({x^ * },{X^T}) + b_1^ * } \right| }}{{\sqrt{w{{_1^ * }^T}k(A,{X^T})w_1^ * } }},\frac{{\left| {w{{_2^ * }^T}k({x^ * },{X^T}) + b_2^ * } \right| }}{{\sqrt{w{{_2^ * }^T}k(B,{X^T})w_2^ * } }}} \right\} .\nonumber \\ \end{aligned}$$
(41)

3.4 The coordinate descent strategy with active set shrinking

To speed up the training, the FTSVM-CKA employs a coordinate descent strategy with shrinking by active set which handles the computational complexity brought by high-dimensional inputs (Gao et al. 2015). Since the dual problems involved in FTSVM-CKA can be solved similarly, we take Eq. (31) as an example. Let \(R = {(H_1^T{H_1} + {C_1}I)^{ - 1}}G_2^T\), \(\widetilde{R} = {G_2}R\), then Eq. (31) can be reduced to the following problem:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _\alpha \mathrm{{ g}}(\alpha ) = - {e_2}^T\alpha + \frac{1}{2}{\alpha ^T}\widetilde{R}\alpha \\ s.t.\mathrm{{ }}0 \le \alpha \le {C_2}{S_2} \end{array} \end{aligned}$$
(42)

A coordinate descent strategy with shrinking by active set is adopted to solve Eq. (42). Its pseudo code is shown in Algorithm 1. \({g_{\nabla i}}(\alpha )\) is a projection gradient, as follows:

$$\begin{aligned} {g_{\nabla i}}(\alpha ) = \left\{ {\begin{array}{*{20}{l}} {\min (0,{g_{\partial i}}(\alpha )),\mathrm{{ }}if\mathrm{{ }}{\alpha _i} = 0}\\ {\mathrm{{ }}{g_{\partial i}}(\alpha ),\mathrm{{ }}if\mathrm{{ 0}}< {\alpha _i} < {C_2}{\mathrm{{s}}_i}\mathrm{{ }}}\\ {\max (0,{g_{\partial i}}(\alpha )),\mathrm{{ }}if\mathrm{{ }}{\alpha _i} = {C_2}{\mathrm{{s}}_i}} \end{array}} \right. \end{aligned}$$
(43)

where \({g_{\partial i}}\) is the ith component of gradient \({g_\partial }\). Reference to Chang and Lin (2011), Chang et al. (2008), and Shao and Deng (2012) for some details.

Algorithm 1
figure a

The coordinate descent strategy with active set shrinking

Fig. 1
figure 1

The hyperplanes by TSVM at 0% noise rate (a) and 10% noise rate (b)

Fig. 2
figure 2

The hyperplanes by FTSVM-CKA at 0% noise rate (a) and 10% noise rate (b)

Fig. 3
figure 3

Accuracy of FTSVM-CKA and TSVM in linear (a) and nonlinear (b) cases

Table 1 Details of UCI data sets

4 Experimental results

In this paper, different experiments are designed on an artificial dataset, i.e., Ripleys (Ripley 2007) and 15 real-world data sets from UCI machine learning repository (Dua et al. 2017), to evaluate the performance of FTSVM-CKA. TSVM (Khemchandani and Chandra 2007), CDFTSVM (Gao et al. 2015), IFTSVM (Rezvani et al. 2019), AULSTSVM (Richhariya et al. 2021), CatBoost (Prokhorenkova et al. 2018), LightGBM (Ke et al. 2017), XGBoost (Chen et al. 2015), SVM (Cortes and Vapnik 1995), RandomForest (RF) (Breiman 2001) are used as comparison algorithms. For parameters \({C_i},i = 1,2,3,4\) in FTSVMs, as for CDFTSVM and IFTSVM, we set \({C_1} = {C_3},{C_2} = {C_4}\) for FTSVM-CKA. They are correctly explored in \(\left\{ {{{10}^i}|i = - 5, - 4,\ldots ,4,5} \right\} \). While the parameters are set as \({c_1} = {c_2}\), \({c_3} = {c_5} = {c_1} \cdot {c_4}\), \({c_4} = {c_6}\) for AULSTSVM (Richhariya et al. 2021). In addition, Gaussian kernel function, i.e., \(k({x_1},{x_2}) = \exp (\frac{{ - {{\left\| {{x_1} - {x_2}} \right\| }^2}}}{{{\sigma ^2}}})\) is used in this paper, and \(\sigma \) is explored in \(\left\{ {{2^i}|i = - 5, - 4,\ldots ,4,5} \right\} \). The 10-fold cross-validation is performed for all the algorithms. All samples are normalized. To simulate label noise, we randomly select a given proportion of samples and flip their corresponding labels, and this proportion is called the noise rate. The experimental environment is listed as follows: Inter Core i5-11500 CPU, 8 G, Windows10, MATLAB2018b.

4.1 Parameter effect

In this subsection, the effect of different \({C_i}\) and \(\sigma \) are considered in the Horse dataset to identify the optimal parameters, i.e., \({C_i}\) for linear case and \({C_i}\) and \(\sigma \) for nonlinear case, that produce the best performance. First, \({C_i}\), which varies in \(\left\{ {1,2,\ldots ,10} \right\} \), is optimized for the linear case. FTSVM-CKA generates better outcomes when \({C_1} = 5\) and \({C_2} = 2\).

Similarly, for nonlinear case, \({C_i}\) and \(\sigma \) are optimized and can differ in \(\left\{ {i \cdot \frac{1}{2}|i = 1,2,\ldots ,10} \right\} \). FTSVM-CKA with \({C_1} = 5\), \({C_2} = 2\) and \(\sigma = 1.5\) produces better outcomes. After obtaining the optimal parameter settings, the performance of the model was evaluated on the remaining testing parts.

4.2 Artificial data sets

The Ripleys data set is a mixture of two Gaussian distributions. It comprises two categories, with each sample consisting of two features. Figures 1 and 2 show the linear separating hyperplanes generated by TSVM and FTSVM-CKA at a noise rate of 0 and 10%, respectively. Figure 1 reveals that the hyperplanes generated by TSVM exhibit noticeable variations across different noise rates. From Fig. 2, one can observe that the disparity between the hyperplanes generated by FTSVM-CKA at various noise rates is significantly smaller in comparison to TSVM.

Figure 3 illustrates the accuracy of FTSVM-CKA and TSVM with varying noise rates in linear and nonlinear cases. It can be observed that the accuracy of FTSVM-CKA is better than that of TSVM, and the classification performance of both algorithms demonstrate a decreasing trend. From Fig. 3, we can find that the accuracy of TSVM fluctuates significantly with increasing noise rate, suggesting its sensitivity to noise. It is worth noting that compared with TSVM, the accuracy of FTSVM-CKA exhibits lower susceptibility to noise and changes more gently. This indicates that FTSVM-CKA can effectively mitigate the sensitivity of TSVM to noise. In conclusion, the experimental results show that the proposed FTSVM-CKA can suppress the adverse effects of noise because we introduce a fuzzy membership assignment strategy based on CKA.

Table 2 Experimental results on UCI data sets at 0% noise rate in the linear case
Table 3 Experimental results on UCI data sets at 5% noise rate in the linear case
Table 4 Experimental results on UCI data sets at 10% noise rate in the linear case
Table 5 Experimental results on UCI data sets at 0% noise rate in the nonlinear case
Table 6 Experimental results on UCI data sets at 5% noise rate in the nonlinear case
Table 7 Experimental results on UCI data sets at 10% noise rate in the nonlinear case
Table 8 Win-Tie-Loss accuracy comparison from Tables 2, 3 and 4
Table 9 Win-Tie-Loss accuracy comparison from Tables 5, 6 and 7
Table 10 The pairwise significant difference between the proposed FTSVM-CKA and existing algorithms in the linear case
Table 11 The pairwise significant difference between the proposed FTSVM-CKA and existing algorithms in the nonlinear case

4.3 UCI data sets

Table 1 shows the details of the 15 UCI datasets selected in this paper. In the experiments, the noise rate is set to 0, 5, and 10%, respectively. The average accuracy, along with the standard deviations (SD) and computational time, are calculated to evaluate the experimental results.

We implement TSVM and TSVM-related methods including CDFTSVM and IFTSVM. Tables 2, 3 and 4 present the experimental results of FTSVM-CKA and TSVM, IFTSVM, CDFTSVM in the linear case, with noise rates set at 0, 5, and 10% in sequence. The bold in all tables means the value obtained is the best. The results from Tables 2, 3 and 4 demonstrate that, out of the 13 UCI datasets mentioned earlier, the proposed FTSVM-CKA achieves the highest classification accuracy on 11, 10, and 10 datasets, respectively. The average ranks of accuracy of FTSVM-CKA under different noise rates is 1.15, 1.31 and 1.46, respectively, which are superior to the existing algorithms. This indicates that FTSVM-CKA outperforms the other three algorithms in terms of classification performance in the linear case. FTSVM-CKA utilizes the fuzzy membership assignment strategy based on CKA to mitigate the adverse impact of noise in the classification process, thereby significantly enhancing the classification performance. In addition, it can be found that the calculation time of FTSVM-CKA and CDFTSVM is very close, and both are significantly shorter than that of TSVM and IFTSVM. It shows that the FTSVM-CKA proposed in this paper has a faster training speed than other algorithms. This can be attributed to both FTSVM-CKA and CDFTSVM employ a coordinated descent strategy with shrinking by active set. However, the training time of FTSVM-CKA is slightly higher than that of CDFTSVM, mainly because CDFTSVM uses the simplest fuzzy membership calculation method.

Tables 5, 6 and 7 present the experimental results of FTSVM-CKA, TSVM, CDFTSVM, and IFTSVM under nonlinear conditions, and the noise rates are 0, 5, and 10% sequentially. In the case of different noise rates, the proposed model achieved the best results in 12, 11 and 12 of the 13 datasets, respectively. In the nonlinear case, the average ranks of accuracy of FTSVM-CKA at different noise rates are 1.23, which are better than the existing algorithms. It can be observed that the classification performance of FTSVM-CKA is better than that of other algorithms in nonlinear condition. Similar to the linear case, the computational time of FTSVM-CKA exhibits significantly shorter computational time compared to both TSVM and IFTSVM in the nonlinear case. The experimental results demonstrate that FTSVM-CKA outperforms other algorithms in terms of both classification performance and training speed.

In order to compare the proposed FTSVM-CKA with other algorithms in terms of the classification performance, we utilize the Win-Tie-Loss (Xu et al. 2016) statistical analysis to examine the datasets and record the number of datasets in which FTSVM-CKA outperforms, achieves equal performance, or performs worse than other algorithms in both linear and nonlinear cases. The corresponding results are presented in Tables 8 and 9. One can observe that FTSVM-CKA give better performance than other methods on the majority of datasets. Furthermore, it is evident that FTSVM-CKA exhibits a clear advantage in the presence of noise.

Table 12 Experimental results with machine learning algorithms at 0% noise rate
Table 13 Experimental results with machine learning algorithms at 5% noise rate
Table 14 Experimental results with machine learning algorithms at 10% noise rate

We perform the Friedman test with the post-hoc test to prove the statistical significance of the proposed FTSVM-CKA for generalization performance. The Friedman test uses the average ranks of the algorithms. The average ranks of all algorithms in different situations are recorded in Tables 2, 3, 4, 5, 6 and 7. Under the null hypothesis that all the algorithms have equal ranks, the Friedman statistics which are distributed with \(\chi _F^2\) with \(\kappa - 1\) degree of freedom are calculated , where \(\kappa \) is the number of the algorithms. The performance of two algorithms is significantly different if their average ranks differ by at least the critical difference defined by \(CD = {q_\alpha }\sqrt{\frac{{\kappa \left( {\kappa + 1} \right) }}{{6N}}} \), where N is the number of datasets and \({q_\alpha }\) is computed by using the Studentized range statistic. The critical difference for our case at \(\alpha = 0.10\) level of significance level is \(CD = 2.241\sqrt{\frac{{5\left( {5 + 1} \right) }}{{6 \times 13}}} \approx 1.39\). Tables 10 and 11 show the pairwise significant difference between the algorithms in linear and nonlinear cases, respectively. It can be found that the proposed FTSVM-CKA is significantly different from most of the algorithms.

In addition, FTSVM-CKA is compared with some existing machine learning algorithms whose effectiveness has been recognized, including CatBoost, LightGBM, XGBoost, SVM, RF. The experiment was performed on 5 UCI datasets with 10-fold cross-validation. These comparison algorithms are all set with default parameters. Tables 12, 13 and 14 show the classification accuracy results at 0, 5, and 10% noise rates, respectively. We can find that FTSVM-CKA achieves optimal classification results in all cases. This shows that the classification performance of FTSVM-CKA is significantly better than that of some existing classical and effective machine learning algorithms. It is worth noting that FTSVM-CKA is superior to traditional SVM in both generalization performance and training speed. One fact is that SVM cannot distinguish the importance of different samples for classification, which makes it sensitive to noise. The proposed FTSVM-CKA utilizes the CKA method to derive a heuristic function for calculating the dependency between samples and their corresponding labels, which assigns fuzzy membership to different samples. This effectively identifies noise and reduces the negative impact on classification.

5 Conclusion

To address the problem that traditional TSVM is sensitive to noise, we proposed a novel and efficient fuzzy twin support vector machine based on centered kernel alignment, termed as FTSVM-CKA. FTSVM-CKA utilizes the CKA method which incorporates the idea of feature weighting to assigns fuzzy membership to different samples. We conducted experiments on an artificial dataset and 15 UCI datasets. Noise is added to the original data set to verify the noise robustness of the proposed FTSVM-CKA. The experimental results demonstrate that FTSVM-CKA outperforms several existing learning models and exhibits excellent classification performance. Statistical tests on experimental results confirm the significance of the proposed algorithm. Nevertheless, FTSVM-CKA does not take into account the class imbalance data set. Our future work is to extend FTSVM-CKA to class imbalance learning.