1 Introduction

Support vector machine (SVM) is supervised machine learning model which has been presented for classification and regression analysis (Vapnik 1995; Yang and Xu 2017; Mao et al. 2014; Santhanama et al. 2016). SVM has been converted into a convex quadratic optimization problem and then solved by a quadratic programming (QP) technique (Vanir 1999). It can also classify nonlinear samples using the kernel function (Cortes and Vapnik 1995; Xue et al. 2018; Moghaddam and Hamidzadeh 2016).

In the real-world applications, there is some complex information under uncertainty which reduces efficiency and accuracy of decision-making systems. Machine learning algorithms would be able to handle decision-making applications via classification algorithms, but it requires a specific algorithm which has high tolerance in dealing with uncertainty aspects, namely noise. The classification hyperplane, which is obtained by SVM, is determined by the support vectors. The presence of noise increases inefficiency of the standard SVM training which makes the decision boundary from the optimal hyperplane. In machine learning, there are many techniques to improve classification (Alcantud et al. 2019; Hamidzadeh and Moradi 2018, 2020; Hamidzadeh et al. 2014, 2017; Hamidzadeh and Namaei 2018; Hamidzadeh and Ghadamyari 2019; Javid and Hamidzadeh 2019). Therefore, there are many methods which have been proposed to improve the SVM classification using identifying uncertainty samples such as noisy and outlier ones in order to discard or delete them (Han et al. 2016; Nguyen et al. 2018; Xu et al. 2016). On the other hand, some solutions have been presented to deal with noisy and outlier samples in weighted support vector machine classification methods in order to reduce the effect of unimportant samples (Karal 2017; Sheng et al. 2015; Zhou et al. 2016; Yang et al. 2005). The previous methods, in which weighted SVM is based on probabilities, cannot reduce the effect of noisy samples in SVM training accurately.

Fuzzy aspects in the real-world applications play a crucial role due to existence of sophisticated information in real-world applications such as medical diagnosis, pattern recognition and wherever exist big data. With respect to the importance of this aspect, the use of probabilistic and fuzzy methods is effective (Singh et al. 2020; Sivasankar et al. 2020). Also regarding importance of the efficiency and the speed of classification-based model, it is important to introduce a specific method that classifies precisely. In order to realize this aim, it is better to introduce a more precise and quick method. Therefore, the fuzzy rough set theory is redesigned. This paper presents a method to lessen the effect of noise in SVM training with soft margin using fuzzy rough set theory (Dubois and Prade 1990). In the proposed method, a weight coefficient is added to the penalty term Lagrangian formula for optimization problem. This weighted coefficient is called entropy degree, which uses lower and upper approximation for membership function in fuzzy rough set theory. As a result, in the proposed method—WSVM-FRS (Weighted SVM-Fuzzy Rough Set)—noisy samples have low degree, and important samples have high degree. The results have obtained good classification accuracy, precision and recall for SVM training.

The rest of this paper is organized as follows: In Sect. 2, a survey of weighted support vector machine algorithms is presented. In Sect. 3, primary concepts of support vector machine and fuzzy rough set theory are presented. In Sect. 4, the proposed method is introduced. The experimental results are shown in Sect. 5. Finally, Sect. 6 contains conclusions and future works.

2 Related works

Several methods have been proposed for weighted support vector machine. This section surveys some important weighted support vector machine methods to deal with noisy and outlier samples.

Weighted support vector machine (WSVM) (Yang et al. 2005) has been presented to improve the outlier sensitivity problem of SVM. The basic idea is possibility c-means (PCM), which is extended into kernel space to generate different weight values for main training data points and outliers according to their relative importance in the training set. In Du et al. (2017), a fuzzy compensation multiclass SVM method has been introduced to improve the outlier and noise sensitivity problem, which gives the dual effects to penalty term through treating every data point as both positive and negative classes, but with different memberships. In Sheng et al. (2015), a method has been presented to reduce the noise sensitive issue based on fuzzy least square support vector machine. By applying fuzzy inference and nonlinear correlation measurement, the effects of the samples with low confidence can be reduced. WDRSVM (Li et al. 2016) has been presented a weighted doubly regularized support vector machine to deal with noise by using both the distance information between classes and within each class. Incosh loss (Karal 2017) has been introduced to obtain support vector regression (SVR) models for coping with different noise distributions, which is optimal in the maximum likelihood sense for the hyper-secant error distributions. In Ding et al. (2017), the presented method (WLMSVM) is a new classifier for multiclass classification, called weighted linear loss multiple birth support vector machine based on information granulation to enhance the performance of multiple WLTSVM.

BWSVM (Sun et al. 2017) has presented a band-weighted support vector machine, which is to quantify the divergent contributions of different bands when implementing SVM; the BWSVM adopts the L1 norm penalty term of band weights on the original SVM. In Li et al. (2017), a method has been introduced to construct the new weighted mechanisms for both loss and penalty, which has developed the weight partly adaptive elastic net for dealing with the binary classification problem of microarray with noise by using the distances from the sample points to both class centers. In Lu et al. (2017) a probabilistic weighted least squares SVM method has been presented to model these kinds of processes under noise; this method can increase robustness and accuracy even with outliers or non-Gaussian noise. RLS-SVM (Yang et al. 2014) has presented a method based on the truncated least squares loss function for regression and classification with noise. DS-RLSSVM (Zhou et al. 2016) has been developed to model complex systems in the presence of various types of random noise. The integration of the distributed LS-SVM and fuzzy clustering is used to construct the evidence for the LS-SVM parameters. In Zhang et al. (2018), a method has been presented to incorporate prior knowledge into SVM using sample confidence, which is called feature weighted confidence with SVM. This method computes the sample confidence directly from the weights of prior features provided by SVM. In Xu et al. (2015), a new support vector weighted quantile regression approach has been introduced that is closely built upon the idea of support vector machine. It can be estimated by solving a Lagrangian dual problem of quadratic programming and is able to implement the nonlinear quantile regression by introducing a kernel function. In Tang et al. (2019), a new approach of integrating piecewise linear representation and weighted support vector machine has been introduced to forecast the stock turning points. K-SRLSSVCR (Ma et al. 2019) has been proposed a robust least squares version of K-SVCR (K-RLSSVCR) based on squares ε-insensitive ramp loss and truncated least squares loss, which partially depress the impact of outliers on the new model via its nonconvex ε-insensitive ramp loss and truncated least squares loss.

Overall, in the real-world applications, it is impossible to ignore the importance of productivity and speed of classification-based model due to the existence of complex information. Consequently, it is substantial to introduce a specific method in order to decide and classify precisely. Regarding this, fuzzy aspects and probability distribution can be effective. Although the previous methods have increased the precision and accuracy of SVM classifier, those methods have not been able to handle uncertainty aspects such as noisy and outlier samples based on a probability distribution. In order to realize this aim, it is better to introduce a more precise and quick method. Thus, the fuzzy rough set theory is redesigned. Following the strategy outlined above, this paper proposes a new weighted support vector machine method based on the fuzzy rough set theory in order to decrease uncertainty.

3 Preliminaries

In this section, a brief overview of the constructive concepts of the proposed method is presented. In Sect. 3.1, the concepts regarding the support vector machine are reviewed. In Sect. 3.2, the basic comprehension of fuzzy rough set theory is defined.

3.1 Support vector machine

Suppose that there is a group of training samples \(\{ (x_{i} ,y_{i} ),x_{i} \in R^{d} ,y_{i} \in \{ + 1, - 1\} ,\;i = 1, \ldots ,N\} ,\) where \(x_{i}\) represents the ith sample and \(y_{i}\) represents its corresponding class label. SVM aims to find a hyperplane which separates the positive training samples from those negative ones and maximizing the margin W between both training samples. Also SVM can extend two class problems to multiclass problems (Hsu and Lin 2002) which \(y_{i} \in \{ 1, \ldots ,C_{i} \} ,\) where Ci represents the number of classes. There are two common methods for implementation for SVM multiclass classification including one-against-all method and one-against-one method. In this paper, one-against-one method is considered. In order to maximize the margin, thus it needs to minimize \(\left\| W \right\|\) that converts to primary quadratic programming of SVM (Vapnik 1995; Yang and Xu 2017; Mao et al. 2014; Santhanama et al. 2016) as following:

$$ \begin{aligned} & \min \;\frac{1}{2}\,\left\| W \right\|^{2} + C\sum\limits_{i = 1}^{N} {\xi_{i} } \\ & s.t\;y_{i} (w^{\tau } \phi (x_{i} ) + b) \ge 1 - \xi_{i} ; \\ & \quad \xi_{i} \ge 0;\;i = 1, \ldots ,N \\ \end{aligned} $$
(1)

where \(\xi_{i}\) is the error term and \(C > 0\) is the regularization parameter. The above optimization can be converted to one problem in the form of (2) by introducing Lagrange parameters

$$ \begin{aligned} & \min \;\frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{N} {\xi_{i} } - \sum\limits_{i = 1}^{N} {\alpha_{i} \left[ {y_{i} (w^{\tau } \phi (x_{i} ) + b) - 1 + \xi_{i} } \right]} - \sum\limits_{i = 1}^{N} {\xi_{i} } \mu_{i} \\ & s.t\;\alpha_{i} \ge 0;\mu_{i} \ge 0,i = 1, \ldots ,N \\ \end{aligned} $$
(2)

where w is a weight vector, b is bias,\(\alpha\) is Lagrangian coefficients, and \(\phi (x_{i} )\) is kernel of training samples.

To apply such trends, the results of partial derivatives are replaced in (2). Therefore, a dual problem is constructed in the form of (3). In fact, it can be solved by quadratic programming.

$$ \begin{aligned} & \min \;\frac{1}{2}\sum\limits_{i} {\sum\limits_{j} {\alpha_{i} } } \alpha_{j} y_{i} y_{j} x_{i}^{\tau } x_{j} - \sum\limits_{i} {\alpha_{i} } = \frac{1}{2}\sum\limits_{i} {\sum\limits_{j} {\alpha_{i} \alpha_{j} y_{i} y_{j} k(x_{i} ,x_{j} )} } - \sum\limits_{i} {\alpha_{i} } \\ & s.t\;\sum\limits_{i} {\alpha_{i} } y_{i} = 0;\;0 \le \alpha_{i} \le C,\;\forall i \\ \end{aligned} $$
(3)

Hence, the solution has the form

$$ w = \sum\limits_{i = 1}^{n} {\alpha_{i} y_{i} \phi (x_{i} )} = \sum\limits_{i \in sv} {\alpha_{i} y_{i} \phi (x_{i} )} $$
(4)

where SV is the number of support vectors

$$ \begin{aligned} & sv = \{ i|0 \le \alpha_{i} \le C\} \\ & \forall i \in sv,\,w^{\tau } \phi (x_{i} ) + b = y_{i} \\ & y_{i} \in \{ + 1, - 1\} \\ & b_{i} = y_{i} - w^{\tau } \phi (x_{i} ) \\ \end{aligned} $$
(5)

where \(x_{i}\) is support vector. And the average of all this \(b_{i}\) defines the bias

$$ b = \frac{1}{{\left| {sv} \right|}}\sum\limits_{i \in sv} {b_{i} = } \frac{1}{{\left| {sv} \right|}}\sum\limits_{i \in sv} {(y_{i} - w^{\tau } \phi (x_{i} ))} $$
(6)

Once the optimal pair (w, b) is determined, the decision function is obtained by

$$ f = (w,\phi (z)) + b = {\text{sign}}\left( {\sum\limits_{i = 1}^{{N_{sv} }} {\alpha_{i} } y_{i} K(x_{i} ,z) + b} \right) $$
(7)

where \(N_{sv}\) is the number of support vectors.

3.2 Fuzzy rough set

Fuzzy rough set theory (Dubois and Prade 1990) is built upon two other theories, namely fuzzy set theory (Zadeh 1965) and rough set theory (Pawlak 1982). Rough set tries to divide universe of discourse to positive region (lower approximation sets), boundary region, and negative region. Fuzzy rough set for each sample returns a pair of memberships that show lower approximation membership as a degree of certainty and upper approximation membership as a possibility degree of being included in target samples.

The lower and upper approximation memberships are constructed by indiscernibility relationships (IR) between samples and the amount of dependency on the target set, \(\mu_{F}\) (Verbiest et al. 2013b). IR as shown in (8) measures the similarity of each pair of samples. When two samples are identical, IR becomes 1, and when the samples are completely different, it shows zero.

$$ {\text{IR}}(x_{i} ,x_{j} ) = \mathop \tau \limits_{{\alpha \in {\text{Dimension}}}} \left( {1 - \left| {x_{i} (\alpha ) - x_{j} (\alpha )} \right|^{2} } \right) $$
(8)

where \(\tau\) is a triangular norm (t-norm),\(\tau :[0,1]^{2} \to [0,1]\) based on dimension (attribute) \(\alpha\).

According to the above relation, the differences between two samples in every dimension are aggregated by a t-norm. The result is called indiscernibility relation between two samples under decision boundary DB. On the other hand, membership of each sample into target class, \(\mu_{F} ,\) can be shown differently like binary ones or the others. Therefore, by these two concepts, lower and upper approximation memberships are defined in (9) and (10), respectively:

$$ \mu_{{\underline{{F_{{{\text{IR,}}\mu_{F} }} }} }} (x_{i} ) = \mathop {\inf }\limits_{{x_{j} \in T,x_{i} \ne x_{j} }} I({\text{IR}}(x_{i} ,x_{j} ),\mu_{F} (x_{j} )) $$
(9)
$$ \mu_{{\overline{{F_{{{\text{IR}},\mu_{F} }} }} }} (x_{i} ) = \mathop {\sup }\limits_{{x_{j} \in T,x_{i} \ne x_{j} }} \tau ({\text{IR}}(x_{i} ,x_{j} ),\mu_{F} (x_{j} )). $$
(10)

As shown in the above equations, fuzzy operators, implicator (I) and t-norm (τ), combine two basic elements and result in several outputs. Then, “inf” and “sup” select one of these outcomes as the final result. Hence, outlier and noise data can change lower and upper approximation memberships in a wide range. Therefore, Verbiest et al. (2013a) propose adjusted versions of the memberships using order weigh average (OWA) instead of “inf” and “sup.” These forms of memberships are presented as follows:

$$ \mu_{{\underline{{F_{{{\text{IR,}}\mu_{F} }} }} }} (x_{i} ) = \mathop {{\text{OWA}}_{\min } }\limits_{{x_{j} \in T,x_{i} \ne x_{j} }} I({\text{IR}}(x_{i} ,x_{j} ),\mu_{F} (x_{j} )) $$
(11)
$$ \mu_{{\overline{{F_{{{\text{IR}},\mu_{F} }} }} }} (x_{i} ) = \mathop {{\text{OWA}}_{\max } }\limits_{{x_{j} \in T,x_{i} \ne x_{j} }} \tau ({\text{IR}}(x_{i} ,x_{j} ),\mu_{F} (x_{j} )). $$
(12)

Many methods like popular fuzzy belief function (Shafer 1976) follow similar trend to handle conflicting, incomplete, and uncertain information (Liu et al. 2011, 2015, 2016). In fact, it tries to explain probability of occurring a subset of universe of discourse based on belief and plausibility functions in the form of (13) and (14), respectively.

$$ \forall A \in T:{\text{Bel}}(A) = \sum\limits_{B:B \in A} {P_{\alpha } } (B) $$
(13)
$$ \forall A \in T:{\text{Pl}}(A) = \sum\limits_{B:B \cap A \ne \phi } {P_{\alpha } } (B) $$
(14)

where \(P_{\alpha } (B)\) demonstrates the probability of happening B while the amount of α has the probability of occurrence relevant to α-cut in fuzzy concept (Chen et al. 2008). On the other hand, the relation between these two concepts as explained in (Dubois and Prade 1990; Chen et al. 2008; Yao and Lingras 1998; Wu et al. 2002; Liu et al. 2015) can be gained as follows:

$$ \forall A \in T:{\text{Bel}}(A) = \mu_{{\underline{{F_{{{\text{IR,}}\mu_{F} }} }} }} (A) $$
(15)
$$ \forall A \in T:{\text{Pl}}(A) = \mu_{{\overline{{F_{{{\text{IR}},\mu_{F} }} }} }} (A). $$
(16)

Consequently, fuzzy rough set is powerful to handle the vagueness and conflict among data and its effectiveness has been proved by remarkable results in fields such as KNN improvement (Verbiest et al. 2013a; Bian and Mazlack 2003; Derrac et al. 2013), fuzzy decision tree expansion (Zhai 2011), and solid multiple traveling salesman problem (Changdar et al. 2016). However, it stands on common fuzzy operators which may be improvable in some issues.

4 Proposed method

In this section, a novel method, namely WSVM-FRS (Weighted SVM-Fuzzy Rough Set), is proposed for SVM training which is one of the novel data characteristic to reduce the effect of noise in SVM training with soft margin toward important samples in contrast to the others. In this method, a weighted coefficient is added to the penalty term Lagrangian formula for optimization problem, using lower and upper approximation for membership function in fuzzy rough set theory. Consequently, in WSVM-FRS noisy samples have low degree. The simplest form of samples is shown by discrete attributes reviewed in the following paragraph.

In Fig. 1, assume a curved line as boundary region dividing universe of discourse. The whole area is sectionalized by indiscernibility relation (IR) shaping equivalent squares. Therefore, these squares contain samples that have similar attributes. According to the rough set theory, the upper approximation represented includes all squares having at least one sample in the boundary region. On the other hand, as shown in Fig. 1, samples in square units located in the boundary region are positive region (lower approximation sets) of the universe segregated by curved boundary region and IR. Consequently, as shown in Fig. 1, simultaneously considering both approximation sets, samples that are certainly in the boundary region gain higher values rather than those on the boundary region. In addition, negative samples are the ones taking zero values.

Fig. 1
figure 1

An example for entropy degree (ED) expression

The approximation memberships constructing ED use (8) and (17) as IR and amount of belonging samples to target set \(t_{{\text{s}}} ,\) respectively. \(t_{{\text{s}}}\) demonstrates reverse distance of samples from the center of the class

$$ t_{{\text{s}}} = 1 - \frac{{\sum\limits_{{\alpha \in {\text{Dimension}}}} {\left| {\sum\nolimits_{{x_{i} \in T}} {x_{i} (\alpha )} - c^{*} } \right|} }}{{\left| {c^{*} } \right|}}^{2} $$
(17)

where T is target set and α dimension (attribute) and ‖ is absolute value. Also c* is the center of class. The center of each class c* is defined as follows

$$ c^{*} = \frac{1}{N}\sum\limits_{{x_{i} \in y_{i} }} {x_{i} } $$
(18)

where N is total number of training sample in each class.

Therefore, the lower and upper approximations memberships are reconstructed according to (9) and (10) by the following

$$ \mu_{{\underline{{F_{{{\text{IR}},\mu_{F} }} }} }} (x_{i} ) = \inf I({\text{IR}}(x_{i} ,c^{*} ),t_{{\text{s}}} ) $$
(19)
$$ \mu_{{\overline{{F_{{{\text{IR}},\mu_{F} }} }} }} (x_{i} ) = \sup \tau ({\text{IR}}(x_{i} ,c^{*} ),t_{{\text{s}}} ). $$
(20)

It is important to introduce a measurement parameter which measures relation among samples based on entropy of samples. Furthermore, WSVM-FRS sets penalty and kernel parameters by using grid search. The new data characteristics discover the certainty of a sample. Each sample, \(x_{i} ,\) is recognized as certain one if it is represented by entropy of attributes similar to the others.

In order to satisfy the above explanations, WSVM-FRS creates a map of data importance by computing ED in the form of (21).

$$ {\text{ED}}(x_{i} ) = - \sum\limits_{{x_{i} \in R^{d} }} {\mu_{{\overline{{F_{{{\text{IR}},\mu_{F} }} }} }} (x_{i} )\log_{2} \mu_{{\underline{{F_{{{\text{IR}}}} ,_{{\mu_{F} }} }} }} (x_{i} )} $$
(21)

Therefore, in this paper, ED concludes that both lower and upper approximation memberships are proposed as certainty index. In fact, ED gains a general perspective of sample’s roles by the upper approximation membership. Following this, the lower approximation membership, which computes restricted relation between samples and the decision boundary, is added to ED to improve the level of samples, which are certainly in the decision boundary. Then, it maps to range of [0, 1].

In addition to assigning certain values as ED to each sample, kernel function is also used for mapping data to hyper dimension φ(.). The advantages of this mapping appear when inner product is displayed in optimization formula, then it can be replaced with the kernel trick. WSVM-FRS can be optimized as follows to discover boundary:

$$ \begin{aligned} & \min \;\frac{1}{2}\,\left\| W \right\|^{2} + C\sum\limits_{i = 1}^{N} {{\text{ED}}(x_{i} )\xi_{i} } \\ & s.t\;y_{i} (w^{\tau } \phi (x_{i} ) + b) \ge 1 - \xi_{i} ; \\ & \quad \xi_{i} \ge 0;\;i = 1, \ldots ,N. \\ \end{aligned} $$
(22)

This optimization turns into differentiable form below by adding \(\alpha_{i} \ge 0\) and \(\mu_{i} \ge 0\) as positive Lagrange values:

$$ \begin{aligned} & L(w,b,\alpha ) = \min \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{N} {{\text{ED}}(x_{i} )\xi_{i} } - \sum\limits_{i = 1}^{N} {\alpha_{i} \left[ {y_{i} (w^{\tau } \phi (x_{i} ) + b) - 1 + \xi_{i} } \right]} - \sum\limits_{i = 1}^{N} {\xi_{i} } \mu_{i} \\ & s.t\;\alpha_{i} \ge 0;\;\mu_{i} \ge 0,i = 1, \ldots ,N. \\ \end{aligned} $$
(23)

Since minimization of (23) is reachable by minimization of \(w,b,\xi_{i}\) and maximization of \(\alpha_{i} ,\mu_{i} ,\) partial derivations of \(w,b,\xi_{i}\) have been set to zero as follows:

$$ \frac{\partial L(w,b,\alpha )}{{\partial w}} = 0 \Rightarrow w - \sum\limits_{i = 1}^{N} {\alpha_{i} y_{i} \phi (x_{i} )} = 0 \Rightarrow w = \sum\limits_{i = 1}^{N} {\alpha_{i} y_{i} \phi (x_{i} )} $$
(24)
$$ \frac{\partial L(w,b,\alpha )}{{\partial b}} = 0 \Rightarrow \sum\limits_{i = 1}^{N} {\alpha_{i} y_{i} = 0} $$
(25)
$$ \begin{aligned} \frac{\partial L(w,b,\alpha )}{{\partial \xi_{i} }} & = 0 \Rightarrow \sum\limits_{i = 1}^{N} C \times {\text{ED}}(x_{i} ) - \sum\limits_{i = 1}^{N} {\alpha_{i} } - \sum\limits_{i = 1}^{N} {\mu_{i} } \\ & = 0 \Rightarrow \sum\limits_{i = 1}^{N} {(C - \alpha_{i} - \mu_{i} )} = 0\mathop \Rightarrow \limits^{{\alpha_{i} \ge 0,\mu_{i} \ge 0}} C = \alpha_{i} + \mu_{i} \\ & \Rightarrow 0 \le \alpha_{i} \le C,0 \le \mu_{i} \le C. \\ \end{aligned} $$
(26)

By replacing equations of (24) and (25) with (26), a form of DB is constructed in the form of (27):

$$ \begin{aligned} & \min \;\frac{1}{2}\sum\limits_{i} {\sum\limits_{j} {\alpha_{i} \alpha_{j} y_{i} y_{j} k(x_{i} ,x_{j} )} } - \sum\limits_{i} {\alpha_{i} } \\ & s.t\;\sum\limits_{i} {\alpha_{i} y_{i} } = 0;\;0 \le \alpha_{i} \le C \times {\text{ED}}(x_{i} ),\forall i. \\ \end{aligned} $$
(27)

The above form of optimization can be solved by the well-known quadratic optimization. Positive results of (27) describe support vectors (SVs). These special samples show the boundary and bias toward special ones.

Therefore, if x is a new sample, its decision function in the form of (28):

$$ f = (w,\phi (z)) + b = \sum\limits_{i = 1}^{N} {\alpha_{i} y_{i} K(x_{i} ,z)} + b = \sum\limits_{i \in sv} {y_{i} K(x_{i} ,z) + b} . $$
(28)

In WSVM-FRS, the boundary is shaped by more effective SVs, the decision boundary becomes more accurate. In the proposed method, grid search is used to find values of penalty and kernel values. Overall, illustration of the proposed method as a step-by step implementation flowchart in Fig. 2 is shown. The next section will demonstrate the superiority of WSVM-FRS over state-of-the-art methods based on experiments conducted on real data sets.

Fig. 2
figure 2

Flowchart of the proposed method

5 Results and discussion

Several experiments have been conducted in terms of accuracy and the value of area under the receiver operating characteristic (ROC) graph to represent the superiority of the proposed method (WSVM-FRS) in comparison with the alternative classification methods. The alternative classification methods involved in this comparison include probabilistic weighted least squares SVM (Lu et al. 2017), DS-RLSSVM (Zhou et al. 2016), Fuzzy-LSSVM (Sheng et al. 2015), RLS-SVM (Yang et al. 2014), PLR-WSVM (Tang et al. 2019) and K-SRLSSVCR (Ma et al. 2019) methods. After describing the implementation details in Subsection 5.1, the results of the experiments are shown in Subsection 5.2.

5.1 Implementation details

In order to validate WSVM-FRS, experiments have been carried out over the real-world data sets taken from the UCI data set repository (Lichman 2013). The case and mechanism of data sets applicability is based on classification under uncertainty due to decision-based models in real-world application because accurate decision is very important such as in medical data, time series data, letter data, etc. which have been described in UCI data set (Lichman 2013).

For all the experiments, the tenfold cross-validation procedure has been used. That is, each data set was divided into ten mutually exclusive blocks, and the proposed method was applied over a training set, built with nine of the ten blocks, and the left one was used as a testing set. Each block was used as the testing set, and the average of the ten tests was reported. The selected data sets and their related parameters are listed in Table 1. In Table 1, #samples, #features, and #classes denote the number of data samples, the number of attributes and the number of classes, respectively.

Table 1 Selected data sets of UCI data repository (Lichman 2013) in the experiments

Experiments have been carried out to evaluate the proposed method against three state-of-the-art weighted SVM methods like probabilistic weighted least squares SVM (Lu et al. 2017), DS-RLSSVM (Zhou et al. 2016), Fuzzy-LSSVM (Sheng et al. 2015), RLS-SVM (Yang et al. 2014), PLR-WSVM (Tang et al. 2019) and K-SRLSSVCR (Ma et al. 2019) methods. In these experiments, grid search has been used for tuning the regularization parameters. A typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data. In the grid searching scheme, the regularization constant C (penalty parameter) is tuned within the range \(\{ 0,2^{ - 1} ,2^{0} ,2^{1} ,2^{2} ,2^{3} , \ldots \}\), and a kernel parameter γ is tuned within the range \(\{ 2^{ - 1} ,2^{0} ,2^{1} ,2^{2} ,2^{3} ,2^{4} , \ldots \}\). These experiments have been executed by a computer with an Intel Core i3 2.4 GHz CPU, 8 GB DDR III memory and software of MATLAB R2015b over Microsoft Windows 7 OS. In experiments, libsvm (Chang and Lin 2011) is employed to implement the base classifier of SVM.

The kernel used in all experiments is a radial basis function:

$$ f(x,z) = \exp \left( { - \frac{{\left\| {x - z} \right\|^{2} }}{{2\sigma^{2} }}} \right). $$
(29)

5.2 Experimental results

The final results of SVM classification can be summarized in four groups: true positive rate is the proportion of positive samples that were correctly identified (TP), false-positive rate is the proportion of negatives samples that were incorrectly classified as positive (FP), true negative rate is defined as the proportion of negatives samples that were classified correctly (TN), and finally false negative rate is the proportion of positives samples that were incorrectly classified as negative (FN). These sample segregations are briefly shown in Table 2.

Table 2 Confusion matrix

Accuracy is a popular index displaying the percentages of samples which is truly described as the result. The accuracy can be computed using Eq. (30):

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{({\text{TP}} + {\text{FN}}) + ({\text{FP}} + {\text{TN}})}}. $$
(30)

The results of experiments in the form of accuracy are described in Table 3. In most cases, classification accuracy WSVM-FRS has better results than the others.

Table 3 Result of average SVM classification accuracy

A statistical significance analysis was performed by considering the nonparametric Wilcoxon signed-rank test (Demsar 2006) to analyze the results and derive strong conclusions. We used this test to determine whether the improvement in the proposed method (WSVM-FRS) is relevant. The last two rows of Tables 3, 4 and 5 present the results of Wilcoxon test. A significance level 0.05 has been considered for the analysis. Number “1” represents that WSVM-FRS significantly improves over the other competing methods in terms of the average SVM classification accuracy measure.

Table 4 Result of average SVM classification precision
Table 5 Result of average SVM classification recall

Table 3 shows the result of average SVM classification accuracy for the competing methods. In some of the data sets, the proposed WSVM-FRS method has a better classification accuracy than the other methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), which is shown in bold. For all data sets, the rank of each method is expressed in Table 3, and the average rank is shown. The best rank in any data set is highlighted. The classification accuracy of SVM training has been shown in terms of percentage (%).

The proposed method has the best performance and the top rank in rank average, compared to the other methods; it has, considerably, performed better than the other methods concerning the amount of ranks average. WSVM-FRS method is successful in six out of 20 existing data sets such as Sonar, Heart, E. coli, Diabetes, Yeast, and Letter in in terms of increasing the SVM classification accuracy, compared to the other methods; it has the best performance. The proposed method has the second to third rank among eight data sets. However, it has between the fourth and sixth rank in six of the data sets in increasing the SVM classification accuracy which means that WSVM-FRS method is different from the other methods and chosen as the best.

For some data sets, other methods have better performance in terms of increasing the classification accuracy. For example, in case of Ionosphere, Transfusion, and Vowel data sets, DS-RLSSVM (Zhou et al. 2016) obtained better results. Although DS-RLSSVM method has, among the average ranks, won the second rank in terms of increasing the classification accuracy, in some data sets, it has the second to sixth rank in increasing the classification accuracy. PLR-WSVM (Tang et al. 2019) method has the third rank average in comparison with competing methods which has had the best result in Glass and Wdbc data sets but in others this method has obtained between the second and seventh rank. After that, K-SRLSSVCR (Ma et al. 2019) method has the fourth rank average among other methods and this method has won in Haberman and Segment data sets but in other data sets has earned different ranks. Similarly, in some cases, probabilistic weighted LS-SVM (Lu et al. 2017) method has the best performance, in Musk, Pendigits, and Satimage data sets, in comparison with other methods; it has won the first rank. Probabilistic weighted LS-SVM method has won the fifth rank with considerable difference with the average rank of WSVM-FRS methods. Probabilistic weighted LS-SVM method does not have an acceptable performance in some data sets; it has the worst performance in increasing the classification accuracy, in four data sets, compared to the other methods. Fuzzy-LSSVM (Sheng et al. 2015) method has the sixth rank average. It has the best performance in terms of increasing the classification accuracy in Iris data set compared to the other methods. It stands far away from the proposed method in ranks average for increasing the SVM classification accuracy. RLS-SVM (Yang et al. 2014) has the seventh rank average in terms of increasing the SVM classification accuracy which is the worst method among ranks average compared to the other methods; it has the best performance in Liver, and Vehicle data sets as well as the first rank. It has the biggest difference with WSVM-FRS method concerning rank average.

The two last rows of Table 3 show the results of the Wilcoxon test comparing the classification accuracy of WSVM-FRS against the other competing methods. A significance level 0.05 has been considered for the analysis. Number “1” represents that WSVM-FRS significantly improves over the other competing methods in terms of classification accuracy.

In pattern recognition and information retrieval classification, precision (also called positive predictive value) is the fraction of retrieved samples that are relevant, while recall (also known as sensitivity) is the fraction of relevant samples that are retrieved. Precision and recall can be measured by (31) and (32), respectively. The results of precision comparison are listed in Table 4, while Table 5 shows experimental results of recall.

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$
(31)
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}. $$
(32)

Table 4 shows the result of average SVM classification precision for the competing methods. In some of the data sets, the proposed WSVM-FRS method has a better classification precision than the other methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), which is shown in bold. For all data sets, the rank of each method is expressed in Table 4, and the average rank is shown. The best rank in any data set is highlighted. The classification precision of SVM training has been shown in terms of percentage (%).

The proposed method has the best performance and the top rank in rank average, compared to the other methods; it has, considerably, performed better than the other methods concerning the amount of ranks average. WSVM-FRS method is successful in six out of 20 existing data sets such as Sonar, Heart, E. coli, Diabetes, Yeast, and Letter in in terms of increasing the SVM classification precision measure, compared to the other methods; it has the best performance. The proposed method has the second to third rank among 9 data sets. However, it has between the fourth and sixth rank in five of the data sets in increasing the SVM classification precision which means that WSVM-FRS method is different from the other methods and chosen as the best.

For some data sets, other methods have better performance in terms of increasing the classification precision. For example, in case of Transfusion, Musk, Pendigits, Satimage, Vowel, and Shuttle data sets, probabilistic weighted LS-SVM (Lu et al. 2017) obtained better results. Although probabilistic weighted LS-SVM method has, among the average ranks, won the second rank in terms of increasing the classification precision, in some data sets, it has the second to seventh rank in increasing the classification precision. Similarly, in some cases, DS-RLSSVM (Zhou et al. 2016) method has the best performance, in Ionosphere data set, in comparison with other methods; it has won the first rank. DS-RLSSVM method has won the third rank with considerable difference with the average rank of WSVM-FRS methods. DS-RLSSVM method does not have an acceptable performance in some data sets, compared to the other methods. PLR-WSVM (Tang et al. 2019) method has the forth rank average in comparison with competing methods which has had the best result in Glass and Wdbc data sets but in others this method has obtained between the second and seventh rank. After that, K-SRLSSVCR (Ma et al. 2019) method has the fifth rank average among other methods and this method has won in Haberman and Segment data sets but in other data sets has earned different ranks.

Similarity, Fuzzy-LSSVM (Sheng et al. 2015) method has the sixth rank average. It has the best performance in terms of increasing the classification precision in Iris data sets compared to the other methods. It stands far away from the proposed method in ranks average for increasing the SVM classification precision. RLS-SVM (Yang et al. 2014) has the seventh rank average in terms of increasing the SVM classification precision which is the worst method among ranks average compared to the other methods; it has the best performance in Liver and Vehicle data sets as well as the first rank. It has the biggest difference with WSVM-FRS method concerning rank average.

The two last rows of Table 4 show the results of the Wilcoxon test comparing the classification precision of WSVM-FRS against the other competing methods. A significance level 0.05 has been considered for the analysis. Number “1” represents that WSVM-FRS significantly improves over the other competing methods in terms of classification precision.

Table 5 shows the result of average SVM classification recall for the competing methods. In some of the data sets, the proposed WSVM-FRS method has a better classification precision than the other methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), which is shown in bold. For all data sets, the rank of each method is expressed in Table 5, and the average rank is shown. The best rank in any data set is highlighted. The classification recall of SVM training has been shown in terms of percentage (%).

The proposed method has the best performance and the top rank in rank average, compared to the other methods; it has, considerably, performed better than the other methods concerning the amount of ranks average. WSVM-FRS method is successful in 6 out of 20 existing data sets such as Sonar, Heart, E. coli, Diabetes, Yeast, and Letter in terms of increasing the SVM classification recall, compared to the other methods; it has the best performance. The proposed method has the second to third rank among 11 data sets. However, it does have between fourth and sixth in three of the data sets in increasing the SVM classification recall which means that WSVM-FRS method is different from the other methods and chosen as the best.

For some data sets, other methods have better performance in terms of increasing the classification precision. For example, in case of Ionosphere, Transfusion, and Vowel data sets, DS-RLSSVM (Zhou et al. 2016) obtained better results. Although DS-RLSSVM method has, among the average ranks, won the second rank in terms of increasing the classification recall, in some data sets, it has the second to seventh rank in increasing the classification recall. RLS-SVM (Yang et al. 2014) has the third rank average in terms of increasing the SVM classification recall which is the worst method among ranks average compared to the other methods; it has the best performance in Liver, and Vehicle data sets as well as the first rank. After that, K-SRLSSVCR (Ma et al. 2019) method has the fourth rank average among other methods and this method has won in Haberman and Segment data sets but in other data sets has earned different ranks. Similarly, in one case, Fuzzy-LSSVM (Sheng et al. 2015) method has the best performance, in Iris data set, in comparison with other methods; it has won the first rank. Fuzzy-LSSVM method has won the fourth rank with considerable difference with the average rank of WSVM-FRS methods. Fuzzy-LSSVM method does not have an acceptable performance in some data sets, compared to the other methods. Probabilistic weighted LS-SVM (Lu et al. 2017) method has the fifth rank average. It has the best performance in terms of increasing the classification recall in Musk, Pendigits, Satimage, and Shuttle data sets compared to the other methods. It stands far away from the proposed method in ranks average for increasing the SVM classification recall. Finally, PLR-WSVM (Tang et al. 2019) method has the fifth rank average in comparison with competing methods which has had the best result in Glass and Wdbc data sets but in others this method has obtained between the second and seventh rank.

The two last rows of Table 5 show the results of the Wilcoxon test comparing the classification recall of WSVM-FRS against the other competing methods. A significance level 0.05 has been considered for the analysis. Number “1” represents that WSVM-FRS significantly improves over the other competing methods in terms of classification recall.

As the number of positive samples is smaller than the negative ones, the performance of SVM classification in segregating positive samples is more important. Therefore, the value of area under the receiver operating characteristic (ROC) graph, known as area under curve (AUC), is computed as another metric. ROC graph is constructed by plotting TPR in contrast to false-positive rate (FPR). Hence, more AUC in the form of (33) shows better ability to distinguish positive samples:

$$ {\text{AUC}} = (1 + {\text{TPR}} - {\text{FPR}})/2. $$
(33)

Based on Fig. 3, AUC metric represents special power of WSVM-FRS to gain data characteristics of positive samples. This ability causes better results of the proposed method rather than the others. Finally, through analyzing several experiments on data sets taken from the UCI repository, the superiority of WSVM-FRS over state-of-the-art methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019) is proved.

Fig. 3
figure 3

Result of AUC metric

Overall, in accuracy, precision and recall metrics, WSVM-FRS has had good performance in data sets that have many classes. Also in some data sets, the proposed method has been able to satisfy all metrics to face with large number datasets and in high-dimension data sets WSVM-FRS to some extent has resulted as well.

5.2.1 Noise analysis

The presence of noise in training data has strong and negative impact on the performance of learning algorithms. Thus, methods should be sufficiently resistant and be able to deal with them. In order to present a deeper discussion and show that the proposed method has better results than competing methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), the most common type of artificial noise has been used which is called uniform random addition (Zhu and Wu 2004). Therefore, class noise and attribute noise have been added. The noise level has been raised in an interval of 0% (original datasets) to 30%. To evaluate the impact of noise level in terms of accuracy, the proposed method and all the other comparing methods have been performed on each dataset. The results are demonstrated in Fig. 4 for each dataset. The x-axis indicates the noise level, and the y-axis represents the classification accuracy from different types of classifiers trained.

Fig. 4
figure 4figure 4figure 4

Experimental results of noise effect on classification accuracy

To keep looking at Fig. 4, when the noise level increases, classification accuracy of all methods has been decreasing dramatically. Furthermore, it is clear that the more noise is added to the datasets, the more proposed method is resistant against the other methods. This is because of a strong and satisfactory classifier based on fuzzy rough set strategy which has been applied. Results indicate the superiority of the proposed method in comparison with the others in dealing with noisy data.

5.2.2 Real-world data set analysis

In this section, two various real-world data sets are considered to illustrate performance of the proposed method (WSVM-FRS) in comparison with some state-of-the-art methods.

MNIST data set To continue evaluation of the proposed method, specific real-world data set, namely MNIST (LeCun et al. 2010), has been considered, which is related to the handwritten digits date set and is a dataset of simple gray handwritten digits, while ImageNet is a large-scale dataset of labeled high-resolution images. MNIST has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Description of MNIST data sat is illustrated in Table 6. Also demonstration of the dataset is shown in Fig. 5.

Table 6 Files contained in the MNIST dataset
Fig. 5
figure 5

Demonstration of the MNIST data set

Experimental results on MNIST dataset Experiment on MNIST data set has been carried out to evaluate the proposed method against three state-of-the-art weighted SVM methods like probabilistic weighted least squares SVM (Lu et al. 2017), DS-RLSSVM (Zhou et al. 2016), Fuzzy-LSSVM (Sheng et al. 2015), RLS-SVM (Yang et al. 2014), PLR-WSVM (Tang et al. 2019) and K-SRLSSVCR (Ma et al. 2019) methods, which are shown in Table 7. In this table, three evaluation measure have been considered such as classification accuracy, recall and precision.

Table 7 Result of the proposed method performance compared to other competitors regarding MNIST data set

Table 7 shows the result of average SVM classification accuracy, recall and precision for the competing methods. Regarding MNIST data set, the proposed WSVM-FRS method has a better classification performance than the other methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), which is shown in bold. For all methods, the rank of each method is expressed in Table 7, and the average rank is shown. The best rank is highlighted, which approves the best performance of the proposed method. The classification of three mentioned evaluations of SVM training has been shown in terms of percentage (%).

Fashion-MNIST dataset Also, other type of real-world data set, which has been used in evaluations, is Fashion-MNIST Dataset (Cohen et al. 2017). Fashion-MNIST is based on the assortment on Zalando’s website. Every fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e., front and back looks, details, looks with model and in an outfit. The original picture has a light-gray background (hexadecimal color: #fdfdfd) and stored in 762 × 1000 JPEG format. For efficiently serving different frontend components, the original picture is resampled with multiple resolutions, e.g., large, medium, small, thumbnail and tiny.

The front look thumbnail images of 70, 000 unique products in used to build Fashion-MNIST. Those products come from different gender groups: men, women, kids and neutral. In particular, whitecolor products are not included in the dataset as they have low contrast to the background. The thumbnails (51 × 73) are then fed into the following conversion pipeline, which is visualized in Fig. 6.

  1. 1.

    Converting the input to a PNG image.

  2. 2.

    Trimming any edges that are close to the color of the corner pixels. The “closeness” is defined by the distance within 5% of the maximum possible intensity in RGB space.

  3. 3.

    Resizing the longest edge of the image to 28 by subsampling the pixels, i.e., some rows and columns are skipped over.

  4. 4.

    Sharpening pixels using a Gaussian operator of the radius and standard deviation of 1.0, with increasing effect near outlines.

  5. 5.

    Extending the shortest edge to 28 and put the image to the center of the canvas.

  6. 6.

    Negating the intensities of the image.

  7. 7.

    Converting the image to 8-bit grayscale pixels.

Fig. 6
figure 6

Diagram of the conversion process used to generate Fashion-MNIST dataset. Two examples from dress and sandals categories are depicted, respectively. Each column represents a step described previously

The dataset is divided into a training and a test set. The training set receives a randomly selected 6000 examples from each class. Images and labels are stored in the same file format as the MNIST data set, which is designed for storing vectors and multidimensional matrices. The result files are listed in Table 8. Examples have been sorted by their labels while storing, resulting in smaller label files after compression comparing to the MNIST. It is also easier to retrieve examples with a certain class label.

Table 8 Files contained in the Fashion-MNIST dataset

For the class labels, the silhouette code of the product has been used. The silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando. Each product contains only one silhouette code. Table 9 gives a summary of all class labels in Fashion-MNIST with examples for each class.

Table 9 Class names and example images in Fashion-MNIST dataset

Experimental results on fashion-MNIST dataset Experiment on Fashion-MNIST Dataset has been conducted out to evaluate the proposed method compared to three state-of-the-art weighted SVM methods like probabilistic weighted least squares SVM (Lu et al. 2017), DS-RLSSVM (Zhou et al. 2016), Fuzzy-LSSVM (Sheng et al. 2015), RLS-SVM (Yang et al. 2014), PLR-WSVM (Tang et al. 2019) and K-SRLSSVCR (Ma et al. 2019) methods, which are demonstrated in Table 10. In this table, three evaluation measure have been considered such as classification accuracy, recall and precision.

Table 10 Result of the proposed method performance compared to other competitors regarding Fashion-MNIST data set

Table 10 shows the result of average SVM classification accuracy, recall and precision for the competing methods. Regarding Fashion-MNIST data set, the proposed WSVM-FRS method has a better classification performance than the other methods (Sheng et al. 2015; Zhou et al. 2016; Lu et al. 2017; Yang et al. 2014; Tang et al. 2019; Ma et al. 2019), which is shown in bold. For all methods, the rank of each method is expressed in Table 10, and the average rank is shown. The best rank is highlighted, which approves the best performance of the proposed method. The classification of three mentioned evaluations of SVM training has been shown in terms of percentage (%).

6 Conclusions and future works

In this paper, a novel method, namely WSVM-FRS, has been introduced to reduce uncertainty effect, say, noise in SVM training. The primary reason of introducing the method is consideration of sophisticated information in the real-world applications under uncertainty so that it would retain training effectiveness against of challenging and large-scale datasets, without losing satisfactory speed.

WSVM-FRS has introduced a novel weighted support vector machine to improve the noisy sensitivity problem of standard support vector machine for multiclass data classification. To keep basic idea, weighted coefficient has been added to the penalty term Lagrangian formula for optimization problem, which is called entropy degree, using lower and upper approximation for membership function in fuzzy rough set theory. Consequently, noisy samples have low degree, and important samples have high degree. The performance of WSVM-FRS has been examined on 20 data sets taken from the UCI repository and real-world data sets so that the results have been compared to six other algorithms in recent literature. Experimental results demonstrate that the proposed method has good classification accuracy, precision and recall due to consideration and handling uncertainty aspect including noisy sample. The Wilcoxon test proves that the methods is more statistically different, in terms of appropriated performance, regarding accuracy, precision and recall metrics.

In the future, we not only would endeavor to enhance effectiveness of WSVM-FRS in dealing with data stream, but also would introduce more accurate kernel using ED concept when the method would face with more challenging real-world data sets. Furthermore, we are fascinated to seek other aspects of weighted SVM in order to reduce the noise more effectively and quickly.