1 Introduction

The most significant information have been used to solve several practical problems such as social, economical, biological, medical, technical problems. The problems having multi-objectives are often non-commensurable and conflict to each other. The set of problems with different objectives is solved by using the theory of multi-objective optimization (MOO) which is a generalization of traditional single objective optimization (SOO). In principle the presence of multiple objectives in a problem gives rise a set of optimal solutions (largely known as Pareto-optimal solutions, POSs), instead of a single optimal solution. In the absence of any further information, one of these POSs can never be said as better than the others. This demands a user to find as many POSs as possible. Classical optimization methods (including the multi criterion decision-making methods) suggest converting the MOO problem to a SOO problem by emphasizing one particular POS at a time. When such a method is used for finding multiple solutions, it has to be applied many times, hopefully finding a different solution at each simulation run. Since long a number of MOO has been suggested by different researchers [15] for different purposes. The primary reason is their ability to find multiple POSs in single simulation run. With an emphasis for moving toward the true Pareto-optimal region, MOO can be used to find multiple POSs in single simulation run.

In this paper, we address all of these issues and propose an improved version of Pareto optimality, which is used to generate distinguished class (DC)/subclass from existing classification. Many subclasses are inevitable in biological data. For example: under different classes of blood cancer (leukaemia, myeloma and lymphoma), different subclasses are there such as acute lymphoblastic leukaemia (ALL), acute myeloid leukaemia (AML), chronic lymphocytic leukaemia (CLL), chronic myeloid leukaemia (CML), Hodgkin lymphoma, non-Hodgkin lymphoma, etc. From the simulation results on different real world data set, we find that the DCs are classified accurately by Pareto-based MOO comparing to all other existing optimization methodologies. Thus our method outperforms other optimization techniques. Hence constrained MOO is important to solve practical problems, but not much attention has been paid so far in this respect among the researchers on Pareto optimization. In this paper, we suggest a simple constraint-handling strategy with Pareto sets that suits well for any MOO problem. The results encourage the application of Pareto optimization to more complex and real-world MOO problems.

The different optimization models are applied in our work along with Pareto-based MOO and obtained appropriate solution for respective models. The performance of above all optimization models are approximately closer to each other based on their classification results. But Pareto-based model generates noble class/subclass from existing class using optimization technique. There are some sensitive features (called sub-feature) of individual feature under feature set (FS). They play major role leading to new class and their frequency may be less in feature data [6]. The associated sub-features are generating the noble classes in this paper.

Especially all biological data values are sensitive in nature and sensitive sub feature leads to noble class.

In the remainder of the paper, we briefly mention the related work in Sect. 2. Section 3 defines the problem statement of the proposed work. In Sect. 4, preparing the data for classification has been explained. In Sect. 5, different optimization models have been illustrated lucidly. The optimization model based on fuzzy sets has been explained in Sect. 6. Section 7 describes the optimization model based classification techniques. The experimental details have been discussed and analyzed with several datasets as per our proposed model in Sect. 8. Finally, Sect. 9 ends with conclusions and future direction.

2 Related work

In real-life multi-objective design, decision-maker (DM) has to take many different criteria such as low cost and good performance into account which cannot be satisfied simultaneously. Sometimes the data mining task becomes even more complicated because of additional constraints which always exist in practice. This leads to the MOO problem. In contrast to the SOO, there is still no universally accepted definition of optimum [7]. Usually, it is possible to consider a trade-off among all (or almost all) criteria. This trade-off analysis can be formulated as a vector nonlinear optimization problem under constraints. Generally speaking, the solution of such a problem is not unique. It is natural to exclude from the consideration any design solution that can be improved without deterioration of any discipline and violation of the constraints. This leads to the notion of a POS [8]. Mathematically, each Pareto point is a solution of the MOO problem.

In literature, there are two broad categories of search algorithms within MOO context; point based and population based. The point based approaches are the classical generating methods, where all the objectives are summed up in a way or another (scalarization [9]) to form an aggregate objective function (AOF). The AOF subject to the constraint serves as a SOO problem. A single point is then used as an initial guess and is improved in each step of the algorithm to optimize the AOF. On the other hand, in population based approaches, several points are initialized and improved in parallel during the search. By implementing some stochastic operators, the initial population is evolved during the course of optimization and the highly fit solution remains in population while the less fits are discarded [10, 11]. Of course, these categories have both their pros and cons considering the problem under study, which we deal with MOO problem. The large number of single point based search methods are considered as line search methods with optimum region [12].

Fig. 1
figure 1

Processing of data for selected associated sub-features for noble classes

As is evident in literature [8, 10, 1315], both the single point and population based search methods are competent in generating the part or the whole Pareto frontier. However, the relative performance of the different algorithms essentially depends on the problem under consideration. Generally in the context of real world optimization, researcher often faces the characteristics that cause some difficulties for the generation algorithms such as (a) discrete search space, (b) noise and non-smoothness of functions, (c) non linearity and multi-modality of the functions and constraints, and (d) large number of design variables and objective functions (high dimensionality).

The basic idea of dominance is an essential component of Pareto based MOO model, which plays an important role in the development of Pareto-based optimization. Pareto-based MOO is a kind of MOO with the selection operator based on Pareto dominance, such as multi objective genetic algorithm [16], non-dominated sorting genetic algorithm (NSGA) [17], NSGA-II [18], non-dominated Pareto genetic algorithm [19], strength-Pareto evolutionary algorithm [20], Pareto-archived evolution strategy [21], and territory defining evolutionary algorithm [22]. As per the development of Pareto based multi-objective evolutionary algorithm (MOEA), non-dominated sorting is still a key topic of Pareto-based MOEAs, because most computation cost of Pareto-based MOEAs comes from non-dominated sorting. Non-dominated sorting is a process of assigning solutions to different ranks. All the non-dominated sorts output the same result by inputting the same solution set. In 2002, Deb [18] proposed a fast non-dominated sort, which lowers the complexity. However, there is still some waste on some unnecessary comparisons. Many-objective optimization problem is a kind of special MOPs with more than three objectives [23]. Its large number of objectives increases its difficulties for Pareto-based MOEAs due to their weak Pareto dominance selection pressure [2427]. Objective reduction [28, 29], and scalarization methods [30] are some strategies to reduce the difficulties of the original problems. Also, many MOEAs aim to solve many-objective optimization problems, such as indicator based evolutionary algorithm [31], HypE (hypervolume estimation algorithm for MOO) [32], and multi objective directed evolutionary line search [33]. In view of the characteristics of many-objective optimization problems, non-dominated sort has to face new challenges because of the large number of objective comparisons. None of the authors have discussed about generation of DC/subclass from existing classification based on selected sub features which has important role in sensitive data. However utmost care has been taken during simulation stage in order to show the proposed method works well.

3 Problem statements

In this section, we present the problem settings based on sub-features for classification. The sub-features are generated based on frequency of features data. The detail description of sub-features is available in [6]. The role of sub-features is important to generate noble class/subclass from existing class. Traditional class is always based on respective feature values, but these feature values of the biological system/economical system/physical system get changed under different environmental conditions or social demand. For example, blood cancer is divided into three classes such as leukemia, myeloma and lymphoma. After seeing the different feature values of blood, it can be among one of the classes. However, if a physician minutely scans some sub feature values, it may lead to one of the subclasses (ALL, CLL, AML, CML). But the detection is always under ambiguous. Thus the physician is always under tilting condition to take right decision. Under this condition Pareto-based MOO and fuzzy Pareto technique is useful to detect the subclass effectively. Hence our objective is to find the class/subclass from existing class based on sub feature data. Sometimes feature values are also changed due to the change of climate, but detection takes lot of time to find the class. Thus we are interested to discuss the importance of sensitive feature values (sub-feature values) for classification which generate the noble (distinguished) class from existing class. The question may arise “what kind of sub-features generate noble class from respective database?” Since the importance of sub-features is highly essential, we have developed different optimal model to find out the appropriate solution for respective optimization problem.

4 Preparing the data for classification

We consider only sub-feature data for our classification model using different optimization method. The collected sub-feature data are used to generate DC from existing class in corresponding database. Our valuable readers are requested to see the detail description of sub-feature data in [6]. Figure 1 depicts different optimization models SOO, MOO, Pareto MOO and convex fuzzy optimization through which cleaned data are processed for optimization. The following preprocessing steps are required for the database to help in improving the accuracy, efficiency, and scalability of the classification process or prediction for noble classes.

4.1 Data cleaning

This refers to the preprocessing of data in order to remove or reduce noise and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step helps to reduce confusion during learning. We point out missing values or irrelevant values and remove them from database. Since our model collects only sub-feature data, almost all noise data may be removed from database.

4.2 Relevance analysis

Many of the sub-features in the dataset may be redundant. Correlation analysis can be used to identify whether any two given sub-features are statistically related. For example, a strong correlation between sub-features A\(_{1}\) and A\(_{2}\) would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant sub-features. The low fuzzy frequency sub-feature techniques [6] can be used in these cases to find a reduced set of sub-features such that the resulting probability distribution of the data can generate possible DC from original database. Hence, relevance analysis, in the form of correlation analysis and sub-feature selection, is used to detect sub-features that do not contribute to the classification or prediction task. Including such sub-features may otherwise slow down, and possibly mislead the prediction for classification and also affect on classification efficiency and scalability.

4.3 Data transformation and reduction

After above two steps, the collected data are transformed for evaluation by different optimization model phase wise. It reduces the irrelevant data from different covering phase.

5 Optimization model for classification

Before elaborating the optimal model, we consider to develop the statistical measurements of sub-features based on inverse probability model for classification.

5.1 Use of inverse probability model for classification

We consider the applications of the conditional probabilities that are used for the computation of unknown probabilities, on the basis of the information supplied by past records. The additional information supplied by the past data is of extreme help to generate a new model for classification in arriving at valid decisions in the face of uncertainties of sub-feature data. Let the generated class data (A) be considered as event that occur in conjunction with one of n mutually exclusive and exhaustive events E\(_{1},\,\mathrm{E}_{2},\ldots ,\mathrm{E}_\mathrm{n}.\) Then the probability of such generating class that preceded by the particular event E\(_\mathrm{i}\) is defined by

$$\begin{aligned} \text {P}\left( {\text {E}_\text {i} |\text {A}} \right) =\frac{P( {E_i } )P(A|E_i )}{\mathop \sum \nolimits _{i=1}^n P( {E_i } )P(A|E_i )}, \end{aligned}$$
(1)

where exhaustive events are considered on sub-feature data. This inverse probability helps to generate new class on the basis of sub-feature data as per the following optimization models. Under this situation the inverse probability model is considered for optimization problem to find appropriate DC. Since our objective is to select sub-feature for DC, we apply the above probability to find number of sub-features from existing class.

The introduction of optimization problem is derived by maximizing or minimizing of a real function by choosing the values of variable to satisfy the MOO problem. The optimization problems are considered for data mining task such as classification, clustering, feature selection, etc. We generate different optimization model, namely, SOO, MOO, and Pareto-based MOO, fuzzy convex optimization.

5.2 Single-objective optimization

By SOO method, we develop the model for noble classification in which only one objective function is optimized. A SOO often minimizes the optimal value on the training data. Mathematically, it is defined as

$$\begin{aligned}&\text {Min f}( \text {x}) \\&\text {Subject to P}( \text {x})\text { }\ge \text {a}, \nonumber \\&\text {x}^{(\mathrm {\ell } )}\le \text {x}\le \text {x}^{(\text {u})}, \nonumber \end{aligned}$$
(2)

where f: \(\mathrm{R}^\mathrm{m}\rightarrow \mathrm{R}\) is known as the objective function, x \(\epsilon \) R\(^\mathrm{m}\) is a m-dimensional input vector with certain range, and P: \(\mathrm{R}^\mathrm{m}\rightarrow \mathrm{R}\) is a constraint. This single objective function generates optimized classification based on sub-feature data where the role of sub-features is important to extract DC from existing class. The corresponding DC is recognized as new class or unique class from same database. On the relation between sub-feature and class, the model has developed using different kinds of optimization techniques. The concept and idea of sub-feature is clearly defined in [6]. This single objective function generates new class by performing different kinds of sub-feature evaluation from feature space.

The inverse probability model is considered for optimization problem (f\(_\mathrm{x})\) to find appropriate solution for required task. The constraints are emerged with certain conditions for sub-feature data, i.e., the probability of sub-feature data satisfy certain threshold values; otherwise it is difficult to generate new class from data base. Thus the Eq. (2) is modified for new class as

$$\begin{aligned}&\text {Min f}_{\text {nc}} ( \text {x}) \\&\text {Subject to max }\left\{ \text {P}_\text {s} ( \text {x} )\right\} \text { }\ge \text {a }+{\upbeta } _\text {i}, \nonumber \\&\text {x}^{(\mathrm {\ell })}\le \text {x}_\text {s} \le \text {x}^{(\text {u})}, \nonumber \end{aligned}$$
(3)

where f\(_\mathrm{nc}\)(x) is a classification function, P\(_\mathrm{s}\)(x) is probability of available covered sub-feature, i.e., maximum covered sub-feature, a is threshold value, x\(_\mathrm{s}\) is range of sub-feature between x\(^{(\mathrm {\ell })}\) and x\(^\mathrm{(u)},\, \upbeta _\mathrm{i}\) denotes number of additional sub-FS. In this case we consider associated sub-features and generate sub-FS from feature space for getting solution of optimization problem, i.e., maximum probability of sub-feature to generate distinguish class (new class). The role of sub-features is important to generate noble class which is derived in Algorithm 1.

figure d

Algorithm 1 helps to generate different new class from existing class based on associated sensitive sub-features.

Definition 1

The sub-feature is said to be sensitive if it generates new class from existing class.

To achieve the overall target of new classification, model firstly has to be built sub-features efficiently. During the concept of new classification, the model normally relies on the sub-features and intuition to make decisions, rather than on the quantitative prediction of the classifier’s performance. The model-based design is better used in the way of coupling with sensitivity analysis for guiding variables influences, or coupling with optimization for conflicting design objectives. Thus, both sensitivity analysis and model-based optimization are used to inform design decisions. In our research, applying the sensitivity methods directly to solutions obtained from a model-based optimization process helps data miners to find the set of optimum solutions and better understanding about the influences of variables around the optimum solution(s), especially when some uncertainties exit in the choice of design solutions.

5.3 Multi objective optimization

In all databases, a sub-feature model may not classify approximately to exact class, but same problem can be handled perfectly using statistical data. But this target is not achieved by minimizing the single objective as in Eq. (3). Maximizing associated sub feature data minimizes the classification. [For example: different subclasses (ALL, AML, CLL, CML, etc.) are there under classes leukemia, myeloma and lymphoma. Some feature data are common to all subclasses. But few feature data (called sub feature) leads to subclass. It helps physician for better diagnosis.] While considering associated sub feature data, the complexity of the model needs to be controlled. Another common objective that often needs to be taken into account is the comprehensibility or interpretability of the sub-feature model, which is particularly important when optimality is used for knowledge discovery from dataset. The interpretability of optimality depends strongly on the complexity of the model, and the lower the complexity, the easier it is to understand the model. Thus a second objective reflecting the complexity of the model must be considered too. To control the complexity, the two objectives can be aggregated into a scalar objective function.

Moreover, MOO, also known as multi-criteria or multi-attribute optimization, is the process of simultaneously optimizing two or more possibly conflicting objectives subject to certain constraints [52, 53]. MOO is found in any situation where optimal decisions are guided not by a single objective but rather by multiple possibly conflicting objectives. Keeping complexity and better classification (as discussed above) in mind, MOO is stated mathematically as follows:

$$\begin{aligned}&\text {Min F }( \text {x} )=\text { w}_{1} \text {f}_{1} ( \text {x} )+\text {w}_{2} \text {f}_{2} ( \text {x}), \\&\text {Subject to P}_\text {s} ( \text {x})\ge \text {a}, \nonumber \\&\text {Q}_\text {c} ( \text {x} )\le \text { b}, \nonumber \\&\text {w}_{1} +\text { w}_{2} ={1}, \nonumber \\&\text {x}^{(\mathrm {\ell } )}\le \text {x}_\text {s} \le \text {x}^{(\text {u})}, \nonumber \end{aligned}$$
(4)

where f\(_{1}(\mathrm{x}) = \mathrm{f}_\mathrm{nc}(\mathrm{x})\) and f\(_{2}(\mathrm{x}) = \mathrm{f}_\mathrm{c}\) (complexity function for generating new class). Q\(_\mathrm{c}\)(x) is measurement of complexity for generating new class and b is threshold value for complexity. Here w\(_{1}\) and w\(_{2}\) are two weights that help to generate optimality of classification. Thus the different data mining techniques are able to optimize on two objectives, though the objective function is still a scalar function.

5.3.1 Pareto-based multi-objective optimization

Use of Pareto approach in solving data mining problem, specifically for addressing multiple objectives is actually a natural idea. But, this approach has not been used for sub-feature selection which leads to classification. As per [10], traditional learning algorithms and most traditional optimization algorithms are incapable of handling multi-objective problems using the Pareto based approach. Since the objective of this research work is to select sub features for classification, we have considered the problem of MOO based on POSs. However, in a Pareto-based approach to MOO, the objective function is no longer a scalar value, but a vector. As a consequence, a number of POSs should be achieved instead of one single solution.

Generally the MOO problem with m-objectives to be minimized is described as

$$\begin{aligned} \text {Minimize}\left\{ \text {f}_{1}( \text {x}),\,\text { f}_{2} ( \text {x}),\,\ldots ,\text { f}_\text {m} ( \text {x})\right\} , \end{aligned}$$

subject to the decision vector x = (x\(_{1},\,\mathrm{x}_{2},\ldots ,\mathrm{x}_\mathrm{n})^\mathrm{T}\) belonging to the feasible region S. Then, the objective vector of x is

$$\begin{aligned} \text {f}( \text {x} )\text { }=\text { }\left( \text {f}_{1} ( \text {x}),\,\text { f}_{2} ( \text {x}),\ldots \text { },\text { f}_\text {m} ( \text {x} )\right) . \end{aligned}$$

Pareto-optimality is the most important concept in Pareto based MOO. Consider the following multi-objective minimization problem:

$$\begin{aligned} \text {Min F}(\text {X})=\left\{ \text {f}_{1} (\text {X}),\,\text {f}_{2} (\text {X}),\text { }\ldots ,\text {f}_\text {m}(\text {X})\right\} . \end{aligned}$$
(5)

A solution X is said to dominate a solution Y if \(\forall \) j =1, 2,...,m, f\(_\mathrm{j}(X) \le \mathrm{f}_\mathrm{j}\)(Y), and there exists k \(\in \) {1, 2,...,m} such that f\(_\mathrm{k}(X) < \mathrm{f}_\mathrm{k}\)(Y). Solution X is called Pareto-optimal if no other feasible solutions dominate it. But there exists more than one POS if the objectives are conflicting with each other. The curve or surface composed of the POSs is known as the Pareto front [10]. In practice, we often do not know where the global Pareto front of a real-world optimization problem lies, and therefore, non-dominated solutions achieved by multi-objective with different algorithms are not necessarily Pareto-optimal. However, non-dominated solutions achieved by MOO algorithms are loosely called POSs. Pareto-based multi-objective tasks follows the Pareto based MOO approach to handle data mining problems.

For example, the scalarized bi-objective optimized problem in Eq. (4) can be formulated as a Pareto-based MOO as follows:

$$\begin{aligned} \text {min}\left\{ \text {f}_{1},\,\text { f}_{2}\right\} , \end{aligned}$$
(6)

where f\(_{1}(x) = \mathrm{f}_\mathrm{nc}\)(x) and f\(_{2}(x) = \mathrm{f}_\mathrm{c}\) (complexity function for generating new class).

Comparing the scalarized MOO described by Eq. (4) and the Pareto-based MOO described by Eq. (6), we find that we no longer need to specify the function in the Pareto-based MOO. On one hand, this spares the user the burden to determine and evaluate the function before optimization; on the other hand, the user needs to pick out one or a number of solutions from the achieved POSs according to the user’s preference after optimization. As we will show in the next sections, Pareto-based MOO are able to achieve a number of POSs, from which the user is able to extract knowledge about the problem and make a better decision when choosing the final solution.

5.3.2 Pareto multi-objective optimization in cluster computing

Clustering is the partitioning of data into subgroups and is one of the fundamental tasks in unsupervised classification. Determining the appropriate number of clusters from a given data set is an important consideration in clustering. Many different formulations of the clustering problem exist, the best known of which are based on minimizing intra cluster variance [41]. But none of the existing clustering criteria can capture all of the different aspects being perceived by humans. This perception is the properties of a good clustering, such as the compactness of clusters, spatial separation between them, and compliance with local density distributions [40]. If the clustering criterion employed is inappropriate, clustering algorithm fails to provide the solution to the problem. Under this situation it needs the use of ensemble techniques to integrate the results of a variety of different clustering methods [42, 43]. Alternatively a posteriori integration of different clustering results is the direct optimization of a partitioning with respect to a number of complementary clustering criteria. Moreover, multi-objective approaches to clustering can indeed result in an improved and robust performance across data exhibiting a range of different data properties and may be superior to some a posteriori integration approaches [44]. However, researchers [44] have also illustrated that good clustering solutions tend to give rise to distinct “knees” in the Pareto front and may be automatically identified through a comparison to random control data. Similarly feature selection is also part of clustering. Algorithms for feature selection can be used as preprocessing methods for the subsequent application of any clustering method. One particular problem is the comparison of feature subspaces of different cardinality as existing measures are usually biased toward small or large feature subspaces [48]. MOO has been introduced as a potential solution to this problem as it allows one to optimize one of these objectives and to counterbalance its bias through the simultaneous minimization or maximization of feature cardinality [4951].

Many researchers have proposed MOO technique for clustering under situations in which the clustering criterion is biased with respect to the number of clusters [45] or where multiple sources of data in the form of multiple dissimilarity matrices should be integrated into a single clustering [46, 47]. Such data may be tackled through (1) an a priori fusion of the data and the use of a standard clustering algorithm, (2) the use of ensemble techniques for the posteriori fusion of the different partitioning obtained, or (3) the selection of a primary clustering objective and the definition of all others as constraints in a constrained optimization problem. However, some work [46, 47] argues that a MOO approach may provide more information and choice to a DM.

Most of the existing clustering techniques are based on a single criterion which reflects a single-measure of goodness of a partitioning. However, a single cluster quality measure is seldom equally applicable for different kinds of data sets with different characteristics. Hence, it may become necessary to simultaneously optimize several cluster quality measures that can capture different data characteristics. In order to achieve this, the problem of clustering a data set has been posed as one of MOO. Therefore, the application of sophisticated meta-heuristic MOO techniques seems appropriate and natural.

Since clustering problem is posed as one of MOO, it gives rise a set of optimal solutions (largely known as POSs), instead of a single optimal solution. In this situation we consider POS for MOO problem. POS is used when conflicting multi-objectives are present in the problem. The optimization model is considered to find POS and Pareto optimal set. When the optimization model is tested over substantial amount of data sets, POS is obtained.

5.4 Pareto dominance

The concept of Pareto dominance (Pareto optimum) has been extensively used to establish the superiority between solutions in MOO. In Pareto dominance, a solution x is considered to be better than a solution \(\mathbf{x}^\mathbf{*}\) if and only if the objective vector of x dominates the objective vector of \(\mathbf{x}^\mathbf{*}.\) Let us consider S as set of all solutions. So all solutions are coming under the set S. Based upon the above consideration different types of dominance are defined as follows.

5.4.1 Pareto dominance

A solution x \(\epsilon \) S dominates a solution \(\mathbf{x}^\mathbf{*} \epsilon \mathrm{S}\,(\mathbf{x} \succcurlyeq \mathbf{x}^\mathbf{*})\) if and only if x is not worse than \(\mathbf{x}^\mathbf{*}\) in all objectives (f\(_\mathrm{i}\,(\mathbf{x}) \ge \mathrm{f}_\mathrm{i}\,(\mathbf{x}^\mathbf{*})\,\forall \mathrm{i} = 1,\ldots ,\mathrm{m})\) and x is strictly better than \(\mathbf{x}^\mathbf{*}\) in at least one objective (f\(_\mathrm{i}\,(\mathbf{x}) > \mathrm{f}_\mathrm{i}\,(\mathbf{x}^\mathbf{*})\) for at least one i \(=\) 1,\(\ldots \),m) [38].

Based on the definition of Pareto dominance it is very obvious that there are weak and strong dominance. Hence it needs to distinguish between weak dominance and strong dominance [10, 12].

5.4.2 Weak dominance

This is often simply referred to as Pareto dominance. A solution x is weakly dominance over a solution \(\mathbf{x}^\mathbf{*}\,(\mathbf{x} \succcurlyeq \mathbf{x}^\mathbf{*})\) if x is better than \(\mathbf{x}^\mathbf{*}\) in at least one objective and is as good as \(\mathbf{x}^{*}\) in all other objectives.

5.4.3 Strong dominance

A solution x is strongly dominance over a solution \(\mathbf{x}^\mathbf{*}\,(\mathbf{x }\succ \mathbf{x}^\mathbf{*})\) if x is strictly better than \(\mathbf{x}^\mathbf{*}\) in all objectives.

5.4.4 Non-dominance

If neither x dominates \(\mathbf{x}^\mathbf{*}\) nor \(\mathbf{x}^\mathbf{*}\) dominates x (weakly or strongly), then both solutions are said to be incomparable or mutually non-dominated. In this case, no solution is clearly preferred over the other.

Pareto-optimal front is the set F consisting of all non-dominated solutions x in the whole solution (search) space. Hence, a solution \(\mathbf{x} \epsilon \mathrm{F}\) if there is no solution \(\mathbf{x}^\mathbf{*}\epsilon \mathrm{S}\) that dominates x, i.e., if x is non-dominated with respect to S. A set of non-dominated solutions that approximates the Pareto optimal front is usually called current Pareto front or known Pareto front.

We consider the application of Pareto optimal with dominance criteria for our optimization model. The solution x and x\(^{\star }\) are defined over associated sub-features data for classification. When the associated sub-features are increased for generating noble classes, sub-features bear important role for association, due to the fact that different associated sub-features generate different DCs. For better understanding, let the solution x is considered for one associated sub-feature for S class and x\(^{\star }\) is the solution of another associated sub-features for S\(^{\star }\) class, where S and S\(^{\star }\) are two set of classes. If \(\mathrm{x} \le \mathrm{x}^{\star }\) and \(\mathrm{S} \subseteq \mathrm{S}^{\star },\) then x\(^{\star }\) is better solution for S\(^{\star }\) and x\(^{\star }\) dominates to x, i.e., the second associated sub-features is the better solution for noble class.

5.4.5 Fuzzy Pareto dominance

In MOO, the optimization goal is given by more than one objective to be extreme. Formally, given a domain as subset of R\(^\mathrm{n},\) there are assigned m functions f\(_{ 1}(\mathrm{x}_{1},\ldots ,\mathrm{x}_\mathrm{n}),\ldots ,\mathrm{f}_\mathrm{m}(\mathrm{x}_{ 1},\ldots ,\mathrm{x}_\mathrm{n}).\) Usually, there is not a single optimum but rather the so-called Pareto set of non-dominated solutions.

In this section, we are going to study the fuzzification of the Pareto dominance relation. The objective of this study is: practically usable numerical representation of the dominance relation between two vectors can be employed in MOO. The issue can be studied more details in [34]. This work showed the principal problems related to the specification of such a degree of dominance. Fuzzy dominance degrees can be computed once the following two conditions are taken into account.

  1. (1)

    The measure is not symmetric. For two vectors ‘a’ and ‘b’, the two measures are considered (a) ‘a’ dominates b by degree \(\upalpha \) and (b) ‘a’ is dominated by b to degree \(\upalpha \) have to be distinguished. Moreover, if ‘a’ dominates ‘b’, either one measure is numerically 0 and the other lower-or-equal to 1, or one is greater-or-equal to 0 and the other 1 [39].

  2. (2)

    The dominance degrees are set-dependent and never be assigned in an absolute manner to single vectors alone [39].

We consider vectors with lower numerical ranking values to be on a higher ranking position. The max or the min operator can be used as well, depending on the ranking to be favored in increasing or decreasing order. When using the comparison function bounded division and the algebraic (or product) norm, the ranking scheme fulfills several useful properties like scale-independency in the data. The fuzzification of Pareto dominance relation can be written as follows: it is said that vector ‘a’ dominates vector ‘b’ by degree \(\upmu _\mathrm{a}\) with

$$\begin{aligned} \upmu _\text {a} (\text {a},\,\text {b})=\frac{\mathop \prod \nolimits _i \text {min}(a_i,\,b_i )}{\mathop \prod \nolimits _i a_i }, \end{aligned}$$
(7)

and that vector a is dominated by vector b at degree \(\upmu _\mathrm{b}\) with

$$\begin{aligned} \upmu _\text {b} (\text {a},\,\text {b})=\frac{\mathop \prod \nolimits _i \text {min}(a_i,\,b_i )}{\mathop \prod \nolimits _i b_i}. \end{aligned}$$
(8)

For a Pareto-dominating b, \(\upmu _\mathrm{a}\)(a, b) =  1 and \(\upmu _\mathrm{b}\)(b, a) = 1, but \(\upmu _\mathrm{b}\)(a, b) < 1 and \(\upmu _\mathrm{a}\)(b,  a) < 1. Note that the case of having an a\(_\mathrm{i}\) or b\(_\mathrm{i}\) equal to 0 is handled by the exclusion of the corresponding index in the products in the numerator and denominator.

We may use the dominance degrees of Eq. (8) to rank the set M of multivariate data (vectors) given by the fitness values of a MOO problem. Each element of M is assigned the maximum degree of being dominated by any other element of M, and the elements of M are sorted according to the ranking values in increasing order:

$$\begin{aligned} \text {R}_\text {M} ( \text {a})=\mathop {\max }\limits _{b\in M\{a\}} \mu _b (a,\,b). \end{aligned}$$

Note again that this definition is related to a set. A ranking value of a within M can only be assigned with reference to a set M containing a. By sorting the elements of M according to the ranking values in increasing order (FPD ranking, FPD for fuzzy-Pareto-dominance), we obtain a partial ranking of the elements of M. From the definition of the ranking scheme, it can be seen that an individual has two ways to reduce its comparison values: by increasing the objectives (thus increasing the denominator in the comparison values), or/and by being larger in some components than other vectors, i.e., being diverse from other vectors. Thus, both goals of MOO can be met by using such a measure: to approach the Pareto front, and to maintain a appropriate solution.

6 Fuzzy sets on sub-feature

Let U be the universe of discourse, with the generic element of U denoted by u. A fuzzy sub-set F of U is characterized by a membership function \(\upmu _\text {F}{\text {:}}\,\text {U}\rightarrow [ 0,\,1],\) which associates with each element u of U a number \(\upmu _\text {F}( \text {u})\) representing the grade of membership of u in F. F is denoted as \(\{ {( {\text {u},\,\text { }\upmu _\text {F} ( \text {u})} )|\text {u}\epsilon {U}}\}.\) Mathematically, F is defined in two ways as follows,

$$\begin{aligned} \begin{array}{l} ( \text {a})\,{\text {F}}=\smallint \upmu _{{\text {F}}}({\text {u}})/{\text {u}}\quad {\text {when}}\,{\text {u}}\in {\text {U}}\,{\text {is}}\,{\text {a}}\,{\text {continuum}}, \end{array} \end{aligned}$$
(9)
$$\begin{aligned}&( \text {b} )\,\text {F}=\upmu _\text {F} \left( {\text {u}_{1} } \right) /\text {u}_{1} +\cdots +\upmu _\text {F} \left( {\text {u}_\text {n} } \right) /\text {u}_\text {n} \nonumber \\&\quad =\mathop \sum \limits _{i=1}^n \mu _F \left( u_i \right) /u_i\nonumber \\&\qquad \text {when U is a finite or countable set of n elements}. \end{aligned}$$
(10)

Definition (fuzzy support) The fuzzy support of F is the set of points in U at which \(\upmu _\text {F} ( \text {u} )\) is positive.

Definition (fuzzy Length) The fuzzy length of F is the least upper bound of \(\upmu _{F}\)(u) over U.

$$\begin{aligned} \text {Lnth}( \text {F} )=\mathop {lub}\limits _{u\in U} \mu _F (u). \end{aligned}$$
(11)

Definition (fuzzy normal) A fuzzy set F said to be normal if its height is unity, that is, if

$$\begin{aligned} \mathop {lub}\limits _{u\in U} \mu _F (u)={1}. \end{aligned}$$
(12)

Definition (fuzzy subset) Let A and B are two fuzzy subset of U and \(\text {A}\subseteq \text {B iff}\,\upmu _\text {A} ( \text {u} )\text { }\le \text { }\upmu _\text {B}( \text {u}),\) i.e.,

$$\begin{aligned} \text {A}\subseteq \text {B}\leftrightarrow \upmu _\text {A} ( \text {u})\le \text { }\upmu _\text {B} ( \text {u}). \end{aligned}$$
(13)

The above fuzzy items are defined on sub-feature data set and it generates fuzzy sub-feature data from feature space.

6.1 Operations on fuzzy sub-feature sets

We define the max and min operations on fuzzy sub-FSs based on two symbols # and&. Let A and B are two sub-FS, then

$$\begin{aligned} \text {A}\# \text {B}=\text {max}( {\text {A},\,\text { B}} )\text { }=\left\{ {{\begin{array}{ll} A &{}\quad if \,{A\supseteq B}, \\ B &{}\quad if \,{A\subset B}, \\ \end{array} }} \right. \end{aligned}$$
(14)

and

$$ \begin{aligned} \text {A} \& \text {B}=\text {min}( {\text {A},\,\text {B}} )=\left\{ {{ \begin{array}{ll} A &{}\quad if\, A\subseteq B, \\ B &{}\quad if \, A\supset B. \\ \end{array}}} \right. \end{aligned}$$
(15)

Thus, consistent with this notation, the symbol # generate maximum sub-FS and & generate minimum sub-FS from feature space for classification. Based on membership function on fuzzy sets, the union and intersection of two sub-FSs A and B are defined as follows.

Definition (union of fuzzy sets) The union of fuzzy sets A and B is denoted by \(\text {A}+\text {B}\,(or\,A\cup B)\) and is defined by

$$\begin{aligned} \text {A}+\text {B}=\int \left[ \mu _A(u)\#\mu _B (u)\right] /u,\quad \text {u} \epsilon \text {U}. \end{aligned}$$
(16)

The union corresponds to the connective OR on two fuzzy sets A and B, and also expressed as A + B.

Definition (intersection of fuzzy sets) the intersection of fuzzy sets A and B is denoted by \(A \cap B\) and is defined by

$$ \begin{aligned} A\cap B = \smallint \left[ \mu _A(u) \& \mu _B(u)\right] /u. \end{aligned}$$
(17)

The intersection corresponds to the connective AND on two fuzzy sets A and B, and is denoted by \(A \cap B.\)

Now we consider the fuzzy relation on optimization model. The fuzzy sets generate the dominance relation on two vectors A and B called Pareto dominance relation. Thus the Pareto dominance relation is defined on two fuzzy sets A and B based # symbol as follows. The fuzzy set A dominates B by degree \(\upmu _\mathrm{A}\)(A,  B) as

$$\begin{aligned} \begin{array}{l} \upmu _\text {A} ( {\text {A},\,\text {B}})\text { }=\text { }({\text {A}\# \text {B}})/\text {U}_\text {A}, \\ \end{array} \end{aligned}$$
(18)

where \(\text {A}\# \text {B}\) is defined as Eq. (14) and fuzzy set B dominates A by degree \(\upmu _\text {B}({\text {A},\,\text { B}})\)

$$\begin{aligned} \upmu _\text {B} ( {\text {A},\,\text {B}})\text { }=\text { }( {\text {A}\# \text {B}})/\text {U}_\text {B}, \end{aligned}$$
(19)

where U\(_\mathrm{A }\)and U\(_\mathrm{B}\) are two universe discourse on A and B. Thus Pareto dominating set based on fuzzy sets A and B generates POS.

For the definition of union and intersection on two fuzzy sets A and B, we consider the combination of sub-feature from each feature for two fuzzy sets A and B. The combination of sub-feature for two fuzzy sets A and B are generated as distinguished classification.

6.2 Fuzzy convex optimization

We consider the combination of fuzzy sets that generates DC from existing class based on convex combination of fuzzy subsets.

Definition (convex combination) If \(\text {A}_{1},\ldots ,\text {A}_\text {n}\) are fuzzy subsets of \(\text {U}_{1},\text {}\ldots ,\text {U}_\text {n}\) (not necessarily distinct) and \(\text {w}_{1},\text { }\ldots ,\text {w}_\text {n}\) are nonnegative weights such that \(\mathop \sum \nolimits _{i=1}^n w_i ={1},\) then the convex combination of \(\text {A}_{1} ,\text { }\ldots ,\text {A}_\text {n}\) is a fuzzy set A whose membership function is defined by

$$\begin{aligned} \mu _A =\text {w}_{1} \mu _{A_1}+\text { }\cdots \text {}+\text {w}_{1} \mu _{A_n}, \end{aligned}$$
(20)

where + denotes the arithmetic sum. The concept of a convex combination is useful in the representation of linguistic hedges such as essentially, typically etc., which modify the weights associated with components of a fuzzy set. To be clear, let U\(_{1}\) and U\(_{2}\) is two universes of discourses and defined over availability of fuzzy sub-FSs and computational complexity on corresponding sub-FSs as

$$\begin{aligned} \text {U}_{1} \!=\!\text { }\left\{ {\text {u}_{1}{\text {:}}\,\text {}0\text { }\!\le \!\text {u}_{1} \le \text {g}_{1}}\right\} \quad \text {and U}_{2} =\text { }\left\{ {\text {u}_{2}{\text {:}}\,\text { }0\text { }\le \text {u}_{2} \le \text {g}_{2} }. \right\} \nonumber \\ \end{aligned}$$
(21)

Here g\(_{1}\) and g\(_{2}\) are two upper bounds of fuzzy sets. The membership function on availability of sub-FS is defined as

$$\begin{aligned} \text {FS}=\mathop \int \nolimits _{x_1 }^{x_2 } \left[ 1+\left( \frac{u_1 -b_1 }{c_1 }\right) ^{-2}\right] ^{-1}/u_1, \end{aligned}$$
(22)

where x\(_{1}\) and x\(_{2}\) are range of sub-feature data, u\(_{1}\) is vary within the sub-feature range, b\(_{1}\) and c\(_{1}\) are fixed for membership function. Further, the membership function for computational cost (FC) is defined as

$$\begin{aligned} \text {FC}=\mathop \int \nolimits _{y_1 }^{y_2 } \left[ 1+\left( \frac{u_2 -b_2 }{c_2 }\right) ^{-2}\right] ^{-1}/u_2, \end{aligned}$$
(23)

where y\(_{1}\) and y\(_{2}\) are range of computational cost, u\(_{2}\) is vary within the computational range, b\(_{2}\) and c\(_{2}\) are fixed for membership function. The crossover point is defined on fuzzy bounded set as

$$\begin{aligned} \text {X}=\text { }\left( \text {x}_{1} +\text {x}_{2} \right) \text { }/2\,\text {for fuzzy sub-feature}, \end{aligned}$$

and

$$\begin{aligned} \text {Y}=\text { }\left( \text {y}_{1}+\text {y}_{2} \right) /2\,\text {for computational cost}. \end{aligned}$$

We consider the crossover point on membership function as

$$\begin{aligned} \mu _{FS} ( X)=0.5\quad \text {and}\quad \mu _{FC} ( Y )=0.5. \end{aligned}$$

The fuzzy set labeled for DC may be defined as a convex combination of fuzzy sets FS and FC as

$$\begin{aligned} \text {DC}=\text {w}_{1}\,*\,\text {FS}+\text {w}_{2}\, *\,\text {FC}, \end{aligned}$$
(24)
$$\begin{aligned} {\mu _{DC}}\left( u_1,\,u_2\right)= & {} {\smallint _{x_{1}}^{x_{2}}} {\smallint _{y_{1}}^{y_{2}}}\big [w_{1}\,*\,\mu _{FS}\left( u_1\right) \nonumber \\&\quad +w_2\, *\,\mu _{FC}\left( u_2\right) \big ]/\left( u_1,\,u_2\right) , \end{aligned}$$
(25)

where \(\mu _{DC} (u_1,\,u_2 )\) is defined over u\(_{1}\) and u\(_{2}\) based on associated sub-FS and complexity of the classification computation. Based on convex combination of fuzzy sets for optimization model, the DC is defined as following optimization model.

$$\begin{aligned}&\text {Max}\,\upmu _{\text {DC}} =\text {w}_{1}\,*\,\upmu _{\text {FS}} +\text {w}_{2}\,*\,\text { }\upmu _{\text {FC}} \\&\text {Subject to}\,\upmu _{\text {FS}} \ge \text { g}, \nonumber \\&\upmu _{\text {FC}} \le \text {h},\nonumber \\&\text {w}_{1} +\text {w}_{2} ={1}. \nonumber \end{aligned}$$
(26)

For convex combination, either \(\text {w}_{1} ={ 1 }-\text {w}_{2} \text {or w}_{2} ={ 1 }-\text {w}_{1} \) is mandatory. Notice that variations in sub-feature have stronger influence on the values of the membership function of the fuzzy set DC than variations in computational cost. This is due to greater importance of the component sub-feature in determining the fuzzy set DC using the convex combination of sub-feature and computational cost.

Definition (k-level set) If A is a fuzzy subset of U, then a k-level set of A is a non-fuzzy set denoted by A\(_\mathrm{k}\) which comprises all elements of U whose grade of membership in A is greater or equal to k.

Mathematically, it is defined as

$$\begin{aligned} \text {A}_\text {k} =\text { }\left\{ {\text {u}{\text {:}}\,\text { }\upmu _\text {k} ( \text {u})\ge \text {k}} \right\} . \end{aligned}$$
(27)

A fuzzy set A may be decomposed into its level sets through the resolution identity

$$\begin{aligned} \text {A}=\mathop \int \nolimits _0^1 kA_k, \end{aligned}$$

or

$$\begin{aligned} \text {A}=\mathop \sum \limits _k kA_k, \end{aligned}$$

where \(\text {kA}_\text {k} \) is the product of a scalar k with the set \(\text {A}_\text {k} \) and \(\mathop \smallint \limits _0^1 kA_k \) (or \(\mathop \sum \nolimits _k kA_k )\) is the union of the A\(_{k}\) sets with k ranging from 0 to 1. We can generate the substitution expression by the union of constituent fuzzy singletons (\(\mu _i/u_i).\) For \(\text {u}_\text {i} =\text {u}_\text {j},\) the substitution expressed by

$$\begin{aligned} \frac{\mu _i }{u_i }+\frac{\mu _j }{u_j }=\left( \mu _i +\mu _j \right) /u_i. \end{aligned}$$

For example,

$$\begin{aligned} \mathrm{A} = .3/\mathrm{a} + .8/\mathrm{a} + .7/\mathrm{b} +.4/\mathrm{b}. \end{aligned}$$

Thus A may be rewritten as

$$\begin{aligned} \mathrm{A}= & {} (.3\#.8)/\mathrm{a} + (.7\#.4)/\mathrm{b}\\= & {} .8/\mathrm{a} + .7/\mathrm{b}. \end{aligned}$$

Further,

$$\begin{aligned} \frac{\mu _i }{u_i }=\frac{(\# \mu _j )}{u_i }, \quad \upmu _\text {j}\epsilon \text {}\left[ {\text {t},\,\text { }\upmu _\text {i} } \right] ,\quad 0\text { }\le \text {t}\le \text { }\upmu _\text {i}. \end{aligned}$$

For example

$$\begin{aligned} .4/\mathrm{a} = (.1\#.2\#.3\#.4)/\mathrm{a},\quad \mathrm{t} = .1. \end{aligned}$$

Thus the resolution identity may be viewed as the result of combining together the terms which fall into the same level-sets.

7 Optimization model based classification techniques

We describe the general approach to classification as a two-step process. In the first step, we build a classification model based on previous data. In the second step, we determine if the model’s accuracy is acceptable, and we use the model to classify on new data or distinguish data from existing data. Thus the model generates broadly two types of classes (a) safe class and (b) risk class. The model needs analysis on existing data to learn which classes are “safe” and which are “risky” for the model. For example a medical researcher wants to analyze breast cancer data to predict which one of three specific treatments a patient should receive. Thus the data analysis task is meant for classification, where a model or classifier is constructed to predict class (categorical) labels, such as “safe” or “risky” (i.e., the patient is under safe/risky state) based on the sub-feature data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.

In our model, a classifier is built describing a predetermined set of classes or concepts, where a classification algorithm builds the classifier by analyzing or a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n-dimensional feature vector, \(\text {X}=({\text {x}_{1},\,\text {x}_{2},\text { }\ldots \text { },\text {x}_\text {n}}),\) depicting n measurements made on the tuple from n database features, respectively, \(\text {A}_{1},\,\text {A}_{2} ,\text { }\ldots \text { },\text {A}_\text {n}.\) Each tuple, X, is assumed to belong to a predefined class as determined by another database feature called the class label feature.

The classification process can also be viewed as the learning of a mapping or function, y = f(X), that can predict the associated class label y of a given tuple X. In this view, we wish to learn a mapping or function that separates the data for different classes. Typically, this mapping is represented in the form of classification rules, decision trees, or mathematical formulae. The rules are used to categorize future data tuples, as well as provide deeper insight into the data contents. They also provide a compressed data representation.

The predictive accuracy of the classifier is estimated for classification. If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to over fit the data (i.e., during learning it may incorporate some particular anomalies of the training data that are not present in the general data set overall). Therefore, a test set is used, made up of test tuples and their associated class labels. They are independent of the training tuples, meaning that they were not used to construct the classifier.

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The associated class label of each test tuple is compared with the learned classifier’s class prediction for that tuple. If the accuracy of the classifier is considered acceptable, the classifier can be used to classify future data tuples for which the class label is not known (such data are also referred to in the predicting sub-feature data as “unknown” or “previously unseen” data). The accuracy is measured as

$$\begin{aligned} \text {Accuracy}=\frac{TP+TN}{P+N}, \end{aligned}$$
(28)

where TP is true positive, TN is true negative, P is positive, N is negative. It reflects how well the classifier recognizes tuples of the various classes. The detail description of accuracy, reader can refer [35].

In order to classify the sub feature data it needs to discuss different kinds of classification techniques that have used for our optimization model such as naive Bayesian classification, classification and regression trees (CARTs), multilayer perceptron (MLP), etc. Here we explain only naive Bayesian classification for proposed model and for more details other classification techniques are available in [35]. The naive Bayesian classifier, or simple Bayesian classifier, works as follows:

  1. (1)

    Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, \(\text {X}=\text { }(\text {x}_{1},\,\text {x}_{2} ,\text { }\ldots \text { },\text {x}_\text {n}),\) depicting n measurements made on the tuple from n attributes, respectively, \(\text {A}_{1},\,\text {A}_{2} ,\text { }\ldots ,\text {A}_\text {n}.\)

  2. (2)

    Suppose that there are m classes, \(\text {C}_{1},\,\text {C}_{2} ,\text { }\ldots \text { },\text {C}_\text {m}.\) Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier predicts that tuple X belongs to the class C\(_{i}\) if and only if

    $$\begin{aligned} \text {P}\left( {\text {C}_\text {i} |\text {X}} \right) \text {}>\text {P}(\text {C}_\text {j}|\text {X})\quad \text {for}\quad 1\le \text {j}\le \text {m},\quad \text {j}\ne \text {i}. \end{aligned}$$

    Thus, we maximize P(C\(_\mathrm{i}\)| X). The class C\(_\mathrm{i}\) for which P(C\(_\mathrm{i}\)|X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem,

    $$\begin{aligned} \text {P}\left( {\text {C}_\text {i}|\text {X}} \right) =\frac{\text {P}(\mathrm {X|C}_\text {i})\text {P}(\text {C}_\text {i})}{P(X)}. \end{aligned}$$
    (29)
  3. (3)

    As P(X) is constant for all classes, only \(\text {P}(\text {X}|\text {C}_\text {i})\text {P}( {\text {C}_\text {i} })\) needs to be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, \(\text {P}(\text {C}_{1} )\text { }=\text {P}(\text {C}_{2} )\text { }=\text { }\cdots =\text {P}(\text {C}_\text {m}),\) and we would therefore maximize \(\text {P}(\text {X}|\text {C}_\text {i}).\) Otherwise, we maximize \(\text {P}(\text {X}|\text {C}_\text {i})\text {P}( {\text {C}_\text {i} }).\) Note that the class prior probabilities may be estimated by \(\text {P}( {\text {C}_\text {i} } )\text { }=\text { }| {\text {C}_\text {i} ,_\text {D} |/\text { }}|\text {D}|,\) where \(|\text {C}_\text {i} ,_\text {D} |\) is the number of training tuples of class C\(_\mathrm{i}\) in D.

  4. (4)

    Given data sets with many attributes, it would be extremely computationally expensive to compute \(\text {P}({\text {X}|\text {C}_\text {i} }).\) To reduce computation in evaluating \(\text {P}( {\text {X}|\text {C}_\text {i} } ),\) the naive assumption of class-conditional independence is made. This presumes that the attributes values are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus,

    $$\begin{aligned} \begin{array}{l} \text {P}\left( {\text {X}|\text {C}_\text {i} } \right) =\mathop \prod \limits _{k=1}^n P\left( x_k |C_i\right) \\ =\text {P}\left( {\text {x}_{1} |\text {C}_\text {i} } \right) \times \text {P}\left( {\text {x}_{2}|\text {C}_\text {i} } \right) \times \cdots \times \text {P}\left( {\text {x}_\text {n} |\text {C}_\text {i} } \right) . \\ \end{array} \end{aligned}$$
    (30)

We can easily estimate the probabilities \(\text {P}( {\text {x}_{1} |\text {C}_\text {i} }),\text {P}( {\text {x}_{2} |\text {C}_\text {i} }),\text { }\ldots ,\text {P}( {\text {x}_\text {n} |\text {C}_\text {i} } )\) from the training tuples. Recall that here x\(_\mathrm{k}\) refers to the value of attribute A\(_\mathrm{k}\) for tuple X. For each attribute, we look at whether the attribute is categorical or continuous-valued.

The different kinds of classification techniques are used for our optimization model. Since we consider objective model based on classification technique, each classification technique shows the own optimal performance for the concerned model. The performance of classification technique varies for different optimal models. Different experiments have been considered to compare the performance of classification technique in each optimal model and obtain the best model for classification.

8 Experiments

The performance of the proposed method is evaluated based on different classification techniques, data sets and different parameters in this section. The proposed method is primarily evaluated to find sub-feature data with certain computational interval for predicting specific or new class using a set of public domain datasets from the University of California at Irvine machine learning repository [36]. Different parts of this section are derived the nature and characteristics of data set, parameter set up for the experiments, and performance of the model for sub-feature data.

8.1 Description of the data sets

For our experimental purpose, six datasets are considered with proper evaluation. The following data sets are briefly introduced for our experimental setup.

8.1.1 Audiology

This data set is used for the patients having audio problem. We consider 24 class distribution data. Eleven features out of 69 features and 200 instances are used for our experiments.

8.1.2 Arrhythmia

This data is used for cardiac arrhythmia patients having 16 classes. Sixteen features from 279 features and 452 instances are taken in to consideration for our experiments.

8.1.3 Dermatology

This database contains 34 attributes, 33 of which are linear valued and 1 of them is nominal. The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples are taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope. In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family and 0 otherwise. The age feature simply represents the age of the patient. It considers 366 instances for evaluation.

8.1.4 Yeast

This dataset is used for protein localization sites based on biological protein related terms like the amino acid content of vacuolar, mitochondrial and non-mitochondrial proteins, localization signals of nuclear and non-nuclear proteins, etc. We use nine attributes with 1484 instances and 10 classes from this data set.

8.1.5 Glass

This dataset is used for different kind of glass identification, i.e., generally it considers for window glass and non window glass. It considers 11 features including classes and 214 instances for our experiments.

8.1.6 Ecoli

This dataset is used for predicting protein localization sites. It considers 8 features, 8 classes and 336 instances for experiments.

The large number of same values from corresponding features of data set is not considered for our experiments, due to the involvement of these data in all classes.

Table 1 The available number of instances for classes in Audiology dataset

8.2 Environments

The experiments are implemented on a personal computer with an Intel Pentium IV, 2.40 GHZ CPU, 1.00 GB RAM, Microsoft Windows XP professional version 2002 operating system with Matlab 7.0.1 development environment. We also consider WEKA tool [37] for different classification techniques. The data set for proposed model have been processed under fuzzy environment for sub-feature selection.

8.3 Experimental results

8.3.1 Results based on single objective optimization

The above datasets are considered to generate the required sub-feature data for our experiment. The sub-features are also determined by fuzzy frequencies which are described in [6]. Now our objective is to generate association of sub FSs from feature space. On the basis of sub-FS, the increased number of associated sub-features can generate DC from existing class. But how association of sub-features is taking place for our proposed model is explained as follows.

Initially, it generates the sub-features from each feature as discussed in [6]. As per the algorithm we have considered the threshold value with an objective to limit the number of sub features. If the number of association of sub features increases, the number of leading classes gets decreased else otherwise. Under this condition we have assumed the threshold value (\(\uptheta = 4\)). This threshold value is prototype for our data base. It may vary on different databases. Above the threshold value, the sub features sets are discarded from features space. In other words below the threshold values sub FSs are taken in to consideration for our purpose. After removing all irrelevant sub-features from the feature space, it needs to sort out the existing class in database with an objective to distinguish classes. At this stage some sub-features may or may not exist in corresponding features after sorting the classes. If none of the sub-features available in any feature, then it is required to remove the corresponding feature from feature space. It would be effective if the number of associated sub features get increased. Hence it is concluded that maximum number of associated sub-features is equal to total number of features in a database. For example, let total number of features in a database is 10 (A\(_{1},\ldots ,\mathrm{A}_{10}).\) The associated sub-features are generated as

$$\begin{aligned} \{\mathrm{a},\,\mathrm{b}\}\,\mathrm{for\,only\,two\,features},\\ \{\mathrm{a},\,\mathrm{b},\,\mathrm{c}\}\,\mathrm{for\,only\, three\,features},\\ \{\mathrm{a},\,\mathrm{b},\,\mathrm{c},\,\mathrm{d}\}\,\mathrm{for\, only\,four\,features},\\ \{\mathrm{a},\,\mathrm{b},\,\mathrm{c},\,\mathrm{d},\,\mathrm{e},\ldots \},\ldots ,\mathrm{till}\,10\,\mathrm{features}\}, \end{aligned}$$

where a \(\epsilon \) A\(_{1},\) b \(\epsilon \) A\(_{2},\) c \(\epsilon \) A\(_{3}\) and so on. Here {a, b, c,...} is sub-feature data for corresponding features (A\(_{1},\ldots ,\mathrm{A}_{10}).\)

If the number of associated sub-features is equal to total number of features, the association of sub-features never be generated further. But it may be repeated. It is always suggested to avoid repetition of associated sub features. Such type of repetition occurs rarely. When the maximum number of sub-features are observed, even it is equivalent to total number of features, then it generates the number of distinguish classes based on that associated sub-features. The associated sub-features for DCs for six different datasets are explained below.

Table 2 The available number of instances for classes in Arrhythmia dataset

As per our considered datasets corresponding to our models, the different statistical results are shown in different tables (Tables 1, 2, 3, 4, 5, 6). Consider Table 1. The classes are AN, BP, CA, CAAN, CAPPM, CNAH, CPN, CU, CD, MCAOM, MCASO, MCUD, MPNO, NE, OM, PM and RU. These are abbreviations of classes mentioned in audiology dataset. The decimals given below of each class tells the probability of instances for the said class. In other words the tables (Tables 1, 2, 3, 4, 5, 6) maintain the probability of each class with respect to number of instances of each dataset. In Table 1, X \(\ge \) 7 means that number of associated sub features are greater than or equal to 7 (i.e., seven or more than seven associated sub features). Similarly X \(\ge \) 8 means that number of associated sub features are more than/equal to 8. In the same way X \(\ge \) 9 means, number of associated sub features are 9 or above. From the tables it is found that the increase in number of associated sub-features generates more effective DCs, i.e., the more, the less. In some cases one class is coming into picture for which we have considered at least two effective DCs for all data set for our model. The probability of available classes in different dataset using different threshold values is shown in following Table 7. Here X is considered the number of available sub-features for the corresponding classes. The probability values of corresponding threshold values indicate that total number of possible classes with respect to available instances. Table 7 shows that decreasing the probability values reflects increasing of associated sub-features which is depicted in Figs. 2, 3, 4, 5, 6 and 7. From the above figures it is seen that in all cases when the threshold value increases, corresponding probability of available classes decreases. Based on Table 7 we obtain possible DCs from different dataset shown in Table 8. The availability of DCs from different data set is recognized as different noble classes.

From Tables 1, 2, 3, 4, 5 and 6, it is observed that the probability value of selected class is decreased when threshold value is increasing for each dataset. It indicates that the association of sub-features is approximately distinct in nature, i.e., the repetitions is very less or even zero. The probability of associated sub-features for DC is shown in following Table 9. From Table 9 it is understood that increase in threshold value helps in increasing the probability of associated sub features for DCs which is depicted in Figs. 8, 9, 10, 11, 12 and 13.

8.3.2 Pareto multi-objective optimization

The natural idea is generated for MOO problem based on POS. Our approach has not been adopted in data mining model by any referring researchers. Instead of single solution for SOO model, a number of POSs is more effective and efficient solutions in MOO problem. In our model, we have applied the POS based on associated sub-features to generate noble classes. When we generate associated sub-features for DCs based on threshold values, some conflict is arising on associated sub-features for DC. In such situation, the POS on sub-FS generate effective solution for our required classification model. As per Pareto optimal model, let X and Y are solutions on associated sub-features, i.e., based on associated sub-features for solution X, the optimal model finds the number of classes with respect to associated sub-features using the threshold value. If threshold value is changed, the associated sub-features get changed accordingly for solution Y. Also number of classes is changed based on corresponding associated sub-features. Here, we consider the solution Y generates less number of classes than on the solution X. On other words we can say that the solution Y dominates the solution X. If same number of classes are generated on solution X and Y, then it doesn’t affect on optimal classification model using POS. As we have already seen that increase in threshold value reduces the number of classes, POS is applied for our required classification model. Hence POS based on probability of associated sub features for DCs is shown in Table 10 for different data sets. Using POS, we obtain optimal classes in which the corresponding associated sub-features dominates other associated sub-features.

Table 3 The available number of instances for classes in Dermatology dataset
Table 4 The available number of instances for classes in Yeast dataset
Table 5 The available number of instances for classes in Glass dataset
Table 6 The available number of instances for classes in Ecoli dataset
Table 7 Probability of available classes using threshold values

Here x is the associated sub-features exceeding the given threshold values for different data set. Since the mentioned threshold value is the optimal threshold value, its corresponding associated sub-features is a POS which dominates other solution for DCs of different datasets as shown in Fig. 14 where POS stands for Pareto optimal solution and DC stands for distinguished class.

8.3.3 Fuzzy Pareto dominance set

We consider two fuzzy sets A and B. The Pareto dominance relation over fuzzy sets A and B is defined by using the relationship #. A fuzzy set A dominates fuzzy set B using the degree \(\upmu _\mathrm{A}\)(A, B) is defined as

$$\begin{aligned} \upmu _\text {A} ( {\text {A},\,\text {B}})\text { }=( {\text {A}\# \text {B}})/\text {U}_\text {A}. \end{aligned}$$

If fuzzy set B dominates fuzzy set A, then

$$\begin{aligned} \upmu _\text {B} ( {\text {A},\,\text {B}})=( {\text {A}\# \text {B}})/\text {U}_\text {B}. \end{aligned}$$

Let A(X \(\ge \) 11) and B(X \(\ge \) 12) are two fuzzy sets defined on arrhythmia dataset. As per the above theory it is found that fuzzy set B covers maximum associated sub features for DCs which is mentioned in Table 11. In other words we can say fuzzy set B dominates fuzzy set A. Similarly in all other data sets it is also found that fuzzy set B dominates fuzzy set A.

As per the resolution identity, we can generate the substitution expression by the union of constituent fuzzy singletons (\(\mu _i/u_i)\) where \(\upmu _\mathrm{i}\) is degree of membership and u\(_{i}\) is the universe of discourse (i.e., total considered sub-features). The fuzzy set A can generate on fuzzy singletons that is defined by

$$\begin{aligned} \text {A} = \left( \# \mu _i \right) /u_i. \end{aligned}$$

For example: the fuzzy set for associated sub-features of arrhythmia dataset (A\(_{1})\) is

$$\begin{aligned}&\mathrm{A}_{1} = 0.375/16 + 0.4375/16 + 0.5/16 \\&\quad + \,0.625/16 + 0.6875/16 + 0.75/16,\\&\mathrm{A}_{1} = (0.375\#0.4375\#0.5\#0.625\#0.6875\#0.75)/16,\\&\mathrm{A}_{1} = 0.75/16. \end{aligned}$$

Similarly we have considered fuzzy sets A\(_{2},\) A\(_{3},\) A\(_{4},\) A\(_{5},\) A\(_{6}\) for all other data sets {Audiology, Dermatology, Yeast, Glass, Ecoli}. Thus the fuzzy sets for our datasets are shown in Table 12.

Fig. 2
figure 2

Arrhythmia dataset

Fig. 3
figure 3

Dermatology dataset

Fig. 4
figure 4

Audiology dataset

Fig. 5
figure 5

Yeast dataset

Fig. 6
figure 6

Glass dataset

Fig. 7
figure 7

Ecoli dataset

Table 8 The distinguished class from different datasets
Table 9 Probability of associated sub-features for distinguished class
Fig. 8
figure 8

Arrhythmia dataset

Fig. 9
figure 9

Dermatology dataset

Fig. 10
figure 10

Audiology dataset

Fig. 11
figure 11

Yeast dataset

Fig. 12
figure 12

Glass dataset

Fig. 13
figure 13

Ecoli dataset

Table 10 Pareto optimal solution based on probability of associated sub-features for distinguished class
Fig. 14
figure 14

Pareto optimal solution and distinguished classes

Table 11 Pareto dominance solution
Table 12 Fuzzy set for different datasets

Based on resolution identity the above fuzzy sets show the required DCs.

8.3.4 Distinguished classes based on fuzzy convex optimization

We have also considered our model on basis of fuzzy convex optimization where the problem is solved by the combination of product of weights and fuzzy sets with degree of membership function. The convex combination is derived on two objectives [i.e., membership function on availability of sub feature (FS) and membership function for computational cost (FC)] and weights (w\(_{1}\) and w\(_{2})\) for our experiments. The detailed description of the model has been described in Sect. 6. The fuzzy convex optimization model finds the effective results based on different kinds of classification techniques and its complexity. For fuzzy convex optimization model, the values of user defined variables are considered to generate the effective results. Moreover it is a prototype for our model.

The availability of sub-features for DCs using degree of membership function is \(\upmu _\mathrm{FS}\) which is defined as

$$\begin{aligned} \upmu _{\text {FS}} =\text {FS}=\mathop \int \nolimits _{x_1 }^{x_2 } \left[ 1+\left( \frac{u_1 -b_1 }{c_1 }\right) ^{-2}\right] ^{-1}/u_1, \end{aligned}$$

and its computational cost is defined as

$$\begin{aligned} \upmu _{\text {FC}} =\mathop \int \nolimits _{y_1 }^{y_2 } \left[ 1+\left( \frac{u_2 -b_2 }{c_2 }\right) ^{-2}\right] ^{-1}/u_2. \end{aligned}$$

The variables {u\(_{1},\) b\(_{1},\) c\(_{1}\)} for FS and {u\(_{2},\) b\(_{2},\) c\(_{2}\)} for FC are defined by user. The optimization model for DCs is defined as

$$\begin{aligned}&\mu _{DC} \left( u_1,\,u_2\right) =\mathop \int \nolimits _{x_1 }^{x_2 } \mathop \smallint \nolimits _{y_1 }^{y_2 } \left[ w_1 \,*\,\mu _{FS} \left( {u_1 } \right) \right. \\&\quad +\left. w_2\, *\,\mu _{FC} \left( {u_2 } \right) \right] /\left( u_1,\,u_2\right) , \end{aligned}$$

where x\(_{1}\) and x\(_{2}\) are the range of optimal level of associated sub-features. The variables u\(_{1}\) and u\(_{2}\) generate effective optimal results for DC. These variables vary based on the number of associated sub-features and different computational complexity. The variables b\(_{1}\) and c\(_{1}\) are defined as lower bound of the range and c\(_{1} = (\mathrm{x}_{2}-\mathrm{x}_{1})/2.\) For example x\(_{1} = 6,\) x\(_{2} = 24,\) b\(_{1} = 6,\) c\(_{1} = 9\) and y\(_{1} = 0.02,\) y\(_{2} = 0.30,\) b\(_{2} = 0.02,\) c\(_{2} = 0.14.\) The computational complexity of different classification techniques are shown in Table 13.

Table 13 Computational complexity of different classification techniques for different data sets

Now it needs to evaluate \(\mu _{DC} (u_1,\,u_2 )\) based on different classification techniques for different values of u\(_{1}\) and u\(_{2},\) and also using different weights as follows.

For Arrhythmia dataset:

  1. (a)

    Classification by MLP,

    $$\begin{aligned}&\upmu _\mathrm{DC} {(12,\,0.8) = 0.6}\,*\,{0.307 + 0.4}\,*\,0.968 \\&\quad = 0.1842 + 0.3872 = 0.5714. \end{aligned}$$
  2. (b)

    Classification by linear regression (LR)

    $$\begin{aligned}&\upmu _\mathrm{DC} {(12,\,0.06) = 0.6}\,*\,{0.307 + 0.4}\,*\,\\&\quad {0.075 = 0.1842 + 0.03 = 0.2142}. \end{aligned}$$

Similarly the evaluation of \(\mu _{DC} (u_1,\,u_2 )\) for other dataset and different classification techniques are described in Table 14 which is also depicted in Fig. 15.

Table 14 The results of \(\mu _{DC} (u_1,\,u_2 )\) for different datasets

Although the role of computational complexity is not important for sub-feature selection for noble classification, but for optimality test, it generates different role for different datasets. As it has already been discussed the previous results based on optimal associated sub-features for noble classes, it only needs to discuss about the optimality test for different classification techniques for appropriate results. Since we have considered all variables as prototypes for different data sets, it is observed from Fig. 15 that MLP performs well than other classification techniques. For different data sets performance of MLP varies. It is understood that MLP is only suitable classification technique for our model. However, this experiment can also be conducted by considering different values of the considered variables. As it is a prototype, the variation of performance is observed which may be neglected by considering other suitable values of the variables.

Fig. 15
figure 15

Classification results on \(\mu _{DC} (u_1,\,u_2)\)

9 Conclusions

This paper explores optimization approach to data mining tasks that provide us basic concepts for studying classification problems. In entire simulation process the original data have not changed, even it is still maintained without divulging the exact values and characteristics of features. The role of sub-feature values in each feature is important to generate new class from existing class. The experimental results demonstrate that our optimization models can be successfully applied to noble classification based on associated sub-features, fuzzy sets, Pareto dominance and classification techniques. By means of Pareto-based optimization, we are able to gain a deeper insight into different aspects of classification, and thus, develop new classification models. From results and discussions it is seen that Pareto based approach to classification predicts the new class/sub class efficiently. However from this study it is understood, the power of the Pareto-based approach is made more attractive due to the successful application of different optimization models for noble classes. We also learned the major benefits of Pareto based approach. Moreover it is observed that more number of associations of sub features predicts the less number of classes which is naturally true. We illustrated the generation of new class/sub class from existing class based on real world sensitive data in which different optimization models have been used to verify our methodology. As we have considered the prototype for our experiments, it can be tested by using other suitable values for the considered variables and our result is based on the fixed values of the variables, is kept the problem open for further research. Finally, many issues remain to be resolved and new areas could be opened up in the field of Pareto-based multi-objective classification.