1 Introduction

Distributed data mining with privacy preservation has been developed by many prominent authors who have analyzed and made several strategies for data privacy. The role of feature selection has been made for several active research areas for industrial applications. Generally the feature selection techniques are used for clustering, classification etc. These techniques have been recognized as data mining techniques for relevant features. Many authors have proposed different algorithms and model for feature selection as [17] for their research work. But they have not considered the sub-features for their work. The sub-features play the vital role to predict the output (the target functions or class functions) correctly in many special tasks. The sub-feature doesn’t have predicted capability can thereby be removed from consideration. Hence the existence of sub-feature and its role come to the limelight. Thus we have considered the sub-feature selection in our proposed work. The concept of sub-feature selection is being made by appropriate representation of fuzzy probabilities which are different from traditional techniques of feature selection. In order to maintain the privacy of sub-feature, some alias values have been taken into consideration. In this paper, the fuzzy probability has important role for all kind of processing tasks. Fuzzy random variables are used as fuzzy numbers which are vaguely defined as compare to real number and also associated for degree of acceptability. The true values are handled by each fuzzy random variable with membership function.

The information is shared by different parties under distributed network environment. The key challenge is to apply fuzzy probabilities to multiparty collaborative distributed data mining to securely unify the perturbation used by different data providers, while each party still gets satisfactory privacy guarantee and the utility of the collected data is well preserved for making sub-feature selection model [8]. Generally there are some important factors that impact the quality of the sub-feature selection model such as frequency of sub-feature from each feature set, utility of fuzzy probability data for privacy, and data mining technique (i.e., gain ratio) etc. These factors are considered for developing the algorithms as well as model for sub-feature selection: fuzzy probability for sub-feature selection, estimation of upper bound and lower bound of gain ratio, and fuzzy privacy for sub-feature selection. The analytical and experimental results show that the fuzzy privacy algorithm for sub-feature selection is most efficient with effective result and privacy guarantee and also the estimation of gain ratio provide the feature selection within the expected interval.

This paper is organized as follows. In Sect. 2, it provides the background of the related work and preliminary study for proposed model. Section 3 defines the problem statement of the proposed work. In forth section, the fuzzy model for data processing has been illustrated. In Sect. 5, the estimation of both upper bounds and lower bounds of gain ratio with approximation solution based on fuzzy random variable are discussed whereas in sixth section, privacy preservation model for sub-feature selection is explained with algorithm. However in seventh section, the experimental details have been discussed and analyses with several dataset, parameters, proposed algorithms and privacy preservation for sub-feature selection for our proposed model. Section 8 ends with concluding remarks and open discussion for future work.

2 Background

In this section, the background is discussed with related works and some mathematically preliminaries describing the concepts for better understanding of the problem. Both parts derive the related concepts of privacy preservation in distributed data mining, feature selection under fuzzy environment.

2.1 Related work

The distributed data mining applications have been focused in several areas like large-scale distributed data mining, privacy preservation of data, peer to peer network systems etc., where each node makes exact solution for combined database [9]. The different distributed network algorithm like standard centralized algorithm for decision tree [10], sharing computation and information peer to peer network [11], centralized Bayesian networks [12], to discover criminal network [13], incentive compatible for distributed data mining [14] etc., have been derived on different database for several computational experiments. But in distributed environment, the role of participants is very important for computation and communication of individual data. Any party never wants to release their data without protection. Thus many researchers develop the different models and standard algorithms to protect the individual or organization data. The data privacy or privacy preservation of data has been developed by using standard and secure multiparty computation [15], game theory [16], K- anonymity and l-diversity approach in social network [17], privacy preserving data publishing [18], horizontal partitioned data [19] etc. To maintain high privacy by participating parties and coordinator of network, a model has been described for better computation [20].

Since feature has important role in distributed data mining, it needs approaches for feature selection. There are several approaches of feature selection like Decision Border feature [21], mutual information based on greedy selection [7], feature selection algorithm for large peer to peer networks [22], feature wrappers and filters [3] etc. Similarly different fuzzy techniques are used for feature selection such as fuzzy clustering technique [23], conventional search technique on fuzzy space [6], fuzzy support vector machine [4], fuzzy rough sets assisted attribute selection [5], fuzzy classification systems based on multi-objective evolutionary algorithm [24], construction fuzzy knowledge bases for feature selection [25], etc. Moreover additional methodologies such as higher order models for fuzzy random variable [26], upper and lower probabilities induced by fuzzy random variable [27], introduction of fuzzy rule based classifier [28], genetic fuzzy systems [29] and fuzzy linguistic models [30] are also used to strengthen different fuzzy techniques for feature selection. None of the authors of cited papers have discussed/proposed privacy preservation for fuzzy sub-feature selection. Our paper differs from [4, 6, 23] in many aspects. Firstly, we proposed, fuzzy sub-feature selection for better class prediction, secondly based on alias values, we have developed the algorithm to maintain the privacy. Thirdly, several solutions has been presented to evaluate the performance of selection. These results provide fundamental insights into the problem.

2.2 Preliminaries

In this section, it discusses basic concepts of fuzzy random variables, data mining technique (i.e., gain ratio) for feature selection and privacy preserving in distributed data mining for better understanding of the problem. The primary focus is on the related issues in the scenario of multiparty to release their perturbed data to data miner for mining purpose. By convention, fuzzy random variables uses for privacy and also data evaluation for sub-feature selection.

2.2.1 Fuzzy random variables

As discussed by Huibert [31], the notion of a fuzzy random variable is discussed as follows. Assume (\(\Omega \),\(\mathcal{F}\),\( \mathcal {P} )\) be a probability triple. Let U is a random variable defines on this triple. Assume that we perceive this random variable through a set of windows W\(_{i}\) i \(\epsilon \) J, with J a finite or countable set, each representing an interval of the real line s.t. W\(_{i} \quad \cap \) W\(_{j} \quad \ne \quad \Phi \), for i \(\ne \) j and \( \bigcup \nolimits _{{i\epsilon j}} {W_{i} = R} \) (perceiving the random variable through these windows means for each \(\omega \) (omega), we can only establish \(U_\omega \quad \epsilon \) W\(_{i }\)for some i \(\epsilon \) J). Let \(\mathcal{F}_i \): R \(\rightarrow \) [0, 1] be a character function defined on a set of windows W\(_{i}\). Let S be the space of all piecewise continuous functions mapping R \(\rightarrow \) [0,1]. Then define the perception of the random variable U as per above description and with mapping X: \(\Omega \quad \rightarrow \) S given by

$$\begin{aligned} \omega \;\mathop \rightarrow \limits ^X \;X_\omega \end{aligned}$$
(1)

with \(X_\omega =\mathcal{F}\) iff \(U_\omega \quad \epsilon \) W\(_{i }\)(where W is perceiving data set) means it associates with each \(\omega \quad \epsilon \quad \Omega \) is not a number, \(U_\omega \) as an ordinary random variable, but a characteristic function \(X_\omega \), which is an element of S (where the mapping X : \(\Omega \quad \rightarrow \) S characterizes as a special type of fuzzy random variable). The random variable U is a fuzzy random variable which is a perception and is also called an original of the fuzzy random variable (FRV). Moreover for a given FRV, there may exist many originals. At this point a FRV is defined as a map \(\xi \): \(\Omega \quad \rightarrow \) F, where F is the set of all fuzzy numbers (i.e., fuzzy random variables are random variables whose values are not real numbers but fuzzy numbers). Fuzzy numbers are numbers whose values are only vaguely defined. A fuzzy number may assume different real values, but it should be associated of degree of acceptability. The fuzzy random variable X is said to be discrete if \(\Omega \) is a countable or finite set. When we deal with a single discrete fuzzy random variable X, we may take \(\Omega \) = N, set of natural number and \(\mathcal{F}\) the sigma algebra of subsets of N. We shall denote the probability P ({i}) = P\(_{i,}\) i \(\epsilon \) N and \(i\;\mathop \rightarrow \limits ^X \;X^i\;\forall \) i \(\epsilon \) N. The fuzzy random variable is used to help feature selection for data mining task using gain ratio technique.

2.2.2 Evaluation of gain ratio for feature selection

Since feature selection issue is an important task in data mining [32, 33], many authors have developed the different techniques for feature selection like entropy, gini index, gain ratio, mutual information etc. However we have considered the gain ratio technique in this paper for feature selection. Generally the gain ratio for feature selection is calculated as follows.

Let D is tuple set of partition related to class, P\(_{i}\) is probability of arbitrary tuple in D belongs to class C\(_{i,}\) x\(_{i}\) is the number of feature, D\(_{j}\) is the partition of feature data belongs to class C\(_{i}\), info(D) is the information of D belong to class, info\(_{xi}\) (D) is the information of x\(_{i}\) of D belong to class, SplitInfo\(_{xi}\) is split information of x\(_{i}\) on D without class, only own feature data. Then the following calculation is required to find gain ratio for best feature with maximum value.

  1. (1)

    Calculate Info(D) = \(-\) \(\mathop \sum \nolimits _{i=1}^m P_{i }\)log\(_{2}\)(P\(_{i})\)

  2. (2)

    Calculate info\(_{xi}\)(D) = \(\mathop \sum \nolimits _{j=1}^{n\,partition} \frac{\vert Dj\vert }{\vert D\vert }\) info(D\(_{j})\)

  3. (3)

    Calculate gain\(_{xi}\) = Info(D) – Info\(_{xi}\)(D)

  4. (4)

    SplitInfo\(_{xi}\)(D) = \(\mathop \sum \nolimits _{j=1}^{n\,partition} \frac{\vert Dj\vert }{\vert D\vert }\) \(\times \) log\(_{2}\frac{\vert Dj\vert }{\vert D\vert }\)

  5. (5)

    Gain ratio\(_{xi}=\frac{Gain\,xi}{SplitInfo\,xi}\) where SpliInfo\(_{xi} \quad \ne \) 0

  6. (6)

    Find the best feature maximum gain ratio.

2.3 Privacy preservation on distributed data mining

Several perturbation techniques have been widely used for privacy preservation of individual data. Generally two types of perturbation techniques are used for privacy preservation of data, i.e., (1) multiplicative (2) additive perturbation. But we have considered additive perturbation for our work. The additive perturbation has been discussed by [15, 34, 35]. This approach makes data perturbation (Y) by adding some random noise data (Z) to the original data (X) as Y = X + Z where X, Y, Z are N-dimensional vectors,

where N is the number of attributes in X. The original data X follows the probability distribution. Hence mean vector and covariance matrix are \(\mu _{x}\), K\(_{x}\). The noise Z is assumed to be independent of X and is a jointly Gaussian vector with zero mean and covariance matrix K\(_{z}\). It is clearly to verify that the mean vector of Y is \(\mu _{x}\) and its covariance matrix is K\(_{y}\) = K\(_{x}\) + K\(_{z}\). It essential to choose K\(_{z}\) to be proportional to K\(_{x}\) i.e., K\(_{z}=\sigma _z^2 \) K\(_{x}\) for some constant \(\,\sigma _z^2 \) denoting the perturbation magnitude [36].

3 Problem statement

In this section, we have considered a decentralized network and distributed data mining where the coordinator collects different data from each party indirectly and evaluates whole data for prediction of the classes. Although each party trusts on coordinator of the network system, still each party may maintain its privacy of individual data, but the coordinating data miner must maintain its privacy by adding some amount of noise with data from each party i.e., the data miner collects only perturbed data for privacy purpose. By using additive perturbation to release the dataset, each party allows the coordinator to make statistical evaluation without releasing the exact values of individual data as discussed in Sect. 2.3. The assumption for system is that, the data miner always tries to construct an appropriate model on the basis of original data from given perturbed data. With respect to the original data X, the perturbed data Y represents how well the privacy is preserved for original data X.

Many traditional methods have been used for feature selection. Since feature data are used for different purpose, still it needs refinement of data for better classification. However there are some sensitive features (called sub-feature) of individual feature under feature set. They play major role leading to new class and their frequency may be less in feature data. We view all biological data values of feature as random variable. As the perception of biological data is always fuzzy, randomness is inevitable. Hence fuzzy random variable comes into the picture. Randomness occurs because it is not known which response may be expected from any given individual. Once response is available, there is still uncertainty about the precise meaning of the response. The latter uncertainty will be characterized by fuzziness. Thus the feature data are defined as fuzzy random variable and we have focused on privacy preservation for sub-feature selection. The following definitions have been considered for better understanding.

Definition 1

The feature (F\(_{i})\) is said to be defined as sub-feature (S\(_{j})\) if frequency of sub-feature value is more than zero i.e., \(\vert \)S\(_{j}\vert \quad >\) 0. Each unique feature is recognized as sub-feature. A feature may have many sub-features.

Definition 2

A feature (F\(_{i})\) is a set of sub-feature (S\(_{j})\) that satisfying the following conditions

  1. (1)

    \( \left| {{\text {F}}_{{\text {i}}} } \right| = \sum \nolimits _{j} {|S_{j} |} \)

  2. (2)

    \( {\text {F}}_{{\text {i}}} = \bigcup \nolimits _{j} {S_{j} } \)

  3. (3)

    \(\bigcap {S_j } \;\;=\Phi \)

where i is the i\(^{th}\) feature and j = number of sub-feature. Example: From UCI machine learning repository, following IRIS data set have been taken into consideration to explain the concepts being discussed in the definitions.

From the Table 1, features are sepal length, sepal width, petal length, petal width with the classes Iris-setosa, Iris-versicolor and Iris-virginica. In the feature sepal width, the sub-features are S\(_{1}\)(2.3,1), S\(_{2}\) (2.7,1), S\(_{3}\) (3.0,1), S\(_{4}\) (3.1,1), S\(_{5}\) (3.2,3), S\(_{6}\) (3.3,2), S\(_{7}\) (3.7,1) where arguments of the sub-features are feature values and its frequency in the data set. For feature sepal width F\(_{sw}\) = { S\(_{1}\), S\(_{2}\), S\(_{3}\), S\(_{4}\), S\(_{5}\), S\(_{6}\), S\(_{7}\)}. Hence \(\vert \)F\(_{sw}\vert =\mathop \sum \nolimits _{j=1}^7 \vert S_j \vert \), F\(_{sw}=\bigcup \nolimits _j {S_j } \) and \(\bigcap \nolimits _j {S_j } =\Phi \).

Table 1 Consideration data of Iris

4 Fuzzy model for data processing

This section discusses both network design and data model task. Since the data are in distributed manner, it needs sharing among all parties for selection of best features and sub-features. Thus data sharing is important to solve local problems of individual party. Likely the privacy preservation of individual data is also essential.

Initially the model considers the collection of data from each party under decentralized manner. Since data are collected in different ranges of each feature from each party, it needs to make the global range of each feature. As data are in decentralized manner geographically, the data values of each feature will vary from place to place. For example, from UCI machine learning repository, it is found that women affected by breast cancer varies from rural to urban to metropolitan cities. Using alias techniques and fuzzy random variable, the model has been developed to maintain privacy preservation of data. The concept of fuzzy random variable has been discussed in Sect. 2.2.1. Fuzzy random variable (FRV) is defined as X = \((\text{ T }-\text{ a })/\text{ T }\), where ‘a’ is number of frequency of sub-feature data and T is the total dataset in the database. For example if a = 1 for a particular sub-feature and T = 150 datasets then the value of FRV X will be 0.993. Thus the coordinator collects the data value of sub-feature of which the value of fuzzy random variable is 0.993. In other words coordinator collects only sub-feature data as per the value of fuzzy random variables. The original data and its alias values of iris dataset have been presented in Tables 10 and 9 of Appendix 1. The data collection and processing of data at coordinator’s end has been depicted in Figs. 1 and 2.

Fig. 1
figure 1

Decentralized network with coordinator

Fig. 2
figure 2

Coordinator database for processing of task

Based on Fig. 2, the coordinator makes easy to get original data from alias data and maintains its own database for processing is illustrated as bellows.

(a) Collection of the data as alias value

During flow of data from party to party, each sub-feature values is assigned as alias values for individual privacy. Each party sends its own data in terms of alias values based on fuzzy random variable X to next party (as depicted in Fig. 1) under decentralized environment i.e., party P\(_{1}\) to party P\(_{2}\), ..., party P\(_{n}\). As alias data are moving from party to party, privacy is maintained. Finally alias data reach at coordinator’s jurisdiction. In order to get alias data, each feature data are assigned by natural number. This natural number would be alias to original feature data. For example, suppose the feature data is 3.2, the alias data would be 1 (one). If data range is more, alias data range would be set accordingly which is shown in Table 9 presented in Appendix 1. This conversion process is within the knowledge of coordinator.

(b) Conversion of alias value to original data

Since the coordinator knows the conversion process, he can convert easily alias to actual data for further processing, if alias data reach at coordinator’s control. Thus the coordinator can have new database. This has been presented in Table 10 of Appendix 1.

(c) Make own database of each feature

Now database is designed based on feature and sub-feature as shown in Table 8 of Appendix 1.

(d) Selection of feature and sub-feature

In this section, the data processing task is highly important for coordinator to select feature and sub-feature using fuzzy probability. We describe the general sub-feature selection algorithm presented in algorithm 1 using fuzzy random variable.

figure a

The algorithm 1 derives the general sub-feature selection from coordinator database. However the fuzzy privacy sub-feature selection algorithm is presented subsequently.

5 Fuzzy random variable for feature selection

In this section, the theoretical derivation of gain ratio is discussed using fuzzy random variable.

5.1 Gain ratio based on fuzzy random variable

Before discussing the gain ratio technique, let it considers the discussion of mutual information between discrete random variables which determine the statistical dependence between variables. The definition of mutual information between two random variables is described in [37]. For feature selection, the useful of mutual information is important to access the quality of discretization [38]. In contrast, the natural definition of mutual information between fuzzy variable and a crisp variable is a fuzzy number which is not numerical value. Since the mutual information is a part of gain ratio for feature selection, we can derive the gain ratio through mutual information using random variable and also fuzzy random variable.

The fuzzy random variable is a perception of which the random variable of the fuzzy random variable is called original and is regarded as family of random sets (\(\psi _{\text {u}})_{\text {u} \epsilon [0,1]}\) where each one of them associated to a confidence level 1- u. A random set is a mapping whose images are crisp sets. A random variable X is a selection of a random set \(\Gamma \) where the image of X is being the member in the image of the same outcome by \(\Gamma \) [39]. In other words if X be random variable and \(\Gamma \) be random set we can define as

$$\begin{aligned}&\text{ X } :\Omega \rightarrow \text{ R }\end{aligned}$$
(2)
$$\begin{aligned}&{\text {and }}\quad \Gamma :\Omega \rightarrow {\mathcal {P}}\left( {\text {R}} \right) \end{aligned}$$
(3)

where X is a selection of \(\Gamma \) (i.e., X \(\epsilon \) A(\(\Gamma ))\) and X(e) \(\epsilon \,\Gamma \)(e) for all e \(\epsilon \, \Omega \). Otherway \(\Gamma \) is also associative of random variables. A random set can be observed as a family of random variables. The gain ratio between a random variable X and random set \(\Gamma \) can be defined as the set of all values of X and \(\Gamma \). Thus the gain ratio between random variable and random set is

$$\begin{aligned} \text{ GR }( {\text{ X },\Gamma })=\left\{ {\text{ GR }( {\text{ X },\text{ Z }})\vert \text{ Z } \epsilon \text{ A }( \Gamma )} \right\} \end{aligned}$$
(4)

where X is a selection of \(\Gamma \) and A(\(\Gamma )\) is a association of random variable.

The fuzzy random variable is being used as nested family of random sets as (\(\Psi _{u})\) (where u \(\epsilon \) (0,1)) and each of them associated to certain confidence level. Thus we define the gain ratio between random variable X and fuzzy random variable \(\Psi \) as fuzzy set as defined by membership function

$$\begin{aligned} \widetilde{{GR}}\left( {{\text {X}},\Psi } \right) \left( {\text {v}} \right) = {\text { max}}\left\{ {{\text { u }}|{\text { v }}\epsilon {\text { GR}}\left( {{\text {X}},\Psi _{{\text {u}}} } \right) } \right\} \end{aligned}$$
(5)

Similarly, assume that we are giving two paired standard random variable samples X(X\(_{1}\), X\(_{2}\), ... X\(_{N})\) and Y(Y\(_{1}\), Y\(_{2}\), ..., Y\(_{N})\) in which both universes of discourse are finite. Let a\(_{1}\), a\(_{2}\), ..., a\(_{n}\) and b\(_{1}\), b\(_{2}\), ...b\(_{m}\) are relative frequencies (probabilities) of the values of the samples of X, Y respectively and c\(_{1}\), c\(_{2}\), ...c\(_{s}\) be the frequencies (probabilities) of the values of the joint samples X \(\times \) Y. Thus the gain ratio between variables X and Y is evaluated using following steps.

  1. (1)

    Since a\(_{i}\), b\(_{i}\), and c\(_{i}\) are the relative frequencies (probabilities) of the arbitrary tuples in database D belonging to class C\(_{i}\), the information needed to classify a tuple in D is

    $$\begin{aligned} \text{ Info }( {\text{ X }, \text{ Y }})( \text{ D })&= -\mathop \sum \nolimits _{i=1}^n a_i \,\log a_i \\&- \mathop \sum \nolimits _{i=1}^m b_i \log b_i+\mathop \sum \nolimits _{i=1}^s c_i \,\log c_i \end{aligned}$$
  2. (2)

    Let x\(_{i}\) be the number of feature and D\(_{j}\) is the partition of D for particular feature data belongs to class C\(_{i}\) then info\(_{xi }\)(D) is the information of feature x\(_{i}\) of the database D for the above class. Thus

    $$\begin{aligned} \text{ Info }_{\text{ xi }} ( {\text{ X }, \text{ Y }})_ ( \text{ D })=\;\mathop \sum \nolimits _{j=1}^{n\,partition} \frac{\left| {D_j } \right| }{\left| D \right| }\,\text{ Info }(\text{ X },\,\text{ Y })(\text{ D })\, \end{aligned}$$
  3. (3)

    Hence the gain information from such a participating would be

    $$\begin{aligned} {\text {Gain}}_{{{\text {xi}}}} \left( {{\text {X}},{\text { Y}}} \right) = {\text {Info}}\left( {{\text {X}},{\text { Y}}} \right) \left( {\text {D}} \right) - {\text { Info}}_{{{\text {xi}}}} \left( {{\text {X}},{\text { Y}}} \right) _{{}} \left( {{\text {D}}_{{\text {j}}} } \right) . \end{aligned}$$

    Similarly split information can be derived analogously with info (D) as given [33]. Hence the gain ratio is defined as

    $$\begin{aligned} \text{ Gain } \text{ ratio }_{\text{ xi }} ( {\text{ X },\text{ Y }})=\;\frac{\text{ Gain }_{\text{ x }_{\text{ i }\,} } \,(\text{ X },\,\,\text{ Y })}{Splitinfo_{x_i } \,(X,\,\,Y)} \end{aligned}$$
    (6)

    where \(Splitinfo_{x_i } \,(X,\,Y) \quad \ne \) 0.

To estimate the gain ratio, let us consider two paired samples X{X\(_{1}\),X\(_{2}\), .................X\(_{N}\)} and \(\Psi \) {\(\Psi _{1}\),\(\Psi _{2}\),...........\(\Psi _{N}\)} of a crisp random variable X and fuzzy random variable \(\Psi \). The estimation of gain ratio between X and \(\Psi \) can be derived by the fuzzy set as

$$\begin{aligned}&\widehat{{GR}}\left( {\left( {{\text {X}}_{{\text {1}}} ,{\text {X}}_{{\text {2}}} , \ldots .{\text {X}}_{{\text {N}}} } \right) ,\left( {\Psi _{{\text {1}}} ,\Psi _{{\text {2}}} , \ldots ..\Psi _{{\text {N}}} } \right) } \right) \left( {\text {v}} \right) \nonumber \\&\quad ={\text {max}} \left\{ {\text {u }}\left| {\text { v }}\epsilon \{ {\text {GR}}\left( \left( {{\text {X}}_{{\text {1}}} , {\text {X}}_{{\text {2}}} , \ldots ..{\text {X}}_{{\text {N}}} } \right) ,\right. \right. \right. \nonumber \\&\qquad \left. \left. \left( {{\text {Z}}_{{\text {1}}} , {\text {Z}}_{{\text {2}}} , \ldots \ldots .{\text {Z}}_{{\text {N}}} } \right) \right) \right| \left( {{\text {Z}}_{{\text {1}}} ,{\text {Z}}_{{\text {2}}} , \ldots .{\text {Z}}_{{\text {N}}} } \right) \nonumber \\&\qquad \left. \epsilon {\text { A}}\left( {\left( {\Psi _{{\text {1}}} ,\Psi _{{\text {2}}} , \ldots \ldots .\Psi _{{\text {N}}} } \right) } \right) {\text {u }}\right\} \} \end{aligned}$$
(7)

Here the gain ratio determines the feature value based on its maximum membership.

5.2 Estimation of upper bounds and lower bounds of gain ratio

A fuzzy random variable is considered to find upper and lower bounds of gain ratio, otherwise probability distribution defined on class of random variables. Thus fuzzy random variable X is a mapping from \(\Omega \) to R i.e.,

$$\begin{aligned} \text{ X } :\Omega \rightarrow \text{ R } \end{aligned}$$

where \(\Omega \) is feature space and R is fuzzy number.

The corresponding probability distribution P\(_{F}\) is defined on the class of random variables as

$$\begin{aligned} {\text {P}}_{{\text {F}}} \left( {\text {z}} \right) = {\text { max}}\left\{ {{\text { u }}|{\text { z }}\epsilon \Psi _{{\text {u}}} } \right\} \end{aligned}$$
(8)

where z is member of random sets \(\Psi _{u}\).

By induction, it generates probability distribution on the values of the gain ratio as

$$\begin{aligned} \text{ P }( {\text{ GR }( {\text{ X },\Psi })= \text{ t }})=\;\mathop \sum \nolimits _{Z\vert GR( {X,Z})=t} P_F (z) \end{aligned}$$
(9)

Using the estimation of the bounds P\(_{F}^{u}\)(z) and P\(_{Fl}\)(z), we can estimate upper and lower bounds of P(GR(X,\(\Psi ))\). Finally we can also estimate the expected value of gain ratio with fuzzy optimization. Since the probability of sample of any fuzzy random variable Z is the product of all probabilities of Z\(_{i}\), the model can be represented as

$$\begin{aligned} {\text {P}}_{{\text {F}}} \left( {{\text {Z}}_{{\text {1}}} ,{\text {Z}}_{{\text {2}}} , \ldots \ldots {\text {Z}}_{{\text {m}}} } \right) = \mathop \prod \nolimits _{{i = 1}}^{m} P_{{\text {F}}} \left( {{\text {Z}}_{{\text {i}}} } \right) \end{aligned}$$
(10)

Then the estimation of gain ratio is defined by above probability distribution as

$$\begin{aligned}&{\text {P }}\left( {\text {GR}}\left( \left( \mathop \bigcup \nolimits _{{i = 1}}^{m} X_{i} ,\;\mathop \bigcup \nolimits _{{i = 1}}^{m} \Psi _{i}\right) = {\text { t}}\right) \right) \nonumber \\&\quad =\mathop \sum \nolimits _{{GR\left( {\mathop \bigcup \nolimits _{{i = 1}}^{m} Xi~,\mathop \bigcup \nolimits _{{i = 1}}^{m} Zi}\right) = t}} P_{{\text {F}}} \left( {{\text {Z}}_{{\text {1}}} ,{\text {Z}}_{{\text {2}}} , \ldots ..{\text {Z}}_{{\text {m}}}}\right) \end{aligned}$$
(11)

The above probability provides a avenue to have a general formulation for fuzzy optimization with constraints and expected value as

$$\begin{aligned}&{\text {Max E}}\left( {{\text {GR}}} \right) = \mathop \sum \nolimits _{{i = 1}}^{m} P_{{\text {i}}} *{\text { GR}}\left( {{\text {X}},{\text { Z}}_{{\text {i}}} } \right) \end{aligned}$$
(12)
$$\begin{aligned}&{\text {Subject to}}\;\mathop \sum \nolimits _{{i = 1}}^{m} P_{{\text {i}}} = {\text {1}}\end{aligned}$$
(13)
$$\begin{aligned}&{\text {P}}_{{\text {l}}} \le {\text { P}}_{{\text {i}}} \le {\text { P}}^{{\text {u}}} \end{aligned}$$
(14)

where P\(_{i}\) is the probability of each samples and (P\(_{l}\), P\(^{u})\) are the lower and upper bound probability. But this cannot find accurate solution practically. For the approximate solution, it can consider the above problem as

$$\begin{aligned}&{\text {Max E}}\left( {{\text {GR}}} \right) = \;\frac{{\mathop \sum \nolimits _{{i = 1}}^{m} P^{'} _{{i~~*~~GR~(X,~~Z_{i} )}} }}{{\mathop \sum \nolimits _{{i = 1}}^{m} P^{'} _{i} }}\end{aligned}$$
(15)
$$\begin{aligned}&\text{ Subject } \text{ to } \text{ max }\;P_j^{'l} \,\le \;P'_i \,\le \text{ max }\;P_j^{'u} \end{aligned}$$
(16)

where max P’\(^{l}\) select maximum value from all lower bound probability and max P’\(^{u}\) select maximum value from all upper bound probability. For approximation solution we consider two cases as follows.

  1. (1)

    Case-1: Upper bound estimation

    $$\begin{aligned}&{\text {Max E}}^{{\text {u}}} \left( {{\text {GR}}} \right) = \;\frac{{\mathop \sum \nolimits _{{i = 1}}^{m} q^{'} *GR(X,Z_{i} )}}{{\mathop \sum \nolimits _{{i = 1}}^{m} q^{'} _{i} }}\end{aligned}$$
    (17)
    $$\begin{aligned}&{\text {Subject to min }}\{ {\text {max}}P_{j}^{{'l}} \} \le \;q'_{i} ~ \le {\text { max }}\{ {\text {max}}P_{j}^{{'u}} \} \end{aligned}$$
    (18)
  2. (2)

    Case-2: Lower bound estimation

    $$\begin{aligned}&{\text {Max E}}_{{\text {l}}} \left( {{\text {GR}}} \right) = \;\frac{{\sum \nolimits _{{i = 1}}^{m} {q^{\prime \prime }_{i} ~ * GR(X,~Z_{i} )} }}{{\mathop \sum \nolimits _{{i = 1}}^{m} q^{\prime \prime }_{i} }}\end{aligned}$$
    (19)
    $$\begin{aligned}&{\text {Subject to min }}\{ {\text {max}}P_{j}^{{'u}} \} \le \;q^{\prime \prime }_{i} ~ \le {\text { max }}\{ {\text {max}}P_{j}^{{'l}} \} \end{aligned}$$
    (20)

    Thus, from the above two cases, the approximate expected value with upper and lower bound estimation is

    $$\begin{aligned} {\text {Appx E}}\left( {{\text {GR}}} \right) = {\text { }}\left[ {{\text {E}}_{{\text {l}}} \left( {{\text {GR}}} \right) ,{\text { E}}^{{\text {u}}} \left( {{\text {GR}}} \right) } \right] \end{aligned}$$
    (21)

    The several examples may be considered for feature selection within above expected interval. In next section, the sub-feature selection is derived based on fuzzy probability.

6 Privacy preservation model for fuzzy sub-feature selection

In this section, three criteria are considered for sub-feature selection such as less frequent (LF), medium frequent (MF) and very large frequent (VLF) feature values from the database. The criteria are assumed characterized as fuzzy numbers with membership function as sketched in Fig. 3. Thus using the above information fuzzy random variable is a mapping from feature elements to level of criteria i.e.,

$$\begin{aligned} \text{ X }:\Omega \rightarrow \text{ L } \end{aligned}$$
(22)

where each \(\omega \quad \epsilon \quad \Omega \) represents sub-feature elements and X (\(\omega )\) represents the label (L) is defined for criteria (i.e., LF, MF, VLF). Here the values of fuzzy random variables are fuzzy numbers vaguely. A fuzzy number may assume different real values with a degree of acceptability. This degree of acceptability can be handled accordingly to rules of fuzzy logic. The fractions of feature values as per the criteria are given in Table 2.

Table 2 Fractional criteria

Since \(\Omega \) is countable, the fuzzy random variable X is said to be discrete. As we are dealing with a single discrete fuzzy random variable X, we can take \(\Omega \) = N, where N is set of natural numbers. Thus

$$\begin{aligned} {\text {X}}:{\text { N }} \rightarrow {\text { P}}_{{\text {n}}}\,\, {\text {and X }}:{\text { N }} \rightarrow {\text { Z}}_{{\text {N}}} \end{aligned}$$
(23)

where P\(_{n}\) is probability and Z\(_{N}\) is the membership functions corresponds to criteria LF, MF, VLF. i.e., a single discrete fuzzy random variable X is essentially characterized by set of pairs (P\(_{n}\), Z\(_{N})\). From Table 2, the probabilities are P\(_{1}\)=0.2, P\(_{2}\)=0.5, P\(_{3}\) = 0.3, P\(_{n}\)=0 for n \(>\) 3 (since \(\Sigma \)P\(_{n}\)=1), and membership functions are LF, MF, VLF as depicted in Fig. 3.

To continue this discussion, let us consider S, the set of piecewise continuous functions with mapping R \(\rightarrow \) [0, 1]. We then define X: \(\Omega \quad \rightarrow \) S as a special type of fuzzy random variable as per the perception described above which in term called as selected fuzzy random variable. That means there may exist many random variables for a given selected fuzzy random variable. Under this condition we generalize and define the discrete fuzzy random variable X as a mapping from \(\Omega \quad \rightarrow \) F\(_{N}\) where F\(_{N}\) is the set of fuzzy numbers. If \(\omega \quad \epsilon \quad \Omega \), we define image as \(\omega \) in F\(_{N}\) as X\(_{\omega }\) which satisfy the following conditions.

For each membership function \(\mu \quad \epsilon \) [0,1], the two selected fuzzy random variables P\(_{\mu }\) and Q\(_{\mu }\) on the sub-feature values defined by

$$\begin{aligned}&\text{ P }_\mu ( \omega )= \text{ inf }\left\{ {\text{ i } \epsilon \text{ R } \vert \text{ X }_\omega ( \text{ i })\ge \mu } \right\} \end{aligned}$$
(24)
$$\begin{aligned}&\text{ Q }_\mu ( \omega )= \text{ sup }\left\{ {\text{ i } \epsilon \text{ R } \vert \text{ X }_\omega ( \text{ i })\ge \mu } \right\} \end{aligned}$$
(25)

are finite real valued random variables satisfying for all \(\omega \quad \epsilon \) \(\Omega \)

$$\begin{aligned} \text{ X }_\omega ( {\text{ P }_\mu ( \omega )}))\ge \mu , \text{ X }_\omega ( {\text{ Q }_\mu ( \omega )}))\ge \mu \end{aligned}$$
(26)
Fig. 3
figure 3

Membership functions of less frequency (0,2), medium frequency(2–7), very large frequency (7–10)

The above conditions are finite support for selecting the sub-features because finite support is important to choose large enough for all purposes. In addition to this condition each random variable should be normal that means for each i \(\epsilon \) R, X\(_{\omega }\)(i) = 1. To find the selecting sub-feature, it considers the level sets corresponding to given fuzzy sets. An algorithm is being considered for determining the fuzzy expectation of a fuzzy random variable with membership function. Now the family of level sets F with \(\mu \quad \epsilon \) [0,1] is as follows

$$\begin{aligned} \text{ F }_\mu =\left\{ {\text{ x } \epsilon \text{ G } \vert \text{ g }( \text{ x })\ge \mu } \right\} \end{aligned}$$
(27)

where G is basic space with g: G \(\rightarrow \) [0, 1]. The membership function g is defined again as

$$\begin{aligned} \text{ g }( \text{ x })= \text{ sup }\left\{ {\mu \epsilon \left[ {0,\text{1 }} \right] \vert \text{ x } \epsilon \text{ F }_\mu } \right\} , \text{ x } \epsilon \text{ G } \end{aligned}$$
(28)

Thus it can define a fuzzy set (G, g) on the basic space G. Now the algorithms which are being presented only allow the evaluation of the level sets F\(_{\mu }\) at the discrete values of \(\mu \quad \epsilon \) [0, 1] which in turn allow an approximate evaluation of the membership function g with the help of above equation. The level sets of expectation E(X) of a discrete fuzzy random variable X are considered here for determination of sub-feature from the feature space.

As we have considered decentralized network model, each party provides their converted fuzzy data to the coordinator of which individual privacy is maintained. In other-words the coordinator collects the individual alias data with adding randomized data for privacy is discussed in algorithm 2. Then subsequently the coordinator evaluates whole data and selects the sub-feature within expected range (given in algorithm 3). The alias data as fuzzy frequency from each party are transferred from party to party in decentralized system for which each party can’t able to know other party’s data due to the reason that privacy is maintained at each party level.

figure b
figure c

The above two algorithms helps to select the sub-feature value for class with privacy among the participating parties in decentralized network.

7 Experimental details

This section discusses the application of proposed algorithms, data sets and different parameters being considered for implementation along with performance evaluation.

7.1 Description of datasets

Even though the proposed algorithm is primarily intended the privacy preserving sub-feature selection, it can also be used very well on conventional data set. In order to show this fact, we have evaluated of algorithm using IRIS plant data set [40] from University of California Irvine (UCI) machine learning repository. The data set with 150 data contains three different class flowers (Setosa, versicolor, virginica) for class. Each class consists of 50 data sets with four features out of which two features (petal, sepal) are important. Generally IRIS data set is the most popular and simple classification data set based on multi-variate characteristics of a plant species (length and thickness of its petal and sepal) divided into three distinct classes of 50 instances each. One class is linearly separable from each other. The four features are predicting features and one is goal feature. All predicting features are real values. The length and width of each feature are important to select sub-features with the help of feature values from four dimensional measurement spaces.

7.2 Environments and parameters

7.2.1 Environments

The proposed method is implemented on a personal computer with an Intel Pentium IV, 2.40 GHZ CPU, 1.00 GB RAM, Microsoft Windows XP professional version 2002 operating system with Matlab 7.0.1 development environment. The data set have been processed under fuzzy environment for sub-feature selection.

7.2.2 Parameters

The predicted data of each feature from feature space is measured by fuzzy random variables with membership function. For evaluating the proposed algorithm, the interpretation of user defined parameters is illustrated in Table 3. Although the parameters are quite restricted but there is no such standard rule to assign systematic parameter values.

Table 3 Parameters used in proposed algorithm

The brief description of parameter values are as follows.

The frequency parameter ‘a’ and total dataset ‘T’ are used to measure the values of fuzzy random variable for each sub-feature as X\(_{i}\) = \(\frac{T-a}{T}\). It is observed from computation (Table 8 from Appendix 1) that sixteen numbers of frequencies in IRIS data set are within the fuzzy frequency interval [0.993, 0.806]. The probability P\(_{i}\) of each fraction of criteria value for all features is determined by (X\(_{i}\) * \(\mu _{k})\) which is explained in algorithm-1. For example, the probability of feature “sepal length” having frequency one is measured by 0.993 *\(\frac{9}{35}\) = 0.993 * 0.257 = 0.255 and 0.933 * \(\frac{1}{35}\) = 0.026. Thus it is concluded that 10 numbers of fuzzy frequencies for feature “sepal length” are in the interval [0.255, 0.026]. Similarly 14 numbers of fuzzy frequencies for feature “sepal width” lie within the interval [0.215, 0.035]. Nine numbers of fuzzy frequencies for “petal length” are within [0.230, 0.041] and eleven numbers of fuzzy frequencies for “petal width” within [0.089, 0.036]. The detail description about experimental data using algorithm 1 is presented in Table 5.

Again two fuzzy random variables P\(_{\mu }\) and Q\(_{\mu }\) are important to determine the best sub-feature selection with certain expected range ER. The probability P\(_{i}\) determines the total sub-feature of each frequency with respect to total data set. For example P\(_{i}\) for frequency one of “sepal length” is \(\frac{(1*9)}{150}\) = 0.06 and for frequency two of sepal length is \(\frac{(2*2)}{150}\) = 0.026. Since it considers only less frequent feature (i.e., only frequency one and two) then P\(_{\mu }\) = 0.986 and Q\(_{\mu }\) = 0.993. The sub-feature selection is restricted with expected range as [\(\mathop \sum \nolimits _i P_i \)P\(_{\mu }\), \(\mathop \sum \nolimits _i P_{i }\)Q\(_{\mu }\)]. Thus the expected range of sub-feature selection of feature sepal length is [0.026 * 0.986, 0.06 * 0.993] = [0.0256, 0.0595]. The sub-feature selection within expected range for all features is shown in Table 4.

Table 4 Expected range of sub-feature selection in each feature of IRIS dataset

7.3 Results and analysis of proposed algorithms for sub-feature selection

The first part of experiment is analyzed based on coordinator database using algorithm-1. Fuzzy frequency variable is used to collect the sub-feature values from different parties which vary from one feature to another. The order of sub-feature values is arranged in the order of fuzzy frequency. The probability of sub-feature set covered by fuzzy frequency is determined by the number of corresponding sub-feature with respect to total number of sub-feature. The number of frequency covered by threshold value is used for fraction of criteria, otherwise it considered as zero which will never predict this sub-feature. The probability of each fraction of criteria is exhibited in Table 5. The probability is different for each frequency corresponding to available sub-features. The first two probabilities are considered for sub-feature selection due to fact that the sub-features have less frequency. If we consider the sub-features having frequency more than two, the sub-feature values would be available in more than one class for which the selection is a challenging task. The value of fuzzy random variable of individual frequency of IRIS database is shown in Table 5. The increase values of fuzzy random variable with decrease values of frequencies are shown in Fig. 6 of Appendix 2.

Table 5 Probability of each fraction of criteria for all features

The maximum and minimum probabilities of sub-feature data of each feature are shown in Table 6. As we have considered, the sub-feature having frequency one and two for sub-feature selection, this range (max–min) is not helpful. The reason behind it is: sub-feature having less frequency can lead to a unique class. Hence Table 7 is used to solve our purpose.

Table 6 Max and min probability value of sub-feature value
Table 7 Best sub-feature based on values of fuzzy random variable

In the second part of the experiment, the applications of fuzzy random variable for sub-feature selection are elaborated using Iris data set. We have considered, six parties are participated in a peer to peer network. Each party holds 25 data sets and four feature sets for experiments. Each party provides their own feature data to data miner using two ways of maintaining privacy (i.e., alias data and secure multiparty computation). There is no exact available sub-feature data range comparing to total sub-feature data range at each party, because each party maintains only their own local data. For example, first party holds the sub-feature data range (4.3 – 5.8) whereas second party holds range (4.4 – 5.5) and so on. But the global range for feature “sepal length” is (4.3 – 7.9) which are not exactly equal as compared to each party’s data range. After the collection of data from each party, the data miner is observed that total data set is 150 and the number of sub-feature data for each feature (sepal length, sepal width, petal length and petal width) are {35, 23, 43, 22} respectively. These sub-feature data sets are used for selection of sub-feature predicting to new class.

As per the probability of fractional criteria shown in Table 5, the sub-feature data from each feature corresponding to their probability is depicted in Fig. 4. In this figure, it is observed that the probability of available sub-feature of each feature having same number of frequency is different due to the reasons that the frequencies of each sub-feature from each feature are different. It is further observed that as frequency increases, the corresponding probability decreases. It shows that the sub-features having less probability are involved in many classes; don’t lead to a particular class whereas sub-features having highest probabilities lead to new class. As we have considered only two selected fractional criteria, the probabilities of sub-features of each feature based on values of fuzzy random variables (i.e., 0.993, 0.986) are exhibited in Fig. 4e relating to features SL, SW, PL and PW. It is interesting to note that if we choose any sub-features in between graphs having the values of fuzzy random variable 0.993 and 0.986 for all features; it will lead to new class. Thus the selected sub-feature for all features are depicted in Fig. 5 for frequency one and two being considered as best of sub-features.

Fig. 4
figure 4

Feature values versus corresponding values of fuzzy random variable (a–d) and selected fractional criteria with corresponding probabilities (e)

7.4 Results and analysis for privacy preservation

The coordinator collects sub-feature or feature data from each party using two ways of maintaining privacy (i.e., alias data technique and secure multiparty computation technique). As per fuzzy frequency, each party provides their available data as alias data to coordinator. The fuzzy frequencies are computed using the fuzzy technique \(\,\,\frac{T-a}{T}\,\) where T for total data set and ‘a’ is the number of frequency. During the collection of sub-feature data from different parties, the data size is changed in the current party after collecting sub-feature data due to the reason that the same sub-feature data are not available in all parties. For example, the coordinator (as first party) sends the sub-feature data set {4.3,4.4,4.7,5.8} with fuzzy frequency value 0.993 and { 4.9, 5.0, 5.7} with fuzzy frequency value 0.986 to second party. When it reaches at second party, the sub-feature data set would be { 4.3, 4.4, 4.5, 5.3, 5.8} and { 4.7, 5.5} with fuzzy frequency 0.993 and 0.986 respectively with own sub-feature data which will send to third party.

As it is a secure multiparty computation problem, the coordinator as first party collects the data range for all features and makes global range accordingly prior to reaching original data. Subsequently the coordinator makes alias range of data range by assigning natural number starting from 1. Now the first party sends sub-feature data in the form {1, 2, 5, 16} and {7, 8, 15} with 0.993 and 0.986 as fuzzy frequency value respectively to second party. Similarly second party sends his alias data {1,2,3,11,16} and {5,13} with fuzzy frequency value 0.993 and 0.986 to third party without knowing the original data of first party and so on until it reaches at first party. Finally all alias data reach at first party under decentralized environment. Hence it is observed, fuzzy frequency value and alias data both help to preserve the privacy of each participating party’s data during computation. The whole process is being implemented using algorithm 2 and algorithm 3. Here the privacy is maintained two times. Since each party maintains the privacy of individual data set, it is necessary to measure “how much privacy is preserved ” for own data set which is beyond the scope of this paper.

Fig. 5
figure 5

Selected number of sub-features having frequency one and two for all features (F1-frequency one and F2- frequency two)

8 Conclusion

This paper explores the use of fuzzy probability and perturbation technique to select sub-feature maintaining privacy preservation in distributed data mining environment. From results and analysis, it is concluded that when frequency count is more, the probability of each sub-feature is less. It proves that for more frequency count involvement of sub-feature in whole classes is true for which it doesn’t lead a new class whereas less frequency count can help sub-feature to a unique class. Moreover, the approach of fuzzy random variable confined the expected range on which the selection of sub-feature from feature database is made easy. At the same time privacy of original data are still well maintained without divulging the exact data values of each party during secure multiparty computation because of perturbation and fuzzification. Under this distributed data mining environment, data values of individual party become secure doubly. The experimental results demonstrate that the notion of fuzzy random variable for sub-feature selection can be successfully applied to different kinds of data mining task including clustering, decision making, gain ratio etc. This technique offers another interesting direction to extend the unique association rule to predict a new class. Even though the IRIS data set is used for this experiment, but it is scalable to consider medical data set which are more sensitive than IRIS data.