1 Introduction

The rapid development of computer technology has given rise to an exponential growth of data. According to a study presented in (Witten and Frank 2005), the amount of data doubles every 20 months in the world. It is obvious that this growing big data would be redundant with many repetitions, incompleteness, inconsistencies, vagueness, uncertainties and therefore is difficult to interpret and analyze. This low-quality nature of data creates big problems for such areas as classification, clustering, feature selection, feature extraction, knowledge discovery, image processing, prediction and text mining. Among these areas, data mining is the most affected problem from missing and hidden data, and therefore, the focus of this paper is to decrease classification problems for missing and hidden data.

This study focuses on classification and proposes a new classification method. The simplest definition of classification is the process of finding the class that an object belongs to. Numbers and labels of classes are clear in the classification process. Moreover, the fact that each object has features and these features are related to the classes is important for classification. Nevertheless, there are some problems that make correct classification a difficult job. For example, classifying within a dataset that is inconsistent, incomplete and contains redundant data is difficult. To solve these problems in data mining, preprocessing procedures such as data cleaning, relevance analysis, data transformation, and reduction are applied. Recently, however, rough sets theory (RST) has been widely and effectively used for this purpose in the literature. Lately, some researchers have also proposed classification methods for inconsistent and incomplete data. For example, Khoo et al. (1999) proposed a novel approach for the classification and rule induction of incomplete information systems. Xiang-Wei and Yian-Fang (2012) proposed a novel effective preprocessing algorithm using RST. Chakhar and Saad (2012) (https://archive.ics.uci.edu/ml/datasets.html) proposed a methodology to support groups in multi-criteria classification problems.

RST makes a dataset more consistent by working on the incomplete and inconsistent data. Additionally, RST can find which features of an object are more effective than the others in the classification of the object. The experimental results as well as related works in the literature show that more effective classification can be achieved by using these aspects of RST.

2 Rough sets theory

Rough sets theory (RST) (Pawlak 1982, 1991, 1998; Pawlak and Skowron 2007) was proposed by Pawlak in 1982, and it is a mathematical method used for generating rules and evaluation for expert systems such as pattern recognition, machine learning and knowledge discovery. RST has recently been attracting people’s attention due to its ability to obtain useful information from missing and inconsistent data, which has opened up very wide application areas (Kryszkiewicz 1998; Dubois and Prade 1990; Pawlak and Sowinski 1994; Jensen and Shen 2004; Duntsch and Gediga 1998; Pawlak 2002; Swiniarski and Skowron 2003; Hu and Cercone 1995; Slowinski and Zopounidis 1995; Pawlak 1984), and to resolve problems related to vague, uncertain and incomplete information. RST also has wide usage for feature extraction and reduction in data mining.

Constructing a wide and comprehensive Information system (IS) related with the problem is the first step of RST (Cekik and Telceken 2014).

$$\begin{aligned} \hbox {IS} = (U, A, D) \end{aligned}$$
(1)

U is a non-empty finite set of objects \((U={x_1,x_2,\ldots ,x_m})\), A is non-empty finite set of features \((A={a_1,a_2,\ldots ,a_m})\), D is non-empty finite set of classes decisions \((D={d_1,d_2,\ldots ,d_k})\). \(\forall a\in A\) defines an information function \(f_a:U \rightarrow V_a\), where \(V_a\) is the set of values of a, called the domain of feature a.

$$\begin{aligned} Z_{i,j}=\{ \forall a\in A |\;\; a(x_i )\ne a(x_j )\} \quad i,j = 1,2,\ldots ,n \end{aligned}$$
(2)

Core and reducts of features are computed by using a Discernibility matrix of size \(n\times n\), where n denotes the number of objects. \(Z_{i,j}\) is defined as the set of all features which discern objects \(x_i\) and \(x_j\). A Discernibility function is a Boolean function of m Boolean variables \(a_1^*,a_2^*,\ldots ,a_m^*\) which correspond to the features \(a_1,a_2,\ldots ,a_m\) defined as below,

$$\begin{aligned} f_{\mathrm{A}} \left( a_1^{*},a_2^{*},\ldots ,a_m^{*}\right) ={\wedge }\{{\vee } Z_{i,j}^{*} \;\;| \;\; 1\le j\le i\le n, \;\;Z_{i,j}\ne \varnothing \}\nonumber \\ \end{aligned}$$
(3)

where

$$\begin{aligned} Z_{i,j}^*=\{ a^* |\;\;a\in Z_{i,j} \}. \end{aligned}$$

The set of all prime implicates of \(f_{\mathrm{A}}\) determines the set of all reducts of A (Komorowski 1999). In this study, core and reduced attributes for every object have been obtained using the discernibility matrix and functions in RST. The goal is to demonstrate different characteristics that determine the class of an object depending on the attributes of the object in different classes. For example, if the first and the third attributes are obtained as discernibility attributes for an object, these attributes are derived to be the discernibility of this object from the other objects. Let \(d_k\) denote a class and,

$$\begin{aligned} i,j= & {} 1,2,\ldots ,n\nonumber \\ Z_{i,j}= & {} \left\{ \forall a\in A |\;\; a(x_i )\ne a(x_j ) \;\; \mathrm{and} \;\;d(x_i )\ne d(x_j )\right\} \end{aligned}$$
(4)

Another concept in RST is upper/lower approximation sets. Let \(R \subseteq A \) and \(X \subseteq U \). Set X can be approximated by using the features contained in R by defining the lower and upper approximations of X, denoted \(\underline{R}(X)\) and \(\overline{R}(X)\), respectively, where \(\underline{R}(X)=\{x \in U |\;\; [x]_R \subseteq X\}\) and \(\overline{R}(X)=\{x \in U | \;\;[x]_R \cap X \ne \emptyset \}\). The objects in \(\underline{R}(X)\) can be classified as members of X on the basis of the knowledge in R. On the other hand, the objects in \(\overline{R}(X)\) can only be classified as possible members of X. The boundary set of X is defined as \(\hbox {BN}_R(X)= \overline{R}(X)-\underline{R}(X)\) and thus consists of those objects that cannot be classified exactly into X on the basis of the knowledge in R. Four basic classes of rough sets can be defined as follows:

  • X is roughly R-definable, \(\Leftrightarrow \underline{R}(X) \ne \emptyset \;\; \textit{and} \;\;\overline{R}(X) \ne U. \)

  • X is internally R-undefinable, \(\Leftrightarrow \underline{R}(X) = \emptyset \;\; \textit{and} \;\;\overline{R}(X) \ne U.\)

  • X is externally R-undefinable, \(\Leftrightarrow \underline{R}(X) \ne \emptyset \;\; \textit{and} \;\;\overline{R}(X) = U. \)

  • X is totally R-undefinable, \(\Leftrightarrow \underline{R}(X) = \emptyset \;\; \textit{and} \;\;\overline{R}(X) = U. \)

The last concept we talk about RST is the positive, negative and boundary regions of features. Let P and Q be sets of attributes including equivalence relations over U. Then the positive, negative and boundary regions are defined as \(\hbox {POS}_P(Q)=U_{X \in U/Q}\underline{P}X\), \(\hbox {NEG}_P(Q)= U - U_{X \in U/Q}\overline{P}X\) and \(\hbox {BND}_P(Q)= U_{X \in U/Q}\overline{P}X-U_{X \in U/Q}\underline{P}X\), respectively.

Fig. 1
figure 1

FWRSC classification model

3 A new rough sets classification method

Data mining discipline has been formed to analyze the big data and extract meaningful information to cope with the problem of data explosion. Data mining uses statistics, machine learning and artificial intelligence and has been the focus of attention in the industry, economy and academy with great interest from the business community. Data mining methods that contain numerous algorithms are used in health industry, basic sciences, banking, finance and market research. Methods used in data mining can be classified as unsupervised clustering, supervised learning and market basket analysis in the most general sense. Supervised learning can be classified as classification, estimation and prediction in itself.

Classification is the process of putting an object into a predefined class using its features. The classification technique is a systematic approach, which builds classification models from input datasets. Each technique relies on a learning algorithm, which tries to construct a relation between features of the input data and the class labels (Han et al. 2011; Tan et al. 2006; Cios et al. 2012).

The classification is accomplished by computing relations between features and classes according to common and distinct features among the classes. Classification becomes easier whenever this relation is simple and the number of common features is small. Unfortunately, the reverse condition is applicable for most of the applications in real life. A feature can be common for more than one class in nature, which would make classification hard and complex. Additionally, some attributes or attribute values can be redundant and unnecessary for the classification of an object. In practice, it means that the object can be classified despite the fact that the values of some attributes may be missing, since the remaining attributes or attribute values are sufficient for classification or to make a decision.

In this study, discernible features were determined for each class. Hence, the relationship between a class and a feature value is simplified avoiding the problem of hidden classification. Consequently, the performance of the classification process is increased. In order to handle the problems explained above, a new rough-set-based classification method, named feature weighted rough set classification (FWRSC), is proposed in this paper. Steps of FWRSC are shown in Fig. 1. The preprocessing phase shown in the figure is used to fill missing values in the dataset using discretization, interpolation, etc. This step was not used in this paper since there are no missing values in the used datasets. In the second step of the processing pipeline in Fig. 1, i.e., the split data phase, the dataset is first divided into two parts as training and testing. The training set is then divided into several classes based on the class label. In this step, FWRSC processes the objects in the training dataset to determine the best attributes that discriminate the objects in their classes. To achieve this, FWRSC creates what is called the weight matrix scoring system. The rows of this matrix are the objects, and the columns are the attributes. Each cell of the matrix signifies the weight of the attribute in determining the particular object’s class label. The higher the weight of an attribute, the higher its contribution toward determining the class of that object. Similarly, the attributes with smaller values have smaller effect on determining the class label of the object. The details of how the weight matrix scoring system is calculated are given in Sect. 3.1. After the feature vectors are determined using the training set, they are used to classify the objects in the test set. The classification of a test object works as follows: Each attribute value of the test object is compared with the attributes of a training object. If the attribute values are the same, a score of 1 is assigned to that attribute, otherwise 0 is assigned. The score values of all attributes are then added up to compute the cumulative score for the object. The test object is then assigned to the same class as the training object for which this score is maximized. The details of this procedure is discussed in Sect. 3.2.

3.1 Weight matrix calculation

The discernibility relations are closely related to indiscernibility and belong to the most important relations considered in RST. The ability to discern between perceived objects is important for constructing many structures such as reducts, classification, decision rule or algorithm (Pawlak and Skowron 2007). Particularly, efficiency of this relation affects the whole classifier performance in data mining. It reveals core and reduced attributes, and the important attributes for an object is clearly made visible. Based on this information, a scoring system for each attribute of an object is made applicable. The scoring system process works similar to term frequency (TF) (Manning et al. 2008) in the text document classification problem. Here the discernibility function can be thought as the text document and the reducts which are combined with the logical OR operator can be thought as the sentences in the text document. Then, computation of the weight of a feature is similar to the computation of the term frequencies. The only difference is that the number of the OR operators plus is used instead of the number of terms. For example, in an information system with five attributes, if the core attributes of an object are the first and the third attributes, then the effect of these two attributes must be improved. In this way, the effect of core attributes that better define the class of an object must be increased while reducing the influence of non-core attributes. Thus, a more accurate classification process can be carried out.

In this study, a matrix for scoring the system attributes, called the weight matrix (\(W_{i,j}\)) is created. \(f_i\) is discernibility function to ith object, and \(\hbox {Count}(f_i, a_j)\) is the frequency of the attribute \(a_j\) in \(f_i\). \(\bigvee \) denotes the Boolean \(\hbox {OR}\) operator in \(f_i\), then \(\hbox {Count}(f_i, \bigvee )\) is the frequency of the \(\bigvee \) which is the Boolean OR operator in \(f_i\). It is formalized:

$$\begin{aligned} N= & {} \hbox {Count}\left( f_i,\bigvee \right) +1 \end{aligned}$$
(5)
$$\begin{aligned} d_{i,j}= & {} \frac{\hbox {Count}(f_i,a_j )}{N} \end{aligned}$$
(6)

Weight matrix (\(W_{i,j}\)):

$$\begin{aligned} \left( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} d_{1,1}&{}d_{1,2}&{}...&{}d_{1,n}\\ d_{2,1}&{}d_{2,2}&{}...&{}d_{2,n}\\ ...&{}...&{}...&{}...\\ ...&{}...&{}...&{}...\\ d_{m,1}&{}d_{m,2}&{}...&{}d_{m,n}\\ \end{array} \right) \end{aligned}$$

Let \(\hbox {IS} = (U, A)\) be an information system that has 10 objects and four attributes. The discernibility function for the fifth object:

Fig. 2
figure 2

Sets for classes

$$\begin{aligned} f_5= a_1 \wedge a_2 \bigvee a_2\wedge a_3 \wedge a_4 \end{aligned}$$

The weight of each attribute for 5th object:

$$\begin{aligned} \hbox {Count}(f_5,a_1 )= & {} 1,\;\; \hbox {Count}(f_5,a_2 )=2,\;\;\\ \hbox {Count}(f_5,a_3 )= & {} 1,\;\;\hbox {Count}(f_5,a_4 )=1,\;\;\\ N= & {} \hbox {Count}\left( f_i,\bigvee \right) +1=2 \end{aligned}$$

According to this;

$$\begin{aligned} d_{5,1}= & {} \frac{\hbox {Count}(f_5,a_1)}{N}= \frac{1}{2}=0.5,\\ d_{5,2}= & {} \frac{\hbox {Count}(f_5,a_2)}{N}= \frac{2}{2}=1.0,\\ d_{5,3}= & {} \frac{\hbox {Count}(f_5,a_3)}{N}= \frac{1}{2}=0.5, \\ d_{5,4}= & {} \frac{\hbox {Count}(f_5,a_4)}{N}= \frac{1}{2}=0.5, \end{aligned}$$

The weight of each attribute for the fifth object:

$$\begin{aligned} W_{5,j}=[0.5\quad 1\quad 0.5\quad 0.5] \end{aligned}$$

Here, \(a_1\) attribute presents only in \(a_1\wedge a_2\), and hence, \(\hbox {Count}(f_5,a_1 )\) is equal to 1. \(a_2\) attribute presents both \(a_1\wedge a_2\) and \(a_2\wedge a_3\wedge a_4\); hence, \(\hbox {Count}(f_5,a_2 )\) is equal to 2. Vector \(W_{5,j}\) shows the weight of each attribute for the fifth object. As seen from the example, attribute \(a_1\) is more important than the other attributes for the fifth object.

3.2 Determining the class of an object

The first step in the classification process is to split the training data into a set of ‘n’ classes. Let \(C_t\) denote the class of the \(t{\mathrm{th}}\) object, and D define the set of classes (Fig. 2).

Formal definitions of each class are as follows:

$$\begin{aligned} r= & {} \{d_i,d_j \in D \;\; |\;\; \forall d_i \not = \forall d_j,\;\; 0<i\le j \le S(D)\}\nonumber \\ t= & {} 1,2,\ldots ,S(r)\nonumber \\ C_t= & {} \{x_k\in U;\;\; d_i,d_j \in D \;\;| \;\; d_i= d_j , \;\;x_k\} \end{aligned}$$
(7)

Here, X is a finite set, and function S(X) is the number of elements in X. Similarly, S(D) denotes the total number of objects in \(\hbox {IS}\), and r denotes the number of classes in \(\hbox {IS}\).

Let \(\hbox {IS} = (U,A,D)\) be an information system, and let \(\beta , \gamma \subset \hbox {IS}\) be the test and the training sets, respectively. Then the class score for a test object for class ‘t’ is calculated as follows:

$$\begin{aligned}&R^*= S(U) \quad \hbox { and } \quad r^*=S(A)-1\nonumber \\&\hbox {Class}^t_{\mathrm{score}}=\frac{\sum _{\begin{array}{c} i=1,\\ \gamma _i \in C_t \end{array}}^{R^*}\sum _{\begin{array}{c} j=1,\\ \beta (a_j) = \gamma (a_j) \end{array}}^{r^*}(1+W_{i,j})}{S(C_t)} + \frac{S(C_t)}{S(U)} \end{aligned}$$
(8)

where \(S(C_t)\) is number of objects (or elements) of \( t{\mathrm{th}} \) class set and S(U) is the total number of objects in \(\hbox {IS}\), and the ratio of \(S(C_t)\) to S(U) is the class rate.

As seen from Eq. 8, the object to be classified in the test dataset is compared with all objects in the training dataset and a score is computed for each class. The computation of this score proceeds as follows: When the value of an attribute is the same for the object in question and for the object in the training set, its score value for similarity is 1, and this score is added to the weight of that attribute in \(W_{i,j}\). The reason for adding the feature weight to the similarity score is to prevent hidden classification and thus enable more sensitive classification by increasing the accuracy. Furthermore, it exposes the important and unimportant attributes that determine the class of an object. The total weight thus computed becomes the feature weight for the object for the particular class. Finally, the class rate is added to the score obtained thus far to come up with the overall score for the class.

To summarize, the computation of the class score for a test object consists of three parts: the similarity value, the feature value and the class rate. The similarity value incorporates how similar an object in test dataset is to each object in the training set, but it does not prevent hidden classification and does not expose hidden pattern in data. At this point, the feature value comes into play and assigns a weight to each attribute. By giving more weight to the important attributes, the hidden classification and pattern problems are avoided. Finally, the class rate ensures internal balance.

Having computed the class scores, the class of an object in the test dataset is determined by:

$$\begin{aligned} \hbox {Calculated class}=\hbox {max} \left( \hbox {Class}^t_{\mathrm{score}}\right) \end{aligned}$$
(9)

Clearly, an object in the test dataset is included in the class having the highest score value. If two or more classes have the same score value, then one of the classes is randomly chosen for the object.

The algorithm for the proposed FWRSC method is given below:

figure a
Table 1 Datasets
Table 2 Accuracy results for different classification methods

3.3 Experimental results

In this section, we quantitatively evaluate the performance of the proposed classification method. Data sets used in experiments have been taken from the website of UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html). The classification performance of FWRSC is compared with as many of the classification methods as possible in a program called WEKA (Garner 1995; Markov and Russell 2006; Bouckaert 2013) on five different dataset. The selected datasets have been randomly chosen so as to cover a wide variety of application areas. For example, if a dataset is from the health industry, other datasets belong to areas outside of the health industry. Table 1 gives the details of the selected datasets.

There are several criterions for the evaluation of a classification method. In this paper, accuracy, F-score (F-measure) and mean absolute error (MAE) criterions have been used. The more common definition is that accuracy is a level of measurement with no inherent limitation (i.e., free of systemic error, another form of observational error). In statistical analysis of binary classification, the F-score (also F1 score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F-score can be interpreted as a weighted average of the precision and recall, where an F-score reaches its best value at 1 and worst at 0. In statistics, the mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes, and it is one of a number of ways of comparing forecasts with their eventual outcomes.

Table 3 Classification methods having an average accuracy percentage above 60%
Table 4 MAE for different classification methods

The success of the classification methods is illustrated in Tables 23 and 4 for accuracy, MAE and F-score, respectively. The best results for each dataset are made darker in all tables. The performance value for the proposed FWRSC method is given at the last row of each table. Let’s analyze the accuracy results in Table 2. While FWRSC has the highest accuracy for the Balloons dataset, its performance on other datasets is close to the best classification method. For example, on the Tic-Tac-Toe dataset, the best accuracy results are obtained with the Rules-OneR classification method for an accuracy value of 63.55%. On the same dataset, FWRSC produces the second highest classification score of 62.30%. Also observe that the performance of FWRSC on other datasets is also close to the best classification method. To better summarize the accuracy results listed in Table 2, a compact representation of only the best classification methods with an average accuracy result greater than 60% is listed in Table 6. It can clearly be observed from Table 3 that FWRSC produces the best average classification accuracy of 67.74% over all datasets, while giving the best performance for the Balloon dataset.

Table 5 F-measure results for different classification methods

Analyzing the MAE and F-score results given in Tables 4 and 5, respectively, we see that FWRSC gives the best MAE performance on Haberman’s survival and Abalone datasets and the best F-score performance for Balloon and Tic-Tac-Toe datasets.

We also can say that the performance of FWRSC on all date sets for MAE and F-score remains just below the overall average of all classifiers with a very small difference. However, FWRSC is one of classifiers that show the best performance after OneR classifier for MAE criteria on Tic-Tac-Teo dataset in Table 4. This shows that FWRSC performs a classification process with low error rate (See Table 4). Additionally, although OneR classifiers indicate the best performance on Tic-Tac-Teo dataset in Table 2, FWRSC classifiers indicate the best performance on this dataset shown in Table 5. This means that FWRSC produces more a balanced classification process than OneR and other classifiers that in tables. All and all, it can be concluded that FWRSC finds itself to be among the best classification methods found in the literature.

4 Conclusions and future work

In this article, a new classification method named feature weighted rough set classification (FWRSC) was proposed to decrease classification problems on account for hidden in data. The most important advantages of FWRSC are reduced complexity, handling inconsistency in the dataset and taking care of hidden classification. RST can find core and reduct attributes. Core attributes that influence the class of an object more than the others, and the attributes that have little impact on the classification performance are determined. Therefore, the ill impacts of classification problems are minimized, and more efficient classification is made possible. FWRSC is compared with the classification methods in WEKA for five different datasets. The experimental results show that FWRSC gives higher performance than most of the methods in WEKA for many of the datasets. Furthermore, FWRSC gives the best average accuracy performance with an overall average score of 67.47% for five different datasets. We believe that FWRSC will find its rightful place among the best classification methods found in the literature. Out future goal is to improve the performance of FWRSC by incorporating some hybrid RST techniques.