Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Information granulation is the process by which a collection of information granules are synthesized, with a granule being a collection of values (in the data space) which are drawn towards the center object(s) (in the object space) by an underlying indistinguishability, similarity or functionality mechanism. Note that the data and object spaces can actually coincide [141]. The Granular Computing (GrC) paradigm [7, 183] encompasses several computational models based on fuzzy logic, Computing With Words, interval computing, rough sets, shadowed sets, near sets, etc.

The main purpose behind Granular Computing is to find a novel way to synthesize knowledge in a more human-centric fashion and from vast, unstructured, possibly high-dimensional raw data sources. Not surprisingly, Granular Computing (GrC) is closely related to Machine Learning [83, 95, 257]. The aim of a learning process is to derive a certain rule or system for either the automatic classification of the system objects or the prediction of the values of the system control variables. The key challenge with prediction lies in modeling the relationships among the system variables in such a way that it allows inferring the value of the control (target) variable.

Rough set theory (RST) [1] was developed by Zdzisław Pawlak in the early 1980s [179] as a mathematical approach to intelligent data analysis and data mining [180]. This methodology is based on the premise that lowering the degree of precision in the data makes the data pattern more visible, i.e., the rough set approach can be formally considered as a framework for pattern discovery from imperfect data [220]. Several reasons are given in [34] to employ RST in knowledge discovery, including:

  • It does not require any preliminary or additional information about the data

  • It provides a valuable analysis even in presence of incomplete data

  • It allows the interpretation of large amounts of both quantitative and qualitative data

  • It can model highly nonlinear or discontinuous functional relations to provide complex characterizations of data

  • It can discover important facts hidden in the data and represent them in the form of decision rules, and

  • At the same time, the decision rules derived from rough set models are based on facts, because every decision rule is supported by a set of examples.

Mert Bal [3] brought up other RST advantages, such as: (a) it performs a clear interpretation of the results and evaluation of the meaningfulness of data; (b) it can identify and characterize uncertain systems and (c) the patterns discovered using rough sets are concise, strong and sturdy.

Among the main components of the knowledge discovery process we can mention:

  • PREPROCESSING

    • Discretization

    • Training set edition (instance selection)

    • Feature selection

    • Characterization of the learning problem (data complexity, metalearning)

  • KNOWLEDGE DISCOVERY

    • Symbolic inductive learning methods

    • Symbolic implicit learning methods (a.k.a. lazy learning)

  • KNOWLEDGE EVALUATION

    • Evaluation of the discovered knowledge

All of the above stages have witnessed the involvement of rough sets in their algorithmic developments. Some of the RST applications are as follows:

  • Analysis of the attributes to consider

    • Feature selection

    • Inter-attribute dependency characterization

    • Feature reduction

    • Feature weighting

    • Feature discretization

    • Feature removal

  • Formulation of the discovered knowledge

    • Discovery of decision rules

    • Quantification of the uncertainty in the decision rules.

RST’s main components are an information system and an indiscernibility relation. An information system is formally defined as follows. Let \(A = \{A_1, A_2, \ldots , A_n\}\) be a set of attributes characterizing each example (object, entity, situation, state, etc.) in non-empty set U called the universe of discourse. The pair (UA) is called an information system. If there exists an attribute \(d \notin A\), called the decision attribute, that represents the decision associated with each example in U, then a decision system \((U,~A \cup \{d\})\) is obtained.

The fact that RST relies on the existence of an information system allows establishing a close relationship with data-driven knowledge discovery processes given that these information or decision systems can be employed as training sets for unsupervised or supervised learning models, respectively.

A binary indiscernibility relation \(I_B\) is associated with each subset of attributes \(B \subseteq A\). This relation contains the pairs of objects that are inseparable from each other given the information expressed in the attributes in B, as shown in Eq. (1).

$$\begin{aligned} I_B = \{(x, y) \in U\times U : f (x, A_i) = f(y, A_i)~~\forall A_i \in B\}. \end{aligned}$$
(1)

where \(f(x, A_i)\) returns the value of the i-th attribute in object \(x \in U\).

The indiscernibility relation induces a granulation of the information system. The classical RST leaned on a particular type of indiscernibility relations called equivalence relations (i.e., those that are simmetric, reflexive and transitive). An equivalence relation induces a granulation of the universe in the form of a partition. This type of relation works well when there are only nominal attributes and no missing values in the information system.

Information systems having incomplete, continuous, mixed or heterogeneous data are in need of a more flexible type of indiscernibility relation. Subsequent RST formulations relaxed the stringent requirement of having an equivalence relation by considering either a tolerance or a similarity relation [61, 68, 181, 207, 212, 231, 283, 284, 305, 306]; these relations will induce a covering of the system. Another relaxation avenue is based on the probabilistic approach [65, 182, 210, 259, 264, 267, 307]. A third alternative is the hybridization with fuzzy set theory [54, 55, 172, 258, 280]. These different approaches have contributed to positioning RST as an important component within Soft Computing [12].

All of the aforementioned RST formulations retain some basic definitions, such as the lower and upper approximations; however, they defined it in multiple ways. The canonical RST definition for the lower approximation of a concept X is given as \(B_*(X) = \{x \in U: B(x) \subseteq X\}\) whereas its upper approximation is calculated as \(B^*(X) = \{x \in U: B(x) \cap X \ne \emptyset \}\). From these approximations we can compute the positive region \(POS (X) = B_* (X)\), the boundary region \(BND (X) = B^*(X) - B_*(X)\) and the negative region \(NEG (X) = U - B^* (X)\). These concepts serve as building blocks for developing many problem-solving approaches, including data-driven learning.

RST and Machine Learning are also related in that both take care of removing irrelevant/redundant attributes. This process is termed feature selection and RST approaches it from the standpoint of calculating the system reducts. Given an information system \(S = (U, A)\), where U is the universe and A is the set of attributes, a reduct is a minimum set of attributes \(B \subseteq A\) such that \(I_A = I_B\).

This chapter emphasizes on the role played by RST within the broad field of Machine Learning (ML). As a sound data analysis and knowledge discovery paradigm, RST has much to offer to the ML community. We surveyed the existing literature and reported on the most relevant RST theoretical developments and applications in this area. The review starts with RST in the context of data preprocessing (discretization, feature selection, instance selection and meta-learning) as well as the generation of both descriptive and predictive knowledge via decision rule induction, association rule mining and clustering. Afterward, we examined several special ML scenarios in which RST has been recently introduced, such as imbalanced classification, multi-label classification, dynamic/incremental learning, Big Data analysis and cost-sensitive learning.

The rest of the chapter is structured as follows. Section 2 reviews ML methods and processes from an RST standpoint, with emphasis on data preprocessing and knowledge discovery. Section 3 unveils special ML scenarios that are being gradually permeated by RST-based approaches, including imbalanced classification, multi-label classification, dynamic/incremental learning, Big Data analysis and cost-sensitive learning. Section 5 concludes the chapter.

2 Machine Learning Methods and RST

This section briefly goes over reported studies showcasing RST as a tool in data preprocessing and descriptive/predictive knowledge discovery.

2.1 Preprocessing

2.1.1 Discretization

As mentioned in [195], discretization is the process of converting a numerical attribute into a nominal one by applying a set of cuts to the domain of the numerical attribute and treating each interval as a discrete value of the (now nominal) attribute. Discretization is a mandatory step when processing information systems with the canonical RST formulation, as there is no provisioning for handling numerical attributes there. Some RST extensions avoid this issue by, for example, using similarity classes instead of equivalence classes and building a similarity relation that encompasses both nominal and numerical attributes.

It is very important that any discretization method chosen in the context of RST-based data analysis preserves the underlying discernibility among the objects. The level of granularity at which the cuts are performed in the discretization step will have a direct impact on any ensuing prediction, i.e., generic (wider) intervals (cuts) will likely avoid overfitting when predicting the class for an unseen object.

Dougherty et al. [53] categorize discretization methods along three axes:

  • global versus local: indicates whether an approach simultaneously converts all numerical attributes (global) or is restricted to a single numerical attribute (local). For instance, the authors in [174] suggest both local and global handling of numerical attributes in large data bases.

  • supervised versus unsupervised: indicates whether an approach considers values of other attributes in the discretization process or not. A simple example of an unsupervised approach is an “equal width” interval method that works by dividing the range of continuous attributes into k equal intervals, where k is given. A supervised discretization method, for example, will consider the correlation between the numerical attribute and the label (class) attribute when choosing the location of the cuts.

  • static versus dynamic: indicates whether an approach requires a parameter to determine the number of cut values or not. Dynamic approaches automatically generate this number along the discretization process whereas static methods require an a priori specification of this parameter.

Lenarcik and Piasta [128] introduced an RST-based discretization method that leans on the concepts of a random information system and of an expected value of classification quality. The method of finding suboptimal discretizations based on these concepts is presented and is illustrated with data from concretes’ frost resistance investigations.

Nguyen [173] considers the problem of searching for a minimal set of cuts that preserves the discernibility between objects with respect to any subset of s attributes, where s is a user-defined parameter. It was shown that this problem is NP-hard and its heuristic solution is more complicated than that for the problem of searching for an optimal, consistent set of cuts. The author proposed a scheme based on Boolean reasoning to solve this problem.

Bazan [5] put forth a method to search for an irreducible sets of cuts of an information system. The method is based on the notion of dynamic reduct. These reducts are calculated for the information system and the one with the best stability coefficient is chosen. Next, as an irreducible set of cuts, the author selected cuts belonging to the chosen dynamic reduct.

Bazan et al. [6] proposed a discretization technique named maximal discernibility (MD), which is based on rough sets and Boolean reasoning. MD is a greedy heuristic that searches for cuts along the domains of all numerical attributes that discern the largest number of object pairs in the dataset. These object pairs are removed from the information system before the next cut is sought. The set of cuts obtained that way is optimal in terms of object indiscernibility; however this procedure is not feasible since computing one cut requires \(O(|A|\cdot |U|^3)\). Locally optimal cuts [6] are computed in \(O(|A|\cdot |U|)\) steps using only \(O(|A|\cdot |U|)\) space.

Dai and Li [46] improved Nguyen’s discretization techniques by reducing the time and space complexity required to arrive at the set of candidate cuts. They proved that all bound cuts can discern the same object pairs as the entire set of initial cuts. A strategy to select candidate cuts was proposed based on that proof. They obtained identical results to Nguyen’s with a lower computational overhead.

Chen et al. [26] employ a genetic algorithm (GA) to derive the minimal cut set in a numerical attribute. Each gene in a binary chromosome represents a particular cut value. Enabling this gene means the corresponding cut value has been selected as a member of the minimal cut set. Some optimization strategies such as elitist selection and father-offspring combined selection helped the GA converge faster. The experimental evidence showed that the GA-based scheme is more efficient than Nguyen’s basic heuristic based on rough sets and Boolean reasoning.

Xie et al. [249] defined an information entropy value for every candidate cut point in their RST-based discretization algorithm. The final cut points are selected based on this metric and some RST properties. The authors report that their approach outperforms other discretization techniques and scales well with the number of cut points.

Su and Hsu [219] extended the modified Chi2 discretizer by learning the predefined misclassification rate (input parameter) from data. The authors additionally considered the effect of variance in the two adjacent intervals. In the modified Chi2, the inconsistency check in the original Chi2 is replaced with the “quality of approximation” measure from RST. The result is a more robust, parameterless discretization method.

Singh and Minz [205] designed a hybrid clustering-RST-based discretizer. The values of each numerical attribute are grouped using density-based clustering algorithms. This produces a set of (possibly overlapping) intervals that naturally reflect the data distribution. Then, the rough membership function in RST is employed to refine these intervals in a way that maximizes class separability. The proposed scheme yielded promising results when compared to seven other discretizers.

Jun and Zhou [116] enhanced existing RST-based discretizers by (i) computing the candidate cuts with an awareness of the decision class information; in this way, the scales of candidate cuts can be remarkably reduced, thus considerably saving time and space and (ii) introducing a notion of cut selection probability that is defined to measure cut significance in a more reasonable manner. Theoretical analyses and simulation experiments show that the proposed approaches can solve the problem of data discretization more efficiently and effectively.

2.1.2 Feature Selection

The purpose behind feature selection is to discard irrelevant features that are generally detrimental to the classifier’s performance, generate noise, increase the amount of information to be stored and the computational cost of the classification process [222, 302]. Feature selection is a computationally expensive problem that requires searching for a subset of the n original features in a space of \(2^n-1\) candidate subsets according to a predefined evaluation criterion. The main components of a feature selection algorithm are: (1) an evaluation function (EF), used to calculate the fitness of a feature subset and (2) a generation procedure that is responsible for generating different subsets of candidate features.

Different feature selection schemes that integrate RST into the feature subset evaluation function have been developed. The quality of the classification \(\gamma \) is the most frequently used RST metric to judge the suitability of a candidate feature subset, as shown in [9,10,11, 64] etc. Other indicators are conditional independence [208] and approximate entropy [209].

The concept of reduct is the basis for these results. Essentially, a reduct is a minimal subset of features that generates the same granulation of the universe as that induced by all features. Among these works we can list [37, 38, 85, 89, 111, 136, 168, 196, 221, 223, 239, 247, 248, 255, 270, 302]. One of the pioneer methods is the QuickReduct algorithm, which is typical of those algorithms that resort to a greedy search strategy to find a relative reduct [136, 202, 247]. Generally speaking, feature selection algorithms are based on heuristic search [97, 164, 302]. Other RST-based methods for reduct calculation are [98, 209].

More advanced methods employ metaheuristic algorithms (such as Genetic Algorithms, Ant Colony Optimization or Particle Swarm Optimization) as the underlying feature subset generation engine [8,9,10,11, 15, 64, 102, 119, 241, 242, 245, 246, 268, 274, 297]. Feature selection methods based on the hybridization between fuzzy and rough sets have been proposed in [13, 28, 42,43,44, 51, 75, 87, 90, 92, 101, 103,104,105, 125, 193, 197, 203, 225, 299]. Some studies aim at calculating all possible reducts of a decision system [27, 28, 206, 225, 299].

Feature selection is arguably the Machine Learning (ML) area that has witnessed the most influx of rough-set-based methods. Other RST contributions to ML are concerned with providing metrics to calculate the inter-attribute dependence and the importance (weight) of any attribute [120, 222].

2.1.3 Instance Selection

Another important data preprocessing task is the editing of the training sets, also referred to as instance selection. The aim is to reduce the number of examples in order to bring down the size of the training set while maintaining the system efficiency. By doing that, a new training set is obtained that will bring forth a higher efficiency usually also produces a reduction of the data.

Some training set edition approaches using rough sets have been published in [16, 19]. The simplest idea is to remove all examples in the training set that are not contained in the lower approximation of any of the decision classes. A more thorough investigation also considers those examples that lie in the boundary region of any of the decision classes. Fuzzy rough sets have been also applied to the instance selection problem in [99, 232, 233].

2.1.4 Meta-Learning

An important area within knowledge discovery is that of meta-learning, whose objective is to learn about the underlying learning processes in order to make them more efficient or effective [234]. These methods may consider measures related to the complexity of the data [79]. The study in [18] explores the use of RST-based metrics to estimate the quality of a data set. The relationship between the “quality of approximation” measure and the performance of some classifiers is investigated in [17]. This measure describe the inexactness of the rough-set-based classification and denotes the percentage of examples that were correctly classified employing the attributes included in the indiscernibility relationship [224]. The authors in [251] analyze the inclusion degree as a perspective on measures for rough set data analysis (RSDA). Other RSDA measures are the “accuracy of the approximation” and the rough membership function [120]; for example, in [108, 109], the rough membership function and other RST-based measures are employed to detect outliers (i.e., examples that behave in an unexpected way or have abnormal properties).

2.2 Descriptive and Predictive Knowledge Discovery

2.2.1 Decision Rule Induction

The knowledge uncovered by the different data analysis techniques can be either descriptive or predictive. The former characterizes the general properties of the data in the data set (e.g., association rules) while the latter allows performing inferences from the available data (e.g., decision rules). A decision rule summarizes the relationship between the properties (features) and describes a causal relationship among them. For example, IF Headache = Yes AND Weakness = YES THEN Influenza = YES. The most common rule induction task is to generate a rule base R that is both consistent and complete.

According to [161], RST-based rule induction methods provide the following benefits:

  • Better explanation capabilities

  • Generate a simple and useful set of rules.

  • Work with sparse training sets.

  • Work even when the underlying data distribution significantly deviates from the normal distribution.

  • Work with incomplete, inaccurate, and heterogeneous data.

  • Usually faster execution time to generate the rule base compared to other methods.

  • No assumptions made on the size or distribution of the training data.

Among the most popular RST-based rule induction methods we can cite LERS [67, 215], which includes the LEM1 (Learn from examples model v1) and LEM2 methods (Learn from examples model v2); the goal is to extract a minimum set of rules to cover the examples by exploring the attribute-value pairs search space of while taking into account possible data inconsistency issues. MODLEM [214, 215] is based on sequentially building coverings of the training data and generating minimal decision rule sets for each decision class. Each of these sets aims at covering all positive examples that belong to a concept and none from any other concept. The EXPLORE algorithm [216] extracts from data all the decision rules satisfying certain requirements. It can be adapted to handle inconsistent examples. The LEM2, EXPLORE and MODLEM algorithms rule induction algorithms are implemented in the ROSE2 software [3]. Filiberto et al. proposed the IRBASIR method [62], which generates decision rules using an RST extension rooted on similarity relations; another technique is put forth in [121] to discover rules using similarity relations for incomplete data sets. This learning problem in presence of missing data is also addressed in [80].

Other RST-based rule induction algorithms available in the literature using rough sets are [3, 14, 63, 110, 118, 129, 154, 179, 228, 229]. The use of hybrid models based on rough sets and fuzzy sets for rule induction and other knowledge discovery methods is illustrated in [2, 24, 41, 100, 123, 159, 201, 298, 300], which includes working with the so called “fuzzy decision information systems” [2].

One of the most popular rule induction methods based on rough sets is the so-called three-way decisions model [81, 260,261,262,263]. This methodology is strongly related to decision making. Essentially, for each decision alternative, this method defines three rules based on the RST’s positive, negative and boundary regions. They respectively indicate acceptance, rejection or abstention (non-commitment, denotes weak or insufficient evidence).

This type of rules, derived from the basic RST concepts, is a suitable knowledge representation vehicle in a plethora of application domains. Hence, it has been integrated into common machine learning tasks to facilitate the knowledge engineering process required for a successful modeling of the domain under consideration. The three-way decisions model has been adopted in feature selection [106, 107, 133, 163, 265, 293], classification [273, 281, 282, 293], clustering [276, 277] and face recognition [132, 289].

2.2.2 Association Rule Mining

The discovery of association rules is one of the classical data mining tasks. Its goal is to uncover relationships among attributes that frequently appear together; i.e., the presence of one implies the presence of the other. One of the typical examples is the purchase of beer and diapers during the weekends. Association rules are representative of descriptive knowledge. A particular case are the so called “class association rules”, which are used to build classifiers. Several methods have been developed for discovering association rules using rough sets, including [49, 70, 94, 111, 127, 134, 211, 266].

2.2.3 Clustering

The clustering problem is another learning task that has been approached from a rough set perspective. Clustering is a landmark unsupervised learning problem whose main objective is to group similar objects in the same cluster and separate objects that are different from each other by assigning them to different clusters [96, 167]. The objects are grouped in such a way that those in the same group exhibit a high degree of association among them whereas those in different groups show a low degree of association. Clustering algorithms map the original N-dimensional feature space to a 1-dimensional space describing the cluster each object belongs to. This is why clustering is considered both an important dimensionality reduction technique and also one of the most prevalent Granular Computing [183] manifestations.

One of the most popular and efficient clustering algorithms for conventional applications is K-means clustering [71]. In the K-means approach, randomly selected objects serve as initial cluster centroids. The objects are then assigned to different clusters based on their distance to the centroids. In particular, an object gets assigned to the cluster with the nearest centroid. The newly modified clusters then employ this information to determine new centroids. The process continues iteratively until the cluster centroids are stabilized. K-means is a very simple clustering algorithm, easy to understand and implement. The underlying alternate optimization approach iteratively converges but might get trapped into a local minimum of the objective function. K-means’ best performance is attained in those applications where clusters are well separated and a crisp (bivalent) object-to-cluster decision is required. Its disadvantages include the sensitivity to outliers and the initial cluster centroids as well as the a priori specification of the desired number of clusters k.

Pawan Lingras [142, 145] found that the K-means algorithm often yields clustering results with unclear, vague boundaries. He pointed out that the “hard partitioning” performed by K-means does not meet the needs of grouping vague data. Lingras then proposed to combine K-means with RST and in the so-called “Rough K-means” approach. In this technique, each cluster is modeled as a rough set and each object belongs either to the lower approximation of a cluster or to the upper approximation of multiple clusters. Instead of building each cluster, its lower and upper approximations are defined based on the available data. The basic properties of the Rough K-means method are: (i) an object can be a member of at most a lower approximation; (ii) an object that is a member of the lower approximation of a cluster is also a member of its upper approximation and (iii) an object that does not belong to the lower approximation of any cluster is a member of at least the upper approximation of two clusters. Other pioneering works on rough clustering methods are put forth in [78, 192, 235, 236].

Rough K-means has been the subject of several subsequent studies aimed at improving its clustering capabilities. Georg Peters [187] concludes that rough clustering offers the possibility of reducing the number of incorrectly clustered objects, which is relevant to many real-world applications where minimizing the number of wrongly grouped objects is more important than maximizing the number of correctly grouped objects. Hence in these scenarios, Rough K-means arises as a powerful and stronger alternative to K-means. The same author proposes some improvements to the method regarding the calculation of the centroids, thus aiming to make the method more stable and robust to outliers [184, 185]. The authors in [291] proposed a Rough K-means improvement based on a variable weighted distance measure. Another enhancement brought forward in [186] suggested that well-defined objects must have a greater impact on the cluster centroid calculation rather than having this impact be governed by the number of cluster boundaries an object belongs to, as proposed in the original method. An extension to Rough K-means based on the decision-theoretic rough sets model was developed in [130]. An evolutionary approach for rough partitive clustering was designed in [168, 189] while [45, 190] elaborate on dynamic rough clustering approaches.

Other works that tackle the clustering problem using rough sets are [35, 72, 76, 77, 122, 124, 135, 143, 144, 162, 177, 178, 213, 271, 272, 275, 292]. These methods handle more specific scenarios (such as sequential, imbalanced, categorical and ordinal data), as well as applications of this clustering approach to different domains. The rough-fuzzy K-means method is put forward in [88, 170] whereas the fuzzy-rough K-means is unveiled in [169, 188]. Both approaches amalgamate the main features of Rough K-means and Fuzzy C-means by using the fuzzy membership of the objects to the rough clusters. Other variants of fuzzy and rough set hybridization for the clustering problem are presented in [56, 126, 160, 171].

3 Special Learning Cases Based on RST

This section elaborates on more recent ML scenarios tackled by RST-based approaches. In particular, we review the cases of imbalanced classification, multi-label classification, dynamic/incremental learning and Big Data analysis.

3.1 Imbalanced Classification

The traditional knowledge discovery methods presented in the previous section have to be adapted if we are dealing with an imbalanced dataset [21]. A dataset is balanced if it has an approximately equal percentage of positive and negative examples (i.e., those belonging to the concept to be classified and those belonging to other concepts, respectively). However, there are many application domains where we find an imbalanced dataset; for instance, in healthcare scenarios there are usually a plethora of patients that do not have a particularly rare disease. When learning a normalcy model for a certain environment, the number of labeled anomalous events is often scarce as most of the data corresponds to normal behaviour. The problem with imbalanced classes is that the classification algorithms have a tendency towards favoring the majority class. This occurs because the classifier attempts to reduce the overall error, hence the classification error does not take into account the underlying data distribution [23].

Several solutions have been researched to deal with this kind of situations. Two of the most popular avenues are either resampling the training data (i.e., oversampling the minority class or undersampling the majority class) or modifying the learning method [153]. One of the classical methods for learning with imbalanced data is SMOTE (synthetic minority oversampling technique) [22]. Different learning methods for imbalanced classification have been developed from an RST-based standpoint. For instance, Hu et al. [91] proposed models based on probabilistic rough sets where each example has an associated probability p(x) instead of the default 1/n. Ma et al. [158] introduced weights in the variable-precision rough set model (VPRS) to denote the importance of each example. Liu et al. [153] bring about some weights in the RST formulation to balance the class distribution and develop a method based on weighted rough sets to solve the imbalanced class learning problem. Ramentol et al. [194] proposed a method that integrates SMOTE with RST.

Stefanowski et al. [217] introduced filtering techniques to process inconsistent examples of the majority class (i.e., those lying in the boundary region), thereby adapting the MODLEM rule extraction method for coping with imbalanced learning problems. Other RST-based rule induction methods in the context of imbalanced data are also presented in [152, 243]. The authors in [218] proposed the EXPLORE method that generates rules for the minority class with a minimum coverage equal to a user-specified threshold.

3.2 Multi-label Classification

Normally, in a typical classification problem, a class (label) \(c_i\) from a set \(C = \{c_1, \ldots , c_k\}\) is assigned to each example. However, in multi-label classification, a subset \(S \subseteq C\) is assigned to each example, which means that an example could belong to multiple classes. Some applications of this type of learning emerge from text classification and functional genomics, namely, assigning functions to genes [226]. This gives rise to the so-called multi-label learning problem. The two avenues envisioned for solving this new class of learning problems have considered either converting the multi-label scenario to a single-label (classical) scenario or adapting the learning methods. Examples of the latter trend are the schemes proposed in [47, 198, 227, 290]. Similar approaches have been proposed for multi-label learning using rough sets. A first alternative is to transform the multi-label problem into a traditional single-label case and use classical RST-based learning methods to derive the rules (or any other knowledge); the other option is to adapt the RST-based learning methods, as shown in [240, 278, 279, 288].

In the first case, a decision system can be generated where some instances could belong to multiple classes. Multi-label classification can be regarded as an inconsistent decision problem, in which two objects having the same predictive attribute values do not share the same decision class. This leads to the modification of the definition of the lower/upper approximations through a probabilistic approach that facilitates modeling the uncertainty generated by the inconsistent system. This idea gives rise to the so-called multi-label rough set model, which incorporates a probabilistic approach such as the decision-theoretic rough set model. Some RST-based feature selection methods in multi-label learning scenarios have been enunciated [131], where the reduct concept was reformulated for the multi-label case.

3.3 Dynamic/Incremental Learning

Data are continuously being updated in nowadays’ information systems. New data are added and obsolete data are purged over time. Traditional batch-learning methods lean on the principle of running these algorithms on all data when the information is updated, which obviously affects the system efficiency while ignoring any previous learning. Instead, learning should occur as new information arrives. Managing this learning while adapting the previous knowledge learned is the essence behind incremental learning. This term refers to an efficient strategy for the analysis of data in dynamic environments that allows acquiring additional knowledge from an uninterrupted information flow. The advantage of incremental learning is not to have to analyze the data from scratch but to utilize the learning process’ previous outcomes as much as possible [57, 73, 112, 176, 200]. The continuous and massive acquisition of data becomes a challenge for the discovery of knowledge; especially in the context of Big Data, it becomes very necessary to develop capacities to assimilate the continuous data streams [29].

As an information-based methodology, RST is not exempt from being scrutinized in the context of dynamic data. The fundamental RST concepts and the knowledge discovery methods ensuing from them are geared towards the analysis of static data; hence, they need to be thoroughly revised in light of the requirements posed by data stream mining systems [151]. The purpose of the incremental learning strategy in rough sets is the development of incremental algorithms to quickly update the concept approximations, the reduct calculation or the discovered decision rules [40, 284]. The direct precursor of these studies can be found in [175]. According to [149], in recent years RST-based incremental learning approaches have become “hot topics” in knowledge extraction from dynamic data given their proven data analysis efficiency.

The study of RST in the context of learning with dynamic data can be approached from two different angles: what kind of information is considered to be dynamic and what type of learning task must be carried out. In the first case, the RST-based incremental updating approach could be further subdivided into three alternatives: (i) object variation (insertion or deletion of objects in the universe), (ii) attribute variation (insertion/removal of attributes) and (iii) attribute value variation (insertion/deletion of attribute values). In the second case, we can mention (i) incremental learning of the concept approximations [33, 139]; (ii) incremental learning of attribute reduction [52, 140, 237, 238, 250] and (iii) incremental learning of decision rules [59, 66, 148, 301].

Object variations include so-called object immigration and emigration [148]. Variations of the attributes include feature insertion or deletion [138, 287]. Variations in attribute values are primarily manifested via the refinement or scaling of the attribute values [32, 146]. Other works that propose modifications to RST-based methods for the case of dynamic data are [147, 149, 157].

The following studies deal with dynamic object variation:

  • The update of the lower and upper approximations of the target concept is analyzed in [33, 137, 156].

  • The update in the reduction of attributes is studied in [82, 250].

  • The update of the decision rule induction mechanism is discussed in [4, 40, 59, 93, 148, 199, 230, 244, 269, 301].

If the variation occurs in the set of attributes, its effects have been studied with respect to these aspects:

  • The update of the lower and upper approximations of the target concept is analyzed in [20, 36, 138, 139, 150, 287].

  • The update of the decision rule induction mechanism is discussed in [39].

The effect of the variations in the attribute values (namely, via refinement or extension of the attribute domains) with respect to the update of the lower and upper approximations of the target concept is analyzed in [30,31,32, 50, 237, 308].

The calculation of reducts for dynamic data has also been investigated. The effect when the set of attributes varies is studied in [39]. The case of varying the attribute values is explored in [50, 69] whereas the case of dynamic object update is dissected in [199, 244]. Other studies on how dynamic data affect the calculation of reducts appear in [140, 204, 237, 238].

3.4 Rough Sets and Big Data

On the other hand, the accelerated pace of technology has led to an exponential growth in the generation and collection of digital information. This growth is not only limited to the amount of data available but to the plethora of diverse sources that emit these data streams. It becomes paramount then to efficiently analyze and extract knowledge from many dissimilar information sources within a certain application domain. This has led to the emergence of the Big Data era [25], which has a direct impact on the development of RST and its applications. Granular Computing, our starting point in this chapter, has a strong relation to Big Data [25], as its inherent ability to process information at multiple levels of abstraction and interpret information from different perspectives greatly facilitates the efficient management of large data volumes.

Simply put, Big Data can be envisioned as a large and complex data collection. These data are very difficult to analyze through traditional data management and processing tools. Big Data scenarios require new architectures, techniques, algorithms and processes to manage and extract value and knowledge hidden in the data streams. Big Data is often characterized by the 5 V’s vector: Volume, Velocity, Variety, Veracity and Value. Big Data includes both structured and unstructured data, including images, videos, textual reports, etc. Big Data frameworks such as MapReduce and Spark have been recently developed and constitute indispensable tools for the accurate and seamless knowledge extraction from an array of disparate data sources. For more information on the Big Data paradigm, the reader is referred to the following articles: [25, 48, 60, 117].

As a data analysis and information extraction methodology, RST needs to adapt and evolve in order to cope with this new phenomenon. A major motivation to do so lies in the fact that the sizes of nowadays’ decision systems are already extremely large. This poses a significant challenge to the efficient calculation of the underlying RST concepts and the knowledge discovery methods that emanate from them. Recall that the computational complexity of computing the target concept’s approximations is O(\(lm^2\)), the computational cost of finding a reduct is bounded by O(\(l^2m^2\)) and the time complexity to find all reducts is O(\(2^lJ\)), where l is the number of attributes characterizing the objects, m is the number of objects in the universe and J is the computational cost required to calculate a reduct.

Some researchers have proposed RST-based solutions to the Big Data challenge [191, 286]. These methods are concerned with the design of parallel algorithms to compute equivalence classes, decision classes, associations between equivalence classes and decision classes, approximations, and so on. They are based on partitioning the universe, concurrently processing those information subsystems and then integrating the results. In other words, given the decision system \(S = (U, C \cup D)\), generate the subsystems \(\{S_1, S_2, \ldots , S_m\}\), where \(S_i = (U_i, C \cup D)\) and \(U = \bigcup U_i\), then process each subsystem \(S_i\), \(i \in \{1, 2,\ldots , m\}\), \(U_i / B, B \subseteq C\). Afterwards, the results are amalgamated. This MapReduce-compliant workflow is supported by several theorems stating that (a) equivalence classes can be independently computed for each subsystem and (b) the equivalence classes from different subsystems can be merged if they are based on the same underlying attribute set. These results enable the parallel computation of the equivalence classes of the decision system S. Zhang et al. [286] developed the PACRSEC algorithm to that end.

Analogously, RST-based knowledge discovery methods, including reduct calculation and decision rule induction, have been investigated in in the context of Big Data [58, 256, 285].

3.5 Cost-Sensitive Learning

Cost is an important property inherent to real-world data. Cost sensitivity is an important problem which has been addressed from different angles. Cost-sensitive learning [252, 294, 303, 304] emerged when an awareness of the learning context was brought into Machine Learning. This is one of the most difficult ML problems and was listed as one of the top ten challenges in the Data Mining/ML domain [296].

Two types of learning costs have been addressed through RST: misclassification cost and test cost [253]. Test cost has been studied by Min et al. [163, 165, 166, 295] using the classical rough set approach, i.e., using a single granulation; a test-cost-sensitive multigranulation rough set model is presented in [253]. Multigranulation rough set is an extension of the classical RST that leans on multiple granular structures.

A recent cost-sensitive rough set approach was put forward in [115]. The crux of this method is that the information granules are sensitive to test costs while approximations are sensitive to decision costs, respectively; in this way, the construction of the rough set model takes into account both the test cost and the decision cost simultaneously. This new model is called cost-sensitive rough set and is based on decision-theoretic rough sets. In [132], the authors combine sequential three-way decisions and cost-sensitive learning to solve the face recognition problem; this is particularly interesting since in real-world face recognition scenarios, different kinds of misclassifications will lead to different costs [155, 294].

Other studies focused on the cost-sensitive learning problem from an RST perspective are presented in [84, 113, 253, 254]; these works have considered both the test cost and the decision cost. Attribute reduction based on test-cost-sensitivity has been quite well investigated [74, 86, 106, 114, 115, 133, 163, 164, 166, 296].

4 Reference Categorization

Table 1 lists the different RST studies according to the ML tasks they perform.

Table 1 Rough sets in machine learning

5 Conclusions

We have reported on hundreds of successful attempts to tackle different ML problems using RST. These approaches touch all components of the knowledge discovery process, ranging from data preprocessing to descriptive and predictive knowledge induction. Aside from the well-known RST strengths in identifying inconsistent information systems, calculating reducts to reduce the dimensionality of the feature space or generating an interpretable rule base, we have walked the reader through more recent examples that show the redefinition of some of the RST’s building blocks to make it a suitable approach for handling special ML scenarios characterized by an imbalance in the available class data, the requirement to classify a pattern into one or more predefined labels, the dynamic processing of data streams, the need to manage large volumes of static data or the management of misclassification/test costs. All of these efforts bear witness to the resiliency and adaptability of the rough set approach, thus making it an appealing choice for solving non-conventional ML problems.