Keywords

1 Introduction

Interest in machine learning tools and algorithms has been huge in recent years and is still growing. There is a wide range of applications on the market that use various machine learning routines. There is however still only a few solutions compatible with the Microsoft .NET framework that can provide machine learning algorithms. Those that exist are rather focused on numerical rather than symbolic methods and so far none of these has included rough set [1] based algorithms.

Machine learning models that are based on mathematically sophisticated methods may achieve high accuracy but they are hardly understandable by users who expect not only accurate results but also easy yet meaningful explanation how these results were obtained. Models relaying on symbolic, e.g., rule based methods may be less accurate but more intuitive and understandable for humans [2]. In both cases, feature subset selection leads to an increase of interpretability and practical usefulness of machine learning models.

Symbolic methods focus on finding relationships in data, typically reported in a form of rules in a feature-value language. The rules are built with a use of basic logical operators. Examples include the rule induction methods such as learning if-then rules [3] or decision trees [4].

Rough sets have proven to be a successful tool in feature selection (see e.g. [5]). The rough set approach is based on decision reducts – irreducible subsets of features, which determine specified decision classes in (almost) the same degree as the original set of features. Determining decisions can be interpreted analogously to, e.g., functional or multi-valued dependencies in relational databases. Subsets of features providing exactly the same degree of determination as the original set are often referred as crisp decision reducts, in opposite to approximate decision reducts [6] where some controlled decrease of determination is allowed. By specifying a threshold for allowed decrease of determination, one can address the balance between decision model’s simplicity and accuracy. Indeed, it is easier to search for smaller subsets of features yielding simpler sub-models under loosened constraints for decision determination, although too weak constraints may also cause poorer accuracy. However, even relatively less accurate sub-models may lead towards very accurate final model, if the processes of sub-models’ design are appropriately synchronized.

NRough is a set of libraries written in C# programming language focusing on rough sets and other symbolic machine learning methods. It contains a number of algorithms for searching approximate decision reducts and constructing decision models. All presented algorithms has been successfully used in our previous research and proven their value. The framework is aimed to be used by researchers who can extend it and test their methods against already implemented models. The second user group are developers and system integrators who can include described routines in their applications.

NRough can be downloaded from GitHub Repository [7] as well as from its dedicated website [8] in a form of a Microsoft Visual Studio solution containing source code for all described libraries. The sources include unit test code that presents use case examples as well as unit testing procedures.

We present the framework’s key features as well as formal definitions behind implemented algorithms in Sect. 2. We describe data representation, approximate decision reducts and decision reduct classifier ensembles in this section. Moreover, we list other supervised and unsupervised machine learning algorithms included in NRough. Finally, we list included features related to model evaluation. Next, in Sect. 3 we describe the architecture of our solution. We put licensing information in Sect. 4. In Sect. 5 we describe other rough set based frameworks. The last Sect. 6 concludes this paper and includes draft of a road map as well as a direction in which we would like the framework to evolve.

2 Key Features

NRough framework contains a number of algorithms for searching approximate decision reducts and constructing decision models based on the rough set theory. Moreover, we added a number of decision rule based classifiers known from machine learning. Last but not least, the framework contains routines for classifier validation and results presentation. Below, we present the list of implemented features starting with data representation description.

2.1 Data Representation

We follow data representation in a form of a decision table which is a well known structure in the rough sets domain. Decision table is a tuple \(\mathbb {A}=(U, A \cup \{d\})\), where U is a finite set of objects, A is a finite set of attributes and \(d \notin A\) is a distinguished decision attribute. We refer to elements of U using their ordinal numbers \(i=0,...,|U|-1\) as well as by unique record identifier (if such exists in the data set). We treat attributes \(a\in A\) as functions \(a:U\rightarrow V_a\), where \(V_a\) denotes the set of values of a. Values \(v_d\in V_d\) correspond to decision classes that we want to describe using the values of attributes in A. The framework internally encodes attribute values using signed long base type which allows generic approach for data access and avoiding boxing/unboxing from origin types i.e. faster computations. Internal values can be however converted to their original typed values using a dictionary lookup methods. There are no restrictions about input types, that are automatically recognized during data loading.

Decision tables can be loaded from text files (including comma delimited files and RSES 1.0 format [9]) as well as from the System.Data.DataTable instance which is often used in .NET to store SQL query results. The framework includes a number of filters to manipulate the data such as removal of selected attributes or records based on a given user criteria. Filter concept is also used to define more sophisticated data manipulations such as numeric value discretization.

One of the key concepts in rough sets theory is the definition of indiscernibility relation. For any subset of attributes \(B \subseteq A\) and the universe of objects \(x \in U\) we are able to define an information vector \(B(x) = [ a_{i_{1}}(x),\dots ,a_{i_{|B|}}(x) ]\) where \(a_{i_{j}}(x)\) are values of attributes \(a_{i_{j}} \in B\) and \(j = 1, \dots ,|B|\). We can also denote the set of all B-information vectors, which will then occur in \(\mathbb {A}\), as \(V_{B} = \{B(x) : x \in U \}.\) Each subset \(B \subseteq A\) partitions the space U onto so called equivalence classes that can be enumerated as \(v_{1}, \dots ,v_{|V_{B}|}\). For such division we get the partition space denoted as \(U/B = \{E_{1},\dots ,E_{t}\}\) where \(E_{t} \subseteq U\) for \(t = 1,\dots ,|V_{B}|\). Each equivalence class is defined as \(E_{t} = \{x \in U : B(x) = v_{t} \}\). NRough utilizes this concept in a form of a dedicated data structure which is used in many scenarios like calculating functions information entropy, majority or relative gain functions to name a few. The majority function is used in many ways by framework algorithms e.g. for approximate decision reducts computation as well as for decision tree pre-pruning or branch split calculation.

Last but not least, the library contains a number of benchmark data sets taken from UCI repository [10]. These data can be accessed with a predefined methods loading the data into memory including their meta data.

2.2 Approximate Decision Reducts

Attribute selection plays an important role in knowledge discovery. It establishes the basis for more efficient classification, prediction and approximation models. Attribute selection methods originating from the theory of rough sets aim at searching for so called decision reducts – irreducible subsets of attributes that satisfy predefined criteria for keeping enough information about decision classes. NRough contains a number of algorithms for computing approximate decision reducts (as well as crisp decision reducts when approximation threshold is set to 0). All reduct computation algorithms are based on heuristic approach and many utilize parallel computing.

We define an \((F,\varepsilon )\)-approximate decision reduct [11] where F is a measure \(F(d|\cdot ) : 2^{|A|} \rightarrow \mathfrak {R}\) which evaluates the degree of influence F(d|B) of subset \(B \subseteq A\) in d. Below we present the definition as well as the general routine for computing \((F,\varepsilon )\)-approximate decision reducts as Algorithm 1 called \((F,\varepsilon )\)-REDORD [11]).

Definition 1

Let \(\varepsilon \in [0,1)\) and \(\mathbb {A}=(U, A \cup \{d\})\) be given. We say that \(B \subseteq A\) is an \((F,\varepsilon )\)-approximate decision reduct, iff it is an irreducible subset of attributes satisfying the following condition:

$$\begin{aligned} F(B) \ge (1 - \varepsilon )F(A) \end{aligned}$$
(1)
figure a

The framework defines three types of F measures: \(\gamma (B)\) [1] which is based on so called positive region, Majority M(B) [11] and Relative Gain R(B) [12]. Other user defined measures can be used with \((F,\varepsilon )\)-approximate reduct computation algorithm.

$$\begin{aligned} M(B) = \frac{1}{|U|}\sum _{E\in U/B} \max _{k \in V_d} |X_k \cap E| \end{aligned}$$
(2)
$$\begin{aligned} R(B) = \frac{1}{|V_d|}\sum _{E\in U/B}\max _{X\in U/\{d\}}\frac{|X \cap E|}{|X|} \end{aligned}$$
(3)
$$\begin{aligned} \gamma (B) = \frac{1}{|U|}|POS(B)| = \frac{1}{|U|} \sum _{E\in U/B:P(X|E)=1}|E| \end{aligned}$$
(4)

In [13] it was shown how to compute approximate decision reducts over a universe of weighted objects and that two different weighting schemes lead to an unified way of computing M(B) and R(B) measures that is for \(\varvec{1} : U \rightarrow \{1\}\) we have \(M_{\varvec{1}}(B)=M(B)\) and for \(r(u) = \frac{1}{|\{x \in U : d(x) = d(u)\}|}\) we obtain \(M_{r}(B) = R(B)\).

Definition 2

Let \(\varepsilon \in [0,1)\), \(\mathbb {A}=(U, A \cup \{d\})\) and \(\omega : U \rightarrow [0,+\infty )\) be given. We say that \(B \subseteq A\) is an \((\omega ,\varepsilon )\)-approximate decision reduct, iff it is an irreducible subset of attributes satisfying the following condition:

$$\begin{aligned} M_{\omega }(B) \ge (1 - \varepsilon )M_{\omega }(A) \end{aligned}$$
(5)
$$\begin{aligned} M_{\omega }(B) = \frac{1}{|U|_{\omega }}\sum _{E\in U/B} \max _{k \in V_d} |X_k \cap E|_{\omega } \end{aligned}$$
(6)
$$\begin{aligned} |Y|_{\omega } = \sum _{u \in Y} \omega (u) \end{aligned}$$
(7)

Moreover the framework contains algorithms for computing decision bireducts and their derivatives \(\gamma \)-bireducts and relative-bireducts [14]. Below we present definitions as well as pseudo code as Algorithm 2. In [15] we showed relationships between \((F,\varepsilon )\)-approximate decision reducts and different types of bireducts.

Definition 3

Let \(\mathbb {A}=(U,A\cup \{d\})\) be a decision system. A pair (BX), where \(B \subseteq A\) and \(X \subseteq U\), is called a decision bireduct, iff B discerns all pairs \(i,j \in X\) where \(d(i) \ne d(j)\), and the following properties hold:

  1. 1.

    There is no \(C \subsetneq B\) such that C discerns all pairs \(i,j \in X\) where \(d(i) \ne d(j)\);

  2. 2.

    There is no \(Y \supsetneq X\) such that B discerns all pairs \(i,j \in Y\) where \(d(i) \ne d(j)\).

Definition 4

Let \(\mathbb {A}=(U,A\cup \{d\})\) be a decision system. A pair (BX), where \(B \subseteq A\) and \(X \subseteq U\), is called a decision \(\gamma \)-bireduct, iff B discerns all pairs \(i \in X, j \in U\) where \(d(i) \ne d(j)\), and the following properties hold:

  1. 1.

    There is no \(C \subsetneq B\) such that C discerns all pairs \(i \in X, j \in U\) where \(d(i) \ne d(j)\);

  2. 2.

    There is no \(Y \supsetneq X\) such that B discerns all pairs \(i \in Y, j \in U\) where \(d(i) \ne d(j)\).

figure b

In [16] we presented the new definition of so called generalized majority decision function, which can be treated as an extension to well known generalized decision function. We also showed the definition of generalized approximate majority decision reducts. The pseudo code is presented as Algorithm 3. An interesting extension in to use so called exceptions which on one hand allow further feature reduction in the main model and on the other hand store details about outlayers. Both definitions are implemented in NRough.

Definition 5

For any decision table \({\mathbb {A}} = (U,A\cup \{d\})\) and approximation threshold \(\varepsilon \in [0, 1)\) one can consider generalized approximate majority decision function \(m_{d}^{\varepsilon } : 2^U \rightarrow 2^{V_d}\) that is taking the following form:

$$\begin{aligned} m_{d}^{\varepsilon }(E) = \{ k : |X_k \cap E| \ge (1 - \varepsilon ) \max _{j} |X_j \cap E| \} \end{aligned}$$
(8)

Definition 6

Let \({\mathbb {A}} = (U,A\cup \{d\})\) be given. We say that \(B \subseteq A\) is a \(m_{d}^{\varepsilon }\)-decision superreduct, if and only if the following condition holds:

(9)

We say that B is a \(m_{d}^{\varepsilon }\)-decision reduct, if and only if it is a \(m_{d}^{\varepsilon }\)-superreduct and none of it proper subsets satisfy the above condition.

figure c

2.3 Approximate Decision Reduct Classifier Ensembles

Approximate decision reducts usually include less attributes than classical reducts. On the other hand, they may generate if-then rules that make mistakes even within the training samples. For noisy data sets it is to some extent desirable. Nevertheless, some methods for controlling those mistakes should be considered. For example, if the goal is to construct a classification model based on several approximate decision reducts, then – by following ideas taken from machine learning [17] – one may wish to assure that if-then rules generated by different reducts do not repeat the same mistakes on the training data. For this purpose, we can consider a mechanism aiming at diversification of importance of particular objects while searching for different approximate reducts. The same mechanisms are used in classifier ensemble methods. These methods perform usually better than their components used independently [18, 19]. Combining classifiers is efficient especially if they are substantially different from each other. In fact, the feature subsets applied in ensembles can be relatively smaller than in case of a single feature subset approach, if we can guarantee that combination of less accurate classifier components (further referred as weak classifiers) will lead back to satisfactory level of determining decision or preserving information about decision.

NRough includes several mechanisms for approximate decision reduct classifier ensembles learning. One method is based on well known Adaptive Boosting algorithm [20]. In NRough we introduced an AdaBoost version which use decision rules derived from approximate decision reducts [21] - the pseudo code is presented in Algorithm 5.

When a reduct ensemble is used to create decision rules one can consider a weak classifier output combination method. We implemented several voting mechanisms described in [22]. We present different voting options in Table 1 in the way compliant with \((\omega , \varepsilon )\)-approximate decision reducts. The voting weights are presented in a slightly changed form where

$$\begin{aligned} X^{\omega }_{E} = \mathop {\text {argmax}}\limits _{X\in U/\{d\}}|X \cap E|_{\omega } \end{aligned}$$
(10)
figure d
Table 1. Six options of weighting decisions by if-then rules, corresponding to the consequent coefficient types plain, \(\omega \)-confidence and \(\omega \)-coverage, and antecedent coefficient types single and \(\omega \)-support. \(|E|_{\omega }\) denotes the support of a rule’s left side. \(X_{E}^{\omega }\) is defined by formula (10).
figure e

Another introduced method for decision reduct diversification is based on decision reduct hierarchical clustering where a distance between reducts is based on binary vectors created according to Formula 11. Figure 1 presents an example of a dendrogram created based on hierarchical clustering of approximate decision reducts. By choosing a cut level we create a number of reducts groups. From each group a single reduct is selected and used as a base for creating a weak classifier. Weak classifiers together form a classifier ensemble.

$$\begin{aligned} \overrightarrow{v_{B}}[k] = {\left\{ \begin{array}{ll} 1, &{} \text {if} \quad d(x_k) = \mathop {\text {argmax}}\limits _{X \in U/\{d\}} |X \cap E| \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(11)
Fig. 1.
figure 1

Dendrogram created based on hierarchical clustering of 18 approximate decision reducts.

The framework also includes selection mechanisms which allow to select from a reduct pool those reducts that meet a user defines criteria e.g. contain least number of features, generate least number of decision rules etc.

2.4 Other Machine Learning Algorithms

Except rough sets inspired classifiers the framework includes a number of decision rule induction algorithms. These algorithms can be combined with rough set based feature selection methods defined in the previous section. Current implementation includes the following routines:

 

Decision lists:

generation routine based on feature subsets and a given decision table.

Majority voting:

based on a feature subset and a given decision table [23].

Decision trees:

(C4.5 [24], ID3 [25]) implementation supporting numerical and nominal attributes types. The impurity functions can be easily exchanged with one another and include information entropy, gini index, majority function or other user defined methods. Decision tree base class include pruning option - current implementation includes an error based pruning and a reduced error pruning [26]. We also introduce a new pre-pruning method based on the majority function.

Random forest:

implementation with option to select ensemble size, base decision tree type and data sampling method.

1R rule inducer:

which, as suggested in [27], could be used to calculate data set baseline accuracy.

Constant decision:

classifier which classifies all object as majority decision from training data set.

 

The framework includes a set of unsupervised algorithms based on Hierarchical clustering [28] with different linkage and distance methods. Model construction algorithms that work only with nominal data can utilize a number of numerical attribute discretization methods, both supervised and unsupervised. Supervised include hierarchical methods based on information entropy [29, 30] and majority function. Unsupervised include equal width and equal frequency binning. Most of implemented algorithms can work with weighted instances.

2.5 Model Evaluation

Proper evaluation and error estimation is crucial in constructing and comparing decision models. One of the key features in NRough is decision model evaluation. Currently the framework provides the following evaluation methods [31]: k-fold n-repeated cross validation (CV), leave-one-out CV, bootstrap with out-of-the-bag testing and finally n-repeated hold out.

Each evaluation test can return detailed information about experiment results in a form of a formatted table which can be saved as a .CSV or .TEX file. Additionally, results can be presented in a graphical form using an interface to R environment [32] - the graphical presentation as well as the latex tabular output are still under development but the working code is available in the repository and can be customized according to your need. The classification result class interface contains information such as error and accuracy rates, balanced accuracy (useful for testing imbalanced data sets), error standard deviation, confidence, coverage, f-score, recall and precision as well as classification confusion table. If model definition allows it, there is possibility to include information about complexity e.g. size of a decision tree, number of rules, average length of rules, etc.

3 Architecture

NRough is a Microsoft .NET based framework written in C# programming language. The source code is CLS compliant which enables to use it in other .NET languages. Currently the framework is provided as a set of libraries targeting .NET 4.6.1 and 64 bit architecture.

The libraries have the following structure:  

NRough.Core:

Contains generic data structures and extensions methods.

NRough.Data:

Responsible for data handling. It defines the decision table interface and equivalence class collection as well as routines for providing meta data and interface to data filtering.

NRough.MachineLearning:

Contains approximate decision reduct computation algorithms as well as other described machine learning models and routines.

NRough.Math:

Contains special functions used across other modules e.g. statistical functions or distance metrics used in clustering.

NRough.Tests.*:

Contains a set of test fixtures which except unit testing purpose serve as a code sample repository. Each of the above listed libraries has its own unit test library e.g. NRough.Data is tested by NRough.Tests.Data.

 

The framework is currently dependent on the following external libraries (except standard .NET): Math.NET.Numerics [33], NUnit [34] and R.NET [35].

4 License

NRough libraries are provided under GNU Lesser General Public License ver. 3 (GNU LGPLv3). This means that provided source can be used for research, commercial and non-commercial purposes without any charges as long as GNU LGPLv3 restrictions are satisfied. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights. However, a larger work using the licensed work through interfaces provided by the licensed work may be distributed under different terms and without source code for the larger work. The source code is provided “as is” without warranty of any kind, express or implied. In no event shall the authors or copyright holders be liable for any claim, damages or other liability. Complete license can be found in [36].

5 Other Frameworks

Let us focus on other rough set related frameworks developed by various researches. First of all, let us mention LERS - a system for Learning from Examples based on Rough Sets [37]. LERS contained two rules’ induction algorithms (LEM1 and LEM2) that could cope with inconsistent data. LERS contained also a number of algorithms for handling missing data and numerical attribute discretization. Its performance was comparable with AQ15 and C4.5 algorithms.

Secondly let us mention more complete GUI-based systems: Rough Set Exploration System (RSES) [9] and ROSETTA [38] (a toolkit for analyzing tabular data within a framework of rough sets). Both solutions shared the same computational kernel developed at the Group of Logic, Institute of Mathematics, University of Warsaw, Poland and finally ROSE2 (Rough Sets Data Explorer) [39] which is a software implementing basic elements of the rough set theory and rule discovery techniques. It was created at the Laboratory of Intelligent Decision Support Systems of the Institute of Computing Science in Poznan, Poland. RSES, ROSETTA as well as ROSE2 contained many different algorithms ranging from data preprocessing, filtering, discretization, rule induction, to classification and feature selection.

All mentioned above solutions are according to authors knowledge no longer maintained. These solutions were based on Java or C++ and its source code was not open. More recently, there were several attempts to revive RSES in a form of an open source Java based library distributed under GNU GPL license called RSESLib [40].

Last but not least we need to mention RoughSets [41] and RapidRoughSets [42] packages as a most recent development in rough sets domain. The former package contains the core computational methods for rough and fuzzy sets based on already mentioned R statistical software. The latter package is the GUI extension based on RapidMiner software.

In .NET domain there were so far, according to authors knowledge, no attempts to publish a machine learning framework containing rough set based algorithms. We would like to mention Microsoft Azure Machine Learning [43] which is a cloud based service integrating many different solutions resulting in a platform worth considering for system integrators. Its is however a commercial solution. We also would like to mention Accord.NET framework [44] which is an open source machine learning framework focusing on numerical methods as well as on machine vision. This framework could complement some general machine learning routines not yet implemented in NRough.

6 Conclusions and Future Work

We have created NRough framework based on our experience and research focused on approximate decision reducts done over past few years. In the beginning presented methods had been developed separately, but recently the whole source code went through major refactoring process resulting in presented solution. We have added other machine learning routines to the framework for two reasons: First to combine well known proven machine learning algorithms with rough sets, secondly, in order to be able to compare their performance against rough set inspired classifiers.

The whole framework was so far developed by a single person and there is still much to be done. First of all the framework needs strong API documentation and more examples. We are planning to add this in the nearest future and publish it on-line Secondly, we would like to add more rough set related algorithms in order to create a comprehensive library of different decision reduct computation routines. Thirdly, we are planning to extend the approximate decision reduct selection and diversification criteria. Last but not least, there are some development tasks to complete like graphical presentation methods using R interface. Moreover the framework is currently targeting .NET version 4.6.1 which allows to compile it only on Windows platform. We plan to extend its compatibility .NET Core to be able to use it on Linux and OSX platforms, but so far .NET Core lacks some important functionality so we are waiting for its new releases.