Keywords

1 Introduction

Technological innovation continues to grow at a fast pace, thus giving rise to modern computational tools that make capturing, storing and processing a myriad of data a much straightforward endeavor than in previous generations; this fact, together with the overwhelming volume of data being generated in the last two decades, has sparked a renewed interest in different scientific branches like Machine Learning  [12].

Machine Learning is a very active research field within Artificial Intelligence that seeks to develop automated techniques capable of learning from the vast oceans of data available nowadays; in other words, the goal is the automatic extraction of meaningful and representative knowledge from the raw data streams in a particular domain of consideration  [15].

Inducing classification rules from data is a classical Machine Learning problem. Most approaches follow a sequential covering strategy in order to come up with and then refine the rule base. They do this by working on a training set consisting of objects that are characterized through a list of conditional attributes (that represent the antecedents of the ensuing rules). These methods use the conditional attributes to arrive at a conclusion regarding the decision (class/label) attribute, often expressed at the consequent of rules. On the other hand, uncertainty management as well as knowledge representation, remain pivotal issues for Artificial Intelligence. Uncertainty is very pervasive and inherent to the real world and may emerge in multiple fashions caused by: (i) inaccurate information due to wrong measurements or transmission errors; (ii) incomplete information owing to forbidden or expensive access to data sources and (iii) ambiguous concept descriptions.

Bayesian reasoning emerged in the 1980s as a probabilistic model for reasoning with uncertainty in Artificial Intelligence and in just a few years, became popular. Different successful applications of Bayesian reasoning have been reported in fields as medicine, information retrieval, computer vision, information fusion, agriculture and so on.

In this study we employ Bayesian reasoning to improve the performance of the a recently proposed rule induction algorithm, named IRBASIR, by rewriting the aggregation function from a Bayesian angle. The new method, IRBASIR-Bayes, shows superior classification accuracy in comparison to other well-known rule induction methods. The rest of the paper is structured as follows. Section 2 elaborates on the technical background while Sect. 3 focuses on the application of Bayesian concepts to the IRBASIR algorithm. The empirical analysis is reported in Sect. 4. Conclusions and further remarks are stated in Sect. 5.

2 Technical Background

In this section we briefly review some foundational concepts in this study such as Bayesian networks as well as the Naïve-Bayes and IRBASIR algorithms.

2.1 Bayesian Networks

An increasing number of studies have to cope with a large number of variables that exhibit complex relationships among them. Bayesian Networks, (BNs) are a class of probabilistic networks that model precisely these interactions. Probabilistic networks are graphical representations of a problems variables and their relationships  [14]. Simply put, BNs have a topological/structural component and a parametric/numerical component. The former is realized via a Directed Acyclic Graph (DAG) that encodes the qualitative knowledge of the model through probabilistic relationships of conditional dependence and independence. This knowledge is articulated in defining these dependence/independence relationships among the model variables. These relationships range from complete independence to a functional dependence. The fact that the model is specified in a graph-based fashion makes BNs really attractive among other peer knowledge representation formalisms. BNs not only qualitatively model the domain knowledge but also quantitatively express the “strength” of the relationships among the variables by means of probability distributions as the extent of the belief we hold on the underlying relations among the model variables.

A Directed Acyclic Graph is composed of nodes denoting the problem variables, which could be descriptive features or attributes. Each pair of nodes is connected by directed edges. These edges represent probabilistic dependencies between the variables. In terms of probabilities, linking X to Y means there is a conditional dependence of Y with respect to X, i.e., \(Y's\) probability distribution is different from that of Y given X. These variables will be linked to the decision class that would act as the root variable. The parametric component of a BN take the form of Conditional Probability Tables (CPTs) established from the information encoded in the DAG. The Bayes theorem is given by Eq. 1,

$$\begin{aligned} P(h|O)=\dfrac{P(O|h)P(h)}{P(O)} \end{aligned}$$
(1)

where h is a hypothesis, O is an observation (or set of observations) and P(hO), P(Oh) are conditional probabilities. The latter is the likelihood that hypothesis h may have produced the set of observations O.P(h) is called the prior probability and represents the initial degree of belief in h. For a classification problem with the class variable C and set of input attributes \(A=\{A_{1},A_{2},\cdots ,A_{n}\}\) the Bayes theorem adopts the form in Eq. 2,

$$\begin{aligned} P(c|A)=\dfrac{P(A|c)P(c)}{P(A)} \end{aligned}$$
(2)

What we want is to identify the most plausible decision class \(c_{*}\) in the set \(C=\{c_{1},c_{2},\cdots ,c_{k}\}\) for an observation O that needs to be classified. In the Bayesian framework, the most plausible hypothesis is the one with the maximum a posteriori (MAP) probability. In other words, the decision class \(c_{*}\) to be assigned to O is computed by means of Eq. 3,

$$\begin{aligned} c_{*}=\arg \max _{c\in C}\{{P(A|c)P(c)}\} \end{aligned}$$
(3)

Notice that the denominator in Eq. 2 has not been included in Eq. 3 for simplification purposes. Bayes’ theorem hence provides a neat and interpretable way to solve a classification problem.

2.2 Naïve-Bayes

A Bayesian classifier that is usually quite accurate despite its simplicity is known as Naïve-Bayes (NB)  [11]. NB is actually one of the most widely used methods among the BN family of classification techniques.

The core of the NB classifier is the underlying assumption that all attributes are independent given the value of the class variable. NB’s name stems from the assumption that predictive variables are conditionally independent given the decision variable; this assumption gives rise to a simplified form of Eq. 2, hence we will only need to learn the conditional probabilities of the attributes given the class values  [15]. The simplified expression used by NB for classification purposes is given in Eq. 4,

$$\begin{aligned} c_{*}=\arg \max _{c\in C} \prod _{i=1}^{n}P(A|c) \end{aligned}$$
(4)

This procedure is known as maximum likelihood estimation. Unfortunately, it requires a large sample size and tends to overfit the data. More complex estimators like Laplace succession are applied to counter this limitation.

NB deals with the numerical data assuming all attributes are generated from different normal or Gaussian probability distributions. The mean and standard deviation for each decision class and numeric attribute is then calculated  [15]. The probability density function for a normal distribution with mean \(\mu \) and standard deviation \(\sigma \) is given by Eq. 5,

$$\begin{aligned} f(x,\mu ,\sigma )=\dfrac{1}{\sigma \sqrt{2\pi }}e^{-\dfrac{(x-\mu )^2}{2\sigma ^2}} \end{aligned}$$
(5)

NB is one of the strongest and most popular classifiers. Several studies confirm that its classification results are competitive with regards to, or even surpass, those produced by other models such as decision trees, neural networks, etc.

2.3 IRBASIR

IRBASIR (Induction of Rules BAsed on SImilarity Relations) is an algorithm  [4, 5, 7] for the automatic generation of classification rules in decision systems that may contain both nominal and numerical attributes. The algorithm leans upon the learning of similarity relations  [3] for building similarity classes of objects, which is accomplished by extending the canonical Rough Set Theory (RST). The overall similarity relation learned from data encompasses attribute-wise similarity functions for both nominal and numerical descriptors. Algorithm 1 depicts the operational workflow of the IRBASIR algorithm introduced in  [7].

figure a

In step 1, we need to specify the set of functions \(\delta _i(x,y)\) that determine the similarity between two objects x and y based on their values for the attribute \(A_i\). Equations 6 show the expressions used in  [7] to deal with numerical and nominal attributes, respectively,

$$\begin{aligned} \partial _i(x,y)={\left\{ \begin{array}{ll} 1 &{}\text { if } x \text { and } y \text { are real and } |x_i - y_i| \le \varepsilon \\ &{}\text { or } x \text { and } y \text { are discrete and } x=y \\ 0 &{}\text { otherside } \end{array}\right. } \end{aligned}$$
(6)

To build the similarity relation R Eq. 7 are employed,

$$\begin{aligned} xRy \text { if and only if } F_1(x,y)\ge \varepsilon \text { and } F_2(x,y)=1 \end{aligned}$$
(7)

where \(\varepsilon \) denotes a threshold. For the experiment the value of \(\varepsilon =0.85\) was used for all datasets suggested by  [8]. The functions \(F_1\) and \(F_2\) are defined by Eqs. 8 and 9:

$$\begin{aligned} F_1(x,y)= \sum _{i=1}^n w_i* \partial _i(x,y) \end{aligned}$$
(8)
$$\begin{aligned} F_2(x,y)= {\left\{ \begin{array}{ll} 1 &{} \text { if } class(x)= class(y) \\ 0 &{}\text { otherside } \end{array}\right. } \end{aligned}$$
(9)

where n is the number of features, \(w_i\) is the weight of feature i calculated according to the method proposed in  [5, 6] and \(\partial _i\) is a features comparison function which calculates the similarity between the values of objects x and y with respect to the feature instances i, is defined by expression 10,

$$\begin{aligned} \partial _i(x_i,y_i)= {\left\{ \begin{array}{ll} 1-\dfrac{|x_i-y_i|}{\max (D_i)-\min (D_i)} &{} \text { if } i \text { is continuous } \\ 1 &{}\text { if } i \text { is discrete and } x_i=y_i \\ 0 &{}\text { if } i \text { is discrete and } x_i\ne y_i \end{array}\right. } \end{aligned}$$
(10)

where \(D_i\) is the domain of feature i.

The third step is the automatic learning of classification rules based on Algorithm 2. This procedure is to generate classification rules and it’s certidumbre value.

figure b

In the pseudo code of Algorithm 2, \([x_i]_R\) returns the set of decision classes of all objects in a set passed as input argument. Upon calculation of the similarity class \([x_i]_R\) for an object \(x_i\), if the objects in the similarity class all share the same value of the decision class, a rule for that decision class is created; otherwise, the most frequent class in that object set is taken and a rule is created for it.

Then the objects in \([x_i]_R\) are all flagged as used and the rule generation process carries on. The GenRulSim() procedure is given in Algorithm 3.

figure c

The term \(V_i[O_s^k]\) in Step 1 denotes the set of values of the attribute \(V_i\) for those objects belonging to the set \(O_s^k\). The function \(f(V_i[O_s^k])\) aggregates the set of values in \(V_i[O_s^k]\) depending on the type of the attribute \(V_i\), e.g., the mode in case of \(V_i\) being a nominal attribute or the mean/median if it is a numerical attribute.

Hence, the vector p could be thought of as a prototype (or centroid) for all objects in \(O_s^k\). Additionally, the weight vector \((w_1,\dots ,w_n)\), the similarity threshold \(\varepsilon \) and the attribute-wise similarity functions \(\delta _i\) and come from Eqs. 7 and 8.

3 The IRBASIR-Bayes Algorithm

In order to improve the performance of the IRBASIR algorithm, a modification in Step 1 of the GenRulSim() method is introduced. The idea is to vary the aggregation function \(f(\cdot )\) to find the vector p with n components for the set of reference objects in \(O_s^k\), from which the rules are generated.

The main idea of the proposed IRBASIR-Bayes approach is to rewrite the function \(f(\cdot )\) by switching from a purely statistical aggregation method, such as the mean or the mode of a set, to a probabilistic scheme like the one used in the NaïveBayes classifier.

Each component \(p(i),i= 1,\cdots ,n\), of the reference (central tendency) vector p is now calculated as shown in Eq. 11 if \(V_i\) is a nominal attribute and by means of Eq. 11 if \(V_i\) is a numerical attribute,

$$\begin{aligned} p_{(i)}=\arg \max _{v\in V_i[O_{s}^k]} \left\{ \dfrac{|x \in O_{s}^k: x_i==v|}{O_s^k}\right\} \end{aligned}$$
(11)
$$\begin{aligned} p_{(i)}=\arg \max _{v\in V_i[O_{s}^k]} \left\{ \dfrac{1}{\sigma \sqrt{2\pi }}e^{-\dfrac{(x-\mu )^2}{2\sigma ^2}}\right\} \end{aligned}$$
(12)

In the nominal/discrete case portrayed in Eq. 11, the estimation of the conditional probability is based on the frequencies with which the different values for the attribute \(V_i\) appear in the objects of the set \(O_s^k\). The reader may notice that this approach is equivalent to the calculation of the mode of \(V_i\) over \(O_s^k\). Semantically, this means that the vector component p(i) will take the most frequently occurring value in that attribute for the objects under consideration. Therefore, the reference vector p is made up by the most frequent values of all nominal attributes in the decision system.

When it comes to numerical attributes, however, the NB method assumes they follow an underlying Gaussian probability distribution; hence, the first step is the calculation of the sample mean \(\mu _i\) and the sample standard deviation \(\sigma _i\) for the objects in the similarity class \(O_s^k\). Then, Eq. 12 applies to each numeric value x of the attribute \(V_i\) over the objects in \(O_s^k\). From a semantic standpoint, this means that the representative value for that attribute captured by the vector component p(i) will be the one with the highest probability distribution value among all other values of \(V_i\) for the objects under consideration.

4 Experimental Results

For the empirical analysis, we relied on 13 UCI Machine Learning Repository data setsFootnote 1 in .ARFF format. They were used to gauge the efficacy and efficiency of the algorithms under consideration. Table 1 lists the data sets.

Table 1. Description of the datasets used in the experiments.

We will empirically compare the classification accuracy of the following classifiers: C4.5  [13], EXPLORER  [10], MODLEM  [10], LEM2  [9], IRBASIR  [4] and IRBASIRBayes for each of the datasets using 10-fold cross validation. Table 2 reports these results. To calculate the measurement accuracy there was used the measure that is typically used in these cases, as shown in expression 13:

Table 2. Classification accuracy results for all rule induction algorithms
$$\begin{aligned} Accuracy= \dfrac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(13)

where: TP (True Positives), TN (True Negatives), FP (False Positives) and FN (False Negatives).

To statistically validate the results, we employed non-parametric multiple-comparison tests in order to identify the best algorithm. The Friedman test is the non-parametric standard test for comparison among several classifiers. This test works by assigning ranks in ascending order, i.e., the first rank for the best obtained results, the second rank for the second best result and so on  [2].

The null hypothesis states that all algorithms behave similarly, so the ranks should be similar. In the experimental study conducted in this investigation, the p-value found using the Friedman statistic was zero, while the critical point for a Chi-square distribution with five degrees of freedom was 44.241758. As the value obtained by the Friedman statistic is much lower than this latter value, there is enough evidence to reject the null hypothesis and conclude that there are indeed statistically significant differences among the groups. Table 3 displays the ranks produced by the Friedman test upon the set of algorithms under comparison, with IRBASIS-Bayes topping the list.

Table 3. Friedman ranks for each algorithm
Table 4. Holm post-hoc procedure results with IRBASIR-Bayes as the control method

Table 4 displays the results of the Holm post-hoc procedure used to compare the control algorithm (IRBASIR-BAYES) with the rest at a \(5\%\) significance level. The test statistic for comparing the ith algorithm and jth algorithm, z, depends on the main nonparametric procedure used; where \(R_0\) and \(R_i\) are the average rankings by the Friedman test of the algorithms compared  [1]. Holm test rejects those hypotheses having a p-value \(\le 0.05\). The test rejects all cases in favor of the best-ranked algorithm; hence we can claim that the proposed algorithm is statistically superior to the other ones in terms of classification accuracy.

5 Conclusions

The IRBASIR algorithm allows inducing classification rules for mixed data (i.e., numerical and nominal attributes). It is characterized by not requiring the discretization of the numerical attributes, neither as a preprocessing step nor during the learning process. IRBASIR generates the rules from a reference or prototype vector that allows constructing similarity classes for the objects in the training set. This vector is simply calculated as the mean/median for each numerical attribute.

IRBASIR-Bayes is a spin-off from IRBASIR that alters the way in which prototype vectors are constructed, namely by switching from the mean/median statistical approach to a probabilistic approach through the classification functions used by Naïve-Bayes. IRABASIR-Bayes has the disadvantage of assuming normality and independence among data. Nevertheless, the results obtained with 13 publicly available Machine Learning repositories evidence a statistically verified superiority in terms of classification accuracy to other well-known rule induction algorithms (C4.5, MODLEM, EXPLORER, LEM2, and IRBASIR itself); another benefit is that the method allows processing attributes with missing values. Future work will concentrate on the parametric tuning of IRBASIR and IRBASIR-Bayes so as to obtain even higher accuracy marks.