1 Introduction

With the advent of new technologies, we are facing an exponentially growing volume of complex structured data in diverse fields of science and engineering, where data mining has become ubiquitous in many real-world applications. Wide interest for data analysis coming from numerous disciplines motivated an interdisciplinary research approach, which cuts across the disciplines of applied mathematics and computer science and fosters the integration of ideas. These interdisciplinary efforts set the stage for innovation by uniting together to create new tools, develop new disciplines, and ultimately open new avenues of research. A fundamental challenge in data mining is to extract, analyze, and interpret knowledge from large-scale datasets effectively and efficiently. In order to address this challenge, the traditional statistical methods are complemented by sophisticated supervised learning techniques, including, for example, support vector machines (Burges 1998; Schölkopf and Smola 2001), neural networks (Bishop 2007; Fausett 1994), decision trees (Bishop 2007; Duda et al. 2001), and a pattern based method, called Logical Analysis of Data (Alexe et al. 2007; Boros et al. 1997, 2000) that are designed to find a decision boundary from the given samples with known classes to predict the class of a new or unseen observation.

Supervised learning algorithms solve binary classification problems, where a learning model is constructed to separate observations into two predefined classes. However, many real-world problems require the identification of more than two subgroups of observations and the features and patterns associated with each subgroup. Typical examples include (i) identification of different subtypes of human cancers (Hanash and Creighton 2003), (ii) protein fold recognition (Ding and Dubchak 2001), (iii) microscopy images (Boland et al. 1998; Misselwitz et al. 2010), (iv) histogram based image classification (Chapelle et al. 1999), (v) handwritten character recognition (LeCun et al. 1989; Lee and Seung 1997), (vi) part-of-speech tagging (Nakagawa et al. 2002), (vii) speech recognition (Jelinek 1998), (viii) text categorization (Apté et al. 1994), etc.

Since the problem is of practical importance, there have been several attempts to extend well-known binary classification algorithms to multiclass problems. The most common approaches to multiclass classification are the natural extension of binary classification problem known as One-vs-One (OvO) and One-vs-Rest (OvR) (Hastie and Tibshirani 1998). Given a K-class dataset \(\varOmega \subset {\mathbb {R}}^{m\times n}\) with m observations and n features, OvO scheme, shown in Fig. 1, assumes that there exists a separator between any two classes and builds \(K(K-1)/2\) classifiers, denoted by\(f_{ij}\), to distinguish each pair of classes \({\mathcal {C}}_i, {\mathcal {C}}_j\in {\mathcal {C}}, i\ne j\), where \({\mathcal {C}}=\{{\mathcal {C}}_1,\ldots ,{\mathcal {C}}_K\}\) is the family of classes. The class of a new or unseen observation, \(\mathbf{o }\in {\mathbb {R}}^{n}\), \(\mathbf{o }\notin \varOmega \), is then assigned by the use of the discriminant function:

$$\begin{aligned} f(\mathbf{o })=\arg \max _i\sum _jf_{ij}(\mathbf{o }). \end{aligned}$$
(1)
Fig. 1
figure 1

One-vs-One (OvO) multiclass scheme

A less expensive approach OvR, shown in Fig. 2, assumes the existence of a single separator between a class \({\mathcal {C}}_i\) (for some i) and all other classes in \({\mathcal {C}}\) and builds K different binary classifiers. Let \(f_i\) be the ith classifier separating observations in class \({\mathcal {C}}_i\) (considered to be positive) and observations in \({\mathcal {C}}\setminus {\mathcal {C}}_i\) (forming the set of negative observations). In this case a new or unseen observation \(\mathbf{o }\in {\mathbb {R}}^{n}\), \(\mathbf{o }\notin \varOmega \) is classified by

$$\begin{aligned} f(\mathbf{o })=\arg \max _i f_i(\mathbf{o }). \end{aligned}$$
(2)
Fig. 2
figure 2

One-vs-One (OvR) multiclass scheme

Since both approaches are easy to adopt, diverse groups of researchers invented them independently: see, for example, multiclass classification (Beygelzimer et al. 2007; Gehler and Nowozin 2009; Har-Peled et al. 2002; Yang and Tsang 2012), discriminant analysis for multiclass classification (Li et al. 2006; Liu et al. 2011), multiclass learning (Daniely et al. 2012; Even-Zohar and Roth 2001), combining many two-class classifiers into a multiclass classifier (Galar et al. 2011; Platt et al. 2000; Tax and Duin 2002; Tewari and Bartlett 2007; Wu et al. 2004), multiclass classification with applications (Singh-Miller and Collins 2009), mixed integer programming approach to multiclass data classification (Avila-Herrera and Subasi 2013, 2015; Kim and Choi 2015; Üney and Türkay 2006), multiclass classification by using support vector machine (Aiolli and Sperduti 2005) and general multiclass classification methods review (Aly 2005). The choice between the use of OvO and OvR in multiclass problems is largely computational.

Despite the undoubted advancements in the area of multiclass classification, there is still room for developing new approaches to improve the effectiveness and efficiency of the methods and tools to analyze archives of historical records for the purpose of discovering hidden structural relationships in large-scale datasets. In this paper, we integrate the mixed integer linear programming based Logical Analysis of Data (LAD) approach of Ryoo and Jang (2009) with the multiclass LAD method of Avila-Herrera and Subasi (2013, 2015) to develop a new multiclass LAD algorithm, where two control parameters, homogeneity and prevalence, are incorporated to generate relaxed (fuzzy) patterns.

LAD is a pattern-based two-class learning method which integrates principles of combinatorics, optimization, and the theory of Boolean functions. The research area of LAD was introduced and developed by Hammer (1986) whose vision expanded the LAD methodology from theory to successful data applications in numerous biomedical, industrial, and economics case studies, see, for example, Alexe et al. (2003, 2004, 2005, 2006), Hammer et al. (1999, 2011), Hammer and Bonates (2006), Lauer et al. (2002), Reddy et al. (20082009) and the references therein. The implementation of LAD method was described in Boros et al. (1997, 2000), Crama et al. (1988) and several further developments of the original technique were presented in Alexe et al. (2007), Alexe and Hammer (2006), Bonates et al. (2008), Guo and Ryoo (2012), Hammer et al. (2004), Ryoo and Jang (2009). An overview of standard LAD method can be found in Alexe et al. (2007) and Bonates et al. (2008). Various recent applications of LAD are presented in Dupuis et al. (2012), Ghasemi et al. (2013), Lejeune et al. (2018), Lejeune and Margot (2011), Mortada et al. (2011) and Subasi et al. (2017). LAD method has been extended to survival analysis (Kronek and Reddy 2008) and regression analysis (Bonates and Hammer 2007; Lemaire 2011) as well.

Extensions of LAD method to multiclass problems are previously studied by Moreira (2000), Mortada (2010), Mortada et al. (2014). Moreira (2000) proposed two methods to break down a multiclass classification problem into two-class problems using an OvO approach. The first method uses the typical OvO scheme which does not require the alteration of the structure of the standard LAD method presented by Boros et al. (2000). The second OvO-type method modifies the architecture of the pattern generation and theory formation steps in standard LAD method, where a LAD pattern \(P_{ij}\) is generated for each pair of classes \({\mathcal {C}}_i,{\mathcal {C}}_j\in {\mathcal {C}}\), \(\displaystyle {i\ne j}\).

Mortada (2010) proposed a multiclass LAD method, integrating ideas from the second approach presented by Moreira (2000), which is based on OvO scheme and an implementation of LAD based on mixed integer linear programming (MILP) presented by Ryoo and Jang (2009). The methodology of Mortada (2010) was applied to five multiclass benchmark datasets. Mortada (2010) observed that the MILP based LAD approach of Ryoo and Jang (2009) combined with the second approach of Moreira (2000) provides classification models with higher accuracy than those models obtained by multiclass approach applied to standard LAD algorithm of Boros et al. (2000).

Recent papers by Avila-Herrera and Subasi (2013, 2015) and Kim and Choi (2015) have also considered the multiclass extension of LAD. Avila-Herrera and Subasi (2013, 2015) explored and rectified the limitations of the two-class MILP LAD approach by Ryoo and Jang (2009), relating to its poor differentiating power in two-class classification. Kim and Choi (2015) developed an efficient iterative genetic algorithm with flexible chromosomes and multiple populations to extend LAD to multiclass classification. The performance of the method was evaluated on six benchmark multiclass datasets.

In this paper, we propose a parametrized/relaxed algorithmic approach that builds on the MILP pattern generation approach of Ryoo and Jang (2009) and multiclass LAD approach of Avila-Herrera and Subasi (2013, 2015) that constructs an OvR-type LAD classifier to identify patterns in multiclass datasets. This modification introduces two control parameters, homogeneity and prevalence, to generate fuzzy patterns which we call “relaxed patterns”. The organization of the paper is as follows. Section 2 describes the basic principles of the standard LAD method of Boros et al. (2000). Section 3 presents the proposed MILP based relaxed multiclass LAD approach to obtain OvR-type multiclass LAD classifiers. In Sect. 4 we present experiments on multiclass benchmark datasets to demonstrate the utility of our proposed methodology and compare the efficiency and performance of the proposed multiclass LAD method with relaxed patterns with that of the previously developed LAD based multiclass classification as well as other well-known supervised learning methods.

2 Preliminaries: logical analysis of data

Logical Analysis of Data (LAD) is a two-class learning method based on combinatorics, optimization, and the theory of Boolean functions Boros et al. (2000). Given an input dataset, \(\varOmega \), consisting of two disjoint classes \(\varOmega ^+\) (set of positive observations) and \(\varOmega ^-\) (set of negative observations), where \(\varOmega =\varOmega ^+\cup \varOmega ^-\) and \(\varOmega ^+\cap \varOmega ^-=\emptyset \), there exists a hidden function of nature separating the observations in \(\varOmega ^+\) and \(\varOmega ^-\). The goal of LAD is to identify positive and negative patterns to approximate the hidden function of nature (as illustrated in Fig. 3).

Fig. 3
figure 3

LAD approximation to hidden function in nature

The standard LAD methodology presented in Boros et al. (1997, 2000) is a multistep procedure, consisting of five main components outlined below. A more detailed overview of the standard LAD method together with the recent developments in its theory and applications can be found in Lejeune et al. (2018).

2.1 Discretization/binarization and support set selection

This step is the transformation of numeric features (attributes/variables) into binary features without losing predictive power. The procedure consists of finding cut-points for each numeric feature. The set of cut-points can be interpreted as a sequence of threshold values collectively used to build a global classification model over all features (Boros et al. 1997, 2000). Given a numeric dataset, there is a large number of “feasible” cut-points, identified by the standard methods such as equal width, equal intervals, based on entropy, chi-square tests, etc. In most cases many of these cut-points may be redundant. In this step a set covering problem is solved to identify an optimal (minimum size) set of cut-points to transform the numerical features into binary ones, illustrated in Fig. 4.

Fig. 4
figure 4

Illustration of cutpoints

Discretization is a very useful step in data mining, especially for the analysis of medical data (which is very noisy and includes measurement errors)—it reduces noise and produces robust results. The problem of discretization is well studied and many powerful methods are presented in literature, see, e.g., the survey papers (Kotsiantis and Kanellopoulus 2006; Liu et al. 2002). Discretization step may produce several binary features some of which may be redundant. In this step of LAD procedure, a minimum set covering problem is solved to obtain an irredundant smallest subset of binary variables, which can distinguish every pair of positive and negative observations in the dataset. The resulting set is called a support set (Boros et al. 1997, 2000). If the input data is categorical (nominal) or text, then there are well-known techniques to binarize the data (Aggarwal 2015).

2.2 Pattern generation

Patterns are the key ingredients of LAD algorithm. This step uses the features in combination to produce rules (combinatorial patterns) that can define homogenous subgroups of interest within the data. The simultaneous use of two or more features allows the identification of more complex rules that can be used for the precise classification of an observation as illustrated in Fig. 5.

Given a binary (or binarized) dataset \(\varOmega =\varOmega ^+\cup \varOmega ^-\subset {\mathbb {B}}^{m\times n}\) with m observations and n features, where \(\varOmega ^+\cap \varOmega ^-=\emptyset \), a patternP is simply defined as a subcube of \({\mathbb {B}}^n=\{0,1\}^n\). A LAD pattern can be described as a Boolean term, that is, a conjunction of literals (binary variables or its negation) which does not contain both a variable and its negation:

$$\begin{aligned} P=\bigwedge _{j\in M_P} x_j\bigwedge _{j\in N_P} \bar{x}_j \end{aligned}$$

where \(M_P,N_P\subseteq \{1,\cdots ,n\}\), \(M_P\cap N_P=\emptyset \), and \(x_j\) is the Boolean literal associated with the jth feature in the dataset.

Patterns define homogeneous subgroups of observations and have the following distinctive characteristics.

  • Degree: The number of literals involved in the definition of a pattern is called the degree of the pattern.

  • Homogeneity: The proportion of positive (negative) observations among all those observations covered by a pattern is called the positive (negative) homogeneity of the pattern. A pure positive pattern has 100% positive homogeneity and 0% negative homogeneity and is defined as a combination of features which covers a proportion of positive observations, but none of the negative ones: \(P(\omega ^+)=1\) for at least one \(\displaystyle {\omega ^+\in \varOmega ^+}\) and \(P(\omega ^-)=0\) for every \(\omega ^-\in \varOmega ^-\). A pure negative pattern can be defined similarly: \(P(\omega ^-)=1\) for at least one \(\displaystyle {\omega ^-\in \varOmega ^-}\) and \(P(\omega ^+)=0\) for every \(\omega ^+\in \varOmega ^+\).

  • Prevalence: The proportion of positive (negative) observations covered by a pattern is called the positive (negative) prevalence of the pattern.

  • Coverage: An observation \(\omega \in \varOmega \) satisfying the conditions of a pattern P, i.e., \(P(\omega )=1\), is said to be covered by that pattern. Coverage of a pattern P, denoted by Cov(P), is the set of observations covered by the pattern.

  • Hazard ratio: The ratio between the proportion of positive observations among all those observations covered by a pattern and the proportion of positive observations among those observations not covered by the pattern is called the hazard ratio of the pattern.

Note that a positive homogeneity plus negative homogeneity of a pattern is 100%. A pattern associated with positive (negative) class must exhibit a positive (negative) homogeneity more than 50%. A positive (negative) pattern with positive (negative) homogeneity less than 100% is called a relaxed (fuzzy) pattern. We also remark that a high quality positive (negative) pattern must have high positive (negative) homogeneity and high positive (negative) prevalence. Smaller degree reduces the complexity, allowing easier interpretation of the pattern.

The most straightforward approach to pattern generation is based on the use of combinatorial enumeration techniques, for example, a bottom-up/top-down approach (Boros et al. 1997, 2000). The bottom-up approach follows a lexicographic order in generating the patterns in order to reduce the amount of computations necessary. The approach starts with terms of degree one that cover some positive observations. If such a term does not cover any negative observation, it is a positive pattern. Otherwise, literals are added to the term one by one until generating a pattern of prefixed degree. The top-down pattern generation approach starts by considering all uncovered observations as patterns of degree n and for each of those patterns, literals are removed one by one, until a pattern with smallest degree is reached. The enumeration type pattern generation approach is a costly process. Given a two-class binary dataset with n features, the total number of candidate patterns to be searched is \(\sum _{i=1}^n 2^i {n\atopwithdelims ()i}\) and the number of degree d patterns can be \(2^d{n\atopwithdelims ()d}\).

A pattern P is called a strong pattern if there is no pattern \(P^\prime \) such that \(Cov(P)\subset Cov(P^\prime )\). Pattern P is called a prime pattern if the deletion of any literal from P results in a term that is no longer a pattern. Since patterns play a central role in LAD methodology, various types of patterns have been studied and several pattern generation algorithms have been developed for their enumeration (see Lejeune et al. 2018 and the references therein). Our OvR-type multiclass LAD algorithm is motivated by the MILP approach of Ryoo and Jang (2009) that generates strong LAD patterns in a two-class dataset.

Consider a two-class dataset \(\varOmega \) consisting of m binary observations and n features. Let \(I^+=\{i: \omega ^+_i\in \varOmega ^+\}\) and \(I^-=\{i: \omega ^-_i\in \varOmega ^-\}\), where \(\varOmega =\varOmega ^+\cup \varOmega ^-\) and \(\varOmega ^+\cap \varOmega ^-=\emptyset \). For each observation \(\omega _i\in \varOmega \), let \(\omega _{ij}\) denote the binary value of the jth feature in the ith observation \(\omega _i\). Let \(x_j\), \(j=1,\cdots ,n\), denote the Boolean literal corresponding to the jth feature \(b_j\) in \(\varOmega \) and introduce n new features \(x_{n+j}=1-x_j\), \(j=1,\cdots ,n\) (representing \(\bar{b}_j\)). If \(x_j=1\) for some specific \(j=1,\ldots ,n\), then \(b_j=1\) in observation \(\omega _i\) and if \(x_{n+j}=1\) for some specific \(j=1,\ldots ,n\), then \(b_j=0\) in observation \(\omega _i\). Let \(\rho _i\), \(i=1,\ldots ,m\), be penalties associated with the coverage of the pattern P, defined as

$$\begin{aligned} \rho _i=\left\{ \begin{array}{ll} 1 &{}\quad \hbox {if}\quad P(\omega _i)=0,~ i\in I^+\\ 0 &{}\quad {\textit{otherwise}}. \end{array} \right. \end{aligned}$$

Following the pattern generation approach of Ryoo and Jang (2009), we formulate the below MILP to generate a positive LAD pattern with \(\alpha \%\) positive homogeneity and \(\beta \%\) positive prevalence:

$$\begin{aligned} \begin{array}{l} \hbox {minimize }\displaystyle { cd+\sum _{i\in I^+} {\rho _i} }\\ \hbox {subject to}\\ \begin{array}{c} \displaystyle {\sum _{j=1}^{2n}\omega _{ij}x_j+ n\rho _i\ge d, \quad i\in I^+} \\ \displaystyle {\sum _{j=1}^{2n}\omega _{ij}x_j - y_i\le d-1, \quad i \in I^-}\\ \displaystyle {\sum _{i\in I^-} y_i \le \alpha |\varOmega ^+|}\\ \displaystyle {\sum _{i\in I^+} \rho _i \le (1-\beta )|\varOmega ^+|}\\ x_j+x_{n+j} \le 1, \quad j=1,\ldots ,n \\ \displaystyle {\sum _{j=1}^{2n} x_j =d}\\ 1\le d\le n,\quad d\in {\mathbb {Z}}^+\\ 0\le y_i\le 1, \quad i=1,\ldots ,m\\ \rho _i\in \{0,1\}, \quad i=1,\ldots ,m\\ x_j\in \{0,1\}, \quad j=1,\ldots ,2n \\ \end{array} \end{array} \end{aligned}$$
(3)

where \(0\le \alpha ,\beta \le 1\) and \(c\in {\mathbb {R}}\) are constants, d is the unknown degree of a positive pattern P, and variables \(y_i, i=1,\ldots ,m\) are the relaxation variables, ensuring the generation of pattern with \(\alpha \%\) homogeneity and \(\beta \%\) prevalence.

Fig. 5
figure 5

LAD classification model

Variables \(x_j\) and \(x_{n+j}\), \(j=1,\ldots ,n\), associated with the jth feature in dataset \(\varOmega \) determine whether the jth feature takes value 1 or 0 in the ith observation \(w_i\in \varOmega \). A feasible solution of problem (3) is a positive pattern with degree d, \(\alpha \%\) positive homogeneity, and \(\beta \%\) positive prevalence. When the scaling parameter c is positive, an optimal solution of problem (3) is a positive pattern with the smallest degree and maximum coverage, and hence is a strong prime pattern (Ryoo and Jang 2009):

$$\begin{aligned} P=\bigwedge _{\{j~:~ x_j=1\}}b_j \bigwedge _{\{j~:~ x_{n+j}=1\}} \bar{b}_j~. \end{aligned}$$

Note that if we change the roles of index sets \(I^+\) and \(I^-\) in problem (3), an optimal solution of the problem provides us with a negative strong prime pattern with \(\alpha \%\) negative homogeneity and \(\beta \%\) negative prevalence.

2.3 LAD model

The entire collection of patterns generated in Pattern Generation Step is called pandect and is denoted by \(\displaystyle {{\mathcal {P}}}=\displaystyle {{\mathcal {P}}}^+\cup \displaystyle {{\mathcal {P}}}^-\), where \(\displaystyle {{\mathcal {P}}}^+\) and \(\displaystyle {{\mathcal {P}}}^-\) are disjoint sets of all positive and negative patterns, respectively. A LAD model (illustrated in Fig. 5), denoted by \({\mathcal {M}}\) is a subset of pandect, that is, \(\displaystyle {{\mathcal {M}}}=\displaystyle {{\mathcal {M}}}^+\cup \displaystyle {{\mathcal {M}}}^-\), where \(\displaystyle {{\mathcal {M}}}^+\subseteq \displaystyle {{\mathcal {P}}}^+\) and \(\displaystyle {{\mathcal {M}}}^-\subseteq \displaystyle {{\mathcal {P}}}^-\).

Patterns are selected into the LAD model \(\mathcal {M}\) so that the model provides the same separation of the positive and negative observations as the pandect \(\mathcal {P}\).

In many cases, when constructing a LAD model, every observation in the training dataset is required to be covered at least k times \((k\in {\mathbb {Z}}^+)\) by the patterns in the model. The standard LAD approach of Boros et al. (1997, 2000) produces patterns using enumerative techniques. Greedy-type heuristic approaches are then adopted to select patterns into a final LAD model. As for the MILP approach of Ryoo and Jang (2009), producing patterns as optimal solutions of the MILP’s involved, such as problem (3), the authors proposed Algorithm 1 to find a LAD model.

figure a

Note that after a pattern is generated as an optimal solution of problem (3), Algorithm 1, proposed by Ryoo and Jang (2009), removes the observations covered by that pattern from the training data to prevent the algorithm from finding the same pattern found in the previous solutions of problem (3). The algorithm terminates when every observation is covered at least once. The resulting set of positive and negative patterns form a LAD model \({\mathcal {M}}\).

2.4 Prediction and accuracy

In the final step of the LAD framework, LAD classification model \(\mathcal {M}\) is used for the classification of a new or unseen observation \(\mathbf{o }\in {\mathbb {B}}^{n}\), \(\mathbf{o }\notin \varOmega \) by the use of a discriminant function \(\varDelta :\{0,1\}^n\rightarrow {\mathbb {R}}\) associated with the model \(\displaystyle {{\mathcal {M}}}\), where \(\varDelta (\mathbf{o })\) is defined as the difference between the weighted proportion of positive patterns and negative patterns covering \(\mathbf{o }\), that is,

$$\begin{aligned} \varDelta (\mathbf{o })=\sum _{P_k^+\in \displaystyle {{\mathcal {M}}}^+}\delta _k^+P_k^+(\mathbf{o }) ~~-\sum _{P_k^-\in \displaystyle {{\mathcal {M}}}^-}\delta _k^-P_k^-(\mathbf{o }), \end{aligned}$$

where \(\delta _k^+\ge 0\) and \(\delta _k^-\ge 0\) are the weights assigned to positive patterns \(P_k^+\in {{\mathcal {M}}}^+\) and negative patterns \(P_k^-\in {{\mathcal {M}}}^-\), respectively. The weights \(\delta _k^+\) and \(\delta _k^-\) can be calculated in several ways. One possibility is to use the proportion of positive (negative) observations covered by a positive pattern \(P_k^+\in \displaystyle {{\mathcal {M}}}^+\) (a negative pattern \(P_k^-\in \displaystyle {{\mathcal {M}}}^-\)) to the total number of positive (negative) observations (i.e., the prevalence of the pattern):

$$\begin{aligned} \delta _k^+=\frac{1}{\left| {\varOmega ^+}\right| } \sum _{i\in I^+}P_k^+(\omega ^+_i)\quad \text{ and }\quad \delta _k^-= \frac{1}{|\varOmega ^-|} \sum _{i\in I^-}P_k^-(\omega ^-_i) \end{aligned}$$

where \(I^+=\{i: \omega _i^+\in \varOmega ^+\}\), and \(I^-=\{i: \omega _i^-\in \varOmega ^-\}\).

The accuracy of the model is estimated by a classical cross-validation procedure, where the dataset \(\varOmega \) is randomly divided into two disjoint subsets called training and test sets (Aggarwal 2015; Dietterich 1998; Efron and Tibshirani 1986; Hastie et al. 2005). A LAD model is generated on the training data and evaluated on the test data. The experiment is repeated several times (determined by the data analyst) and the accuracy of the LAD model is reported as the average accuracies on the test datasets.

If an external dataset (validation set) is available, a LAD model \({\mathcal {M}}\) is obtained for the original dataset \(\varOmega \) and the performance of the model is evaluated on the validation set.

Figures 6 and 7 shows the overall framework of LAD method, including the validation step in case of the absence or presence of an external data, respectively.

Fig. 6
figure 6

Cross-validation of LAD model

Fig. 7
figure 7

Validation of LAD model on an external dataset

3 Multiclass LAD method with relaxed patterns

In this section we present an OvR-type extension of LAD algorithm to multiclass classification problems. As in conventional LAD method, our multiclass LAD approach has four steps: (i) binarization and support set selection, (ii) pattern generation, (iii) theory formation, and (iv) prediction. These steps are outlined below.

3.1 Binarization and support set selection

Binarization of a multiclass numeric data is similar to that of two-class data discussed in Sect. 2.1. Binarization step associates several cut-points, \(\alpha _{\nu _k}\), and the following indicator variables to a numeric feature \(\nu \) to transform it into a set of binary features:

$$\begin{aligned} x_{\nu _k}=\left\{ \begin{array}{ll} 1 &{}\quad \hbox {if}\quad \nu \ge \alpha _{\nu _k}\\ 0 &{}\quad \hbox {if}\quad \nu < \alpha _{\nu _k} \end{array} \right. \end{aligned}$$

Transforming the data from discrete levels to indicator variables results in a multiclass binary dataset. For each variable, virtually any numerical value can be considered as a cut-point. However, the cut-points are chosen in a way which allows to distinguish between observations in different classes (Kotsiantis and Kanellopoulus 2006). An optimal (minimum size) set of cut-points can be obtained by the use of a set-covering problem. The multiclass discretization problem is extensively studied and there are several different approaches to accomplish this task (Friedman et al. 2000). As in two-class binarization procedure, an irredundant subset of binarized features (support set) that can distinguish every pair of observations in different classes of the K-class dataset \(\varOmega \) can be identified by solving a minimum set covering problem similar to the one presented in Boros et al. (1997, 2000).

In what follows we assume that the input dataset is a binary (or binarized) multiclass dataset.

3.2 Relaxed multiclass LAD pattern generation

Let \(\varOmega =\varOmega _1\cup \cdots \cup \varOmega _K\subset {\mathbb {B}}^{m\times n}\) be a K-class binary dataset with n features and m observations, where \(\varOmega _i\cap \varOmega _j=\emptyset \) for all \(i\ne j\). We introduce the following notations:

  • \({\mathcal {C}}=\{{\mathcal {C}}_1,\ldots , {\mathcal {C}}_K\}\): family of classes in \(\varOmega \), that is, any observation in \(\varOmega _k\) has class \({\mathcal {C}}_k\), \(k=1,\ldots ,K\).

  • \(P_{{\mathcal {C}}_p}\): pattern associated with class \({\mathcal {C}}_p\), \(p=1,\ldots ,K\).

  • \(\omega _{ij}\): binary value of the jth feature in the ith observation \(\omega _i\in \varOmega \).

  • \(y_j\), \(j=1,\ldots ,n\): binary variable representing a Boolean literal corresponding to the jth binary feature \(b_j\) in \(\varOmega \). If \(y_j=1\) for some specific \(j=1,\ldots ,n\), then \(b_{j}=1\) in observation \(\omega _i\) and the jth literal, \(x_j\), is used in the construction of the pattern.

  • \(y_{n+j}=1-y_j\), \(j=1,\ldots ,n\): binary variable representing the negation of the jth binary feature \(\bar{b}_j\) in \(\varOmega \). If \(y_{n+j}=1\) for some specific \(j=1,\ldots ,n\), then \(b_j=0\) in observation \(\omega _i\) and the negation of the jth literal, \(\bar{x}_j\), is used in the construction of the pattern.

  • \(d\in {\mathbb {Z}}^+\): unknown degree of a pattern \(P_{{\mathcal {C}}_p}\), \(p=1,\ldots ,K\).

  • \(\displaystyle {\rho = (\rho _1, \rho _2,\ldots , \rho _m)}\): penalty vector associated with the coverage of the pattern \(P_{{\mathcal {C}}_p}\). For all \(i=1,\ldots ,m\),

    $$\begin{aligned} \rho _i=\left\{ \begin{array}{ll} 1 &{}\quad \hbox {if}\quad \omega _i \in {\mathcal {C}}_p~\hbox {is not covered by pattern}~P_{{\mathcal {C}}_p} \\ 0 &{} \quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$
    (4)
  • \(z_i\), \(i=1,\ldots ,m\): relaxation variables, ensuring the generation of a pattern with homogeneity less than \(100\%\).

We shall formulate an MILP to generate a smallest degree and maximum coverage relaxed pattern \(P_{{\mathcal {C}}_p}\) associated with some class \({\mathcal {C}}_p\) in K-class dataset \(\varOmega \). In order to achieve this goal we proceed as follows:

  • Since a pattern cannot include both literals \(x_j\) and \(\bar{x}_j\), we impose the condition

    $$\begin{aligned} y_j+y_{n+j}\le 1,\quad j=1,\ldots ,n. \end{aligned}$$
    (5)
  • Degree of pattern \(P_{{\mathcal {C}}_p}\) is \(d\in {\mathbb {Z}}^+\) (unknown):

    $$\begin{aligned} \sum _{j=1}^{2n} y_j =d. \end{aligned}$$
    (6)
  • Consider the augmented matrix \(B=[\varOmega |\overline{\varOmega }]\), where \(\overline{\varOmega }\) is the binary data obtained from \(\varOmega \) by replacing 0 entries by 1 and 1 entries by 0. Define the vector \(\mathbf{v }=B\mathbf{y }\). In order to produce a relaxed pattern \(P_{{\mathcal {C}}_p}\) with \(\alpha \%\) homogeneity and \(\beta \%\) prevalence, we prescribe the following constraints:

    $$\begin{aligned}&v_{i}+ n \rho _i \ge d, \quad i\in I_p, \end{aligned}$$
    (7)
    $$\begin{aligned}&v_i - z_i \le d-1, \quad i \in I_k,\quad k=1,\ldots ,K,\quad k\ne p \end{aligned}$$
    (8)
    $$\begin{aligned}&\sum _{i\in I_k} z_i \le \alpha |\varOmega _p|, \end{aligned}$$
    (9)
    $$\begin{aligned}&\sum _{i\in I_p} \rho _i \le (1-\beta )|\varOmega _p|, \end{aligned}$$
    (10)
    $$\begin{aligned}&0\le z_i\le 1, \quad i=1,\ldots ,m \end{aligned}$$
    (11)
    $$\begin{aligned}&1\le d\le n, \quad d\in {\mathbb {Z}}^+ \end{aligned}$$
    (12)
    $$\begin{aligned}&\rho _i\in \{0,1\}, \quad i=1,\ldots ,m \end{aligned}$$
    (13)
    $$\begin{aligned}&y_j\in \{0,1\}, \quad j = 1,\ldots , 2n \end{aligned}$$
    (14)

    where \(\displaystyle {I_p =\left\{ {i: \omega _i \text{ is } \text{ in } \text{ class } {\mathcal {C}}_p}\right\} }\) and \(\displaystyle {I_k =\left\{ {i: \omega _i \text{ is } \text{ in } \text{ class } {\mathcal {C}}_k}\right\} }\) for all \(k\ne p\).

  • Let \(c\in {\mathbb {R}}^+\) be a constant. Our goal is to generate a pattern \(P_{{\mathcal {C}}_p}\) with minimum degree and maximum coverage. Therefore, we consider the objective function

    $$\begin{aligned} cd+\sum _{i\in I_p} {\rho _i} \end{aligned}$$
    (15)

We use conditions in (5)–(14) to formulate the following MILP:

$$\begin{aligned} \begin{array}{l}\displaystyle \hbox {minimize}~~\displaystyle cd+\sum _{i\in I_p} {\rho _i} \\ \hbox {subject to}\\ \displaystyle {\begin{array}{c} v_{i}+ n \rho _i \ge d, \quad i\in I_p \\ v_i - z_i \le d-1, \quad i \in I_k~, ~k=1,\ldots ,K, ~~k\ne p\\ \displaystyle \sum _{i\in I_k} z_i \le \alpha |\varOmega _p|\\ \displaystyle \sum _{i\in I_p} \rho _i \le (1-\beta )|\varOmega _p|\\ y_j+y_{n+j} \le 1, \quad j=1,\ldots , n \\ \displaystyle \sum _{j=1}^{2n} y_j =d\\ 1\le d\le n, \quad d\in {\mathbb {Z}}^+\\ 0\le z_i\le 1, \quad i=1,\ldots ,m\\ \rho _i\in \{0,1\}, \quad i=1,\ldots ,m\\ y_j\in \{0,1\}, \quad j = 1,\ldots , 2n \\ \end{array}} \end{array} \end{aligned}$$
(16)

where \(c\in {\mathbb {R}}^+\) and \(0\le \alpha ,\beta \le 1\) are input parameters.

Theorem 1

Let \((\mathbf{v }^*,\mathbf{y }^*,\mathbf{z }^*,\rho ^*, d^*)\) be a feasible solution of problem (16). Let

$$\begin{aligned} S =\left\{ j:~ {y^*_j}=1, j=1,\ldots ,n\right\} \quad \text {and} \quad \bar{S} =\left\{ j:~ {y^*_{n+j}}=1, j=1,\ldots ,n\right\} . \end{aligned}$$

Then

$$\begin{aligned} P_{{\mathcal {C}}_p}=\bigwedge _{S} x_j\bigwedge _{\bar{S}} \bar{x}_j \end{aligned}$$
(17)

forms a relaxed (fuzzy) pattern of degree \(d^*\), associated with class \({\mathcal {C}}_p\).

Proof

Let \((\mathbf{v }^*, \mathbf{y }^*, \mathbf{z }^*,\rho ^*, d^*)\), where \(\mathbf{v }^*=B\mathbf{y }^*\), be a feasible solution of problem (16). First note that the constraint

$$\begin{aligned} y_j+y_{n+j} \le 1, \quad j=1, \ldots , n \end{aligned}$$

ensures that the Boolean term \(P_{{\mathcal {C}}_p}\) shown in (17) does not contain both literals \(x_j\) and \(\bar{x}_j\) associated with the jth feature in dataset \(\varOmega \) and the condition

$$\begin{aligned} \sum _{j=1}^{2n} y_j =d \end{aligned}$$

guarantees that the term \(P_{{\mathcal {C}}_p}\) is of degree d. The constraint

$$\begin{aligned} v_{i}+ n \rho _i \ge d, \quad i\in I_p \end{aligned}$$

ensures that \(P_{{\mathcal {C}}_p}\) covers at least one observation \(\omega _i\) in class \({\mathcal {C}}_p\), that is, \(P_{{\mathcal {C}}_p}(\omega _i)=1\), \(\displaystyle {i\in I_p}\). If an observation \(\omega _i, i\in I_p,\) is covered by \(P_{{\mathcal {C}}_p}\), then d number of \(y_j\)’s are set to 1 and hence, we have \(v_{i}=d, i\in I_p\), where \(v_i\) is the ith component of vector \(\mathbf{v }=B\mathbf{y }\). However, if an observation is not covered by \(P_{{\mathcal {C}}_p}\), then \(v_{i}<d, i\in I_p,\) and the term “\(n\rho _i\)” is added to the left hand side to compensate it. Similarly, the constraints

$$\begin{aligned}&v_i - z_i \le d-1, \quad i \in I_k, \quad k=1,\ldots ,K, \quad k\ne p,\\&\sum _{i\in I_k} z_i \le \alpha |\varOmega _p| \end{aligned}$$

with the relaxation variables \(0\le z_i\le 1,~i=1,\ldots ,m\), guarantee that up to \((1-\alpha )\%\) of the observations covered by the term \(P_{{\mathcal {C}}_p}\) may be in \(\varOmega \setminus \varOmega _p\), i.e., the homogeneity of pattern \(P_{{\mathcal {C}}_p}\) is \(\alpha \%\).

Moreover, the constraint

$$\begin{aligned} \sum _{i\in I_p} \rho _i \le (1-\beta )|\varOmega _p| \end{aligned}$$

ensures that at least \(\beta \%\) of the observations in \(\varOmega _p\) are covered by pattern \(P_{{\mathcal {C}}_p}\), i.e., the prevalence of pattern \(P_{{\mathcal {C}}_p}\) is \(\beta \%\).

Thus, the solution \((\mathbf{v }^*, \mathbf{y }^*, \mathbf{z }^*, \rho ^*, d^*)\) can be used to form a relaxed (fuzzy) pattern \(P_{{\mathcal {C}}_p}\), shown in (17), with degree \(d^*\), homogeneity \(\alpha \%\), and prevalence \(\beta \%\). \(\square \)

Theorem 2

Let \((\mathbf{v }^{opt}, \mathbf{y }^{opt},\mathbf{z }^{opt}, \rho ^{opt}, d^{opt})\) be an optimal solution of problem (16). Let

$$\begin{aligned} S =\left\{ j~:~ {y^{opt}_j}=1, j=1,\ldots ,n\right\} \quad \text {and} \quad \bar{S} =\left\{ j~:~ {y^{opt}_{n+j}}=1, j=1,\ldots ,n\right\} . \end{aligned}$$

Then

$$\begin{aligned} \displaystyle P_{{\mathcal {C}}_p}^{opt}=\bigwedge _{S} x_j\bigwedge _{\bar{S}} \bar{x}_j \end{aligned}$$
(18)

is a relaxed (fuzzy) strong prime pattern of degree \(d^{opt}\), associated with class \({\mathcal {C}}_p\).

Proof

Let \((\mathbf{v }^{opt}, \mathbf{y }^{opt},\mathbf{z }^{opt}, \rho ^{opt}, d^{opt})\) be an optimal solution of problem (16). Hence, as discussed in the proof of Theorem 1, it can be used to construct a relaxed LAD pattern \(P_{{\mathcal {C}}_p}\) of degree \(d^{opt}\) that is associated with class \({\mathcal {C}}_p\). Note that the objective function of problem (16) minimizes the degree of \(P_{{\mathcal {C}}_p}\), resulting in a prime pattern. Since the objective function simultaneously minimizes the penalties \(\rho _i\) associated with observations \(\omega _i, i\in I_p\), the resulting pattern has the maximum coverage in class \({\mathcal {C}}_p\) and is a strong pattern. Thus, an optimal solution to problem (16) used to form a relaxed pattern \(P_{{\mathcal {C}}_p}^{opt}\), shown in (18), is a strong prime pattern with degree \(d^{opt}\), homogeneity \(\alpha \%\), and prevalence \(\beta \%\). \(\square \)

3.3 Relaxed multiclass LAD model

As stated in Theorem 2, an optimal solution of problem (16) is a fuzzy strong prime pattern. In order to obtain a multiclass LAD model, patterns must be generated until every observation in class \({\mathcal {C}}_k\) for all \(k=1,\ldots ,K\) is covered at least once. Before we develop an algorithm that produces a multiclass LAD model consisting of relaxed LAD patterns, we recall that in case of two-class MILP approach, Algorithm 1 of Ryoo and Jang (2009) (shown in Sect. 2.3) produces a set of patterns associated with a positive (negative) class that loops as many times as necessary until all observations in positive (negative) class are covered by at least one pattern. Note that once a positive (negative) pattern P is found as an optimal solution of problem (3), Algorithm 1 of Ryoo and Jang (2009) removes the observations covered by the pattern P, whilst looping through execution. However, this is counterproductive because every time the algorithm loops through again, it uses less information (smaller training set) to generate new patterns. Mortada (2010) has adopted a similar approach to develop an OvO-type multiclass LAD algorithm, where observations covered by a pattern are removed from the training dataset while executing the proposed algorithm. As compared to Algorithm 1 of Ryoo and Jang (2009), the algorithm proposed by Mortada (2010) stops looping when each observation is covered by t patterns, where t is a user-input, i.e., determined by the data analyst.

In order to avoid the removal of observations from the training dataset when generating new patterns that form a multiclass LAD model containing relaxed patterns obtained as the optimal solutions of problem (16), we define \(\displaystyle {\,\mathop {\kappa }\nolimits }\) as an \(\displaystyle {m}\)-vector that keeps track of the number of patterns covering an observation \(\omega _i\in \varOmega \) for all \(i=1,\ldots ,m\). Initially, for each class \(\displaystyle {{\mathcal {C}}_k}, k=1,\ldots ,K\), we set \(\displaystyle {\,\mathop {\kappa }\nolimits ={\mathbf {0}}}\). This vector shall be updated as new solutions of the MILP problem (16) are found. With the help of new vector \(\displaystyle {\,\mathop {\kappa }\nolimits }\), constraint (7), i.e., the first constraint in problem (16), can be replaced by

$$\begin{aligned} v_{i}+ n \left( {\rho _i+\,\mathop {\kappa }\nolimits _i}\right) \ge d, \quad i\in I_p. \end{aligned}$$
(19)

where \(\,\mathop {\kappa }\nolimits _i\ge 0\), \(i=1,\ldots ,m\).

Theorem 3

Let \((\mathbf{v }^\prime , \mathbf{y }^\prime , \mathbf{z }^\prime , \rho ^\prime , d^\prime )\) be an optimal solution of problem (16), where the constraint

$$\begin{aligned} v_{i}+ n {\rho _i} \ge d, \quad i\in I_p \end{aligned}$$

is replaced by constraint (19). Let

$$\begin{aligned} S =\left\{ j:~ {y^\prime _j}=1, j=1,\ldots ,n\right\} \quad \text {and} \quad \bar{S} =\left\{ j:~ {y^\prime _{n+j}}=1, j=1,\ldots ,n\right\} . \end{aligned}$$

Then

$$\begin{aligned} \displaystyle P_{{\mathcal {C}}_p}^\prime =\bigwedge _{S} x_j\bigwedge _{\bar{S}} \bar{x}_j \end{aligned}$$

is a degree \(d^\prime \) relaxed strong prime pattern associated with class \({\mathcal {C}}_p\).

Proof

The proof of the assertion follows immediately from the proof of Theorem 1 and Theorem 2 and hence, is left to the reader. \(\square \)

Theorem 3 enables us to propose an algorithmic approach to generate a multiclass LAD model consisting of relaxed LAD patterns obtained as the optimal solutions of our multiclass MILP problem (16). This approach is presented in Algorithm 2. We remark that, unlike the Algorithm 1 of Ryoo and Jang (2009), our proposed Algorithm 2 does not require the removal of observations from the training dataset at any iteration by adding “NewConstraint” to the relaxed MILP problem (16) each time a new pattern is generated to prevent the algorithm from finding the same pattern found at the previous iterations. This is achieved by introducing \(\kappa _i\) that keeps track of the number of patterns covering observations \(\omega _i\in \varOmega \) for all \(i=1,\ldots ,m\) and “TotCov” that counts the number of observations covered so far.

Let \(\mathcal {M}_k\) denote the set of relaxed LAD patterns associated with class \({\mathcal {C}}_k\), \(k=1,\ldots ,K\) and \({\mathcal {M}}={\mathcal {M}}_1\cup \cdots \cup {\mathcal {M}}_K\), where \({\mathcal {M}}_i\cap {\mathcal {M}}_j = \emptyset \), \(i\ne j\), be the multiclass LAD model obtained by Algorithm 2.

figure b

The final step of our proposed multiclass LAD method with relaxed patterns is to validate and use the model \(\mathcal {M}\) for prediction of new observations. Similar to the two-class classification problem, the accuracy of a multiclass model \({\mathcal {M}}\) is estimated by the classical cross-validation procedures (Aggarwal 2015; Dietterich 1998; Efron and Tibshirani 1986; Hastie et al. 2005). If an external dataset (test set) is available, the performance of the model can be evaluated on that set and the accuracy of the model is reported as the accuracy on the test set.

3.4 OvR multiclass LAD discriminant and prediction

Given a \(K-\)class dataset \(\varOmega =\varOmega _1\cup \cdots \cup \varOmega _K\) and a corresponding multiclass LAD model \({\mathcal {M}}={\mathcal {M}}_1\cup \cdots \cup {\mathcal {M}}_K\), (\({\mathcal {M}}_i\cap {\mathcal {M}}_j = \emptyset \), \(i\ne j\)), the classification of a new (or unseen) observation \(\mathbf{o }\in {\mathbb {B}}^{n}\), \(\mathbf{o }\notin \varOmega \) is determined by the value of the discriminant function

$$\begin{aligned} \varDelta (\mathbf{o })=\arg \max _k \varDelta _k(\mathbf{o }) \end{aligned}$$
(20)

where

$$\begin{aligned} \varDelta _k(\mathbf{o })=\sum _{P_{{\mathcal {C}}_k}\in \displaystyle {{\mathcal {M}}}_k}\delta _k P_{{\mathcal {C}}_k}(\mathbf{o }),\quad k=1,\cdots ,K \end{aligned}$$

and \(\delta _k\ge 0\) are the weights assigned to patterns \(P_{{\mathcal {C}}_k}\in {{\mathcal {M}}}_k, k=1,\ldots ,K\) and can be calculated in various ways. One possibility is to use the prevalence of patterns that is defined by

$$\begin{aligned} {\delta _k=\frac{1}{\left| {\varOmega _k}\right| } \sum _{i\in I_{{\mathcal {C}}_k}}P_{{\mathcal {C}}_k}(\omega _i)} \end{aligned}$$

where \(\varOmega _k\subset \varOmega \) is the set of observations in class \({\mathcal {C}}_k\) and \(I_{{\mathcal {C}}_k}=\{i: \omega _i \in \varOmega _k\}\), \(k=1,\ldots ,K\). If \(\varDelta (\mathbf{o })=\varDelta _p(\mathbf{o })=\varDelta _q(\mathbf{o })\) for some \(p\ne q\), then the observation \(\mathbf{o }\) is unclassified.

4 Experiments

In this section we present experimental results on publicly available datasets to show how Algorithm 2 described in Sect. 3.3 can be used for multiclass classification. Regarding the stopping criterion, Algorithm 2 ends once all patterns for each class \({\mathcal {C}}_k,k=1,\ldots ,K,\) have been computed, i.e., all observations are covered. In the worst case, an adhoc pattern can be built by the algorithm to cover a single observation. In regards to our relaxed MILP LAD algorithm, the experiments are implemented through Python 3.5.7 on a GPU machine, containing the virtual environment and all necessary packages. The specifications for the machine include an Intel CORE i7-6700 CPU with 64 GB Memory running Linux. The goal in these experiments is not only to compare our new LAD based multiclass method with the prior multiclass LAD techniques in LAD literature, but also with the well-known and commonly used supervised learning methods in machine learning literature.

4.1 Experimental results

In order to test our proposed multiclass LAD methodology, we conduct experiments on five multiclass datasets from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Table 1 summarizes the characteristics of these datasets.

Table 1 Five multiclass datasets from UCI repository

We apply our proposed “Relaxed Multiclass LAD Method” (in what follows referred as Relaxed MC-LAD) to these five datasets and report the average sensitivities of \(10\times 10\)-folding cross-validation experiments in Table 2.

Table 2 Relaxed MC-LAD method: average sensitivity of \(10\times 10\) cross-validation experiments

4.2 Pattern characteristics

The Relaxed MC-LAD pattern characteristics for all five datasets are shown in Tables 3456 and 7. As can be seen from these results, the Relaxed MC LAD method produces accurate classification models with high quality patterns.

4.2.1 Relaxed multiclass LAD pattern characteristics for Iris dataset

Table 3 shows the pattern characteristics for Iris dataset, where the degree of the patterns ranges from one to three and patterns exhibit coverage varying from 5.26 to 100%. Note that for class Setosa, our relaxed multiclass LAD approach produces a single pattern of degree 1 that covers all observations in class Setosa. The relaxed multiclass LAD model contains seven patterns, one for class Setosa, three for class Verisicolor, and three for class Virginica.

Table 3 Relaxed multiclass LAD pattern characteristics for Iris dataset

4.2.2 Relaxed multiclass LAD pattern characteristics for Wine dataset

Our relaxed multiclass LAD approach produces a LAD model consisting of nine patterns, two for class A, five for class B, and two for class C. Table 4 shows the pattern characteristics for Wine dataset, where the degree of the patterns ranges from 2 to 5 and patterns exhibit coverage varying from 2.82 to 94.91%.

Table 4 Relaxed multiclass LAD pattern characteristics for Wine dataset

4.2.3 Relaxed multiclass LAD pattern characteristics for Glass dataset

Glass dataset is a noisy dataset. Our relaxed multiclass LAD approach produces a LAD model consisting of 51 patterns, fifteen for class A, sixteen for class B, nine for C, two for D, four for E, and five for class E. Table 5 shows the pattern characteristics for Glass dataset, where the degree of the patterns ranges from 1 to 5 and patterns exhibit coverage varying from 1.96 to 90.00%.

Table 5 Relaxed multiclass LAD pattern characteristics for Glass dataset

4.2.4 Relaxed multiclass LAD pattern characteristics for E. Coli dataset

Table 6 shows the pattern characteristics for E. Coli dataset, where the patterns are generated by our multiclass LAD approach. Note that this model contains more complicated patterns with degrees, ranging from 3 to 11 and coverage, ranging from 1.4 to 100%. The multiclass LAD model contains a total of 40 patterns. Note that only one pattern of degree three is sufficient to cover all observations in class omL.

Table 6 Relaxed multiclass LAD pattern characteristics for E. Coli dataset
Table 7 Relaxed multiclass LAD pattern characteristics for Dermatology dataset

4.2.5 Relaxed multiclass LAD pattern characteristics for Dermatology dataset

Table 7 shows the pattern characteristics for Dermatology dataset, where the patterns are generated by our multiclass LAD approach. Similar to E. Coli LAD model, the multiclass LAD model for Dermatology dataset contains more complicated patterns with degrees, ranging from 1 to 6 and coverage, ranging from 2.22 to 92.45%. Note that the method produced two degree one patterns associated with class C and these two patterns cover 92.45% of the observations in class C.

4.3 Experiments comparing LAD based multiclass classification methods

In this section we report the average accuracy of \(10\times 10\)-folding experiments, where we compare Relaxed MC-LAD method against other LAD based multiclass methods of Kim and Choi (2015)-OvR, Kim and Choi (2015)-OvO, Moreira (2000)-OvO, Mortada (2010)-OvO, and Avila-Herrera and Subasi (2013)-OvR. These results are summarized in Table 8 for Iris, Wine, and Glass datasets and in Table 9 for E. Coli and Dermatology datasets.

Table 8 Classification accuracy (%) of LAD based multiclass methods on Iris, Wine, and Glass datasets
Table 9 Classification accuracy (%) of LAD based multiclass methods on E. Coli and Dermatology datasets

The average CPU times for LAD based multiclass methods, including our proposed method Relaxed MC-LAD, are given in Table 10. Note that the efficiency of our proposed Relaxed MC-LAD Method is comparable if not better than those other existing multiclass LAD methods.

Table 10 Comparison of average CPU time for LAD based multiclass methods

As compared to the previously proposed LAD multiclass approaches, our Relaxed MC-LAD method is more efficient and performs better (higher accuracy) than those of Moreira (2000)-OvO, Mortada (2010)-OvO and Avila-Herrera and Subasi (2013)-OvR for all five datasets. The Relaxed MC-LAD method has smaller CPU times than those reported by the methods of Kim and Choi (2015) for almost all of the datasets, except it is slightly slower than Kim and Choi (2015)-OvR for the Glass dataset. We also observe that our proposed Relaxed MC-LAD method’s performance in terms of classification accuracy is very comparable to, if not better than, the performance of the methods of Kim and Choi (2015) for all datasets, except for Glass dataset. However, for Glass dataset, the Relaxed MC-LAD method outperforms the previously developed multiclass LAD methods, excluding the methods of Kim and Choi (2015), as can be seen from Tables 8 and 9.

4.4 Experiments comparing relaxed multiclass LAD method with well-known classification techniques

Similar to the previous section, here we report the average accuracy of \(10\times 10\)-folding experiments of our Relaxed MC-LAD method as compared to the well-known classification techniques, including Nearest Neighbor, Naïve Bayes, Logistic Regression, Support Vector Machines, Neural Networks, and Decision Trees-C4.5. These experiments were run in WEKA 3.8 Data Mining Software (Frank et al. 2016).

Table 11 shows the average accuracy of the cross-validation experiments for Iris, Wine, and Glass datasets and Table 12 shows the average accuracy of the cross-validation experiments for E. Coli and Dermatology datasets.

Table 11 Comparison of average accuracy of well-known classification methods for Iris, Wine, and Glass datasets
Table 12 Comparison of average accuracy of well-known classification methods for E. Coli and Dermatology datasets

We observe from Tables 11 and 12 that our proposed Relaxed MC-LAD method’s classification accuracy is very comparable to, if not better than, that of the well-known supervised learning methods including Nearest Neighbor, Naïve Bayes, Logistic Regression, Neural Networks, and Decision Trees for all datasets, including Glass data where the performance of all methods is much worse (\(\ge 10\%\) smaller accuracy) than the Relaxed MC-LAD method. While all of the multiclass methods, including both LAD based and others, but excluding the Relaxed MC-LAD method, poorly classify the observations in Glass dataset, the methods of Kim and Choi (2015) do outperform all of the methods by about more than 10% improvement in accuracy based on the results reported in their paper.

Our numerical experiments suggest that the proposed Relaxed MC-LAD method is an exciting alternative to the multiclass classification literature. The method not only provides efficient and robust results, but also easily interpretable explicit classification models—an essential feature of LAD methodology.

5 Conclusions

In this paper we extend Logical Analysis of Data (LAD) to multiclass classification where relaxed (fuzzy) patterns are generated as optimal solutions of a mixed integer linear programming problem. Our proposed relaxed multiclass LAD approach is motivated by MILP formulation of Ryoo and Jang (2009) that generates LAD patterns for two-class datasets and LAD based multiclass algorithm of Avila-Herrera and Subasi (2013, 2015) that generated pure patterns with minimum degree and maximum coverage. Our multiclass method uses homogeneity and prevalence as two parameters to generate relaxed LAD patterns, aimed at improving the generalization capability. While all of the multiclass methods, including both LAD based and others, but excluding the Relaxed MC-LAD method, poorly classify the observations in Glass dataset, the methods of Kim and Choi (2015) do outperform all of the methods by about more than 10% improvement in accuracy based on the results reported in their paper. All these results suggest that our proposed Relaxed Multiclass LAD method is an exciting alternative to the multiclass classification literature. We demonstrate the advantage of having the flexibility proposed in our method through experiments on five benchmark multiclass datasets. Experimental results show that the proposed relaxed multiclass LAD algorithm produces highly accurate classification models on the benchmark datasets, where the cross-validation accuracy of the relaxed multiclass LAD algorithm is comparable to, if not better than, those obtained by previously developed multiclass LAD classification methods well as those by the well-known classification techniques, including Nearest Neighbor, Naïve Bayes, Logistic Regression, Support Vector Machines, Neural Networks, and Decision Trees. In addition, our proposed relaxed multiclass LAD method is very efficient as can be seen from the reported CPU time of the experiments.