Introduction

Data mining allows extracting useful knowledge from data. In the last decades, data mining has been considerably investigated and a huge number of different techniques have been proposed for generating, for instance, descriptive models in clustering and frequent pattern analysis, and predictive models in classification and regression tasks [35]. Data mining is closely related to cognitive computation. Indeed, as discussed in [20], cognitive computation helps improve human decision-making: data mining models are often adopted for decision-making issues, such as image recognition [16], disease identification [3], and managing the energy consumption in wireless sensor networks [38].

We are currently experiencing the Big Data Era [45] and classical data mining algorithms appear to be inadequate to manage big data. Indeed, big data are characterized by the four “V”s, namely volume, variety, velocity, and veracity: large volumes of data, which are often produced at a very high speed and need to be elaborated in almost real time (velocity), are generated by different sources and may have different formats (variety) [11] and trustworthiness (veracity). These data represent a very important source of added-values in several contexts, such as in marketing strategies [23], industrial applications [55], and Internet of Things [2].

Big data are of prominent importance also for their relationship with cognitive computing systems [1]: they naturally learn from people and from the huge amount of data they are involved with, typically by exploiting computational intelligence and machine learning algorithms. In the last years, several researchers have introduced data mining approaches purposely designed and implemented for Big Data [48, 58]. Most of these approaches have employed specific distributed frameworks, such as Apache Hadoop [57] and Apache Spark [59], which have been recently proposed with the aim of dealing with data storage and elaboration of big data. Further, most of the recent contributions in the field exploit the MapReduce paradigm [22] for implementing both descriptive and predictive models, with the additional benefit of the possibility to exploit computing resources on the Cloud [31].

As an example, in [39] and in [42], authors have proposed a distributed version of two famous clustering algorithms, i.e., DB-SCAN and Fuzzy C-Means, respectively, developed using the Apache Hadoop framework. A fuzzy version of random forests has been implemented over the same framework as well [15].

Recently, in [14] and [43], Apache Spark implementations of associative classification models and of the KNN classifier, respectively, have been discussed. Also, big social data analysis has taken advantage of the use of such distributed computing frameworks [47]. A recent work highlights the main advances, challenges, and objectives in designing, developing, and using data mining and machine learning algorithms for big data [60].

In the context of predictive models, in the last years, a number of contributions employing fuzzy models (FMs) for handling big data have been proposed [25, 28, 32, 41, 44, 51,52,53]. As stated in [29], FMs are particularly suitable for handling the variety and veracity of big data. This is mainly due to their good capability of coping with vague, imprecise, and uncertain concepts. It is worth underlining that fuzzy logic has been recognized also as an important tool to keep the fidelity of psychological interpretation of emotion [12], opening up new ways to analyze the sentiment contents of huge amounts of data available from web sources. From a more technical point of view, the use of overlapped fuzzy labels ensures a good coverage of the problem space. This issue is especially relevant when dealing with very large datasets that may be divided into a number of heterogeneous chunks, such as in the MapReduce programming paradigm. Actually, the different chunks may influence in a different way the parameter learning process of the classification model.

To the best of our knowledge, the first proposal regarding FM for big data classification is the Chi-FRBCS-BigData [51] model. It is a fuzzy rule–based classifier (FRBC), developed according to the MapReduce paradigm and based on the approach described by Chi et al. [17]. In the work discussed in [41], the Chi-FRBCS-BigData algorithm has been adapted for handling imbalanced big datasets, and the effects of the granularity of fuzzy partitions, when using the same algorithm, have been studied in [30]. Recently, in the work published in [25], the CHI-BD algorithm has been introduced: it is a novel distributed version of Chi et al.’s approach [17], with improved results with respect to Chi-FRBCS-BigData. Basically, CHI-BD is the exact distributed implementation of Chi et al.’s approach, whereas Chi-FRBCS-BigData is an approximated implementation. More details can be found in [25] and [51], respectively. Notably, efficient MapReduce solutions based on the CHI-BD algorithm have been proposed also for the prototype reduction problem [26].

Similarly to the aforementioned approaches, most of the contributions proposed so far focus on the design and development of FMs in a distributed environment, especially considering the accuracy of the models. Also, two recent works discuss the good results obtained by two novel fuzzy classification models for big data. The two models are respectively based on fuzzy associative classifiers [52] and distributed multi-way fuzzy decision trees (DMFDTs) [53]. Even if the accuracy of the obtained classifiers is good, the complexity of the relative models, in terms of both number of rules and number of decision nodes, is very high. The greater the complexity, the lower the interpretability of the FMs. However, the interpretability is a very important feature that characterizes FMs, and is particularly relevant in the context of big data as well [29, 56]. New methods to generate both accurate and interpretable FMs are currently investigated in the research community on fuzzy models [24].

Interpretability is a subjective concept: it is hard to find a worldwide-agreed definition and consequently a universal measure of interpretability. A taxonomy of interpretability measures for fuzzy rule–based models has been proposed [33] by considering the two distinct dimensions of semantics and complexity, at the rule base (RB) and data base (DB) levels. As regards the DB, the semantic interpretability is usually evaluated in terms of integrity of the fuzzy partitions, whereas the complexity is evaluated in terms of number of fuzzy sets. As regards the RB, the interpretability is mostly analyzed in terms of complexity and one of the most used metrics is the total rule length (TRL) [18, 36, 37], that is, the total number of conditions used in the RB. In the learning of interpretable FMs, the importance of other factors like rule relevance has been experimentally studied as well [49].

In the framework of “classical” FMs, multi-objective evolutionary algorithms (MOEAs) have been widely used with the aim of generating models characterized by good trade-offs between accuracy and interpretability [7, 27]. Independently of the approach used to generate the DB and the RB of the fuzzy rule–based systems, the computation of the accuracy of each individual generated in the evolutionary process requires the scan of the overall training set. When the size of the dataset is very large, such as in the context of big data, the application of MOEA-based approaches to the FM generation is very critical. Thus, the natural way for managing very large datasets would be to adopt a solution to speed up the computation: this can be done by exploiting a distributed implementation on a computer cluster.

Although some evolutionary-based methods for learning FMs for big data have been recently proposed [28, 44], in 2017, we have introduced the first distributed implementation of an MOEA to learn concurrently the RB and DB of FRBCs, by maximizing accuracy and minimizing complexity [32]. We have named our algorithm as DPAES-RCS: it is an Apache Spark–distributed implementation of the PAES-RCS approach discussed in [9, 10]. PAES-RCS learns the RB through a rule and condition selection strategy, which selects a reduced number of rules from a heuristically generated set of candidate rules and a reduced number of conditions, for each selected rule, during the evolutionary process. Moreover, the parameters of the fuzzy sets are learnt concurrently with the RB. PAES-RCS has proven to be very efficient in obtaining satisfactory approximations of the Pareto front using a limited number of iterations [9].

In this paper, we propose an extension of DPAES-RCS that includes two main novel aspects. First, we generate the initial set of rules using the distributed FDT learning approach introduced in [53] rather than the distributed version of the C4.5 algorithm [21]. We highlight that we adopt the same learning algorithm proposed in [53], but we do not employ fuzzy partitions generated by the distributed fuzzy discretizer (proposed in [53] as well) and leaves labeled with different classes. Similar to [32], once the FDT has been generated, we extract the rules by surfing the tree from the root to each leaf. Second, we introduce a strategy for learning the most suitable number of fuzzy sets (granularity learning), for each linguistic variable, concurrently to the learning of the RB and the parameters of the fuzzy sets. To this aim, we adopt the virtual partition method introduced in [5] and recently used in [7] in the context of MOEA-based fuzzy models. We experiment the proposed extension of DPAES-RCS, named DPAES-FDT-GL, on eight benchmark datasets for big data classification. We compare DPAES-FDT-GL with DPAES-RCS and with a simplified version of DPAES-FDT-GL, named DPAES-FDT, which exploits the FDT for generating the initial rule set, but does not employ the granularity learning during the evolutionary process. This last comparison is performed to evaluate whether both the extensions of DPAES-RCS included in DPAES-FDT-GL produce valuable effects.

We show that the accuracies achieved by the three approaches are statistically equivalent, while the complexity of the FRBCs generated by DPAES-FDT-GL and DPAES-FDT is much lower than the one of the FRBCs generated by DPAES-RCS. Thus, we conclude that DPAES-FDT-GL and DPAES-FDT are definitely able to generate more interpretable models than DPAES-RCS. However, even though DPAES-FDT-GL and DPAES-FDT are statistically equivalent in terms of complexity, results show that in most of the cases, DPAES-FDT-GL generates the most compact solutions, characterized by the lowest number of rules, conditions, and fuzzy sets.

The paper is organized as follows. In “Preliminaries”, some background concepts are introduced. Section “The Proposed Approach” describes the overall approach. We report the results of our experimental analysis in “Experimental Results” and draw some final conclusion in “Conclusions and Future Work”.

Preliminaries

The term “classification” refers to the action of assigning a class Cm, out of a given set C = {C1,…,CK} of K classes, to an unlabeled instance. A generic instance is described by a set X = {X1,…,XF} of attributes with cardinality F. Each attribute can be either categorical or numerical. In the case of categorical attributes, Xf takes values out of a set \(L_{f} = \{L_{f,1}, \dots , L_{f,T_{f}} \}\) of Tf distinct values.

For numerical attributes, the universe Uf of Xf can always be considered a bounded interval in \(\mathbb {R}\). With the aim of defining fuzzy rules, a fuzzy partition is defined on each of these intervals. Referring to a generic numeric attribute Xf, let \(P_{f} = \{A_{f,1},\dots ,A_{f,T_{f}}\}\) be a partition over the relative universe Uf, where Tf is the number of fuzzy sets in the partition. A label Lf,j is then assigned to each fuzzy set Af,j, thus letting us work with linguistic variables and deal with both categorical and numerical attributes in a homogeneous way.

In this paper, we adopt triangular fuzzy sets and, therefore, each fuzzy set Af,j is identified by the tuples (af,j,bf,j,cf,j), where af,j and cf,j correspond to the left and right extremes of the support, respectively, and bf,j to the core. Since we use strong partitions, \(a_{f,1} = b_{f,1}, b_{f,T_{f}} = c_{f,Tf}\) and, for j = 2,...,Tf − 1, bf,j = cf,j− 1 and bf,j = af,j+ 1.

The number Tf of fuzzy sets to be used in the partition for the attribute Xf can be regarded as a measure of the granularity used for the definition of linguistic variables over Uf. We can use the notation Pf(Tf) to emphasize the granularity level for Pf, because clearly Tf has a direct impact on the accuracy and interpretability of the derived classification models.

Classification by Means of Fuzzy Rules

In FRBCs, the output value for an unlabeled instance is inferred from the fuzzy rules that compose the RB. In the present work, we assume the following structure for the generic m th rule Rm in RB:

$$\begin{array}{@{}rcl@{}} \textit{R}_{m}:\ && \textbf{if}\ X_{1}\ \textbf{is}\ L_{1,j_{m,1}}\ \textbf{and} \ {\dots} \ \textbf{and}\ X_{\textit{F}}\ \textbf{is}\ L_{\textit{F},j_{m,F}}\\ && \textbf{then}\ Y\ \textbf{is}\ C_{k_{m}} \end{array} $$
(1)

where Y is the FRBC output, whose value in the consequent of rule Rm is \(C_{k_{m}}\), and jm,f ∈ [1,Tf] identifies the index of the label that has been selected for Xf in rule Rm, i.e., \(L_{\textit {f},j_{m,f}}\). Depending on the nature of Xf (either numerical or categorical), such a label may refer to either a fuzzy set in partition Pf or a categorical value. In general, it may happen that in one rule the value assumed by an attribute provides no indications in choosing the outcome. This situation can be plainly dealt with by introducing an additional fictitious label Lf,0 for each attribute Xf, and using it to express that Xf does not contribute to the classification. Formally, this “dummy” label corresponds to a set with unitary membership across the whole attribute universe: thus, Lf,0 lets us keep the generic structure of (1) also for rules where the outcome actually depends only on a subset of the attributes.

Let TR = {(x1,y1),(x2,y2),...,(xN,yN)} be the training set that contains N instances. In this notation, (xn, yn) indicates the n th input–output pair, where xn is the input vector with F values (each, either numerical or categorical, for the relative attribute), and yn is the classification label.

The matching degree of the rule Rm with the input xn represents the strength of activation of the rule. It is calculated according to the following equation:

$$ w_{m}(\textbf{x}_{n}) = {\prod}_{f = 1}^{F} \mu_{L_{f,j_{m,f}}}(x_{n,f}) $$
(2)

where \(\mu _{L_{f,j_{m,f}}}(x_{n,f})\) is, in the case of numerical attributes, the membership value of xn,f to the fuzzy set \(A_{f,j_{m,f}}\) represented by label \(L_{f,j_{m,f}}\) and, in the case of categorical attributes, is either 0 or 1.

As discussed in [33], the complexity of a rule base can be measured in different ways, but a simple yet effective index is TRL. In the approach proposed in this paper, TRL is used to quantify the model complexity (and, indirectly, its interpretability) and it is taken as one of the objectives for the evolutionary algorithm that shapes the final solutions.

Finally, we adopt the “maximum matching method” as reasoning method: the class of an unlabeled instance is determined by the consequent of the rule with the maximum matching degree for such an instance. In case of tie, among the equally matching rules, the first one is chosen. If no rule is fired, the instance is classified with the most frequent class.

The Proposed Approach

The overall proposed approach, named DPAES-FDT-GL, is structured according to the scheme reported in Fig. 1. In the first place, it is necessary to obtain a very good candidate rule base as the starting point for the successive optimization process, aimed at producing a set of final FRBCs with different trade-offs between accuracy and interpretability. We denote the first phase as Candidate RB Generation, and the second one as Multi-Objective Evolutionary Learning. It is important to underline that dealing with big data asks for particularly efficient algorithms, and that consistent speed-ups can be achieved by adopting distributed computing solutions. For this reason, along all the successive steps in the proposed approach, distributed algorithms have been employed whenever possible using the Apache Spark framework [59], which is able to implicitly deal with data distribution by means of a predefined container type (known as RDD, resilient distributed dataset).

Fig. 1
figure 1

General scheme of the proposed approach. Upon the construction of an initial candidate RB by means of a distributed FDT learning algorithm applied to uniformly partitioned attributes, a multi-objective evolutionary algorithm finds optimal FRBCs with different trade-offs between accuracy and complexity

For the Candidate RB Generation, differently from the previous solution discussed in [32], we adopt a distributed FDT learning approach. Thus, the candidate RB is directly obtained from the learned FDT. The rationale for using an FDT learning algorithm instead of C4.5 is based on the intuition that learning an initial candidate fuzzy RB by means of a procedure, which is completely managed by using fuzzy sets, can produce a fuzzy model more appropriate to undergo the subsequent fuzzy manipulations; moreover, in practical contexts, the RBs generated by traversing all the paths from the root down to each leaf of an FDT have proven to be compact and interpretable [50]. As a preliminary step, a uniform discretization is performed. The FDT learning is carried out by means of a distributed algorithm [53].

The algorithm chosen for the multi-objective evolutionary learning (MOEL) is designed according to the well-known Pareto Archive Evolutionary Strategy (PAES) [19], and in particular it is based on DPAES-RCS [32], which is a distributed version over Apache Spark of the PAES-RCS algorithm [9]. Such a choice is motivated by its fast convergence rate, which lets us reduce the number of iterations to get to a set of satisfying solutions. This is a crucial point with big data, because fitness evaluations over very extensive datasets represent the most time-consuming task. An important characteristic of this phase is that the granularity of the attribute partitions is directly taken into account and learned throughout the evolutionary process.

The following subsections give a description of the algorithms used in the two phases of the proposed approach; further details can be found in [32] and [53].

The Distributed Candidate RB Generation

The Candidate RB Generation phase used in this work exploits a distributed FDT learning approach suitable for dealing with big data, recently proposed by Segatori et al. [53]. The multi-way version of the FDT has been chosen (instead of the binary one) because it makes particularly simple and efficient the subsequent rule extraction to build the candidate RB: rules are obtained in the form described in (1) through a tree traversal, deriving one rule for each possible path from the root down to a leaf.

Differently from [53], which employs fuzzy partitions generated by a distributed fuzzy discretizer (proposed in [53] as well), and leaves labeled with different classes, the tree learning used in this paper relies on a preliminary uniform discretization for all the numerical features, with Tmax evenly spaced triangular fuzzy sets that make up strong partitions. This operation requires nothing more than knowing the lowest and highest values in TR for each numerical feature (that is, the endpoints for each universe Uf), and this can be easily accomplished in a distributed fashion.

Each FDT node corresponds to a subset of TR, and the root corresponds to the whole TR. The FDT construction starts from the root and follows a top-down approach; unless a termination condition is not satisfied, a newly generated node gives rise to Tmax child nodes according to the fuzzy partition of the attribute chosen for that specific splitting. In this procedure, an attribute can be considered only once in the same path from the root to a leaf. The attribute that drives the splitting is selected as the one that yields the best fuzzy information gain, which will be defined below. The termination conditions are the following:

  1. 1.

    All the instances in the node belong to the same class.

  2. 2.

    The number of instances in the node is lower than a fixed threshold λ.

  3. 3.

    The tree has reached a maximum fixed depth β.

  4. 4.

    The value of the fuzzy information gain is lower than a fixed threshold ε. In our experiments, we set ε = 10− 6.

More formally, given a parent node PN, let CNj indicate the generic j th child node, j = 1,…,Tmax. The subset of TR in CNj contains only the instances belonging to the support of the fuzzy set Af,j. Let Sf be the set of instances in the parent node, and Sf,j be the set of instances for CNj, i.e., the support of Af,j. Each node CNj is characterized by a fuzzy set Gj, whose cardinality is defined as

$$ \left|G_{j}\right| = {\sum}_{i = 1}^{N_{j}}\mu_{G_{j}}(\textbf{x}_{i}) = {\sum}_{i = 1}^{N_{j}}TN(\mu_{A_{f,j}}(x_{f,i}), \mu_{G}(\textbf{x}_{i})) $$
(3)

where Nj is the number of instances in set Sf,j, μG(xi) is the membership degree of instance xi to parent node PN (for the root node, μG(xi) = 1), and the operator TN is a T-norm.

The fuzzy information gain FGain used for selecting the splitting attribute is computed, for a generic attribute Xf with partition Pf, as

$$ \scalebox{.9}{\(\mathit{FGain}(P_{f};I_{G}) = \mathit{FEnt}(G)-\mathit{WFEnt}(P_{f};I_{G})\)} $$
(4)

where IG is the support of fuzzy set G. The fuzzy entropy FEnt(G) is defined as

$$ \mathit{FEnt}(B_{f,j})={\sum}_{m = 1}^{M}-\frac{\left|B_{f,j,C_{m}}\right|}{\left|B_{f,j}\right|} log_{2}(\frac{\left|B_{f,j,C_{m}}\right|}{\left|B_{f,j}\right|}) $$
(5)

where fuzzy cardinality \(\left |B_{f,j,C_{m}}\right |\) is computed on the set of instances in Sf,j with class label Cm. The weighted fuzzy entropy \(\mathit {WFEnt}(P_{I_{f}},I_{f})\) of partition \(P_{I_{f}}\) is defined as

$$ \mathit{WFEnt}(P_{I_{f}};I_{f})={\sum}_{j = 1}^{K_{P_{I_{f}}}} \frac{\left|B_{f,j}\right|}{\left|S_{f}\right|}\mathit{FEnt}(B_{f,j}) $$
(6)

where |Bf,j| is the fuzzy cardinality of fuzzy set Bf,j, |Sf| is the cardinality of set Sf, and FEnt(Bf,j) is the fuzzy entropy of Bf,j.

In the case of categorical attributes, we split the parent node into a number of child nodes CNj equal to the number of possible values for the attribute. Each node CNj is characterized by a fuzzy set Gj, whose cardinality is

$$ \left|G_{j}\right| = {\sum}_{i = 1}^{N_{j}}\mu_{G_{j}}(\textbf{x}_{i}) = {\sum}_{i = 1}^{N_{j}}TN(1, \mu_{G}(\textbf{x}_{i})) $$
(7)

Figure 2 summarizes the distributed implementation of the candidate rule generation phase. We have highlighted the distribution of the operations across the cluster of computing units (CUs). The adopted distribution strategy is aimed at reducing as much as possible the scans over TR, which is the very bottleneck in the overall computation. To this aim, the computation of the best split for each node is spread across the CUs. The FDT learning consists in the iterative execution of a MapReduce step: the mapping phase takes care of computing the figures (over V chunks for TR) to decide how to split, and the reduce phase is in charge of completing (if it is the case) the node splitting. The nodes to be possibly split are kept in a list, where in each iteration at most MaxY elements at a time are extracted to be processed in a MapReduce step.

Fig. 2
figure 2

Outline of the distributed candidate rule generation phase

It is worth underlining that the chosen distributed FDT has a very good CU utilization, with extremely satisfactory values for the speed-up, thus yielding good scalability figures with respect to the number of CUs [53]. The most critical aspect relates to the system requirements for the used cluster of computers. In particular, the maximum number of nodes MaxY that can be processed in parallel depends on the amount of RAM available on the cluster. Of course, more memory resources can be provided so to achieve the required parallelism [53].

The Distributed Multi-objective Evolutionary Learning

The Evolutionary Process

The multi-objective evolutionary learning phase aims to produce solutions that maximize accuracy and minimize TRL, the index chosen to express the complexity of the learned FRBCs. This is obtained through a distributed multi-objective evolutionary algorithm that takes in input the candidate RB produced in the previous phase. The overall structure of the learning stage is based on PAES-RCS for classification models, introduced in [9] and then extended to the distributed framework in [32]. As multi-objective evolutionary algorithm, we use (2 + 2)M-PAES, introduced in [18] and also successfully employed in our previous works [4, 6, 7]. (2 + 2)M-PAES is a modified version of the well-known (2 + 2)PAES [40] and is a steady-state evolutionary algorithm that uses two current solutions and stores the non-dominated solutions in an archive. Unlike the classical (2 + 2)PAES, which maintains the current solutions until they are not replaced by solutions with particular characteristics, we randomly extract, at each iteration, the current solutions. Unlike the PAES-RCS adopted in [9, 32], which considers uniform fuzzy partitions with a pre-fixed number of fuzzy sets, the granularity of each partition here is learned as well. Thus, the chromosome coding has to accommodate this additional requirement.

The evolutionary process operates over three different aspects:

  1. 1.

    selection of a subset of rules out of the initial set of candidate rules, and contextual activation/deactivation of the relative conditions,

  2. 2.

    modification of the fuzzy partitions by properly moving the cores of the composing triangular fuzzy sets, and

  3. 3.

    selection of the granularity level, i.e., the number Tf of partitions (or fuzzy sets), in the range [Tmin,Tmax].

We recall that the initial set of candidate rules is obtained by considering uniform strong fuzzy partitions for each numeric variable Xf, each containing Tmax fuzzy sets. Also, the learning of the RBs, using the evolutionary rule and condition selection procedure, and the optimization of the parameters of the fuzzy sets are performed considering such partitions, computed at the very beginning of the first phase. Indeed, we deal with virtual RBs [4] and virtual partitions [5]. The actual granularity is used only in the computation of the objectives. In practice, although during the evolutionary process we generate RBs, denoted as virtual RBs, and tune the fuzzy set parameters by using such virtual partitions, each time we need to evaluate the fitness, the evaluation is performed on the actual number of fuzzy sets used to partition the single variables. This process requires the definition of proper mapping strategies, both for the RB and for the fuzzy set parameters. Thus, the execution of crossover and mutation operators is not affected by the actual number of fuzzy sets.

RB Mapping Strategy

To map the virtual RB defined on partitions with Tmax fuzzy sets onto a concrete RB defined on variables partitioned with Tf fuzzy sets, we adopt the simple procedure proposed in [4, 5]. Let us consider the following general proposition in a rule of the virtual RB: Xfis\(\widehat {A}_{f,h}\), h ∈ [1,Tmax]. It will be mapped to Xfis\(\widetilde {A}_{f,s}\), with s ∈ [1,Tf], where \(\widetilde {A}_{f,s}\) is the fuzzy set most similar to \(\widehat {A}_{f,h}\) out of the Tf fuzzy sets \(\widehat {A}_{f,h}\) defined on Xf. In dealing with triangular fuzzy sets, defined as discussed in “Preliminaries”, a trivial yet effective similarity measure is the distance between the centers of the cores of the two fuzzy sets. If two fuzzy sets exist in the partition with centers of the cores at the same distance from the center of the core of \(\widehat {A}_{f,h}\), we operate a random choice between them.

It can be noted that, at a certain granularity level, it is possible that distinct fuzzy sets defined on the partitions of the virtual RB do map onto the same fuzzy set on the partitions used in the concrete RB. Thus, different rules of the virtual RB may correspond to the same rule in the concrete RB. For this reason, duplicates in the concrete RB are searched and possibly removed. Of course, operating at a different granularity level with the same virtual RB, a different situation may arise also with respect to the presence of duplicates for concrete rules. Thus, the concept of virtual RB allows us to explore the search space and, at the same time, exploit the (sub)optimal solutions found during the evolutionary process.

Fuzzy Set Parameter Mapping Strategy

As regards the fuzzy set parameter tuning, we approach the problem by using a piecewise, non-decreasing linear transformation [5]. We start from an initial partition of the input variables, and tune the parameters of the fuzzy sets that compose the partition by applying such a transformation. Let \(\widetilde {P}_{f}=\left \{\widetilde {A}_{f,1},\ldots ,\widetilde {A}_{f,T_{f}}\right \}\) and \(P_{f}=\left \{A_{f,1},\ldots ,A_{f,T_{f}}\right \}\) be the initial and the transformed partitions, respectively. In the following, we assume that the two partitions have the same universe (i.e., \(\widetilde {U}_{f} \equiv U_{f}\)), considering also each variable normalized in the interval [0,1].

Let \(t(x_{f}): U_{f} \rightarrow \widetilde {U}_{f}\) be the piecewise linear transformation. We have that \(A_{f,j}(x_{f})= \widetilde {A}_{f,j}\left (t \left (x_{f} \right )\right )=\widetilde {A}_{f,j}\left (\widetilde {x}_{f} \right )\), where \(\widetilde {A}_{f,j}\) and Af,j are two generic fuzzy sets from the initial and transformed partitions, respectively. We define the piecewise linear transformation by considering “representative” for each fuzzy set the corresponding center of the core. The sequence of representatives indicates the change of slopes of the piecewise linear transformation t(xf) for each variable Xf. Let \(\widetilde {b}_{f,1}, \ldots ,\widetilde {b}_{f,T_{f}}\) and \(b_{f,1}, \ldots , b_{f,T_{f}}\) be the representatives of \(\widetilde {A}_{f,1}, \ldots ,\widetilde {A}_{f,T_{f}}\) and \(A_{f,1}, \ldots , A_{f,T_{f}}\), respectively. In each interval bf,j− 1xfbf,j,j = 1…Tf, the transformation t(xf) is defined as:

$$ t(x_{f})=\frac{\widetilde{b}_{f,j}- \widetilde{b}_{f,j-1}}{b_{f,j}-b_{f,j-1}}\cdot (x_{f}-b_{f,j-1})+\widetilde{b}_{f,j-1} $$
(8)

Given t(xf), it can be used for the transformation of all the parameters that define the fuzzy sets.

Figure 3 shows an example of the case where t(xf), as per Eq. (8), is defined assuming a uniform initial partition and a maximum granularity Tmax = 7. Clearly, bf,1 and \(b_{f,T_{f}}\) coincide with the extremes of the universe Uf of Xf, thus t(xf) depends on Tf − 2 parameters, that is \(t(x_{f}) = t\left ((x_{f};b_{f,2},\ldots ,b_{f,T_{f}-1})\right )\) [5]. Once the values \(b_{f,2},\ldots ,b_{f,T_{f}-1}\) are given, the partition Pf can be obtained by transforming the three points \((\widetilde {a}_{f,j},\widetilde {b}_{f,j},\widetilde {c}_{f,j})\) that describe the generic triangular fuzzy set \(\widetilde {A}_{f,j}\) into (af,j,bf,j,cf,j) by applying \(t^{-1}(\widetilde {x}_{f})\).

Fig. 3
figure 3

An example of piecewise linear transformation

It is worth noticing that, during the learning of the granularity, the transformation is applied only to the parameters that describe a fuzzy set, thus obtaining again triangular fuzzy sets. Figure 4 shows an example of this transformation for granularity Tf = 5 by using the piecewise linear transformation in Fig. 3, which is defined for Tmax = 7.

Fig. 4
figure 4

Use of the transformation t(xf) of Fig. 3 on partitions with granularity Tf = 5 different from Tmax = 7

Objective Functions, Chromosome Coding, and Mating Operators

In the designed MOEL scheme, each chromosome is associated with a bi-dimensional objective vector. The first element accounts for the model complexity in terms of TRL for the relative actual RB. The second element assesses the model accuracy through its classification rate, as computed over the training set.

Within the evolutionary procedure, the chromosome C is composed of three different portions (CR,CG,CT): CR defines the virtual RB, CG the number of fuzzy sets, and CT the virtual partitions.

Let JFDT be the initial virtual RB obtained in the first phase, and MFDT the corresponding number of rules. We underline that the initial RBs are generated considering Tmax fuzzy sets for each partition. As we are interested in getting compact RBs, we constrain the virtual RB to contain no more than Mmax rules.

The CR part is a vector of Mmax pairs pm = (km,vm), where km ∈ [0,MFDT] identifies the index of the selected rule in JFDT, and vm = [vm,1,…,vm,F] is a “mask” boolean vector whose generic element vm,f indicates, for attribute Xf, whether to consider or not the relative condition in the rule (if not, it becomes a “don’t care” condition). As we want to be able to generate RBs with a number of rules lower than Mmax, km is set to 0 if the m th rule must be excluded from the RB.

CG is a vector that specifies the number of fuzzy sets to be used for each attribute. Thus, its f th element contains the number Tf ∈ [2,Tmax] of fuzzy sets to be used in the actual partition Pf(Tf). Tmax is an input parameter for the algorithm, and such a value applies to all the variables. As discussed in “RB Mapping Strategy,” the values contained in CG are used to generate the concrete RB from the virtual RB coded in CR.

CT is aimed at describing the placement of the Tmax distinct fuzzy sets within each strong fuzzy partition for all the F attributes; thus, it is a vector of F vectors, each containing Tmax − 2 real numbers. The f th vector \(\left [b_{f,2},\ldots ,b_{f,T_{\mathit {max}}-1}\right ]\) indicates the positions of the cores of the triangular fuzzy sets: it also contains the information to define the shape of the piecewise linear transformation t(xf) (and consequently t− 1) used to determine the position of Tf fuzzy sets, if Tf < Tmax holds. To make sure that bf,i < bf,i+ 1, ∀i ∈ [2,Tmax − 1], and to avoid an excessive departure of the cores with respect to the uniform partition, the value for the generic bf,j is restricted to vary in the interval

$$ \left[\!\widetilde{b}_{f,j}\!-\frac{\widetilde{b}_{f,j}-\widetilde{b}_{f,j-1}}{2},\widetilde{b}_{f,j}+\frac{\widetilde{b}_{f,j + 1}-\widetilde{b}_{f,j}}{2}\!\right] \!, \forall j \in \left[2,T_{\mathit{max}}-1\right]. $$
(9)

For the generation of the offspring populations, the MOEL makes use of both crossover and mutation operations. We apply independently the two-point crossover to CR, the one-point crossover to CG, and the BLX-α-crossover, with α = 0.5, to CT. As regards CR, the crossover points are always placed between two rules. In the case of CG, the crossover point is a random position in [1,F].

The algorithm includes two mutation operators for CR, one for CG, and another for CT. As regards CR, both operators start by randomly choosing a rule (actually, a pair pm) in the chromosome. The first operator replaces the rule in pm with another rule randomly chosen out of the candidate rule base. The second operator modifies the rule in pm by going through each position vm,f of the condition mask, and performing its complement with a probability equal to Pcond (\(P_{\textit {cond}}=\frac {2}{F}\) in the experiments).

The mutation for CG attempts to modify the granularity for one single variable Xf: it consists of randomly choosing a gene f ∈ [1,F] and randomly either incrementing or decrementing it by one. If the new value is out of the range [2,Tmax], no modification is actually performed.

The mutation for CT operates on bf,j, by first randomly choosing f in [1,F] and j in [2,Tmax − 1]: the new value for it is randomly selected in the allowed interval (Eq. 9). The described mating operators have experimentally shown a good balance between exploration and exploitation, thus being suitable for driving the evolutionary algorithm towards good approximations of the Pareto fronts.

MOEL Distributed Implementation

The characteristics of the possible strategies for parallel/distributed MOEAs have been extensively studied before the widespread use of cloud-computing facilities [54]. Anyway, recently, the opportunity to exploit cloud resources in a simple way by means of efficient frameworks like Apache Spark has made more attractive some solutions than others. Thus, the “master-slave” paradigm [54] inherent in typical Spark programs has been chosen in particular for its ability to deal with big datasets, providing good scalability with respect to the size of the training set. Other paradigms, like the “islands” and “diffusion” ones, are much more suited with other distributed computing frameworks [19, 54], and in these cases the obtained accuracy may be affected by the number of used CUs, while we would make the results independent of the underlying platform.

The distributed implementation of the MOEL phase for the proposed approach is described in the schematic view of Fig. 5: here, it is explicitly shown the workload distribution across a cluster of CUs. It can be noted that distributed computations can be used both in the initialization of the archive required by the genetic algorithm (upper part of the figure), and in the evolutionary procedure itself (lower part).

Fig. 5
figure 5

Outline of the distributed computation approach adopted in the evolutionary procedure

The overall MOEL algorithm is driven by the master task: it is in charge of the main control flow and, at each single iteration, of the TRL part of the fitness computation (which, given the limited size of the rule base, is really effortless). Instead, the evaluation of the accuracy asks for a scan of the whole TR, which is typically very large. Thus, it is advantageous to split TR in V chunks to be separately scanned by slave tasks on the cluster CUs.

This way to exploit the CUs in the cluster is clearly indicated in Fig. 5.

This scheme can be easily accommodated by developing a single procedure to be executed just for this purpose by all the slave CUs, taking as input the two solutions to be evaluated and returning, for each of them, the number of successful classifications over the target chunk of TR. It is up to the driver task to sum up all the contributions to the overall accuracy value.

Notably, it has been shown that the scalability of the adopted MOEL, with respect to the used CUs, is almost linear [32]. This means that, whenever needed, additional CUs can be used so to effectively deliver reduced runtimes. Finally, we can underline that the effort required by the accuracy evaluation depends on the complexity of the rule base. As such a complexity typically becomes smaller and smaller as the population evolves, the execution time for each iteration significantly decreases as the algorithm proceeds towards its completion.

Experimental Results

In this section, we show the results of an experimental study for the evaluation of DPAES-FDT-GL. In the following, we take two main aspects into consideration: (1) evaluation of the solutions provided by the algorithm in terms of classification accuracy, complexity, and interpretability; (2) comparison of the algorithm with the original DPAES-RCS. Moreover, in order to disentangle the contribution of the FDT from that of the granularity learning, we performed a comparison with DPAES-FDT, which adopts the FDT for generating the initial rule set, but no granularity learning during the evolutionary process.

The eight datasets used in the experiments are listed in Table 1, along with their number of instances, attributes (both numerical and categorical), and classes, and their size. The datasets have been collected from the UCIFootnote 1 and the LIBSVMFootnote 2 repositories.

Table 1 Datasets used in the experiments. n and c denote numerical and categorical, respectively

For each dataset, we performed a five-fold cross validation. The experiments have been carried out using Apache Spark 2.2.0 over a small computer cluster; we used up to seven machines, one master node, and up to six workers. Both the master and the workers are supplied with 4 vCPU, 8 GB of RAM, and 160-GB hard drive. All the machines run Ubuntu 14.04. The datasets are stored on the Hadoop Distributed File System. In all the experiments, we used the stand-alone cluster manager provided by Apache Spark.

The parameterization used for DPAES-FDT-GL and DPAES-FDT is reported in Table 2, and the values have been devised starting from the experience with DPAES-RCS [32]. Regarding the number of evaluations, for most of the datasets, we experimentally verified that the evolutionary optimization process has a similar behavior as the one discussed in [9], where we showed that 50,000 fitness evaluations allow obtaining Pareto fronts statistically equivalent to the ones achieved after 1 million evaluations. For the sake of brevity, we do not report this analysis in the paper. Since each iteration of the (2 + 2)M-PAES requires two fitness evaluations, it follows that 50,000 fitness evaluations correspond to 25,000 iterations. Regarding granularity learning, we let the number of fuzzy sets per partition vary between Tmin = 3 and Tmax = 7. In fact, considering the employed strong triangular fuzzy partitioning scheme, the first and last fuzzy sets are tied to the ends of the universe, thus a meaningful learned partitioning requires at least three fuzzy sets. Furthermore, according to psychologists, to preserve interpretability, the number of linguistic terms per variable should be in the range 7 ± 2. This is due to a limit of the human information processing capability [46]. Thus, for the sake of simplicity, we set Tmax = 7. Moreover, in our first work on granularity learning, discussed in [4], we showed that using 7 or 9 as Tmax yields similar results.

Table 2 Values of the parameters used in the experiments for DPAES-FDT-GL and DPAES-FDT

As previously described, for each dataset, the initial RB has been obtained exploiting the multi-way version of the distributed FDT algorithm [52], along with a uniform discretization with Tmax = 7 linguistic values for each numeric attribute. As suggested in [52], we limited the minimum number of instances per leaf to 0.1% of the total number of instances. Moreover, we set the maximum tree depth β to 10 so to generate sufficiently complex rules, yet limiting their total number. The average number of generated rules, as well as the average number of selected features, is reported in Table 3.

Table 3 Values of the parameters used for the distributed FDT algorithm, and average numbers of rules and attributes in the RBs extracted from the generated FDTs

Experimental Evaluation of DPAES-FDT-GL

In this section, we discuss the results of an experimental evaluation of DPAES-FDT-GL. In order to provide a thorough evaluation of the proposed algorithm, we analyzed the Pareto front approximations generated during the optimization process by means of previously proposed methods [32]. First, for each fold, we extracted and sorted the Pareto front by decreasing accuracy; then, we selected three solutions: FIRST, MEDIAN, and LAST, as the most accurate, the median solution in the set, and the least accurate, respectively, with respect to accuracy.

Table 4 shows, for each dataset and for each representative solution, the average values and the standard deviations of the accuracy achieved on both the training (AccTra) and test (AccTst) sets, of the average values and the standard deviations of the complexity (TRL) and of the number (NNDS) of non-dominated solutions contained in the archive at the end of the evolutionary process.

Table 4 Average values and standard deviations of the accuracy on the training and test sets and of TRL of the FIRST, MEDIAN, and LAST solutions and average values and standard deviations of the number of non-dominated solutions generated by DPAES-FDT-GL

First, in Table 4, we observe highly competitive results (a comparison with the state-of-the-art is provided in “Comparison of DPAES-FDT-GL with DPAES-FDT and DPAES-RCS”), while the TRL is still reasonable. Thus, we can conclude that DPAES-FDT-GL is able to generate both accurate and interpretable systems. Furthermore, by comparing the accuracies obtained on the training and test sets, we observed that little or no overtraining occurs.

In order to better characterize the interpretability of the provided solutions, in Table 5 we report M, \(\widehat {F}\), and #Fset for the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL. Here, M is the average number of rules, \(\widehat {F}\) is the average number of selected features, and #Fset represents the average number of fuzzy sets obtained via granularity learning for each selected numerical feature. Interestingly, both the number of selected rules and the number of features are quite low, suggesting that the learned RB is highly interpretable. Moreover, the mean number of fuzzy sets per feature is lower than 5, suggesting that the granularity learning does indeed help in producing more interpretable systems.

Table 5 Average values and standard deviations of the number of rules (M), the number of attributes (\(\widehat {F}\)), and the average number of fuzzy sets in the partition (#Fset) for the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL. Please note that the dataset POK has no numerical feature

Finally, comparing the number of rules M with TRL, we observe that the average rule length (not shown here) is typically very low, suggesting that the RBs are mostly composed by generic rules.

Table 6 shows, for each dataset, the average execution time for DPAES-FDT-RCS (in seconds) as well as its standard deviation. Execution times have been measured on a cluster with 6 slaves, equipped with 4 cores each, for a total of 24 cores. Both the total execution time and the runtime for the distributed evolutionary optimization phase (DEO) are reported. We observe that the DEO phase is the most time-consuming one. The runtime is primarily affected by two factors: the number of instances in the dataset, and the TRL of the evaluated solutions.

Table 6 Average computation times (in seconds) and standard deviations for the distributed evolutionary optimization (DEO) phase and the overall algorithm (Tot)

To give an example of the results provided by DPAES-FDT-GL, we show the MEDIAN solution obtained on the first fold of SUSYFootnote 3 dataset. As reported in Table 4, the MEDIAN solution performs well both in terms of accuracy and TRL. SUSY is a binary classification problem to distinguish between a signal process that produces supersymmetric particles and a background process that does not. The data have been produced using Monte Carlo simulations and are characterized by 18 attributes. The first eight features are kinematic properties measured by the particle detectors in the accelerator. The last ten features are functions of the first eight features; these are high-level features derived by physicists to help discriminate between the two classes. In the following, the features will be labeled as follows: lepton 1 pT, lepton 1 η, lepton 1 ϕ, lepton 2 pT, lepton 2 η, lepton 2 ϕ, missing energy magnitude, missing energy ϕ, MET_rel, Axial MET, MR, M_TR_2, R, MT2, \(\sqrt {\hat {S}_{R}}\), \(M_{{\Delta }_{R}}\), \({\Delta }{\Phi }_{R^{\beta }}\) and cos(𝜃R+ 1). More information about the attributes can be retrieved in the original manuscript [13].

Figure 6 shows, for each continuous attribute, both the original uniform fuzzy partition (dashed line) and the learned fuzzy partition (solid line) of the MEDIAN solution. The corresponding RB is shown in Fig. 7. Here, the labeling of the fuzzy sets depends on the granularity of the partitioning: for three fuzzy sets, we used low, medium, and high, while for seven fuzzy sets, we used very_low, low, medium-low, medium, medium-high, high, and very_high. The labeling for four, five, and six fuzzy sets has been obtained by interpolating in between. It is worth noticing that the RB is composed of only seven rules, with a maximum of three antecedents each.

Fig. 6
figure 6

Uniform fuzzy partitions (dashed line) and learned fuzzy partitions (solid line) of the attributes of the MEDIAN solution obtained at the end of the evolutionary process on the SUSY dataset

Fig. 7
figure 7

RB of the MEDIAN solution obtained on the first fold of SUSY. The RB, composed of seven rules, and characterized by a TRL of 18, achieved a classification accuracy of \(\sim 78.776\%\) on the test set

Comparison of DPAES-FDT-GL with DPAES-FDT and DPAES-RCS

In this section, we experimentally compare the performances of DPAES-FDT-GL with DPAES-FDT and DPAES-RCS, the baseline MOEL scheme from which DPAES-FDT-GL and DPAES-FDT have been derived. We underline that in [32] it has been shown that DPAES-RCS is highly effective when compared to other state-of-the-art algorithms, such as distribute decision trees and the Chi-FRBCS-BigData.

Hereafter, the reported results achieved by DPAES-RCS are taken from the tables in [32]. In Table 7, we list the average values with standard deviations of the accuracy on the training (AccTra) and test (AccTst) sets for the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL, DPAES-FDT, and DPAES-RCS. We also report M and TRL in Table 8.

Table 7 Average accuracies ± standard deviations achieved by the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL, DPAES-FDT, and by DPAES-RCS
Table 8 Average M and TRL ± standard deviations achieved by the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL, DPAES-FDT, and by DPAES-RCS

We observe that the accuracy values obtained by the three algorithms are generally comparable across all the three solutions analyzed here. Moreover, the solutions generated by DPAES-FDT-GL and DPAES-FDT are always, except for the FIRST solution of COV_2 dataset, more compact than those produced by DPAES-RCS. However, it is worth noting that DPAES-FDT-GL solutions are characterized, in most cases, by a lower TRL and fewer rules than DPAES-FDT solutions.

To statistically assess the differences among the achieved accuracies and complexities, we generate, for each comparison algorithm and on all datasets, a distribution consisting of the average accuracy values obtained on the test set, and a distribution consisting of the average complexity values. Then, we apply non-parametric statistical tests. In particular, we perform the Friedman test to compute a ranking among the distributions, and the Iman and Davenport test to evaluate whether there exists a statistical difference among the distributions. If the Iman and Davenport p value is lower than the level of significance α (it is assumed the standard threshold value α = 0.05), we can reject the null hypothesis and affirm that there exist statistical differences among the multiple distributions. Otherwise, no statistical difference exists. In case of statistical difference, we apply a post hoc procedure, namely the Holm test. This test allows detecting effective statistical differences between the control approach, i.e., the one with the lowest Friedman rank, and the remaining approaches. Details on the aforementioned tests may be found in [34].

Table 9 shows the results of the application of the Friedman and of the Iman and Davenport tests on the accuracy values obtained over the test set. The null hypothesis for the Iman and Davenport test can never be rejected (the p values are always greater than 0.05). Thus, we can conclude that the three algorithms are statistically equivalent in terms of accuracy. On the other hand, it is worth noticing that DPAES-FDT-GL and DPAES-FDT achieve the highest ranks for the FIRST solutions.

Table 9 Results of the Friedman and of the Iman and Davenport tests on the accuracy computed on the test set

Table 10 shows the results of the application of the Friedman and of the Iman and Davenport tests on the complexities. In this case, the null hypothesis associated with the Iman and Davenport test is always rejected (the p values are always lower than 0.05). Thus, we performed the Holm post hoc procedure by considering DPAES-FDT-GL, DPAES-FDT, and PDAES-FDT-GL as control approaches for the FIRST, MEDIAN, and LAST solutions, respectively. By analyzing Table 11, we can conclude that the DPAES-RCS solutions are always statistically more complex than those of the control algorithms. On the other hand, the complexity of the solutions generated by DPAES-FDT-GL and DPAES-FDT is always statistically equivalent. In conclusion, both DPAES-FDT-GL and DPAES-FDT outperform DPAES-RCS in terms of complexity. It is worth noticing that, for the FIRST and the LAST solutions, DPAES-FDT-GL achieves the best Friedman rank. Indeed, as discussed above, in most of the cases, the complexities of the DPAES-FDT-GL solutions are lower than those generated by DPAES-FDT.

Table 10 Results of the Friedman and of the Iman and Davenport tests on the complexity
Table 11 Results of the Holm post hoc procedures on the complexity for α = 0.05

For an easier visual comparison of the widths of the Pareto front approximations obtained by DPAES-FDT-GL, DPAES-FDT, and DPAES-RCS, in Fig. 8 we plot, on the classification rate/TRL plane, the average values achieved by the three representative solutions, for all the datasets, on both the training and test sets. Here the solutions generated by DPAES-FDT-GL, DPAES-FDT, and DPAES-RCS are reported as blue diamond, empty black circle, and red plus markers, respectively. The results provided in Tables 7 and 8 can thus be visually evaluated, and the trends of the results previously discussed can be easily identified.

Fig. 8
figure 8

Plots of the average accuracy on the training and test sets and average TRL of the FIRST, MEDIAN, and LAST solutions generated by DPAES-FDT-GL (blue diamond markers), DPAES-FDT (empty black circle markers), and DPAES-RCS (red plus symbol markers)

In conclusion, we can state that employing the distributed FDT, rather than a distributed version of the C4.5, allows the MOEL process to generate more compact FRBCs. Moreover, even though we cannot find statistical differences between the complexities of the FRBCs generated by DPAES-FDT-GL and DPAES-FDT, the activation of the granularity learning allows us to reduce, in most of the cases, the number of rules and the TRL of the generated classifiers. The good behaviour of DPAES-FDT-GL can be mainly attributed to the following considerations. First of all, the FDT learning algorithm generates fuzzy decision trees directly from fuzzy partitions. Thus, the tree is tuned to the fuzzy partitions. On the other hand, the C4.5 learning algorithm used in DPAES-RCS generates decision trees from crisp partitions. Indeed, the fuzzy partitions are actually considered crisp for the execution of the learning algorithm: each fuzzy set is approximated by using a crisp set that corresponds to the α-cut with α = 0.5, preserving the same label as in the corresponding fuzzy set. Once the tree is learned, then the rules are extracted from the tree and the labels re-assigned to the original fuzzy sets. Thus, the decision tree (differently from FDT) is not tuned to the final fuzzy partitions. Second, the granularity learning process allows reducing the number of fuzzy sets for each linguistic variable. The lower the number of fuzzy sets that describe each partition, the lower the number of combinations that can be obtained for generating classification rules. These two aspects mainly contribute to the good behaviour exhibited by DPAES-FDT-GL, which achieves more compact solutions than DPAES-RCS. This result is achieved thanks to the synergy among the initial set of fuzzy rules extracted from the FDT, granularity learning, rule and condition selection, and fuzzy set parameter learning. Indeed, the membership function parameter learning allows adapting the fuzzy partitions to the dataset, also when using a low number of fuzzy sets for each linguistic attribute. Thus, the number of rules can decrease and the accuracy increase during the evolutionary process.

As a final remark, we briefly compare the results achieved by DPAES-FDT-GL with those obtained by a distributed multi-way fuzzy decision tree (DMFDT) learning algorithm [53], and a distributed fuzzy associative classifier for big data (DFAC-FFP) [52] (Table 12). We highlight that DMFDT exploits the same FDT learning algorithm used to generate the initial set of fuzzy rules in DPAES-FDT-GL, but employs fuzzy partitions generated by a distributed fuzzy discretizer, and leaves labeled with different classes and a weighted voting inference strategy. To this aim, we selected HIGGS, KDD_2, and SUSY datasets. We chose only these three datasets since the relative results are the unique ones available in both [53] and [52], where DMFDT and DFAC-FFP were proposed. Furthermore, HIGGS and SUSY are the largest datasets in terms of memory occupancy.

Table 12 Comparison of the average accuracies on the test set and average complexities for DPAES-FDT-GL, DMFDT, and DFAC-FFP

On HIGGS, DMFDT achieves the highest accuracy; the lower complexity of both DPAES-FDT-GL and DFAC-FFP is balanced by a lower classification accuracy. Furthermore, while the accuracies of DPAES-FDT-GL and DFAC-FFP are comparable, the model complexities are different by about two order of magnitudes. On KDD_2, the three algorithms achieve more or less the same accuracy, but the complexity of DPAES-FDT-GL is one order of magnitude smaller than the one of the two comparison algorithms. More interestingly, DMFDT achieves a classification accuracy of ∼ 79.6% on the SUSY dataset; it is ∼ 1.1% higher of that achieved by DPAES-FDT-GL, yet it has been obtained with 805,076 nodes and 758,064 leaves, thus with a system of four orders of magnitude more complex than the one generated by DPAES-FDT-GL. Finally, it is worth noticing that DPAES-FDT-GL achieves better results than DFAC-FFP, with a complexity smaller by two orders of magnitude.

Conclusions and Future Work

In this paper, we have presented a novel approach, denoted as DPAES-FDT-GL, for generating sets of fuzzy rule–based classifiers with different optimal trade-offs between accuracy and interpretability from big data. The approach extends DPAES-RCS, a distributed multi-objective evolutionary algorithm recently proposed on the Apache Spark framework. The extensions regard two main aspects. First, the initial set of candidate rules used in the multi-objective evolutionary learning is extracted from a fuzzy decision tree (FDT) rather than a crisp decision tree. The FDT is generated by a distributed learning algorithm recently proposed by one of the authors. Second, the granularity of each numerical attribute is determined during the evolutionary process. We have executed DPAES-FDT-GL on eight big datasets and have compared the results to the ones obtained by DPAES-RCS. Although the accuracy achieved by the fuzzy rule–based classifiers generated by DPAES-FDT-GL is statistically comparable to the one obtained by the classifiers generated by DPAES-RCS, the models generated by DPAES-FDT-GL are characterized by the lowest number of rules, conditions, and fuzzy sets. We can conclude that DPAES-FDT-GL represents an important step forward in getting interpretable fuzzy classifiers in the context of big data. Since there exists a number of real applications that require not only high accuracy, but also high interpretability, we strongly believe that DPAES-FDT-GL can be a very interesting and promising approach for such applications.

In order to disentangle the contribution of the FDT from that of the granularity learning, we also performed a comparison with DPAES-FDT, a version of DPAES-FDT-GL, which adopts the FDT for generating the initial rule set, but no granularity learning during the evolutionary process. We observed that, when extracting the initial set of rules from an FDT, we obtain models that are always statistically less complex. Moreover, even though we cannot find statistical differences between the complexities of the FRBCs generated by DPAES-FDT-GL and DPAES-FDT, we observed that the activation of the granularity learning allows reducing, in most of the cases, the number of rules and the TRL of the generated classifiers.

Future works will address the problem, in the specific setting of the described approach, of bounding the size of the training set without experiencing losses in the achieved accuracy. This aspect is crucial in dealing with big data, and effective solutions can extend the practical applicability of DPAES-FDT-GL to extremely big dataset, with no significant additional penalties in the runtimes for the learning phase. Indeed, the main problem we have to cope with when using EFS with big data is the computation of the accuracy on the overall training set. This computation depends on the number of instances in the training set and on the dimensionality of each instance. In big data, generally, both these numbers are high and then require long runs before achieving satisfactory solutions. Thus, techniques for reducing the number of attributes and the numerosity of the datasets, preserving the accuracy achieved by the models, are very appealing. As regards attribute reduction, our approach already performs a selection of attributes when we apply the FDT algorithm for generating the initial set of rules: indeed, the attributes that are considered in no decision node are removed. Furthermore, during the evolutionary process of RCS, attributes that are included in no rule can be eliminated. Nevertheless, we would like to investigate an appropriate chromosome coding for performing explicitly attribute selection during the evolutionary optimization. The reduction of the instance numerosity can be performed with the amount of approaches that have been proposed in the literature, but that needs to be adequately tuned to the specific setting of the proposed algorithm. Further, in the past, we proposed a co-evolutionary approach for instance selection [8], which should be adapted to manage big data.