Keywords

1 Introduction

Inspired by the idea that multiple opinions are crucial before making a final decision, ensemble methods make predictions by consulting multiple different predictors [31]. Apart from their similarity with some natural decision-making methodologies, ensemble methods have a strong statistical background. Namely, ensemble methods aim to reduce the variance – thus increasing the accuracy – by combining multiple different predictors. Due to their versatility and effectiveness, ensemble methods have been successfully applied to a wide range of problems including classification, regression, and feature selection. As a preliminary study, this paper only addresses ensemble methods for binary classification problems.

Although there is no rigorous definition of an ensemble classifier [23], they can be conceived as a group of base classifiers, also called weak or base classifiers. As to the construction of an ensemble classifier, we must take into account the diversity of the base classifiers and the rule used to combine them [23, 30]. There are a plethora of ensemble methods in the literature, including bagging, pasting, random subspace, boosting, and stacking [2, 10, 14, 38]. For example, a bagging ensemble classifier is obtained by training copies of a single base classifier using different subsets of the training set [2]. Similarly, a random subspace classifier is obtained by training copies of a classifier using different subsets of features [14]. In both bagging and random subspace ensembles, the base classifiers are then combined using a voting scheme. Random forest is a successful example of an ensemble of decision tree classifiers trained using both bagging and random subspace ensemble ideas [3].

In contrast to the traditional majority voting, in this paper, we propose to combine the base classifiers using an associative memory. Associative memories (AMs) refer to a broad class of mathematical models inspired by the human brain’s ability to store and recall information by association [1, 13, 21]. The Hopfield neural network is a typical example of a recurrent neural network able to implement an associative memory [15]. Despite its many successful applications [16, 32,33,34], the Hopfield neural network suffers from an extremely low storage capacity as an associative memory model [24]. To overcome the low storage capacity of the Hopfield network, many prominent researchers proposed alternative learning schemes [18, 27] as well as improved network architectures. In particular, the recurrent correlation associative memories (RCAMs), proposed by Chiueh and Goodman [5], can be viewed as a kernelized version of the Hopfield neural network [8, 9, 29]. In this paper, we apply the RCAMs to combine binary classifiers in an ensemble method.

At this point, we would like to remark that associative memories have been previously used by Kultur et al. to improve the performance of an ensemble method [22]. Apart from addressing a regression problem, Kultur et al. use an associative memory in parallel to an ensemble of multi-layer perceptrons. The resulting model is called ensemble of neural networks with associative memory (ENNA). Our approach, in contrast, uses an associative memory to combine the base classifiers. Besides, Kultur et al. associate patterns using the k-nearest neighbor algorithm which is formally a non-parametric method used for classification or regression. Differently, we use recurrent correlation associative memories, which are models conceived to implement associative memories.

The paper is organized as follows: The next section reviews the recurrent correlation associative memories. Ensemble methods are presented in Sect. 3. The main contribution of the manuscript, namely the ensemble classifiers based on associative memories, are addressed in Sect. 3.2. Section 4 provides some computational experiments. The paper finishes with some concluding remarks in Sect. 5.

2 A Brief Review on Recurrent Correlation Associative Memories

Recurrent correlation associative memories (RCAMs) has been introduced by Chiueh and Goodman as an improved version of the famous correlation-based Hopfield neural network [5, 15].

Briefly, an RCAM is obtained by decomposing the Hopfield network with Hebbian learning into a two-layer recurrent neural network. The first layer computes the inner product (correlation) between the input and the memorized items followed by the evaluation of a non-decreasing continuous activation function. The subsequent layer yields a weighted average of the stored items.

In mathematical terms, a RCAM is defined as follows: Let \(\mathbb {B} = \{-1,+1\}\) and \(f:[-1,+1]\rightarrow \mathbb {R}\) be a continuous non-decreasing real-valued function. Given a fundamental memory set \(\mathcal {U}=\{\mathbf{u}^1,\ldots ,\mathbf{u}^P \} \subset \mathbb {B}^N\), the neurons in the first layer of a bipolar RCAM yield

$$\begin{aligned} w_\xi (t)= f\left( \frac{1}{N} \sum _{i=1}^N z_i(t) u_i^\xi \right) , \quad \forall \xi \in 1,\ldots ,P, \end{aligned}$$
(1)

where \(\textit{\textbf{z}}(t)=[z_1(t),z_2(t),\ldots ,z_N(t)]^T \in \mathbb {B}^N\) denotes the current state of the network and \(\mathbf{u}^\xi = [u_1^\xi ,\ldots ,u_N^\xi ]^T\) is the \(\xi \)th fundamental memory. The activation potential of the output neuron \(a_i(t)\) is given by the following weighted sum of the memory items:

$$\begin{aligned} a_i(t)=\sum _{\xi =1}^P w_\xi (t) u_i^\xi , \quad \forall i=1,\ldots ,N.\end{aligned}$$
(2)

Finally, the state of the ith neuron of the RCAM is updated as follows for all \(i=1,\ldots ,N\):

$$\begin{aligned} z_i(t+1) = {\left\{ \begin{array}{ll} {\text {sgn}}\big (a_i(t)\big ) &{} a_i(t) \ne 0, \\ z_i(t), &{} \hbox {otherwise}. \end{array}\right. } \end{aligned}$$
(3)

From (2), we refer to \(w_\xi (t)\) as the weight associated to the \(\xi \)th memory item.

In contrast to the Hopfield neural network, the sequence \(\{\textit{\textbf{z}}(t)\}_{t \ge 0}\) produced by an RCAM is convergent in both synchronous and asynchronous update modes independently of the number of fundamental memories and the initial state vector \(\textit{\textbf{z}}(0)\) [5]. In other words, the limit \(\textit{\textbf{y}}= \lim _{t \rightarrow \infty } \textit{\textbf{z}}(t+1)\) of the sequence given by (3) is well defined using either synchronous or asynchronous update.

As an associative memory model, an RCAM designed for the storage and recall of the vectors \(\mathbf{u}^1,\ldots ,\mathbf{u}^P\) proceeds as follows: Given a stimulus (initial state) \(\textit{\textbf{z}}(0)\), the vector recalled by the RCAM is \(\textit{\textbf{y}}= \lim _{t \rightarrow \infty } \textit{\textbf{z}}(t+1)\).

Finally, the function f defines different RCAM models. For example:

  1. 1.

    The correlation RCAM or identity RCAM is obtained by considering in (1) the identity function \(f_i(x)=x\).

  2. 2.

    The exponential RCAM, which is determined by

    $$\begin{aligned} f_e(x;\alpha )= e^{\alpha x}, \quad \alpha >0. \end{aligned}$$
    (4)

The identity RCAM corresponds to the traditional Hopfield network with Hebbian learning and self-feedback. Different from the Hopfield network and the identity RCAM, the storage capacity of the exponential RCAM scales exponentially with the dimension of the memory space. Apart from the high storage capacity, the exponential RCAM can be easily implemented on very large scale integration (VLSI) devices [5]. Furthermore, the exponential RCAM allows for a Bayesian interpretation [11] and it is closely related to support vector machines and the kernel trick [8, 9, 29]. In this paper, we focus on the exponential RCAM, formerly known as exponential correlation associative memory (ECAM).

3 Ensemble of Binary Classifiers

An ensemble classifier combines a group of single classifiers, also called weak or base classifiers, in order to provide better classification accuracy than a single one [23, 31, 38]. Although this approach is partially inspired by the idea that multiple opinions are crucial before making a final decision, ensemble classifiers have a strong statistical background. Namely, ensemble classifiers reduce the variance combining the base classifiers. Furthermore, when the amount of training data available is too small compared to the size of the hypothesis space, the ensemble classifier “mixes” the base classifiers reducing the risk of choosing the wrong single classifier [19].

Formally, let \( \mathcal {T} = \{(\textit{\textbf{t}}_1,d_1),\dots , (\textit{\textbf{t}}_M,d_M)\}\) be a training set where \(\textit{\textbf{t}}_i \in \mathcal {X}\) and \(d_i \in \mathcal {C}\) are respectively the feature sample and the class label of the ith training pair. Here, \(\mathcal {X}\) denotes the feature space and \(\mathcal {C}\) represents the set of all class labels. In a binary classification problem, we can identify \(\mathcal {C}\) with \(\mathbb {B} = \{-1,+1\}\). Moreover, let \(h_1,h_2,\ldots ,h_P:\mathcal {X} \rightarrow \mathcal {C}\) be base classifiers trained using the whole or part of the training set \(\mathcal {T}\).

Usually, the base classifiers are chosen according to their accuracy and diversity. On the one hand, an accurate classifier is one that has an error rate better than random guessing on new instances. On the other hand, two classifiers are diverse if they make different errors on new instances [12, 19].

Bagging and random subspace ensembles are examples of techniques that can be used to ensure the diversity of the base classifiers. The idea of bagging, an acronym for Bootstrap AGGregatING, is to train copies of a certain classifier h on subsets of the training set \(\mathcal {T}\) [2]. The subsets are obtained by sampling the training \(\mathcal {T}\) with replacement, a methodology known as bootstrap sampling [23]. In a similar fashion, random subspace ensembles are obtained by training copies of a certain classifier h using different subsets of the feature space [14]. Random forest, which is defined as an ensemble of decision tree classifiers, is an example of an ensemble classifier that combines both bagging and random subspace techniques [3].

Another important issue that must be addressed in the design of an ensemble classifier is how to combine the base classifiers. In the following, we review the majority voting methodology – one of the oldest and widely used combination scheme. The methodology based on associative memories is introduced and discussed subsequently.

3.1 Majority Voting Classifier

As remarked by Kuncheva [23], majority voting is one of the oldest strategies for decision making. In a wide sense, a majority voting classifier yields the class label with the highest number of occurrences among the base classifiers [10, 35].

Formally, let \(h_1,h_2,\ldots ,h_P:\mathcal {X} \rightarrow \mathcal {C}\) be the base classifiers. The majority voting classifier, also called hard voting classifier and denoted by \(H_v:\mathcal {X} \rightarrow \mathcal {C}\), is defined by means of the equation

$$\begin{aligned} H_v(\textit{\textbf{x}}) = \underset{c \in \mathcal {C}}{{\text {argmax}}}\, \sum _{\xi =1}^P w_\xi \mathcal {I}[h_\xi (\textit{\textbf{x}})=c], \quad \forall \textit{\textbf{x}}\in \mathcal {X}, \end{aligned}$$
(5)

where \(w_1,\ldots ,w_P\) are the weights of the base classifiers and \(\mathcal {I}\) is the indicator function, that is,

$$\begin{aligned} \mathcal {I}[h_\xi (\textit{\textbf{x}})=c] = {\left\{ \begin{array}{ll} 1, &{} h_\xi (\textit{\textbf{x}})=c,\\ 0, &{} \hbox {otherwise}. \end{array}\right. } \end{aligned}$$
(6)

When \(\mathcal {C}=\{-1,+1\}\), the majority voting ensemble classifier given by (5) can be written alternatively as

$$\begin{aligned} H_h(\textit{\textbf{x}}) = {\text {sgn}}\left( \sum _{\xi =1}^P w_\xi h_\xi (\textit{\textbf{x}})\right) , \quad \forall \textit{\textbf{x}}\in \mathcal {X}, \end{aligned}$$
(7)

whenever \(\sum _{\xi =1}^P w_\xi h_\xi (\textit{\textbf{x}}) \ne 0\) [7].

3.2 Ensemble Based on Bipolar Associative Memories

Let us now introduce the ensemble classifiers based on the RCAM models. In analogy to the majority voting ensemble classifier, the RCAM-based ensemble classifier is formulated using only the base classifiers \(h_1,\ldots ,h_P:\mathcal {X} \rightarrow \mathbb {B}\). Precisely, consider a training set \(\mathcal {T} = \{(\textit{\textbf{t}}_i,d_i):i=1,\ldots ,M\} \subset \mathcal {X}\times \mathbb {B}\) and let \(X = \{\textit{\textbf{x}}_{1},\ldots ,\textit{\textbf{x}}_{L} \} \subset \mathcal {X}\) be a batch of input samples. We first define the fundamental memories as follows for all \(\xi =1,\ldots ,P\):

$$\begin{aligned} \mathbf{u}^\xi = [h_\xi (\textit{\textbf{t}}_1),\ldots ,{h_\xi (\textit{\textbf{t}}_M)},h_\xi (\textit{\textbf{x}}_1),\ldots ,h_\xi (\textit{\textbf{x}}_L)]^T \in \mathbb {B}^{M+L}. \end{aligned}$$
(8)

In words, the \(\xi \)th fundamental memory is obtained by concatenating the outputs of the \(\xi \)th base classifier evaluated at the M training samples and the L input samples. The bipolar RCAM is synthesized using the fundamental memory set \(\mathcal {U}=\{\mathbf{u}^1,\ldots ,\mathbf{u}^P\}\) and it is initialized at the state vector

$$\begin{aligned} \textit{\textbf{z}}(0) = [d_1,d_2,\ldots ,d_M, \underbrace{0,0,\ldots ,0}_{L-\hbox {times}}]^T. \end{aligned}$$
(9)

Note that the first M components of initial state \(\textit{\textbf{z}}(0)\) correspond to the targets in the training set \(\mathcal {T}\). The last L components of \(\textit{\textbf{z}}(0)\) are zero, a neutral element different from the class labels. The initial state \(\textit{\textbf{z}}(0)\) is presented as input to the associative memory and the last L components of the recalled vector \(\textit{\textbf{y}}\) yield the class label of the batch of input samples \(X = \{\textit{\textbf{x}}_1,\ldots ,\textit{\textbf{x}}_L\}\). In mathematical terms, the RCAM-based ensemble classifier \(H_a:\mathcal {X} \rightarrow \mathbb {B}\) is defined by means of the equation

$$\begin{aligned} H_a(\textit{\textbf{x}}_i) = y_{M+i}, \quad \forall \textit{\textbf{x}}_i \in X, \end{aligned}$$
(10)

where \(\textit{\textbf{y}}= [y_1,\ldots ,y_M,y_{M+1},\ldots ,y_{M+L}]^T\) is the limit of the sequence \(\{\textit{\textbf{z}}(t)\}_{t\ge 0}\) given by (3).

In the following, we point out the relationship between the bipolar RCAM-based ensemble classifier and the majority voting ensemble described by (7). Let \(\textit{\textbf{y}}\) be the vector recalled by the RCAM fed by the input \(\textit{\textbf{z}}(0)\) given by (9), that is, \(\textit{\textbf{y}}\) is a stationary state of the RCAM. From (2), (3), and (8), the output of the RCAM-based ensemble classifier satisfies

$$\begin{aligned} H_a(\textit{\textbf{x}}_i) = {\text {sgn}}\left( \sum _{\xi =1}^P w_\xi h_\xi (\textit{\textbf{x}}_i) \right) , \end{aligned}$$
(11)

where

$$\begin{aligned} w_\xi = f\left( \frac{1}{M+L}\sum _{i=1}^{M+L} y_i u_i^\xi \right) , \quad \forall \xi =1,\ldots ,P. \end{aligned}$$
(12)

From (11), the bipolar RCAM-based ensemble classifier can be viewed as a weighted majority voting classifier. Furthermore, the weight \(w_\xi \) depends on the similarity between the \(\xi \)th base classifier \(h_\xi \) and the ensemble classifier \(H_a\). Precisely, let us define the similarity between two binary classifiers \(H,h_\xi :\mathcal {X} \rightarrow \mathbb {B}\) on a set of samples S by means of the equation

$$\begin{aligned} \mathtt {Sim}(H,h) = \frac{1}{\hbox {Card}(S)} \sum _{\textit{\textbf{s}} \in S} \mathcal {I}\big [ h(\textit{\textbf{s}}) = H(\textit{\textbf{s}})\big ]. \end{aligned}$$
(13)

Using (13), we can state the following theorem:

Theorem 1

The weights of the RCAM-based ensemble classifier given by (11) satisfies the following identities for all \(\xi =1,\ldots ,P\):

$$\begin{aligned} w_\xi = f \big (1-2\cdot \mathtt {Sim}(H_a, h_\xi ) \big ), \quad \forall t \ge 1, \end{aligned}$$
(14)

where the similarity in (14) is evaluated on the union of all training and input samples, that is, on \(S = X \cup T = \{\textit{\textbf{t}}_1,\ldots ,\textit{\textbf{t}}_M\}\cup \{\textit{\textbf{x}}_1,\ldots ,\textit{\textbf{x}}_L\}\).

Proof

Since we are considering a binary classification problem, the similarity between the ensemble \(H_a\) and the base classifier \(h_\xi \) on \(S = X \cup T\), with \(N = \text {Card}(S) = M+L\), satisfies the following identities:

$$\begin{aligned} \mathtt {Sim}(H, h)&= 1 - \frac{1}{N}\sum _{i=1}^N \mathcal {I}[h(\textit{\textbf{s}}_i) \ne H_a(\textit{\textbf{s}}_i)] = 1 - \frac{1}{4N}\sum _{i=1}^N \big (h(\textit{\textbf{s}}_i)- H_a(\textit{\textbf{s}}_i)\big )^2 \\&= 1 - \frac{1}{2N}\sum _{i=1}^N \big (1-H_a(\textit{\textbf{s}}_i) h(\textit{\textbf{s}}_i)\big ) = \frac{1}{2}\left( 1 - \frac{1}{N} \sum _{i=1}^N H_a(\textit{\textbf{s}}_i) h(\textit{\textbf{s}}_i) \right) \end{aligned}$$

Equivalently, we have

$$\begin{aligned} \frac{1}{\hbox {Card}(S)} \sum _{\textit{\textbf{s}} \in S} H(\textit{\textbf{s}})h(\textit{\textbf{s}}) = 1 - 2 \cdot \mathtt {Sim}(H,h). \end{aligned}$$
(15)

Now, from (1), (10), and (15), we obtain the following identities:

$$\begin{aligned} w_\xi&= f\left( \frac{1}{N}\sum _{i=1}^N y_i u_i^\xi \right) = f\big (1-2\cdot \mathtt {Sim}(H_a,h_\xi )\big ), \end{aligned}$$

which concludes the proof.

Theorem 1 shows that the RCAM-based ensemble classifier is a majority voting classifier whose weights depend on the similarity between the base classifiers and the ensemble itself. In fact, in view of the dynamic nature of the RCAM model, \(H_a\) is obtained by a recurrent consult and vote scheme. Moreover, at the first step, the weights depend on the accuracy of the base classifiers.

4 Computational Experiments

In this section, we perform some computational experiments to evaluate the performance of the proposed RCAM-based ensemble classifiers for binary classification tasks. Precisely, we considered the RCAM-based ensembles obtained using the identity and the exponential as the activation function f. The parameter \(\alpha \) of the exponential activation function has been either set to \(\alpha =1\) or it has been determined using a grid search on the set \(\{10^{-2},10^{-1},0.5,1,5,10,20,50\}\) with 5-fold cross-validation on the training set. The RCAM-based ensemble classifiers have been compared with AdaBoost, gradient boosting, and random forest ensemble classifiers, all available at the python’s scikit-learn API (sklearn) [28].

First of all, we trained AdaBoost and gradient boosting ensemble classifiers using the default parameters of sklearn. Recall that boosting ensemble classifiers are developed incrementally by adding base classifiers to reduce the number of misclassified samples [23]. Also, we trained the random forest classifier with 30 base classifiers (\(P=30\)) [3]. Recall that the base classifiers of the random forest are decision trees obtained using bagging and random subspace techniques [2, 14]. Then, we used the base classifiers from the trained random forest ensemble to define the RCAM-based ensemble. In other words, the same base classifiers \(h_1,\ldots ,h_{30}\) are used in the random forest and the RCAM-based classifiers. The difference between the ensemble classifiers resides in the combining rule. Recall that the random forest combines the base classifiers using majority voting. From the computational point of view, training the random forest and the RCAM-ensemble classifiers required similar resources. Moreover, despite the consult and vote scheme of the RCAM-based ensemble, they have not been significantly more expensive than the random forest classifier. The grid search used to fine-tune the parameter \(\alpha \) of the exponential RCAM-based ensemble is the major computational burden in this computational experiment.

For the comparison of the ensemble classifiers, we considered 28 binary classification problems from the OpenML repository [36]. These binary classification problems can be obtained using the command fetch_openml from sklearn. We would like to point out that missing data has been handled before splitting the data set into training and test sets using the command SimpleImputer from sklearn. Also, we pre-processed the data using the StandardScaler transform. Therefore, each feature is normalized by subtracting the mean and dividing by the standard deviation, both computed using only the training set. Furthermore, since some data sets are unbalanced, we used the F-measure to evaluate quantitatively the performance of a certain classifier. Table 1 shows the mean and the standard deviation of the F-measure obtained from the ensemble classifiers using stratified 10-fold cross-validation. The largest F-measures for each data set have been typed using boldface. Note the exponential RCAM-based ensemble classifier with grid search produced the largest F-measures in 11 of the 28 data sets. In particular, the exponential RCAM with grid search produced outstanding F-measures on the “Monks-2” and “Egg-Eye-State” data sets. For a better comparison of the ensemble classifiers, we followed Demšar’s recommendations to compare multiple classifier models using multiple data sets [6]. The Friedman test rejected the hypothesis that there is no difference between the ensemble classifiers. A visual interpretation of the outcome of this computational experiment is provided in Fig. 1 with the Hasse diagram of the non-parametric Wilcoxon signed-rank test with a confidence level at 95% [4, 37]. In this diagram, an edge means that the classifier on the top statistically outperformed the classifier on the bottom. The outcome of this analysis confirms that the RCAM-based ensemble classifiers statistically outperformed the other ensemble methods: AdaBoost, gradient boosting, and random forest.

Table 1. Mean and standard deviation of the F-measures produced by ensemble classifiers using stratified 10-fold cross-validation.

As to the computational effort, Fig. 2 shows the average time required by the ensemble classifiers for the prediction of a batch of testing samples. Note that the most expensive method is identity RCAM-based ensemble classifier while the gradient boosting is the cheapest. The exponential RCAM-based ensemble is less expensive than the AdaBoost and quite comparable to the random forest classifier.

Finally, note from Table 1 that some problems such as the “Banknote”’ and the “MOFN-3-7-10” data sets are quite easy while others such as the “Haberman” and “Hill Valley” are very hard. In order to circumvent the difficulties imposed by each data set, Fig. 3 shows a box-plot with the normalized F-measure values provided in Table 1. Precisely, for each data set (i.e., each row in Table 1), we subtracted the mean and divided by the standard deviation of the score values. The box-plot in Fig. 3 confirms the good performance of the RCAM-based ensemble classifiers, including the exponential RCAM-based ensemble classifier with a grid search. Concluding, the boxplots shown on Figs. 2 and 3 supports the potential application of the RCAM models as an ensemble of classifiers for binary classification problems.

Fig. 1.
figure 1

Hasse diagram of Wilcoxon signed-rank test with a confidence level at 95%.

Fig. 2.
figure 2

Box-plot of the average time for prediction of batch of input samples.

Fig. 3.
figure 3

Box-plot of the normalized F-measures produced by the ensemble classifiers.

5 Concluding Remarks

This paper provides a bridge between ensemble methods and associative memories. In general terms, an ensemble method reduces variance and improve the accuracy and robustness by combining a group of base predictors [23, 38]. The rule used to combine the base predictors is one important issue in the design of an ensemble method. In this paper, we propose to combine the base predictors using an associative memory. Associative memory is a model designed for the storage and recall of a set of vectors [13]. Furthermore, an associative memory should be able to retrieve a stored item from a corrupted or partial version of it. In an ensemble method, the memory model is designed for the storage of evaluations of the base classifiers. The associative memory is then fed by a vector with the target of training data as well as the unknown predictions. The output of the ensemble method is obtained from the vector retrieved by the memory.

Specifically, in this paper, we presented ensemble methods based on the recurrent correlation associative memories (RCAMs) for binary classifications. RCAMs, proposed by Chiueh and Goodman [5], are high storage capacity associative memories which, besides Bayesian and kernel trick interpretation, are particularly suited for VLSI implementation [8, 9, 11, 29]. Theorem 1 shows that the RCAM model yields a majority voting classifier whose weights are obtained by a recurrent consult and vote scheme. Moreover, the weights depend on the similarity between the base classifiers and the resulting ensemble. Computational experiments using decision tree as the base classifiers revealed an outstanding performance of the exponential RCAM-based ensemble classifier combined with a grid search strategy to fine-tune its parameter. The exponential RCAM-based ensemble, in particular, outperformed the traditional AdaBoost, gradient boosting, and random forest classifiers.

In the future, we plan to investigate further associative memory-based ensemble methods. In particular, we plan to extend these ensemble methods to multi-class classification problems using, for instance, multistate associative memory models [17, 20, 25, 26].