Keywords

1 Introduction

In a supervised learning setting, algorithms use a set of feature vectors xi to predict a set of targets yi, which can be discrete (classification) or continuous (regression). Typically xi is grouped into a design matrix X(m, n) where m represents the total observations and n the number of features. Algorithms use the learned features to predict new values \( \hat{y} \) from unseen data. In this context, features play an important role in the algorithm’s performance, due to the ability to increase or decrease it. Thus, learning the appropriate set of features becomes a crucial task in the majority of practical applications. Therefore, it is necessary to apply some techniques or transformations to the original features in order to improve them. Such techniques are usually grouped into feature engineering, feature extraction and feature selection. Although certain debate exists about the formalism of feature engineering, there is no doubt of its effectiveness as observed in many machine learning competitions. An example of this is described in [1], where a set of diverse features were generated in order to improve the final model score.

Feature engineering can be defined as an iterative process, where a set of new features are generated using a set of transformations, which can be very simple, like obtaining the mean, or could involve a more complex set of calculations. Usually the quality of the obtained features is related to the practitioner’s prior knowledge over the problem domain. The lack of this prior knowledge represents an issue in the feature generation process. To address this problem, different techniques have been proposed to generate features in an automatic fashion [2, 3]. In this context, algorithms are typically used to generate and evaluate the new features.

Some of those techniques involve the application of unsupervised algorithms such as autoencoders, which are a type of unsupervised neural network. They were introduced by Hinton [4] in 1986. Since then, many variants have been proposed, including sparse autoencoders [5], variational autoencoders [6] and denoising autoencoders [7]. These architectures introduce different variations to the original autoencoder model, but they are generally composed of two types of layers, namely encoder and decoder. The encoder has the function to learn a compress representation of the data. Typically, the encoder has k units such as k < n, where n represents the number of features in the data. This ensures that the encoder learn a compact representation of the data, similarly to what can be achieved with PCA. Meanwhile the decoder layer reconstructs the representation learned by the encoder to its original form [8].

The complexity of the autoencoders to generate features are related to factors such as the type of architecture used, the parameter tuning strategy, the preprocessing techniques and the data complexity. In this context, the encoder units are learning a certain type of fix features \( A\, \in \,H \), where \( H \) represents the total feature space. Therefore, it is possible to assume that there are other feature spaces F that could also be explored as features by introducing variations to the number of units in the encoder layer. This represents the main topic of this study. But instead of using the common reconstructions from the decoder layer, the encoder layer is being used to learn and generate new features from the original data, using a different subset of encoder units. Using these new representations, a set of learning algorithms which belong to the gradient boosting and ensemble families are being trained. The results show that using this approach enables to explore other feature spaces F. Also, the new features are less dimensional than the original ones, obtaining similar or better auc scores, even in the presence of highly imbalanced data sets.

2 Main Idea and Motivation

Autoencoders are applied in diverse application domains. For example, due to their innate capability to maintain the local variability in lower dimensions, they can be used in feature embeddings [9]. In feature extraction tasks, they are capable of creating new features, which can be used as training data for classifiers [10]. Also, they are very efficient when applied to extract features from high dimensional data [11]. Another interesting application of autoencoders is related to data augmentation [12]. Finally, autoencoders can be used to automatize the feature generation process [13]. In all the applications discussed, the encoder units are capable of generating a new rich set of representations.

However, these representations correspond to a fix number of units in the encoder layer. Moreover, they are the result of a very careful and laborious tuning process, which correspond to the autoencoder architecture design. But if a different number of units were used in the encoder layer, then the learned representations would significantly change. In fact, a totally new set of features would be obtained. This makes us asking, which set of obtained features are better [14]. Using this fact, it is possible to hypothesize the existence of a more general feature space H. In this context, the encoder layer is learning an optimal subset \( A \in H \).

Therefore, if variations are made to the encoder layer, then it is possible to obtain a new set of features F. However, there are diverse types of variations and criteria that can be implemented. For example, in [15], from a set of hidden layers, only the highest activations were kept. Also, multiple layers and other algorithms can be implemented to learn features as shown in [16]. Finally, it is also possible to introduce kernel techniques to control the learned representations [17]. However, in this study, variations to the number of encoder units are instead been made, resulting in different k-encoder layers.

However, in order to explore F a set of assumptions has to be made: (i) F must not be an infinite space, and (ii) there has to be a reasonable number of original n features to generate F. It is also important to pay attention to the autoencoder architecture used to explore F. For example, if a single encoder layer is used, it would be equivalent to a PCA reduction. Therefore, it is necessary that the autoencoder will be at least two layers in depth to guarantee a set of more complex features. This idea is illustrated in Fig. 1.

Fig. 1.
figure 1

Feature space F generated by the k different encoder layers.

Also, the defined autoencoder in Fig. 1 must maintain sparsity in the learned representations. Therefore, a series of L1 and L2 regularizations, which can guarantee these criteria are been applied to the autoencoder layers.

Once the k-encoders learn F, it is possible to transform the original data using the learned representations from the k-encoders. Section 3 will provide a detailed approach describing the implementation of the main idea proposed here.

3 Design Approach

3.1 Base Sparse Autoencoder

In order to consider the approach described in Sect. 2, a base sparse autoencoder [5] Ba will be implemented. This sparse autoencoder Ba will be used to train the different k-encoder layers. Since the architecture defined in Fig. 1 is composed by a two-layer depth sparse autoencoder, it is necessary to determine the number of units in the respective layers. Let be p the number of units in the first layer and k the number of units in the k-encoder layers. Then the number of hidden units p can be defined as follows:

$$ \begin{array}{*{20}c} {p = \mathop \sum \limits_{i = 1}^{n} c_{var\left( i \right)} > \varepsilon } \\ \end{array} $$
(1)

Where the total hidden units p is determined by the total principal components ci obtained from the data set Xi, which maintain a variance greater than \( \epsilon \). In this context, \( \epsilon \) becomes a parameter that must be tuned. We choose to keep \( \epsilon \, = \,0.9 \). This means that the total number of units p will be equivalent to the total components that maintain a variance equal to 90%. Using (1) also acts as a filter to discard features that could represent noise or are irrelevant.

Once p is defined, it is possible to determine the total number of k-encoders units as follows:

$$ \begin{array}{*{20}c} {k = \left\{ {1, \ldots ,p - 1} \right\}} \\ \end{array} $$
(2)

Using (2), k is being constrained to explore a representative finite space from F. Meanwhile (1) is removing unnecessary features from Xi. To guarantee the sparsity in Ba, a set of L1 and L2 regularizations will be applied to the layers. The activation functions used are: relu for the first layer, tanh for the encoder layer, relu for the first layer of the decoder and finally sigmoid for the output layer. Also, the weights are being initialized using a normal glorot distribution (3). Finally, to guarantee the reconstruction from the original Xi, KL divergence (4) is used as a loss function.

$$ \begin{array}{*{20}c} {stdev = \sqrt {\frac{2}{{u_{i} + u_{o} }}} } \\ \end{array} $$
(3)
$$ \begin{array}{*{20}c} {KL\left( {q\,||\,p} \right) = \mathop \sum \limits_{d} q\left( d \right) *\log \left( {\frac{q\left( d \right)}{p\left( d \right)}} \right)} \\ \end{array} $$
(4)

It is important to note that the number of k-encoders to be generated will be restricted by three factors: (i) the total number of features n in Xi, (ii) the number of units p obtained applying (1) on Xi and (iii) the number of k encoder units generated from (2). This means that per each data set Xi a set of k-encoder layers will be generated, varying only in the number of units k. Using the definition of Ba, the architecture can be summarized in Fig. 2.

Fig. 2.
figure 2

Architecture representation of the Ba sparse autoencoder.

Libraries scikit-learn [18] and Keras [19] were used to implement Ba.

3.2 Data Sets

In order to test the proposed approach, library PMLB [20] was used. This library contains a well-known easily accessible benchmark data set repository. The repository can be used for regression and classification tasks. In this experiment, the data sets were restricted to only binary classification tasks. It also included both balanced and imbalanced data sets.

From the repository, a total of 13 data sets were selected. Each data set was partitioned into training (80%), development (10%) and test (10%), respectively. This partition was used during training both the sparse autoencoders Ba and the algorithms. Also, to guarantee the reproducibility of the results, a constant random seed was maintained in each split during all the experiment. Table 1 shows the data sets alongside their respectively properties.

Table 1. Data sets description.

3.3 Trained Algorithms

Algorithms trained with the k-encoder representations belong to the gradient boosting and ensemble families. These algorithms were LightGBM [21] and CatBoost [22], which apply gradient boosting techniques to grow trees, and Ada Boost [23] and Random Forest [24], which combine a set of week learners in order to boost their performance. To ensure that the algorithm’s parameter do not influence the results, no parameter tuning was implemented. Therefore, their default configuration with a constant random seed was used during training with the k-encoder features and with the original ones.

3.4 Training Methodology

Using the definition described in (1) and (2) the right number of p units and k-encoder layers were determined for each data set Xi. Then each k-encoder layer was trained using a sparse architecture. Once the training was complete, the different k-encoders were used to transform the original features into their new feature representation F for each Xi. This process was carried out a total of ten times per each data set Xi. Then both the ensemble and gradient boosting algorithms were trained using the generated features and the original ones. Also, the auc score was averaged for the transformed test features per each data set Xi. Finally, the hardware used was an Intel core i5-6200U CPU – 2.30 GHz laptop with 8 GB RAM and running Ubuntu 16.04 LTS 64-bit.

4 Results

The experiment was performed ten times on the data sets. The table below shows the average auc test scores obtained by each algorithm, alongside with the number of p units and the best encoder layer (bk).

The results in Table 2 indicate that each algorithm obtain their maximum auc score from different k-encoders. Also, LightGBM and CatBoost do not present much improvement from the k-encoder representation. In fact, in most cases the auc scores are equal or slightly better. However, it must be noted that these results are achieved using just a smaller number of features than the original ones, which indicates that the obtained features are relevant. On the other hand, AdaBoost and Random Forest have received a more significant improvement over their auc scores, especially Ada Boost, which achieves an average auc score greater than the normal Ada Boost model. Also, there are two remarkable cases where all the algorithms benefit greatly from the encoder representations: the spect and flare data sets.

Table 2. ROC scores from transformed test data using the best encoder layers (bk).

When analyzing the impact of the different k-encoders over the auc performance, two cases must be noted. First, there is a point where the maximum auc score is achieved. Then, when more k-encoders units are added, the performance begins to gradually decrease or oscillate. This pattern is more predominant with the spect and flare data sets, as observed in Fig. 3. However, this is totally the opposite with other data sets, where it seems that adding more k-encoder units improve the auc performance as seen in Fig. 4. Taking this into consideration, it is necessary to train all the k-encoders, in order to find a suitable set of features for a particular learning algorithm.

Fig. 3.
figure 3

Trained k-encoders on spect (left) and flare (right) data sets.

Fig. 4.
figure 4

Trained k-encoders on mushroom (left) and ring (right) data sets.

Another factor to consider is the time taken to train the k – encoders. This is illustrated in Tables 3 and 4 for the balanced and imbalanced data sets. In both cases, the k-encoders also reflect the new low dimensionally obtained features. Also, observing the results in both tables, the data sets which take more time to train the k-encoders are usually the ones that have more features. The number of training instances also have impact on the training time.

Table 3. Training time in balanced data sets.
Table 4. Training time in imbalanced data sets.

5 Conclusions

The proposed approach allows us to observe how the different k-encoder features F impacted the performance of the gradient boosting and ensemble algorithms. Also, the proposed approach improved greatly the performance in two highly imbalanced data sets, such as flare and spect. The proposed approach also allowed us to obtain better results using less dimensions than the original features.

It was also observed that in some cases, using a reduced number of k-encoders was enough to generate a good set of features F. At the same time, there were cases in which adding more k-encoders further improved the performance. Therefore, both cases are strongly related to the particular characteristics of the data sets. However, this approach should be able to generate useful features in the presence of balanced and imbalanced data sets.

When analyzing the average performance obtained by the algorithms, Ada Boost was the one which beneficiated the most of the proposed approach. In fact, a normal Ada model only obtains a 79% average auc score, in contrast with the 82% using the proposed approach.

It is also important to note that each algorithm selected a different k-encoder, which means that each k-encoder feature will have a different relevance for each algorithm. Therefore, to put this approach into practice, all the k-encoders must be trained in order to find the best representation suited for a particular algorithm.

During the experiments a limitation in the proposed approach for the k-encoder generation was found. If the number of units p for a particular data set are equal to 1, then (2) will become 0, thus, the k-encoders cannot be generated. Finally, it was observed that the addition of regularization techniques such as L1 and L2 to the layers improved significantly the results.