Keywords

1 Introduction

Extracting features from raw data and transforming them into formats that are appropriate for Machine Learning (ML) models is what is known as feature engineering [12]. This task is usually carried out by a data scientist with good domain knowledge and the data sources of the task at hand [19, 21, 33]. Generally, feature engineering entails the daunting manual labor of designing, selecting, and evaluating features where even a great intuition is needed [6, 18]. This is due to the fact that the performance of most machine learning algorithms heavily relies on the training data quality. These type of datasets usually consists of a large collection of different formats that need to be curated to be exploited by machine learning algorithms [6]. Therefore, by using feature engineering, we can select and obtain novel features from the raw data that would better represent the problem.

However, most of the existing automated feature engineering proposals perform this task by applying the expansion-reduction method [17], which is the process of trying a predefined set of transformation functions applied to raw features. Then, those transformed features are selected based on the improvement of model performance or some evaluation metric [21]. However, expansion-reduction leads to an exponential growth in the space of constructed features, which is known as the feature explosion problem [5]. In addition, extracting novel features without a proper and systematic method can lead to an unnecessary increase in the dimensionality of the data, and hence a poor performance in the learning process of the model [3]. Thus, the curse of dimensionality arises [20], which is the potential of high-dimensional data to be more complicated to process than low-dimensional data [8].

Fig. 1.
figure 1

The framework of our method. MACFE extracts meta-features from dataset \(\boldsymbol{D}\) and a frequency table for each feature \(\boldsymbol{x} \in \boldsymbol{X}\). Then, an encoding \(\boldsymbol{e}\) is generated by the meta-features and feature distribution. Next, we search for the most similar encoding on the Transformation Recommendation Matrix (\(\boldsymbol{TRM}\)) in order to recommend a useful transformation from it. The transformed dataset \(\hat{\boldsymbol{D}}\) is built from the constructed novel features and the original ones selected by the Directed Acyclic Graph (DAG) causal model.

It is crucial to realize that there are dozens of types of machine learning models, and each has its peculiarities and needs [19]. For instance, some models neither work with highly correlated features nor with highly multi-collinearity. Additionally, other models have trouble dealing with missing, noisy, or irrelevant features. Furthermore, since data and models are so diverse, it is difficult to generalize the practice of feature engineering across projects [33]. Thus, finding a proper process to treat data agnostically from a specific learning algorithm can help to choose transformations that better suit the learning process. To tackle this issue, a possibility is to incorporate only the generated features that have more appropriate knowledge about the data. For this, we present MACFE, a novel meta-learning and causality approach for automated feature engineering for classification problems using tabular data. The main contributions of this paper are briefly described as follows:

  • We present a causality-based method for feature selection on the original dataset. For this, we use the mean magnitude effect of the features on the target to rank and select a subset of them.

  • We propose a novel meta-learning generation for unary, binary, and high-order features based on non-linear transformations. This approach addresses the feature explosion problem by only searching for feature transformations that were found useful in past experiences.

In order to evaluate the proposed method, we designed a series of experiments on fourteen popular public classification datasets with relatively small dimensions to evaluate the feature generation and selection performance of MACFE. The results are obtained from eight machine learning models: Logistic Regression (LR), K-Nearest Neighbors (KNN), Lineal Support Vector Machine (SVC-L), Polynomial Support Vector Machine (SVC-P), Random Forest (RF), AdaBoost (AB), Multi-layer Perceptron (MLP) and Decision Tree (DT). As illustrated in Fig. 1, our approach is divided into three phases. In the first one, the feature selection is carried out by using a Structural Causal Model (SCM) [22] for choosing the most promising features. Then, we move to a meta-learning phase (the second one), where meta-features are extracted from datasets and feature distributions to create encodings for each attribute. Then, we lookup for feature transformation on similar previously engineered datasets. Finally, in the third phase, we evaluate the engineered features among eight machine learning models and obtain the mean accuracy of stratified 5-Fold Cross Validation in order to assess the quality of the feature engineering method. Experimental results show that our proposal is effective in surpassing the scores of state-of-the-art feature engineering methods by achieving a mean accuracy of 81.83% across the fourteen testing datasets and the eight machine learning models evaluated.

The rest of this paper is organized as follows. In Sect. 2, we review the state of the art in automated feature engineering. In Sect. 3, we elaborate on the need for automated methods like ours. In Sect. 4, we introduce our proposed method MACFE, whereas Sect. 5, we show in detail our evaluation results, and finally, in Sect. 6 the conclusions drawn in this research work are given.

2 Related Work

In recent years, many automated feature engineering methods have been proposed using different methodologies. For instance, Data Science Machine (DSM) [14] is an automated feature engineering approach for structured and relational data. DSM proposed a Deep Feature Synthesis (DFS) method, which searches for relations and transformations across features in databases. They include a depth hyper-parameter d for setting the maximum composition, which recursively enumerates all possible transformations. In addition, DSM generates a large novel feature space, which is reduced by using Singular Value Decomposition (SVD) based feature selection. However, DSM is only suitable for relational data and it could take high computational times due to all the transformation functions used for processing all the original feature sets.

The data-driven approach presented in FCTree [9], creates novel features from sequential transformations of the original space by employing decision trees and then selecting the best features with the aid of information gain. The method in [25], known as the TFC framework, presents an iterative feature generation algorithm. The method applies feature transformation across all the features, and then it selects the best features based on information gain. Nevertheless, the generated feature space grows in a combinatorial way, leading to feature explosion. AutoFeat [13] and AutoLearn [16] are also data-driven methods. They can generate large transformations of features, selecting useful features by using regularized regression models for each pair of features. However, these methods require training a regression model, which can be time-consuming. Also, they both suffer from the feature explosion problem. Label based Regression (LbR) [30] is another method for generating novel features by using Ridge Regression and Kernel Ridge Regression. This method selects features based on the Distance Correlation Coefficient and the Maximum Information Coefficient (MIC) for each feature pair, which leads to discriminate features that are useful in combination with others.

2.1 Meta-learning for Feature Engineering

Recently, meta-learning has been proposed as a means of improving the quality of the generated features [21]. Meta-data can be simply defined as data about data [31]. For this work, meta-features are used to characterize and identify features with attributes in the context of meta-learning [1, 4, 10, 28]. Some examples of meta-features are: a) General, such as the number of samples, features or classes, etc. b) Statistical like standard deviation, correlation coefficients, etc. c) Information-theoretic such as entropy, mutual information, noise ratio, etc. d) Model-based describes some characteristics of models such as Decision Trees, Bayesian Networks, SVMs, etc.

ExploreKit [15] is an example of a method that uses meta-learning for ranking and selecting the most promising generated features. ExploreKit does this by applying all possible transformations on features, suffering from the feature explosion problem. Furthermore, Learning Feature Engineering (LFE) is an approach that also uses meta-learning for recommending useful features for classification problems. The transformation recommender in LFE is based on the construction of a meta-feature vector based on the feature values associated with a class label. However, LFE can recommend only unary and binary transformations, lacking high-order transformations.

2.2 Causality Feature Selection

Classical feature selection approaches to leverage the correlations between features and class variables but lack taking advantage of the causal relationships between them. In contrast, knowing the causal relationship implies the underlying mechanism of a dataset [32], and thus causal variables are expected to be persistent across different settings or environments.

Hence, basing the feature engineering on relevant causally related features to the class of interest ideally should provide a more rich and robust output of engineered features. Consequently, if we work only with causally related variables to the target variable, independently of the type of relationship, it should be possible to be learned by an ML model, which additionally implies it facilitates, at some level, the efficacy of applying feature engineering to causally related variables.

3 Problem Definition

Let \(D = \{\boldsymbol{X}, \boldsymbol{Y}\}\) be a dataset of input-output pairs, \(\boldsymbol{X}\) a collection of n features \(\{\boldsymbol{x_1}, \boldsymbol{x_2}, ..., \boldsymbol{x_n} \}\), and m labels \(\boldsymbol{Y} = \{y_1, ... , y_m \}\). A machine learning algorithm L (e.g. SVM, Logistic Regression, or Random Forest), and an evaluation metric E (e.g. accuracy, F1-score).

We refer to a transformation \(t \in \boldsymbol{T}\) as a function \(t(\boldsymbol{x})\) that takes a feature as an argument, and maps it to a transformed feature output \(\hat{\boldsymbol{x}}\in \boldsymbol{X'}\). Where \(\boldsymbol{T}\) is our set of transformations \(\{t_1, t_2, ..., t_k\}\) that can be unary or binary, depending on the number of given arguments. Here, a high-order transformation is a composition of unary and binary transformations. Over each feature it is possible to define a series of non-linear transformations, \(t_i: \boldsymbol{x_i} \rightarrow \mathbb {\hat{\boldsymbol{x_i}}}\) that allow to extract as much intra and inter information from the “original” data. The goal of feature engineering is thus to transform \(\boldsymbol{X}\) into \(\boldsymbol{X'}\) by applying \(\boldsymbol{T}\) such that \(\boldsymbol{X'}\) maximizes the evaluation metric E of a machine learning algorithm L. The search for new transformed features and their combinations grows exponentially, and the feature explosion problem arises. MACFE, our proposed feature engineering approach, was devised to help mitigate this problem by employing meta-features to guide the search for transformations on features.

3.1 Meta-learning and Meta-features

A formal definition of meta-features was proposed in [28], in which meta-features are a set of q values extracted from a dataset D by a function f.

$$\begin{aligned} f(D) = \sigma (\mu (D, h_\mu ), h_\sigma ), \end{aligned}$$
(1)

where \(f: D \mapsto \mathbb {R}^q\) is the extraction of q values from dataset D, \(\mu : D \mapsto \mathbb {R}^{q'}\) is a characterization measure, \(\sigma : \mathbb {R}^{q'} \mapsto \mathbb {R}^q\) can be a summarization function such as: mean, minimum, maximum, etc. Moreover, \(h_\mu \) and \(h_\sigma \) are hyperparameters for \(\mu \) and \(\sigma \), respectively. Thus, the function f is built by measuring some characteristic from D by \(\mu \), and a summarizing function defined by \(\sigma \).

Here, meta-features describe features using meta-data. An example is the mean or median, as they are features that provide extra information about the underlying data distribution. In particular, the core of this work is meta-learning applied to the identification of data through meta-features.

4 Proposed Approach

In the following sections, we describe the dataset preprocessing along with the construction of the meta-feature vector and encodings for features. Also, we present the training of our method including the Meta-learning and Causal Selection phases.

4.1 Datasets

Preprocessing. MACFE is guided by meta-feature learning based on past experience to create novel features. Our method is trained with M random datasets \(\boldsymbol{D_{train}} = \{\boldsymbol{D_1, D_2, ..., D_M}\}\) collected from Open ML [29], which have a structured format and a classification task related to the data. First, the preprocessing and cleaning of data are performed for each dataset by removing non-numerical features and imputing missing values with the feature mean. Next, a meta-feature extractor is used to obtain meta-data about the datasets. Let \(\boldsymbol{mf}\) be a meta-feature vector composed by the main characteristics of a given dataset \(\boldsymbol{D_i} \in \boldsymbol{D_{train}}\). Thus, a meta-feature vector for a dataset \(\boldsymbol{D_i}\) is defined as:

$$\begin{aligned} \boldsymbol{mf} = [mf_1, mf_2, ..., mf_p], \end{aligned}$$
(2)

where each \(mf_i\) is a meta-feature value extracted from the data, and p is the size of the extracted meta-features.

However, describing datasets by mapping their main characteristics can be a challenging task. A full set of estimators and metrics can be extracted from a dataset, e.g., the number of classes or instances in a dataset can be a meta-feature value from such a dataset. For this, we use the approach of [24] to perform the automatic meta-feature extraction process. The extraction of meta-features is divided into five categories proposed by Rivolli et al. [28]: simple or general, statistical, information-theoretic and model-based, and landmarking. In order to automate the process of extracting meta-features from datasets, we use the framework Meta-feature Extractor (MFE) [1] for each training datasets \(\boldsymbol{D_i} \in \boldsymbol{D_{train}}\), which implements the standard meta-feature extraction described above.

Next, we treat each feature \(\boldsymbol{x} \in \boldsymbol{D_i}\) as follows:

  1. 1.

    We create a frequency table with a fixed number of buckets or bins b, for each feature \(\boldsymbol{x}\)

  2. 2.

    A range r is calculated on the feature values given by the upper and lower bounds of the feature.

  3. 3.

    We generate s disjoint sets or bins b with uniform width w. Thus, each bin range \(b_i\) is a bucket in which values that are in the bin range lie. Each \(b_i\) range starts with the lower bound of \(\boldsymbol{x}\;plus\;i\) times the width w, and ends with the lower bound of \(\boldsymbol{x}\;plus\;i+1\) times the width w.

  4. 4.

    Finally, each frequency table or histogram is normalized in the range [0,1].

Thus, we obtain an encoding \(\boldsymbol{e} \in \mathbb {R}^{1\times \eta }\) for each feature \(\boldsymbol{x} \in \boldsymbol{D_i}\), composed by the meta-feature vector \(\boldsymbol{mf}\) of the dataset and the feature distribution as follows:

$$\begin{aligned} \boldsymbol{e}= [mf_1, mf_2, \dots , mf_p, b_0, b_1, ... b_{s-1}] \end{aligned}$$
(3)

4.2 Model Training

Meta-learning Phase. The meta-learning phase is described as follows. The unary, binary, and scaling feature transformations \(t \in \boldsymbol{T}\) are applied to the original features \(\boldsymbol{X}\). Then, an evaluation is performed on both original features and the generated features \(t(\boldsymbol{X})\). For this, we use the Maximal Information Coefficient (MIC) [27], which measures the strength of the linear or non-linear relationships between two variables. MIC generates values between 0 and 1, where 0 means statistical independence and 1 stands for a noiseless statistical relationship between variables. Thus, we get the set of selected transformations \(\boldsymbol{T_{sel}}\) for each original feature in \(\boldsymbol{x} \in \boldsymbol{X}\) with the maximum score as follows:

$$\begin{aligned} \boldsymbol{T_{sel}} = \mathop {\textrm{argmax}}\limits _{t \in \boldsymbol{T}}\, g_t\bigg (MIC(t(\boldsymbol{x})) - MIC(\boldsymbol{x})\bigg )\,. \end{aligned}$$
(4)

Finally, the selected transformations \(t \in \boldsymbol{T_{sel}}\) are stored in the Transformation Recommendation Matrix (\(\boldsymbol{TRM}\)) for each \(\boldsymbol{x} \in \boldsymbol{D_{train}}\) represented by its corresponding encoding \(\boldsymbol{e}\). Thus, \(\boldsymbol{TRM}\) is represented as follows (Fig. 2).

Fig. 2.
figure 2

\(\boldsymbol{TRM}\) Matrix, where the \(i^{th}\) row in the matrix is the feature \(\boldsymbol{x} \in \boldsymbol{D_{train}}\), and the \(j^{th}\) column is the encoding value of \(\boldsymbol{e}\) (Eq. 3). N is the size of all the features in \(\boldsymbol{D_{train}}\), and \(\eta \) is the size of encoding \(\boldsymbol{e}\) composed by the meta-feature vector \(\boldsymbol{mf}\) (Eq. 2) and feature histogram. The last column stands for the transformations \(t \in \boldsymbol{T}\) with the resulting highest MIC score for the given features (Eq. 4).

In Algorithm 1 the training procedure to learn the most appropriated unary \(\boldsymbol{T_{un}}\) and binary \(\boldsymbol{T_{bin}}\) transformations is presented. This process is done for each feature in a given dataset \(\boldsymbol{D}\). Similarly, high-order transformations are built by combining several unary or binary transformations one after the other (Algorithm 2).

figure a
figure b

The order value of the transformation function is related to the number of times a feature is processed by a transformation, e.g., an input feature \(\boldsymbol{x_1}\) is given as an argument of the log function, so \(f_1(\boldsymbol{x_1}) = log(\boldsymbol{x_1})\). Then, the resulting feature is combined with another feature \(\boldsymbol{x_2}\), lets say a multiplication, thus, \(f_2(f_1(\boldsymbol{x_1}), \boldsymbol{x_2}) = mult(log(\boldsymbol{x_1}), \boldsymbol{x_2})\). Finally, the output feature is given to the unary function square. Thus, the final transformed feature \(\boldsymbol{\hat{x}}\) has an order of 3, and can be seen as follows:

$$\begin{aligned} \boldsymbol{\hat{x}} = f_3(f_2(f_1(\boldsymbol{x_1}), \boldsymbol{x_2})) = square(mult(log(\boldsymbol{x_1}), \boldsymbol{x_2}))) \end{aligned}$$
(5)

Hence, we look for the underlying information about data through the extraction of more complex features. This gives us the capability of creating novel features from raw features that apparently do not have any predictive power, but in combination with high-order functions can have suitable predictive power for some machine learning models.

Causal Feature Selection Phase. Once the \(\boldsymbol{TRM}\) is trained, MACFE is ready to recommend useful transformations for new datasets and features. For this, we start selecting the most promising original features, a causality-based feature selection is performed on the features. A DAG Classifier is trained to discover a causal graph from data. For this, we use the implementation of CausalNex [2]. This graph underlies the causal relationship between features and a target variable. The mean identified causal magnitude effect of the features on the target is used to rank the features. Then, a given threshold hyperparameter s determines the top k selected features. The resulting subset of selected features are processed to obtain an encoding \(\boldsymbol{e}\) (Eq. 3).

Then, for a given feature encoding \(\boldsymbol{e}\), we search for a transformation in \(\boldsymbol{TRM}\) by retrieving the most similar feature encoding using the cosine distance as a similarity measure. We benefit from this measure for ranking the most similar feature-vectors in the range 1.0 for identical feature-vectors and 0.0 for orthogonal feature-vectors [26]. Next, the most similar feature transformation is applied to the feature. The process is followed by the binary transformations and iterating over the features in the dataset (Algorithm 2). Furthermore, a depth d hyper-parameter is set to look for the maximum transformation order across unary and binary functions. Lastly, for the Scaling transformations we refer to those transformations on features that change the scale on a standard range. Many machine learning algorithms struggle to find patterns in data when features are not on the same scale. For this, having scaled features can help gradient descent to converge faster towards a minimum.

We scale features as follows. For a given feature \(\boldsymbol{x} \in \boldsymbol{X}\), the following scaling functions can be applied. Normalization, also called Min-Max Scaler, is a method that scales each feature value to the range [0,1]. Standardization, this method scales each feature value so that the mean is 0 and the standard deviation is 1. Robust Scaler, this scaler is useful when the input feature has a lot of outliers. The Robust Scaling is done by calculating the median (\(50^{th}\) percentile), and also the \(25^{th}\) and \(75^{th}\) percentiles. Then, each feature value is subtracted from the median, and divided by the Interquartile Range (IQR). In order to learn and recommend which scaler is appropriate for a given dataset, we follow a series of data testings. First, we test the features to know the proportion of outliers. If this proportion is larger than a certain threshold \(\gamma \), then a Robust Scaler is applied to the features. Secondly, if the data follows a normal distribution, then we use a Standard Scaler. In particular, we use a Shapiro-Wilk test [11] to evaluate the normality of data. Then, if the test value is greater than 0.05 we consider the data is normally distributed. Finally, if none of the above tests is true about the data, then we use a Min-Max Scaler on the features. The resulting scaling method is saved in TRM according to the dataset encoding.

5 Experimental Results

For the evaluation of MACFE, first, we describe the evaluation details, such as the case study datasets and learning algorithms. Next, we briefly describe each of the implementation details of the classifiers and evaluation methods. Finally, a comparison with previous work is done and a discussion is presented by analyzing the characteristics of datasets and algorithms where MACFE is convenient.

5.1 Evaluation Details

Table 1. Statistics of 14 case study datasets

The evaluation of MACFE as an automated feature engineering method is performed on a set of fourteen classification datasets and eight machine learning algorithms commonly cited in the literature [15, 16, 30]. These datasets are from different areas, such as medical, physical, life, and computer science. In addition, these datasets are publicly available in the UCI ML Repository [7] and OpenML Repository [29]. The main statistics of these datasets are shown in Table 1.

5.2 Implementation Details

For our experiments, we tested the following learning algorithms: Logistic Regression (LR), K-Nearest Neighbors (KNN), Linear Support Vector Machine (SVC-L), Polynomial Support Vector Machine (SVC-P) and Random Forest (RF), AdaBoost (AB), Multi-layer Perceptron (MLP) and Decision Tree (DT). The scoring method for the evaluations is the mean accuracy of stratified 5-Fold Cross Validation on each dataset. Same as the state-of-the-art methodology for scoring. Each algorithm is used with scikit-learn [23] default parameters. This is because our objective is to enhance the accuracy of a model by improving the data through our automated feature engineering process, MACFE.

Table 2. Mean accuracy results in 5-fold cross validation among original datasets (ORIG) and consulted state-of-the-art (TFC [25], FCTree [9], ExploreKit [15], AutoLearn (AL) [16], LbR [30]) and MACFE (ours). The best performing approach is shown in bold, each dataset is shown with its corresponding ID from Table 1.

5.3 Comparison with Previous Works

The comparison of our proposal takes into account the same scenario conditions of the results presented in recent feature engineering proposals such as TFC [25], FCTree [9], ExploreKit [15], AutoLearn [16] and LbR [30]. In Table 2 are shown the scores achieved by our proposal compared against the scores obtained by other approaches in the state-of-the-art. The best scores are shown in bold, each dataset is represented by its ID defined in Table 1. The improvement among algorithms and datasets is notable: as shown in Fig. 3 we achieve an average accuracy of 81.83% across all tested datasets and classifiers, outperforming TFC, FCT, ExploreKit, AutoLearn (AL), LbR, by 6.54%, 5.99%, 5.63%, 3.95%, and 2.71%, respectively.

Fig. 3.
figure 3

Mean accuracy of state-of-the-art methods and MACFE (ours) across fourteen case study datasets and eight machine learning models.

5.4 Discussion

The transformation recommendation procedure of this method is agnostic of the learning algorithm. But, some transformations can be more appropriate for a certain algorithm. Therefore, MACFE achieves 100% of efficacy in terms of improving at least one model for each dataset. The depth hyperparameter d of MACFE can generate different orders of complex features to improve the model performance. A high value in d can result in too complex novel features, thus the algorithm cannot learn from the data. In contrast, a small value of the hyperparameter s can lead to a small subset of the original features, thus not finding good relationships between features. Hence, it is recommended a grid search to find the optimal values of hyperparameters.

6 Conclusions and Future Work

In this paper, we presented a causality-based feature selection to reduce the feature space search for feature transformations. Also, a meta-learning-based method for automated feature construction, on which the number of transformations executed on features depends on the number of useful transformations found on historical past similar features. In particular, this method has the capability of constructing novel features from raw data that are informative and useful for a learning algorithm. Hence, MACFE can automatically create features by applying selected transformations to the data, either unary, binary, or high-order, instead of applying all possible combinations of those. Hence, the feature explosion problem is minimized. However, MACFE has a fixed set of unary, binary, and scaling transformations. In future work, we intend to increase this set by adding more transformation functions, leading to the construction of more informative features from raw features. In addition, the causal selection of features could be improved, since it is applied equally to all datasets but different datasets can be expected to satisfy different causal assumptions, which produces different levels of efficacy when selecting the features to be engineered. To improve this, better methods of general causal discovery are needed.