Keywords

1 Introduction

Data scientists spend a lot of time simply preparing data for analysis. Even before exploratory analysis, data cleaning and feature engineering, an additional step of data wrangling is often required. This step consists of taking raw data and wrangling it into a format that can be used for data science tasks, such as visualisation and prediction. Data wrangling is typically carried out by writing a program that transforms part of the data into the desired format. Writing these programs is very time consuming to data scientists—according to popular statistic, up to 80% of the time in the whole data science pipeline is invested in wrangling [2], which explains the interest in automated data wrangling [5, 10].

The existing work on automated data wrangling, however, assumes that a user knows which format the data should take and can provide input. This input can take many forms. Early methods interactively propose transformations based on data that the user selects and require the wrangling algorithm to learn how to extract the selection [7, 14]. Other methods allow the user to give an example of what the output should look like in order to learn a program that correctly produces this output [3, 5]. This is clearly a strong assumption: the big lesson of feature engineering is that users rarely know which features, and in which form, are useful for the target task.

Taking inspiration from both feature engineering and data wrangling, we introduce the problem of feature wrangling, which is concerned with wrangling at the feature level. More specifically, we automatically search for transformations that wrangle individual columns into features of high quality to be used in predictive models. For example, a date formatted as “01/01/2001” would be split in its constituent day, month and year parts and these should be marked as ordinal (day and month) or numerical (year) features. As the resulting features are to be used in supervised machine learning, the quality of the generated features can be assessed using the predictive performance of the resulting model. A major benefit of our approach is that it eliminates the need for user interaction.

Motivational Example. Consider the excerpt of basketball data in Fig. 1a. Regardless of the task, we can see that it is not very suited for further analysis. The height feature is not numerical and position is ambiguous—is “G-F” a position on its own or is this player comfortable in multiple positions?

Suppose a fourth column salary exists that we want to predict. We can then try different possibilities of representations for position and height and use their performance in predicting salary to choose the most appropriate one. For example, position can be one-hot encoded or split on “-” and then encoded with a dummy variable for every symbol.

Whereas previous approaches would require the user to provide an example of the desired feature, avatar generates and tries different alternatives to see which ones yield the best performance. In this example, it will also discover that splitting height by a “-” yields new columns that, after being made numerical in a second iteration, are good features. The full wrangling program and its result are respectively shown in Figs. 1b and 1c.

Fig. 1.
figure 1

Example of data wrangling for machine learning.

Contributions. In this paper, we make the following contributions.

  • We introduce the problem of automated feature wrangling, which is concerned with wrangling at the individual feature level and uses the performance of the predictive models to evaluate alternative feature sets.

  • We implement this idea in a prototype feature wrangling tool called avatar—the Automated VAlue Transformator And extractoR. Given only a dataset and a prediction task, it returns a new dataset with wranlged features that are more suitable for the given task.

  • We evaluate avatar on real datasets from the KaggleFootnote 1 data science platform.

2 Related Work

Feature engineering aims to improve the performance of predictive models by transforming and combining existing features into new features that are easier for the model to use. Being a laborious process, automating it is an active area of research [8, 9]. The goal of feature wrangling is to use wrangling transformations in order to extract new features from previously unusable columns. Feature wrangling therefore lies at the intersection of data wrangling and feature engineering.

AutoML is concerned with automating the data science pipeline as a whole. The general structure of such a pipeline is shown in Fig. 2. These systems start either from the feature engineering or model selection steps and build a data science pipeline with the goal of optimising performance on a prediction task. Some examples of methods are TPOT [13], auto-sklearn [4] and OBOE [18]. Automated feature wrangling can be viewed as extending these approaches to include wrangling transformations in the feature engineering process.

Fig. 2.
figure 2

Overview of the data science pipeline.

Two types of wrangling approaches aim to prepare data for the data science pipeline. The first is concerned with extracting and restructuring data, as lots of information is still stuck in inconvenient formats such as spreadsheets, XML or json. Selecting a few examples of desired rows allows methods like FlashExtract [10] to learn a program that extracts all similar rows. Auto-Suggest recommends data preparation steps, such as pivot and join, for raw tables [17]. Foofah [6] and AutoPandas [1] learn full transformation programs if an output example can be given. A second type of wrangling is concerned with transforming and normalizing individual columns based on examples provided by a user [3, 5]. Approaches in both types of wrangling assume that a user knows how to represent their data and provide a shortcut to obtain this representation. avatar, on the other hand, automatically determines a suitable representation at the feature level.

3 Data Wrangling for Machine Learning

Data wrangling in general is concerned with preparing raw data for data science. In this paper, we focus on the specific task of transforming wrongly formatted columns into usable features by automatically constructing a transformation program—similar to how a human data scientist would perform the same task. We formally describe this problem and present a simple language capable of performing common data wrangling tasks in the following two sections.

3.1 Problem Statement

We are interested in learning a data transformation program \(P : \mathcal {X} \rightarrow \mathcal {X}'\) that transforms a dataset D with instances of the form \((\mathbf {x}, y) \in (\mathcal {X}, \mathcal {Y})\) into a new dataset \(D^\prime \). The goal is to obtain a dataset \(D^\prime \) with a better feature representation than the original dataset D to perform a given machine learning task. In this paper, we consider the task of supervised learning. The dataset D is used to learn a model \(m: \mathcal {X} \rightarrow \mathcal {Y}\) that predicts the value of y given its features \(\mathbf {x}\). This model m is learned using a learner \(\mathcal {F}\) on the dataset D, written as \(m = \mathcal {F}(D)\).

Assessing whether a dataset \(D^\prime \) contains better features than D is possible in the existence of a scoring function \(s: (\mathcal {F}, D) \rightarrow \mathbb {R}\) that estimates the performance of a model \(\mathcal {F}(D)\). The score of a model trained on a dataset then serves as a proxy to evaluate the quality of the features of this dataset—better features will result in better predictions. We assume the learner and scoring function to be given, for example, a decision tree classifier and predictive accuracy.

The problem of feature wrangling for machine learning is then as follows. Given a dataset D and a transformation language \(\mathcal {L}\), find a program \(P^* = \mathop {\mathrm {arg\,max}}_{P \in \mathcal {L}} s(\mathcal {F}, P(D))\) that transforms D into a dataset \(D^* = P^*(D)\) on which an optimal model \(\mathcal {F}(D^*)\) can be learned.

It is intractable to find the optimal program, however, as an infinite number of programs can be generated. Any program P such that \(s(\mathcal {F}, P(D)) > s(\mathcal {F}, D)\) is an improvement over using the raw data.

3.2 A Language for Feature Wrangling

Let us write \(D = [X_1, \ldots , X_m]\) when referring to columns of data. Let \(t(X, \mathbf {a})\) be a transformation that takes a column X and (optional) arguments \(\mathbf {a}\), and returns a new matrix of columns \(\mathbf {X}\). A transformation with fixed arguments is called a wrangling transformation and written as \(t(\mathbf {a})\). This wrangling transformation is valid for X if \(t(X, \mathbf {a}) \ne X\).

Example 1

The \(\textsf {split}(X, d)\) transformation takes a column X and a delimiter d. It returns a set of columns obtained by splitting each row of X at every occurrence of d. An example of a wrangling transformation is \(\textsf {split}(\text {``-''})\). It is valid for the height and position columns in Fig. 1a, but not for the name column.

A wrangling program is simply a sequence of wrangling transformations. Data scientists typically build such programs by iteratively picking a transformation t and arguments \(\mathbf {a}\) for a target column \(X_i\) such that \(t(X_i, \mathbf {a})\) yields new columns. Each of these new columns is given a unique identifier and added to the data.

Example 2

An example of a dataset, wrangling program and its result is shown in Fig. 1a. Table 1 shows an overview of transformations currently supported by avatar.

Table 1. Overview of transformations supported by avatar and their generators. The implicit column parameter is not mentioned. Argument d is a string, p is a regular expression and L is a list of strings.

3.3 Generating Arguments

To compose valid wrangling programs, we need to tractably identify the possible arguments of transformation functions. We do so by following a generator-based approach [1]. For each transformation t, we define a generator \(\mathcal {G}_t(X)\) that takes a column X as input and yields arguments \(\mathbf {a}\) such that \(t(\mathbf {a})\) is a valid wrangling transformation for X. In other words, the arguments of wrangling transformations are generated from data and are not predefined by a user. Generators in avatar are only allowed to yield a finite number of arguments.

Example 3

The generator for split yields strings of consecutive, non-alphanumeric characters from rows in the input column, one at a time. Given either position or height columns from Fig. 1a, only a single argument is generated: “-”.

Generators for all transformations that avatar supports are described in Table 1. A generator for a transformation without additional arguments returns true if the function can be applied to X.

4 Machine Learning for Feature Wrangling

At the core of avatar is the use of machine learning models for evaluating progress during wrangling. As opposed to looking for a single transformation at a time, multiple transformations are considered in parallel to allow feature interactions to be considered when evaluating progress. In order to do so, avatar explores the space of wrangling programs by exploiting the fact that, starting from a dataset, the order in which transformations are applied to this dataset does not matter. At each iteration, a large wrangling program is generated, which is then pruned by subsequent steps. A high level overview of avatar is shown in Fig. 3 and the following sections describe each of these steps in detail.

Fig. 3.
figure 3

The avatar algorithm.

4.1 Prune

Pruning aims to remove columns that are not and will never become useful features. This step allows the generators for transformations to be significantly less complex, as different edge cases don’t have to be explicitly considered. The following columns are removed from the dataset.

  1. 1.

    Columns that are constant.

  2. 2.

    Columns in which more than \(p_{nan}\) percent of values is missing.

  3. 3.

    Columns that are more than \(p_{id}\) percent identical to another column.

4.2 Select

From all remaining columns, the selection step aims to find promising features—those that have at least some predictive power. Selection happens in two steps: preselection and feature ranking. This ranking of features is then used in the next step to evaluate the fitness of the current dataset.

Preselection. This step heuristically excludes the following bad columns.

  1. 1.

    Categorical columns that contain more than \(p_{u}\) percent unique values.

  2. 2.

    Categorical columns \(X_j\) for which there exists a column \(X_i\) with \(i < j\) such that a bijective mapping exists between these columns for at least \(p_{bi}\) of rows.

Preselection serves to reduce the effort required by the feature ranking step that follows. As opposed to the pruning step, columns excluded by preselection are not removed from the dataset; instead, they remain available to subsequent wrangling steps, as they may still become useful features after more transformations.

Feature Ranking. The aim of this step is to quickly rank columns by potential relevance. To do so, avatar uses a wrapper approach—learning shallow models on subsets of features and aggregating the feature importances extracted from these models. In every iteration, n rows are randomly sampled from a random subset of columns. On this subset of data, we perform k-fold cross-validation with a shallow learner \(\mathcal {F}_r\). In each of the folds, feature importances are estimated from these learned models, averaged for each column and weighted by the cross-validation performance. Final importances for each column are obtained by averaging the weighted importances over multiple iterations.

To estimate model performance, we use accuracy in case of classification and \(\max (0, R^2)\) in case of regression. Feature importances within each model are estimated using SHAP values [11, 12]. They are a practical implementation of the game-theoretic concept of Shapley values, which quantify an individual player’s contribution towards the final outcome in a cooperative game [15]. Each feature takes the role of a player and a prediction is considered the outcome of a game.

4.3 Evaluate

Given the ranking of features, avatar now heuristically evaluates its progress on the current dataset. It looks for a k such that the top-k ranked features result in the best performance. Performance for a set of features is evaluated using cross validation on all rows of the dataset using a learner \(\mathcal {F}_e\). Accuracy is used for classification and RMSE for regression.

If performance decreases with respect to previous iteration, avatar terminates and returns the set of features that achieved the highest performance. A user can easily request more features from avatar, which are returned in order of their rank. The wrangling program is also generated from these selected features by adding drop transformations for columns that are generated but not selected, as was shown in Fig. 1b.

4.4 Wrangle

In this step, avatar generates new candidate features by transforming the columns of the current dataset. We exhaust the generators for all transformations on all columns that were not wrangled before and apply the transformations to obtain new columns. These new columns are appended to the current dataset, ensuring that more complex features—obtained by applying more transformations—are pruned over simpler ones.

5 Evaluation

We perform experiments to answer the following questions.

Q1 :

Is avatar able to find new and useful features?

Q2 :

How does avatar compare to human wranglers?

Data. We use datasets from Kaggle, a popular data science platform. Kaggle allows users to publish datasets and provides public notebooks which contain snippets of code executing the data science steps. We search for datasets that (1) contain scraped data, (2) have a single file, and (3) a clear prediction target. We focus on evaluating avatar’s ability to wrangle interesting features and, therefore, only use the datasets that have at least one column that requires wrangling. An overview of datasets is shown in Table 2a.

Table 2. Data used for evaluating avatar.

Models and Metrics. As we are interested in avatar’s ability to wrangle new features, not in obtaining the best possible performance on a dataset, our primary concern when choosing the model for estimating feature importance is its speed (as we need to train it frequently) and ability to identify useful features. We assume that even low-capacity models are capable of identifying useful features, though their estimate might be less robust compared to complex models. For this reason, we focus on decision trees and limit their depth to 4 when evaluating feature importance and to 12 when training the final model. We report the relative performance over the absolute performance.

Experimental Setup. Experiments were performed on a laptop with a prototype implemented in Python. Code, data and results are available on GitHub.Footnote 2 We ran avatar for four iterations, with 1600 iterations of feature selection on samples of 1000 rows. In some datasets with many columns containing long strings, the number of columns can quickly explode after a few iterations. If the pruning step took longer than two hours, avatar was stopped early.

Fig. 4.
figure 4

Relative performance of avatar after iterations of wrangling new features when compared to the original dataset (iteration 0). Feature importances for marked data points are shown in Fig. 5a.

5.1 Wrangling New Features

Evaluating the quality of features is impossible to do directly; instead, we evaluate their quality implicitly through the performance of a model trained on the features. More precisely, we compare performance of the model training the raw data versus the model trained on data wrangled by avatar. The relative performance after each iteration is shown in Fig. 4. Note that avatar starts with pruning and selection, and the baseline result at iteration 0 is thus also obtained after greedily selecting features for the best performance without wrangling.

The results show that avatar consistently improves predictive performance by wrangling new features. A single exception is the NBA dataset, where wrangled features are not relevant to the target. This is reinforced in the next experiment, where avatar performs on par with human wranglers. We observe a general trend where performance drops after multiple wrangling iterations. The reason for that is the noise in feature importance estimation: our estimate becomes less robust with the increase of the number of features because avatar repeatedly uses uniformly sampled subsets of features to estimate their importance. With the increase in the number of features, there is a higher chance of a spurious interaction between the features. This negatively impacts the performance of the final model due to overfitting. avatar then terminates and returns the ranked features from the previous iteration.

As a small case study, we take a closer look at the best performing features for the Melbourne housing and iPhone 11 datasets in Fig. 5a. One column from their original datasets is shown in Fig. 5b. For the Melbourne housing dataset, the dummy encoded feature for “Southern” after splitting Regionname on “” is found to be the most relevant one. The selected feature in the second iteration first splits this column by “-” and then extracts the word “Metropolitan”, “South” or “Victoria”. This results in “South-Eastern Metropolitan” and “Southern Metropolitan” being projected to the same “South” feature, which a human might not think of. On the iPhone 11 dataset, many features are extracted from the Description. Very relevant is the full model name “iPhone 11 256GB” as obtained by splitting on “-”. Human data scientists might expect that this feature requires to be split up further. It is, however, an ordinal feature on its own and provides a strong signal.

Fig. 5.
figure 5

A closer look at selected features for two datasets.

5.2 Comparison with Humans

In the second experiment, we compare the performance of a predictive model on a dataset wrangled by (1) human experts on Kaggle and (2) by avatar. We obtain expert-wrangled datasets from the corresponding notebooks on Kaggle. As we are interested in the ability to wrangle features, any feature engineering steps are removed from the notebooks, but any feature selection is left untouched. A list of notebooks and number of lines of wrangling code is given in Table 2b. We compare the relative performance of the same model trained on features wrangled by humans versus features wrangled by avatar and show them in Fig. 6.

The results show that avatar performs similarly or better than human data wranglers. The only exception is the iPhone 11 dataset. avatar still identifies interesting features, but the main reason for bad performance are noisy examples—iPhone 11 covers instead of phones—which negatively impacts the performance of avatar. For the NBA and Pet datasets, human performance is marginally better. Wrangled features are not representative of the target, which further explains their small performance differences in Fig. 4b. On datasets where avatar greatly improves with respect to the baseline, it also beats human wrangling.

Fig. 6.
figure 6

Comparing the relative performance of human wranglers to avatar. Downward slopes indicates that avatar is better, which happens for half of the datasets. For the NBA and Pet dataset, the previous experiment has already shown that little additional information is present in wrangled features.

6 Conclusion and Future Work

In order to cope with data scientists spending valuable time on this tedious process, we present the avatar algorithm for automatically wrangling features from raw columns. We show that avatar is able to wrangle features that improve predictive performance when compared to the original dataset. On datasets that require heavy wrangling, it even outperforms some human wranglers.

Future Work. Two immediate pointers for extensions are expanding avatar to the multi-relational setting, allowing different tables to be joined, and exploring the unsupervised case, for example, by using multi-directional ensembles of decision trees [16]. Unsupervised data wrangling would allow avatar to aid exploratory data analysis, another significant time sink for data scientists. Quickly selecting relevant features from high-dimensional data with high multicollinearity plays an important role in avatar. Our repository contains the intermediate, wrangled datasets to encourage further research on this topic. The main technical limitation for avatar is that the search space quickly explodes when many columns with long, textual values are present. Being more strict on the generators and pruning rules can trade off speed for expressive power.