1 Introduction

For machine learning on data tables, a data scientist may encounter columns with many different discrete entries or entities, for instance cities in a housing price prediction setting (Fig. 1a). These city names can be encoded as a categorical variable, but generalizing to housing in a new city is then impossible. A good solution for such columns is often to use external sources to bring in information: the GPS coordinates of the cities, the population, the average income (Fig. 1b)... From a data-science perspective, this requires feature engineering on relational data: merging and aggregating information across data sources to create an enriched table with extra features (Fig. 1c). In practice however, such feature engineering is difficult and time consuming for the human analyst, because it requires a good understanding of both the different data sources and the application domain. For instance the number of wealthy people living in a city may be important, but estimating it may require crossing information across many tables to build a single somewhat abstract indicator. In fact, it is often recognized that data preparation is one of the biggest bottlenecks of data-science (Kaggle Industry Survey, 2018; Lam et al., 2021).

Fig. 1
figure 1

The classical pipeline of feature enrichment. A base table (a) contains a target to predict and several features, including a categorical feature with discrete entities (here cities). To boost prediction performance, external data (b) about the entities of interest is incorporated into the base table –usually via tedious feature engineering– to obtain the enriched table (c). The external data (b) can come under various formats, e.g. tables or multi-relational graphs

A specificity of learning across a complex relational structure is that different entries come with very different information. For instance, when collecting information on local wealth in Wikipedia—querying DBPedia (Lehmann et al., 2015) or YAGO (Mahdisoltani et al., 2013)—, a data scientist will find for San Francisco the GDP as well as many known individuals and companies. But for the neighboring locality Muir Beach, none of this is available. The data scientist may then need to dig information at the county level, which has a different set of attributes. The root of the challenge is that the original relational information is fundamentally irregular and cannot be represented to a learning algorithm as a fixed set of “features”.

Our goal here is to make it very easy for the data scientist to enrich a feature with information from external data sources. Inspired by word embeddings (Mikolov et al., 2013) which brought a breakthrough to text processing by their ease of use, we strive to associate entities to general-purpose feature vectors that can be used in multiple downstream tasks. This requires a feature extraction method that captures well entity attributes, and is scalable enough to be used on large databases. For instance, a general-purpose knowledge-base such as YAGO3 (Mahdisoltani et al., 2013) is a particularly useful source of data, with information on 75,000 cities; but it is huge: millions of entities and hundreds of attributes. Existing automatic feature engineering methods, such as Deep Feature Synthesis (DFS) (Kanter & Veeramachaneni, 2015), are combinatorial: they greedily join and aggregate entity attributes across tables to create feature vectors. Their combinatorial nature leads to tractability challenges: running DFS on YAGO3 produces very high dimensional vectors (\(d \sim\) 10,000–140,000) which entail large storage costs and computational hurdles in downstream machine-learning tasks.

Instead, we propose to use embedding models that learn a static vector representation for each entity. Indeed, they provide compact representations that can encode knowledge about various entities into a fixed, low-dimensional space (e.g. \(d = 200\)). We learn these vectors from the external data, and add them to the base table as new features to enhance prediction performance. A pioneering work in this direction is RDF2vec (Ristoski & Paulheim, 2016a) and its variants, which have been used to learn entity embeddings from multi-relational graphs for various downstream tasks (Egami et al., 2021; Saeed & Prasanna, 2018; Ristoski et al., 2019; Sousa et al., 2020). These works directly build on word-embedding tools developed for natural language—namely word2vec (Mikolov et al., 2013). As such, they leverage contextual information: as San Francisco and California are connected in the graph they are related. However, they do not account for the nature of these relations, which requires modeling the relational information: Wikipedia specifies that San Francisco is in California, but Sacremento is the capital of California. We will see that capturing well this information is important to generate feature vectors for downstream analytic applications. Another, more general, drawback of embedding methods is that they are designed for discrete entities, and are less suited to capture numerical attributes. Yet these attributes are often useful for the end task: densely populated cities tend to exhibit high housing prices for instance.

We propose here an approach that addresses these two limitations and provide high-performance embeddings. To capture relational information, we rely on knowledge graph embedding models (Wang et al., 2017), widely used for graph completion but not studied for feature extraction purposes. In such models, embeddings are directly optimized to capture relationships between entities. We then introduce KEN (Knowledge Embedding with Numbers), a module that extends knowledge graph embedding models to numerical attributes. Finally, we conduct a thorough empirical evaluation of our approach, using entity embeddings to boost machine-learning performance in multiple tasks, and show that:

  • Feature vectors obtained via knowledge graph embedding models perform much better than RDF2vec embeddings.

  • Embeddings learned with KEN do capture numerical information, which greatly improves prediction performance in downstream tasks.

  • A good embedding model coupled with KEN outperforms manually handcrafted features, while requiring much less human effort. It is also competitive with Deep Feature Synthesis, but is more scalable in terms of computation time, memory and size of the created features.

  • Although designed for multi-relational graphs, simple heuristics allow our approach to be applied to tabular data, with good performance.

The rest of the paper follows as such: Sect. 2 goes into depth explaining related work, Sect. 3 details our contributed approach, and Sect. 4 gives a thorough empirical study of approaches to create features from relational data.

2 Related work: extracting features from relational data

We focus here on two common data structures for data-science: tabular data, as in relational databases, and multi-relational graphs (a.k.a. knowledge graphs), the backbone of Linked Open Data (Bauer & Kaltenböck, 2011). We broadly refer to both as relational data. In this section we give an overview of various lines of work related to creating vectors from relational data, drawing from a variety of scientific communities.

2.1 The classic view: feature engineering

Manual feature engineering Feature engineering across multiple tables traditionally relies on a human analyst crafting SQL queries or dataframe operations, such as joins or aggregations, to build a single feature matrix. The problem is the same with Linked Open Data (Paulheim et al., 2013; Ristoski & Paulheim, 2016b): statistical studies require features extracted from the data, here coming as knowledge graphs rather than multiple tables. Propositionalization approaches used to mine knowledge graphs (Kramer et al., 2001) tackle this by creating for each entity (node) of the graph a set of features, statistical fingerprints and aggregates of its neighbourhood (Paulheim & Fümkranz, 2012; Ristoski & Paulheim, 2014). Here again, manual crafting is needed to capture specific information such as wealth.

Whether it is done on tables or knowledge graphs, feature engineering is a time-consuming task: studies show that data scientists spend 60% or more of their time transforming the data for analysis (CrowdFlower, 2016). Indeed, designing the right features often requires careful effort from the analyst: which information is relevant for the task at hand? How to query it? This is particularly difficult on large data sources. For instance, a knowledge graph representation of Wikipedia leads to hundreds of entity classes described by thousands of attributes in DBPedia (Lehmann et al., 2015). Exploring which joins are best for a given analysis is difficult even for an expert: How to assemble indirect signals that capture information on the question at hand, for instance estimating the distribution of wealth in a locality.

Automated feature engineering A few approaches have been proposed to automate the construction of queries for feature engineering on relational databases. A fundamental challenge is that assembling such multi-table data transformations calls for discrete choices—e.g. to join, or not to join?—with combinatorial possibilities that explode on large databases.

Fig. 2
figure 2

An example of deep feature synthesis. Starting from a reference table with entities of interest (here cities), new features are created by chaining joins to related tables, up to a certain depth = 2. To aggregate values from one-to-many relations (e.g. city inhabitants), we use the MEAN and COUNT operators, respectively for numerical and categorical features. Colored arrows indicate join paths across tables for each depth

For instance, Deep Feature Synthesis (DFS) Kanter and Veeramachaneni (2015) is a greedy approach that denormalizes a database by chaining joins from one reference table to all related tables and aggregates one-to-many relations using combinations of a small base of functions (see Fig. 2). Typical aggregation functions include COUNT, MODE (most common) for categorical features, and MEAN, MIN, MAX, STD for numerical features. A crucial parameter of DFS is the depth, which limits how many times joins can be chained to create new features. Higher depths capture a wider range of information and usually improve performance, but quickly result in very large feature vectors and computation times, as the number of possible join paths grows exponentially. This often calls for post-processing techniques to remove unpredictive or redundant features.

Subsequent works have improved over DFS by adding aggregation functions for other types of data (text, sequences) (Lam et al., 2017), for instance via recurrent neural networks (Lam et al., 2019). Although powerful feature extractors, all these methods remain combinatorial in nature, and do not scale to large databases. Even with a limited depth, a large number of entities of different types leads to increasingly wide feature matrices with many missing values, as the different entities come with different sets of attributes. Finally, automated feature engineering methods present other drawbacks: the created features often contain categorical or missing values that must be encoded, and their interpretability (we can trace back the joins and aggregations needed to compute each feature) is challenged as their dimension quickly grows.

2.2 Entity embeddings in relational data

While entity embeddings come from a body of literature far from that of feature engineering, they also create feature vectors from relational data (Lavrač et al., 2020).

Prelude: word embeddings Many embedding methods for relational data take inspiration from word embeddings. By injecting discrete entities (words) in vector spaces, word embeddings have boosted statistical analyses of text. They rely on the distributional semantics idea, which can be summarized by Firth’s sentence: “a word is characterized by the company it keeps”. The central model is Skip-Gram with Negative Sampling (SGNS), used in word2vec (Mikolov et al., 2013). Each word w is associated to an embedding \({\varvec{w}} \in \mathbb {R}^p\)Footnote 1. SGNS learns these embeddings by optimizing similarities of pairs of words, using a scoring function:

$$\begin{aligned} \text {Scoring function}&f(w, w') = {\varvec{w}} {\varvec{\cdot }} {\varvec{w'}}&\end{aligned}$$
(1)

Given a text corpus, embeddings are optimized so that a word w is more similar to a word \(w'\) observed in the same context—e.g. the same sentence—, than another word \(w^\dagger\) not in the context; minimizing a cross-entropy lossFootnote 2:

$$\begin{aligned} \text {SGNS}L&= - \sum _{\begin{array}{c} w,\;w' \in \text {context}(w), \\ \; w^\dagger \not \in \text {context}(w) \end{array}} \log (\sigma (f(w, w'))) + \log (1 - \sigma (f(w, w^\dagger ))) \end{aligned}$$
(2)

After training, word embeddings capture contextual similarities: words with the similar contexts (neighbors) end up close in the embedding space.

2.2.1 Embedding entities in a table

Word embedding methods, such as SGNS, can be extended to other data structures by defining a corresponding notion of context (Grohe, 2020). In tables, a common choice is to view rows as sentences: two entities are in one another context if they appear in the same row. This was for instance applied to enable semantic queries over tables (Bordawekar & Shmueli, 2017) and for automatic table completion and retrieval (Zhang et al., 2019). More recent work integrates intra-row and intra-column information to learn richer representations. Cappuzzo et al. (2020) link entries of a table to the row and column nodes they belong to. Random walks through the resulting graph generate “sentences” of tokens, then fed to a SGNS model.

2.2.2 Embeddings entities in knowledge graphs

Knowledge graphs use a more general representation of relational data than tables. They replace the notion of columns by that of relations, which enables a uniform representation over many tables, and helps assembling information from multiple sources of data. Each piece of information is encoded as a triple (hrt), indicating a certain relation r between the head and tail entities (ht). Large knowledge graphs, such as YAGO3 (Mahdisoltani et al., 2013) or DBPedia (Lehmann et al., 2015) contain millions or even billions of triples—e.g. (San Francisco, HasState, California)—and cover millions of entities.

Knowledge graph embedding models learn a vector for each entity (node) and relation (edge) of the graph. They have been mostly developed for two purposes, leading to two distinct lines of research (Portisch et al., 2022):

  1. (1)

    Predicting new triples of the knowledge graph for completion purposes, which has been the main application of knowledge graph embeddings.

  2. (2)

    Providing feature vectors for downstream tasks outside the knowledge graph, which received much less attention in the literature, but is our focus here.

Fig. 3
figure 3

Graph to text representation in RDF2vec. Random walks are performed on the knowledge graph to generate sentences of tokens. Often, walks are only computed for a subset of entities, here San Francisco. The depth parameter limits the number of hops in the random walk, either forward or backward

Embeddings for downstream tasks RDF2vec (Ristoski & Paulheim, 2016a) is a central work applying knowledge graph embeddings in external downstream tasks. It has been used to incorporate background information in various tasks: geospatial data analysis (Egami et al., 2021), recommender systems (Saeed & Prasanna, 2018; Ristoski et al., 2019), or biomedical prediction tasks (Sousa et al., 2020). Given a knowledge graph, RDF2vec generates sequences of tokens by performing random walks on the graph, alternating between entities and relations (see Fig. 3). These sequences are then fed to a SGNS model to obtain embeddings for entities and relations. An important parameter is the depth, which limits the number of hops in the random walk, and thus the range of information to capture. A depth of 1 captures relationships between entities and their nearest neighbors in the graph, and so on... Similarly to Deep Feature Synthesis, a challenge is that the number of possible walks increases exponentially with depth. To avoid this, walks are often computed for certain entities of interest only, with a limited number of walks for each entity.

Since RDF2vec, most research efforts focused on the creation of walks, for instance giving more weight to relations/entities based on their frequency, PageRank or degree, removing rare entities, or allowing teleportations between entities that share similar properties (Cochez et al., 2017; Vandewiele et al., 2020).

Embeddings for graph completion Knowledge graph embeddings have been widely used for graph completion, either through link prediction (predicting the missing entity in an incomplete triple (hr, ?)) or triple classification (predicting if a triple is True of False). Similarly to SGNS, these models define a scoring function f(hrt) that represent the plausibility of a given triple (hrt). Embeddings are then optimized so that observed triples obtain high scores, while negative ones (typically sampled by corrupting the head or tail entity in observed triples) obtain low scores.

Scoring functions typically model the different relations between entities as geometrical operations in the embedding space. For instance, the seminal TransE model (Bordes et al., 2013) represents a relation r as a translation vector \({\varvec{r}} \in \mathbb {R}^p\) between entity embeddings \({\varvec{h}}\) and \({\varvec{t}}\):

$$\begin{aligned} \text {TransE}&f(h, r, t) = - \Vert {\varvec{h}} + {\varvec{r}} - {\varvec{t}}\Vert&\end{aligned}$$
(3)

with \(\Vert .\Vert\) a \(\ell _1\) or \(\ell _2\) norm. Given a knowledge graph \(\mathcal {G}\), embeddings are trained to minimize a margin loss:

$$\begin{aligned} L = \sum _{\!\!\!\begin{array}{c} (h,r,t) \in \mathcal {G},\\ (h',t')\,\text {s.t.} (h',r,t') \not \in \mathcal {G} \!\! \\ \text {with } h'=h \text { or } t=t' \end{array}} [f(h', r, t') - f(h, r, t) + \gamma ]_+ \end{aligned}$$
(4)

Many models that improve upon TransE (Wang et al., 2017) focus on better modeling of one-to-many relationships and certain relational patterns (e.g. symmetry/antisymmetry, inversion, composition) (Yang et al., 2015; Sun et al., 2019; Balazevic et al., 2019). For link prediction in knowledge bases, one of the best performing methods (Ali et al., 2020) is MuRE, Multi-Relational Poincare graph embeddings (Balazevic et al., 2019). The key component of the method is the model of the link between head and tail entity [homologous to (3) for TransE]:

$$\begin{aligned} \text {MuRE} f(h, r, t) = - d({\varvec{\rho }}_r \odot {\varvec{h}}, {\varvec{t}} + {\varvec{r}}_r)^2 + b_h + b_t \end{aligned}$$
(5)

where \(\odot\) is the element-wise multiplication, two vectors \({\varvec{\rho }}_r, {\varvec{r}}_r \in \mathbb {R}^p\) represent the relation r, and the head and tail entities are represented by vectors \({\varvec{h}}, {\varvec{t}} \in \mathbb {R}^p\) and biases \(b_h, b_t \in \mathbb {R}\). d is the Euclidean distanceFootnote 3. The model is optimized by sampling positive and negative triples (as in (4), but using a logistic loss (2) instead).

Structure of contextual vs relational embeddings Approaches based on SGNS such as RDF2vec only capture contextual information, while much progress in knowledge graph embedding has focused on modeling different types of relations separately. As a consequence they induce very different neighborhood structures on entities embeddings.

Contextual embeddings, as RDF2vec, are trained on “sentences” of tokens, where each entity is surrounded by the relations and entities it co-occurs with in triples (Fig. 3). Two entities end up close in the embedding space if they have similar contexts: (1) They may share a relation, but not necessarily with the same entity, e.g. (San Francisco, LocatedIn, California) and (Paris, LocatedIn, France). This tend to group entities of the same type, since entities of different nature, like people and cities, share few relations. (2) They may share a connection to a common entity, but not necessarily via the same relation, e.g. (MathWorks, FoundedIn, California) and (Nevada, HasBorderWith, California). Figure 4a gives a paradigmatic example: such contextual information is blind to the difference between Facebook, founded in Massachussetts but headquartered in California, and MathWorks, founded in California but headquartered in Massachussetts.

Knowledge graph embeddings using the relation type in the scoring function between two entities create a very different structure in the embedding space. As relations of different nature lead to different transformations of the embedding space, they each “pull” entities in different directions. In addition, modern models can learn transformations that are not one-to-one –non bijective–, better suited to many-to-one relations, as when many cities are located in the same state. As a result the different relations can be encoded separately in the entities embeddings, for instance along different coordinates (Fig. 4b).

Fig. 4
figure 4

What drives entity neighborhoods in embedding space? a Contextual embeddings (as RDF2vec) ignore the nature of the relation: given information on states in which companies have been founded and have their headquarters, it cannot differentiate Facebook (born in Massachussetts, moved to California), from MathWorks (born in California, moved to Massachussetts). b Knowledge graph embeddings models can give rise to different geometric constraints for these two relations, separating out the companies. For instance here a relation is encoded with a projection

Integrating numerical attributes in embeddings Numerical attributes, such as city populations, are poorly handled by most embedding methods. They are often simply dismissed, or at best binned and treated as discrete entities (Cappuzzo et al., 2020), which remains suboptimal as it does not capture the topology of numbers.

Recent knowledge graph embedding models address this issue (Gesese et al., 2021). TransEA (Wu & Wang, 2018) adds a loss to reconstruct numerical values from embeddings with a linear model. LiteralE (Kristiadi et al., 2019) is a state-of-the-art approach where each entity i is represented by two vectors: \({\varvec{e}}_i \in \mathbb {R}^p\) representing the entity itself, and \({\varvec{l}}_i \in \mathbb {R}^{q}\), \({\varvec{l}}_i\) containing each of its numerical attribute (0 if no value, and where q is the number of numerical relations in the KG). When used in the scoring function, embeddings \({\varvec{h}}\) and \({\varvec{t}}\) are constructed with a function g that combines the two vectors into a single one: \({\varvec{h}} = g({\varvec{e}}_h, {\varvec{l}}_h)\), and \({\varvec{t}} = g({\varvec{e}}_t, {\varvec{l}}_t)\), both in \(\mathbb {R}^p\). LiteralE implements g as a learnable mechanism similar to gated recurrent units.

3 Contribution: multi-relational embeddings that capture numbers

We introduce here our approach to automatically extract information from relational data, creating feature vectors that can be used in downstream tasks. It relies on 3 key ingredients, that we describe in the following subsections:

  1. (1)

    Using knowledge graph embedding models designed for graph completion, as opposed to RDF2vec, to capture well relational information.

  2. (2)

    KEN (Knowledge Embedding with Numbers), a module that extends knowledge graph embedding models to numerical attributes.

  3. (3)

    Representing tables as knowledge graphs, to leverage them in our approach.

Figure 5 summarizes our pipeline for automatic feature extraction from relational data.

Fig. 5
figure 5

Our pipeline for automatic feature extraction from relational data. (1) The input data, which may contain tables, is transformed into a knowledge graph. (2) We use a knowledge graph embedding model to learn a vector for each entity, and leverage numerical values by embedding them in the same space as other entities with KEN. (3) After training, entity embeddings can be easily added as new features in downstream tasks

3.1 Relational rather than contextual embeddings to encode information

With our goal of creating embeddings as features for downstream tasks, we motivate here the importance of using relational embeddings, originally designed for knowledge graph completion, rather than contextual RDF2vec-like models, traditionally used to extract features for downstream tasks.

From a big picture perspective, given an entity h of interest (e.g. a city), we would like an embedding \({\varvec{h}}\) that encodes as well as possible the information related to h in the data. At the very least, it implies representing well the various relationships h has to other entities (e.g. its state), to make them available to the machine-learning model used in the downstream task. Representing not only the related entity t but also the nature of the relation r is often important: knowing whether a person A is the mother, the sister, or the daughter of a person B informs on the age difference.

In contextual embeddings such as RDF2vec, the presence of a link between a entity h to another entity t is modeled somewhat independently from the nature r of the link, i.e. the type of the relation. Indeed, the scoring function used in SNGS—Eq. (1)—is only applied to pairs (ht), (hr) and (rt). Structure between h, r, and t is created indirectly as they appear in the same context.

In contrast, relational embeddings developed for knowledge graph embeddings use a scoring function involving h, r, and t jointly. As this scoring function is minimized for triples in the graph, it induces algebraic relations between the corresponding embeddings: for TransE \({\varvec{t}} \approx {\varvec{h}} + {\varvec{r}}\), or for MuRE \({\varvec{t}} \approx {\varvec{\rho }}_r \odot {\varvec{h}} - {\varvec{r}}_r\). These algebraic relations imply that \({\varvec{t}}\) captures the link to \({\varvec{h}}\) in a way that is specific to r and hence a downstream analysis model can recover this specific information, e.g. selecting on the mother, and not all relatives.

Figure 4 illustrates the specificity of the link: for RDF2vec the relations are encoded as vectors which lie in the middle of the embeddings of the entities while a knowledge graph embedding encodes the relations as a transformation of these vectors (here a projection), and allows the different relations to be expressed on different coordinates of the vectors.

3.2 Capturing numerical attributes with KEN

Numerical attributes are omnipresent in relational data, and often contain precious information for downstream tasks, e.g. a city’s wealth influences housing prices. While they are readily-available as numbers, the irregular nature of the information prevents from merely adding them as coordinates to the feature vectors. A first challenge is that different entities have different numerical attributes. A more serious one arises when aggregating numerical information across many-to-one relations: there are many ways of doing so. For instance, to characterize wealth in a county from the GDP of its cities, the mean, the Gini index, the percentiles, etc. are all useful aggregates. As a result, Deep Feature Synthesis generates more than 2000 features derived from numerical attributes for cities in YAGO3.

We strive for lower-dimensional representations, and thus aim to capture numerical information in entity embeddings. However, embedding methods are formulated in terms of discrete elements (Sect. 2.2): words, entities. A naive way to adapt them to numerical attributes would be to consider numbers as tokens and learn an independent embedding for each value. Yet doing so discards the topology underlying those numbers: close numerical values should have similar representations. Binning values before embedding reduces this effect, but remains suboptimal. To tackle this, we introduce here KEN (Knowledge Embedding with Numbers), a module that adapts embedding models to numerical attributes.

The KEN module Entity-embedding approaches can be seen as relying on a linear encoder to associate an entity h with its vector representation \({\varvec{h}} \in \mathbb {R}^p\). In this light, we propose to inject numerical values in the same vector space also with an encoder, learning a function \({\varvec{e}}: \mathbb {R} \rightarrow \mathbb {R}^p\) that maps numerical values to embeddings.

We use as function a single-layer neural network with a ReLU activation to embed numerical values. To embed different types of attribute separately (e.g. city populations and GPS coordinates), we learn a function \({\varvec{e}}_r\) for each attribute r:

$$\begin{aligned} {\varvec{e}}_r(x) = \mathrm {ReLU}(x\,{\varvec{w}}_r + {\varvec{b}}_r) \end{aligned}$$
(6)

with \(x \in \mathbb {R}\) the numerical value to embed, and \({\varvec{w}}_r, {\varvec{b}}_r \in \mathbb {R}^p\) the weights and biases of the linear layer. Embeddings \({\varvec{e}}_r(x)\) of numerical values can then be used in place of tail embeddings \({\varvec{t}}\) in the scoring function f(hrt).

Comparison with other methods capturing numerical attributes An asset of KEN is that it comes with no hyper-parameters to tune. This is unlike TransEA (Wu & Wang, 2018), where the importance of numerical attributes must be controlled, with the danger that the optimal value might differ for each attribute. Another important difference with TransEA is that KEN can capture non-linear interactions between entities and numerical attributes, thanks to the ReLU activation. For instance, cities in California are associated to latitudes between \(32^\circ\) N and \(41^\circ\) N which cannot be expressed by a mere threshold on a linear representation.

Importantly, KEN uses numerical values x during the training as new triples (hrx) to be predicted, which forces entity embeddings to capture these numerical attributes. This is different from LiteralE (Kristiadi et al., 2019), where numerical values are incorporated to entity embeddings to better predict non-numerical triples (hrt). LiteralE therefore only captures the information in numerical values useful to triangulate other entities, and not the values in themselves. In particular non discriminant numerical attributes can be discarded by the gate mechanism. As an extreme example, an entity linked to numerical attributes but not to other entities will not be embedded in LiteralE, as there is no training data.

In contrast, KEN draws no major distinction between discrete entities and numerical values: they are embedded in the same space. Each type of numerical attribute is associated to a specific relation and thus embedded on a specific line segment via Eq. (6). An analytic model for a downstream task can extract this information, proceeding in a similar way as with discrete information (as described in Sect. 3.1). The numerical attributes that an entity has and its relations to other entities may contribute to create similar neighborhood structures: for a city to be locatedIn California is equivalent to its GPS coordinate taking specific value ranges.

Fig. 6
figure 6

Embedding numerical values with KEN

Making the architecture robust to attribute distribution One challenge of heterogeneous data is that different numerical attributes have very different distributions. We normalize numerical values \(x \in \mathbb {R}\) to the interval [0, 1] before embedding them. With neural networks, a common way to do so is “min-max" normalization: \(x' = \frac{x-x_{min}}{x_{max} - x_{min}}\). However it is problematic when dealing with heavy-tailed distributions, such as city populations. Indeed, after normalization, most values \(x'\) will be very close to zero and have similar representations \({\varvec{e}}_r(x') \simeq \mathrm {ReLU}({\varvec{b}}_r)\). This makes it difficult for instance to distinguish a village with 1000 inhabitants from a medium-sized town of 10,000 people.

Ideally, we would like the values \(x'\) to be evenly distributed in [0, 1], to separate as well as possible their embeddings. We achieve this with quantile normalization, which maps numerical values to their quantile in the attribute distribution, using an empirical estimate of the cumulative distribution function: \(x' = \text {CDF}(x)\).

Figure 6 summarizes the complete picture of numerical value embedding with KEN.

3.3 Representing tables as knowledge graphs

Fig. 7
figure 7

Representing tables with triples. For each row of the table, we generate triples by linking its entries through different relations. The methods we present here differ on their choice of head entities when building triples: a using all discrete entries as heads b using only the entities of interest (generally from the same column) and c introducing a “row id” entity for each row and using it as head entity

To create embeddings with rich semantics, the source data must contain as much detail as possible about the entities under study. This often requires to leverage data from different sources, for instance combining broad but shallow information (e.g. city populations) from large knowledge graphs with more granular data (e.g. recent house prices at the neighbourhood-level) from domain-specific tables. Although our approach inputs knowledge graphs (i.e. triples (hrt)), this representation is general enough to easily encode information from other data structures. We focus here on tabular data, and explore a few strategies to represent tables as knowledge graphs.

The core idea to generate triples from tables is to link entities from the same rows with different relations. For instance, an exhaustive strategy consists in building all possible triples from the table, linking all discrete entries to other entities or numerical values from the same rows (Fig. 7a). One asset of this method is that it produces good embeddings for all entities, as they are directly connected to their attributes in the graph. But it generates a large number of triples: \(\mathcal {O}(n_{cols}^2 \, n_{rows})\), which increases the training time of embeddings. If we know beforehand the entities of interest, i.e. those used in the end task (e.g. cities), we can instead build triples from these entities only (Fig. 7b). This greatly reduces the number of triples to \((n_{cols} - 1) \, n_{rows}\) (these entities generally come from a single column) and returns embeddings tailored for the entities under study. However, this approach neglects other entities: they are not directly connected to the entries of the row and are thus likely to underperform in other applications. Finally, we consider a third heuristic that assigns a row id to each row of the table, treats this row id as an entity, and then links it to the various entries of the row (Fig. 7c). This method combines benefits of the previous methods: it does not require any prior knowledge of the downstream application and generates a light graph with \(n_{cols} \, n_{rows}\) triples. Yet learning an additional embedding for each row also raises scalability issues if there are much more rows than distinct entities to embed.

Fig. 8
figure 8

Capturing joint information across columns. a A table describing cities with two joint attributes that must be considered together to be meaningful. b Using cities as head entities encodes the two attributes separately, hence we cannot differentiate them from their triples. c Introducing row entities allows to capture all attributes jointly and distinguish the two cities

A desirable property of table-to-graph methods is their ability to represent joint information across columns. For instance Fig. 8a considers two cities AB with their number of companies in different fields of activity. Taken alone, the two columns are not very informative: what matters here is the number of companies in a certain field of activity, which requires to consider both columns jointly. Methods that build triples from table entries such as cities encode the attributes “field of activity” and “number of companies“ independently, and thus cannot distinguish A and B from their triples (Fig. 8b). In contrast, introducing row entities allows to capture row data jointly and differentiate the two cities (Fig. 8c).

Finally, if missing data are present in the table, we encode them with specific entities (one for each column).

4 Empirical study

We compare our approach with automatic feature extraction techniques, such as Deep Feature Synthesis (DFS) or RDF2vec, and focus on two criteria:

  • the quality of the extracted features: how well do they improve performance in downstream tasks?

  • the scalability of the approach: time and space complexity, size of the feature vectors

4.1 Downstream tasks

We evaluate our approach on 7 prediction tasks on various types of entities. In each task, we extract features for the entities of interest (i.e. target entities) from a source dataset, and add them to a target dataset containing the variable to predict. To showcase the versatility of our method, we consider tables and knowledge graphs as source data. More details about the downstream tasks and datasets are given in the Appendix 7.1.

Tabular data We first consider two classification tasks: KDD14 (classification of educational crowdfunding projects) and KDD15 (student dropout prediction in MOOCs). For these tasks the source data consists of multiple tables describing the target entities. To leverage this data in our approach, we represent it as a knowledge graph by using target entities as head entities and linking them to other entries from the same rows, similarly to Fig. 7b.

Knowledge graphs To support our claim that general-purpose embeddings can be learned from large databases and used in various end tasks, we consider a more challenging setup: enriching several downstream tasks with background information from Wikipedia. To that end, we leverage YAGO3, a knowledge graph representation of common knowledge, built from Wikipedia and other sources (Mahdisoltani et al., 2013).

Our version of YAGO3 contains 2.8 million entities, described by 7.2 million triples. We learn embeddings for various entities that are common in data science problems (counties, cities, people, companies, movies...) and use them in 5 regression tasks on socio-economic topicsFootnote 4:

  • Elections: predict the number of votes per party in 3000 US counties.

  • Housing prices: predict the average housing price in 23000 US cities.

  • Accidents: predict the number of accidents in 8500 US cities.

  • Movie revenues: predict the box-office revenues of 4900 movies.

  • Employees: predict the number of employees in 3000 companies.

Note that there exists a more recent version of Tanon et al. (2020), with a much greater coverage of information: 64 million entities, with about 2 billion triples. However, we could not include it in our empirical study as the DFS baseline was intractable on such a large database.

4.2 Approaches considered for evaluation

We describe below the feature extraction approaches that we include in our empirical study.

Our approach We implement KEN on top of 3 embedding algorithms: TransE (Bordes et al., 2013), the seminal work that introduced relations as translations of embeddings, DistMult (Yang et al., 2015), with scoring function \(f(h, r, t) = {\varvec{h}} {\varvec{\cdot }} ({\varvec{r}} \odot {\varvec{t}})\), and MuRE (Balazevic et al., 2019) because it emerged as a top-performing method in link prediction (Ali et al., 2020). We learn 200-dimensional embeddings and keep all hyper-parameters constant, except for the number of epochs \(\in [2, 4, 8, 16, 24, 32, 40]\) that we tune (see the Appendix 7.2 for the exact parameters used). We base our implementations on Ali et al. (2021), a Python library for learning knowledge graph embeddings. In addition, PyKEEN implements a version of DistMult that leverages numerical values with LiteralE (Kristiadi et al., 2019), which allows for a comparison with KEN.

Deep Feature Synthesis We compare our embedding approach to Deep Feature Synthesis (DFS, see Fig. 2). We use an implementation of DFS from the Python package featuretools and extract features at depths (0, 1, 2, 3) with the default aggregation functions: MEAN, MIN, MAX, STD, SKEW, SUM for numerical features, MODE, NUM_UNIQUE for categorical features and COUNT for both. Categorical features are one-hot encoded to their 10 most common categories. To apply DFS on YAGO3, we convert it to tabular format by creating a table with two columns (head, tail) for each forward/inverse relation.

Manual feature engineering Besides DFS, we include manual feature engineering to our empirical study. The objective is to estimate how well an analyst would perform given a time budget of 1–2 h per dataset. Results obviously depend on the analyst and could be improved with more effort, but they provide a simple baseline for a time-constrained analysis. See Appendix 7.2 for a description of the handcrafted features we used.

RDF2vec Finally, we also compare our approach to RDF2vec, traditionally used to extract features for downstream tasks. For each entity under study, we generate all possible walks of depth 2, going through forward and backward relations (as in Fig. 3). However, as the number of walks can be very high for certain entities (e.g. tens of millions), we cap this number to 10,000, and checked empirically that this value is large enough to impact only a small fraction of entities. We then feed these sequences to a SGNS model with embedding dimension = 200, window size = 4 (which allows to capture 1-hop and 2-hop neighborhoods), and pick the epoch \(\in\) [1, 5, 10, 20] that performs best. We used the pyRDF2Vec package (Vandewiele et al., 2022) to run the experiments.

4.3 Quality of the extracted features

Methodology We first study how well feature vectors created from a source database can improve performance in data-science tasks. For this, we consider the prediction problems introduced in Sect. 4.1 and the feature extraction approaches presented in Sect. 4.2: TransE, DistMult and MuRE with and without KEN; Deep Feature Synthesis; manual feature engineering; and RDF2vec.

Table 1 Quality of the extracted features: cross-validation scores on target datasets using either embeddings, deep feature synthesis, or manually handcrafted vectors as features

We measure performance with cross-validation scores, and only use entity representations to predict the target values.Footnote 5 For regression and classification, we use two analytic models from the scikit-learn library: k-nearest neighbors and gradient boosted trees, whose hyper-parameters are tuned. We report in Table 1 fivefold cross-validation scores, averaged over multiple seeds for shuffling the data and training the embedding models. See Appendix 7.3 for a more detailed description of the experimental setup.

Results When using entity-embeddings as feature vectors, DistMult and MuRE overall outperform RDF2vec by a wide margin (except on the Employees dataset, where RDF2vec gets surprisingly good results), with MuRE appearing as the best approach. We explain this gap by their ability to capture well relational information. In particular, MuRE is more expressive than TransE and DistMult (their scoring functions can be seen as speczial cases of MuRE) and thus better model complex relations. In contrast, TransE does not model well many-to-one relationships: if we have (hrt) and \((h', r, t)\), then h and \(h'\) are forced to have very close embeddings \({\varvec{h}} = {\varvec{h'}} = {\varvec{t}} - {\varvec{r}}\). Similarly, the scoring function of DistMult is symmetrical, i.e. f(hrt) = f(trh), which is not suited for non symmetrical relations like locatedIn. We can also see from Table 1 that leveraging numerical attributes with KEN always improves performance in TransE, DistMult and MuRE, and that it is superior to LiteralE in DistMult.

We now compare the performance of MuRE + KEN (the best embedding approach) to manual and automatic feature engineering methods. When using powerful prediction models (gradient boosted trees), MuRE + KEN does not consistently outperforms DFS, but is often competitive for depths \(\le\) 2, and almost always outperforms manual feature engineering. However, when using simpler prediction models (K-Nearest Neighbors), MuRE + KEN significantly outperforms DFS for all depths. Indeed, embeddings tend to be well structured (as induced by the scoring function) and have homogeneous coefficients with similar distributions, which facilitates the downstream learning. In contrast, DFS creates a huge number of heterogeneous features, which even after scaling are hard to leverage by simple models.

We also study whether injecting taxonomic information into embedding models improves performance. Following (d’Amato et al., 2021), we augment YAGO3 with triples describing its ontology, such as entity types and their relations (subClassOf and disjointWith). We apply MuRE + KEN on this augmented version of YAGO3 and observe that it generally improves prediction performance and reduces the gap with DFS.

Capturing entity types Finally, we investigate whether knowledge graph embeddings capture entity types, for instance differentiating cities from movies or counties. Such information can be useful in certain tasks that we did not consider in our previous experiments, e.g. clustering. To evaluate this, we take many entities of various types (cities, counties, movies, companies) from our previous tasks on YAGO3, and measure how well entity types can be predicted from their MuRE + KEN embeddings. We use a simple K-Nearest Neighbor model, whose number of neighbors is tuned and obtain a ROC AUC score of 0.996, showing that knowledge graph embeddings indeed capture entity types. We detail the experimental setup in Appendix 7.3.

4.4 Scalability concerns

Large databases, such as YAGO3, bear promises to provide general-purpose feature enrichment. For this, the scalability of features extraction methods is crucial. To that end, we compare in Table 2 the scalability of various approaches: Deep Feature Synthesis (for \(0 \le\) depth \(\le 3\)), RDF2vec and MuRE (with and without KEN).

Methodology We quantify computational scalability with several metrics capturing:

  1. (1)

    The scalability of feature extraction: duration and RAM usage when computing the feature vectors.

  2. (2)

    The scalability of feature usage: dimension of the feature vectors, disk memory needed to store them, and duration of cross-validated evaluation in prediction tasks (using gradient boosted trees).

A benefit of knowledge graph embedding models is that they learn representations for all entities at once (e.g. cities, counties, movies in YAGO3). This is unlike DFS and RDF2vec which typically extracts feature vectors for target entities only. Given our objective to provide representations for many different entities, we thus benchmark DFS and RDF2vec when extracting features for all entities.

In some cases (KDD14 with depth 3 and YAGO3 with depth 2/3), DFS breaks the RAM capacity of our machine (400 GB) and does not terminate, even when splitting entities into 1000 chunks to lower the RAM usage. For these cases, we extrapolate the total duration based on the duration for a subset of entities, and the disk memory required to store features based on the memory it takes for a smaller number of features.

Similarly, we were not able to learn RDF2vec embeddings for all YAGO3 entities due to memory overflow. We tried limiting the number of walks to 100 per entity, and only generating them from the 1% most frequent ones, but we still could not compute them in less than a day, even with parallelization over 40 CPUs. We thus interrupted the process, and measured the duration and RAM usage just before stopping.

Table 2 Scalability of feature extraction methods: computational scalability of embedding models versus deep feature synthesis

Results We report in Table 2 the scalability metrics described above. As expected, DFS quickly becomes intractable on large databases: it requires huge amounts of time and RAM to run, and returns very high-dimensional feature vectors that need a lot of memory to be stored and a lot of time to be leveraged by machine-learning models. Interestingly, we saw in Table 1 that DFS must be computed at a depth of 2 or more to outperform MuRE + KEN (using powerful gradient boosted tree models). Yet based on this scalability study, this is already too deep to run DFS for all entities in YAGO3, due to memory issues. In the end, DFS produces high-performance features, but its usage is limited to small databases, or when the downstream task is known beforehand so as to extract features for a subset of entities only. Unlike knowledge graph embedding models, it cannot be used to create general-purpose feature vectors from large databases with millions of entities.

We observe similar trends with RDF2vec: feature extraction for all entities overall requires much more time and memory than MuRE. Actually, even creating feature vectors for target entities only with RDF2vec can take more time (e.g. 9300s for 23000 cities in Housing prices) than applying MuRE to all YAGO3 entities, and must be repeated for every new downstream task.

4.5 KEN helps embeddings capture numerical attributes

As visible on Fig. 9, KEN provides embeddings that represent in a much simpler way the numerical information associated with entities. When embedding counties from YAGO3, the structure of KEN embeddings reflects well the population density, with a direction grouping together metropolitan areas such as Chicago (Cook county), Los Angeles (Orange County), Houston (Harris county), and Phoenix (Maricopa county), well separated from rural counties. On the other hand, this information is more diluted in standard MuRE embeddings.

Fig. 9
figure 9

Embeddings of counties using only categorical attributes (MuRE) or all attributes (KEN-E) from YAGO3: PCA projection of the 200-dimension embeddings in 2D. The color represents the county population and the symbols the state of the county. We randomly draw high and low population counties in the same state. Cook, Orange, Harris, and Maricopa counties correspond to major cities: Chicago, Los Angeles, Houston, and Phoenix. The global structure of MuRE + KEN embeddings better reflects the population of the counties, in particular separating the rural counties from those related to major cities. A simple linear projection of the MuRE + KEN embeddings suffices to roughly capture the rural-urban gradients, while it is less clear on MuRE embeddings

Table 3 Reconstructing numerical attributes: cross-validation scores (R2) of simple nearest-neighbour models predicting the numerical attributes associated to an entity from its embedding

Methodology To evaluate quantitatively the ability of embeddings to capture numerical information, we compare the performance of simple supervised models to predict the numerical attributes of entities (e.g. county populations) from their embeddings. In practice we use K-Nearest Neighbors models (whose hyper-parameters are tuned) and aim to predict statistics about donations to projects in KDD14, students connections to MOOCs in KDD15 and county attributes in YAGO3. We measure performance with cross-validation scores. See Appendix 7.4 for the exact evaluation setup.

Results The scores reported in Table 3 confirms that adding KEN significantly improves the ability to capture numerical information related to the entities: in all settings adding KEN leads to better reconstruction of numerical attributes, and also outperforms LiteralE by a wide margin. In addition, results show that these embeddings capture to some extent the whole distribution of numerical attributes: their mean, but also their quantiles.

Table 4 Ablation study: drop in cross-validation scores of variants of MuRE + KEN and binning, relatively to the original MuRE + KEN

4.6 Ablation study

We study in this section the influence of two ingredients of KEN on the quality of entity-embeddings: (1) the quantile normalization of numerical values at the input, and (2) the presence of a ReLU activation function at the output (Fig. 6).

Methodology We measure the drop in performance relative to the original MuRE + KEN when: (1) replacing the quantile normalization by a min-max normalization \(x' = \frac{x-x_{min}}{x_{max} - x_{min}}\) and 2) removing the ReLU activation. We also compare KEN to a standard binning practice, where numerical values are divided into bins and an embedding is learned for each bin. In practice we use 20 bins and split values evenly across bins to be robust to fat-tailed distributions: the first bin corresponds to values in the top 5%, the second bin to values in the range 5–10%, and so on... We use gradient boosted tree models for prediction, and the same setup as in Table 1.

Results Table 4 shows that all ingredients of KEN are important, especially the quantile normalization, and confirms that KEN leads to markedly better features than binning.

4.7 Capturing deep features with embeddings

Methodology We want to determine if embeddings can capture information deep in the knowledge graph, indirectly chaining relations as in Deep Feature Synthesis. For this purpose, we compare in Table 5 cross-validation scores of gradient boosted tree models with embeddings trained either on the full YAGO3 database, or on a subset of YAGO3 containing only the triples related to the target entities. For example, a subset with city-related triples would contain direct information about cities (e.g. the state in which they belong), but no information about the states themselves. Such “deep” information can however be helpful for analytical tasks, and should be captured by embeddings models. The evaluation setup is the same as in Table 1.

Table 5 Embedding can capture deep features: cross-validation scores (R2) of gradient boosted tree models using as features either embeddings trained on the full YAGO3 dataset, or on a subset of YAGO3 containing only the triples related to the target entities

Results Table 5 shows that adding triples indirectly related to the target entities improves the quality of their embeddings; hence embedding models do capture deep information.

Table 6 Influence of table representations: cross-validation scores of different strategies to represent tables as a knowledge graph

4.8 Influence of table representations

Methodology When the source data consists of tables, it must be represented as a knowledge graph to be leveraged by our approach. We introduced in Sect. 3.3 three table-to-graph strategies, which differ on which entities are used as heads when generating triples (Fig. 7). We either use: (1) all entities, (2) only target entities (which require some prior knowledge of the downstream application) or (3) row ids. We evaluate the performance of these strategies with cross-validation scores on KDD14 and KDD15, using gradient boosted tree models for prediction (as in Table 1). To show the importance of choosing well the column with the target entities in the second approach, we also evaluate a simple baseline taking entities from another column.

Results Based on Table 6, the top performing table-to-graph strategy consists in generating triples from target entities. Indeed, the resulting graph directly connects them to their attributes, which facilitates the learning of embeddings. This intuition is confirmed when taking instead entities from another column, as we observe a sharp drop in performance. Interestingly, using all entities or row ids as head entities return embeddings that perform reasonably well without being tailored for the specific task at hand. These methods can provide general-purpose embeddings that perform well for various entities and applications. However, they either increase the number of triples (and thus the training time of embeddings) or the number of entities.

5 Discussion

5.1 Embeddings capturing numerical information can provide feature enrichment

By relying on entity embeddings, our feature-synthesis pipeline departs strongly from the standard approach of feature engineering in databases. Our extensive experiments confirm that features created via knowledge graph embedding do capture the information needed for a statistical task. Embedding models coupled with KEN improve over manual feature engineering on almost all tasks.

We observe clear trends in the experimental results: Table 1 reveals the importance of capturing well (1) the numerical attributes and (2) relational, rather than contextual information. Indeed, across all analytic tasks and embedding methods explored, adding KEN leads to features that better capture numerical attributes and improve the downstream analytic task (Tables 3, 1). It also improves over binning and LiteralE by a large margin. The ingredients that we introduced in KEN, such as the quantile normalization to account for the distribution of numerical attributes significantly improves performance (Table 4). Improving models of relations makes a strong difference in how useful the resulting features are for downstream tasks: there are notable improvements from RDF2vec—no explicit model of the relation—to MuRE (Table 1).

5.2 Deep feature synthesis cannot go so deep

Automated feature-engineering methods like Deep Feature Synthesis greatly reduce the human cost of manually handcrafting features across tables, while achieving excellent results on all datasets. With deep-enough features, DFS performs consistently better than manual feature engineering and often slightly better that MuRE + KEN (Table 1).

But this ability to generate good features comes at the price of scalability. Since DFS combines aggregation functions and features at each depth, the time and space complexity, as well as the number of created features grow exponentially (Table 2). Even on relatively small databases like KDD14 or YAGO3, building features for all entities with DFS at a depth of 2 or 3 becomes intractable, with the memory requirements greatly exceeding our machine capacity (400 GB). Besides memory limitations, the number of features quickly reaches tens or hundreds of thousands, making statistical models harder and slower to train (e.g. 180x longer on Employees), and reducing feature interpretability.

Yet, the databases that we have explored are smaller than the latest repositories of general knowledge: YAGO3 is 50 times smaller than YAGO4 Tanon et al. (2020). Progress in linked open data is continuously increasing the amount of information available in a consistent representation: DBPedia (Lehmann et al., 2015) currently contains 900 millions triplets, and growth by a factor of 1.5 to 2 every two years (DBPedia Web Page, 2021). For instance, we could not run DFS, even with a depth of 1, on YAGO4. Even if it could run, it would provide a huge number of features, hard to leverage.

Embeddings, on the opposite, readily provide low-dimension representations (\(p = 200\)) which are able to capture “deep” information, indirectly chaining relations (Table 5). Finally, knowledge graph embedding methods are very scalable: embeddings are optimized with stochastic gradient descent (\(\mathcal {O}(\#\text {triplets})\)), and can be trained on huge amounts of data. Further optimizations can make embedding techniques \(2-5\times\) faster than the implementations that we used (Zheng et al., 2020).

Knowledge graph embedding models are also naturally suited to capture complex relational patterns between discrete elements. This is unlike DFS, which struggles to encode categorical features: ensembles of discrete entities (e.g. the cities located in a county) are aggregated by their most common element and then one-hot encoded, discarding a lot of information in the process.

5.3 Current limitations call for further work

Interpretability The biggest drawback of automatic feature generation is that it leads to models harder to interpret. Indeed, features are often manually crafted to capture a quantity of interest, such as wealth of a locality. Data scientists can then reason about the role of the corresponding quantity, for instance the impact of local wealth on housing prices. A challenge to these interpretations is that the crafted feature must represent well the quantity, but for this the burden is on the analyst and not the tool. With automatically generated features, the quantities of interest must be identified from the features. This is typically hard: even in DFS where features are associated with descriptive labels, we may have to distinguish between many partly redundant features. This is even harder in embedding models, which are black-box and do not associate human-understandable labels to individual features.

Matching and out-of-vocabulary The target data may come with different naming conventions as the source, for instance county names in the Elections dataset are written differently than in YAGO3. In such case, a form of matching must be performed (e.g. Cook County \(\rightarrow\) Cook, Illinois). This is often done manually using domain-knowledge. Further work should explore automated techniques, for instance using fuzzy or similarity joins (Mann et al., 2016; Silva et al., 2010), or adapting NLP techniques used to create embeddings robust to out-of-vocabulary entities (Bojanowski et al., 2016; Pinter et al., 2017; Chen et al., 2022).

6 Conclusion

We have shown how turn-key extraction of embeddings from relational data can distill valuable information from a database, synthesizing feature vectors for data enrichment in downstream analytic tasks. For these feature vectors to be most useful in the analytic tasks, experiments show that embedding methods must model well the different relations between entities, and capture their numerical attributes. For this, we proposed to use knowledge graph embedding models designed for link prediction, and extended them to numerical attribute with KEN. Our extensive experiments show that these embeddings improve markedly upon manual feature engineering and embedding methods traditionally used for feature extraction such as RDF2vec. They are also competitive with automatic feature engineering methods based on systematic denormalizations like Deep Feature Synthesis, but do not face the same scalability challenges.

A pipeline to minimize human effort Our pipeline is designed to facilitate data preparation. Not only does it circumvent the human labor of designing manual features, but also is minimizes data integration and wrangling challenges. Operating on a triple representation –sometimes automatically built from tables– removes many tedious aspects of data input. For instance it works well on tables in “long” or “wide” formats. It also allows to capture and mix information from various data structures: tables, knowledge graphs... Yet, richer representations may be useful in the long run to better capture complex relationships within the data, such as temporal dependencies (Arora & Bedathur, 2020).

Towards general-purpose feature enrichment The scalability of our approach enabled to easily extract embeddings from YAGO3, capturing the corresponding information drawn from Wikipedia. These could readily be used as feature enrichment to improve statistical analysis on 5 different socio-economic datasets we investigated. Our work thus opens a path to capturing the large and complex stores of general information into feature vectors easy to integrate into any analysis. As such it contributes a major step towards facilitating data science with less manual data preparation.