Keywords

1 Introduction

Data in organizations (such as enterprises, hospitals) are dispersed all over in various data formats. It is important to identify all records corresponding to the same real-world entity across data sources, which is a fundamental task called entity resolution (ER) in data integration and data governance [1, 2]. ER, also known as entity matching and record linkage, is a long-standing research topic in database, data mining and machine learning.

Recently deep learning (DL) has been extensively explored in ER research [3,4,5,6]. Although deep learning based ER (deep ER) research achieves remarkable progress, there still exists potential space for necessary improvements. A fundamental aspect of deep ER models is data embedding [3, 4]. The majority of existing works utilize universal word embeddings, such as word2vec [7, 8], GloVe [9] and FastText [10], all of which are pre-trained on large corpora of natural language processing (NLP). Despite their universality, ER tasks see deficiencies of pre-trained word embeddings. ER mainly processes data records from data with attributes (as illustrated in Fig. 1) [1], which are more rigorously organized than free texts in NLP. Data with attributes subsume structured data (like data in relational databases) and semi-structured data (like CSV data and JSON data). In the following, data refer to data with attributes if not particularly specified. Universal word embeddings fail to capture attribute-related semantics, which is detailed later. Also, ER datasets are often deeply domain-specific, like enterprise data and medical data. Such data commonly involve custom words out of universal vocabulary, which are missed by pre-trained word embeddings, called out-of-vocabulary (OOV) problem. Thus, tailored local embeddings of data with attributes are preferred in ER.

Fig. 1.
figure 1

This is an example of data with attributes in our ER setting. In such data, records are annotated with attributes, such as ds1 and ds2. Each field is uniquely fixed by both a record and an attribute, such as f11 = r1[Item] = “Lenovo ThinkPad X1 Carbon Gen 7”, f12 = r1[Model] = “20QD001WUS”, and f13 = r1[Category] = “Laptops / Notebooks”. There might be complex attribute associations (directed): from Item in ds1 to Title, Brand in ds2; from Title in ds2 to Item, Model in ds1. Also, there might be dirty data: in r2, the field of Category should be “Laptops/Notebooks”, which is however misplaced in the field of model.

Let us analysis data (as Fig. 1 shows) in details. These data are hybridly hierarchical. In the two-dimensional data hierarchy, there are four roles: attributes, records, fields and tokens. A schema consists of several non-ordered attributes, which shows its set nature. Each record is instantiated following the schema, where each field corresponds to a unique attribute. In such an organization, a field is horizontally related to other fields in the same record (row), and is vertically related to other fields in the same attribute (column). A field itself is a token sequence, a.k.a., a short text. Specifically, as shown in Fig. 2, there are three types of semantic relations: attribute-field, each field is annotated by a unique attribute; record-field, a record is composed of several fields; field-token, a field is composed of several tokens. Hence semantics in data is multitype, including schema semantics and hierarchical instance semantics (i.e., fields in a record and tokens in a field).

Traditional word or text embeddings [7,8,9,10,11], which originate from free texts in NLP, can neither well process the complex hierarchy nor fully capture multitype semantic relations in our data. In order to learn tailored local embeddings of data with attributes for ER, there are two major challenges. (1) The first challenge is how to build a unified data model to encode these multitype semantic relations. (2) The second challenge is how to learn effective data embeddings through the given unified data model for downstream ER tasks. On one hand, local data embeddings should fully capture both schema semantics and hierarchical instance semantics; on the other hand, local embeddings should be similarity driven, considering downstream ER tasks. EmbDI creates local embeddings for data integration tasks [5], and makes certain progress on our topic. Yet EmbDI is limited in following aspects. Its data model ignores the key role of fields, which reduces representation completeness of data with attributes. Its embedding method leverages vanilla random walks from DeepWalk [12], and does not fully exploit differences among objects of distinct types and heterogeneousness of semantic relations among such objects.

Fig. 2.
figure 2

Multitype semantic relations in data with attributes. Each arrowed line indicates a specific semantic relation type.

In this work, our goal is to generate tailored multi-semantic data embeddings for ER. First, we model data with attributes as a family of multitype bipartite information networks, including attribute-field network, record-field network and field-token network, each of which captures a specific type of semantic relations. Second, we learn multi-semantic local embeddings of data with attributes through these multitype information networks, which are tailored for ER. Since fields play key roles in both data organization and record comparisons, we choose to learn distributed representations of fields by collectively embedding the three bipartite information networks. In this way, field embeddings fully capture the hybrid hierarchy of schema semantics and instance semantics. Regarding similarity requirements of ER, we propose a similarity driven method for bipartite information network embedding, which maps vertices into a low dimensional space according to their effectively measured similarities. In essence, the embedding proximity distribution should be consistent with the measured similarity distribution. Third, we introduce a hierarchical representation-comparison-classification framework for ER. Schema mapping is not a priori in our ER setting, but can be inferred with field embeddings by the ER framework. Finally, we carry out comprehensive experiments over three types of datasets to evaluate the proposed approach. The evaluations present our improvements over previous works, and test different components of our approach.

Summary of contributions.

  • We propose to represent data with attributes as a family of bipartite information networks, which fully preserve multitype semantic relations among records, attributes, fields and tokens in data.

  • We propose to learn tailored multi-semantic distributed representations of data for ER tasks, by collectively embedding multitype information networks. Especially, we design a similarity based bipartite network embedding method.

  • We propose a flexible representation-comparison-classification framework of deep ER, into which probabilistic schema mapping is integrated.

  • We conduct extensive experimental evaluations on three types of seven datasets, which show effectiveness of our approach and effects of its components.

Organization of the rest paper. Section 2 formalizes the problem. Section 3 specifies multi-semantic data embedding through multitype information networks. Section 4 presents a representation-comparison-classification framework for ER. Section 5 conducts experimental evaluations over three types of datasets. Section 6 reviews related works. Section 7 concludes the whole work.

2 Problem Formalization

Entity resolution (ER) determines whether multiple records correspond to the same real-world entity. ER is actually a classification problem, which can be solved with deep neural networks [3, 4], a.k.a. deep ER. This work focuses on learning tailored local data embeddings for ER tasks, which is a fundamental problem in deep ER model building [4]. Basically, we model data as multitype information networks and generate multi-semantic distributed representations of data by networks embedding.

Multitype Information Networks for Data with Attributes.

As Fig. 2 illustrates, there are three semantic relation types in data. To incorporate multitype semantic relations into a unified representation, we define a family of multitype information networks and Fig. 3 presents an example. We choose to model data as multitype bipartite information networks rather than a single heterogeneous information network, because different types of semantic relations are not comparable.

Definition 1. Attribute-Field Network.

Attribute-field network is a weighted bipartite graph GAF = (A ∪ F, EAF, WAF), where A is a set of attributes, F is a set of fields, and EAF is the set of edges between attributes and fields. The weight wij of the edge between attribute ai and filed fj is set to 1 uniformly.

Attribute-field network connects schema and instance, which is from abstract to instantiation. The attribute set size is usually small, and for instance, there are three attributes in ds1 or ds2 of Fig. 1. The number of fields corresponding to each attribute is up to record number, which is flexible. Thus the total number of fields can be very large.

Definition 2. Record-Field Network.

Record-field network is a weighted bipartite graph GRF = (R ∪ F, ERF, WRF), where R is a set of records, F is a set of fields, and ERF is the set of edges between records and fields. The weight wij of the edge between record ri and filed fj is set to 1 uniformly.

Record-field network is an affiliation from ensembles to components. A record has n fields, where n is the attribute set size.

Definition 3. Field-Token Network.

Field-token network is a weighted bipartite graph GFT = (F ∪ T, EFT, WFT), where F is a set of fields, T is a set of tokens, and EFT is the set of edges between fields and tokens. The weight wij of the edge between field fi and token tj is set to the number of times token tj appears in field fi, and is normalized into (0, 1] with the max number.

Field-token network captures token co-occurrences at the field level, which expresses fine-grained data semantics. Field semantics majorly stem from token level semantics, and so does for field similarity.

Fig. 3.
figure 3

A family of information networks constructed with data ds1 and ds2 in Fig. 1. All three information networks are partially illustrated, due to space limits.

Data with Attributes Embedding for Entity Resolution.

With above three information networks, we formalize our problem. Our goal is to learn distributed representations of data with attributes, which is optimized for downstream ER tasks. Among four types of objects in the hybrid hierarchy, fields act a role of hubs, which connects attributes, records and tokens, as we see attribute-field network, record-field network and field-token network together. Fields, consisting of sequential tokens, are primary elements of records, and meanwhile, are semantically constrained by attributes. Since fields play the role of hubs across multitype information networks and are basic units for record comparisons, we choose to learn field embeddings for ER.

Definition 4. Data with Attributes Embedding (DAE) for Entity Resolution.

Given a collection of data with attributes from one or several data sources, the goal of DAE is to learn multi-semantic distributed representations of fields by embedding these multitype information networks built from the collection into a low dimensional vector space. Field embeddings should fully capture both schema semantics and hierarchical instance semantics. Also, field embeddings should be similarity orientated, where the proximity in the embedding space should be consistent with an effective similarity.

3 Multi-semantic Data Embedding Through Multitype Information Networks

Basically, we embed data with attributes through multitype information networks, and the output is tailored embeddings of fields, which cover all datasets to be resolved. An essential problem of ER is similarity computation, which calls for similarity based data embedding. Inspired by [13], we propose a common neighbor similarity based bipartite information network embedding method. Then we generate multi-semantic field embeddings by collectively leveraging the three information networks (of different types) constructed from data with attributes.

3.1 Similarity Based Bipartite Network Embedding

We embed bipartite networks with a novel common neighbor similarity.

Common Neighbor Similarity.

For a bipartite network G = (VA ∪ VB, EAB, WAB), VA and VB are two disjoint vertex sets of different types, and EAB is the edge set between them. Generally, similarity between two vertices of the same type is indirectly indicated by their common neighbors of the other type, since such vertices are never linked directly. Given two vertices vi and vj from VA, then their similarity can be measured as follows.

$$sim_{cn} (v_{i} ,v_{j} ) = \frac{{\sum\nolimits_{{v_{k} \in N(v_{i} ) \cap N(v_{j} )}} {\frac{1}{{d(v_{k} )}}(w_{ik} + w_{jk} )} }}{{\sum\nolimits_{{v_{m} \in N(v_{i} )}} {\frac{1}{{d(v_{m} )}}w_{im} + \sum\nolimits_{{v_{n} \in N(v_{j} )}} {\frac{1}{{d(v_{n} )}}w_{jn} } } }}$$
(1)

N(vi) is the neighbor set of vi; d(vi) is the degree of vi. Our vertex similarity is a weighted variant of the dice similarity. We weight both edges and vertices in our similarity. For a vertex vi and its neighbor vk, their edge eik is naturally weighted as wik by network G. As a neighbor, vertex vk is weighted by 1/d(vk), which is inspired by classical IDF (Inverse Document Frequency). Thus, for vertex vi, its neighbor vk’s importance is measured by both vertex vk’s weight and their edge weight, expressed as (1/d(vk))wik. Finally, the denominator of the right part in formula 1 is vi’s weighted neighbors plus vj’s weighted neighbors, and the numerator is their weighted common neighbors. Note that each common neighbor is counted twice since the neighbor is linked to vi and vj separately.

Take the field-token network as an example. The more proportion of tokens two fields share, the more the two fields are similar; the more fields a token occurs in, the less the token contributes to field similarities.

Bipartite Network Embedding.

Here we focus on embedding vertices of the same type in a bipartite network. Considering similarity computation desideration of ER, such as field similarities, the learned vertex feature representations are supposed to reflect distributions of a given vertex similarity sim: V × V → \(\mathbb{R}\) over all vertices of the same type in a bipartite network.

We define neighborhoods in a bipartite network as follows: two vertices of one type belong to the same neighborhood(s) if they share at least one common neighbor vertex of the other type; otherwise, they are separated into different neighborhoods. Vertices sharing similar neighborhoods in networks should be mapped to be in close proximity in the embedding space.

Let f: V → \(\mathbb{R}\)d be the mapping function from vertex v to its feature representation f(v) (a d-dimensional vector), which we want to learn. To model proximity in the embedding space, we define the conditional probability of vertex vj in set VA given vertex vi in set VA, as shown in formula 2. This is actually a normalized proximity.

$$pxt(v_{j} |v_{i} ) = \frac{{\exp (f(v_{j} )^{{\text{T}}} f(v_{i} ))}}{{\sum\nolimits_{{v_{k} \in V_{A} }} {\exp (f(v_{k} )^{{\text{T}}} f(v_{i} ))} }}$$
(2)

Then given a bipartite network similarity sim(·, ·), we generate its similarity distribution. For instance, the conditional similarity of vj given vi is defined in formula 3.

$$p_{sim} (v_{j} |v_{i} ) = \frac{{sim(v_{i} ,v_{j} )}}{{\sum\nolimits_{{v_{k} \in V_{A} }} {sim(v_{i} ,v_{k} )} }}$$
(3)

We want the embedding proximity to be consistent with the given similarity. Thus, we define the objective function as Kullback-Leibler (KL) divergence between the embedding proximity distribution and the given similarity distribution, as formula 4 shows, and minimize it.

$$O_{emb} = \sum\nolimits_{{v_{i} \in V_{A} }} {KL(p_{sim} ( \cdot |v_{i} )||pxt( \cdot |v_{i} ))}$$
(4)

Omitting some constants in formula 4, the objective function can be rewritten as formula 5, which is cross-entropy.

$$O_{emb} = - \sum\nolimits_{{v_{i} \in V_{A} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$
(5)

This objective can be optimized with asynchronous stochastic gradient descent (ASGD). But it is computationally expensive to calculate conditional probabilities psim(·|vi) and pxt(·|vi), which needs summation over the entire vertex set. To settle this issue, we adopt negative sampling (NEG) method [8], which, for each positive sample, selects K negative samples according to some noise distribution. A positive sample is defined as two vertices (of same type) sharing neighborhood(s), where they are neighbors to each other; a negative sample is defined as two vertices (of same type) sharing no neighborhood. Adoption of NEG makes our model scalable. Formally, we define negative sampling by the objective, as formula 6.

$$O_{NEG} = \sum\limits_{\begin{subarray}{l} v_{i} \in V_{A} \\ v_{j} \sim p_{sim} ( \cdot |v_{i} ) \end{subarray} } {[\log \sigma (f(v_{j} )^{{\text{T}}} f(v_{i} )) + \sum\limits_{k \in [1,K]} {{\mathbb{E}}_{{v_{k} \sim P_{n} (v)}} \log \sigma ( - f(v_{k} )^{{\text{T}}} f(v_{i} ))} ]}$$
(6)

σ(x) = (1 + ex)−1 is the sigmoid function. The first term models a positive sample, where vj is sampled from neighborhoods of vi, and their similarity by sim(·, ·) is positive; the second term models K negative samples randomly selected from the noise distribution Pn(v), which is set by following [8].

Embeddings of attribute-field network, record-field network and field-token network can all be learned by the proposed model. In our model, sim(·, ·) is set to our proposed common neighbor similarity simcn(·, ·).

3.2 Multi-semantic Embedding for Data with Attributes

There are three information networks: attribute-field, record-field and field-token. As we see, fields occur in all multitype networks, and are also what we want to embed for downstream ER tasks. Each network indicates a unique affiliation relation, and has a particular semantic interpretation. Field-token networks contain token level semantics, which fundamentally contributes to field semantics, and plays a key role in field similarity computations. Record-field networks reflect record level semantics, where each small set of fields co-occur in the same record context. Attribute-field networks reflect schema semantics, where each (large) set of fields are constrained in the same attribute context.

Field embeddings should contain all multitype semantics. Therefore, field representations are collectively leaned through the three bipartite information networks. We define the collective objective function (formula 7) for multi-semantic field embedding, and minimize it.

$$O_{all} = \alpha O_{FT} + \beta O_{RF} + \gamma O_{AF}$$
(7)
$$O_{FT} = - \sum\nolimits_{{v_{i} \in F \in G_{FT} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$
(8)
$$O_{RF} = - \sum\nolimits_{{v_{i} \in F \in G_{RF} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$
(9)
$$O_{AF} = - \sum\nolimits_{{v_{i} \in F \in G_{AF} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$
(10)

OFT is the objective function for the field-token network embedding, ORF is the objective function for the record-field network embedding, and OAF is the objective function for the attribute-field network embedding. The hyperparameters α, β, γ (α + β + γ = 1) are weights for different objectives, which control the contribution of each network embedding to overall field embedding.

We jointly train the model, where all the three types of networks are utilized. Since edges from different networks are not comparable, we choose to interleave updates of different network embeddings. Hence the model is updated network by network.

4 Flexible Entity Resolution with Multi-semantic Data Embeddings

There are two data sources S with schema \([{a}_{1}^{s},\dots ,{a}_{m}^{s}]\) and T with schema \([{a}_{1}^{t},\dots ,{a}_{n}^{t}]\). Entity resolution determines if two records rs and rt correspond to the same real-world entity, where \({r}^{s}=\{{<}{a}_{1}^{s},{f}_{1}^{s}{>},\dots ,{<}{a}_{m}^{s},{f}_{m}^{s}{>}\}\) is from S and \({r}^{t}=\{{<}{a}_{1}^{t},{f}_{1}^{t}{>},\dots ,{<}{a}_{n}^{t},{f}_{n}^{t}{>}\}\) is from T. Each field fi is annotated by a unique attribute ai, and is a token sequence.

With tailored multi-semantic field embeddings as the base, we propose a flexible representation-comparison-classification framework for ER. We integrate probabilistic schema mapping into ER. We adopt inter-attention to infer probabilistic attribute associations, and utilize intra-attention to arrange attribute weights.

Representation Layer.

All dirty data to be resolved are put into DAE, and local field embeddings are generated. A field \(f_{i}^{s}\) from record rs is represented as \({\varvec{h}}_{i}^{s}\). Then, rs is represented as \(\boldsymbol{H}^{S} = [\boldsymbol{h}_{1}^{s} , \ldots ,\boldsymbol{h}_{m}^{s} ]\). [,] is a vector or matrix concatenation.

$${\varvec{h}}_{i}^{s} = {\text{DAE}} (f_{i}^{s} )$$
(11)

Comparison Layer.

This layer includes field alignment, comparison and weighting. Record comparisons are bidirectional, and we just specify rs → rt for simplicity. It aligns fields from rs to rt in probability, compares records in the field level, and arranges field weights. The output of this layer is a pair of directional record similarities.

Field Alignment.

We build probabilistic schema mapping from rs to rt with inter-attention [14]. For each field representation \({\varvec{h}}_{i}^{s}\) of record rs, its soft-aligned representation is computed with all field representations of record rt. Soft field alignment jointly analyzes two records, and results in pairwise field proximities from \(\boldsymbol{H}^{s}\) to \(\boldsymbol{H}^{t}\), denoted as \(\boldsymbol{\alpha }^{s \to t}\). Field level inter-attention score from \(\boldsymbol{H}^{s}\) to \(\boldsymbol{H}^{t}\) is \((\boldsymbol{H}^{s} )^{{\text{T}}} \boldsymbol{W}^{s \to t} \boldsymbol{H}^{t}\), where \(\boldsymbol{W}^{s \to t}\) is a trainable matrix. With softmax, attention scores are normalized into field alignment matrix \(\boldsymbol{\alpha }^{s \to t}\), where each entry \(\boldsymbol{\alpha }^{s \to t} (i,j)\) is proximity from \({\varvec{h}}_{i}^{s}\) to \({\varvec{h}}_{j}^{t}\). \(\widehat{\boldsymbol{H}}^{s}\) is \(\boldsymbol{H}^{s}\)’s soft-aligned representation with \(\boldsymbol{H}^{t}\).

$${\boldsymbol{\alpha }}^{s \to t} = {\text{softmax}} ((\boldsymbol{H}^{s} )^{{\text{T}}} {\boldsymbol{W}}^{s \to t} {\boldsymbol{H}}^{t} )$$
(12)
$$\widehat{\boldsymbol{H}}^{s} = {\boldsymbol{H}}^{t} (\boldsymbol{\alpha }^{s \to t} )^{{\text{T}}}$$
(13)

Field Comparison.

For each \({\varvec{h}}_{i}^{s}\) and its soft-aligned representation \(\widehat{{\varvec{h}}}_{i}^{s}\), we compute their element-wise absolute difference \(|{\varvec{h}}_{i}^{s} - \widehat{{\varvec{h}}}_{i}^{s} |\) and Hadamard product \({\varvec{h}}_{i}^{s} \odot \widehat{{\varvec{h}}}_{i}^{s}\). The concatenation of the two interactions is put into a two-layer highway network, and a compact similarity representation \(\widetilde{{\varvec{h}}}_{i}^{s}\) is generated. Formula 14, organized in the record way, presents initial field level similarity from rs to rt. So far, all fields play equally important roles in comparisons.

$$\widetilde{\boldsymbol{H}}^{s} = {\text{Highway}} ([|\boldsymbol{H}^{s} - \widehat{\boldsymbol{H}}^{s} |,{\boldsymbol{H}}^{s} \odot \widehat{\boldsymbol{H}}^{s} ])$$
(14)

Field Weighting.

As is commonly known, different fields do not contribute equally to record similarities. We introduce an intra-attention mechanism [14] to capture field importances in similarity representations. \(\widetilde{\boldsymbol{H}}^{s}\)’s intra-attention score is computed as product of \(\widetilde{\boldsymbol{H}}^{s}\) and a global context vector \(\boldsymbol{c}^{s}\), which is trainable. Attention scores are normalized with softmax into intra-attention \(\boldsymbol{\beta }^{s}\). Weighted similarity representation \(\boldsymbol{s}^{s \to t}\) from rs to rt is obtained by applying \(\boldsymbol{\beta }^{s}\) on initial similarity \(\widetilde{\boldsymbol{H}}^{s}\).

$$\boldsymbol{\beta }^{s} = {\text{softmax}} ((\widetilde{\boldsymbol{H}}^{s} )^{{\text{T}}} \boldsymbol{c}^{s} )$$
(15)
$$\boldsymbol{s}^{s \to t} = \widetilde{\boldsymbol{H}}^{s} (\boldsymbol{\beta }^{s} )^{{\text{T}}}$$
(16)

ER Classification Layer.

We build a binary ER classifier with a highway network and softmax. The concatenation of similarities \(\boldsymbol{s}^{s \to t}\) and \(\boldsymbol{s}^{s \leftarrow t}\) is fed into a two-layer fully connected highway network, and the output is the aggregated similarity \(\boldsymbol{s}^{s \leftrightarrow t}\). \(\boldsymbol{s}^{s \leftrightarrow t}\) is put into a softmax classifier, and the final output is the ER distribution \(P(y|r^{s} ,r^{t} )\).

$$\boldsymbol{s}^{s \leftrightarrow t} = {\text{Highway}} ([\boldsymbol{s}^{s \to t} ,\boldsymbol{s}^{s \leftarrow t} ])$$
(17)
$$P(y|r^{s} ,r^{t} ) = {\text{softmax}} (\boldsymbol{Ws}^{s \leftrightarrow t} + b)$$
(18)

The ER model is trained by minimizing cross-entropy loss OER, where yl is the label of ground truth and ypre is the predicted label.

$$O_{ER} = crossEntropy(y_{l} ,y_{pre} )$$
(19)

5 Experimental Evaluation

5.1 Experiments Setup

Datasets.

As illustrated in Table 1, there are three groups of datasets for evaluations, including standard data and two types of hard data: dirty data and complex data. We implement an enhanced variant of UIS data generator [15] (eUIS for short) to help generate dirty data and complex data. We generate a standard person dataset Person-Person (PP), including two partitions with the same schema: name, telephone, address, city, state and zip code. In PP, there are duplicates between two partitions, but there is no duplicate inside each partition. Later, we construct a dirty version PP1 and a complex version PP2 with PP.

  1. (1)

    Standard data. There are three standard datasets DBLP-Scholar (DS), DBLP-ACM (DA) and Fodors-Zagats (FZ) [4], which are well structured, are perfectly one-to-one aligned in schemas, and contain simple fields with few errors.

  2. (2)

    Dirty data. Two dirty datasets PP1 and BR1 are derived from standard datasets PP and BeerAdvo-RateBeer (BR) [4] respectively. There are errors and value misplacements in dirty data. We generate a dirty dataset with a standard dataset in two steps: error injection and field misplacement. (a) With probability of 25%, a selected field in a record is injected into errors including edit errors (random character insertion, deletion, replacement and swap) and token errors (random token repeat, insertion, deletion, replacement and swap). (b) With probability of 40%, one field is randomly selected and moved into another attribute in the same record.

    Table 1. Dataset descriptions.
  3. (3)

    Complex data. Two complex datasets PP2 and BR2 are derived from standard datasets PP and BR [4] respectively. There is at least one one-to-many attribute association between schemas of different data sources. We construct a complex dataset with a standard dataset in two steps: error injection and attribute merging. (a) Error injection here is similar to error injection of dirty data generation, except that probability is 20%. (b) Then, a subset of attributes is merged into a complex attribute given a schema. For PP2 from PP, name and address are merged into a complex attribute name-address in partition one; name and telephone are merged into a complex attribute name-telephone, and, address, city & zip code are merged into a second complex attribute address-city-zipcode in partition two. For BR2 from BR, Beer Name and Brew Factory Name are merged into a complex attribute BN-BFN in BeerAdvo; Beer Name, Style and ABV are merged into a complex attribute BN-style-ABV in RateBeer.

Metric.

Our work focuses on resolution quality of ER. We choose the common metric F1 measure for ER evaluation. \(F_{1} = {{2PR} \mathord{\left/ {\vphantom {{2PR} {(P + R)}}} \right. \kern-\nulldelimiterspace} {(P + R)}}\), P is precision and R is recall. P is the proportion of predicted matches that are truly matched, and R is the proportion of true matches that are correctly predicted.

Settings.

Information networks based data embedding is implemented with C++, and the ER model is implemented with Python (PyTorch). All experiments are run on a server with 8 CPU cores (Intel(R) E5-2667, 3.2 GHz), 64G memory, and NVIDIA GeForce GTX 980 Ti.

Following previous works [13, 16,17,18], the network embedding dimensionality is set to d = 128. Each dataset is split into 3: 1: 1 for training, validation and testing of ER tasks. Numbers of epochs, mini-batch size and dropout rate are 15, 16 and 0.1. Adam is used as optimization algorithms.

5.2 Comparisons with Existing Works

We compare our approach DAE based ER (DAER) with existing graph based deep ER approaches EmbDI [5], GraphER [6] and two deep ER baselines DeepER [3], DeepMatcher [4] on three types of data.

Figure 4 illustrates overall performances of five ER approaches on three standard datasets: DS, DA and FZ. All approaches achieve relatively comparable (and good) performances on standard datasets, and F1 gaps between different approaches are usually not big. Specifically, DAER outnumbers other approaches by 0.2% to 3% in ΔF1 on DS and DA; all five approaches achieve the same F1 on FZ. This is mainly due to that these standard datasets are easy to resolve.

Figure 5 illustrates overall performances of five ER approaches over two dirty datasets: BR1 and PP1. In general, DAER obviously outperforms the other four approaches on dirty data. On BR1, ΔF1 between DAER and the others are at least 8.1%; on PP1, ΔF1 between DAER and the others are at least 11.6%. There are many typos, token errors and even more value misplacements in dirty data, which make data hard to resolve. Our DAER’s improvements majorly benefit from local field representations. Our tailored field representations capture multitype semantics, including token level (breaking attribute boundaries), record level & attribute level, and are learned based on similarities, both of which are essential for similarity computing in ER.

Figure 6 depicts overall performances of five ER approaches over two complex datasets: BR2 and PP2. Overall, DAER surpasses the other four approaches on complex data. On BR2, ΔF1 between DAER and the others are at least 5.6%; on PP2, ΔF1 between DAER and the others are at least 13.3%. There are complex attribute associations in schemas of complex data; also, complex data contains typos and token errors. Hence complex data are difficult to resolve. We think DAER’s advantages over previous approaches come from following aspects: (1) tailored local field representations, which capture multitype semantics and are similarity driven, and (2) the proposed ER model, which integrates flexible schema mapping into ER.

Fig. 4.
figure 4

General comparisons on standard data

Fig. 5.
figure 5

General comparisons on dirty data

Fig. 6.
figure 6

General comparisons on complex data

5.3 Detailed Analysis

We evaluate key components of our proposed solution in detail.

Effect of Graph Embedding.

Data embedding via multitype information networks is our major contribution. We compare different graph embedding methods for data embedding in ER. We use classical graph embedding methods PTE [18] and Node2Vec [16] for local field representations, and other parts stay the same, denoted as DAER-PTE and DAER-N2V respectively. In DAER-PTE, PTE is directly used for local field embedding. In DAER-N2V, Node2Vec replaces our bipartite network embedding method (for each information network embedding) in local field representations. On three standard datasets, they have comparable performances, as Fig. 7 shows. DAER overall outnumbers the other two approaches in F1 on both dirty data and complex data. As Fig. 8 illustrates, ΔF1 between DAER and the others are at least 7.7% on two dirty datasets. As Fig. 9 illustrates, ΔF1 between DAER and the others are at least 9.1% on two complex datasets. The evaluation advantages show that our multitype information networks based data embedding is effective in ER. Our data embedding captures multitype semantics and considers object similarities, both of which are essential for similarity computations in ER.

Fig. 7.
figure 7

Graph embedding tests on standard data

Fig. 8.
figure 8

Graph embedding tests on dirty data

Fig. 9.
figure 9

Graph embedding tests on complex data

Effect of ER Model.

Probabilistic schema mapping (PSM) and field weighting (FW) are two key components of our ER model, and we test their effects on three types of data. DAER-[-FW] is DAER without FW, and DAER-[-PSM] is DAER with vanilla schema mapping instead of PSM. Figures 10, 11, 12 illustrate results on three standard datasets, on two dirty datasets and on two complex datasets respectively. On standard datasets, there are minor ΔF1 between DAER and the other two approaches. On both dirty datasets and complex datasets, DAER obviously outperforms the others in F1. Especially on complex datasets, loss of PSM reduces much more accuracies than loss of FW, due to existence of many complex attribute associations. The evaluation results testify that PSW and FW are effective components of our ER model, especially for dirty data and complex data, which commonly exist in the real world.

Fig. 10.
figure 10

ER model tests on standard data

Fig. 11.
figure 11

ER model tests on dirty data

Fig. 12.
figure 12

ER model tests on complex data

6 Related Work

Entity resolution attracts multiple research communities, such as database, data mining and machine learning [1, 19]. Currently, deep learning is strongly driving ER research. DeepER is a pioneer deep ER work [3], which builds an ER system with distributed representations of words and LSTM. Also, DeepER investigates DL based blocking for ER efficiency. DeepMatcher defines a design space of deep ER, including attribute embedding, attribute similarity representation and classifier [4]. DeepMatcher introduces four methods: heuristic-based, RNN-based, attention-based and hybrid. There are graph based deep ER works [5, 6]. GraphER is a token-centric approach, which utilizes GCN (graph convolutional network) to aggregate token-level comparisons [6]. EmbDI creates embeddings of relational data for data integration tasks, such as schema mapping and ER [5]. EmbDI constructs a graph with tokens, attributes & records, and run vanilla random walks over the graph to generate sentences to describe similarities across objects (like DeepWalk [12]). However, EmbDI disregards the key role of fields in graph construction, and does not fully utilize heterogeneousness of both objects (tokens, attributes & records) and their semantic relations in learning of embeddings.

Along with rapid DL developments, word embeddings have been widely used in NLP tasks. Trained over large NLP corpora, word embeddings map words into a compact vector space, which preserves syntactic and semantic word relationships. As a milestone, word2vec proposes two neural net language models skip-gram and CBOW [7, 8], which are able to learn high-quality word vectors with their simple but useful neural architectures. Word2vec produced profound influences on later word embeddings, and also triggered other embeddings, such as graph embeddings [20]. GloVe incorporates global information by matrix factorization and local information by context window into word representations [9], promoting its performance. Regarding unseen words, FastText extends the skip-gram model with character n-grams [10], where words are represented as sums of n-gram vectors.

Vertex embedding, as a core branch of graph embeddings, maps vertices into a low dimensional vector space by embedding graph structures [20]. Inspired by word2vec, DeepWalk captures the “context” of a vertex by running random walks, and utilizes skip-gram as the learning model [12], where generated walks act the role of sentences. Following DeepWalk, node2vec introduces biased random walks to diversify neighborhoods [16]. It guides random walks by configuring a mixture of BFS (breadth-first search) and DFS (depth-first search). LINE learns vertex embeddings by combining first-order and second-order proximities [17]. Incorporating both unlabeled and labeled information, PTE extends LINE for semi-supervised text data embedding [18]. As a versatile vertex similarity embedding framework, VERSE embeds graphs by reconstructing similarity distributions between vertices [13]. Our bipartite network embedding method is an improvement of VERSE that is adaptive to ER tasks. In heterogeneous information networks, metapath2vec defines meta-path based random walks and exploits a heterogeneous skip-gram model to learn vertex embeddings [21].

7 Conclusion

In this work, we study how to locally embed data with attributes for ER tasks. Data are modeled as a family of information networks, in which multitype semantic relations are preserved. Tailored multi-semantic distributed representations of fields are learned by collectively embedding these information networks. Particularly, a similarity driven method is proposed to embed each bipartite information network. With generated field embeddings, ER is carried out in a flexible representation-comparison-classification framework. Sufficient experimental evaluations over several datasets show that our approach is an effective solution. In future, an interesting potential research direction is how to apply our DAER approach to transfer learning of ER, which is meaningful for low-resource scenarios.