Information Networks Based Multi-semantic Data Embedding for Entity Resolution

Sun, Chenchen; Shen, Derong; Nie, Tiezheng

doi:10.1007/978-3-031-00129-1_2

Chenchen Sun^16,17,18,
Derong Shen¹⁹ &
Tiezheng Nie¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13247))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2599 Accesses

Abstract

Entity resolution (ER) is an ongoing topic in data integration and data governance, which attracts considerable attention from multiple research fields. Recently, deep learning techniques have been substantially applied to entity resolution. We focus on entity resolution with graph based multi-semantic data embedding. In ER, data with attributes cannot be well represented by common word embeddings from natural language processing. In this work, data with attributes are modeled as a family of multitype bipartite information networks, each of which captures a specific type of semantics in data. Based on this, multi-semantic embeddings of data are collectively learned through the family of information networks. Particularly, a novel method is introduced to learn similarity based bipartite network embeddings. Generated tailored data embeddings are put into a flexible hierarchical ER framework, which outputs ER classification distributions. Our approach is comprehensively evaluated on a group of datasets, which presents its effectiveness.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A survey: knowledge graph entity alignment research based on graph embedding

Article Open access 03 August 2024

Cross-lingual knowledge graph entity alignment by aggregating extensive structures and specific semantics

Article 18 July 2022

Improving entity alignment via attribute and external knowledge filtering

Article 11 July 2022

Keywords

1 Introduction

Data in organizations (such as enterprises, hospitals) are dispersed all over in various data formats. It is important to identify all records corresponding to the same real-world entity across data sources, which is a fundamental task called entity resolution (ER) in data integration and data governance [1, 2]. ER, also known as entity matching and record linkage, is a long-standing research topic in database, data mining and machine learning.

Recently deep learning (DL) has been extensively explored in ER research [3,4,5,6]. Although deep learning based ER (deep ER) research achieves remarkable progress, there still exists potential space for necessary improvements. A fundamental aspect of deep ER models is data embedding [3, 4]. The majority of existing works utilize universal word embeddings, such as word2vec [7, 8], GloVe [9] and FastText [10], all of which are pre-trained on large corpora of natural language processing (NLP). Despite their universality, ER tasks see deficiencies of pre-trained word embeddings. ER mainly processes data records from data with attributes (as illustrated in Fig. 1) [1], which are more rigorously organized than free texts in NLP. Data with attributes subsume structured data (like data in relational databases) and semi-structured data (like CSV data and JSON data). In the following, data refer to data with attributes if not particularly specified. Universal word embeddings fail to capture attribute-related semantics, which is detailed later. Also, ER datasets are often deeply domain-specific, like enterprise data and medical data. Such data commonly involve custom words out of universal vocabulary, which are missed by pre-trained word embeddings, called out-of-vocabulary (OOV) problem. Thus, tailored local embeddings of data with attributes are preferred in ER.

Let us analysis data (as Fig. 1 shows) in details. These data are hybridly hierarchical. In the two-dimensional data hierarchy, there are four roles: attributes, records, fields and tokens. A schema consists of several non-ordered attributes, which shows its set nature. Each record is instantiated following the schema, where each field corresponds to a unique attribute. In such an organization, a field is horizontally related to other fields in the same record (row), and is vertically related to other fields in the same attribute (column). A field itself is a token sequence, a.k.a., a short text. Specifically, as shown in Fig. 2, there are three types of semantic relations: attribute-field, each field is annotated by a unique attribute; record-field, a record is composed of several fields; field-token, a field is composed of several tokens. Hence semantics in data is multitype, including schema semantics and hierarchical instance semantics (i.e., fields in a record and tokens in a field).

Traditional word or text embeddings [7,8,9,10,11], which originate from free texts in NLP, can neither well process the complex hierarchy nor fully capture multitype semantic relations in our data. In order to learn tailored local embeddings of data with attributes for ER, there are two major challenges. (1) The first challenge is how to build a unified data model to encode these multitype semantic relations. (2) The second challenge is how to learn effective data embeddings through the given unified data model for downstream ER tasks. On one hand, local data embeddings should fully capture both schema semantics and hierarchical instance semantics; on the other hand, local embeddings should be similarity driven, considering downstream ER tasks. EmbDI creates local embeddings for data integration tasks [5], and makes certain progress on our topic. Yet EmbDI is limited in following aspects. Its data model ignores the key role of fields, which reduces representation completeness of data with attributes. Its embedding method leverages vanilla random walks from DeepWalk [12], and does not fully exploit differences among objects of distinct types and heterogeneousness of semantic relations among such objects.

In this work, our goal is to generate tailored multi-semantic data embeddings for ER. First, we model data with attributes as a family of multitype bipartite information networks, including attribute-field network, record-field network and field-token network, each of which captures a specific type of semantic relations. Second, we learn multi-semantic local embeddings of data with attributes through these multitype information networks, which are tailored for ER. Since fields play key roles in both data organization and record comparisons, we choose to learn distributed representations of fields by collectively embedding the three bipartite information networks. In this way, field embeddings fully capture the hybrid hierarchy of schema semantics and instance semantics. Regarding similarity requirements of ER, we propose a similarity driven method for bipartite information network embedding, which maps vertices into a low dimensional space according to their effectively measured similarities. In essence, the embedding proximity distribution should be consistent with the measured similarity distribution. Third, we introduce a hierarchical representation-comparison-classification framework for ER. Schema mapping is not a priori in our ER setting, but can be inferred with field embeddings by the ER framework. Finally, we carry out comprehensive experiments over three types of datasets to evaluate the proposed approach. The evaluations present our improvements over previous works, and test different components of our approach.

Summary of contributions.

We propose to represent data with attributes as a family of bipartite information networks, which fully preserve multitype semantic relations among records, attributes, fields and tokens in data.
We propose to learn tailored multi-semantic distributed representations of data for ER tasks, by collectively embedding multitype information networks. Especially, we design a similarity based bipartite network embedding method.
We propose a flexible representation-comparison-classification framework of deep ER, into which probabilistic schema mapping is integrated.
We conduct extensive experimental evaluations on three types of seven datasets, which show effectiveness of our approach and effects of its components.

Organization of the rest paper. Section 2 formalizes the problem. Section 3 specifies multi-semantic data embedding through multitype information networks. Section 4 presents a representation-comparison-classification framework for ER. Section 5 conducts experimental evaluations over three types of datasets. Section 6 reviews related works. Section 7 concludes the whole work.

2 Problem Formalization

Entity resolution (ER) determines whether multiple records correspond to the same real-world entity. ER is actually a classification problem, which can be solved with deep neural networks [3, 4], a.k.a. deep ER. This work focuses on learning tailored local data embeddings for ER tasks, which is a fundamental problem in deep ER model building [4]. Basically, we model data as multitype information networks and generate multi-semantic distributed representations of data by networks embedding.

Multitype Information Networks for Data with Attributes.

As Fig. 2 illustrates, there are three semantic relation types in data. To incorporate multitype semantic relations into a unified representation, we define a family of multitype information networks and Fig. 3 presents an example. We choose to model data as multitype bipartite information networks rather than a single heterogeneous information network, because different types of semantic relations are not comparable.

Definition 1. Attribute-Field Network.

Attribute-field network is a weighted bipartite graph G_AF = (A ∪ F, E_AF, W_AF), where A is a set of attributes, F is a set of fields, and E_AF is the set of edges between attributes and fields. The weight w_ij of the edge between attribute a_i and filed f_j is set to 1 uniformly.

Attribute-field network connects schema and instance, which is from abstract to instantiation. The attribute set size is usually small, and for instance, there are three attributes in ds1 or ds2 of Fig. 1. The number of fields corresponding to each attribute is up to record number, which is flexible. Thus the total number of fields can be very large.

Definition 2. Record-Field Network.

Record-field network is a weighted bipartite graph G_RF = (R ∪ F, E_RF, W_RF), where R is a set of records, F is a set of fields, and E_RF is the set of edges between records and fields. The weight w_ij of the edge between record r_i and filed f_j is set to 1 uniformly.

Record-field network is an affiliation from ensembles to components. A record has n fields, where n is the attribute set size.

Definition 3. Field-Token Network.

Field-token network is a weighted bipartite graph G_FT = (F ∪ T, E_FT, W_FT), where F is a set of fields, T is a set of tokens, and E_FT is the set of edges between fields and tokens. The weight w_ij of the edge between field f_i and token t_j is set to the number of times token t_j appears in field f_i, and is normalized into (0, 1] with the max number.

Field-token network captures token co-occurrences at the field level, which expresses fine-grained data semantics. Field semantics majorly stem from token level semantics, and so does for field similarity.

Data with Attributes Embedding for Entity Resolution.

With above three information networks, we formalize our problem. Our goal is to learn distributed representations of data with attributes, which is optimized for downstream ER tasks. Among four types of objects in the hybrid hierarchy, fields act a role of hubs, which connects attributes, records and tokens, as we see attribute-field network, record-field network and field-token network together. Fields, consisting of sequential tokens, are primary elements of records, and meanwhile, are semantically constrained by attributes. Since fields play the role of hubs across multitype information networks and are basic units for record comparisons, we choose to learn field embeddings for ER.

Definition 4. Data with Attributes Embedding (DAE) for Entity Resolution.

Given a collection of data with attributes from one or several data sources, the goal of DAE is to learn multi-semantic distributed representations of fields by embedding these multitype information networks built from the collection into a low dimensional vector space. Field embeddings should fully capture both schema semantics and hierarchical instance semantics. Also, field embeddings should be similarity orientated, where the proximity in the embedding space should be consistent with an effective similarity.

3 Multi-semantic Data Embedding Through Multitype Information Networks

Basically, we embed data with attributes through multitype information networks, and the output is tailored embeddings of fields, which cover all datasets to be resolved. An essential problem of ER is similarity computation, which calls for similarity based data embedding. Inspired by [13], we propose a common neighbor similarity based bipartite information network embedding method. Then we generate multi-semantic field embeddings by collectively leveraging the three information networks (of different types) constructed from data with attributes.

3.1 Similarity Based Bipartite Network Embedding

We embed bipartite networks with a novel common neighbor similarity.

Common Neighbor Similarity.

For a bipartite network G = (V_A ∪ V_B, E_AB, W_AB), V_A and V_B are two disjoint vertex sets of different types, and E_AB is the edge set between them. Generally, similarity between two vertices of the same type is indirectly indicated by their common neighbors of the other type, since such vertices are never linked directly. Given two vertices v_i and v_j from V_A, then their similarity can be measured as follows.

$$sim_{cn} (v_{i} ,v_{j} ) = \frac{{\sum\nolimits_{{v_{k} \in N(v_{i} ) \cap N(v_{j} )}} {\frac{1}{{d(v_{k} )}}(w_{ik} + w_{jk} )} }}{{\sum\nolimits_{{v_{m} \in N(v_{i} )}} {\frac{1}{{d(v_{m} )}}w_{im} + \sum\nolimits_{{v_{n} \in N(v_{j} )}} {\frac{1}{{d(v_{n} )}}w_{jn} } } }}$$

(1)

N(v_i) is the neighbor set of v_i; d(v_i) is the degree of v_i. Our vertex similarity is a weighted variant of the dice similarity. We weight both edges and vertices in our similarity. For a vertex v_i and its neighbor v_k, their edge e_ik is naturally weighted as w_ik by network G. As a neighbor, vertex v_k is weighted by 1/d(v_k), which is inspired by classical IDF (Inverse Document Frequency). Thus, for vertex v_i, its neighbor v_k’s importance is measured by both vertex v_k’s weight and their edge weight, expressed as (1/d(v_k))w_ik. Finally, the denominator of the right part in formula 1 is v_i’s weighted neighbors plus v_j’s weighted neighbors, and the numerator is their weighted common neighbors. Note that each common neighbor is counted twice since the neighbor is linked to v_i and v_j separately.

Take the field-token network as an example. The more proportion of tokens two fields share, the more the two fields are similar; the more fields a token occurs in, the less the token contributes to field similarities.

Bipartite Network Embedding.

Here we focus on embedding vertices of the same type in a bipartite network. Considering similarity computation desideration of ER, such as field similarities, the learned vertex feature representations are supposed to reflect distributions of a given vertex similarity sim: V × V → $\mathbb{R}$ over all vertices of the same type in a bipartite network.

We define neighborhoods in a bipartite network as follows: two vertices of one type belong to the same neighborhood(s) if they share at least one common neighbor vertex of the other type; otherwise, they are separated into different neighborhoods. Vertices sharing similar neighborhoods in networks should be mapped to be in close proximity in the embedding space.

Let f: V → $\mathbb{R}$^d be the mapping function from vertex v to its feature representation f(v) (a d-dimensional vector), which we want to learn. To model proximity in the embedding space, we define the conditional probability of vertex v_j in set V_A given vertex v_i in set V_A, as shown in formula 2. This is actually a normalized proximity.

$$pxt(v_{j} |v_{i} ) = \frac{{\exp (f(v_{j} )^{{\text{T}}} f(v_{i} ))}}{{\sum\nolimits_{{v_{k} \in V_{A} }} {\exp (f(v_{k} )^{{\text{T}}} f(v_{i} ))} }}$$

(2)

Then given a bipartite network similarity sim(·, ·), we generate its similarity distribution. For instance, the conditional similarity of v_j given v_i is defined in formula 3.

$$p_{sim} (v_{j} |v_{i} ) = \frac{{sim(v_{i} ,v_{j} )}}{{\sum\nolimits_{{v_{k} \in V_{A} }} {sim(v_{i} ,v_{k} )} }}$$

(3)

We want the embedding proximity to be consistent with the given similarity. Thus, we define the objective function as Kullback-Leibler (KL) divergence between the embedding proximity distribution and the given similarity distribution, as formula 4 shows, and minimize it.

$$O_{emb} = \sum\nolimits_{{v_{i} \in V_{A} }} {KL(p_{sim} ( \cdot |v_{i} )||pxt( \cdot |v_{i} ))}$$

(4)

Omitting some constants in formula 4, the objective function can be rewritten as formula 5, which is cross-entropy.

$$O_{emb} = - \sum\nolimits_{{v_{i} \in V_{A} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$

(5)

This objective can be optimized with asynchronous stochastic gradient descent (ASGD). But it is computationally expensive to calculate conditional probabilities p_sim(·|v_i) and pxt(·|v_i), which needs summation over the entire vertex set. To settle this issue, we adopt negative sampling (NEG) method [8], which, for each positive sample, selects K negative samples according to some noise distribution. A positive sample is defined as two vertices (of same type) sharing neighborhood(s), where they are neighbors to each other; a negative sample is defined as two vertices (of same type) sharing no neighborhood. Adoption of NEG makes our model scalable. Formally, we define negative sampling by the objective, as formula 6.

$$O_{NEG} = \sum\limits_{\begin{subarray}{l} v_{i} \in V_{A} \\ v_{j} \sim p_{sim} ( \cdot |v_{i} ) \end{subarray} } {[\log \sigma (f(v_{j} )^{{\text{T}}} f(v_{i} )) + \sum\limits_{k \in [1,K]} {{\mathbb{E}}_{{v_{k} \sim P_{n} (v)}} \log \sigma ( - f(v_{k} )^{{\text{T}}} f(v_{i} ))} ]}$$

(6)

σ(x) = (1 + e^−x)⁻¹ is the sigmoid function. The first term models a positive sample, where v_j is sampled from neighborhoods of v_i, and their similarity by sim(·, ·) is positive; the second term models K negative samples randomly selected from the noise distribution P_n(v), which is set by following [8].

Embeddings of attribute-field network, record-field network and field-token network can all be learned by the proposed model. In our model, sim(·, ·) is set to our proposed common neighbor similarity sim_cn(·, ·).

3.2 Multi-semantic Embedding for Data with Attributes

There are three information networks: attribute-field, record-field and field-token. As we see, fields occur in all multitype networks, and are also what we want to embed for downstream ER tasks. Each network indicates a unique affiliation relation, and has a particular semantic interpretation. Field-token networks contain token level semantics, which fundamentally contributes to field semantics, and plays a key role in field similarity computations. Record-field networks reflect record level semantics, where each small set of fields co-occur in the same record context. Attribute-field networks reflect schema semantics, where each (large) set of fields are constrained in the same attribute context.

Field embeddings should contain all multitype semantics. Therefore, field representations are collectively leaned through the three bipartite information networks. We define the collective objective function (formula 7) for multi-semantic field embedding, and minimize it.

$$O_{all} = \alpha O_{FT} + \beta O_{RF} + \gamma O_{AF}$$

(7)

$$O_{FT} = - \sum\nolimits_{{v_{i} \in F \in G_{FT} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$

(8)

$$O_{RF} = - \sum\nolimits_{{v_{i} \in F \in G_{RF} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$

(9)

$$O_{AF} = - \sum\nolimits_{{v_{i} \in F \in G_{AF} }} {p_{sim} ( \cdot |v_{i} )\log pxt( \cdot |v_{i} )}$$

(10)

O_FT is the objective function for the field-token network embedding, O_RF is the objective function for the record-field network embedding, and O_AF is the objective function for the attribute-field network embedding. The hyperparameters α, β, γ (α + β + γ = 1) are weights for different objectives, which control the contribution of each network embedding to overall field embedding.

We jointly train the model, where all the three types of networks are utilized. Since edges from different networks are not comparable, we choose to interleave updates of different network embeddings. Hence the model is updated network by network.

4 Flexible Entity Resolution with Multi-semantic Data Embeddings

There are two data sources S with schema $[{a}_{1}^{s},\dots ,{a}_{m}^{s}]$ and T with schema $[{a}_{1}^{t},\dots ,{a}_{n}^{t}]$. Entity resolution determines if two records r^s and r^t correspond to the same real-world entity, where ${r}^{s}=\{{<}{a}_{1}^{s},{f}_{1}^{s}{>},\dots ,{<}{a}_{m}^{s},{f}_{m}^{s}{>}\}$ is from S and ${r}^{t}=\{{<}{a}_{1}^{t},{f}_{1}^{t}{>},\dots ,{<}{a}_{n}^{t},{f}_{n}^{t}{>}\}$ is from T. Each field f_i is annotated by a unique attribute a_i, and is a token sequence.

With tailored multi-semantic field embeddings as the base, we propose a flexible representation-comparison-classification framework for ER. We integrate probabilistic schema mapping into ER. We adopt inter-attention to infer probabilistic attribute associations, and utilize intra-attention to arrange attribute weights.

Representation Layer.

All dirty data to be resolved are put into DAE, and local field embeddings are generated. A field $f_{i}^{s}$ from record r^s is represented as ${\varvec{h}}_{i}^{s}$. Then, r^s is represented as $\boldsymbol{H}^{S} = [\boldsymbol{h}_{1}^{s} , \ldots ,\boldsymbol{h}_{m}^{s} ]$. [,] is a vector or matrix concatenation.

$${\varvec{h}}_{i}^{s} = {\text{DAE}} (f_{i}^{s} )$$

(11)

Comparison Layer.

This layer includes field alignment, comparison and weighting. Record comparisons are bidirectional, and we just specify r^s → r^t for simplicity. It aligns fields from r^s to r^t in probability, compares records in the field level, and arranges field weights. The output of this layer is a pair of directional record similarities.

Field Alignment.

We build probabilistic schema mapping from r^s to r^t with inter-attention [14]. For each field representation ${\varvec{h}}_{i}^{s}$ of record r^s, its soft-aligned representation is computed with all field representations of record r^t. Soft field alignment jointly analyzes two records, and results in pairwise field proximities from $\boldsymbol{H}^{s}$ to $\boldsymbol{H}^{t}$, denoted as $\boldsymbol{\alpha }^{s \to t}$. Field level inter-attention score from $\boldsymbol{H}^{s}$ to $\boldsymbol{H}^{t}$ is $(\boldsymbol{H}^{s} )^{{\text{T}}} \boldsymbol{W}^{s \to t} \boldsymbol{H}^{t}$, where $\boldsymbol{W}^{s \to t}$ is a trainable matrix. With softmax, attention scores are normalized into field alignment matrix $\boldsymbol{\alpha }^{s \to t}$, where each entry $\boldsymbol{\alpha }^{s \to t} (i,j)$ is proximity from ${\varvec{h}}_{i}^{s}$ to ${\varvec{h}}_{j}^{t}$. $\widehat{\boldsymbol{H}}^{s}$ is $\boldsymbol{H}^{s}$’s soft-aligned representation with $\boldsymbol{H}^{t}$.

$${\boldsymbol{\alpha }}^{s \to t} = {\text{softmax}} ((\boldsymbol{H}^{s} )^{{\text{T}}} {\boldsymbol{W}}^{s \to t} {\boldsymbol{H}}^{t} )$$

(12)

$$\widehat{\boldsymbol{H}}^{s} = {\boldsymbol{H}}^{t} (\boldsymbol{\alpha }^{s \to t} )^{{\text{T}}}$$

(13)

Field Comparison.

For each ${\varvec{h}}_{i}^{s}$ and its soft-aligned representation $\widehat{{\varvec{h}}}_{i}^{s}$, we compute their element-wise absolute difference $|{\varvec{h}}_{i}^{s} - \widehat{{\varvec{h}}}_{i}^{s} |$ and Hadamard product ${\varvec{h}}_{i}^{s} \odot \widehat{{\varvec{h}}}_{i}^{s}$. The concatenation of the two interactions is put into a two-layer highway network, and a compact similarity representation $\widetilde{{\varvec{h}}}_{i}^{s}$ is generated. Formula 14, organized in the record way, presents initial field level similarity from r^s to r^t. So far, all fields play equally important roles in comparisons.

$$\widetilde{\boldsymbol{H}}^{s} = {\text{Highway}} ([|\boldsymbol{H}^{s} - \widehat{\boldsymbol{H}}^{s} |,{\boldsymbol{H}}^{s} \odot \widehat{\boldsymbol{H}}^{s} ])$$

(14)

Field Weighting.

As is commonly known, different fields do not contribute equally to record similarities. We introduce an intra-attention mechanism [14] to capture field importances in similarity representations. $\widetilde{\boldsymbol{H}}^{s}$’s intra-attention score is computed as product of $\widetilde{\boldsymbol{H}}^{s}$ and a global context vector $\boldsymbol{c}^{s}$, which is trainable. Attention scores are normalized with softmax into intra-attention $\boldsymbol{\beta }^{s}$. Weighted similarity representation $\boldsymbol{s}^{s \to t}$ from r^s to r^t is obtained by applying $\boldsymbol{\beta }^{s}$ on initial similarity $\widetilde{\boldsymbol{H}}^{s}$.

$$\boldsymbol{\beta }^{s} = {\text{softmax}} ((\widetilde{\boldsymbol{H}}^{s} )^{{\text{T}}} \boldsymbol{c}^{s} )$$

(15)

$$\boldsymbol{s}^{s \to t} = \widetilde{\boldsymbol{H}}^{s} (\boldsymbol{\beta }^{s} )^{{\text{T}}}$$

(16)

ER Classification Layer.

We build a binary ER classifier with a highway network and softmax. The concatenation of similarities $\boldsymbol{s}^{s \to t}$ and $\boldsymbol{s}^{s \leftarrow t}$ is fed into a two-layer fully connected highway network, and the output is the aggregated similarity $\boldsymbol{s}^{s \leftrightarrow t}$. $\boldsymbol{s}^{s \leftrightarrow t}$ is put into a softmax classifier, and the final output is the ER distribution $P(y|r^{s} ,r^{t} )$.

$$\boldsymbol{s}^{s \leftrightarrow t} = {\text{Highway}} ([\boldsymbol{s}^{s \to t} ,\boldsymbol{s}^{s \leftarrow t} ])$$

(17)

$$P(y|r^{s} ,r^{t} ) = {\text{softmax}} (\boldsymbol{Ws}^{s \leftrightarrow t} + b)$$

(18)

The ER model is trained by minimizing cross-entropy loss O_ER, where y_l is the label of ground truth and y_pre is the predicted label.

$$O_{ER} = crossEntropy(y_{l} ,y_{pre} )$$

(19)

5 Experimental Evaluation

5.1 Experiments Setup

Datasets.

As illustrated in Table 1, there are three groups of datasets for evaluations, including standard data and two types of hard data: dirty data and complex data. We implement an enhanced variant of UIS data generator [15] (eUIS for short) to help generate dirty data and complex data. We generate a standard person dataset Person-Person (PP), including two partitions with the same schema: name, telephone, address, city, state and zip code. In PP, there are duplicates between two partitions, but there is no duplicate inside each partition. Later, we construct a dirty version PP1 and a complex version PP2 with PP.

(1)
Standard data. There are three standard datasets DBLP-Scholar (DS), DBLP-ACM (DA) and Fodors-Zagats (FZ) [4], which are well structured, are perfectly one-to-one aligned in schemas, and contain simple fields with few errors.
(2)
Dirty data. Two dirty datasets PP1 and BR1 are derived from standard datasets PP and BeerAdvo-RateBeer (BR) [4] respectively. There are errors and value misplacements in dirty data. We generate a dirty dataset with a standard dataset in two steps: error injection and field misplacement. (a) With probability of 25%, a selected field in a record is injected into errors including edit errors (random character insertion, deletion, replacement and swap) and token errors (random token repeat, insertion, deletion, replacement and swap). (b) With probability of 40%, one field is randomly selected and moved into another attribute in the same record.
Table 1. Dataset descriptions.
Full size table
(3)
Complex data. Two complex datasets PP2 and BR2 are derived from standard datasets PP and BR [4] respectively. There is at least one one-to-many attribute association between schemas of different data sources. We construct a complex dataset with a standard dataset in two steps: error injection and attribute merging. (a) Error injection here is similar to error injection of dirty data generation, except that probability is 20%. (b) Then, a subset of attributes is merged into a complex attribute given a schema. For PP2 from PP, name and address are merged into a complex attribute name-address in partition one; name and telephone are merged into a complex attribute name-telephone, and, address, city & zip code are merged into a second complex attribute address-city-zipcode in partition two. For BR2 from BR, Beer Name and Brew Factory Name are merged into a complex attribute BN-BFN in BeerAdvo; Beer Name, Style and ABV are merged into a complex attribute BN-style-ABV in RateBeer.

Metric.

Our work focuses on resolution quality of ER. We choose the common metric F₁ measure for ER evaluation. $F_{1} = {{2PR} \mathord{\left/ {\vphantom {{2PR} {(P + R)}}} \right. \kern-\nulldelimiterspace} {(P + R)}}$, P is precision and R is recall. P is the proportion of predicted matches that are truly matched, and R is the proportion of true matches that are correctly predicted.

Settings.

Information networks based data embedding is implemented with C++, and the ER model is implemented with Python (PyTorch). All experiments are run on a server with 8 CPU cores (Intel(R) E5-2667, 3.2 GHz), 64G memory, and NVIDIA GeForce GTX 980 Ti.

Following previous works [13, 16,17,18], the network embedding dimensionality is set to d = 128. Each dataset is split into 3: 1: 1 for training, validation and testing of ER tasks. Numbers of epochs, mini-batch size and dropout rate are 15, 16 and 0.1. Adam is used as optimization algorithms.

5.2 Comparisons with Existing Works

We compare our approach DAE based ER (DAER) with existing graph based deep ER approaches EmbDI [5], GraphER [6] and two deep ER baselines DeepER [3], DeepMatcher [4] on three types of data.

Figure 4 illustrates overall performances of five ER approaches on three standard datasets: DS, DA and FZ. All approaches achieve relatively comparable (and good) performances on standard datasets, and F₁ gaps between different approaches are usually not big. Specifically, DAER outnumbers other approaches by 0.2% to 3% in ΔF₁ on DS and DA; all five approaches achieve the same F₁ on FZ. This is mainly due to that these standard datasets are easy to resolve.

Figure 5 illustrates overall performances of five ER approaches over two dirty datasets: BR1 and PP1. In general, DAER obviously outperforms the other four approaches on dirty data. On BR1, ΔF₁ between DAER and the others are at least 8.1%; on PP1, ΔF₁ between DAER and the others are at least 11.6%. There are many typos, token errors and even more value misplacements in dirty data, which make data hard to resolve. Our DAER’s improvements majorly benefit from local field representations. Our tailored field representations capture multitype semantics, including token level (breaking attribute boundaries), record level & attribute level, and are learned based on similarities, both of which are essential for similarity computing in ER.

Figure 6 depicts overall performances of five ER approaches over two complex datasets: BR2 and PP2. Overall, DAER surpasses the other four approaches on complex data. On BR2, ΔF₁ between DAER and the others are at least 5.6%; on PP2, ΔF₁ between DAER and the others are at least 13.3%. There are complex attribute associations in schemas of complex data; also, complex data contains typos and token errors. Hence complex data are difficult to resolve. We think DAER’s advantages over previous approaches come from following aspects: (1) tailored local field representations, which capture multitype semantics and are similarity driven, and (2) the proposed ER model, which integrates flexible schema mapping into ER.

5.3 Detailed Analysis

We evaluate key components of our proposed solution in detail.

Effect of Graph Embedding.

Data embedding via multitype information networks is our major contribution. We compare different graph embedding methods for data embedding in ER. We use classical graph embedding methods PTE [18] and Node2Vec [16] for local field representations, and other parts stay the same, denoted as DAER-PTE and DAER-N2V respectively. In DAER-PTE, PTE is directly used for local field embedding. In DAER-N2V, Node2Vec replaces our bipartite network embedding method (for each information network embedding) in local field representations. On three standard datasets, they have comparable performances, as Fig. 7 shows. DAER overall outnumbers the other two approaches in F₁ on both dirty data and complex data. As Fig. 8 illustrates, ΔF₁ between DAER and the others are at least 7.7% on two dirty datasets. As Fig. 9 illustrates, ΔF₁ between DAER and the others are at least 9.1% on two complex datasets. The evaluation advantages show that our multitype information networks based data embedding is effective in ER. Our data embedding captures multitype semantics and considers object similarities, both of which are essential for similarity computations in ER.

Effect of ER Model.

Probabilistic schema mapping (PSM) and field weighting (FW) are two key components of our ER model, and we test their effects on three types of data. DAER-[-FW] is DAER without FW, and DAER-[-PSM] is DAER with vanilla schema mapping instead of PSM. Figures 10, 11, 12 illustrate results on three standard datasets, on two dirty datasets and on two complex datasets respectively. On standard datasets, there are minor ΔF₁ between DAER and the other two approaches. On both dirty datasets and complex datasets, DAER obviously outperforms the others in F₁. Especially on complex datasets, loss of PSM reduces much more accuracies than loss of FW, due to existence of many complex attribute associations. The evaluation results testify that PSW and FW are effective components of our ER model, especially for dirty data and complex data, which commonly exist in the real world.

6 Related Work

Entity resolution attracts multiple research communities, such as database, data mining and machine learning [1, 19]. Currently, deep learning is strongly driving ER research. DeepER is a pioneer deep ER work [3], which builds an ER system with distributed representations of words and LSTM. Also, DeepER investigates DL based blocking for ER efficiency. DeepMatcher defines a design space of deep ER, including attribute embedding, attribute similarity representation and classifier [4]. DeepMatcher introduces four methods: heuristic-based, RNN-based, attention-based and hybrid. There are graph based deep ER works [5, 6]. GraphER is a token-centric approach, which utilizes GCN (graph convolutional network) to aggregate token-level comparisons [6]. EmbDI creates embeddings of relational data for data integration tasks, such as schema mapping and ER [5]. EmbDI constructs a graph with tokens, attributes & records, and run vanilla random walks over the graph to generate sentences to describe similarities across objects (like DeepWalk [12]). However, EmbDI disregards the key role of fields in graph construction, and does not fully utilize heterogeneousness of both objects (tokens, attributes & records) and their semantic relations in learning of embeddings.

Along with rapid DL developments, word embeddings have been widely used in NLP tasks. Trained over large NLP corpora, word embeddings map words into a compact vector space, which preserves syntactic and semantic word relationships. As a milestone, word2vec proposes two neural net language models skip-gram and CBOW [7, 8], which are able to learn high-quality word vectors with their simple but useful neural architectures. Word2vec produced profound influences on later word embeddings, and also triggered other embeddings, such as graph embeddings [20]. GloVe incorporates global information by matrix factorization and local information by context window into word representations [9], promoting its performance. Regarding unseen words, FastText extends the skip-gram model with character n-grams [10], where words are represented as sums of n-gram vectors.

Vertex embedding, as a core branch of graph embeddings, maps vertices into a low dimensional vector space by embedding graph structures [20]. Inspired by word2vec, DeepWalk captures the “context” of a vertex by running random walks, and utilizes skip-gram as the learning model [12], where generated walks act the role of sentences. Following DeepWalk, node2vec introduces biased random walks to diversify neighborhoods [16]. It guides random walks by configuring a mixture of BFS (breadth-first search) and DFS (depth-first search). LINE learns vertex embeddings by combining first-order and second-order proximities [17]. Incorporating both unlabeled and labeled information, PTE extends LINE for semi-supervised text data embedding [18]. As a versatile vertex similarity embedding framework, VERSE embeds graphs by reconstructing similarity distributions between vertices [13]. Our bipartite network embedding method is an improvement of VERSE that is adaptive to ER tasks. In heterogeneous information networks, metapath2vec defines meta-path based random walks and exploits a heterogeneous skip-gram model to learn vertex embeddings [21].

7 Conclusion

In this work, we study how to locally embed data with attributes for ER tasks. Data are modeled as a family of information networks, in which multitype semantic relations are preserved. Tailored multi-semantic distributed representations of fields are learned by collectively embedding these information networks. Particularly, a similarity driven method is proposed to embed each bipartite information network. With generated field embeddings, ER is carried out in a flexible representation-comparison-classification framework. Sufficient experimental evaluations over several datasets show that our approach is an effective solution. In future, an interesting potential research direction is how to apply our DAER approach to transfer learning of ER, which is meaningful for low-resource scenarios.

References

Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Article Google Scholar
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2(1), 1282–1293 (2009)
Article Google Scholar
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018)
Article Google Scholar
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Google Scholar
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1335–1349 (2020)
Google Scholar
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: GraphER: token-centric entity resolution with graph convolutional neural networks. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 8172–8179 (2020)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop with the 2013 International Conference on Learning Representations (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 2013 International Conference on Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of 2014 International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
Google Scholar
Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: VERSE: versatile graph embeddings from similarity measures. In: Proceedings of the 2018 World Wide Web Conference, pp. 539–548 (2018)
Google Scholar
Hu, D.: An introductory survey on attention mechanisms in NLP problems. In: Proceedings of SAI Intelligent Systems Conference, pp. 432–448 (2019)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Disc. 2(1), 9–37 (1998)
Article Google Scholar
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016)
Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015)
Google Scholar
Tang, J., Qu, M., Mei, Q.: Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1165–1174 (2015)
Google Scholar
Azzalini, F., Jin, S., Renzi, M., Tanca, L.: Blocking techniques for entity linkage: a semantics-based approach. Data Sci. Eng. 6(1), 20–38 (2021)
Article Google Scholar
Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40(3), 52–74 (2017)
Google Scholar
Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 62002262, 62172082, 62072086, 62072084).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China
Chenchen Sun
Engineering Research Center of Learning-Based Intelligent System (Ministry of Education), Tianjin University of Technology, Tianjin, China
Chenchen Sun
College of Intelligence and Computing, Tianjin University, Tianjin, China
Chenchen Sun
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Derong Shen & Tiezheng Nie

Authors

Chenchen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenchen Sun .

Editor information

Editors and Affiliations

Indian Institute of Technology Kanpur, Kanpur, India
Arnab Bhattacharya
National University of Singapore, Singapore, Singapore
Janice Lee Mong Li
University of California, Santa Barbara, Santa Barbara, CA, USA
Divyakant Agrawal
IIIT Hyderabad, Hyderabad, India
P. Krishna Reddy
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Mukesh Mohania
Ashoka University, Sonepat, Haryana, India
Anirban Mondal
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Vikram Goyal
University of Aizu, Aizu, Japan
Rage Uday Kiran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, C., Shen, D., Nie, T. (2022). Information Networks Based Multi-semantic Data Embedding for Entity Resolution. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham. https://doi.org/10.1007/978-3-031-00129-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-00129-1_2
Published: 08 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00128-4
Online ISBN: 978-3-031-00129-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics