Keywords

1 Introduction

Entity resolution (ER) is a common and difficult problem in data cleaning that has to do with handling unintended multiple representations in a database of the same external objects. Multiple representations lead to uncertainty in data and the problem of managing it. Cleaning the database reduces uncertainty. In more precise terms, ER is about the identification and fusion of database records (think of rows or tuples in tables) that represent the same real-world entity [8, 15]. As a consequence, ER usually goes through two main consecutive phases: (a) detecting duplicates, and (b) merging them into single representations.

For duplicate detection, one must first analyze multiple pairs of records, comparing the two records in them, and discriminating between: pairs of duplicate records and pairs of non-duplicate records. This classification problem is approached with machine learning (ML) methods, to learn from previously known or already made classifications (a training set for supervised learning), building a classification model (a classifier) for deciding about other record pairs [10, 15].

In principle, in ER every two records (forming a pair) have to be compared, and then classified. Most of the work on applying ML to ER work at the record level [10, 11, 21], and only some of the attributes, or their features, i.e. numerical values associated to them, may be involved in duplicate detection. The choice of relevant sets of attributes and features is application dependent.

ER may be a task of quadratic complexity since it requires comparing every two records. To reduce the large number two-record comparisons, blocking techniques are used [2, 19, 23]. Commonly, a single record attribute, or a combination of attributes, the so-called blocking key, is used to split the database records into blocks. Next, under the assumption that any two records in different blocks are unlikely to be duplicates, only every two records in a same block are compared for duplicate detection.

Although blocking will discard many record pairs that are obvious non-duplicates, some true duplicate pairs might be missed (by putting them in different blocks), due to errors or typographical variations in attribute values. More interestingly, similarity between blocking keys alone may fail to capture the relationships that naturally hold in the data and could be used for blocking. Thus, entity blocking based only on blocking key similarities may cause low recall. This is a major drawback of traditional blocking techniques.

In this work we consider different and coexisting entities. For each of them, there is a collection of records. Records for different entities may be related via attributes in common or referential constraints. Blocking can be performed on each of the participating entities, and the way records for an entity are placed in blocks may influence the way the records for another entity are assigned to blocks. This is called “collective blocking”. Semantic information, in addition to that provided by blocking keys for single entities, can be used to state relationships between different entities and their corresponding similarity criteria. So, blocking decision making forms a collective and intertwined process involving several entities. In the end, the records for each individual entity will be placed in blocks associated to that entity.

Example 1

Consider two entities, Author and Paper. For each of them, there is a set of records (for all practical purposes, think of database tuples in a single table). For Author we have records of the form \(\mathbf {a} = \langle { name}, \ldots , { affiliation}, \ldots ,\) \({ paper~title}, \ldots \rangle \), with {name, affiliation} the blocking key; and for Paper, records of the form \(\mathbf {p} = \langle { title}, \ldots , { author~name}, \ldots \rangle \), with title the blocking key. We want to group Author and Paper records at the same time, in an entwined process. We block together two Author entities on the basis of the similarities of authors’ names and affiliations.

Assume that Author entities \(\mathbf {a}_1, \mathbf {a}_2\) have similar names, but their affiliations are not. So, the two records would not be put in the same block. However, \(\mathbf {a}_1, \mathbf {a}_2\) are authors of papers (in Paper records) \(\mathbf {p_1}, \mathbf {p_2}\), resp., which have been put in the same block (of papers) on the basis of similarities of paper titles. In this case, additional semantic knowledge might specify that if two papers are in the same block, then corresponding Author records that have similar author names should be put in the same block too. Then, \(\mathbf {a}_{\!1}\) and \(\mathbf {a}_2\) would end up in the same block.

In this example, we are blocking Author and Paper entities, separately, but collectively and in interaction. \(\blacksquare \)

Collective blocking is based on blocking keys and the enforcement of semantic information about the relational closeness of entities Author and Paper, which is captured by a set of matching dependencies (MDs). So, we propose “MD-based collective blocking” (more on MDs right below).

After records are divided in blocks, the proper duplicate detection process starts, and is carried out by comparing every two records in a block, and classifying the pair as “duplicates” or “non-duplicates” using the trained ML model at hand. In the end, records in duplicate pairs are considered to represent the same external entity, and have to be merged into a single representation, i.e. into a single record. This second phase is also application dependent. MDs were originally proposed to support this task.

Matching dependencies are declarative logical rules that tell us under what conditions of similarity between attribute values, any two records must have certain attribute values merged, i.e. made identical [16, 17]. For example, the MD

$$\begin{aligned} { Dept}_{\!B}[{ dept}] \approx { Dept}_{\!B}[{ dept}] \ \rightarrow \ { Dept}_{\!B}[{ city}] \doteq { Dept}_{\!B}[{ city}] \end{aligned}$$
(1)

tells us that for any two records for entity (or relation or table) \({ Dept}_{\!B}\) that have similar values for attribute \({ dept}\) attribute, their values for attribute \({ city}\) should be matched, i.e. made the same.

MDs as introduced in [17] do not specify how to merge values. In [6, 7], MDs were extended with matching functions (MFs). For a data domain, an MF specifies how to assign a value in common to two values. We adopt MDs with MFs in this work. In the end, the enforcement of MDs with MFs should produce a duplicate-free instance (cf. Section 2 for more details).

MDs have to be specified in a declarative manner, and at some point enforced, by producing changes on the data. For this purpose, we use the LogicBlox platform, a data management system developed by the LogicBloxFootnote 1 company, that is centered around its declarative language, LogiQL. LogiQL supports relational data management and, among several other features [1], an extended form of Datalog with stratified negation [9]. This language is expressive enough for the kind of MDs considered in this work.Footnote 2

In this paper, we describe our ERBlox system. It is built on top of the LogicBlox platform, and implements entity resolution (ER) applying to LogiQL, ML techniques, and the specification and enforcement of MDs. More specifically, ERBlox has three main components: (a) MD-based collective blocking, (b) ML-based duplicate detection, and (c) MD-based merging. The sets of MDs are fixed and different for the first and last components. In both cases, the set of MDs are interaction-free [7], which results, for each entity, in the unique set of blocks, and eventually into a single, duplicate-free instance [7]. We use LogicQL to declaratively implement the two MD-based components of ERBlox.

The blocking phase uses MDs to specify the blocking strategy. They express conditions in terms of blocking key similarities and also relational closeness (the semantic knowledge) to assign two records to a same block (by making the block identifiers identical). Then, under MD-based collective blocking different records of possibly several related entities are simultaneously assigned to blocks through the enforcement of MDs (cf. Sect. 5 for details).

On the ML side, the problem is about detecting pairs of duplicate records. The ML algorithm is trained using record-pairs known to be duplicates or non-duplicates. We independently used three established classification algorithms: support vector machines (SVMs) [24], k-nearest neighbor (K-NN) [14], and non-parametric Bayes classifier (NBC) [4]. We used the IsmionFootnote 3 implementations of them due to the in-house expertise at LogicBlox. Since the emphasis of this work is on the use of LogiQL and MDs, we will refer only to our use of SVMs.

We experimented with our ERBlox system using as dataset a snapshot of Microsoft Academic Search (MAS)Footnote 4 (as of January 2013) including 250 K authors and 2.5 M papers. It contains a training set. The experimental results show that our system improves ER accuracy over traditional blocking techniques [18], which we will call standard blocking, where just blocking-key similarities are used. Actually, MD-based collective blocking leads to higher precision and recall on the given datasets.

This paper is structured as follows. Section 2 introduces background on matching dependencies and their semantics, and SVMs. A general overview of the ERBlox system is presented in Sect. 3. The specific components of ERBlox are discussed in Sects. 4, 5, and 6. Experimental results are shown in Sect. 7. Section 8 presents conclusions.

2 Preliminaries

2.1 Matching Dependencies

We consider an application-dependent relational schema \(\mathcal { R}\), with a data domain U. For an attribute A, \({ Dom}_{\!A}\) is its finite domain. We assume predicates do not share attributes, but different attributes may share a domain. An instance D for \(\mathcal { R}\) is a finite set of ground atoms of the form \(R(c_1,\ldots , c_n)\), with \(R \in \mathcal { R}\), \(c_i \in U\).

We assume that each entity is represented by a relational predicate, and its tuples or rows in its extension correspond to records for the entity. As in [7], we assume records have unique, fixed, global identifiers, rids, which are positive integers. This allows us to trace changes of attribute values in records. Record ids are placed in an extra attribute for \(R \in \mathcal { R}\) that acts as a key. Then, records take the form \(R(r,\bar{r})\), with r the rid, and \(\bar{r} =(c_1,\ldots , c_n)\). Sometimes we leave rids implicit, and sometimes we use them to denote whole records: if r is a record identifier in instance D, \(\bar{r}\) denotes the record in D identified by r. Similarly, if \(\mathcal {A}\) is a sublist of the attributes of predicate R, then \(r[\mathcal { A}]\) denotes the restriction of \(\bar{r}\) to \(\mathcal {A}\).

MDs are formulas of the form: \(R_1[\bar{X}_1] \approx R_2[\bar{X}_2] \ \rightarrow \ R_1[\bar{Y}_1] \doteq R_2[\bar{Y}_2]\) [16, 17]. Here, \(R_1, R_2 \in \mathcal { R}\) (and may be the same); and \(\bar{X}_1, \bar{X}_2\) are lists of attribute names of the same length that are pairwise comparable, that is, \(X_1^i\) and \(X_2^i\), and also \(\bar{Y}_1,\bar{Y}_2\), share the same domain.Footnote 5 The MD says that, for every pair of tuples (one in relation \(R_1\), the other in relation \(R_2\)) where the LHS is true, the attribute values in them on the RHS have to be made identical. Symbol \(\approx \) denotes generic, reflexive, symmetric, and application/domain dependent similarity relations on shared attribute domains.

A dynamic, chase-based semantics for MDs with matching functions (MFs) was introduced in [7]. Given an initial instance D, the set \(\varSigma \) of MDs is iteratively enforced until they cannot be applied any further, at which point a resolved instance has been produced. In order to enforce (the RHSs of) MDs, there are binary matching functions (MFs) ; and is used to replace two values \(a, a' \in {Dom}_{\!A}\) that have to be made identical. MFs are idempotent, commutative, and associative, and then induce a partial-order structure \(\langle {Dom}_{\!A}, \preceq _A\rangle \), with: [5, 6]. It always holds: . In this work, MFs are treated as built-in relations.

There may be several resolved instances for D and \(\varSigma \). However, when (a) MFs are similarity-preserving (i.e., \(a \approx a'\) implies ); or (b) \(\varSigma \) is interaction-free (i.e., each attribute may appear in either the RHS or LHS of MDs in \(\varSigma \)), there is a unique resolved instance that is computable in polynomial time in |D| [7].

2.2 Support Vector Machines

The SVMs technique [24] is a form of kernel-based learning. SVMs can be used for classifying vectors in an inner-product vector space \(\mathcal { V}\) over \(\mathbb {R}\). Vectors are classified in two classes, with a label in \(\{0,1\}\). The algorithm learns from a training set, say \(\{(\mathbf {e}_1, f(\mathbf {e}_1)), (\mathbf {e}_2,\) \(f(\mathbf {e}_2)), (\mathbf {e}_3, f(\mathbf {e}_3)),\) \(\ldots , (\mathbf {e}_n,\) \(f(\mathbf {e}_n))\}\). Here, \(\mathbf {e}_i \in \mathcal { V}\), and for the feature (function) f: \(f(\mathbf {e}_i) \in \{0, 1\}\).

SVMs find an optimal hyperplane, \(\mathcal { H}\), in \(\mathcal { V}\) that separates the two classes where the training vectors are classified. Hyperplane \(\mathcal { H}\) has an equation of the form \(\mathbf {w} \bullet \mathbf {x} + b\), where \(\bullet \) denotes the inner product, \(\mathbf {x}\) is a vector variable, \(\mathbf {w}\) is a weight vector of real values, and b is a real number. Now, a new vector \(\mathbf {e}\) in \(\mathcal { V}\) can be classified as positive or negative depending on the side of \(\mathcal { H}\) it lies. This is determined by computing \(h(\mathbf {e}) := { sign}(\mathbf {w} \bullet \mathbf {e} + b)\). If \(h(\mathbf {e})> 0\), \(\mathbf {e}\) belongs to class 1; otherwise, to class 0.

It is possible to compute real numbers \(\alpha _1, \ldots , \alpha _n\), such that the classifier h can be computed through: \(h(\mathbf {e}) = { sign}(\sum _i \alpha _i \cdot f(\mathbf {e}_i) \cdot \mathbf {e}_i \bullet \mathbf {e} + b)\) (cf. Fig. 3).

3 Overview of ERBlox

A high-level description of the components of ERBlox is given in Fig. 1. It shows the workflow supported by ERBlox when doing ER. ERBlox’s three main components are: (1) MD-based collective blocking (path \(\mathbf {1,3,5,\{6,8\}}\)), (2) ML-based record duplicate detection (the whole initial workflow up to task \(\mathbf {13}\), inclusive), and (3) MD-based merging (path \(\mathbf {14, 15}\)). In the figure, all the boxes in light grey are supported by LogiQL. As just done, in the rest of this section, numbers in boldface refer to the edges in this figure.

Fig. 1.
figure 1

Overview of ERBlox

The initial input data is stored in structured text files. (We assume these data are already standardized and free of misspellings, etc., but duplicates may be present.) Our general LogiQL program that supports the whole workflow contains some rules for importing data from the files into the extensions of relational predicates (think of tables, this is edge \(\mathbf {1}\)). This results in a relational database instance T containing the training data (edge \(\mathbf {2}\)), and the instance D on which ER will be performed (edge \(\mathbf {3}\)).

Fig. 2.
figure 2

Feature-based similarity

The next main task is blocking, which requires similarity computation of pairs of records in D (edge \(\mathbf {5}\)). For record pairs \(\langle r_1, r_2\rangle \) in T, similarities have to be computed as well (edge \(\mathbf {4}\)). Similarity computation is based on similarity functions, \({ Sf}_i\!: { Dom}_{A_i} \times { Dom}_{A_i} \rightarrow [0,1]\), each of which assigns a numerical value, called similarity weight, to the comparisons of values for a record attribute \(A_i\) (from a pre-chosen subset of attributes) (cf. Fig. 2). A weight vector \(w(r_1,r_2) = \langle \cdots , { Sf}_i(r_1[A_i],r_2[A_i]), \cdots \rangle \) is formed by similarity weights (edge \(\mathbf {7}\)). For more details on similarity computation see Sect. 4.

Since some pairs in T are considered to be duplicates and others non-duplicates, the result of this process leads to a “similarity-enhanced” database \(T^{\!s}\) of tuples of the form \(\langle r_1, r_2, w(r_1,r_2), L\rangle \), with label \(L \in \{0,1\}\) indicating if the two records are duplicates (\(L = 1\)) or not (\(L=0\)). The labels are consistent with the corresponding weight vectors. The classifier is trained using \(T^{\!s}\), leading to a classification model (edges \(\mathbf {9,10}\)).

For records in D, similarity measures are needed for blocking, to decide if two records \(r_1,r_2\) go to the same block. Initially, every record has its rid assigned as block (number). To assign two records to the same block, we use matching dependencies that specify and enforce (through their RHSs) that their blocks have to be identical. This happens when certain similarities between pairs of attribute values appearing in the LHSs of the MDs hold. For this reason, similarity computation is also needed before blocking (workflow \(\mathbf {5,6,8}\)). This similarity computation process is similar to the one for T. However, in the case of D, this does not lead directly to the same kind of weight vector computation. Instead, the computation of similarity measures is only for the similarity predicates appearing in the LHSs of the blocking-MDs. (So, as the evaluation of the LHS in (1) requires the computation of similarities for \({ dept}\)-string values.)

Notice that these blocking-MDs may capture semantic knowledge, so they could involve in their LHSs similarities of attribute values in records for different kinds of entities. For example, in relation to Example 1, there could be similarity comparisons involving attributes for entities Author and Paper, e.g.

$$\begin{aligned} { Author}(x_1,y_1,{ bl}_1) \wedge&{ Paper}(y_1,z_1, { bl}_3) \wedge { Author}(x_2,y_2,{ bl}_2) \wedge \nonumber \\&\qquad { Paper}(y_2,z_2, { bl}_4) \wedge x_1 \approx _1 x_2 \wedge z_1 \approx _2 z_2 \ \rightarrow \ { bl}_1 \doteq { bl}_2, \end{aligned}$$
(2)

expressing that when the similarities on the LHS hold, the blocks \({ bl}_1, { bl}_2\) have to be made identical.Footnote 6 The similarity comparison atoms on the LHS are considered to be true when the similarity values are above predefined thresholds (edges \(\mathbf {5,8}\)).Footnote 7

This is the MD-based collective blocking stage that results in database D enhanced with information about the blocks to which the records are assigned. Pairs of records with the same block form candidate duplicate record pairs, and any two records with different blocks are simply not tested as possible duplicates (of each other).

Fig. 3.
figure 3

Classification hyperplane

After the records have been assigned to blocks, pairs of records \(\langle r_1, r_2\rangle \) in the same block are considered for the duplicate test. As this point we proceed as we did for T: the similarity vectors \(w(r_1,r_2)\) have to be computed (edges \(\mathbf {11, 12}\)).Footnote 8 Next, tuples \(\langle r_1, r_2,w(r_1,r_2)\rangle \) are used as input for the trained classification algorithm (edge \(\mathbf {12}\)).

The result of the trained ML-based classifier, in this case obtained through SVMs as a separation hyperplane \(\mathcal { H}\), is a set M of record pairs \(\langle r_1,r_2,1\rangle \) that come from the same block and are considered to be duplicates (edge \(\mathbf {13}\)).Footnote 9 The records in these pairs will be merged on the basis of an ad hoc set of MDs (edge \(\mathbf {15}\)), different from those used in edges \(\mathbf {6,8}\).

Informally, the merge-MDs are of the form: \(r_1 \approx r_2 \ \rightarrow \ r_1 \doteq r_2\), where the antecedent is true when \(\langle r_1, r_2,1\rangle \) is an output of the classifier. The RHS is a shorthand for: \(r_1[A_1] \doteq r_2[A_1] \wedge \cdots \wedge r_1[A_m] \doteq r_2[A_m]\), where m is the total number of record attributes. Merge at the attribute level uses the matching functions .

We point out that MD-based merging takes care of transitive cases provided by the classifier, e.g. if it returns \(\langle r_1, r_2,1\rangle \), \(\langle r_2, r_3,1\rangle \), but not \(\langle r_1, r_3,1\rangle \), we still merge \(r_1, r_3\) (even when \(r_1 \approx r_3\) does not hold). Actually, we do this by merging all the records \(r_1,r_2, r_3\) into the same record. Our system is capable of recognizing this situation and solving it as expected. This relies on the way we store and manage -via our LogiQL program- the positive cases obtained from the classifier (details can be found in Sect. 6). In essence, this makes our set of merging-MDs interaction-free, and leads to a unique resolved instance [7].

The following sections provide more details on ERBlox and our approach to ER.

4 Initial Data and Similarity Computation

We describe now some aspects of the MAS dataset, highlighting the input for- and output of each component of the ERBlox system. The data is represented and provided as follows. The Author relation contains authors names and their affiliations. The Paper relation contains paper titles, years, conference IDs, journal IDs, and keywords. The PaperAuthor relation contains papers IDs, authors IDs, authors names, and their affiliations. The Journal and Conference relations contain short names, full names, and home pages of journals and conferences, respectively. By using ERBlox on this dataset, we determine which papers in MAS data are written by a given author. This is clear case of ER since there are many authors who publish under several variations of their names. Also the same paper may appear under slightly different titles, etc.Footnote 10

From the MAS dataset, which contains the data in structured files, extensions for intentional, relational predicates are computed by LogiQL-rules of the general program, e.g.

$$\begin{aligned}& \_{ file}\_{ in}(x1,x2,x3) \rightarrow { string}(x1),{ string}(x2), { string}(x3). \end{aligned}$$
(3)
$$\begin{aligned}&{ lang:physical:filePath}[`\_{ file}\_{ in}] ="{ author.csv}". \end{aligned}$$
(4)
$$\begin{aligned}&{ +author}({ id}1,x2,x3) \leftarrow \_{ file}\_{ in}(x1, x2,x3),{ string\!:\!int}64\!:\!{ convert}[x1] = { id}1. \end{aligned}$$
(5)

Here, (3) is a predicate schema declaration (metadata uses “\(\rightarrow \)”), in this case of the “\(\_\) file \(\_\) in” predicate with three string-valued attributes,Footnote 11 which is used to store the contents extracted from the source file, whose path is specified by (4). Derivation rules, such as (5), use the usual “\(\leftarrow \)”. In this case, it defines the author predicate, and the “\(+\)” in the rule head inserts the data into the predicate extension. The first attribute is made an identifier [1]. Figure 4 illustrates a small part of the dataset obtained by importing data into the relational predicates. (There may be missing attributes values.)

Fig. 4.
figure 4

Relation extensions from MAS using LogiQL rules

As described above, in ERBlox, similarity computation generates similarity weights, which are used to: (a) compute the weight vectors for the training data T and the data in D under classification; and (b) do the blocking, where similarity weights are compared with predefined thresholds for the similarity conditions in the LHSs of blocking-MDs.Footnote 12

We used three well-known similarity functions [13], depending on the attribute domains. “TF-IDF cosine similarity” [22] used for computing similarities for text-valued attributes, whose values are string vectors. It assigns low weights to frequent strings and high weights to rare strings. It was used for attribute values that contain frequent strings, such as affiliation. For attributes with short string values, such as author name, we applied “Jaro-Winkler similarity” [25]. Finally, for numerical attributes, such as publication year, we used “Levenshtein distance” [20], which computes similarity of two numbers on the basis of the minimum number of operations required to transform one into the other.

Similarity computation for ERBlox is supported by LogiQL-rules that define similarity functions. In particular, similarity computations are kept in extensions of program predicates. For example, if the similarity weight of values \(a_1,a_2\) for attribute \({ Title}\) is above the threshold, a tuple \({ TitleSim}(a_1,a_2)\) is created by the program.

5 MD-Based Collective Blocking and Duplicate Detection

Since every record has an identifier, rid, initially each record uses its rid as its block number, in an extra attribute \({ Bl\#}\). In this way, we create the initial blocking instance from the initial instance D, also denoted with D. Now, blocking strategies are captured by means of (blocking) MDs of the form:

$$\begin{aligned} R_i(\bar{X}_1,{ Bl}_1) \wedge R_i(\bar{X}_2,{ Bl}_2) \wedge \psi (\bar{X}_3) \ \rightarrow \ { Bl}_1\doteq { Bl}_2. \end{aligned}$$
(6)

Here \({ Bl}_1, { Bl}_2\) are variables for block numbers, and \(R_i\) is a database (record) predicate. The lists of variables \(\bar{X}_1, \bar{X}_2\) stand for all the attributes in \(R_i\), but \({ Bl\#}\). Formula \(\psi \) is a conjunction of relational atoms and comparison atoms via similarity predicates; but it does not contain similarity comparisons of blocking numbers, such as \({ Bl}_3\!\approx { Bl}_4\).Footnote 13 The variables in the list \(\bar{X}_3\) appear in \(R_i\) or in another database predicate or in a similarity atom. It holds that \((\bar{X}_1 \cup \bar{X}_2) \cap \bar{X}_3 \ne \emptyset \). For an example, see (2), where \(R_i\) is Author.

In order to enforce these MDs on two records, we use a binary matching function , to make two block numbers identical: . More generally, for the application-dependent set, \(\varSigma ^{{ Bl}}\), of blocking-MDs we adopt the chase-based semantics for entity resolution [7]. Since this set of MDs is interaction-free, its enforcement results in a single instance \(D^{\!{ Bl}}\), where now records may share block numbers, in which case they belong to the same block. Every record is assigned to a single block.

Example 2

These are some of the blocking-MDs used for the MAS dataset:

$$\begin{aligned}& { Paper}({ pid}_1,x_1,y_1,z_1,w_1,v_1,{ bl}_1) \wedge { Paper}({ pid}_2,x_2,y_2,z_2,w_2,v_2,{ bl}_2) \wedge \\& \ x_1 \approx _{\!{ Title}} x_2 \ \wedge \ y_1=y_2 \ \wedge z_1=z_2 \rightarrow \ { bl}_1 \doteq { bl}_2. \nonumber \end{aligned}$$
(7)
$$\begin{aligned}& { Author}({ aid}_1, x_1,y_1, { bl}_1) \ \wedge \ { Author}({ aid}_2, x_2,y_2, { bl}_2) \wedge \\& \ x_1 \approx _{\!{ Name}} x_2 \ \wedge \ y_1\approx _{\!{ Aff}}y_2 \rightarrow \ { bl}_1 \doteq { bl}_2.\nonumber \end{aligned}$$
(8)
$$\begin{aligned}& { Paper}({ pid}_1,x_1,y_1,z_1,w_1,v_1,{ bl}_1) \ \wedge \ { Paper}({ pid}_2,x_2,y_2,z_2,w_2,v_2,{ bl}_2) \wedge \\&{ PaperAuthor}({ pid}_1, { aid}_1, x'_1,y'_1) \ \wedge { PaperAuthor}({ pid}_2, { aid}_2, x'_2,y'_2) \wedge \nonumber \\& { Author}({ aid}_1, x'_1,y'_1, { bl}_3) \ \wedge { Author}({ aid}_2, x'_2,y'_2, { bl}_3) \wedge x_1 \approx _{\!{ Title}} x_2 \rightarrow \ { bl}_1 \doteq { bl}_2.\nonumber \end{aligned}$$
(9)
$$\begin{aligned}& { Author}({ aid}_1, x_1,y_1, { bl}_1) \ \wedge { Author}({ aid}_2, x_2,y_2, { bl}_2) \ \wedge \ x_1 \approx _{\!{ Name}} x_2\wedge \\& { PaperAuthor}({ pid}_1, { aid}_1, x_1,y_1) \ \wedge { PaperAuthor}({ pid}_2, { aid}_2, x_2,y_2) \wedge \nonumber \\&{ Paper}({ pid}_1,x'_1,y'_1,z'_1,w'_1,v'_1,{ bl}_3) \wedge { Paper}({ pid}_2,x'_2,y'_2,z'_2,w'_2,v'_2,{ bl}_3) \rightarrow \ { bl}_1 \doteq { bl}_2.\nonumber \end{aligned}$$
(10)

Informally, (7) tells us that, for every two Paper entities \(\mathbf {p}_1, \mathbf {p}_2\) for which the values for attribute \({ Title}\) are similar and with same publication year, conference ID, the values for attribute \({ Bl\#}\) must be made the same. By (8), whenever there are similar values for name and affiliation in Author, the corresponding authors should be in the same block. Furthermore, (9) and (10) collectively block Paper and Author entities. For instance, (9) states that if two authors are in the same block, their papers \(\mathbf {p}_1\), \(\mathbf {p}_2\) having similar titles must be in the same block. Notice that if papers \(\mathbf {p}_1\) and \(\mathbf {p}_2\) have similar titles, but they do not have same publication year or conference ID, we cannot block them together using (7) alone. \(\blacksquare \)

We now show how these MDs are represented in LogiQL, and how we use LogiQL programs for declarative specification of MD-based collective blocking.Footnote 14 In LogiQL, an MD takes the form:

$$\begin{aligned} R_i[\bar{X}_1]\!=\!{ Bl}_2 , \ \ R_i[\bar{X}_2]\!=\!{ Bl}_2 \ \ \longleftarrow \ \ \ R_i[\bar{X}_1]={ Bl}_1, \ R_i[\bar{X}_2]={ Bl}_2, \ \psi (\bar{X}_3), \ { Bl}_1 < { Bl}_2, \end{aligned}$$
(11)

subject to the same conditions as in (6). An atom \(R_i[\bar{X}]\!\!=\!\!{ Bl}\) states that predicate \(R_i\) is functional on \(\bar{X}\) [1]. It means each record in \(R_i\) can have only one block number \({ Bl\#}\).

Given an initial instance D, a LogiQL program \(\mathcal { P}^{B}(D)\) that specifies MD-based collective blocking contains the following (kind of) rules:

  • 1. For every atom \(R({ rid}, \bar{x}, { bl})\) \(\in \) D, the fact \(R[{ rid}, \bar{x}]={ bl}\). (Initially, \({ bl} := { rid}\).)

  • 2. For every attribute A of \(R_i\), facts of the form \(\text{ A-Sim }(a_1,a_2)\), with \(a_1, a_2 \in { Dom}_{\!A}\), the finite attribute domain. They are obtained by similarity computation.

  • 3. The blocking-MDs as in (11).

  • 4. Rules to represent the consecutive versions of entities during MD-enforcement:

    $$\begin{aligned}&R\text{- }{ OldVersion}(r_1,\bar{x}_1, { bl}_1) \ \ \leftarrow \ \ R[r_1, \bar{x}_1]={ bl}_1, \ R[r_1, \bar{x}_1]={ bl}_2, \ { bl}_1 < { bl}_2. \end{aligned}$$

    For each rid, r, there could be several atoms of the form \(R[r,\bar{x}]\!=\! { bl}\), corresponding to the evolution of the record identified by r due to MD-enforcement. The rule specifies that versions of records with lower block numbers are old.

  • 5. Rules that collect the latest versions of records. They are used to form blocks:

    $$\begin{aligned}&R\text{- }{ MDBlock}[r_1,\bar{x}_1]= { bl}_1 \;\;\leftarrow \;\;R[r_1,\bar{x}_1]= { bl}_1, \ ! \ R\text{- }{ OldVersion}(r_1,\bar{x}_1, { bl}_1). \end{aligned}$$

In LogiQL, “!”, as in the body above, is used for negation [1]. The rule collects R-records that are not old versions.

Programs \(\mathcal { P}^{B}(D)\) as above are stratified (there is no recursion involving negation). Then, as expected in relation to the blocking-MDs, they have a single model, which can be used to read the final block number for each record.

Example 3

(ex. 2 cont.) Considering only MDs (7) and (9), the portion of \(\mathcal { P}^{B}(D)\) for blocking Paper entities has the following rules:

  • 2. Facts such as: \({ TitleSim}({ Illness\ entities\ in\ West\ Africa},{ Illness\ entities\ in\ Africa})\).              \({ TitleSim}({ DLR\ Simulation\ Environment\ m3}, { DLR\ Simulation\ Environment}).\)

  • 3. \({ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]={ bl}_2, { Paper}[{ pid}_2,x_2,y_2,z_2,w_2,v_2]={ bl}_2 \ \leftarrow \) \({ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]={ bl}_1, { Paper}[{ pid}_2,x_2,y_2,z_2,w_2,v_2]={ bl}_2,\) \({ TitleSim}(x_1,x_2), y_1=y_2, z_1=z_2, { bl}_1 < { bl}_2.\) \({ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]={ bl}_2, { Paper}[{ pid}_2,x_2,y_2,z_2,w_2,v_2]={ bl}_2\ \leftarrow \) \({ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]={ bl}_1, { Paper}[{ pid}_2,x_2,y_2,z_2,w_2,v_2]={ bl}_2, { TitleSim}(x_1,x_2),\) \( { PaperAuthor}({ pid}_1,{ aid}_1, x'_1,y'_1), { PaperAuthor}({ pid}_2,{ aid}_2, x'_2,y'_2),\) \( { Author}[{ aid}_1, x'_1,y'_1]={ bl}_3, { Author}[{ aid}_2, x'_2,y'_2]={ bl}_3, { bl}_1 < { bl}_2\).

  • 4. \({ PaperOldVersion}({ pid}_1,x_1,y_1,z_1,w_1,v_1,{ bl}_1) \! \leftarrow \! { Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1] ={ bl}_1,\) \({ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]={ bl}_2, { bl}_1 < { bl}_2.\)

  • 5. \({ PaperMDBlock}[{ pid},\bar{x}_1]= { bl}_1 \;\;\leftarrow \;\;{ Paper}[{ pid}_1,x_1,y_1,z_1,w_1,v_1]= { bl}_1,\) \({ PaperOldVersion}({ pid}_1,x_1,y_1,z_1,w_1,v_1, { bl}_1).\)

Restricting the model of the program to the relevant attributes of predicate PaperMDBlock returns: \(\{\{123, 205\}, \{195 ,769\}\}\), i.e. the papers with pids 123 and 205 are blocked together; similarly for those with pids 195 and 769.\(\blacksquare \)

As described above, the input to the trained classifier is a set of tuples of the form \(\langle r_1,r_2, w(r_1,r_2) \rangle \), with \(w(r_1,r_2)\) the computed weight vector for records (with ids) \(r_1, r_2\) in a same block.Footnote 15

Example 4

(ex. 3 cont.) Consider the blocks for entity Paper. If the “journal ID” values are null in both records, but not the “conference ID” values, “journal ID” is not considered for a feature. Similarly, when the conference ID values are null. However, the values for “journal ID” and “conference ID” are replaced by “journal full name” and “conference full name” values, found in Conference and Journal records, resp. In this case then, attributes Title, Year, ConfFullName or JourFullName and Keyword are used for corresponding feature for weight vector computation.

Considering the previous Paper records, the input to the classifier consists of: \(\langle 123,\) \(205, w(123,205) \rangle \), with \(w(123,205) =[0.8, 1.0, 1.0, 0.7]\), and \(\langle 195,769, w(195,769)\rangle \), with \(w(195, 769)=[0.93, 1.0, 1.0, 0.5]\) (actually the contents of the two square brackets only). \(\blacksquare \)

Several ML techniques are accessible from LogicBlox platform through the BloxMLPack library, that provides a generic Datalog interface. Then, ERBlox can call an ML-based record duplicate detection component through the general LogiQL program. In this way, the SVMs package is invoked by ERBlox.

The output is a set of tuples of the form \(\langle r_1, r_2, 1\rangle \) or \(\langle r_1, r_2, 0\rangle \), where \(r_1,r_2\) are ids for records of entity (table) R. In the former case, a tuple \(R\text{- }{ Duplicate}(r_1,r_2)\) is created (as defined by the LogicQL program). In the previous example, the SVMs method return \(\langle [0.8, 1.0, 1.0, 0.7],1\rangle \) and \(\langle [0.93, 1.0, 1.0, 0.5],1\rangle \), then \({ PaperDuplicate}(123, 205)\) and \({ PaperDuplicate}(195,769)\) are created.

6 MD-Based Merging

When \({ EntityDuplicate}(r_1,r_2)\) is created, the corresponding full records \(\bar{r}_1,\bar{r}_2\) have to be merged via record-level merge-MDs of the form \(R[r_1] \approx R[r_2] \ \longrightarrow \ R[\bar{r}_1] \doteq R[\bar{r}_2]\), where \(R[r_1] \approx R[r_2]\) is true when \(R\text{- }{ Duplicate}(r_1,r_2)\) has been created according to the output of the SVMs classifier. The RHS means that the two records are merged into a new full record \(\bar{r}\), with  [7].

Example 5

(ex. 4 cont.) We merge duplicate Paper entities enforcing the MD: \({ Paper}\) \([{ pid}_1]\approx { Paper}[{ pid}_2] \ \longrightarrow \ { Paper}[{ Title},{ Year}, { CID}, { Keyword}] \doteq { Paper}[{ Title}, { Year}, { CID},\) \({ Keyword}]\).\(\blacksquare \)

The portion, \(\mathcal { P}^{\!M}\), of the general LogiQL program that represents MD-based merging contains rules as in 1.4. below:

  • 1. The atoms of the form \(R\text{- }{ Duplicate}\) mentioned above, and those representing the matching functions (MFs) .

  • 2. For an MD \(R[r_1] \approx R[r_2] \ \longrightarrow \ R[\bar{r}_1] \doteq R[\bar{r}_2]\), the rule:

    which creates two records (one of them can be purged afterwards) with different ids but all the other attribute values the same, and computed componentwise according to the MFs for . Here, \(\bar{x}_1, \bar{x}_2, \bar{x}_3\) stand each for all attributes of relation R, except for the id and the block number (represented by \({ bl}\)). (Block numbers play no role in merging.)

  • 3. As for program \(\mathcal { P}^{B}(D)\) given in Sect. 5, rules specify the old versions of a record:

    $$\begin{aligned}&R\text{- }{ OldVersion}(r_1,\bar{x}_1) \ \leftarrow \ R[r_1, \bar{x}_1]={ bl}, \ R[r_1, \bar{x}_2]={ bl}, \ \bar{x}_1 \prec \bar{x}_2. \end{aligned}$$

    Here, \(\bar{x}_1\) stands for all attributes other than the id and the block number; and on the RHS \(\bar{x}_1 \prec \bar{x}_2\) means componentwise comparison of values according to the partial orders defined by the MFs.

  • 4. Finally, rules to collect the latest version of each record, building the final resolved instance: \(R\text{- }{ ER}(r_1,\bar{x}_1) \;\;\leftarrow \;\;R[r_1,\bar{x}_1]= { bl}, \ ! \ R\text{- }{ OldVersion}(r_1,\bar{x}_1)\).

Notice that the derived tables \(R\text{- }{ Duplicate}\) that appear in the LHSs of the MDs (or in the bodies of the corresponding rules) are all computed before (and kept fixed during) the enforcement of the merge-MDs. In particular, a duplicate relationship between any two records is not lost. This has the effect of making the set of merging-MDs interaction-free, which results in a unique resolved instance.

7 Experimental Evaluation

We now show that our approach to ER can improve accuracy in comparison with standard blocking. In addition to the MAS, we used datasets from DBLP and Cora Citation.

Fig. 5.
figure 5

The experiments (MAS)

In order to emphasize the importance of semantic knowledge in blocking, we consider standard blocking and two different sets of MDs, (1) and (2), for MD-based collective blocking. Under (1), we define blocking-MDs for all the blocking keys used for standard blocking, but under (2) we have MDs for only some of the used blocking keys. In both cases, in addition to properly collective blocking MDs.

We use three measures for the comparisons of blocking techniques. One is reduction ratio, which is the ratio (minus 1) of the number of candidate record-pairs over the initial number of records. The higher this value, the less candidate record-pairs are being generated, but the quality of the generated candidate record pairs is not taken into account. We also use recall and precision measures. The former is the number of true duplicate candidate record-pairs divided by the number of true duplicate pairs, and precision is the number of true candidate duplicate record-pairs divided by the total number of candidate pairs [12].

Figures 5, 6 and 7 show the comparative performance of ERBlox. They show that standard blocking has higher reduction ratio than MD-based collective blocking version (1). This means that less candidate record-pairs are being generated by standard blocking. However, the precision and recall of MD-based blocking version (1) are higher than standard blocking, meaning that MD-based blocking version (1) can lead to improved ER results at the cost of larger blocks, and thus more candidate record pairs that need to be compared.

Fig. 6.
figure 6

The experiments (DBLP)

In blocking, this is a common tradeoff that needs to be considered. On the one hand, having a large number of smaller blocks will result in fewer candidate record-pairs that will be generated, probably increasing the number of true duplicate record-pairs that are missed. On the other hand, blocking techniques that result in larger blocks generate a higher number of candidate record-pairs that will likely cover more true duplicate pairs, at the cost of having to compare more candidate pairs [12]. The experiments are all done before MD-based merging.

Fig. 7.
figure 7

The experiments (Cora)

Interestingly, MD-based blocking version (2) has higher reduction ratio, recall, and precision than standard blocking. This emphasizes the importance of MDs supporting collective blocking, and shows that blocking based on string similarity alone fails to capture the relationships that naturally hold in the data.

As expected, the experiments show that different sets of MDs for MD-based collective blocking have different impact on reduction ratio, so as standard blocking depends on the choice of blocking keys. However, the quality of MD-based collective blocking, in its two versions, dominates standard blocking for the three datasets.

8 Conclusions

We have shown that matching dependencies, a new class of data quality/cleaning semantic constraints in databases, can be profitably integrated with traditional ML-methods, in our case for entity resolution. They play a role not only in the intended goal of merging duplicate representations, but also in the record blocking process that precedes the learning task. At that stage they allow to declaratively capture semantic information that can be used to enrich the blocking activity. MDs declaration and enforcement, data processing in general, and machine learning can all be integrated using the LogiQL language.