Synonyms

Co-reference resolution; Deduplication; Duplicate detection; Identity uncertainty; Merge-purge; Object consolidation; Record linkage; Reference reconciliation

Definition

A fundamental problem in data cleaning and integration (see Data Preparation) is dealing with uncertain and imprecise references to real-world entities. The goal of entity resolution is to take a collection of uncertain entity references (or references, in short) from a single data source or multiple data sources, discover the unique set of underlying entities, and map each reference to its corresponding entity. This typically involves two subproblems – identification of references with different attributes to the same entity and disambiguation of references with identical attributes by assigning them to different entities.

Motivation and Background

Entity resolution is a common problem that comes up in different guises (and is given different names) in many computer science domains. Examples include computer vision, where we need to figure out when regions in two different images refer to the same underlying object (the correspondence problem), natural language processing when we would like to determine which noun phrases refer to the same underlying entity (co-reference resolution), and databases, where, when merging two databases or cleaning a database, we would like to determine when two tuple records are referring to the same real-world object (deduplication and data integration). Deduplication is important for removing redundancy and for accurate analysis. In information integration, determining approximate joins is important for consolidating information from multiple sources; most often there will not be a unique key that can be used to join tables across databases.

Such ambiguities in entity references can occur due to multiple reasons. Often times, data may have data entry errors, such as typographical errors. Multiple representations, such as abbreviations, are also possible. Different databases typically have different keys – one person database may use social security numbers, while another uses name and address.

Traditional entity resolution approaches focus on matching attributes of different references for resolving entities. However, many data sources have explicit or implicit relationships present among the entity references. These relations are indicative of relationships between the underlying entities themselves. For example, person records in census data are linked by family relationships such as sibling, parent, and spouse. Researchers collaborate mostly within their organization, or their research community, as a result of which references to related researchers tend to occur closely together. Recent entity resolution approaches in statistical relational learning make use of relationships between references to improve entity resolution accuracy and additionally to discover relationships between the underlying entities.

Theory/Solution

As an illustration of the entity resolution problem, consider the task of resolving the author references in a database of academic publications similar to DBLP, CiteSeer, or PubMed. Let us take as an example the following set of four papers:

  1. 1.

    W. Wang, C. Chen, and A. Ansari, “A mouse immunity model”

  2. 2.

    W. Wang and A. Ansari, “A better mouse immunity model”

  3. 3.

    L. Li, C. Chen, and W. Wang, “Measuring protein-bound fluoxetin”

  4. 4.

    W. W. Wang and A. Ansari, “Autoimmunity in biliary cirrhosis”

Now imagine that we would like to find out, given these four papers, which of these author names refer to the same author entities. This process involves determining whether paper 1 and paper 2 are written by the same author named Wang, or whether they are different authors. We need to answer similar questions about all such similar author names in the database.

In this example, it turns out there are six underlying author entities, which we will call Wang1 and Wang2, Chen1 and Chen2, Ansari, and Li. The three references with the name “A. Ansari” correspond to author Ansari and the reference with name “L. Li” to author Li. However, the two references with name “C. Chen” map to two different authors Chen1 and Chen2. Similarly, the four references with name “W. Wang” or “W. W. Wang” map to two different authors. The “Wang” references from the first, second, and fourth papers correspond to author Wang1, while that from the third paper maps to a different author Wang2. This inference illustrates the twin problems of identifying “W. Wang” and “W. W. Wang” as the same author and disambiguating two references with name “W. Wang” as different authors. This is shown pictorially in Fig. 1, where references that correspond to the same authors are shaded identically. In the entity resolution process, all those and only those author references that are shaded identically should be resolved as corresponding to the same underlying entity.

Entity Resolution, Fig. 1
figure 53figure 53

The references in different papers in the bibliographic example. References to the same entity are identically shaded

Formally, in the entity resolution problem, we are given a set of references \(\mathcal{R} =\{ r_{i}\}\), where each reference r has attributes r. A1, r. A2, , r. A k , such as observed names and affiliations for author references, as in our example above. The references correspond to some set of unknown entities \(\mathcal{E} =\{ e_{i}\}\). We introduce the notation r. E to refer to the entity to which reference r corresponds. The goal is to recover the hidden set of entities \(\mathcal{E} =\{ e_{i}\}\) and the entity labels r. E for individual references given the observed attributes of the references. In addition to the attributes, in some data sources we have information in the form of relationships between the references, such as coauthor relationships between author references in publication databases. We can capture the relationships with a set of hyper-edges \(\mathcal{H} =\{ h_{i}\}\). Each hyper-edge h may have attributes as well to capture the attributes of relationships, which we denote h. A1, h. A2, , h. A l , and we use h. R to denote the set of references that it connects. In our example, each rectangle denotes one hyper-edge corresponding to one paper in the database. The first hyper-edge corresponding to Paper1 has as its attribute the title “A mouse immunity model” and connects the three references having name attributes “W. Wang,” “C. Chen,” and “A. Ansari.” A reference r can belong to zero or more hyper-edges, and we use r. H to denote the set of hyper-edges in which r participates. For example, if we have paper, author, and venue references, then a paper reference may be connected to multiple author references and also to a venue reference. In general, the underlying references can refer to entities of different types, as in a publication database or in newspaper articles, which contain references to people, places, organizations, etc. When the type information is known for each reference, resolution decisions are restricted within references of the same type. Otherwise, the types may need to be discovered as well as part of the entity resolution process.

Traditional entity resolution approaches pose entity resolution as a pairwise decision problem over references based on their attribute similarity. It can also be posed as a graph clustering problem, where references are clustered together based on their attribute similarities and each cluster is taken to represent one underlying entity. Entity resolution approaches differ in how the similarities between references are defined and computed and how the resolution decisions are made based on these similarities. Traditionally, each pairwise decision is made independently of the others. For example, the decision to resolve the two Wang references from papers 1 and 3 would be made independently of the decision to resolve the two Chen references from the same papers.

The first improvement is to account for the similarity of the coauthor names when such relationships are available. However, this still does not consider the “entities” of the related references. For the two “Wang” references in the earlier example, the two “C. Chen” coauthors match regardless of whether they refer to Chen1 or Chen2. The correct evidence to use here is that the “Chens” are not co-referent. In such a setting, in order to resolve the “W. Wang” references, it is necessary to resolve the “C Chen” references as well and not just consider their name similarity. In the collective relational entity resolution approach, resolutions are not made independently, but instead one resolution decision affects other resolutions via hyper-edges.

Below, we discuss the different entity resolution approaches in greater detail.

Attribute-Based Entity Resolution

As discussed earlier, exact matching of attributes does not suffice for entity resolution. Several sophisticated similarity measures have been developed for textual strings (Cohen et al. 2003; Chaudhuri et al. 2003) that may be used for unsupervised entity resolution. Finally, a weighted combination of the similarities over the different attributes for each reference is used to compute the attribute similarity between two references. An alternative is to use adaptive supervised algorithms that learn string similarity metrics from labeled data (Bilenko and Mooney 2003). In the traditional entity resolution approach (Fellegi and Sunter 1969; Cohen et al. 2003), similarity is computed for each pair of references r i , r j based on their attributes, and only those pairs that have similarity above some threshold are considered co-referent.

Efficiency

Even the attribute-only approach to entity resolution is known to be a hard problem computationally, since it is infeasible to compare all pairs of references using expensive similarity measures. Therefore, efficiency issues have long been a focus for data cleaning, the goal being the development of inexpensive algorithms for finding approximate solutions. The key mechanisms for doing this involve computing the matches efficiently and employing techniques commonly called “blocking” to quickly find potential duplicates (Hernández and Stolfo 1995; Monge and Elkan 1997), using cheap and index-based similarity computations to rule out non-duplicate pairs. Sampling approaches can quickly compute cosine similarity between tuples for fast text-joins within an SQL framework (Gravano et al. 2003). Error-tolerant indexes can also be used in data warehousing applications to efficiently look up a small but “probabilistically safe” set of reference tuples as candidates for matching for an incoming tuple (Chaudhuri et al. 2003). Generic entity resolution frameworks also exist for resolving and merging duplicates as a database operator and minimize the number of record-level and feature-level operations (Menestrina et al. 2006).

Probabilistic Models for Pairwise Resolution

The groundwork for posing entity resolution as a probabilistic classification problem was done by Fellegi and Sunter (1969), who studied the problem of labeling pairs of records from two different files to be merged as “match” (M) or “non-match” (U) on the basis of agreement γ among their different fields or attributes. Given an agreement pattern γ, the conditional probabilities P(γ | M) and P(γ | U) of γ given matches and non-matches are computed and compared to decide whether the two references are duplicates or not. Fellegi and Sunter showed that the probabilities P(γ | M) and P(γ | U) of field agreements can be estimated without requiring labeled training data if the different field agreements are assumed to be independent. used the EM algorithm to estimate the probabilities without making the independence assumption.

Probabilistic Models for Relational Entity Resolution

Probabilistic models that take into account interaction between different entity resolution decisions through hyper-edges have been proposed for named-entity recognition in natural language processing and for citation matching (McCallum and Wellner 2004; Singla and Domingos 2004). Such relational learning approaches introduce a decision variable y ij for every pair of references r i and r j , but instead of inferring the y ij ’s independently, use conditional random fields for joint reasoning. For example, the decision variables for the “Wang” references and the “Chen” references in papers 1 and 3 would be connected to each other; features and functions would be defined to ensure that they are more likely to take up identical values.

Such relational models are supervised and require labeled data to train the parameters. One of the difficulties in using a supervised method for resolution is constructing a good training set that includes a representative collection of positive and negative examples. Accordingly, unsupervised relational models have also been developed (Li et al. 2005; Pasula et al. 2003; Bhattacharya and Getoor 2006). Instead of introducing pairwise decision variables, this category of approaches uses generative models for references using latent entity labels. Note that, here, the number of entities is unknown and needs to be discovered automatically from the available references. Relationships between the references, such as co-mentions or co-occurrences, are captured using joint distributions over the entity labels.

All of these probabilistic models have been shown to perform well in practice and have the advantage that the match/non-match decisions do not depend on any user-specified similarity measures and thresholds but are learned directly from data. However, this benefit comes at a price. Inference in relational probabilistic models is an expensive process. Exact inference is mostly intractable and approximate strategies such as loopy belief propagation and Monte Carlo sampling strategies are employed. Even these approximate strategies take several iterations to converge, and extending such approaches to large datasets is still an open problem.

Other Approaches for Relational Entity Resolution

Alternative approaches (Bhattacharya and Getoor 2007; Kalashnikov et al. 2005; Dong et al. 2005) consider relational structure of the entities for data integration but avoid the complexity of probabilistic inference. By avoiding a formal probabilistic model, these approaches can handle complex and longer-range relationships between different entity references, and the resolution process is significantly faster as well. Such approaches also create pairwise decision nodes between references and create a dependency graph over them to capture the relationships in the data. But instead of performing probabilistic inference, they keep updating the value associated with each decision node by propagating relational evidence from one decision node to another over the dependency graph.

When the relationships between the references and the entities can be captured in a single graph, the matching entity for a specific reference may be identified using path-based similarities between their corresponding nodes in the graph. The connection strength associated with each edge in the graph can be determined in the unsupervised fashion given all the references, their candidate entity choices, and the relationships between them, by solving a set of nonlinear equations (Kalashnikov et al. 2005). This approach is useful for incremental data cleaning when the set of entities currently in the database is known and an incoming reference needs to be matched with one of these entities.

An alternative approach to performing collective entity resolution using relational evidence is to perform collective relational clustering (Bhattacharya and Getoor 2007). The goal here is to cluster the references into entities by taking into account the relationships between the references. This is achieved by defining a similarity measure between two clusters of references that take into account not only the attribute similarity of the references in the two clusters but also the neighboring clusters of each cluster. The neighboring clusters of any reference cluster c are defined by considering the references r connected to references r belonging to c via hyper-edges and the clusters to which these related references belong. If the r. C represents the current cluster for reference c, then \(N(c) =\bigcup r^{{\prime}}.C\), where r. H = r. H and r. C = c. For instance, the neighboring clusters for a Wang cluster in our example containing the Wang references from papers 1, 2, and 4 are the Ansari cluster and the Chen clusters containing the other references from the same papers. The relational similarity between two clusters is then computed by comparing their neighborhoods. This relational similarity complements attribute similarity in the combined similarity between two clusters. Intuitively, two entities are likely to be the same if they are similar in attributes and are additionally connected to the same other entities. Collective relational clustering can be efficiently implemented by maintaining a priority queue for merge-able cluster pairs and updating the “neighboring” queue elements with every merge operation.

Applications

Data cleaning and reference disambiguation approaches have been applied and evaluated in a number of domains. The earliest applications were on medical data. Census data is an area where detection of duplicates poses a significant challenge and Winkler (2002) has successfully applied his research and other baselines to this domain. A great deal of work has been done making use of bibliographic data (Pasula et al. 2003; Singla and Domingos 2004; Bhattacharya and Getoor 2007). Almost without exception, the focus has been on the matching of citations. Work in co-reference resolution and disambiguating entity mentions in natural language processing (McCallum and Wellner 2004) has been applied to text corpora and newswire articles like the TREC corpus. There have also been significant applications in information integration in data warehouses (Chaudhuri et al. 2003).

Cross-References