Keywords

1 Introduction

Entity matching is the problem of identifying which entities in a data source link to the same entities in the other data sources. It is a well known and paramount problem that arises in many research fields, including data cleaning and integration, information retrieval and machine learning. There are many applications which can benefit from the entity matching task. In the first place, users in a social network may be the same individuals in the other platforms, but each user profile and user behavior may be slightly different, e.g., containing different abbreviations, and missing some information. We can improve the product recommendations after determining the more complete user behaviors. There is one more point, we should touch on that two websites of second-hand housing may share many house information. After we derive the matched entities across the different websites, we can insight into the more complete house information. Therefore, we have studied the problem of matching entities across two or more heterogeneous data sources in this paper.

However, the entity matching is often a challenging task due to following reasons: (1) it is extremely expensive for the large data sets since the process considers all the tuples as candidates; (2) the process is arduous to compare many heterogeneous attributes for tuple of entities from different data sources; (3) there may be some missing attributes for an entity in some Web applications.

In this paper, we provide an approach, called EMAN, to match entities across two or more heterogeneous data sources. We formulate entity matching task as an unsupervised learning problem. Our proposed approach can result in promising accuracy without ground-truth which are usually arduous and costly to collect in Web applications. For the provided approach, we would like to address the three challenges highlighted earlier. It utilizes the exponential family to combine heterogeneous entity attributes, handle missing data in the unsupervised framework, employ locality sensitive hashing (LSH) to speed up the computation. In summary, our major contributions are as follow.

  • We propose an unsupervised method to match entities across two or more heterogeneous data sources. The approach is a unified one which integrates the heterogeneous entity attributes, and employs the distributions from the exponential family to model the similarities of different attributes.

  • We employ the locality sensitive hashing (LSH) to block entities. With LSH, EMAN can perform very efficiently but still maintain the promising matching results.

  • We illustrate the performance of our algorithm against a comparable baseline on three real data sources. Empirical study results manifest that EMAN outperforms baseline in matching entities across two or more data sources.

The rest of paper is organized as follows. We shortly discuss the related work in Sect. 2. We formally define the problem and describe the overview of our algorithm in Sect. 3. We present the entity matching method in Sect. 4 and report our empirical study in Sect. 5. Finally, we conclude this paper in Sect. 6.

2 Related Work

Entity matching aims at detecting several entities which describe the same entity from given datasets. The study of entity matching problem has become a hot topic in recent years, and some earlier studies can go back to 1950s [1]. However, entity matching is also a wide research problem studied in multiple research communities. In the traditional database community, the problem is described as data matching [2], data deduplication [3], instance identification [4], Merge/Purge [5] and record linkage [6]; In the information retrieval community, the same problem is described as entity resolution [711]. The object linkage, object identification [12], and duplicate detection [1315] are also commonly referred to the same task.

In general, the existing studies of entity matching can be mainly divided into two categories: classification-based and rule-based [17]. For classification-based approaches, they assign labels to a pair after learning the patterns from the training data. Mikhail and Raymond train a classifier by SVM with high accuracy [18]. Dong [19] implements the entity matching algorithm with three crucial features among records based on machine learning technique. The algorithm improves the accuracy of entity matching by merging the attribute values to enrich record information gradually. However, it is not easy to find the exact matched pairs for learning a classifier.

For rule-based approaches, they are deterministic linkage approach [20, 21] and judge whether a record pair is matched or not based on rules. Jiannan Wang trains to obtain the most appropriate set of rules on the premise of given rules which is difficult to obtain [17]. Moreover, if the given rules are not sufficient, the deviation of classification results is large and the classification result may not be acceptable. Whang [9] proposed an entity resolution method with evolving rules based on relationship between dynamic semantics and resolution rules to solve the problem of interaction among results. Whang [7] indicates the entity resolution is not an one-time process but incremental process changing constantly. Wang et al. first proposes how similar is similar problem in entity matching, and they address the rule-based method to identify the most appropriate similarity functions and thresholds to find entities effectively [17]. Rastogi et al. focus on the scale generic entity matching problem which is implemented with a parallel framework on Hadoop [22]. Lee et al. also attempt to address the scalability problem of entity matching [23]. In that work, they exploit a materialization structure inspired by top-k query processing and develop a scalable entity matching algorithm for evolving rules. However, the rules for linking entities are arduous to learn from data sets.

Fellegi-Sunter’s approach is also a rule-based method, and solves the record linkage problem via using an unsupervised and probabilistic linkage approach [16, 24]. It works well only when the linkage problem is simple and exact one-to-one matching of username and other user attributes. Sadinle et al. extend Fellegi-Sunter’s model to present a probabilistic method for linking multiple data files [25]. Currently, many recent applications generate many data associated with poor quality, including heterogeneous in attributes, error, incomplete and missing values, etc. However, existing entity matching approaches cannot integrate heterogeneous entity attributes and handle missing values. Gao et al. also extend Fellegi-Sunter’s approach to link users across different social networks. Their approach can handle heterogeneous user attributes and missing values in user profiles [26]. In addition, Fellegi-Sunter’s approach can only link records from two data sources. For linking more than two data sources, the false positive tuples may be large if we link them pair-to-pair by using Fellegi-Sunter’s approach. These are the focuses of this work.

3 Entity Matching Approach

In this section, before we overview our proposed approach, we describe a formal definition of the entity matching problem.

3.1 The Problem Definition

We assume that there are N data sources (\(N > 1\)). Let \(E_i\), \(1\le i \le N\), be the set of entities from the \(i-\)th source and \(\alpha _i(e_{ij})\) represent the observed features of \(e_{ij}\in E_i\), i.e., \(\alpha _i(e_{ij})\) represents the observed feature vector of \(e_{ij}\) from source \(E_i\). Let \(\alpha _i(E_i)\) be the set of attribute feature vectors of entities from source \(E_i\). The set of all candidates tuples T can be represented as \(\prod _{i = 1}^{k}\alpha _i(E_i)\). The entity matching problem is to determine the matched tuples M and unmatched tuples U in T, i.e.,

$$\begin{aligned} M= & {} \{(\alpha _1(e_{1j_1}), \ldots ,\alpha _N(e_{Nj_N}))|e_{1j_1}=\ldots =e_{Nj_N},e_{ij_i}\in E_i\}, \end{aligned}$$
(1)
$$\begin{aligned} U= & {} \{(\alpha _1(e_{1j_1}), \ldots ,\alpha _N(e_{Nj_N}))|\exists N_1, N_2, e_{N_1j_{N_1}}\ne e_{N_2j_{N_2}}, e_{ij_i}\in E_i\}. \end{aligned}$$
(2)

When \((\alpha _1(e_{1j_1}),\alpha _2(e_{2j_2}), \ldots ,\alpha _N(e_{Nj_N}))\in M\) means that entities \(e_{ij_i}\), \(1\le i\le N\), are the same, while \((\alpha _1(e_{1j_1}),\alpha _2(e_{2j_2}), \ldots ,\alpha _N(e_{Nj_N}))\in U\) means that at least an entity \(e_{ij_i}\) is different from the others. Suppose that each entity can at most match an entity from \(E_i\), M can therefore have at most \(min(|E_1|,|E_2|,\cdots ,|E_N|)\) matched entity tuples. We may ideally want \(T = M\cup U\) but T is usually an extremely large set in real. We therefore consider a smaller T that includes M utilizing some blocking techniques.

3.2 Overview of EMAN

Our proposed entity matching approach consists of four components as following.

  • Step 1: Candidate tuple generation. The major computational cost is significantly impacted by generating candidate tuples. Blocking methods, such as n-gram indexing and sorted neighborhood, may be the feasible techniques to reduce the number of candidate tuples [27]. However, the number of candidate tuples with N sources of n entities containing in b blocks is \(O(\frac{n^N}{b^{N-1}})\). The efficiency and accuracy are significantly impacted by the number of blocks. We therefore utilize the LSH (Locality Sensitive Hashing) schema to speed up the candidate tuple generation since the number of blocks can be arbitrary large (Based on n-gram model, entities can be blocked by utilizing LSH when some string attributes, such as name, address etc., can be represented as a binary vector after shingling).

  • Step 2: Entity vectorization and similarity computations. We determine a similarity function \(s^{j}\) to evaluate the similarity between the \(j-\)th feature of two entities from a candidate tuple. For tuple

    $$\begin{aligned} t_i = (\alpha _1(e_{1i_1}),\alpha _2(e_{2i_2}), \cdots ,\alpha _N(e_{Ni_N})), \end{aligned}$$

    we can compute a similarity vector for m attributes of any two entities of tuple \(t_i\) using similarity functions \(\{s^{j}\}_{j = 1}^{m}\). There are \(\left( {\begin{array}{c}N\\ 2\end{array}}\right) \) similarity vectors between different entity pairs. We take the minimum value of each entry over \(\left( {\begin{array}{c}N\\ 2\end{array}}\right) \) similarity vectors to model the similarity of tuple \(t_i\), denoted as a m-dimensional vector \(\gamma _i\). An entity may have multiple attributes. Some of them are individual demographic attributes, e.g., name, location, date, URL and etc. These can be of different data types (e.g., numeric, text, string, categorical, etc.). These attributes can be represented in set and distribution types. Our proposed approach models the similarities between entities to accommodate heterogeneous features using different probability distributions in exponential family.

  • Step 3: Parameter learning. Given each similarity vector \(\gamma _i\), EMAN models the similarity values of tuples using two different probability distribution functions, one for matched tuples and another for unmatched ones. The parameter learning step is to infer parameters of the two distributions. More details will be covered in Sect. 4.

  • Step 4: Tuple scoring and label assignment. For a tuple \(t_i\in T\), its score can be computed by \(\log {\frac{P(t_i \in M |\gamma _i,\hat{\varTheta })}{P(t_i \in U |\gamma _i,\hat{\varTheta })}}\) (for ease of computation). Tuple \(t_i\) is more likely to be matched tuple if its score is greater than 0, and otherwise unmatched tuple. Given a threshold, the matched scores of entity tuples are used to judge if they belong to the matched or unmatched tuple sets, i.e., M or U. A tuple is judged as matched tuple if its matched score is larger than the threshold, and otherwise unmatched tuple.

Unlike the earlier Fellegi-Sunter’s method, EMAN considers both discrete and continuous similarities as a wider range of probability distributions from the exponential family to model the similarity values of matched and unmatched entity tuples (in Step 1). This is an important extension to handle the heterogenous attribute types, including string, numeric, set, distribution, etc., these exist in the entity matching task.

3.3 EMAN Algorithm

We now present the full EMAN algorithm in Algorithm 1. In this algorithm, PM maintains the set of matched entity tuples. At Lines 2–4, the algorithm employs LSH to block entities from N data sources. At Lines 5–9, it generates all the candidate tuples. A candidate tuple consists of N entities from N data sources, where all entities in a tuple are from the same bucket. In this step, it also computes the similarity vector. At Lines 10–13, it infers the parameters by utilizing the EM-algorithm. Finally, from Lines 14 to 17, it judges whether a tuple belongs to the matched or unmatched one based on the computed matching scores at Line 15.

figure a

4 Parameters Inference and Prediction

We employ a generative model to solve the problem defined in previous section. Given the similarity vectors of all candidate tuples, EMAN learns the parameters of similarity distributions for matched and unmatched tuples based on exponential family distributions. In terms of these learned parameters of similarity distributions, EMAN infers whether candidate tuple \(t_i\in T\) is matched or unmatched by estimating probabilities \(P(t_i\in M|\gamma _i, \varTheta )\) and \(P(t_i\in U|\gamma _i, \varTheta )\).

4.1 Likelihood

Assume that \(P(t_i\in M|\varTheta ) = p\), i.e., \(P(t_i\in U|\varTheta ) = 1-p\). Employing Bayes’ rule to \(P(t_i\in M|\gamma _i, \varTheta )\), we can obtain:

$$\begin{aligned} P(t_i\in M|\gamma _i, \varTheta )= & {} \frac{p\times P(\gamma _i|t_i \in M,\varTheta )}{P(\gamma _i|\varTheta )} \end{aligned}$$
(3)
$$\begin{aligned} P(\gamma _i|\varTheta )= & {} p\times P(\gamma _i|t_i \in M,\varTheta ) +(1-p)\times P(\gamma _i|t_i \in U,\varTheta ) \end{aligned}$$
(4)

Please note that we do not know the label of a candidate tuple \(t_i\). To represent the joint probability of the observed data, we define a latent variable \(l_i\) for candidate tuple \(t_i\). Its value is 1 if tuple \(t_i\) is a matched tuple, and otherwise 0. And, we defined \(c_i=(l_i, \gamma _i)\) as the complete data vector for T. The probability of observation \(c_j\) under parameter \(\varTheta \) can be defined as:

$$\begin{aligned} \begin{aligned} P(c_i|\varTheta )&=[P(\gamma _i, t_i \in M | \varTheta )]^{l_i}[P(\gamma _i, t_i \in U | \varTheta )]^{(1-l_i)}\\&=[p\times P(\gamma _i|t_i \in M, \varTheta )]^{l_i}[(1-p)\times P(\gamma _i|t_i \in U, \varTheta )]^{(1-l_i)} \end{aligned} \end{aligned}$$
(5)

Let \(L_i=(l_i, 1-l_i)\) and thus we obtain the log-likelihood for sample \(X=\{c_i:i = 1, 2, \cdots , |T|\}\) as:

$$\begin{aligned} \begin{aligned} \varvec{L}(\varTheta |X)&=\sum _{i=1}^{|T|} L_i[log P(\gamma _i|t_i \in M, \varTheta ), log P(\gamma _i | t_i \in U, \varTheta )]^{'}\\&\quad +\sum _{i=1}^{|T|} L_i[log p, log (1-p)]^{'}. \end{aligned} \end{aligned}$$
(6)

4.2 Exponential Family

As mentioned above, most of the probability distributions can be represented by exponential family which is a convenient and widely used family of distributions. Distributions in the exponential family appeal to the machine learning community as some good properties of MLE which is a function of the sufficient statistic and the best unbiased estimator, etc. [22].

In probability and statistics, an exponential family is a set of probability distributions and represented by an exponential form which is chosen for mathematical convenience [23]. In other word, an exponential family is a set of probability distributions whose PDF and PMF can be expressed in the form as follows:

$$\begin{aligned} f(x; \theta ) = h(x) \exp \left( \theta ^{'} S(x) -z(\theta )\right) \end{aligned}$$
(7)

where \(\theta \) (may be a vector) is the natural parameter of a distribution. S(x) is a sufficient statistic. Generally, \(S(x)=x\). So when the parameter zhS are fixed, we will define an exponential family with parameter \(\theta \). The exponential family contains as special cases most of the standard discrete and continuous distributions that we use for practical modelling, such as Bernoulli, Multinomial, Poisson, Gamma, Dirichlet, etc.

One of our task is to calculate probabilities \(P(t_i\in M|\gamma _i, \varTheta )\) and \(P(t_i\in U|\gamma _i, \varTheta )\). In Eq. 6, we know that the critical step is to calculate \(P(\gamma _i|t_i \in M,\varTheta )\) and \(P(\gamma _i|t_i \in U,\varTheta )\). So we estimate \(P(\gamma _i|t_i \in M,\varTheta )\) and \(P(\gamma _i|t_i \in U,\varTheta )\) and assume that \(\gamma _i\) is drawn from a distribution of exponential family, and use the simplifying assumption that the entries of vector \(\gamma _i\) are conditional independent with respect to the state of indicator \(L_i\) such as:

$$\begin{aligned} \begin{aligned} P(\gamma _i^j|t_i \in M,\varTheta )\sim f_{1,j}(\gamma _i^j;\theta _{1,j}),\,for\,j=1,\cdots , m\\ P(\gamma _i^j|t_i \in U,\varTheta )\sim f_{0,j}(\gamma _i^j;\theta _{0,j}),\,for\,j=1,\cdots , m \end{aligned} \end{aligned}$$
(8)

where \(f_{\cdot ,j}(\cdot ;\cdot )\) is the PDF or PMF from the exponential family in Eq. 7.

Next, the log-likelihood in Eq. 6 can be replaced with:

$$\begin{aligned} \begin{aligned} \varvec{L}(\varTheta |X)&\propto \sum _{i=1}^{|T|} L^i[\sum _{j=1}^{m}\theta _{1,j}^{'}S_{1,j}(\gamma _i^j),\sum _{j=1}^{m}\theta _{0,j}^{'}S_{0,j}(\gamma _i^j)]^{'}\\&\quad -\sum _{i=1}^{|T|} L^i[\sum _{j=1}^{m}z_{1,j}(\theta _{1,j}),\sum _{j=1}^{m}z_{0,j}(\theta _{0,j})]^{'}+\sum _{i=1}^{|T|} L^i[\log {p}, \log {(1-p)}]^{'}. \end{aligned} \end{aligned}$$
(9)

4.3 Maximum Likelihood Estimator

Since \(L^i\) is a latent vector in Eq. 9, so we estimate the parameter \(\varTheta =\{p, \theta _{1,j}, \theta _{0,j}, for \,j = 1,\cdots , m\}\) with maximum likelihood estimation using EM algorithm. The EM algorithm begins with initial estimator of unknown parameter \(\varTheta \) and repeat iterative calculation of the expectation(E) and maximization(M) steps until the convergence.

E-step. The objective in E-step is to calculate the conditional expectation of latent variables and estimate the missing data with observed data. Given \(\gamma _i\) and \(\varTheta ^{(k-1)}\) in the \(k\text {-}th\) iteration, the conditional distribution of \(l_i\) is \(l_i|\gamma _i,\varTheta ^{(k-1)}\sim B(1,p_i^{(k)})\) with

$$\begin{aligned} \begin{aligned} p_i^{(k)}=P(l_i=1|\gamma _i,\varTheta ^{(k-1)}) \end{aligned} \end{aligned}$$
(10)

Then \(p_i^{(k)}\) will be represented with Eq. 11 as:

$$\begin{aligned} \begin{aligned} p_i^{(k)}&=P(l_i=1|\gamma _i,\varTheta ^{(k-1)})=\frac{P(t_i \in M,\gamma _i|\varTheta ^{(k-1)})}{P(\gamma _i|\varTheta ^{(k-1)})}\\&=\frac{p^{(k-1)}\cdot \prod _{j=1}^mf_{1,j}(\cdot ;\cdot )}{p^{(k-1)}\cdot \prod _{j=1}^mf_{1,j}(\cdot ;\cdot )+(1-p^{(k-1)})\cdot \prod _{j=1}^mf_{0,j}(\cdot ;\cdot )} \end{aligned} \end{aligned}$$
(11)

By substituting \(p_i^{(k)}\) for \(l_i\), we obtain the expectation function.

M-step. In M-step, we maximize the likelihood after E-step. When we estimate the values of \(l_i^{(k)}=p_i^{(k)}\) in E-step, we take derivatives of the log-likelihood to parameters \(p, \theta _{1,j},\, and \, \theta _{0,j}\) as follows:

$$\begin{aligned} \frac{\partial \varvec{L}(\varTheta |X) }{\partial p}= & {} \sum _{i=1}^{|T|}(\frac{l_i^{(k)}}{p}-\frac{1-l_i^{(k)}}{1-p}) \end{aligned}$$
(12)
$$\begin{aligned} \frac{\partial \varvec{L}(\varTheta |X) }{\partial \theta _{1,j}}= & {} \sum _{i=1}^{|T|} l_i^{(k)}(S_{1,j}(\gamma _i^j)-\frac{\partial z_{1,j}(\theta _{1,j})}{\partial \theta _{1,j} }) \end{aligned}$$
(13)
$$\begin{aligned} \frac{\partial \varvec{L}(\varTheta |X) }{\partial \theta _{0,j}}= & {} \sum _{i=1}^{|T|}(1- l_i^{(k)})(S_{0,i}(\gamma _i^j)-\frac{\partial z_{0,j}(\theta _{0,j})}{\partial \theta _{0,j} }) \end{aligned}$$
(14)

By calculating, we can infer \(\frac{\partial z_{\cdot ,j}(\theta _{\cdot ,j})}{\partial \theta _{\cdot ,j} }=E_{\theta _{\cdot ,j}}(S_{\cdot ,j}(\gamma ^j))^2\) where \(\cdot \) can be 1 or 0 and obtain the MLEs of parameters as shown in Table 1. Due to the page limitation, we omit the proofs and computations.

While the distributions of all similarity values were assigned, we can estimate the parameter in the M-step of the \(k\text {-}th\) iteration. The probability p can be estimated as \(p^{(k)}=\frac{\sum _{i=1}^{|T|} l_i^{(k)}}{|T|}\).

Table 1. The MLEs of parameters for both matched and unmatched groups

4.4 Missing Data

As a result of presence of missing data, we denote the sample X as \((X_o,X_m)\), where \(X_o\) represents the observed data and \(X_m\) represents the missing data. Let \(\varTheta ^{(0)}\) be the initial value for parameter. The E-step of EM algorithm computes \(Q(\varTheta ;\varTheta ^{(k-1)})=E(\varvec{L}(\varTheta |X)|X_o,\varTheta ^{(k-1)})\) during \(k\text {-}iteration\). Due to the missing data, \(S_{1,i}(\gamma _i^j)\) and \(S_{0,i}(\gamma _i^j)\) in Eq. 9 are missing. So \(Q(\varTheta ;\varTheta ^{(k-1)})\) can be calculated by \(E(S_{1,j}(\gamma _i^j)|\varTheta ^{(k-1)})\) and \(E(S_{0,j}(\gamma _i^j)|\varTheta ^{(k-1)})\) for \(S_{1,j}(\gamma _i^j)\) and \(S_{0,j}(\gamma _i^j)\) in Eqs. 13 and 14 respectively.

4.5 Matching Score Computation

Once parameters \(\varTheta \) are estimated, EMAN determines whether candidate tuple \(t_i\) belongs to matched or unmatched one by computing its matching score. To speed up the computation of matching scores for exponential family, we define the match score function as:

$$\begin{aligned} W_i=log(\frac{P({t_i}{\in } M|\gamma _i, \hat{\varTheta })}{P({t_i}{\in } U|\gamma _i,\hat{\varTheta })}) \propto \sum _{j=1}^m w_i^j \end{aligned}$$
(15)

where

$$\begin{aligned} {w_{i}^{j}}=({\varTheta }_{1,j}^{'}{S_{1,j}}({\gamma }_{i}^{j})-z_{1,j}(\varTheta _{1,j}))-(\varTheta _{0,j}^{'}S_{0,j}({\gamma }_i^j)-z_{0,j}({\varTheta }_{0,j})) \end{aligned}$$
(16)

where \(P(t_i\in M|\gamma _i, \hat{\varTheta }) > P(t_i\in U|\gamma _i,\hat{\varTheta })\) when \(W_i>0\). Alternatively, we can assign \(t_i\) to the matched tuple set if \(W_i > W_0\) where \(W_0 > 0\) is a threshold.

5 Empirical Evaluation

We conduct two experiments to compare the proposed EMAN with the baseline method using three real data sources. First, we manifest the performance of the self-matching problem in which there is a clearly ground truth one-one matching between entities from the identical data source. Secondly, we study the performance of matching three real heterogeneous data sources.

5.1 Experimental Setup

Datasets. We crawled three datasets from the famous estate websites in China for our experiments. The descriptive statistics about the datasets are shown in Table 2. However, the schemes of three data sources are different from each other. We only find 10 useful attributes as shown in the first column of Table 3. In our experiment, we model the similarities of property fees and price with Gaussian distribution, purpose(shop or dwelling) with Bernoulli distribution, and remaining similarities with Exponential distribution as shown in the last column of Table 3.

Table 2. Descriptive statistics of datasets
Table 3. The setup of experiment and parameter estimation for matched and unmatched groups

Comparative Method and Evaluation Measures. We find an unsupervised approach, called Felliegi-Sunter (shorted in FS), to be the comparative baseline. Fellegi-Sunter’s approach therefore evaluates all attributes by using binary similarity, i.e., the similarity is 1 if two attributes are the same, and otherwise 0. Currently, many data sources are low in quality. The FS approach is too simple to obtain reasonable performance. In our implementation for FS, we therefore set the similarity value to be 1 if the value of similarity of attributes is larger than a tuned threshold, and otherwise 0.

As the mentioned above, we evaluate our method using Precision@K, and Recall@K. Precision@K is the fraction of the matched tuples in the \(top\text {-}K\) result that are correctly matched. Recall@K is the fraction of ground truth matched entities that appear among the \(top\text {-}K\) results. To evaluate the scalability of our proposed approach, we also measure the elapsed time in second and the number of candidate tuples.

5.2 Self-matching Evaluation

Firstly, we perform our method in self-matching task which is designed such that we know the complete ground truth matched entities, i.e., we match the entities from three replicas of the identical data source. We create two new data sources which are injected some noise into the given data source. For each replica, we randomly inject some noises into string attributes. For each character, it has the same probability to be inserted, deleted or replaced. Take PingAnFang as an example, \(PingAnFang_1(\psi )\) is the first replica of PingAnFang, where \(\psi \) denotes the probability of each character being changed. In our experiment, the value of \(\psi \) varies from \(10\,\%\) to \(50\,\%\).

Fig. 1.
figure 1

Accuracy of EMAN and FS

Figure 1(a) and (b) manifest the accuracy of EMAN on PingAnFang by varying \(\psi \) from 10 to \(50\,\%\). We observe that the accuracy of EMAN is promised. If 10 % noise is injected into the data, almost 90 % matched entities can be found by EMAN. Even 50 % noise is injected into the data, the precision in the top-200 is almost 100 %. Figure 1(c) and (d) illustrate that EMAN outperforms the baseline significantly for self-matching on PingAnFang. The result indicates that exponential family is helpful to integrate heterogeneous entity attributes.

5.3 Matching Heterogeneous Data Sources

Scalability of EMAN. In this experiment, we address whether LSH is helpful to speed up EMAN. For entity matching on heterogeneous data sources, we only change the size of NetEaseHouse from 500 to 2,500. In Fig. 2(a) (\(EMAN_{\_}L\) is an approach that EMAN employs LSH to block entities), we can find that only less than 1 % candidate tuples are remained after using LSH to block entities. In Fig. 2(b), EMAN associated with LSH detects entities within 2,000 s when the size of data source is almost 2,000. However, the elapsed time of EMAN without LSH is more than 12 h. In summary, LSH is helpful to reduce the number of candidate tuples and speed up the computation of EMAN.

Fig. 2.
figure 2

Efficiency of EMAN

Manually Judgement. We now turn to match three heterogeneous data sources, namely complete NetEaseHouse, AnJuKe and PingAnFang. Since the maximum size of matched tuples is less than the minimal number of |NetEaseHouse|, |AnJuKe| and |PingAnFang|, we manually annotated the top-250 matched entities labelled by EMAN and FS. Three entities are judged to be matched tuple when (1) the similar entity name; (2) the similar values in some attributes, such as address, region and developer. The remaining entity triples are assigned the undetermined label. As shown in Fig. 3(a) and (b), we find that the accuracy for the top-250 result of EMAN is more than 70 %, but it is about 60 % for FS. This illustrates that both EMAN and FS are quite good in returning the correctly matched entities for different top-K ranked tuples. EMAN also returns fewer undetermined tuples than FS.

For this experiment, we list all parameters learned from our approach. An attribute is more important to match entities if the difference of parameters between matched and unmatched groups is larger. As shown in Table 3, we can also observe that name and address are two most important attributes to match entities for this task.

Fig. 3.
figure 3

Accuracy for entity matching

6 Conclusion

In this paper, we have studied the problem of entity matching across two or more heterogeneous data sources. It is a challenging task due to the overwhelming expensive, heterogeneous attributes for each entity, and incomplete and missing data. We propose an unsupervised method to deal with the mentioned challenges. We have illustrated our proposed method on three real data sources. Experimental results indicate that EMAN not only outperforms the comparable baseline but also obtains the promising performance.

In our future work, we plan to extend our work to handle some ground-truth tuples with semi-supervised approach, and deploy a distributed algorithm to support more efficient computation.