1 Introduction

Instance matching solves the problem of detecting the different instances of the same object. Such instances are called co-referent instances. Instance matching is an important task in knowledge discovery and data mining, especially in data integration, data cleansing, and linked data interlinking (Hernández and Stolfo 1995; Rahm and Do 2000; Ferrara et al. 2011). It enables the data integration system to remain the consistency and integrity when combining different repositories. In linked data, instance matching is an essential component for linking the co-referent instances of different repositories. This process enriches the knowledge of linked data by connecting the independently created repositories. For its importance, instance matching has been extensively studied (Winkler 2006; Koudas et al. 2006; Köpcke and Rahm 2010). However, the achievement of an ultimate solution is still an open research problem.

In our work, we focus on instance matching for data in which the properties are declared. Some examples of this form of data are relational data, XML, and linked data. Our study is inspired by the problem of data integration in general and linked data interlinking, whose difficulties are almost shared but also varied in accordance with the characteristics of each problem. The first common difficulty is the heterogeneity coming from the schema and instance level. The heterogeneity of schemas arises from the fact that different repositories do not use the same schema, which contains the description of property (e.g., ‘name’ and ‘range’). For example, ‘abstract’ can be named as ‘comment’, ‘description’, or even by a random code. The heterogeneity of instances results also from the inconsistent convention and the data collection from various sources. The same attribute of an instance can be represented in different styles (e.g., measurement unit, parts of name ordering, and time format) or even different values (e.g., New York and NY, clicker and remote control). The second common difficulty is the data ambiguity. Different values can describe the same information, and contradictorily, the same value can simultaneously refer to different things (e.g., Beverly Hills, California and Beverly Hills, Texas). The heterogeneity and ambiguity are clearly the causes of accuracy reduction.

In our work, although linked data is not the only one focus, it is the main motivation of scalability and heterogeneity. Those challenges on linked data are very difficult because of its open access and large scale. Most prominent linked data repositories are constructed from crowdsourced data, which offers a large opportunity to increase the knowledge. However, it also makes the data be vulnerable to the inclusion of incorrect or inconsistent information. Besides that, the heterogeneity and ambiguity of linked data repositories are much higher than traditional databases. For example, consider the DBpedia 3.9, there are 45,858 unique properties, 5,314,053 of only non-redirected instances,Footnote 1 and more than 20,000 instances sharing the token ‘John’ or ‘William’. Matching repositories involved with such kind of data is very challenging in term of accuracy and scalability.

In order to perform the instance matching task, the matching configuration is used implicitly or explicitly to specify the settings of instances comparison (Ngomo and Auer 2011; Volz et al. 2009). A typical configuration includes the equivalent property mappings, similarity metrics, and similarity aggregation. One or many similarity metrics can be applied to a property mapping in order to consider many aspects of the given values. The similarity aggregation combines the similarities of all property mappings into a final score, which is used to determine the co-reference possibility.

Besides configuration-based approach, supervised instance matching can also be considered as a classification problem (Köpcke et al. 2010; Soru and Ngomo 2014). Different from the configuration-based approach as we discussed above, this approach takes the prediction result as the co-reference possibility. The limitation of the classification-based approach is that the classifier may be complicated to be interpreted. Furthermore, the post-processing, which was proved to be effective (Nguyen et al. 2012a; Ngomo et al. 2011), requires the explicit matching scores which are not offered by a classifier.

Instance matching is a well-studied problem. The early state of instance matching is the manually operated systems (Cruz et al. 2009; Niu et al. 2011), in which the configuration is constructed by the human. The heterogeneity of repositories is a factor of automation limitation for this approach because it blocks the portability of the constructed configuration. In addition, since user’s experience does not guarantee to cover every data domain, the useful property mappings, suitable similarity metrics, and the optimality of other settings are difficult to be precisely decided. Finally, the accuracy is reduced as a consequence. Matching configuration can be effectively and automatically constructed by using learning algorithms. Unsupervised learning is an option for a fully automatic system (Hall et al. 2008; Ngomo and Lyko 2013; Nikolov et al. 2012; Bhattacharya and Getoor 2006). However, it is still far from the practical usage because it delivers just a modest accuracy but suffers from very high complexity. Contradictorily, supervised learning of configuration has the advantages in high accuracy and low complexity. Although this approach requires some labeled instance pairs for the learning step, a small size of training data can compensate the loss of automation. Configuration learning has been the focus of a few systems (Nikolov et al. 2012; Ngomo and Lyko 2012; Isele and Bizer 2013). However, those systems are not suitable for highly heterogeneous and large-scale datasets because they need manual interventions or are of high complexity.

We present S c L i n k, a scalable and supervised instance matching system. S c L i n k contains two phases: learning phase and resolution phase. In learning phase, S c L i n k automatically generates property mappings and assigns similarity metrics to each mapping in order to create the initial similarity functions. An optimal combination of these similarity functions and also other parameters of the resolution phase are selected by a learning algorithm.

S c L i n k contains three main contributions. The first one is c L e a r n, a heuristic-based configuration learning algorithm. The second one is m i n B l o c k, a learning algorithm for blocking model. Blocking is a technique used for quickly detecting the candidates of potentially co-referent instances. The objective of this step is to avoid taking the comparisons for all pairwise alignments of instances between input repositories. The third novelty is the modified BM25 (m B M25Footnote 2), a novel similarity metric that is robust to the ambiguous strings. S c L i n k is highly scalable because its simple architecture enables the parallel processing for the comparisons of instances. In addition, different from previous supervised systems, S c L i n k optimizes the blocking model and thus generates fewer candidates. That is, the complexity is much reduced.

S c L i n k is an extensive improvement of c L i n k (Nguyen and Ichise 2015a). Compared to c L i n k, the scalability, ambiguity, and heterogeneity are more radically solved in S c L i n k. The novelties of S c L i n k compared to c L i n k include the workflow, m i n B l o c k algorithm, and the m B M25. With the inclusion of the new elements, the experiments are more elaborately conducted with more datasets. In addition, the c L e a r n algorithm is deeper analyzed in this paper. In short, the main contributions of S c L i n k are the m i n B l o c k, c L e a r n, and the m B M25.

We analyze the performance of S c L i n k in many aspects. We use 15 real datasets whose sizes vary from small to very large. We evaluate in detail m i n B l o c k, c L e a r n, and m B M25. We re-implement recent configuration learning algorithms (Isele and Bizer 2012, 2013; Nikolov et al. 2012; Hu et al. 2014; Ngomo and Lyko 2012) for comparing with c L e a r n in the context of using the same input of similarity functions and all other parameters. We compare S c L i n k with the previous systems, including the supervised and the state-of-the-art systems. Interestingly, experimental results find that our system needs only a small amount of training data for constructing an effective configuration. It supports the applicability of S c L i n k in practical instance matching problems.

The rest of the paper is organized as follows. Section 2 defines the problem of instance matching and provides some preliminary concepts. Section 3 describes S c L i n k in detail. Section 4 reports the experiments. Section 5 summarizes the related work. Section 6 reviews our study and outlines the future work.

2 Preliminaries

The problem of instance matching is defined as follows. Given two repositories: the source R S and the target R T . Each repository is a collection of instances and is associated with a schema. In this paper, the schema of a repository is defined as all properties existing in that repository. In addition, the set of attributes of an instance x described by property p is formatted as p(x). In relational databases, an element of p(x) can be considered as a cell value. In linked data, where the RDF statement 〈s u b j e c t, p r e d i c a t e, o b j e c t〉 is used to describe a fact, the representation of p(x) is equivalent to p r e d i c a t e(s u b j e c t), which returns all related objects. Due to the heterogeneity, p(x) may be empty (e.g., missing value), contains one, or multiple elements (e.g., different names for one thing).

The objective of instance matching is to identify the co-referent set IR S ×R T . A pair (x, y) of instances xR S and yR T belongs to I if x and y co-describe the same object. In the supervised scenario, a subset of R S ×R T is labeled and given to the learning algorithms. When the learning process finishes, the obtained knowledge is applied for detecting the co-references existing in the unlabeled set.

The discrimination of source and target repository enables the role assignment for the input repositories. Most researchers consider the target as a referent repository, which is expected to contain the instances that are co-referent with many instances of the source repository. Therefore, the target repository contains more instances and so that it is more heterogeneous and ambiguous. This discrimination is helpful for reducing the complexity and improving the accuracy of the instance matching. These advantages will be further discussed in the next section.

The evaluation of instance matching is based on three measures: recall, precision, and F1 score. Among them, F1, which is the harmonic mean of recall and precision is more prioritized because it reflects the conciliation of recall and precision. Considering I as the expected co-references and D as the result of the instance matching process, the calculations of recall and precision are as follows.

$$ recall(I, D)=\frac{|I\cap D|}{|I|} $$
(1)
$$ precision(I, D)= \frac{|I\cap D|}{|D|} $$
(2)

We use an example of two small repositories to illustrate the detail of ScLink. Table 1 depicts this example. In the next section, we describe the detail of ScLink and use this example as an explanation for the main idea. In addition, for efficient reference, Table 2 lists the notations frequently used in this article.

Table 1 A motivating example
Table 2 List of frequently used notations

3 S c L i n k

We begin with the description of the general workflow and then detail the steps.

3.1 Overview

The general workflow of S c L i n k is depicted in Fig. 1. The instance matching process consists of two phases: learning and resolution. The learning phase contains four steps: property alignment, similarity function generation, blocking, and configuration learning. The resolution phase contains four steps: blocking, similarity aggregation, and filtering. S c L i n k takes two repositories R S and R T as the input. R S is also separated into two parts: R S L and R S U .

Fig. 1
figure 1

The general architecture of S c L i n k

In learning phase, first, the property alignment generates the property mappings using an overlap measure over the values described by all instances of R S and R T . According to the type of each mapping, the similarity function generation assigns the similarity metrics and produces the initial similarity functions. Each function computes one atomic score reflecting an aspect of the similarities between two instances. In parallel, in blocking step, m i n B l o c k algorithm finds an optimal blocking model using the generated property mappings. The learned model is used to create the labeled candidates set C L and unlabeled candidates C U , where C L R S L ×R T and C U R S U ×R T . In this step, an annotation process is conducted to create the training data for m i n B l o c k as well as c L e a r n. Together with the initial similarity functions, the labeled candidates C L are input into configuration learning, in which c L e a r n is executed. c L e a r n finds the optimal configuration, which contains the specification of further steps, including similarity aggregation and filtering.

In resolution phase, the optimal configuration is used to detect the co-references from unlabeled candidates. For each candidate, the similarity aggregation executes the learned similarity functions and combine the results into one final matching score. Finally, filtering step produces the co-references for unlabeled candidates by applying a constraint that considers the matching score of all candidates.

The main sub-problems of our work are the property alignment, m i n B l o c k, and c L e a r n. The objective of property alignment is finding the property mappings that are useful for matching instances, given the schema of the repositories. The m i n B l o c k algorithm finds the effective blocking functions for generating fewer candidates but reserving high recall. c L e a r n mainly searches for an optimal combination of similarity functions and similarity aggregator. All of them are at very high complexity because the goal is to find the optimal solution in a large collection of possible answers. The search space for those problems is at least 2n−1, where n is the |s c h e m a(R S )|×|s c h e m a(R T )|, the number of blocking functions, and similarity functions, with respective to above sub-problems. For c L e a r n, the algorithm also finds the optimal similarity aggregator. Hence, the search space is m times larger, where m is the number of possible aggregators.

Most parts of the above workflow are shared between many configuration-based supervised instance matching systems (Hu et al. 2014; Ngomo et al. 2011; Isele and Bizer 2012, 2013). Considering a complete system, one originality of S c L i n k is that it also applies the learning for blocking step to create the optimal model, which can generate compact candidate sets. Other systems use all property mappings for this step and thus, ignore the quality judgment for the input information because those mappings are not guaranteed to be correct. In other words, using all of them may produce many unnecessary candidates and add more limitations to scalability.

3.2 Property alignment

The mission of property alignment is to find the property mappings that are expected to describe the same attributes. For solving this, S c L i n k first selects the property candidates from s c h e m a(R S ) and then aligns them to the appropriate properties in s c h e m a(R T ), where s c h e m a(R S ) and s c h e m a(R T ) are all properties of R S and R T , respectively.

3.2.1 Select property candidates from source repository

S c L i n k does not take all properties of the source repository into comparison. Instead, only the properties that satisfy the requirements of diversity and frequency are considered. These measures are modified from the discriminability and coverage, which are designed in Song and Heflin (2011). The diversity (div) of a property p reflects how many unique values are described by p. Meanwhile, the frequency (frq) of p expresses how many instances contain p. These measures are calculated by (3) and (4), respectively.

$$ div(p, R_{S}) = \frac{|\bigcup_{x\in R_{S}} p(x)|}{{\sum}_{x\in R_{S}}{|p(x)|}} $$
(3)
$$ frq(p, R_{S}) = \frac{|\{x|x \in R_{S}, p(x)\ne \emptyset\}|}{|R_{S}|} $$
(4)

We separate the properties by their type before performing the selection. This mechanism enables the user to set a quota for each type of property, in order to increase the inclusion of useful properties from various types. We discriminate the properties into four types: string, number, date, and URI. For each type, S c L i n k limits the search space into only properties whose diversity satisfies a threshold t d i v . Then, a property is considered as a candidate if it is among the top K f r q most frequent properties.

3.2.2 Align properties between source and target repository

Each selected property from source repository is aligned with the properties having the same type of the target. Each alignment also is called a mapping. The confidence of a mapping [p S , p T ] is measured by counting the sharing values of p S and p T . Concretely, it is defined as follows:

$$\begin{array}{@{}rcl@{}} conf([p_{S},p_{T}]) &=& \frac{|V_{p_{S}} \cap V_{p_{T}}|}{|V_{p_{S}}|}\\ V_{p_{k}} &=& \bigcup_{x\in R_{k}}\bigcup_{v \in p_{k}(x)}E(v) \end{array} $$
(5)

where E is a preprocessing function. Table 3 is the detail of E. According to this table, E is designed for working with various data types. Among the mappings related to one property of the source repository, we select the mappings that belong to the top K c o n f most confident ones. In technical aspect, a very small threshold is applied to the confidences to skip the slightly relevant mappings. We fix this value to 0.01.

Table 3 Value preprocessing

As an illustration, consider the example in Table 1, the f r q(Name, R S ) and d i v(Name, R S ) are both 1.0 because it appears in all instances and all values are different. However, f r q(Class, R S ) and d i v(Class, R S ) are 4/5 and 2/2, respectively. Suppose that t d i v = 0.5 and K f r q = 2, then both Class and Name are selected for the first step. In the second step, the confidence of the mappings is listed in Table 4. Let K c o n f = 2, all mappings whose confidence is higher than zero are selected as the final result.

Table 4 Example: confidence of property mappings

K f r q , t d i v , and K c o n f can be chosen based on the number of retrieved property mappings. For each data type, the maximum quantity of mappings is K f r q ×K c o n f and additionally depends on t d i v . More property mappings cause higher complexity for the later learning algorithms but may compensate more accurate result. In order words, setting these parameters is a heuristic for balancing the accuracy and complexity.

The diversity and frequency are computed only for properties of the source repository, instead of both repositories. This strategy originates from the assumption that the target is a reference repository (Section 2). The target property p T is required to cover many values described by the source property p S . Therefore, it is sufficient to measure the confidence directly without applying the property selection for the target repository.

S c L i n k generates the initial property mappings by observing the values described by the properties. That is, the creation of mappings is independent with the schema description. Therefore, S c L i n k is classified as a schema-independent system. In other words, the heterogeneity of schema is solved. However, this is just half of the solution because the quality of property mappings are not validated. The refinement is done by configuration learning step.

Some systems also applied similar techniques to find the property mappings (Nguyen et al. 2012a; Araujo et al. 2015). However, the prior property selection from source repository was not investigated although the computation of the confidence for all pairwise mappings between s c h e m a(R S ) and s c h e m a(R T ) is impractical on large datasets. In addition, other systems use Jaccard index to measure the confidence, by replacing the denominator of (5) by \(|V_{p_{S}}\cup V_{p_{T}}|\). Quantifying such value is expensive because all elements of \(V_{p_{T}}\) must be retrieved. For large reference repository, querying \(V_{p_{T}}\) is not trivial. For example, assume that the property label exists in almost all instances, V l a b e l are at very large size and the executions of preprocessing function E also increase consequently. Our strategy remains correct because we rank the mappings by each property of the source repository independently.

All property mappings are used for similarity function generation and the mappings of string properties are used in blocking step. In the next two sections, we describe the detail of those steps.

3.3 Similarity function generation

The task of this step is to assign the suitable similarity metrics to the property mappings. The result of one assignment is a similarity function. This step creates only the prototype of similarity functions. The execution of them is conducted by the similarity aggregation step, which is described in Section 3.6. We denote a similarity function as \(sim_{mapping}^{metric}\). The formal definition of a similarity function is given in (6).

$$ sim_{[p_{S}, p_{T}]}^{metric} (x, y) = \underset{v_{x} \in p_{S}(x), v_{y} \in p_{T}(y)}{\max}metric(v_{x}, v_{y}) $$
(6)

where max operator is used to return only the similarity of the most similar values described by p S and p T , because p S (x) or p T (y) may contain multiple facts. For example, given n a m e(x)={‘Sony’, ‘Sony Corporation’ } and l a b e l(y)={‘Sony Global’ }. Then, \(sim_{[name, label]}^{Jaro}(x,y)\) compares the pairwise values of n a m e(x) and l a b e l(y) using Jaro similarity and returns the maximum value.

S c L i n k assigns only the similarity metrics that are suitable for the type of the interested property mapping. We discriminate 4 types of mappings, including string, number, date, and URI. The list of similarity metrics for those types is reported in Table 5. According to this table, exact matching is used for comparing URI and date. The similarity of values of URI and date is considered as 0 or 1 because it is suitable for the real data (e.g., homepage, birth date, and release date). Although the variation of time is useful in temporal instance matching (Christen and Gayler 2013), the focus of S c L i n k is the general context. For number, the inverse difference (7) is used to calculate how much close two numbers a and b are. Exact matching can also be used for this data type. However, inverse difference is more flexible when the repositories contain slightly different values (e.g., geographic coordinate).

$$ invDiff(a,b)=\frac{1}{1+|a-b|} $$
(7)
Table 5 Similarity metrics by data type

We apply Levenshtein metric for short strings comparison because it is effective for this kind of data (Levenshtein 1966). For long strings, we install the well-known TF-IDF Cosine and introduce the novel modified BM25 metrics. These metrics take the tokens (w) of the given strings into the computation, together with their normalized term frequency (TF) and the inverse document frequency (IDF). Equations (8) and (9) are the customized calculations of those weights, for adapting with the instance matching scenario.

$$ TF(w,x,R) = \frac{\sum\limits_{p\in schema(R)}|\mathcal{V}(w,p,x)|} {\max\limits_{t\in R}{\sum\limits_{p\in schema(R)}|\mathcal{V}(w,p,t)|}} $$
(8)
$$ IDF(w,R) = \log{\frac{|R| - \sum\limits_{p\in schema(R)}|\{x | x\in R, \mathcal{V}(w,p,x) \ne \emptyset\}| }{\sum\limits_{p\in schema(R)}|\{x | x\in R, \mathcal{V}(w,p,x) \ne \emptyset\}|}} $$
(9)

where

$$\mathcal{V}(w,p,x) = \{v|v \in p(x), w \in E(v)\} $$

By including (8) and (9), the TF-IDF Cosine and the modified BM25 similarities of two token lists a and b belonging to p S (x) and p T (y) are calculated as in (8) and (9), respectively.

$$\begin{array}{@{}rcl@{}} cosine(a,b) &=& \frac{\sum\limits_{w\in a \cap b}TFIDF(w,x,R_{S})\times TFIDF(w,y,R_{T})}{\sqrt{\sum\limits_{w\in a}TFIDF(w,x,R_{S})\times {\sum\limits_{w\in b}TFIDF(w,y,R_{T})}}}\\ TFIDF(w,x,R) &=& TF(w,x,R) \times IDF(w,R) \end{array} $$
(10)
$$ mBM25(a, b) = \frac{|a\cap b|}{|a\cup b|}\times {\sum}_{w\in a \cap b}{tWeight(w,x,y)\times tEdit(w,a,b)} $$
(11)

where

$$\begin{array}{@{}rcl@{}} tWeight(w,x,y) &=& IDF(w,R_{S}) \times IDF(w,R_{T})\times\frac{TF(w, y, R_{T}) \times k_{1}}{TF(w,y,R_{T})+k_{2}}\\ tEdit(w,a,b) &=& invDiff(pos(w,a)-pos(w_{0},a), pos(w,b)-pos(w_{0},b)) \end{array} $$

The TF-IDF Cosine is popularly used for its advantages in weighting the tokens. However, it is sensitive to ambiguity because it ignores the tokens order, an important information. Using a similarity metric with a robust disambiguation capability is very important. We introduce the modified BM25 as a more effective metric, which also considers this useful aspect. The first factor, \(\frac {|a\cap b|}{|a\cup b|}\), is the Jaccard coefficient of a and b. The second factor is the sum of the token weighting t W e i g h t and the order inverse difference t E d i t. The function pos returns the position of a token in its parent string. The variable w 0 denotes the first shared token between a and b. The combination of Jaccard coefficient, t W e i g h t, and t E d i t is effective because we can simultaneously penalize the less overlapped strings, include the token weight, and consider the token order. t W e i g h t modifies the original BM25 weighting scheme (Robertson et al. 1994), which is originally designed for Information Retrieval. t W e i g h t eliminates the target document length in BM25 and introducing the I D F(w, R S ). k 1 and k 2 are fixed equivalently to the defaults of BM25, which are 2.2 and 0.3, respectively. In our experiment, the modified BM25 shows its considerable effectiveness against the TF-IDF Cosine on highly ambiguous datasets.

We illustrate two motivating example of m B M25. We simplify the examples by considering that the weight of all tokens is same and thus t W e i g h t is skip in the calculation. The first example is the necessity of t E d i t. Suppose we need to compare ‘Tokyo (Prefecture of Japan)’ (1) to ‘Tokyo Prefecture’ (2), and ‘Tokyo, Japan’ (3). The task is to discriminate that (1) and (2) refer to the same entity (Tokyo Metropolis) but (1) and (3) do not because (3) refer to the special wards of Tokyo. If we use only Jaccard measure, it fails to discriminate the cases, because it gives 2/4 for the two pairs. However, t E d i t for these pairs in turn is i n v D i f f(0,0) T o k y o + i n v D i f f(1,1) P r e f e c t u r e and i n v D i f f(0,0) T o k y o + i n v D i f f(3,1) J a p a n . Because the former is higher, the task is successfully done. The second example is the necessity of Jaccard measure. We compare ‘Tokyo University’ (4) to ‘University of Tokyo’ (5) and ‘The Tokyo Metropolitan University’ (6). We have t E d i t(4,5)=5/4 and t E d i t(4,6)=3/2. Therefore, t E d i t fails to realize that (4) is closer to (5) than (6). However, because Jaccard (4,5)=2/3 and Jaccard (4,6)=2/4, the respective final results are 10/12 and 6/8, which satisfy the expectation.

The TF-IDF Cosine and modified BM25 metrics compare the token using exact matching. Since we focus on real data, which infrequently contains token distortion, exact matching is effective. In addition, S c L i n k is implemented with a flexible mechanism that is ready for the injection of new similarity metrics when needed.

3.4 Blocking

The input of this step includes the string property mappings, R S L , R S U and R T . In this step, the labeled candidate set C L and unlabeled set C U are generated. A candidate is defined as a pair of instances and is expected to be the actual co-reference. This step is very important because it reduces the number of pairwise alignments between R S and R T . Therefore, together with the retention of many actual co-references, another mission of this step is to generate the candidate sets with compact sizes.

For each property mapping, we define a blocking function, which receives the input of instances pairs C and returns the pairs that satisfy the blocking mechanism. The input, for example, can be R S ×R T or just a subset of this Cartesian product. A blocking function is a combination of two elements: property mappings and blocking mechanism. Blocking mechanism can be a simple exact matching or the advanced soundex and edit-distance. In S c L i n k, the blocking mechanism is to qualify a pair if the instances share the first token in the values described by the properties of interest. Using multiple blocking functions increases the recall but has more possibility to generate incorrect candidates. Therefore, selecting an optimal set of blocking function is very important.

The procedure of blocking step is as follows. First, we create a default blocking model B d e f , which is the set of all blocking functions generated from all input property mappings. The result of applying a blocking model is the union of the results of each member. Then, a candidates set C L0 is generated as the result of B d e f (R S L ×R T ), where R S L is a portion of R S . C L0 is passed through an annotation process in which positive or negative label is assigned for each candidate. The annotation can be solved by the achievements of other studies. For instance, combining automatic matching algorithms and crowd-sourcing-based technology (Demartini et al. 2013; Vesdapunt et al. 2014). After that, m i n B l o c k algorithm learns an optimal blocking model B o p t from C L0 and B d e f . Finally, B o p t is used to create C L and C U , where C L = B o p t (C L0) and C U = B o p t (R S U ×R T ).

An optimal blocking model generates compact candidate sets with ignoring minimal actual co-references. For the learning of such model, we propose m i n B l o c k. m i n B l o c k considers all input blocking functions as baseline and tries to remain the actual co-references that can be generated by the baseline. Meanwhile, it tries to reduce the unexpected candidates. The pseudo code of m i n B l o c k is written in Algorithm 1. In this pseudo code, \(C^{+}_{L0}\) is the positive candidates and recall is calculated by (1). The idea of m i n B l o c k is simple. First, the variable m i n R e c is assigned with a value estimated from the recall of using all blocking function (B d e f ) (line 2). m i n R e c is the lower bound for the optimal blocking model. By default, t l o s s is set to 1.0 for the zero loss expectation. After that, in each main iteration controlled by line 3, m i n B l o c k removes the blocking function that generates the highest number of candidates but satisfies m i n R e c. This selection is reflected by line 8. The algorithm stops when there is no blocking function for removal (line 12).

figure a

Consider the example in Table 1, from the property alignment, we have 3 property mappings: [Name, FirstName], [Name, Alias], and [Class, Type]. Let f 1, f 2, and f 3 in turn are the blocking function defined for those mappings with the same blocking strategy, to match the first token. The co-references (x 1, y 1),(x 2, y 2), and (x 4, y 4) are given to m i n B l o c k as \(C^{+}_{L0}\). By applying f 1, f 2 and f 3 on R S L ×R T , we get the respective candidates: {(x 1, y 1),(x 2, y 3),(x 3, y 3)}, {(x 1, y 1),(x 1, y 3),(x 4, y 4)}, and {(x 1, y 4),(x 2, y 2),(x 3, y 2),(x 4, y 4)}. The result of applying the initial B d e f = {f 1, f 2, f 3} is the union of the above candidate sets and it covers all the expected co-references. Therefore, with t l o s s = 1.0, the final blocking model is required to produce all the three co-references (i.e., m i n R e c=1.0). In the first iteration controlled by line 3, the removal of each function in B d e f is considered. Without f 3, the candidate (x 2, y 2) is not produced, while without f 1 or f 2, all co-references are still retrieved. f 1 produces two unnecessary candidates ((x 2, y 3),(x 3, y 3)) while f 2 produces only one ((x 1, y 3)). Therefore, f 1 is removed and B o p t becomes {f 2, f 3}. In the next iteration, because the removal of either f 2 or f 3 violates m i n R e c, the algorithm stops.

The complexity of m i n B l o c k is acceptable. The worst case happens when the if statement in line 8 is always hold. In that case, the complexity of m i n B l o c k is the quadratic \(O\left ({n(n+1)\over 2}\right )\), where n is the size of B d e f . This complexity indicates a moderate computational effort.

The learning of blocking model is also investigated by many studies (Bilenko et al. 2006; Dalvi et al. 2013; Kejriwal and Miranker 2013). Among them, most similar to m i n B l o c k is optimal hashing (Dalvi et al. 2013), which also tries to reduce the size of the candidate set while reserving the recall. However, the problem definition of this hashing model is very different from ours and therefore the solution is varied.

The definition of blocking function enables the modification of blocking mechanism. That is, one or multiple blocking mechanisms can be used for each property mapping. However, S c L i n k currently uses a simple token-based strategy (Papadakis et al. 2013) as the only mechanism. The advantage of this strategy is to retain the highest number of the expected co-references, compared to that of many other systems (Nguyen et al. 2012a; Mendes and Jakob 2011; Volz et al. 2009; Isele et al. 2011), which use token weighting. Although weighting approach generates less number of candidates, it is accompanied by a considerable drop in recall. The token-based comparison is used in many blocking strategies (e.g., attribute clustering, n-grams, and suffix array). The most recent benchmark is reported in Papadakis et al. (2016).

3.5 Configuration learning

The labeled candidates set C L is separated into two sets, the training set T r a i n and validation set V a l. Using T r a i n, V a l, initial similarity functions I s i m (Section 3.3), and similarity aggregators I a g g (Section 3.6), this step learns the optimal configuration C o p t that is most suitable for the input repositories. A configuration specifies the combination of similarity functions F s i m , the similarity aggregator A g g, the parameters σ s i m associated with each similarity function sim, and the parameter δ of the filtering step (Section 3.7). The mission of this step is to automatically assign the optimal value to all elements of C o p t .

We use c L e a r n, a heuristic search method to optimize the combination of the similarity functions and the similarity aggregator. The pseudo code of c L e a r n is given in Algorithm 2. In this pseudo code, we use dot (‘.’) notation to represent the member access operator. The detail of the functions and parameters used in Algorithm 2 is as follows.

  • K t o p is used to adjust the maximum number of selected similarity functions. A higher value of K t o p offers more opportunity for the optimal configuration but also may increase the complexity because it may add more iterations controlled by line 12. In c L e a r n, we assign K t o p = 16 by default. We use such a high value of K t o p so that possibly c L e a r n does not miss the optimal configuration. In addition, in our experiment, using K t o p = 16 is sufficient to find a good configuration in an acceptable time on a desktop computer.

  • I n i t creates a configuration by assigning A g g, F s i m , and δ with given values.

  • M a t c h produces the co-references from a collection of candidates given a configuration. This step calls the similarity aggregator (Section 3.6) and filtering (Section 3.7). The similarity aggregator computes the matching score of candidates and the filtering produces the co-references.

  • F i n d T h r e s h o l d assigns a value to δ, a parameter of co-reference filter. This function first selects the top |T r a i n +| candidates with the highest matching score, where |T r a i n +| is the number of the actual co-references in T r a i n. Then, it assigns the lowest score of the correctly detected co-reference to δ.

  • E v a l u a t e computes the performance of instance matching by comparing the generated results with the labeled data. In this function, the F1 score is used as the default performance metric. It is the harmonic mean of recall and precision, as calculated by (1) and (2).

c L e a r n begins with checking each similarity function (line 4 to 10) and select the best K t o p performed ones (line 11). The selected functions are stored using candidate variable. After that, c L e a r n finds the optimal combination among the K t o p functions using Apriori approach (Agrawal et al. 1994) (line 12 to 23). In the first iteration of line 12, candidate contains the groups of single similarity function. The size of those groups increases along with the iteration, as the result of combining different groups. Line 16 reflects such combination by taking the union of g.c.F s i m and h.c.F s i m . The heuristic used in this algorithm is the direct enhancement assumption. According to line 21, the performance of using a combination must not be less than that of the combined groups. This heuristic is reasonable as a list of similarity functions that reduces the performance has little possibility of generating a further list with improvement. The qualified combinations are carried to the next iteration by updating candidate variable (line 23). The algorithm stops when a new combination of similarity functions is not made (candidate is empty).

figure b

The validation set V a l is used to increase the generality of the final configuration C o p t . Each iteration controlled by line 2 finds an optimal configuration with one similarity aggregator A. In other words, there are |I a g g | configurations in C a g g . For selecting the most optimal one from C a g g , instead of just picking the configuration having the best performance on T r a i n, V a l is recommended. In case the labeled data is too small to be separated into training and validation sets, a cross-validation is reasonable.

We illustrate the main idea of c L e a r n, which is to select the optimal combination of similarity functions (line 12 to 23). From the example given in Table 6, the result of property alignment, and m i n B l o c k, we get 3 property mappings [Name, FirstName], [Name, Alias], [Class, Type] and 6 candidates produced by B o p t = {f 2, f 3}. In order to simplify the illustration, we assign only Jaccard measure to the mappings and thus we get 3 respective similarity functions to the above mappings s i m 1, s i m 2 and s i m 3. Considering the average as the similarity aggregator, the results of applying all combinations are listed in Table 6. We skip the stable filtering and set the matching threshold to 0.25. That is, if a value in Table 6 satisfies 0.25, the respective candidate becomes a detected coreference. Assume that at the first iteration of line 12 to 23 candidate is {s i m 1, s i m 2, s i m 3}. In this iteration, the combinations of two functions are considered. According to Table 6, only the combination of s i m 1 and s i m 3 violates the constraint of line 21 because the performance (F1) is lower than using s i m 1. Therefore, candidate is changed to {{s i m 1, s i m 2},{s i m 2, s i m 3}} and advanced to the next iteration. In the second iteration, the combination {s i m 1, s i m 2, s i m 3} is checked. Because it is not better than {s i m 1, s i m 2}, candidate becomes empty and the algorithm stops. Finally, the optimal combination is {s i m 1, s i m 2}. In this example, all combinations are visited. In our experiment, the checked combinations are much smaller than all possible cases.

Table 6 Example: cLearn algorithm

We analyze the complexity by the number of the configurations that c L e a r n has to check. In total the complexity is O(|G i n p |×(|F d e f | + f(w h i l e))), where f(w h i l e) is the complexity of the while block from line 12 to line 23. The complexity of this block depends on K t o p , and the probability of the if statement. Let p i be the probability of the hold cases for the if statement in the i th iteration of the while block. The complexity of the i th iteration is \(f(while^{i})=p_{i-1} \times \binom {K_{top}}{i+1}\), where the iteration is counted from 1 and p 0 = 1.0. That is, the size of candidate may increase for the first few iterations (if K t o p >3) and then decrease. In the worst case, if p is equal to 1.0 for all iterations, the result will be the union of K t o p similarity functions. Otherwise, lower value of p makes the algorithm finish faster.

c L e a r n works under the principle of the well-known Apriori algorithm, where each similarity function is considered as an item as in Apriori. c L e a r n begins with the consideration of each single similarity function and then checks their combinations. Since the goals of Apriori and our learning task are different, we customize the acceptance mechanism when checking a new combination of similarity functions. While in Apriori algorithm, the condition is uniformly fixed for all cases, we adaptively change the criterion for each combination, which is the heuristic discussed above.

The global optimum of configuration can be found using exhaustive search. However, such method is extremely expensive in term of computational cost and thus has not been used. Some systems try to use genetic search to solve the issue of complexity (Nikolov et al. 2012; Ngomo and Lyko 2012; Isele and Bizer 2013) but this algorithm is still time-consuming because it is based on random convergence. In addition, the genetic algorithm has many free parameters. c L e a r n uses a reasonable heuristic to reduce the complexity and minimize the parameters.

Most studies consider the supervised instance matching problem as the discovery of property mappings, instead of identifying them by annotation. The reason is the correct mappings in semantic aspect are different from the useful mappings for instance matching. For example, air transportation companies frequently share their ICAO (i.e. unique abbreviation name) and stock symbol; or full name can be matched with first name and last name, instead of only the same amount of information. Besides taking this issue into account, S c L i n k also combines the similarity metric and the mappings for directly optimizing the similarity functions. The motivation of this mechanism is that each metric offers different advantages on different properties with the association of particular repositories.

3.6 Similarity aggregation

This step computes the final matching score for each candidate using the similarity functions F s i m and their parameter δ s i m specified by a configuration. The computation of the matching score m S c o r e(x, y) for two instances x and y is defined as follows:

$$\begin{array}{@{}rcl@{}} mScore_{{F}_{sim}}(x,y)&=&\frac{1}{valid(U_{{F}_{sim}}(x,y))}{\sum}_{v\in U_{{F}_{sim}}(x,y)}{v^{k}}\times weight(y)\\ U_{{F}_{sim}}(x,y) &=&\{sim(x,y)|sim(x,y)\ge \sigma_{sim}, sim\in {F}_{sim}\} \end{array} $$
(12)

where k∈{1,2}, valid is a counting function, weight is a function weighting the target instance y, and σ s i m is the parameter for each similarity function sim, which is determined automatically by c L e a r n (at line 8). k controls the transformation for each similarity v. When k=1, m S c o r e function acts as a first order aggregation. When k=2, we have a quadratic aggregation. There are two variations of valid, which return the number of elements in \(U_{F_{sim}}(x,y)\) and 1.0 always. The difference between these variations is that the latter penalizes the pair (x, y) having similarities s i m(x, y)<σ s i m while the former does not. For weight function, S c L i n k also provides two options. For non-weighting, w e i g h t(y) simply returns 1.0. For weighting, the function returns:

$$ weight(y)=\log_{\max_{t\in R_{T}}size(t)}size(y) $$
(13)

where s i z e(y) counts the number of RDF triples existing in y. By using (13), we assume that the instances containing more triples are more prioritized. The logarithmic scale is used to reduce the weight of instances whose size is particularly large. This weighting method is effective when the target repository is very ambiguous, such as large repositories. In addition, S c L i n k provides a restriction mechanism to enable or disable σ s i m . When disabling, all σ s i m parameters are set to zero instead of the learned values.

In total, there are 16 combinations of weight, valid, k, and restriction. Consequently, there are 16 different aggregators supported by S c L i n k. All of them are used to initialize I a g g in c L e a r n and let the algorithm select the most optimal one.

3.7 Filtering

This step produces the final co-references. A candidate’s co-referent possibility is not directly concluded from its matching score. Instead of that, the matching scores of all related candidates are considered. We reuse the principle of stable marriage problem (Gale and Shapley 1962), which was evaluated to be effective on instance matching (Nguyen et al. 2012a; Ngomo et al. 2011). A candidate (x, y) (where xR S and yR T ) is eventually co-referent if m S c o r e(x, y) satisfies the following conditional statement:

$$ mScore(x,y) \ge \max\left( \underset{z\in R_{S}}{\max}~{mScore(z,y)}, \underset{t\in R_{T}}{\max}~{mScore(x,t)}\right) $$
(14)

In this step, S c L i n k also uses a cut-off threshold δ to eliminate the incorrect candidates but satisfying (14). δ is assigned automatically by the learning algorithm c L e a r n. Only candidates whose scores satisfy the above condition statement and threshold δ are selected for the final results.

The basic idea of the filtering step is to rank each candidate in the local set of candidates sharing its source or target instance. It guarantees that for very ambiguous repositories, a highly similar pair (x, y) is not necessary to be co-referent if there exist other pairs (x, z) with higher similarity.

Above we have described the details of S c L i n k. Next, we report some implementation notes of S c L i n k.

3.8 Implementation notes

Since the processings of all steps are parallelizable, it is feasible to apply paralleling techniques to enhance the speed of the system. Although many advanced frameworks are available and are used by other systems (Urbani et al. 2010; Volz et al. 2009; Ngomo and Auer 2011; Hogan et al. 2012), we currently implement S c L i n k with the basic multi-threading technique, because we are also interested in memory efficiency and cost saving. Concretely, we combine multi-threading with each step of the system workflow. We experimented with very large datasets on a resource-limited computer in order to confirm the scalability of the system.

S c L i n k is installed as part of ScSLINT (Nguyen and Ichise 2015b), an efficient framework for instance matching. In ScSLINT, input repositories are indexed using the well-known inverted index structured for pairs of 〈p, p(x)〉 and 〈p(x), x〉. Therefore, the property alignment and blocking are enhanced at the maximum processing speed. The complexity of indexing is very small and linear to the number of instances in the repository.

4 Experiment

In order to elaborately evaluate S c L i n k, we conduct 5 experiments:

  • Experiment 1 evaluates the performance of blocking step.

  • Experiment 2 evaluates the performance c L e a r n.

  • Experiment 3 analyzes the similarity metrics and similarity aggregators.

  • Experiment 4 reports the runtime of S c L i n k.

  • Experiment 5 compares S c L i n k with other systems.

All experiments are conducted on a desktop computer equipped with one Intel core i7 4770K CPU and 16GB memory. The results of these experiments are reported in Section 4.3 to 4.7. Next, we describe the datasets and the mutual settings of all experiments.

4.1 Datasets

We use in total 15 datasets collected from the instance matching problem on relational databases and linked data. The summary of the datasets is given in Table 7. In this table, |R k |, P k , and f a c t k are the number of instances, properties and total facts (i.e., the attributes) existing in the repository k (k∈{S, T}), respectively. |I| is the number of actual coreferences. The scale of a dataset is measured by the pairwise comparisons between the repositories. Following that, the average scale is 64.6×109 and the largest is 0.4×1012, which are sufficient to see the scalability of S c L i n k.

Table 7 Summary of used datasets

The first 3 datasets (D1-D3) cover the bibliography and e-commerce domains and are collected from the instance matching problem on relational databases. These datasets have a small number of instances as well as simple schemas. The next 5 datasets (D4-D8) belong to medicine and disease domains. Among them, the first 3 datasets have medium size while the other 2 are larger. The last 7 datasets (D9-15) belong to people (peo), location (loc), and organization (org) domains and have very large size. Particularly, the datasets related to Freebase are at a huge size. The largest one contains nearly 402×109 pairwise instances between the source and the target repositories. The schemas of most linked datasets are very heterogeneous. Considering DBpedia, 45,858 different properties are used to describe the instances. In Freebase, such quantity is even more than 50 times larger. For the last 9 datasets (D7-D15), together with many instances, the huge number of facts increase the ambiguity and complexity. Especially, the last 7 datasets (D9-D15) are realized as very challenging for their scalability, heterogeneity, and ambiguity. We use the dump of NYTimes 2014/02, DBpedia 3.7 English, Freebase 2013/09/03, and Geonames 2014/02. There are a few slight inconsistencies between the provided ground-truth and the downloaded dump data, because of the difference in the release dates. Therefore, we have to manually exclude in total 130 (only 0.298 %) source instances which are related to such inconsistencies. In additional, the three repositories NYTimes, DBLP, and ACM do not contain co-referent instances within themselves, while the other repositories do.

The above datasets are selected because they are real datasets with the variety of domains and sizes. Although there are some newer datasets, they are either small, artificial, or focus on the benchmarks with some special targets (e.g., reasoning-based, string distortion, language variation). Therefore, we do not use such datasets. Morever, many systems have been recently tested on those datasets (Soru and Ngomo 2014; Köpcke et al. 2010; Euzenat et al. 2011). That enables the comparisons between S c L i n k and others systems.

Although the source repository of the tested datasets is much smaller than the target, S c L i n k will also work for large source repository because the methods used in S c L i n k is not limited to the small source repositories. Furthermore, because the target repository is considered as a reference, if the source repository is too huge, it is possible to reduce the scale by partitioning it into smaller parts sharing some characteristics (e.g., schema and class). Analyses on very large source repository and partition methods are beyond the scope of this paper.

4.2 Experimental settings

There are 3 parameters in S c L i n k: t d i v , K f r e , K c o n f . We are interested in using the same parameters for all datasets and test cases in order to evaluate the sensitivity of S c L i n k to parameters. We find that the selection of these parameters is not difficult. We first fix the K f r e , K c o n f into 4 for all property types. That is, at most 64 property mappings (4 types × 4 properties from R S × 4 properties from R T ) will be generated. In order to see the ability to select good similarity functions, a considerable amount of property mappings should be given to c L e a r n. 64 property mappings are comparatively many because it is at least 2.4 times larger than the number of properties of each source repository. By fixing K f r e like above, we gradually reduce t d i v from 1.0 until at least one string property is selected from each source repository. t d i v = 0.5 satisfies this expectation and we use this value uniformly for all experiments. We observe that varying t d i v in the range 0.2 to 0.7 does not change the generated mappings on all datasets excepts the last 7 datasets, which are related to NYTimes. For NYTimes repository, using t d i v higher than 0.5 does not return any property of string type.

For training data separation, we follow the workflow of S c L i n k (Fig. 1). We split the source repository into two parts R S L and R S U . Part R S L is used to generate the labeled candidates C L and R S U is used to generate unlabeled candidates C U . The ratio of R S L and R S U is varied between experiments (e.g., cross-validation or percentage split). After that, C L is separated into two sets with the ratio 80 %:20 %. These set are used to generate two labeled candidates set, the training set T r a i n (from 80 %) and validation set V a l (from 20 %), which are used by c L e a r n. In the rest of this section, we denote x % labeled data as using x % instances of source repository as R S L .

The split of R S L and R S U is based on the ground-truth data, the set I of actual co-references. Each element of I is a pair of one source and one target instances. Given the expected split ratio r, we randomly separate I into two sets I L and I U with the ratio \({|I_{L}|\over |I_{L}| + |I_{U}|} = r\). The source instances of I L and I U are put in R S L and R S U . Up to this point, R S L R S U may not be equal to R S because the I is not necessary to cover all instances of R S . Those uncovered instances are split with the same ratio r and additionally put to R S L and R S U . Finally, we get R S L R S U R S and maintenance the ratio r between R S L and R S U the ratio r for instances that exist (and thus also do not exist) in I. C L and C U are generated from R S L and R S U , respectively. Following this manner, it is simple to assign the label to C L , C U , and the detected co-references. The final accuracy (e.g., recall, precision, and F1) in our experiment is measured by comparing the detected co-references for R S U with |I U |.

An important note is that for experiments with percentage split, in order to reduce the random noise, we always repeat 10 times running for each dataset and report the average results.

4.3 Experiment 1: blocking

In this experiment, we compare the recall and number of candidates when using and not using m i n B l o c k. First, we report the results of not using this algorithm. That is, all property mappings and all instances of R S are used for blocking step. In other words, we measure the number of candidates (size) and the recall (rec) for C 0 = B d e f (R S ×R T ). Table 8 reports those results for all datasets. In this table, size is the number of generated candidates and rec stands for recall. The high rec values in this table reflect the effectiveness of using token-based blocking, especially using only the first token. The most difficult dataset is D3, on which the recall is 0.865 and is particularly lower than other datasets. However, it is acceptable because the most recently equivalent result on this dataset is only 0.835 (Kejriwal and Miranker 2015), by using the advanced tri-gram attribute clustering (Ngomo et al. 2011). The number of generated candidates is very small compared to all possible pairwise instances between R S and R T . For all 15 datasets, the total pairwise instances are 0.97×1012 and the number of generated candidates is 1.38×109. That is, about 99.9 % unnecessary candidates are removed. However, compared to the actual co-references, the generated candidates are generally still at a much larger size.

Table 8 Result of blocking using all property mappings

The objective of m i n B l o c k is to remain the recall and reduce the unexpected candidates. Therefore, we evaluate the effectiveness of m i n B l o c k by taking the ratios of those values between after and before using m i n B l o c k. The procedure of this experiment is as follows.

  • Given two splits R S L and R S U , we first generate the results of not using m i n B l o c k, C L0 = B d e f (R S L ×R T ) and C U0 = B d e f (R S U ×R T ). C L0 is used for m i n B l o c k to generate the optimal blocking model B o p t .

  • After that, we generate the result of using m i n B l o c k by taking C L = B o p t (R S L ×R T ) and C U = B o p t (R S U ×R T ).

  • Finally, we calculate the two ratios: size reduction rate \(rSize= 1-\frac {|C_{L}|+|C_{U}|}{|C_{L0}|+ |C_{U0}|}\) (larger is better) and recall reduction rate \(rRec=1-\frac {|R \cap C_{U}|}{|R \cap C_{U0}|}\) (smaller is better).

We measure the change of size for both labeled set (C L ) and unlabeled set (C U ) because they are used for c L e a r n and resolution phase. However, we only measure the change of recall for unlabeled set because the recall of labeled set is identical to the original set C L0, as t l o s s is used as 1.0, the default value. Note that, when we divide R S into different training split, the recall and the size of C L0 and C U0 are probabilistically similar to those of the parent set C 0 (reported in Table 8).

First, we use 5 folds cross-validation so that all instances are in turn used for training as well as testing. For this test, the recall reduction r R e c is equal to 0.0 for most datasets, except D7 and D12. For D7 and D12, only on 1 random fold (out of 5 folds), 1 expected candidate is ignored. However, the size reduction r S i z e is varied between datasets. Therefore, we report the detail in Table 9. According to this table, the reduction of size is low on small datasets but very high on large datasets, when unexpected candidates are frequently available. For D9 to D15, in total, compared to not using m i n B l o c k a considerable number of 1.26×109 candidates are discarded. These results confirm the effectiveness and importance of m i n B l o c k, for enhancing the scalability of S c L i n k. By limiting the candidates, the complexity of c L e a r n and resolution phase is much reduced.

Table 9 Cross validation result with m i n B l o c k

4.3.1 Effect of size of training data

For a supervised system, the size of training data is very important. Therefore, we also evaluate m i n B l o c k with different small amounts of training data. We vary the ratio labeled data from 1 % to 15 % and analyze the trend of r R e c and r S i z e.

Generally, on most datasets, r R e c reduces with the increase of labeled data. In other words, the more labeled data is given, the more optimal blocking model is learned. The coefficient of variation for r R e c values when changing the amount of labeled data is under 1.0 for all datasets except D15. Particularly, for D6 and D9, r R e c is always equal to zero whatever amount of labeled data is given. Consider the average of all datasets, r R e c quickly reduces from 0.0089 to 0.0035, when labeled data is increased from 1 % to 8 %. After this point, the change of r R e c is not considerable as only 0.001 unit is reduced for the range 8 % to 15 %. The slight variation of r R e c between different settings of annotation effort and the early saturated value show that m i n B l o c k can learn the optimal blocking model by using a small amount of labeled data.

The reduction of r R e c is accompanied with the drop in r S i z e. However, the reduction of r S i z e is not considerable and gradually slow down with the increase of labeled data. In average of all datasets, r S i z e drops from 0.53 to 0.43 when 1 % and 15 % labeled data is given. This value is almost not different with that of using 5 folds cross-validation, whose average r S i z e is 0.422. In addition, the coefficient of variation of every dataset is under 1.0. For above facts, the reduction of r S i z e is within an acceptable range even we try to reduce r R e c by giving more labeled data to m i n B l o c k.

4.4 Experiment 2: performance of learning algorithms

In order to evaluate the effectiveness of our proposed learning algorithm, we compare the result of S c L i n k when using c L e a r n and when replacing it by other algorithms. We compare with top rank selection (naive), information gain based selection (gain), and genetic algorithm (genetic). The mechanisms of these algorithms are as follow.

  • naive selects the K t o p similarity functions that obtain highest F1.

  • gain re-implements the idea of ADL (Hu et al. 2014), which selects the most discriminative property mappings by independently measuring the information gain of each property.

  • genetic follows the idea of EAGLE (Ngomo and Lyko 2012), Knofuss (Nikolov et al. 2012), GenLink (Isele and Bizer 2012), and ActiveGenLink (Isele and Bizer 2013), which use genetic algorithm to learn the matching configuration. We use binary array representation for the combination of similarity functions. We choose the exponential ranking for fitness selection, 0.7 for single point crossover probability, 0.1 for single point mutation probability, and 50 for the population size. We limit the maximum iteration to 1000 and also use early stop mechanism, which terminates the algorithm when F1 is saturated.

In order to implement other algorithms, we replace the lines from 3 to 23 of Algorithm 2 with the new algorithms. In other words, the mechanism of determining σ s i m , δ, and A g g remains the same of all algorithms. The reimplementation of other algorithms offers elaborate comparisons. It enables using the same input of similarity functions and the mechanism of determining other settings. More important, other algorithms are installed in the systems that are not scalable enough and thus cannot work with large datasets like D7 to D15.

Table 10 reports the average F1 score on each dataset of the tested algorithms when using 5 folds cross-validation. According to this table, the proposed algorithm consistently outperforms gain and naive. For genetic, c L e a r n is better than this algorithm on 9 out of 15 datasets. Although the harmonic means of c L e a r n and genetic look closed, when considering each fold separately, so that there are 75 tests (i.e. for 15 datasets and 5 folds), paired t-test finds that the improvement made by c L e a r n is significant (p = 0.0223). The similar results are also recorded for comparing c L e a r n with naive (p <0.0001) and gain (p <0.0001).

Table 10 F1 scores of S c L i n k when using c L e a r n and other algorithms

Naive and gain are the fastest algorithms because they check each similarity function independently and do not consider the combinations. For c L e a r n, since we fixed the K t o p into 16, in theory, the maximum number of configurations is 216 for the worst case, if c L e a r n exhaustively checks all combinations of similarity functions. However, in fact, with the effect of the heuristic, c L e a r n stops after checking averagely 177 configurations. Meanwhile, genetic needs to check 316 configurations in average. That is, the proposed heuristic reduces almost 44 % the configurations while remaining the high accuracy, compared to genetic. In summary, it is concludable that c L e a r n is more reliable and faster than other tested algorithms.

4.4.1 Effect of size of training data

Similar to Experiment 1, we also analyze the trend of F1 score when varying the size of labeled data. Small amounts of labeled data are given to c L e a r n, from 1 to 15 %. For each dataset, we conduct the random split 10 times and take the average result, which is depicted in Fig. 2. From this figure, the harmonic mean of F1 score quickly increases when labeled data is varied from 1 % (0.7272) to 4 % (0.8078). After that, it increases with a lower acceleration. At the setting of 13 %, S c L i n k reaches the value 0.8329, which is slightly better than using genetic with 80 % labeled data (Table 10). On most datasets, F1 score is near saturated at 5 % labeled data, except for D3, D5, D6, and D8. For D8, since F1 clearly increases at 15 %, we extend the experiment for this dataset and observe that the change slows down after 19 %. As 5 % training data is the goal of almost supervised systems, S c L i n k expresses its capability by satisfying this expectation for most tested datasets.

Fig. 2
figure 2

Performance of S c L i n k with different amount of labeled data

4.5 Experiment 3: similarity aggregators and similarity metrics

In order to know which similarity aggregators and similarity metrics are effective, as well as the diversity of learned configurations, we conduct some statistics on the configurations produced by c L e a r n. We reuse the results of cross-validation in Experiment 2 for this analysis. As a result, there are 300 similarity aggregators and 2,419 similarity functions.

For similarity aggregator, considering each individual setting independently (e.g., k=1 vs. k=2 and weighting vs. non-weighting), the following settings are most frequently selected: k=2 (64 %), enabled restriction (79 %), \(valid=|U_{F_{s}im}(x,y)|\) (54 %), and weighting (77 %). Among them, weighting is very effective on the large datasets from D9 to D15 as it is always selected for all tests related to these datasets. All settings except valid show the dominant proportions in the learned configurations. However, the dominance is not strong enough to confirm a universal effectiveness of each setting. For example, 21 % of cases requires another setting for restriction. The combination of above settings is also the most frequent aggregator. However, its share is only 24 % out of the 300 ones. This result implies that it is difficult to manually connect the dots between the datasets and the optimal similarity aggregator.

For similarity metrics, all string-related metrics show a relative balance as the similar proportions of Levenshtein (25 %), TFIDF-cosine (32 %), and m B M25 (28 %). In addition, r D i f f is very important for the subsets related to location domain as it is always selected on D9, D10, and D13, for the distance estimation of geographic coordinates.

An interesting finding is observed on D10. When the size of training data is 80 %, only longitude, one of two important geographic properties, is selected. While both longitude and latitude are considered as important for D9, D13, and even in human thinking, the learning algorithm returns a different recommendation. This example, together with the variety of similarity aggregators and similarity metrics as reported above, shows that for a particular input, it is difficult for a user to define a perfect matching configuration like an automatic system can do.

For the evaluation of m B M25, we compare the result of S c L i n k when using and not using this metric. 5 folds cross-validation is also used for this test. The result is reported in Table 11. According to this table, by including m B M25, S c S L I N T improves the results for most datasets. Especially, for datasets from D10 to D14, which are among the large and highly ambiguous datasets, using m B M25 is clearly better. For D6, D8, D9, and D15, using m B M25 drops the performance but very slightly. For these datasets, the token order is not important because the learned similarity functions are almost constructed for short string properties and the strings that are described by such properties frequently contain only one token. In summary, m B M25 shows its effectiveness against TF-IDF Cosine, especially for large and ambiguous datasets.

Table 11 F1 scores of using (use) and not using (not use) m B M25 in S c L i n k

4.6 Experiment 4: runtime

Efficiency is an important factor of instance matching systems. Therefore, we evaluate the runtime of S c L i n k on tested datasets. Five percent labeled data is used for this experiment. For small datasets from D1 to D6, S c L i n k takes only 2.68 seconds in average to complete all the steps. Especially for D2 and D5, the runtime is under 1.0 second. The longest runtime is also small, 8.1 seconds, measured on D1. For larger datasets, the runtime ranges from 38.6 seconds (D7) to 9.28 minutes (D15) and the average is 3.34 minutes. We are interested in comparing the runtime of S c L i n k and other systems. However, the runtime of other systems on medium to large datasets (D7 to D15) are unfortunately not available. However, the speed of S c L i n k is truly impressive as it is fast on D9 to D15, which are the scalability barriers for most existing systems.

In order to see the proportion of each step, we plot the detailed runtime of the medium and large datasets in Fig. 3. According to this figure, the consumed time of property alignment depends on target repository. In overall, for medium datasets like D7, D8, and D9, this step occupies the largest portion in the total runtime. For larger datasets, c L e a r n and similarity aggregation share the dominant parts. In average, from D9 to D15, these steps cost 37.2 % and 41.2 %, respectively. Meanwhile, m i n B l o c k algorithm is very efficient. It takes only 9.5 % of the runtime in average but delivers very important benefit. Without using m i n B l o c k, about 11.5 times longer runtime would be required for similarity aggregation step, as the reduction rate of candidates is about 91.3 % (Experiment 1). For example, compare to the base framework S c S L I N T (Nguyen and Ichise 2015b), the total runtime for D15 is 36 minutes because the number of candidates is more than 10 times larger (Nguyen and Ichise 2015b). In summary, S c L i n k is very fast for large datasets and the consumed time of learning algorithms is also acceptable for the supervised scenario.

Fig. 3
figure 3

Detailed runtime of S c L i n k with 5 % labeled data

4.7 Experiment 5: comparison to other systems

We compare the performance of S c L i n k with a series of recent and state-of-the-art systems, including unsupervised, supervised, and non-learning-based ones. The availability of their result is also a criterion of selection. Since each system was tested on different datasets due to the support of data format and/or scalability, we separate the comparison into three groups: D1 to D3, D4 to D8, and D9 to D15.

4.7.1 D1 to D3

These datasets enable the comparison between S c L i n k and other instance matching system for relation databases. We compare S c L i n k with the state-of-the-art FEBRL (Christen 2008c) and MARLIN (Bilenko and Mooney 2003), and the most recent work in Kejriwal and Miranker (2015), which we temporarily call semiBoost. These systems are all classifier-based. FEBRL and MARLIN use SVM to learn the classification model. semiBoost uses AdaBoost in conjunction with semi-supervised learning.

The reported results of these systems are for different settings of labeled data. For semiBoost, 2 % is used. For FEBRL and MARLIN, 22 %, 46 %, and 38 % are used for D1, D2, and D3, respectively. Therefore, for each comparison, we give the same amount of labeled data to S c L i n k. Table 12 reports the results of S c L i n k and these systems.

Table 12 F1 scores of S c L i n k and other systems on D1 to D3

4.7.2 D4 to D8

For these datasets, we compare S c L i n k with ObjectCoref (Hu et al. 2011) and the work in Rong et al. (2012), which we temporarily call rdBoost. ObjectCoref is a semi-supervised system that learns discriminative property mappings. rdBoost uses AdaBoost to train a committee of random forest classifiers. In addition, we include the results of RiMOM (Li et al. 2009) and PARIS (Suchanek et al. 2011) as two state-of-the-arts among automatic systems. RiMOM combines multiple matching strategies in order to obtain the optimal resolution result. PARIS automatically generates co-references of instances, properties, values, and classes by combining similarity and probability propagation.

rdBoost uses 5 % candidates for training and ObjectCoref uses 2.3 %, 11.6 % and 1.2 % for D1, D2, and D3, respectively. Note that for ObjectCoref, only the results on D1, D2, and D3 are reported. Table 13 reports the comparison on these datasets.

Table 13 F1 scores of S c L i n k and other systems on D4 to D8

4.7.3 D9 to D15

Many systems have reported the experiments on these 7 datasets. Unfortunately, in some of those experiments, the repositories are simplified in various manners and thus the challenge of ambiguity and scalability are much reduced. For example, instead of inputting the whole target repository, only the class (location, organization, or people) related the domain of source repository (NYTimes) is used by ADL (Hu et al. 2014); or only instances residing in actual co-reference are used by ADL (Hu et al. 2014), Knofuss (Nikolov et al. 2012), ActiveGenLink (Isele and Bizer 2013), and SLINT+ (Nguyen et al. 2012a). As using such simple datasets, it is not difficult to obtain the perfect results using a simple method, as reported in by SLINT+ (Nguyen et al. 2012a).

We report here the comparisons with Zhishi.Links (Niu et al. 2011), AgreementMaker (Cruz et al. 2011), and SERIMI (Araujo et al. 2011), which are tested on full datasets though they are not learning-based systems. In addition, because the periods of data retrieval are different, the repositories used for S c L i n k and other systems may be slightly different. For all those reasons, the comparisons are made for reference purpose. In order to minimize the difference of input knowledge, we feed only 5 % labeled data into the learning algorithms. Five percent is also used popularly as an expectation for the size of labeled data. Table 14 reports this comparison. The results of other systems are collected from Euzenat et al. (2011).

Table 14 F1 scores of S c L i n k and other systems on D9 to D15

4.7.4 Discussion

In overall, S c L i n k expresses an impressive performance as it is better than other systems in many cases. Compared to state-of-the-art supervised systems on relational databases, S c L i n k is competitive to FEBRL and MARLIN. Compared to recent resolution systems for linked data, S c L i n k is far better than semiBoost, rdBoost, and ObjectCoref. Especially for D2 and D3, D7 and D8, which seem to be difficult for semiBoost and rdBoost because of the presences of co-references inside the source or target repository. S c L i n k clearly outperforms AgreementMaker and SERIMI, and is competitive to Zhishi.Links. Note that Zhishi.Links is specially customized for these datasets as this system applies 19 unification rules for matching difficult strings that frequently appear in this dataset (e.g., Co and Company, Manhattan and NYC). Therefore, considering the importance of generality, as well as the better results of S c L i n k in term of overall harmonic mean, S c L i n k reveals its strengths against Zhishi.Links. Comparing learning-based and non-learning-based systems, S c L i n k, rdboost, and ObjectCoref are much better than RiMOM and PARIS. This fact confirms the necessity of learning-based systems for improving the effectiveness.

5 Related work

Instance matching is an extensively studied problem. The literature of instance matching includes candidate generation methods, matching frameworks, and full solutions ranging from automatic to learning-based systems. In this section, we briefly summarize some representatives of each group.

Since candidate generation is a crucial step of most matching systems, the research on this sub-problem is attractive and is considered as an independent topic. Considering the complexity, token-based blocking (Papadakis et al. 2013) is the most basic approach as it is the fastest one. The token-based approach is also the most effective in term of recall. In most real cases, generating candidate by a comparison on the tokens is acceptable although in a few cases (e.g., availability of spelling errors), character-based is safer. The representatives of more advanced methods include sorted neighborhood (Hernández and Stolfo 1995), canopy clustering (McCallum et al. 2000), collective blocking (Isele et al. 2011; Nguyen et al. 2012a), and learning-based algorithms (Dalvi et al. 2013; Kejriwal and Miranker 2013; Papadakis et al. 2014). Sorted neighborhood arranges the instances into an order of lexicographic feature estimated from the value of instances. After that, the instances closing to the others within a fixed-size window are grouped into one block. Canopy clustering randomly selects a center and the instances that have an acceptable distance to the center is put into its cluster. Collective blocking takes multiple types of property mappings for generating candidates. Because this approach uses a multitude of information, it delivers high accuracy but also comes with it high complexity. Learning-based algorithms are can generate the most optimal candidate set. However, existing algorithms are complex and not suitable for large repositories. An important technique in blocking is to eliminate the instance groups with very large size (Papadakis et al. 2011). These cases frequently contain instances sharing the stopwords. In S c L i n k, the preprocessing step manages the solution for this issue (Table 4).

MOMA (Thor and Rahm 2007), FEBRL (Christen 2008c), SILK (Volz et al. 2009), and LIMES (Ngomo and Auer 2011) are among the famous instance matching frameworks. These frameworks allow the user to manually input the matching configuration. Among them, SILK and LIMES support matching of linked data repositories and are considered as scalable frameworks, according to the reported experiments. SILK and LIMES also provide the integration of map-reduce technology. SILK and LIMES are the base framework for many recent instance matching algorithms. Most recently, ScSLINT is presented as a memory and time efficient framework, which is significantly faster than LIMES (Nguyen and Ichise 2015b). S c L i n k is also implemented as a part of this framework. Also focus on efficient instance matching, Kirsten et al. parallelized the instance comparisons on different blocks, together with memory caching technique (Kirsten et al. 2010). Similar to above framework, AgreementMaker (Cruz et al. 2009, 2011) and Zhishi.Links (Niu et al. 2011) are manual instance matching systems. These systems offer the mechanism to include domain knowledge, which is helpful for improving the accuracy.

Remarkable automatic systems include SERIMI (Araujo et al. 2015), SLINT+ (Nguyen et al. 2012a), KD2R (Pernelle et al. 2013), RiMOM(Li et al. 2009), and PARIS (Suchanek et al. 2011). SERIMI and SLINT+ automatically detect the property mappings using pure statistical methods and apply these mappings for comparing instances. The limitation of this approach is that mappings contain incorrect alignments and reduce the final accuracy. The scalable architecture of SLINT+ enables the introduction of its efficient version. S c L i n k is implemented on that scalable version of SLINT+. KD2R uses the nearly functional properties for matching instances. Such property is effective in discriminating ambiguous instances. However, this method is limited to only repositories having the same schema. RiMOM and PARIS are among the state-of-the-art systems although their mechanisms are totally different. RiMOM dynamically combines multiple comparison strategies in matching instances. PARIS simultaneously matches the instances, classes, properties, and values using probability propagation. Also based on probabilistic theory, but ZenCrowd (Demartini et al. 2013) is proposed as a crowd-sourcing based system. ZenCrowd ranks the candidates in each block and asks humans for the low ranked ones. In addition, collective instance matching (Dong et al. 2005; Bhattacharya and Getoor 2004; Locoro et al. 2014) is well-studied in graph-based data, such as the collaboration domain of relational database or linked data generally. The basic idea is that the similarity of one node affects those of the others using a propagation mechanism. PARIS can be classified into this approach as well.

Recent achievements of instance matching also include incremental and progressive instance matching. Whang et al. focused on the evolution of data and the matching rules (Whang and Garcia-Molina 2014). Stating that re-running the matching process when the input change is costly, the authors proposed algorithms to such incremental matching setting. In parallel independent work, Altowim et al. propose a progressive algorithm to rank the block by the predictive number of co-references (Altowim et al. 2014). The blocks are then processed in a programmatic order instead of random as in previous matching settings. This approach is useful if a system has to return an expected number of co-references within a limited time.

Most closed to our work are RAVEN (Ngomo et al. 2011), EAGLE (Ngomo and Lyko 2012), ActiveGenLink (Isele and Bizer 2013; Nikolov et al. 2012), and EUCLID (Ngomo and Lyko 2013). These systems also focus on learning of matching configuration. EAGLE, ActiveGenLink, and Knofuss apply the genetic algorithm to search for an optimal configuration while RAVEN and EUCLID use deterministic methods. RAVEN and ActiveGenLink also apply active learning for reducing the number of training examples. In addition, by focusing on a very similar problem, ADL (Hu et al. 2014) learns for discriminative property mappings that are expected to best separate the co-referent and non-co-referent cases. The limitation of ADL is that this system considers each mapping independently, and therefore cannot evaluate the quality of their combination.

Machine learning methods are widely studied as a solution for instance matching, by considering the matching of two instances as a classification problem. Many machine learning algorithms have been used, including CRF (Hall et al. 2008), latent Dirichlet (Bhattacharya and Getoor 2006), Neural network (Li and Clifton 2000), SVM (Christen 2008a), rule-based (Sheila et al. 2002), and decision tree (Nguyen et al. 2012b; Rong et al. 2012). A recent comparison between machine learning classifiers is reported in (Soru and Ngomo 2014). MARLIN (Bilenko and Mooney 2003) and the work in (Sarawagi and Bhamidipaty 2002) investigate committee-based active learning methods for training example selection. Together with active learning, transfer learning (Rong et al. 2012) and semi-supervised learning (Kejriwal and Miranker 2015; Hu et al. 2011) are alternative options for the compensation of the automation loss in supervised approach. Most machine learning algorithms try to maximize the margin between co-referent and non-co-referent cases. However, since most of them are probability-based, a conjunction of the classifier and post-processing (e.g. the filtering step of S c L i n k) is difficult, as well as classifier’s interpretation, which is helpful for scaling up the resolution phase.

6 Conclusion

In this paper, we presented a supervised instance matching system named S c L i n k. We describe our solutions for the issue of heterogeneity, ambiguity, and scalability. We install in S c L i n k a novel learning algorithms for configuration and blocking model. We also use a robust string similarity metric. We reported the detail analyses and experiments for evaluating our proposed system. The experimental results confirm that S c L i n k meets practical demands in instance matching for real and large data. S c L i n k consistently outperforms other systems of the same objectives, including the state-of-the-arts. The heuristic search algorithm used in S c L i n k is also significantly better than many advanced methods of configuration learning. Together with effectiveness, S c L i n k is also more efficient in both time and memory consumption. The source code of S c L i n k is available at the project page: http://ri-www.nii.ac.jp/ScSLINT/.

A common shortage of all supervised systems is the availability of training data. Active learning is a promising strategy to reduce the annotation effort. In order to apply active learning, a study on effective training example selection is necessary because querying informative examples from a large pool of candidates is not a trivial problem. Although active learning has been studied by many researchers, such complexity issue was not convincingly solved (Christen 2008b; Sarawagi and Bhamidipaty 2002; Isele and Bizer 2013). Another alternative is transfer learning. As many existing co-references are retrievable nowadays, especially on the Web of linked data, the configuration constructed for two particular repositories can be shared to resolve the co-references between similar repositories (e.g., domain, schema, provider). Transfer learning is preliminarily evaluated that to work with instance matching problem (Rong et al. 2012). However, the issue of identifying exactly the equivalent properties between the trained repositories and the unknown repositories having different schema remains unsolved.

For the improvement of S c L i n k particularly, besides the above possibilities, S c L i n k can benefit from technical enhancement of similarity metric in term of processing speed (Mishra et al. 2013; Soru and Ngomo 2013). S c L i n k is not sensitive to its parameters, however, reducing the free parameters can perfectly satisfy the automation mission. In addition, an analysis on very large source repositories and partitioning methods may support the applicability of S c L i n k.