Keywords

1 Introduction

The problem of schema matching is a very complex and difficult one. Over the years many methodologies have been proposed by various teams. Surveys [1, 3, 13] show the focus is to use a computer to automate the process using composite systems; that is, more than one method is used to produce possible matches, and the results are then combined to produce the final matches. While these systems boost matching accuracy, an important question remains: Can matchers that are less reliable than others “unbalance” a system?

In this paper we propose two ways to address that. The first is what we dub “two phase schema matching”, that aims to allow the simultaneous use of a wide array of matchers regardless of their performance in the specific schemas to be matched. We achieve this by separating the available matching techniques into “strong” and “weak” ones, and one using user feedback. The intuition is to use the right technique at the right time to maximize its usefulness while minimize the destabilizing effects it may have on the system.

What we call “strong” techniques are those that have consistently produced good matches and are less likely to be affected by differences in domains, implementation details and/or the size and quality of instance data. All other techniques are classified as “weak”: they can be helpful if used, but their credibility is less certain.

The second is using a limited and easy-to-get amount of information from the human expert what will evaluate the results. This user interaction strategy is based on the fact that according to recent research [4] and our own experience, there is no safe way to perform schema matching tasks in a fully automated way, as there can be no guarantee of 100 % confidence and accuracy. While most recent work does recognize that machine based methods cannot solve the problem with 100 % precision there is little work published in the benefits of user-computer interaction and the impact it can have on solving the problem. Even if there was such a system, it would, in the end, need a human expert to evaluate the matches and authorize the mapping process. We therefore put forth the idea that since every outcome of an automated schema matching process must and will be evaluated by a human expert, it’s to our advantage to bring some of that evaluation earlier in time. We use this evaluation to discover and use the knowledge hiding in that mid-step, to help determine the best way to match the remaining attributes.

Both approaches are aimed at large schema matching projects, for example the matching of two database schemata during a company merging or a company-wide system integration. The system is schema structure agnostic and thus able to find matches between structurally different schemata (for example between a relational database and a collection of flat files or an XML schema). In that sense where the term “schema” is used it should be understood not only as a relational schema but as data schema in general.

An overview of the two methods follows in Sects. 2 and 3, in Sect. 4 we perform experiments in both our systems and to currently available systems and comment on the outcome. Related work is presented in Sect. 5, and in Sect. 6 we present our conclusion and we ask some interesting questions for future consideration.

2 The Strong Vs. Weak Approach

In this approach, which we called ASID (Another Schema Integration Dashboard), matching is performed in two phases: one where only strong techniques are used and one where all matchers are used. An overview of this is presented in [Fig. 1]. The first-phase combiner uses their results to produce plausible matches. If there are unmatched attributes left or the score of a match is too low, these attributes are passed to the second-phase matching, where the “weak” techniques are also used. In the second phase, the results from all techniques are combined. This arrangement ensures that matches are not “unbalanced” by weak matching techniques, while the system can benefit from the extra matching ability they provide.

Fig. 1.
figure 1

Overview of the ASID system

2.1 Matching Methods

In our system we use a mix both established and new matchers, that leverage information both on semantic and on data instance level. Multiple matchers, especially “niche” matchers that need specific information to provide good results, can be of great help but can also unbalance a system. In our setup with both the self-mute switch and the variable weights (described in the next section) we demonstrate that these methods do not hurt the results and can under some conditions improve them.

Name Matching.

Name matching is a string matching technique between the names of the attributes, used extensively in schema matching. We use the Jaro metric for this task as recent research [14] has shown that it is one of the best algorithms for name matching. The Jaro metric is based on the number and order of the common characters between two strings.

Given strings \( s = a_{1} \ldots a_{k} \) and \( t = b_{1} \ldots b_{L} \), define a character \( a_{i} \) in s to be common with t if there is a b j  = a i in t such that \( i - H \le j \le i + H \) where H = \( \frac{\hbox{min} (\left| s \right|\left| t \right|)}{2} \).

Let \( s^{\prime} = a^{\prime}_{1} \ldots a^{\prime}_{k'} \) be the characters in s which are common with t (in the same order they appear in s) and \( t^{\prime} = b^{\prime}_{1} \ldots b^{\prime}_{L'} \) be defined the same way for t, Now define a transposition for s’, t’ to be a position i such that a’ i  ≠ b’ i . Let \( T_{s',t'}^{{}} \) be half the number of transpositions for s’ and t’. The Jaro similarity metric for s and t is

$$Jaro(s, t) = \frac{1}{3}\left( {\frac{{\left| {s'} \right|}}{\left| s \right|} + \frac{{\left| {t'} \right|}}{\left| t \right|} + \frac{{\left| {s'} \right| - T_{s',t'}^{{}} }}{\left| s \right|}} \right) $$

It is out of scope to further analyse the algorithm used. For more information, the reader could consult [14]. In our system, scores range in [0, 1].

Attribute Description Matching.

This method tries to exploit simple forms of documentation that often exist in relational systems used in organizations. In most RDBMS deployments, each attribute has a description written in natural language that describes its contents, usually in a sentence or two.

Our matching method is a simple one: for each attribute of the source schema, we create a text corpus made of its description string and the description strings of all the attributes of the target schema that are possible matches (Fig. 2). Then we use TF/IDF weighting to compute vector similarity between the first member of the corpus and the remaining ones: the more similar the description, the higher the score. This method ensures that words appearing many times in the corpus, like “table”, are not overly important to the computation of similarity, ensuring that the truly important words dominate the result.

Fig. 2.
figure 2

Creating the corpus

Naïve Bayes Classifier.

All available data from one schema are fed into a simple Naïve Bayes classifier; after the learning phase is complete, the sample data of the target attribute are classified. The total score for all attributes is normalized in [0,1] to be compliant in range to the previously computed score. The formula by which the normalization is achieved is a compromise between robustness and ease of implementation.

The transformation formula used is \( s_{i} ' = \frac{{s_{i} - s_{\hbox{min} } }}{{s_{\hbox{max} } - s_{\hbox{min} } }} \) where \( s_{i} \) is the individual score of an attribute match and \( s_{\hbox{min} } \) and \( s_{\hbox{max} } \) respectively are the global minimum and maximum scores of said attribute. The intuition behind this is that the best score for an attribute will be transformed to 1 while the lowest will be 0. It is important to note that this procedure gives a boost to the best score that may lead to the production of false positives. When faced with the dilemma of producing some extra false positives or risk losing some good matches, we opted for the first choice, the rationale being that, in large-scale schema matching, a false positive is easier for a human to dismiss than it is for her to discover a missed true match.

TF-IDF Matching.

This matcher is inspired by the WHIRL system, which “extends relational databases to reason about the similarity of text-valued fields using information-retrieval technology” [6]. We use only the similarity function used in WHIRL to create a matcher for ASID.

Specifically, data is inserted in the matcher and a collection of documents is created where each document consists of all the instances of any available target attribute. A “source” document is created from the data available for the source attribute. We use TF/IDF weighting to compute vector similarity scores between the source document in the collection and the target ones. The more similar the source document is to the corpus of the data, the higher the score. The last step is to normalize the results using the formula described above.

Datatype Matching.

This matcher uses similarity between datatypes to determine similarity between attributes. The matcher deduces the datatypes from the data instances and not from the data definition. This is done because in most systems that are used for long periods of time the datatypes used differ from the datatypes defined in the documentation. The actual datatype of an attribute as determined by the system is the datatype that can hold every and all instances of the attribute in question. These datatypes are then used to compute similarity between attributes as described in the data similarity matrix presented in Table 1.

Table 1. Datatype similarity matrix

Instance Pattern Matching.

This matcher extracts patterns from the data instances. It aims to find patterns that can be hiding in the data of large schemas, patterns like passport numbers, license plates, VAT numbers etc. Similarity between these patterns is calculated with the edit distance method to penalize changes in stronger way than the more lenient Jaro metric. The patterns found are comprised of the characters found in all data instances and the places of the changing letters (“‡”) and/or numbers (“№”). For example the IBAN Number of UK account would create a pattern of [GB№№‡‡‡‡№№№№№№№№№№№№№№].

An alternate implementation of this would be to express the patterns of the source attributes in the form of a regular expression and award a score on those matches were the target attributes validate the regular expressions. For example from a source attribute containing IBAN numbers we can create the following regular expression:

$$\left[{{\text{a}}-{\text{zA}}-{\text{Z}}}\right]\left\{2\right\}\left[{0-9}\right]\left\{2\right\}\left[{{\text{a}}-{\text{zA}}-{\text{Z}}0-9}\right]\left\{4\right\}\left[{0-9}\right]\left\{7\right\}\left({\left[{{\text{a}}-{\text{zA}}-{\text{Z}}0-9}\right]?}\right)\left\{{0,16}\right\}$$

with any target attribute containing IBAN numbers must adhere to. This method is more rigid than the first, since, contrary to the above regex that validates all known IBANs, a regex derived from UK IBANs will not validate German IBANS. In light of that for the implementation of our system we opted for the first approach.

3 The Human-in-the-Loop Approach

We can extend our two phase matcher system, which we will henceforth refer to it as ASID + , with a module that implements our proposed solution on how to increase the quality of the matching by using small pieces of information in the form of user feedback. We then proceeded and evaluated our approach and established that it can make a big difference in the performance of an automated schema matching tool. The design of the system is presented in Fig. 3.

Fig. 3.
figure 3

2-Phase matching with user feedback

3.1 The Human in the Loop Match Combiner

The human in the loop approach uses a simple voting match combiner to compute an overall score for each possible match. The sum of all matchers’ scores is divided by the sum of their weights. If after the total evaluation a match is deemed probable (with a score greater than 0.5) the evaluator produces a “good match,” presenting it to the user and removing the matched couple from the source and target schemata.

User Interaction and Weight Redistribution.

As already suggested, combining the individual matchers’ scores cannot involve training of the system since schema matching is more often than not a one-time task. Proposed solutions like a decision tree combiner fail to leverage the knowledge of the human expert, when apart from the schemata to be matched, this may be the only other available source information. In most systems interaction between the user and the system happens before (providing the data) or after (evaluating the results) the matching process. We propose an interactive matching method, based on our two phase idea, which aims to get the best out of both human and computer worlds.

After the system produces matches the user is asked to confirm or not the correctness of limited number of results. This feedback is then used to evaluate the performance of each technique and adjust the weights of each in our prediction combiner. The weights are initially set to values that are found to be good starting points but can be set to any number without affecting the final outcome of the system as they given enough iterations, will gradually shift to optimal values. That can even be as extreme as being zero to all but one technique, making the system work as a single matcher (Fig. 4).

Fig. 4.
figure 4

User Interaction and weight redistribution

The matches presented to the user are (pseudo)randomly selected from a pool of matches hovering just above or below a cutoff confidence point (in our experiments 0,5). The user is asked if a match is correct or not, as requiring them to manually search, construct and submit a correct match is tedious. When the system produces a true match the individual techniques that are correct increase their weight. If the match was wrong those that were wrong have their weight reduced. Both adjustment are proportional to their respective confidence score. The algorithm of the operation and an example of running it once is shown below (Fig. 5).

Fig. 5.
figure 5

Weight redistribution example

After the all random matches are evaluated by the user, the matching is repeated with the new values. It is of note that the weight of each technique has no meaning other than distinguishing the proportional strength of that one over the others.

3.2 Over-Fitting as a (Rare) Problem

During our experiments it became apparent that random sampling can in edge cases lead to over fitting. For example a source that was incrementally developed, had parts of it documented in English and parts of it in transliterated Greek. The documentation matcher did exceptionally well in the parts that were documented in English (which had corresponding documentation in the target schema) but underperform in the transliterated parts. When the matches presented to the user were all from the same type of documentation, the matcher was penalized or rewarded far beyond the desired point. For this every matcher has a self-mute switch that activates when it is no longer reliable (i.e. multiple documentation languages), and explicitly excludes itself from the process.

3.3 Confidence in the Human Expert

For the system to work the human expert must provide correct answers before the second matching step. We made certain choices and assumptions when we designed the system to interact with a single expert user. We designed and implemented the feedback phase aiming to minimize input errors by asking questions that can only be answered with a simple Yes/No/Ignore. We opted out of the more elaborate “Select and score the correct match for attribute x” as this is much more prone to errors as suggested in [23]. A central point of our design is that the user is just validating the results at a time earlier they usually have to, as a human expert would ultimately need to evaluate the results and make final decisions. With this in mind and the simple nature of the questions, his feedback is considered correct.

4 Experiments and Evaluation of the System

In this section we evaluate our two approaches against each other and against state of the art systems. We conduct experiments with various datasets and demonstrate the robustness of the two phase schema matching and the added value of human user interaction to the matching process. We also display that our novel niche matchers can be used to boost the systems performance.

4.1 Overview

Three distinct and very different schema collections were used in the experiments. The aim was to see the system perform matches for both small (~ 30 attributes), medium (~ 120 attributes) schemas and large schemas derived from open source projects.

For the small schema test we used part of the data used in [2] to evaluate the system.

For the medium schema experiments, we used real schemas from the Greek cadastre where different schemas, developed independently but address the same needs. With up to 120 attributes, they are representative of schemas used in medium applications. The large schemata were procured by two open source shopping cart solutions [15, 16] that are both large and complicated.

The systems we compare ASID + to are the Harmony matcher [19] as implemented in OpenII [20], and COMA ++ v3 community edition [21] (a derivative of [810, 17]). As ASID + outputs only the best match available we constrained Harmony to do the same and hence only evaluated the best, i.e. the one with the highest score, match produced by it. If we were to evaluate every match proposed by harmony above a reasonable threshold the number of false positives (and hence the number of pairs a human experts should have to evaluate and discard) would greatly increase. We also set Harmony’s threshold to 0.5 to eliminate matches that would probably be false positives.

As our system is a self-tuning one we run all experiments using the default settings (For COMA ++ that meant using the $AllContextInstW dataflow with no other tuning).

In all experiments we ignored matches of attributes that exist only as non-real world unique identifiers as these depend on the design principles used and are not transferred between schemata.

For the rest of the paper we named our system using a priori knowledge the performance of each matcher ASID and it comes in two flavors, one with 4 matchers that excludes instance pattern matching and datatype matching and one that has all 6 matchers. ASID + is the name of our human in the loop implementation that also has the complete set of matchers. Also of note is that in these experiments ASID + was configured to only use a single feedback phase.

4.2 Small Schema Tests

The “small schema” dataset was kindly provided by AnHai Doan and is the relational dataset used in [2]. Table 2 has information on the schemas used).

Table 2. Small Schema information

The correspondences between the attributes were sometimes, but not often, complex (rather than one-on-one). For creating the schemata in OpenII we used its own SQL DDL importer on the source schemata DDL scripts. Figure 6 summarizes the results of our evaluation and shows how our approach performs better in small schemas.

Fig. 6.
figure 6

Small schemas Accuracy (%)

Additionally to improving the overall number of correct matches there was a dramatic drop in false positives with almost all proposed matches being true positives. The experiments display the robustness of two phase method and that using feedback from a human expert can improve the system’s performance.

4.3 Medium Schema Tests

The experiments were conducted using schemas created for the Greek cadastre. The system was supplied with the distinct values of more than 30.000 tuples in each case. All schemas and data were real (Table 3).

Table 3. Medium schema information

The results (Figs. 7 and 8) illustrate that, despite the increased complexity of real world schemas, accuracy remains high at ~ 85 % The reduction in accuracy in this experiment for ASID + and ASID compared to their performance in the small schema experiments shows that user interaction can make the system more robust with regard to increased schema size.

For this comparison we decided against the use of the medium schemas because of them having documentation and attribute/table names in both the Greek (transliterated or not) and English language. Therefore the matching is heavily dependent on instance based matchers which the chosen open source schema matching systems lack, something that would skew these experiments would in favor of our system.

Fig. 7.
figure 7

Number of true positives on medium schemas (left)

Fig. 8.
figure 8

Medium schemas ASID accuracy (%) (right)

4.4 Large Schema Comparison

For the large schemas we used the 6-matcher, human in the loop ASID + while for Harmony we used the DDL importer to import the schemas but both DDL scripts of the e-shops had some non-standard datatypes substituted by their more common analogues.

The results in Fig. 9 suggest that ASID + is able to outperform Harmony by a margin of 20 per cent point. It is notable that it manages that by utilizing all of its matchers despite the fact that data instances available are of very low quality, confirming our assumption that a small user input can go a long way in increasing the robustness of our system. The same observation holds to a lesser extent for COMA ++ where the difference is 7 per cent point.

Fig. 9.
figure 9

Large schemas systems comparison (Precision %)

4.5 Conclusions on Experiments

The experiments presented strongly suggest our approach can offer better results to automated schema matching. The experiments demonstrate both the robustness of two phase schema matching and the positive effect a user assisted prediction combining method can have to the schema matching process.

Moreover, the evaluation of the results of ASID + , irrespective of the status of the niche matchers, demonstrates that our method, by taking into account the user input to assign weights to the various matching methods, helps improve the results while at the same time allowing “niche” matching techniques to be used without results degradation.

5 Related Work

The systems proposed and/or build through the years can be divided for the purposes of this work in two major categories those that somehow enable a human in the matching process and those who do not. Of the later there are many works worth mentioning. Like iMap [11] that while able to identify complex matches (i.e. address = concat(city, state)), a feature rarely seen in schema matching tools, achieves 43-92 % success in matching attributes a range showing it may not be as stable as for industrial schema matching. Similarity Flooding [12] is based on the eponymous graph matching algorithm. The strength of the algorithm (and also its weakness) is the lack of knowledge it has for the attributes being matched. This had a side effect of producing out of context matches that need to be filtered out. The basic idea behind COMA ++  [9, 10] is to create a true composite system. This is a generic system, designed to be adaptable to many matching problems, achieving notable results not only in data integration and for that reason is one of the systems we tested ASID + against. The LSD system [5] is a demonstrator of a multi-strategy learning approach in schema matching. Results show it achieves 71–92 % accuracy but it demands training on target schemata, something that often is not possible. MKB [2] displays high accuracy, either autonomously or as an additional matcher but only when the existence of past mappings is available, as it requires a large amount of previously matched schemata, something not likely to be available in industrial RDBMS deployments. U-MAP [7] is again a technology demonstrator system for a novel idea: using information extracted from query logs to generate correspondences between the attributes of two different schemas and the mappings between them. YAM [22] is actually a schema matcher generator designed to produce a tailor-made matcher based on a knowledge base, not on human expert knowledge that is by definition more relevant to the domain at hand. Harmony [19] aims to speed up the task of finding correspondences across two data schemas. As implemented on the OpenII project [20] is uses a variety of rather simple matchers and incorporates but its one-to-many visualization solution tends to produce far too many correspondences, in real world schemas, making it counter intuitive.

Then there are the systems that enable human users to aid in the schema matching process. In [24] the authors focus on pay-as-you-go reconciliation in schema matching networks using a probabilistic matching network. In contrast to ours, the system in [24] is focused not on the costly and error prone task of one off schema matching but on improving the outcome of a lengthy, incremental process involving many schemata. As mentioned in [26] the authors build a system that aims to use user interaction to resolve matches; this happens at the start/end of the matching process and involves a large number of ordinary users. In [25] the authors propose the use of crowdsourcing techniques to reduce ambiguity in schema matching. In relation to the previous two works our system involves one human expert during the matching process. ASID does not use the knowledge gained to resolve matches or to retrain its matching techniques, but it uses it to decide which are (or are not) the best matchers in its disposal for a given problem - although we do plan to experiment with using the user feedback to also resolve matches in future work. There are also a number of other schema matching systems and techniques of note e.g., [7, 8, 18] designed to address different aspects of the schema matching problem.

6 Conclusions and Future Work

Our approach is based on the assumption that schema matching in business environments is under most scenarios a one-off task and hence learning-based approaches are of limited use. Based on this we proposed a two-phase method for building composite systems. We examined both the use of a priori knowledge in deciding what individual methods are most important and use of limited user input. To demonstrate the effectiveness of our approach we created a prototype system and used it in to match real small, medium and large schemata.

Conducting extensive experiments in a variety of schemata of variable size and domain, we demonstrate a significant improvement over existing open source systems, ranging from a 13 % to 19 % increase in accuracy in small schema matching a significant increase of precision in large schemas from 27 % to 47 %. The evaluation of the results also shows that taking user input into account to assign weights to the various matching methods helps improve the results accuracy by 5 %, while at the same time allowing “niche” matching techniques to be used without the risk of results degradation.

We plan on exploring additional strategies to initiate and use the human computer interaction that is the basis of our method. In particular, we plan to expand ASID + to not only use the feedback as a way to adjust the matcher’s weights but also to treat the true positives attribute matches as match constraints that can possibly exclude some matches from even being generated. There is also the interesting path of extracting more information from the human user, instead of simply asking a random set of simple yes/no questions about matches hovering above or below our threshold.