Keywords

1 Introduction

A DI system aims at integrating a variety of data obtained from different data sources, usually autonomous and heterogeneous, and providing a unified view of these data, often using an integrated schema. The integrated schema makes a bridge between the data sources and the applications that access the DI system. Data in a DI system can be physically reconciled in a repository (materialized data integration approach), or can remain at data sources and is only consolidated when a query is posed to the DI system (virtual data integration approach). A data warehouse system [1] is a typical example of the first approach. As examples of the second approach, we can cite federated information systems [2] and mediator systems [3]. In the present work, both scenarios can be used, but in this paper we will focus on the materialized integration approach.

One of the hardest problems to solve in DI is to define mappings between the integrated schema (the target) and each data source schema, known as the schema mapping problem. It consists of two main tasks: i) schema matching to define/generate correspondences (a.k.a. matches) between schema elements (e.g., attributes, relation, XML tags, etc.); and ii) schema mapping to find data transformations that, given data instances of a source schema, obtain data instances of the target schema.

The result of schema matching is a set of correspondences that relate elements of a source schema to elements of the target schema, where an element can be a relation name or attribute in the relational model. These correspondences can be described using a Local-as-view (LAV), a Global-as-view (GAV), or a Global and Local-asview (GLAV) language. In summary, in a LAV approach, each data source is described as a view over the integrated schema. In a GAV approach, the integrated schema is expressed as a view over the data sources. Finally, the GLAV combines the expressive power of both GAV and LAV. Once the schema matching is performed, the correspondences are used to generate the schema mappings. For example, a schema mapping can be codified through an SQL query that transforms data from the source into data that can be stored in the target.

Extensive research on schema matching has been carried out in recent years [4, 5]. The majority of the works on this subject identifies 1-1 correspondences between elements of two schemas. For example, a 1-1 correspondence can specify that element title in one schema matches element film in another schema, or that relation genre matches relation category Footnote 1. This kind of schema matching is known in the literature as basic matching. Good surveys can be found in [6, 7].

While basic matching is common, it leaves out numerous correspondences of practical interest, in particular when we consider DI systems. Thus, more complex matches are necessary. A complex matching specifies 1:n, m:n, or n:1 correspondences between elements of two schemas. For example, it may specify that totalPrice corresponds to unitPrice * quantity; or that name matches concatenate(firstName, lastName), where concatenate is a function that applies to two strings and returns a concatenated string; or even that the average departmental salary avgWage corresponds to grouping the salaries (salary) of all employees (emp) by department (dept). Works in [8, 9] are examples of approaches that deals with complex matches.

Some researchers go beyond dealing with complex matches and add semantics to the correspondences to improve the overall matching quality. In the Sect. 2, we explain more about complex matches and show a motivation example. The remainder of the paper is structured as follows. In Sect. 3, we present the necessary background in Correspondence Assertions (CAs), the formalism used in this work to specify correspondences between elements of schemas. In Sect. 4, we propose new CAs to deal with join operators and metadata. Section 5 shows how to generate mapping expressions from CAs. Section 6 shows some preliminary tests to evaluate our approach. Section 7 describes the related work. Finally, Sect. 8 concludes and describes future work.

2 Motivating Example

Consider a motivating example with the source schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) in Fig. 1, which contain information about movies. \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) keeps a catalog of movies with information about different types of media (dvd, blue rays, etc.) in which the movies are available. The names of the relations and attributes are mostly self-explanatory. Some non-self-explanatory attributes in \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) have the following meaning: id is the movie identifier, year is the year of a movie, film is the title of a movie, number is the tape identifier, name can be a producer or a director name, and role can be producer or director. FK1 and FK2 are foreign keys. We use the notation FK(\(\textsc {R}\):L, S:K) to denote a foreign key, named FK, where \(\textsc {R}\) and S are relation names and L and K are list of attributes from \(\textsc {R}\) and S, respectively, with the same length. FK1 is the foreign key of tape that refers to movie and FK2 is the foreign key of movieMakers that refers to movie. \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) stores general information about movies and the places (in different cities) where movies are being shown. We assume that \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) can store older movies than \({{\mathbf {\mathsf{{S}}}}}\) \(_2\). Some non-self-explanatory attributes in \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) have the following meaning: rate is the classification of the movie with regard the audience, location and city are, respectively, the cinema and the name of the city where the movie is shown, and time is the date when the movie is shown.

Fig. 1.
figure 1

Example of source schemas and a integrated schema.

The integrated schema M, also shown in Fig. 1, provides a unified user view of movies currently shown in cinemas of Lisbon. It is populated by information from schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\). The relation movie stores movies shown currently at a cinema. The relation filmMakers keeps information about professionals of show businesses. The relation schedule contains information about the schedule of movies shown in Lisbon. The relation remakes keeps the years of movies for which there is at least one remake. The relation rating stores the classification of movies with regard to suitability audience. Some non-self-explanatory attributes in M have the following meaning: description is the summary of a movie, nvYear is the year of the most recent version of a movie, ovYear is the year of the older versions of a movie, and quantity is the total of movies with the same rating.

Given the schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\), \({{\mathbf {\mathsf{{S}}}}}\) \(_2\), and M, we can consider the correspondences between the source schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\), and the target schema M. As an example, we can state that M.schedule corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime, because both relations store information regarding the same real world conceptFootnote 2. However, in this correspondence, it is not clear that M.schedule only keeps schedules about movies shown in Lisbon. The additional information: M.schedule corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime when \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime.city = “Lisbon”, specifies better the matching.

The works reported in [1012] and [5](chap. 3) propose schema matching approaches that can specify correspondences to deal with situations as required in the example The reader can see more proposals to add semantics to schema matching in [8, 10, 12, 13]. However, the following situations have not been fully covered yet:

  1. 1.

    Correspondences Between Relations Involving Join Conditions Other than Equality of Attributes. Consider the relation M.remakes that keeps a list of remakes with the years of the oldest versions. Knowing that \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film keeps current movies and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie may contain older versions of the same movie, we want to indicate which of the current movies are remakes and store this information in M.remakes. The correspondence between these relations can be specified as: M.remakes corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film join \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie where \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.title = \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.film and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.year \(>\) \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.year. Usual schema matching approaches cannot specify this correspondence, because join conditions are not explicitly defined in schema matching. Moreover, join paths are normally automatically discovered in the schema mapping phase [14], and the algorithms used can only find equi-join conditions, so they cannot automatically discover the condition \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.year \(>\) \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.year. Hence, we need a schema matching approach that makes it possible to specify the join between relations and allows general join conditions containing operators different from equality.

  2. 2.

    Correspondences Between Relations Involving Outer-joins (Full, Left, or Right). We want to indicate how M.movie is related to source schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\). M.movie and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film represent the same concept of the real world (i.e., both relations store current movies shown at some cinema). However, it is not enough to specify that M.movie matches \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film, because there are attributes in \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie (namely, category and summary) that contain information required in the schema of M.movie. Hence, we should specify that M.movie is related to both \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film. However, it is not correct we simply match M.movie to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie because \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie can store movies that are not being shown in a cinema anymore and M.movie can store recent movies that are not available in dvds yet. In summary, we should specify that: M.movie corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film left outer-join \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie on \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.title = \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.film and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.year = \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.year. Note that the condition \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.title = \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.film and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.year = \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.year guarantees that we refer to a same movie stored in both \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film. Again, we cannot specify this type of correspondence since joins (and their variants) are not explicitly defined in current schema matching approaches.

  3. 3.

    Correspondences Between Data and Metadata. Consider the relations \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers and M.filmMakers. Both keep information about the relationship between a movie, a producer, and a director. We want to indicate that M.filmMakers corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers since they represent the same concept in the real world. In addition, we want to specify the correspondences between the attributes of these relations. Knowing that \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.name can be a producer name or a director name, we would like to specify that M.filmMakers.producer corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.name when \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.role = “producer” and that M.filmMakers.director corresponds to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.name when \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.role = “director”. However, we cannot specify these correspondences using traditional schema matching approaches, because these correspondences involve semantics not covered yet by these approaches. Actually, we can only specify that M.filmMakers.producer matches to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.name and M.filmMakers.director matches to \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.name.

In order to deal with these situations, we propose to use a formalism based on CAs [10, 15]. Using CAs, we can declaratively specify basic and complex matchings with semantics. We propose to adapt CAs to be able to express schema matching between relational schemas, as well as to extend this formalism with new types of CAs to deal with joins, outer-joins, and data-metadata relationships. Finally, we demonstrate how mapping expressions in the form of SQL queries can be generated from CAs.

3 Background

In this section, we present the basic terminology used in this paper. We also review the different classes of CAs, and adapt them to the Relational Data Model (RDM).

3.1 Basic Concept and Notation

We assume that the reader is familiar with the relational concepts. We denote a relation schema as \(\textsc {R}\)(\(\mathcal{A}_1\), \(\mathcal{A}_2\), \(\dots \), \(\mathcal{A}_n\)), and a foreign key as FK(\(\textsc {R}\):L, S:K). We say that FK relates \(\textsc {R}\) and S.

A relational schema is a pair \({{\mathbf {\mathsf{{S}}}}}\)= (\(\mathcal{{R}}\), \(\Omega \)), where \(\mathcal{{R}}\) is a set of relation schemas and \(\Omega \) is a set of relational constraints such that: (i) \(\Omega \) has a unique primary key for each relation schema in \(\mathcal{{R}}\); (ii) if \(\Omega \) has a foreign key of the form FK(\(\textsc {R}\):L, S: K), then \(\Omega \) also has a constraint indicating that K is the primary key of S. Given a relation schema \(\textsc {R}\)(\(\mathcal{A}_1\), \(\mathcal{A}_2\), \(\dots \), \(\mathcal{A}_n\)) and a tuple variable t over \(\textsc {R}\), we use t[\(\mathcal{A}_i\)] to denote the projection of t over \(\mathcal{A}_i\).

Let \({{\mathbf {\mathsf{{S}}}}}\)= (\(\mathcal{{R}}\), \(\Omega \)) be a relational schema and \(\textsc {R}\) and T be relation names of relation schemas in \(\mathcal{{R}}\). We denote \(\varrho \) = FK \(_1\bullet \) FK \(_2\bullet \dots \bullet \) FK \(_{n-1}\) a path from \(\textsc {R}\) to T iff there is a list \(\textsc {R}\) \(_1\), \(\dots \), \(\textsc {R}\) \(_n\) of relation schemas in \({{\mathbf {\mathsf{{S}}}}}\) such that \(\textsc {R}\) \(_1\) = \(\textsc {R}\), \(\textsc {R}\) \(_n\) = T, and FK \(_i\) relates \(\textsc {R}\) \(_i\) and \(\textsc {R}\) \(_{i+1}\). We say that tuples of \(\textsc {R}\) reference tuples of T through \(\varrho \).

3.2 Correspondence Assertions

We use Correspondence Assertions (CAs) in order to express schema matchings between schema elements. CAs are formal expressions of the general form \(\psi \): \(\mathcal{T} \leftarrow \mathcal{S}\), where \(\psi \) is the name of the CA, \(\mathcal{T}\) is an expression formed by elements of the target schema, and \(\mathcal{S}\) is an expression formed by elements of a source schema. The symbol “\(\leftarrow \)” means “is matched from”.

In accordance to [10], there are four types of CAs: Relation Correspondence Assertion (RCA), Attribute Correspondence Assertion (ACA), Summation Correspondence Assertion (SCA), and Grouping Correspondence Assertion (GCA). RCAs and SCAs specify the relationship between relations of distinct schemas, while ACAs and GCAs specify the relationship between attributes of relations of distinct schemas. We now shortly describe each type of CA, adapting them to the RDM. In the remainder of this Section, consider: \({{\mathbf {\mathsf{{S}}}}}\) \(_i\)= (\(\mathcal{{R}}\) \(_i\), \(\Omega \) \(_i\)) be relational schemas for \(1 \le i \le n\), with \(\textsc {R}\) \(_i\) being relation names of relation schemas in \(\mathcal{{R}}\) \(_i\).

Definition 1

Let \(\sigma \) be a selection over \(\textsc {R}\) \(_2\). A Relation Correspondence Assertion RCA is an expression of one of the following forms:

figure a

In Definition 1, we say that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_i\), for \(1 \le i \le n\). RCAs express the various kinds of semantic equivalent relationships. Two relations \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\) are semantically equivalent if they represent the same real concept and there is a 1-to-1 correspondence between their instances. \(\psi _1\), shown in Fig. 2, is an example of a RCA.

Fig. 2.
figure 2

Examples of correspondence assertions.

\(\psi _1\) specifies that M.schedule is semantically equivalent to \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime when the condition \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime.city = “Lisbon” is satisfied. This means that only a subset of tuples of \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).showTime, those that satisfy the condition are involved in the match.

Before we define an Attribute Correspondence Assertion (ACA), we need introduce the concept of attribute expression, as follows:

Definition 2

Let \(\textsc {R}\) \(_2\) and T be relation names in \(\mathcal{{R}}\) \(_2\), with \(\mathcal{A}\) being an attribute of \(\textsc {R}\) \(_2\) and \(\mathcal{B}\) an attribute of T. Let also \(\varrho \) be a path from \(\textsc {R}\) \(_2\) to T. An attribute expression \(\mathcal {E}\) over \(\textsc {R}\) \(_2\) is an expression with one of the following forms:

figure b

Definition 3

Let \(\mathcal{A}\) \(_i\) be attributes of \(\textsc {R}\) \(_1\) (for \(1 \le i \le n\)). Let also \(\mathcal {E}\) \(_j\), for \(1 \le j \le m\), be attribute expressions over \(\textsc {R}\) \(_2\). An Attribute Correspondence Assertion (ACA) is an expression of one of the following forms:

figure c

Where \(\varphi \) is a function over attributes of \(\textsc {R}\) \(_2\), p \(_j\) (for \(1 \le j \le m\)) are boolean conditions over attributes of \(\textsc {R}\) \(_2\), and v is a value. We say that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\).    \(\Box \)

ACAs specify the relationship between the attributes of relations that are matched by a RCA. They allow to define 1:1, 1:n, n:1, or m:n relationships between attributes of relations of different schemas. For example see the ACA \(\psi _{2}\) presented in Fig. 2. It specifies the correspondence between M.schedule.movie and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.title through a path from showTime to film.

Definition 4

Let \(\sigma \) be a selection over \(\textsc {R}\) \(_2\). Let also \(\mathcal{A}\)\(_i\) attributes of \(\textsc {R}\) \(_2\) (for \(1 \le i \le m\)). A Summation Correspondence Assertion (SCA) is an expression of one of the following forms:

  1. 1.

    \(\psi \): \({{\mathbf {\mathsf{{S}}}}}_1[\textsc {R}_1] \Leftarrow groupby({{\mathbf {\mathsf{{S}}}}}_2[\textsc {R}_2](\mathcal{A}'_1, \mathcal{A}'_2, \dots , \mathcal{A}'_m))\)

  2. 2.

    \(\psi \): \({{\mathbf {\mathsf{{S}}}}}_1[\textsc {R}_1] \Leftarrow groupby({{\mathbf {\mathsf{{S}}}}}_2[\textsc {R}_2\sigma ](\mathcal{A}'_1, \mathcal{A}'_2, \dots , \mathcal{A}'_m))\)

  3. 3.

    \(\psi \): \({{\mathbf {\mathsf{{S}}}}}_1[\textsc {R}_1] \Leftarrow normalise({{\mathbf {\mathsf{{S}}}}}_2[\textsc {R}_2](\mathcal{A}'_1, \mathcal{A}'_2, \dots , \mathcal{A}'_m))\)

  4. 4.

    \(\psi \): \({{\mathbf {\mathsf{{S}}}}}_1[\textsc {R}_1] \Leftarrow normalise({{\mathbf {\mathsf{{S}}}}}_2[\textsc {R}_2\sigma ](\mathcal{A}'_1, \mathcal{A}'_2, \dots , \mathcal{A}'_m))\)    \(\Box \)

In Definition 4, we say that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\). SCAS specify 1:n, n:1, m:n relationships between relations with distinct schemas. Here we use the symbol “\(\Leftarrow \)” instead of “\(\leftarrow \)” in order to emphasize that the correspondence is not 1:1 as is usual in the most part of schema matching approaches. SCAS are used to describe the summary of a relation whose tuples are related to the tuples of another relation by gathering them into logical groups. This means that a SCA has only the necessary information to indicate which grouping field is involved in the relationship and the process used to grouping the tuples. \(\psi _3\) shown in Fig. 2 is a simple example of a SCA.

GCAs specify the relationship 1:1, 1:n, n:1, or m:n between attributes of relations that are matched by a SCA.

Definition 5

Let \(\mathcal{A}\) be an attribute of \(\textsc {R}\) \(_1\). Let also \(\mathcal {E}\) \(_i\), for \(1 \le i \le m\), be attribute expressions over \(\textsc {R}\) \(_2\). A Grouping Correspondence Assertion (GCA) is an expression of one of the following forms:

figure d

Where \(\varphi \) is a function over attributes of \(\textsc {R}\) \(_2\), p \(_j\) (for \(1 \le j \le m\)) are boolean conditions over attributes of \(\textsc {R}\) \(_2\), v is a value, and \(\gamma \) is one of the aggregate functions: sum (summation), max (maximum), min (minimum), avg (average), or count. We say that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\).    \(\Box \)

Consider the relations \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film and M.rating. \(\psi _4\), represented in Fig. 2, specifies that M.rating.quantity corresponds to the counting of all distinct values of \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.rate.

Definition 6

Let \({{\mathbf {\mathsf{{S}}}}}\) \(_1\), \({{\mathbf {\mathsf{{S}}}}}\) \(_2, \dots \), \({{\mathbf {\mathsf{{S}}}}}\) \(_n\) and T be relational schemas; \(\textsc {R}\) \(_1\) be a relation schema of T, and \(\textsc {R}\) \(_2\) a relation schema of some \({{\mathbf {\mathsf{{S}}}}}\) \(_i\), \(1 \le i \le n\). Let also \(\mathcal {E}\) \(_j\) (for \(1 \le j \le m\)) be expressions as defined in Definition 2. A schema matching between schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\), \({{\mathbf {\mathsf{{S}}}}}\) \(_2\), \(\dots \), \({{\mathbf {\mathsf{{S}}}}}\) \(_n\) and the schema T is a set \({\mathcal {M}}\) of CAs such that:

  1. 1.

    if \({\mathcal {M}}\) has an ACA \(\psi \) such that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\), then \({\mathcal {M}}\) has a RCA \(\psi '\) that matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\).

  2. 2.

    if \({\mathcal {M}}\) has a GCA \(\psi \) such that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\), then \({\mathcal {M}}\) has a SCA \(\psi '\) that matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\).

  3. 3.

    if \({\mathcal {M}}\) has a RCA \(\psi \) such that \(\psi \) matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\), then \({\mathcal {M}}\) has an ACA \(\psi '\): \({{\mathbf {\mathsf{{S}}}}}_1[\textsc {R}_1](\mathcal{A}_1, \dots , \mathcal{A}_n) \leftarrow (\mathcal {E}_1, \dots , \mathcal {E}_n)\) that matches \(\textsc {R}\) \(_1\) and \(\textsc {R}\) \(_2\).    \(\Box \)

4 Specifying New CAs

In Sect. 1, we identified the following types of relationships between schemas elements that are not properly handled in current schema matching approaches: (1) matches involving explicit join conditions; (2) matches involving outer-joins; and (3) matches involving data-metadata. Join (and outer-join) relationships can express one-to-one or many-to-many correspondences between the relations involved. Matches involving data-metadata can express many-to-many correspondences between the relations involved. So, we extend our previous definitions of RCA and SCA in order to better specify these types of matchings. In the following text consider \({{\mathbf {\mathsf{{S}}}}}\) \(_i\) relational schemas, \(\textsc {R}\) \(_i\) relation schemes of \({{\mathbf {\mathsf{{S}}}}}\) \(_i\) (for \(1 \le i \le 3\)), \(\theta \) a join condition between \(\textsc {R}\) \(_2\) and \(\textsc {R}\) \(_3\), and \(\mathcal{A}\) \(_j\) attributes of \(\textsc {R}\) \(_2\) (for \(1 \le j \le n\))

Definition 7

A Relation Correspondence Assertion (RCA) is an expression of one of the following forms:

figure e

Definition 8

A Summation Correspondence Assertion (SCA) is an expression of one of the following forms:

figure f

Consider the three examples about join, outer-join, and data-metadata correspondences described in Sect. 1. The correspondence between M.remakes and both \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movies and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film can be specified by the SCA \(\psi _5\) shown in Fig. 3. \(\psi _5\) specifies that M.remakes corresponds to a join between \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movies and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film where the join condition: \({{\mathbf {\mathsf{{S}}}}}\) \(_2.\) film.title= \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.film and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film.year \(>\) \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.year is satisfied.

Fig. 3.
figure 3

Examples of CAs involving joins, outer-joins and data-metadata.

The correspondence between M.movie and both \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie can be specified by the RCA \(\psi _6\), shown in Fig. 3. \(\psi _6\) specifies that M.movie corresponds to a left outer-join between \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.

The correspondence between M.filmMakers and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers can be specified by the SCA \(\psi _7\), shown in Fig. 3. \(\psi _7\) specifies that M.filmMakers corresponds to grouping \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers by the attribute id, being that a data-metadata translation should be performed (i.e., some data should be converted into metadata).

Once the schema matching is finished, the CAs generated can be used, for example, to generate mapping expressions that convert data sources into data target. We propose that the mapping expressions are automatically generated in the form of SQL queries, which are used to load the relations (the materialized views) of the integrated schema.

5 From CAs to Mapping Expressions

In our proposal, the process to create queries to transform data from a schema to another one consists of three steps:

  1. 1.

    Indicate the source schemas and the integrated schema using a high-level data model. In our case, we use the RDM.

  2. 2.

    Define the CA that formally specify the relationships between the integrated schema and the source schemas.

  3. 3.

    Generate a set of queries based on the CAs generated in step 2, in order to populate the relations of the integrated schema.

In order to illustrate our approach, consider the integrated schema M and the sources schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) shown in Fig. 1.

Now, we should define CAs between M and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\), and CAs between M and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\). In our work, the CAs are specified using a GAV approach rather than a LAV one. One of reasons for our choices was due to the GAV approach makes the query answering easier than LAV one, both in materialized and in virtual integration approaches.

The process to generate the CAs consists of the following steps:

  1. 1.

    To each relation \(\textsc {R}\) \(^T\!\) of the target T do:

    1. (a)

      Identify the correspondences at a relation level (i.e., if there is a RCA or a SCA matching a target relation \(\textsc {R}\) \(^T\!\) and some source relation \(\textsc {R}\) \(^S\!\)).

    2. (b)

      Identify the correspondences at an attribute level: (1) identify the ACAs between the attributes of \(\textsc {R}\) \(^T\!\) and \(\textsc {R}\) \(^S\!\) (if there is a RCA between \(\textsc {R}\) \(^T\!\) and \(\textsc {R}\) \(^S\!\)); (2) identify the GCA between the attributes of \(\textsc {R}\) \(^T\!\) and \(\textsc {R}\) \(^S\!\) (if there is a SCA between \(\textsc {R}\) \(^T\!\) and \(\textsc {R}\) \(^S\!\)).

    3. (c)

      Determine which RCA and SCA can be combined to form a single CA.

In the current work, CAS were manually specified. However, we can use traditional schema matching tools (e.g.,[8, 9]) as a starting point to find basic matchings. Then these basic matchings can be enriched through our formalism (using the CAs).

Some examples of RCAs, ACAs, SCAs, and GCAs between elements of M and the source schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\) and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) can be found in Figs. 3 and 4.

Fig. 4.
figure 4

Examples of ACAs and GCAs.

The final step in the process of creating queries to transform data from a schema to another is the generation of the queries. In our proposal, they are defined based on the definition of the schemas and the CAs. Here we use SQL syntax of MySQL, since MySQL is an open source database that allows to combine the information from many databases in a single query. However, our CAs can be used to generate queries in any SQL syntax or even other federating queries languages as SchemaSQL [16].

Let \({\mathcal {M}}\) be a set of CAs that defines a matching between the source schemas \({{\mathbf {\mathsf{{S}}}}}\) \(_1\), \({{\mathbf {\mathsf{{S}}}}}\) \(_2\) and the integrated schema G, that is, \({\mathcal {M}}\) satisfies the conditions stated in Definition 6. Algorithm 1 shows the procedure to automatically generate the statements of SQL queries from the CAs in \({\mathcal {M}}\).

The Algorithm 1 generates a set of SQL queries, one for each relation schema \(\textsc {R}\) \(^T\) in the integrated schema. First it spans all ACAs and GCAs that relates attributes of \(\textsc {R}\) \(^T\), and puts the correct value in lists S, J, and LA, in accordance to the type of the CA. S keeps the relation schemas that will be included in the FROM clause, J keeps the join conditions that will be included in the WHERE clause, and LA keeps the attributes that will be included in the SELECT clause. The procedure G_SQL_ACA(), shown in Algorithm 3, spans the ACAs, while the procedure G_SQL_GCA(), shown in Algorithm 3, spans the GCAs. After, the algorithm spans the RCAs and SCAs, of \(\textsc {R}\) \(^T\), in order to create the SQL query to load \(\textsc {R}\) \(^T\), using templates in Table 1. In accordance to type of CA besides S, J, and LA, other variables are needed to keep the join conditions that will be included in the ON clause (\(\theta \)), the relation schema that will be included in (inner, outer, left, or right) JOIN clause (RJ), and the grouping attributes that will be included in the GROUP BY clause (G). Due to space limitations, Algorithms 1, 2, and 3, as well as the Table 1, do not cover the whole set of CAs as defined in Definitions 3, 5, 7, and 8.

In Algorithms 2, we assumed that \(\varphi \)() is a pre-defined SQL function or a user-defined function on SQL. In Algorithms 3, JAux, and Aux are lists used when it is necessary to create temporary tables in SQL. This occurs when the SQL query is created from a SCA of metadata. JAux stores the joins that will be included in the WHERE clause of the temporary table, while Aux keeps the relation schema that will be the alias of the temporary table.

figure g
figure h

In Table 1, Att() is a function that returns the list of attribute names of a relation schema. We use the short word outer join to emulate a UNION of a LEFT JOIN and a RIGHT JOIN, since MySQL does not support directly full outer-joins.

The SQL queries generated by our algorithms can be used to compute the data target once, and to recompute them at pre-stablished times in order to maintain the target data up-to-date (this approach is named rematerialization). Generally, a more efficient approach is to periodically modify only part of the target data to reflect updates in data sources (this approach is named incremental maintenance). Rematerialization is adequate, for example, when the integrated schema is firstly populated, or in situations involving complex operations.

Figure 5 presents the SQL query to transform data from \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie and \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film to M.movie from the RCA \(\psi _6\) and ACA \(\psi _{8}, \psi _{9}, \psi _{10}, \) and \(\psi _{11}\). The “select” clause (in line 2) is derived based on ACAS \(\psi _8\), \(\psi _9\) , \(\psi _{10}\) and \(\psi _{11}\). The “from” clause (in line 3) implements a join operation as specified by RCA \(\psi _6\). The “on” clause (in line 4) is based on the join condition indicated in the end of \(\psi _6\).

figure i
Fig. 5.
figure 5

Query definition to populate M.movie from \({{\mathbf {\mathsf{{S}}}}}\) \(_2\).film and \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movie.

Figure 6 presents the definition of the query to transform data from \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers to M.filmMakers. For this query, we have to define a nested select statement to each case-base GCA that relates attributes of \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers to attributes of M.filmMakers. Each nested select statement must be joined through an outer-join in order to guarantee both: i) that duplicate tuples will be merged properly, and 2) not duplicate tuples will be stored in M.filmMakers. Thus, the clauses “from” (line 3), “outer join” (line 7), and “on” (line 12) correctly implement the data-metadata relationship specified by the SCA \(\psi _{7}\). The “on” clause (line 12) is based on the attribute indicated in \(\psi _{7}\). The first nested select statement (lines 4 to 6) is defined based on the GCAS \(\psi _{12}\) and \(\psi _{14}\). The second nested select statement (lines 8 to 11) is similar to the first one, but now it is based on \(\psi _{13}\) and \(\psi _{14}\). The “select” clause in line 2 is based on the left-hand side of GCAs \(\psi _{12}\), \(\psi _{13}\) and \(\psi _{14}\).

Table 1. Templates to generate SQL Statements induced by RCAS and ACAS.
Fig. 6.
figure 6

Query definition to populate M.filmMakers from \({{\mathbf {\mathsf{{S}}}}}\) \(_1\).movieMakers.

6 Empirical Evaluation

We have performed some preliminary tests to verify that our approach is tractable for reasonably sized input.

6.1 Study Case Scenario

For our evaluation, we create a case study to simulate a situation close to the real world. We need to integrate information of three different sources: IES, FSP, and CDV to get a more complete information about Brazil’s universities. IES contains data about Brazil’s universities (name, city, state, etc.). It has a single relation (ies_2011) with 26 attributes and 2366 tuplesFootnote 3. FSP contains data about the ranking of the Brazil’s universities (ranking, university, grade, etc.). It has two relations, but only one of them (ranking, with 13 attributes and 191 tuples) was used in the evaluationFootnote 4. CDV contains data about the living cost of some cities of Brazil. It has a single relation (livingCost) with 5 attributes and 84 tuplesFootnote 5.

The integrated schema, named G, contains the necessary structure to keep the information required by the designer. It contains 8 relations with a total of 31 attributes and 8 foreign keys, as can be saw in the Fig. 7.

Fig. 7.
figure 7

The integrated schema G.

6.2 Method

We measure the performance of the data translation (i.e., the run time of the queries to load the schema G). For our tests, we have used a Macbook Pro/2.3GHz Intel Core (4GB of RAM and 499Gb of HD) running OSx 10.9.5. All databases were locally stored in this machine using the MySQL 5.6.

We first manually defined the CAs, with the aid of a tool implemented by us. For this case study, we defined 3 RCAs, 5 SCAs of normalize, 31 ACAs, and 8 GCAs, being a total of 52 CAs. Using the Algorithm 1, we generate 8 SQL queries: 5 queries of group by, 2 simple select-from queries, and 1 more complex query that simulates the outer join operator. Some queries use stored functions defined to looks for the value of a primary key in a target relation based on the attribute value of a source relation.

For data translation test, we measured the time that MySQL took to load each target relation using the queries generated by the Algorithm 1. Due to the run time of SQL queries can change depending on internal and external factors, we ran each query by 50 times and took the average to each 10 executions. All tests were performed locally in a same machine and only the MySQL server and MySQLWorkbench were running at the time. The result of the test can be observed through the chart shown in Fig. 8.

Fig. 8.
figure 8

Run time query by quantity of executions chart.

6.3 Discussion

We noted that the run time to most queries are more or less constant and below that 60 ms. It is not a surprise that the queries with higher execution time were those to load fsp_ranking (about 200ms) and university (about 590 ms), because both has more tuples to load than the others. Considering the number of tuples of the university (more than 2000) and that the query generated is a bit complex (includes left-join, right-join, union all, and 5 stored functions), we believe that 590 ms is a good performance.

7 Related Work

Schema matching is an important step of the data integration process. Typically, 1:1 correspondences between two different schemas are manually defined using a GUI or are (semi-) automatically discovered using matchers (usually through heuristics). Each correspondence, in general, only specifies which elements refer to a same attribute or relation in the real world [17]. AgreementMaker [4], and OII Harmony [9] are some examples of tools for schema matching. AgreementMaker [4] can match schemas and ontologies using schema information as well as instance-level data to generate the correspondences. OII Harmony [9] combines multiple matchers algorithms based on natural-language processing to identify correspondences between schemas.

Correspondences such as those defined/generated in [4, 9] do not provide all necessary information for discovering expressions to transform data sources in data target (i.e., the mapping expressions), the next phase in the schema mapping process. Richer models for specifying correspondences between schemas were proposed by [8, 1113] and [5](chap. 3). These approaches allow to define one-to-one or many-to-one attribute correspondences (i.e., association between attributes of two schemas). COMA++ [8] is a generic prototype for schema and ontology matching, schema-based and instance-based, and support a semi-automatic or manual enrichment of simple 1:1 correspondences into more complex mapping expressions including functions to support data transformations. [13] describes the IMAP system, which semi-automatically discovers complex matches, using different kinds of information such as domain knowledge, and domain integrity constraints to improve matching accuracy. [5](chap. 3) and [11] allow to express conditional correspondences (i.e., the value of an attribute A is the same of an attribute B if a given condition is satisfied). More closely to our approach is the work in [12]. In [12], the authors allow to manually specify one-to-one correspondence assertions between elements of Entity Relationship models. Although they cannot specify many-to-many matches, their correspondences have some semantic and allow to specify relationships such as: equivalence, union, intersection, and selection.

[10] specify one-to-one and many-to-many basic, complex, and semantic matches between elements of object-relational schemas. They can specify most part of the correspondences specified in [12] and other more complex. For example, they can deal with aggregate functions, denormalisations, and grouping (i.e., group by in SQL). Joins and outer-joins are implicitly defined based on the integrity constraints or match functionsFootnote 6. A distinguished feature of the approach proposed in [10] is that it allows to match, in the same correspondence, relations and attributes of two or more schemas. Yet, the information they provide is not sufficient, since they do not explicitly enable the specification of join paths and its variants, nor to deal with data-metadata relationships.

Data-metadata translations between elements of different relational schemas have been studied extensively. SchemaSQL [16] and FIRA/FISQL [18] are the most notable works on this subject. SchemaSQL [16] is a SQL-like metadata query language that uses view statements to restructure one column of values of a relation into metadata in another one. FISQL [18] is a sucessor of SchemaSQL and it is equivalent to the query algebra FIRA. Both SchemaSQL and FIRA/FISQL were proposed to provide interoperability in relational multi-database systems. Our SCA of metadata was based on the promote metadata operator of FIRA.

8 Conclusions

This paper focused on present CAs that deal with 1:1 and m:n matchings between schemas components, including correspondences involving aggregations, joins, and metadata. We emphasize that, in our approach, the CAs can specify basic and complex correspondences with semantics. Using CAs, we shown how SQL queries can be automatically generated to populate relations (views) of a global schema.

We presented some preliminary tests to evaluate the performance of the queries generated from CAs. We intend to realize more tests to evaluate the performance to different types of queries and other datasets.

We currently are working in as specifying complex correspondences between relational schemas and (RDF). Some initial work was published in [19]. We intent extend the initial proposal with the CAs presented here.