Keywords

1 Introduction

Historically, data integration is associated with distributed databases (DB) developed mainly in 80ties. These DBs used mostly the relational DB model, a global schema and local schemas for DBs were placed in multiple DB nodes in a network. Then, two approaches based on DB schemas management occurred:

  • top-down – starting with a global schema to design schemas for particular data stores in network sites,

  • bottom-up – i.e., to use a schema mapping for schemas of data stores in sites with a middleware (e.g., JDBC). The process consists of integrating local DBs with their (local) schemas into a global DB with its global schema.

We remind that the former concerns rather homogenous DB models used in integrated data stores, using usually relational DBs, while the latter supports various DB models and consequently heterogeneous database systems (DBS).

Now, systems that store and process Big Data have become a common component of data management architectures. Generally, Big Data can be a combination of (i) structured data in DBs and data warehouses based on SQL, (ii) semi-structured data, such as web server logs or streaming data from sensors, organized by the means of, e.g., RDF graphs or XML documents, or (iii) unstructured data, such as document (or text) collections. Here, we will consider categories (i) and (ii).

A traditional problem how to approach data in such environment is the way how data is integrated. The remainder of the paper is organized as follows. Section 2 presents a functional modelling of conceptual and DB structures including tools appropriate for their querying, i.e., typed functions and a typed lambda calculus. Section 3 explores some approaches to data integrations. Section 4 presents also functional querying integrated data. Finally, Sect. 5 provides conclusions and topics for future works.

2 Functional Data Modelling

We start from classic approaches to functional DBs, that use a version of functional typing and a typed lambda calculus in Sects. 2.1 and Sect. 2.2. (for more details, e.g., [6]). In Sect. 2.3, we present how functional conceptual structures attributes can be described by expressions of a natural language. Combining attributes and typed lambda calculus we obtain a powerful query language (QL) presented in Sect. 2.4.

2.1 Functional Data Types

We assume the existence of some elementary types \(S_{1} , \ldots ,S_{\text{k}}\) (k ≥ 1) constituting a base B. More complex types are constructed in the following way:

If S, \(R_{1} , \ldots ,R_{\text{n}}\) (n ≥ 1) are types, then

  1. (i)

    (S:R1, …, Rn) is a (functional) type,

  2. (ii)

    (R1, …, Rn) is a (tuple) type.

The set of types T over B is the least set containing all types from B and those given by (i)–(ii). When Si in B are interpreted as non-empty sets, then (S:R1, …, Rn) denotes the set of all (total or partial) functions from \(R_{1} \times ... \times R_{\text{n}}\) into S, (R1, …, Rn) denotes the Cartesian product \(R_{1} \times ... \times R_{\text{n}}\). Elementary type Bool = {TRUE, FALSE} is also in B. It allows to model sets (resp. relations) as unary (resp. n-ary) characteristic functions. An object o of the type T is called a T-object. We denote it o/T. Logical connectives, quantifiers, and predicates are typed functions, e.g., and/(Bool: Bool, Bool) and implies/(Bool: Bool, Bool). Arithmetic operations are (Number: Number, Number)-objects. The aggregation functions have also associated types, e.g., SUM/(Real:(Bool:Real)). We use the infix notation for functions and arithmetic operations. We write ‘∀x…’ and ‘∃x…’, for application of the universal and existential quantifier, respectively. Relations are (Bool:S1, …, Sm)-objects, where Si are descriptive elementary types.

2.2 Typed Lambda Calculus

Let F be a collection of constants, each having a fixed type, and suppose to have a denumerable set of variables of each type at disposal. The language of lambda terms LT is defined as follows:

Let types R, S, R1, …, Rn (n ≥ 1) be elements of T. Then

  1. (1)

    Every variable of type R is a term of type R.                                              (variable)

  2. (2)

    Every constant (a member of F) of type R is a term of type R.                      (constant)

  3. (3)

    If M is a term of type (S:R1, …, Rn), and N1, …, Nn are terms of types R1, …, Rn, respectively, then M(N1, …, Nn) is a term of type S.                            (application)

  4. (4)

    If x1, …, xn are distinct variables of types R1, …, Rn, respectively, and M is a term of type S, then λx1, …, xn(M) is a term of type (S:R1, …, Rn).                      (λ-abstraction)

  5. (5)

    If N1, …, Nn are terms with types R1, …, Rn, respectively, then (N1, …, Nn) is a term of type (R1, …, Rn).                                                                                  (tuple)

  6. (6)

    If M is a term of type (R1, …, Rn), then M [1], …, M[n] are terms of respective types R1, …, Rn.                                                                                              (components)

Terms can be interpreted by an interpretation assigning to each function from F an object of the same type, and a semantic mapping from LT into all functions and Cartesian products given by the type of system T. Briefly, an application is evaluated as the application of an associated function to its arguments, the λ-abstraction “constructs” a new function. A tuple is a member of the Cartesian product of sets of typed objects.

2.3 Conceptual Modelling with Attributes

In general, attributes are parametrized by possible worlds (elementary type w) and time moments (elementary type t). Mathematical/logical functions are not dependent on w and t. For simplicity, we will not assume either possible worlds or time moments in the paper. For example, ACTORS/(Bool:Name, Title, Role) and MOVIES/(Bool:Title, Released, Director, Genre) represent named attributes - relations.

figure a

are rather functional attributes. We will denote them JU and SMU, respectively.

Other conceptual constructions are propositions of type Bool. Attributes generate certain basic propositions, e.g., “Mr. Baker contributes to the journal Computer Reviews”. It is generated by the JU attribute. A conceptual schema is a tuple of attribute specifications and, possibly, a set of integrity constraints, i.e., certain propositions giving explicitly some information about attributes. An information base is a set of TRUE-propositions induced by attributes in an actual world and in a given time moment. Obviously, all known conceptual constructs used in conceptual modelling are cases of attributes. In [4] and [7] we applied this approach to XML and JSON data, respectively.

2.4 Querying with Attributes

The LT language can be used as a theoretical tool for building a functional QL. The choice of functions determines the expressive power of QL. A query in such language is expressed by a LT term, e.g.,

$$ \lambda {\text{ u}}^{User} ,{\text{ n}}^{Number} ({\text{n}} = {\text{COUNT}}(\lambda {\text{ m}}^{Movie} ( \exists \, {\text{s}}^{Stars}\, {\mathsf{SMU}}\left( {\text{u}} \right)\left( {{\text{s}}, {\text{ m}}} \right))) \, ) $$

of type (Bool:User, Number). Indexes of variables denote their types. The query means “Find for each user the number of rated movies”.

A more complex example of a term uses a universal quantifier and implication:

figure b

expressing the query “Find the names of actors, who play in each Spielberg film.”

We gain a tool for common manipulation of relations and other typed functions. Then, the query results can be relations, nested relations, typed functions, etc. For Boolean queries, YES/NO can be a query result. It is important that there is no sharp line between conceptual and DB modelling with the functional approach. An application of the typed lambda calculus with equality is used in the approach of Hillebrand [2].

3 Multi-model Approach to Data Integration

Today, polystores and multi-model DBs are considered for DBs with multiple data stores [3]. In a polystore multiple storage engines are distinct and accessed separately through their own query engines. A more user-friendly solution of heterogeneous DB integration, is referred to as multi-model DBs. Typically, the relational data model can be one of them [1]. The query is then executed on more data sources, but an additional layer is often used to enable data integration.

Fig. 1.
figure 1

GDB conceptual schema Movies and its functional version

The notion of attribute applied in GDBs can be restricted to attributes of types (R:S), (Bool(R):S), or (Bool:R, S), where R and S are entity types. This strategy simply covers binary functional types, binary multivalued functional types, and binary relationships described as binary characteristic functions. The last option corresponds to M:N relationship types. For modelling directed graphs, the first two types are sufficient, because M:N relationship types can be expressed by two “inverse” binary multivalued functional types. For graphical expressing a graph conceptual schema, we use two types of arrows according to associated binary functional types (see Fig. 1).

Properties describing entity types can be of types (S1, …, Sm:R), where Si are descriptive elementary types and R is an entity type. They are of types (S1, …, Sm, R1:R2) and ((Bool:S1, …, Sm, R1):R2) for binary functional and binary multivalued functional types, respectively. Functional querying in GDBs is described, e.g., in [5].

For relational DBs, we can assume the existence of an E-R schema describing the semantics of relations. Here we use attributes for conceptual schemas based on E-R models and sufficiently structured approach for expressing semantics of data in particular NoSQL DBs. The database schemas of these DBs are then described by sets of attributes, i.e. rather as local conceptual schemas (LCSs), a global schema is obtained by union of these LCSs. Such approach can be generalized to most NoSQL DBs [8].

In the case of NoSQL, even more than one data model is often included in one DB architecture. For example, the distributed DB Cassandra combines column-based and key-value data models, DynamoDB combines document-oriented and key-value data models. ArangoDB also represents a multi-model approach, meaning that it can address JSON documents, graphs, and key-values. OrientDB is a multi-model DB including geospatial, graph, fulltext, and key-valued data models. MarkLogic enables to store and search JSON and XML documents and RDF triples. In [10] the gap between SQL and NoSQL is solved via an abstraction level in which the NoSQL data are transformed to triples incorporated into SQL DB as virtual relations.

4 Querying Multi-model Data

In literature, we can find two basic general frameworks for unified modelling and management of multi-model data. The categorical approaches described [3, 9] use category theory for transformations between models and are usable also for conceptual querying. Querying multi-model data by a functional approach means to describe DB structures in particular DBs functionally by attributes. It means, in principle, that LCSs are specified. Since sets (relations) are modelled as their characteristic functions, we gain a tool for common manipulation of relations and functional data from NoSQL DBs. In consequence, the query results can be relations, nested relations or XML [4], JSON [7], graph data [5] as well, again expressed by LT terms.

Another approach uses a global schema similarly to the ANSI/SPARC approach. In such logical integration, the global conceptual (or mediated) schema (GCS) is entirely virtual and not materialized. The bottom-up design involves both the generation of the GCS and the mapping of individual LCSs to this GCS. In any case, there are difficulties in schema integration, because of different structures and semantics among local DBs. Details of integration of relational DBs and GDBs functionally are described in [6]. Data selection is performed in the source systems using SQL and Cypher. The results are mapped into data structures associated with the source query term.

Example 1:

Suppose the relational attributes {ACTORS, MOVIES} from Sect. 2.3 and GDB described in Fig. 1, i.e., attributes {Movie, User, Journal, Rates, Submittes_to}. In the integrated DB, i.e., the multi-model system, the term in the simplified notation

figure c

expresses the query “Find for each user the genres and the number of reviews he/she made in them”. The answer will be of type ((Bool: Genre, Number), User), i.e. a new multivalued attribute assigning to each user a binary relation with tuples containing a genre and the number of the rates created for this genre by a given user. The query term is decomposed and transformed into a query program that requires evaluation of the included attributes, e.g., by SQL and Cypher expressions, respectively. These partial results serve to the integration that generates the query result.

5 Conclusions

In the paper, we have focused on integration of relational and NoSQL DBs. Even a variant of the E-R model can be used in without problems. Formally, we used a functional typing system serving for specification of so-called attributes. The attributes can be named with expressions of a natural language, bringing database querying closer to conceptual querying. A typed lambda calculus can be used as a manipulation language.

The presented tools create a formal background covering querying an integrated multi-model DB. Such a language could be based on SQL-like syntax, in principle. Another interesting topic for research is the expressive power of the subsets of LT considered, the solution of their user variants, and the complexity of formulating queries in such apparatus. In general, the expressive power of a user QL depends on a choice of constant functions included into the QL. These are themes for future work.