Keywords

1 Introduction

Graph databases are focused on efficient storing and querying highly connected data. They are a powerful tool for graph-like queries, e.g., computing the shortest path between two nodes in the graph. They reach an excellent performance for local reads by traversing the graph and can use various data models for graphs and their data extensions.

Graph databases are considered usually as NoSQL databases (e.g., [12]). One rather popular definition of a graph database (GDB), also called a graph-oriented database, says that it is a database that uses graph theory to store, map and query relationships. That is, the distinguished characteristics of the domain include:

  • relationship-rich data,

  • relationships are first-class citizens in graph databases.

Despite of the fact that there are various approaches to GDB implementation, native graph processing based on so called index-free adjacency is the most efficient means of processing data in a graph because connected nodes use physical “pointers” to neighbour nodes in the database.

A GDB can contain one (big) graph or a collections of graphs. The former includes, e.g., graphs of social networks such, Web graph, the latter is especially used in scientific domains such as bioinformatics and chemistry. Graph search occurs in other application scenarios, like recommender systems, complex software plagiarism detection, and traffic route planning. In line with similar concepts in other database technologies, we will talk about Graph Data Management Systems (GDBMS) and Graph Database Systems (GDBS).

An important part of GDB technology is querying graphs. Always there is the intimate relationship between database modelling and querying. Most graph query languages use directly a structure of directed graphs or property graphs. Now, the most known declarative query language over property graphs is CypherFootnote 1 of GDBMS Neo4j [10]. Cypher was the first pattern-matching query language to target the property graph data model. Cypher commands use partially SQL syntax and are targeted at ad hoc queries over the graph data.

Yet other approaches are possible, e.g., a functional approach. In the late 80s, there was the functional language DAPLEX [11]. The language only allowed nested applications of functions. The functional map was applied in context-oriented semantics of multivalued functions. A number of significant works using functional approach to data management are contained in the book [3]. In the current era of GDBMSs, we can mention the GremlinFootnote 2 - a functional graph query language developed by Apache TinkerPop which allows to express complex graph traversals and mutation operations over property graphs. It is supported by many GDBMSs (e.g., TitanFootnote 3).

Here, we will use a functional approach in which a database graph is represented by so called attributes, i.e. typed partial functions. We use for this approach the HIT Database Model, see, e.g., [6], as a functional alternative variant of E-R model. Then a typed lambda calculus can be used as a data manipulation language. This approach reflects the graph structure of a GDB and, on the other hand, provides powerful possibilities for dealing with properties, i.e. with the GDB content.

The rest of the paper is organized as follows. Section 2 introduces a graph data model based on (labelled) property graphs. Section 3 describes modelling GDBs, both on the conceptual and database level. The notions of graph conceptual schema and graph database schema are introduced including some integrity constraints (ICs). Section 4 shortly introduces a functional approach to GDB modelling and a version of typed lambda calculus appropriate for GDB querying. Details of these principles are explained in examples in Sect. 5. Section 6 gives the conclusion.

2 Graph Data Model

In general, database technologies are based on a database model. Here we will use a (labelled) property graph data model whose basic constructs include:

  • entities (nodes),

  • properties (attributes),

  • labels (types),

  • relationships (edges) having a direction, start node, and end node,

  • identifiers.

Entities and relationships can have any number of properties, nodes and edges can be tagged with labels. Both nodes and edges are defined by a unique identifier ( Id ). Properties are expressed in the key:value style. In graph-theoretic notions we also talk about labelled and directed attributed multigraphs in this case. These graphs are used both for GDB and its database schema (if any). An example of a GDB is in Fig. 1.

Fig. 1.
figure 1

Example of a GDB

3 Modelling and Querying Graph Databases

Current commercial GDBMSs need more improvements to meet traditional definitions of conceptual and database schema known, e.g., from the relational databases world. The graph database model is usually not presented explicitly, but it is hidden in constructs of a data definition language (DDL) which is at disposal in the given GDBMS. These languages also enable to specify some simple ICs. Conceptual modelling of graph databases is not used at all. An exception is the GRAD database model [5], which although schema-less, uses conceptual constructs occurring in E-R conceptual model and some powerful ICs. Both graph conceptual schema and graph database schema can provide effective communication medium between users of any GDB. They can also significantly help to GDB designers.

3.1 Graph Conceptual and Database Schemas

In [9] we proposed a binary E-R model as a variant for graph conceptual modelling considering strong entity types, weak entity types, relationship types, attributes, identification keys, partial identification keys, ISA-hierarchies, and min-max ICs. Figure 2 uses for min-max ICs well-known notation with dotted lines and crow’s foots used for the start node and the end node of some edges. The perpendicular line denotes the identification and existence dependency of weak entity types. Subtyping (ISA-hierarchies) are simply expressed by arrow to the entity supertype.

Fig. 2.
figure 2

Graph conceptual schema

A correct graph conceptual schema may be mapped into an equivalent (or nearly equivalent) graph database schema with the straightforward mapping algorithm [9] but with a weaker notion of a database schema, i.e. some inherent ICs from the conceptual level have to be neglected to satisfy usual notation of labelled and directed attributed multigraphs. Consequently, we can propose several different graph database schemas from a graph conceptual schema. For example, the edges Teaches and Is_born_in provide only a partial information w.r.t. the associated source conceptual schema. For example, the inverted arrow Is_taught_by could be used as well. Due to the loss of the inherent ICs on the conceptual level, we should put some explicit ICs into the GDB schema, e.g., that “A teacher can teach more languages” and “A teacher is born in exactly one town”.

Figures 2 and 3 give examples of graph conceptual schema and graph database schema, respectively. Concerning properties, identification key of Teacher would be #Person_ID . On the database schema level, the identification key of Street would be { Town_name , Street_name }. Details of mapping of graph conceptual schemas to graph database schemas can be found in [9].

Fig. 3.
figure 3

Graph database schema

Due to the graph structure of data in GDB, associated explicit ICs can have also a graph form. Very simple IC is a functional dependency (FD) or conditional functional dependency (CFD) introduced in [9]. For example, “Each teacher is born in one town” and “Teachers born later than in 1994 teach at most one language” are examples of FD and CFD, respectively.

On the other hand, NoSQL databases often does not require the notion of database schema at all. Strict application of schemas is sometimes considered disadvantageous by those who develop applications for dynamic domains, e.g., domains working with user-generated content, where the data structures are changing very often [1]. Consequently, many GDBMSs are schema-less. OrientDBFootnote 4 even distinguishes three roles of graph database schema: schema-full, schema-less, and schema-hybrid.

3.2 Graph Querying

In this section we focus on basics of graph querying (for more details see [8]). Its simplest type uses the index-free adjacency. In practice, the basic queries like k-hop queries are the most frequent. Looking for a node, looking for its neighbours (1-hop), scan edges in several hops, retrieval of property values, etc., belong in this category.

More complex queries are subgraph and supergraph queries. They belong to traditional queries based on exact matching. Other typical queries include breadth-first/depth-first search, path and shortest path finding, least-cost path finding, finding cliques or dense subgraphs, finding strong connected components, etc.

Very useful are regular path queries (RPQ). RPQs have the form:

$$ {\text{RPQ}}\left( {x,y} \right) \, : = \, \left( {x,R,y} \right) $$

where R is a regular expression over the vocabulary of edge labels. RPQs provide couples of nodes connected by a path conforming to R. With the closure of RPQs under conjunction and existential quantification we obtain conjunctive RPQs.

For example, the Cypher working with Neo4j databases lacks some fundamental graph querying functionalities, namely, RPQs and graph construction. In [2] an interesting newer approach is offered by the language G-Path. G-Path is an RPQ language working on graphs, which supports mostly all useful regular expression operators (grouping, alternation, Kleene operator, etc.). PGQL [13] is based also on the paradigm of graph pattern matching, closely follows syntactic structures of SQL, and provides RPQs with conditions on labels and properties.

4 Database Modelling with Functions

A conceptual modelling can be based on the notion of attribute viewed as an empirical typed function that is described by an expression of a natural language [6]. A lot of papers are devoted to this approach studied mainly in 90ties (see, e.g., [7]).

4.1 Types

A hierarchy of types is constructed as follows. We assume the existence of some (elementary) types S 1,…, S k (k≥1). They constitute a base B. More complex types are constructed in the following way.

If S , R 1,…, R n (n≥1) are types, then

  1. (i)

    ( S : R 1,…, R n) is a (functional) type,

  2. (ii)

    ( R 1,…, R n) is a (tuple) type.

The set of types T over B is the least set containing types from B and those given by (i)-(ii). When S i in B are interpreted as non-empty sets, then ( S : R 1,…, R n) denotes the set of all (total or partial) functions from R 1×…× R n into S , ( R 1,…, R n) denotes the Cartesian product R 1 × … × R n.

The elementary type Bool = {TRUE, FALSE} is also in B. The type Bool allows to type some objects as sets and relations. They are modelled as unary and n-ary characteristic functions, respectively. The notion of set is then redundant here.

The fact that X is an object of type R T can be written as X/ R , or “X is the R -object”. For each typed object o the function type returns type (o) ∈ T of o. Logical connectives, quantifiers and predicates are also typed functions: e.g., and/( Bool : Bool , Bool ), R-identity =R is ( Bool : R , R )-object, universal R-quantifier ΠR, and existential R-quantifiers ΣR are ( Bool :( Bool : R ))-objects. R-singularizer I R/( R :( Bool : R )) denotes the function whose value is the only member of an R -singleton and in all other cases the application of I R is undefined. Arithmetic operations +, −, *, / are examples of ( Number : Number , Number )-objects. The approach also enables to type functions of functions, etc.

4.2 Attributes

Object structures usable in building a database can be described by some expressions of a natural language. For example, “the language thought by a teacher at a school” (abbr. LTS) is a ( Language : Teacher , School )-object, i.e. a (partial) function f: Teacher × School Language , where Teacher , School , and Language are appropriate elementary types. Such functions are called attributes in [6].

More formally, attributes are functions of type (( S : T ): W ), where W is the logical space (possible worlds), T contains time moments, and S ϵ T. M w denotes the application of the attribute M to w/W, M wt denotes the application of M w to the time moment t. We can omit parameters w and t in type (M). In the case of LTS attribute we consider possible worlds, where teachers teach at most one language in each school. For GDBs we can elementary entity types conceive as sets of node ID s.

Attributes can be constructed according to their type in a more complicated way. For example, “the classes in a school” could be considered as an attribute CS of type (( Bool :( Bool : Student )): School ), i.e. the classes contain sets of students and the CS returns a set of classes (of students) for a given school.

We can also consider other functions that need no possible world. For example, aggregate function like COUNT, + (adding) provides such function. These functions have the same behaviour in all possible worlds and time moments.

Consequently, we can distinguish between two categories of functions: empirical (e.g. attributes) and analytical. The former are conceived as partial functions from the logical space. Range of these functions are again functions. Analytical functions are of type R , where R does not depend on W and T . In the conceptual modelling, each base B consists of descriptive and entity types. Descriptive types ( String , Number , etc.) serve for domains of properties.

The notion of attribute applied in GDBs could be restricted on attributes of types ( R : S ), ( Bool ( R ): S ), or Bool ( R , S ), where R and S are entity types. This strategy simply covers binary functional types, binary multivalued functional types, and binary relationships described as binary characteristic functions. The last option corresponds to m:n relationship types. For modelling directed graphs the first two types are sufficient, because m:n relationships types can be expressed by two “inverse” binary multivalued functional types. Here we will consider always one of them.

Now we add properties. Properties describing entity types S 1,…, S m can be of types ( S 1, …, S m: R ), where S i are descriptive elementary types and R is an entity type. So we deal with functional properties. Similarly, we can express properties of edges. They are of types (S 1, …, S m, R 1: R 2 ) or (Bool(S 1, …, S m, R 1 ) : R 2 ) .

Then a functional database schema describing GDB in Fig. 1 can look as:

  • La/ ((Name, Textbook):Language)

  • Te/ ((T_ID, T_Name, Birth_year):Teacher)

  • Tw/ ((Town_name, Population):Town)

  • Teaches/ (Bool:(Day, Hour, Room, Language):Teacher)

  • Is_Born_in/ ((Date, Town):Teacher)

We remark, however, that our functional GDBs with such schemas can contain isolated nodes with at least one property. ID s of edges are not necessary, because edges are not explicitly considered.

4.3 Manipulating Functions

A manipulating language for functions is traditionally a typed lambda calculus. Our version of the typed lambda calculus uses tuple types and directly supports manipulating objects typed by T. We will suppose a collection Func of constants, each having a fixed type, and denumerable many variables of each type. Then the language of (lambda) terms LT is defined as follows:

Let types R , S , R 1,…, R n (n≥1) are elements of T.

  1. (1)

    Every variable of type R is a term of type R .

  2. (2)

    Every constant (a member of Func) of type R is a term of type R .

  3. (3)

    If M is a term of type ( S : R 1,…, R n), and N 1,…,N n are terms of types R 1,…, R n, respectively, then M(N 1,…,N n) is a term of type S . /application/

  4. (4)

    If x 1, …, x n are different variables of the respective types R 1, …, R n and M is a term of type S, then λx 1,…,x n(M) is a term of type ( S : R 1, …, R n) /lambda abstraction/

  5. (5)

    If N 1, …, N n are terms of types R 1,…, R n, respectively, then

    (N 1, …, N n) is a term of type ( R 1, …, R n). /tuple/

  6. (6)

    If M is a term of type ( R 1, …, R n), then

    M[1], …, M[n] are terms of respective types R 1, …, R n. /components/

Instead of the position notation we can use more readable dot notation for components. Consider the Prague object (entity, node). Then instead of Tw(Prague) [2], where Prague/ Town , we can write Tw(Prague).Population.

Terms are interpreted in a standard way by means of an interpretation assigning to each function symbol from Func an object of the same type, and by a semantic mapping [ ] from LT into all functions given by T. Func influences the power of LT. It can contain usual arithmetic and aggregation functions, etc. Of a special importance is R-identity \( {\tt = }_{{{\tt Bool}({\tt S})}} \) usable for a comparison of two sets of S-objects. Consider Teacher -objects John and Frank. Then with

$$ \begin{aligned} & \lambda l_{1}^{\text{Language}} \exists d_{1} ,h_{1} ,r_{1} ,_{{}} Teaches\left( {John} \right)\left(d_{ 1} ,h_{ 1} ,r_{ 1} ,_{{}} l_{ 1}^{\text{Language}}\right) \\ & =_{{\texttt{Bool(Language)}}} \lambda l_{ 2}^{\text{Language}} \exists d_{ 2} ,h_{ 2} ,{\text{ r}}_{ 2} ,_{{}} Teaches\left( {Frank} \right)\left(d_{ 2} ,h_{ 2} ,{\text{ r}}_{ 2} ,l_{ 2}^{\text{Language}}\right) \\ \end{aligned} $$

we can test whether John and Frank teach the same set of languages. Similarly to domain relational calculus, we can simplify this expression by omitting some existential quantifiers and associated variables. Then the resulted expression can look as

$$ \lambda l_{ 1}^{\text{Language}} Teaches\left( {John} \right)\left( {l_{ 1}^{\text{Language}} } \right) =_{{\texttt{Bool(Language)}}} \lambda l_{ 2}^{\text{Language}} Teaches\left( {Frank} \right)\left( {l_{ 2}^{\text{Language}} } \right) $$

A further simplification could be made by omitting lambda abstraction supposing that information about objects considered is in the Bool(Language )-identity. Then we can write

$$ Teaches\left( {John} \right) =_{{\texttt{Bool(Language)}}} Teaches\left( {Frank} \right) $$

In other words, an application is evaluated as the application of an associated function to given arguments, a lambda abstraction “constructs” a new function. In the conventional approach a valuation δ is used. Supposing this mapping we can assign objects to every variable occurring in a term.

For example, CS(Oxford_House)(AM_training) is TRUE, if there is the class AM_training in the Oxford_house school (AM_training is a constant of type ( Bool : Student )). It is true while AM_training contains all students of the class AM_training.

In accordance to the semantics of the quantifiers and the singularizer, we can write simply ∀x(M) instead of Π(λ x(M)). Similarly, ∃x(M) replaces Σ(λ x(M)). Finally, we write Ix(M)) shortly as onlyone x(M) and read “the only x such as M”. Certainly, M/Bool.

Similarly, aggregation functions as COUNT/ Number :( Bool : S ), SUM/ Number :( Bool : Number ), etc. can be defined with usual meaning.

The deductive power of LT is given by ß-reduction. The main rule is called ß-rule in the system:

$$ \lambda x_{ 1} , \ldots ,x_{\text{n}} \left( M \right)\left( N \right) \, = M[x_{\text{i}} \leftarrow N\left[ {\text{i}} \right]] $$

where the right side of equality represents substitution of N[i] for x i in M.

From the database point of view, we have at disposal a powerful declarative language for formulating schema transformation, ICs, etc., even querying GDBs, as we shall see in Sect. 5.

5 Querying with Functions

The LT language can be used as a theoretical tool for building a functional database language. A query in such language is expressed by a term of LT, e.g.,

$$ \lambda t,n\left( {n = {\text{COUNT}}\left( {\lambda l\left( {\exists d,h,rTeaches\left( t \right)\left( {d,h,r,l} \right)} \right)} \right)} \right) $$

with d, h, r, and l of types Teacher , Day , Hour , Room , Language , respectively.

Using the syntactic abbreviation similar to that one used in Sect. 4.3, the resulted lambda expression

$$ \lambda t^{\text{Teacher}} ,n^{\text{Number}} \left( {n^{\text{Number}} = {\text{ COUNT}}\left( {\lambda l\left( {Teaches\left( {t^{\text{Teacher}} } \right)\left( {l^{\text{Language}} } \right)} \right)} \right)} \right) $$

denotes the query “Give a set of couples associating to each teacher the number of languages teaching by him/her”. Obviously, the query could be also reformulated as λ tn (…)), or more conventionally λ t onlyone n (…).

To consider only teachers born in Prague the lambda expression might look like

$$ \begin{aligned} & \lambda t^{\text{Teacher}} ,n^{\text{Number}} \left( {n^{\text{Number}} = {\text{COUNT}}\left( {\lambda l\left( {Teaches\left( {t^{\text{Teacher}} } \right)\left( {l^{\text{Language}} } \right)} \right)\;{\mathbf{and}}} \right.} \right. \\ & \left. {\left. {Is - born\_in\left( {t^{\text{Teacher}} } \right).Town\_name = Prague} \right)} \right) \\ \end{aligned} $$

Remark: In our conceptual framework we could conceive each query as an attribute, i.e., a lambda expression dependent on possible worlds and time moments. In practice, we omit λwt … in its head. Clearly, the resulted lambda expressions can express more complex graph structures than k-hop queries, i.e. contribute to constructions of new graphs from the original GDB.

In querying GDBs, RPQs are of user importance. We will consider expressions

$$ \lambda x,yReg(x,y) $$

where Reg is an expression simulating concatenation and closure. There are two styles how to construct Reg. First, consider the attribute Friends_Of_Friend(( Bool : Person ): Person ), abbreviated as FOF. Then

$$ FOF^{*} \left( {p_{ 1} ,p_{\text{k}} } \right) $$

will denote the expression FOF(p 1)(p 2) andand FOF(p k-1,)(p k), for a k (k>1), and

$$ \lambda x,yFOF^{*}\left( {x,y} \right) $$

provides a set of couples (p 1, p k), where there is a directed path from p 1 to p k along edges FOF.

Now we will consider a single-valued function, e.g., Manager_of/ Person ( Person ), abbreviated as MO. MO *(p) will denote the expression MO(…(MO(p)…)) with k applications of MO, for a k (k>1), and

$$ \lambda x,y\left( {MO^{*}\left( x \right) \, = y} \right) $$

provides a set of couples (p 1, p k), where there is a directed path from p 1 to p k along edges MO.

We remind that the LT is not computationally complete. But it makes possible to increase its computational power by adding new built-in functions into Func.

6 Conclusions

The objective of this paper was to provide an alternative approach to GDB querying based on a functional approach. Comparing to other graph query languages the functional language designed here is based on the notion of graph database schema using the notion of attribute.

All the techniques associated to GDBMS and supported in a graph search engine should fulfil so called FAE rule [4]. The FAE rule says that the quality of search engines includes three key factors: Friendliness, Accuracy and Efficiency, i.e. that a good search engine must provide the users with a friendly query interface and highly accurate answers in a fast way. Clearly, friendliness of our functional language is missing till now. This is the main challenge for future work.