Keywords

1 Introduction

The analysis of database schema elements such as tables, views, triggers, user-defined functions and constraints provide valuable information for database users for understanding, maintaining and managing a database application and its evolution. In the literature, schema analysis has been investigated for improving the quality of SQL/program code or detecting program errors [3] and for determining the consequences of schema changes [10], versioning [8], or matching [11]. In addition, the analysis of schema objects plays an important role for tuning resp. refactoring database applications [2]. All these approaches rely on exploring dependencies between schema objects and an in-depth analysis of their components and interactions. A comprehensive and flexible analysis of schema elements, however, is not provided as these approaches are typically restricted to some subparts of a given schema.

The same is true for analysis features provided by commercial systems where approaches such as integrity checking, executing referential actions or query change notification (as provided by Oracle) already use schema object dependencies but in an implicit and nontransparent way, only. That is, no access to the underlying meta-data is provided to the user nor can be freely analyzed by means of user-defined queries. Even the meta-data about tables and SQL views which are sometimes provided by system tables cover only certain information of the respective schema elements. In this paper, we propose a uniform approach for analyzing schema elements in a comprehensive way. To this end, the schema objects are compiled and their meta-data is stored into a Datalog program which employs queries for deriving interesting properties of the schema. This way, indirect dependencies between tables, views and user-defined functions (UDFs) can be determined which is important for understanding follow-up changes. In order to show the expressiveness of the proposed analysis, our rule-based approach is applied to the problem of deducing functional dependencies (FDs) for derived relations, i.e., views, based on FDs defined for base relations. This so-called FD propagation or FD-FD implication problem has been studied since the 80s [7, 9, 12] and has applications in data exchange [6], data integration [4], data cleaning [7], data transformations [5], and semantic query optimization. We show that our rule-based approach to schema analysis is well-suited for realizing all known techniques for FD propagation indicating the expressiveness of the proposed analysis. In particular, our contributions are as follows:

  • We propose an approach for analyzing the properties of views, tables, trigger and functions in a uniform way.

  • Our declarative approach can be easily extended for refining the analysis by user-defined queries.

  • The employed Datalog solution can be simply transfered into SQL systems.

  • In order to show the expressiveness of our approach, the implication problem for functional dependencies is investigated using our approach.

2 Rule-Based Schema Analysis

A database schema describes the structure of the data stored in a database system but also contains views, triggers, integrity constraints and user defined functions for data analysis. Functions and these different rule types, namely deductive, active and normative rules, are typically defined with various interdependencies. For example, views are defined with respect to base relations and/or some other views inducing a hierarchy of derived queries. In particular, the expression CREATE VIEW q AS SELECT ... FROM p\(_1\), p\(_2\), ..., p\(_n\) leads to the set {p\(_1\) \(\rightarrow \) q, ..., p\(_n\) \(\rightarrow \) q} of direct dependencies between the derived relation q and derived or a base relations \(p_i\). which are typically represented by means of a predicate dependency graph for analyzing indirect dependencies via the transitive closure, too. This allows for understanding the consequences of changes made to the instances of the given database schema (referred to as update propagation in the literature) or to its structure. This is important when a database user wants to know all view definitions potentially affected by these changes. Various dependencies can occur in a database schema such as table-to-table dependencies induced by triggers or view-to-table dependencies which can be induced by functions. The analysis of such dependencies can be further refined by structural details (e.g., negative vs. positive dependencies as needed for update propagation) as well as by considering the syntactical components of schema objects such as column names (attributes) or operator types (sum, avg, insert, delete, etc.). To this end, the definitions of schema objects need to be parsed and the obtained tokens stored as queryable facts. This kind of analysis is well-known from meta-programming in Prolog which led to the famous vanilla interpreter. For readability reasons we use Datalog with facts such as base(R,A) (base relation R with arity A), derived(V,A) (view V with arity A), dep(To,From) (dependency between relations), call(V,I,O,F) (input I and output O of function F in view V), attr(R,P,N) (position P of attribute named N in relation R) for representing meta-information about a given view or user-defined function. Based on these facts, schema analysis can be realized by queries like

figure a

for determining reused attribute names, views calling a function, base tables possibly changed by function f\(_1\), and cyclic dependencies between two base tables through a function. This way, many interesting properties of schema elements can be systematically determined which supports users in understanding the interrelationships of schema elements. Most database systems already allow for storing and querying meta-data about schema elements in a simple way but a comprehensive (and in particular user-driven) analysis like this is still missing.

3 Functional Dependency Propagation

In order to show the expressiveness of our approach, we investigate the possibility to compute induced FDs for derived relations using the deductive rules introduced above. FDs form special constraints which are assumed to hold for any possible valid database instance. The FD propagation problem is undecidable in the general setting for arbitrary relational expressions [9]. Even restricted to SC views, i.e., relational expressions allowing selection and cross product only, the propagation problem turns out to be coNP-complete (for an in-depth discussion on complexity see [7]). In favor of addressing the general setting, we drop the ambition of achieving completeness by considering a special case, only. Instead, we allow for arbitrary expressions over all relational operators, multiple propagation steps and possibly finite domainsFootnote 1 in order to cover the majority of practical cases.

3.1 Preliminaries

A functional dependency \(\alpha = \{A_1, \ldots , A_n\} \rightarrow \) B states that the attribute values of \(\alpha \) determine those of B. The restriction to univariate right sides can be done without loss of generality as well as the representation of FDs satisfying \(B\,\notin \,\alpha \), only.Footnote 2 We allow \(\alpha = \emptyset \) which means that the attribute values of B are constant. For our FD propagation rules, we employ a Datalog variant with special data types for finite, one-leveled sets and finite, possibly nested lists. In our approach we use the extended transitivity axiom

$$\begin{aligned} \alpha \rightarrow B, \ \upgamma \rightarrow D, \ B \in \upgamma , \ D \notin \alpha \ \Rightarrow \ \alpha \cup (\upgamma - B) \rightarrow D \end{aligned}$$
(1)

to derive transitive FDs. Note that if \(B \notin \alpha \) and \(D \notin \upgamma \), then the derived FD also satisfies \(D \notin \alpha \cup (\upgamma - B)\).

Rule Normalization. For our systematic FD propagation approach, we assume the Datalog rules defining views to be in a normal form, where each rule corresponds to exactly one of the relational operators \(\pi \), \(\pi '\), \(\sigma \), \(\times \), \(\cup \), \(\cap \), −, or \(\bowtie \).Footnote 3 Any set of Datalog rules can be transformed into an equivalent set of normalized rules while preserving important properties like being stratifiable [1].

3.2 Representation of FDs and Normalized Rules

We assume that functional dependencies for EDB predicates are given in a relation \(\mathtt{edb\_fd({\mathsf{p}},{\alpha },B,ID)}\). Here \({\alpha }\) and B are (sets of) column numbers of the relation \({\mathsf{p}}\). The fact represents the functional dependency \(\alpha {\rightarrow } B\) for the relation p. The ID is of type list and used to identify the dependency in later steps, e.g., in case of union. The derived functional dependencies will be represented in the same way in an IDB predicate \(\mathtt{fd({\mathsf{p}},{\alpha },B,ID')}\). Here ID’ is related to the dependency’s ID where the FD is derived from for propagated FDs or to a newly created ID for FDs that arise during the propagation process.

As in normal form every rule corresponds to exactly one operator, we can refine the above defined dependency relation dep/2 to rel/3 by adding the respective operator. A fact rel(p,q,op) indicates that a relation p depends (positively) on q via an operator op which is one of ’projection’, ’extension’, ’selection’ ’product’, ’join’, ’negation’, ’intersection’, and ’union’. We further introduce an EDB predicate pos(head,body,pos_head,pos_body) for storing information on how the positions of non position preserving operators (cf. Table 1) transform from rule body to head (as FDs are represented via column numbers). Remembering that each relation is defined via one operator only and that we exclude self joins for simplicity (cf. Sect. 3.1), the above defined relation pos/4 is non-ambiguous. Finally, we have two additional EDB predicates eq(pred,pos1,pos2) and const(pred,pos,val) for information on equality conditions (e.g., X \(=\) Y or X \(=\) const) in extension and selection rules.

Table 1. Properties of FD propagation categorized by operator

3.3 Propagation Rules

In this section, we present three different types of propagation rules for (a) propagating FDs to the next step, (b) introducing additional FDs arising from equality constraints, and (c) calculating transitive FDs.

Example 1

Consider the following rule set given in normal form together with two FDs \(\mathtt{fd(s,\{1\},2,ID_1)}\) and \(\mathtt{fd(t,\{1,2\},3,ID_2)}\) for the base relations s and t:

figure b

Omitting IDs, we obtain the following propagation process: First, both FDs are propagated to r resulting in fd(r,{1},2,-) and fd(r,{2,3},4,-) (with the appropriate column renaming for the latter FD). By transitivity we have fd(r,{1,3},4,-) as a combination of the two. All three FDs are propagated to q together with fd(q, \(\emptyset \) ,3,-) resulting from the equality constraint Y = 2. Applying transitivity results in three more FDs for q, but only fd(q,{1},4,-) is propagated further to p as fd(p,{1},2,-). The complete list of propagated FDs including IDs is given in Example 2.

Table 1 summarizes the properties of how FDs are propagated via the different relational operators which form the basis for the propagation rules. In most cases, the FDs are propagated as they are (with adjustments on the positions for \(\pi \), \(\times \), and \(\bowtie \)). If there is a single rule defining a derived relation, the source FDs transform to FDs for the new relation (restricted to the attributes in use). Union forms an exception where even common FDs are only propagated in special cases (cf. Sect. 3.4). For extensions \(\pi '\) and selections \(\sigma \) where additional FDs can occur due to equality conditions as well as for joins \(\bowtie \) transitive FDs may appear so that taking the transitive closure becomes necessary. In cases where the number of tuples is reduced (i.e., \(\sigma \), \(\cap \), \(\bowtie \), and −) it is possible that new FDs appear as there are less tuples for which the FD constraint must be satisfied. The different propagation rules for all relational operators except union are specified in the following.

(a) Induced FDs. For direct propagation of FDs from one level to the next, we distinguish between position preserving and non position preserving operators. In the first case FDs can be directly propagated (2), whereas in the latter adjustments on the column numbers are necessary (3). The EDB predicates pos_pres and non_pos_pres comprise the respective operators as listed in Fig. 1.

$$\begin{aligned} \mathtt{fd(P,\alpha ,B,\texttt {-}) \leftarrow fd(Q,\alpha ,B,\texttt {-}), rel(P,Q,op), pos\_{pres(op)}.} \end{aligned}$$
(2)
$$\begin{aligned} \mathtt{fd(P,\{X_1,\ldots ,X_n\},Y,\texttt {-}) \leftarrow fd(Q,\{A_1,\ldots ,A_n\},B,\texttt {-}),} \end{aligned}$$
(3)
$$\begin{aligned}&\mathtt{pos(P,Q,X_1,A_1),\ldots , pos(P,Q,X_n,A_n), pos(P,Q,Y,B),}\\&\mathtt{rel(P,Q,op), non\_pos\_pres(op).} \end{aligned}$$
Fig. 1.
figure 1

Position preserving (left) and non position preserving (middle) operators, and relations where transitive FDs may occur (right).

(b) Additional FDs. For any equality constraint \(X = Y\) we can deduce the dependencies \(X \rightarrow Y\) and \(Y \rightarrow X\). Similar, a constant constraint \(X = c\) induces the dependency \(\emptyset \rightarrow X\). That is for any fact eq(R,pos1,pos2) and const(R,pos,val) respectively we derive the following FDs:

$$\begin{aligned} \mathtt{fd(R,pos1,pos2,\texttt {-}) \leftarrow eq(R,pos1,pos2).} \end{aligned}$$
(4)
$$\begin{aligned} \mathtt{fd(R,pos2,pos1,\texttt {-}) \leftarrow eq(R,pos1,pos2).} \end{aligned}$$
(5)
$$\begin{aligned} \mathtt{fd(R,\emptyset ,pos,\texttt {-}) \leftarrow const(R,pos,val).} \end{aligned}$$
(6)

(c) Transitive FDs. Since transitive FDs can only occur for certain operators it is sufficient to deduce them for those cases, only (cf. Table 1):

$$\begin{aligned} \mathtt{fd(P,\varepsilon ,D,\texttt {-}) \leftarrow }~&\mathtt{fd(P,\alpha ,B,\texttt {-}), fd(X,\upgamma ,D,\texttt {-}),} \\&\mathtt{B \in \upgamma , D \notin \alpha , \varepsilon = \alpha \cup (\upgamma -\{B\}), trans(P)}. \nonumber \end{aligned}$$
(7)
$$\begin{aligned} \mathtt{fd(P,X,Y,\texttt {-}) \leftarrow fd(P,\alpha ,X,ID), fd(P,\alpha ,Y,ID), trans(P).} \end{aligned}$$
(8)

The first rule implements the extended transitivity axiom (1) and the second equates the right sides of two identical FDs (identified by matching IDs). The IDB predicate trans/1 comprises all relations where transitive FDs may occur.

3.4 Union

In case of union \(p = p_1\,\cup \,p_2\) even common FDs are only propagated in special cases. Consider the following example of post codes. In each country, the post code uniquely identify the city associated with it. But the same post code can be used in different countries for different cities. So although we have the FD post code \(\rightarrow \) city in the relations german_post_codes and us_post_codes, it is not a valid FD in the union of both. A common FD of \(p_1\) and \(p_2\) is only propagated to p if the domains of the FD are disjoint, or if they match on common instances. The first case can only be handled safely on schema level if constants are involved. The latter is the case if the FDs have the same origin and are propagated in a similar way. Whether two FDs have the same origin can be easily checked (e.g. using path from Sect. 2). This criteria is not yet enough as the FDs might have been manipulated during the propagation process (e.g., changes in the ordering, equality constraints, etc.). We employ identifiers to track changes made to certain FDs using a list structure that adopts the tree structure of [9] who represents FDs as trees with source domains as leaves and the target domain as the tree’s root. As the target is already handled in the FD itself, we keep track of the source domains and transitively composed FDs, only.

At the beginning, each base FD \(\alpha \rightarrow B\) gets a unique identifier ID\(_i\). The idea is to propagate this ID together with the FD and to keep track of the modifications made to the FD. For this purpose we attach an ordered tuple, a (possibly nested) list, to the ID, i.e., ID\(_i[A_1,\ldots A_n]\) for \(\alpha = \{A_1,\ldots A_n\}\). For the position preserving operators (that in particular do not change the FD’s structure) the ID is identically propagated in (2). For the non position preserving operators the positions are updated (using a UDF) similarly to the position adjustments of the FD itself in (3). The difference is that the ID maintains an ordering and the cardinality stays invariant. For constant constraints, we set the constant value as ID in (6), equality constraints in (4), (5) and (8) get the (column number of the) left side as ID. In (7) we replace the occurrences of the column number B in the ID of \(\mathtt{fd(X,\upgamma ,D,\texttt {-})}\) by the ID of \(\mathtt{fd(X,\alpha ,B,\texttt {-})}\).

Example 2

figure c

A common ID implies that the same modifications have been made to a common base FD. This means that the FD is preserved in the case of union:

(9)

4 Conclusion

In Sect. 3.3 we introduced our propagation rules for propagating functional dependencies. To compute the set of propagated FDs these rules are simultaneously applied to the input Datalog program in normal form. The rules are based on the observations in Table 1 which can be easily verified. The propagated functional dependencies of our approach are not complete as the problem is undecidable in general. Also limited to a less expressive subset of the relational operators (e.g., restricted operator order SPC views) one has to assume the absence of finite domains to achieve completeness. Nevertheless, we are able to deal with many cases appearing in real world applications. Our FD propagation approach can be flexible extended to allow for user-defined functions in the extension operator and even recursion can be covered with some modifications.

In [9] the FD implication problem was addressed first. We provided a full declarative approach covering most cases stated in this work. In addition, we are able to cover linear recursion in a similar way as proposed by [12]. Other related approaches like the work of [7] for conditional FDs can be incorporated into our approach, too. Besides these rule-based approaches, a detailed comparison with the chase, an established algorithm for FD implication, is object for future research.