Keywords

1 Introduction

Motivation. Domain, entity, and referential integrity form the three inherent integrity constraints in relational database systems [5]. Entity integrity is enforced by primary keys to ensure that entities of an application domain are uniquely represented within a database [12]. Referential integrity is enforced by foreign keys to ensure that the intrinsic links between entities of an application domain are correctly represented within a database [12]. In the TPC-C database, the Order table contains the foreign key o_c_id, o_d_id, o_w_id, which references the primary key c_id, c_d_id, c_w_id of the Customer table. The primary key uniquely identifies a customer c_id within a district c_d_id of a warehouse c_w_id. The foreign key references the unique customer in a district of a warehouse who placed the order. Primary and foreign keys are fundamental building blocks of database system, and are indispensable for most database applications [5, 12].

In practice, designers fail to specify primary and foreign keys for various reasons. For example, they do not comprehend well enough the intrinsic links between data in the application domain, it is infeasible to enforce integrity due to inconsistencies arising from data integration or database evolution over time, or because integrity enforcement will inhibit data acquisition [17]. Databases at enterprise level frequently contain hundreds of tables and thousands of columns, and are often insufficiently documented. In such situations it difficult to identify foreign keys. As a consequence, the discovery of primary and foreign keys is an important yet challenging core activity of data profiling [13]. This is due to the large number of candidates and the fact that the given data source may not even satisfy the meaningful primary and foreign keys; since these are not always enforced. Previous work on profiling primary and foreign keys has exclusively focused on purely relational databases, in which null markers are treated naively as any other domain value [17]. In the real world, null markers occur frequently and the SQL standard recommends a designated treatment for them.

Goal. This background motivates our objective to establish algorithms that efficiently profile unary and multi-column foreign keys from dirty and incomplete data. Efficiency refers to the following factors: (i) there should be a different algorithm for each of the semantics for foreign keys as proposed by the SQL standard, i.e. simple, full, and partial; (ii) the precision and recall of the algorithms should be similar to those of the state of the art algorithm for the idealized special case in which null markers are treated naively, and (iii) the profiling time of our algorithms should be similar to that of the state of the art algorithm.

Key Idea. Let X denote the sequence of distinct foreign key columns on table \(R_1\) and Y the equal-length sequence of distinct primary key columns on \(R_2\). Then we write \(R_1[X]\subseteq R_2[Y]\) to denote the foreign key on \(R_1\). For example, Order[o_c_id, o_d_id, o_w_id] \(\subseteq \) Customer[c_id, c_d_id, c_w_id] denotes the foreign key on Order from the TPC-C benchmark. For the projection r[Z] of a table instance r over table R onto the columns in sequence Z over R, the inclusion coefficient between \(r_1[X]\) and \(r_2[Y]\) has been defined as \(\sigma (r_1[X],r_2[Y]) = |r_1[X] \cap r_2[Y]|/|r_1[X]|\) where |S| denotes the number of elements in S. Intuitively, the inclusion coefficient \(\sigma (r_1[X],r_2[Y])\) measures the proportion of values in \(r_1[X]\) that are also in \(r_2[Y]\). The higher \(\sigma (r_1[X],r_2[Y])\) the more likely it is that \(R_1[X]\subseteq R_2[Y]\) forms a true foreign key. However, the inclusion coefficient has only been defined for relations in which no null markers occur [14, 17]. The SQL standard recommends three different options for the semantics of foreign keys, which can be declared in the Match clause as simple, full, and partial [12]. The semantics differ in how they treat tuples with occurrences of Null in foreign key columns. Simple semantics regards such tuples as non-offensive, while full semantics regards them as offensive. Finally, partial semantics regards tuples \(t\in r_1[A_1,\ldots ,A_n]\) only as offensive if there is no tuple \(t'\in r_2[B_1,\ldots ,B_n]\) that subsumes t, i.e., where \(t[A_i]=t'[B_i]\) on all attributes \(A_i\) where \(t[A_i]\not =\textsc {Null}\). We say that a tuple t is X-total on a sequence X of distinct attributes, if \(t[A]\not =\textsc {Null}\) on all \(A\in X\). Furthermore, we call the tuple t over table R total if it is R-total, and a relation r over R total if all tuples in r are total. We thus obtain the notions of simple \(\sigma _s\), full \(\sigma _f\), and partial inclusion coefficients \(\sigma _p\) by defining

Table 1. TPC-C foreign key and its different inclusion coefficients
  • \(r_1[X] \cap _{s} r_2[Y] := \{t \in r_1[X]\mid \text { if}\,t\,\text {is}\,X\text {-total, then}\,t \in r_2[Y]\}\),

  • \(r_1[X] \cap _{f} r_2[Y] := \{t \in r_1[X]\mid t\,\text {is}\,X\text {-total and}\,t\in r_2[Y]\}\), and

  • \(r_1[X] \cap _{p} r_2[Y] := \{t \in r_1[X]\mid \exists t'\in r_2[Y] \text { such that}\,t \text { is subsumed by}\,t'\}\).

Indeed, when \(r_1[X]\) and \(r_2[Y]\) are total relations, simple \(\sigma _s\), full \(\sigma _f\), and partial inclusion coefficients \(\sigma _p\) all coincide with the standard inclusion coefficient.

Example 1

For illustration consider the foreign key FK Order[o_c_id, o_d_id, o_w_id] \(\subseteq \) Customer[c_id, c_d_id, c_w_id] of the TPC-C schema. Table 1 shows sample data that violates the foreign key under all three semantics, since the fifth tuple - which is total - is not present in Customer. An intriguing question is if FK can still be discovered as a good foreign key candidate. For total relations this question has been answered affirmatively [17], but our data features Null.

Indeed, \(\cap _f\) consists of only four tuples since full semantics regards incomplete tuples as offensive, \(\cap _s\) consists of nine tuples since simple semantics regards incomplete tuples as non-offensive, and \(\cap _p\) consists of eight tuples since the fourth and fifth Order tuples are not subsumed by any Customer tuple. \(\square \)

Example 1 illustrates that the three null marker semantics, proposed by the SQL standard, require different techniques to profile foreign keys.

Contributions. Firstly, we demonstrate that current methods for foreign key profiling over complete data are not suitable for incomplete data. These methods already exhibit poor precision and recall when few null markers occur. Secondly, based on the recommended semantics of foreign keys by the SQL standard, we propose three notions of inclusion coefficients. Each notion leads to a different algorithm for profiling SQL foreign keys in dirty and incomplete data. We focus on foreign keys that reference a primary key, the most common case in practice. Thirdly, experiments with benchmark data demonstrate that our algorithms for simple and partial semantics perform as well for dirty and incomplete data as the current state of the art algorithm performs in the idealized special case where incomplete data is absent. Our results for precision and recall are robust under different rates of incompleteness. The penalty for sustaining strong profiling results for dirty and incomplete data is only marginal when comparing our profiling time to that of the state of the art for complete data.

Organization. In Sect. 2 we recall the three semantics the SQL standard proposes for foreign keys. Our profiling algorithms are proposed in Sect. 3. The algorithms are evaluated in Sect. 4. Their combination with schema-driven techniques is evaluated in Sect. 5. Related work is discussed in Sect. 6 before we conclude and comment on future work in Sect. 7.

2 Referential Integrity and the SQL Standard

Inclusion dependencies, and foreign keys in particular, are referential integrity constraints that enforce the intrinsic links between data items of an application domain within a database system [8, 11]. An inclusion dependency (IND) over a database schema R is an expression of the form \(R_1[X] \subseteq R_2[Y]\), where \(R_1\), \(R_2\) are relation schemata of R, and X and Y are equal-length sequences of distinct columns from \(R_1\) and \(R_2\), respectively. Having \(X = f_1,\ldots ,f_n\) and \(Y = k_1,\ldots ,k_n\), the inclusion dependency holds in the given database over R if for every tuple t in the relation over \(R_1\) there is some tuple \(t'\) in the relation over \(R_2\) such that \(t[f_i] = t'[k_i]\) for \(1 \le i \le n \). An inclusion dependency \(R_1[X] \subseteq R_2[Y]\) is a foreign key if Y forms a key of \(R_2\) [11].

Match Options to Handle Nulls. An exception to the semantics of foreign keys are occurrences of the null marker Null. If some foreign key column features an occurrence of Null, then the semantics of a foreign key is defined by the Match clause of the SQL standard [12]. Available options are full, simple and partial, each proposing different ways to handle occurrences of Null.

Under full semantics, the foreign key \(R_1[f_1,\ldots ,f_n] \subseteq R_2[k_1,\ldots ,k_n]\) is satisfied if every tuple \(t\in r_1\) over \(R_1\) is \(\{f_1,\ldots ,f_n\}\)-total and there is some tuple \(t'\in r_2\) over \(R_2\) such that \(t[f_i]=t'[k_i]\) for all \(i=1,\ldots ,n\). Hence, full semantics does not permit any occurrences of Null on any of the foreign key columns.

Under simple semantics, the foreign key \(R_1[f_1,\ldots ,f_n] \subseteq R_2[k_1,\ldots ,k_n]\) is satisfied if for every tuple \(t\in r_1\) over \(R_1\), either \(t[f_i]=\textsc {Null}\) for some \(i\in \{1,\ldots ,n\}\), or there is some tuple \(t'\in r_2\) over \(R_2\) such that \(t[f_i]=t'[k_i]\) for all \(i=1,\ldots ,n\). Hence, simple referential integrity is never violated by tuples that are partially defined on the foreign key columns.

Under partial semantics, the foreign key \(R_1[f_1,\ldots ,f_n] \subseteq R_2[k_1,\ldots ,k_n]\) is satisfied if for every tuple \(t\in r_1\) over \(R_1\) there is some tuple \(t'\in r_2\) over \(R_2\) such that \(t[f_1,\ldots ,f_n]\) is subsumed by \(t'[k_1,\ldots ,k_n]\). Hence, partial referential integrity may also be violated by tuples that are partially defined on the foreign key columns. It can be understood as a compromise that balances the pessimistic full semantics and the optimistic simple semantics.

In Table 1, only tuples 1, 3, 6, and 8 of the Order table satisfy full semantics, while all tuples except for tuple 5 satisfy simple semantics, and all tuples except for tuples 4 and 5 satisfy partial semantics.

3 Profiling Foreign Keys from Dirty and Incomplete Data

We propose two-step algorithms for profiling full, simple, and partial foreign keys from dirty and incomplete data. In phase one, we apply full, simple, and partial inclusion coefficients to identify good candidates for foreign keys. In phase two, a randomness test validates whether the distribution of the foreign key values is similar to that of the key values.

Inclusion Coefficients for Incomplete Data. The first challenge is to find reasonable candidates from data that may violate meaningful foreign keys. Similar to [2, 6, 17], a foreign key is a candidate when its inclusion coefficient meets a user-defined threshold \(\theta \). We introduce three different inclusion coefficients, resulting from the SQL standard semantics for null markers.

Example 2

When \(\theta =0.8\), 20 % inconsistencies in the given data are considered acceptable. The simple and partial coefficients of the foreign key FK from Example 1 meet \(\theta \), but the full inclusion coefficient does not. \(\square \)

Exact computations of inclusion coefficients are infeasible for big data sizes, so we approximate them by bottom-k sketches [17]. A bottom-k sketch consists of those values of a given data set that have the k smallest ranks assigned to them by a hash function. For a bottom-k sketch of a multi-column key, the hash function is applied to the concatenation of the values on each column [17].

figure a

Algorithm 3.1 shows the pseudo-code of our proposed algorithm. Like in [17] we assume that a set of unary (\(\mathbf {P_u}\)) and multi-column (\(\mathbf {P_m}\)) primary keys has been obtained already. \(\mathbf {C}\) and \(\mathbf {T}\), respectively, refer to the set of columns and tables in the given schema. The option O for the Match clause of the SQL standard are denoted by \(\mathbf {f}\) for full, \(\mathbf {s}\) for simple, and \(\mathbf {p}\) for partial. \(\hat{C}\) refers to the bottom-k sketch of column C. For \(O\in \{\mathbf {f},\mathbf {s},\mathbf {p}\}\), \(\textsc {Inclusion-O}(\hat{F}, \hat{P})\) computes an approximation of the full, simple, or partial inclusion coefficient by counting the number of foreign key tuples in the current bottom-k sketch \(\hat{F}\) that meet the requirements of full, simple, or partial semantics, respectively, and dividing this number by the total number of foreign key tuples in \(\hat{F}\). If \(f_j\) denotes the jth foreign key tuple in \(\hat{F}\), then \(f_j\) meets the requirements of (i) full semantics, when \(f_j[C']\not =\textsc {Null}\) for all columns \(C'\) and \(f_j\) occurs in \(\hat{P}\); (ii) simple semantics, when either \(f_j[C']=\textsc {Null}\) for some column \(C'\) or \(f_j\) occurs in \(\hat{P}\); and (iii) partial semantics, when there is some \(p\in \hat{P}\) that subsumes \(f_j\).

Algorithm 3.1 starts by profiling unary inclusion dependencies as in [6, 17]. The novel idea is to profile foreign keys under different SQL semantics. For unary candidates, simple and partial semantics coincide. Thus, Inclusion-S computes unary inclusion coefficients under simple and partial semantics.

All candidates for unary inclusion dependencies are stored in memory for profiling multi-column foreign keys. If a unary inclusion dependency references some \(P\in \mathbf P _u\), it is stored as a unary foreign key in \(\mathbf F _u\). If there is some multi-column primary key, Algorithm 3.1 calculates the inclusion coefficient for a set of unary inclusion dependencies which occur in one table and pair-wisely reference the columns of this primary key. Finally, all pairs (FP) which pass the inclusion test are stored in \(\mathbf {F}_m\). Algorithm 3.1 returns a set \(\mathbf {F_u} \cup \mathbf {F_m}\) of unary and multi-column candidate foreign keys.

Example 3

Consider the TPC-C data set from Example 1 and a threshold of \(\theta = 0.9\). Algorithm 3.1 calculates inclusion coefficients for unary candidates first:

 

\(|\textsc {Order}[{o\_c\_id},{o\_d\_id},{o\_w\_id}]|\)

\(\sigma _p\)

\(\sigma _s\)

\(\sigma _f\)

\(O[{o\_c\_id}] \subseteq C[{c\_id}]\)

5

1

1

0.9

\( O[{o\_d\_id}] \subseteq C[{c\_d\_id}]\)

5

1

1

0.9

\( O[{o\_w\_id}] \subseteq C[{c\_w\_id}]\)

4

1

1

0.75

FK

10

0.8

0.9

 

In particular, \(\sigma _f({o\_w\_id}, {c\_w\_id})=0.75 < \theta \), which means this unary inclusion dependency does not meet the requirements of full semantics. Therefore, our FK \(\textsc {Order}[{o\_c\_id},{o\_d\_id},{o\_w\_id}]\,\subseteq \,\textsc {Customer}[{c\_id},{c\_d\_id},{c\_w\_id}]\) is not further considered under full semantics. Final results show that this foreign key is profiled under simple but not under partial semantics, since \(\sigma _p=0.8 < \theta \). \(\square \)

Testing Randomness on Dirty and Incomplete Data. The randomness test from [14, 17] checks if the distinct values of the referencing column set X have the same distribution as the distinct values in the referenced column set Y. This test is good for eliminating false positives [14, 17]. For complete data the Earth Mover’s Distance (EMD) represents the least amount of work required to move the set of values in the candidate foreign key to the set of values in the referenced primary key. The smaller the EMD the closer the distributions for the candidate foreign and primary key values. We extend the randomness test of [14, 17] from complete to incomplete data. Our method measures the likelihood of the candidate pair (FP) dependent on either full, simple, or partial semantics.

Algorithm 3.2 returns a set of candidates ranked in increasing order of their EMDs. Firstly, Algorithm 3.2 applies the extended randomness test to further prune the candidates found by Algorithm 3.1. As in [17], we apply quantile histograms to approximate EMDs. The histogram is calculated by a user-defined quantile constant for every column of a primary key. For multi-column keys, a quantile grid is constructed from the quantile histogram of each column. The grid for a candidate foreign key is constructed by populating the associated primary key grid with the values in its sketch. The approximate EMD is computed from the distance between the quantile grids of primary and foreign key. We empower this method to deal with real world data by imputing null markers with actual domain values that are consistent with simple and partial semantics, respectively. The procedure Simple replaces the incomplete bottom-k sketch \(\hat{F}\) of the given foreign key F by the completed sketch \(\hat{F}_{total}\) in which Null occurrences in \(\hat{F}\) have been imputed with randomly chosen domain values from the referenced primary key sketch \(\hat{P}\). This is consistent with the simple semantics. The procedure Partial imputes Null occurrences in subsumed foreign key tuples t with randomly chosen domain values from tuples of the primary key sketch that subsume t. This is consistent with the partial semantics. Based on the completed sketches, Algorithm 3.2 proceeds by approximating the EMD of candidate foreign keys based on quantiles of \(\hat{F}_{total}\).

figure b

Example 4

Applying Partial from Algorithm 3.2 to Example 1 may yield the following completed sketch: {(1,2,1), (1,2,2), (3,3,3), (Null,5,3), (2,3,2), (3,6,3), (4,2,2), (4,9,2), (2,5,2)}. An alternative is {(1,2,1), (4,2,2), (3,3,3), (Null,5,3), (2,3,2), (3,6,3), (2,5,1)}. In each completion, all partial tuples are replaced by complete tuples that subsume them. Tuple (Null,5,3) is not subsumed by any primary key tuple and violates partial semantics. Applying Simple produces a completed sketch in which all Null occurrences of (Null,2,2), (Null,5,3), (3,Null,3), (4,Null,2), (2,5,Null) are imputed by random domain values from the referenced primary key, for example by (1,2,2), (3,5,3), (4,5,2), (2,5,3). Even (Null,5,3) is replaced, in consistency with simple semantics. \(\square \)

4 Evaluation of Data-Driven Profiling Techniques

We evaluate the data-driven profiling of foreign keys from real world data sets.

Characteristics of Experiments. Our algorithms are evaluated on two benchmark databases TPC-C and TPC-HFootnote 1. The characteristics of the data sets are summarized as follows:

Data set

#tables

#rows

Data set size

#unary FKs

#multi-column FKs

TPC-C

9

2.41M

0.39G

3

7

TPC-H

9

9.42M

1.43G

7

1

Algorithms were implemented in C++ and run on an Intel Core i5 CPU 3.3 GHz with 8 GB RAM. The operating system was 64-bit Windows 7 Enterprise, Service pack 1. The database management system we used was MySQL version 5.5. Our algorithms are evaluated in terms of accuracy and time with respect to increasing levels of incompleteness in the data, starting with complete data. We use three well-known measures of accuracy: precision, recall and f-measure. For each data set, we consider the constraints which are explicitly declared in the schema as the golden standard for determining the accuracy of our algorithms. This ensures that we can compare our results to those of [17] who followed the same approach. In TPC-C, there are three single-column, three 2-column (\(Cu \subseteq Di\), \(Hi \subseteq Di\), \(OL \subseteq St\)) and four 3-column (\(Hi \subseteq Cu\), \(Or \subseteq Cu\), \(OL \subseteq Or\), \(NO \subseteq Or\)) foreign keys (FKs). In TPC-H, 7 of the 8 foreign keys are single-column and the only composite foreign key has 2-columns. We randomly generated null marker occurrences in the foreign key columns. Starting from the given complete data set, null marker occurrences were randomly generated for foreign key columns, and in different percentages of tuples to pinpoint the impact of different levels of incompleteness on our measures. The names of the resulting data sets originate from the original data set augmented by the percentage of nullified tuples. For example, TPC-C2 results from TPC-C where 2 % of the tuples had null markers in their foreign key columns.

Profiling Accuracy and Time. We applied Algorithm 3.2 to the benchmark data sets with increasing levels of incompleteness under simple, partial and full semantics, respectively. As in [17] we applied \(\theta =0.9\), bottom-256 sketches, 256-quantiles for unary, and 16-quantiles for multi-column foreign key candidates. The results are summarized in Fig. 1 and Table 2. Here, \(t_p\) denotes the number of true positives, \(f_p\) the number of false positives, \(T_{3.2}\) the overall time of running Algorithm 3.2 in hours and minutes (hr:mn), and \(T_{3.1}\) indicates the time spent on Algorithm 3.1 as part of running Algorithm 3.2. Partial-PK denotes an optimization technique discussed in Sect. 5.

Table 2. Data profiling of SQL foreign keys from incomplete benchmark variants
Fig. 1.
figure 1

F-measure: TPC-C and TPC-H

Our first observation is that on the original data sets TPC-C and TPC-H, respectively, all accuracy measures agree under all 3 different semantics. The sweet spot for balancing precision and recall is given by the top-20 % ranked foreign keys discovered from TPC-C, which yield an f-measure of 0.36. For TPC-H, the f-measure is around 0.4 when the sweet spots of top-55 % or -60 % ranked foreign keys are considered, see Table 3.

Our second observation is that full semantics is inadequate for SQL data profiling of foreign keys. Since this is the semantics applied by [17], their good performance for complete data does not carry over to incomplete data. The poor performance is more pronounced for TPC-C, since there are more multi-column foreign keys. Just having null markers present in 2 % of all tuples, lowers the recall for full semantics from 0.9 to 0.3. Introducing null markers in 5 % of all tuples brings the recall down to 0.1, and this foreign key no longer features amongst the top-25 %. On TPC-H with 7 unary foreign keys and 1 binary foreign key, the f-measure under full semantics also declines as soon as null markers occur. In particular, the binary foreign key is never profiled under full semantics on any data set with some level of incompleteness.

Our third observation is that both simple and partial semantics perform very well in profiling foreign keys. Their measures of accuracy remain robust under increasing levels of incompleteness and are competitive with the corresponding measures on complete data, as Fig. 1 and Table 2 show.

Our final observation is that the robustness of simple and partial semantics under incomplete data comes at a very reasonable increase in profiling time, when compared to the state of the art algorithm for complete data. It is natural that the profiling time increases with the recall. However, spending 3hr:47mn - which is already the worst case - instead of 2hr:44mn is well worth the increase in accuracy, considering the purpose of data profiling.

Table 3. Partial semantics ranks meaningful foreign keys higher
Table 4. Semantics with lower EMD for TPC-C multi-column foreign keys
Table 5. Profiling benchmarks with different sample sizes under partial semantics

Comparing Precision of Simple and Partial Semantics. While simple semantics outperforms partial semantics in recall, the roles change for precision and f-measure. Table 3 shows for each data set and semantics, the lowest percentage that captures all discovered foreign keys from the golden standard. Hence, partial semantics ranks meaningful foreign keys higher.

Table 4 shows for each of the meaningful multi-column foreign keys of TPC-C, which of simple (S) and partial (P) semantics resulted in a lower EMD. In about two thirds of the cases (23 out of 35) this was partial semantics.

Size of Sketches. As in [17] we observe that bottom-256 sketches are a sweet spot for the experiments. Bottom-128 sketches lead to drops in recall, and bottom-k sketches for \(k=512, 1024, 10K\) lead to minor improvements in recall, more false positives, and higher profiling time, see Table 5 for example.

Other Experiments. We have applied our algorithms to the CM data set about employer costs for employee compensation from US government data http://catalog.data.gov/. It consists of 11 tables featuring 11 one-attribute foreign keys and two 3-column foreign keys. On total data, we obtain a recall of 0.85 and precision of 0.3 under all semantics. Adding 2 %, 5 % and 10 % null values, respectively, simple and partial semantics maintain these rates, while full semantics drops to a recall of 0.31 and precision of 0.2.

5 Optimization Strategies

We propose and evaluate two strategies for improving our algorithms.

Increasing Recall. Recall under partial semantics can be improved to the level of recall for simple semantics. Under partial semantics, our algorithm is looking for a matching reference in the bottom-256 sketch of the primary key. If a match does not exist in the sketch, it may still exist in the whole table. Referring to this strategy as Partial-PK, Fig. 1 shows its increase in f-measure in comparison to scanning only the bottom-256 sketch of the referenced primary key. Moreover, Table 6 gives an indication of the very minor penalty in profiling time, resulting from the presence of the index for the referenced primary key.

Table 6. TPC-C and TPC-H with partial-PK, 256-bottom sketch

Schema-Driven Strategies. We add two schema-driven techniques leading to significant improvements of our measures, as in [17].

The first technique concerns the golden standard. So far, it comprises only those foreign keys that were explicitly specified in the benchmark documentation. In reality, these foreign keys logically imply other foreign keys [3], which our current algorithms treat (incorrectly) as false positives. On the TPC-C schema, for example, the two foreign keys Order[o_c_id, o_d_id, o_w_id] \(\subseteq \) Customer[c_id, c_d_id, c_w_id] and Customer[c_d_id, c_w_id] \(\subseteq \) District[d_id,d_w_id] logically imply the foreign key Order[o_d_id, o_w_id] \(\subseteq \) District[d_id,d_w_id]. Adding all implied foreign keys to the golden standard is more realistic and leads to an improvement of our accuracy measures.

Our second technique applies a pruning strategy from [17] regarding the similarity of column names. Finding good measures for string similarity in database schemata is an open problem and can easily lead to false positives [17]. However, the documentation for the benchmark databases is extensive and explains the naming conventions. For TPC-C, for example, we can apply string identity after removing table name prefixes from column names. The resulting names are identical only if the candidate is an implied foreign key.

Adding these techniques to our algorithms results in a perfect precision score of 1, and more than doubles our f-measure in comparison to applying the data-driven algorithms only, see Fig. 2.

Fig. 2.
figure 2

F-measures after adding schema-driven techniques to TPC-C using the Top-20 % of candidates on the left and the Top-25 % of candidates on the right

6 Related Work

Previous algorithms for profiling referential integrity constraints do not consider incomplete data. The state of the art algorithm that considers dirty data is presented in [17]. The authors established a randomness test that reduces a large number of false positives and can profile foreign keys efficiently from complete data. Our evaluation on the original TPC-C data set, without introducing null markers, provides further evidence for the efficiency of those techniques [17], which did not consider TPC-C despite its high number of multi-column foreign keys compared to TPC-H and TPC-E.

State of the art algorithms for efficiently profiling uniqueness constraints are described in [7, 16]. Profiling inclusion dependencies is an effective pre-cursor for profiling foreign keys [14, 17]. In fact, the discovery of unary foreign keys from unary inclusion dependencies has been studied by several authors [2, 6, 8, 11]. Data summaries and random sampling, cliques, min-hash sketches and Jaccard coefficient all constitute pruning techniques from previous work [2, 4, 6, 9, 17]. Lopes et al. [10] profile foreign keys from given SQL query workloads. Approximate inclusion dependencies constitute a different popular approach to accommodate dirty data within the data profiling framework [2, 6, 17]. The profiling of data quality semantics is important for big data [15].

7 Conclusion and Future Work

We have proposed the first algorithms for profiling foreign keys from dirty and incomplete data. Our techniques target the three different semantics for null marker occurrences as proposed by the SQL standard. On complete data, all three semantics coincide with the strictly relational semantics. Our detailed evaluation demonstrates several insights. Firstly, full semantics is not suitable for profiling foreign keys from incomplete data, even when only a few null markers occur. Secondly, our purely data-driven algorithms for simple and partial semantics perform as well as the state of the art algorithm does for complete data. Thirdly, the robustness of profiling with simple and partial semantics under different levels of incompleteness incurs very reasonable penalties in profiling time. Finally, combining data- with schema-driven techniques optimizes performances significantly, which is consistent with previous work on complete data.

In future work we will look at other strategies for imputing null markers during EMD calculation. Identifying thresholds of inclusion coefficients that work well for our profiling techniques is an interesting goal. We will also consider the less frequently observed case in which null markers also occur in referenced columns. Our preliminary observations suggest only marginal changes to the overall trend we have described. The extension of our techniques to profiling conditional foreign keys from dirty and incomplete data is an interesting goal for data cleaning. So far, all approaches to this problem do not address dirty data or different semantics for null markers [1].