Keywords

1 Introduction

Database watermarking of relational databases has received much attentions to the research community over the last decade when various application scenarios, e.g., database-as-a-service, data-mining technologies, online B2B interactions, etc., demand an effective way to protect database information from various fraudulent activities, like illegal redistribution, ownership claims, forgery, theft, etc. [15, 26]. Figure 1 depicts a pictorial view of database watermarking techniques, where a watermark W is embedded into the original database using a private key K (known only to the owner) and later the verification process is performed on any suspicious database using the same private key K by extracting and comparing the embedded watermark (if present) with the original watermark information.

Fig. 1
figure 1

Basic watermarking technique

1.1 Related Works

Existing watermarking techniques are categorized into two: distortion-based and distortion-free. Distortion-based techniques [1, 10, 11, 25, 27, 28] introduce distortion to the underlying database data, and hence, usability is a prime concern while watermarking. Distortion should always be introduced in such a way that it is tolerable and does not destroy the usability of the data at all. Watermarking in [1] is performed by flipping bits in numerical values at some predetermined positions based on the secret parameters. Image as watermark is embedded at bit-level in [28]. Approaches in [10, 27] are based on database content: The characteristics of database data is extracted and embedded as watermark into itself. Authors in [11] proposed a reversible-watermarking technique which allows to recover the original data from the distorted watermarked data. Khanduja et al. [19] proposed a secure embedding of blind and multi-bit watermarks using Bacterial Foraging Algorithm. Later, they used voice as biometric identifier for watermarking [18]. Unlike numerical values, categorical data type and nonnumeric multi-word attributes are also considered as cover for watermarking in [2, 25]. Distortion-free watermarking techniques [5, 6, 13, 20, 21], on the other hand, do not introduce any distortion. Unlike distortion-based techniques, watermark is generated from the database rather than embedding. In [4, 21], hash value of the database is extracted as watermark information. Approaches in [5, 6, 20] are based on the conversion of database relation into a binary form to be used as watermark. In [17], watermark is generated based on digit frequency, length of data values, etc. in the database, whereas [7] generates the watermark based on the grouping of data into square matrix and the computation of determinant and diagonals’ minor for each group. Although the approach [7] is not economically viable, but suitable to detect multifaceted attacks and is resilient against tuples insertion-deletion attack and value modification attack.

1.2 Motivations

This is to be observed that most of the distortion-based techniques in the literature use a part of the database content as cover [10, 27, 28], and therefore, a number of update or delete operations may distort the watermark or may make the watermark undetectable. Also re-watermarking the database is very expensive process. Authors in [12, 13] first address a key issue, called persistency, in the context of database watermarking where database tuples are being updated or deleted frequently by the associated legitimate applications. Their approaches aim at preserving persistency of the embedded watermark under usual database operations: watermark is embedded in an invariant part of the database (w.r.t. database operations), while the same is generated from the abstract variant part representing properties instead of actual values. However, they did not specify any approach to identify the variant/invariant part while watermarking a complete information system consisting of a set of applications interacting with a database at the back-end.

1.3 Contributions

In this paper, we propose a data-flow analysis-based approach which serves as a generic framework for persistent database watermarking. Unlike existing approaches, we consider watermarking of a complete information system which includes both the back-end database and the associated applications legitimately accessing or manipulating the data in the database. In particular, our proposal is unfolded into the following phases:

  • Formulation of data-flow equations for the applications embedding query languages.

  • Analysis of the applications based on the data-flow equations which effectively identifies an invariant part of the underlying database instances.

  • Watermarking of the invariant part by distortion-based technique.

  • Generation of Opaque Predicates from the variant part respecting the integrity constraints of the database systems.

  • Embedding opaque predicates as watermarks into the associated applications.

The structure of the paper is as follows: Sect. 2 provides a motivating example. Section 3 recalls some basic notions about persistent watermarking, data-flow analysis, etc. The proposed technique is discussed in Sect. 4. In Sects. 5 and 6, we provide, respectively a brief discussion on the complexity and robustness of our proposal. Experimental results are presented in Sect. 7. Finally, we draw our conclusions in Sect. 8.

2 Running Example

Consider, three online trading companies, say x, y, z, who are maintaining their own databases and the associated applications. Figure 1 depicts one such database which stores the details of the customers, various products, and the purchase history. Suppose, three companies have decided to collaborate, aiming at making the online purchasing system more attractive to the customers in terms of product availability.

However, according to the policy, each company can perform, in addition, its own business independently. A common interface after collaboration is developed and is allowed to access any of the three databases. This makes the database information vulnerable to various kinds of attacks, e.g., theft, illegal redistribution, ownership claiming, etc. Therefore, it is mandatory to watermark individual database in order to prevent above mentioned attacks.

Consider a code-fragmentFootnote 1 P depicted in Fig. 2 which accesses and manipulates database of Table 1. The code either inserts order details (statement 7–11) or offers gifts to the premium customers (statement 13–16). This is to be noted that the database part corresponding to the attributes “TotalAmt” and “Offer” can possibly be updated by the application—hence it is a variant part. The rest of the database acts as invariant part. This is immediate that any watermark embedded into this variant part may get destroyed or undetectable due to the legitimate update operations on the values.

Fig. 2
figure 2

Program P

Table 1 Online trading database

In the subsequent sections, we propose an efficient way to identify invariant and variant part of the underlying databases w.r.t. the associated applications in the system. This will enhance the existing watermarking techniques w.r.t. the persistency issue.

3 Basic Concepts

In this Section, we recall some basic notion about persistent watermarking from [13].

Persistent watermark Given a database dB and a set of associate applications A, we denote by 〈dB, A〉 an information system model. Let d 0 be the initial state in which the watermark W is embedded. When applications from A are processed on d 0, the state changes and goes through a number of valid states d 1, d 2…, d n−1. The watermark W is persistent if we can extract and verify it blindly from any of the following n − 1 states successfully.

Definition 1

(Persistent Watermark)

Let 〈dB, A〉 be an information system model where A represents the set of associated applications interacting with the database dB. Suppose the initial state of dB is d 0. The processing of applications from A over d 0 yields to a set of valid states d 1, …, d n−1. A watermark W embedded in state d 0 of dB is called persistent if

$$ \forall i \in [1..(n - 1)],\;{\text{verif}}y({d_0},W) = {\text{verify}}({d_i},W) $$

where verify(d, W) is a boolean function such that the probability of “verify(d, W) = true” is negligible when W is not the watermark embedded in d.

Variant versus Invariant Database Part Consider an information system 〈dB, A1〉 where A is the set of applications interacting with database dB. For any state d i , i ∊ [0…(n − 1)], we can partition the data cells in d i into two parts: Invariant and Variant. Invariant part contains those data cells that are not updated or deleted by the applications in A, whereas data cells in variant part of d i may change under the processing of applications in A.

Let \( {\text{CEL}}{{\text{L}}_{d_i}} \) be the set of cells in the state d i . The set of invariant cells of d i w.r.t. A is denoted by \( {\text{Inv}}_{d_i}^A \,\subseteq \,{\text{CEL}}{{\text{L}}_{d_i}} \). For each tuple td i , the invariant part of t is \( {\text{In}}{{\text{v}}_t}^A \, \subseteq \, {\text{Inv}}_{d_i}^A \). Thus, \( {\text{Inv}}_{d_i}^A = \bigcup\nolimits_{{t_j} \in {d_i}} {\text{In}}{{\text{v}}_t}^A \). The variant part w.r.t. A, on the other hand, is defined as \( {\text{Var}}_{d_i}^A = {\text{CEL}}{{\text{L}}_{d_i}} - {\text{Inv}}_{d_i}^A \).

Data-flow Analysis Data-flow analysis is a technique for gathering information about the dynamic behavior of programs by only examining the static code [24]. A program’s control-flow graph (CFG) is used to define data-flow equations for each of the nodes in the graph. Data-flow analysis can be performed either in a forward direction or in a backward direction, depending on the equations defined. The least fix-point solution of the equations provides the required information about the program. The information gathered is often used by compilers when optimizing a program. A canonical example of a data-flow analysis is reaching definitions.

4 Proposed Technique

The intuition of our proposal is to make the embedded watermark persistent w.r.t. all possible operations in the information system. As database states change frequently under various legitimate operations in the associated applications, the content dependent watermarks embedded into the database are highly susceptible to benign updates. In particular, update and delete operations may remove or distort any existing watermark of the database [10, 27, 28].

In order to make the watermark persistent, our proposal aims at identifying some invariant parts of the database states which remain unchanged w.r.t. the applications. To this aim, we apply static data-flow analysis technique to the associated applications which identifies various parts of the database, called variant parts, targeted by update, or delete operations in the applications. The complement of this variant part in the database acts as invariant part and is used for persistent watermarking. For instance, any database part retrieved by SQL select statement remains unchanged and is, of course, suitable for persistent watermarking. We also watermark the associated applications in the information system by using opaque predicates obtained from the variant part.

Summarizing, the proposed technique consists of the following phases:

  • Identifying variant and invariant parts of the database, by performing data-flow analysis to the associated applications.

  • Watermarking of invariant database parts.

  • Watermarking of associated applications by using opaque predicates obtained from the variant part.

4.1 Data-Flow Analysis

In this phase, we analyze the associated applications based on the data-flow equations in order to collect information about the part of the database information updated or deleted at each point of the applications.

The data-flow equations for various commands in the applications embedding query languages are defined in Fig. 3. The abstract syntax of update and delete statements are denoted by \( \langle {\vec v_d}\mathop = \limits^{\text{upd}} \vec e,\;\phi \rangle\) and \( \langle {\text{del}}({\vec v_d}),\;\phi \rangle\) respectively, where \( {\vec v_d} = \langle {a_1},{a_2}, \ldots ,{a_r}\rangle\) denotes a sequence of database attributes, \( \vec e = \langle {e_1},{e_2}, \ldots ,{e_r}\rangle\) denotes a sequence of arithmetic expressions, and ϕ denotes the WHERE-part of the statements following first-order formula [14]. We denote by notations \( {\text{upd}}({\vec v_d}{)|_\phi } \) and \( {\text{del}}({\vec v_d}{)|_\phi }\) the part of the database updated and deleted by \( \langle {\vec v_d}\mathop = \limits^{\text{upd}} \vec e,\;\phi \rangle\) and \( \langle {\text{del}}({\vec v_d}),\;\phi \rangle\) respectively. Observe that any database part is identified by a subset of attributes \( {\vec v_d} \) values corresponding to a subset of tuples satisfied by ϕ. The notation (x, n) represents that x is defined at program point n, whereas \( (x,?) \) represents that x is defined by any program point. In case of conditional node with boolean expression b, we denote by notation \( {\text{JOIN}}(n{)|_b} \) the information restricted by b.

Fig. 3
figure 3

Data-flow equations of applications embedding query languages

The data-flow analysis is performed by using data-flow equations for each node of the control-flow graph and solves them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes, i.e., it reaches a fix point. The least fix-point solution of the equations provides the information about the variant part of the database possibly updated or deleted by the program. Observe that during solving the data-flow equations, the result in any iteration may contain multiple definitions of the same attributes corresponding to different conditions (for example, say \( {\vec v_d}{|_{\phi_1}} \) and \( {\vec v_d}{|_{\phi_2}} \)).Footnote 2 In such case, we use merge function defined below:

$$ {\text{merge}}((a{|_{\phi_1}},{n_1}),(a{|_{\phi_2}},{n_2})) = (a{|_{{\phi_1} \vee {\phi_2}}},\{ {n_1},{n_2}\} ) $$

This yields a modified data-flow equations for UPDATE and DELETE as follows:

Lattice Structure Defining Data-flow. Let Lab, Var, ψ be the set of program points, the set of program variables and the set of well-formed formulas (in first-order logic), respectively. Let \( R = {\text{Var}} \times \psi \times \wp ({\text{Lab}}) \). The Lattice is defined as (℘(R), ⊆, , R, ∪, ∩), where is the bottom element and R is the top element of the lattice. The lowest upper bound ∪ is defined as:

$$ \{ ({x_i},{\phi_i},\{ {l_{i,m}}\} )\} \cup \{ ({x_j},{\phi_j},\{ {l_{j,n}}\} )\} = \left\{ {\begin{aligned} &{\{ ({x_i},{\phi_i} \vee {\phi_j},\{ {l_{i,m}}\} \cup \{ {l_{j,n}}\} )\} }&{{\text{if}}\;{x_i} = {x_j}} \\& {\{ ({x_i},{\phi_i},\{ {l_{i,m}}\} )({x_j},{\phi_j},\{ {l_{j,n}}\} )\} }&{\text{otherwise}} \end{aligned}} \right. $$

and the greatest lower bound ∩ is defined as:

$$\{ (x_{i} ,\phi _{i} ,\{ l_{{i,m}} \} )\} {\mkern 1mu}\cap {\mkern 1mu} \{ (x_{j} ,\phi _{j} ,\{ l_{{j,n}} \} )\} \left\{ \begin{aligned} &\{ (x_{i} ,\phi _{i}\wedge \phi _{j} ,\{l_{{i,m}} \}\cap \{ l_{{j,n}} \} )\} \,{\text{if}}\;x_{i}= x_{j} \hfill \\ &\emptyset \,\quad{\text{ otherwise}} \hfill \\ \end{aligned}\right.$$

Example 1

Let us illustrate the data-flow analysis on the running example P of Sect. 2. The control-flow graph of P and the data-flow equations for each node are depicted in Figs. 4 and 5 Footnote 3 respectively. If we solve the equations assuming the initial value as empty set, we get the least fix-point solution depicted in Fig. 6. The solution clearly indicates that the data corresponding to the attributes “TotalAmt” and “Offer” may possibly be defined at program points 11 and 16. Therefore, this part act as variant part of the database, while the remaining acts as an invariant part.

Fig. 4
figure 4

Control-flow graph of P

Fig. 5
figure 5

Data-flow equations of control-flow graph nodes of P

Fig. 6
figure 6

Least fix-point solution of equations in Fig. 5

4.2 Watermarking of Invariant Parts

In this phase, we may use any of the existing watermarking techniques [15] to watermark the invariant part of the database obtained in the previous phase. As invariant parts are not prone to modification, of course the embedded watermark will behave as persistent one.

However, the choice of existing watermarking technique is determined by (i) the use of data in a particular application context, (ii) the size of invariant part which is used as cover, (iii) the type of the cover, etc.

4.3 Watermarking of Applications Using Opaque Predicates

An opaque predicate is a predicate whose truth value is known a priori [8]. Moden et al. [22] first used opaque predicates in softwares watermarking by inserting dummy methods guarded by opaque predicates. The key challenge to design opaque predicates is that they should be resilient to various forms of attack-analysis. A variety of techniques such as using number theoretic results, pointer aliases, and concurrency have been suggested for the construction of opaque predicates [8]. In addition, Arboit also suggested a technique for constructing a family of opaque predicates through the use of quadratic residues [3]. Arboit’s proposal is to encode the watermark information in the form of opaque predicates and to embed it into the software without affecting the control-flow structures.

The integrity constraints defined on a database ensure that the attributes under the constraints will have right and proper values in the database. Moreover, database designers also have opportunity to define their own assertions. These constraints which in fact define the properties of attribute-values, can be represented in terms of predicate formulas of first-order logic.

In this phase, we identify integrity constraints or we define assertions as a way to represent the properties of values in the variant part of the database obtained in the phase before. Observe that, although values in the variant part are prone to be updated or deleted, their properties represented by the constraints (integrity constraints or assertions) remain unchanged. Importantly, these constraints act as opaque predicate as their truth value w.r.t. the values in variant part is always true. We follow existing software watermarking techniques [16, 23] to watermark the applications in the information system by using these opaque predicates. As the applications contain SQL statements, we may use the conditional-part (WHERE clause) of SQL statements as cover.

Consider the running example. Consider an integrity constraint defined on the attribute “Age” which says that the age must belong to the range 15–70. This is expressed as:

$$ {\kern 1pt} 15 \leq {\text{Age}} \leq 70{\kern 1pt}$$

Since the formula is always true, it acts as an opaque predicate. Following Arboit’s proposal [3], we can watermark the code by embedding this opaque predicate in the statement 13 as shown below:

$$ \$ {\text{rs}}2 = {\text{SELECT}}\;{\text{CustId}}\;{\text{FROM}}\;{\text{Cust}}\;{\text{WHERE}}\;{\text{TotalAmt}} > 5000\;{\text{AND}}\;15 \leq {\text{Age}} \leq 70;{\kern 1pt}$$

5 Complexity Analysis

Let n be the program size. Let p be the number of variables (which include database attributes and application variables) in the program. The number of data-flow equations associated with control-flow nodes of the program is n. Since each data-flow equation depends on the results of the predecessor nodes, the worst-case time complexity of each data-flow equation is O(n). At each iteration the analysis provides us the information about the data defined up to each program point. Therefore, the height of the corresponding finite lattice is O(p). Thus, the overall worst-case time complexity of data-flow analysis is O(n × n × p) = O(n 2 p).

6 Security Analysis

The proposed approach focuses on information systems scenario where databases are associated with a predefined set of applications. Our basic assumption is that only the database statements in the associated applications are authorized to perform computations on the database. Since attackers are not allowed to issue any other database operations, this mitigates the possibility of random value modification attacks on watermark in invariant part. This is to note that attacker can perform attacks in the variant part (see in Sect. 7). The integrity constraints, which are treated as opaque predicates, also do not change over time. Therefore, watermark detection in our approach is deterministic in practice. However, attackers may perform static analysis to detect opaque predicates [9] in order to remove watermarks from the associated applications codes.

7 Experimental Results

We have performed experiment on the Forest Cover Type data set.Footnote 4 The data set has 581012 tuples and 61 attributes. An extra attribute id is added in our experiment that serves as primary key. The experiment is performed on server equipped with Intel Xeon processor, 64 GB RAM, 3.07 GHz clock speed and Linux operating system. The algorithms are implemented in java version 1.7 and MYSQL version 5.1.73.

In Table 2, we describe the notations used in the tables showing experimental results. Table 3 depicts results of watermark detection after random update attacks take place in AHK algorithm [1]. Observe that detection may fail when more tuples are modified (updated) by attackers.

Table 2 Descriptions of the notations
Table 3 Detection results after random update attacks in AHK algorithm [1]

Experimental results obtained in our proposed scheme are depicted in Table 4. We have taken results by changing the size of invariant part as 25, 50, 75 and 90 % that include 145253, 290506, 435759 and 522910 tuples, respectively. Observed that we follow AHK algorithm to embed and detect watermark in invariant part. The experimental results depict that attackers may try to create a new watermark in variant part by performing random modification attacks. The results imply that probability of false-watermark detection in variant part increases if the size of variant part decreases or the value of α (hence τ) decreases. For lower value of α, attacker may successfully prove the existence of such false-watermark. Parameters used by the attacker for detecting false-watermark are similar as those used for marking by the owner. This situation may arise during proving the ownership in presence of all concerned people.

Table 4 Detection after random update attacks on variant in proposed scheme

8 Conclusions

In this paper, we proposed a persistent watermarking of information systems comprising of a set of applications supported by the database at the back-end. We provided a unified framework by combining software watermarking and database watermarking to watermark the complete system at a time. The proposal identifies both variant and invariant part of the database by applying data-flow analysis to the applications, aiming at making the embedded watermarks persistent. The proposed technique serves as generalized framework which may enhance any of the existing techniques in the literature in terms of persistency. We are now in process of building a prototype tool based on the proposal.