Keywords

16.1 Introduction

The impact of digital information environments has been remarkably universal, with the generation of vast quantities of information. Sometimes these data are stored in a special infrastructure called digital preservation or archival system. Along with newly produced data, many countries have digitalized and stored the documents which are on papers to save physical space. The land registers [4, 18] in Europe and the digitization of records of high courts in India [20] are to name a few. The main goal of such a system is to secure the long-term persistence of information in digital form.

There are many reasons to keep on changing the bit patterns of the stored data objects periodically in the archival systems [1], with or without conscious to the owner of the data object. Today, we have several protection solutions based on cryptography. However, such solutions are not guaranteed to be secure in future as computer power and cryptanalysis evolve [24]. For example, we can use quantum computer techniques to attack applications currently using RSA signatures. Also, there is a possibility of strengthening the cryptographic security with advanced computation power by increasing the size of the key or block size. In order to protect the data for long term, there can be a change in key or key size or block size or cryptographic algorithm itself or format transformations resulting in change of bit pattern of data.

This work is based on our previous work, which is the periodical re-encryption-based archival system [23]. The archived data object may transform into several versions in a short period of time. In order to preserve the assurance of integrity all the way back to that of the original data object is a real challenge. Traditionally, integrity is checked by comparing the stored hash value with a newly computed hash from a data object being archived. In periodical re-encryption-based system, the bit patterns of the stored data object will keep on changing periodically. Hence it is not a feasible solution to compare newly computed hash value to the stored hash value with the owner in order to ensure integrity. We have many systems [8, 10, 12, 16, 22, 25] which address this problem through reregistration process of a data object to the archival system when it is re-encrypted with new credentials. It also involves the verification of the integrity and authenticity before the data object is transformed into a new format or version. This process is not practical, because the owner or organization who attached the data objects may not be online always or may no longer be available over a long period in order to authenticate during reregistration process. The data stored in archival systems which provides periodical re-encryption is useful, if their integrity (data objects are unaltered) is protected over a long term and also a proof of existence (a time reference when the data object is witnessed) is provided.

16.2 Related Work

In this section, we describe some common integrity checking and proof of existence techniques traditionally used in a digital archival systems.

16.2.1 Cryptographic Hash Function

Cryptographic hash function, as defined by Wenbo Mao [14], is a deterministic function which maps a bit string of an arbitrary length to a bit string of fixed length called a hash value or digest or simply hash and satisfies important properties: (a) hash value h(x) for an input x should be computationally indistinguishable from a uniform binary string in the interval (0, 2|h|) (Mixing transformation); (b) it should be computationally infeasible to find two inputs x and y with x ≠ y such that h(x) = h(y) (collision resistance); and (c) given a hashed value h, it should be computationally infeasible to find an input string x such that h = h(x) (pre-image resistance). The security of traditional archival systems relies on the hardness of defeating one of the hash function properties. Therefore, today’s secured hash function becomes insecure in future as cryptanalysis evolves, and thus, no single hash function can be secure from long-term point of view[15].

16.2.2 Digital Signature

A digital signature scheme consists of three algorithms [15, 16, 25]. An efficient key generation algorithm generates private and public key such that the message m is encrypted using private key that can be decrypted by using public key or vice versa. An efficient signing algorithm generates a signature on a given message m and by using private key. An efficient verification algorithm verifies and decides the signature is valid or not. RSA [21] and variants of ElGamal [6, 13] are popular digital signature schemes.

16.2.3 Merkle Tree

A Merkel tree [15, 17] is one of the most widely used hash linking schemes. The leaves of the Merkle binary tree are the hash values of the data objects being processed. The value stored at each internal node is the concatenated hash values of the children. The value computed at the root of the tree is called Merkle root, which represents the compressed value of all the data objects to be processed.

For example [8, 22], let us consider four data objects D 1, D 2, D 3, and D 4 with a corresponding hash values h1, h2, h3, and h4 which are to be processed at a time T 0. The corresponding Merkle tree is shown in Fig. 16.1.

Fig. 16.1
figure 1

Merkle tree with time stamp

The values of the internal nodes are obtained by h12 = h(h1||h2), h34 = h(h3||h4), and Merkle root MR = h(h12||h34). To check the proof of data object D 2 whose hash value is h2, the required path is D2 → h2 → h12 → MR. We can compute MR mathematically h(h(h1||h2)||h34) with the information h1, h2, and h34. This information is called authentication path to the data object D 2. The compressed value MR is used as a proof of existence of a data object. Change to any of the data objects will result in a different MR value.

16.2.4 Time Stamp

Whenever an archivist, who manages the data objects, has a copy to be time-stamped, he or she transmits the hash value of a data object to a trusted time stamping authority (TSA). An authority records the date and time the hash value was received and retains a copy for safekeeping. Any challenger can check the integrity by comparing the archivist hash value with the TSA record [2, 7, 8, 11, 15, 22].

16.2.5 Evidence Record Syntax (ERS)

The Evidence Record Syntax (ERS), proposed by Gondrom et al. in RFC 4998 [8, 15], holds an evidence for each data object. The evidence record contains a sequence of certificates issued by archivist, and each certificate contains time stamp issued by a trusted time stamping authority (TSA) based on the archivist request on a Merkle root. Each time stamp covers a group of data objects by using Merkle tree.

16.2.6 One-Way Accumulators

One-way accumulator [3] is a function which takes two arguments from comparably sized domains and produces a result of similar size. In other words, it is a quasi-commutative hash function with two input arguments and produces fixed-size digest. The combination of quasi-commutative and one-way properties is used to develop one-way accumulator. The desired property of one-way accumulator is obtained by considering function Acc : X × Y → X and asserting that for all x ∈ X and for all y1,y2 ∈ Y , Acc(Acc(x, y1), y2) = Acc(Acc(x, y2), y1).

16.2.7 Patricia Trees

It is a space-optimized data structure in which each node with a single child is merged with its parent [9]. It supports comparably high-speed search and insertion of a new node to the data structure. Unlike regular trees, the key at each node is compared considering a group of bits, where the quantity of bits in that group at that node is an r-ary tree. It is binary when r is 2. The example as shown in Fig. 16.2 represents a binary tree containing the strings 000, 011, 100, and 101. A new string is easily inserted into a data structure as each node represents a unique string and its position is uniquely determined by its value. The root node represents an empty string, and leaf node represents the actual string in a data structure. The searching is as easy as tree traversal starting from a root and following left path if the bit of a string is 0 and, otherwise, the right path. The process is repeated until string is found or all bits of a string are exhausted.

Fig. 16.2
figure 2

Patricia tree containing the strings 000, 011, 100, and 101

16.3 Proposed Scheme

We propose a scheme by considering three actors in an archival system: data owner, who generates data object that needs to be archived for long-term usage; an archivist, who manages the archived data object in a secured manner; and a trusted middle layer between the owner and an archivist, which is defined as below.

16.3.1 Lightweight Integrity Management Layer (IML)

It computes necessary operation to provide proof of integrity and existence of data object. It uses an XOR operation to compute proof of integrity. In order to provide a proof of existence of a data object, it computes Merkle root on a special data structure called Patricia tree. The special feature of this data structure is to support time stamping scheme [19]. The value for proof of existence of a data object is computed based on the Patricia tree over a group of hash value of a data object.

16.3.2 Scheme Description

The basic computations required are defined in Fig. 16.3. The data owner computes h ← Hash(D i ) before archiving the data object D i . The notations B i and A i indicate the hash values of data object D i before and after the encryption/re-encryption, respectively. The record E i  ← (F i , S i ) provides proof of integrity, and the node \(V_h^t\) computed and inserted to Patricia tree at the time interval t will be used to confirm the existence of data object D i .

Fig. 16.3
figure 3

The basic computations

Any challenger or the owner, who wishes to confirm the integrity of the data object D i , can collect the record E i from IML and current hash value A i from the archivist. He/she computes \(F_i \bigoplus A_i \doteq S_i \bigoplus h\) to confirm the integrity of the data object. If it does not match, there is a loss of integrity somewhere in the re-encryption stage. In order to find the exact interval where the integrity was lost, IML must store the record E i for all re-encryption stages.

We organize our data structure based on Patricia tree. It is implemented as binary tree indexed by original hash value, h, computed at the time of archiving to enable efficient search. The internal nodes of the tree store small metadata that mainly contains a set of computed hash values by using hash values of its children at different stages. This set of hash values at each node is ordered by time intervals t. It also stores two additional values to represent left and right child records used at the time of computing hash value. Specifically, we represent it with the notations \(V^{t}_{n}=\{h(V^t_l|| V^t_r),l,r\}\). The record V is stored at the node n during the time interval t. The symbols l and r indicate the left and right child records used to calculate hash value of the node. The leaf node of the tree holds one additional record E to store the values for proof of integrity.

Let us consider the two data objects D 1 and D 2 that are to be archived at a time T 1. Let us assume that 00 and 01 are hash values of data objects D 1 and D 2, respectively. The IML computes the values for proof of integrity and updates the records to the corresponding internal nodes of the tree. At each leaf node, top record V holds the current hash value of the data object and, bottom record E holds the values for proof of integrity. Figure 16.4 shows a data structure at time T 1. The value of the record at root node \(V^1=\{h(V^1_0||null),1,null\}\) will be used as Merkle root. Authentication path is computed with the records \(V^1_{00}\), \(V^1_{01}\), and V 1.

Fig. 16.4
figure 4

Data objects D 1 and D 2 are archived at time interval T 1

Similarly, in a time interval T 2, let us assume that the data object D 2 is re-encrypted and data object D 3 with hash value 10 is inserted. In a time interval T 3, the data objects D 2 and D 3 are re-encrypted, and the data object D 4 with a hash value 11 is inserted. Figure 16.5 shows the corresponding data structure.

Fig. 16.5
figure 5

Data structure at the time interval T 3

The records at the root of the tree can be used as Merkle root and, index of the record can be considered as time stamp at which the Merkle root is constructed. The owner or any challenger can construct Merkle root with the help of authentication path and specific time interval as record index and confirms the existence of their data object. For example, in order to confirm the existence of D 2, the records \(V^3_{01}\), \(V^1_{00}\), \(V^3_1\), and V 3 are required. Similarly, to confirm D 1, the records \(V^1_{00}\), \(V^1_{01}\), and V 1 are required. The IML periodically removes the records which are out of scope in order to reduce the space overhead. For example, with reference to Fig. 16.5, at the time interval T 3, all records with index 2 (time interval 2) except \(V^2_{10}\) are irrelevant and can be removed from the tree. The record \(V^2_{10}\) is required to prove that the data object D 3 with hash value 10 is archived at a time interval T 2.

16.4 Implementation

We implemented IML layer using Java code on a desktop machine and performed some experiments using Enron’s employees email data set [5]. These experiments run on an algorithm called a re-encryption [23]. We implemented Patricia tree-based data structure to hold the records in IML layer. The following functions describe the overview of IML layer:

  1. 1.

    The SearchNode enables to search the data object given by hash value. It is basically traversing from root node to corresponding leaf node in the tree.

  2. 2.

    The InsertNode enables to insert a new node to the tree during new archival.

  3. 3.

    The InsertRecord enables to insert a new record to a node. It could be used when there is a change in the metadata of the node.

  4. 4.

    The UpdateRecord enables to update the values for proof of integrity after re-encryption of a data object occurred.

  5. 5.

    The DeleteRecord enables to delete a record of a node. It is used when IML wants to remove irrelevant records from the tree.

The IML maintains a special list to make a note on the data objects, which are undergone in the process of re-encryption. Periodically, IML calculates Merkle root on hash values of all those data objects and accordingly inserts new records to the corresponding nodes in the data structure.

16.5 Results and Discussion

In the long-term archival system, it may not be possible for the owner or client to be online always in order to receive and store the certificates issued by the archivist. Therefore, client server-based systems may not be strongly advisable solutions for long-term and re-encryption-based archival systems.

The major problem of the ERS schemes [8] is the size of linking information. The size of the metadata is linearly increasing with the number of data objects of an archival system. In addition, it may not be a practical solution due to frequent re-encryption events of the data object.

The major problem of one-way accumulator-based scheme is the high complexity involved in computing a one-way accumulation on every re-encryption of a data object. If the archival node takes one microsecond per data object to generate a certificate with a new accumulator value, it could process only around 8% data objects compared to ERS schemes.

The major complexity is eliminated by using simple XOR operation in our scheme. The simulation of IML shows that it could process 2–3% more data objects than ERS scheme in the time period of 30 ms. We selected 90,000 emails randomly of Enron’s employees with an average size of 1.9 KB. We treat each email with all its attachment as a single data object. The simulation started with ten initial data objects and uniformly added as defined in [23]. On every millisecond (a time interval), IML calls to calculate the proof of existence among the hashes of the data objects which are re-encrypted during that time interval.

Over 5 runs of simulation of periodical encryption algorithm [23] could process up to 81,400 emails in 30 time intervals, out of which nearly 23,600 data objects are highly sensitive data objects (HSD). It also shows that there is a linear increment in medium sensitive data objects (MSD) over a period of time. Figure 16.6 shows the storage requirements for data objects and metadata. It shows that the required size of the metadata for our scheme is stabilized to around 20 MB as stabilization is observed with respect to HSD objects. Figure 16.7 shows the storage requirement of metadata that is linearly increasing with respect to number of data objects in the absence of periodical re-encryption. It shows that our scheme could also be used with traditional system.

Fig. 16.6
figure 6

Storage requirements for data objects and metadata when considering the periodical re-encryption

Fig. 16.7
figure 7

Storage requirements for data objects and metadata when without considering the periodical re-encryption

The IML scheme is lightweight and easy to implement on any periodical re-encryption-based archival systems. Storage space required for storing the information with respect to integrity checking and proof of existence is comparatively reduced.

16.6 Conclusion and Future Work

The obsolescence of hash functions or cryptographic security algorithms affects the solutions of archival systems from long-term point of view. The traditional, reregistration process of data object to an archival system whenever the version or format changed may not be a realistic solution. Ensuring the integrity and proof of existence of a data object whenever the owner needs confirmation is a real challenge of any archival systems.

In this paper we have shown the challenge and need of a specific technique to certify long-term integrity and proof of existence for a periodical re-encryption-based archival system. Our scheme uses most suitable time stamping technique and focus more on realistic implementation in the archival systems. The experiments demonstrate that metadata required to provide proof of integrity and proof of existence are compact in size, which is reaching 20 MB for 81,000 emails. The simulation results show that the storage requirement for metadata is stabilized over a period of time instead of linearity. This scheme is very lightweight and easy to implement. The owner need not to store all the certificates issued by the archivist, and data object integrity and proof of existence are easily verifiable by any challenger at any point of time.

The unfocused issue in our scheme is that if the data object is altered, it is unable to generate a caution alarm message to authorized person or archivist.