Keywords

1 Introduction

Confidential documents mostly follow an incremental structure, where files arrive like chunks of a large file. Say files have arrived in sequence \( f_{1} ,f_{2} ,f_{3} , \ldots ,f_{k} \), then each of \( f_{i} \) is a file in its own, and any sequence from beginning \( f_{1} f_{2} f_{3} \ldots\, f_{k} \) is also a file at the moment file \( f_{k} \) arrives. Tamper detection commonly relies on hashing and message authenticators (MAC) code computations. In case, these incremental documents are stored with a third party, like a cloud service provider, the files need to be stored in encrypted form. This encryption secures the crucial information stored in files from being leaked. At the same time, now files cannot be used, not even concatenated without decrypting them back. Considering the amount of data, downloading the files back from cloud to the user site, decrypting and then concatenating, then uploading to the cloud is an impractical solution. If this stupendous task is outsourced to the cloud, there will be a need to share the encryption key; losing all the point of encryption in the first place. Here we assume that cloud might be honest but curious. So, the first requirement is that CSP should be able to concatenate files in encrypted form only. This can be simply achieved if the files are encrypted at word level, that is, each word of the file is encrypted separately and concatenated to form a file. Though it can be taken to bit-level, it would be very expensive. Once this has been done, one can implement any message digest or hash computation function homomorphically over the files for the purpose of tamper detection. The real problem starts when the files are a part of an incremental document. The message digest or hash has to be now computed for \( f_{1} ,f_{1} f_{2} ,f_{1} f_{2} f_{3} \) and so on. This requires a homomorphic hash as has been suggested in this paper.

2 Incremental Text Model

Documents like appeals, applications, project reports, annual reports etc. are crucial and large sized. They are incremental in both use and storage. Though so common in practice, the similar model for the soft copies of incremental documents has yet not been formalized. We present here an elementary idea how such documents can be stored and viewed as soft copies. Let every individual part of the document be called a file, denoted by f. At any point, the document is a sequential concatenation of files generated till now, \( D_{k} = \left\langle {f_{1} f_{2} \ldots f_{k}} \right\rangle \). Since, the files are generated and maintained chronologically, for any file \( f_{i} \) and file \( f_{j} ,i < j \) suggests that file \( f_{i} \) has been generated before \( f_{j} \). Files either have independent existence or the entire document has. No other sequence or combination of file can be retrieved or has any meaningful interpretation. Thus, at any point of time k, \( D_{k} \) is sequence of \( k \) files beginning at \( f_{1} \) till \( f_{k} \).

3 DGHV Style FHE Scheme

Homomorphic encryption is a technique which allows operations to be performed over encrypted data, and the results when decrypted give the same output as would have been produced by actual desired operation on plaintexts. Thus, the operation ‘g’ performed over cipher texts is homomorphic to desired function ‘f’ over plaintexts. If the operations ‘f’ contain any arbitrary circuit and homomorphic ‘g’ exist for any such ‘f’, the scheme is called fully homomorphic. The idea was asked for by Rivest et al. [1], and the first solution arrived in 2009 through Gentry [2]. The idea has been developed from bits to integers, based on several hard problems. The recent works based on ring learning with errors [3, 4] are efficient and secure enough. Yet, practical issues like number and size of keys involved, noise growth etc. make implementing FHE debatable.

In this section we first discuss the FHE scheme to be used, based on those suggested in [5], which in turn are symmetric versions of DGHV [6] style FHE systems. Any file is assumed to be divided into equal sized words, which may be 1, 2 or 4 bytes long or more as per the computation power available at the data owner. The size of word affects the runtime cost of encryption. File is to be encrypted before storage to ensure security. The key used remains secret with the data owner. Let the word size be w (256, 65,536 or 4,294,967,296), and length of key η bits. The file to be encrypted consists of ‘B’ words, and each word \( b_{1} ,b_{2} , \ldots ,b_{B} \) is encrypted as

$$ x_{j} = b_{j} + p*r_{j} $$
(1)

where p is the secret key, an odd number of length \( \eta \) bits. Each \( r_{i} \) is a random odd number of length η bits, \( r_{j} \ne p, \forall j. \) Thus, file \( \left\langle {b_{1} ,b_{2} , \ldots ,b_{B}}\right\rangle \) is now encrypted and stored in Cloud as \( \left\langle {x_{1} ,x_{2} , \ldots ,x_{B}}\right\rangle. \) The length of each encrypted word, now onwards called cipherword, is of \( O(\eta ) \).

Each file is associated with a hash value, which is equivalent to bitwise XOR of all words of the file, producing a hash word of length \( O(\lambda ) \). Since the encryption is fully homomorphic, instead of computing XOR of all words and then encrypting it under key p, the encrypted hash value, y, can be directly computed as modulo sum of all cipher words. Here, we require a homomorphic hash, so the encrypted hash is re-encrypted under a new key q,

$$ y = (\varSigma x_{j} ){\mod w} $$
(2)
$$ z = y + qs $$
(3)

where, \( q \) is a public key, an odd number of λ bits and \( s \) is a random odd number of λ bits, \( s \ne q \). This produces the final hash value of a file to be stored as z, a number of length \( O(\lambda ) \) bits.

The fully homomorphic properties of the suggested scheme can be observed directly, as proved in [5].

Note on parameter selection: The word size, \( w \), makes it mandatory that \( \eta > \log_{2} w \), so that modular arithmetic remains correct. Moreover, \( \lambda > \eta \), for hash computations. Our experiments suggest that minimum difference of 2 be maintained.

4 Proposed Protocol

Let the incremental document be a chronological sequence of n files, \( D = \left\langle{f_{1} f_{2} \ldots f_{n}}\right\rangle. \) The user generates any file \( f_{i} , \) encrypts it into a sequence of cipherwords under key p using Eq. (1), and uploads it to the third-party storage (TPS). TPS computes the hash value of file as \( z_{i} \) using Eqs. (2) and (3) under key q. At a subsequent instant, file \( f_{i + 1} \) is generated, and similar procedure is followed. At any arbitrary instant k, the document \( D_{k} \) is concatenation of k files \( \left\langle{ f_{1} f_{2} \ldots f_{k}}\right\rangle \), it is required to have hash values associated with documents too. The hash values of documents are computed as

$$ Hash\left( {D_{1} } \right) = Hash\left( {f_{1} } \right) = z_{1} $$
$$ Hash\left( {D_{2} } \right) = Hash\left( {f_{1} f_{2} } \right) = z_{1} + z_{2} $$

Thus,

$$ Hash(D_{k} ) = Hash(f_{1} ,f_{2} \ldots f_{k} ) = \sum {z_{k} .} $$

We proceed to discuss how hash value of a document \( D_{k} \) can be verified for correctness.

The steps 1 till 4 of the protocol are repeated many times at arbitrary instances. Files are concatenated in chronological sequence. At any point k, the command is \( D_{k} \) and user may want to perform a tamper check. The request of tamper check is handled by sending hash value of current document. It is verified using verification algorithm. If verification is positive, protocol stops In case verification is negative, tamper is detected and user now proceeds to locate the tamper. This is done by running verification algorithm over all documents beginning at \( D_{1} \). But now the hash values have to be recomputed by TPS under a new key sent by user. Let the point at which verification fails is t. This implies that file \( f_{ t} \) is tampered. The error can further be located by checking file \( f_{ t} \). The cipher word which does not decrypt correct is tampered.

When the verification process returns a negative reply, a tamper in one or more files is expected. The protocol for tamper detection can be summarized as shown in Fig. 1.

Fig. 1
figure 1

Tamper detection protocol

5 Analysis of Proposal

The proposal was implemented as a Java program and growth of runtime recorded for different values of \( \eta , \lambda \) and w. The extensive results cannot be reported due to limitation of space. Figure 2 is a plot of encryption time with different combinations of \( \eta \) and \( \lambda \), keeping w = 256. It can be seen that growth is quadratic as the value of parameters increase. This is apt even for a lightweight client.

Fig. 2
figure 2

Growth of encryption time with increasing values of parameters

5.1 Performance Analysis

Formal analysis can be performed for each step involved. To encrypt a word of size \( \log_{2} w \) bits, under key of \( \eta \) bits, major cost is incurred during multiplication of p and r, making it \( O(\eta^{2} ) \) operation, since \( \log_{2} w \ll \eta \). Computing hash involves addition of \( O(\eta ) \), and multiplication of \( O(\lambda^{2} ) \). Time complexity of verification process for any document \( D_{k} \) is \( O(k\lambda ) \), thus depending on number of files in the document \( D_{k} \). Tamper detection process, once the tampered file has been detected, involves checking all cipherwords of suspected file. This amounts to cost of encryption of all words of file that depends on both n and \( \eta^{2} \).

5.2 Security Analysis

Brute-force method to guess a \( b \)-bit odd number is of effort \( O(2^{b - 1} ) \). Since the secret key ‘p’ is never shared, it can only be guessed, requiring \( O(2^{\eta - 1} ) \) effort. Forging a valid hash requires picking a correct number of length \( (2\eta + 1), \) since ‘\( q \)’ may be leaked. This amounts to naive effort of \( O(2{\upeta} + 1) \).

At present, an effort of 280 cycles is considered secure. This means that suggested scheme and protocol is secure at parameters η = 40, λ = 42. Considering other kind of attacks, a key of length 128 bits is considered secure enough. The parameter values η = 128, λ = 140, when experimented incur encryption time of less than 10 ms.

6 Conclusion

The security and tamper detection problem for incremental documents has not been considered till now. We present a simple solution based on DGHV style FHE. The protocol for tamper detection and location is simple yet powerful. Keeping the efficiency issue in mind, a lightweight scheme is suggested, which is fast enough even for very large keys.

7 Future Scope

Better candidates of FHE can be explored for a similar application. The overall protocol is still under implementation. We plan to analyze growth of verification time and tamper detection time with increasing size of file. Security analysis in context of known plaintext-ciphertext pairs of the suggested scheme is yet to be done. Formal ontology related to incremental documents can be a further research area.