Keywords

1 Introduction

Big data is a term for datasets that are so large, complex and that cannot be analysed with traditional computing technologies. The quantity of computed data being generated is increasing exponentially from different application sources like retail, logistics and financial databases, social networks, sensors, internet of things etc.

In order to explicate the data and know its characteristics, it is very important to securely store, manage and share the huge amount of complex data. Now this sort of facility made available via the distributed platform which is popularly known as cloud computing.

The main feature of cloud computing is on-demand network access to computing resources, on pay per use basis, which are provided by cloud service providers. Common deployment models for cloud computing include Platform as a Service (PaaS), Software as a Service (SaaS), Infrastructure as a Service (IaaS). The PaaS provides platform to customers to develop, run and manage applications without owning the respective infra. The SaaS provides businesses with applications that are stored and run on virtual servers in the cloud.

In the IaaS model, client will pay on a per-use basis for the use of equipment to support computing operations that include storage, hardware, servers and networking equipment.

Security over cloud services is however in its maturing phase, the data in the cloud would be at risk for a large number of security vulnerabilities. The cloud administrators do not have any clue over the data where it is being stored and in what format. Therefore, in this scenario the users must be ensured that proper security measures have been adapted to protect their information mainly from data leakage and data tampering. Further, processing/analysing the huge data at cloud data center is a critical issue. Recently, several distributed frameworks like HADOOP [1, 2], Google File System [3] have been developed for storing and processing the BIG DATA. However, HADOOP distributed framework is quite popular among industry and research communities. HADOOP includes two sets of functionalities, (i) The HADOOP Distributed File System (HDFS) to store large and unstructured datasets, (ii) The Map Reduce framework for processing huge data. Usually, HADOOP works with applications having thousands of nodes and petabytes of data.

Security mechanisms are not incorporated at HADOOP. Several works have been reported the usage of cryptographic algorithms to encrypt the data and store the data at HDFC. Encryption is used to provide security for sensitive information. Encryption algorithm performs various substitutions and transformations on the original message or data and transforms it into cipher text which ultimately becomes a random message. Various cryptographic algorithms are available and used in information security. There are different types of algorithms: (i) Symmetric-key algorithms [4, 5] such as Data Encryption Standard (DES) [6], Triple DES [7] and Advanced Encryption Standard (AES) [7] (ii) Asymmetric-key algorithms [8] such as RSA [7] and Elliptic Curve Diffe-Hellman (ECDH).

This paper is organized as follows. Section 2 depicts the framework for HADOOP based cloud data center. Section 3 presents the work related to securing BIG DATA at HADOOP based cloud data center. Section 4 discusses the performances of encryption and decryption operations performed using different cryptographic algorithms for securing BIG DATA at HADOOP based cloud data center. Finally, Sect. 5 concludes the work.

2 HADOOP Based Cloud Data Center

As shown in Fig. 1, first, the user data are taken from different sources and encrypted on the server using either symmetric or asymmetric cryptographic mechanisms. Soon after encryption, the data is stored on cloud i.e. it will be stored in a cluster via HADOOP File System (HDFS). In HADOOP, the NameNode (NN) is responsible for the data distribution to DataNodes (DN). Whenever the user requests for data, the encrypted data is given by the server for decryption. Then user takes the encrypted data and decrypts using corresponding keys.

Fig. 1
figure 1

A HADOOP based cloud data center architecture for big data analytics

3 Related Work

The HADOOP architecture assumes secure network and hence no security framework is incorporated at base level. As a first step Park and Lee [2] introduced a secure HADOOP framework with AES based encryption/decryption. Research work cited in [9, 10] demonstrated the adoption of Kerberos based authentication mechanism to secure the data in HDFS storage. Zhou and Wen [11] applied ‘Cipher Text Policy’ and Attribute Based Encryption (CP_ABE) to provide access control credentials for valid cloud users. Here, CP_ABE uses an encrypted data access control structure rather using user’s personal identity. The user can perform the decryption provided, the user identification attribute matches with access control structure. In this mechanism the cipher text and corresponding cipher key generated via CP_ABE method are transmitted to the Namenode. The Namenode further re-encrypts the cipher text and distributes the file blocks to Datanodes. Here, the key distribution seems to be simple with less user intervention due to the centralized key distribution which is based on CP_ABE at Namenode. However, the original file is also sent to Namenode for re-encryption. Therefore, the security to the client file is not guaranteed. Cohen and Acharya [12] proposed an AES based New Instruction (AES-NI) encryption framework for data encryption and integrity validation by making use of Trusted Platform Module (TPM). Further, an advanced cryptographic mechanism like homomorphic encryption is also widely used to secure BIG DATA at cloud storage. Using fully homomorphism Jin et al. [13] devised a security mechanism for cloud storage. In this method, agent technology is used for encryption and user authentication. However, fully homomorphic encryption may not be fully applicable to address the real world requirements in Big data Scenario. The hybrid encryption schemes were also devised to secure data at HDFS. Lin et al. [14] proposed a hybrid encryption method where the users’ data file is symmetrically encrypted by a unique key k and this k is then asymmetrically encrypted with the owner’s public key.

To encrypt users’ files this mechanism uses the DES algorithm initially in order to generate the ‘data key’. Later on, RSA is used to encrypt the already generated ‘data key’ and the user keeps the private key to decrypt the ‘data key’. Here Yang et al. [15] assumed that the generated private key using RSA is still vulnerable. Consequently, they have used IDEA (International Data Encryption Algorithm) to further encrypt the secret key. Although, this hybrid encryption method seems to secure the data, it increases the computational complexity to the extent. Saini and Naveen [6] presented a steganography based hybrid scheme to make the encrypted data completely not visible to the outside users.

4 Comparative Study of Cryptographic Algorithms Over HADOOP Based Cloud Data Center

Table 1, depicts the comparison of encryption and decryption of differing file sizes using various cryptographic algorithms. Here the time unit is shown in seconds (s). From our experiments we have observed that RSA could not perform over the files with larger sizes (considered file sizes 50, 100 and 150 MB). Hence, the results reported here are in the file sizes 512, 1024, 2048, 4096 Bytes. However, the encryption and decryption using AES and DES algorithms even performed well for the files in smaller sizes (512, 1024, 2048, 4096 Bytes) which is shown in Fig. 2. Thus, AES, DES, EC and ECDH are considered for HADOOP based experiments.

Table 1 Comparison across RSA, AES, DES, EC and ECDH with respect to encryption and decryption times
Fig. 2
figure 2

Encryption and decryption times for cryptographic

In order to study the behavior of considered algorithms on HADOOP we have considered three different scenarios of HADOOP setting in securing the BIG DATA. The performance of these algorithms with respect to HDFS storage is measured in terms of writing and read speeds of the files with varying sizes in to and from HADOOP’s HDFS. Usually HDFS reading takes more time than writing as data is stored in different blocks. Further, HADOOP supports replication factor that reflects how many times the original data can be replicated across HDFS nodes.

4.1 Scenario 1

Same node act as a master and slave of considering replication factor as one. From Fig. 3, it can be observed that as the plain text size increases the time required to write and read to and from HDFS is also increasing. In this scenario encrypted file writing and reading using EC and ECDH yields more time than AES and DES. This is because of larger encrypted files generation by EC and ECDH.

Fig. 3
figure 3

Write and read speeds in Scenario 1

4.2 Scenario 2

By considering a three node cluster, among the three, one node act as Namenode and two others are data nodes. i.e., one master node and two slaves and here the replication factor are considered as one. The writing and reading speeds in this scenario are shown in Fig. 4. Here writing time is increased compared to the first scenario because in previous scenario all blocks are stored in one node, but in this scenario, data blocks are stored in three different nodes. In this scenario, read speed is nearly same as the previous scenario.

Fig. 4
figure 4

The write and read speeds in Scenario 2

4.3 Scenario 3

This scenario is similar to Scenario 2 except replication factor is set to three. The writing and reading speeds in this scenario are depicted in Fig. 5. In this scenario, the writing time is increased when compared with previous scenarios because of replication factor. In previous scenario only one copy of the data is stored, but here three copies of data are stored due to three replication blocks. Here also the read speed is similar to previous one and two scenarios.

Fig. 5
figure 5

Write and read speeds in scenario 3

5 Conclusion

To observe the behavior of encryption and decryption operations on BIG DATA in the HADOOP environment, we have used symmetric and asymmetric crypto algorithms over varied file sizes. We could conclude from the experiment that ECDH yielded highest reading and writing speed compared to EC, AES and DES. EC yielded a second highest speed and AES, DES yielded least speed for the same size of file blocks. Further, it is identified that the read speed is consistent even with the increase in replication factor.