Keywords

1 Introduction

The exchange of data or information is now quite frequent in e-governance applications. Textual data, like legal data and the personal data of citizens, flows from different departments in e-governance. If there is any form of leakage during transit, security properties like confidentiality will not be preserved. The confidentiality of sensitive data is to be checked during transmission from the sender to the receiver. To remove threats to confidentiality and other security parameters, the technique of encryption is widely used. Traditional encryption systems can be divided into two subcategories: symmetric and asymmetric methods [1]. But in recent studies, there have been various proofs available to disqualify the applicability of the symmetric key concept in terms of textual information encoding. So, as a consequence, the asymmetric key concept is a good choice for encryption of textual data. With a different view point, it can also be stated that the encoding methods can be of two types: encoding with a selective portion and encoding with the whole portion of the original text. Both the two methods have its benefits and drawbacks. Full encryption methods are not suitable for resource constrained environment [2]. Considering the method of whole text encoding, it is obvious that it must consume the more computation time than selective encoding, but the speedup factor is also a major factor [3]. In selective encoding, the speed of encryption is much higher for huge amounts of data produced from different sources maintaining same level of security of whole text encryption method. In our proposed method, we consider the benefits of both the asymmetric key method and the selective encoding approach to design a robust and secure encryption system. So, regular expressions are used to select the segment of textual data, given a text as user input, and then RSA cryptography is implemented to encrypt the selected segment of text. In our research study, 1024 bit RSA is used for strongest encryption process. The cryptosystem RSA is very famous for its class of algorithms in asymmetric key cryptography [4]. The steps of RSA algorithm has already defined in [5]. In our research study, the predefined function \( rsa.encrypt(Org\_msg,Pub\_key)\) of 1024 bits in Python-RSA module [6] as pure Python-RSA implementation for encryption is taken for the experiment. In decoding step, \(rsa.decrypt(Enc\_msg,Priv\_key)\) is used to decode the orginal text, where \(Org\_msg\) depicts original message, \(Pub\_Key\) depicts public key of the receiver and \(Priv\_Key\) depicts the private key of the receiver. The message is encoded and decoded with the ’utf8’ format before encrypting process and after decrypting process respectively.

2 Our Contribution

Selective encryption in the context of text encryption is very rare. Our main contribution is that some portion of the data must be untraceable, even if the attacker manages to extract the rest of the unencrypted data. Assume the PAN or the Aadhaar number is important information of citizen that must be kept private. Whenever an Income Tax Return form is generated by the authority, the PAN number is added to it. If the attacker can obtain the PAN number, he or she can obtain all the legal information pertaining to a particular citizen. Our aim is to encode only the PAN, while the rest of the document will not be encoded. So, RSA with a 1024-bit encoding technique is implemented. We combine the benefits of selective encryption and an asymmetric key algorithm to design our new encoding technique. We chose the selective encoding method by search to reduce the time required by traditional whole text encryption. The asymmetric key encoding scheme is then used to achieve the highest level of security while maintaining the data’s confidentiality. Our method can be extended and applied to secure medical records and sensitive data generated by wireless and IOT devices.

3 Literature Review

The purpose of the research study [1] is to introduce a novel selective significant data encryption algorithm, where a significant amount of uncertainty is added to data as it is encrypted. This algorithm takes help of the concept of natural language processing and extracts the data from the whole text. There are four steps to the selective encryption technique studied in this study. First step is to removing special characters, secondly tokenization fetches all words available in the message l, after that the words signifies termination have been removed. Lastly, encryption process is applied to the keywords to leaving the common words as it is. Both encrypted keywords and plain common words are sent to the network. In recent times, a research [2] is carried out considering selective encryption for image and audio data in resource constrained environment in terms of low memory, low computation capacity and low power requirements. Also, selective encoding technique is evaluated in association with metrics like tenability, degradation of visual effect, cryptographic security, encryption ratio, compression friendliness, format compliance and error tolerance. The categorization of selective encoding is also done based on pre-compression, in-compression and post-compression approaches. The selective significant data encryption [3] approach for text data encryption was introduced in the previous study. This method chooses just relevant data from the entire message in terms of the whole message’s keywords, which gives the data encryption procedure enough uncertainty. This improves speed and cuts down on the overhead associated with encryption. The symmetric key encoding technique is used to carry out the encryption process. The Blowfish algorithm is employed for this. A comparative study of the proposed technique, the full encoding scheme, and the toss of a coin method is also included in terms of proportion of encoded text and computation time. In this study of a selective encoding scheme[7], they provide an innovative AES-Rijndael-based encryption technique for medical data. Firstly, a selector component is depicted that allows the method to be implemented on a variety of platforms, with the required size of input, count of rounds. In the second phase, the compression process of original picture is done with the Huffman algorithm to decrease the size of the picture and encryption time of AES method by more than half. And thirdly, the simulation time of AES algorithm is kept minimum with the concept of loop unrolling and methods of merging in proposed algorithm. Experimental study proves that this novel selective encoding scheme cut down the average execution time by 35% comparing to traditional AES scheme. Previously, a modified RSA [8] method has been presented with improved security for message encryption. By identifying three factors of n instead of two, makes the proposed encrypting model more difficult for an attacker to guess by the process of factorization. Thus the security is raised by two levels. Finding a public key and a private key as a result of the second modulus x being used in place of the modulus n being passed is challenging since only using these keys makes it feasible to encrypt or decode messages while maintaining message secrecy. The time to produce the keys of the encoding system is less than the traditional RSA cryptographic method. In this article, a new selective encryption technique[9] is demonstrated that employs a safe, index-based chaotic sequence to encrypt only the chosen compressed video frames from each set of images. Simulation results and statistical analysis have done based on quality analysis, keyspace metric, psnr analysis, mean-square-error analysis and computation time analysis and it is found effective and efficient rather than traditional AES and RC5 encoding algorithms. The concept of the CMYK color model [10] has already been used to create a unique encoding and decoding approach with four keys for conversion from text to image. This approach encrypts data faster in terms of text characters. In order to prevent the mathematical factorization of n from leading to the factors p and q, the modified RSA algorithm [11] incorporates the removal of the large prime number n from the key. A one-digit number serves as the initial message in this experiment. According to the analytical report, the suggested approach encrypts and decrypts faster than a conventional RSA strategy. To address the issue of slow key decryption or slow key transmission, an improved method of homomorphic encryption based on Chinese remainder theorem with a Rivest-Shamir-Adleman [12] method was developed, utilizing multiple keys. It performs the cipher text decoding better than standard RSA for documents.

4 Proposed Algorithm

The proposed algorithm is depicted in a block diagram in the Fig. 1.

Fig. 1
2 flowcharts for encryption and decryption. They read the document or the encoded file, extract text, match with key. If not, stop. If yes, apply R S A. Then the encoded or decoded file is written. The process repeats.

Block diagram of encryption and decryption process

4.1 Encrypting and Decrypting Procedure

The process of encrypting and decrypting schemes are given below.

Algorithm 1
Algorithm 1 features encryption in 12 lines. It takes the original P D F as input and generates an encoded file for the same.

Encryption Procedure

Algorithm 2
Algorithm 2 features decryption in 6 lines. It takes the encoded P D F as input and generates the original P D F as the decoded file.

Decryption Procedure

5 Implementation Example

The experiment has been conducted in Intel 3rd gen processor computer having 1.70 GHz cpu speed, 500GB HDD and RAM of 4GB capacity. The software Pycharm of version 2020.2 is utilized for the experiment along with Matlab R2016b for statistical analysis. Different standard pdf documents are collected from the web sources [13,14,15]. In the following example, the content of the pdf document is considerd for analysis irrespective of the position and layout and font of the pdf document. The content "July 4, 1776" is selected from second line of text the for encrypting and decrypting process. The process of selective encoding mechanism is applied to the selected part "July 4, 1776" and the encrypted form of the text is written to the encoded pdf file. The content of encoded pdf file is shown in Fig. 2 in the middle .The decrypting process converts the encoded content back to the original text "July 4, 1776" and written to a new decoded pdf file. The content of decoded pdf file is shown in Fig. 2 in bottom part.

Fig. 2
Three paragraphs of text titled declaration of independence in congress, July 4, 1776. The first para represents the original text. Encrypted and decrypted texts are in the second and third para respectively.

Original text, encrypted text and decrypted text

6 Analysis of Security Parameters

The dataset is composed of three standard pdf documents. The extracted portion of the text is named "Data1","Data2" and "Data3", respectively. As for example the "Data1" consists of the text "July 4, 1776". As for example the "Data2" consists of the text "SEMPRONIO". As for example the "Data3" consists of the text "Contents".

6.1 Study of Key Space

Study of keyspace considers the number of changing variables used for computation. The high value of this metric discards any type of attacks that are bruteforce in nature. The standardization made with IEEE floating-point value consideration, is that the accuracy of double variables is approximately \(10^{-15}\) with the bit capacity 64. We have four double variables as p,q,e and d. So, the keyspace value is about \(10^{60} \approx 2^{199.31569}\). So, our scheme of encrypting and decrypting text is constituted to give protection about all attakcs made in bruteforce approach considering this large keyspace.

6.2 Entropy

The term is first uttered by the famous mathematician Shannon as a metric to measure uncertainty. It has been applied in the domain of information processing [16]. The value of a text with a lower probability of the occurrence of an event retains more information, and thus it has a higher information entropy [17]. As a consequent, suppose "Data security" has less probability of appearance than the sentence "Data security is applicable to different fields". The metric entropy of a sentence represents indicates how much information it contains [18]. The study of entropy can be depicted as the Eq. 1 given below [19]

$$\begin{aligned} {} H(P)=\sum _{i=0}^{255}[\mathrm{{Prob}}(X_{i})\times \log (\frac{1}{\mathrm{{Prob}}(X_{i})})] {}\end{aligned}$$
(1)

In the above equation \(Prob(X_{i})\) represents the probability of existence of symbol \(X_{i}\)

Table 1 Study of entropy

From the above Table 1, the encrypted text has more entropy value than original text. The higher value of entropy makes the encrypting and decrypting scheme very hard to crack.

Fig. 3
9 bar graphs distributed over 3 rows and 3 columns. The first column has 3 graphs for original text. The second column has 3 graphs for encrypted text. The third column has 3 graphs for decrypted texts.

Study of histogram of Data1, Data2 and Data3

6.3 Histogram Analysis

Each letter or symbol that appears in the message “Msg" is shown by a histogram. If the spread of the letters or symbols is uniform, the encrypting technique is also insurmountable in the face of statistical assaults [20]. The histogram plot of the ciphered text should differ drastically from the histogram of the plain text and should be as evenly distributed as is humanly feasible, meaning that the likelihood of any value existing is the same [10]. In the above Fig. 3, the histogram of original, encoded and decoded text is depicted taking conversion to ASCII format. For the encrypted text, the histogram representation is uniform in terms of vertical bars than the histogram of original text.

6.4 Avalanche Effect

A feature of an encryption method known as the ”avalanche effect” causes changes in multiple bits of the encoded text when one bit of the original text is changed [21]. Avalanche impact should be 0.5 under ideal circumstances [22]. The Eq. 2 of avalanche effect is depicted below. In the equation ”CTEXT” represents cipher text.

$$\begin{aligned} {} {\text {Avalanche Effect}}=\frac{{{\text {Number of Bits Flipped in Ctext}}}}{{{\text {Number of Bits in Ctext}}}} {}\end{aligned}$$
(2)
Table 2 Study of avalanche effect

From the above Table 2, the conclusion can be made easily that our proposed technique crossed the ideal range of the avalanche effect value, depcting a good encrypting system property.

Fig. 4
2 line graphs for before change and after change feature highly fluctuating trends.

Study of plaintext sensitivity

6.5 Plaintext Sensitivity

The study of plaintext sensitivity depicts that a small moderation in the original content in terms of a bit can create a rapid change in the encoded content. The original text is ”July 4, 1776” is changed to ”July 4, 1777” to compute the plaintext sensitivity and the result is given in the above Fig. 4. As a consequent, the above two encoded images are totally different before and after the encoding process. So, only one-digit change in the original string make a huge change in cipher text.The correlation between two cipher files is -0.0276. This low value of correlation means there is no relation between two encoded files.

6.6 Computation Time Analysis

In the Table 3, the computation time for encoding and decoding text file is given in seconds. The time analysis satisfies that our method consumes less cpu time and can be incorporated not only in e-governance application but also in resource limited environment.

Table 3 Study of encryption and decryption time
Table 4 Comparison result of proposed text encoding with others

From the above Table 4, it is very clear that existing methods of text encryption lack in detailed statistical anlysis in terms of metrics like entropy and avalanche effects and only present required encryption time. Our method has high value of entropy, ideal value of avalanche effect with low encryption time. Also, our propsed method of encoding text constists of detailed study of statistical metrics which proves the robustness against different attaks. The important metrics like plaintext sensitivity and histogram study have also been included in our research study to qualify as a good cryptosystem.

7 Conclusion and Future Scope

Our research study provides the text data security in e-governance applications. The asymmetric approach of encoding text is discussed in this paper using 1024 bit RSA cryptographic algorithm. The confidentiality property of data is guaranteed by our proposed method along with high security features. Government documents and Legal documents can be secured using our proposed encoding scheme. Important selected data like account number, PAN and Aadhaar of any citizen can be encrypted using proposed method and added in the government documents. Attacker may find the document but unable to decrypt the selected part of the content which leads to an unsuccessful attempt of data theft. The security analysis report proves the robustness of our method against different attacks causing security threats. Also, the proposed model of encrypting and decrypting specific part of the content fetched from pdf document takes less time than whole text encoding. As a consequence, the applicability of our encrypting method rises for resource limited environment. As of now, the method is implemented for text in pdf document but can also be applied for multimedia content like image and video. In future, chaotic functions may be incorporated to introduce more randomness in the encoding and decoding technique. The encoding scheme can also be extended with the elliptic curve cryptography. The proposed method of encryption can be done with any length and in any position, but in the context of “Selective Encryption”, a small portion of the whole text is taken for experiment.