Introduction

Data compression is a procedure of changing a data stream of one form into another that has fewer size than the original [1]. The stream can be a document, a bit stream, or individual bits sent to a channel. The primary targets of data compression are to lessen the size of data and increase the exchange rate. Data compression covers a huge area of applications including data communication, data storage and database technology. Text compression is a field of data compression, which utilizes the lossless compression method to change over information to another type of document that is able to decrease the room required to store the data. In this way, the most clear favorable position of data compression is that of lessening the capacity prerequisite to store the information. Lessening the capacity prerequisite is equal to expanding the limit of the capacity medium. Since compacted text are encoded utilizing fewer bits, moving of packed data starting with one spot then onto the next requires less time and subsequently brings about a higher successful exchange rate. Since the pressure decreases the stacking of I/O channels, it gets practical to process more I/O demands every second and thus accomplish higher and viable channel use. One of the most important application of data compression is the reducing the cost of data communication in distributed networks [2]. There are some systems that have been proposed for text compression in the literature. The majority of which depends on a similar standard of expelling or lessening redundancies from the intended text record. The repetition can show up at character, syllable, or word levels. This standard proposed a component for text compression by relegating short codes to regular parts, that is, characters, syllables, words, or sentences, and long codes to uncommon parts. Lately, a few strategies have been created for text compression. These strategies can be further classified into four types of techniques namely, substitution, measurable, lexicon, and context-based technique. The substitution text compression strategies replace a specific longer reiteration of characters with a shorter one. A strategy that is a representative of this method is run-length encoding [3].

The factual strategies more often than not figure the likelihood of characters to create the briefest normal code length, for example, Huffman coding [4], and arithmetic coding [5]. The lexicon techniques include substitution of a substring of text by a file or a pointer code. They identify with a situation in the dictionary of the substring. Agents of these strategies are LZW [6], LZ77 [7], and LZ78 [8]. The last sort is context-based methods, which include the utilization of insignificant earlier suspicions about the measurements of the text. Ordinarily, they utilize the context of the text being encoded and the historical back drop of the text to give progressively proficient compression. Representatives of this sort are Prediction by Partial Matching (PPM) [9] and Burrow Wheeler change (BWT) [10]. Compression methods can likewise be partitioned into two essential classifications to be specific lossless and lossy. In lossless compression, the decompressed information is a precise imitation of the first information. Despite what might be expected, in lossy compression, the decompressed information might be not quite the same as the first information. Commonly, there is some contortion between the first and recreated information in lossy compression [11].

In this paper, we propose a dictionary based lossless compression technique that converts the 8 bit character to 5 bits using a look up table. The look up table is produced with the general printable characters which are used in English and Bangla text. The lookup table is constructed using the Zipf distribution [12, 13]. After converting the character into 5 bits we propose an idea of n-sequence to create a dictionary. The dictionary is implemented by a hash function to logically calculate the value of dictionary entry. Therefore, the dictionary is a logical implementation and does not take any physical storage. Compression and decompression algorithms have been proposed using the converted the 5 bit stream and the n-sequence characters. We call our proposed scheme as n-sequence based m Bit Compression (nSmBC). We compared our scheme with Huffman [4] and LZW [6] and found better performance to them. We also compared the proposed scheme with WinRAR and WinZip that also shows promising efficiency. The nSmBC technique can be able to compress any text by more than 92% of the actual text. The proposed scheme can not only be applied to text compression but also to text mining [14], text encoding [15] and database compression [16]. The rest of the paper is organized as follows. The next section presents some related works, the following section explains the proposed text compression scheme, the next section describes the analytical evaluation of the scheme, the next section depicts the experimental result and finally last section outlines some conclusion.

Related Works

Most of the text compression systems [17,18,19,20,21] are based on a dictionary of word, or character levels. An n-gram based dictionary approach is presented in [17] for compressing Vietnamese text. They used window size of bigram to five grams dictionary to encode a string. The compression ratio is good but the scheme will fail to work if the input string is not a traditional word of the language. In [18], the authors proposed a technique to convert the characters in the original file to a binary code. In this method, the most common characters have the shortest binary codes whereas the least common characters have the longest binary codes. The binary codes are originated based on the estimated probability of the character within the file and using 8-bit character word length. In [19], the authors proposed a technique that combined word with LZW algorithm. The method divides the input text to word and non-word and after that uses them as initial alphabet of LZW. But dividing word and non-word is a costly operation. [20] proposed a method to compress shorter text on the basis on two stages. At the first stage, it alters the source input including letters, numbers, spaces, and punctuation marks used in English language. And in the 2nd stage, it introduces a transformation that reduces the length of the text by a fixed fraction of the length of its input. In [21], the authors introduced a word-based compression on the basis on the LZ77 algorithm and proposed and implemented different ways of sliding windows and various possibilities of output encoding. Some other techniques of text compression are also appeared based on syllables such as The Burrows–Wheeler transform. These techniques include few languages which have morphology in the organization of words or morphemes (e.g., German, Arabic, Turkish, and Czech) such as in [22,23,24,25]. In [22] the authors introduced a lossless text compression algorithm that uses syllable-based morphology of multisyllabic languages. The proposed technique is implemented by partitioning words into its syllables and then producing their little bit organizations for compression. A genetic algorithm based technique is introduced in [23]. This technique was utilized by determining the characteristics of syllables. Then it stores the characteristics into a dictionary, that happens in the compression process and it is not needed placing the characteristics into compressed data. This phenomena has led to the compression of the space has been used. Lansky and his colleagues [24, 25] proposed a technique for syllable-based text compression techniques. They emphasizes on specification of syllables, methods for decomposition of words into syllables, and using syllable-based compression in combination of the principles of LZW and Huffman coding.

A 6 bit representation of printable characters is presented in [26]. It converts the 8 bit characters to 6 bits by partitioning the characters into 5 sets and utilizing them in a look up table. This method has a compression ratio that converges to around 50% and it is suitable for small text files. Another system by utilizing the algorithm of [26] for database operation is proposed in [27].

Most of the techniques mentioned above uses a dictionary and applies Run length encoding or Huffman encoding or LZW technique. In this paper, we introduced a new idea namely n-sequence based m bit compression Scheme. The basic difference is that we have implemented the dictionary in logical fashion and it does take any physical space. The scheme shows superior performance than existing systems

Compression Scheme for Natural Language Text

Definition 1

(n-sequence) n-sequence is a sequence of characters taken from a character set with n characters which are arranged together in a sequence. It is possible that all combinations take n characters from a given character set. For example, if a character set contains the characters {A, B} then 1-sequence is <A, B>, 2-sequence is <AA, BB, AB, BA>, and 3-sequence is <AAA, BBB, AAB, ABA,…>. Each of the members in the n-sequence is identified by an index which is a number. If there are k members in an n-sequence set, the index numbers are from 0 to (k − 1). For example, the index of “BB” in the 2-sequence is 1.

Definition 2

(Set tag) Set tag is defined as a group of characters tagged to a particular set number. For example, the capital letters, small letters, digits and symbols are tagged to particular set number as shown in Table 1. We call it set tag representation.

Table 1 Lookup table for set tag

Look Up Table Construction

We construct the look up table by dividing the characters into 7 set tags namely Set-1, Set-2… Set-7. Each of the set tag contains 25 characters. The characters are placed in a lookup table as shown in Table 1. The entry in Table 1 is organized as follows:

  1. 1.

    Characters of the Bangla alphabet are placed in Set-1, Set2 and Set-3. Set 3 also contains the Bangla digits as well. The positions 21–24 in Set-3 are blank and can be used in future.

  2. 2.

    Characters of the English alphabet are placed in Set-4, Set-5, Set-6 and Set-7. The English digits and special characters are placed in Set-6 and Set-7 respectively. Position 19–24 of Set-7 is empty and can be filled with any missing characters.

  3. 3.

    The rest of the 7 combinations are filled with the 7 set tags as shown in Table 1.

Therefore, the Table 1 contains 32 characters (0 to 31). These 32 (\({2}^{5}\) =32) combinations can be represented by 5 bits. Within the 32 combinations 25 combinations are utilized for converting the original 8 bit character to 5 bit and the rest of the 7 combinations are utilized for representation of the set tags. Therefore, we can use \({(2}^{5}-7)\times 7\) =175 characters in Table 1. If we can take 6 bits then there can be \({(2}^{6}-7)\times 7\) =399 characters can be handled. We call it m bit representation scheme. In the following we represent m = 5 bits to explain our proposed method. We represent any character using the encoding scheme (see Table 1). We call it set tag representation. For example, if we have a character stream “ABCDabcd956” then the set tag representation is “Set4ABCDSet5abcdSet6956”. Since “A” is located in Set4 we start with Set4 followed by “A”, “a” is represented in Set5 we put Set5 before “a” and so on. When a set change occurs, we insert a Set number to distinguish it with others. In these stage we can say this representation as a variants of Run Length Encoding.

The placement of characters in the look up table is optimized using Zipf distribution which is a discrete dispersion of ordinarily utilized characters in various dialects [12]. Zipf's law is an empirical law formulated using mathematical statistics. It states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number n has a frequency proportional to 1/n. Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word [13]. The objective of using Zipf’s distribution is to place the characters in Table 1 such that the minimum number of set change occurs to handle the input string.

n-Sequence Dictionary Construction

After converting the 8 bit characters into 5 bits, we have a bit stream of 5 bits of each character. For any input text T, we create a bit stream (5 bits for each character) and from this bit stream we divide it by 4 to take 4 bits each. We put trailing the last set number to make it mod 4 equal to zero if the length of the bit stream is not mod 4 equal to zero. From this 4 bits, we have \({2}^{4}=16\) different combinations of bits. Since each of the characters is represented by 8 bits, we add a fixed bit pattern in front of each of the 4 bits. Figure 1 shows an example. The fixed bit pattern is 0100.

Fig. 1
figure 1

Adding fixed bit pattern

After adding fixed 4 bit pattern (0100) we get ASCII value range 64–79. Table 2 shows the characters along with its decimal and ASCII values. Hence any of the characters shown in Table 1 becomes a character shown in Table 2.

Table 2 Converted list of characters

Example 1

Original Text: “Test Text

Set Representation:

Set4 T Set5 est space Set4 T Set5 ext

Decimal Representation:

28 1 29 0 7 1 24 28 1 29 0 22 1

5 bit representation:

11100 00001 11101 00000 00111 00001 11000 11100 00001 11101 00000 10110 00001

After Dividing by 4:

1110 0000 0111 1010 0000 0011 1000 0111 0001 1100 0000 1111 0100 0001 0110 0000 1111

Adding 0100 to every combination:

01001110 01000000 01000111 01001010 01000000 01000011 01001000 01000111 01000001 01001100 01000000 01001111 01000100 01000001 01000110 01000000 01001111

Corresponding

ASCII Character: N@GJ@CHGAL@ODAF@O

Dictionary Construction

Using the characters of Table 2, we generate a dictionary of n-sequence (See Definition 1) of different values of n. Figure 2 shows the n-sequence for n = 1, 2 and 3.

Fig. 2
figure 2

n-Sequence for n = 1, 2 and 3.

Key Generation

From the n-sequence dictionary we generate a key of the form < n, \(({v}_{1, }{v}_{2, }{ v}_{3, }{ v}_{4, }\dots )\)> where n is the number of the n-sequence and v is the index value of the corresponding n-sequence dictionary (see Fig. 2). We store the \(\left({v}_{1, }{v}_{2, }{ v}_{3, }{ v}_{4, }\dots \right)\) the secondary storage and store the n in main memory.

Logical Dictionary

The dictionary we generate using n-sequence generation is not stored in the physical memory. We implement the dictionary using a hash function h(s). The function h() takes a string s which is member of the n-sequence dictionary as input and returns the corresponding index of the dictionary. Hence the dictionary becomes a logical one and does not take any physical memory.

Hash Function Development

Forward Hash Function

Firstly, we assign all the 16 characters (see Table 2) a value as follows \({V}_{0}=@,{V}_{1}=A,{V}_{2}=B,{V}_{3}=C,{V}_{4}=D,{V}_{5}=E, {V}_{6}= F,{V}_{7}=G,{V}_{8}=H,{V}_{9}=I,{V}_{10}=J,{V}_{11}=K,{V}_{12}=L,{V}_{13}=M,{V}_{14}=N,{V}_{15}=O.\) where \({V}_{1}=1, {V}_{2}=2, {\dots , V}_{i}=i\) (\(1\)\(i\) ≤ 15).

The index value for n-sequence dictionary for different values of n is calculated as follows.

For n = 1,

\(h\left(s\right)\) = \({V}_{i}+1\)

If s = “A” then \(h\left("A"\right)={V}_{1}\)+1 = 1 + 1 = 2 [where i = 1].

For n = 2,

\(h\left(s\right)\) = \({(V}_{i}*16)+ {V}_{j}\) +1

If s = “AM” then \(h\left({\text{AM}}\right)={(V}_{1}*16)+{V}_{j}+1\) = (1*16) + 13 + 1 = 30 [where i = 1 and j = 13].

For n = 3,

\(h\left(s\right)\) = (\({((V}_{i}*16)+ {V}_{j})\)*16) + \({V}_{k}+\) 1.

If s = “BAG” then \(h\left("BAG"\right)=({((V}_{i}*16)+ {V}_{j})\) *16) + \({V}_{k}+\) 1 = (((2*16) + 1) * 16) + 7 + 1 = 536 [where i = 2, j = 1 and k = 7].

Finally we generalize h(s) as

\(h(s)\) = (((\({V}_{i}\) * 16) + \({V}_{j}\)) * 16 + \({V}_{k}\)) * 16 + \({V}_{l}\)) + … + 1

where \(h\left(s\right)\) a string which is a member of n-sequence dictionary, \({V}_{i}\) assigned number of the 1st character, \({V}_{j}\) assigned number of the 2nd character, \({V}_{k}\) assigned number of the 3rd character, \({V}_{l}\) assigned number of the 4th character, and so on.

After getting the indexes using the above hash function we represent each index with 1 byte using the Java OutputStreamWriter() function in Java platform which is used to convert the written characters to the bytes written to the underlying OutputStream. Here we convert the written index to ASCII which defines 1 byte.

Reverse hash function

For n = 1,

\(h\) = \({V}_{i}+1\)

\(Y\) = \({V}_{1}\) [h-1 = \(Y\)].

\({V}_{1}= Y\) mod 16.

For n = 2,

\(h\) = \({(V}_{i}*16)+ {V}_{j}\) + 1

\(Y\) = \({(V}_{1}*16)+ {V}_{2}\)  + 1 [h-1 = \(Y\)].

\({V}_{2}=\) \(Y\) mod 16.

\({V}_{1}\) = \(Y\)/16.

For n = 3,

\(h\) = (\({((V}_{i}*16)+ {V}_{j})\)*16) + \({V}_{k}+\) 1.

\(Y\) = (\({(V}_{1}*16)+ {V}_{2})\)*16) + \({V}_{3}+\) 1 [h-1 = \(Y\)]

\({V}_{3}\) = \(Y\) mod 16

\({V}_{1}\) = [\(Y\)/16]/16

\({V}_{2}\) = [\(Y\)/16] mod16

Hence the general equations becomes

\({V}_{n}\) = \(Y\) mod 16

\({V}_{1}\) = [[[\(Y\)/16]/16]/16 …]/16

\({V}_{i [2<i<n-1>}\) = [[[\(Y\)/16]/16]/16 …] mod 16

We use the idea of n-Sequence for m bit representation hence call the scheme nSmBC (n-Sequence based m bit Compression).

Compression and Decompression Algorithm

In this section, the compression and decompression algorithms for nSmBC is briefly described in different steps. After the compression and decompression technique, an example is provided to show the working procedure of the algorithms.

Forward Mapping

Input: A string S to be compressed,

Output: An encoded compressed string Sc.

  1. Step 1:

    Represent S to S′ as set tag representation adding Set tag.

  2. Step 2:

    Using the look up table, convert the string S′ by 5 bit stream. Let, in this stage the bit stream contains k bits.

  3. Step 3:

    d = k%4; if (d ≠ 0) add trailing 0 bits to the last set number to make d = 0.

  4. Step 4:

    Store every 4 bit combinations of k.

  5. Step 5:

    Add 0100 in front of to every 4 bit combination of k to make the binary combination only limited to the characters Table 2.

  6. Step 6:

    Divide k by 8 to find the corresponding ASCII characters.

  7. Step 7:

    Create the logical n-sequence dictionary using forward hash function h() and store < n, index > 

Example 2

Original Text (Input): “Test Text “

  1. Step 1:

    Set Representation: Set4 T Set5 est space Set4 T Set5 ext

  2. Step 2:

    Decimal Representation: 28 1 29 0 7 1 24 28 1 29 0 22 1

  3. Step 3:

    5 bit representation: 11100 00001 11101 00000

  4. Step 4:

    00111 00001 11000 11100 00001 11101 00000 10110 00001

  5. Step 5:

    After Dividing by 4: 1110 0000 0111 1010 0000 0011 1000 0111 0001 1100 0000 1111 0100 0001 0110 0000 1111

  6. Step 6:

    Using Adding 0010 to every combination: 01001110 01000000 01000111 01001010 01000000 01000011 01001000 01000111 01000001 01001100 01000000 01001111 01000100 01000001 01000110 01000000 01001111

  7. Step 7:

    ASCII Representation:N@GJ@CHGAL@ODAF@O

  8. Step 8:

    Generate n-Sequence to get the < n, index > (n = 4 used here): < 4, (57467, 904, 7184, 16737, 63422) > 

Backward Mapping

Input: Compressed String, Sc.

Output: Uncompressed original string, S

  1. Step 1:

    Representing the string Sc by its corresponding < n, index > pair using reverse hash function.

  2. Step 2:

    From the location of the pair < n, index > find the exact n-sequence character combination and store it in Sc.

  3. Step 3:

    From Sc, find its corresponding binary combination from the ACII Table (Table 2) and store the resultant binary bits in k.

  4. Step 4:

    Remove 0100 from every 8 bit binary combinations.

  5. Step 5:

    From the remaining bits stream, take 5 bits and representing it by the character set of the look up table (Table 1).

  6. Step 6:

    Remove the set number to get the original string S.

Example 3:

Compressed String: < 4, (57467, 904, 7184, 16737, 63422) > 

  1. Step 1:

    Corresponding string in n-sequence dictionary: N@GJ@CHGAL@ODAF@O

  2. Step 2:

    From 8 bit Representation: 01001110 01000000 01000111 01001010 01000000 01000011 01001000 01000111 01000001 01001100 01000000 01001111 01000100 01000001 01000110 01000000 01001111

  3. Step 3:

    Removing 0100 from every 8 bit combination: 1110 0000 0111 1010 0000 0011 1000 0111 0001 1100 0000 1111 0100 0001 0110 0000 1111

  4. Step 4:

    Fro5 bit representation: 11100 00001 11101 00000 00111 00001 11000 11100 00001 11101 00000 10110 00001

  5. Step 5:

    Decimal Number corresponding to 5 bits: 28 1 29 0 7 1 24 28 1 29 0 22 1

  6. Step 6:

    Corresponding Set Representation: Set4 T Set5 est space Set4 T Set5 ext

  7. Step 7:

    Original Text: Test Text

The proposed nSmBC compression algorithm is designed for all the natural characters that are found in a standard keyboard. These characters include English characters, special characters and Bangla natural characters. Hence we believe we have handled the special characters as well as natural English and Bangla characters. Moreover the algorithm can easily be extended for other characters as the last column of lookup table (Table 1) contains blank cell that can be used for other characters. The look up table can also be extended to accommodate other characters (if necessary) for compressing other characters using the nSmBC compression algorithm.

Analytical Evaluation

In this section, the analytical evaluation of the proposed scheme is done. Table 3 shows the parameters for analytical evaluation. Some parameters are provided as input while others are derived from the input parameters. All lengths and sizes are in bits.

Table 3 Parameters for analytical evaluation

Therefore,

η = \(\frac{q*\mathrm{\alpha }}{N*8}\)  = \(\frac{v * q*\mathrm{\alpha }}{n* N*8}\) = \(\frac{{S}_{2} * \mathrm{\alpha }}{\beta *n* N*8}\)  = \(\frac{N*m+ \varepsilon * \mathrm{\alpha }}{\beta *n* N*8}\),

[We assume the size of the Set tags are negligible \(\varepsilon \) ≈ 0]. = \(\frac{N*m* \mathrm{\alpha }}{\beta *n* N*8}\)  = \(\frac{m* \mathrm{\alpha }}{\beta *n*8}\)  = \(\frac{m* 8}{\beta *n*8}\), \([\mathrm{\alpha }\) = 1 byte = 8 bit].

η = \(\frac{m }{\beta *n}\)

Using the above equation, we evaluated the trend of η with varying values of n (6 to 15). Figure 3 shows the analytical result.

Fig. 3
figure 3

Analytical evaluation for compression ratio for different n-Sequence value

The performance of the proposed nSmBC scheme depends on the Value of n i.e. if the length of n-sequence is large the performance will be better. The performance also depends on the value of m and β. If m is large then performance will degrade but if m is very small then the number of characters that can be accommodated in the lookup table will be small. Therefore, moderate value of m is necessary. If the value of β increases, then performance will also be improved but when β increase then the number of characters also increase in Table 2. Hence number characters to generate the n-sequence will be large.

Experimental Results

We have implemented a prototype system with our proposed algorithm in Java NetBeans IDE 8.2 with the parameter values shown in Table 4. In this Section we present the experimental results. In our experiment, we used the data set collected from Microsoft Research (MSR) Abstractive Text Compression Dataset. The dataset is available at [28]. The MSR dataset contains substantial amount of text data collected from diverse source and genre including business letters, newswire, journals, and Non-fiction academic publications, such as PLoS Medicine, open access journal from the Open American National Corpus. The MSR text maintains a strong correlation with the human judgments of meaning. Therefore, we believe, the data set becomes a good natural text. Since our main concern was to compress natural text, we used MSR natural dataset in our experimental evaluation. The details of the dataset can be found in [29]. Table 5 shows the description of the dataset.

Table 4 Parameters for experimental evaluation
Table 5 Dataset description

Figure 4a shows the experimental results for compression ratio with varying values of n for nSmBC. It demonstrates that when the value on n increases the value of η decreases. For n = 15, η reduces to 0.08 which means, σ is 92% as shown in Fig. 4b. When n increases, η reduces because, η depends mainly on n, m and β. For increasing the value of n, more characters can be increased to include to a single index. Hence the η will reduce and the σ will increase as shown in Fig. 4. This is what we shown in our analytical evaluation in Sect. 5 (see Fig. 3). Hence we validate our analytical model.

Fig. 4
figure 4

a Compression ratio for different n-Sequence value. b Savings of space for different n-Sequence values

We compare our proposed technique with well-known Huffman technique [4] and LZW [6]. The comparison for compressed file size with LZW and Huffman is shown in Fig. 5a. The result for nSmBC is shown for n = 6, 8, 14 and 15. The nSmBC outperforms Huffman technique for n = 6, 8, 14 and 15. LZW shows good results but the nSmBC scheme outperforms LZW for n = 14 and 15. For all the cases Huffman shows worst result.

Fig. 5
figure 5

a Comparison of compressed file size for LZW and Huffman. b Comparison of compressed file size for WinZip and WinRAR

We also compare our technique with two industrial systems WinZip and WinRAR. The WinZip and WinRAR compress all character set along with images. The scope of this paper is to compress natural text. The image compression is another research issue and, therefore, image compression was out of the scope of this paper. The MSR dataset used in the experimental evaluation is a pure natural text. And the same data set was used to evaluate the compression schemes namely proposed nSmBC, WinZip and WinRAR.

Figure 5b shows the comparison with WinZip and WinRAR for compressed file size. The nSmBC performs well than WinZip and WinRAR for n = 14 and 15. The WinZip performs well for small \({S}_{1},\) when \({S}_{1}\) increases the nSmBC performs well even for n = 8.

The comparison for η with LZW and Huffman is shown in Fig. 6a. The result for nSmBC is shown for n = 6, 8, 14 and 15. The Huffman shows poor result among the schemes. The reason behind Huffman technique to be poor is that the data is derived by Huffman from the frequency of occurrence of the possible values in the source symbol [4]. So if the size of data is quite large then a large number of individual symbols will be created. As a result, it shows poor η comparing to others. η ranges from 0.35 to 0.4 leading to σ = 60–65%. The result demonstrates that nSmBC provides the best compression ratio. It reaches to η = 0.08 (σ = 92%) for n = 15. The other values n also provides good performance.

Fig. 6
figure 6

a Comparison of compression ratio for LZW and Huffman technique. b Comparison of compression ratio for WinZip and WinRAR

Figure 6b shows the comparison for ηwith WinZip and WinRAR. The WinRAR shows η = 0.10 (σ = 90%) at initial level. But at the increasing \({S}_{1}\), η degrades to 0.15 (σ = 85%). In case of WinZip, it also shows same type of values for η as WinRAR at the initial stage but not as good as WinRAR. At the increasing \({S}_{1}\), the compression ratio fluctuates in between 0.16 to 2.0 (σ = 80–85%). The nSmBC shows better performance and it outperforms WinZip and WinRAR for n = 14 and 15. Finally, we conclude that the nSmBC outperforms other techniques.

Figure 7 provides the compression time of different algorithms. We use Java currentTimeMillis() function to calculate the time of the nSmBC, LZW and Huffman method. From Fig. 7, it shows that nSmBC provides better result than Huffman. Initially LZW provides bad result but when file size is increasing, it will show the similar range of time as nSmBC. But WinZip and WinRAR provides best result that means these two algorithms need a very short run time. We think this is the only shortcoming of our algorithm in respect to WinZip and WinRAR. But this time issue can be resolved by utilizing this algorithm in a high configuration computers.

Fig. 7
figure 7

Comparison of compression time for LZW, Huffman, WinZip and WinRAR

Conclusion

In this paper, we present a novel method for text compression. The paper proposes the idea on n-sequence and construction of logical dictionary. The large dictionary is implemented of a hash function. The proposed nSmBC takes 5 bits for each character using a lookup table. Analytical and experimental results are presented to show the superiority of the scheme. The scheme IS able to compress up to 92% for web-based diverse data set. The scheme shows superior performance to existing schemes namely LZW, Huffman and also for WinZip and WinRAR. The technique can easily be utilized to compress large amount of natural language text. Both the forward and backward mapping algorithms are presented. This technique can also be utilized to parallel processing environment as well as load balancing technique to achieve promising encoding time. We believe, the nSmBC is an efficient algorithm for compression that has the potential to compete with the existing text compression techniques.