1 Introduction

The rapid development of the Internet requires the lightweight of data and its storage security much more than before (Parah et al. 2015; Zhang et al. 2017; Utku Celik et al. 2005; Ruijin et al. 2016). Accordingly, users need effective tools to decrease the size of large data blocks before data transmission (Xuan et al. 2005; Tseng and Chang 2004; Guo and Liu 2012; Mali et al. 2012; Zhu et al. 2017).

The wide application of compressing tools inspires a new way for data security, which is to embed secret data in compressed files (Tseng and Chang 2004; Guo and Liu 2012; Guo and Tsai 2012; Jian et al. 2015; Xuan et al. 2004; Nikolaidis 2015). Such kind of methods toughen the process of discovering hidden data for adversaries, since a proper decompressing method is essential to obtain hidden data (Parah et al. 2015; Lee et al. 2012; Zhang et al. 2013; Yan et al. 2016; Kumar and Pooja 2010). Additionally, these methods have advantages over reducing transmission costs and simultaneously make the transmission more secure (Nikolaidis 2015; Wu et al. 2006; Chang et al. 2006, 2016; Moulin and Koetter 2005; Zhu et al. 2016; Najafi 2007; Liu and Tsai 2007).

Past 10 years has seen the development of data hiding algorithms based on compressed tools (Tseng and Chang 2004; Guo and Tsai 2012; Xuan et al. 2004; Kumar and Pooja 2010; Chang et al. 2006; Moon and Kawitkar 2007; Yadav et al. 2012). Among large number of compressing tools, Lempel–Ziv–Welch (LZW) coding is a widely applied lossless compression algorithm and Shim et al. (2004) proposed the DH-LZW scheme by one character of one symbol to hide the data. Chen and Chang (2010) proposed the HCDH-LZW scheme in which received a higher embedding capacity than Shim et al.’s scheme by shrinking the characters according to the length of the symbol used to hide the data. Wang et al. (2013) proposed the HPDH-LZW scheme by modifying the value of LZW codes to hide secret data.

Schemes based on LZW hold the advantage getting a higher capacity for data hiding as well as the convenience of secret data extracting (Shim et al. 2004; Chen and Chang 2010; Bender et al. 1996). However, on the one hand, since many optimized schemes have been proposed, which hold higher performance and prevalence, new data hiding algorithm is needed to adapt to these optimized schemes for both the less time consumption and better compression rate. On the other hand, since most of these schemes hide secret data by changing the length of symbols, we discover a detecting method which can detect the existence of hiding data in such kind of schemes. Thus, those schemes may not be considered as secure ones.

To make up for this defect, our proposed scheme, named DH-Deflate, is based on a kind of optimized and prevalence compressing algorithm, Deflate, which is widely applied in compressing tools and considered to outperform the original algorithm, LZ’77. Besides the application of Deflate, our scheme also changes the algorithm of data hiding to embed secret data in the process of altering the symbol producing process, which disables the proposed detecting method to discover hidden data.

Besides the proposed data hiding scheme, we also work on providing experiential advice for compression data hiding achievers. That is, once the size of carriers is constrained and the embedded rate is specifically required, an advice of choosing proper file formats can be generalized. Thus, an optimization is done to shed light on the relationship of data embedding rate between carrier files format and redundancy.

Our contribution:

  1. 1.

    DH-Deflate a data hiding scheme based upon Deflate algorithm is proposed to achieve data hiding tasks which owns compatibility and could be transplant to LZ’77 series algorithms. To the best of our knowledge, this is the first data hiding scheme realized in Deflate.

  2. 2.

    A longest match detecting algorithm is proposed to judge if compression files contain embedded data. This algorithm can be applied to detect most existed data hiding algorithms which achieve data hiding based on shrinking the length of the longest match. Of course, DH-Deflate is an exception for our detecting algorithm.

  3. 3.

    An optimization is done to provide advisable inspiration to data hiding work achievers. They can achieve data hiding in a more purposive and convenient way by utilizing the optimization in this paper.

The following section is devoted to an overview of data hiding algorithms based on compressed file and our proposed scheme. In Sect. 2, we will review several related compressing algorithms and data hiding schemes. Section 3 proposes a longest match detecting algorithm for detecting the existence of LZW-based hidden data. In Sect. 4, we propose our data hiding scheme. The experimental results and optimization are illustrated in Sect. 5. Limitation and future work are described in Sect. 6, and Sect. 7 shows our conclusion.

2 Preliminaries

In this section, LZW compression algorithm and its several data hiding schemes including DH-LZW scheme, HCDH-LZW scheme, HPDH-LZW scheme are introduced. Deflate compression algorithm is also briefly described.

2.1 LZW algorithm

LZW is a lossless data compression algorithm proposed by Abraham Lempel, Jacob Ziv, and Terry Welch, which codes variable-length symbols with fixed-length codes in the dictionary (Welch 1984). During encoding phase, sequences which do not exist in dictionary will be regarded as a new symbol and inserted into the next unused location of dictionary; then, the sequences are partly coded and outputted. In LZW algorithm, the dictionary initializes with ASCII values 0–255; thus, the shortest code length is 9 bits (Wu et al. 2006). During decoding phase, encoded data are decoded according to its dictionary. Since this phase can build a same dictionary with encoding phase automatically, receivers do not need to synchronize its dictionary with senders.

2.2 DH-LZW data hiding scheme

The DH-LZW scheme is a data hiding algorithm based on LZW algorithm which is proposed by Shim et al. (2004). Via dropping the last character of the symbols, secret data can be embedded into the compressed files by bytes.

Table 1 Data hiding and data extracting phases of DH-LZW

The basic idea of DH-LZW is to modify symbol’s length during the encoding phase until the least significant bit of symbol’s length is equal to embedding secret bit, which means if the least significant bit of symbol’s length equals to the hidden bit, the symbol will remain unchanged, else drop the last character of symbol to let them consistent. Thus, during the decoding phase, the embedded data can be extracted by figuring out the least significant bit of each symbol’s length.

Accounting for algorithm’s efficiency (Wang et al. 2013), the DH-LZW scheme sets a threshold called THD to judge if a symbol is available to hide secret bit, a symbol cannot be used to hide secret bit if its length is not larger than THD. When extracting embedded bits, if a symbol’s length is not larger than THD, this symbol’s least significant bit would not be outputted to the secret data files.

Example 1

Set the source file ”aabbaaabbbabbbabbaaa,” THD is 3, and secret file is ”001.”

An example of DH-LZW schemes is given in Example 1, and its specific results are shown in Table 1. In encoding phase, before the 263rd item is generated, the length of symbol ”abbb” is 4, which can embed a secret bit. The first secret bit is ”0” which is not equal to (4–1); thus, the last character of symbol ”abbb” is dropped. Similarly, when getting the 264th item, symbol ”bbab” drops the character ”b” to hide the second secret bit ”0.” And before the 265th item is added to the dictionary, symbol ”abba” whose length is 4 can be used to hide a secret bit. The third secret bit ”1” equals to (4–1), and the symbol remains unchanged.

In decoding phase, the 263rd and 264th items have already existed in the dictionary, and we can get the secret bit ”0” from the least significant bit of (3–1). The length of 265th item is 4, from which the secret bit ”1” can be extracted since the least significant bit of (4–1) is 1.

2.3 Optimized schemes of DH-LZW

Some optimized schemes of DH-LZW have been proposed in the last several years. Chen and Chang (2010) proposed the HCDH-LZW scheme, which enlarged the data hiding capacity of the original scheme. The main idea of HCDH-LZW scheme is to shrink the symbol according to the symbol’s length, \(m = {\hbox {log}}_2 (n-1)\), where  m is how many bit can be embedded in one symbol and n is symbol’s length. For example, if symbol’s length is 2 or 3, it can be used to hide 1 bit, or if the length is between 4 and 7, 2 bits can be hided in the symbol. Wang et al. (2013) proposed the HPDH-LZW scheme, which realized data hiding by modifying LZW code according to the hidden bit. For example, if the hidden bit is ”0,” the output LZW code will remain equal to the original LZW code; however, if the hidden bit is ”1,” the output LZW code will be the sum of original LZW code and currently dictionary size.

2.4 Deflate algorithm

Deflate algorithm is an optimize compressing algorithm based on LZ’77 algorithm (Lonardi et al. 2007; Lonardi and Szpankowski 2003). Different from LZW algorithm, Deflate algorithm compresses files by representing sequential symbols with its longest match’s distance and length pairs in sliding window. A sliding window is a buffer which stores the compressed data blocks.

During the encoding phase, at very beginning, sliding window fills itself with data blocks and encodes them directly. Then, sliding window scans source files and encodes new data block with longest matches buffered in the window.

Table 2 Compression phase and decompression phase of Deflate

Accounting for algorithm’s efficiency, if a symbol’s longest match length is less than a threshold (usually set to 3), encoder will skip this symbol, else the encoder calculates the distance between the longest match and symbol and conducts a [distance, length] pair. Then, the [distance, length] pair will be encoded to Huffman codes as output. Finally, the sliding window slides ”distance” length and encodes the following data blocks.

In decoding phase, [distance, length] pairs are decoded according to the slide window. A simple example of the Deflate algorithm is presented in Table 2 for providing a better understanding of Deflate.

Example 2

The size of sliding window is assumed to be 5. The source file is assumed to be “aabbaaabbbabbbabbaaa.” [5, 4] means distance is 5 and length is 4.

It can be seen from Table 2, in the compression phase, before the sliding window is fully filled, the input character is coded directly. The window then begins to slid and generates [distance, length] code pairs. The encoder tries to find the longest match and express it with [distance, length]. For those whose length of longest match is shorter than 3, the encoder codes it directly. In the data extracting phase, a single character is directly decoded and the [distance, length] pairs are transformed to strings according to the window.

3 Detect data hiding behavior of altering symbol length

3.1 Longest match detecting algorithm

The proposed longest match detecting algorithm can be applied to detect whether LZW compressed files contain secret data. In LZW algorithm encoding phase, all items in the dictionary are unique, which means each new symbol must be a longest match. However, in DH-LZW algorithm encoding phase, not all items are unique, since symbol may be modified for data hiding. Thus, new symbol may not always be a longest match. The possibilities that a symbol is a longest match or not are approximately equal, which means the more the secret bits, the higher the possibility we can discover a non-longest match. For practical reason, we always try to embed as much secret data as possible; in such situation, our detecting scheme obtains a very high possibility of detecting if a file contains hidden data. The algorithm is summarized as shown in Algorithm 1. With this algorithm, rivals can easily figure out if there are hidden data.

figure a

3.2 Examples of longest match detecting algorithm

To show how the longest match detecting algorithm works, an example is given in Table 3. The source files and secret files are all completely same with the corresponding examples given in Sect. 2. Example 3 is the longest match detection for DH-LZW.

Table 3 Longest match detection for DH-LZW

For HCDH-LZW, since this scheme also embeds secret bit via changing symbols’ length, the longest match detection can also detect whether there are hidden bits or not, just as DH-LZW does.

For HPDH-LZW, this scheme does not change symbols’ length, and the long longest match detection cannot find the hidden bits. However, as it has been introduced before, rivals can easily use other methods to find the hidden data.

It can be seen that longest match detection for DH-LZW and HCDH-LZW can detect if there is a hidden bit if only there is a change of symbol length.

4 The secure data hiding scheme for Deflate codes

Instead of embedding secret bits via shrinking symbols, our proposed scheme embeds secret bits into the longest match generating process. Thus, this scheme could confirm that all symbols generated will not be shrunken and remained as longest matches.

Fig. 1
figure 1

Embedding strategy of the proposed scheme

4.1 Data embedding

The encoding phase of the proposed scheme is shown in Fig. 1. \(n\) denotes the longest match’s length, the generating phase of which is controlled to hide a secret bit. The encoder scans the whole sliding window from beginning side to end and finds out all of the longest matches in the sliding window (noting \({T_1},{T_2}, \ldots \)), their lengths are calculated (marked in \({n_1},{n_2}, \ldots \)), and those whose length is too short are dropped. Different from the original Deflate algorithm where \(l = {\left\{ {{n_1},{n_2}, \ldots } \right\} _{\max }}\), the longest match is select from (\({T_1},{T_2}, \ldots \)) in the principle of \(l = {\{ {n_k}\left| {} \right. {n_k}(\mathrm{mod}2) = b\} _{\max }}\), \(T = \{ \) \({T_k}\left| {{n_k} = l,{d_k} = } \right. \)

\({\left\{ {{d_i}\left| {{n_i} = l} \right. } \right\} _{\min }}\} \). Thus, each successful match phase can embed a hidden bit. If \(T = \phi \), the current position’s character will be coded singly. The data hiding algorithm (DH-Deflate) is summarized as shown in Algorithm 2.

figure b

4.2 Data extraction

In the data extracting procedure, assuming the current length of longest match is C’, if C’ is larger than 2, the secret bit is equal to the least significant bit of C’. Each [distance, length] pair contains a secret bit. For example, if C’ is 6, then the secret bit is ”0”; if C’ is 5, then the secret bit is ”1.” The data extracting algorithm is summarized as shown in Algorithm 3.

figure c
Table 4 Data hiding and data extracting phases of DH-Deflate

To show our proposed scheme more intuitively, an example is given in Table 4. The compressing phase’s 6th–11th input is given in Table 4.

Example 3

Assuming that the size of slid window is 5, the source file is ”aabbaaabbbabbbabbaaa.” The secret bits ”001,” [5,3] means distance is 5 and length is 3.

In Table 4, when encoder is going to compress the 6th input character, the longest match is “aabb,” the length of “aabb” is 4, and the least significant bit of which is equal to the first secret bit “0,” which means it is available to hide the secret bit “0.” The 8th input has the longest match “abbb,” and the least significant bit of “abbb” is “0” which is equal to the second secret bit. The 9th input has the longest match “abb,” least significant bit of which is “1” equaling to the third secret bit. In data extracting phase, the decoder finds all of [distance, length] pairs and their parities of length are equal to the secret file “001.”

4.3 Performance discussion

This section deduces the performance of the proposed scheme. We analyze \(M_{n}\) for a binary source, \(X = {X_1}{X_2}\).

\({X_3}\ldots \), where the \(x_{i}\) are random variables on the binary alphabet with \(Pr(x_{i}=0)=p_{i}\) and \(Pr(x_{i}=1)=q_{i}\). \(Y = {Y_1}{Y_2}{Y_3} \cdots \)is the hiding data series where the \({Y_j}\) are random variables on the binary alphabet with \(Pr({Y_j} = 0) = {p_j}'\) and \(Pr({Y_j} = 1) = {q_j}'\). Our goal is to figure out the probability of a successful data hiding for DH-Deflate. Firstly, assuming that the size of sliding window is \(w\) byte, we conclude (1), in which \(Pr\left( {\mathop = \limits _m n\,~byte} \right) \) presents the longest match’s length in position \(m\) byte(\(m\) is in the sliding window).

$$\begin{aligned}&Pr\left( \mathop = \limits _m n~byte \right) \nonumber \\&\quad = \mathop {{\varPi }}\limits _{v = 0}^{8n - 1} ({p_{m + v}}{p_{i + w + v + 1}} + {q_{m + v}}{q_{i + w + v + 1}}) \end{aligned}$$
(1)

Nextly, we try to work out the probability of the longest match’s length equals to \(n\), which is marked as \(Pr(\mathop = \limits _{\max } n~byte)\), and here we conclude (3). The \(Pr(\mathop < \limits _t n~by\) \(te)\) in (3) is deduced from (1) which is represented in (2).

Fig. 2
figure 2

Five test images

Table 5 Comparison to DH-LZW scheme for text and image files
$$\begin{aligned}&Pr\left( {\mathop < \limits _m n\,~byte} \right) = \mathop {\varSigma } \limits _{r = 1}^{n = 1} Pr\left( {\mathop = \limits _m r~byte} \right) \end{aligned}$$
(2)
$$\begin{aligned}&Pr\left( {\mathop = \limits _{\max } n\,~byte} \right) = \mathop {\varSigma } \limits _{m = 1}^w Pr\left( {\mathop = \limits _m n~byte} \right) \frac{{\mathop {\varPi } \limits _{t = 1}^w Pr(\mathop< \limits _t n\,~byte)}}{{Pr(\mathop< \limits _m n\,~byte)}} \nonumber \\&\qquad + \mathop {\mathop {\varSigma }\limits _{{m_1} = 1}^w \mathop {\varSigma } \limits _{{m_2} = 1}^w }\limits _{({m_1} \ne {m_2})} Pr\left( {\mathop = \limits _{{m_1}} n~byte} \right) Pr\left( {\mathop = \limits _{{m_2}} n~byte} \right) \ \nonumber \\&\qquad \frac{{\mathop {\varPi } \limits _{t = 1}^w Pr(\mathop< \limits _t n\,~byte)}}{{Pr(\mathop< \limits _{{m_1}} n\,~byte)Pr(\mathop< \limits _{{m_2}} n\,~byte)}} + \cdots \nonumber \\&\qquad + \mathop {\mathop {\varSigma } \limits _{{m_1} = 1}^w \mathop {\varSigma } \limits _{{m_2} = 1}^w \cdots \mathop {\varSigma } \limits _{{m_w} = 1}^w }\limits _{({m_a} \ne {m_b},if\,a \ne b)} Pr\left( {\mathop = \limits _{{m_1}} n~byte} \right) \nonumber \\&\cdots Pr\left( {\mathop = \limits _{{m_w}} n~byte} \right) \frac{{\mathop {\varPi } \limits _{t = 1}^w Pr(\mathop< \limits _t n\,~byte)}}{{Pr(\mathop< \limits _{{m_1}} n\,~byte) \cdots Pr(\mathop < \limits _{{m_w}} n\,~byte)}} \end{aligned}$$
(3)

To figure out the probability of a successful data hiding for DH-Deflate in condition that the longest match’s length is \(n\), we calculated \(Pr(n~(\bmod 2)) = {Y_j}'\) in (4), and then, we get the \(Pr(\mathop {success}\limits _n )\) in (5). Finally, our aim \(Pr(\mathop {success} )\) is deduced in (6). As it has been shown above, the success rate of DH-Deflate is correlated with both the carrier source and the hiding data series.

$$ \begin{aligned} \begin{aligned}&Pr(n~(\bmod 2) = Y{_j}') = Pr(({Y_j} = 0) \& (n\mathop = \limits _{\max } 4,6,8 \cdots )) \\&~~~+ Pr(({Y_j} = 1) \& (n\mathop = \limits _{\max } 3,5,7 \cdots )) \\&= p{_j}'(Pr(\mathop = \limits _{\max } 4) + Pr(\mathop = \limits _{\max } 6) + Pr(\mathop = \limits _{\max } 8) \cdots ) \\&~~~+ q{_j}'(Pr(\mathop = \limits _{\max } 3) + Pr(\mathop = \limits _{\max } 5) + Pr(\mathop = \limits _{\max } 7) \cdots ) \end{aligned} \end{aligned}$$
(4)
$$ \begin{aligned} \begin{aligned}&Pr(\mathop {success}\limits _n ) = Pr((\mathop = \limits _{\max } n~byte) \& n(\bmod 2) = {Y_j}') \\&~~~ + Pr((\mathop { = }\limits _{\max } n~byte) \& (n(\bmod 2) \ne {Y_j}') \& (\mathop { = }\limits _{\mathop {\mathrm{m}}} (n - 1)~byte)) \\&~~~+ Pr((\mathop { = }\limits _{\max } n~byte) \& (n(\bmod 2) \ne {Y_j}') \\&~~~~~~~ \& (\mathop { = }\limits _{\mathop {\mathrm{m}}\nolimits } \overline{(n - 1)~byte} ) \& ( = (n - 3)~byte)) \\&~~~+ \cdots + Pr((\mathop { = }\limits _{\max } n~byte) \& (n(\bmod 2) \ne {Y_j}') \\&~~~~~~~ \& (\overline{\mathop { = }\limits _m (n - 1)~byte}) \& ((\overline{\mathop { = }\limits _m (n - 3)~byte} )) \& \cdots \\&~~~~~~~ \& \left\{ \begin{array}{l} (\overline{\mathop { = }\limits _m 5~byte} ) \& (\mathop { = }\limits _m 3~byte)(n = 3,5,7, \cdots ) \\ (\overline{\mathop { = }\limits _m 6~byte} ) \& (\mathop { = }\limits _m 4~byte)(n = 4,6,8, \cdots ) \end{array} \right. ) \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} Pr(\mathop {success}\limits _{} ) = \mathop {\varSigma } \limits _{s = 3}^\infty Pr(\mathop = \limits _{\max } s~byte)Pr(\mathop {success}\limits _s ) \end{aligned}$$
(6)

5 Experimental result

5.1 Performance evaluation via scheme comparison

The proposed scheme is implemented based on zlib ver 1.2.8, an frequently used open-source library of providing compress/decompress features, and we modify the ”longest_match()” function in Deflate.c which controls the generation of longest matches. Nearly, 300 lines of C code are added.

Table 6 Comparison to the original Deflate scheme for data hiding and data extracting time

This experiment is done on Win7 64-bit operating system, Intel(R) Core(TM) i5-3210M CPU@2.50GHz,4 GB RAM.

In our experiment, four text files and five image files are used to evaluate the performance of the proposed scheme. For the text files, ”Obama” is Obama’s speech; ”Lincoln” is Lincoln’s Gettysburg address; ”Hamlet” is one of the Shakespear’s famous works; and ”Pride” is a novel named ”Pride and prejudice” written by Jane Austen, first published in 1813. Paper-b*.txt and paper-s*.txt in the same size and format of Paper-b.txt and paper-s.txt in Chen and Chang (2010) are chosen as testing samples. Such replacement is reasonable since DH-Deflate performs stable among txt carriers which will be deduced in the following Sect. 5.2. For the image files, five popular 256*256 images, as shown in Fig. 2, are included, such as Baboon, Goldhill, Boat, Lena, and Pepper.

Table 7 Test results of 33 typical sample files

We choose the DH-LZW scheme as a reference because both DH-LZW and our proposed scheme try to hide one secret bit in one symbol. The experiment results are shown in Table 5. The hidden secret bits are generated as a random sequence. The sizes of the files and compressed size (LZW and Deflate) shown in tables are measured in bytes, and the hidden data are measured in bit; the data hiding and data extracting time are measured in second. Tables 5 and 6 compare the hiding capacities of the proposed scheme and the DH-LZW scheme. The variant quantities show on average the influence of embedded data to carrier files’ size. The result shows that DH-LZW causes a larger size increment in hiding the same size of data. For testing data, the author of DH-LZW uses four text files and five image files.

According to the proposed scheme, the capacity of secret data that a file can hold is depending on symbols generated in the data hiding phase, which means the amount of [distance, length] pairs. For DH-LZW, the embedding capacity is depended on the number of embeddable symbols, the DH-LZW symbols which are shorter than 3 are not available for data hiding, and hence, the proposed scheme can get a larger capacity for hiding secret data.

Table 6 shows the data hiding and data extracting time of the original Deflate scheme and the proposed scheme. It can be easily seen that the average hiding time of the proposed scheme increases 42%, and the average extracting time of the proposed scheme increases 35.8%.The time of hiding and extracting phase increases because of the existence of hidden bit. In the phase of data embedding, if all lengths of the longest matches in slid window do not match with the embedded hidden bit, the hidden bit will be left until it finds a corresponding longest match, which causes the increase in the hiding time. For the extracting phase, it needs time to calculate the parity of length for each [length distance] pair, which generates the increase in extracting time.

Fig. 3
figure 3

Embedding rates for the texts and images

Figure 3 is generated based on Table 5 and shows the variation tendency of embedding rate for both DH-LZW and DH-Deflate. The embedding rate means how many percent of capacity is used to embed secure data bits in the compressed files. It is obviously that DH-Deflate owns a higher embedding rate in our nine testing samples and DH-Deflate performs much better than DH-LZW in image format carriers. The average embedding rate of the DH-LZW and the DH-Deflate, respectively, is 0.53% and 5.23%, and it means how many percent of capacity is used to embed secure data bits in the whole compressed files.

5.2 Hiding capacity inspiration for different file formats

This section discusses the relationship between data redundancy and embedding rate of DH-Deflate while it makes an optimization in providing practical suggestion for data hiding.

Fig. 4
figure 4

Relationship of file redundancy and embedded capacity which is illustrated as the rate of embedded size and original file size of 33 typical sample files in 11 different formats

Fig. 5
figure 5

Distribution of average embedding rate in 11 different formats

In total, 33 typical sample files in 11 different formats ( Table 7) are selected from maximum compression library and Internet. All of these file formats are frequently used in huge size data transmission. Their specific basic information (including file format and original size) and experiment data (for example, compression rate) are shown in table. Figure 4 shows the relationship between original file size and embedding file size of DH-Deflate. Figures 5 and 6 give a data hiding inspiration for some most common file types.

In Fig. 4, the histogram shows the rate between embedded size and original file size of 11 common file formats. Displayed in ascending order of the average format rate (the average embedding rate in a same file type), it is obviously that the embedding rates in unit size are varied in different file format. Figure 5 shows the average embedding rate of 11 formats. For format, eg., JPEG, files in such formats have been encoded in a compression way for data transmission, compression tools usually cannot make them further smaller, and some situation may in turn become larger. On the other hand, files in text format such as .bpm or .dat which contains larger redundancy hold the highest average embedding rate in around 5.1%.

When trying to achieve data hiding, if a specific file is affordable in embedding, all secret information shall be taken into consideration. In this paper, by analyzing the average embedding rate of 11 formats of file carriers, optimization in providing practical suggestion for DH-Deflate data hiding is done based upon experiment results. As Fig. 5 shows, four sections are divided considering the average embedding rates: 0–1.50, 1.51–3.00, 3.01–4.50, and 4.51–6.00%. This brings an inspiration on piratical DH-Deflate data hiding. For example, if 10,000 bytes are to be embedded, the average sizes of carrier files are varying: 76,923.1 KB for JPEG, 2173.9 KB for pdf, 378.8 KB for doc, 256.4 KB for txt and 198.1 KB for dat.

Figure 6 intuitively reveals which format of carrier files is available for achieving some certain DH-Deflate data hiding tasks. If the sizes of carrier files are constrained to no larger than 400 KB while no less than 4 KB data are to be embedded, according to Fig. 6, only the strings of tif, doc, txt, bmp, and dat pass through the section which represents the section met the requirement. Thus, .tif, .doc, .txt, .bmp, and .dat are the possible carriers. One can choose which format and size of files are to be selected and data hiding work can be achieved in a convenient and targeted way.

Fig. 6
figure 6

Format of carrier files available for achieving a certain DH-Deflate data hiding tasks. (Constraint: secret text size = 4 KB, carrier file size \(\le \) 400 KB)

6 Limitation and future work

DH-Deflate will change the original size of compressed files by altering the compressing process. If rivals can get the relevant information of compressed files such as the version of compression software, compression parameters (window size, compression level, etc.), it is possible to discover whether a suspicious compressed file contains hidden secret data by uncompressing it to get the source data, compressing the source data again with the same version of compression tool and parameters, and then checking if the resulting compressed data are same as the suspicious one. However, since the compress file usually does not contain all of above information, and there are a plenty more of compression tools frequently used, it is hard for rivals to detect the existence of secret data embedded by our proposed scheme.

In the future work, we are committed to improving the efficiency of our scheme by referring to the improved scheme of DH-LZW.

7 Conclusion

In this paper, we proposed a longest match detecting algorithm which can detect the existence of secret embedded by the existing LZW data hiding scheme. We also proposed DH-Deflate, a secure data hiding scheme based on the Deflate compression code, that can fight against the above detecting algorithm. The experiment shows the proposed scheme achieves 5.12% of embedding rates with the cost of 10.18% size increase in the compressed file. In addition, the data hiding performance of Deflate algorithm is much better than LZW algorithm because that Deflate codes use different ways to store the redundant information. Moreover, the Deflate algorithm is now widely used in many compression tools, and our proposed scheme shall be much easier to deploy than the existing LZW-based ones. Lastly, via optimization analysis, it can be found that formats with high redundancy such as .txt, .dat, .bmp own higher and stable embedding rate.