Keywords

1 Introduction

Information hiding is an ancient but also young and challenging subject. Utilizing the insensitivity of human sensory organs, as well as the redundancy of the digital signal itself, the secret information is hidden in a host signal, which does not affect the effect on sensory and value in use of the host signal. The host here covers all kinds of digital carriers such as the image, text, video and audio [1]. Since the text is the most frequently used and extensive information carrier, its research has attracted many scholars’ interest, and has obtained many results.

There are four main types of text information hiding technologies: the text format-based, text image-based, generating method-based and embedding method-based natural language information hiding.

Text format based information hiding method mainly achieves the hiding of secret information by changing the character spacing, inserting invisible characters (spaces, tabs, special spaces, etc.) and modifying the format of documents (PDF, HTML, Office). For example, the [2, 3] hided data via changing the characters such as the row spacing, word spacing, character height and character width. The [48] embedded data that utilized the programming language to modify certain properties of the Office document (including NoProofing attribute values, character color attributes, font size, font type, font underline, Range object, object’s Kerning property, color properties, etc.). Based on the format information of disk volume, Blass et al. proposed a robust hidden volume encryption in [9, 10]. The hiding capacity of information hiding methods based on text format is large, but most of them can’t resist the attack of re-composing and OCR. They can’t resist the steganography detection based on statistical analysis ([11, 12]).

The main idea of information hiding method based on text image is to regard a text as a kind of binary image. Then it combines the features of binary images with texts to hide data. For example, [13] embedded information by utilizing the parity of the numbers of black and white pixels in the block, in [14, 15], information was embedded by modifying the proportion of black-white pixels in a block and the pixel values of the outer edge, respectively. The embedding of secret information is realized by the rotation of the strokes of the Chinese characters in [16]. In addition, based on hierarchical coding, Daraee et al. [17] presented an information hiding method. Satir et al. [18] designed a text information hiding algorithm based on the compression method, which improved the embedding capacity. The biggest problem of text image based information hiding method is that it can’t resist re-composing and OCR attacks. After re-composing the characters of the hidden information into a non-formatted text, the hidden information hiding would completely disappear.

Generation method based natural language information hiding method utilizes the natural language processing (NLP) technologies to carry secret information by generating the similar natural text content. It can be divided into two types: the primary text generation mechanism and the advanced text generation mechanism. The former is based on probability statistics, which is coded by utilizing the random dictionary or the occurrence frequency of the letter combinations, and then the generated text meets the natural language of statistical characteristics. The latter is based on the linguistic approach. Based on linguistic rules, it carries the secret data by using the imitation natural language text without specific content [19]. In these methods, due to the lack of artificial intelligence for the automatic generation of arbitrary text, the generation text always contains the idiomatic mistakes or common sense errors, or sentences without complete meaning. Moreover, it may cause incoherent semantic context and poor text readability, which is easily to be recognized by human eyes [20, 21].

Embedding method based natural language information hiding method embeds the secret information by using different granularity of modification of the text [22, 23]. According to the scope of the modified text data, the embedding method can be divided into lexical level information hiding and sentence level information hiding. The former method hides the messages by means of substitution of similar characters [24], substitution of spelling mistakes [25], substitution of abbreviations/acronyms and words in complete form [26], etc. Based on the understanding of the sentence structure and its semantics, the latter method changes the sentence structure to hide information, and then utilizes the syntactic transformation and restatement technology in the same situation of its meaning and style [2730]. Embedding method is the focus and hotspot of text information hiding in current research. However, this method needs the support of natural language processing technology, such as syntactic parsing, disambiguation, automatic generation, etc., so that the information embedded into the text meets rationality of words, collocation accuracy, syntactic structure, and the statistical characteristics of language [31]. Because of the limitation of the existing NLP technology, it is hard to realize the hiding algorithm. In addition, there are still some deviation and distortion in the statistic and linguistics [32].

From the above we can see that the text information hiding has made many research results, but there are still some problems such as weak ability in anti-statistical analysis, bad text rationality and so on. Furthermore, theoretically as long as the carrier is modified, the secret message will certainly be detected. As long as the secret information exists in the cover, it can hardly escape from steganalysis. Thus, the existing steganography technology is facing a huge security challenge, and its development has encountered a bottleneck.

The proposed method firstly carries on syntactic parsing about the information to be hidden and divides it into independent keywords, then uses the Chinese mathematical expression [33] to create a locating tags. After that, utilizing the cloud search services- multi-keyword ranked search [34, 35], a normal text containing the secret information can be retrieved, which achieves the direct transmission of the secret information. It doesn’t require any other carriers and modifications, while it can resist all kinds of existing steganalysis methods. This research has an important positive significance for the development of information hiding technology.

2 Related Works

The Chinese character mathematics expression was proposed by Sun et al. in 2002 [33]. The basic idea is to express the Chinese characters as a mathematical expression so that the operands are components of Chinese characters and the operators are six spatial relations of components. Some definitions are given below.

Definition 1.

A basic component is composed of several strokes, and it may be a Chinese character or a part of a Chinese character.

Definition 2.

An operator is the location relation between the components. Let A,B be two components, A lr B, A ud B, A ld B, A lu B, A ru B and A we B represent that A and B have the spatial relation of left-right, up-down, left-down, left-upper, right-upper, and whole enclosed respectively. An intuitive explanation of the six operators is shown in Fig. 1.

Fig. 1.
figure 1

Intuitive explanation of the defined operators

Definition 3.

The priority of the six operators defined in Definition 3 is as follows: (1). () is the highest; (2). we, lu, ld, ru are in the middle; (3). lr, ud are the lowest; the operating direction is from left to right.

Using the selected 600 basic components and the six operators in Fig. 1, we can express all the 20902CJK Chinese characters in UNICODE 3.0 by utilizing the mathematical expressions. It is very nature and has a simple structure, and every character can be processed by certain operational rules as general mathematical expressions. After the expression of Chinese characters into the mathematical symbols, many processing of the Chinese information will become simpler than before.

According to the Chinese mathematical expression, we can see that if the appropriate components are selected as the location label of the secret message, it is better than that of the word or phrase being selected directly as the index in terms of many indicators such as randomness, distinguishability and universalness.

3 Proposed Method

Instead of conventional information hiding that needs to search an embedded carrier for the secret information, coverless information hiding requires no other carriers. It is driven by the secret information to generate an encryption vector, and then a normal text containing the encrypted vector can be retrieved from the big data of text, so the secret message can be embedded directly without any modification.

From the above analysis, there are three characteristics of the coverless information hiding algorithm: The first one is “no embedding”, that is, a carrier can’t embed secret information by modifying it. The second is “no additional message need to be transmitted except an original text”, that is, other than the original agreement, there should not be any other carriers additionally used to send auxiliary information, such as the details or parameters of the embedding or extraction. The third is “anti-detection”, which can resist all kinds of the existing detection algorithms. Based on the above characteristics together with the related theory of the Chinese mathematical expression, this paper presents a text-based coverless information hiding algorithm.

3.1 Information Hiding

For the coverless information hiding based on text, we first segment the secret data into words, then convert the Chinese words on a word-to-word basis, design the locating tags, generate the keywords that contain the converted secret data and the locating tags. Furthermore, we search the texts that contain the keywords in the database so as to achieve the information hiding with zero modification.

Let m be a Chinese character, and \( {\mathcal{T}} \) be a set of the 20902CJK Chinese characters in UNICODE 3.0. Suppose the secret message is \( {\text{M = m}}_{ 1} {\text{m}}_{ 2} \ldots {\text{m}}_{\text{n}} \), the conversion and location process of its secret information can be summarized as in Fig. 2. The details can be introduced as the following:

Fig. 2.
figure 2

Secret information conversion and positioning process

  1. (1)

    Information segmentation. Based on the interdependence of the syntactic parsing, M is segmented into \( {\mathcal{N}} \) non-overlapping keywords \( w_{i} \left( {i = 1,2, \cdots {\mathcal{N}}} \right) \), where \( ||w_{i} || \le \ell \) and \( \ell \) is a predetermined threshold for controlling the length of the keywords. The greater the \( \ell \), the higher the security, but the extraction of the secret information is more difficult. From the research of the Chinese information entropy, the choice of \( \ell \) is often no more than three.

  2. (2)

    Words conversion. Segment the text database retrieved into keywords according to the rules of information segmentation in step (1), and then calculate the frequency of every keywords in the text database, finally sort the words in descending order according to the frequency of occurrence \( { \mathcal{P}} = \left\{ {p_{1} ,p_{2} , \ldots ,p_{\text{t}} } \right\} \). So the word transformation protocol can be designed as follows:

    $$ p '_{i} = {\mathcal{F}}_{c} (p_{i} ,{\text{k}}),i = 1,2, \cdots ,s; $$

    Where \( {\mathcal{F}}_{c} (p_{i} , \cdot ) \) is the transformation function used for the statistic and analysis of the text database, this function is open for all users, and \( \cdot \) is the private keys of the information receiver, such as the \( k \) of the above formula, and \( s \) is the number of keywords of the text database. Where the difference between the quantities of \( p_{i} \) and \( p'_{i} \) is not too much, if not, a commonly-used word will be converted into a rarely-used word, which will greatly decrease the retrieval efficiency of stego-text. Therefore, using the \( {\mathcal{F}}_{c} \), the converted keyword \( w \cdot_{i} \) of \( w_{i} \) can be calculated, we can obtain the converted secret message \( {\mathcal{W}} \cdot = w \cdot_{1} w \cdot_{2} \ldots w \cdot_{{\mathcal{N}}} \).

  3. (3)

    Get the locating tags. For the text database retrieved, divide the Chinese characters into various components by using the Chinese mathematical expression first, and then calculate the frequency of every component, finally sort the components in descending order according to the frequencies of occurrence. Select the component whose appearance time is in the top 50 and then determine the locating sequence according to the user’s key. For \( i = 1,2, \cdots {\mathcal{N}} \), suppose \( b_{i} \) is the corresponding component of the located keyword \( w'_{i} \), when \( {\mathcal{N}} > 50 \), the keywords have the same located tags every 50 numbers.

    For many components, the corresponding Chinese characters are often not unique. In order to find the stego-text that contains the secret information, we first calculate the \( r \) characters with the biggest numbers of appearance \( m_{i}^{j} (j = 1,2, \ldots ,r) \) for every component \( b_{i} \), combine every \( m_{i}^{j} \) with the keyword \( w'_{i} \) and research them from the text database, then sort \( m_{j}^{i} \) according to the number of occurrences. Utilizing the user’s key, select the alternative character from the top 5 Chinese characters, so the location tags \( {\mathcal{L}}_{i} \left( {i = 1,2, \cdots {\mathcal{N}}} \right) \) are calculated.

  4. (4)

    Combining \( {\mathcal{L}}_{i} \) with \( w'_{i} \) and obtain \( {\mathcal{D}}_{i} \left( {i = 1,2, \cdots {\mathcal{N}}} \right) \), where \( {\mathcal{D}}_{i} \) is the keyword retrieved in the text database.

In order to find the normal text that contains the keywords retrieved \( {\mathcal{D}} = \{ {\mathcal{D}}_{i} |i = 1,2, \cdots {\mathcal{N}}\} \), the creation of large-scale text database plays a crucial role. It not only emphasizes of “high speed, wide range, great quantity”, but also follows the principle of “quality, standardization and accuracy”. Moreover, in order to improve the anti-detection performance, the quality of the text needs to be controlled from two aspects: One is to ensure that the text is normal with no secret information; the second is to ensure that the text is standardized in line with the language specification.

Based on the above text database, the keyword indexing technology is applied to find every \( {\mathcal{D}}_{i} \left( {i = 1,2, \cdots {\mathcal{N}}} \right) \) in the database and built the reverse file index \( {\mathcal{I}\mathcal{D}}_{i} \), then search for a text that contains the secret information. If the search-string is found, then send it to the receiver; otherwise, divide it into two segments and re-retrieve it again until the right text is found. In order to avoid the suspicion, text classification can be used to the retrieval process, such as emotional and situational classification methods, which can avoid the retrieved results having non-relevant texts being grouped together.

It is worth mentioning that, from the above information hiding process we can see, the word conversion protocol is essentially a data encryption and the locating protocol is essentially a mapping. The two together realize the purpose of enhancing security and determining the location of the secret information. This idea isn’t presented from nothing, but has a profound historical heritage. When choosing poetries as a text database, the keywords do not convert and the first word of each sentence is selected as the location tags, which is the ancient acrostic poem.

This skill is used in the peasant uprising famous novel “outlaws of the marsh” chapter sixty, which is about Chinese Northern Song Dynasty (1119-1121), such as Fig. 3. The normal reading order is from left to right, and its meaning is praising the general Lu Junyi. However, combine the initial letter in each line, we will get , whose meaning is that “ will defect to the enemy”( and is a homonym), thus achieving the purpose of hiding information in public document.

Fig. 3.
figure 3

Left is the acrostic and right is the picture of Lu Junyi.

3.2 Information Extraction

In the conventional text information hiding, the stego-text is normal for a stranger but is abnormal for the receiver, so the receiver can extract the secret information by analyzing the abnormalities. However, in the coverless information hiding, the stego-text is actually an open and normal text, and the receiver can’t extract the secret information by finding the abnormal place. Let the stego-text be \( {\mathbb{S}} \), and \( k \) is the private key, then the process of extraction is showed as Fig. 4. The details can be introduced as follows.

Fig. 4.
figure 4

The flowchart of the information extraction

  1. (1)

    Extraction preprocessing. Because the text database is open to all users, they are also available as public information of the 50 Chinese components for marking and the corresponding Chinese characters, denoted \( b = \{ b_{i} |i = 1, \ldots ,50\} \) and \( m_{i} = \{ m_{i}^{j} |j = 1, \ldots \} \) respectively. Therefore, utilizing the user’s private key, we can get the located component and its order of appearance. Moreover, based on the statistical results, it is also easy to get the Chinese characters \( { \mathcal{L}} = \{ {\mathcal{L}}_{i} |i = 1,2, \cdots ,{\mathcal{N}}\} \) in the database.

  2. (2)

    The extraction of candidate keywords. Sequentially scan the stego-text \( {\mathbb{S}} \), then extract the candidate locating tags \( {\mathcal{C}\mathcal{L}} = \{ {\mathcal{C}\mathcal{L}}_{i} |i = 1,2, \cdots {\mathcal{N}}^{ '} \} \) and the candidate keywords \( {\mathcal{S}} '= \{ {\mathcal{S}} '_{i} |i = 1,2, \cdots {\mathcal{N}}^{ '} \} \) according to the user location components set, where \( {\mathcal{N}}^{ '} \ge {\mathcal{N}} \).

    When \( {\mathcal{N}}^{ '} = {\mathcal{N}} \), then \( {\mathcal{C}\mathcal{L}} \) is the locating tags of the secret information, skip step (3) to step (4); When \( {\mathcal{N}}^{ '} > N \), there exist the non-locating components that are contained in \( {\mathcal{C}\mathcal{L}} \), they should be eliminated from \( {\mathcal{C}\mathcal{L}} \) with step (3).

  3. (3)

    Eliminate the redundant tags. The procedure is introduced as follows:

    1. (a)

      Compare \( {\mathcal{C}\mathcal{L}}_{i} \) with \( {\mathcal{L}}_{i} \) from \( i = 1 \), if \( {\mathcal{C}\mathcal{L}}_{i} = {\mathcal{L}}_{i} \), update \( i = i + 1 \) and execute the step (a) again, otherwise skip to step (b);

    2. (b)

      If \( {\mathcal{C}\mathcal{L}}_{i - 1} \ne {\mathcal{C}\mathcal{L}}_{i} \), but both of them have the same components, then \( {\mathcal{C}\mathcal{L}}_{i} \) is not a location tag, delete it and skip to step (a). If \( {\mathcal{C}\mathcal{L}}_{i - 1} \ne {\mathcal{C}\mathcal{L}}_{i} \) and they have the different components, then compare the quantity sorting of the two Chinese characters in text database, and the Chinese character that doesn’t meet the keys of receiver isn’t the locating tag, then delete it from \( {\mathcal{C}\mathcal{L}} \); otherwise skip to step c);

    3. (c)

      If \( {\mathcal{C}\mathcal{L}}_{i - 1} = {\mathcal{C}\mathcal{L}}_{i} \), then at least one character isn’t the locating tag between \( {\mathcal{C}\mathcal{L}}_{i - 1} \) and \( {\mathcal{C}\mathcal{L}}_{i} \), so combine \( {\mathcal{C}\mathcal{L}}_{i - 1} \) and \( {\mathcal{C}\mathcal{L}}_{i} \) with its subsequent keywords to generate the keyword retrieved \( {\mathcal{D}}_{i} \), delete the one with smaller number and skip to the step (a);

    When the correct tags are calculated, locate it in \( {\mathbb{S}} \), and then extract the character strings \( {\mathcal{S}}_{i} (i = 1,2, \cdots ,{\mathcal{N}}) \) after the locating points, where \( ||{\mathcal{S}}_{i} | = \ell \);

  4. (4)

    Since each keyword is divided by the dependency syntax before the hiding, the length of every keyword \( \ell_{{{\text{w}}i}} \) may not be exactly the same, where \( 1 \le \ell_{{{\text{w}}i}} \le \ell \). Moreover, because of the words conversion, the keywords cannot be accurately extracted. Therefore, when using the inverse transform of word conversion \( {\mathcal{F}}_{c}^{ - 1} (p_{i} ,k) \) to restore the string \( {\mathcal{S}}_{i} \), the obtained candidate keywords set \( {\mathbb{K}}_{i} = \left\{ {{\mathbb{K}}_{i}^{j} |1 \le {\text{j}} \le \ell } \right\} \) is not unique;

  5. (5)

    Select a keyword from every \( {\mathbb{K}}_{i} \left( {i = 1,2, \cdots {\mathcal{N}}} \right) \), and generate the candidate secret messages by researching the language feature and the word segmentation based on user background, then measure the confidence of the candidate secret information by analyzing the edit distance and similarity of the keywords, a rank can then be recommended to the receiver;

  6. (6)

    Utilize the sorted recommended information, then combine the language analysis with Chinese grammar features, we can access the secret information \( {\text{M = m}}_{1} {\text{m}}_{2} \ldots {\text{m}}_{\text{n}} \).

4 Example Verification

In order to clearly describe the above information hiding process, we explain it by a simple example. For example, let the secret information M be , then the procedure of the hiding is shown in Fig. 5.

Fig. 5.
figure 5

Example results of the information hidding procedure

Firstly, segment M into , then design the words conversion protocol \( {\mathcal{F}}_{c} (p_{i} ,{\text{k}}) \), where \( {\mathcal{F}}_{c} (p_{i} ,{\text{k}}) \) can be set to: .

Secondly, analyze the text database and calculate the statistic values, then choose the suitable component for the locating, where we set the components of the Chinese characters as .

Thirdly, select the locating tags from the candidate Chinese characters, where we obtain the locating tags , and the keywords set retrieved is .

Finally, retrieve the text database to find a stego-text which contains the locating tags and keywords, where using the rules of the above, is a stego-text with the retrieved secret information .

In the case of the recipient’s encrypted text information , because of the absence of redundant components, the extraction process is the inverse process of the embedding process. The realization is relatively simple, so we will not repeat them now.

5 Conclusions

This paper presented a text information hiding method, which is based on Chinese mathematical expression. Instead of the conventional information hiding method that needs to find an embedding carrier for the secret message, the proposed method requires no other carriers. First, an encryption vector is generated by the secret information, and then a normal text containing the encrypted vector is retrieved from the text database, which realizes embedding directly without any modification of the secret data. Therefore, the proposed method can resist all kinds of existing steganalysis methods. This research has an important positive significance for the development of information hiding technology.