Audio encryption based on the cosine number transform

Lima, Juliano B.; da Silva Neto, Eronides F.

doi:10.1007/s11042-015-2755-6

Audio encryption based on the cosine number transform

Published: 01 July 2015

Volume 75, pages 8403–8418, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Audio encryption based on the cosine number transform

Download PDF

Juliano B. Lima¹ &
Eronides F. da Silva Neto¹

730 Accesses
59 Citations
Explore all metrics

Abstract

In this paper, we introduce an audio encryption scheme based on the cosine number transform (CNT). The transform, which is defined over a finite field, is recursively applied to blocks of samples of a noncompressed digital audio signal. The blocks are selected using a simple overlapping rule, which provides diffusion of the ciphered data to all processed blocks. A secret-key is used to specify the number of times the transform is applied to each one of such blocks. Computer experiments are carried out and security aspects of the proposed scheme are discussed. Our analysis indicates that the method meets the main security requirements of secret-key cryptography. More specifically, after the encryption of 16-bit audio signals, correlation coefficients significantly close to 0 and entropy values close to 16 were obtained. Furthermore, the flexibility of the method easily allows key space sizes greater than 2²⁵⁶ and provides robustness against differential, known-plaintext and chosen-plaintext attacks.

Security analysis of an audio data encryption scheme based on key chaining and DNA encoding

Article 08 January 2021

Securing Digital Audio Files Using Rotation and XOR Operations

A technique for securing digital audio files based on rotation and XOR operations

Article 31 October 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the last years, the development of multimedia security techniques has attracted the attention of researchers from several fields of knowledge. This is mainly due to the increasing ease to share digital image, video and audio through communication networks [22]. In this scenario, steganographic and watermarking schemes have the purpose of providing privacy and autenticity using information hiding fundamentals [4, 6, 7, 20, 28]. On the other hand, multimedia encryption schemes employ several strategies and mathematical tools with the purpose of making perceptual and statistical aspects of an image or an audio file seem noisy [1, 12, 15, 29, 30]. In the optical domain, image encryption can be performed by using fractional Fourier transforms and random phase encoding, for instance [9, 24]. Chaotic maps, which have the property of being highly sensitive to initial conditions, are also widely used in multimedia encryption techniques [27].

More specifically, digital audio encryption can be implemented in different ways and applied to scenarios with manifold requirements. In [21], for example, chaotic and multiple-keys algorithms are employed in association with discrete transforms; the purpose is to provide a new audio encryption package for TV cloud computing. The audio encryption algorithm proposed in [18] is devoted to online applications and employs transposition and a multiplicative non-binary system. In [8], a higher dimensional chaotic map is used to enhance the key space and the security of a iterative audio encryption method. The operating principle of the approach presented in [17] is on the basis of a virtual optics scheme; both virtual wavelength and virtual diffraction distance are applied in conjunction with a complex-valued random mask to design multiple-locks and multiple-keys in the course of audio data encryption and decryption. In [26], an index-based selective audio encryption for wireless multimedia sensor networks is proposed and, in [11], the authors introduce an advanced partial encryption using watermarking and scrambling in MP3.

In this paper, we propose an audio encryption scheme based on the cosine number transform (CNT). This transform was originally named finite field cosine transform, as a reference to the algebraic structures where it is defined [14]. However, in the present context, the CNT can be viewed as a cosine-based version of the number-theoretic transform (NTT), which explains the nomenclature we have adopted. Number-theoretic transforms are well-known in the signal processing community, where they are used in fast algorithms for computing linear convolutions [3, 19] and, more recently, in fragile watermarking techniques [5, 25]. Analogously to the NTT, all arithmetic operations involved in the computation of the CNT are carried out modulo an odd prime p. In other words, computations in extension fields are avoided, which is suitable for signal processing applications.

The first encryption scheme based on the CNT was proposed in [13]. In that paper, the encryption of grayscale images is considered and the secret-key, which is given by a permutation, determines the position of each image block to be processed by the transform. Such a scheme also includes a preliminary transform step, which is not key-dependent. The encryption scheme proposed in the present work is applicable to noncompressed audio. The technique consists in applying the CNT to blocks of samples of an audio signal. The number of times that the transform is recursively applied to each block depends on a secret-key. The transformed block replaces the original block before the next block is processed. Since there is an overlapping among the samples of two adjacent blocks, the ciphered data is diffused along the whole audio signal. This is important to ensure some properties related to the security of the method.

Besides complying with the main security requisites of a multimedia encryption scheme, the proposed approach has the following attractive features: (i) simplicity: basically, the scheme consists in computing transforms of audio blocks and requires only two encryption rounds; (ii) flexibility: the scheme can be easily adapted to audio signals encoded with different numbers of bits per sample and allows adjustments on the key sizes; (iii) fidelity: since rounding is not necessary at any step of the algorithm, if the key is correct, the decrypted audio is identical to the original audio signal; (iv) computational efficiency: the CNT can be computed via fast algorithms and employing fixed-point arithmetic operations only. This permits efficient implementations and reduces the number of additions and multiplications necessary to calculate an N-length CNT from $\mathcal {O}(N^{2})$ to $\mathcal {O}(N\log N)$ [3, 10].

This paper is divided as follows. In Section 2, the main theoretical aspects concerning the cosine number transform are presented. In Section 3, we introduce the proposed scheme and describe the steps involved in the encryption/decryption of an audio signal. In Section 4, we present numerical results of computer experiments of the proposed technique and analyze its security. A preliminary comparison between our approach and other state-of-art audio encryption schemes is carried out and some concluding remarks are presented in Section 5.

2 Cosine number transform

The definition of the cosine number transform requires the following finite field cosine function.

Definition 1

Let ζ be a nonzero element in the finite field GF(p), p an odd prime. The finite field cosine function related to ζ is computed modulo p by

$$ \cos_{\zeta}(x):=\frac{\zeta^{x}+\zeta^{-x}}{2}, $$

(1)

x=0,1,…,ord(ζ), where ord(ζ) denotes de multiplicative order^{Footnote 1} of ζ.

The finite field cosine function holds properties similar to those of the standard real-valued one, such as the unit circle and the addition of arcs, for instance. Definition 1 can also contain additional details, which do not need to be considered in the present context [14]. The cosine number transform is given by the following definition.

Definition 2

Let ζ∈GF(p) be an element such that ord(ζ)=2N. The cosine number transform of the vector x=[x ₀,x ₁,…,x _N−1], x _i∈GF(p), is the vector X=[X ₀,X ₁,…,X _N−1], X _j∈GF(p), of elements

$$ X_{j}:=\sqrt{\frac{2}{N}}\sum\limits_{i=0}^{N-1} \beta_{j} x_{i}\cos_{\zeta}\left( j\frac{2i+1}{2}\right) $$

(2)

computed modulo p, where

$$ \beta_{j} =\left\{ \begin{array}{ll}1/\sqrt{2}\:(\text{mod}\:\:p), & j=0, \\ 1, & j=1,2,\ldots,N-1. \end{array} \right. $$

The computation of the CNT of a row vector x can be represented by the matrix equation

$$\mathbf{X}=\mathbf{C}\cdot\mathbf{x}^{T}, $$

where x ^T is the vector x transpose and C corresponds to the transform matrix, whose element in the (j+1)-th row and the (i+1)-th column is given by

$$C_{j+1,i+1}=\sqrt{\frac{2}{N}}\beta_{j} \cos_{\zeta}\left( j\frac{2i+1}{2}\right)=\sqrt{\frac{2}{N}}\beta_{j} \cos_{\sqrt{\zeta}}\left( j(2i+1)\right), $$

i,j=0,1,…,N−1. It can be shown that the inverse CNT is obtained by using the transform matrix C ⁻¹=C ^T [14]. This means that algorithms and architectures designed to compute a CNT can be easily adjusted to compute the corresponding inverse CNT.

As an example, let us construct a CNT of length N=8 over GF(65537). We use the element ζ=4, with multiplicative order ord(ζ)=16, and, from Definitions 1 and 2, we compute

$$\mathbf{C}=\left[ \arraycolsep=4.2pt \begin{array}{cccccccc} 1020 & 1020 & 1020 & 1020 & 1020 & 1020 & 1020 & 1020\\ 24577 & 63491 & 65033 & 65441 & 96 & 504 & 2046 & 40960\\ 61442 & 65297 & 240 & 4095 & 4095 & 240 & 65297 & 61442\\ 63491 & 96 & 40960 & 504 & 65033 & 24577 & 65441 & 2046\\ 64517 & 1020 & 1020 & 64517 & 64517 & 1020 & 1020 & 64517\\ 65033 & 40960 & 65441 & 63491 & 2046 & 96 & 24577 & 504\\ 65297 & 4095 & 61442 & 240 & 240 & 61442 & 4095 & 65297\\ 65441 & 504 & 63491 & 40960 & 24577 & 2046 & 65033 & 96 \end{array}\right]. $$

(3)

Regarding the matrix C given in (3), it is important to remark that the least positive integer l such that C ^l=I (the identity matrix) is greater than 10⁹. This means that the constructed CNT can be iteratively applied to a vector x at least 1 billion times before the original vector x is recovered. Due to this property, the CNT is suitable for cryptographic schemes which use recursive transformations. This strategy would not be effective, for example, if standard number-theoretic transforms were considered, because the fourth power of a standard NTT matrix is equal to the identity matrix [2, 3, 25].

Additionally, it is important to remark that p=65537, the characteristic of the finite field employed in the example developed above, is a Fermat prime, i. e., it has the form $p=2^{2^{s}}+1$, with s=4. In the cases where p is a Fermat prime, the possible multiplicative orders of the elements of GF(p) are divisors of $p-1=2^{2^{s}}$ and CNT whose lengths are also a power of two can be defined (see Definition 2). This allows to use standard radix-2 decimation-in-time and decimation-in-frequency fast algorithms to compute the CNT [10]. On the other hand, if p=2^s−1, it is a Mersenne prime. In the cases where p is a Mersenne prime, multiplications by powers of 2 (mod p) correspond to a bit shift. This means that, if the CNT kernel is expressed as a sum of powers of two, multiplication-free transforms can be constructed [3, 16].

Usually, the parameters of a CNT to be used in the processing of a specific signal are chosen in a way such that the computational advantages mentioned above can be achieved. If a signal has samples whose values are integers in the range 0−M, for instance, the smallest Fermat (or Mersenne) prime greater than M is selected and a transform defined over the corresponding finite field is constructed. This premise is considered in the design of the encryption scheme introduced in the next section.

3 The encryption scheme

The proposed encryption scheme is illustrated in Fig. 1. It requires the definition of a CNT over the smallest prime finite field in which the range of integer values assumed by the audio samples can be mapped. In this paper, we have designed the proposed encryption scheme to be applied to 16 bits/sample noncompressed audio signals; each sample assumes an integer value in the range 0−65535 and the CNT given in the example presented in Section 2 is used. In step 1 of our scheme, the CNT matrix C is constructed (see Fig. 1).

An audio block with 8 samples is then taken from the original audio (step 2 in Fig. 1); such a block, which is denoted by b _n (starting from b ₁, which is composed by the first eight samples of the audio signal), is taken in a way such that it overlaps the previous ciphered audio block in two samples; that is, the first two samples of the original audio block b _n are the last two samples of the ciphered audio block $\mathbf {b}^{\prime }_{n-1}$ (only the block b ₁ is taken without the overlapping). This provides diffusion in our scheme. The index n of the audio block being processed determines the choice of the element of the secret-key used in the scheme (step 3). More specifically, the secret-key is the K-length vector of integers

$$\mathbf{k}=[k_{0},\, k_{1},\, \ldots,\,k_{K-1}], $$

whose n (mod K)-th component is considered in the encryption of the audio block b _n.^{Footnote 2} Such a component determines the computation of the k _{n (mod K)}-th power of the CNT matrix C (step 4). The matrix $\mathbf {C}^{k_{n\:(\text {mod}\:K)}}$ is then multiplied by the block b _n (step 5), which produces the provisory ciphered audio block

$$ \mathbf{b}^{\prime}_{n,1}=\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\cdot \mathbf{b}_{n}^{T}. $$

(4)

This corresponds to compute k _{n (mod K)} times the CNT of b _n in an iterative manner.

The block computed in (4) is provisory because, since all computations are carried out modulo p=65537, $\mathbf {b}^{\prime }_{n,1}$ may contain samples whose values are equal to 65536. This would require a binary representation with 17 bits, which violates the encoding of the original audio signal. In order to avoid such an extra bit, the matrix $\mathbf {C}^{k_{n\:(\text {mod}\:K)}}$ is iteratively multiplied by b _n; if a block $\mathbf {b}_{n,e}^{\prime }$ obtained from (4) in the e-th iteration contains a sample equal to 65537, we update such a block, multiplying it by $\mathbf {C}^{k_{n\:(\text {mod}\:K)}}$ again. The process stops when a new block $\mathbf {b}^{\prime }_{n,E}$ without samples equal to 65536 is encountered in the E-th iteration (step 6). The definitive block $\mathbf {b}^{\prime }_{n}=\mathbf {b}^{\prime }_{n,E}$ is then taken as the encrypted version of b _n and replaces b _n in the composition of the encrypted audio signal (step 7). The encryption procedure is completed after the whole audio vector is submitted to two rounds of the described transformation strategy. This is necessary to make brute-force attacks unfeasible.

The decryption consists in applying, in the reverse order, the same steps used in the encryption; the matrix C is replaced by the the matrix C ⁻¹=C ^T and the blocks are taken from right to left. We remark that the number of times each audio block has to be iteratively multiplied by $\mathbf {C}^{k_{n\:(\text {mod}\:K)}}$ in the encryption does not need to be known for a successful decryption. Suppose that

$$\left( \mathbf{b}^{\prime}_{n,e}\right)^{T}=\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{e}\cdot\mathbf{b}^{T}_{n},\quad e = 1,2,\ldots,E-1,$$

contains at least one sample whose value is equal to 65536, but the maximum sample value in

$$\left( \mathbf{b}^{\prime}_{n,E}\right)^{T}=\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n}$$

does not exceed 65535. Then, $\mathbf {b}^{\prime }_{n,E}=\mathbf {b}^{\prime }_{n}$ is taken as the (definitive) ciphered version of b _n. In the decryption,

$$\left( \mathbf{b}^{\prime}_{n,E-d}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{d}\cdot\left( \mathbf{b}^{\prime}_{n}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{d}\cdot\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n},$$

d=1,2,…,E−1, will contain at least one sample whose value equal to 65536; actually, $(\mathbf {b}^{\prime }_{n,E-d})^{T}$, d=1,2,…,E−1, are the same provisory blocks produced in the encryption. Only when d=E is reached, a block which does not contain at least one sample value equals 65536 is obtained. Such a block is

$$\left( \mathbf{b}^{\prime}_{n,0}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{E}\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n}=\mathbf{b}^{T}_{n},$$

the original audio block correctly recovered.

4 Computer experiments and security analysis

The proposed encryption scheme was implemented in Matlab ^®. Segments with 1.8×10⁵ samples of eight noncompressed audio signals with different characteristics (music, speech etc.), encoded with 16 bits/sample, were encrypted using

$$ \mathbf{k}=[25,\, 34,\, 224,\, 146,\, 16 ,\, 60,\, 91 ,\, 210 ,\, 4 ,\, 11,\, 44,\, 166,\, 187 ,\, 166,\, 115] $$

(5)

as secret-key. In Fig. 2, the waveforms of the original audio signals are shown. The file audio_01.wav, for example, was obtained using the sampling rate F _s=8000 Hz and, therefore, has time duration equal to 22.5 s; nevertheless, the application of the encryption procedure independs on the sampling rate used in the discretization of a signal.

In Fig. 3a, the complete ciphered version of audio_01.wav is shown. Naturally, the “dense” visual aspect of the waveform reflects the rapid variations arising from the encryption process. In order to provide a more suitable visualization, we show in Fig. 3b the first 250 samples of the same ciphered audio signal (the samples are connected with lines to enhance the visualization). In this figure, we can observe the noisy aspect of the waveform. This contrasts with the waveform presented in Fig. 3c, where the first 250 samples of the corresponding original audio signal are shown. In this case, the quasi-periodicity of the waveform is emphasized; this property indicates that the non-ciphered audio segment may represent, for instance, a voiced speech. The visual aspect of the waveforms corresponding to ciphered versions of all other audio signals used in our experiments is similar.

4.1 Histogram analysis

The noisy aspect of the ciphered versions of the audio signals is also reflected in their histograms. In Fig. 4a, the histogram of audio_03.wav is shown; it follows a specificdistribution, which is similar to the distributions observed for the other original audio signals. On the other hand, the histogram of the ciphered version of audio_03.wav (Fig. 4b) has a flat shape. This behavior is also verified for the other audio signals. Since the number of samples in the audio segments employed in the simulations (1.8×10⁵) is relatively small when compared with the number of symbols in the source alphabet (65536), the histogram in Fig. 4b appears not to be uniform. However, if longer audio segments are used, the tendency of uniformization can be observed. See, for example, Fig. 4c, where the histogram of the ciphered version of a segment with 1.8×10⁶ samples of audio_03.wav is shown. This suggests that the proposed encryption scheme produces samples uniformly distributed and weakly correlated.

4.2 Statistical analysis

An objective analysis of the statistical properties of the ciphered audio signals resulting from our experiments can be carried out by computing correlation coefficients. By selecting arbitrarily P samples, the correlation coefficient is computed by

$$r_{xy}=\frac{\text{cov}(x,y)}{\sqrt{D(x)D(y)}}, $$

where $\text {cov}(x,y)=\frac {1}{P}{\sum }_{i=1}^{P}(x_{i}-E(x))(y_{i}-E(y))$, $D(x)=\frac {1}{P}{\sum }_{i=1}^{P}(x_{i}-E(x))^{2}$ and $E(x)=\frac {1}{P}{\sum }_{i=1}^{P}x_{i}$; x _i is the value of the i-th selected sample and y _i is the value of the corresponding adjacent sample. The results for P=10⁵ are shown in Table 1. Original audio signals have correlation coefficients clearly close to one, while ciphered audio signals have correlation coefficients close to zero. This indicates that the proposed scheme is resistant against statistical attacks. Moreover, in the simulations, the entropy of the ciphered audio files has assumed values varying from 15.7057 to 15.7117. Although these values are greater than those commonly observed for non-ciphered 16-bit audio data, they are not too close to 16. Again, this is due to the relationship between the number of samples of the audio signals used in our experiments and the number of symbols in the source alphabet, that is, 65536. If we consider a segment with 1.8×10⁶ samples of audio_03.wav (10 times longer than the segment employed in our simulations), for instance, the corresponding ciphered audio has entropy equal to 15.9735, which is significantly closer to 16, when compared with the entropy values previously given. A similar behavior is verified for all other audio signals. This means that the transformed audio signals are close to a random source and the proposed technique is also secure against the entropy attack.

Table 1 Correlation coefficients of original (r) and ciphered ($\tilde {r}$) audio files used in the simulations

Full size table

4.3 Key space

Other important security parameter is the key space. If we encode each key position with 10 bits, for example, a key space of size 2¹⁵⁰ is achieved with the 15-length key used in the experiments (K=15). Under this aspect, the proposed scheme is very flexible. Larger key spaces can be obtained increasing the key length or increasing the number of bits used to encode each key position. A key space of size 2²⁵⁶, for example, is obtained if we consider K=16 and encode each key position with 16 bits. In fact, according to the comments made after (3), each key position could be an integer in the range 1−10⁹. This indicates that our scheme is secure against brute-force attacks [23].

4.4 Robustness to differential attacks

In order to evaluate the resistance of the method against differential attacks, for each original audio signal, we randomly choose one sample. The least significant bit of such a sample is inverted and a modified audio signal is obtained. Original and modified audio signals are encrypted using the same key and two ciphered audio signals are generated. Such ciphered audio signals are then compared by the number of samples change rate (NSCR) and the unified average changing intensity (UACI), which are defined by [27]

$$ \textrm{NSCR} = \frac{{\sum}_{i}D_{i}}{L}\times 100~\% $$

and

$$ \text{UACI} = \frac{1}{L}\left[ \sum\limits_{i}\frac{|A_{i}-A_{i}^{\prime}|}{65535}\right]. $$

A and A ^′ are the two ciphered audio signals whose corresponding original audio signals have only one-sample difference; the values of the samples at position i of A and A ^′ are respectively denoted by A _i and $A_{i}^{\prime }$; L corresponds to the length of the audio vector; D _i is determined according to the rule

$$D_{i} =\left\{ \begin{array}{ll}1, & A_{i} \neq A_{i}^{\prime}, \\ 0, & \text{otherwise}. \end{array} \right. $$

The ideal values for NSCR and UACI are 100 % and 33.3 %, respectively [27]. In Table 2, the minimum, the maximum and the average values of NSCR and UACI, computed from the encryption of 100 different modified versions of each audio signal are shown. The results are considerably close to the ideal values and independ on the position of the modified sample.

Table 2 Maximum, minimum and average NSCR and UACI (100 different modified versions of each audio signal were used)

Full size table

4.5 Key sensitivity

The key sensitivity of the proposed encryption scheme is evaluated by encrypting an audio signal with a given secret-key k and attempting to decrypt it with a wrong key k ^′ slightly different from k. In our simulations, we use

$$\mathbf{k}^{\prime}=[25,\, 34,\, 224,\, 146,\, 16 ,\, 60,\, 91 ,\, 210 ,\, 4 ,\, 11,\, 44,\, 166,\, 187 ,\, 166,\, \underline{114}] $$

which is different from the key k given in (5) by one bit only (the underlined number 114 in k ^′ replaces the number 115 in k). The original audio and that recovered with k ^′ are compared using the number of samples change rate. The NSCR obtained for all audio signals used in the simulations vary from 99.9972 % to 100.0000 %, which means that the audio signals decrypted using the wrong key are completely different from the original ones. In fact, the aspect of the recovered audio signals is completely noisy, being similar to those presented in Figs. 3a and 3b.

4.6 Known-plaintext and chosen-plaintext attacks

A preliminary analysis indicates that the proposed scheme can also resist to known-plaintext and chosen-plaintext attacks. Even if an adversary has access to some plaintext/ciphertext pair, the overlapping among adjacent audio blocks and the employment of a two-round encryption procedure reduce to a brute-force attack the attempt of obtaining the secret-key. The adversary could find the position ℓ of the last ciphered audio block $\mathbf {b}^{\prime }_{\ell }$ and also determine the index ℓ (mod K) of the component k _{ℓ (mod K)} of the secret-key k used to encrypt such a block. However, k _{ℓ (mod K)} could not be discovered by “comparing” successive results of recursive computations of the inverse CNT of $\mathbf {b}^{\prime }_{\ell }$ to the corresponding known-plaintext (the block b _ℓ at the same position in the original audio). This is due to the fact that, even if the decryption of $\mathbf {b}^{\prime }_{\ell }$ is correct, it produces an audio block which is composed by blocks encrypted in the first encryption round.

The choice of plaintexts that would reveal the secret-key is apparently not straight. If we choose an audio signal with all the samples equal to zero, a ciphered audio signal with all the samples equal to zero is obtained. Another possibility is to choose an audio such that the only nonzero sample is the first one, i. e., b ₁=[1 0 0 0 0 0 0 0] and b _i=0, i≠1. If an adversary has access to $\mathbf {b}^{\prime }_{1}$ (computed in the first encryption round), he can obtain the exponent k ₁ by verifying exhaustively whether the first column of $\mathbf {C}^{k_{1}}$ is equal to $\left (\mathbf {b}^{\prime }_{1}\right )^{T}$. Applying a similar procedure to the pair b ₂ and $\mathbf {b}^{\prime }_{2}$, k ₂ could be obtained and so on. However, such a procedure is feasible only if the adversary has access to each ciphered audio block before the next block is processed; usually, this is not considered a realistic attack scenario. If the adversary has access only to the whole ciphered audio, even if he can choose a plaintext, the described attack becomes impractical.

5 Discussion and concluding remarks

We have introduced an audio encryption scheme based on the cosine number transform. The scheme is very flexible and can be applied to noncompressed digital data encoded with different numbers of bits per sample. Our approach has demonstrated robustness against the main cryptographic attacks, namely, statistical, brute-force, differential, known-plaintext and chosen-plaintext attacks [23]. This is ratified by the results obtained in our experiments, which include the calculation of several metrics specifically related to certain types of attacks. In the literature, the contexts in which other audio encryption techniques are placed are very diverse. Furthermore, the incompleteness of some previous audio encryption papers with respect to security aspects makes unfeasible a systematic and full comparison with our approach. Considering such restrictions, we carry out a preliminary and qualitative analysis regarding this point.

In [18], for example, the authors basically compare the histogram of one audio signal to the histogram of the corresponding ciphered audio signal by means of a visual inspection. Moreover, the entropy, the standard deviation and the mean absolute difference of such a ciphered audio are calculated. Although such measurements indicate that the ciphered audio has a statistical behavior similar to that of a uniformly distributed random source, they are not sufficient to ensure the security of the method.

The scheme proposed in [17] is based on virtual optics and requires the conversion of the audio signal into a two-dimensional sound map. According to the authors, the method is highly sensitive to deviations in parameters related to the secret-key and its key space size can reach (2⁸)^256×256. Besides not presenting a complete security analysis, the implementation of the method depends on the knowledge of operations and elements commonly employed in optics frameworks, but probably unfamiliar in multimedia scenarios. This hinders its practical utilization, which contrasts with the straightness and the suitability of our method for processing digital data.

The method described in [21] employs real-valued discrete transforms, which means that rounding operations are necessary. Unlike our method, this may produce a decrypted audio signal slightly different from the corresponding original audio signal. Moreover, the experiments performed by the authors are incomplete (for example, only one audio signal is considered) and nonusual metrics are computed for security analysis. Nevertheless, we have verified that our correlation measurements are similar to those obtained by the authors in their paper.

In [8], a scheme based on a higher dimensional Arnold’s cat map is proposed. According to the authors, the security level of their approach depends on the iteration time, which is directly proportional to the key space size. On the other hand, our scheme involves only two rounds and the key space size can be increased without the need of more iterations. Moreover, although the authors mention several security parameters, they do not present numerical results. This raises doubts regarding the robustness of the method against specific cryptographic attacks.

The scopes considered in [11] and [26] are quite different from ours. In [11], a partial encryption is perfomed by means of watermarking and scrambling in MP3; the authors perform numerical experiments whose focus is the overhead rate and the watermark robustness against amplitude reduction and echo addition. The selective encryption scheme presented in [26] encrypts only the important audio data in order to achieve both real-time performance and energy efficient transmission in wireless multimedia sensor networks; security aspects and metrics usually considered to evaluate a data encryption algorithm are not discussed.

The extension of the proposed scheme to digital images is currently under investigation. In this case, a two-dimensional CNT has to be considered and the transform parameters have to be chosen according to specific digital image standards. Aspects related to the fast computation of the CNT should also be considered in future work. This is particularly important in scenarios where hardware implementations of the proposed encryption technique are desired. The definition of a CNT over fields of characteristic two is also part of the topics for future research. The possibility of defining a CNT over GF(2¹⁶), for instance, would eliminate the need of recursively computing the transform of an audio block in order to avoid the appearance of certain sample values. This would simplify and make our method faster.

Notes

The multiplicative order of an element ζ in the finite field GF(p) is the least positive integer l such that ζ ^l≡1 (mod p).
The index of the component selected in the secret-key has to be reduced modulo K because the number of audio blocks to be processed throughout the encryption procedure is usually greater than the key-length K. In this sense, the K-th block is processed using the component of index K (mod K)≡0 (mod K) of the secret-key; the (K+1)-th block is processed using the component of index K+1 (mod K)≡1 (mod K) of the secret-key and so on.

References

Abuturab MR (2013) Color image security system based on discrete hartley transform in gyrator transform domain. Opt Lasers Eng 51(3):317–324
Article Google Scholar
Birtwistle DT (1982) The eigenstructure of the number theoretic transforms. Signal Process 4(4):287–294
Article MathSciNet Google Scholar
Blahut RE (2010) Fast algorithms for signal processing. Cambridge University Press
Cheddad A, Condell J, Curran K, McKevitt P (2010) Digital image steganography: Survey and analysis of current methods. Signal Process 90(3):727–752
Article MATH Google Scholar
Cintra RJ, Dimitrov VS, Campello de Souza RM, de Oliveira HM (2009) Fragile watermarking using finite field trigonometrical transforms. Signal Process Image Commun 24:587–597
Article Google Scholar
Cox I, Miller M, Bloom J, Fridrich J, Kalker T (2007) Digital watermarking and steganography, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann
Fallahpour M, Megias D (2009) High capacity audio watermarking using FFT amplitude interpolation. IEICE Electron Express 6(14):1057–1063
Article MATH Google Scholar
Gnanajeyaraman R, Prasadh K, Ramar D (2009) Audio encryption using higher dimensional chaotic map. Int J Recent Trends Eng 1(2):103–107
Google Scholar
Gong L, Liu X, Zheng F, Zhou N (2013) Flexible multiple-image encryption algorithm based on log-polar transform and double random phase encoding technique. J Modern Opt 60(13):1074–1082
Article Google Scholar
Kok CW (1997) Fast algorithm for computing discrete cosine transform. IEEE Trans Signal Process 45(3):757–760
Article Google Scholar
Kwon GR, Wang C, Lian S, Hwang SS (2012) Advanced partial encryption using watermarking and scrambling in MP3. Multimed Tools Appl 59(3):885–895
Article Google Scholar
Lian S (2008) Multimedia content encryption: techniques and applications, 7th edn. Auerbach Publications
Lima JB, Lima EAO, Madeiro F (2013) Image encryption based on the finite field cosine transform. Signal Process Image Commun 28(10):1537–1547
Article Google Scholar
Lima JB, Campello de Souza RM (2011) Finite field trigonometric transforms. Appl Algebra Eng Commun Comput 22(5-6):393–411
Article MathSciNet MATH Google Scholar
Madain A, Abu Dalhoum AL, Hiary H, Ortega A, Alfonseca M (2014) Audio scrambling technique based on cellular automata. Multimed Tools Appl 71(3):1803–1822
Article Google Scholar
Nibouche O, Boussakta S, Darnell M (2009) Pipeline architectures for radix-2 new Mersenne number transform. IEEE Trans Circ Syst–I: Regular Papers 56 (8):1668–1680
Article MathSciNet Google Scholar
Peng X, Cui Z, Cai L, Yu L (2003) Digital audio signal encryption with a virtual optics scheme. Optik - Int J Light Electron Opt 114(2):69–75
Article Google Scholar
Raghunandhan kR, Radhakrishna D, Sudeepa KB, Ganesh A (2013) Efficient audio encryption algorithm for online applications using transposition and multiplicative non-binary system. Int J Eng Res Technol 2(6):472–477
Google Scholar
Rubanov NS, Bovbel EI, Kukharchik PD, Bodrov VJ (1998) The modified number theoretic transform over the direct sum of finite fields to compute the linear convolution. IEEE Trans Signal Process 46(3):813–817
Article MathSciNet Google Scholar
Sadek MM, Khalifa AS, Mostafa MGM (2014) Video steganography: a comprehensive review. Multimed Tools Appl. 1–32. doi:10.1007/s11042-014-1952-z
Serag Eldin SM, Khamis SA, Mahmoud Hassanin AAI, Alsharqawy MA (2015) New audio encryption package for TV cloud computing. Int J Speech Technol 18(1):131–142
Article Google Scholar
Shih FY (2012) Multimedia security: watermarking, steganography and forensics. CRC Press
Smart N (2011) ECRYPT II yearly report on algorithms and keysizes (2010-2011). Tech. rep., European Network of Excellence in Cryptology II
Sui L, Duan K, Liang J, Zhang Z, Meng H (2014) Asymmetric multiple-image encryption based on coupled logistic maps in fractional Fourier transform domain. Opt Lasers Eng 62:139–152
Article Google Scholar
Tamori H, Yamamoto T (2009) Asymetric fragile watermarking using a number theoretic transform. IEICE Trans Fundament Electron Commun Comput Sci E92-A (3):836–838
Article Google Scholar
Wang H, Hempel M, Peng D, Sharif H, Chen HH (2010) Index-based selective audio encryption for wireless multimedia sensor networks. IEEE Trans Multimed 12(3):215–223
Article Google Scholar
Wang Y, Wong KW, Liao X, Chen G (2011) A new chaos-based fast image encryption algorithm. Appl Soft Comput 11(1):514–522
Article Google Scholar
Yan D, Wang R, Yu X, Zhu J (2012) Steganography for MP3 audio by exploiting the rule of window switching. Comput Secur 31(5):704–716
Article Google Scholar
Ye G (2010) Image scrambling encryption algorithm of pixel bit based on chaos map. Pattern Recog Lett 31(5):347–354
Article Google Scholar
Zhou N, Zhang A, Zheng F, Gong L (2014) Image compression-encryption hybrid algorithm based on key-controlled measurement matrix in compressive sensing. Opt Laser Technol 62:152–160
Article Google Scholar

Download references

Acknowledgments

This research was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) under Grants 307686/2014-0 and 456744/2014-2.

Author information

Authors and Affiliations

Department of Electronics and Systems, Federal University of Pernambuco, Av. da Arquitetura, S/N, 50740-550, Recife, Brazil
Juliano B. Lima & Eronides F. da Silva Neto

Authors

Juliano B. Lima
View author publications
You can also search for this author in PubMed Google Scholar
Eronides F. da Silva Neto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juliano B. Lima.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lima, J.B., da Silva Neto, E.F. Audio encryption based on the cosine number transform. Multimed Tools Appl 75, 8403–8418 (2016). https://doi.org/10.1007/s11042-015-2755-6

Download citation

Received: 02 November 2014
Revised: 17 April 2015
Accepted: 15 June 2015
Published: 01 July 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11042-015-2755-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Audio encryption based on the cosine number transform

Abstract

Similar content being viewed by others

Security analysis of an audio data encryption scheme based on key chaining and DNA encoding

Securing Digital Audio Files Using Rotation and XOR Operations

A technique for securing digital audio files based on rotation and XOR operations

1 Introduction