1 Introduction

In the last years, the development of multimedia security techniques has attracted the attention of researchers from several fields of knowledge. This is mainly due to the increasing ease to share digital image, video and audio through communication networks [22]. In this scenario, steganographic and watermarking schemes have the purpose of providing privacy and autenticity using information hiding fundamentals [4, 6, 7, 20, 28]. On the other hand, multimedia encryption schemes employ several strategies and mathematical tools with the purpose of making perceptual and statistical aspects of an image or an audio file seem noisy [1, 12, 15, 29, 30]. In the optical domain, image encryption can be performed by using fractional Fourier transforms and random phase encoding, for instance [9, 24]. Chaotic maps, which have the property of being highly sensitive to initial conditions, are also widely used in multimedia encryption techniques [27].

More specifically, digital audio encryption can be implemented in different ways and applied to scenarios with manifold requirements. In [21], for example, chaotic and multiple-keys algorithms are employed in association with discrete transforms; the purpose is to provide a new audio encryption package for TV cloud computing. The audio encryption algorithm proposed in [18] is devoted to online applications and employs transposition and a multiplicative non-binary system. In [8], a higher dimensional chaotic map is used to enhance the key space and the security of a iterative audio encryption method. The operating principle of the approach presented in [17] is on the basis of a virtual optics scheme; both virtual wavelength and virtual diffraction distance are applied in conjunction with a complex-valued random mask to design multiple-locks and multiple-keys in the course of audio data encryption and decryption. In [26], an index-based selective audio encryption for wireless multimedia sensor networks is proposed and, in [11], the authors introduce an advanced partial encryption using watermarking and scrambling in MP3.

In this paper, we propose an audio encryption scheme based on the cosine number transform (CNT). This transform was originally named finite field cosine transform, as a reference to the algebraic structures where it is defined [14]. However, in the present context, the CNT can be viewed as a cosine-based version of the number-theoretic transform (NTT), which explains the nomenclature we have adopted. Number-theoretic transforms are well-known in the signal processing community, where they are used in fast algorithms for computing linear convolutions [3, 19] and, more recently, in fragile watermarking techniques [5, 25]. Analogously to the NTT, all arithmetic operations involved in the computation of the CNT are carried out modulo an odd prime p. In other words, computations in extension fields are avoided, which is suitable for signal processing applications.

The first encryption scheme based on the CNT was proposed in [13]. In that paper, the encryption of grayscale images is considered and the secret-key, which is given by a permutation, determines the position of each image block to be processed by the transform. Such a scheme also includes a preliminary transform step, which is not key-dependent. The encryption scheme proposed in the present work is applicable to noncompressed audio. The technique consists in applying the CNT to blocks of samples of an audio signal. The number of times that the transform is recursively applied to each block depends on a secret-key. The transformed block replaces the original block before the next block is processed. Since there is an overlapping among the samples of two adjacent blocks, the ciphered data is diffused along the whole audio signal. This is important to ensure some properties related to the security of the method.

Besides complying with the main security requisites of a multimedia encryption scheme, the proposed approach has the following attractive features: (i) simplicity: basically, the scheme consists in computing transforms of audio blocks and requires only two encryption rounds; (ii) flexibility: the scheme can be easily adapted to audio signals encoded with different numbers of bits per sample and allows adjustments on the key sizes; (iii) fidelity: since rounding is not necessary at any step of the algorithm, if the key is correct, the decrypted audio is identical to the original audio signal; (iv) computational efficiency: the CNT can be computed via fast algorithms and employing fixed-point arithmetic operations only. This permits efficient implementations and reduces the number of additions and multiplications necessary to calculate an N-length CNT from \(\mathcal {O}(N^{2})\) to \(\mathcal {O}(N\log N)\) [3, 10].

This paper is divided as follows. In Section 2, the main theoretical aspects concerning the cosine number transform are presented. In Section 3, we introduce the proposed scheme and describe the steps involved in the encryption/decryption of an audio signal. In Section 4, we present numerical results of computer experiments of the proposed technique and analyze its security. A preliminary comparison between our approach and other state-of-art audio encryption schemes is carried out and some concluding remarks are presented in Section 5.

2 Cosine number transform

The definition of the cosine number transform requires the following finite field cosine function.

Definition 1

Let ζ be a nonzero element in the finite field GF(p), p an odd prime. The finite field cosine function related to ζ is computed modulo p by

$$ \cos_{\zeta}(x):=\frac{\zeta^{x}+\zeta^{-x}}{2}, $$
(1)

x=0,1,…,ord(ζ), where ord(ζ) denotes de multiplicative orderFootnote 1 of ζ.

The finite field cosine function holds properties similar to those of the standard real-valued one, such as the unit circle and the addition of arcs, for instance. Definition 1 can also contain additional details, which do not need to be considered in the present context [14]. The cosine number transform is given by the following definition.

Definition 2

Let ζ∈GF(p) be an element such that ord(ζ)=2N. The cosine number transform of the vector x=[x 0,x 1,…,x N−1], x i ∈GF(p), is the vector X=[X 0,X 1,…,X N−1], X j ∈GF(p), of elements

$$ X_{j}:=\sqrt{\frac{2}{N}}\sum\limits_{i=0}^{N-1} \beta_{j} x_{i}\cos_{\zeta}\left( j\frac{2i+1}{2}\right) $$
(2)

computed modulo p, where

$$ \beta_{j} =\left\{ \begin{array}{ll}1/\sqrt{2}\:(\text{mod}\:\:p), & j=0, \\ 1, & j=1,2,\ldots,N-1. \end{array} \right. $$

The computation of the CNT of a row vector x can be represented by the matrix equation

$$\mathbf{X}=\mathbf{C}\cdot\mathbf{x}^{T}, $$

where x T is the vector x transpose and C corresponds to the transform matrix, whose element in the (j+1)-th row and the (i+1)-th column is given by

$$C_{j+1,i+1}=\sqrt{\frac{2}{N}}\beta_{j} \cos_{\zeta}\left( j\frac{2i+1}{2}\right)=\sqrt{\frac{2}{N}}\beta_{j} \cos_{\sqrt{\zeta}}\left( j(2i+1)\right), $$

i,j=0,1,…,N−1. It can be shown that the inverse CNT is obtained by using the transform matrix C −1=C T [14]. This means that algorithms and architectures designed to compute a CNT can be easily adjusted to compute the corresponding inverse CNT.

As an example, let us construct a CNT of length N=8 over GF(65537). We use the element ζ=4, with multiplicative order ord(ζ)=16, and, from Definitions 1 and 2, we compute

$$\mathbf{C}=\left[ \arraycolsep=4.2pt \begin{array}{cccccccc} 1020 & 1020 & 1020 & 1020 & 1020 & 1020 & 1020 & 1020\\ 24577 & 63491 & 65033 & 65441 & 96 & 504 & 2046 & 40960\\ 61442 & 65297 & 240 & 4095 & 4095 & 240 & 65297 & 61442\\ 63491 & 96 & 40960 & 504 & 65033 & 24577 & 65441 & 2046\\ 64517 & 1020 & 1020 & 64517 & 64517 & 1020 & 1020 & 64517\\ 65033 & 40960 & 65441 & 63491 & 2046 & 96 & 24577 & 504\\ 65297 & 4095 & 61442 & 240 & 240 & 61442 & 4095 & 65297\\ 65441 & 504 & 63491 & 40960 & 24577 & 2046 & 65033 & 96 \end{array}\right]. $$
(3)

Regarding the matrix C given in (3), it is important to remark that the least positive integer l such that C l=I (the identity matrix) is greater than 109. This means that the constructed CNT can be iteratively applied to a vector x at least 1 billion times before the original vector x is recovered. Due to this property, the CNT is suitable for cryptographic schemes which use recursive transformations. This strategy would not be effective, for example, if standard number-theoretic transforms were considered, because the fourth power of a standard NTT matrix is equal to the identity matrix [2, 3, 25].

Additionally, it is important to remark that p=65537, the characteristic of the finite field employed in the example developed above, is a Fermat prime, i. e., it has the form \(p=2^{2^{s}}+1\), with s=4. In the cases where p is a Fermat prime, the possible multiplicative orders of the elements of GF(p) are divisors of \(p-1=2^{2^{s}}\) and CNT whose lengths are also a power of two can be defined (see Definition 2). This allows to use standard radix-2 decimation-in-time and decimation-in-frequency fast algorithms to compute the CNT [10]. On the other hand, if p=2s−1, it is a Mersenne prime. In the cases where p is a Mersenne prime, multiplications by powers of 2 (mod p) correspond to a bit shift. This means that, if the CNT kernel is expressed as a sum of powers of two, multiplication-free transforms can be constructed [3, 16].

Usually, the parameters of a CNT to be used in the processing of a specific signal are chosen in a way such that the computational advantages mentioned above can be achieved. If a signal has samples whose values are integers in the range 0−M, for instance, the smallest Fermat (or Mersenne) prime greater than M is selected and a transform defined over the corresponding finite field is constructed. This premise is considered in the design of the encryption scheme introduced in the next section.

3 The encryption scheme

The proposed encryption scheme is illustrated in Fig. 1. It requires the definition of a CNT over the smallest prime finite field in which the range of integer values assumed by the audio samples can be mapped. In this paper, we have designed the proposed encryption scheme to be applied to 16 bits/sample noncompressed audio signals; each sample assumes an integer value in the range 0−65535 and the CNT given in the example presented in Section 2 is used. In step 1 of our scheme, the CNT matrix C is constructed (see Fig. 1).

Fig. 1
figure 1

Block diagram of the proposed audio encryption scheme

An audio block with 8 samples is then taken from the original audio (step 2 in Fig. 1); such a block, which is denoted by b n (starting from b 1, which is composed by the first eight samples of the audio signal), is taken in a way such that it overlaps the previous ciphered audio block in two samples; that is, the first two samples of the original audio block b n are the last two samples of the ciphered audio block \(\mathbf {b}^{\prime }_{n-1}\) (only the block b 1 is taken without the overlapping). This provides diffusion in our scheme. The index n of the audio block being processed determines the choice of the element of the secret-key used in the scheme (step 3). More specifically, the secret-key is the K-length vector of integers

$$\mathbf{k}=[k_{0},\, k_{1},\, \ldots,\,k_{K-1}], $$

whose n (mod K)-th component is considered in the encryption of the audio block b n .Footnote 2 Such a component determines the computation of the k n (mod K)-th power of the CNT matrix C (step 4). The matrix \(\mathbf {C}^{k_{n\:(\text {mod}\:K)}}\) is then multiplied by the block b n (step 5), which produces the provisory ciphered audio block

$$ \mathbf{b}^{\prime}_{n,1}=\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\cdot \mathbf{b}_{n}^{T}. $$
(4)

This corresponds to compute k n (mod K) times the CNT of b n in an iterative manner.

The block computed in (4) is provisory because, since all computations are carried out modulo p=65537, \(\mathbf {b}^{\prime }_{n,1}\) may contain samples whose values are equal to 65536. This would require a binary representation with 17 bits, which violates the encoding of the original audio signal. In order to avoid such an extra bit, the matrix \(\mathbf {C}^{k_{n\:(\text {mod}\:K)}}\) is iteratively multiplied by b n ; if a block \(\mathbf {b}_{n,e}^{\prime }\) obtained from (4) in the e-th iteration contains a sample equal to 65537, we update such a block, multiplying it by \(\mathbf {C}^{k_{n\:(\text {mod}\:K)}}\) again. The process stops when a new block \(\mathbf {b}^{\prime }_{n,E}\) without samples equal to 65536 is encountered in the E-th iteration (step 6). The definitive block \(\mathbf {b}^{\prime }_{n}=\mathbf {b}^{\prime }_{n,E}\) is then taken as the encrypted version of b n and replaces b n in the composition of the encrypted audio signal (step 7). The encryption procedure is completed after the whole audio vector is submitted to two rounds of the described transformation strategy. This is necessary to make brute-force attacks unfeasible.

The decryption consists in applying, in the reverse order, the same steps used in the encryption; the matrix C is replaced by the the matrix C −1=C T and the blocks are taken from right to left. We remark that the number of times each audio block has to be iteratively multiplied by \(\mathbf {C}^{k_{n\:(\text {mod}\:K)}}\) in the encryption does not need to be known for a successful decryption. Suppose that

$$\left( \mathbf{b}^{\prime}_{n,e}\right)^{T}=\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{e}\cdot\mathbf{b}^{T}_{n},\quad e = 1,2,\ldots,E-1,$$

contains at least one sample whose value is equal to 65536, but the maximum sample value in

$$\left( \mathbf{b}^{\prime}_{n,E}\right)^{T}=\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n}$$

does not exceed 65535. Then, \(\mathbf {b}^{\prime }_{n,E}=\mathbf {b}^{\prime }_{n}\) is taken as the (definitive) ciphered version of b n . In the decryption,

$$\left( \mathbf{b}^{\prime}_{n,E-d}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{d}\cdot\left( \mathbf{b}^{\prime}_{n}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{d}\cdot\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n},$$

d=1,2,…,E−1, will contain at least one sample whose value equal to 65536; actually, \((\mathbf {b}^{\prime }_{n,E-d})^{T}\), d=1,2,…,E−1, are the same provisory blocks produced in the encryption. Only when d=E is reached, a block which does not contain at least one sample value equals 65536 is obtained. Such a block is

$$\left( \mathbf{b}^{\prime}_{n,0}\right)^{T}=\left[\mathbf{C}^{-k_{n\:(\text{mod}\:K)}}\right]^{E}\left[\mathbf{C}^{k_{n\:(\text{mod}\:K)}}\right]^{E}\cdot\mathbf{b}^{T}_{n}=\mathbf{b}^{T}_{n},$$

the original audio block correctly recovered.

4 Computer experiments and security analysis

The proposed encryption scheme was implemented in Matlab ®. Segments with 1.8×105 samples of eight noncompressed audio signals with different characteristics (music, speech etc.), encoded with 16 bits/sample, were encrypted using

$$ \mathbf{k}=[25,\, 34,\, 224,\, 146,\, 16 ,\, 60,\, 91 ,\, 210 ,\, 4 ,\, 11,\, 44,\, 166,\, 187 ,\, 166,\, 115] $$
(5)

as secret-key. In Fig. 2, the waveforms of the original audio signals are shown. The file audio_01.wav, for example, was obtained using the sampling rate F s =8000 Hz and, therefore, has time duration equal to 22.5 s; nevertheless, the application of the encryption procedure independs on the sampling rate used in the discretization of a signal.

Fig. 2
figure 2

Original audio signals used in the computer experiments: (a) audio_01.wav, (b) audio_02.wav, (c) audio_03.wav, (d) audio_04.wav, (e) audio_05.wav, (f) audio_06.wav, (g) audio_07.wav, (h) audio_08.wav

In Fig. 3a, the complete ciphered version of audio_01.wav is shown. Naturally, the “dense” visual aspect of the waveform reflects the rapid variations arising from the encryption process. In order to provide a more suitable visualization, we show in Fig. 3b the first 250 samples of the same ciphered audio signal (the samples are connected with lines to enhance the visualization). In this figure, we can observe the noisy aspect of the waveform. This contrasts with the waveform presented in Fig. 3c, where the first 250 samples of the corresponding original audio signal are shown. In this case, the quasi-periodicity of the waveform is emphasized; this property indicates that the non-ciphered audio segment may represent, for instance, a voiced speech. The visual aspect of the waveforms corresponding to ciphered versions of all other audio signals used in our experiments is similar.

Fig. 3
figure 3

(a) Full ciphered version of audio_01.wav and (b) its first 250 samples; (c) first 250 samples of the original version of audio_01.wav

4.1 Histogram analysis

The noisy aspect of the ciphered versions of the audio signals is also reflected in their histograms. In Fig. 4a, the histogram of audio_03.wav is shown; it follows a specificdistribution, which is similar to the distributions observed for the other original audio signals. On the other hand, the histogram of the ciphered version of audio_03.wav (Fig. 4b) has a flat shape. This behavior is also verified for the other audio signals. Since the number of samples in the audio segments employed in the simulations (1.8×105) is relatively small when compared with the number of symbols in the source alphabet (65536), the histogram in Fig. 4b appears not to be uniform. However, if longer audio segments are used, the tendency of uniformization can be observed. See, for example, Fig. 4c, where the histogram of the ciphered version of a segment with 1.8×106 samples of audio_03.wav is shown. This suggests that the proposed encryption scheme produces samples uniformly distributed and weakly correlated.

Fig. 4
figure 4

Histograms of the original version (a) and the ciphered version (b) of a segment with 1.8×105 samples of audio_03.wav; (c) histogram of the ciphered version of a segment with 1.8×106 samples of audio_03.wav. The histograms shown in (b) and (c) has a uniform tendency

4.2 Statistical analysis

An objective analysis of the statistical properties of the ciphered audio signals resulting from our experiments can be carried out by computing correlation coefficients. By selecting arbitrarily P samples, the correlation coefficient is computed by

$$r_{xy}=\frac{\text{cov}(x,y)}{\sqrt{D(x)D(y)}}, $$

where \(\text {cov}(x,y)=\frac {1}{P}{\sum }_{i=1}^{P}(x_{i}-E(x))(y_{i}-E(y))\), \(D(x)=\frac {1}{P}{\sum }_{i=1}^{P}(x_{i}-E(x))^{2}\) and \(E(x)=\frac {1}{P}{\sum }_{i=1}^{P}x_{i}\); x i is the value of the i-th selected sample and y i is the value of the corresponding adjacent sample. The results for P=105 are shown in Table 1. Original audio signals have correlation coefficients clearly close to one, while ciphered audio signals have correlation coefficients close to zero. This indicates that the proposed scheme is resistant against statistical attacks. Moreover, in the simulations, the entropy of the ciphered audio files has assumed values varying from 15.7057 to 15.7117. Although these values are greater than those commonly observed for non-ciphered 16-bit audio data, they are not too close to 16. Again, this is due to the relationship between the number of samples of the audio signals used in our experiments and the number of symbols in the source alphabet, that is, 65536. If we consider a segment with 1.8×106 samples of audio_03.wav (10 times longer than the segment employed in our simulations), for instance, the corresponding ciphered audio has entropy equal to 15.9735, which is significantly closer to 16, when compared with the entropy values previously given. A similar behavior is verified for all other audio signals. This means that the transformed audio signals are close to a random source and the proposed technique is also secure against the entropy attack.

Table 1 Correlation coefficients of original (r) and ciphered (\(\tilde {r}\)) audio files used in the simulations

4.3 Key space

Other important security parameter is the key space. If we encode each key position with 10 bits, for example, a key space of size 2150 is achieved with the 15-length key used in the experiments (K=15). Under this aspect, the proposed scheme is very flexible. Larger key spaces can be obtained increasing the key length or increasing the number of bits used to encode each key position. A key space of size 2256, for example, is obtained if we consider K=16 and encode each key position with 16 bits. In fact, according to the comments made after (3), each key position could be an integer in the range 1−109. This indicates that our scheme is secure against brute-force attacks [23].

4.4 Robustness to differential attacks

In order to evaluate the resistance of the method against differential attacks, for each original audio signal, we randomly choose one sample. The least significant bit of such a sample is inverted and a modified audio signal is obtained. Original and modified audio signals are encrypted using the same key and two ciphered audio signals are generated. Such ciphered audio signals are then compared by the number of samples change rate (NSCR) and the unified average changing intensity (UACI), which are defined by [27]

$$ \textrm{NSCR} = \frac{{\sum}_{i}D_{i}}{L}\times 100~\% $$

and

$$ \text{UACI} = \frac{1}{L}\left[ \sum\limits_{i}\frac{|A_{i}-A_{i}^{\prime}|}{65535}\right]. $$

A and A are the two ciphered audio signals whose corresponding original audio signals have only one-sample difference; the values of the samples at position i of A and A are respectively denoted by A i and \(A_{i}^{\prime }\); L corresponds to the length of the audio vector; D i is determined according to the rule

$$D_{i} =\left\{ \begin{array}{ll}1, & A_{i} \neq A_{i}^{\prime}, \\ 0, & \text{otherwise}. \end{array} \right. $$

The ideal values for NSCR and UACI are 100 % and 33.3 %, respectively [27]. In Table 2, the minimum, the maximum and the average values of NSCR and UACI, computed from the encryption of 100 different modified versions of each audio signal are shown. The results are considerably close to the ideal values and independ on the position of the modified sample.

Table 2 Maximum, minimum and average NSCR and UACI (100 different modified versions of each audio signal were used)

4.5 Key sensitivity

The key sensitivity of the proposed encryption scheme is evaluated by encrypting an audio signal with a given secret-key k and attempting to decrypt it with a wrong key k slightly different from k. In our simulations, we use

$$\mathbf{k}^{\prime}=[25,\, 34,\, 224,\, 146,\, 16 ,\, 60,\, 91 ,\, 210 ,\, 4 ,\, 11,\, 44,\, 166,\, 187 ,\, 166,\, \underline{114}] $$

which is different from the key k given in (5) by one bit only (the underlined number 114 in k replaces the number 115 in k). The original audio and that recovered with k are compared using the number of samples change rate. The NSCR obtained for all audio signals used in the simulations vary from 99.9972 % to 100.0000 %, which means that the audio signals decrypted using the wrong key are completely different from the original ones. In fact, the aspect of the recovered audio signals is completely noisy, being similar to those presented in Figs. 3a and 3b.

4.6 Known-plaintext and chosen-plaintext attacks

A preliminary analysis indicates that the proposed scheme can also resist to known-plaintext and chosen-plaintext attacks. Even if an adversary has access to some plaintext/ciphertext pair, the overlapping among adjacent audio blocks and the employment of a two-round encryption procedure reduce to a brute-force attack the attempt of obtaining the secret-key. The adversary could find the position of the last ciphered audio block \(\mathbf {b}^{\prime }_{\ell }\) and also determine the index (mod K) of the component k (mod K) of the secret-key k used to encrypt such a block. However, k (mod K) could not be discovered by “comparing” successive results of recursive computations of the inverse CNT of \(\mathbf {b}^{\prime }_{\ell }\) to the corresponding known-plaintext (the block b at the same position in the original audio). This is due to the fact that, even if the decryption of \(\mathbf {b}^{\prime }_{\ell }\) is correct, it produces an audio block which is composed by blocks encrypted in the first encryption round.

The choice of plaintexts that would reveal the secret-key is apparently not straight. If we choose an audio signal with all the samples equal to zero, a ciphered audio signal with all the samples equal to zero is obtained. Another possibility is to choose an audio such that the only nonzero sample is the first one, i. e., b 1=[1 0 0 0 0 0 0 0] and b i =0, i≠1. If an adversary has access to \(\mathbf {b}^{\prime }_{1}\) (computed in the first encryption round), he can obtain the exponent k 1 by verifying exhaustively whether the first column of \(\mathbf {C}^{k_{1}}\) is equal to \(\left (\mathbf {b}^{\prime }_{1}\right )^{T}\). Applying a similar procedure to the pair b 2 and \(\mathbf {b}^{\prime }_{2}\), k 2 could be obtained and so on. However, such a procedure is feasible only if the adversary has access to each ciphered audio block before the next block is processed; usually, this is not considered a realistic attack scenario. If the adversary has access only to the whole ciphered audio, even if he can choose a plaintext, the described attack becomes impractical.

5 Discussion and concluding remarks

We have introduced an audio encryption scheme based on the cosine number transform. The scheme is very flexible and can be applied to noncompressed digital data encoded with different numbers of bits per sample. Our approach has demonstrated robustness against the main cryptographic attacks, namely, statistical, brute-force, differential, known-plaintext and chosen-plaintext attacks [23]. This is ratified by the results obtained in our experiments, which include the calculation of several metrics specifically related to certain types of attacks. In the literature, the contexts in which other audio encryption techniques are placed are very diverse. Furthermore, the incompleteness of some previous audio encryption papers with respect to security aspects makes unfeasible a systematic and full comparison with our approach. Considering such restrictions, we carry out a preliminary and qualitative analysis regarding this point.

In [18], for example, the authors basically compare the histogram of one audio signal to the histogram of the corresponding ciphered audio signal by means of a visual inspection. Moreover, the entropy, the standard deviation and the mean absolute difference of such a ciphered audio are calculated. Although such measurements indicate that the ciphered audio has a statistical behavior similar to that of a uniformly distributed random source, they are not sufficient to ensure the security of the method.

The scheme proposed in [17] is based on virtual optics and requires the conversion of the audio signal into a two-dimensional sound map. According to the authors, the method is highly sensitive to deviations in parameters related to the secret-key and its key space size can reach (28)256×256. Besides not presenting a complete security analysis, the implementation of the method depends on the knowledge of operations and elements commonly employed in optics frameworks, but probably unfamiliar in multimedia scenarios. This hinders its practical utilization, which contrasts with the straightness and the suitability of our method for processing digital data.

The method described in [21] employs real-valued discrete transforms, which means that rounding operations are necessary. Unlike our method, this may produce a decrypted audio signal slightly different from the corresponding original audio signal. Moreover, the experiments performed by the authors are incomplete (for example, only one audio signal is considered) and nonusual metrics are computed for security analysis. Nevertheless, we have verified that our correlation measurements are similar to those obtained by the authors in their paper.

In [8], a scheme based on a higher dimensional Arnold’s cat map is proposed. According to the authors, the security level of their approach depends on the iteration time, which is directly proportional to the key space size. On the other hand, our scheme involves only two rounds and the key space size can be increased without the need of more iterations. Moreover, although the authors mention several security parameters, they do not present numerical results. This raises doubts regarding the robustness of the method against specific cryptographic attacks.

The scopes considered in [11] and [26] are quite different from ours. In [11], a partial encryption is perfomed by means of watermarking and scrambling in MP3; the authors perform numerical experiments whose focus is the overhead rate and the watermark robustness against amplitude reduction and echo addition. The selective encryption scheme presented in [26] encrypts only the important audio data in order to achieve both real-time performance and energy efficient transmission in wireless multimedia sensor networks; security aspects and metrics usually considered to evaluate a data encryption algorithm are not discussed.

The extension of the proposed scheme to digital images is currently under investigation. In this case, a two-dimensional CNT has to be considered and the transform parameters have to be chosen according to specific digital image standards. Aspects related to the fast computation of the CNT should also be considered in future work. This is particularly important in scenarios where hardware implementations of the proposed encryption technique are desired. The definition of a CNT over fields of characteristic two is also part of the topics for future research. The possibility of defining a CNT over GF(216), for instance, would eliminate the need of recursively computing the transform of an audio block in order to avoid the appearance of certain sample values. This would simplify and make our method faster.