1 Introduction

With the fast proliferation of Internet and the underlying applications, handling multimedia traffic and services have gained in importance over the last couple of decades. In particular, safe distribution and management of multimedia contents have become the real challenge. This is required to protect piracy and copyright issues in distributing multimedia contents, particularly audio and video. In this paper a technique for partial encryption and digital watermarking of audio files has been presented, which can be very effective to address the problem as mentioned. A typical scenario is an on-line music portal, which sells and distributes audio files to customers over the Internet. To protect against piracy, users have to register to the site through proper authorization mechanism. While registered users can gain access to high-quality audio contents, unregistered users can only have access to lower-quality preview. The method uses a combination of partial encryption, perceptual encryption, and commutative watermarking and encryption [1, 13] on audio files.

Partial Encryption [13] is a process that encrypts only a selected part of the multimedia contents, while leaving other parts unchanged. The decryption process is similar, where only the encrypted portion is decrypted to get back the original contents. Since the whole file is not encrypted, the encryption process is faster and can lead to efficient on-line implementation. Figure 1 captures the basic idea behind this concept.

Fig. 1
figure 1

Partial encryption

Perceptual Encryption [13] is a type of controlled encryption process which degrades the quality of a multimedia file depending upon the requirements. By providing encryption with various levels of degradation, media contents with various quality levels can be generated, like preview mode, medium quality mode, high quality mode, etc. To carry out this kind of encryption, the sensitive parameters of the media file have to be extracted out, and some of them encrypted, as illustrated in Fig. 2. For example, in an audio file, we can perform discrete wavelet transformation (DWT), and select some subset of the wavelet coefficients for encryption.

Fig. 2
figure 2

Perceptual encryption

In Commutative Watermarking and Encryption (CWE) [13, 23, 24], both encryption and watermarking are applied on disjoint parts of the multimedia contents. Encryption is used to protect the confidentiality, while watermarking [7] can be used for key distribution and protection of media copyright. The encryption and watermarking steps are normally applied one after the other in a specified sequence, both at the sender side and the receiver side. In CWE, since the parts are disjoint, the steps can be applied in any order (i.e. commutative). Thus, encryption on the watermarked version, or watermarking on the encrypted version, will both generate very similar final media content, as illustrated in Fig. 3. The main advantage of having a commutative system [3] is flexibility, where the same watermarked version can be variously encrypted depending on varying requirements, or the same encrypted version can be provided with a different watermark if required. This saves computation time as compared to carrying out both watermarking and encryption every time.

Fig. 3
figure 3

Commutative watermarking and encryption

In this paper, we have proposed an audio encryption and watermarking [17] scheme that is commutative in nature. The DWT coefficients of a given audio file are partitioned, and encryption and watermarking are performed on disjoint partitions. Moreover, the quality of the watermarked audio can be controlled by a user specified parameter. This feature helps in controlling distribution of media (audio) files over a public network.

The rest of the paper is organized as follows. A brief review of similar existing works is presented in Section 2, followed by the details of the proposed scheme in Section 3 along with experimental results. Section 4 gives some comparisons of the results with existing works, and finally Section 5 provides with concluding remarks.

2 Review of the existing works

There exist applications of digital watermarking where copyright protection is not the major concern. Rather, issues like content authentication, secure distribution, etc. assume greater importance. There exists a set of methods in the literature which combine conventional watermarking with encryption to achieve this objective. Some of these methods are reviewed in this section. Although quite a few research works have been reported on this field, there is a scope for improvement as far as the security and flexibility of the schemes are concerned.

Juan et al. [9] proposed a perception based scalable encryption model for Audio Video Coding Standard (AVS). Depending on the type of applications available, scalable encryption technique provides different protection levels to these applications. All audio bits are not equally important, and so a perceptual classification of the bit streams is incorporated in this paper. The main design goal of this scheme is to account for the degree of degradation of the audio content. The security of this scheme, however, depends on the encryption algorithm used.

Servetti et al. [19] proposed a low complexity perceptual based partial encryption scheme. This scheme provides very good content protection. Here instead of encrypting the whole multimedia data, only a selected fractional part is encrypted which intuitively reduces the processing load. This scheme is applied to clean speech taken from NTT Multilingual Speech Database. The bitstream is partitioned as per their scheme and is then encrypted. Two partial encryption techniques are reported in the paper, a low protection scheme, which helps in protecting most kinds of eavesdropping, and a high protection scheme, based on encrypting perceptually important bits.

Servetti et al. [20] proposed a low-complexity scheme based on partial encryption for content protection of MP3 audio. The main motivation of this work is to provide the users with degradable quality of audio which can be improved to original quality by attaining a key. Decryption process is applied only to a selected number of bits (1–10% of the total bitstream). Here from the MP3 audio file, MDCT coefficients are selected and divided into several frequency regions, and these spectral subdivisions are mainly exploited to degrade the perceptual quality of the compressed audio by using low-pass filtering. For introducing annoying artifacts to the compressed audio, limiting frequency content is an effective way which is used in this technique. Moreover in this technique the cut-off frequency is modified by increasing or decreasing the number of coefficients to get the desired degree of perceptual quality. The result shows that low-pass filtering at 5.5 kHz preserves audio contents. But this paper lacks some formal tests, which should be performed for overall effectiveness of the proposed scheme. Also in this paper there is no mention of the set of MP3 audio files taken for experimentation.

Lemma et al. [11] proposed a secure embedding scheme that incorporates traditional watermarking and partial encryption. Two new techniques are proposed; one for MASK watermarking on baseband audio, and the other for spread spectrum watermarking on MPEG-2 encoded video streams. In the first technique MASK watermarking system is used. Here the embedding process is performed by modifying the envelope of the host signal. Encryption is performed by modulating the signal with a piece wise stationary random sequence so that the encrypted audio is perceptually degraded. For each individual clients, server generates the watermark information. Firstly encryption is performed on the audio signal and then the generated watermark is embedded to the encrypted audio. In the second method a simple additive spread spectrum watermarking scheme on MPEG-2 compressed video stream is presented. The security issue of the partial encryption method used is not addressed here, only efficiency of the secure watermark embedding process is analyzed.

In [2] a protocol is proposed which uses cryptographic techniques to address the piracy issue. Here commutative encryption technique is used to protect the piracy of the watermarked data. Here the fingerprint to be embedded is determined by the content provider and the customer. In this case it will be fairly judged by the public authority as who is guilty in the unauthorized distribution. So this scheme is found to be secure against any attack from content provider or customer. Although this method proposes a very elegant technique for secure distribution but there is very little mention about the security of the process.

In [14] a commutative watermarking and encryption technique is used for MPEG2 video. Usually in different scenarios, watermarking and encryption is performed on different parts of the media files, but this has got the disadvantage that it cannot protect against replacement attacks. So to overcome from this difficulty here watermarking and encryption are performed on same part of media data. Although replacement attack is very important, but it is more relevant for images [6]. Different researches have been performed on still images.

Lian et al. [16] proposed a commutative watermarking and encryption scheme for image files based on frequency characteristics of wavelet codec. They analyzed the variations of PSNR values with changes in the frequency bands used for encryption.

Though a number of works have tried to combine encryption with watermarking to achieve some degree of degradation in a given audio file, the process is not continuous in terms of the amount of degradation possible. There is, therefore, a need to device a scheme where based on the value of some user specified parameter, any arbitrary level of degradation in quality can be achieved.

3 The proposed scheme

Media encryption and media watermarking are two different techniques, which can be coalesced together to protect both confidentiality and identity. The proposed approach deals with a commutative watermarking and encryption (CWE) scheme based on partial audio encryption, which provides controlled level of degradation in the quality of the audio files. In order to have the commutative property, the discrete wavelet transform coefficients of the given audio file are partitioned, and watermarking and encryption are applied on disjoint partitions. The basic features of the scheme are discussed in the following subsections.

3.1 Partial (perceptual) encryption

This subsection provides an insight into the concept of partial encryption as used in the context of the proposed scheme. The original media file P is initially partitioned into two parts, P1 and P2. The part P2 is significant to perception and a change in the coefficients of this part renders the audio file unintelligible, whereas the human auditory senses are not sensitive to the part P1. The part P1 is encrypted using a key K E to form the perception-sensitive part P1′. Then the encrypted part P1′ is recombined with the untouched part P2 to form the encrypted media M.

$$\begin{array}{rll} (P1, P2) & = & Partition (P)\\ P1^{\prime} &=& Encrypt (P1, K_{E})\\ M &=& Combine (P1^{\prime}, P2) \end{array}$$

Actually, the partitioning is carried out in the DWT domain, and not in the time domain. Hence this method may also be regarded as a perceptual encryption scheme. As explained in the subsequent subsections, encryption is carried out on some of the higher frequency DWT coefficients, using a scheme explained in the following subsection.

3.2 The key distribution and encryption framework

The proposed requirements and the suggested solution for distributing the encryption key and partial encryption of the media file can be explained with respect to the general model as shown in Fig. 4. The server S allows users (clients) to register themselves, and have access to a database of media files hosted by the server. Unregistered users will have access to partially encrypted versions of media files which will result in degraded quality of playback, whereas registered users will be provided with a decryption key using which the high-quality version can be reconstructed. In the proposed work, the key is distributed along with the media using watermarking; however, it may also be sent over the secure channel that exists between a user and the server. The process of key distribution, and the partial encryption of the media files are explained below.

Fig. 4
figure 4

The client-server environment

When clients register with the server S through a secure channel, they are provided with a (public key, private key) pair for being used with some standard public-key system like RSA [18]. For a client m, the public and private keys are denoted as KU cm and KR cm respectively.

A media file will be partially encrypted by the server using a randomly generated symmetric key K s , using AES algorithm [4] as shown in Fig. 5. The block of data to be encrypted (referred to as plaintext in the figure) is divided into 128-bit sub-blocks, and each sub-block is encrypted using AES with the key K s to obtain the ciphertext block. If the last sub-block is not a multiple of 128, it is not encrypted.

Fig. 5
figure 5

AES encryption scheme

The server S distributes the encryption key K s to a registered client m by encrypting the key using the public-key KU cm of the client, and watermarking the encrypted key in the media file. This process is illustrated in Fig. 6. It may be noted that any other method that does not rely on watermarking may also be used for distributing the key. However, since in the context of the present work we are not considering signal processing attacks that can disturb the watermark, we have chosen to send the encrypted key watermarked along with the audio file.

Fig. 6
figure 6

Key distribution process

3.3 DWT decomposition tree of coefficients

It is known that DWT transforms an audio signal at any level into approximate and detail coefficients. The approximate coefficients refer to the low frequency components, which the human auditory senses are sensitive to. The detail coefficients are the representation of high frequency components, which largely go undetected by human auditory senses.

In the first level of DWT transformation, the audio samples are decomposed into the low frequency components (approximate coefficients A) and the high frequency components (detailed coefficients D). With the DWT transformation to the second level, the approximate and detail coefficients again undergo transformation to form the four coefficients of AA, AD, DA, and DD. As we proceed down the levels, the audio file gets further transformed, giving rise to a binary tree like structure. At a level n, the total number of leaf nodes which represent the approximate and detail coefficients of the binary tree structure is 2n. Figure 7 depicts the decomposition tree for n = 3.

Fig. 7
figure 7

DWT of an audio signal up to 3 levels

Inverse discrete wavelet transform (IDWT) is used to recombine the decomposed low and high frequency coefficients to get back the audio file. In the proposed schemes, the given audio undergoes DWT up to three levels. For ease of explanation, we will use the following alternate notations in some of the following subsections: x1 (instead of AAA), x2 (instead of AAD), x3 (instead of ADA), x4 (instead of ADD), x5 (instead of DAA), x6 (instead of DAD), x7 (instead of DDA), x8 (instead of DDD).

3.4 Proposed scheme 1

After decomposition of the original audio file up to three levels using DWT, parameter extraction is performed. The eight leaf nodes, x1 to x8, correspond to transform coefficients, arranged in order of increasing frequency. In the first scheme that is being proposed, coefficients in the lowest frequency components (x1) are used for watermarking, while all the coefficients in one of the frequency blocks x2 to x8 are encrypted. Through experimentation, the impact of the frequency block used for encryption on the quality of the resulting audio signal is evaluated in terms of the SNR values of the encrypted media files. The algorithms for watermark embedding and encryption, and watermark detection and decryption are given below. In Algorithm 1, the concatenation operation (denoted by ||) is carried out considering all its operands as bit strings. It may be noted that during the process of watermarking in x1, the audio signal will undergo some degradation which, however, is small as compared to that resulting due to encryption. In terms of SNR values, this degradation has been found to be within 5% for all the audio files that have been experimented with.

figure c
figure d

3.4.1 Watermarking process

As mentioned above, the lowest frequency coefficients present in x1 are used for embedding the watermark. The watermarking scheme employs a mean quantization technique [10, 25], wherein the low frequency coefficients in x1 are used for embedding. In the process of embedding, coefficients in x1 are first divided into frames (number of frames being equal to the number of watermark bits to be embedded), the frame mean subtracted from each of the coefficient values, and an offset added or subtracted depending on the watermark bit (0 or 1). This watermarking scheme has been shown to be robust against many signal processing attacks like MP3 compression, quantization, etc. However, it may be noted that in the context of the present work, robustness in the watermarking process [22] is not a very important consideration.

Detail of the embedding process is depicted in Fig. 8. The following are the steps performed during the embedding process.

  1. Step 1:

    At first DWT is performed up to three levels and we obtain AAA and AAD.

    $$ [A, D] = DWT(s, wavelet) $$
    (1)

    Here s is the audio sample values and wavelet defines the mother wavelet transformation for performing the analysis (Haar or Daubechies). Here A and D are first level approximate and detail coefficients. To obtain the third level approximate coefficients, we have to move down to third level by further decomposing approximate coefficients.

  2. Step 2:

    The approximate coefficients of the third level (AAA) are divided into frames of fixed size \(\lfloor n/m \rfloor\), and denoted as f 1,f 2,...,f m . Here n is the total number of coefficients in AAA, and m is the length of the watermark in bits.

  3. Step 3:

    After the framing process is done, the means of all the frames are calculated.

    $$ m_i = Mean(f_i), \ \ \mbox{where}\ i=1,2,....,m $$
    (2)

    The calculated means of each frames are then subtracted from all the coefficient values of that frame.

  4. Step 4:

    The watermark bit W i is embedded into the frame f i as follows:

    $$\begin{array}{rll} f_{ij}^{\prime} & = & f_{ij} + \alpha m_i, \mbox{ if } W_i = 1 \\ & = & f_{ij} - \alpha m_i, \mbox{ if } W_i = 0 \end{array}$$
    (3)

    where f ij denotes the jth coefficient of frame f i , and α is a constant called embedding intensity.

Fig. 8
figure 8

Watermark embedding process

Detail of the extraction process is depicted in Fig. 9. The following are the steps performed during the extraction.

  1. Step 1

    Here in the same way like the embedding process, DWT is performed and we obtain third level approximate coefficients.

    $$ [A^{\prime}, D^{\prime}] = DWT(s^{\prime}, wavelet) $$
    (4)

    In a similar way we move down to third level and obtain AAA.

  2. Step 2

    Thereafter framing is performed and the mean is calculated for each frame.

    $$ m^{\prime}_i = Mean(f_i), \ \ \mbox{where}\ i=1,2,....,m $$
    (5)
  3. Step 3

    The ith watermark bit W i is extracted using the following formula.

    $$\begin{array}{rll} W_i & = & 1, \mbox{ if } m^{\prime}_i > 0 \\ & = & 0, \mbox{ if } m^{\prime}_i < 0 \end{array}$$
    (6)
Fig. 9
figure 9

Watermark extraction process

The watermarking process discussed here uses simple mean quantization technique. Various mean quantization techniques [5, 10, 12] are there in literature. It may be recalled that we do not need robustness as a necessary condition for successful implementation of our scheme, and hence the choice of the watermarking algorithm is not very critical.

3.4.2 Integration of encryption and watermarking

As mentioned, the encryption and watermarking operations are performed on the coefficients in the independent leaf nodes (among x1, x2, ..., x8). The commutation of the watermarking and encryption process is ensured due to the mutually independent sequencing of these processes. Since the choice of the leaf nodes (or the sections of the coefficients) for encryption and watermarking are disjoint, the processes can be implemented in any sequence. The original audio file whether encrypted first and watermarked second, or watermarked first and encrypted second generates the same media file. In this process, embedding of watermark into the encrypted audio file is independent of the knowledge of the decryption key which helps in controlled access and safe distribution of audio content. The process is depicted in Fig. 10.

Fig. 10
figure 10

Combined watermarking and encryption for proposed scheme 1

3.4.3 Experimental results

In the implementation, we have used mean quantization method on the low frequency DWT coefficients for watermarking, and AES algorithm on some subset of higher frequency DWT coefficients for encryption. Experiments have been carried out using MATLAB on 100 audio files, by watermarking on x1 and encrypting one of the leaf nodes x2 through x8.

Representative results for four files are shown in Table 1, which depicts the variations in SNR values after encrypting individual leaf nodes at third level.

Table 1 Variation of SNR values with encryption of various x i ’s

It may be observed from the table that SNR values after encryption do not vary monotonically with frequency. Though we have shown the results for four files only, similar results are found to hold for the other files also. We may conclude from the experimental results that the variation in SNR with the block x i being encrypted depends quite heavily on the audio file under consideration, and we cannot make a prediction as to which block will result in the desired level of degradation. For this reason, we have not explored this method any further.

figure e
figure f

3.5 Proposed scheme 2

With the previous scheme failing to provide us with a mechanism to provide a monotonous degradation mechanism based on some user-defined parameter, an alternate scheme has been proposed. As before, watermarking is carried on the leftmost DWT transformed coefficients block (namely, x1). However, instead of encrypting one of the higher frequency blocks, we encrypt a variable number of consecutive coefficients starting from the highest frequency side. Essentially, we treat all the coefficients in all the leaf nodes x1, x2, ..., x8 as a single vector V. Encryption is performed on S% of the coefficients in the vector V from the rightmost (high frequency) side. Watermarking is performed as before using mean quantization method on x1, which corresponds to 12% of the lowest frequency coefficients at level 3. With increase in the value of S, since the number of DWT coefficients that are encrypted increases, the amount of degradation is also expected to increase monotonously. The algorithms for watermark embedding and encryption, and watermark detection and decryption are given below.

3.5.1 Watermark embedding

As in the previous approach, we embed the watermark bits in the DWT transformed lowest frequency coefficients in x1 at the third level, using mean quantization technique.

3.5.2 Integration of encryption and watermarking

In this scheme, watermarking and encryption are carried out as illustrated in Fig. 11. Clearly, watermarking and encryption are carried out on disjoint sections of the DWT transformed coefficients, and hence the process is commutative. The value of S can be chosen seamlessly to encrypt any percentage of the higher frequency components as desired.

Fig. 11
figure 11

Combined watermarking and encryption for proposed scheme 2

3.5.3 Experimental results

The experiment has been carried out using MATLAB on a set of 100 audio files, which include pop, speech, rock and instrumental files, by varying the value of S from 5 to 85, in increments of 5. Table 2 shows the variations in SNR values with variations in S for 40 of the files, with some of the corresponding plots shown in Figs. 12, 13, 14, and 15. The following observations may be drawn from the results.

  • The SNR values deteriorate when 5 ≤ S ≤ 25.

  • The SNR values remain more or less constant when 25 ≤ S ≤ 50.

  • The SNR values further degrade when S > 50.

Table 2 Variation of SNR with S
Fig. 12
figure 12

Variations with S for pop files

Fig. 13
figure 13

Variations with S for speech files

Fig. 14
figure 14

Variations with S for rock files

Fig. 15
figure 15

Variations with S for instrumental files

Therefore, by varying the value of S, any desired level of degradation can be achieved. Through experimentation, it has been found that any value of S between 30 and 60 can be used to provide acceptable levels of degradation in the audio quality. Since the value of S is also watermarked as stated in Algorithm 1, the decoder at the receiver end can extract the value of S before carrying out decryption.

4 Analysis and comparison

In the first scheme as reported (Algorithms 1 and 2), in addition to some watermark to identify the origin of the file, the value of i (indicating which leaf node block x i is encrypted), and the encryption key k are embedded into the audio file as watermark. Assuming that we use 128 bit encryption key, 3 bits for i, and 69 bits to store some information about the audio, the number of watermark bits is 200. Assuming the frame size for mean quantization method for watermarking is 16 (which gives good results), this requires a minimum of 3,200 number of coefficients in x1. In terms of the embedding capacity, this translates into one bit for every 8 ×16 = 128 audio sample values. This in turn implies that the minimum number of samples in the audio file is 8 × 3,200 = 25,600. This determines the minimum size of the audio file for which the proposed method can be implemented.

Similarly, in the second scheme (Algorithms 3 and 4), watermark will consist of information about the file, S, and k, which again will be around 200 bits if S is encoded in multiples of 5. The minimum size of the audio file in this case will also be around 25,600 samples.

In the context of the present work, the word security needs some explanation. We are not concerned about the robustness of the watermarking algorithm, since the attacker does not achieve its objective by destroying the watermark. What the attacker tries to achieve is to either extract the encryption key k, or have access to the high quality version of the audio. Clearly, the latter cannot be achieved without knowledge of the former, which is safeguarded by encrypting it using the public key of the client and watermarking in into the audio.

The results of the proposed approach cannot be directly compared to other techniques because in the context of the application scenario considered, there are no methods available in the literature. The partial audio encryption scheme proposed in [21] is somewhat similar, where the FFT parameters of speech data are encrypted thereby degrading the quality. However continuous control over the level of degradation as can be done in the proposed scheme is not possible. The scheme proposed in [19, 20] works on G.729 compressed speech, and is used to encrypt telephone voice signals. In [8] which works on MP3 audio, some bit allocation information or some Huffman codes are encrypted to provide degradation. Again, continuous control over the level of degradation is not easy. Similarly, in [15, 16], a CWE scheme for compressed MPEG4 video is proposed, where video parameters like inter or intra-prediction mode, motion vector difference and residue coefficient sign are encrypted, while the amplitudes of DC or AC are watermarked. Here again, continuous control is difficult, as there are too many parameters.

5 Conclusion

A partial encryption and watermarking scheme has been proposed in this paper, which is based on the DWT coefficients corresponding to a given audio file. Two different schemes have been reported. For watermarking purpose low frequency components are chosen whereas for encryption high and middle frequencies are selected. In the first scheme, SNR values after encryption do not vary monotonically with frequency, and so the selection of leaf node cannot be always made to have the expected degradation. But the results of the second scheme provide us with a mechanism for selecting the degree of encryption so as to degrade the original audio by a desired amount. The value of S can be tuned to obtain any desired level of degradation for a given audio file. This method can be used effectively for safe distribution of audio files over the Internet.