Keywords

1 Introduction

Since the late 20th century, the Internet has grown at unprecedented rate, and the spread of digital-multimedia transfer has increased exponentially, especially after we had advanced technologies in data compression, more powerful personal computers, and social communities. This communication channel is somehow a double-edge sword. On one hand, it is very useful and convenient to access gigantic source of data. On the other hand, it could be harmful when it is misused. For example, there are social concerns about authentication, copyright management, copy control, and the like of digital media. These issues have been aware in music industry since 1990s [1]. One of the potential solutions solving this kind of problems in audio domain is to use an audio-watermarking technique.

Watermarking is a scheme of making information unnoticeable; therefore a user is not aware the existence of hidden information. This type of information hiding is called steganography. The other type is cryptography. Both are information hiding techniques; nevertheless, they have a different objective. While steganography deals with making information unnoticeable, the cryptography tries to make it unreadable. The use of watermarking for copyright control is claimed to be the original goal for audio watermark [2]. Besides copyright management and the like, audio watermarking could be used in many other applications where each has its own requirements. There are at least three areas of application, as stated in [2], for audio watermarking: (1) copyright marking and copy control, (2) forensic watermarking, such as digital fingerprinting, and (3) information hiding for annotation and added value. Since the main objective of audio watermarking is to add information into an audio signal transparently, in general, especially for commercial purposes, there are four requirements for an audio-watermarking method: (1) Inaudibility: human auditory system should not be able to detect a watermark, thus sound quality comparing between original and watermarked signals should be equal. (2) Robustness: it is an ability to survive of a watermark after attacks, such as re-sampling or compression, are applied to the watermarked signal. (3) Blindness: it is a property of extracting hidden information from a watermarked signal without the presence of the original in extracting processes. (4) Confidentiality: it is a property of concealment of hidden data [3]. In [2, 4], the authors have added an additive requirement which is a capacity or quantity of information embedded in the original signal. For example, for the purpose of value adding, increasing capacity is an important issue since the hidden information is not just a short serial number. Naturally, these requirements conflict each other’s. The high robustness, for example, normally comes with the cost of low audio quality or semi-transparency. Work of name of author of [4] also concludes similar phenomenal that high capacity implies low robustness. Therefore, in addition to proposing a new effective technique many researches on audio watermarking have focused on how to compromise these conflicts. However, there is no work that can completely solve these problems. Some techniques are good at transparency or inaudibility, but not blind. Some techniques are good at high capacity, but not robust, especially for the certain attacks.

On classification of audio watermarking schemes, name of the author in [5] had categorized watermarking schemes into three categories. The first embeds information in time domain by mostly changing the least significant bit (LSB) of an original signal. The second introduces an echo, and the third embeds information in a certain transform domain such as frequency or wavelet domain. However, classification depends upon a set of criteria. For example, it could be classified into two categories: the one that deploys properties of the human auditory system and the one that does not. Some techniques deploy characteristics of human auditory system as a guideline for hiding information [6]. The others are based on mathematical manipulation and do not much rely on special characteristics of human auditory system [7]. In this paper, we investigate the later kind and propose a watermarking scheme based on singular-spectrum analysis (SSA), which relates to singular value decomposition-based (SVD-based) watermarking techniques. These techniques rely on a mathematical method of extracting algebraic features called singular values (SVs) from a two-dimensional matrix.

The SVD-based watermarking has a lot of advantages [723]. The advantages are mainly due to properties of SVs, such as the invariance of SVs under common signal processing operations. For example, when a small modification is applied to the original signal represented by a two-dimensional matrix, its SVs changes unnoticebly. This property of SVs makes the SVD-based techniques robust against the common signal processing. Besides, it has a low computational complexity comparing with other transform-based methods.

The SVD-based audio watermarking is originally proposed by Özer et al. in 2005 [8] and based on the watermarking technique employed in the image domain [11]. From our survey, the SVD-based methods could be categorized into two frameworks based on the stage in extracting processes that it is employed information of a watermark which is from the embedding stage. For example, methods described in [810, 21] use information of a watermark in the extracting stage, therefore they are necessary non-blind methods. Those described in [7, 1215, 1720, 22, 23] do not use information of a watermark in extracting processes. They can be blind or non-blind. For example, [7, 15, 22, 23] are blind while the others are non-blind. Even though there are two frameworks, both share a common concept of SVD-based watermarking, i.e. SVs are modified slightly according to embedding rules. We can also use the positions of modified SVs as a criterion for categorization. For example, methods described in [1214, 1820] modify only the largest SV, but the method described in [23] modifies only some small SVs. Methods in [7, 15, 17, 21, 22] modify all SVs. Section 5 shows the effect of the position of modified SVs on watermarked-sound quality. It is important to note here that Lamarche et al. have pointed out that the robustness of those in the first framework are highly likely due to false positive rate [11]. Therefore, to avoid such kind of problems, our proposed method is designed to be blind.

Interestingly, to the best of our knowledge, there is not much discussion about the essence or meaning of SVs in previously SVD-based audio-watermarking techniques. When SV is modified, which audio feature is exactly modified. The answer does not depends only on the domain representing a signal, we believe, but also on how a matrix is created. From the view point of SSA, the meaning of SVs could be clearly explained. We are inspired by the robustness of SVD-based methods and intrigued by the question of physical meaning of SV in the hope that knowing the meaning and its relation with other physical features could help us to overcome the conflicts in requirements of audio watermarking.

The rest of this paper is organized as follows. Section 2 introduces the background of SSA and SVD. Our proposed method is detailed in Sect. 3. Performance evaluation and experimental results are given in Sect. 4. Remarks are made in Sect. 5. Section 6 summarizes this paper.

2 Singular-Spectrum Analysis

Singular-spectrum analysis (SSA) is a method of identifying and extracting oscillatory components from a signal [24]. There are many types of SSAs. The SSA we are going to describe in this section is called Basic SSA of which our proposed method is based on. SSA is used to decompose a signal (or time series) of interest into several additive oscillatory components. Each oscillatory component represents a simple oscillatory mode. We hypothesize that the relationship between SSA and oscillatory components is somehow similar to that between Empirical Mode Decomposition (EMD) and intrinsic mode functions.

SSA consists of two stages which involve analysis and synthesis. The decomposition stage has two decomposition steps which are embedding and singular value decomposition (SVD). It should be noted that the name of the first step has nothing to do with embedding a watermark. It is the SSA terminology. In reconstruction stage, there are also two steps which are grouping and diagonal averaging.

In the embedding step, a signal \(X = (f_0,f_1,...,f_{N-1})^T\) of length N is mapped to a trajectory matrix X of size \(L \times K\).

$$\begin{aligned} {{\varvec{X}}} = \left[ {\begin{array}{*{20}c} f_0 &{} f_1 &{} f_2 &{} \cdots &{} f_{K-1} \\ f_1 &{} f_2 &{} f_3 &{} \cdots &{} f_{K} \\ f_2 &{} f_3 &{} f_4 &{} \cdots &{} f_{K+1} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ f_{L-1} &{} f_L &{} f_{L+1} &{} \cdots &{} f_{N-1} \\ \end{array} } \right] \end{aligned}$$
(1)

Each column vector of X is called a lagged vector. The ith column of X, \(X_i\), is defined as \(X_i = (f_{i-1},f_i,...,f_{i+L-2})^T\), where L is the window length which has a maximum value of N. Therefore, the matrix X consists of \(K = N-L+1\) lagged vectors, i.e. X \(= [X_1 X_2 ... X_K]\). Note that L is the only parameter of Basic SSA. Since the trajectory matrix X has equal elements on the diagonals, i.e. \(x_{i,j} = x_{i-1,j+1}\) where \(x_{i,j}\) is an element at ith row and jth column of X, it is considered to be a Hankel matrix.

For the second step, SVD is a step that decomposes a matrix X into a product of three matrices U, D, and V with the following relationship.

$$\begin{aligned} {{\varvec{X}}} = {{\varvec{UDV}}}^T \end{aligned}$$
(2)

where X is a matrix being decomposed (in this case, it is the trajectory matrix from the first step), U and V are orthogonal matrices, i.e. \({{\varvec{U}}}{{\varvec{U}}}^T ={{\varvec{U}}}^T{{\varvec{U}}} = {{\varvec{V}}}{{\varvec{V}}}^T = {{\varvec{V}}}^T{{\varvec{V}}} = {{\varvec{I}}}\) where \({{\varvec{I}}}\) is the identity matrix, and \({{\varvec{D}}}\) is a diagonal matrix whose element is called singular value (SV). Columns of U and V, \(U_i\) and \(V_i\), which are sorted in descending order of corresponding eigenvalues, are eigenvectors of \({{\varvec{X}}}{{\varvec{X}}}^T\) and \({{\varvec{X}}}^T{{\varvec{X}}}\) respectively. Then, the elements of D are the square root of the eigenvalues. If the eigenvalues of \({{\varvec{X}}}{{\varvec{X}}}^T\) (or \({{\varvec{X}}}^T{{\varvec{X}}}\)) is denoted by \(\lambda _1, \lambda _2, ...,\) and \(\lambda _L\), then the trajectory matrix X can be written as

$$\begin{aligned} {{\varvec{X}}} = {{\varvec{X}}}_1 + {{\varvec{X}}}_2 + ... + {{\varvec{X}}}_d \end{aligned}$$
(3)

where \({{\varvec{X}}}_i = \sqrt{\lambda _i}U_iV_i^T \) and \(d = \max \{i\), such that \(\lambda _i > 0\}\). Each \({{\varvec{X}}}_i\) represents a simple oscillatory component of the signal X.

The third step is grouping. In this step, the set of indices \(\{1, 2, ..., d\}\) obtained from the previous step is partitioned into m disjoint subsets \(I_1, I_2, ..., I_m\). Then, \({{\varvec{X}}}_1, {{\varvec{X}}}_2, ... ,{{\varvec{X}}}_d\) are grouped into m groups.

$$\begin{aligned} {{\varvec{X}}} = {{\varvec{X}}}_{I_1} + {{\varvec{X}}}_{I_2} + ... + {{\varvec{X}}}_{I_m} \end{aligned}$$
(4)

Since the purpose of this step is to separate the time series into meaningful additive sub-series such as trend or noise, according to separability conditions, which is not our watermarking purpose, the step is not included in our proposed method.

The last step is diagonal averaging. This last step transforms (hankelizes) each matrix \({{\varvec{X}}}_{I_j}\) of the grouped decomposition into a new series of length N. The hankelization of matrix Y of dimension \(L \times K\) to the series \(Y = (g_0,g_1,...,g_{N-1})^T\) is defined as follows.

$$\begin{aligned} g_k =\left\{ \begin{array}{ll} \begin{array}{ll} \frac{1}{k+1}\displaystyle \sum _{m=1}^{k+1}y^*_{m,k-m+2} &{} \quad \text {for} \quad 0 \le k < L^*-1 \\ \frac{1}{L^*}\displaystyle \sum _{m=1}^{L^*}y^*_{m,k-m+2} &{} \quad \text {for} \quad L^*-1 \le k < K^* \\ \frac{1}{N-k}\displaystyle \sum _{m=k-K^*+2}^{N-K^*+1}y^*_{m,k-m+2} &{} \quad \text {for} \quad K^* \le k < N \end{array} \end{array} \right. \end{aligned}$$
(5)

where \(L^* = \min (L,K)\), \(K^* = \max (L,K)\), \(y^*_{ij}=y_{ij}\) if \(L < K\), and \(y^*_{ij}=y_{ji}\) if \(L \ge K\). In our proposed method, Y is a watermarked trajectory matrix, which is a trajectory matrix after its SVs are modified.

3 Proposed Method

3.1 Embedding Process

The embedding process is illustrated in Fig. 1 (left). First, an audio signal is segmented into non-overlapping frames. One bit of the watermark will be embedded into one frame. This also implies that the frame length determines embedding capacity. Then, trajectory matrices representing each frame are created, and SVD is applied on each matrix. A watermark bit is embedded by modifying SVs according to certain rules. In this work, we use simple rules similar to quantization index modulation. The rules can be summarized as follows.

Let \(\{\sqrt{\lambda _1}, \sqrt{\lambda _2}, ..., \sqrt{\lambda _d}\}\) be a set of SVs in descending order, where \(d = \max \{i\), such that \(\lambda _i > 0\}\), and \(\epsilon \) be a small real positive number. We use the following criterion to modify values of SVs. If the watermark bit is 0, then \(\sqrt{\lambda _{u}}, \sqrt{\lambda _{u+1}}, ...,\) and \(\sqrt{\lambda _{l}}\) are replaced with \((1+\epsilon )\sqrt{\lambda _{l+1}}\), and if the watermark bit is 1, then \(\sqrt{\lambda _{u}}, \sqrt{\lambda _{u+1}}, ...,\) and \(\sqrt{\lambda _{l}}\) are replaced with \((1-\epsilon )\sqrt{\lambda _{u-1}}\) given that \(\sqrt{\lambda _{u}}\) is greater than \(\sqrt{\lambda _{l}}\).

After modifying SVs, each modified trajectory matrix is hankelized. Finally, the watermarked signal is obtained by adding those hankelized frames.

Fig. 1.
figure 1

Embedding and extracting processes

3.2 Extracting Process

The proposed extracting process is shown in Fig. 1 (right). The watermarked signal is segmented into non-overlapping frames. In the same way as embedding process described above, each frame is mapped to a trajectory matrix, and use SVD to extract SVs. The watermark bit is decoded by determining the value of \(\sqrt{\lambda _{m}}\), where \(\sqrt{\lambda _{m}}\) is the median of \(\{\sqrt{\lambda _{u}}, \sqrt{\lambda _{u+1}}, ..., \sqrt{\lambda _{l}}\}\). If \(\sqrt{\lambda _{u-1}} - \sqrt{\lambda _{m}}\) is greater than \(\sqrt{\lambda _{m}} - \sqrt{\lambda _{l+1}}\), the watermark bit is 1. Otherwise, the watermark bit is 0.

4 Evaluation

Following the evaluation criteria suggested by the committee of Information Hiding and its Criteria (IHC) [25], this work is evaluated in two major dimensions which are objective sound-quality tests of watermarked signals and robustness tests. The perceptual evaluation of audio quality (PEAQ) which is recommended by ITU-R-BS.1387-1 is used to measure the objective different grade (ODG). PEAQ measures the degradation of the watermarked signal being evaluated comparing with the original one and covers a scale from \(-5\) (worst) to 0 (best). IHC suggests that the ODG should be greater than \(-2.5\).

Twelve host signals from RWC music-genre database [26] (Track No. 01, 07, 13, 28, 37, 49, 54, 57, 64, 85, 91, and 100) used in IHC’s 2012 audio watermarking competition were used in our experiments. All has a sampling rate of 44.1 kHz, 16-bit quantization, and two channels (stereo). Ninety-bit payloads per 15 seconds of the host signal are embedded into the host. This allows a maximum of 6 bps of embedding capacity.

For robustness evaluation, five attacks were applied to watermarked signals: Gaussian-noise addition with average SNR of 36 dB, re-sampling with 16 and 22.05 kHz, band-pass filtering with 100-6000 Hz and \(-12\) dB/Oct, MP3 compression with 128 kbps joint stereo, and MP4 compression with 96 kbps. We represent extraction precision in term of bit error rate (BER), which is defined by the number of bit errors divided by the total number of embedded bits. Remark that IHC suggests that the BER should be lower than \(10\,\%\).

4.1 Singular-Spectrum Analysis and Synthesis

As mentioned in Section 2, SSA can be used to decompose a signal into oscillatory components. In our experiments, the window length is set to 500. Figure 2 shows an example of using SSA to decompose a signal. The first panel labeled Org is an original signal which is zoomed to observe 300 samples. The second to sixth panels show examples of five oscillatory components corresponding to the 1st, 5th, 50th, 100th, and 200th SVs, respectively. Specifically, the waveform \(X_1\) shown in the second panel is a result of hankelization of the matrix \({{\varvec{X}}}_1 = \sqrt{\lambda _1}U_1V_1^T \), and the waveform \(X_5\) shown in the third panel is a result of hankelization of the matrix \({{\varvec{X}}}_5 = \sqrt{\lambda _5}U_5V_5^T \), and so on. Therefore, SV could be interpreted as a scale factor of each oscillatory component. The lower the component order, the more contribution to the signal. This is because SVs are sorted in descending order. Actually, there are more than 200 components since there are more than 200 SVs that are greater than zero as shown in Fig. 3. The first panel of Fig. 4 shows a waveform of reconstructed signal comparing to the original. The reconstructed signal is constructed from only the first 100 oscillatory components. The second panel shows a residual signal or the difference between the original and reconstructed signals. The more components are added, the smaller the residual signal is. Therefore, it is possible to modify the high-order oscillatory components without affecting sound quality significantly. In this sense, our simple rules are corresponding to changing scale factors of certain components. Figure 5 shows an example of waveform when embedding bit 1 and bit 0 into \(X_{35}\). It can be seen clearly from this example that what is modified is the scale factor of the component waveform.

Fig. 2.
figure 2

An example of using SSA to decompose a signal

Fig. 3.
figure 3

Singular spectrum (The rst 200 SVs)

Fig. 4.
figure 4

Original and reconstructed signals (top), Residual signal or the difference between original and reconstructed signals (bottom)

Fig. 5.
figure 5

Embed “0" and “1" into the component \(X_{35}\)

Although the residual signal is very small when all oscillatory components are added up, it might exists. Thus, we first check ODGs of synthesis signals comparing to the originals. The result is shown in Fig. 6 (light gray). The sound quality of synthesis signals is not different from that of originals.

Fig. 6.
figure 6

ODGs of watermarked and synthesis signals

In addition to PEAQ, the log-spectral distance (LSD) and the signal-to-error ratio (SER) were also performed. LSD is a distance or distortion measure between two spectra, which is defined as the following formula given \(P(\omega )\) and \(\hat{P}(\omega )\) are power spectra of original and synthesis signals respectively.

$$\begin{aligned} \mathrm{{LSD}} = \sqrt{\frac{1}{2\pi }\int \limits _{-\pi }^{\pi }\left[ 10\log _{}\frac{P(\omega )}{\hat{P}(\omega )}\right] ^2d\omega } \end{aligned}$$
(6)

SER is the power ratio between a signal and the error. Given amplitudes \(A_\mathrm{{sig}}(n)\) and \(A_\mathrm{{syn}}(n)\) of original and synthesis signals respectively, SER is defined as follows.

$$\begin{aligned} \mathrm{{SER}} = 10\log _{}\frac{\displaystyle \sum _{n}{}{[A_\mathrm{{sig}}(n)]}^2}{\displaystyle \sum _{n}{}{[A_\mathrm{{sig}}(n)-A_\mathrm{{syn}}(n)]}^2} \end{aligned}$$
(7)

The Eqs. (6) and (7) imply that the lower LSD, the power spectrum of a synthesis signal is more similar to that of an original, and the higher SER, the lower error. Therefore, for a perfect reconstruction, the ideal value of LSD is zero, and the ideal value of SER is infinity. The results from our experiment confirm that LSD between any pair of original and a synthesis signals is zero, and SER is \(\inf \). Thus, the framework of singular-spectrum analysis-synthesis could be used in audio watermarking.

4.2 Objective Sound-Quality Test

The following parameters are chosen for the proposed embedding rules: \(\epsilon = 0.1\), \(u = 21\), and \(l = 49\), i.e. the 21st to 49th SVs are modified with respect to a watermark bit as described in Sect. 3. Figure 6 shows ODGs of watermarked signals together with ODGs of synthesis signals. From the viewpoint of PEAQ, we can say that our proposed method introduces very small distortion to the sound quality. LSD and SER between original and watermarked signals are shown in Fig. 7. The results indicate that our proposed method satisfies inaudibility criterion.

Fig. 7.
figure 7

Log-spectral distance and Signal-to-Error ratio of watermarked signals

4.3 Robustness Test

The results of robustness tests are shown in Table 1. The proposed method is robust against MP3 and MP4 compression, Gaussian-noise addition, band-pass filtering and re-sampling. Furthermore, in order to compare robustness against certain attacks, we implemented the scheme based on [15], which will refer as a conventional method in this paper. The proposed method is more robust against MP3 compression and band-pass filtering than the conventional method.

Table 1. Comparing BERs of the proposed and conventional methods when attacks are applied

5 Discussion

The results from our investigation suggest that the robustness could be improved if lower-order oscillatory components are modified instead of the higher-order ones. For example, the robustness against MP3 attack of the proposed scheme increases, as shown in Fig. 9, if values of the 2nd to 9th SVs, instead of the 21st to 49th SVs, are modified to embed a watermark. However, the sound quality of watermarked signals decreases as shown in Fig. 10. It reveals a trade-off between robustness and inaudibility.

Besides the position of modified SVs, the other point we would like to discuss here is the number of SVs which ranges from u to l. If the number of modified SVs increases, then the robustness increases. It is possible that we can keep ODG while BER is decreasing, especially in the case of high orders. For example, when we modify the 51st to 89th SVs, we have an average BER of \(13.26\,\%\) for MP3 attack. If we set l to 99, we will have the average BER of \(10.83\,\%\), and for both cases the average ODG is around 0.18. Therefore, one way to improve the proposed scheme is choosing these parameters appropriately.

We also evaluated the robustness against single-echo addition. It is not a general attack of watermark. We found that both conventional and proposed methods are not robust against this attack as shown in Fig. 8. Because both are time-domain based methods where matrices are created directly from the waveform, thus time-domain processing affects the elements of those matrices in a significant way.

Fig. 8.
figure 8

Comparing BERs of the proposed and conventional methods when single-echo addition with delay time of 100 ms, \(-6\) dB is applied

We have verified that our proposed scheme can extend capacity from 6 to 18 bps without affecting both sound quality and a detection rate. The maximum capacity of the proposed scheme given preferred sound quality is determined by N. It could be increased if the parameters such as L, u, l, and \(\epsilon \) are chosen appropriately similar to a way to improve robustness.

Finally, this is a frame-based method, thus there also exists a problem of frame synchronization. The current proposed scheme assumes to know where to look for hidden information. One possible solution to this problem is using synchronization codes [27, 28]. Moreover, in our experiments, we do not claim that the parameters, such as N, L, u, l, and \(\epsilon \), are the optimal ones. These are the subjects of our future works.

Fig. 9.
figure 9

Comparing BERs, after MP3 attack is applied, when lower-order components are modified to hide information

Fig. 10.
figure 10

Comparing ODGs when lower-order components are modified to hide information

6 Conclusion

This paper presented our investigation on deploying SSA for hiding information and proposed audio watermarking scheme based on it. SSA was used to decompose a signal into oscillatory components. Controlling the scale factors which can be done by modifying SVs of some components is the important procedure to embed a watermark. In this SSA view point, we can interpret the physical meaning of SV. We discover that SSA is a perfect analysis-synthesis tool. Although the proposed scheme is robust against many attacks especially MP3 and MP4 compression and satisfies inaudibility criterion, it still has a drawback that involves in audio watermark, for example, it is fragile to single-echo addition. The proposed method is a time-domain SSA-based method so that it seems to be robust against frequency domain processing. Our next interests are as follows: whether a frequency-domain (amplitude and phase spectra) SSA-based method will be robust against time-domain processing such as single-echo addition, and whether a hybrid SSA-based method of time- and frequency-domain will offer a solution to inaudible and robust audio watermarking.