1 Introduction

Speech is the most operative medium used in telephony, mobile communications and transmissions. Speech compression is one of the means that attempts to exploit all the available capabilities and resources of the communication systems. Compression is made by reducing the size or bit rate of the transmitted speech signal components [2, 4]. This process saves bandwidth of the communication channel. It also decreases the memory space which is needed for storage speech files. Speech compression is done according to the fact that a large number of redundant information is originated in the speech signals. By coding necessary speech information and neglecting non–essential information, a compressed signal is generated. The amount of discarded information must be suited to the level which is desired to restore the original speech with high intelligibility. Today, speech compression is used in many different applications such as voice mail, video teleconferencing systems, satellite and cellular communications [10]. For the purpose of preparing the speech signal to compress, some transforms such cosine and wavelet are utilized [11, 14, 19, 31]. These transforms have the ability to deal with speech’s time and frequency domains with high resolution.

Discrete wavelet transform decomposes the speech signal by decorrelating its samples into sets of coefficients. Many of these coefficients are almost zero [3, 27]. Thereby, the compression can be performed by coded effective coefficients only and give zero to the other coefficients when achieves a decompression process. Hence, wavelet transform based compression is still an interesting field by the researchers. Supatinee K. et al., [14] examined the application of Haar, Biorthogonal and Discrete Meyer wavelets to compress speech signals. The experiments indicated that the Biorthogonal wavelet provides a good compression ratio and quality of the reconstructed signal compared with two others. The enforcement of Coiflet wavelet based on speech compression was tested by Snehanka G. et al., [19]. This paper seeks to get a high auditory quality to the recovered speech. Fatma z. Chelali et al., [3] used DWT for speech compression and denoising. The results showed that the compression of an acoustic signal using DWT outperforms compared with use of discrete cosine transform. In [27], speech compression using a hybrid wavelet is proposed by Rekha V., and Sachin S. Chauhan. The held energy for the speech frames coefficients, which sets as a threshold, controls the required compression levels. Another compression algorithm for a distributed speech recognition system is proposed by Syu-Siang W. et al., [29]. This algorithm uses suppression by selecting wavelets to achieve compress and efficient data transmission. DWT filters speech signals into two frequency levels. The low frequency level is kept then transmitted while the high frequency level is discarded.

To get an increment in the compression ratio of the signal compared with existing schemes based on wavelet, an interesting compression approach called Compressive Sensing (CS) has been used recently. This approach introduces reduction in acquisition time and complexity. But this scheme has a challenge represented by that the processing power at the decoder limits the reconstruction quality of the signals [6, 13, 21, 25]. In [20], Vinitha R. et al., proposes a compression system based on CS to enhance and compress speech signals. Improving compression ratio is presented through a scheme proposed by Maher K. M. Al-Azawi et al., [1]. CS and chaotic system are utilized. All these compression algorithms intend to increase the compression level while retaining as much as possible of the quality and intelligibility of the recovered speech signal.

In this paper a high-quality system of compression and encryption of a speech signal based on DWT and Hènon chaotic maps is proposed. The excellent sparsen process of the signal which is produced by multi-level wavelet decomposition guarantees a high compression ratio after thresholding. Also, the new efficient coding process of the remaining wavelet coefficients using chaotic signals upsurge the auditory quality of the reconstructed speech and reduce the appended information required to decompress speech signal. The substantial characteristics for the eight chaotic signals assurance high security level to the compressed speech. A unified framework of compression and encryption processes is provided by using a combination of DWT and chaotic signals in the proposed approach.

The paper sets as follows; Section 2 shows DWT and Hènon chaotic map. Section 3 illustrates the proposed system of the speech compression and encryption. Finally, the simulation results and conclusions are presented in Sections 4 and 5 respectively.

2 Discrete wavelet transform and chaotic map

2.1 Discrete wavelet transform (DWT)

Wavelet transform applies different window scales to split the data into various ranges of frequency components. It captures time location and frequency information of the signals with high resolution [5, 28]. The coefficients (W) of the DWT which are computed for a signal S[n] is defined by the equation [28].

$$ W\left(j,k\right)=\sum \limits_nS\left[n\right]{2}^{-\frac{j}{2}}\psi \left({2}^{-j}n-k\right), $$
(1)

where j and k are the scale and shift parameters, respectively.

ψ(t) is called the mother wavelet. There are many wavelet families, each one is characterized by its mother wavelet shape.

Haar is the oldest, simplest, an orthogonal wavelet and it has linear phase characteristic. Haar wavelet owns one vanishing moment with two filter coefficients and it doesn’t use the overlapping windowing technique [17, 28]. The mother Haar wavelet is given by the equation.

$$ \psi (t)=\left\{\begin{array}{c}1\kern1.5em 0<t<\frac{1}{2}\kern0.5em ,\\ {}-1\kern1.5em \frac{1}{2}\le t<1\kern0.5em ,\kern0.5em \\ {}\kern0.75em 0\kern2.5em otherwise\kern0.5em .\end{array}\right. $$
(2)

Daubechies-p (db) and Coiflets-p (coif) are another orthogonal wavelet with longer compactly supported length than that of haar. These families use overlapping windows to decompose the data samples. The daubechies filters have 2p coefficients while coiflets filters have 6p coefficients. Therefore, these families deal with each 2p and 6p adjacent data element respectively. The results of windowing processes produce a smoother representation in the wavelet domain to the signal than in haar. One more wavelet family which has two different wavelet functions is Biorthogonal wavelet (bior). It is orthogonal to the shifted base function under different scale factors. But for the same scale factor, it is not orthogonal [14, 19]. Figure 1 shows haar, db4, coif1, and bior2.2 mother wavelet families respectively.

Fig. 1
figure 1

Mother functions of several wavelet families

To decompose the signal, DWT passes the input data through successive low and high pass filters which have dissimilar cutoff frequencies. This process produces an orthogonal set of wavelets which have almost zero information components. In multilevel, DWT analyzes the data into approximation and detail coefficients by pushing those data through filters. Down-sampling by two is carried out to characterize the wavelet signal [7, 16, 30]. Therefore, the time resolution is halved while frequency resolution is doubled. Approximation coefficients are decomposed and subsampling again for each next level. Mathematically, the output coefficient vector can be written as

$$ W(n)=\left\{\begin{array}{c}\left(\sum \limits_{k=-\infty}^{\infty }S\left[k\right]L\left[2n-k\right]\right)\downarrow 2,\\ {}\left(\sum \limits_{k=-\infty}^{\infty }S\left[k\right]H\left[2n-k\right]\right)\downarrow 2,\end{array}\ \right. $$
(3)

Where S, L, and H are the input signal, low pass and high pass filters respectively.

Several of one-dimension H and L filters are applied to obtain a two-dimensional (2D) wavelet (input data is a 2D matrix). First, each row of the input data is pushed through the two filters. Then after downsampling, each column of the resulting coefficient matrix is passed again through the filters. Three details subbands which represent highest resolution wavelet coefficients and one approximation subband which represents smooth coefficients of the original data. To conduct more analysis of the signal features with different scales, the last one subband is further decomposed through the next level [12, 15].

2.2 Hénon chaotic map

Chaotic maps produce deterministic sequences which have many unique properties such sensitivity to its parameters, noise-like behavior, and ergodicity. Thus, the chaotic signal confers additional confidentiality when the encryption scheme employs it [26]. In 1976, Michel Hénon introduced a chaotic map with two chaotic behavior signals. This map is defined as in the equation

$$ {\displaystyle \begin{array}{c}{x}_n=1-r{x_{n-1}}^2+{y}_n,\\ {}\kern0ex {y}_n={cx}_{n-1},\kern5em \end{array}} $$
(4)

where (xn, yn) ∈ R are the generated chaotic sequences. r and c are the control parameters seed.

To guarantee chaotic performance of Hénon map signals, the control parameter may be having values r ∈ (1.399,1.4) and c ∈ (0.299,0.3) [18].

3 Proposed system of compression and encryption speech signal

Figure 2 shows the block diagram of the proposed compression and encryption system for the speech signals. This system utilizes DWT and Hénon chaotic signals to compress and encrypt speech signals. The speech signal is arranged into an almost squared matrix with respect to the length of speech signal. 2D spectrogram is constructed by applying the discrete cosine transform (DCT) for each column vector. DCT works as the first level of data sparsity. After that, multilevel 2D-DWT is applied on the spectrogram matrix to generate a wavelet coefficient matrix. A hard threshold value is chosen to attain the required compression level of the signal. All the details wavelet coefficients have values less than the threshold are reset to zero. The matrix is converted to 1D vector (W) to prepare it for the compression step. A new coding process is suggested for the compression scheme. This process is used to compress and encode the significant wavelet coefficients. Eight chaotic signals such (x1, y1, x2, y2, x3, y3, x4, y4) which are generated from four Hénon maps are employed for this mission. These signals are used to code each single, double, third, fourth, fifth, sixth, seventh, and eighth adjacent data samples respectively. The chaotic signals are joined together to assure high quality randomness of these signals. The joining process is accomplished by modifying ‘s equations such as

$$ \left.\begin{array}{c}y{1}_n=c1\ \left(x{1}_{n-1}+y{2}_{n-1}\right),\\ {}\begin{array}{c}y{2}_n=c2\ \left(x{2}_{n-1}+y{3}_{n-1}\right)\kern0.5em ,\\ {}y{3}_n=c3\ \left(x{3}_{n-1}+y{4}_{n-1}\right),\end{array}\\ {}y{4}_n=c4\ \left(x{4}_{n-1}+y{1}_{n-1}\right),\end{array}\kern0.5em \right\} $$
(5)

where c1, c2, c3, and c4 are the control parameters for each map respectively.

Fig. 2
figure 2

The structure of proposed speech compression system

Quantization process is applied for each chaotic signal to be compatible with bits per sample of speech signal. Then, to ensure there are no repeated values of chaotic samples, the elimination process cancels each sample which has the same values within an instantaneous chaotic plane (ICP). ICP is a stream of the eight quantized chaotic signals that have a thousand samples of each signal started with an instantaneous sample to the next thousand samples of those signals. ICP can be given as

$$ ICP=\left[\begin{array}{cc}\begin{array}{cc}\overset{\sim }{x}{1}_n& \overset{\sim }{\ x}{1}_{n+1}\\ {}\begin{array}{c}\overset{\sim }{y}{1}_n\\ {}\overset{\sim }{\ x}{2}_n\end{array}& \begin{array}{c}\ \overset{\sim }{y}{1}_{n+1}\\ {}\ \overset{\sim }{\ x}{2}_{n+1}\end{array}\end{array}& \begin{array}{cc}\cdots &\ \overset{\sim }{x}{1}_{n+1000}\\ {}\begin{array}{c}\cdots \\ {}\cdots \end{array}& \begin{array}{c}\ \overset{\sim }{y}{1}_{n+1000}\\ {}\ \overset{\sim }{\ x}{2}_{n+1000}\end{array}\end{array}\\ {}\begin{array}{cc}\vdots & \vdots \\ {}\overset{\sim }{y}{4}_n&\ \overset{\sim }{y}{4}_{n+1}\end{array}& \begin{array}{cc}\ddots & \vdots \\ {}\cdots &\ \overset{\sim }{y}{4}_{n+1000}\end{array}\end{array}\right], $$
(6)

where n is an instantaneous sample shifted with the process, \( \overset{\sim }{\ x} \) and \( \overset{\sim }{\ y} \) are quantized chaotic signals.

The compression and encoding of wavelet coefficients are performed using ICP planes. The number of adjacent information in the wavelet coefficients decides which ICP components are to be chosen to represent code of this information. To clarify that, if there are A adjacent information components (up to eight) that have non-zero value in the wavelet coefficients located at a P position with respect to the instantaneous thousand components, therefore ICPA,P is chosen to encode these information components. As an example, if the wavelet coefficients have a stream of data, such …, 0, 0, W77, W78, 0…, the value in the 2nd row (two non-zero adjacent samples) -77th column of ICP \( \left(\ \overset{\sim }{y}{1}_{77}\right) \) is chosen to be coded. The compressed signal will be contained \( \overset{\sim }{y}{1}_{77} \), W77, W78. Figure 3 illustrates an example to the compression and encoding process.

Fig. 4
figure 3

An example of compression and encoding process

After each thousand wavelet coefficients, a sample with zero value is inserted in the streams of the compressed signal. This sample identifies ICP length and is used to revive the decoding process if any error in the compressed signal samples occurs. To retrieve the speech signal from compressed data, the inverse steps of the compression process are performed at the receiver side. If the decoder receives ICPA,P, the next Ath samples is put in the position started with Pth of the decoding vector. All next samples are set to zero until the next Pth position of the ICPA,P is detected. As an example, if the decoder receives data beginning with a sample which has the same\( \overset{\sim }{x}{4}_{65} \) value, the next seven samples are put with position starting with 65 to 71 in the decoding vector. Then the eighth sample, next to the seven samples mentioned before, is compared with the ICP sample to know the corresponding position and data size. After whole this process is done, two-dimensional inverse Discrete wavelet (2D-IDWT) and inverse Cosine (ICT) transforms are applied respectively to the generated decoding vector.

4 Simulation results

Different speech files which are obtained from ‘NOIZEUS’, ‘CMU_Arctic’ and ‘TIMIT’ speech databases are used to experience the performance of the proposed speech compression system. All tested signals have samples of 16 bits. Eight level DWT- haar family (basically) is applied to get wavelet coefficients. As well as, four linked Hènon maps which have different initial conditions and control parameters are utilized to generate eight chaotic signals. To prepare chaotic signals to the encoding process, all signals are quantized to 216 levels respected to speech bit per sample (16 bits for speech files which are tested). Repeated samples in ICP, with dimensions 8 × 1000, are eliminated to ensure correct reconstruction of speech signal at the receiver. Figure 4a shows y2 samples, as an example, corresponding to the samples which have similar values in the other quantized chaotic signals. The red points indicate the similarity of the y2 samples with that of other chaotic signals. Figure 4b shows the same relation after the eliminating process. So, it can be seen that the samples which have similar values within ICP are eliminated. The ICP is ready now to be used for the coding process.

Fig. 4
figure 4

ICP plane before and after repeated samples elimination process

Each one of the chaotic signals assigned to encode a specific group which have a certain number of adjacent samples in the wavelet coefficient samples. Up to eight adjacent samples can be encoded corresponding to eight chaotic signals. Figure 5 depicts the number of adjacent samples in the wavelet coefficients with respect to compression level for the Sp21 speech file in NOIZEUS database.

Fig. 5
figure 5

Number of adjacent samples in the wavelet coefficients

Many performance statistical measures are used to evaluate the proposed speech compression system listed in the next subsection.

4.1 The performance measures

The objective measures are useful to measure the residual intelligibility and quality of compressed speech and retrieved signal respectively. The quality of the retrieved speech signal is mostly measured by Signal to Noise Ratio (SNR) [1]. High value of SNR corresponds with high quality of recovered speech. Higher quality of retrieved speech can be also measured by a higher value of another objective factor known as Peak Signal to Noise Ratio (PSNR) [10]. Perceptual Evaluation of Speech Quality (PESQ) [8, 24] is an accurate international standard factor for estimating speech quality. PESQ became a worldwide industry standard test for the applications which enhance speech quality used by voice processing and telephone networks. Moreover, Segmental Spectral Signal to Noise Ratio (SSSNR) [1] indicates the amount of residual intelligibility of encoded speech signal. The more negative value of SSSNR means more strength of the encryption process. Furthermore, Correlation coefficient (CF) [23] is a statistical measure used to test the signals similarity. CF has values between +1 to −1. The near zero value for CF means a large difference between the signals. When the CF value is almost one, the similarity is confirmed. Finally, Number of Non-Zero Coefficients (NNZC) before thresholding process and Number of Wavelet Coefficients (NWC) are suggested here. NNZC is applied to compute the percentage ratio of non-zero coefficients (coefficients which are processed to acquire a compressed signal) to the total coefficients before thresholding process. NWC is employed to compute the percentage ratio of increment in the number of decomposition coefficients for a wavelet family to that in haar wavelet family. NNZC and NWC are given as in the following equations:

$$ \mathrm{NNZC}=\frac{number\ of\ non- zero\ befor\ threshold}{total\ elements}\ast 100\%\kern0.5em . $$
(7)
$$ \mathrm{NWC}=\left(\frac{number\ of\ coeff.\kern0.5em of\ a\ family}{number\ of\ coeff.\kern0.5em of\ haar\ wavelet} - 1\right)\ast 100\%\kern0.5em . $$
(8)

All these statistical measures are computed with respect to Compression Ratio (CR). CR is used to obtain the percentage ratio of the size of compressed signal to that of the original speech signal.

4.2 The results of proposed speech compression system

The performance of the proposed system illustrates in this subsection. Figures 6 and 7 show the waveform and spectrogram of the original, compressed and recovered speech signals respectively. Acoustically, these figures depict high intelligibility and quality of the retrieved signal at a high compression level (CR=18%). The compressed signal is like noise and it is dissimilar with respect to the original speech. This analysis is supported by the statistical results that are shown in Tables 1 and 2.

Fig. 6
figure 6

Waveform of original, compressed, and recovered speech signals

Fig. 7
figure 7

Spectrogram of original, compressed, and recovered speech signals

Table 1 Simulation results of proposed compression speech system
Table 2 Comparison results of proposed compression system based on various wavelet families

Table 1 indicates the SNR, PSNR, PESQ, and CF results for the retrieved speech signals and SSSNR for the encrypted and compressed signal all with various CR levels. By observation, it is found the high values of all simulated objective measures. High CR results are realized by the efficient encoding of wavelet coefficients. This process is accomplished with minimum information which is required to retrieve the speech signal. Also, the linear phase property of the haar wavelet attains a good reconstructing to the speech signal. That is clearly by these results which reflect high quality for the reconstructed speech. As well, Low SSSNR values (gets between −41.449 and −26.4618) confirm the strength of the encryption process and refer to high level immunity against any attacks.

To test the effects of applying another wavelet family, Table 2 represents the comparison results of using some types of db, coif, and bior wavelet instead of haar to compressing a speech from the NOIZEUS database at CR = 30%. From this Table and except haar wavelet, NNZC results indicate that all coefficients of the decomposition process which are produced by db, coif, or bior multilevel 2D-DWT haven’t zero value before the thresholding process. This fact is a result of a highly smoothing representation of the signal by inherent overlapping windowing property for these families. But it leads to loss of more information through thresholding. Also, the NWC results indicate that the db, coif, and bior wavelets produce more coefficients (gets between 5% to 31%) compared to the coefficients which haar produces. Where those families have many samples in its FIR L and H filters compared with haar that has two samples only. For these reasons, the execution of haar wavelet in the proposed scheme appears superior in terms of SNR, PSNR, PESQ, and CF with respect to CR.

Figures 8 and 9 show the waveform, spectrogram, and correlation of the recovered speech signal with and without 10−15 change in r1 parameter value respectively, all with CR=40%. The spectrograms and correlation test results depict huge differences between original and recovered speech. The waveforms of the recovered speech assert that also.

Fig. 8
figure 8

Waveform, spectrogram and correlation test without any change in Hènon parameters

Fig. 9
figure 9

Waveform, spectrogram and correlation test with ±10−15 change in r1 parameter only

Table 3 displays the SNR and CF with various CR values at a tiny change in r1control parameter (±10−15). The results of SNR and CF clearly reflect that change by ±10−15 to a control parameter makes it impossible to retrieve the original speech signal. The low CF values indicate there is no relation between the original and recovered signals.

Table 3 SNR and CF of retrieved speech when tiny change in a control parameter

In the proposed system, Wavelet family, wavelet level, all the control parameters, and initial values of the four Hènon maps are exploited as secret keys. Generally, increasing of keys in an encryption system leads to a high key space of that system. Sixteen parameters of four Hènon maps give (1015)16 = 10240 key space.

The performance results of the proposed compression and encryption system have been compared with schemes that are presented in [1, 11, 19, 20, 22, 27], and [9]. Table 4 sets forth results summary of CR, SNR, PSNR, and PESQ for the proposed system and some of these objective measures for the compared schemes. It is clear that the proposed compression system outperforms the other schemes for the same CR values. The SNR and PESQ values indicate that the proposed system can attain a high level of quality and intelligibility for the reconstructed signal. Also, the encryption strength of the proposed system is confirmed by lower SSSNR = −38.74 dB compared with −20.78 and −14.834 for  [9] and [1] respectively.

Table 4 Comparison results of proposed compression system with various compression Schemes

The sparse process of the speech information by the multilevel 2D-DWT and efficient proposed encoding of the valuable coefficients give the proposed compression scheme dominance in terms of compression ratio and quality of retrieved speech in comparison with the other schemes.

5 Conclusions

In this paper, the proposed system compresses and encrypts the speech signal simultaneously. Discrete wavelet Transform sparsens speech information and then the proposed coding process which is based on eight signals of Hènon map encodes the weighty coefficients. The simulation results show outperforms the proposed system in terms of SNR, PESQ, PSNR, CF, and SSSNR with respect to CR ratio. At low CR value equals to 10%, the results get SNR= 11.5496 dB, PSNR=58.21 dB, PESQ=3.02945, CF=0.96437. These results reflect high intelligibility and quality of reconstructed speech signals. As well the proposed system guarantees high encryption strength with large key space for compressed speech. Where it can note that the SSSNR values (get between −41.449 and −26.4618 dB) are very low. Consequently, it is harder to extract the original speech signal by any intruder.