1 Introduction

The rapid growth of the Internet and its application drastically increases the need for securing the transmission of multimedia data in the public network. The Internet is used in most of the entire domain including education, government, military, banking, and commerce, etc. Growing popularity of the Internet and use of electronic gadgets increases the transmission of multimedia data in the network. Mobile phones are widely used for transmission of audio, images, video and text data over the Internet. The public network is open to all. It is vulnerable to attacks by hackers and intruders. Protecting the multimedia data against unauthorized access is the demand today. There is a need to provide confidentiality and integrity for multimedia content. Different techniques have been proposed to secure the multimedia content, and it includes encryption, watermarking and stenography. Protection of multimedia content is different from regular text data. In specific, most of the traffic generated in the mobile is audio traffic. Protecting the audio signal against the attackers is the fundamental requirement today. Enforcing confidentiality against the audio signal is needed now.

Predominantly used information representation structure today is the audio signals, which are widely used by the modern community for different types of the communication. In recent days, the secret sharing in the form of audio is happening publicly. Audio is accepted as the evidence in the court cases. Digital audio needs to be protected against unauthorized access. But audio signals are entirely different types of signals as compared to text and images. Audio signals are represented as wave signals, and it has been characterized by various metrics such as frequency, amplitude, and phase. Most of the existing cryptographic algorithms are best suitable for text data. It cannot be used for audio signals directly due to its representation and in specific audio signals are high volume and highly redundant data. Hence, efficient cryptographic algorithms are required to secure the sensitive audio signals before transmitting the signal over the public network in specific Internet and mobile network. Designing the audio encryption algorithm is the challenging task today.

Recently researchers have studied this problem, and they proposed different kinds of the algorithms for protecting audio against the unauthorized access (Al Saad and Hato 2014; Li et al. 2009; Kohad et al. 2012; Sharma 2012; Zeng et al. 2012; Sheu 2011; Elshamy et al. 2013; Zhao et al. 2014; Mermoul and Belouchrani 2010; Al-Karim et al. 2013). Several image and video-based encryption, watermarking and stenography algorithms are available in the research (Zhang et al. 2008; Lin and Chang 2001; Petitcolas et al. 1999; Langelaar et al. 2000; Chen and Lin 2003; Barni et al. 2001; Refregier and Javidi 1995; Hedelin et al. 1999; Yang et al. 1998; Kim et al. 2004; Kwon et al. 2006; Wu and Ng 2002; Wang and Fan 2010), but, the audio protection methods are relatively very low. In the audio scrambling techniques, the audio signal is rearranged to remove the correlation between the audio samples. Most of the audio scrambling techniques are based on 1D linear mapping (Zeng et al. 2012). But, this types of algorithms are vulnerable to attack. Because, the audio signal has small a variation concerning time, the adjacent samples have similar signals. Therefore, the audio encryption is one of the challenging tasks and limited techniques are proposed to protect the audio (Al Saad and Hato 2014; Li et al. 2009; Kohad et al. 2012; Sharma 2012; Zeng et al. 2012; Sheu 2011; Elshamy et al. 2013; Zhao et al. 2014; Mermoul and Belouchrani 2010; Al-Karim et al. 2013). This problem is addressed in this paper. This paper proposes the efficient audio encryption scheme to provide confidentiality for the sensitive audio signals.

The rest of the paper is arranged as follows. The subsequent section illustrates the existing models for audio encryptions. In Sect. 3, the proposed SEED encrypt algorithm and SEED decrypt algorithm is described with the diagram. Section 4 is structured for the experimental result, performance analysis, and security analysis. Security analysis illustrates that the proposed algorithms are highly sensitive to a minor alteration of the keys. Statistical analysis demonstrates that mean square error (MSE), peak signal to noise ratio (PSNR), correlation analysis and Histogram analysis. And it is proved that the proposed algorithms resist all statistical attacks. Experimental results express the usefulness of the audio encryption scheme. A brief conclusion is given in Sect. 5.

1.1 Motivation

The secret spy the microphone prevails everywhere and paves the way to hackers. Hackers find their entry by remote access trojans into government and corporate sectors. They can acquire audio information through these microphones and will transmit as compressed audio files via email, for illegal uses. Many computing systems can be compromised if its audio and microphone channels are not physically partitioned. Risks will explode exponentially in Voice over Internet Protocol (VoIP) phone systems.

Intruders can have remote access to microphones and can easily escape from security software, and their activity cannot be trapped. Usage of memory buffers and other types of storage devices can still raise the danger of misuse. Malicious software can easily manipulate these technologies when users switch between systems. Since many VoIP networks transmit data between networks of different security policies. That increases the danger of electromagnetic interference leakage. As the switch logic in firmware is reprogrammable, it can be tampered with, and hence difficult to identify whether it has been used or compromised. But a new variety of innovation is budding to avoid malicious audio signal interference. By maintaining audio signals physically divided from the microphone or speaker signals, the likelihood of leakage between signals on either side can be removed. As a result, organizations can avoid signals from being oppressed or manipulated by malicious software, thereby preserving the integrity of the signal when users switch between computers. Further, the use of a microphone mute button, which can physically control microphone when not in method, and cannot be manipulated by software or drivers, thereby assuring extend system security.

2 Related works

The audio encryption based on one and two-dimensional discrete time a chaotic system was proposed by Akgül and Kaçar (2015). In this model, the audio samples of type both mono and stereo are scrambled, and security of the algorithm is increased by non-linear models. Sadkhan and Mohammed (2015) proposed the pseudo-random bit generator for the audio encryption, which is based on the chaotic map. Tamimi and Abdalla (2014) demonstrated a scrambling process to protect audio with traditional block cipher algorithms; the secret key was designed in such a manner that it is audio signal dependent and the public key reliant. Lima and Silva Neto (2016) proposed an audio scrambling method by using cosine number transform (CNT), CNT is structured based on finite fields, and is repeatedly pertained to range of audio sequences of raw uncompressed data, the blocks are preferred using a overlying rule, that yields confusion and diffusion in the encrypted data of different blocks of audio signals.

Ciptasari et al. (2014) demonstrated the encryption techniques by the hybrid combination of the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) to design the resilient of the audio. It is used to provide the visual cryptography and time stamping and watermarking on the digital data. In this paper, watermarking is not embedded in the plain audio, it is utilized to create the secret image and a public image that is used to protect the audio signal. In the visual cryptographic techniques, chaotic maps like Chebyshev map (Liu and Wang 2010), Tent map, and disordered schemes, like Lorenz system (Anees 2015) and Chen scheme (Tong et al. 2015), are regularly used to create the random sequence such as key flow, for minor modification of one of the premier parameters can lead to utterly dissimilar path. Augustine et al. (2015) proposed an audio scrambling technique based on compressive sensing (CS) and Arnold transform (AT). The scrambling and compressive sensing are carried out by means of a key-based depth matrix, and the encryption is performed by the use of an Arnold matrix in which the first condition is created by using a Piecewise Linear Chaotic map (PWLCM). Audio encryption algorithms proposed to handle outmoded, and strong audio signals are the chaos-based and double random phase encoding (DRPE) methods (Al-Karim et al. 2013).

A chaotic map based audio encryption algorithm is proposed by (Eldin et al. 2015; Elkholy et al. 2015; Mostafa et al. 2015; Alwahbani and Bashier 2013). An audio encryption based on LFSR is proposed in James et al. (2014). Dengre and Gawande (2015) proposed an audio encryption for uncompressed data. Selective audio data encryption for multimodal surveillance system is proposed in Cichowski and Czyzewski (2012). Datta and Gupta (2013) proposed a fractional encryption and watermarking methods for audio signals with the reduction of quality. Rashidi and Rashidi (2013) proposed an FPGA based AES encryption algorithm for an audio signal. Voice authentication and real-time audio encryption are proposed in Nguyen et al. (2013). Kulkarni and Patil (2015) proposed a strong encryption technique for audio data hiding in digital images for better security. Ashok et al. (2013) proposed a secure cryptographic scheme for audio signals. Iyer et al. (2016) proposed a multimedia encryption based on hybrid approach. Context-aware multimedia encryption is proposed in Fazeen and Bajwa (2014). Washio and Watanabe (2014) proposed an audio secret sharing scheme. Zhao et al. (2014) proposed a dual key speech encryption algorithm based underdetermined BSS. Scrambling based speech encryption via compressed sensing is proposed in Zeng et al. (2012). Lu et al. (2012) demonstrated an audio data hiding based on AT and double random phase encoding methods.

In modern society, numerous secret commercial talks need to be protected. In many real-time situations, digital audio needs to be protected from malicious exploits, and this alertness of privacy protection provokes the rapid development of protection mechanism. Audio encryption has invited a great deal of interest from researchers.

Audio is considered as one of the essential representation types; it has been broadly used in present society. In some cases such as sensitive business conversation, an audio proof is acceptable in court. Hence, the digital audio need be concealed as secret information. In specific, more and more consciousness of individual privacy protection triggers the instant design of audio encryption techniques. Hence, audio encryption has gained a great deal of attention from researchers.

3 Proposed multi-tier seed model

Figure 1 depicts the multi-tier SEED model for the proposed SAIL cryptosystem. Various activities carried out in each tier during the encryption phase referred as the forward process, and with that of the decryption, phase referred as the reverse process is shown as four tiers. The input audio signal is digitized by performing analog to digital conversion. In the first tier, the input audio signal is segmented and then compressed in the second tier by applying discrete wavelet transformation (DWT) finally; the compressed audio signal is encrypted using ECC in the third tier. The final tier performs desegmentation to construct the digital audio information.

Fig. 1
figure 1

Multi-tier SEED model for SAIL cryptosystem

DWT has been preferred over alternative transformations for the following reasons:

  1. 1.

    It can offer best audio quality than DCT with increased compression ratio.

  2. 2.

    DWT performs compression for the whole file rather than block by block, and hence the compression errors will be distributed across the entire file.

ECC has been preferred over RSA for the following reasons:

  1. 1.

    It extends the same level of security with a just 160-bit key size equivalent to a 1024 bit key size required for RSA as per the recommendation of National Institute of Standards and Techniques (NIST), and key generation is also faster in ECC.

  2. 2.

    It is not vulnerable to timing attack as that of RSA.

  3. 3.

    Brute force attack and Pollard’s who attack are computationally expensive or infeasible as it involves exponential running time.

  4. 4.

    Computational complexity and overhead are very minimal in ECC when compared with RSA as the former is based on additive group whereas the latter belongs to a multiplicative group.

  5. 5.

    ECC involves point operation which is less complicated than exponentiation operation performed in RSA.

  6. 6.

    More suitable for power constrained devices as it requires less computing power.

The novelty of the proposed SEED system entirely relies on the selection of appropriate Elliptic curve over prime field.

3.1 Elliptic curve cryptography

Elliptic curve cryptography (ECC) is an asymmetric cryptosystem standardized by IEEE P1363. It offers the equal level of security offered by Rivest Shamir and Adleman (RSA) but with lesser key size. Hence it reduces the processing overhead. Elliptic curve is based on the Weierstrass equation of the form (1)

$${y^2}+axy+by={x^3}+c{x^2}+dx+e$$
(1)

where a, b, c, d, and e are real numbers and x and y take n values in the real numbers. Simplified form of the Eq. (1) is,

$${y^2}={x^3}+ax~+b$$
(2)

Equation (2) is the cubic equation of degree 3 where a and b are coefficients, and x and y are variables. An elliptic curve over finite fields uses either prime curve or binary curve. The prime curve is based on GF(p), the coefficients and values took n values in the set of integers from 0 to p − 1 and represented as Ep(a,b). The binary curve is based on GF(pm), the variables and coefficients of the cubic equation take values in GF(pm), and it is represented as Epm(a,b).

3.1.1 Arithmetic operations on Ep(a,b)

\(P+0=P~{\text{where, }}Q~\epsilon ~{E_p}(a,b)\)

If \(P=({x_p},{y_p})\) then \(- P=({x_p}, - {y_p})\)

If \(P=( {{x_p},{y_p}} )\) and \(Q=({x_Q},{y_Q})\) with \(P \ne ~ - Q\) then \(R=P+Q=({x_R},{y_R})\) is based on the formula given below,

$${x_R}=\left( {{\lambda ^2} - xp - xq} \right)mod~p$$
(3)
$${y_R}=\left( {\lambda \left( {xp - xR} \right) - yp} \right)\,mod\,p\,$$
(4)

where, \(\lambda =\left\{ {\begin{array}{*{20}{c}} {\frac{{{y_Q} - {y_p}}}{{{x_q} - {x_p}}}~mod~~p~,if~P \ne Q~} \\ {\frac{{3{x^2}p+a}}{{2yp}}~~mod~~p,~if~P=Q} \end{array}} \right.\)

Elliptic curve encryption and decryption need a point G and an elliptic curve Ep(a,b). B selects a private key d and determines the public key as PA= d* G. B transmits the pair (G, PA) to A. A selects the message Pm and to select the secret key r. Then he encrypts the Pm as follows,

$$C1=r*G,~C2={P_m}+~r~{P_A}$$
(5)

The pair (C1, C2) transmitted across the network. B decrypts the message as follows,

$${P_m}=C2 - d*C1$$
(6)

3.2 Proposed SEED encryption algorithm

The SEED encryption algorithm is given in Algorithm 1 and shown in Fig. 2. The original audio is digitized and segmented with the segment size of 8 bits. The first 44 bytes represent the wave description and the remaining deals with the scanned payload data. Again it is segmented into 8 bits data. The audio payload is compressed by the ID DWT. Elliptic curve encryption is applied on each pair of 8 bytes of data. Encrypted data is prefixed with 44 bytes of wave description to form the cipher audio signal. The result of the SAIL cryptosystem is the compressed and encoded audio signals. This data is transmitted across the network.

Fig. 2
figure 2

SAIL cryptosystem forward process

Elliptic curve selection for audio encryption based on complex multiplication (CM) method

  1. 1.

    Given prime number \(p\), estimate the minimum Determinant \(D\) with torsion value \(t\) based on Eq. (7).

$$4p={t^2}+D{s^2},\;{\text{where }}t.s \in Z$$
(7)
$$\# E({F^p})=p+1 - t,\;{\text{where }}\left| t \right| \leq 2\sqrt p$$
(8)
  1. 2.

    Check if the order of \(E({F^p})\) has admissible factorization. Otherwise choose different D and t. Dot step 2 until an order with acceptable factorization is found.

  2. 3.

    Create the class polynomial \({H_D}(x)\).

  3. 4.

    Find the root \({j_0}\) of \({H_D}(x)\), where \({j_0}\) is the j-invariant of the curve.

  4. 5.
    $${\text{Set}}\;k={\raise0.7ex\hbox{${{j_0}}$} \!\mathord{\left/ {\vphantom {{{j_0}} {\left( {1728 - {j_0}} \right)}}}\right.\kern-0pt}\!\lower0.7ex\hbox{${\left( {1728 - {j_0}} \right)}$}}~~~(mod~p)\;{\text{and}}\;{\text{the curve }}E\left( {{F^p}} \right):{y^2}={x^3}+3kx+2k$$
    (9)
  5. 6.

    Verify the order of the curve. If it is not equal to \(p+1 - t\), then create the twist using randomly chosen nonsquare \(C \in {F_p}\).

In this audio encryption algorithm, every 8 bytes form one point with 4-byte coordinates. To cover all the points 32 bit largest prime number is used \(p={2^{32}} - 1 = 214748364\).CM method is applied to compute the elliptic curve over \({F_{214748364}}\), the constructed curve is,

$$E\left( {{F_{214748364}}} \right):~{y^2}={x^3}+390064447,~where~a=0,~b=390064447$$
(10)

Selected curve is well suitable for encryption of all three different categories of audio such as human voice, animal sound, and instrumental music.


Algorithm 1: SEED Encryption

Input: Digitized plain audio signal Ap

Output: Cipher audio signal Ac

Procedure

  1. Step 1.

    Digital segmentation tier

Digitized audio signal \({A_{p~}}\) is fragmented into segments of size 1 byte \({A_p}=\left\{ {S_{p}^{1},S_{p}^{2},S_{p}^{3}, \ldots S_{p}^{N}} \right\}\)

First 44 bytes \({A_{sw}}=\left\{ {S_{p}^{1},S_{p}^{2},S_{p}^{3}, \ldots S_{p}^{{44}}} \right\}\) contains wave data and the remaining \({A_{spl}}=\left\{ {S_{p}^{{44}},S_{p}^{{45}},S_{p}^{{47}}, \ldots S_{p}^{N}} \right\}\) is the audio payload.

  1. Step 2.

    One dimensional compression tier

\({A_{spl}}=\left\{ {S_{p}^{{44}},S_{p}^{{45}},S_{p}^{{47}}, \ldots S_{p}^{N}} \right\}\)Compute the length of the audio payload Vector: \(N=\left| {{A_{spl}}} \right|\)

Haar scaling function is described in Eq. (11)

$$\phi \left( x \right)=\left\{ {\begin{array}{*{20}{c}} {1,~if~0~ \leq x<1} \\ {0,~Otherwise~~~~} \end{array}} \right.$$
(11)

Haar wavelet mother function is described in Eq. (12)

$$\psi \left( x \right)=\phi \left( {2x} \right) - \phi (2x - 1)$$
(12)
$$\psi \left( x \right)=\left\{ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} { - 1,~~{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-0pt}\!\lower0.7ex\hbox{$2$}}~ \leq x<1} \\ {1,~~0 \leq x<{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-0pt}\!\lower0.7ex\hbox{$2$}}} \end{array}} \\ {0,~~Otherwise~~~} \end{array}} \right.$$

Consider the audio paylad as a vector of length \(N=~{2^n}\)

1-level Haar transform for \(f=(x1,~x2, \ldots .xn)\)

$$f\mathop \to \limits_{{H_{1} }} \left( {\left. {a^{1} } \right|d^{1} } \right)$$
(13)

where,

$${a^1}=\left( {\frac{{{x_1}+{x_2}}}{{\sqrt 2 }},\frac{{{x_3}+{x_4}}}{{\sqrt 2 }}, \ldots ,\frac{{{x_{N - 1}}+{x_N}}}{{\sqrt 2 }}} \right)$$
(14)
$${d^1}=\left( {\frac{{{x_1} - {x_2}}}{{\sqrt 2 }},\frac{{{x_3} - {x_4}}}{{\sqrt 2 }}, \ldots ,\frac{{{x_{N - 1}} - {x_N}}}{{\sqrt 2 }}} \right)$$
(15)

1-level Haar wavelets:

$$W_{1}^{1}=\left( {\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }},0,0, \ldots 0} \right)$$
$$W_{2}^{1}=\left( {0,0,\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }} \ldots 0} \right)$$
$$W_{{N/2}}^{1}=\left( {0,0,0,0,..\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }}} \right)$$

Therefore, d1 is represented in Eq. (16)

$${d^1}=(fW_{1}^{1},fW_{2}^{1}, \ldots fW_{{N/2}}^{1})$$
(16)

1-level Haar scaling functions:

$$V_{1}^{1}=\left( {\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }},0,0, \ldots 0} \right)$$
$$V_{2}^{1}=\left( {0,0,\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }} \ldots 0} \right) \ldots$$
$$V_{{N/2}}^{1}=\left( {0,0,0,0,..\frac{1}{{\sqrt 2 }}, - \frac{1}{{\sqrt 2 }}} \right)$$

Therefore, a1 is represented in Eq. (17)

$${a^1}=(fV_{1}^{1},fV_{2}^{1}, \ldots fV_{{N/2}}^{1})$$
(17)

\(V_{1}^{1},V_{2}^{1}, \ldots .V_{{N/2}}^{1},W_{1}^{1},W_{2}^{1}, \ldots W_{{N/2}}^{1}\) construct an orthonormal basis in an N-dimensional space.

$$V_{i}^{1}.V_{j}^{1}=0,~W_{i}^{1}.W_{j}^{1}=0,~i \ne j,~V_{i}^{1}.W_{i}^{1}=0$$
(18)

\(\left| {V_{i}^{1}} \right|=\left| {W_{i}^{1}} \right|=1\) They form a new coordinate system.

  1. Step 3.

    Encryption

Elliptic curve chosen for the sound encryption is \({E_{2147483647}}\,\,( {0,390064447} )\) based on CM method presented above.

The generator point M is \((1027045486,1393612238)\) is selected

Select random seed ‘k’ from \(\left[ {1..(n - 1)} \right],~j=1\)

For every 8 bytes in audio payload repeat the following

Form the point X i by considering the first 4 bytes as x coordinate and remaining 4 bytes as the y coordinate

Cipher pints will be generated by Eq. (20)

$${C_{i1}}=k*{X_i},~{C_{i2}}=M+k*Q{\text{Q}}$$
(19)
$$~~~~{Y_j}={C_{i1}},{C_{i2}}$$
(20)

Cipher audio signal \({A_{\text{c}}}={A_{\text{sw}}}\) padded with Y where \({A_{\text{sw}}}\) is the wave data; and is the compressed, encrypted audio

3.3 Proposed SEED decryption algorithm

The audio decryption algorithm is shown in Fig. 3, and its details are given in Algorithm 2. The cipher audio is digitized and segmented with the segment size of 8 bytes. The first 44 bytes represent the wave description, and the remaining is the sequence of 8 bytes data. Each pair of 8 bytes data is decrypted by using Elliptic curve decryption algorithm. Finally, ID Inverse Wavelet Transform is applied to recover the original signals.

Fig. 3
figure 3

SAIL cryptosystem reverse process

Algorithm 2: SEED Decryption

Input: Cipher audio signal A c

Output: Digitized plain audio signal A p

Procedure:

  1. Step 1.

    Digital Segmentation

Digitized audio signal A c is fragmented into segments of size 8 bits \({A_c}=\left\{ {S_{c}^{1},S_{c}^{2},S_{c}^{3} \ldots S_{c}^{N}} \right\}\) Ac = {Sp1, Sp2, Sp3…..Spn}.

First 44 bytes \({A_{sw}}=\left\{ {S_{p}^{1},S_{p}^{2},S_{p}^{3}, \ldots S_{p}^{{44}}} \right\}\) contains wave data and the remaining \({A_{sc}}=\left\{ {S_{c}^{{44}},S_{c}^{{45}},S_{c}^{{46}}, \ldots S_{c}^{N}} \right\}\) is the audio payload.

  1. Step 2.

    Decryption

Elliptic curve chosen for the audio encryption is \({E_{2147483647}}\,\,( {0,390064447} )\)

Generator point M is \((1027045486,1393612238)\)

Select the random seed’ from \(\left[ {1..(n - 1)} \right]\)

For each Cipher pair in A sc repeat the following \({X_i}={C_{i2}} - d*{C_{i1}}\)

  1. Step 3.

    ID Decompression

Haar wavelet defined in defined in (5) used here for Inverse Discrete Wavelet Transform.

The transformation \({H_i}\) is reversible. That means, f is reconstructed via \(( {{a^1},{d^1}} )\)

$${a^1}=({a_1}, \ldots {a_{N/2}})$$
(21)
$${d^1}=({d_1}, \ldots {d_{N/2}})$$
(22)
$$f=\left(\frac{{{a_1}+{d_1}}}{{\sqrt 2 }},\frac{{{a_1} - {d_1}}}{{\sqrt 2 }}, \ldots ,\frac{{{a_{N/2}}+{d_{N/2}}}}{{\sqrt 2 }},\frac{{{a_{N/2}} - {d_{N/2}}}}{{\sqrt 2 }}\right)$$
(23)

Reconstruction from 1-level Haar transform Eq. (24)

$$f=\left( {\frac{{{a_1}+{d_1}}}{{\sqrt 2 }},\frac{{{a_1} - {d_1}}}{{\sqrt 2 }}, \ldots .\frac{{{a_{N/2}}+{d_{N/2}}}}{{\sqrt 2 }},\frac{{{a_{N/2}} - {d_{N/2}}}}{{\sqrt 2 }}} \right)={A^1}+{D^1}$$
(24)

\({A_p}={A_{sw}}\) padded with X; where \({A_{sw}}\) is the wave data, and X is the decrypted and decompressed audio

4 Experimental results

The designed SAIL system has been implemented in python, statistical and security analyses have been performed in Matlab. Audio Signals with the sampling rate of 8 kHz is used for human voice and animal voice. The sampling rate of 48 kHz used for instrumental music. All the audio signals are initially fed in the uncompressed form. Samples are taken from three different categories namely human voice, animal voice and instrumental music. These signals are encoded into binary using quantization. Normally, the “dense” visual feature of the waveform replicates the quick differences rising from the encryption process.

4.1 Histogram analysis

Histogram analysis is performed on all three categories namely animal sound, human sound, and instrumental music and depicted in Figs. 4, 5 and 6. Figure 4a–c displays the original audio, encrypted version and its equivalent decrypted version all the three categories respectively. Figure 4b shows the histogram of an encrypted sound file. Figure 4c illustrates the decrypted sound. The stringent property of the encrypted part of the audio signals is also shown in their corresponding histograms. In Fig. 4c, the histogram of dog barking sound is shown; it followed a specific distribution model, which is alike to the distributions obtained for the other plain audio signals. Otherwise, the histogram of the encrypted version of audio Fig. 4b has a very flat structure. This response is also tested for the different audio signals such as human voice and instrumental music.

Fig. 4
figure 4

a Dog barking sound, b encrypted sound, c decrypted sound

Fig. 5
figure 5

a Human voice (hello), b encrypted voice, c decrypted voice

Fig. 6
figure 6

a Music instrument sound (piano), b encrypted music, c decrypted music

4.2 Time domain and frequency domain analysis

The time domain and frequency domain characteristics charts of the plain audios (human voice, animal sound, and instrumental sound) and its equivalent encrypted audio is shown in Figs. 7, 8 and 9. The figures indicate that the encrypted audio has no similarity to the plain audio and is full of noise and hence imperceptible. The decryption algorithm recovered the original sound successfully. Figures 7, 8 and 9 show that the decrypted audio resembles the original audio.

Fig. 7
figure 7

a Dog barking sound, b encrypted audio, c decrypted sound

Fig. 8
figure 8

a Human voice (hello), b encrypted voice, c decrypted voice

Fig. 9
figure 9

a Music instrument sound (piano), b encrypted music, c decrypted music

4.3 Correlation analysis

Statistical properties of the original signal and encrypted signal are analyzed by calculating the correlation coefficients. Equation (25) shows the correlation coefficient formula. It is computed on randomly selected P sample in the different categories of the audio signals such as human voice, animal sound, and instrumental music.

$${r_{xy}}=\frac{{cov(x,~y)}}{{\sqrt {D\left( x \right)D\left( y \right)} }}$$
(25)
$$conv\left( {ex,y} \right)=~\frac{1}{p}\mathop \sum \limits_{{i=1}}^{p} \left( {{x_i} - E\left( x \right)} \right)\left( {{y_i} - E\left( y \right)} \right),~~D\left( x \right)=\frac{1}{p}\mathop \sum \limits_{{i=1}}^{p} {({x_i} - E\left( x \right))^2},~E\left( x \right)=\frac{1}{p}\mathop \sum \limits_{{i=1}}^{p} {x_i}$$
(26)

Xi is the value of the n-th chosen the audio sample, and you are the value of the equivalent adjoining audio sample. Original digital audio signals have correlation coefficients near to 1 exhibiting closer resemblance, whereas encrypted digital audio signals have correlation coefficients near to zero and hence claiming no resemblance as shown in Table 1. This illustrates that the proposed method is not vulnerable to statistical attacks. In addition to this, the entropy of the encrypted digital audio signals has inherent values ranging from 15.7057 to 15.7117. Even though these ranges are larger than those usually observed for original 16-bit audio signal, they are not too near to 16. This is because of the association between the numbers of samples of the audio signals used in the experiments. The equivalent encrypted audio has entropy equal to 15.9735, which is considerably near to 16. The related performance is tested for all types of audio samples. This implies that the encrypted audio signals are near to a random basis and the proposed model is also secure against the entropy attack.

Table 1 Correlation analysis of recovered and cipheredaudio w.r.t original audio

4.4 MSE and PSNR analysis

MSE and PSNR is calculated for all the audio samples taken for analysis.

$$MSE=\frac{1}{{N*M}}\mathop \sum \limits_{{n=1}}^{N} \mathop \sum \limits_{{m=1}}^{M} \left[ {f\left( {i,j} \right) - {f_0}{{(i,j)}^2}} \right]$$
(27)

where f and f0 are the intensity functions of decrypted and original sounds. (i, j) is the position of the data. (N*M) is the size of the sound file. Table 2 shows the MSE of recovered sound. It shows that the MSE of the decrypted sound concerning its original image is closer to 0 which is desirable.

Table 2 MSE of decrypted audio concerning original

PSNR is the ratio of the mean square difference of two sounds to the maximum mean square differences that exist between two audio files. Larger the value of PSNR, greater the quality of the sound. PSNR value is tabulated in Table 3.

Table 3 PSNR of decrypted audio concerning the original audio
$$PSNR=20*\log\frac{{{{255}^2}}}{{\sqrt {MSE} }}$$
(28)

4.5 Power spectrum analysis

The power spectral density (PSD) is the distribution of power per unit frequency. It calculates the PSD of discrete time domain based audio signals using spectrum. The PSD is generalized to discrete time variables. Signals are sampled at discrete time intervals \({x_n}=x(n\Delta t)\) for a total measurement period of \(T=N\Delta t.\) Figures 10, 11 and 12 shows the PSD of the original sound, encrypted sound and the decrypted sound for the different categories of audio signals. From the figures, it can be inferred that the PSD of the original sound and encrypted sound has great variation but the PSD of the original sound and encrypted sound remains same.

Fig. 10
figure 10

a Dog barking sound, b encrypted sound, c decrypted sound

Fig. 11
figure 11

a Human voice (hello), b encrypted voice, c decrypted voice

Fig. 12
figure 12

a Music instrument sound (piano), b encrypted music, c decrypted music

4.6 Keyspace analysis

SEED encryption and decryption algorithms are based on DWT and ECC. ECC provides the equal level of security as compared to RSA with concise key length. As ECC is based on discrete logarithm problem (DLP), the brute force attack is impossible. Key size is chosen such a way that it should be best suited for RTP applications which are not tolerant to delays without compromise on the security level. The algorithm has been designed to scale up for larger key size. To preserve privacy, the random seed has been used as practiced in Diffie hellman key exchange. To break the cryptosystem adversary has to know the random seed which is not possible in the proposed system. Elliptic curve chosen for the sound encryption is \({E_{2147483647}}\,\,( {0,390064447} )\) and the generator point is \((1027045486,1393612238)\).

4.7 Key sensitivity analysis

The proposed algorithm is very much sensitive to the key, even one-bit change in the decryption key will provide nosiy audio and making it irrecoverable. This proposed cryptosystem is fully based on the random key. The usage of random keys provides different cipher audio for a given clear audio and hence making the known-plaintext attack and chosen plaintext attack ineffective. Key sensitivity test has been conducted by changing the initial parameters used for decryption which resulted in a completely different cipher audio.

4.8 Robustness to differential attacks

One sample is selected at random, to analyze the vulnerability of the proposed method against differential attack. The audio signal is modified by inverting the Least significant bit (LSB) of the sample. Modified and original audio is encrypted using the same key and evaluated by using the number of samples change rate (NSCR) and the unified average changing intensity (UACI) as given below:

$$NSCR=\frac{{\mathop \sum \nolimits_{i} {D_i}}}{L}*100\%$$
(29)
$$UACI=\frac{1}{L}\left[ {\mathop \sum \limits_{i} \frac{{\left| {{A_i} - A_{i}^{'}} \right|}}{{65535}}} \right]$$
(30)

A and A i are the two encrypted audio data whose equivalent plain audio data have only single bit change in the sample; the values of the samples at location I of A and A i are correspondingly represented by Ai and A; L corresponds to the size of the audio vector; Di is calculated based on the rule,

$${D_i}=\left\{ {\begin{array}{*{20}{c}} {1,}&{{A_i}~ \ne A_{i}^{'}} \\ {0,}&{Otherwise} \end{array}} \right.$$
(31)

The benchmark value for NSCR is 100% and for UACI is 33.3%. The minimum, the maximum and the average values of NSCR and UACI, calculated from the encryption of 100 different modified versions of each audio signal. Computed NSCR values are closer to 98%, and UACI is closer to 33%. The results are considerably closer to the ideal values and in depend on the position of the modified sample.

5 Conclusion

Audio security ensures the secrecy, integrity, accessibility and confidentially of the audio signal. This multi-tier SEED model performs DWT to compress the audio signal which can suit well for Real Time Protocol (RTP) based applications like VoIP, live audio streaming and video conferencing. Digital audio encryption is made as it can provide lower residual intelligibility and intensified cryptanalytic strength. Application of ECC claims this work to be unique of its kind as it is suited for digital encryption. Since larger key size will be inappropriate for RTP and delay sensitive applications an optimal key size is chosen without compromising security. The SEED model provides the faster encryption as it performs the fixed-point operation that involves less computation time. This model is easy to implement in spite of its mathematical complexity but offers the higher degree of flexibility, as samples range can vary from 8 to 16 k. Various statistical analysis has been performed, and the results substantiate the higher level of security and ensure it is not vulnerable to any statistical attacks and hence more prudent for multi-channel audio processing.