1 Introduction

With the development of cloud technology and mobile, large amounts of audio files have been stored in the cloud, such as voice messages, meeting records, etc. The privacy of audio is becoming more and more concern. Unlike text files, audio files usually have large sizes, which makes high efficiency become the focus of encryption technology [7]. Classical encryption algorithm takes a long time to encrypt or decrypt audio files and is difficult to practical [20], such as Data Encryption Standard (DES), Advanced Encryption Standard (AES), RSA, etc. Chaotic systems [4, 8] gain popularity in the efficient encryption field, for their sensitivity to initial conditions and parameters, topological transitivity, ergodicity, etc. To improve security, several chaotic maps are combined as a composite chaotic system. For the same reason, the hyper chaotic system with more than one Lyapunov exponent is published. Although the chaotic system has high efficiency and good encryption quality, it is hard to perform interaction with the audio files in the cloud safely and correctly. Any operations could not be executed correctly with the encrypted data, but the decryption may leak the information to the cloud. The users have to modify audio files after downloading and decrypting for safety, and it will increase the costs of network transmission.

For that reason, homomorphic encryption technology is on the rise in recent years, it can keep cloud service providers from gaining sensitive information [14]. If using homomorphic encryption, the decrypted result of calculating encrypted data would equal the value of calculating the original data. Some cloud computing can be done without any privacy problems, so it has a wide range of applications for cloud computing, blockchain, etc. [10]. But the present homomorphic encryption has a large data extension and high computational complexity. It’s difficult to apply to audio, so most researches mainly focus on the core data such as audio features [27]. However, the user often wants to interact with their audio files stored in the cloud, such as playback, adding or deleting parts of audio, volume adjustment, etc. Therefore, it is very useful to design a homomorphic audio encryption algorithm with small data expansion and high efficiency for practical application. In addition, the audio files are in a variety of formats, with different attributes such as the number of channels and maximum data size of each sample. Self-adaption is required to deal with that encryption and interactive operations.

The contributions to this paper are as follows:

  1. 1).

    We proposed a new homomorphic audio encryption scheme for less complexity and data expansion, and it supports both additive and multiplicative homomorphism with conditions.

  2. 2).

    We designed an intelligent encryption algorithm to adapt to various common audio formats. The proper encryption parameters are generated automatically, and the encrypted data could be reconverted to an audio file for kinds of interactions.

  3. 3).

    We gave a solution to secure interaction with encrypted audio, some operations can be done without decryption.

The rest of this paper is laid out as follows. Section 2 discusses the research works related to audio encryption and homomorphic encryption. Section 3 describes the proposed homomorphic encryption scheme, the intelligent encryption algorithm, and the method of interactive operation with encrypted audio. Section 4 gives details of the experimental analysis. Section 5 concludes this paper and future prospects.

2 Related work

In [18], a modified Generalized Feistel Network (non-chaotic) and a generalized modified tent map (chaotic) were used to encrypt signals separately. Both methods have similar encryption quality and time consumption, but the chaotic algorithm decrypted much more quickly. To improve efficiency, Kalpana et al. [11] globally synchronized two chaotic systems (encryption and decryption). Due to that low-dimensional chaotic systems could be predicted in the short term, multiple chaotic systems were proposed to improve security. As in [12], four types of chaotic maps combined with SHA-1 were used to encrypt the signal. Then hyper chaotic systems have been used for higher initial value sensitivity and larger key space. Farsana et al. [6] have introduced a non-orthogonal quantum state with an improved 4-D chaotic system to gain higher security. Likewise, multiple hyper chaotic systems have been presented. Sathiyamurthi et al. [16] have permuted the real and imaginary values of the signal with that of a reference sample by a 3D Lorenz-Logistic map. Then they present a hybrid-hyper chaotic system to further improve the security level [17].

Even though the encrypted audio signal can be reconverted with audio format utilizing chaotic systems, it is hard to perform calculations on them without decryption. While the homomorphic encryption algorithm can achieve that [1]. The classical encryption algorithm RSA supports multiplicative homomorphism, but it is not suitable for handling large amounts of data. The El-Gamal encryption algorithm also supports multiplicative homomorphism, which some researchers have implemented for audio signal encryption and improved efficiency [9, 13]. Paillier homomorphic cryptosystem can directly encrypt decimal integers, unlike RSA, which makes it more efficient and supports additive homomorphism [25]. Many researchers have developed audio signal encryption based on probability and matrix, the proposed homomorphic cryptosystem [19] has lower computational complexity and less data expansion for running on smartphones. However, it still took more than 20 seconds to encrypt an audio about 5 seconds, which is not suitable for long-playing audio. Furthermore, it has many limits in audio applications for only supporting additive homomorphism. Dijk and Gentry et al. [3] proposed a fully homomorphic encryption on the integers, it could calculate multiplication and addition infinitely in theory by bootstrapping. But this scheme must convert each integer to binary to encrypt, the high complexity and huge data expansion make it stay in theory. Later, the BGV scheme was proposed to improve the old bootstrapping by using a modulus-switching technique, which has lower complexity to be more practical. Multithreading technology has been used to improve the efficiency of the DGHV fully homomorphic encryption scheme [26]. In [28], the VGGVox outputs were encrypted by the BFV scheme (TFHE and SEAL library), then were employed in the speaker identification system directly. Thaine et al. [21] also applied BFV (PALISADE library) to encrypt speech signals and extracted MFCCs and BFCCs features from them for speech recognition. The CKKS encryption scheme could process real numbers by scaling the fractional part to the nearest integer, which is used in deep learning. As in [2], the spectrograms of speeches were encrypted by CKKS (SEAL library), then were used for a convolutional neural network to identify speakers. Although the new homomorphic encryption scheme has improved performances and reduced data expansion by utilizing batch processing and other methods, it is not suitable for audio files yet. Therefore, the present researches are mainly on the encryption of core data, such as audio features, except for the work in [21]. Meanwhile, a lightweight solution is required to process encrypted audio stored in the cloud, and there is no such research in this area before, as far as we have known.

In summary, chaotic encryption is more efficient, but its practical applications are limited. Homomorphic encryption algorithm allows some calculations on encrypted data, but the poor efficiency and huge data expansion increase storage spaces and transmission times. Especially in the case of mass data onto long-playing audio, how to solve these problems becomes a difficulty. Besides that, encryption and calculation require different parameters for different audio formats and property values. This problem has been less considered in related research so far. Given that, this paper proposed intelligent audio homomorphic encryption for secure interaction. It could intelligently encrypt and decrypt according to different formats and attributes to good efficiency and less data extension, and keep interactive operations in security.

3 Proposed algorithm

3.1 Homomorphic encryption scheme based on decimal integers

Our scheme was inspired by a binary symmetric encryption scheme in [3]. It is a fully homomorphic encryption by bootstrapping and could be an asymmetric cryptosystem by generating a public key set. Its security relies on the approximate GCD problem. But its data expansion is huge, the size of the encrypted audio file would be unacceptable.

The proposed scheme η has three key parts:

KeyGen: J, K are keys, chosen randomly from positive integers.

Encrypt (J, K, m): Set c = m + J × d + K × a × b, where message m is a decimal integer, mN, cN. a, b, d are random positive integers.

Decrypt (J, K, c): (c mod J) mod K.

Theorem 1

The above scheme η is correct, if K > m, and K × a × b + m < J/2.

Proof

From the Decrypt (J, K, c),

$${\displaystyle \begin{array}{l}\begin{array}{l}\begin{array}{l}c\ \mathit{\operatorname{mod}}\ \textrm{J}\\ {}=\left(m+\textrm{J}\times \textrm{d}+\textrm{K}\times \textrm{a}\times \textrm{b}\right) \operatorname {mod}\ \textrm{J}\end{array}\\ {}=\left(m \operatorname {mod}\ \textrm{J}+\left(\textrm{K}\times \textrm{a}\times \textrm{b}\right) \operatorname {mod}\ \textrm{J}\right) \operatorname {mod}\ \textrm{J}\\ {}\because\; \textrm{K}\times \textrm{a}\times \textrm{b}+m<\textrm{J}/2\end{array}\\ {}\therefore\; \textrm{Above}\\ {}\begin{array}{l}=\left(m+\textrm{K}\times \textrm{a}\times \textrm{b}\right) \operatorname {mod}\ \textrm{J}\\ {}=m+\textrm{K}\times \textrm{a}\times \textrm{b}\end{array}\end{array}}$$

Then

$${\displaystyle \begin{array}{l}\begin{array}{l}\begin{array}{l}\left(c \operatorname {mod}\ \textrm{J}\right) \operatorname {mod}\ \textrm{K}\\ {}=\left(m+\textrm{K}\times \textrm{a}\times \textrm{b}\right) \operatorname {mod}\ \textrm{K}\end{array}\\ {}=\left(m \operatorname {mod}\ \textrm{K}+\left(\textrm{K}\times \textrm{a}\times \textrm{b}\right) \operatorname {mod}\ \textrm{K}\right) \operatorname {mod}\ \textrm{K}\\ {}\because\; \textrm{K}>m\end{array}\\ {}\therefore\; \textrm{Above}\\ {}\begin{array}{l}=m \operatorname {mod}\ \textrm{K}\\ {}=m\end{array}\end{array}}$$

When the keys K and J meet the given conditions, the decryption is successful.

Theorem 2

Let c1 be the result of encrypting message m1, c1 = m1+ J × d1 + K × a1 × b1, Let c2 be the result of encrypting message m2, c2 = m2 + J × d2 + K × a2 × b2. Then the scheme η is addition homomorphic if m1 + m2 < K and the keys are chosen as required by Theorem1.

Proof

$${\displaystyle \begin{array}{l}\begin{array}{l}\begin{array}{l}\begin{array}{l}{c}_1+{c}_2=\left({m}_1+{m}_2\right)+\textrm{J}\times \left({\textrm{d}}_1+{\textrm{d}}_2\right)+\textrm{K}\times \left({\textrm{a}}_1\times {\textrm{b}}_1+{\textrm{a}}_2\times {\textrm{b}}_2\right)\\ {}\textrm{Decrypt}\ \left(\textrm{J},\textrm{K},{c}_1+{c}_2\right)\end{array}\\ {}=\left(\left({m}_1+{m}_2\right) \operatorname {mod}\ \textrm{J}+\textrm{K}\times \left({\textrm{a}}_1\times {\textrm{b}}_1+{\textrm{a}}_2\times {\textrm{b}}_2\right) \operatorname {mod}\ \textrm{J}\right) \operatorname {mod}\ \textrm{J} \operatorname {mod}\ \textrm{K}\\ {}\because\; \textrm{K}\times \textrm{a}\times \textrm{b}+m<\textrm{J}/2\end{array}\\ {}\therefore\; \textrm{Above}\\ {}=\left(\left({m}_1+{m}_2\right)+\textrm{K}\times \left({\textrm{a}}_1\times {\textrm{b}}_1+{\textrm{a}}_2\times {\textrm{b}}_2\right)\right) \operatorname {mod}\ \textrm{K}\end{array}\\ {}\because\; {m}_1+{m}_2<\textrm{K}\\ {}\begin{array}{l}\textrm{Then}\ \textrm{Decrypt}\ \left(\textrm{J},\textrm{K},{c}_1+{c}_2\right)\\ {}={m}_1+{m}_2\end{array}\end{array}}$$

If the operand is a constant, the sum of message m and this constant must be smaller than the key K too.

Theorem 3

Let c1 be the result of encrypting message m1, c1 = m1+ J × d1 + K × a1 × b1. Set r is an integer, if m1 × r < K and K × a1 × b1 × r < J, both keys are chosen as required by Theorem1. Then the decrypted value of c1 × r is equal to m1 × r, and the scheme η is multiplication homomorphic.

Proof

$${\displaystyle \begin{array}{l}\begin{array}{l}\begin{array}{l}{c}_1\times r={m}_1\times r+\textrm J\times \textrm {d}_1\times r+\textrm{K}\times {\textrm{a}}_1\times {\textrm{b}}_1\times r\\ {}\textrm{Decrypt}\ \left(\textrm{J},\textrm{K},{c}_1\times r\right)\\ {}=\left(\left({m}_1\times r+\textrm{J}\times {\textrm{d}}_1\times r+\textrm{K}\times {\textrm{a}}_1\times {\textrm{b}}_1\times r\right) \operatorname {mod}\ \textrm{J}\right) \operatorname {mod}\ \textrm{K}\end{array}\\ {}=\left(\left({m}_1\times r\right) \operatorname {mod}\ \textrm{J}+\left(\textrm{K}\times {\textrm{a}}_1\times {\textrm{b}}_1\times r\right) \operatorname {mod}\ \textrm{J}\right) \operatorname {mod}\ \textrm{J} \operatorname {mod}\ \textrm{K}\\ {}\because\; {m}_1\times r<\textrm{K},\textrm{K}\times {\textrm{a}}_1\times {\textrm{b}}_1\times r<\textrm{J}\end{array}\\ {}\therefore\; \textrm{Above}\\ {}={m}_1\times r\end{array}}$$

3.2 Adaptive intelligent audio encryption algorithm

Figure 1 illustrates the main process of the proposed encryption algorithm. Adaptive cryptographic parameters are generated for supporting multiple audio formats, then encrypted data are reconverted to audio files automatically.

Fig. 1
figure 1

Processes of the proposed audio encryption algorithm

Definition 1

(Adaptive Cryptographic Parameters): For an audio file, the maximum value of each sample is 2s, (usually s is 8, 16, or 32), and the encryption level is l, where lN+. Then let the length of secret key K be s + 2 bits and the length of secret key J be (s × 4 × l – 2 – 7 × l) bits, set random numbers a, b, and d: 1 ≤ a, b ≤ 24× l, 1 ≤ d ≤ 27× l. These parameters always guarantee the success of decryption.

Proof

The length of K is s + 2 bits, 2s +1 < K < 2s +2, the maximum value of an audio sample is 2s, and K is always bigger than m. Another condition is K × a × b + m < J/2 by Theorem1, where K × a × b + m is no more than 2s +8× l +2, J/2 is 2s ×4× l – 3 – 7× l. s + 8 × l + 2 – (s × 4 × l – 3 – 7 × l) = (1–4 × l) × s + 15 × l + 5. According to the standards of audio files, the minimum value of s is 8. Then K × a × b + m always be less than J/2, even if l is 1, the minimum value of it.

From the head information on an audio file, the format can be identified, such as ‘WAV’, ‘MP3’, ‘WMA’, etc. Then get the related attributes: h (the number of channels) and s bits (the maximum length of an audio sample). Let the sum of the samples be n, the original audio sampling data M = {m1,  m2, ⋯mn}, where -2s ≤ mi ≤ 2s- 1. Keys K and J can be generated adaptively by any random number generator under Definition1.

The proposed encryption algorithm can be split into five specific steps:

Step 1: Each mi in M is considered to be two parts, sgni for the plus/minus sign and umi for the unsigned value, where sgni ∈ {+, −}, 0 ≤ umi ≤ 2s and umiN. Restrict the operation of modulo to positive numbers, because different programming languages process negative values in totally different ways.

Step 2: For each mi, generate new adaptive random numbers ai, bi, and di under Definition 1.

Step 3: The uci (encrypted data without sign) is computed by equation (umi + J × di + K × ai × bi).

Step 4: Set ci ← sgni + uci. If the length of ci is less than s × 4 × l – 1, pad ci to the left with zeros. If the length of ci is more than 32 bits, divide ci into several groups of 32 bits each, and the last group has 31 bits, ci ← {x1, x2, …xj}, where j = (s × 4 × l)/32.

Step 5: Reconvert encrypted data to WAV format to keep quality, since MP3 and other formats would compress data again. Write each xi to the file, If x1 is zero, xj is written with sgni. The maximum length of an audio sample of the encrypted file is 32 bits, and the rest of the attributes are unchanged.

The proposed decryption algorithm can fully recover the originaldecryption algorithm audio signal without any additional record, it can be split into four steps:

Step 1: Read in an encrypted file, get h, l, s, and encrypted audio data C.

Step 2: Let j ← (s × 4 × l)/32, each j sample is a group, and obtain sgni according to whether the first value of each group is 0 or not.

Step 3: Fill each sample with zeros from the left to make sure the size is 32 bits (31 bits for the last one). Then recombine data onto a group to get uci.

Step 4: Let udi ← (uci mod J) mod K, di ← sgni + udi. Write di to the decrypted file. WAV format is preferred, other formats also can be appointed.

3.3 Security interactive operation on encrypted audio

Theorem 4

According to the relationship between audio signals with volume, to increase approximately 6 dB means to double the value of each sample. The adaptive parameters satisfy the conditions of multiplicative homomorphism under audio format.

Proof. There are two conditions according to Theorem 3: m × 2 < K and K × a × b × 2 < J. Where m is no more than 2s, so m × 2 would be no more than 2s +1, and certainly be less than K. K × a × b × 2 is also certainly smaller than J.

Figure 2 shows the solution to edit encrypted audio in the cloud without decryption and downloading. The proposed method can improve the efficiency and security of interaction between the user and encrypted audio. The user sends a request for volume adjustment to the server. When receiving it, the server processes the encrypted audio file to get ci and multiply each ci by 2 without decryption. By Theorem 4, Decrypt (ci × 2) = mi × 2.

Fig. 2
figure 2

Sketch map of the interactive operation

Likewise, If the user wants to delete a clip (t1 th- t2 th seconds), then the server just removes the encrypted data from t1 × r × 2 × l to t2 × r × 2 × l, where r is the number of samples per second. If the user wants to insert a new clip at t th second, the server adds the new encrypted data after the y th data, where y = t × r × 2 × l.

Moreover, the modified data can be sent back to the user, the user can play while decrypting. Other signal processing programs about addition, subtraction, and multiplication also can be introduced.

4 Security analysis and performance analysis

For the test, we use some audio files from the THCHS30 database and the TIMIT database. All of them were sampled at 16 kHz and 256 kbps. All tests are programmed in Python 3.6 on a PC with a 2.30 GHz CPU and 4.00 GB main memory.

4.1 Statistical analysis

4.1.1 Histograms and residual intelligibility

Figures 3 and 4 show the waveforms, spectrograms, and histograms of one test file, encrypted and decrypted files in different formats and encryption levels.

Fig. 3
figure 3

The waveforms, spectrograms, and histograms for the format WAV. a The original audio, b Encrypted audio (l = 1), c Decrypted audio (l = 1), d Encrypted audio (l = 4), e Decrypted audio (l = 4)

Fig. 4
figure 4

The waveforms, spectrograms, and histograms for the format MP3. a The original audio, b Encrypted audio (l = 1), c Decrypted audio (l = 1), d Encrypted audio (l = 4), e Decrypted audio (l = 4)

As shown in Figs. 3 and 4, the waveforms and spectrograms are changed without any original features even if the encryption level is 1. For the poor residual intelligibility of encrypted audio, attackers cannot get any information about the original signal. In the histogram, the encrypted data is distributed more evenly, which makes the statistical analysis by attackers hard to succeed.

When the encryption level is up to 4, the spectrogram of the encrypted audio signal is more like noise. Since only one in a group may have the plus/minus sign when the encrypted data was reconverted, the number of positive samples of the histogram is much more than that of the negative. But the data is distributed evenly among positive and negative separately.

In addition, the waveforms and spectrograms of decrypted and original audio files are about the same, the audio signals have been fully recovered after decryption.

4.1.2 Correlation coefficients

The correlation coefficients [5] between the original and encrypted audio signal, and the original and decrypted audio signal of four test files are listed in Table 1. Note: Some of the encrypted audio signals have been truncated in the calculation to keep the same data volume of the formula.

Table 1 Correlation coefficients

It can be seen from Table 1 that the correlation coefficients are close to zero between the encrypted audio and the original one. Even if the attackers obtained some original data, they could not recover adjacent samples based on the correlation analysis. The correlation coefficients between the decrypted audio and the original one are nearly 1, suggesting again that the decrypted audios have the high quality as the original.

4.1.3 Entropy

The entropy [22] of the four test audio files and the corresponding encrypted audio files are shown in Table 2.

Table 2 Entropy

The entropy is usually used to measure the randomness and unpredictability of the distribution of samples of audio. The entropy of encrypted audio is higher than that of the original, and it increases significantly to the bigger encryption level. The distribution of signal after encryption is more random and enough to guard against attacks of statistical analysis.

4.2 Sensitivity analysis

4.2.1 Key space and security analysis

Our algorithm has two keys J and K as well as random numbers a, b, and d. The key space is a total of 27+4× l + s ×4× l. If s = 16 and l = 4, the key space is up to 2279, it exceeds the traditional AES algorithms and could prevent brute force attacks. Meanwhile, the key space can be further enlarged by increasing the encryption level for the security of important audio.

In addition, the proposed algorithm can effectively prevent known/ chosen plaintext attacks. Because it involves three random numbers, the encrypting results of the same key are different from the same data.

The header information only has general properties of audio, only the length and the size of K can be obtained from it. Even if the attacker intercepts the encrypted file, it is hard to recover the original audio based on the header information without the keys. Therefore, keeping the format information unencrypted is barely harmful to the security of the scheme.

4.2.2 Key sensitivity

There are two important indexes for key sensitivity, the Number of Samples Change Rate (NSCR) and the Unified Average Changing Intensity (UACI) [23] as follows:

$$NSCR=\frac{\sum_iD(i)}{q}\times 100\%,D(i)=\left\{\begin{array}{l}1,\kern0.5em c(i)\ne {c}^{\prime }(i)\\ {}\begin{array}{cc}0,& c(i)={c}^{\prime }(i)\end{array}\end{array}\right\}$$
(1)
$$UACI=\frac{1}{q}{\sum}_i\frac{\left\Vert c(i)\left|-\right|{c}^{\prime }(i)\right\Vert }{2^{31}}\times 100\%$$
(2)

Where q is the number of samples, c(i) is the encrypted signal using the keys J and K, and c’(i) is the encrypted signal using the keys J’ (change the least significant bit of J) and K. Unlike the pixel, the values of the signal may be negative, so the absolute values of signal are used instead to avoid interference in Eq. (2).

NSCR gives the number of different encrypted signals when the least significant bit of the key is changed. UACI further quantifies how much difference between two encrypted audio files. The results of four test files are given in Table 3.

Table 3 NSCR and UACI

According to the measurement of the image encryption, the ideal value of NSCR and UACI is 99.6094% and 33.4635%. The closer to the ideal value, the more sensitive to the key. From Table 3, the proposed encryption algorithm is sensitive to keys with strong security.

4.3 Quality of encrypted audio

To measure the encryption quality, we calculate the Signal-to-noise ratio (SNR), segmental signal-to-noise ratio (SegSNR), peak signal-to-noise ratio (PSNR) [15], and the mean squared error (MSE). Table 4 lists some of the results. In addition, many data of the MP3 files are zero, and log100 would lead to an incorrect result (infinite value) in the calculation of SegSNR. To avoid this, log100.1 is used instead. And some data from encrypted audio is truncated in the calculation to keep the same length as the original one.

Table 4 SNR, SegSNR, PSNR and MSE

From the results, no matter whether the encryption level is 1 or 4, the values (SNR, SegSNR, and PSNR) are all less than zero. The MSE values are to reach 1017. The difference between the encrypted signal and the original is too large to detect any information, which means that the encryption quality of the proposed method is high.

4.4 Comparison with the present works

Table 5 compares the proposed method with a multiple chaotic system [12], a multiple hyper chaotic system [16], a hybrid-hyper chaotic system [17], the ElGamal encryption [9, 13], and DGHV encryption based on multithreading [26]. The results take the average value of the main indexes. The test uses 6 audio files from the TIMIT database. The performance of the proposed method is superior to others, for all the values of indexes are less than zero. The absolute SNR of the proposed algorithm is the largest except that of Ref. [26], and the absolute values of PSNR and SegSNR are the largest. The correlation coefficient between the original and the encrypted of the proposed algorithm is the minimum except that of Ref. [13]. But the decrypted audio in Ref. [13] cannot maintain the same quality as the original (the correlation coefficient between the original and the decrypted is only 0.8006).

Table 5 Comparison of the main evaluation metrics

4.5 Complexity analysis

Overall, the proposed algorithm is simple and easy to implement, only one loop is needed for encrypting an audio with n samples. From the encryption function C = m + J × d + K × a × b, the computational complexity is Ο (n).

$$T(n)=\sum_{i=1}^n\left({m}_i+J\times d+K\times a\times b\right)=\sum_{i=1}^n\textrm{O}\left({m}_i\right)=\textrm{O}(n)$$

The decryption function is (c mod J) mod K, and its computational complexity is Ο (n) too.

$$T(n)=\sum\limits_{i=1}^n\left(\left(c_i\;modJ\right)\;modK\right)=\mathrm O(n)$$

The computational complexity of encryption based on the Paillier cryptosystem [24] is over Ο (gn), where g is the public key. The computational complexity of decryption is over Ο (nλ), where λ is the private key.

The encryption \(\textbf{c}={\textbf{P}}_{\textbf{1}}\times \boldsymbol{m}\times {\textbf{P}}_{\textbf{2}}^{-1}\) and decryption \(\boldsymbol{m}={\textbf{P}}_1^{-1}\times \textbf{c}\times {\textbf{P}}_{\textbf{2}}\) are used in [19], where the matrix P1 and P2 are keys. The computational complexity of both is Ο (n3).

We used twenty audio files from the THCHS30 database to test the time consumption of encryption and decryption. When l (encryption level) = 1, It takes an average of 1.79 s to encrypt each audio file (7–8 seconds). Encrypting 16,000 samples (about 1 s) takes only 0.2215747 s. When l = 4 it takes an average of 2.51 s for each file and 0.3112282 s for 16,000 samples. The algorithm used in [19] and the Paillier cryptosystem takes 50s around to encrypt an 8-second audio, about 6.2348027 s for 16,000 samples. And the time consumption non-linearly increases with the number of data. In decryption, when l = 1 (l = 4), our algorithm takes an average of 1.34 s (2.85 s) for each audio file. The algorithm used in [19] and the Paillier cryptosystem takes 40s around to decrypt an audio (about 8 seconds).

Table 6 lists the comparison results of time consumption, which show that the proposed method encrypts and decrypts much more quickly than most other Homomorphic encryption, and it could process large data volumes in real-time for applications in the cloud. Although the time consumption of the ElGamal encryption [13] is less, it only supports multiplicative homomorphism.

Table 6 Comparison of time consumption and data expansion

In terms of data expansion, the proposed algorithm has the best performance. When l = 4, the data volume raises only 16 times after encryption. In [19], they take the remainder of dividing encrypted data by 215 to reduce data expansion. But all of the quotients need to be saved to reconstruct the original encrypted audio.

4.6 The simulation results of interactive operation on encrypted audio

Figure 5 shows the simulation results of the proposed algorithm to amplify the volume of the encrypted audio. The decibels (dB) are a measure of amplitude, compared with Fig. 5a, the decibel values of Fig. 5b and 5c are significantly increased. Meanwhile, the volume amplification can also be heard during playback. And the waveforms of Fig. 5b and 5c are almost identical, verifying the effectiveness of the proposed algorithm.

Fig. 5
figure 5

The comparison of volume adjustment. a The original audio, b The decryption audio using the proposed security interactive operation, c Original audio after increasing 6 dB using an Editor

Figure 6 gives some numerical examples of performing volume adjustment, the server executes calculation on the encrypted data and cannot get information during the process. The proposed solution is secure and efficient, it takes only 0.0679390 s to multiply 2 by the whole encrypted audio (7-second original audio, 124,800 samples).

Fig. 6
figure 6

Numerical examples

If combined with various audio filters and transformed the relevant calculations into approximate addition and multiplication, more refined adjustments could be achieved. It’s beyond the scope of this paper and will not be discussed further.

Figure 7 shows the result of the proposed deleting operation on encrypted audio. The duration of the original audio is 7.80s, and the clip need to be deleted is from 2.35 s to 5.75 s.

Fig. 7
figure 7

Screenshots of the audio editor. a The original audio, b The modified decrypted audio using the proposed security interactive operation

Many operations on audio involve multiplication, such as volume adjustment, it’s useless if the algorithm only supports additive homomorphism. For the BFV scheme, it is difficult to reconstruct encrypted data into audio files and hard to perform interactive operations efficiently.

5 Conclusions

This paper presents intelligent homomorphic audio encryption for secure interacting. It has good performance on SNR, SegSNR, PSNR, and MSE, and can guard against attacks such as statistical analysis. Moreover, the lower data expansion and time complexity make real-time encryption and secure interaction possible. The proposed algorithm can deal with various audio formats commonly used in the original and compressed domains and can generate proper encryption parameters adaptively. In addition, the proposed scheme supports homomorphic multiplication and addition and can realize some interactive operations on encrypted audio files efficiently and safely. And the users can set the encryption levels to further balance the relationship between security and time/space consumption.

For future works, we intend to improve the time/space consumption and design an asymmetric encryption system.