1 Introduction

Speech enhancement, which emphasizes the target speech from mixed signals, is one of the important issues in acoustic processing. Multi-channel approach using the microphone array is a typical approach to enhance the speech signal. Speech enhancement is realized by using the difference between the phase and amplitude of the sound entering each microphone [1,2,3]. Although it is an attractive approach to enhance the speech, it is necessary to set multiple microphones and to estimate the positions of the microphones. When we set the system for speech enhancement to the machines such as robots and vehicles, the background noise makes their performance worse. In order to solve this problem, we focus on speech enhancement technology in sensor fusion that combines image signal and audio signal. P. Duchnowaski et al. [4] proposed a method for extracting speech from the speaker’s lips using image information and tracking the speaker’s face based on the extracted lips to assist speech enhancement. H. Kulkarni et al. [5] proposed lip reading that can automatically enhance speech in correspondence with a language data set by combining deep learning and lip reading.

In this way, sensor fusion is a technology that has been attracting attention for a long time. However, speech enhancement technology that simultaneously transmits and receives image signal and audio signal has not been studied so far.

Recently, we proposed a speech enhancement technology of cooperative transmission and reception [6]. In this method, the speaker transmits not only the audio signal through the speaker but also the audio information as an image signal through the display. The listener emphasizes the target acoustical signal from the image and audio signals received through the standard camera and microphone mounted on smartphones and tablets. This framework is a framework for sensor-transmission/reception cooperative sensor fusion. The proposed framework makes it possible to implement speech enhancement that is not affected by the type of external noise, which was difficult with conventional speech enhancement techniques. Non-overlapping noise can be removed by the binary mask even for mixed speech with overlap on the frequency axis since the mask information can be obtained as visual information. However, overlapping noise remains on the time-frequency axis after the binary mask. In this paper, we focus on reducing the overlapping noise by using the spectral subtraction and give some experimental results to show the effectiveness of the proposed approach.

2 Problem Formulation

In this section, we describe the supposed situation and formulate the problem. Figure 1 shows an example of how to use the proposed method. It can be used to send voice to an unspecified number of people, such as an election speech in public spaces. The speaker not only generates the voice through the speaker, but also transmits the mask information of the voice to the listener through the display. When the listener would like to hear the speaker’s voice, the listener moves the camera of the smartphone or tablet to the display and captures the audio signal and image information. The listener can obtain the voice information of the speaker with high quality even in a noisy situation by using the proposed method. Let us consider a target signal s(t) and ith noise ni(t). The mixed speech x1(t) acquired from the microphone array is described as follows.

Fig. 1.
figure 1

Assumed usage scenario of the proposed method.

$$ x_{1} (t) = s(t) + \sum\nolimits_{t = 1}^{n} {n_{t} } (t) $$
(1)

The mask information received as image information is then defined. Even if audio information and image information are sent at the same time, there is a time lag in the signal information received at the receiving side. Therefore, the mask signal X2 (τ, ω) received by the receiving side can be expressed as follows:

$$ X_{2} (\tau ,\omega ) = M(\tau - \delta ,\omega ) $$
(2)

where \(\delta \) is the time delay.

Here, when ∆ is defined as the maximum delay between sensors, the following equation is satisfied.

$$ \left| \delta \right| \le \Delta $$
(3)

Considering that the audio signal received by the microphone is restored by the image signal received by the camera, its output is expected to be maximum when there is no delay. Hence, the delay can be estimated as follows:

$$ \tilde{\delta } = \arg \;\max_{\delta < \Delta } \sum\nolimits_{\tau ,\omega } {\left| {X_{2} (\tau + \delta ,\omega )X_{1} (\tau ,\omega )} \right|} $$
(4)

Speech enhancement is performed as follows using the estimated mask information.

$$ S(\tau ,\omega ) = \tilde{M}(\tau ,\omega )X_{1} (\tau ,\omega )\forall \tau ,\omega $$
(5)

where \( \tilde{M} \)(τ, ω) is the mask information estimated using \( \tilde{\delta } \).

3 Proposed Approach

The approach using binary mask works well when the sparseness between the target signal and noise in the time-frequency domain is satisfied. However, if the conventional method is applied to a sound that does not have sparseness, noise remains in the overlapping part. Therefore, we consider an improvement method using speech enhancement by combining a binary mask and a spectral subtraction method as shown in Fig. 1. The spectral subtraction method removes noise by subtracting the estimated value of the average power spectrum of noise from the power spectrum of mixed speech.

Let us suppose the mixed signal x(t) can be written using the signals s(t) and n(t) as

$$ x(t) = s(t) + n(t) $$
(6)

The signal on the time-frequency axis obtained by Fourier transforming the mixed signal can be described as follows.

$$ X(\tau ,\omega ) = S(\tau ,\omega ) + N(\tau ,\omega ) $$
(7)

where X(τ, ω), S(τ, ω) and N(τ, ω) are the complex spectra of the x(t), s(t) and n(t), respectively. τ and ω are the time frame and angular frequency, respectively.

In the spectral subtraction method, it is assumed that the signal and noise have no correlation, and is approximated as follows.

$$ \left| {X(\tau ,\omega )} \right| = \left| {S(\tau ,\omega ) + N(\tau ,\omega )} \right|^{2} $$
$$ \left| {S(\tau ,\omega )} \right|^{2} + S(\tau ,\omega )N^{*} (\tau ,\omega ) + S^{*} (\tau ,\omega )N(\tau ,\omega ) + \left| {N(\tau ,\omega )} \right|^{2} $$
(8)
$$ \approx \left| {S(\tau ,\omega )} \right|^{2} + \left| {N(\tau ,\omega )} \right|^{2} $$

where * indicates a complex conjugate. In the spectral subtraction method, we focus on Eq. (8) and emphasize the target signal by subtracting the estimated noise from the mixed sound. In addition, it is assumed that the noise is stationary in general spectral subtraction. Let the estimated speech spectrum after noise removal be \( \tilde{S} \)(τ, ω) and the average power spectrum of the estimated noise be \( \tilde{N} \)(ω). \( \tilde{S} \)(τ, ω) can be estimated as follows:

$$ \left| {\tilde{S}(\tau ,\omega )} \right|^{2} = \left| {X(\tau ,\omega )} \right|^{2} - \left| {\tilde{N}(\omega )} \right|^{2} $$
(9)

Figure 1 shows the basic concept of the proposed approach in the case that noise remains when the target signal and the noise is overlapped in the time-frequency domain. Figure 2 shows the procedure of the proposed approach to estimate the noise. As shown in Fig. 3, since the mask information is acquired as image information in this method, noise is estimated by averaging the points where the mask is 0 as follows.

Fig. 2.
figure 2

Noise remains when the target signal and the noise is overlapped in the time-frequency domain. We apply spectral subtraction to reduce the remain noise and extract the target signal.

Fig. 3.
figure 3

Procedure of the proposed approach to estimate the noise. Noise parts are estimated by averaging the noise part based on the mask information.

$$ \left| {\tilde{N}(\omega )} \right|^{2} = \frac{1}{T(\omega )}\sum\nolimits_{\tau ,M(\tau ,\omega ) = 0} {\left| {X(\tau ,\omega )} \right|} $$
(10)

Here, \( \sum_{\tau } \), M(τ, ω)=0 indicates addition in the τ direction for the point where M(τ, ω) = 0. In addition, T(ω) indicates the number of points for which M(τ, ω) = 0 when added in the τ direction for each ω. Figure 4 shows the process of the proposed method to enhance the target signal. To recover the waveform, not only the power of the signal but also the phase information is needed. Since it is difficult to obtain the phase information of the signal itself from the mixed sound, the phase information of the mixed sound is often used.

Fig. 4.
figure 4

Procedure of the proposed approach to enhance the speech. Speech is enhanced by subtracting the estimated noise from the mixed noise.

4 Experiments

In this section, we describe the experimental condition and the results. The target voice was a female voice from the ATR newspaper reading database. All programs were created using Python in Visual Studio 2017 Community. As noise, pink noise and white noise were used. Three levels of noise were set: 0, −10, −20 dB.

Table 1 shows the experimental condition. We set the threshold values to create the binary mask from −100 to −30 dB with 10 dB intervals. Signal-to-Distortion Ratio (SDR) was used to evaluate the waveform error between the target speech and the speech after noise removal [7]. Generally, the larger the SDR value, the smaller the distortion for the signal. Let S (τ, ω) be the complex amplitude of the target speech in the time-frequency domain, and let \( \tilde{S} \)(τ, ω) be the power of the signal after speech enhancement.

Table 1. Experimental condition.
$$ SDR = 10\,\log_{10} \left( {\frac{{\sum\nolimits_{\tau ,\omega } {\left| {S(\tau ,\omega )} \right|} }}{{\sum\nolimits_{\tau ,\omega } {\left| {\left| {S(\tau ,\omega } \right| - \lambda \left| {\tilde{S}(\tau ,\omega } \right|} \right|} }}} \right) $$
(11)

where λ is described as follows:

$$ \lambda = \sqrt {\frac{{\sum\nolimits_{\tau ,\omega } {\left| {S(\tau ,\omega )} \right|^{2} } }}{{\sum\nolimits_{\tau ,\omega } {\left| {\tilde{S}(\tau ,\omega )} \right|^{2} } }}} $$
(12)

Table 2 to Table 4 show the experimental results when various levels of white noise (−20 dB, −10 dB and 0 dB) were used for the experiments, respectively. Table 5 to Table 7 show the experimental results when various levels of pink noise (−20 dB, −10 dB and 0 dB) were used for the experiments. The fond was set to the bold to show the maximum value among all the tested thresholds. From Tables 2, 3, 4, 5, 6 and 7, it can be confirmed that the value of the signal-to-distortion ratio is greatly increased by the proposed method in all the cases, especially when the threshold value is low. When the threshold is high, we succeeded in improving the accuracy when the threshold is low while maintaining the accuracy of the conventional method. The reason why the SDR improves when the threshold is low is thought to be because a lot of noise components remain in the mixed signal.

Table 2. Experimental results when white noise (−20 dB) was used.
Table 3. Experimental results when white noise (−10 dB) was used.
Table 4. Experimental results when white noise (0 dB) was used.
Table 5. Experimental results when pink noise (−20 dB) was used.
Table 6. Experimental results when pink noise (−10 dB) was used.
Table 7. Experimental results when pink noise (0 dB) was used.

5 Conclusion

In this research, we proposed a vision-referential speech enhancement with binary mask and spectral subtraction. To check the validity of the proposed methods, we prepared white noise and pink noise with different noise level and checked the effectiveness of the proposed method. From the experimental results, the proposed method could effectively remove noise that is not sparse compared to the previous method. Since the proposed method is not effective against fluctuating noise, we consider a method that can cope with fluctuating noise. We also consider real-time processing in the future.