Vision-Referential Speech Enhancement with Binary Mask and Spectral Subtraction

Matsumoto, Mitsuharu

doi:10.1007/978-3-030-61105-7_42

Mitsuharu Matsumoto¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 158))

Included in the following conference series:

International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

680 Accesses

Abstract

This paper proposes vision-referential speech enhancement with binary mask and spectral subtraction as a sensor fusion of visual information and audio information. Recently, we can find many smart phones and tablet devices with a camera and a microphone in the worlds. We improve the sound quality of the audio signal by using the mask information from the visual information. Although the frame rate of the camera in such devices. it will be useful to enhance the speech signal if both signals are used adequately. We therefore aim to design a vision-referential speech enhancement. Throughout the experiments, it was confirmed that the speech could be enhanced even when there was high level of real noise in the environments.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Follow-Up Survey of Audiovisual Speech Integration Strategies

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Survey on Different Visual Speech Recognition Techniques

1 Introduction

Speech enhancement, which emphasizes the target speech from mixed signals, is one of the important issues in acoustic processing. Multi-channel approach using the microphone array is a typical approach to enhance the speech signal. Speech enhancement is realized by using the difference between the phase and amplitude of the sound entering each microphone [1,2,3]. Although it is an attractive approach to enhance the speech, it is necessary to set multiple microphones and to estimate the positions of the microphones. When we set the system for speech enhancement to the machines such as robots and vehicles, the background noise makes their performance worse. In order to solve this problem, we focus on speech enhancement technology in sensor fusion that combines image signal and audio signal. P. Duchnowaski et al. [4] proposed a method for extracting speech from the speaker’s lips using image information and tracking the speaker’s face based on the extracted lips to assist speech enhancement. H. Kulkarni et al. [5] proposed lip reading that can automatically enhance speech in correspondence with a language data set by combining deep learning and lip reading.

In this way, sensor fusion is a technology that has been attracting attention for a long time. However, speech enhancement technology that simultaneously transmits and receives image signal and audio signal has not been studied so far.

Recently, we proposed a speech enhancement technology of cooperative transmission and reception [6]. In this method, the speaker transmits not only the audio signal through the speaker but also the audio information as an image signal through the display. The listener emphasizes the target acoustical signal from the image and audio signals received through the standard camera and microphone mounted on smartphones and tablets. This framework is a framework for sensor-transmission/reception cooperative sensor fusion. The proposed framework makes it possible to implement speech enhancement that is not affected by the type of external noise, which was difficult with conventional speech enhancement techniques. Non-overlapping noise can be removed by the binary mask even for mixed speech with overlap on the frequency axis since the mask information can be obtained as visual information. However, overlapping noise remains on the time-frequency axis after the binary mask. In this paper, we focus on reducing the overlapping noise by using the spectral subtraction and give some experimental results to show the effectiveness of the proposed approach.

2 Problem Formulation

In this section, we describe the supposed situation and formulate the problem. Figure 1 shows an example of how to use the proposed method. It can be used to send voice to an unspecified number of people, such as an election speech in public spaces. The speaker not only generates the voice through the speaker, but also transmits the mask information of the voice to the listener through the display. When the listener would like to hear the speaker’s voice, the listener moves the camera of the smartphone or tablet to the display and captures the audio signal and image information. The listener can obtain the voice information of the speaker with high quality even in a noisy situation by using the proposed method. Let us consider a target signal s(t) and ith noise n_i(t). The mixed speech x₁(t) acquired from the microphone array is described as follows.

$$ x_{1} (t) = s(t) + \sum\nolimits_{t = 1}^{n} {n_{t} } (t) $$

(1)

The mask information received as image information is then defined. Even if audio information and image information are sent at the same time, there is a time lag in the signal information received at the receiving side. Therefore, the mask signal X₂ (τ, ω) received by the receiving side can be expressed as follows:

$$ X_{2} (\tau ,\omega ) = M(\tau - \delta ,\omega ) $$

(2)

where $\delta $ is the time delay.

Here, when ∆ is defined as the maximum delay between sensors, the following equation is satisfied.

$$ \left| \delta \right| \le \Delta $$

(3)

Considering that the audio signal received by the microphone is restored by the image signal received by the camera, its output is expected to be maximum when there is no delay. Hence, the delay can be estimated as follows:

$$ \tilde{\delta } = \arg \;\max_{\delta < \Delta } \sum\nolimits_{\tau ,\omega } {\left| {X_{2} (\tau + \delta ,\omega )X_{1} (\tau ,\omega )} \right|} $$

(4)

Speech enhancement is performed as follows using the estimated mask information.

$$ S(\tau ,\omega ) = \tilde{M}(\tau ,\omega )X_{1} (\tau ,\omega )\forall \tau ,\omega $$

(5)

where $ \tilde{M} $(τ, ω) is the mask information estimated using $ \tilde{\delta } $.

3 Proposed Approach

The approach using binary mask works well when the sparseness between the target signal and noise in the time-frequency domain is satisfied. However, if the conventional method is applied to a sound that does not have sparseness, noise remains in the overlapping part. Therefore, we consider an improvement method using speech enhancement by combining a binary mask and a spectral subtraction method as shown in Fig. 1. The spectral subtraction method removes noise by subtracting the estimated value of the average power spectrum of noise from the power spectrum of mixed speech.

Let us suppose the mixed signal x(t) can be written using the signals s(t) and n(t) as

$$ x(t) = s(t) + n(t) $$

(6)

The signal on the time-frequency axis obtained by Fourier transforming the mixed signal can be described as follows.

$$ X(\tau ,\omega ) = S(\tau ,\omega ) + N(\tau ,\omega ) $$

(7)

where X(τ, ω), S(τ, ω) and N(τ, ω) are the complex spectra of the x(t), s(t) and n(t), respectively. τ and ω are the time frame and angular frequency, respectively.

In the spectral subtraction method, it is assumed that the signal and noise have no correlation, and is approximated as follows.

$$ \left| {X(\tau ,\omega )} \right| = \left| {S(\tau ,\omega ) + N(\tau ,\omega )} \right|^{2} $$

$$ \left| {S(\tau ,\omega )} \right|^{2} + S(\tau ,\omega )N^{*} (\tau ,\omega ) + S^{*} (\tau ,\omega )N(\tau ,\omega ) + \left| {N(\tau ,\omega )} \right|^{2} $$

(8)

$$ \approx \left| {S(\tau ,\omega )} \right|^{2} + \left| {N(\tau ,\omega )} \right|^{2} $$

where * indicates a complex conjugate. In the spectral subtraction method, we focus on Eq. (8) and emphasize the target signal by subtracting the estimated noise from the mixed sound. In addition, it is assumed that the noise is stationary in general spectral subtraction. Let the estimated speech spectrum after noise removal be $ \tilde{S} $(τ, ω) and the average power spectrum of the estimated noise be $ \tilde{N} $(ω). $ \tilde{S} $(τ, ω) can be estimated as follows:

$$ \left| {\tilde{S}(\tau ,\omega )} \right|^{2} = \left| {X(\tau ,\omega )} \right|^{2} - \left| {\tilde{N}(\omega )} \right|^{2} $$

(9)

Figure 1 shows the basic concept of the proposed approach in the case that noise remains when the target signal and the noise is overlapped in the time-frequency domain. Figure 2 shows the procedure of the proposed approach to estimate the noise. As shown in Fig. 3, since the mask information is acquired as image information in this method, noise is estimated by averaging the points where the mask is 0 as follows.

$$ \left| {\tilde{N}(\omega )} \right|^{2} = \frac{1}{T(\omega )}\sum\nolimits_{\tau ,M(\tau ,\omega ) = 0} {\left| {X(\tau ,\omega )} \right|} $$

(10)

Here, $ \sum_{\tau } $_{, M(τ, ω)=0} indicates addition in the τ direction for the point where M(τ, ω) = 0. In addition, T(ω) indicates the number of points for which M(τ, ω) = 0 when added in the τ direction for each ω. Figure 4 shows the process of the proposed method to enhance the target signal. To recover the waveform, not only the power of the signal but also the phase information is needed. Since it is difficult to obtain the phase information of the signal itself from the mixed sound, the phase information of the mixed sound is often used.

4 Experiments

In this section, we describe the experimental condition and the results. The target voice was a female voice from the ATR newspaper reading database. All programs were created using Python in Visual Studio 2017 Community. As noise, pink noise and white noise were used. Three levels of noise were set: 0, −10, −20 dB.

Table 1 shows the experimental condition. We set the threshold values to create the binary mask from −100 to −30 dB with 10 dB intervals. Signal-to-Distortion Ratio (SDR) was used to evaluate the waveform error between the target speech and the speech after noise removal [7]. Generally, the larger the SDR value, the smaller the distortion for the signal. Let S (τ, ω) be the complex amplitude of the target speech in the time-frequency domain, and let $ \tilde{S} $(τ, ω) be the power of the signal after speech enhancement.

Table 1. Experimental condition.

Full size table

$$ SDR = 10\,\log_{10} \left( {\frac{{\sum\nolimits_{\tau ,\omega } {\left| {S(\tau ,\omega )} \right|} }}{{\sum\nolimits_{\tau ,\omega } {\left| {\left| {S(\tau ,\omega } \right| - \lambda \left| {\tilde{S}(\tau ,\omega } \right|} \right|} }}} \right) $$

(11)

where λ is described as follows:

$$ \lambda = \sqrt {\frac{{\sum\nolimits_{\tau ,\omega } {\left| {S(\tau ,\omega )} \right|^{2} } }}{{\sum\nolimits_{\tau ,\omega } {\left| {\tilde{S}(\tau ,\omega )} \right|^{2} } }}} $$

(12)

Table 2 to Table 4 show the experimental results when various levels of white noise (−20 dB, −10 dB and 0 dB) were used for the experiments, respectively. Table 5 to Table 7 show the experimental results when various levels of pink noise (−20 dB, −10 dB and 0 dB) were used for the experiments. The fond was set to the bold to show the maximum value among all the tested thresholds. From Tables 2, 3, 4, 5, 6 and 7, it can be confirmed that the value of the signal-to-distortion ratio is greatly increased by the proposed method in all the cases, especially when the threshold value is low. When the threshold is high, we succeeded in improving the accuracy when the threshold is low while maintaining the accuracy of the conventional method. The reason why the SDR improves when the threshold is low is thought to be because a lot of noise components remain in the mixed signal.

Table 2. Experimental results when white noise (−20 dB) was used.

Full size table

Table 3. Experimental results when white noise (−10 dB) was used.

Full size table

Table 4. Experimental results when white noise (0 dB) was used.

Full size table

Table 5. Experimental results when pink noise (−20 dB) was used.

Full size table

Table 6. Experimental results when pink noise (−10 dB) was used.

Full size table

Table 7. Experimental results when pink noise (0 dB) was used.

Full size table

5 Conclusion

In this research, we proposed a vision-referential speech enhancement with binary mask and spectral subtraction. To check the validity of the proposed methods, we prepared white noise and pink noise with different noise level and checked the effectiveness of the proposed method. From the experimental results, the proposed method could effectively remove noise that is not sparse compared to the previous method. Since the proposed method is not effective against fluctuating noise, we consider a method that can cope with fluctuating noise. We also consider real-time processing in the future.

References

Jarrett, D.P.: Theory and Applications of Spherical Microphone Array Processing. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-42211-4
Book MATH Google Scholar
Yu, C., Su, L.: Speech enhancement based on the generalized sidelobe cancellation and spectral subtraction for a microphone array. In: Proceedings of 8th International Congress on Image And Signal Processing, pp. 1318–1322 (2015)
Google Scholar
Doclo, S., Moonen, M.: Design of broadband beamformers robust against gain and phase errors in the microphone array characteristics. IEEE Trans. Sign. Process. 51(10), 2511–2526 (2003)
Article Google Scholar
Duchnowski, P., Hunke, M., Busching, D., Meier, U., Waibel, A.: Toward movement-invariant automatic lip-reading and speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 109–112 (1995)
Google Scholar
Kulkarni, A.H., Kirange, D.: Artificial intelligence: a survey on lip-reading techniques. In: International Conference on Computing, Communication and Networking Technologies, p. 45670 (2019)
Google Scholar
Matsumoto, M.: Vision-referential speech enhancement of an audio signal using mask information captured as visual data. J. Acoust. Soc. Am. 145(1), 338–348 (2019)
Article Google Scholar
Fukui, K., Shimauchi, S., Nakagawa, A., Hioka, Y., Haneda, Y., Ohmuro, H., Kataoka, A.: Noise-power estimation based on ratio of stationary noise to input signal for noise reduction. J. Sign. Process. 18(1), 17–28 (2014)
Article Google Scholar

Download references

Acknowledgments

This research was supported by the research grant of Support Center for Advanced Telecommunications Technology Research and by the research grant of Foundation for the Fusion of Science and Technology. I also would like to thank Mr. Suzuki.

Author information

Authors and Affiliations

University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo, Japan
Mitsuharu Matsumoto

Authors

Mitsuharu Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mitsuharu Matsumoto .

Editor information

Editors and Affiliations

Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
Hosei University, Tokyo, Japan
Makoto Takizawa
Osaka University, Osaka, Japan
Tomoki Yoshihisa
University of Naples “Federico II”, Napoli, Italy
Flora Amato
Fukuoka Institute of Technology, Fukuoka, Japan
Makoto Ikeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matsumoto, M. (2021). Vision-Referential Speech Enhancement with Binary Mask and Spectral Subtraction. In: Barolli, L., Takizawa, M., Yoshihisa, T., Amato, F., Ikeda, M. (eds) Advances on P2P, Parallel, Grid, Cloud and Internet Computing. 3PGCIC 2020. Lecture Notes in Networks and Systems, vol 158. Springer, Cham. https://doi.org/10.1007/978-3-030-61105-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-61105-7_42
Published: 09 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61104-0
Online ISBN: 978-3-030-61105-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Vision-Referential Speech Enhancement with Binary Mask and Spectral Subtraction

Abstract

Similar content being viewed by others

A Follow-Up Survey of Audiovisual Speech Integration Strategies

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Survey on Different Visual Speech Recognition Techniques

1 Introduction

2 Problem Formulation

3 Proposed Approach

4 Experiments

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Vision-Referential Speech Enhancement with Binary Mask and Spectral Subtraction

Abstract

Similar content being viewed by others

A Follow-Up Survey of Audiovisual Speech Integration Strategies

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Survey on Different Visual Speech Recognition Techniques

1 Introduction

2 Problem Formulation

3 Proposed Approach

4 Experiments

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation