Keywords

1 Introduction

In noisy environment, the speech spoken by a person will be affected by the noise. For example, the noise can be other person speaking, noise due to a passing car, and many other sounds. Many applications need a system that separates the speech from the noisy speech and enhancement of the target speaker’s speech. In cellular phone communication, the voice will be affected by the surrounding noise at the transmitter. A speech enhancement system can be used at the receiving end to improve the quality of the speech.

In speech separation process, the target speech signal is separated from acoustic mixture. The acoustic mixture may be another speech or environmental noise or both. Speech separation can be used in speech/speaker recognition, voice communication, air–ground communication, hearing aids, etc. Spectral subtraction, subspace analysis, hidden Markov modeling, and sinusoidal modeling are some of the methods propose earlier proposed for monaural speech separation. These approaches usually require a prior knowledge of noise signal. In the last few decades, many researchers have developed monaural speech separation using adaptive energy threshold [1], and image analysis techniques as in [2, 3]. The performance of the speech is improved using genetic algorithm-based fusion scheme in [4], and speech intangibility is improved by fusing voiced and unvoiced speech segments as in [5]. All these methods used audio as the only input to improve the speech quality.

Visual cues refer to the voice activity of the target speaker which is used in this paper to produce an enhanced output of the target speaker’s speech. Normal speech enhancement techniques which are used to enhance a single (monaural) person speech do not provide the expected result.

Lip-reading depends on number of factors in human-to-human communication [6]. The quality of visual information plays a vital role. For example, in poor lighting condition it is hard to detect the shape of the mouth. Additionally, it is difficult to detect visual cues as the listener and the speaker are moving apart.

Hence simulating the target person’s speech activity by tracking the lip movement of the person using Viola–Jones algorithm [7, 8] and Kanade–Lucas–Tomasi (KLT) [9] and comparing that with the original audio sample to ensure that the presence of voice activity from the lip moment matches with the original audio onsets and offsets.

The presence of speech activity and the voice activity can be detected from both audio and video stream. If the SNR is high, speech is dominant compared to noise; thus, it is more reliable to detect the presence of speech activity from audio stream itself. But when the SNR reduces, and as the noise dominates speech, it is not reliable to detect onsets and offsets from audio stream as it may treat some noisy parts as speech or vice-versa. In such cases, it is advisable to detect onsets and offsets from video stream as it is independent of SNR of the signal. The onset and offset times of video stream can be detected by tracking the mouth of the target speaker. The detected onset and offset time of the video stream is then plotted and compared with the plot of audio stream, and checked for one-to-one correspondence.

The reminder of this paper is organized as follows. Section 2 describes the proposed system about speech onset and offset detection using video stream. The existing system of speech onset and offset detection using audio stream is explained in Sect. 3. The experimental results are given in Sect. 4. Conclusion is outlined in Sect. 5.

2 Speech Onset and Offset Detection Using Visual Stream

In this paper, the main focus is to obtain a voice activity detection mask using visual cues (V-VAD) for target speech detection. The mask here is a binary mask which specifies the absence or presence of voice activity in each frame. First the video stream is splitted in sequence of image frames. Second, the face is detected in the given frame and the mouth region is encompassed with the bounding box in the detected face. The first frame is processed using the Viola–Jones algorithm [7, 8, 10] to get the bounding box for the mouth region. Viola–Jones algorithm uses Haar features to detect the face. The four stages of Viola–Jones algorithm are: (a) choice of Haar feature, (b) creation of integral image, (c) Adaboost training (d) cascading classifiers.

2.1 Binary Mask Detection

The mouth region in the image is identified using Viola–Jones algorithm. It uses a cascade of classifiers to detect the presence of target object in the image frame. Every stage in the cascade rejects the region which has no target object. As the sliding window travels over the image it is possible to produce multiple object detection near the target object. This multiple identification is combined to produce one bounding box for one target object. The bounding box is represented using the top left corner coordinates and the height and width of the box. Figure 1 highlights the bounding box of the mouth region detected.

Fig. 1
figure 1

Mouth detection of first frame

2.2 KLT Feature Tracking

Kanade–Lucas–Tomasi (KLT) algorithm [9, 11, 12] is used for tracking the two points in the video frames. KLT tracking algorithm works with two simple steps: in initial frame tracking features are identified, and then tracks all the detected features in the remaining frames. Assume the first and the next images were taken at time t and t + τ, respectively. Number of frames per second captured by the video camera determines the time τ. Let an image be represented as a function of two variables x and y. Now the variable t is added to represent the time at which the image was captured by the camera. Now any point in an image is defined by the function f(x, y, t + τ). The assumption made the KLT tracking algorithm is

$$f\left( {x,y,t + \tau } \right) = f\left( {x - x,y - y,t} \right)$$
(1)

From Eq. (1), it is understood that each point in the first frame is shifted by an amount (x, y) to obtain the second frame. This shifting amount is represented by displacement d = (x, y), and the objective of tracking is to compute d. The two feature points taken from the first frame in our system for tracking throughout the frames are given as [a + (w/2), b] and [a + (w/2), b + h] where (a, b) are top left corner coordinates and w is the width and h is the height of the bounding box respectively.

Displacement between the feature points is calculated from one frame to other fame. Lip movement is computed using the displacement. This displacement calculation is used to find the onset and offset of speech using video. This decision will form the binary mask for speech enhancement. The steps involved in getting the binary mask using visual stream are also shown in Fig. 2.

Fig. 2
figure 2

Binary mask detection using video stream

2.3 Linear Interpolation

The number of video frames is significantly smaller than the number of determined audio frames. But the voice activity detection in audio stream and video stream is compared frame by frame for one-to-one correspondence. The original visual frames are interpolated to equate the number of audio stream frames. The onset/offset decision between two visual frames N and N − 1 is the onset/offset decision of Nth frame as in Fig. 3. The general formula to determine the number of frames in a given audio stream is that

$${\text{NbFr}} = ({\text{length}}(x[n] - {\text{FL}} + {\text{FS}}){\text{/FS}}$$
(2)
Fig. 3
figure 3

Audio subsequence corresponding to a frame

where NbFr = Number of audio frames, x[n] = Audio samples, FL = Frame Length, FS = Frame Shift.

3 Speech Onset and Offset Detection Using Audio Stream

Varieties of information’s are present in acoustic speech signal. In existing system, the voice activity detection using only audio is implemented as in [13]. The first step involved in conversion is framing. The input audio signal is segmented into 30 ms frames with 10 ms overlap between successive frames, and then Fourier transform is applied to obtain the frequency domain according to the Window, sidelobe attenuation, and FFT length properties. Using rectangular window will introduce high-frequency noise at the beginning and end of every frame. Hamming window is used to reduce this edge effect. The signal is then represented in power domain. According to [14], the noise variance is estimated. According to [15], the posterior and prior SNR are estimated. The probability of speech present in the current frame is calculated using hidden Markov model (HMM) and log-likelihood ratio test according to [13]. Based on the probability of each frame, onset/offset decisions are made in the existing method.

4 Experimental Results

The experiment is first conducted by taking sample test videos from an audio–visual database like ‘GRID Corpus,’ where sample videos of both male and female speakers are available of the same duration. The noise samples with which the clean speech is mixed are taken from the ‘Noisex92’ database.

The frontview videos from the GRID corpus show the speakers uttering the same sentence consisting of voiced, unvoiced and silent segments to carry out the experiment. The mask using visual cues for voice activity detection is obtained using Viola–Jones and KLT algorithm. The target speech is separated using the mask obtained using visual cues. We have tested our system using three short videos from GRID corpus consisting of a total of (297 * 3) frames by mixing noise at − 5 dB to + 5 dB SNR conditions. The target speech separated using visual cues mask is compared with the target speech separated using mask obtained by the audio stream.

The experiment result is displayed in Table 1. The performance is evaluated using perceptual evaluation of speech quality (PESQ). PESQ score varies from 4.5 to − 0.5, higher scores better is the quality.

Table 1 PESQ improvement of the proposed system

5 Conclusion

We have presented a mask using visual cues for voice activity detection. The proposed system detects the face and the mouth region to effectively distinguish the speaking from the silent frames in low SNR conditions. The binary mask presented here is independent of noise and effective for speech enhancement.

On comparing, the proposed VAD using visual cues to VAD using audio cues for three videos each of 297 frames it is evident that at low SNR conditions the target speech enhanced using visual cues are having higher PESQ score than target speech enhanced using audio cue.

Extremely low lighting situations and faces that move away considerably from a frontal pose or are too far from the camera to provide enough information for the mouth would cause the system to perform poorly.