Abstract
The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the Alimeeting dataset.
摘要
传统的目标语音分离技术直接估计目标说话人语音成分,忽略了每帧上不同说话人之间的相互关系。我们提出了一种多目标说话人语音分离(MTSS)模型,可以同时从混合语音中提取每个说话人的语音成分,而不仅仅是最优估计得到单个目标说话人语音成分。此外,还提出了一种基于说话人日志技术(SD)的多目标说话人语音分离系统(SD-MTSS)。通过利用目标说话人声活动检测(TSVAD)和估计的掩模,SD-MTSS模型可以在会话录音中同时提取每个说话人的语音成分,无需提前注册目标说话人语音。实验结果表明,提出的MTSS模型在WSJ0-2mix-extr数据集上分别实现了1.38 dB的信号失真比(SDR)、1.34 dB的尺度不变信号失真比(SI-SDR)和0.13的语音质量感知评估(PESQ)改进。SD-MTSS系统在Alimeeting数据集上实现了19.2%的说话人相关的字符错误率降低。
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 31–35.
CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 246–250.
YU D, KOLBÆK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 241–245.
KOLBÆK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901–1913.
LUO Y, MESGARANI N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 696–700.
LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266.
LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation [C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46–50.
GE M, XU C L, WANG L B, et al. Multi-stage speaker extraction with utterance and frame-level reference signals [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6109–6113.
DELCROIX M, ZMOLIKOVA K, OCHIAI T, et al. Speaker activity driven neural speech extraction [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6099–6103.
WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking [C]//Interspeech 2019. ISCA: Graz, 2019: 2728–2732.
LI T L, LIN Q J, BAO Y Y, et al. Atss-net: Target speaker separation via attention-based neural network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1411–1415.
CHEN J, RAO W, WANG Z L, et al. MC-SpEx: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation [C]//Interspeech 2023. Dublin: ISCA, 2023: 4034–4038.
WANG Q, DOWNEY C, WAN L, et al. Speaker diarization with LSTM [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5239–5243.
WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171–9175.
YU F, ZHANG S, FU Y, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6167–6171.
DING S J, WANG Q, CHANG S Y, et al. Personal VAD: Speaker-conditioned voice activity detection [C]//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo: ISCA, 2020: 433–439.
GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1406–1410.
WANG W Q, LI M, LIN Q J. Online target speaker voice activity detection for speaker diarization [C]//Interspeech 2022. Incheon: ISCA, 2022: 1441–1445.
LIN Q J, YIN R Q, LI M, et al. LSTM based similarity measurement with spectral clustering for speaker diarization [C]//Interspeech 2019. Graz: ISCA, 2019: 366–370.
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
DENG J K, GUO J, XUE N N, et al. ArcFace: Additive angular margin loss for deep face recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4685–4694.
COSENTINO J, PARIENTE M, CORNELL S, et al. LibriMix: An open-source dataset for generalizable speech separation [DB/OL]. (2020-05-22) [2023-12-19]. http://arxiv.org/abs/2005.11262
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done? [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 626–630.
WANG W Q,CAI D W,LIN Q J, et al. The DKU-DukeECE-lenovo system for the diarization task of the 2021 VoxCeleb speaker recognition challenge [DB/OL]. (2021-09-05) [2023-12-19]. http://arxiv.org/abs/2109.02002
YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560–564.
DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5554–5558.
XU C L, RAO W, CHNG E S, et al. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 6990–6994.
XU C L, RAO W, CHNG E S, et al. Time-domain speaker extraction network [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 327–334.
XU C L, RAO W, CHNG E S, et al. SpEx: Multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370–1384.
YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054–4058.
ZHANG B B, LV H, GUO P C, et al. WENET-SPEECH: A 10000 hours multi-domain mandarin corpus for speech recognition [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6182–6186.
Acknowledgment
The authors thank the computational resource provided by the Advanced Computing East China Sub-Center.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
Foundation item: the National Natural Science Foundation of China (No. 62171207), and the Science and Technology Program of Suzhou City (No. SYC2022051) and OPPO
Rights and permissions
About this article
Cite this article
Zeng, B., Suo, H., Wan, Y. et al. Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2739-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12204-024-2739-7
Keywords
- target speech separation
- interrelationship
- speaker diarization (SD)
- target speaker voice activity detection
- multiple-target speech separation (MTSS) model