1 Introduction

Speaker Recognition and Diarization are the domains that deal in the recognition of human voices but the former lets you distinguish between the speakers only whereas the latter helps in finding speaker count along with their duration of speech. Diarization provides us with the specific time of each speaker during any multi-speaker conversation. It mainly deals with the problem of “Who spoke when?” [2, 4]. But finding a few speakers in any conversation is an easily achievable task; a challenge arises when it comes to a multiple-speaker scenario. Extract multiple speakers’ duration from meetings, seminars, conferences, discussions, and telephonic conversations. Another complex task is to distinguish various speakers’ speech from various background noises such as clapping, murmuring, laughing, reverberation, and overlapped speech as well [3]. These noises create disturbances and affect the performance of the complete diarization pipeline.

Generally, the diarization process can be termed as unsupervised or supervised in terms of segmentation, embedding, and clustering behavior. Initially, unsupervised diarization system [25] with various types of embeddings (i-vector [3, 13, 26], x-vector [11, 12, 27], d-vector [14, 15, 28, 30] embeddings) and different types of clustering (Mean Shift [16, 17], Agglomerative Hierarchical Clustering [3, 18, 22], Spectral Clustering [19,20,21,22, 30]) were known. Recently, a shift from unsupervised to supervised and now to fully supervised diarization system like UIS-RNN (Unbounded Interleaved State Recurrent Neural Network) [29] has scaled up the research in online and offline processes for diarization. All this is introduced after the rise in the development of deep neural network architectures. Likewise, in [30] Wang Q. et. al. introduced a diarization system using LSTM (Long Short-Term Memory) neural network using various offline and online clustering algorithms with d-vector embeddings. Speaker diarization system (pipeline) present in literature comprised of the following sub-modules [2];

  1. 1.

    Speech Activity Detection: It divides speech segments from non-speech segments such as noise, reverberation, etc. In this module important features are being extracted like Mel Frequency Cepstral Coefficient (MFCC), zero crossing rate, spectral features, etc. [1]. Thereafter, speech separation on the speech part is executed. A classifier model is used to predict whether the input frame is speech or not. Earlier Gaussian Mixture Models (GMM) and Hidden Markov Models [5] were used for this detection but now after the introduction of Deep Neural Networks (DNN) has performed well in it.

  2. 2.

    Speech Segmentation: Small segments from larger chunks of data are being formed to make the speaker assignment processing easy. These segments obtained from speaker change point detection works better than any other method. Also, these are used to specify speaker labels. Various techniques like Generalised Likelihood Ratio [6], and Bayesian Information Criterion were used for segmentation [7, 8] previously. Before i-vector existed Kullback–Leibler and information change rate were popular methods to calculate distance between speech segments in the last decade. With the introduction of i-vectors and d-vectors uniform segmentation prevailed to date. Fixed window length and overlap length are considered for uniform segmentation [9].

  3. 3.

    Embedding Extraction: Previously GMM and GMM-UBM (Universal Background Model) models were preferred for the speaker representation until Joint Factor Analysis and i-vector were used. JFA overcame problems faced by MAP (Maximum-A-Posterior) like channel and background noise [1]. Then, neural representation techniques took all over and reformed the process [10]. Various combinations of x-vectors [11, 12] and i-vectors [13], exist but d-vectors [14, 15, 30] are still ruling.

  4. 4.

    Clustering: This module provides speaker count by labeling them with clusters separately. As soon as the embeddings are fed into the pipeline; the clustering algorithm is implemented. Initially mean shift clustering algorithm [16, 17] and agglomerative hierarchical clustering [3, 18] were used in general with most representation techniques. In diarization recently used clustering is a spectral clustering algorithm [19,20,21]. Many of its improved variants [23, 24, 58,59,60] are working well with deep learning models.

But besides these 4 modules of a diarization pipeline, there are pre and post-processing modules [1]. They both help in decreasing the complexity within the system by refining and smoothening the input and output of the diarization pipeline respectively. With the emergence of multi-disciplinary large datasets like VoxCeleb, VoxConverse and other Meeting corpuses various challenges like unwanted background noise, multiple speakers (more than 10 or 15), lengthy audio files, overlapping segments, processing computation, low resource compatibility, etc. came into existence. Sun L. et. al., in [3] addressed the noise issue by using an LSTM-based speech denoising model with i-vector embedding using AHC was introduced by. In [30] a d-vector-based LSTM model was suggested by Wang Q.et. al., and compared to i-vectors it proved to be more robust and effective for the diarization system. In 2020, authors in [40] proposed LSTM based speech enhancement block for reducing the noises using densely connected progressive learning.

However, the pipeline suffers any discrepancies related to the removal of background noise with laughter, clapping and murmuring, etc., and after that dealing with large datasets like VoxConverse heavy computation is another challenge. To overcome both of these challenges of working with large datasets this paper has proposed a modified speaker diarization pipeline. It contains a modified pre-processing module in which a speech refinement model using Bi-LSTM Skip U-Net connection network for noise reduction is added. Along with that paper has also suggested a solution to overcome the slow computation problem of spectral clustering on large datasets by symmetrizing the affinity for smooth calculation and evaluating eigenvector using singular value decomposition.

The rest of the paper is organized as follows. In section 2 background study concerning both modified modules is presented. In sub-section 2.1 a brief review of speech enhancement models is discussed whereas in sub-section 2.2 various modifications done in spectral clustering from traditional to latest are mentioned. Section 3 will give highlights about the material and methods used in the proposed system pipeline. Then, in section 4 proposed speaker diarization pipeline is introduced with a modified version of both pre-processing and clustering modules in detail. The metric used, datasets and the experimental setup are specified in section 5. The implemented results are shared in section 6. Finally, conclusions and future scope is discussed in section. 7.

2 Background study

2.1 Review of speech enhancement methods

Enhancement in the speech domain is one of the most essential pre-processing steps. It helps in the reduction or removal of various noises such as background, applauding, laughter, reverberation, etc. to obtain improved speech signals. It is segregated into 2 domains i.e., Frequency Domain and Time Domain.

  • In the frequency domain, single-channel speech enhancement [31], frequency domain linear predicting (FDLP) [32], and conventional Fast Fourier Transform magnitude spectra [33] have been replaced by wavelet threshold multi-taper spectra [34, 35] which employ significant improvement in SNR (signal-to-noise ratio) metric of evaluation.

  • In time domain features earliest speech enhancement technique used was spectral subtraction [36], in which noise spectrum is extracted from noisy spectrum to obtain the clean speech spectrum. It helps in improving quality and intelligibility. Thereafter many variants of spectral subtraction approached like spectral over subtraction, multiband spectral subtraction, iterative spectral subtraction, and finally Wiener filtering which replaced all. Mainly, there are the following time domain techniques employed for speech enhancement.

    • Spectral Techniques [37]. In this technique, authors have discussed the addition of 2 factors (spectral over subtraction and spectral floor) in spectral over subtraction which improved the basic algorithm, but remnant noise could not be removed with this. So, multiband spectral subtraction was introduced for controlling real-world noise. This type of spectral subtraction is implemented over 4 equally spaced frequency bands. To customize the noise removal, process an additional band subtraction element had been added. It also provided control over noise subtraction at each band. Another one is Wiener Filtering Technique. It improved spectral subtraction by minimizing the mean square error (MSE) between the original and assessed signal. It uses a fixed gain function for the calculation of accuracy which degrades the quality of clean speech signal. Finally, iterative spectral subtraction came as an improvement of wiener filtering in which an output of the enhanced speech signal is used as input for the next signal in the process. In 2013, Abd-El Fattah, et.al. [38] improvised normal wiener filtering to an adaptive wiener filtering by targeting the drawback of the former one i.e., spectral subtraction is being applied over stationary signals but this adaptive one learns sample-to-sample filter response. Finally, Gating Technique is used to reduce noise generally in the music industry [57] for many years. It uses a gate that monitors the audio level. In [50] spectral gating was applied on the pre-processing module of the speaker diarization pipeline and hence received an improved performance. Table 1 shows how gating techniques help in reducing noise present in raw audio files on the VoxConverse dev set.

    • Deep Learning-based Techniques: Recently various neural network-based filtering algorithms such as Convolution Neural Network (CNN) based Speech Enhancement method [39] and Kalman Filter based Deep Neural Network (DNN) [40] have been implemented. Neural network models performed well in noisy conditions on both metrics of quality and intelligibility i.e., Perceptual Evaluation of Speech Quality (PESQ) and short-time objective intelligibility (STOI) respectively as noted by authors in [39, 40]. Hence, deep learning paved the way for some more improvement in reducing these background disturbances and unnecessary noises. Also, Authors in [66, 67] discussed U-Net architecture which has shown improved results in terms of speech enhancement.

Table 1 Average value of components of DER (Diarization Error Rate) & DER % before and after applying spectral gating on VoxConverse dev set [50]

The enhancement techniques discussed above are based on filtering, gating, and neural network methods. Gating techniques have been earlier used by music enthusiasts researching the domain. But from recent research, it can be observed that neural network has performed better than both other techniques [37, 38, 40]. Still, there are some distracting noises like laughter, murmuring, clapping, etc. which are sustained and could not be reduced or removed from audio files in the case of multidisciplinary datasets.

2.2 Review on spectral clustering

Clustering is an important module of the diarization pipeline in which speakers are labeled through clusters. Spectral clustering has been one of the finest clustering algorithms that overcame the drawback of k-means clustering of not being compatible with anisotropic data (spherical or round cluster formation), it is sensitive to initializing centroids prior and converging easily locally. The basic steps involved in the traditional spectral clustering algorithm by Ng A., et. al., [42] are as follows:

  • Form an affinity matrix \({A}_{ij}=\mathrm{exp}(\frac{{-\Vert {s}_{i}-{s}_{j}\Vert }^{2}}{2{\sigma }^{2}})\), considering \(i\ne j\) and \({A}_{ij}=0\).

  • Let \(D\) be the diagonal matrix whose elements \((i,j)\) be the sum of \(A's\) i-th row and construct the Laplacian matrix \(L= {D}^{-1/2}A{D}^{-1/2}\).

  • Find the largest \(k\) eigenvectors of \(L\) from \({x}_{1}, {x}_{2} ,\dots \dots \dots , {x}_{k}\) and form a column.

  • Form matrix \(Y\) from \(X\) by renormalizing each of \(X's\) row to have unit length.

  • Apply \(k\)-means clustering and obtain cluster labels separately for each cluster.

Some variants of spectral clustering algorithms came with different modifications and implemented different machine learning algorithms and neural network systems. Some of them are discussed below:

  1. A.

    Unnormalized Spectral Clustering

    This was the very first change made by Andrew Ng and discussed by Luxumberg in [19]. As the name suggests Unnormalized Spectral clustering computed its unnormalized Laplacian matrix (which has satisfied some properties of being symmetric, semi-definite, and has non-negative eigenvalues). The basic steps of unnormalized spectral clustering are mentioned as follows:

    • With \(W\) as a weighted adjacency matrix, \(D\) as a diagonal matrix, and \(L\) as unnormalized Laplacian matrix, it is calculated as \(L=D-W\)

    • First \(k\) eigenvectors are calculated \({u}_{1} , {u}_{2},\dots \dots ., {u}_{k}\) of\(L\).

    • A column matrix \(U\) is formed from these eigenvectors.

    • Clusters labels are obtained using \(k\)-means algorithm.

  2. B.

    Normalized Spectral Clustering

    Unlike the above normalized spectral uses normalized Laplacian matrix (\({L}_{sym}\)(symmetric) or \({L}_{rw}\)(random walk)) and proceed further with similar steps as in the case of unnormalized clustering [19].

    • For random walk Laplacian matrix: When Eigenvectors(u) are calculated from generalized eigenproblem with \(D\) as diagonal matrix and λ is an eigenvalue of \({L}_{rw}\) from \(Lu=\uplambda Du\)

    • For symmetric matrix: Here, λ is an eigenvalue of \({L}_{sym}\) with Eigenvector \(w= {D}^{-1/2}u\) First \(k\) eigenvectors are calculated from (\({L}_{sym}\) and then normalized by 1 to form another matrix \(T\).

  3. C.

    Self-tune Spectral Clustering

    In self-tune authors tried to automate a complete spectral clustering algorithm [24] by modifying a few steps in traditional spectral clustering. The self-tune spectral clustering steps are given below:

    • Compute local scale \({\sigma }_{i}\) for all the points to be clustered.

    • Form local scaled affinity matrix Ȃ and then construct normalized affinity matrix \(L= {D}^{-1/2}A{D}^{-1/2}\).using diagonal matrix \(D\).

    • Form matrix \(X=[{x}_{1},\dots .., {x}_{c}]\) with \(C\) largest eigenvectors of \(L\).

    • Rotate the eigenvectors for maximal sparse representation.

  4. D.

    Auto-tune Spectral Clustering

    This is the latest automated version of spectral clustering developed by Park T. J. et. al., [23] in 2019. Here, hyperparameters used in clustering such as \(p\) (the threshold used for row-wise binarization) and \({g}_{p}\)(normalized maximum eigengap value) are being tuned automatically. These both share a linear relationship and thus play an important role in the calculation of good DER as \(p/g_p\) can suggest a value of \(p\) from which presumably DER can be the lowest. As a result, it showed a better performance of SC with cosine similarity than AHC when coupled with PLDA (Probabilistic Linear Discriminant Analysis) model. Other variations were also introduced like Constrained Spectral Clustering [58], Scalable Constrained spectral clustering [59], Multi-view Spectral clustering [61], Multiclass spectral clustering [60], and many others.

However, even after so many modifications in spectral clustering it still suffers some drawbacks for large and noisy datasets such as VoxCeleb [63] and VoxCeleb2 [64] and VoxConverse [68]. It faces performance degradation in terms of speed, cost computation, and heavy calculation of informative eigenvector selection within the algorithm. To overcome these setbacks spectral clustering is facing, another better technique of eigenvector decomposition is suggested in this paper i.e., using singular value decomposition, the eigenvector selection gets easy and the computational load is balanced for the complete pipeline which affects the overall performance as well. The complete detailed modified clustering is discussed in module M5 of Sect. 3.

3 Methodology

3.1 Bi-LSTM model with skip U-Net connections

Speech denoising became a crucial pre-processing step in the field of diarization after the advent of large and multidisciplinary datasets which contains multiple speaker conversations in the field of news broadcast, interviews, conferences, meeting, and discussions [1]. Various types of noises disrupt the clean and effective diarization.

These noisy disturbances are present everywhere. Thus, proper reduction or removal technique by pre-processing audio files is required to achieve refined audio for further evaluation.

Demucs architecture developed by Défossez A. et. al., [54] has shown great results for music source separation in the waveform domain [53]. It helped in enhancing the quality of speech by suppressing environmental noise, reverberations, background noises, etc. up to a great extent. It has inherited its structure from Wave-U-Net [55] which was introduced for audio source separation from music datasets.

The paper has adapted Demucs architecture for a speech refinement module that consists of a Convolutional Encoder, Bi-LSTM, and Convolutional Decoder in which the encoder-decoder is linked with skip U-Net connections. This architecture can reduce stationary as well as non-stationary noises. It has also improved the naturalness of the audio. A brief description of the architecture used is given below.

  • It consists of L encoder layers, numbered from 1 to L whereas decoder layers are in reverse order from L to 1.

  • Firstly, an encoder network has an internal structure has a convolution layer with 2i−1H output channels with Recurrent Linear Units (ReLU) activation, then 1 × 1 convolution with 2iH output channels, and finally Gated Linear Units (GLU) activation that converts the number of channels to 2i−1H again.

  • Secondly, the Bi-LSTM network consists of 2 layers and 2L−1 H hidden layers. L1 loss function has been used over the waveform and STFT loss spectrogram magnitude for the architecture.

  • Lastly, a decoder network takes input from a neural network and gives an output as a clean signal. It takes 2i−1 H channels and applies a 1*1 convolution with 2i H channels which are followed by a GLU activation of 2i−1H channels and transposed convolution with K = 8 (kernel size), S = 4 (stride) and output channels = 2i−2H channels.

  • Finally, at last, a ReLU function is applied for all layers except the last one where no ReLU is applied and only a single channel is at the output. The role of skip-u net connections is that it connects \(i-th\) layer of an encoder to a \(i-th\) layer of a decoder.

A combination of Bi-LSTM network with Skip U-Net Connections is employed in the proposed diarization pipeline described in Sect. 4.

3.2 Singular Value Decomposition (SVD)

SVD is a matrix factorization technique that decomposes a single matrix \(A\) into 3 matrices as mentioned below

$$A=U \bullet\Sigma \bullet \mathrm{V}$$
(1)
$$A=A{A}^{T}\bullet\Sigma \bullet {A}^{T}A$$
(2)

where \(A\) is an \(m\times n\) matrix, U is an \(m\times m\) left singular matrix, V is an \(n\times n\) right singular matrix, and \(\Sigma\) is diagonal matrix with \(m\times n\) size. \({A}^{T}\) is the transpose of matrix \(A\), \(m\) denotes the number of rows and \(n\) denotes the number of columns.

4 Proposed system description

This paper proposes an improved speaker diarization pipeline as shown in Fig. 1 for an audio system by employing BILSTM skip U-Net Model in pre-processing module for speech refinement and developing Modified Spectral Clustering (MSC) with SSVD (Symmetricity and Singular value decomposition for clustering module. All the modules of the improved speaker diarization pipeline are explained in detail thereafter in sub-sections.

Fig. 1
figure 1

Proposed speaker diarization pipeline

4.1 M1: Pre-processing module: Speech refinement model using Bi-LSTM

Pre-processing is the first module of our proposed diarization pipeline. It refines and enhances the raw and noisy audio files. The paper proposed a speech refinement model using Bi-LSTM with Skip U-Net connections in this module. The brief network architecture of the refinement module is depicted in Fig. 2.

Fig. 2
figure 2

Speech refinement model using Bi-LSTM

The Bi-LSTM model is capable of handling sequential and time series data very well. As audio data captures long-range dependencies and finds temporal patterns in both directions. It will be easy to identify noisy patterns within files in a short period using Bi-LSTM rather than an LSTM network. Likewise, Skip U-Net connections extract both high-level and low-level features from an audio signal. It helps in retaining the same structure and characteristics of clean speech or sound which is beneficial for denoising purposes. The combination of Bi-LSTM with Skip U-Net connections works better in understanding the structure of audio and distinguishing between noise and actual signal components. They both combinedly creates a robust network for speech refinement purpose. The pseudocode for the proposed speech refinement module is given below:

  1. a)

    Input a raw audio waveform to a pre-processing module where, using the pre-trained models a state dictionary is loaded.

  2. b)

    Then, set all the parameters according to our model’s requirement. Pass the waveform to the speech refinement model.

    1. i.

      Firstly, it will enter an encoder network, where gated linear units (GLU) are used to boost performance and at the output, both recurrent linear units (ReLU) and GLU activation is being applied for enhancement.

    2. ii.

      Then, the processed signal passes through the Bi-LSTM network with 48 hidden layers.

    3. iii.

      Lastly, the decoder network takes the channels in and provides a clean signal after performing convolution, gated linear activation, and transposed convolution on it and passes through a recurrent linear activation at the end to provide an enhanced signal for the further processing through a speaker diarization pipeline.

Now, refined audio files after the reduction of background noises are received and passed for speech and non-speech detection to the next module.

4.2 M2: Speech activity detection module

In this stage files obtained from the above module after pre-processing are converted to an array, and finally, L1 normalization is applied. The overlap detection is also a part of SAD but here in this framework, we have not considered the overlapping speech segments.

4.3 M3: Segmentation Module

During this phase, speech and non-speech segments are separated to further reduce the complexity by removing silences, gaps, and non-speech segments from the speech ones. These smaller segments help in smoothening the process by gathering the vocal segments in one place.

4.4 M4: Embedding extraction module

Using those speech segments, we receive embeddings from embed utterances, and then MFCC features are extracted. Obtain continuous embedding from partial embedding. In this paper, d-vector embeddings are extracted from the pre-trained PyTorch model Resemblyzer.

4.5 M5: Clustering module: Spectral clustering modification using MSC-SSVD

Now, continuous embeddings are received from the previous module and firstly number of clusters should be predicted to be given as an input to spectral clustering. Another modification of the proposed pipeline is implemented here in the spectral clustering algorithm. The need of modifying the traditional algorithm is just to use the basic dimensionality reduction technique and fasten the process of eigenvector calculation and decomposition. Firstly, the affinity matrix is symmetrized to obtain a transformed adjacency matrix and then singular value decomposition is applied for eigenvector calculation.

The steps of the modified spectral clustering- symmetrized singular value decomposition (MSC-SSVD) are as follows.

  1. a)

    Form an affinity matrix A using cosine similarity measure \(({A}_{ij}=d\left({w}_{i},{w}_{j}\right))\); which is the distance between the 2 speakers embedding from 2 speech segments.

  2. b)

    Then, a symmetric operation on the affinity matrix is applied and a transformed matrix is obtained. The affinity matrix will be transformed into an undirected adjacency matrix by taking an average of an original and transposed versions of the affinity matrix.

    $${A}_{s}= \frac{1}{2} (A+ {A}^{T})$$
    (3)
  3. c)

    Compute the eigenvalues and eigenvectors using Singular Value Decomposition and arrange the eigenvectors in descending order.

  4. d)

    Apply the k-means clustering algorithm on the obtained spectral embeddings to get the number of cluster labels.

  5. e)

    The obtained cluster labels signify the number of speaker counts.

Lastly, individual speaker labels are obtained after the final clustering module. The labels signify the speech duration of each speaker separately.

5 Dataset, experimental setup, and evaluation metrics

5.1 Dataset

The proposed speaker diarization pipeline is implemented on the VoxConverse dataset [64] which is made publicly available in 2020. It consists of 50 h of multi-speaker clips of human conversations in various forms of telephonic calls, broadcasting news and interviews, and other conversations.

The implementation using this VoxConverse dataset has been done in 3 batches of these files with different durations for adding up the variations and to get a complete understanding of different timing durations.

The reason for implementing the pipeline on this VoxConverse dataset is that it contains varieties of multi-speaker clips in many different scenarios which will be very helpful in gathering most of the variants in a single place. This helped us in analyzing multidisciplinary domains in a single audio dataset.

5.2 Experimental setup

The implementation is done on “NVIDIA-SMI 471.41, Driver Version: 471.41, CUDA Version: 11.4 with 8 GB RAM and 512 SSD NVIDIA GeForce”. It was another challenge to run this heavy dataset in a low-resource environment. Pyannote. audio 2.0.1 has also been taken into use for the SAD task. Metric evaluation has been done using pyannote. metrics. Resemblyzer pre-trained embedding extractor is used for extracting d-vector embeddings.

5.3 Evaluation metric

To analyze our speaker diarization pipeline we used Diarization Error Rate (DER) [65]. It calculates the total percent of reference speaker duration that is not assigned correctly to a speaker. Here, the correctly assigned is nothing but a one-to-one mapping of speaker labels between their ground truth and hypothesis. Overlapping segments are ignored.

Diarization error Rate namely consists of 3 sub-components as mentioned in Equation 4.

  • False Alarm: It refers to the percentage of scored time that a hypothesized speaker is labeled as a non-speech in the reference.

  • Missed Detection: It refers to the percentage of scored time that a hypothesized non-speech segment corresponds to a reference speaker segment.

  • Confusion (Speaker error): It refers to the percentage of scored time that a speaker ID is assigned to the wrong speaker

    $$Diarization Error Rate=\frac{\mathrm{False Alarm}+\mathrm{Missed Detection}+\mathrm{Confusion}}{Total Reference}$$
    (4)

6 Results and discussions

The proposed speaker diarization pipeline is evaluated on the VoxConverse dataset and results obtained after employing Bi-LSTM with spectral clustering are tabulated in Table 2 and compared with state of the art system which used an LSTM neural network with spectral clustering for the diarization pipeline [30]. According to the result, there is a significant decrease in DER after the addition of the Bi-LSTM model during pre-processing. An average of 3.8% decrease in DER can be seen.

Table 2 Results of DER% by Speech refinement using BiLSTM on VoxConverse dataset

Table 3 also shows a DER% after applying the proposed modified spectral clustering with symmetrized singular value decomposition. It also shows with state of art clustering algorithms like Particle Swarm Optimization k-means (PSO k-means), Agglomerative Hierarchical Clustering (AHC), and Spectral Clustering for the diarization pipeline. During experimentation, the LSTM network is combined with all clustering and without any refinement procedure. DER% has been reduced by a noticeable margin by our proposed MSC – SSVD technique from existing spectral clustering [30].

Table 3 Comparison of different clustering algorithms with LSTM network with our proposed Modified SC-SSVD on VoxConverse dataset

Finally, the comparison between the baseline system [30] and the proposed modified speaker diarization system has been compiled in Table 4. Along with these both proposed modifications separately are also evaluated. It is obvious from the results that the Bi-LSTM model has made remarkable changes in DER when combined with the MSC-SSVD clustering technique. Overall, absolute change in DER comes out to be 6.1%, 4.7%, and 7% respectively for 3 batches. It depicts that background noise plays a significant role in degrading the quality of an audio file.

Table 4 Comparison of DER of our proposed pipeline with other state-of-art systems on the VoxConverse dataset (All the values are in % and the lower is better)

7 Conclusion

Diarization is the task of identifying and tracking the speaker’s speech duration in an audio recording. Nowadays, it has spread its scope to speaker indexing, content structuring, and audio information retrieval. The paper aims to reduce the extraneous noises generated from non-speech sounds like clapping, murmuring, laughing, etc. This paper proposed an improved speaker diarization pipeline with a speech refinement module using Bi-LSTM with skip U-Net connection and an improved spectral clustering algorithm with symmetrized singular value decomposition. The DER obtained after implementing the proposed solution is 37.2%, 37.1%, and 43.3% respectively for 3 batches on the multi-disciplinary VoxConverse dataset. The results are compared with the baseline [30] approach and a significant decrease of 6.1%, 4.7%, and 7% is observed. The modified pipeline paved the way for retrieving audio files with reduced background and unnecessary noises. But the pipeline suffers problems in understanding similar voices at times.

The proposed pipeline can be extended to multimodal functionalities in other modes like videos and emotion recognition with audio. It has still vast scope for improvement in tackling speaker variability, adaptation, and real-time performance.