1 Introduction

Speaker Identification (SID) technology is voice biometrics gaining popularity for voice-assisted devices and authentication applications. Traditionally, the MFCC is being used to extract the speaker-specific information (Maurya et al., 2018) represented as audio features or descriptors. However, existing noise in a real scenario, intra-speaker variability, etc. degrades performance in speaker identification. Many approaches are proposed in the literature for noise removal (Al-Allaf, 2015; Manasa & Rama, 2020); noise reduction is always a big issue for many well-known reasons. First, the noise signals in speech are non-stationary, so estimating the statistics for noise removal is tedious. Second, speech distortion is usually observed while speech enhancement.

Due to its hidden and perceptual characteristics, identification of the person from the whispered utterances is a complicated process. The rich phonation's identifiable separation ability is absent from a whisper (Bimbot et al., 2004; Singh & Joshi, 2020). The ADs that work well with the neutral database don’t work efficiently with the whispered one. As a result, the customized strategy indicated in Fig. 2 is being utilized, in which the suitable ADs are selected at the beginning only. The identification rate will be improved by picking the proper ADs suitable for the type of database and application.

In the literature, many low-level audio descriptors are explored. The energy, speech bandwidth, spectral centroid, zero-crossing rate etc. are all important attributes of audio signal (Bhattacharjee et al., 2018; Fan et al., 2011). Mel-Frequency Cepstral Coefficients (MFCC), roll-off, and brightness are examples of second high-level and complex descriptors utilized for speech and speaker identification. The parametric analysis of the spectral envelope is used for such descriptors (Davis & Mermelstein, 1980). The different audio descriptors used for speech processing should be de-correlated from every other descriptor. For music and voice classification, which is a comparatively simple task, typical classification results employing improved non-correlated MFCC features reported accuracy of up to 95% (Hermansky & Malaya, 1998). These outcomes, however, are based on a clean audio signal setting. When we consider the problems of noise, inter-session variability and telephone speech etc. the performance degrades considerably (Dobrowohl et al., 2019; Toonen Dekkers & Aarts, 1995). The whisper is like a noise due to air turbulence by vocal efforts; hence the traditional audio descriptors are replaced by selected timbre descriptors features to maximize the accuracy.

The general strategy found in the literature for speaker identification research is to develop a statistical model to justify the applicability of audio features for the database in question and then utilize it (Foulkes & Sóskuthy, 2017; Karvanagh, 2519). It is shown in Fig. 1

Fig. 1
figure 1

Block diagram of Generalized Speaker identification System

However, the unknown and hidden reasons for the perceptual audio feature's good performance may not be justified. The whispered database lacks phonations, so the intangible timbre features are proposed; each researcher defined timbre differently. It must capture some additional concealed speaker-specific information (Failed, 2004).

While sorting the best-performing features, eight probable ADs are targeted here that are good mixes of various domains like time, frequency, cepstral, and wavelet. Any of the unknown attributes of these ADs may contribute to the enhancement of performance; therefore Hybrid Selection method is a good choice. As a result, the modified strategy is adopted like one illustrated in Fig. 2; where a selection of the suitable ADs is processed in the beginning only.

Fig. 2
figure 2

Block diagram of Modified Speaker identification System

This paper is organized as follows. Audio features are classified in two different ways in Sect. 2, followed by a description of timbre features included in the MIR toolbox and concluded the impact of timbre feature selection on the identification accuracy in the whispered database. Section 3 is dedicated to the System Description. The methodologies and resources used for the study are explored in detail. It includes a database, a Hybrid selection algorithm, and Classifiers (K-means and K-NN). Section 4 emphasizes the use of Median Values of Timber Features for the whispered database. It also supports its impact on decreasing the intra-speaker spread by illustration. Results on the performance of k-means/ K-NN, different features as MFCC only, timbre features, and median values of timbre features are presented in Sect. 5. Few results on FAR (False Acceptance Rate) and FRR (False Rejection Rate) are also presented.

2 Audio Features

2.1 Review of audio features

Audio features can be divided into two categories. The global descriptor is a type of feature in which the computations are done on the entire signal. For example, the whole duration of an audio stream can be used to determine the attack time of a sound. Instantaneous descriptors are another class of descriptors that works with a single frame of audio data at an instance (40 ms). Because the spectral centroid in a audio signal can change over time, it is referred to as an instantaneous descriptor. As an instantaneous descriptor generates many values for a given number of frames, statistical processes (such as mean or median, standard-deviation, and inter-quartile range) are needed to provide a single value representation. A list of 166 audio features is offered in the CUIDADO project (Peeters, 2004).

Further differentiation can be made based on the method of extraction as shown in Figure 3:

Fig. 3
figure 3

Classification of Audio features based on the extraction method

Every individual method will be differently effective for the type of database.

2.2 MIR toolbox Matlab for timbre audio descriptors

The Musical Information Retrieval (MIR) toolbox is mainly designed to enable the study of the relation between musical attributes and music-tempted sensation. MIR toolbox uses a modular outline. It is well known that the common algorithms are used in audio processing like segmentation, filtering, framing etc. with an addition of one or more distinguished algorithms at some stage of processing. These algorithms are available in a modular form and the individual blocks can be integrated to capture some features (Albert-Ludwigs-Universität Freiburg, 2007).

We see MIRToolbox, an integrated set of functions written in Matlab, dedicated to the extraction of sound records related to timbre, tonality, rhythm or form of music. It offers the modular and craftsmanship of a computational approach for Music Information Retrieval (MIR). The different algorithms are decomposed into stages, formalized using a minimal set of elementary mechanisms, and integrated with different variants. We have formulated a piece strategy (Fig. 4) for this study. Before that, it is essential to define the timber features of concern that are used in the subsequent work.

Fig. 4
figure 4

Philosophical integration of modules for the timbre features of concern in MIR toolbox

Roll-off frequency Roll-off is assessed from the foremost energy (85% or 95% as a standard) contained below the pre-defined frequency.

Roughness It estimates the average disagreement between all peaks of the signal. It is also an indicator of the presence of harmonics generally higher than the 6th harmonic.

Brightness It is the measure of the percentage of energy spread above some cut-off frequency.

Irregularity It may be calculated as the sum of the square of the difference in amplitude between adjoining partials or the sum of the amplitude minus the mean of the past, the same component and subsequent amplitude.

After defining the algorithm of timbre features of concern, philosophical discussion on the integration of modules in the MIR toolbox resumes. For illustration to measure irregularity and brightness, we need the implementation of an algorithm like reading audio samples, segmentation, filtering, and framing as the common processes between them. Within the last arrangement, due to characteristic contrasts, irregularity needs a peaking algorithm and brightness is spectrum analysis. Even, the integration of different stages depends upon parameter variations. E.g. mirregularity (…, ‘Jensen’), where the adjoining partials are taken into consideration and mirregularity (…, ‘Krimphoff’) which considers the mean of the preceding, same and next amplitude.

miraudio: This command loads the appropriate format of an audio file. E.g. miraudio (‘speaker.wav’).

mirsegment: This process splits a continuous audio signal into homogeneous segments.

Mirfilterbank: A set of filters are required which are useful to select neighboring narrow sub-bands that cover the entire frequency range. The effect like aliasing in the reconstruction process is avoided e.g. mirfilterbank (..., ‘Gammatone’) processes a Gammatone filterbank decomposition.

mirframe: The frame decomposition can be performed using the mirframe command. The frames can be specified as follows:

mirframe (x,…, ‘Length’, w, wu).

mirspectrum: Discrete Fourier Transform decomposes the energy of a signal (be it an audio waveform, an envelope, etc.) along with the frequencies.

Mathematically, for an audio signal x;

$$Xk = \mathop \sum \limits_{n = 0}^{N - 1} xne^{{\frac{ - 2\prod ikn}{N}}} k = 0, \ldots , N - 1$$
(1)

This decomposition is performed using a Fast Fourier Transform by the ‘mirspectrum’ function.

Mirpeaks: Many features like irregularity require the Peaks analysis. Peaks are calculated from any data x produced in the MIR toolbox using the command ‘mirpeaks(x)’.

In most of the studies, Timber features have been used for Music processing. Timbre feature covers almost all the domains of feature extractions that showed in the Fig. 3. There exists a variety of processing mechanisms available in the MIR toolbox. Hence, it is estimated to be helpful to capture hidden speaker-specific information in the whispered sound using the timbre class.

3 Description of proposed system

The major system components like the database and the role of the Hybrid Selection Algorithm in the selection process of ADs and classifiers are described in the subsequent discussion.

3.1 Database

The database utilized for the undertaking comprises 36 speakers with 33 tests each; with a good blend of male and female voice tests (20 guys and 16 females). It is the CHAIN database created at ‘School of Computer Science and Informatics College Dublin’ (Cummins et al., 2006). The duration of 2–3 s is recorded at 44.1 kHz. The sentences are chosen from CSLU and TIMIT database which guarantees the phonetic adjustment within the corpus. The database may be divided into different sub-databases (DB1, DB2, DB3, and DB4) to determine the contribution of individual ADs. Figure 5 presents the framework to analyze all the databases and automatically select the appropriate audio descriptors which can maximize the results. The processes can be divided into two parts, using the application software and the system software.

Fig. 5
figure 5

Speaker Identification Architecture with feature selection algorithm

3.2 Hybrid selection algorithm

Hybrid selection is an iterative process that begins with the targeted timbre class of Audio descriptors and progresses to the ADs with the best identification result (Deshmukh & Bhirud, 2012). This technique was also utilized to classify abnormal images of liver tissue in Li et al. (2016).

After every iteration, the sorted AD which maximizes the classifier accuracy is appended by the remaining ADs for next iteration. The process continues until no further increase in accuracy is observed.

As shown in Fig. 6 below, all the eight targeted ADs are individually investigated for the accuracy in speaker identification experiment. (i) Iteration I: sorted three features offering the highest accuracy namely MFCC, Roll-off, and brightness. (ii) Iteration II: In this iteration, the sorted single ADs are combined with all remaining ADs and performance is evaluated for the combination of two ADs. The first three highest performances with the combination of two ADs are sorted MFCC + Roll Off, MFCC + Brightness, Brightness + MFCC (iii) Next iteration sorts the first three best performances with the combination of three ADs, four ADs and, (iv) the Last iteration sorts the best performances with the combination of best MFCC, Roll-off, Roughness, Brightness, and Irregularity. Now the process terminates as there is no further improvement by appending the ADs.

Fig. 6
figure 6

Pictorial demonstration of Hybrid Selection Algorithm

3.3 Classifiers

3.3.1 K-means classifier

A semi-supervised learning strategy of K-means clustering is adopted in this study. The audio feature samples are partitioned into clusters by the algorithm. To partition the data into clusters, ‘k’ number of centroids is assumed. Each feature is combined in a particular cluster based on the minimum distance from a particular centroid. K-means clustering aims to partition the ‘n’ observations into k (≤ n) set S = {S1, S2,.., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance) (Ito et al., 2005). To be specific, the purpose is to find:

$${\text{arg min }}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{k}}} { }\mathop \sum \limits_{{{\text{x}} \in {\text{si}}}} { }\left\| {{\text{x}} - {\upmu }} \right\|^{2} = {\text{ arg min }}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{k}}} \left| {{\text{si}}} \right|{\text{ Var Si }}$$
(2)

3.3.2 K- nearest neighbor (K-NN)

K-NN may be a straight forward and non-parametric algorithm which separates the data points into several classes. The points in the query samples are combined in the defined classes based on the distance metric. It is also called a lazy algorithm as this classification does not make any assumptions about the distribution of data. The real-world data does not comply with the usually assumed pattern (e.g. linear regression models). Hence, K-NN classifier is useful in general. While managing with this classifier, the following parameters are used: the number of nearest ‘neighbors’ (k), a distance function (d), decision rule and n labeled samples of audio files Xn. The query sample is assigned to one of the labels among the existing classes based on the minimum distance in the proximity of several neighbors (2-NN (nearest neighbors), 3-NN) from the training classes. In another word, K-NN calculates a posteriori class probabilities P (wi|x) \(for P\left( {w_{i} } \right){\text{outcome }}\) as below:

$$P\left( {\left. {w_{i} } \right|x} \right) = \frac{{k_{i} }}{k}{ }.{ }P\left( {w_{i} } \right)$$
(3)

where ki is the number of vectors which belongs to a class within the subset of k vectors (Shah et al., 2015).

KNN classifier allocates a class label to the query sample based on the closest distance from the training classes called the nearest neighbors. The selected five features like brightness, roll-off, irregularity, roughness and MFCC are extracted and rearranged in a vector form. The distances between the query feature vector and the feature vector of all other existing classes are calculated. The Euclidean distance is a popular distance metric, and the City-block distance is another for minimizing the effect of any the much-deviated feature/s; both distance metrics are exercised in the study (Sreelekshmi & Syama, 2017).

  • Euclidean Distance: n- dimension Euclidean distance applies as:

    $${\text{D}}\left( {{\text{x}},{\text{y}}} \right) = \sqrt {\left( {{\text{x}}_{1} - {\text{y}}_{1} } \right)^{2} + \left( {{\text{x}}_{2} - {\text{y}}_{2} } \right)^{2} + \ldots + \left( {{\text{x}}_{{\text{n}}} - {\text{y}}_{{\text{n}}} } \right)^{2} }$$
    (4)

    where x is the coordinates of the training feature vector and y is the coordinates of a query feature vector.

  • City-block: The City -block (Manhattan) distance between a pair of points, x and y, with n dimensions is calculated as:

    $$\mathop \sum \limits_{{{\text{j}} = 1}}^{{\text{n}}} \left| {x_{j} - {\text{y}}_{j} } \right|$$
    (5)

The vector consists of multiple features; some features may have high intra-speaker variations (though undesirable) for some speech samples. The effect of such a high difference in a single dimension is diminished since the distances are not squared for City-block distance.

For our system, all the variants of the KNN classifier are verified to maximize the identification accuracy. The variations tested for the number of nearest neighbor are 1-NN, 2-NN, and 3-NN. Two distance functions Euclidean and City-block are investigated. The rules namely nearest and consensus are also tested. After a variety of experiments, it is concluded that a combination of 3-NN neighbors, City-block distance and the Nearest rule give the maximum identification accuracy in Sardar and Shirbahadurkar (2018).

4 The role of median values of timber features

In the speaker identification task, inter-speaker variability is one of the reasons to degrade the performance. Standard Deviation (σ) is one of the statistical tools that are used to examine the variations among the same speaker and hence the corresponding feature values. The evaluated standard deviation value needs to be either added or subtracted from the feature value (i.e. feature value ± σ). However, using the standard deviations for feature modifications is intricate for two reasons. It requires a complex decision algorithm for every feature value of every speaker sample. Second, modification of the feature values with standard deviation may exceed the normalization range. MEDIAN formula considering even and odd number of samples is as below:

$$= \left\{ {\begin{array}{ll} {X\left[ \frac{n}{2} \right]if\, n\, is\, even} \\ {\frac{{X\left[ {\frac{n - 1}{2}} \right] + X\left[ {\frac{n + 1}{2}} \right]}}{2}if\, n\, is\, odd} \\ \end{array} } \right.$$
(6)

X = ordered list of values in a data set; n = number of values in the data set.

As a result, MEDIAN can be thought of as the fully sheared mid-range. The median values of the individual features are opted to minimize the intra-speaker spread. The following illustration uses a few samples of five speakers to represent the feature vector(MFCC + Roll-off + Brightness + Roughness + irregularity in the feature space by a single-valued dot. Part (a) of the figure shows the plot of the feature vector when the direct values of timbre features are utilized; while part (b) is the plot after using the median values of the timbre features (Fig. 7).

Fig. 7
figure 7

Effect on the intra-speaker variability due to use of Median values of timbre features

The illustration in Figure x proves that the feature samples of each class (speaker here) are closely spaced with minimum intra-speaker distance when median values of timbre features are used instead of absolute values.

5 Results and evaluation

5.1 Identification accuracy

Mel Frequency Cepstral Coefficient (MFCC) is a widely used feature for speaker identification tasks. Table 1 illustrates the comparative performance of K-means and K-NN classifier. A total of 35 speakers with 33 whispered samples each from a CHAIN database in a whisper train-whisper test scenario are tested. The samples of each speaker are selected with a choice of 70% samples for training and 30% for testing.

The results shown in Table 1 for the same feature (i.e. MFCC) proved that the K-NN classifier is most suited here.

Table 1 Comparative accuracy using MFCC features and K-means/KNN classifiers

The Table 2 shows results using the K-NN classifier with parameter settings as—Rule: nearest, Neighbor: 3, and distance Metric: City-Block distance. The selected Timbre features by the Hybrid selection Algorithm are MFCC, Roll-off, Brightness, Roughness, and irregularity. Also, the results are examined using the median values of timbre features (Fig. 8).

Table 2 Comparative Identification accuracy by using features MFCC, Timbre only and Timbre (Median)
Fig. 8
figure 8

Comparison of speaker identification accuracy by proposed study with MFCC and Baseline system

Compared to the conventional MFCC features, the identification accuracy utilizing chosen timbre descriptors is upgraded by 7.72%. Further, using median values of timbre features enhances the outcomes by about 2.23%. It is due to the compensation of intra-speaker spread by the advent of Mean values (Table 2).

The results are compared with a baseline speaker identification system. The baseline system also used the whispered data from the CHAINs database. The highest identification accuracy using NDMP Based Fusion System (α = 0.70) + SVM whisper train-whisper test setting reported the highest results as 83.75% in Wang et al. (2015) which are reproduced in Table 3.

Table 3 Baseline results of speaker identification accuracy by Timbre features

Compared to the highest result results given by baseline system (83.75%) shown in Table 3, results achieved by the proposed system using median values of timbre features i.e. 88.76% (Table 2) report the increase by 5.01%.

5.2 False acceptance and false rejection rate

The false-positive rate (FPR) is the proportion of all negatives that still yield positive test outcomes while the False-negative rate(FNR) is the proportion of all outcomes which yield negative tests.

$${\text{FPR }} = {\text{ FAR }} = {\text{ FP}}/ \, \left( {{\text{FP}} + {\text{TN}}} \right)$$
(7)
$${\text{FNR }} = {\text{FRR }} = {\text{ FN}}/ \, \left( {{\text{FN}} + {\text{TP}}} \right)$$
(8)

True Positive (TP), True Negative (TN), False positive (FP) and False Negative

Table 4 shows performance calculation on sample basis, i.e.FPR and FNR. Randomly five speakers 18 to 22 are considered for calculations of FAR and FRR.

Table 4 Sample calculations of performance parameters for speakers 18 to 22

FNR and FPR should be un-doubtfully low, but they are differently influencing the different applications.

6 Conclusion

A variety of sound descriptors are accessible that are selected to agree to the application. The drastic change in the characteristics of the whisper is observed compared to the neutral voice. Hence, multidimensional and the perceptually motivated timbre features are assumed to be most appropriate. However, it suggests utilizing constrained and well-performing for high speed and performance. The Hybrid Selection Algorithm sorted five features based on best performance using the CHAINs database. The selected timbre features MFCC, Brightness, Roll-off, Roughness, and irregularity) are used as a feature vector for the speaker identification. It enhances the identification accuracy using the timbre features by 7.72 % compared to the most used MFCC features. The speaker identification task generates false positive outcomes due to intra-speaker variability. Hence, the MEDIAN values of timber features are utilized to reduce intra-speaker spread that further reported enhancement in the speaker identification by 2.23 %. The aggregate result is considering the complete database. This fact seeds the future scope to investigate the effectiveness of the selected features on unvoiced phonemes in whispered speech. It will put light on speaker identification and other speech processing applications.