Introduction

Voice timbre is among the primary acoustic characteristics of a speaker’s vocal tract. As such, it has been capturing the attention of researchers and specialists across a wide range of fields for many years [1, 2]. Consequently, the voice timbre analysis is a classic problem in the field of acoustic measurements of speech signals [3,4,5], while the comparative analysis of speech signals based on the voice timbre is an important aspect of such problem. The latter is addressed when designing and studying modern automatic speech processing systems intended for a wide spectrum of purposes [5,6,7,8].

Despite years of research of the acoustic characteristics of the speaker’s vocal tract, studies performed in this field show a clear tendency towards the development and expansion of this topic [9,10,11], because in the author’s view, a number of unresolved theoretical problems still remain to date. One of the most important problems has to do with small observation samples [12]. In the studied case, the sample size is strictly limited by the duration of two to three periods of the fundamental pitch (T0 = 5–10 ms), when the vocalized speech signal can be considered steady-state [13].

Since the voice timbre is defined by the fine structure of a speech signal within one such period, the sequence of observations x(i), \(i=1, 2, \ldots\), should be synchronized with the vibrations of the speaker’s vocal cords [14, 15]. However, under conditions of a prior uncertainty and small sample sizes, such synchronization presents a practically unresolvable problem. Therefore, the topic of this study is highly relevant.

The goal of this work is to develop an objective measure of differences in speech signals by the voice timbre, which does not require synchronization of observations with the fundamental pitch period. To achieve this goal, a universal information-theoretic approach and methodology of the acoustic theory of speech production were used.

This article was written to further advance the results of the previous studies performed by the author in collaboration with the personnel of the Laboratory of Algorithms and Technologies of Network Structure Analysis at the National Research University “Higher School of Economics” [16, 17].

Problem statement

Let x(t) represent a speech signal in discrete time t = iT, i = 1, 2, …, N with a period T of sample \(x(i)=x(iT)\) over the observation interval of a vocalized (vowel) speech sound having a duration of Tob = MT0, where \(M\geq 1\). Assuming that the first sample x(1) is co-located with the beginning of the observation interval, the sample size will be \(N=nM\), where \(n=[T_{0}T^{-1}]=[F{F}_{0}^{-1}]\); \(F_{0}={T}_{0}^{-1}\) is the fundamental pitch frequency; \(F=T^{-1}\) is the speech signal sampling frequency; [·] denotes the integer part of a rational number. In such cases, we talk about synchronizing of observations (analysis [15]) with the fundamental pitch of the speech signal. We will now divide the N-sequence of samples {x(i)} into M partial sequences \(\{x_{m}(i), m\leq M\}\) each having a dimensionality of \(n=NM^{-1}\gg 1\). For example, at F = 8 kHz and F0 = 100 Hz (standard value of the fundamental pitch frequency for male voices [16]), there are n = 8000/100 = 80 samples of \(x_{m}(i)=x(mT_{0}+iT)\) for \(i\leq n\) within one (each individual) period of the fundamental pitch. According to the acoustic theory of speech production [18,19,20], these samples are the ones that determine the speaker’s voice timbre. Therefore, formally, the voice timbre can be described using an intraperiodic (within the period of the fundamental pitch) function of autocorrelation of the sequence {xm(i)} of speech signal observations over a finite duration interval \(T_{0}=nT\) [10]. The statistical equivalent of this function is the empirical (sample) autocorrelation (p×p)-matrix [7]:

$$S_{x}\triangleq M^{-1}\sum _{m=1}^{M}\boldsymbol{x}_{m}\;{\boldsymbol{x}}_{m}^{\top}\;{,}$$
(1)

defined over a set { xm} of p-dimensional (vector) observations \(\boldsymbol{x}_{m}=\mathrm{col}_{p}\{x_{m}(i)\}\), synchronous with the fundamental pitch of the speech signal x(t). Here, \(\mathrm{col}_{p}\{\cdot\}\) is a column-vector having a dimensionality of p ≤ n; and ≜ is equality by definition. Similarly, for any other speech signal y(t), there is an empirical autocorrelation matrix:

$$S_{y}\triangleq M^{-1}\sum _{m=1}^{M}\boldsymbol{y}_{m}{\boldsymbol{y}}_{m}^{\top}\;{,}$$
(2)

where \(\boldsymbol{y}_{m}=\mathrm{col}_{n}\{y_{m}(i)\}\) is the p-column-vector of synchronous observations ym(i) in discrete time \(i=1,2,\ldots ,n\).

Following the information theory of speech perception [6, 21], we will use matrices (1) and (2) as a basis of the information-theoretic approach to the automatic differentiation of speech signals x(t) and y(t) by the voice timbre.

Kullback-Leibler divergence

We will now determine the Kullback-Leibler divergence [22] for two Gaussian laws of distribution of probabilities specified by their autocorrelation matrices Sx and Sy in a p-dimensional sample spaceFootnote 1:

$$\rho _{x,y}\triangleq 0.5M\left[\mathrm{tr}\left(S_{x}{S}_{y}^{-1}\right)+\mathrm{tr}\left(S_{y}{S}_{x}^{-1}\right)-2p\right]\geq 0$$
(3)

where tr(·) denotes the trace (spur) of a square (p×p) matrix.

As shown in Ref. [21], Eq. 3 defines the asymptotically optimal (as \(M\rightarrow \infty\)) decision statistic in the problem of differentiating two speech signals x(t) and y(t) based on finite observation samples. However, the practical use of Eq. 3 as a measure of differences between such speech signals is greatly limited by the requirement for synchronization of their vector observations {xm} and {ym} with the fundamental pitch of the corresponding signal.

To circumvent the aforementioned issue, the problem at hand will be reduced to signal processing in the frequency domain, where there is fundamentally no need for synchronization of the observation sequence. In Ref. [23], the frequency equivalent of the information divergence (3) is justified using the formula for a scale-invariant modification of the COSH-distanceFootnote 2:

$$\rho _{x,y}=\sqrt{\left[F^{-1}\int _{-0.5F}^{0.5F}\hat{G}_{x}\left(f\right){\hat{G}}_{y}^{-1}\;\left(f\right)df\right]\;\left[F^{-1}\int _{-0.5F}^{0.5F}\hat{G}_{y}\left(f\right){\hat{G}}_{x}^{-1}\left(f\right)df\right]}-1 \geq 0.$$
(4)

Bartlett’s periodograms [24, 25] from Eq. 4:

$$\begin{cases} \hat{G}_{x}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left(nT\right)^{-1}\left| T\sum _{i=1}^{n}x_{m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2};\\ \hat{G}_{y}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left(nT\right)^{-1}\left| T\sum _{i=1}^{n}y_{m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}. \end{cases}$$
(5)

are used as statistical estimates of the intraperiodic spectra of power of speech signals x(t) and y(t) based on the discrete observation samples.

The scale invariance property of measure (4) can be easily confirmed by bringing the arbitrary gain coefficients for the signals {xm(i)} and {ym(i)} under the absolute value sign on the right-hand side of Eq. 5. The result will remain unchanged regardless [16]. However, this does not solve the main problem of automatic speech processing when analyzing the voice timbre, which is the synchronization of the observation sequence with the fundamental pitch of speech signals.

Method of asynchronous analysis of voice timbre

Considering that under the general assumptions [1], partial oscillations

$$x_{m}\left(i\right)=a_{x}h_{x,m}\left(i\right);\quad y_{m}\left(i\right)=a_{y}h_{y,m}\left(i\right),\ i=1,2,\ldots ,$$
(6)

(where ax, ay = const) are determined by the dynamics of pulse response characteristics hx,m(i) and hy,m(i) of the linear (filter-based) “acoustic tube” type model of the vocal tract, which is inherently stable in terms of digital filtering [26], and therefore exhibit an attenuation behavior. We will rewrite Eq. 4 in an asymptotically equivalent form:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}df}\times \\ &\quad \times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}df}-1. \end{aligned}$$
(7)

The expressions under the absolute value sign from Eq. 7, through the Fourier transform of the corresponding pulse characteristics (6) in discrete time i, determine two complex transfer coefficients:

$$\begin{aligned}[b]K_{x,m}\left(\mathrm{j}f\right)&=T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right);\\ K_{y,m}\left(\mathrm{j}f\right)&=T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\end{aligned}$$

From Eq. 7 the following expression can be obtained:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| K_{x,m}\;\left(\mathrm{j}f\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\;\left| K_{y,m}\;\left(\mathrm{j}f\right)\right| ^{2}}\;d\;f}\times \\ &\quad\times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\;\left| K_{y,m}\;\left(\mathrm{j}f\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\;\left| K_{x,m}\;\left(\mathrm{j}f\right)\right| ^{2}}\;df}-1=\\ &=F^{-1}\sqrt{\int _{-0.5F}^{0.5F}\frac{{K}_{x}^{2}\left(f\right)}{{K}_{y}^{2}\left(f\right)}\;df\int _{-0.5F}^{0.5F}\frac{{K}_{y}^{2}\left(f\right)}{{K}_{x}^{2}\left(f\right)}\;df}-1. \end{aligned}$$
(8)

Thus, the problem comes down to determining the average statistical values of the squares of the amplitude-frequency characteristics (AFC) of the speaker’s vocal tract:

$${K}_{x}^{2}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left| K_{x,m}\left(\mathrm{j}f\right)\right| ^{2};\ {K}_{y}^{2}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left| K_{y,m}\left(\mathrm{j}f\right)\right| ^{2}.$$

This is a typical problem of statistical analysis and speech modeling [6, 27]. A number of various theoretical approaches have been developed for solving this problem [18, 19], with the most relevant ones including the methods of parametric spectral analysis [24, 25], and specifically, the Berg’s methodFootnote 3.

Example of practical implementation

According to the universal all-pole model of the speaker’s vocal tract within short (10–20 ms) intervals of vocalized verbal speech, the desired amplitude-frequency characteristics can be determined using the formula for calculating the absolute value of the complex transfer coefficient of a recursive filter of the pth order [23]:

$$K_{x}\left(f\right)=b_{x}\left| 1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)\right| ^{-1};$$
(9)
$$K_{y}\left(f\right)=b_{y}\left| 1-\sum _{k=1}^{p}\mathit{a}_{y,p}\left(k\right)\exp \left(-\mathrm{j2}\pi if\;T\right)\;\right| ^{-1}{,}$$
(10)

where |f| ≤ 0.5 F; bx and by are the gain factors of signals x(t) and y(t), respectively, in the speaker’s vocal tract; ax,p(k) and ay,p(k) are the autoregression coefficients of the finite (pth) order (k—coefficient number).

Considering Eqs. 9 and 10, we can rewrite Eq. 8 as follows:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\left| \frac{1-\sum _{k=1}^{p}a_{y,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}{1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}\right| ^{2}df\times }\\ &\quad \times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\left| \frac{1-\sum _{k=1}^{p}a_{y,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}{1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}\right| ^{-2}df}-1. \end{aligned}$$
(11)

Written under the integral sign in Eq. 11 are the direct and inverse relationships of the squares of two normalized amplitude-frequency characteristics [9] and [10] (assuming that \(b_{x}=b_{y}=1\)). In this case, gain coefficients bx and by do not play a role. Autoregression coefficients ax,p(k) and ay,p(k) are adapted to the speech signals x(t) and y(t) for all \(k\leq p\) according to the samples of corresponding observations {x(i)} and {y(i)} obtained by using one of the known methods. For instance, this could be the Berg’s method, which is based on the Levinson recursion [24]:

$$\begin{aligned}[b]\forall q&=\overline{1,p}\colon a_{x,q}\left(i\right)=a_{x,q-1}\left(i\right)+c_{q}a_{x,q-1}\left(q-i\right),\quad i=1,2,\ldots ,q;\\ c_{q}&=-\frac{2\sum _{n=q}^{N-1}\eta _{q-1}\left(n\right)\nu _{q-1}\left(n-1\right)}{\sum _{n=q}^{N-1}\left[{\eta }_{q-1}^{2}\left(n\right)+{\nu }_{q-1}^{2}\left(n-1\right)\right]};\\ \eta _{q}\left(n\right)&=\eta _{q-1}\left(n\right)+c_{q}\nu _{q-1}\left(n-1\right);\\ \sum \nu _{q}\left(n\right)&=\nu _{q-1}\left(n-1\right)+c_{q}\eta _{q-1}\left(n\right)\end{aligned}$$
(12)

with the recursion initialization by a system of equalities \(\nu _{0}(n)=\eta _{0}(n)=x(n)\backslash y(n)\), \(n=1,2,\ldots ,N\) (\—symbol of the choice function OR). The final values of recursion (12) (at q = p), taken with the opposite sign, determine two p-vectors of corresponding coefficients {ax,p(k)} and {ay,p(k)} on the right-hand side of Eqs. 9 and 10.

Thus, Eq. 11 together with recursion (12) defines a scale-invariant measure of differences in speech signals by the voice timbre of one or two different speakers. It does not require synchronization of observations with the fundamental pitch of speech signals. The potential of the proposed measure can be illustrated by the results of the experiment described below, in which the author’s software Phoneme Training was usedFootnote 4.

Experimental procedure and results

The experimental program consisted of two stages.

First stage. During the first stage, the sensitivity of the new measure (11) to differences in the fine structure of speech signals was studied with the observations being asynchronous relative to the fundamental pitch. The study was focused on the long (approximately 1.5 to 2 s) signals in the form of the vowel phonemes of the reference speaker—the author of this article. Using the Phoneme Training software, each such signal was transformed into a sequence of homogeneous (monophonic) frames x(t)\y(t) of relatively short duration: Tob = 16 ms. In this case, it was assumed that all such frames are characterized by the same voice timbre. On the contrary, frames of different phonemes fundamentally differ from each other in terms of voice timbre.

The graphs shown in Fig. 1 represent timing diagrams of the phoneme “a” signal in case of synchronous (a) and asynchronous (b) discrete observations relative to the fundamental pitch. The sampling frequency (8 kHz) of the signal in both cases was consistent with the standard telephone bandwidth (4 kHz). In the first case, the signal covers two full periods of the fundamental pitch, while in the second case, it covers only one. As a result, the fine structures of the signals are very different, while the voice timbre in both cases is practically the same. This fact poses no contradictions, since a different fine structure of speech signals is considered. When analyzing the voice timbre, only the intraperiodic fine structure is considered, which severely limits its analysis by using classical periodogram estimates (5) in case of the asynchronous observations.

Fig. 1
figure 1

Timing diagrams of the phoneme “а” signal in two variants of discrete observations: synchronous (а) and asynchronous (b) with the fundamental pitch

In accordance with the experimental procedure, measures (11) were calculated for different pairs of speech signals x(t) and y(t) within the set of prepared voice samples. For each such pair, two corresponding vectors \(\boldsymbol{a}_{x}=\{a_{x,p}(k)\}\), and \(\boldsymbol{a}_{y}=\{a_{y,p}(k)\}\) of autoregression coefficients of the orderFootnote 5 of p = 10 were first calculated using algorithm (12). These vectors were then used to calculate the measure of differences ρx,y according to Eq. 11. For example, for a pair of homonymous signals (see Fig. 1), the following two vectors were obtained: ax = (1.364686; −1.08823; 0.532204; −0.80853; 0.906187; −0.43502; 0.107709; −0.17596; 0.40483; −0.12711) ≜ ax*; ay = (1.368958; −1.01194; 0.457037; −0.84364; 1.049269; −0.55294; 0.145535; −0.27892; 0.529057; −0.18108), based on which measure ρx,y = 0.009 \(\ll\)1 was calculated. Similar results were obtained for all other experimental pairs {x(t), y(t)}, composed of monophonemic voice samples from the reference speaker, within a range of ρx,y = 0.005–0.025.

As can be seen from Table 1, which shows the average values of the measure of differences (ρx,y) for a set of 100 realizations of each separate variant of the experimental pair {x(t), y(t)}, the situation is sharply different for pairs of heterophonemic samples. The grayed out elements of the main diagonal of the Table correspond to the monophonemic variants of the experimental pair. All their values are considerably less than one, which indicates a high degree of similarity of the monophonemic speech signals in terms of the speaker’s voice timbre. On the contrary, all other elements of the Table consistently exceed one, indicating significant differences between heterophonemic signals. Therefore, a conclusion can be made about high sensitivity of measure (11) towards differences in speech signals in terms of the voice timbre.

Table 1 Average values of measure ρx,y

Second stage. A pair of phoneme “a” signals were studied that were synthesized using a recursive filter scheme [23]:

$$\begin{cases} x\left(i\right)=-\sum _{k=1}^{p}a_{z,p}\left(k\right)x\left(i-k\right)+\eta _{x}\left(i\right);& \\ y\left(i\right)=-\sum _{k=1}^{p}a_{z,p}\left(k\right)y\left(i-k\right)+\eta _{y}\left(i\right);& i=0,1,\ldots \end{cases}$$
(13)

The signals are characterized with pulse excitation {ηx(i)} and {ηy(i)} of different fundamental pitch frequencies: F0 = 100; 130 Hz (130 Hz ≈ 1/7.7 ms). Here, the vector of autoregression coefficients \(a_{z,p}(k)=a_{x,p}(k)\) was determined by the same vector \({\boldsymbol{a}}_{x}^{\ast }\), obtained during the first stage of the experiment in both cases. To maintain the small sample conditions, the duration of the synthesized signal samples in discrete time (i) was established as \(N=128\). The idea of the second stage was to ensure that the signals in each experimental pair {x(i), y(i)} were similar in terms of the virtual speaker’s voice timbre, while being significantly different in terms of the fundamental pitch frequency (F0) [27, 28]. These signals along with algorithms (11) and (12) were then used to calculate the vectors of autoregression coefficients: ax = (1.366934; −1.11707; 0.595185; −0.85859; 0.900872; −0.40507; 0.008816; −0.05351; 0.328108; −0.0927) and ay = (1.42998; −1.17923; 0.64873; −0.93677; 1.072479; −0.59299; 0.211108; −0.26045; 0.482161; −0.1689) and the measure of their differences: ρx,y = 0.0219 \(\ll\)1 based on the voice timbre. The obtained results are illustrated in Figs. 2 and 3.

Fig. 2
figure 2

Timing diagrams of synthesized phoneme “a” signals with a pitch frequency F0 = 100 Hz (а) and F0 = 130 Hz (b)

Fig. 3
figure 3

Graphs of the square of the normalized AFC \({K}_{x,y}^{2}\) (9) of the vocal tract linear model (13) at a pitch frequency F0 = 100 Hz (а) and F0 = 130 Hz (b) in comparison with autoregressive estimates of the discrete speech signal power spectrum Gx,y (at high autoregression order)

The graphs shown in Fig. 2 represent timing diagrams of two synthesized phoneme “a” signals with different fundamental pitch frequencies (F0), but practically identical intraperiodic fine structure. The graphs shown in Fig. 3 illustrate squared normalized AFC (order p = 10) of the linear model (13) of the vocal tract with pulse excitation for the fundamental pitch frequencies F0 = 100 and 130 Hz. These characteristics are compared with the corresponding estimates Gx(f) and Gy(f) of the discrete speech signal power spectrum obtained using the Berg’s method at a high value of autoregression order p* = 90. (This order was established based on the requirement of p ≥ F/F0 = 8000/100 = 80 for the fine structure of the speech signal in the frequency domain [6]). In both cases, the AFC shape repeats the spectral envelope of the synthesized signals, which is practically independent of the fundamental pitch frequency. Therefore, it can be concluded that measure (11) is invariant with respect to the value of F0, which is a key requirement for an objective measure of differences in speech signals when analyzing voice timbre [29, 30].

Results and discussion

The theoretical justification of the measure of differences (11) is based on the principle of superposition of oscillations in linear systems [26]. According to this principle, a speech signal {x(i)} with dominating vowel sounds is defined by the convolution of a periodic sequence of fundamental pitch pulses {ηx(i)} with the pulse characteristic of the vocal tract {hx(i)}. The square of the AFC (9) has a form of an envelope of the discrete speech signal power spectrum. Therefore, the acoustic theory of speech production considers the spectral envelope as the most comprehensive characteristic of the voice timbre in the frequency domain [18, 25].

The problem is that the concept of “spectral envelope” is not strictly defined in the theory. As a result, researchers still lack clarity on the issue of optimal estimation of the spectral envelope based on the speech signal [28,29,30]. Therefore, in the conducted study, this concept was not used for the synthesis of the measure of differences (11), but exclusively as an illustration of the synthesis results.

Conclusion

The developed objective measure of differences in speech signals by voice timbre makes it possible to automatically assess the specifics of the fine structure of these signals under conditions of a prior uncertainty and small observation sample sizes. During practical implementation of the new measure based on a finite-order recursive filter with adaptive tuning of its parameters using the Berg’s method, it was established that there is no need to synchronize observations with the fundamental pitch of speech signals. The results of the conducted full-scale experiment confirmed the following two main properties of the proposed measure: high sensitivity to differences in speech signals by voice timbre and significant invariance to the fundamental pitch frequency.

The obtained results are intended for use when designing and studying digital speech processing systems tuned to the speaker’s voice, where the individual characteristics of the vocal tract are of primary importance [31,32,33]. Examples include digital voice communication systems, biometric and biomedical systems, etc. [4,5,6,7,8,9].