A measure of differences in speech signals by the voice timbre

Savchenko, V. V.

doi:10.1007/s11018-024-02294-1

A measure of differences in speech signals by the voice timbre

Published: 11 March 2024

Volume 66, pages 803–812, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Measurement Techniques Aims and scope

A measure of differences in speech signals by the voice timbre

Download PDF

V. V. Savchenko ORCID: orcid.org/0000-0003-3045-3337¹

75 Accesses
2 Citations
Explore all metrics

Abstract

This research relates to the field of speech technologies, where the key issue is the optimization of speech signal processing under conditions of a prior uncertainty of its fine structure. The problem of automatic (objective) analysis of the speaker’s voice timbre using a speech signal of finite duration is considered. It is proposed to use a universal information-theoretic approach to solve it. Based on the Kullback-Leibler divergence, an expression was obtained to describe the asymptotically optimal decision statistic for differentiating speech signals by the voice timbre. The author highlights a serious obstacle during practical implementation of such statistics, namely: synchronization of the sequence of observations with the pitch of speech signals. To overcome the described obstacle, an objective measure of timbre-based differences in speech signals is proposed in terms of the acoustic theory of speech production and its “acoustic tube” type model of the speaker’s vocal tract. The possibilities of practical implementation of a new measure based on an adaptive recursive filter are considered. A full-scale experiment was set up and carried out. The experimental results confirmed two main properties of the proposed measure: high sensitivity to differences in speech signals in terms of voice timbre and invariance with respect to the fundamental pitch frequency. The obtained results can be used when designing and studying digital speech processing systems tuned to the speaker’s voice, for example, digital voice communication systems, biometric and biomedical systems, etc.

Method for Measuring the Intelligibility of Speech Signals in the Kullback–Leibler Information Metric

Article 02 December 2019

A Method of Measuring the Index of Acoustic Voice Quality Based on an Information-Theoretic Approach

Article 19 April 2018

Estimation of the Phonetic Speech Quality Using the Information Theoretic Approach

Article 01 January 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Voice timbre is among the primary acoustic characteristics of a speaker’s vocal tract. As such, it has been capturing the attention of researchers and specialists across a wide range of fields for many years [1, 2]. Consequently, the voice timbre analysis is a classic problem in the field of acoustic measurements of speech signals [3,4,5], while the comparative analysis of speech signals based on the voice timbre is an important aspect of such problem. The latter is addressed when designing and studying modern automatic speech processing systems intended for a wide spectrum of purposes [5,6,7,8].

Despite years of research of the acoustic characteristics of the speaker’s vocal tract, studies performed in this field show a clear tendency towards the development and expansion of this topic [9,10,11], because in the author’s view, a number of unresolved theoretical problems still remain to date. One of the most important problems has to do with small observation samples [12]. In the studied case, the sample size is strictly limited by the duration of two to three periods of the fundamental pitch (T₀ = 5–10 ms), when the vocalized speech signal can be considered steady-state [13].

Since the voice timbre is defined by the fine structure of a speech signal within one such period, the sequence of observations x(i), $i=1, 2, \ldots$, should be synchronized with the vibrations of the speaker’s vocal cords [14, 15]. However, under conditions of a prior uncertainty and small sample sizes, such synchronization presents a practically unresolvable problem. Therefore, the topic of this study is highly relevant.

The goal of this work is to develop an objective measure of differences in speech signals by the voice timbre, which does not require synchronization of observations with the fundamental pitch period. To achieve this goal, a universal information-theoretic approach and methodology of the acoustic theory of speech production were used.

This article was written to further advance the results of the previous studies performed by the author in collaboration with the personnel of the Laboratory of Algorithms and Technologies of Network Structure Analysis at the National Research University “Higher School of Economics” [16, 17].

Problem statement

Let x(t) represent a speech signal in discrete time t = iT, i = 1, 2, …, N with a period T of sample $x(i)=x(iT)$ over the observation interval of a vocalized (vowel) speech sound having a duration of T_ob = MT₀, where $M\geq 1$. Assuming that the first sample x(1) is co-located with the beginning of the observation interval, the sample size will be $N=nM$, where $n=[T_{0}T^{-1}]=[F{F}_{0}^{-1}]$; $F_{0}={T}_{0}^{-1}$ is the fundamental pitch frequency; $F=T^{-1}$ is the speech signal sampling frequency; [·] denotes the integer part of a rational number. In such cases, we talk about synchronizing of observations (analysis [15]) with the fundamental pitch of the speech signal. We will now divide the N-sequence of samples {x(i)} into M partial sequences $\{x_{m}(i), m\leq M\}$ each having a dimensionality of $n=NM^{-1}\gg 1$. For example, at F = 8 kHz and F₀ = 100 Hz (standard value of the fundamental pitch frequency for male voices [16]), there are n = 8000/100 = 80 samples of $x_{m}(i)=x(mT_{0}+iT)$ for $i\leq n$ within one (each individual) period of the fundamental pitch. According to the acoustic theory of speech production [18,19,20], these samples are the ones that determine the speaker’s voice timbre. Therefore, formally, the voice timbre can be described using an intraperiodic (within the period of the fundamental pitch) function of autocorrelation of the sequence {x_m(i)} of speech signal observations over a finite duration interval $T_{0}=nT$ [10]. The statistical equivalent of this function is the empirical (sample) autocorrelation (p×p)-matrix [7]:

$$S_{x}\triangleq M^{-1}\sum _{m=1}^{M}\boldsymbol{x}_{m}\;{\boldsymbol{x}}_{m}^{\top}\;{,}$$

(1)

defined over a set { x_m} of p-dimensional (vector) observations $\boldsymbol{x}_{m}=\mathrm{col}_{p}\{x_{m}(i)\}$, synchronous with the fundamental pitch of the speech signal x(t). Here, $\mathrm{col}_{p}\{\cdot\}$ is a column-vector having a dimensionality of p ≤ n; and ≜ is equality by definition. Similarly, for any other speech signal y(t), there is an empirical autocorrelation matrix:

$$S_{y}\triangleq M^{-1}\sum _{m=1}^{M}\boldsymbol{y}_{m}{\boldsymbol{y}}_{m}^{\top}\;{,}$$

(2)

where $\boldsymbol{y}_{m}=\mathrm{col}_{n}\{y_{m}(i)\}$ is the p-column-vector of synchronous observations y_m(i) in discrete time $i=1,2,\ldots ,n$.

Following the information theory of speech perception [6, 21], we will use matrices (1) and (2) as a basis of the information-theoretic approach to the automatic differentiation of speech signals x(t) and y(t) by the voice timbre.

Kullback-Leibler divergence

We will now determine the Kullback-Leibler divergence [22] for two Gaussian laws of distribution of probabilities specified by their autocorrelation matrices S_x and S_y in a p-dimensional sample space^{Footnote 1}:

$$\rho _{x,y}\triangleq 0.5M\left[\mathrm{tr}\left(S_{x}{S}_{y}^{-1}\right)+\mathrm{tr}\left(S_{y}{S}_{x}^{-1}\right)-2p\right]\geq 0$$

(3)

where tr(·) denotes the trace (spur) of a square (p×p) matrix.

As shown in Ref. [21], Eq. 3 defines the asymptotically optimal (as $M\rightarrow \infty$) decision statistic in the problem of differentiating two speech signals x(t) and y(t) based on finite observation samples. However, the practical use of Eq. 3 as a measure of differences between such speech signals is greatly limited by the requirement for synchronization of their vector observations {x_m} and {y_m} with the fundamental pitch of the corresponding signal.

To circumvent the aforementioned issue, the problem at hand will be reduced to signal processing in the frequency domain, where there is fundamentally no need for synchronization of the observation sequence. In Ref. [23], the frequency equivalent of the information divergence (3) is justified using the formula for a scale-invariant modification of the COSH-distance^{Footnote 2}:

$$\rho _{x,y}=\sqrt{\left[F^{-1}\int _{-0.5F}^{0.5F}\hat{G}_{x}\left(f\right){\hat{G}}_{y}^{-1}\;\left(f\right)df\right]\;\left[F^{-1}\int _{-0.5F}^{0.5F}\hat{G}_{y}\left(f\right){\hat{G}}_{x}^{-1}\left(f\right)df\right]}-1 \geq 0.$$

(4)

Bartlett’s periodograms [24, 25] from Eq. 4:

$$\begin{cases} \hat{G}_{x}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left(nT\right)^{-1}\left| T\sum _{i=1}^{n}x_{m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2};\\ \hat{G}_{y}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left(nT\right)^{-1}\left| T\sum _{i=1}^{n}y_{m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}. \end{cases}$$

(5)

are used as statistical estimates of the intraperiodic spectra of power of speech signals x(t) and y(t) based on the discrete observation samples.

The scale invariance property of measure (4) can be easily confirmed by bringing the arbitrary gain coefficients for the signals {x_m(i)} and {y_m(i)} under the absolute value sign on the right-hand side of Eq. 5. The result will remain unchanged regardless [16]. However, this does not solve the main problem of automatic speech processing when analyzing the voice timbre, which is the synchronization of the observation sequence with the fundamental pitch of speech signals.

Method of asynchronous analysis of voice timbre

Considering that under the general assumptions [1], partial oscillations

$$x_{m}\left(i\right)=a_{x}h_{x,m}\left(i\right);\quad y_{m}\left(i\right)=a_{y}h_{y,m}\left(i\right),\ i=1,2,\ldots ,$$

(6)

(where a_x, a_y = const) are determined by the dynamics of pulse response characteristics h_x,m(i) and h_y,m(i) of the linear (filter-based) “acoustic tube” type model of the vocal tract, which is inherently stable in terms of digital filtering [26], and therefore exhibit an attenuation behavior. We will rewrite Eq. 4 in an asymptotically equivalent form:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}df}\times \\ &\quad \times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\left| T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\right| ^{2}}df}-1. \end{aligned}$$

(7)

The expressions under the absolute value sign from Eq. 7, through the Fourier transform of the corresponding pulse characteristics (6) in discrete time i, determine two complex transfer coefficients:

$$\begin{aligned}[b]K_{x,m}\left(\mathrm{j}f\right)&=T\sum _{i=1}^{\infty }h_{x,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right);\\ K_{y,m}\left(\mathrm{j}f\right)&=T\sum _{i=1}^{\infty }h_{y,m}\left(i\right)\exp \left(-\mathrm{j}2\pi ifT\right)\end{aligned}$$

From Eq. 7 the following expression can be obtained:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\left| K_{x,m}\;\left(\mathrm{j}f\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\;\left| K_{y,m}\;\left(\mathrm{j}f\right)\right| ^{2}}\;d\;f}\times \\ &\quad\times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\frac{M^{-1}\sum _{m=1}^{M}\;\left| K_{y,m}\;\left(\mathrm{j}f\right)\right| ^{2}}{M^{-1}\sum _{m=1}^{M}\;\left| K_{x,m}\;\left(\mathrm{j}f\right)\right| ^{2}}\;df}-1=\\ &=F^{-1}\sqrt{\int _{-0.5F}^{0.5F}\frac{{K}_{x}^{2}\left(f\right)}{{K}_{y}^{2}\left(f\right)}\;df\int _{-0.5F}^{0.5F}\frac{{K}_{y}^{2}\left(f\right)}{{K}_{x}^{2}\left(f\right)}\;df}-1. \end{aligned}$$

(8)

Thus, the problem comes down to determining the average statistical values of the squares of the amplitude-frequency characteristics (AFC) of the speaker’s vocal tract:

$${K}_{x}^{2}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left| K_{x,m}\left(\mathrm{j}f\right)\right| ^{2};\ {K}_{y}^{2}\left(f\right)\triangleq M^{-1}\sum _{m=1}^{M}\left| K_{y,m}\left(\mathrm{j}f\right)\right| ^{2}.$$

This is a typical problem of statistical analysis and speech modeling [6, 27]. A number of various theoretical approaches have been developed for solving this problem [18, 19], with the most relevant ones including the methods of parametric spectral analysis [24, 25], and specifically, the Berg’s method^{Footnote 3}.

Example of practical implementation

According to the universal all-pole model of the speaker’s vocal tract within short (10–20 ms) intervals of vocalized verbal speech, the desired amplitude-frequency characteristics can be determined using the formula for calculating the absolute value of the complex transfer coefficient of a recursive filter of the p^th order [23]:

$$K_{x}\left(f\right)=b_{x}\left| 1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)\right| ^{-1};$$

(9)

$$K_{y}\left(f\right)=b_{y}\left| 1-\sum _{k=1}^{p}\mathit{a}_{y,p}\left(k\right)\exp \left(-\mathrm{j2}\pi if\;T\right)\;\right| ^{-1}{,}$$

(10)

where |f| ≤ 0.5 F; b_x and b_y are the gain factors of signals x(t) and y(t), respectively, in the speaker’s vocal tract; a_x,p(k) and a_y,p(k) are the autoregression coefficients of the finite (p^th) order (k—coefficient number).

Considering Eqs. 9 and 10, we can rewrite Eq. 8 as follows:

$$\begin{aligned}[b] \rho _{x,y}&=\sqrt{F^{-1}\int _{-0.5F}^{0.5F}\left| \frac{1-\sum _{k=1}^{p}a_{y,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}{1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}\right| ^{2}df\times }\\ &\quad \times \sqrt{F^{-1}\int _{-0.5F}^{0.5F}\left| \frac{1-\sum _{k=1}^{p}a_{y,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}{1-\sum _{k=1}^{p}a_{x,p}\left(k\right)\exp \left(-\mathrm{j}2\pi kfT\right)}\right| ^{-2}df}-1. \end{aligned}$$

(11)

Written under the integral sign in Eq. 11 are the direct and inverse relationships of the squares of two normalized amplitude-frequency characteristics [9] and [10] (assuming that $b_{x}=b_{y}=1$). In this case, gain coefficients b_x and b_y do not play a role. Autoregression coefficients a_x,p(k) and a_y,p(k) are adapted to the speech signals x(t) and y(t) for all $k\leq p$ according to the samples of corresponding observations {x(i)} and {y(i)} obtained by using one of the known methods. For instance, this could be the Berg’s method, which is based on the Levinson recursion [24]:

$$\begin{aligned}[b]\forall q&=\overline{1,p}\colon a_{x,q}\left(i\right)=a_{x,q-1}\left(i\right)+c_{q}a_{x,q-1}\left(q-i\right),\quad i=1,2,\ldots ,q;\\ c_{q}&=-\frac{2\sum _{n=q}^{N-1}\eta _{q-1}\left(n\right)\nu _{q-1}\left(n-1\right)}{\sum _{n=q}^{N-1}\left[{\eta }_{q-1}^{2}\left(n\right)+{\nu }_{q-1}^{2}\left(n-1\right)\right]};\\ \eta _{q}\left(n\right)&=\eta _{q-1}\left(n\right)+c_{q}\nu _{q-1}\left(n-1\right);\\ \sum \nu _{q}\left(n\right)&=\nu _{q-1}\left(n-1\right)+c_{q}\eta _{q-1}\left(n\right)\end{aligned}$$

(12)

with the recursion initialization by a system of equalities $\nu _{0}(n)=\eta _{0}(n)=x(n)\backslash y(n)$, $n=1,2,\ldots ,N$ (\—symbol of the choice function OR). The final values of recursion (12) (at q = p), taken with the opposite sign, determine two p-vectors of corresponding coefficients {a_x,p(k)} and {a_y,p(k)} on the right-hand side of Eqs. 9 and 10.

Thus, Eq. 11 together with recursion (12) defines a scale-invariant measure of differences in speech signals by the voice timbre of one or two different speakers. It does not require synchronization of observations with the fundamental pitch of speech signals. The potential of the proposed measure can be illustrated by the results of the experiment described below, in which the author’s software Phoneme Training was used^{Footnote 4}.

Experimental procedure and results

The experimental program consisted of two stages.

First stage. During the first stage, the sensitivity of the new measure (11) to differences in the fine structure of speech signals was studied with the observations being asynchronous relative to the fundamental pitch. The study was focused on the long (approximately 1.5 to 2 s) signals in the form of the vowel phonemes of the reference speaker—the author of this article. Using the Phoneme Training software, each such signal was transformed into a sequence of homogeneous (monophonic) frames x(t)\y(t) of relatively short duration: T_ob = 16 ms. In this case, it was assumed that all such frames are characterized by the same voice timbre. On the contrary, frames of different phonemes fundamentally differ from each other in terms of voice timbre.

The graphs shown in Fig. 1 represent timing diagrams of the phoneme “a” signal in case of synchronous (a) and asynchronous (b) discrete observations relative to the fundamental pitch. The sampling frequency (8 kHz) of the signal in both cases was consistent with the standard telephone bandwidth (4 kHz). In the first case, the signal covers two full periods of the fundamental pitch, while in the second case, it covers only one. As a result, the fine structures of the signals are very different, while the voice timbre in both cases is practically the same. This fact poses no contradictions, since a different fine structure of speech signals is considered. When analyzing the voice timbre, only the intraperiodic fine structure is considered, which severely limits its analysis by using classical periodogram estimates (5) in case of the asynchronous observations.

In accordance with the experimental procedure, measures (11) were calculated for different pairs of speech signals x(t) and y(t) within the set of prepared voice samples. For each such pair, two corresponding vectors $\boldsymbol{a}_{x}=\{a_{x,p}(k)\}$, and $\boldsymbol{a}_{y}=\{a_{y,p}(k)\}$ of autoregression coefficients of the order^{Footnote 5} of p = 10 were first calculated using algorithm (12). These vectors were then used to calculate the measure of differences ρ_x,y according to Eq. 11. For example, for a pair of homonymous signals (see Fig. 1), the following two vectors were obtained: a_x = (1.364686; −1.08823; 0.532204; −0.80853; 0.906187; −0.43502; 0.107709; −0.17596; 0.40483; −0.12711) ≜ a_x^*; a_y = (1.368958; −1.01194; 0.457037; −0.84364; 1.049269; −0.55294; 0.145535; −0.27892; 0.529057; −0.18108), based on which measure ρ_x,y = 0.009 $\ll$1 was calculated. Similar results were obtained for all other experimental pairs {x(t), y(t)}, composed of monophonemic voice samples from the reference speaker, within a range of ρ_x,y = 0.005–0.025.

As can be seen from Table 1, which shows the average values of the measure of differences (ρ_x,y) for a set of 100 realizations of each separate variant of the experimental pair {x(t), y(t)}, the situation is sharply different for pairs of heterophonemic samples. The grayed out elements of the main diagonal of the Table correspond to the monophonemic variants of the experimental pair. All their values are considerably less than one, which indicates a high degree of similarity of the monophonemic speech signals in terms of the speaker’s voice timbre. On the contrary, all other elements of the Table consistently exceed one, indicating significant differences between heterophonemic signals. Therefore, a conclusion can be made about high sensitivity of measure (11) towards differences in speech signals in terms of the voice timbre.

Table 1 Average values of measure ρ_x,y

Full size table

Second stage. A pair of phoneme “a” signals were studied that were synthesized using a recursive filter scheme [23]:

$$\begin{cases} x\left(i\right)=-\sum _{k=1}^{p}a_{z,p}\left(k\right)x\left(i-k\right)+\eta _{x}\left(i\right);& \\ y\left(i\right)=-\sum _{k=1}^{p}a_{z,p}\left(k\right)y\left(i-k\right)+\eta _{y}\left(i\right);& i=0,1,\ldots \end{cases}$$

(13)

The signals are characterized with pulse excitation {η_x(i)} and {η_y(i)} of different fundamental pitch frequencies: F₀ = 100; 130 Hz (130 Hz ≈ 1/7.7 ms). Here, the vector of autoregression coefficients $a_{z,p}(k)=a_{x,p}(k)$ was determined by the same vector ${\boldsymbol{a}}_{x}^{\ast }$, obtained during the first stage of the experiment in both cases. To maintain the small sample conditions, the duration of the synthesized signal samples in discrete time (i) was established as $N=128$. The idea of the second stage was to ensure that the signals in each experimental pair {x(i), y(i)} were similar in terms of the virtual speaker’s voice timbre, while being significantly different in terms of the fundamental pitch frequency (F₀) [27, 28]. These signals along with algorithms (11) and (12) were then used to calculate the vectors of autoregression coefficients: a_x = (1.366934; −1.11707; 0.595185; −0.85859; 0.900872; −0.40507; 0.008816; −0.05351; 0.328108; −0.0927) and a_y = (1.42998; −1.17923; 0.64873; −0.93677; 1.072479; −0.59299; 0.211108; −0.26045; 0.482161; −0.1689) and the measure of their differences: ρ_x,y = 0.0219 $\ll$1 based on the voice timbre. The obtained results are illustrated in Figs. 2 and 3.

The graphs shown in Fig. 2 represent timing diagrams of two synthesized phoneme “a” signals with different fundamental pitch frequencies (F₀), but practically identical intraperiodic fine structure. The graphs shown in Fig. 3 illustrate squared normalized AFC (order p = 10) of the linear model (13) of the vocal tract with pulse excitation for the fundamental pitch frequencies F₀ = 100 and 130 Hz. These characteristics are compared with the corresponding estimates G_x(f) and G_y(f) of the discrete speech signal power spectrum obtained using the Berg’s method at a high value of autoregression order p^* = 90. (This order was established based on the requirement of p ≥ F/F₀ = 8000/100 = 80 for the fine structure of the speech signal in the frequency domain [6]). In both cases, the AFC shape repeats the spectral envelope of the synthesized signals, which is practically independent of the fundamental pitch frequency. Therefore, it can be concluded that measure (11) is invariant with respect to the value of F₀, which is a key requirement for an objective measure of differences in speech signals when analyzing voice timbre [29, 30].

Results and discussion

The theoretical justification of the measure of differences (11) is based on the principle of superposition of oscillations in linear systems [26]. According to this principle, a speech signal {x(i)} with dominating vowel sounds is defined by the convolution of a periodic sequence of fundamental pitch pulses {η_x(i)} with the pulse characteristic of the vocal tract {h_x(i)}. The square of the AFC (9) has a form of an envelope of the discrete speech signal power spectrum. Therefore, the acoustic theory of speech production considers the spectral envelope as the most comprehensive characteristic of the voice timbre in the frequency domain [18, 25].

The problem is that the concept of “spectral envelope” is not strictly defined in the theory. As a result, researchers still lack clarity on the issue of optimal estimation of the spectral envelope based on the speech signal [28,29,30]. Therefore, in the conducted study, this concept was not used for the synthesis of the measure of differences (11), but exclusively as an illustration of the synthesis results.

Conclusion

The developed objective measure of differences in speech signals by voice timbre makes it possible to automatically assess the specifics of the fine structure of these signals under conditions of a prior uncertainty and small observation sample sizes. During practical implementation of the new measure based on a finite-order recursive filter with adaptive tuning of its parameters using the Berg’s method, it was established that there is no need to synchronize observations with the fundamental pitch of speech signals. The results of the conducted full-scale experiment confirmed the following two main properties of the proposed measure: high sensitivity to differences in speech signals by voice timbre and significant invariance to the fundamental pitch frequency.

The obtained results are intended for use when designing and studying digital speech processing systems tuned to the speaker’s voice, where the individual characteristics of the vocal tract are of primary importance [31,32,33]. Examples include digital voice communication systems, biometric and biomedical systems, etc. [4,5,6,7,8,9].

Notes

The assumption of a Gaussian probability distribution does not limit the generality of the conclusions of this study, as this law is characterized by the maximum entropy for a given average power of the speech signal.
COSH—cosine hyperbolic function.
Researchers often prefer Berg’s method over other parametric spectral analysis methods due to its well-known advantages in terms of computational speed and, most importantly, stability of the spectral estimates of the autoregressive type that are formed on its basis.
The Phoneme Training phonetic analysis and speech training information system: [website]. URL: https://sites.google.com/site/frompldcreators/produkty-1/phonemetraining (access date: May 18, 2023).
This order is intended for autoregressive simulation of 4–5 AFC resonances of a typical vocal tract when pronouncing vowels in the frequency bandwidth of 0 to 4 kHz.

References

Zhao, R., Erleke, E., Wang, L., Huang, J., Chen, Z.: The effects of timbre on voice interaction. In: Rau, P.-L.P. (ed.) Cross-Cultural Design: HCII 2023, Lecture Notes in Computer Science, vol. 14023. Springer, Cham (2023) https://doi.org/10.1007/978-3-031-35939-2_12
Chapter Google Scholar
Ando, Y.: Temporal and spatial features of speech signals. In: Signal processing in auditory neuroscience, pp. 81–101. Academic Press, (2019) https://doi.org/10.1016/B978-0-12-815938-5.00009-1
Chapter Google Scholar
Ternström, S.: Appl. Sci. 13(6), 3514 (2023). https://doi.org/10.3390/app13063514
Article Google Scholar
Song, W., Yue, Y., Zhang, Y., et al.: Multi-speaker multistyle speech synthesis with timbre and style disentanglement. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds.) Man-machine speech communication: NCMMSC 2022, communications in computer and information science. Springer, Singapore (2022) https://doi.org/10.1007/978-981-99-2401-1_12
Chapter Google Scholar
Jialu, L., Hasegawa-Johnson, M., McElwain, N.L.: Speech. Commun. 133, 41–61 (2021). https://doi.org/10.1016/j.specom.2021.07.010
Article Google Scholar
Savchenko, V.V.: Radioelectron. Commun. Syst. 64(11), 592–603 (2021). https://doi.org/10.3103/S0735272721110030
Article Google Scholar
Savchenko, A.V., Savchenko, V.V.: Meas. Tech. 64(4), 928–935 (2022). https://doi.org/10.1007/s11018-022-02025-4
Article Google Scholar
Wei, Y., Gan, L., Huang, X.: Front. Psychol. 13, 869475 (2022). https://doi.org/10.3389/fpsyg.2022.869475
Article Google Scholar
Xue, J., Zhou, H., Song, H., Wu, B., Shi, L.: Speech. Commun. 147, 41–50 (2023). https://doi.org/10.1016/j.specom.2023.01.001
Article Google Scholar
Li, J., Zhang, L., Qiu, Z.: 5th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP). Chengdu., pp. 833–837 (2023). https://doi.org/10.1109/ICMSP58539.2023.10171030
Book Google Scholar
Igras-Cybulska, M., Hekiert, D., Cybulski, A., et al.: Work-in-Progress. In: 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) Shanghai. pp. 355–359. (2023) https://doi.org/10.1109/VRW58643.2023.00079
Chapter Google Scholar
Cui, S., Li, E., Kang, X.: 2020 IEEE International Conference on Multimedia and Expo (ICME). London., pp. 1–6 (2020). https://doi.org/10.1109/ICME46284.2020.9102765
Book Google Scholar
Gupta, S., Fahad, M.S., Deepak, A.: Multimed Tools Appl 79, 23347–23365 (2020). https://doi.org/10.1007/s11042-020-09068-1
Article Google Scholar
Dai, B., Zahorian, S.: J. Acoust. Soc. Am. 104, 1805 (1998). https://doi.org/10.1121/1.423591
Article ADS Google Scholar
Zakhar’ev, V.A., Petrovskii, A.A.: Metody parametrizatsii rechevogo signala na osnove analiza, sinkhronizirovannogo s chastotoi osnovnogo tona v sistemakh konversii golosa. In: Proceedings of the 11th International Scientific and Technical Conference “Nauka – obrazovaniyu, proizvodstvu, ekonomike, vol. 1, pp. 203–204. BNTU, Minsk (2013). in Russian
Google Scholar
Savchenko, V.V., Savchenko, L.V.: J. Commun. Technol. Electron. 68(7), 757–764 (2023). https://doi.org/10.1134/S1064226923060128
Article Google Scholar
Savchenko, A.V., Savchenko, V.V.: Radioelectron. Commun. Syst. 64(6), 300–309 (2021). https://doi.org/10.3103/S0735272721060030
Article Google Scholar
Gibson, J.: Information 10(5), 179–189 (2019). https://doi.org/10.3390/info10050179
Article Google Scholar
Herbst, Ch T., Elemans, C.P.H., Tokuda, I.T., Chatziioannou, V., Švec, J.G.: J. Voice (2023). https://doi.org/10.1016/j.jvoice.2022.10.004
Article Google Scholar
Sadok, S., Leglaive, S., Girin, L., Alameda-Pineda, X., Séguier, R.: Speech. Commun. 148, 53–65 (2023). https://doi.org/10.1016/j.specom.2023.02.005
Article Google Scholar
Savchenko, V.V.: J. Commun. Technol. Electron. 64(6), 590–596 (2019). https://doi.org/10.1134/S0033849419060093
Article Google Scholar
Kullback, S.: Information theory and statistics. Dover, New York (1997)
Google Scholar
Savchenko, V.V.: Meas. Tech. 66(6), 430–438 (2023). https://doi.org/10.1007/s11018-023-02244-3
Article Google Scholar
Marple Jr., S.L.: Digital spectral analysis, 2nd edn. Dover, New York (2019)
Google Scholar
Savchenko, V.V.: Meas. Tech. 66(3), 203–210 (2023). https://doi.org/10.1007/s11018-023-02211-y
Article Google Scholar
Oppenheim, A., Schafer, R.: Discrete-time signal processing, 3rd edn. Pearson (2009)
Google Scholar
Kathiresan, Th , Maurer, D., Suter, H., Dellwo, V.: J. Acoust. Soc. Am. 143(3), 1919–1920 (2018). https://doi.org/10.1121/1.5036258
Article ADS Google Scholar
Kovela, S., Valle, R., Dantrey, A., Catanzaro, B.: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island., pp. 1–5 (2023). https://doi.org/10.1109/ICASSP49357.2023.10096220
Book Google Scholar
Sun, P., Mahdi, A., Xu, J., Qin, J.: Speech. Commun. 101, 57–69 (2018). https://doi.org/10.1016/j.specom.2018.05.006
Article Google Scholar
Tohyama, M.: Spectral envelope and source signature analysis. In: Acoustic signals and hearing, pp. 89–110. Academic Press, (2020) https://doi.org/10.1016/B978-0-12-816391-7.00013-9
Chapter Google Scholar
Savchenko, V.V.: Radioelectron. Commun. Syst. 63, 42–54 (2020). https://doi.org/10.3103/S0735272720010045
Article Google Scholar
Eggermont, J.J.: Brain responses to auditory mismatch and novelty detection. Academic Press, pp. 345–376 (2023). https://doi.org/10.1016/B978-0-443-15548-2.00011-9
Book Google Scholar
Oganian, Y., Bhaya-Grossman, I., Johnson, K., Chang, E.: Neuron 111(13), 2105–2118e4 (2023). https://doi.org/10.1016/j.neuron.2023.04.004
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Research University Higher School of Economics, Nizhny Novgorod, Russian Federation
V. V. Savchenko

Authors

V. V. Savchenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. V. Savchenko.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Translated from Izmeritel’naya Tekhnika, No. 10, pp. 63–69, October, 2023. Russian DOI: https://doi.org/10.32446/0368-1025it.2023-10-63-69.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Original article submitted September 18, 2023; approved after reviewing October 18, 2023; accepted for publication October 18, 2023.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Savchenko, V.V. A measure of differences in speech signals by the voice timbre. Meas Tech 66, 803–812 (2024). https://doi.org/10.1007/s11018-024-02294-1

Download citation

Published: 11 March 2024
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11018-024-02294-1

Keywords

UDC

53.082.4;004.934.2

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A measure of differences in speech signals by the voice timbre

Abstract

Similar content being viewed by others

Method for Measuring the Intelligibility of Speech Signals in the Kullback–Leibler Information Metric

A Method of Measuring the Index of Acoustic Voice Quality Based on an Information-Theoretic Approach

Estimation of the Phonetic Speech Quality Using the Information Theoretic Approach

Introduction

Problem statement

Kullback-Leibler divergence

Method of asynchronous analysis of voice timbre

Example of practical implementation

Experimental procedure and results

Results and discussion

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

UDC

Navigation

A measure of differences in speech signals by the voice timbre

Abstract

Similar content being viewed by others

Method for Measuring the Intelligibility of Speech Signals in the Kullback–Leibler Information Metric

A Method of Measuring the Index of Acoustic Voice Quality Based on an Information-Theoretic Approach

Estimation of the Phonetic Speech Quality Using the Information Theoretic Approach

Introduction

Problem statement

Kullback-Leibler divergence

Method of asynchronous analysis of voice timbre

Example of practical implementation

Experimental procedure and results

Results and discussion

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

UDC

Search

Navigation