Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The occurrence and similarity of stuttering can be observed in many languages. This disorder can be divided into several types of dysfluencies: prolongations, blocks, interjections, repetitions of syllables or even parts of words [15]. This paper deals with the syllable and fragment repetitions in continuous speech. The analysis of the possibility of automatic dysfluency detection in continuous speech is a very important topic. Stuttering people have to deal with not only the interpersonal communication problems, but also with the more and more spreading voice actuation. Additionally, this analysis increases the knowledge of the nature of stuttering and helps to prepare even better stuttering therapies.

1.1 Related Work

In the research on the automatic dysfluency detection, the initial listening selection of speech fragments into fluent and dysfluent parts is made [1, 5, 7, 8, 16, 1921, 2429]. This research aims at detecting dysfluent fragments continuous utterances. First dysfluency detection trials without the initial dysfluent words extraction were undertaken in the works of Howell [911], afterwards in the research of Suszyński [23], Wiśniewski [30, 31], Codello [24] and Kobus [13, 15]. Those works have as their input the four-second-long recordings initially classified as fluent or dysfluent. The comparison of differences between the results of various analyses of the fluent and dysfluent samples allows to notice the significant disparities between these two types of samples. Thanks to this fact, the prolongation and phoneme repetitions’ detection gives good results. The syllable repetition detection is a more complex issue. Still, it was addressed in the Suszyński’s research [25], based on the worse of Hiroshima [7] and Codello [4]. For this aim Suszyński performed a comparison of spectrograms. He applied the correlation of the one-third octave spectrums in the connection with the procedures for the time–amplitude file structure analysis to detect the repeating fragments. The classification was made by exceeding the obtained correlation coefficient of the two fragments. This analysis allows to detect and localise syllable repetitions with 70 % efficiency. Codello’s research [4] was based on the application of CWT in the Bark scale. The results were split into vectors and compared by the correlation method. This analysis reached 80 % of efficiency. For the speech dysfluency analysis the authors programmed the “Dabar” application. Beside the implementation of the prolongation [15] and block [13] detection algorithms the possibility of syllable and word fragment detection was also analysed. Amongst the many functionalities of the application are the following: dimension reduction by the Kohonen networks, evaluation of linear prediction and PARCOR coefficients, the spectrum evaluation on the basis of LPC and the creation of the elliptic model of the vocal tract [14]. Previous works allow to ascertain that the representation of the speech signal by means of linear prediction coefficients serves perfectly for detecting plosive repetition [13] and phoneme prolongation [15]. For this research, the splitting of utterances into syllables and/or speech fragments was implemented, together with the K-means centres obtaining method.

2 Methodology

The goal of this research is the analysis of the possibility of automatic syllable and speech parts’ repetition detection using linear prediction method. In each utterance, the places where the repetition of syllable or speech fragment occurs were marked.

2.1 Speech Data

For the analysis, the several seconds long utterances were used. All 56 files of the total length of 3 min 47 s are in WAV format. Fourteen speakers were recorded—nine men and five women. Men were 11–25 years old (30 recordings), women were 13–24 years old (26 recordings). All recordings contain 262 pairs of syllables, 117 of which were dysfluent. Materials were recorded with Creative Wave Studio using the SoundBlaster card with 22050 Hz frequency and 16 bits per sample.

2.2 Preprocessing

The independent recordings were split into non-overlapping frames with 512 samples. Each of the frames was multiplied by the Hanna window function (1). Based on the previous research, it allows to achieve frequency spectrums with the best characteristic of those obtained by the linear prediction method [13].

$$\begin{aligned} w(n) = 0.5\left( 1-\cos \left( \frac{2\pi n}{N-1}\right) \right) , \end{aligned}$$
(1)

where \(N=512\) is the frame size.

2.3 Feature Extraction

2.3.1 Linear Prediction

The most basic characteristic of the linear prediction method is that it stores the knowledge of the speech signal change in a vector of a few coefficients. Consecutive speech signal samples vary to a small degree [18], thus the approximation of the next sample can be evaluated from the p previous samples using the proper linear prediction coefficients \(\alpha \). This idea is expressed by the following equation:

$$\begin{aligned} \tilde{s}(m) = \sum _{k=1}^p\alpha _ks(m-k) \end{aligned}$$
(2)

where \(\tilde{s}(m)\) is the mth value of the predicted voice sample, s(m) is the mth value of the input speech sample, p is a prediction order and \(\alpha _k\) are the obtained linear prediction coefficients. Basing on the linear prediction coefficients, the continuous frequency spectrum with an arbitrary number of stripes may be evaluated. The evaluation is expressed by Eq. (3) derived from the definition of the LP (Eq. 2):

$$\begin{aligned} H(f_i) = \frac{G}{1-\displaystyle \sum _{k=1}^p\alpha _k \left( \cos \left( k\frac{f_i}{f}2\pi \right) -i\sin \left( k\frac{f_i}{f}2\pi \right) \right) } \end{aligned}$$
(3)

where f is the chosen number of the spectrum stripes, \(f_i\) is the number of the chosen spectrum stripe, where \(0 \le f_i \le f\), and G is the gain factor.

2.3.2 Analysis

In this algorithm the linear prediction coefficients were evaluated using the Levinson–Durbin method [18] with G gains from the obtained frames. The number of coefficients for each of them was 15, which is sufficient for good signal characteristics [22]. Next, the frequency spectrum was evaluated from Eq. (3) for each of the frames. In the order to obtain a precise spectrum, the number of stripes was set to 300. The evaluated coefficient vectors were used as the input vectors in the method of splitting speech into fragments Sect. 2.4. They were also the basis of the values parameterising the speech for further analysis Sect. 2.6.

2.4 Segmentation

The algorithm of splitting the speech into fragments was implemented for the purpose of this research. The algorithm is based on the frequency spectrum obtained from the linear prediction coefficients. From the obtained spectrums, the average value of the powers of all the m spectrum stripes values was evaluated for each frame. On the basis of the obtained values, the threshold was defined. It was evaluated as the average of the two primary values from the vector increased by the 3 db.

2.5 Pairing

Each fragment of the examined utterance was compared with the succeeding fragment. If the succeeding fragment was longer than the preceding one, it was cut into the length of the first fragment. This is due to the fact that the succeeding fragment may also contain the further, fluent part of the utterance and the repeated fragment is in its initial phase.

2.6 Formants Extraction

For each of the frames the formants are defined [6, 17, 22]. They are determined by those maxima F from the local maximum points of the frequency spectrum which fulfil the bandwidth condition [12, 22]:

$$\begin{aligned} B_j = -(f_j/\pi )\ln (r_0) < B \end{aligned}$$
(4)

where \(B_j\) is the 3 db bandwidth, j is the ordinal number of the maximum from the chosen frame, \(f_j\) is the frequency of the jth spectrum maximum and \(r_0\) are the roots of the polynomial \(A(z)=1-\displaystyle \sum _{k=1}^p\alpha _kz^{-k}\), where \(z=e^{i2\pi f_j/f_s}\), \(f_s\) is the sampling frequency. B is the maximum bandwidth permissible for the formant. For the purpose of this analysis \(B=1000\) Hz. The points have three coordinates: frequency, amplitude and time. For each frame several formants were obtained. As a result of that, a fragment is represented by a set of P points. Two of those dimensions, the frequency and the amplitude, are the basis for the K-means algorithm. The main goal of this method is to reduce m characteristic points from the P set to k characteristic centres for the whole fragment.

2.7 Clusterisation

The main rule of the K-means algorithm is to gradually adjust the midpoints (centres) so that they would divide the space in the best possible way into groups of points concentrated around each other.

2.7.1 The Normalisation of Space of Characteristic Points

The normalisation of space of characteristic points consists in scaling each of the dimensions with respect to the average value and the variance of the value of a set of points, that is, on the basis of the Mahalanobis distance:

$$\begin{aligned} \overline{p_j} = \frac{{p_j}-{\mu }}{\sigma } \end{aligned}$$
(5)

where \(\overline{p_j}\) is a normalised point in the \(R^N\) space, \(p_j\) is a point in the \(R^N\) space, \(\mu \) is a point of average values for \(p_j\) points and \(\delta \) is a vector of variances for these points.

2.7.2 Input Data

The P set of the normalised points \(p_j\) is given, where \(0<j<m\), m is the number of the P set elements, as well as the vector \(\mathbf{k}\) of the normalised initial centres \(c_i\), where \(0<i<k\) and k is the length of the \(\mathbf{k}\) vector. Both the set of points and the vector of the centres should be normalised. In order to measure the distance between two points in N-dimensional space the space normalisation is needed for the comparison of distances. The vector \(\mathbf{k}\) of the centres should be also initialised by the values of the coordinates.

2.7.3 Centres Initialisation

The random initialisation of the centres is often applied; however, thanks to the knowledge of the structure of the set of points, the initialisation based on this knowledge may be applied. The uniform distribution method was used for the purpose of this research. The basic data for the initialisation are points with two dimensions: amplitude and frequency. The input points are sorted in the order of frequency. Next, \(k_p\) points distributed uniformly in frequency are chosen. If there are only few of such points, these points are being duplicated up to the k number so that they would not affect the result of measurement of the distances between the vectors.

2.8 Distance Measuring

For the analysis of the distances between the vectors \(\mathbf{k}\) and consequently between the points, a few distance measures were applied.

2.8.1 Metrics

For measuring the distance between the points two metrics were applied.

Euclidean Metric

This metric was applied in the measurement of the distance between points in the K-means algorithm, as well as in the classification for measurement of the distance between the tested pair and the threshold between the fluent and dysfluent pairs (MEDIAN, AVERAGE).

Mahalanobis Metric

This metric was applied in the classification for examining the distance between the tested pair and the group of fluent and dysfluent pairs (MAHALANOBIS).

2.8.2 Distance Between Vectors

After the K-means analysis, each fragment is represented by a vector of centres. They are combined into pairs and the distance between them in two dimensions is being evaluated. The evaluation consists in obtaining the sum of minimal Euclidean distances between the centres of the succeeding and preceding fragment.

2.9 Classification

The obtained distances were used for the classification of 262 pairs of fragments into dysfluent (fragment repetitions) and fluent ones. The previously classified pairs were randomly divided into three parts. Two of them were used as a training part (172 pairs) and one as a testing part (90 pairs). Three methods were used for classification:

 

AVERAGE :

the minimal distance from the threshold in the Euclidean metric—the threshold was obtained experimentally as \(n + (f-n)/5\), where n is an arithmetic average of the distances between contiguous recurring fragments and f is an arithmetic average of the distances between contiguous nonrecurring fragments (fluent speech).

MEDIAN :

analogous to the above mentioned, but the average distance was substituted with a median.

MAHALANOBIS :

the minimal distance from the group in a Mahalanobis metric.

 

For the evaluation of the average and the median, the training part was used.

Fig. 1
figure 1

Classification quality percentage graph for the best precision and sensitivity in relation to the number of centres for the AVERAGE type of categorisation

3 Results and Discussion

The results refer to the testing part of extracted pairs. The result figures contain a number of parameters related to the quality assessment of the classification.

  1. 1.

    Sensitivity—\(100*\textit{TP}/(\textit{TP}+\textit{FN})\)—where \(\textit{TP}\)—true positive—is a number of correctly classified dysfluencies, \(\textit{FN}\)—false negative—a number of not recognised dysfluencies

  2. 2.

    Precision—\(100*\textit{TP}/(\textit{TP}+\textit{FP})\)—where \(\textit{TP}\) is a number of non-fluent pairs with recognised non-fluency, \(\textit{FP}\)—false positive—a number of fluent pairs classified as non-fluent

  3. 3.

    K—a number of the centres in a representation vector

  4. 4.

    Categorisation method Sect. 2.9: AVERAGE, MEDIAN, MAHALANOBIS (Figs. 1, 2 and 3).

Fig. 2
figure 2

Classification quality percentage graph for the best precision and sensitivity in relation to the number of centres for the MEDIAN type of categorisation

Fig. 3
figure 3

Classification quality percentage graph for the best precision and sensitivity in relation to the number of centres for the MAHALANOBIS type of categorisation

4 Conclusion

The results of the research proved the effectiveness of the syllable and fragment repetition detection using the proposed method. Regarding the number K of the centres, the best results were reached for \(5\le K\le 7\). The results for \(8\le K\le 9\) were good and for \(K=4\), a little bit worse. It may result from the number of formants which represent the centres. If K is too small, some of the formants may not be duly represented, e.g. in the case of two or three phonemes. If K is too big, the representation is good but the excess of points has an influence on the distortion of the results for the lower number of the phonemes. Fortunately, the differences were not significant. The second aspect which was examined was categorisation. It was observed that regardless of its type, over 80 % of sensitivity was achieved. The best results were obtained using the AVERAGE categorisation. The MAHALANOBIS and MEDIAN categorisations were a little bit worse but similar to each other. The results of the analysis lead to the conclusion that detecting speech fragment repetitions, using the K-means method as a dimension reducer, the distance analysis as a classifier and the linear prediction coefficients as the input values, is possible and reaches 89 % precision with 89 % sensitivity. The applied method of automatic speech segmentation into syllables and fragments allows to determine the places where the syllable, phoneme or fragment repetition occurs with an 89 % efficiency. The algorithm described in this paper, equipped with a database of recordings with predefined dysfluencies’ occurrences, may also be applied in a real-time detection of repetitions.