Keywords

1 Introduction

Speaking and listening are the most common ways in which humans convey and understand each other in daily conversations. Nowadays, the speech interface has also been widely integrated into many applications/devices like Siri, Google Assistant, and Alexa [13]. These applications use speech recognition-based approaches [3, 11] to understand the spoken user queries. Like speech, the text is also a widely used medium in which people converse. Recent advances in language modeling and representation learning using deep learning approaches [2, 7, 24] have proven to be very promising in understanding the actual meanings of the textual data, by capturing semantical, syntactical, and contextual relationships between the textual words in their corresponding learned fixed-size vector representations.

Such computational language modeling is difficult in the case of speech for spoken language understanding because unlike textual words, (1) spoken words can have different meanings of the same word when spoken in different tones/expressions [9], (2) it is difficult to identify sub-word units in speech because of the variable-length spacing and overlapping between the spoke-words [34], and (3) use of stress/emphasis on few syllables of a multi-syllabic word can increase the variability of speech production [27]. Although the textual word representations capture the semantical, syntactical, and contextual properties, they fail to capture the tone/expression. Using only speech/audio data for training spoken-word representations results in semantically and syntactically poor representations.

So in this paper, we propose a novel spoken-word representation learning approach called STEPs-RL that uses speech and text entanglement for learning phonetically sound spoken-word representations, which not only captures the acoustic and contextual features but also are semantically, syntactically, and phonetically sound. STEPs-RL is trained in a supervised manner such that the learned representations can capture the phonetic structure of the spoken-words along with their inter-word semantic, syntactic, and contextual relationships. We validated the proposed model by (1) evaluating semantical and syntactical relationships between the learned spoken-word representations on four widely used word similarity benchmark datasets, and comparing its performance with the textual word representations learned by Word2Vec & FastTexT (obtained using transcriptions), and (2) investigating the phonetical soundness of the generated vector space.

The rest of the paper is organized as follows: Sect. 2 describes the related work; Sect. 3 explains the proposed model architecture; Sect. 4 will describe the datasets used, pre-processing pipeline, and training details for reproducibility. Then experimental results are explained in Sect. 5 and finally we conclude in Sect. 6.

2 Related Work

Earlier, speech processing was done using feature learning-based models like deep neural networks (DNN) [28]. The DNN models were able to capture contextual and temporal information from the speech-based data after the introduction of sequential neural networks like RNNs [16], LSTMs [25], Bi-LSTMs [10, 36], and GRUs [29, 33]. Recent research by [23] has presented the use of a transformer-based self-supervised speech representation learning approach called TERA that uses multi-target auxiliary tasks. TERA is trained by generating acoustic frame reconstructions; [30] introduced wav2vec which is a CNN based model pre-trained in a unsupervised manner using contrastive loss to learn raw audio representations; [20] explored the use of black-box variational inference for linguistic representation learning of speech using an unsupervised generative model; [26] proposed Contrastive Predictive Coding (CPC) for extracting representations from high dimension data by predicting future in latent space, using autoregressive models; [18] proposed a novel variational autoencoder based model that learns disentangled and interpretable latent representations of sequential data in an unsupervised manner; [22] used BERT encoder for learning phonetically aware contextual speech representation vectors; [4] proposed a Word2Vec type sequence-to-sequence autoencoder model for embedding variable-length audio segments. Other works on learning fixed-length spoken-word vector representations that use multi-task learning include [5, 6, 19, 21, 32].

3 Model

In this paper, we propose STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning. STEPs-RL is a novel spoken-word representation learning approach which entangles speech and text based contextual information for learning phonetically sound spoken-word representations. The model architecture is shown in Fig. 1. Given a target spoken-word represented by \(S^t\), its left and right contextual spoken-words represented by \(S_{ctx}^{l} = \{S^i\}_{t-1-m}^{t-1}\) & \(S_{ctx}^{r} = \{S^i\}_{t+1}^{t+1+m}\) respectively (\(m\) represents the context window size), along with the textual word embeddings of the corresponding spoken-words represented by \(W_{ctx}^{l} = \{W^i\}_{t-1-m}^{t-1}\), \(W^t\) & \(W_{ctx}^{r} = \{W^i\}_{t+1}^{t+1+m}\), the proposed model tries to learn a vector representation of the target spoken-word that not only captures the semantic-based, syntax-based and acoustic-based information but also captures the phonetic-based information.

Fig. 1.
figure 1

Illustration of the STEPs-RL model architecture.

Here, a single spoken-word \(S^i \in \mathbb {R}^{n\times d_{mfcc}}\) consists of a sequence of acoustic features represented by \(d_{mfcc}\)-dimensional Mel-frequency Cepstral Coefficients (MFCCs); \(W^i \in \mathbb {R}^{d_{w}}\) represents the \(d_{w}\)-dimensional pre-trained textual word embedding of the corresponding spoken-word. Each of the spoken-word is padded with silence, so that they all consists of a sequence of \(n\) acoustic features.

Our approach uses Bidirectional-LSTM [31] for capturing the contextual information. Bidirectional-LSTM (also known as Bi-LSTM), uses two LSTM [15] networks (\(\overrightarrow{LSTM}, \overleftarrow{LSTM}\)) to capture contextual information in opposite directions (forward and backward) of a sequence (\(t_1,...t_T\)). The final hidden representations corresponding to the sequence tokens is generated by concatenating (\(\oplus \)) the hidden representations (\(\overrightarrow{h_i},\overleftarrow{h_i}\)) generated by both the LSTM networks. So the final hidden representation of the \(i^{th}\) token can be represented as shown in Eq. 1.

$$\begin{aligned} \overrightarrow{h_i} = \overrightarrow{LSTM}(t_i,\overrightarrow{h_{i-1}}), \quad \overleftarrow{h_i} = \overleftarrow{LSTM}(t_i,\overleftarrow{h_{i + 1}}), \quad h_i = \overrightarrow{h_i} \oplus \overleftarrow{h_i} \end{aligned}$$
(1)
Fig. 2.
figure 2

(a) STEPs-RL Phase 1: Each of the individual Bi-LSTM captures contextual information. (b) STEPs-RL Phase 2: Speech & Text entanglement with target spoken word.

STEPs-RL consist of three independent Bi-LSTM networks represented by \(BiLSTM_{C}\), \(BiLSTM_{T}\) and \(BiLSTM_{W}\) to capture contextual information respectively from (1) The acoustic features of the left and right contextual spoken-words represented by \(S_{ctx}^{l}\) & \(S_{ctx}^{r}\), (2) The acoustic features of the target spoken-word represented by \(S^t\), and (3) The pre-trained textual word embeddings of the corresponding target spoken-word, left contextual spoken-words and right contextual spoken-words represented by \(W^t\), \(W_{ctx}^{l}\) & \(W_{ctx}^{r}\) respectively.

$$\begin{aligned} h^{C},\overrightarrow{o^{C}},\overleftarrow{o^{C}} = BiLSTM_{C}([S_{ctx}^{l},S_{ctx}^{r}]) \end{aligned}$$
(2)
$$\begin{aligned} h^{T},\overrightarrow{o^{T}},\overleftarrow{o^{T}} = BiLSTM_{T}([S^{t}]) \end{aligned}$$
(3)
$$\begin{aligned} h^{W},\overrightarrow{o^{W}},\overleftarrow{o^{W}}=BiLSTM_{W}([W_{ctx}^{l},W^t,W_{ctx}^{r}]) \end{aligned}$$
(4)

As shown in Eqs. 2, 3, and 4, all the three Bi-LSTM networks generate a final hidden state representation corresponding to each timestamp (\(h^{C}\), \(h^{T}\), \(h^{W}\)), a final output of the corresponding forward LSTM network (\(\overrightarrow{o^{C}}\), \(\overrightarrow{o^{T}}\), \(\overrightarrow{o^{W}}\)), and a final output of the corresponding backward LSTM network (\(\overleftarrow{o^{C}}\), \(\overleftarrow{o^{T}}\), \(\overleftarrow{o^{W}}\)). The final forward and backward outputs of \(BiLSTM_{C}\) & \(BiLSTM_{W}\) are concatenated to generate \(f^C\) & \(f^W\) respectively, which will later act as context vectors during the entanglement of speech and text.

$$\begin{aligned} f^C = \overrightarrow{o^{C}} \oplus \overleftarrow{o^{C}}, \quad f^W = \overrightarrow{o^{W}} \oplus \overleftarrow{o^{W}} \end{aligned}$$
(5)
Fig. 3.
figure 3

STEPs-RL Phase 3: Latent representation learning

For intuition (as shown in Fig. 2a), \(f^C\) represents the final contextual representation of the spoken-words present in context of the target spoken-word, and \(f^W\) represents the final semantical and syntactical contextual representation of all the corresponding textual words. In other words, \(f^C\) captures the acoustic/speech-based contextual information whereas \(f^W\) captures the text-based contextual information. Both \(f^C\) & \(f^W\), are then used to entangle speech and text-based contextual information with the target spoken-word by generating new speech and text entangled bidirectional hidden state representations (\(h^{T,C}\) & \(h^{T,W}\)) of the target spoken-word using the hidden representations generated by \(BiLSTM_{T}\), as shown in Eqs. 6 and 7.

$$\begin{aligned} h^{T,C} = [h_1^{T,C}, h_2^{T,C},...,h_n^{T,C}] = h^{T} \otimes f^C; \quad h_i^{T,C} = \alpha _i^{T,C} \times h_i^{T} \end{aligned}$$
(6)
$$\begin{aligned} h^{T,W} = [h_1^{T,W}, h_2^{T,W},...,h_n^{T,W}] = h^{T} \otimes f^W; \quad h_i^{T,W} = \alpha _i^{T,W} \times h_i^{T} \end{aligned}$$
(7)

In the above equations, (\(\otimes \)) represents an element wise attention function; \(h^{T,C}\) & \(h^{T,W}\) represents the newly generated speech-entangled and text-entangled hidden representations respectively; \(\alpha _i^{T,C}\) & \(\alpha _i^{T,W}\) represents the speech-entangled and text-entangled attention scores respectively, corresponding to the \(i^{th}\) timestamp of the hidden representations generated by \(BiLSTM_{T}\). The attention scores \(\alpha _i^{T,C}\) & \(\alpha _i^{T,W}\) are generated by taking the dot product (\(\bullet \)) of each of the timestamps of \(h^T\) with the context vectors \(f^C\) & \(f^W\) respectively, as shown in Eq. 8. Same is illustrated in Fig. 2b.

$$\begin{aligned} \alpha _i^{T,C} = h_i^{T} \bullet f^C, \quad \alpha _i^{T,W} = h_i^{T} \bullet f^W \end{aligned}$$
(8)

Next, the proposed model uses the newly generated speech-entangled and text-entangled hidden representations \(h^{T,C}\) & \(h^{T,W}\), along with the original bidirectional hidden state representations \(h^T\) of the target spoken-word (generated from \(BiLSTM_{T}\)), to generate a latent vector representation \(z\) of the target spoken-word by stacking (illustrated in Fig. 3) all these three hidden representations on top of each other and passing it through a simple encoder LSTM network \(\overrightarrow{LSTM_{encode}}\).

$$\begin{aligned} z = \overrightarrow{LSTM_{encode}}([h^{T,C} \oplus h^{T,W} \oplus h^T]), \quad z_{new} = zW_1 + z_{aux}W_2 + B \end{aligned}$$
(9)

In Eq. 9, \(z\) represents a fixed size latent vector which is the output of the encoder LSTM network. To add more information about the speaker, the proposed model linearly combines the latent vector with an auxiliary vector \(z_{aux}\) to generate a new latent representation \(z_{new}\) of the target spoken-word. This new latent representation \(z_{new} \in \mathbb {R}^{d_{e}}\) is a \(d_{e}\)-dimensional vector representation that the proposed model tries to learn. In Eq. 9, \(W_1 \in \mathbb {R}^{d\times d_{e}}\) and \(W_2 \in \mathbb {R}^{d_a\times d_{e}}\) represents the combination weights and \(B\) represents the bias. These weights and biases are learnable in nature. The auxiliary vector \(z_{aux} \in \mathbb {R}^{d_a}\) is a one-hot vector of size \(d_a\) that consists of information related to the speaker’s gender/dialect or both. Such an auxiliary vector was introduced because usually, the pronunciation of different words usually depends on the speaker’s gender and dialect and hence can help learn phonetically sound spoken-word representations.

Next, the proposed model uses a decoder LSTM network \(\overrightarrow{LSTM_{decode}}\) to predict the sequence of phonetic symbols \(Y\) = (\([y_1,...,y_k]\)) of the corresponding target spoken-word using the above generated latent representation of the target spoken-word \(z_{new}\), as shown in Eq. 10 and 11.

$$\begin{aligned} P_{\theta }(y_i|Y_{<i}, z_{new}) = \varUpsilon (h_i^d, y_{i-1}) \end{aligned}$$
(10)
$$\begin{aligned} h_i^d = \varPsi (h_{i-1}^d, y_{i-1}) \end{aligned}$$
(11)

Here, \(\varPsi \) represents a function that generates the hidden vectors \(h_i^d\) (hidden state representations of the decoder network), and \(\varUpsilon \) represents a function that computes the generative probability of the one-hot vector \(y_i\) (target phonemic symbol). The hidden vector \(h_i^d\) is \(z_{new}\), and \(y_i\) is the one-hot vector of “[SOP]” when \(i\) = \(0\). Here “[SOP]” represent the start of phoneme token. The proposed model uses cross-entropy as its training loss function as shown in Eq. 12, where cross-entropy loss \(L\) is computed using the actual target spoken-word phonetic sequence (\(Y\) = (\([y_1,...,y_k]\))) and the predicted target spoken-word phonetic sequence (\(\hat{Y}\) = (\([\hat{y_1},...,\hat{y_k}]\))).

$$\begin{aligned} L(Y,\hat{Y}) = \sum _{i}^{k}y_i\log \frac{1}{\hat{y_i}} \end{aligned}$$
(12)

4 Dataset and Experimental Setup

Table 1. Gender and dialect distribution of the speakers in TIMIT speech corpus.

For our experiments, we used the DARPA TIMIT acoustic-phonetic Continuous Speech Corpus [8]. This corpus contains 16 kHz audio recordings of 630 speakers of 8 major American English dialects of which approximately 70% were male and 30% were female, as shown in Table 1. The corpus consists of 6300 (5.4 h) phonetically rich utterances by different speakers (10 by each speaker) along with their corresponding time-aligned orthographic, phonetic, and word transcriptions.

All the recordings were segmented according to the spoken-word boundaries using the transcriptions and were paired with their left and right context spoken-words and their corresponding textual words along with the phonetic sequence of the target spoken-word. All the spoken-word utterances were represented by their MFCC representations and the textual words were represented by their pre-trained textual word embeddings, where the MFCC representations and the textual word embeddings were of the same size (\(d_{mfcc}\) = \(d_{w}\)). One-hot encoded dialect (8-dimensional) and gender (2-dimensional) vectors were used as auxiliary information vectors. We used the standard train (462 speakers and 4956 utterances) and test (168 speakers and 1344 utterances) set of the TIMIT speech corpus for training and testing the proposed model. Due to computational resource limitations, a context window size of 3 was used. In all the experiments the MFCC representations and the textual word embeddings were of the same size (\(d_{mfcc}\) = \(d_{w} \in \) {50, 100, 300}). For the textual word embeddings, the proposed model used two different widely used pre-trained word embeddings i.e., (1) Word2Vec [24], which are word-based embeddings, and (2) FastText [2], which are character-based embeddings. For all the experiments, the proposed model was trained for 20 epochs using a mini-batch size of 100. The initial learning rate was set to \(0.01\) and Adam optimizer was used for optimization. The Bi-LSTM and LSTM nodes were regularised using an L2 regularizer with a penalty of \(0.01\). Early stopping was used to avoid over-fitting. The size of the target spoken-word latent representation \(z_{new}\) was set to 50-, 100- & 300 for comparison. All the spoken-words were represented by a sequence of 50 phonetic symbols using the original unique 27 phonetic symbols present in the corpus along with our four newly introduced symbols (“[SOPS]” for the start of each phonetic sequence, “[SEP]” for separation/space between phonetic symbols, “[PAD]” for padding and “[EOPS]” for the end of each phonetic sequence).

5 Results

Table 2. Phonetic sequence prediction results on the TIMIT speech corpus. We present here the comparison of testing set accuracy (%) of the STEPs-RL model using different sets of auxiliary information (gender (G), dialect (D)) with the base STEPs-RL model using no auxiliary information. The comparison is done for different textual word embedding sizes \(\varvec{d_{w}}\) = {50, 100, 300}, different spoken-word latent representation sizes \(\varvec{d_{e}}\) = {50, 100, 300} and different word embeddings like Word2Vec (w) and FastText (f). The best performance in each configuration is marked in bold, row of the best performing model is highlighted in , the overall best performance is further marked in and its configuration is marked in .

For evaluation, we first tested the proposed model on the phonetic sequence prediction task with different spoken-word latent representation & textual word embedding sizes, and also tested the performance of the model using different types of textual word embeddings (Word2Vec & FastText). We compared the phonetic sequence prediction accuracy (%) of the base STEPs-RL model (w/o any auxiliary information) with its variants that use different sets of auxiliary information like gender/dialect or both. The results are shown in Table 2. It was observed that increasing the spoken-word representation size resulted in better performance but was not so evident in the case of textual word embedding size. It was also observed that in general using Word2Vec textual word embeddings achieved better results compared to using FastText textual word embeddings. The addition of auxiliary information like dialect and gender showed clear improvements in accuracy when compared to the base STEPs-RL model, validating the use of this type of auxiliary information for spoken-word representation learning. It was also found that STEPs-RL was able to perform best when it used both dialect (D) and gender (G) together in its auxiliary vector (STEPs-RL+D+G). So for further evaluations, we will only consider the target spoken-word representations generated from the STEPs-RL+D+G model using the configurations marked blue in Table 2. Table 3a illustrates examples of four different spoke-words along with their actual corresponding phonetic sequences and the phonetic sequences predicted by the STEPs-RL+D+G model. These examples demonstrate the ability of the STEPs-RL+D+G model to encode phonetic-based information in their corresponding latent representations.

To further evaluate the latent representations generated from STEPs-RL+D+G, we use intrinsic methods to test the semantic or syntactic relationships between these generated latent representations of the spoken-words present in the corpus. To do so, we use 4 benchmark word similarity datasets and compare the performance of the spoken-word representations generated from STEPs-RL+D+G with the representations generated by text-based language models (Word2Vec & FastText) trained on the textual transcripts. The word similarity datasets include SimeLex-999 [14], MTurk-771 [12], WS-353 [35] and Verb-143 [1]. These datasets contain pairs of English words and their corresponding human-annotated word similarity ratings. The word similarities between the spoken-words (in case of STEPs-RL+D+G) and the textual-words (in case of Word2Vec and FastText) were obtained by measuring the cosine similarities between their corresponding representation vectors.

Table 3. (a) Examples of the phonetic sequences generated by STEPs-RL+D+G model. (b) Performance of STEPs-RL+D+G compared to Word2Vec & FastText on four benchmark word similarity datasets.
Fig. 4.
figure 4

Difference vectors corresponding to (a) Set 1: Word pairs differ in last few phonemes (b) Set 2: Word pairs differ in first few phonemes.

Table 3b reports Spearman’s rank correlation coefficient \(\rho \) between the human rankings and the ones generated by STEPs-RL+D+G, Word2Vec, and FastText. Since there were many words present in these datasets which were not present in the TIMIT speech corpus, only those word pairs were considered in which both the word were present in the TIMIT speech corpus. Table 3b shows that the performance of the spoken-word representations generated from STEPs-RL+D+G was comparable to the performance of textual word representations generated from Word2Vec and FastText. This demonstrates that our proposed model was also able to capture semantic-based and syntax-based information, although the scores were slightly less compared to Word2Vec and FastText. We believe that the primary reason for this difference is the disparity in the way different speakers speak. The same word can be spoken in different ways and can have different meanings based on the tone and expression which may in return lead to an entirely different representation for the same word. In addition to that, these word similarity datasets are for the textual words, which do not take into account the tone and the expression aspect. Also, to the best of our knowledge, no other such word similarity dataset exists for the spoken-words. So keeping in mind these issues, the performance of the proposed model validates its ability to capture semantical and syntactical information in the representations it generates.

Next, we try to investigate the phonetical soundness of the vector space generated by the proposed model. A vector space can be said to be phonetically sound if the spoken-word representations of the words having similar pronunciations are present close to each other in the vector space. For this investigation we use 2 sets of randomly chosen word pairs:

  • Set 1: (street, streets), (come, comes), (it, its), (project, projects), (investigation, investigations)

  • Set 2: (few, new), (bright, night), (bedroom, room)

Here, in Set 1 the word pairs differ in the last few phonemes and in Set 2 the word pairs differ in the first few phonemes. To illustrate the relationship between these word pairs, first, the difference vectors were computed between the average spoken-word vector representation of the words present in the above-mentioned word pairs, and then these high dimensional difference vectors were reduced to 2-dimensional vectors using PCA [17], to interpret these vectors. The difference vectors corresponding to Set 1 & Set 2 are shown in Fig. 4. It can be observed in the figures that the difference vectors are similar in directions and magnitude. In both the figures, phonetic replacements lead to similar transformations, for example (come \(\rightarrow \) comes) is similar to (it \(\rightarrow \) its) in Fig. 4a, and (few \(\rightarrow \) new) is similar to (bright \(\rightarrow \) night) in Fig. 4b. These transformations are not perfectly similar because we are taking an average of the same word spoken by different speakers having different accents and pronunciations, but despite this, the transformations are still very close to each other. All these experiments demonstrate the quality of spoken-word vector representations generated by the proposed model using speech and text entanglement which not only are semantically and syntactically adequate but are also phonetically sound.

6 Conclusion

In this paper, we introduced STEPs-RL for learning phonetically sound spoken-word representations using speech and text entanglement. Our approach achieved an accuracy of 89.47% in predicting phonetic sequences when both gender and dialect of the speaker are used in the auxiliary information. We also compared its performance using different configurations and observed that the performance of the proposed model improved by (1) increasing the spoken word latent representation size, and (2) the addition of auxiliary information like gender and dialect. We were not only able to validate the capability of the learned representations to capture the semantical and syntactical relationships between the spoken-words but were also able to illustrate soundness in the phonetic structure of the generated vector space.