Keywords

1 Introduction

Punctuation insertion into the output of Automatic Speech Recognition (ASR) is a known problem in speech technology. The importance of having punctuations in automatically generated text – transcripts, indexing, closed captions, for metadata extraction etc. – has been outlined several times [1, 16], as punctuation helps both human readability, and also eventual subsequent processing with text based tools, which usually require the punctuation marks at the very first step of their operation: the tokenization. In dictation systems, punctuation marks can be explicitly dictated; however, in several other domains where ASR is used, this is not possible.

Two basic approaches can be distinguished for automatic punctuation, although they are often used in combination: prosody and text based approaches. In general prosody based approaches require less computation, less training data and hence can result in lightweight punctuation models. They are also more robust to ASR errors; recently proposed text based approaches on the other hand provide mostly more accurate punctuation, but are more sensitive to ASR errors and may introduce high latency due to the processing of a wide context, requiring extensive computations and also future context which directly results in high latency.

In this paper we focus on reducing this latency by still maintaining the accuracy provided by text based models. We demonstrate systems intended to be used for punctuation of closed-captioned data. ASR technology is widely used by television companies to produce closed captions especially for live programs [21], which require almost real-time processing with little latency.

Much effort has been devoted to develop reliable punctuation restoration algorithms, early approaches proposed to add punctuation marks to the N-gram language model of the ASR as hidden events [8, 23]. These models have to be trained on huge corpora to reduce data sparsity [8]. More sophisticated sequence modeling approaches were also inspired by this idea: a transducer alike approach getting a non-punctuated text as input is capable of predicting punctuation as was presented in numerous works [1, 3, 11], with frameworks built on top of Hidden Markov Models (HMM), Maximum Entropy (MaxEnt) models or conditional random fields, etc. MaxEnt models allow for any easy combination of textual and prosodic features into a common punctuation model [10]. In a comprehensive study [2], many features were compared in terms of their effect on punctuation accuracy of a MaxEnt model. It was found that the most powerful textual features were the word forms and part-of-speech (POS) tags, whereas the best prosodic feature was the duration of inter-word pauses.

Applying a monolingual translation paradigm for punctuation regarded as a sequence modeling task was also proposed in [5], which also allowed for considerably reducing time latency. Recently, sequence-to-sequence modeling deep neural network based solutions have been also presented: taking a large word-context and projecting the words via an embedding layer into a bidirectional Recurrent Neural Network (RNN) [22], high quality punctuation could be achieved. RNNs are successfully used in many sequence labeling tasks as they are able to model large contexts and to learn distributed features of words to overcome data sparsity issues. The first attempt to use RNN for punctuation restoration was presented in [24], where a one-directional LSTM [9] was trained on Estonian broadcast transcripts. Shortly after, Tilk and Alumäe introduced a bidirectional RNN model using GRU [7] together with attention mechanism, which outperformed previous state-of-the-art on Estonian and English IWSLT datasets [25]. In a recent study [15], capitalization and punctuation recovery are treated as correlated multiple sequence labeling tasks and modeled with bidirectional RNN. In [14], a prosody based punctuation approach was proposed using an RNN on top of phonological phrase sequence modeling.

In this paper, we introduce a lightweight RNN-based punctuation restoration model using bidirectional LSTM units on top of word embeddings, and compare its performance to a MaxEnt model. We pay a special attention to low latency solutions. Both approaches are evaluated on automatic and manual transcripts and in various setups including on-line and off-line operation. We present results on Hungarian broadcast speech transcripts and the IWSLT English dataset [4] to make the performance of our approach comparable to state-of-the-art systems. Apart form the purely prosody based approach outlined in [14], we are not aware of any prior work for punctuation restoration for Hungarian speech transcripts.

Our paper is structured in the following way: first we present the used datasets in Sect. 2, then we move on to presenting the experimental systems in Sect. 3. The results of Hungarian and English Punctuation Restoration tasks are presented and discussed in Sect. 4. Our conclusions and future ideas are drawn in Sect. 5.

2 Data

2.1 The Hungarian Broadcast Dataset

The Hungarian dataset consists of manually transcribed closed captions made available by the Media Service Support and Asset Management Fund (MTVA), Hungary’s public service broadcaster. The dataset contains captions for various TV genres enabling us to evaluate the punctuation models on different speech types, such as weather forecasts, broadcast news and conversations, magazines, sport news and sport magazines. We focus on the restoration of those punctuations, which have a high importance for understandability in Hungarian: commas, periods, question marks and exclamation marks. The colons and semicolons were mapped to comma. All other punctuation symbols are removed from the corpora. We reserve a disjunct 20% of the corpus for validation and use a representative test set, not overlapping with training and validation subsets. For further statistics about training and test data we refer the reader to Table 1.

Table 1. Statistics of the Hungarian dataset

The automatic transcription of the test set is carried out with an ASR system optimized for the task (close captioning of live audio) [27]. The language model for the ASR was trained on the same corpus as the punctuation model and was coupled with a Deep Neural Network based acoustic model trained on roughly 500 hours of speech using the Kaldi ASR toolkit [18]. The average word error rate (WER) of the automatic transcripts was around 24%, however showed a large variation depending on genre (see later Table 1). Note, that for Mixed category there was no available audio data in the test database.

2.2 The English IWSLT Dataset

The IWSLT dataset consists of English TED talks transcripts, and has recently became a benchmark for evaluating English punctuation recovery models [4, 15, 24, 25]. We use the same training, validation and test sets as the studies above, containing 2.1 M, 296 K and 13 K words respectively. This dataset deals with only three types of punctuations: comma, period and question mark.

3 Experimental Setups

3.1 MaxEnt Model

The maximum entropy (MaxEnt) model was suggested by Ratnaparkhi for POS Tagging [19]. In his framework, each sentence is described as a token (word) sequence. Each classified token is described with a set of unique features. The system learns the output labels based on these. In supervised learning, the output labels are hence assigned to the token series. To determine the set of features, the MaxEnt model defines a joint distribution through the available tags and the current context, which can be controlled with a radius parameter. Pre-defined features such as word forms, capitalization, etc. can also be added.

We use the MaxEnt model only with word form-related input features, and all tokens are represented in lower case. To obtain these input features, we use Huntag, an open-source, language independent Maximum Entropy Markov Model-based Sequential tagger for both Hungarian and English data [20].

The radius parameter of the MaxEnt tagger determines the size of the context considered. By default, left (past) and right (future) context is taken into account. We will refer to this setup as off-line mode. As taking future context into account increases latency, we consider the limit of it, which we will refer to by on-line mode. In the experiments we use round brackets to specify left and right context, respectively. Hence (5,1) means that we are considering 5 past and 1 future token actually.

3.2 Recurrent Neural Networks

We split the training, validation and test corpus into short, fixed-length sub-sequences, called chunks (see the optimized length in Table 2), without overlapping, i.e. such that every token appears once. A vocabulary is built from the k-most common words from the training set, by adding a garbage collector “Unknown” entry to map rare words. Incomplete sub-sequences were padded with zeros. An embedding weight matrix was added based on pre-trained embedding weights and the tokens of the vocabulary.

We investigate the performance of an unidirectional and a bidirectional RNN model in our experiments. Our target slot for punctuation prediction is preceding the actual word. The used architectures are presented in Fig. 1.

Our RNN models (WE-LSTM and WE-BiLSTM, named after using “Word Embedding”) are built up in the following way: based on the embedding matrix, the preprocessed sequences are projected into the embedding space (\(x_{t}\) represents the word vector x at time step t). These features are fed into the following layer composed of LSTM or BiLSTM hidden cells, to capture the context of \(x_{t}\). The output is obtained by applying a softmax activation function to predict the \(y_{t}\) punctuation label for the slot preceding the current word \(x_{t}\). We chose this simple and lightweight structure to allow for real-time operation with low latency.

Fig. 1.
figure 1

Structure of WE-BiLSTM (left) and WE-LSTM (right) RNN model

The Hungarian punctuation models were trained on the 100 K most frequent words in the training corpus, by mapping the remaining outlier words to a shared “Unknown” symbol. RNN-based recovery models use 600-dimensional pre-trained Hungarian word embeddings [13]. This relative high dimensionality of the embeddings comes from the highly agglutinating nature of Hungarian. In our English RNN-models, a 100-dimensional pre-trained “GloVe” word embedding [17] is used for projection. During training, we use categorical cross-entropy cost function and also let the imported embeddings updated.

We performed a systematic grid search optimization for hyperparameters of the RNNs on the validation set: length of chunks, vocabulary size, number of hidden states, mini-batch size, optimizers. We also use early stopping to prevent overfitting, controlled with patience. Table 2 summarizes the final values of each hyperparameter used in the Hungarian and the English WE-BiLSTM and WE-LSTM models, also including those ones which were inherited from [25], to ensure a partial comparability.

Table 2. Hyperparameters of WE-BiLSTM and WE-LSTM models

As for the MaxEnt setup, we differentiate low latency and lightweight on-line mode, and robust off-line mode using the future context. All RNN models for punctuation recovery were implemented with the Keras library [6], trained on GPU. The source code of the RNN models is publicly availableFootnote 1.

We briefly mention that beside word forms, we were considering other textual features too: lemmas, POS-tags (also suggested by [26]) and morphological analysis. The latter were extracted using the magyarlánc toolkit, designed for morphological analysis and dependency parsing in Hungarian [28]. Nevertheless, as using word forms yielded the most encouraging results, and also as further analysis for feature extraction increases latency considerably, the evaluated experimental systems rely on word forms features only, input to the embedding layers.

4 Results and Discussion

This section presents the punctuation recovery results for the Hungarian and English tasks. For evaluation, we use standard information retrieval metrics such as Precision (Pr), Recall (Rc), and the F1-Score (F1). In addition, we also calculate the Slot Error Rate (SER) [12], as it is able to incorporate all types of punctuation errors – insertions (Ins), substitutions (Subs) and deletions (Dels) – into a single measure:

$$\begin{aligned} SER=\frac{C(Ins)+C(Subs)+C(Del)}{C(total slots)}, \end{aligned}$$
(1)

for slots considered following each word in the transcription (in (1) C(.) is the count operator).

4.1 Hungarian Overall Results

First, we compare the performance of the baseline MaxEnt sequence tagger (see Subsect. 3.1) to the RNN-based punctuation recovery system (see Subsect. 3.2) on the Hungarian broadcast dataset. Both approaches are presented in two configurations. In the on-line mode punctuations are predicted for the slot preceding the current word in the input sequence resulting in a low latency system, suitable for real-time application. In the off-line mode, aimed at achieving the best result with the given features and architecture, the future word context is also exploited. Please note that hyperparameters of all approaches and configurations were optimized on the validation set as explained earlier (see Sect. 3).

The test evaluations are presented in Table 3 for the reference and in Table 4 for the automatic (ASR) transcripts, respectively. In the notation of MaxEnt models (ij), i stands for the backward (past), whereas j stands for the forward (future) radius. As it can be seen, the prediction results for comma stand out from the others for all methods and configurations. This can be explained by the fact that Hungarian has generally clear rules for comma usage. In contrast to that, period prediction may also benefit from acoustic information, which assumption is supported by the results in [14], showing robust period recovery with less effective comma restoration.

Table 3. Punctuation restoration results for Hungarian reference transcripts

As Table 3 shows, switching to the RNN-based punctuation restoration for Hungarian reference transcripts reduces SER by around 20% relative compared to the baseline MaxEnt approach. The WE-BiLSTM and WE-LSTM are especially beneficial in restoring periods, question marks and exclamation marks as they are able to exploit large contexts much more efficiently than the MaxEnt tagger. Limiting the future context in on-line configuration causes much less deterioration in results than we had expected. The features from the future word sequence seem to be useful if task requires maximizing recall, otherwise the WE-LSTM is an equally suitable model for punctuation recovery.

Table 4. Punctuation restoration results for Hungarian ASR transcripts

As outlined in the introduction, limiting the future context and propagation of ASR errors into the punctuation recovery pipeline are considered to be the most important factors hindering effective recovery of punctuations in live TV streams. Results confirm that a large future context is less crucial for robust recovery of punctuations, contradictory to our expectations. In contrast, ASR errors seem to be more directly related to punctuation errors: switching from reference transcripts to ASR hypotheses resulted in 15–20% increase in SER (see Table 4). Although the performance gap is decreased between the two approaches in case of input featuring ASR hypothesis, RNN still outperforms MaxEnt baseline by a large margin.

4.2 Hungarian Results by Genre

The Hungarian test database can be divided into 6 subsets based on the genres of the transcripts (see Table 1). We also analyzed punctuation recovery for these subsets, hypothesizing that more informal and more spontaneous genres are harder to punctuate, in parallel to the more ASR errors seen in these scenarios. Some of the punctuation marks for specific genres were not evaluated (see “N/A” in Table 1), if the Precision or Recall was not possible to be determined based on their confusion matrix.

As the RNN-based approach outperformed the MaxEnt tagger for every genre, we decided to include only results of WE-BiLSTM and WE-LSTM systems in Tables 5 and 6 for better readability.

Table 5. Hungarian reference transcript results by genres

If we compare the results to the statistics in Table 1, it can be seen that the punctuation recovery system performed best on those genres (broadcast news, broadcast conversations, magazine), for which we had the most training samples. However, the relatively large difference among these three, well-modeled genres suggests that there must be another factor in the background, as well, which is the predictability of the given task. Analogous to language modeling, the more formal, the task is, the better is the predictability of punctuations (see broadcast news results). Obviously, conversational (broadcast conversations) and informal (magazine) speech styles (characterized with less constrained wording and increased number of disfluencies and ungrammatical phrases) make prediction more difficult and introduce punctuation errors compared to more formal styles.

Table 6. Hungarian ASR transcript results by genres

The relatively high SER of the weather forecast and the sport programs genres point out the importance of using a sufficient amount of in-domain training data. Besides collecting more training data, adaptation techniques could be utilized to improve results for these under-resourced genres.

By comparing punctuation recovery error of the reference and ASR transcripts, we can draw some interesting conclusions. For the well-modeled genres (Brc.-News, Brc.-Conv., magazine) the increase in SER correlates with the word error rate (WER) of the ASR transcript. However, for the remaining genres (weather, sport news, sport magazine), this relationship between SER and WER is much less predictable. It is particularity difficult to explain the relatively poor results for the sport news genre. Whereas the WER of the ASR transcript is moderate (24.7%), the SER of punctuation is almost doubled for it (67% to 107%). We assume that this phenomenon is related to the high number of named entities in the sport news program, considering that the highest OOV Rate (10%) can be spotted for this genre among all the 6 tested genres.

4.3 English Results

In this subsection, we compare our solutions for punctuation recovery with some recently published models. For this purpose, we use the IWSLT English dataset, which consists of TED Talks transcripts and is a considered benchmark for English punctuation recovery. For complete comparability, we used the default training, validation and test datasets. However, the hyperparameters were optimized for this task (see Table 2). Please note that the IWSLT dataset does not contain samples for exclamation marks.

We present the English punctuation recovery results in Tables 7 and 8. As it can be seen, in on-line mode, the proposed RNN (WE-LSTM) significantly outperformed the so-called T-LSTM configuration presented in [25], which had the best on-line results on this dataset so far to the best of our knowledge. Without using pre-trained word embedding (noWE-LSTM) our results are getting very close to the T-LSTM configuration.

Table 7. Punctuation restoration results for English reference transcripts
Table 8. Punctuation restoration results for English ASR transcripts

Although in this paper we primarily focused on creating a lightweight, low latency punctuation recovery system, we also compared our WE-BiLSTM system to the best available off-line solutions. As it is shown in Tables 7 and 8, both T-BRNN-pre from [25] configuration and Corr-BiRNN form [15] outperformed our WE-BiLSTM mainly due to their better performance for commas and question marks. However, these punctuation recovery systems are using much more complex structure and it is questionable whether they would be able to operate in real time scenarios. We consider the high recall of periods by our WE-BiLSTM models as a nice achievement both in reference and ASR transcripts.

5 Conclusions

In this paper, we introduced a low latency, RNN-based punctuation recovery system, which we evaluated on Hungarian and English datasets and compared its performance to a MaxEnt sequence tagger. Both approaches were tested in off-line mode, where textual features could be used from both forward and backward directions; and also in on-line mode, where only backward features were used to allow for real-time operation. The RNN-based approach outperformed the MaxEnt baseline by a large margin in every test configuration. However, what is more surprising, on-line mode causes only a small drop in the accuracy of punctuation recovery.

By comparing results on different genres of the Hungarian broadcast transcripts, we found (analogous to language modeling) that the accuracy of text based punctuation restoration mainly depends on the amount of available training data and the predictability of the given task. Note, that we are not aware of any prior work in the field of text based punctuation recovery of Hungarian speech transcripts.

In order to compare our models to state-of-the-art punctuation recovery systems, we also evaluated them on the IWSLT English dataset in both on-line and off-line modes. In on-line mode, our WE-LSTM system achieved the overall best result. In off-line mode, however, some more complex networks turned out to perform better than our lightweight solution.

For future work, we are mainly interested in merging of our word-level system and the prosody-based approach outlined in [14] for Hungarian. Extending the English model with further textual or acoustic features is also a promising direction, as we keep our focus on low latency for both languages.

All in all, we consider as important contributions of our work that (1) we use a lightweight and fast RNN model by closely maintained performance; (2) we target real-time operation with little latency; (3) we use the approach for the highly agglutinating Hungarian which has a much less constrained word order than English, as grammatical functions depend much less on the word order than on suffixes (case endings), which makes sequence modeling more difficult due to higher variation seen in the data.