Keywords

1 Introduction

Deep neural networks (DNNs) have been extensively studied in the past few decades and employed in multiple pattern recognition tasks owing to their high performance when large labeled datasets are available [1,2,3]. For handwriting recognition, DNNs have achieved increasing recognition accuracy [4,5,6] on many benchmark databases [7,8,9,10,11] of Latin, Arabic, Chinese, Indic, and Japanese scripts. These models require more labeled samples for training when the number of parameters is high [12, 13]. On the other hand, they do not take advantage of unlabeled samples. Unlabeled samples are easier to collect in large quantities and at a lower cost than labeled samples. For example, the two new databases of handwritten answers, namely SCUT-EPT [14] and NCUEE-HJA [15], have 40,000 labeled sentences and more than 190,000 unlabeled sentences, respectively. Only a few studies have utilized unlabeled samples for handwritten text recognition [16, 17]. Thus, we aim to create a generalized learning framework for any handwriting recognizer that satisfies two criteria (i) Trainable with as less labeled data as possible; (ii) Utilizable for unlabeled and labeled data.

Thus far, semi-supervised learning (SSL) methods have been established and developed to address the use of unlabeled data. Since the early deep learning era, Pseudo-Labeling has been proposed and extended for image classification tasks [18]. In the Pseudo-Labeling method, a pre-trained model is initialized using a small, labeled subset and is then used to predict the pseudo labels of a large unlabeled subset. Next, the unlabeled subset with the corresponding pseudo labels is used to re-train the model. Generally, Pseudo-Labeling is similar to the teacher-student training framework, where the initialized supervised pre-trained model is a teacher model while the training model is a student model. The teacher model provides pseudo labels for training a student model with unlabeled input samples. Thus, the handwriting recognizer is optimized on both the labeled and unlabeled samples using features from the unlabeled samples.

In fact, the Pseudo-Labeling method depends on the quality of the pseudo labels, as erroneous predictions often appear early in the training process [19]. Handwritten text recognition (HTR) is considered a sequential labeling task requiring a sequence of character predictions. It is difficult to employ Pseudo-Labeling for training HTR because misrecognized labels might lead to incorrect predictions in the rest of the sequence. Hence, we propose a framework, termed the Incremental Teacher Model, to gradually extend the effect of pseudo labels during the training process. The teacher model is incrementally updated after each epoch by its student model.

We have not focused on developing a novel handwriting recognizer in this work. Instead, we employ the proposed framework to train existing handwriting recognition architectures: Convolutional Recurrent Neural Network (CRNN) with connectionist temporal classification (CTC) [20], Attention-based Encoder-Decoder (AED) [21], and Self-Attention-based CRNN with CTC [22]. These handwriting recognition architectures utilize unlabeled data using the proposed SSL framework.

The rest of this paper is organized as follows: Sect. 2 reviews related studies on SSL methods. Section 3 presents our proposed framework with Mixed Augmentations and Scheduled Pseudo-Label loss. Section 4 presents the experiments and results of the proposed framework applied to different HTR architectures. In Sect. 5, we draw conclusions.

2 Related Works

Although DNNs have been continuously improved for higher performance, they strongly depend on large-scale labeled datasets for training. In fact, it is difficult to efficiently adapt them to new tasks, such as recognizing unseen or seen characters written in a new writing style. During the last few years, meta-learning has been widely studied to make DNNs to learn new patterns with a few training samples [23]. It is a wide field of machine learning that includes few-shot learning, one-shot learning, and domain adaptation [17, 24, 25]. Among them, the domain adaptation (DA) methods, particularly methods following the SSL approach, are promising to generalize a handwriting recognizer using both labeled and unlabeled data. Specifically, we focus on the inner-domain handwriting recognition task where training and testing sets have the same categories.

Two main approaches are studied based on these assumptions: consistency regularization and entropy minimization. Consistency regularization is mainly based on data augmentation and weight noise by dropout, as small changes should not significantly affect the prediction made by the network. The consistency loss measures the distance between the network predictions, with and without augmentations for input samples. Some well-known methods in this approach are the Π-Model [26], Temporal Ensembling [26], Mean Teacher [27], and Virtual Adversarial Training (VAT) [28].

The Π-Model employs stochastic augmentation to provide minor changes in each input sample. It also applies dropout to make noise on the weights of a given DNN model. The distance between the predictions of the original sample (without either augmentation or dropout) and its variant (with both augmentation and dropout) is then minimized. While the Π-Model requires two executions of the network for every sample, Temporal Ensembling keeps and updates the ensembled prediction of every sample during the training process; thus, its computation cost is lower than that of the Π-Model. Mean Teacher focuses on updating the ensembled model instead of tracing the ensembled patterns so that it helps converge faster than Temporal Ensembling. On the other hand, VAT approximates how augmentations to be employed on each input sample affect the output class distribution most significantly.

Entropy minimization prevents the decision boundary from lying near the low-confidence prediction region in the feature space. A simple loss term is commonly used to minimize the entropy for unlabeled data with all the classes. Two well-known methods based on entropy minimization are Pseudo-Labeling [18] and Label Propagation [29]. Pseudo-Labeling trains a student model based on a teacher model’s predictions or pseudo labels, in which the teacher model is pre-trained using supervised learning. On the other hand, Label Propagation is to diffuse from labeled samples to unlabeled ones according to the propagation weights computed from pairwise similarity scores.

Recent studies have combined consistency regularization and entropy minimization, such as MixMatch [30] and FixMatch [31]. These methods apply multiple augmentations on a single unlabeled sample and force the model to predict these augmented input data similarly. By combining numerous augmentations, the trained model extracts invariant features to improve the overall performance even using a small number of labeled samples.

3 Methodology

By extending the Pseudo-Labeling method, we propose an SSL framework integrated with mixed augmentations and multiple losses, as shown in Fig. 1. First, an initial handwriting recognizer as a student model is prepared using labeled data by supervised learning. Second, mixed augmentations are applied to generate a weakly transformed variant and a strongly transformed variant from each original sample. Third, the teacher model produces a pseudo label from the weakly augmented variant and then computes a pseudo-label loss on the strongly augmented variant. For the prediction from the teacher model, the special tokens of padding or blank [PAD], start of sequence [SOS], and unknown [UNK] should not exist. These tokens are eliminated from the predictions to maintain the quality of the pseudo labels. Fourth, the student model is trained by minimizing both the supervised and pseudo-label losses with a flexible ratio. The ratio depends on the rate between labeled and unlabeled samples in a single training minibatch and the number of trained epochs. Note that the pseudo-label loss is gradually used to update the handwriting recognizer to avoid the negative effect of incorrect pseudo labels, termed the Scheduled Pseudo-Label loss. Finally, the teacher model was incrementally updated using the student model and used for evaluation.

Although the Mean Teacher and Pseudo-Labeling methods are the basis of this study, they follow different training schemes. Thus, we modified their training schemes similar to our model to achieve a fair comparison with the proposed framework in this study.

Fig. 1.
figure 1

Workflow of our proposed Incremental Teacher Model with Mixed Augmentations and Scheduled Pseudo-Label loss. The single-line arrows illustrate supervised learning using labeled samples, whereas the double-line arrows represent SSL with unlabeled samples.

3.1 Incremental Teacher Model

Updating of the models that generate pseudo labels is handled differently depending on the research and application. In [18], the teacher model is commonly pre-trained and fixed; therefore, the predicted pseudo labels are stable for training the student model. This approach is good in the case where the teacher model is sufficiently trained on labeled data. In practice, however, many labeled samples are not always available. On the other hand, methods that compute consistency regularization, such as Mean Teacher, can simultaneously train the student model and the teacher model that generates the pseudo labels in the training process. However, it might update the teacher model with a worse student model in the early stage of the training process. Thus, we propose to update the teacher model with the student model whenever the validation accuracy is improved at the end of each training epoch. The teacher model is updated by copying the weighted parameters from the student model. Finally, the teacher model was used for evaluation. To the best of our knowledge, this is the first work applying incremental updates of the teacher model for handwriting recognition using pseudo labels.

A well-initialized pre-trained model is essential to prepare a good teacher model to enhance the performance of the student model later. Because RotNet has been demonstrated to be effective for general images with complex background [32], we expected that it would be suitable for HTR with simple background. Moreover, the handwritten word image ratio was in range of general image ratio. Therefore, we employed RotNet, a self-supervised learning method for predicting the rotation of images, as a pretext task. This initialization method provides more general network weights to achieve a higher accuracy using supervised learning or SSL in the later training process.

3.2 Mixed Augmentations

In recent years, augmentation has played an important role in avoiding overfitting during the DNN training process [33] since it provides a large number of variants from a small number of samples. With more variants, a well-trained DNN model with augmentation tends to perform better extraction and focus on the invariant features. Since augmentation does not require newly collected data, it is commonly employed as an efficient method to improve the DNN performance. On the other hand, sequence-to-sequence contrastive learning (SeqCLR) has been proposed to employ stochastic image augmentation to generate two different variants from a single input sample [16]. Subsequently, the mapping between two extracted feature sequences is computed and considered the contrastive loss for optimization. In addition, augmentations are employed to generate multiple variants of a single sample for training based on prediction consistency [30].

In this study, we used multiple augmentation methods to generate two variants from a sample, which was named as “Mixed Augmentations”. One variant used smaller deformations to obtain a pseudo label, while the other had larger deformations. Note that the stochastic image augmentation in SeqCLR randomly generates two variants of an original sample using a single transforming pipeline repeatedly. Owing to the asymmetry of the proposed framework, two generated variants in our method are normally generated by two different transforming pipelines (weak and strong).

Augmentations used in general image recognition, such as FixMatch [31], are composed of geometric transforms for weak and multiple mixed transformations for strong transforms. For handwriting recognition, however, geometric transforms are limited to maintain the readability of the augmented handwritten images. Thus, we use four augmentations, namely rotation, crop, perspective, and Gaussian blur, which are commonly employed in handwriting recognition studies, as shown in Table 1. These settings are based on comparative experiments and applied consistently in experiments with many HTR architectures and in different labeled ratio scenarios.

Table 1. Details of Mixed Augmentations.

3.3 Scheduled Pseudo-Label Loss

For training samples \(\mathrm{X}\) with corresponding labels \(\mathrm{Y}\), the supervised loss is based on the negative log-likelihood as follows:

$${\mathcal{L}}_{SL}=\sum_{\left(X,Y\right)}-\mathrm{log}p(Y|X)$$
(1)

The pseudo-label loss for the unlabeled training samples \(\mathrm{X}\) is defined as follows:

$${\mathcal{L}}_{PL}={\sum}_{\left(X\right)}-\mathrm{log}p\left(\overline{Y }|\overline{\overline{X}}\right)\; with \;\overline{Y }=\mathrm{teacher}(\overline{X })$$
(2)

Here, \(\overline{X }\) and \(\overline{\overline{X}}\) are the weakly and strongly transformed variants from \(\mathrm{X}\), respectively. The pseudo labels \(\overline{Y }\) are predicted by the teacher model on \(\overline{X }\). Thus, the pseudo-label loss is based on the conditional probabilities of the pseudo-label \(\overline{Y }\) for the strongly transformed variants \(\overline{\overline{X}}\).

We introduce scheduling of the loss calculations for the pseudo labels of the unlabeled samples. It is aimed to avoid the problem that the target model does not converge due to the generation of incorrect pseudo labels in the early stages of training. Label scheduling has been proposed besides Pseudo-Labeling, and several derivations have been considered in other related studies. In this study, we applied the Scheduled Pseudo-Label loss as follows:

$${\mathcal{L}}_{SPL}=\frac{1}{n}{\mathcal{L}}_{SL}+ \alpha \left(t\right) \frac{1}{{n}^{\mathrm{^{\prime}}}}{\mathcal{L}}_{PL}$$
(3)

where n is the total number of labeled samples, n’ is the total number of unlabeled samples, t is the training epoch and \(\alpha \left(t\right)\) is the scheduled weight for \({\mathcal{L}}_{PL}\) that depends on T1, T2, and \(A\) as shown below:

$$\alpha \left(t\right)=\left\{\begin{array}{c}0\\ \frac{t-{T}_{1}}{{T}_{2}-{T}_{1}}A \\ A\end{array}\begin{array}{c} t<{T}_{1}\\ { T}_{1}\le t<{T}_{2}\\ {T}_{2}\le t\end{array}\right.$$
(4)

Thus, \({\mathcal{L}}_{PL}\) begins to affect \({\mathcal{L}}_{SPL}\) when the number of epochs crosses T1 and monotonically increases until it reaches T2; then, \(A\) is the highest weight of \({\mathcal{L}}_{PL}\). In this study, we applied T1 of 50, T2 of 250, and \(A\) of 1, so that \({\mathcal{L}}_{PL}\) is used from the midpoint of learning on the labeled data. Note that the current hyperparameters of the scheduled pseudo-label loss were experimentally chosen.

4 Experiments

4.1 IAM Handwriting Database and Scenarios for SSL

We used handwritten English word-level patterns of the IAM database for evaluation because they have been used as the benchmark for many HTR studies [7]. Although the SSL methods have been employed for many recognition tasks, they have not been widely applied in handwriting recognition as mentioned in the review section. For handwriting recognition, a sequence of characters is required for prediction instead of single characters. Thus, preliminary experiments at the word level are the most straightforward HTR task.

Table 2 shows four splitting scenarios derived from the RWTH Aachen University splitFootnote 1 of the IAM handwriting database, where Words, Pages, and Writers denote the numbers of labeled and unlabeled samples in the training set, the number of samples in validation set, and that in the testing set, respectively. There is no writer duplication between the labeled and unlabeled samples. These scenarios are prepared to evaluate the SSL methods with our handwriting recognizers. These splitting scenarios satisfy the writer-independent requirement, which is commonly used to benchmark the handwritten English text recognizers.

Scenario 1 is the same as the supervised learning configuration without unlabeled samples. Scenarios 2, 3, and 4 are prepared to randomly select 50%, 10%, and 1% of the training set as the labeled training sets, respectively, while the rest is used as unlabeled training sets. Note that the labeled training set of Scenario 4 (1% labeled) does not include the eight character categories, which is over 10% of all character categories (8/79). Thus, Scenario 4 is the most challenging with unseen categories and writing styles.

Table 2. Details of SSL scenarios on IAM handwriting database.

4.2 Handwritten Text Recognition Architectures

As recognition models tested in the experiments, we used four architectures of handwriting recognizers. The first is a CRNN using ResNet as a feature extractor and Bidirectional Long Short-Term Memory (BLSTM) with CTC [20]. The second is another general encoder–decoder architecture, where an attention layer guides the decoder (AED) [21]. The third is a Deep Convolutional Recurrent Neural Network (DCRN) derived from AED with a simple Convolutional Neural Network (CNN) and a stacked BLSTM that provides a deeper sequential encoder [22]. The fourth is a CRNN using multiple Self-Attention layers for the sequential encoder (SelfAttn) [22]. These are listed in Table 3 with each major component.

Table 3. Main components of four HTR architectures.

4.3 Results of Different Recognition Architectures

To the best of our knowledge, no related research applied similar techniques to the HTR problem. The related studies were proposed for general image classification. For comparison, we experimented using Mean Teacher [27] and Pseudo-Labeling [18] because the proposed method is derived from them. Furthermore, we experimented using FixMatch [31] as this is one of the most efficient SSL methods. Note that we modified these SSL methods to match with the training scheme used for our method.

Table 4 reports the results of four HTR architectures trained by different frameworks in each scenario. The baseline column shows the character accuracy rate (CAR) of the HTR architectures trained by only labeled samples, while the other columns show the CARs of trained HTR architectures using Mean Teacher, Pseudo-Labeling, FixMatch, and Incremental Teacher Model. For Pseudo-Labeling, we followed the default setting of scheduling parameters reported in [18]. Note that these reported results are on the IAM word-level testing set. The recognition rates shown here seem inferior to the state-of-the-art results [34] since these rates are obtained without word dictionaries and language models.

Table 4. Character accuracy rate (%) of HTR architectures trained by Supervised Learning, Mean Teacher, Pseudo-Labeling, FixMatch, and Incremental Teacher Model in four SSL scenarios.

Overall, AED produced the best results in all scenarios with any training framework (bold), while CRNN typically produced the second-best results (underline). These results suggest that using a ResNet-based feature extractor seems to be better than the simple CNN. Moreover, the high complex sequential encoders of DCRN and SelfAttn did not achieve an accuracy as high as that of the simple sequential encoders of AED and CRNN. The performance of all the HTR architectures decreased significantly in Scenario 4 since the labeled training set did not cover the character set.

For the related SSL methods, Pseudo-Labeling outperformed Mean Teacher and FixMatch in almost all scenarios with all the HTR architectures. Note that in the case of the Mean Teacher and FixMatch methods, the performance of the HTR architecture is deteriorated in some cases, which is shown by ↓ in Table 4. Mean Teacher and FixMatch mainly rely on the loss calculated from the distribution comparison between pseudo labels and output, as the consistency cost is unsuitable for text line recognition. It is considered difficult to capture the consistency because the output before decoding is a time series of classification, which varies significantly depending on the augmentation with positional information. Therefore, a method that expands on the pseudo labels is effective, and additional study is required to introduce consistency costs.

For every architecture except SelfAttn, the recognizer trained by the Incremental Teacher Model outperforms the recognizers trained by the well-known SSL methods: Mean Teacher, Pseudo-Labeling, and FixMatch in every scenario using only 50%, 10%, or 1% labeled training samples on the IAM handwriting database, respectively. The SelfAttn architecture with a simple feature extractor and a complex sequential encoder does not perform well in Scenarios 2 and 4. Mixed Augmentations seem to be helpful for the feature extractor rather than the sequential encoder.

Figure 2 illustrates the changes in the recognition accuracy with the increase in the ratio of labeled data in the training set. The Incremental Teacher Model increases the accuracy of the AED architecture by at most 15.7 percentage points (p.p.) in Scenario 3. Despite using the 1% labeled samples for training, the accuracy of AED is increased by 6.4 p.p. Compared to Pseudo-Labeling, it improves the HTR accuracy by at least 0.9 and at most 6.5 p.p. in Scenarios 2 and 3, respectively. These results show that the proposed framework could leverage unlabeled data to improve the HTR efficiency. Moreover, they give a clue about the possibility of applying HTR in practice on an unlabeled dataset by labeling only a small portion of the dataset.

Fig. 2.
figure 2

Character accuracy rate (%) of AED trained by different methods in four SSL scenarios.

Table 5 lists six word-level samples from the IAM handwriting database with the predictions from four architectures trained by Incremental Teacher Model. For short words such as “of”, “the” and “friend”, CRNN and AED correctly predicted while DCRN and SelfAttn produced misrecognitions. For longer words, even CRNN and AED did not perform correctly. The predictions by AED differed from the ground truth by one to two characters while those by CRNN had more differences. The predictions by DCRN were shorter than the ground truth which might suggest that the DCRN capability is limited in the length of its output sequences. The SelfAttn architecture performed well with its predictions being different from the ground truth by only one to two characters.

Table 5. IAM word-level samples with predictions from four architectures trained using Incremental Teacher Model in Scenario 3.

4.4 Results of Different Augmentation Configurations

Table 6 shows our search for weak/strong transformation settings, where we trained the AED architecture on Scenario 3 (10% of the training samples have been labeled). The most basic augmentation is rotation by at most 15 degrees (Rot15). Thus, we conducted a series of experiments with Rot15 as weak and strong transformations and inserted other augmentations into the strong transformation, such as Crop80 (randomly removed at most 20% of an image), Blur2 (randomly applied Gaussian blur with the highest value of sigma of 2), and Per30 (randomly and vertically distorted an image by at most 30%). By employing more augmentations on the strong transformation, the AED performance increases from R1 to R5. Moreover, we tried to eliminate Rot15 from weak transformation; however, R6 performs worse than R5 at 2.3 p.p. Next, we modified the parameters used for augmentations from the settings of R5 to make R7. The small changes in the parameters might reduce the final recognition accuracy. Moreover, we tested to include more augmentations in the weak transformation. As shown in the R8 and R9 rows, the recognition accuracy declines when more augmentations are applied.

Table 6. Ablation studies for different configurations of Mixed Augmentations in Scenario 3.

Thus, we might assume that simple augmentations are suitable for weak transformations. Moreover, we still need to search for the optimal parameters of Mixed Augmentations.

4.5 Discussions

Based on the experiments, the AED model outperformed other models, which may be owing to its components of a ResNet-based feature extractor and an LSTM-based decoder with attention. These components are large and deep to extract useful features for recognition and correctly focus on character regions. Thus, they are commonly used to build handwriting recognizers. Because these experiments were on word-level patterns only, further experiments on sentence-level are required to verify the efficacy of the proposed framework. We believe that designing the consistency cost for long handwritten text is challenging. As it is impractical to investigate all types of augmentation in this study, we selected and applied the augmentations commonly used with better performance on HTR. However, we expect that other augmentations are also possible to be employed in the proposed framework.

5 Conclusions

We proposed Incremental Teacher Model and demonstrated its effectiveness. It produces a high recognition accuracy for handwritten text recognition even when only a part of the training set is labeled. It comprises Mixed Augmentations and Scheduled Pseudo-Label loss for handwritten text recognition. Instead of using a fixed pre-trained handwritten text recognition (HTR) model as a teacher model to generate pseudo labels, the proposed framework incrementally updates the teacher model using the latest recognizer. We applied the proposed framework to four DNN architectures for handwriting recognition and compared it with well-known semi-supervised learning methods: Mean Teacher, Pseudo-Labeling, and FixMatch. For almost every architecture, the recognizer trained by the Incremental Teacher Model outperforms the recognizers trained by other well-known SSL methods in every scenario when using only 50%, 10%, or 1% labeled training samples on the IAM handwriting database. However, we only confirmed the effectiveness of our framework for word-level English, so we plan to examine the framework for text-line-level English as well as for other languages in future works.