1 Introduction

Today, handwriting recognition is a very urgent task. A solution to this problem would automate the business processes of many companies. One of the clear examples is a postal company, where the task of sorting a large volume of letters and parcels is an acute issue. Many researchers have made different types of handwritten text recognition systems for different languages such as English [19, 35, 59], Chinese [54, 60], Arabic [43], Japanese [14], Bangla [8], Malyalam [31], etc. Having said that, the recognition problems of these scripts cannot be considered to be entirely solved.

Any language contains a large number of words. For example, dictionaries of the Russian and Kazakh languages on average register more than 100,000 words, and the Oxford English dictionary more than 300,000 words. In this regard, collecting an exhaustive database of handwritten words, which include all words with a large variation in handwriting, seems almost impossible. In other words, there is always a word that the system cannot recognize. To the best of our knowledge, the analogs of handwritten text database for Russian and Kazakh languages do not exist. To create such a database, we decided to adopt the general principles of data collection and storage described in the IAM Database [39]. In the context of handwritten address recognition, it is necessary to identify the many keywords that can occur in the address.

In this paper, we describe the first version of a database that contains Russian words and also present a new database for offline handwriting recognition. The collection of this database combines the following steps. As an initial step, we collected the first data set with our own hands, since it is almost impossible to find such a set publicly available. This dataset was obtained by using forms, which consisted of machine-typed texts, and empty lines next to those texts. These empty lines were subsequently filled out by persons with their handwriting. It can serve as a basis for a variety of handwriting recognition tasks. Next, the same way we collected handwritten Kazakh and Russian alphabet in Cyrillic. The last set of data came from handwritten samples of poems also filled by our own hands in Russian language. Overall, the databases were produced by approximately 200 different writers, each having 5 to 10 forms (made up of poem and keyword texts) to fill.

For these purposes, we determined the minimum set of words, which includes all the names of cities, towns, villages, districts, and streets in Kazakhstan, and created layouts for filling out forms. Forms were created in such a way as to simplify the process of “cutting” words from the form as much as possible (Fig. 1). Extensive experiments related to the pre-processing of forms were also carried out in order to automatically identify forms, determine the contours of forms, compensate rotations, and also remove edge artifacts at the boundaries of segmented words.

Fig. 1
figure 1

One of the poem form in the dataset. The database consists of more than 1500 filled forms

To solve the problem of recognition and processing of natural languages (natural language processing), which consists of optical recognition of characters of the manuscript texts in Russian and Kazakh languages, innovative software is being developed using state-of-the-art neural network-based machine learning methods.

The following section defines the related work on Handwriting Databases. Section 3 presents the Data collection and storage phases as one of the most time consuming and costly stages. Section 4 provides Automated Labeling and Words Segmentation. Section 5 provides further characteristics of the Database. Section 6 provides Experiment Result on the HKR dataset and conclusion and future work are given in Sect. 7.

2 Related work

The IAM Handwriting Database [39, 40] comprises handwritten samples in English that can be used to evaluate systems like text segmentation, handwriting recognition, writer identification and writer verification. The database is developed on the Lancaster-Oslo/Bergen Corpus and comprises forms where the contributors copied a given text in their natural unconstrained handwriting. Each form was subsequently scanned at 300 dpi and saved as a gray level (8-bit) PNG image. The IAM Handwriting Database 3.0 includes contributions from 657 writers, making a total of 1539 handwritten pages comprising 5685 sentences, 13,353 text lines, and 115,320 words. The database is labeled at the sentence, line, and word levels. It has been widely used in word spotting [18, 21, 57, 58], writer identification [7, 11, 13, 30, 51], handwriting gender prediction [37, 38], handwritten text segmentation [46, 47, 61] and offline handwriting recognition [12, 16, 22, 26].

RIMES [24] is a representative database of an industrial application. The main idea of developing this database was to collect handwritten samples similar to those that are sent to different companies by postal mail and fax by individuals. Each contributor was assigned a fictitious identity and a maximum of up to five different scenarios from a set of nine themes. These themes included real-world scenarios like damage declaration or modification of contract. The subjects were required to compose a letter for a given scenario using their own words and layout on white paper using black ink. A total of 1300 volunteers contributed to data collection, providing 12,723 pages corresponding to 5605 mails. Each mail contains two to three pages, including the letter written by the contributor, a form with information about the letter, and an optional fax sheet. The pages were scanned, and the complete database was annotated to support evaluation of tasks like document layout analysis [41], mail classification [32], handwriting recognition [25] and writer recognition [51].

The National Institute of Standards and Technology, NIST, developed a series of databases [23] of handwritten characters and digits supporting tasks like isolation of fields, detection and removal of boxes in forms, character segmentation, and recognition. The form comprises boxes containing writer information, 28 boxes for numbers and 2 for alphabets, and 1 box for a paragraph of text. The NIST Special Database 1 comprised samples contributed by 2100 writers. The latest version of the database, the Special Database 19, comprises handwritten forms of 3600 writers with 810,000 isolated character images along with ground truth information. This database has been widely employed in a variety of handwritten digits [27] and character recognition systems [52].

CVL [33] is a database of handwritten samples supporting handwriting recognition, word spotting, and writer recognition. The database consists of seven different handwritten texts, one in German and six in English. A total of 310 volunteers contributed to data collection, with 27 authors producing 7 and 283 writers providing 5 pages each. The ground truth data is available in XML format, which includes a transcription of the text, the bounding box of each word, and the identity of the writer. The database has been used for writer recognition and retrieval [17] and can also be employed for other recognition tasks. In addition to regular text, a database of handwritten digit strings written by 303 students has also been compiled [15]. Each writer provided 26 different digit strings of different lengths, making a total of 7800 samples. Isolated digits were extracted from the database to form a separate dataset — the CVL Single Digit Dataset. The Single Digit Dataset comprises 3578 samples for each of the digit class (0-9). A subset of this database has also been used in the ICDAR 2013 digit recognition competition [15].

The AHDB [4] is an offline database of Arabic handwriting together with several pre-processing procedures. It contains Arabic handwritten paragraphs and words. Words used to represent numbers on checks produced by 100 different writers. The database was mainly intended to support automatic processing of bank checks, but it also contains pages of unconstrained texts allowing evaluation of generic Arabic handwriting recognition systems as well. The database was employed in handwriting recognition [5] and writer identification tasks [3].

IFN/ENIT [44] is an database of handwritten Arabic town/village names is presented in this paper. 411 writers filled out forms with approximately 26400 names totaling over 210000 characters. It’s made for training and evaluating handwritten Arabic word recognition systems. There are 26459 handwritten Tunisian town/village names in the IFN/ENIT database.

CASIA [34] This dataset includes collections of isolated characters and handwritten texts from online and offline Chinese handwriting databases. A total of 1,020 authors contributed to the samples. Isolated character datasets, whether online or offline, contain approximately 3.9 million samples of 7,356 groups (7,185 Chinese characters and 171 symbols), while handwritten text datasets contain approximately 5,090 pages and 1.35 million character samples. Each dataset is divided into standard training and test subsets and segmented and annotated at the character level. Various handwritten text review activities can be researched using the online and offline databases. For Chinese handwriting, Shusen Zhou proposed [60] First, utilizing digital ink techniques, the client end samples and redisplays handwritten text, segments handwritten characters, change them, and stores original handwritten information into a self-defined document. Second, using the suggested Gabor feature extraction and affinity propagation clustering (GFAP) approach, the server recognizes handwritten documents and delivers the recognition results to the client.

3 Data collection and storage

3.1 Data collection

A data collection phase is one of the most time consuming and costly stages. Our main task is to simplify and automate as much as possible. The sources of all the forms in the datasets were generated by LaTeX, then converted to PDF and printed to be filled by writers. So, it was an easy task to generate the correct labels for the printed text on the forms. Each writer filled approximately between 5-10 forms from keyword and poem forms, so each form in the dataset is written by approximately 50-100 writers. Each form has a unique id at the name of the form. The word or letter is placed in the rectangle. The filled forms and letters were scanned with a Canon MF4400 Series UFRII scanner at a resolution of 300 dpi and a color depth of 24 bits.

We collected three different Datasets described as the following:

  • Handwritten samples (Forms) of keywords in Kazakh and Russian (Areas, Cities, Village, etc.) are shown in Fig. 2.

  • Handwritten Kazakh and Russian alphabet in Cyrillic are shown in Fig. 2.

  • Handwritten samples (Forms) of poems in Russian are shown in Fig. 1.

Fig. 2
figure 2

Two forms for collecting handwritten samples of the Cyrillic alphabet and Keywords

3.1.1 Keyword database

To begin with, we consider correspondence addresses relevant for the Republic of Kazakhstan, as the list of keywords containing the following names:

  • Areas

  • Cities

  • Village

  • Settlements

  • Streets

  • Poems

  • Russian Letter

Additional information, such as:

  • Indices

  • Phones

  • Surnames

  • Company Names

were not included in the database.

3.1.2 Handwritten alphabet and forms

There are two fundamental approaches to text recognition: character recognition (Optical Character Recognition, OCR) and word recognition (Optical Word Recognition, OWR). With OCR, a model dataset required to train the model should contain handwritten samples of all the characters in a language alphabet. It is important for each language to compose separate forms since the set of letters of different alphabets can vary greatly. On the other hand, with OWR, a model dataset required to train the model should contain handwritten samples of all the Words for the language. Further, for subsequent training and testing of the model, handwritten samples of target words are needed. An example of one of the forms for collecting word samples and letters (Fig. 2).

3.1.3 Data collection methods

A person who has agreed to provide a sample of his handwriting will fill the forms and give the form to us and we scan and save it in our database.

4 Automated labeling and words segmentation

4.1 Automated labeling

Labeled data are data that have been marked with labels identifying certain features, characteristics, or a kind of object. The labeling of data is a prerequisite for recognition experiments. Labeling data is expensive, time consuming, and error prone. Like in “IAM Dataset” [39], we decided to do as much automation as possible automatically. The sources of all the forms printed (and subsequently filled by writers) were saved on a text file with a unique id for the form and the cell number in the form. So it was an easy task to generate the correct labels for the printed text on the forms. In this regard, we have developed a recommendation system that allows us to simplify the process of labeling data in forms.

4.2 Segmentation

The form is designed so that it is possible to easily identify and segment by cells. To identify the form, there is a marker in the upper-right corner of each form. To simplify the process of segmentation, the entire form is divided by horizontal and vertical lines, which makes it quite easy to restore the structure of the document, and accordingly, the spatial position of the word. Words are indexed (annotated) according to their position in the table. In order to cut out cells from the form, the following actions (pre-processing) are performed:

  • filtering forms to enhance table boundaries

  • defining of the contours of the table

  • Determination and compensation of the angle of rotation

  • exclusion of lines

  • Sorting forms by id (marker)

  • Streaming the division of forms into words

  • name and storage of words

After the image areas corresponding to the word cells are segmented, segmented image areas corresponding to the word cells may contain some edge artifacts. For example, line artifacts cut out with a cell or parts of a word from a neighboring cell can be attributed to artifacts (Fig. 3). We eliminate these artifacts by constructing vertical and horizontal histograms (Figs. 4and 5) also by cutting off parts that are separately localized closer to the edges of the cell.

Fig. 3
figure 3

Example of a region cut out of a form with a word. A pronounced cell line and a piece of letter from a neighboring area are visible along the edges

However, it is not always possible to eliminate all artifacts. The following are some aspects that make further processing of a segmented word difficult:

  • Letters may not be interconnected.

  • Letters can be perfected with artifacts.

  • The position of the letters and their size vary significantly from word to word.

  • Letters can be written in different colors (blue, black, red).

In this regard, we have developed a recommendation system that allows us to simplify the process of selecting areas with words from the form.

  • We suggest filling out the form using the blue pen. This will allow the system to distinguish the word from the table borders at the color level. For example, by converting an image from RGB to HSV, we get a color representation of objects that is invariant with respect to lighting. In this color space, blue remains blue, regardless of the brightness and intensity of the image.

  • sometimes eliminating parts of words from neighboring cells is impossible without distorting the target content of a given cell; therefore, when filling out the form, it is desirable that the subject does not go beyond the boundaries of the cell.

  • find the region of interest (ROI) in forms, the ROI in our forms is two columns that are filled by writers.

  • We segmented the cells depending on the horizontal white space between the cell by using the histogram (Fig. 4).

  • then we exclusion of lines that cross the words.

  • Finally, we cropped and segmented the cells depending on the vertical white-space by using the histogram (Fig. 5).

Fig. 4
figure 4

Horizontal histogram

Fig. 5
figure 5

Vertical histogram

Fig. 6
figure 6

Examples of segmented words

The final image shown in Fig. 6.

5 Further characteristics of the database

The database consists of more than 1500 filled forms written by 200 writers. There are approximately 63000 sentences, more than 715699 symbols shown in Fig. 7. And also There are approximately 106718 words. Total images in the dataset after pre-processing and segmentation the forms are 64943 images.

Fig. 7
figure 7

Histogram of Characters in the dataset

6 Experiment result

A quantitative comparison of well-known recurrent neural networks (RNN), such as SimpleHTR [48], LineHTR [36], NomeroffNet [42], Bluche [9], Puigcerver [45] and Attention-Gated-CNN-BGRU [2] models, has been implemented to choose the best performing model on the dataset given. At the first, the final dataset was split into three datasets as follows: Training (70%), Validation (15%), and Testing (15%). The test dataset was equally split into two sub-datasets (7.5% each): the first dataset was named TEST1 and consisted of words that did not exist in Training and Validation datasets; the second was named TEST2 and was made up of words that exist in Training dataset but with totally different handwriting styles. The primary purpose of splitting the Test dataset into TEST1 and TEST2 was to check the differences between accuracies of recognizing totally unseen words and the other words, which were seen in the training phase but with unseen handwriting styles. After training, validation, and testing datasets were prepared, the models were trained, and a series of comparative evaluation experiments were conducted. As experiment results proved, the Attention-Gated-CNN-BGRU model demonstrated the best performance with 8.34% character error rate (CER) Levenshtein, 38.54% word error rate (WER) and 12.12% CER our algorithm for the first test dataset and 8.36% CER Levenshtein, 56.36% WER and 16.5% CER our algorithm for the second test dataset.

6.1 Evaluation methods

In this article, we evaluated models using two methods: in the first method the standard performance measures are used for all results presented: the character error rate (CER) and word error rate (WER) [20]. The CER is determined as the Levenshtein distance, which is the sum of the character substitution (S), insertion (I), and deletions (D) required to turn one string into another, divided by the total number of characters in the ground truth word (N).

$$\begin{aligned} {\begin{matrix} CER = \frac{S+I+D}{N} \end{matrix}} \end{aligned}$$
(1)

Similarly, the WER is calculated as the sum of the number of the term substitutions (\(S_w\)), insertion (\(I_w\)), and deletions (\(D_w\)), which is necessary for the transformation of one string into another, and divided by the total number of ground-truth terms (\(N_w\)).

$$\begin{aligned} {\begin{matrix} WER = \frac{S_w + I_w + D_w}{N_w} \end{matrix}} \end{aligned}$$
(2)

The second method is character error rate (CER) [10] which was developed by the authors to evaluate our results. CER by our algorithm for counting each character’s errors goes in a loop through all the results, then counts the frequency of characters and the number of correctly recognized characters. The error for each character is calculated by

$$\begin{aligned} {\begin{matrix} CER_c = (1-\frac{ pred_c}{freq_c}) * 100 \end{matrix}} \end{aligned}$$
(3)

where c is a character, \(pred_c\) is the number of correct predictions of c and \(freq_c\) is the number of character c.

Then we calculate the average of CER using the following formula, where errors for each letter are multiplied by the fraction of a character in the whole test dataset and summed.

$$\begin{aligned} {\begin{matrix} CER_{avg} = \sum CER_c * \frac{freq_c}{total} \end{matrix}} \end{aligned}$$
(4)

where c is a character, \(CER_c\) character error rate of c, \(freq_c\) is the number of character c, total is the total number of all characters. In the rest of the paper, we will mention \(CER^*\) for our algorithm and CER for Levenshtein.

6.2 Training

All models have been trained using Tensorflow [1] deep learning library in Python. Tensorflow allows for transparent use of highly optimized mathematical operations on GPUs through Python. A computational graph is defined in the Python script to define all operations that are necessary for the specific computations.

The experiments were run on a machine with 2x “Intel(R) Xeon(R) E-5-2680” CPUs, 4x ”NVIDIA Tesla k20x” and 100 GB RAM. The use of a GPU reduced the training time of the models by approximately a factor of 3, however, this speed-up was not closely monitored throughout the project, hence it could have varied.

The plots for the report were generated using the matplotlib library for Python, and the illustrations have been created using Inkscape, which is a vector graphics software similar to Adobe Photoshop.

All models are trained to minimize validation loss value. The optimization with stochastic gradient descent is performed, using the RMSProp method [29] with a base learning rate of 0.001 and mini-batches of 32. Also, early stopping with patience 20 is applied, we wanted to monitor the validation loss at each epoch, and when the validation loss does not improve after 20 epochs, training is interrupted.

6.3 SimpleHTR model

Originally inspired by artificial neural network architectures by [50] and [56], Harald Scheidl proposed a new approach to handwritten recognition task [48] in 2018. The model’s architecture consists of 5 convolutional neural network (CNN) layers, 2 long short term memory (LSTM) layers, connectionist temporal classification (CTC) loss and decoder layers shown in Fig. 8.

Fig. 8
figure 8

SimpleHTR architecture contain 5 CNN layers as an encoder with 2 LSTM layers to decode the feature and pass it to the CTC loss function

Below is a SimpleHTR algorithm pipeline in short [49]:

  • Input is a gray-scale image of fixed size 128 x 32 (W x H)

  • CNN layers map this gray-scale image to a feature sequence of size 32 x 256

  • LSTM layers with 256 units map this feature sequence to a matrix of size 32 x 80: here 32 represents the number of time-steps (horizontal positions) in an image with a word; 80 represents probabilities of different characters at a certain time-step in that image)

  • CTC layer may work in 2 modes: loss mode - to learn to predict the right character at a time-step when training; decoder mode - to get the recognized word when testing

  • batch size is equal to 50

SimpleHTR model ideally requires handwritten texts to be split into words; otherwise, recognition of a full text line would result in low accuracy since 32 time-steps are insufficient to handle a large number of characters in that text line.

6.4 LineHTR model

LineHTR model [36] is just an extension of the previous SimpleHTR model, which was developed to enable the model to process images with a full text line (not a single word only), thus, to increase the model’s accuracy further. Architecture of LineHTR model is quite similar to that of SimpleHTR model, with some differences in the number of CNN and LSTM layers and size of those layers’ input: it has 7 CNN and 2 Bidirectional LSTM (BLSTM) layers. Below is a LineHTR algorithm pipeline in short [36]:

  • Input is a gray-scale image of fixed size 800 x 64 (W x H)

  • CNN layers map this gray-scale image to a feature sequence of size 100 x 512

  • BLSTM layers with 512 units map this feature sequence to a matrix of size 100 x 205: here 100 represents the number of time-steps (horizontal positions) in an image with a text line; 205 represents probabilities of different characters at a certain time-step in that image)

  • CTC layer may work in 2 modes: loss mode - to learn to predict the right character at a time-step when training; decode mode - to get the final recognized text line when testing

6.5 Nomeroff net OCR model

According to the authors of Nomeroff Net automatic number-plate recognition system [42], the OCR architecture solution is shown in Fig. 9.

Fig. 9
figure 9

Nomeroff’s number plate recognition model architecture

As can be seen from the Fig. 9, the Nomeroff Net OCR algorithm pipeline is as follows:

  • Input is a gray-scale image of fixed size 64 x 128 (W x H)

  • This gray-scale image is then introduced into 2 subsequent CNN layers, which in turn output feature maps of (16 x 32) x 16 size

  • Then these feature maps are reshaped into a single map of 256 x 32 size

  • then this map is introduced into a fully connected (FC) layer

  • the output from FC layer is directed to 2 parallel recurrent neural network (RNN) layers of gated recurrent unit (GRU)

  • outputs from 2 GRU layers are combined into one by Element-wise addition operation to form a map of 512 x 32 size

  • then this map is directed to 2 parallel GRU RNN layers again

  • outputs from the last 2 GRU layers are concatenated to form a map of 1024 x 32 size

  • then this map is introduced into subsequent FC and Softmax layers, before being directed to CTC decoder to obtain the final recognized text

Although Nomeroff Net OCR architecture was designed to recognize “machine typed” car numbers, it is also worth to check the model’s performance on handwritten text recognition tasks. That is why this model is also included in the list of RNN architectures evaluated in this research work.

6.6 Bluche model

Bluche model [9] proposes a new neural network structure for modern handwriting text recognition (HTR) as an opportunity for RNNs in multidimensional LSTM. The model is totally based on a deep convolutional input image encoder and a bi-directional LSTM decoder predicting sequences of characters. Its goal is to generate standard, multi-lingual and reusable tasks in this paradigm using the convolutional encoder to leverage more records for transfer learning.

The encoder in the Bluche model contains 3x3 convolutional layer with 8 features, 2x4 convolutional layer with 16 features, a 3x3 gated convolutional layer, 3x3 convolutional layer with 32 features, 3x3 gated convolutional layer, 2x4 convolutional layer with 64 features and 3x3 convolutional layer with 128 features. The decoder contains 2 bidirectional LSTM layers of 128 units and 128 dense layers between the LSTM layers. Figure 10 shows the Bluche architecture.

Fig. 10
figure 10

Bluche HTR model contain 8 CNN and 2 BLSTM layers

6.7 Puigcerver model

Modern approaches of Puigcerver model [45] to offline HTR dramatically depend on 1-LSTM and 2-LSTM networks. Puigcerver model has a high level of recognition rate and a large number of parameters (around 9.6 million). This implies that LSTM dependencies, theoretically modeled by recurrent layers, might not be sufficient, at least in the lower layers of the system, to achieve high recognition accuracy. Figure 11 shows the Puigcerver architecture.

The Puigcerver model has three important parts :

  • Convolutional blocks: they include 2-D convolutional layer with 3x3 kernal size and 1 horizontal and vertical stride. number of filters is equal to 16n at the n-th layer of Conv.

  • Recurrent blocks: Bidirectional 1D-LSTM layers form recurrent blocks, that transfer the input image column-wise from left to right and from right to left. The output of the two directions is concatenated depth-wise.

  • Linear layer: the output of recurrent 1D-LSTM blocks are fed to linear layer to predict the output label. Dropout is implemented before the Linear layer to prevent overfitting (also with probability 0.5).

Fig. 11
figure 11

Puigcerver HTR model contain 5 CNN and 5 BLSTM layers

6.8 Attention-Gated-CNN-BGRU model

Attention-based Fully Gated CNN-BGRU [2], aiming at improving HTR model accuracy in handwritten Cyrillic text recognition task. The architecture is shown in Fig. 12.

Fig. 12
figure 12

Attention-Gated-CNN-BGRU architecture for handwriting recognition. The system contains four main parts: (A) encoder, (B) attention block, (C) decoder, (D) CTC

This model’s architecture consists of 4 main parts: encoder, attention, decoder, and CTC. An encoder part consists of 5 convolutional blocks, each of which is made up of a convolutional layer, Parametric Rectified Linear Unit (PReLU) activator [28] with Batch Normalization, and gated convolutional layer [9]. The Dropout technique is also applied at the input of some convolutional layers (with a dropout probability of 0.5) to reduce the overfitting issue [53]. As an attention part of this model’s architecture, Bahdanau attention mechanism is used [6]. Generally, attention mechanisms encode an input sentence by segmenting it into a fixed number of parts so they can be processed later by a decoder. Bahdanau attention mechanism enabled attention mechanism to focus on relevant parts of an input sentence, rather than hard segmenting it. The key role of Bahdanau attention mechanism applied between an encoder and decoder is to provide a richer encoding of the input sequence.

6.9 Results

All the models were trained on the HKR dataset. We evaluated these models by the standard performance measures used for all results presented: CER, CER*, and WER. For all models, the minibatch of 32 size and Early Stopping after 20 epochs without improvement in validation loss value and lr=0.001 were set. For the best use of each model, within the 20 tolerance epochs, ReduceLRonPlateau schedule [55] with a decay factor of 0.2 after 10 epochs without improvement in validation loss value was also used. All of the following figures present the character error rate, which shows how the model detects each character.

The first experiment was conducted with SimpleHTR model which showed a performance with average of 58.97% WER, 33.26 CER*, and 19.98% CER on TEST1 and 11.09% WER, 2.45 CER*, and 1.55% CER on TEST2 datasets (Fig. 13). This big difference in CER rates shows that SimpleHTR model overfitted to words seen in the training stage and demonstrated a lower level of generalization.

Fig. 13
figure 13

SimpleHTR model performance on TEST1 and TEST2 datasets

The next experiment was carried out with LineHTR model which was trained on data for 100 epochs. This model demonstrated a performance with average 85.66% WER, 47.46 CER*, and 33.63% CER on TEST1 and 21.99% WER, 5.41 CER*, and 3.51% CER on TEST2 datasets (Fig. 14). A similar tendency of overfitting to training data can be observed here as well.

Fig. 14
figure 14

LineHTR model performance on TEST1 and TEST2 datasets

The same experiments were conducted with Nomeroff Net HTR model. Unlike previous models examined, this model showed a lower average of 80.28% WER, 52.37 CER*, and 34.84% CER on TEST1 and 50.19% WER, 16.23 CER*, and 10.19% CER on TEST2 datasets (Fig. 15). As can be observed from the figure, NomeroffNet model also suffers from overfitting.

Fig. 15
figure 15

NomeroffNet HTR model performance on TEST1 and TEST2 datasets

The experiments on HTR were also conducted with Puigcerver and Bluche models and the following results for recognition errors were obtained for the test datasets: 1) The Bluche model achieved 76.43% WER, 31.91 CER*, and 22.31% CER in the TEST1 dataset and of 69.13% WER, 21.84 CER*, and 12.94% CER in the TEST2 dataset (Fig. 16); 2)The Puigcerver HTR model showed 100.00% WER, 82.00 CER*, and 69.15% CER in the TEST1dataset and of 98.95% WER, 66.34 CER*, and 49.87% CER in the TEST2 dataset (Fig. 17). We can observe that the Puigcerver has a higher error rate compared with the other models because the Puigcerver model has many parameters ( 9.6M) and underfitting on the dataset.

Fig. 16
figure 16

Bluche HTR model performance on TEST1 and TEST2 dataset

Fig. 17
figure 17

Puigcerver HTR model performance on TEST1 and TEST2 dataset

Training of Attention-Gated-CNN-BGRU model took 240 epochs. The CER rates on TEST1 and TEST2 datasets were reported as 8.34% and 8.36% respectively (Fig. 18). As can be seen from the figure, Attention-based Fully Gated CNN-BGRU model resulted in lower CER and generalization rates overall.

Fig. 18
figure 18

Attention-based Fully Gated CNN-BGRU model performance on TEST1 and TEST2 datasets

Table 1 shows the result of comparison between all models.

Table 1 CER, CER*, and WER for Bluche, Puigcerver, NomeroffNet, LineHTR, SimpleHTR, and Attention-Gated-CNN-BGRU

7 Conclusion and future work

In this research work, firstly, we have built the handwritten Kazakh, Russian database. The database can serve as a basis for research in handwriting recognition. This contains Russian Words (Areas, Cities, Village, Settlements, Areas, Streets) by a hundred different writers. It also incorporates the most popular words in the Republic of Kazakhstan. A few pre-processing and segmentation procedures have been developed together with the database. Finally, it contains free handwriting forms in any area of the writer’s interest. This database is meant to provide a training and testing set for Kazakh, Russian Words recognition research. In the future, further work on gathering Handwriting samples of keywords and envelope shots will continue. At the same time, envelopes are annotated and various metrics are checked to evaluate the recognition error. In order for the artifacts to not interfere, we need to collect as much tagged data as possible.

Secondly, this research work tried to solve a handwritten Cyrillic postal-address interpretation task using well-known RNN models, such as SimpleHTR, LineHTR, NomeroffNet, Bluche, Puigcerver, and Attention-Gated-CNN-BGRU HTR models. These RNN models were first quantitatively evaluated against each other to select the best performing one. According to experiments, the Attention-Gated-CNN-BGRU HTR model demonstrated the highest recognition rate overall.

One of this research work’s goals was to investigate and quantitatively compare the state-of-the-art RNN models to choose the best performing one in a handwritten Cyrillic postal-address recognition task. This goal also incorporates all efforts put into improving the best performing RNN model. According to experiment results, the Attention-Gated-CNN-BGRU HTR model demonstrated comparatively better results in terms of generalization and overall accuracy (see Table 1). This model was then extended to the modified version, called Attention-based Fully Gated CNN-BGRU model.

As figures 13-18 show, generally average CERs of all models tend to be high. It seems the reason for that is large differences between frequencies of Cyrillic characters. In other words, since the dataset includes a small number of Kazakh language handwritings, the language characters have lower frequencies (distribution in the dataset) compared to other Cyrillic letters. Consequently, above mentioned models struggle to recognize these characters resulting in very low recognition rates. Hence, this affects the overall average CER. The dataset also includes non-alphabetic characters (such as “.,!” and so on) with small distributions. SimpleHTR, LineHTR, and NomeroffNet models seemed to prone to overfitting while being trained in Cyrillic handwritings. It seems that enriching the dataset with a variety of Kazakh and Russian words, and making it balanced will solve this issue.

Generally, all models examined in this research proved that there is need for more data, especially containing Kazakh language words. As future work, a Telegram bot was created to collect a new dataset with predominantly Kazakh language words. Currently, the handwriting recognition model developed by this research is not ready to use at a production level, like in a postal company. The development of a web application that enables easy-to-use interface for users is still being developed.