Keywords

1 Introduction

Handwritten text analysis and recognition [20] has always been an important field of OCR [9], and it has also been the focus of research by scientists in the past decade [14]. Handwriting analysis has been studied for a long time. From the initial rule-based method to the current deep learning network-based methods, the accuracy of recognition has been continuously improved. According to different representations, handwritten text recognition is divided into online handwriting recognition and offline handwriting recognition. Offline characters are represented by two-dimensional static images, while online characters are represented by a continuous coordinate sequence. Online characters also covers the trajectory, speed and angle of the handwriting during writing. Therefore, compared to offline characters, the accuracy of online handwriting recognition is usually higher than offline handwriting recognition. However, offline text collection is more convenient, more suitable for actual application scenarios, and its applications are more extensive. If the dynamic information of the text can be recovered from the two-dimensional static image, the static and dynamic information can be combined to further improve the accuracy of recognition. Moreover, handwriting reconstruction is widely used in smart writing and handwriting identification [7].

Fig. 1.
figure 1

The framework of handwriting reconstruction method.

Currently, character handwriting reconstruction methods include graph sear-ch, template matching and writing rules, as well as deep learning based methods [6, 20]. The method based on graph search [17] is to find a path with the least cost according to the minimum energy cost criterion. It is only suitable for the restoration of the writing order of numbers and alphabets. The method based on template matching [11] needs to build a stroke template library, and restore the handwriting by comparing the input image with the template. This method has a wider application range and higher accuracy, but it is too complicated to calculate the best path in the matching process. The method [2] based on writing rules uses the structural characteristics of characters to express the relationship between character strokes, and then uses rules to restore their order. Its disadvantage is that it cannot adapt to changes in writing styles and cannot handle text with broken pens. The method [19] based on deep learning performs a series of preprocessing on the image, and finally performs the arrangement prediction of the order relationship of each pixel through the network. It has poor adaptability to the complicated text with many strokes. Other methods based on deep learning like [6, 6] extracted the feature sequence of the two-bit static image, and finally the handwriting sequence is generated through RNN and fully connected network. It is not very adaptable to samples which have complex font and a wide range of stroke’s number.

When a person is writing, the visual attention will move with the movement of the handwriting. In machine vision, according to this feature, we express it as the response probability of corresponding position at different times, and should be concentrated in a certain area or point. Therefore, this paper proposes a handwriting reconstruction method based on spatial-temporal encoder-decoder network, which simulates the process of human visual attention movement [18] by predicting the probability of each point on the image at different times.

The rest of this article is organized as follows. The second section introduces Spatial-Temporal Encoder-Decoder Network model proposed in this article in detail. The third section explains in detail the reconstruction constraint proposed in this paper. The fourth section introduces the composition of the loss function in detail. The fifth section is the detail of experiment and results. The last section is the conclusions.

2 Spatial-Temporal Encoder-Decoder Network

In this section, we introduce in detail how our proposed method generates online handwriting sequences based on offline pictures. As mentioned earlier, we did not directly output, but output the absolute position of the maximum probability point at different temporal steps. The Spatial-Temporal Encoder-Decoder Network is divided into three modules to generate handwriting sequence:key point detector module, spatial encoder module, and temporal decoder module. The spatial encoder module is the backbone network of the model, which is essentially a variant FCN network and outputs the spatial features of each position of the offline image. Figure 3 shows its structure. The key point detector module is a branch of the backbone network, which outputs and classifies all key points of the font. Recurrent neural network GRU [3] and Multi-layer perceptron MLP form temporal decoder, which combines spatial features to output the heat map sequence. The overall framework as show as Fig. 1.

2.1 Key Point Detector

The key point detector [1, 5, 16] module is to return the position of each candidate point through the FCN network [13]. Fully convolutional networks have better spatial generalization capabilities than fully connected networks. This module detects all key points and divides them into two categories:end points and connection points. And then provide this information to the reconstruction constraint module. Since the full convolutional network FCN is more stable than the fully connected network in terms of position regression, more and more people use the full convolutional network FCN when studying key point detection [1]. Fully convolutional network only contains convolutional layer, pooling layer and activation layer. Specific parts are selected through the information connection between parts. This method is very meaningful for target detection (Fig. 2).

Fig. 2.
figure 2

The structure diagram of key point detector network

The detection network can identify the overall frame of the font and filter out the key parts. It has a certain sensitivity to the turning points of the line segments. Even a curve with a small curvature can identify subtle turning points, which are finally reflected in the output heat map. While detecting the key points, the detection network also determines the length of the output coordinate sequence, and realizes the self-variable length sequence generation.

2.2 Spatial Encoder Network

The key point detector network can find the key points of the font, but only extracts the position information, and cannot analyze the relationship between them (the feature map size is the same as the output image size, and the receptive field is limited, and deeper sequential features are not extracted). FCN cannot extract deeper-scale features well without changing the size of the feature map, so this work will be completed by the spatial encoder network. The spatial encoder network is a special full convolutional network. In order to extract the deep-level visual features of the image and obtain a larger receptive field while keeping the feature map at a certain size. So this part is composed of FCN and U-Net [12]. FCN [13] made a brief introduction in the front. And U-Net [12] is an FCN network with a special structure. U-Net [12] consists of two parts:one is the contracting path and the other is the expanding path. The contraction path can obtain contextual information of different scales, and the expansion path can supplement some of the deep-level information of the image. But the supplement of information is definitely incomplete, so skip connect is needed to combine the higher resolution pictures on the contraction path. Since the method proposed in this paper is to generate a heat map of handwriting points through FCN. It limits the output size of FCN must take into account the size ratio of handwriting points in the original image. And the addition of U-Net [12] is to maintain a certain size scale feature map while obtaining the deeper features of the image. Meanwhile, the stroke texture information of the image can be obtained by expanding the receptive field.

Fig. 3.
figure 3

The structure of spatial encoder network as showing as picture.

The specific structure of spatial encoder network shown in Fig. 3. The image through the spatial encoder network, is encoded as a tensor of size \(d\times H'\times W'\). We denote these coding features as Eq. 1,

$$\begin{aligned} a=\{a_1,a_2,a_3,...,a_n\} ,a_i\in R^d ,L=H'\times W' \end{aligned}$$
(1)

where d is the dimension of \(a_i\).

2.3 Temporal Decoder Network

The Temporal Decoder Network is essentially a candidate determiner composed of MLP and GRU [3]. In order to link the offline image with the variable-length output sequence, the paper [18] calculates an intermediate vector to provide a regional feature filter for subsequent recognition and classification. But we use this intermediate vector to output the coordinate points we need in the image heatmap. Figure 4 shows the work flow of the temporal decoder network, where MLP is an multi-layer perceptron composed of multiple fully connected networks and is the output layer of the temporal decoder network. GRU [3] is an improved version of cyclic neural network RNN, which solves the problem of gradient disappearance or gradient explosion during RNN training, and the space occupancy rate is much smaller than LSTM [4] while achieving the same effect. The hidden state calculation equation of GRU [3] see Eq. 2 \(\sim \) Eq. 5.

$$\begin{aligned} z_t = \sigma \left( W_{hz} H_{t-1}+U_{cz}C_t+b_z \right) \end{aligned}$$
(2)
$$\begin{aligned} r_t = \sigma \left( W_{hr} H_{t-1}+U_{cr}C_t+b_r \right) \end{aligned}$$
(3)
$$\begin{aligned} \widetilde{H_t} = \tanh \left( W_h\left( H_{t-1} \otimes r_t\right) +U_h C_t+b_h \right) \end{aligned}$$
(4)
$$\begin{aligned} H_t = \left( 1-z_t\right) \otimes H_{t-1} + z_t \otimes \widetilde{H_t} \end{aligned}$$
(5)

Among them,\(\sigma \left( \cdot \right) \) is the sigmoid function, \(z_t,r_t,\widetilde{H_t}\) which is the update gate, reset gate and candidate state. When the temporal decoder network predicts the position of the handwriting point at each time step, it outputs the probability of each area, so the output only needs to maximize the probability of the candidate area. The temporal decoder network combines spatial features \(a_i\) and the hidden state \(H_{t-1}\) of the current GRU to calculate the maximum probability position of the handwriting point at the current moment (see Eq. 6\(\sim \)Eq. 7),

Fig. 4.
figure 4

The calculation process of the temporal decoder network.

$$\begin{aligned} e_{ti} = v^T_a \tanh \left( W_a H_{t-1} +U_a a_i\right) \end{aligned}$$
(6)
$$\begin{aligned} p_{ti} =\frac{\exp \left( e_{ti}\right) }{\sum _{i=0}^L \exp \left( e_{ti}\right) } \end{aligned}$$
(7)

where \(v_a \in R^n,W_a \in R^{n'\times n}, U_a \in R^d \). And then we will the most probable point as the \(C_t\) to strengthen the relationship between points, like Eq. 8.

$$\begin{aligned} C_t = a\left[ \max \left( p\right) \right] \end{aligned}$$
(8)

In order to strengthen the path information, this article proposes a handwriting trend feature. The handwriting trend feature is a blank graph \(\left( \beta \right) \) whose size is the output scale, and the corresponding position is marked at each time step. Then trend feature are extracted by convolution and be send to the MLP. The final calculation formula is as shown as Eq. 9 \(\sim \) Eq. 12.

$$\begin{aligned} \beta = \left( 0 \right) \in F^{H' \times W'} \end{aligned}$$
(9)
$$\begin{aligned} \beta _t = \left( 1_{ij} \right) \in F^{H' \times W'}, i \in \left( 0,W' \right) ,j \in \left( 0,H' \right) \end{aligned}$$
(10)
$$\begin{aligned} F _t = f \left( \beta _t \right) \end{aligned}$$
(11)
$$\begin{aligned} e_{ti} = v^T_a \tanh \left( W_a H_{t-1} +U_a a_i + U_f f_t\right) ,f_t \in F_t \end{aligned}$$
(12)

where \(\beta _t\) is the trajectory picture in the time step t, and \(f\left( \cdot \right) \) is a convolution module. Finally the output \(e_{ti}\) will be used in Eq. 11 which represents the response of each position of the corresponding time step t.

3 Handwriting Reconstruction Constraints

Although the Spatial-Temporal Encoder-Decoder Network has a certain adaptability to the reconstruction of handwriting Chinese characters with complex fonts and broken pens, the handwriting point probability based on the whole image will be chaotic when faced with such samples. In order to constrain the chaos of handwriting, we designed a connection rule based on different handwriting points, as shown in Fig. 5.

In the key point detection module, we divide all points into connection points and end points. So, we defined two rules:

Theorem 1

The starting point of the line segment must be the end point

Theorem 2

There must be a solid line in the line segment.

In practical applications, we select candidate points based on the rules and the output of the model (See Eq. 13\(\sim \) Eq. 15),

$$\begin{aligned} p_{ti} = \left\{ \begin{array}{lll} \frac{\exp \left( e_{ti} \right) \times d_i}{\sum _{i=0}^L\exp \left( e_{ti} \right) \times d_i }, &{} if~last~point \in endpoints,\\ \frac{\exp \left( e_{ti} \right) \times k_i}{\sum _{i=0}^L\exp \left( e_{ti} \right) \times k_i}, &{} if~last~point \in connection points. \end{array} \right. \end{aligned}$$
(13)
$$\begin{aligned} l = \frac{1}{n}\phi \left( last~point,candidate~point \right) \end{aligned}$$
(14)
$$\begin{aligned} P_{ti} = \left\{ \begin{array}{lll} \max \left( p_{ti} \right) ,&{} if~last~point \in endpoints,\\ \max \left( p_{ti} \times l \right) . &{} if~last~point \in connection points. \end{array} \right. \end{aligned}$$
(15)

where \(\phi \left( l,c\right) \)in Eq. 14 means interpolating sampling between two points in the original image and n is the number of samples. In addition, \(k_i \in key~point~map\) and \(d_i \in end~point~map\). \(P_{ti}\) is the final predicted value.

Fig. 5.
figure 5

Example of the situation in the process of reconstruction

4 Loss Function

Since the Spatial-Temporal Encoder-Decoder Network needs to learn key point detection and key point sorting, these two tasks are different. So we define the final loss function as Eq. 16 like [8], where \(L_{det}\) represents the loss in the key point detection task and \(L_{sq}\) represents the loss in the key point sorting task.

$$\begin{aligned} L = L_{det} + L_{sq} \end{aligned}$$
(16)

In order to measure the gap between the predicted map and the label and balance the quantitative relationship between key points and the background, focal loss [10] is used as the loss function, as shown in the Eq. 17.

$$\begin{aligned} L_{det} = \frac{-1}{N}\sum _{c=1}^C \sum _{h=1}^H \sum _{w=1}^W \left\{ \begin{array}{ll} \beta \left( 1-p_{cij}\right) ^{\alpha }\log \left( p_{cij}\right) ,&{}if~y_{cij}=1,\\ \left( 1-\beta \right) p_{cij}^{\alpha } \log \left( 1-p_{cij} \right) ,&{} otherwise. \end{array} \right. \end{aligned}$$
(17)

Different from the traditional sorting loss function, our sorting task is to maximize the probability of the label point at each time, so we directly adopt the cross-entropy loss function.(\(L_{sq}\) see Eq. 18)

$$\begin{aligned} L_{sq} = \frac{-1}{N} \sum _{t=1}^N \log \left( p_{t,label\left[ t\right] }~\right) \end{aligned}$$
(18)

5 Experiment

In order to verify the effectiveness of the proposed method in handwriting reconstruction, this chapter conducts ablation experiments and comparative experiments.

5.1 Dataset Processing

OLHWDB1.1 and Tamil dataset are used in experiment.OLHWDB1.1 includes 3755 types of Chinese characters, which is written on a separate page. The stroke coordinates of the pen tip are recorded. Tamil dataset is from paper [6], which can explore the reconstruction effect on other languages, is a dataset of HP Company‘s compete. We should convert it to offline form because all of them are trajectory sequence.

Different from offline handwriting characters saved in the form of static images, online handwriting characters retain richer dynamic information when writing in the form of handwriting point sequences. We save the original data in the form of a formula [20] like Eq. 19,

$$\begin{aligned} \left[ \left[ x_1,y_1,s_1\right] ,\left[ x_2,y_2,s_2\right] ,...,\left[ x_n,y_n,s_n\right] \right] \end{aligned}$$
(19)

where \(x_i\) and \(y_i\) is the coordinate,\(s_i\) is the point state. And then, we convert the data set into a specific form for training.

We think that the points that are too dense or the intermediate points on the same line are redundant points. In order to filter excess points, we set two conditions [20] as shown in Eq. 20 and Eq. 21,

$$\begin{aligned} \sqrt{\left( x_i-x_{i-1}\right) ^2 + \left( y_i-y_{i-1}\right) ^2} \le T \end{aligned}$$
(20)
$$\begin{aligned} \frac{\varDelta x_{i-1} \varDelta x_i + \varDelta y_{i-1}\varDelta y_i}{\sqrt{\left( \varDelta x_{i-1}^2 + \varDelta x_i^2 \right) \cdot \left( y_{i-1}^2+y_i^2 \right) }}\ge C \end{aligned}$$
(21)

where T is the threshold to filter out points with too dense distance and C is the threshold to filter out the middle point in the same straight line. In order to protect the starting point and end point of the strokes from being screened out, the screening operation is carried out when \(s_i=s_{i-1}=s_{i+1}\).

Offline Character Generation. In order to make the key point heat map and label correspond to the offline image, map the preprocessed handwriting point sequence coordinates to the image whose size is \(H' \times W'\). Then we resize the image to \(H \times W\). (See (a) in Fig. 6) According to the corresponding label, generate a heatmap of key point [15] based on the Gaussian distribution on the image.(See (b) in Fig. 6)

Fig. 6.
figure 6

Offline characters (a) and corresponding heatmap labels (b).

5.2 Implementation Details

The network model in this article is built under the pytorch framework, and the GPU model used by the platform is NVIDIA 1080Ti, which runs on a 64-bit Linux system. In the data preprocessing in this paper, the parameter T in Eq. 20 is \(0.05 \times \max \left( H,W \right) \) and the parameter C in Eq. 21 is \(-0.9\). The size of image is \(H \times W =512 \times 512\), the heat map and the output scale are \(H' \times W' =64 \times 64\). The dimension of the output \(a_i\) of the spatial encoder network is \(d = 128\). The hidden state \(H_t\) of GRU is a 64-dimensional tensor. Finally, we use the optimizer Adam to set the initial learning rate \(lr=0.001\) and decay to 0.1 every 10 rounds.

5.3 Evaluation Metrics

At present, there is no unified standard for the evaluation of online handwriting generation problems, such as paper [6, 20]. This is also due to the large differences in the methods of generating handwriting.

Due to the particularity of the method proposed in this article. We use the average probability of the corresponding position of each handwriting point of the character as the criterion for the quality of the model.(See Eq. 22)

$$\begin{aligned} mean P = \frac{1}{K} \sum _{t=1}^K p_{t,indice} \end{aligned}$$
(22)

where K is the number of trajectory points. Although meanP cannot fully represent the recovery degree of a font handwriting, it can reflect the response degree of the model to handwriting points.

In addition, in order to facilitate the comparison with the paper [6], we also adopted their evaluation index(See Eq. 23 \(\sim \) Eq. 24),

$$\begin{aligned} Starting~Point~ Accuracy = \frac{Number~of~correct~SP}{Total~number~of~test~images} \end{aligned}$$
(23)
$$\begin{aligned} Junction~Point~ Accuracy = \frac{Number~of~correct~JP}{Total~number~JP~points~in~test~data} \end{aligned}$$
(24)

when complete trajectory \(\left( CT \right) \) of an offline character image is perfectly retrieved along with the correct starting point,we evaluate this metric as a positive result.

Table 1. The meanP of each method combination

5.4 Experiment and Result Analysis

In order to verify the necessity of the handwriting trend characteristics \(\left( TC\right) \) in Eq. 11 and reconstruction constraints \(\left( RC\right) \) in the model proposed in this article, we conducted corresponding ablation experiments with meanP as the evaluation index. We randomly select 5000 samples from the test set for evaluation (see Table 1) The results of the Table 1 are predictable. The handwriting trend characteristics provides model with the features formed by the handwriting points of all previous moments, and provides enlightening information for the next moment. And reconstruction constraints can properly correct its errors and provide more accurate information for the next moment.

Table 2. Stroke recovery accuracy
Fig. 7.
figure 7

Examples of the recovery trajectory from offline character on OLHWDB dataset.

Fig. 8.
figure 8

Examples of the recovery trajectory from offline character on Tamil dataset.

We also conducted a comparative experiment with the paper [6] in Tamil dataset and OLHWDB1.1. The results are from 1,000 randomly selected samples. (see Table 2)

We have compared the accuracy of our proposed method with the method from [6]. We selected it because it is the latest methods in this field as comparison objects and implemented them on the dataset. That ours model are more suitable for trajectory reconstruction of Chinese characters. A few qualitative results of methods are shown in Fig. 7 and Fig. 8.

6 Conclusion

This paper proposes a method of regression trajectory sequence that generates heatmap base on Spatial-Temporal Encoder-Decoder Network. The reconstruction results is better than the method [6] on OLHWDB. The coordinates generated in this way cannot be directly trained in the network, and the gradient must also be faulted by generating heat map labels. In future work, we will focus on solving this problem and combine it with GAN to generate a more complete trajectory sequence. In addition, whether the model can completely recover the handwriting of characters that have not been touched before will also be a future research direction.