Keywords

17.1 Introduction

Mathematics is called the “handmaiden of science”, hence it plays a pivotal role in all scientific research done. Mathematics always presents itself in the form of equations, and Handwritten Mathematical expressions are the primary method of writing equations, which later is encoded in LaTeX or mathML for proper rendering on digital documents. However, automatically recognizing and converting them to an appropriate format remains a difficult task because of the nature of Handwritten Mathematical Expressions (HMEs). These problems include the two-dimensional nature of HMEs [1, 2] i.e. it tends to be related in a spatially.

Solving the HMEs problem can be broken down into two major stages, symbol recognition and structural analysis. Structural analysis may be done in two ways, which is sequentially and globally. Sequential analysis [3] first deals with symbol recognition and then proceeds to structural analysis. Whereas the global approach tends to deal with both of them at the same time. Sequential analysis and global analysis come with their own share of problems such as they require prior knowledge about the type of expressions to be generated for generating the parser. The complexity of the parser increases with increase in symbols dealt by the parser. They do not take into consideration the semantic context among associated symbols to deal with ambiguous symbols in case of sequential parsers.

In the last decade or so, encoder-decoder model have been utilized to solve the problem of HMEs because of its application in machine translation [4]. We propose a variation of encoder-decoder model which requires less time and has less complexity during the training phase. The model is trained to take as input images of HMEs and produce mathML strings which can be directly embedded within any word processor that accepts mathML. The overall goal of this is to generate document file such as .docx which a WYSIWYG text editor containing the mathematical expression since complex mathematical expressions are more often required within scientific documents and writing mathematical equations within the document with the present methods of manually adding expressions within the document is a hassle which we are trying to solve. MathML is a XML based document and since most text editors are XML based indirectly the conversion is a trivial task provided that the text editor supports that symbol and there exists a XLTS transformer that can convert MathML to a format used by the word processor.

The rest of the paper is organized as follows: Sect. 17.2 describes the related work summary. Section 17.3 describes the proposed methodology. Section 17.4 discusses the results and comparison, and finally Sect. 17.5 concludes this paper.

17.2 Related Work Summary

HMER consists of two elemental components that are symbol recognition and structural analysis. Provided the two dimensional nature of a mathematical expression for its structural analysis, many researchers prefer approaches based on predefined grammars as natural way to solve the problem. Several types of math grammars have been scrutinized. Chan and Yeung [5] have used definite clause grammars in their paper. However, their system works only on online mathematical expression; they have not demonstrated it on offline data set. The authors in [7] showed the fruitfulness of stochastic context-free grammars on various systems as they typically performed great in the CROHME competitions. Approaches based on probabilistic context-free grammars analyses the structure of mathematical expression and deals with ambiguities in handwritten data, such an approach based on PCFG was proposed by [6, 8]. However, the proposed approach deals with only online maths expressions and in their future work they intend to apply it in offline mathematical expression recognition for both printed and hand written. The authors in [9] have proposed a novel neural network framework, namely encoder-decoder for sequence to sequence learning. The encoder decoder model has many applications including [10,11,12].

17.3 Methodology

17.3.1 Overview

The encoder within our model is a pretrained Densenet [13] model with two subsequent Fully Convolutional Layer (FCN) [14] that results in encoded image features. The decoder is Recurrent neural network (RNN) [15] with gated recurrent units (GRU) [16] that converts the encoded image features into mathML string which is our desired output. The resultant model is (1) end-to-end trainable. (2) Produces expression based on data rather than predefined grammar (3) takes into account the semantic context of the symbol to choose the best symbol and position. The data used for the training and validations is CROHME dataset which consists of stroke metadata (pen-up, pen-down sequence) during generation of expression as well as the ground truth in the form of MathML. The flowchart of the proposed methodology is represented in Fig. 17.1.

Fig. 17.1
A flowchart with following steps. input expression as Input image, image encoder as dense-net, convolutional 2-D layers 1 and 2, image features decoder as G R U based attention model, and output expression as Math M L expression.

Flowchart of proposed model

17.3.2 Dataset Preprocessing

The handwritten expression are usually stored in images that can vary in quality and size, and image preprocessing is done to prepare images in specific format to feed into the encoder. The preprocessing includes image resizing image, center cropping and normalizing the image pixel values to keep values in range. The MathML corresponding to each image expression is stored separately with same name as image file. The MathML expression is consist of predefined tags, operator symbols, operand symbols following the pattern of one symbol at a between tags (opening and closing). MathML is a 2 dimensional representation of the input handwritten expression. The mathml expression is divided into tokens of tags, operator symbols, and operand symbols.

17.3.3 Encoder

The Encoder takes transformed image to convert the 3 channel image to N channel feature matrix which is an intermediate form for decoder input. The encoder is consist of Densenet and convolution layers stacked over one another. The Densenet consist of denseblocks, in each denseblock the concatenation of the outputs of preceding layers is fed as input in succeeding layers. Let \({\mathbf{H}}_{\mathbf{l}}(.)\) denote the convolution functionof the lth layer, then the output of layer l is represented as:

$${x}_{l}={H}_{l}\left(\left[{x}_{0};{x}_{1};{x}_{2};\dots ;{x}_{l-1}\right]\right)$$
(17.1)

where \({x}_{0}\), \({x}_{1}\), …, \({x}_{{\text{l}}}\) denote the output features produced in layers 0, 1, …, l, “;” denotes the concatenation operation of feature maps.

The connections established between layers enables Densenet to use features extracted in previous layers and easy gradient propagation to initial layers. Also, this mechanism strengthens features extraction in Densenet without implementing much deeper convolution layers.

In this paper, pre trained Densenet model provided by pytorch has been used. Using pre trained Densenet has its advances as it reduces the cost of training such complex and memory consuming architecture is easier to load. The output produced by Densenet is larger in size. CNN has been largely used to reduce the size of representational n-dimensional matrix without affecting features represented by the n-dimensional matrix. Thus, the last layers of Densenet model are removed to make model work as a feature extractor instead of a classifier. Then two convolution layers are layered over output of Densenet to reduce the size of output to optimal feature representation. The proposed model takes as input a raw expression image and generates corresponding MathML sequence.

17.3.4 Decoder

The input block of the decoder provides one-hot encoding of the input word to the embedding layer. The embedding layer converts the one-hot encoding of input word to word embedding of hidden_size, H length vector. Word embedding is an efficient way to represent relation between words in a vocabulary. Embedding is a dense vector of floating points that represents a word’s features and more importantly, these features can be learned via training of the embedding layer. The working of decoder has been shown in Fig. 17.2.

Fig. 17.2
A block diagram of decoder features following blocks linked with directional arrows. Input, embedding layer, previous hidden state, attention, encoder outputs, B M M with attention weights, combined attention layer, G R U, hidden state, and output layer.

Decoder

Let \({x}_{i}\) be the one hot encoding of input word and \({O}_{{\text{en}}}\) represent the encoder output. Then, \({O}_{e}\) represent the output of embedding layer which takes as input a vector of vocabulary size and gives output vector of H size.

$${O}_{{\text{emm}}}={W}_{{\text{emm}}}{x}_{i}$$
(17.2)

The previous hidden state \({h}_{t-1}\), a vector of size H that corresponds to the last hidden state generated by the GRU.

$${x}_{{\text{attn}}1}=\{{O}_{{\text{emm}}};{h}_{t-1}\}$$
(17.3)

Then, \({x}_{{\text{attn}}1}\) is the concatenation of the embedding output and previous hidden state. Attention block which is a linear layer which takes input of size 2*H and gives output of size H is applied on \({x}_{{\text{attn}}1}\).

$${O}_{{\text{attn}}1}={\text{softmax}}({W}_{{\text{attn}}1}{x}_{{\text{attn}}1}+{b}_{{\text{attn}}1})$$
(17.4)

\({O}_{{\text{attn}}1}\) represents the output of attention block which act as attention weights for the encoder output.

Softmax activation function is used to convert real values to probabilities so it can be applied on encoder output.

$$x_{{{\text{in}}}} = O_{{{\text{attn}}1}} \otimes O_{{{\text{en}}}}$$
(17.5)

\({x}_{{\text{in}}}\) is the element-wise multiplication of attention weights and encoder output

$${x}_{{\text{out}}}=\{{O}_{{\text{emm}}} ;{x}_{{\text{in}}}\}$$
(17.6)

\({x}_{{\text{out}}}\) represents the concatenation of embedding output and \({x}_{{\text{in}}}\), , which is input to second attention block called attention combined which is also a linear layer which takes input of size 2*H and gives output of size H.

$${O}_{{\text{attn}}2}={\text{RELU}}({W}_{{\text{attn}}2}{x}_{{\text{out}}}+{b}_{{\text{attn}}2})$$
(17.7)

The rectified linear activation function (RELU) is used as it is a piecewise linear function that will output the input directly if is positive; otherwise, it will output zero. \({O}_{{\text{attn}}2}\) is the output of combined attention layer, and it is a vector of size H. \({O}_{{\text{attn}}2}\) is the input to GRU block of the decoder.

$${x}_{t}= {O}_{{\text{attn}}2}$$
(17.8)

GRU is an improved version of RNN which solves the problems of vanishing and exploding gradients. Let \({x}_{t}\) be given input to GRU and the output \({h}_{t}\) is computed as:

$${h}_{t}={\text{GRU}}({x}_{t}, {h}_{t-1})$$
(17.9)

Softmax activation function is applied on GRU output to generate vector of output probabilities, and argmax is applied to predict the output word.

17.3.5 Document

The predicted mathml is parsed to a tree structure and inserted into a word document using python libraries i.e. python-docx, xET.

17.4 Result and Comparison

This section describes the system settings for the experimentation purpose and the evaluation matrices used

17.4.1 Experimental Setup

The system is implemented on Intel(R) Core(TM) i5, 3.30 GHz CPU, 4 cores and 8 GB RAM. During training of the model the factors considered are loss and Validation.

The red line in Fig. 17.3 represents the value of Log loss, and the blue line represents validation loss.

Fig. 17.3
A multi-line graph with unlabeled axes features 2 falling trends with undulation for loss and validation.

Loss and validation graph

Figure 17.4 shows decrease in loss in Red and increase in Bleu score on Test set in Blue curve with epochs.

Fig. 17.4
A multi-line graph with unlabeled axes features 1 falling trend and 1 rising trend with undulation for loss and B L E U.

Loss and BLEU score

Model comparison By Bleu Scores (see Table 17.1):

Table 17.1 Comparison results

Initial predictions—<mrow><mi><mi></mi><mrow><mo><mi></mi>

<mrow><mi><mo></mi></mrow>

Original value—<mrow><mi> x</mi> <mrow> <mo> + </mo> <mi> y </mi></mrow></mrow>

The resultant output of Fig. 17.5 image comes out to be x + y.

Fig. 17.5
A graph window with unlabeled axes features a handwritten expression, x + y.

Input handwritten expression

17.5 Conclusion

In this paper, we concluded that using a pre-trained dense encoder model we can train an attention model with features to provide good accuracy. Densenet provides better image features than most of the state of the art models present for image segmentation and feature extraction. This reduces the computational cost significantly that is used to train a Densenet. Also, the MathML conversion of feature vectors provides a base for conversion to other standard formats of mathematical expressions. Also, the GRU based Architecture of Decoder is uniquely defined and experiments have been done regarding its effectiveness.