Keywords

1 Introduction

Mathematical expression (ME) plays an important role in scientific and technical documents. As an intuitive and easily comprehensible knowledge representation model, MEs are present in various types of literature and could help the dissemination of knowledge in many related domains, including mathematics, physics, economy and many other fields. Mathematical expression recognition is the process of converting scanned images or online handwritten form of mathematical expression into editable text and other forms through related technologies such as image processing, image recognition, semantic analysis, and structural analysis.

With the electronic process of document and the development of online education, the need to identify MEs in pictures or documents, which are based on typesetting systems such as LaTeX and MS Word, has increased recently. It is important to reduce time in converting image-based documents like PDF to text-based documents that are easy to use and edit, but the recognition of MEs is far more difficult than the recognition of traditional text. Since the two-dimensional structure of the mathematical expression is significantly different from the traditional natural language, it is difficult for traditional text OCR technology to be directly applied to mathematical expression recognition.

Meanwhile, benefit from the development and popularization of smart terminal devices based on optical scanning and pen writing input, online handwritten data has become a more common important data type. As one of the branches, people are allowed to use handwritten mathematical expressions (HMEs) as the input data conveniently and naturally, which puts the problem, how to recognize mathematical handwritten expressions, in people perspective. Therefore, the software implementation aspect of the HMEs recognition system has become a key issue. Nevertheless, it is still a difficult problem to recognize HMEs successfully for the reason that the handwritten mathematical expression recognition exhibits three distinct difficulties [1, 2]: i.e., the complex two-dimensional structures, enormous ambiguities in handwriting input and strong dependence on contextual information. Because of the ambiguity present in handwritten input, it is often unrealistic to hope for consistently perfect recognition accuracy.

This paper firstly discusses the three parts of the research status of mathematical expression recognition, including expression detection, symbol recognition, and structural analysis. Then introduces the research status of mathematical expression recognition from document images or handwritten forms. Finally, this paper looks forward to the future development of recognition of mathematical expression, especially handwritten mathematical expression.

2 Major Problems

There are three important directions for the recognition of mathematical expressions today [3, 4]: expression positioning, symbol segmentation, symbol recognition, and structural analysis. There are basically two approaches of mathematical expression recognition: online recognition and offline recognition.

2.1 Symbol Segmentation

In a mathematical expression, there usually exist multiple number, letter, and operator symbols. Before identifying a single symbol, we must first segment the individual symbol from the expression properly.

The mathematical symbol system is a symbol system with a two-dimensional spatial layout, which is different from traditional natural language texts with only a single sequence in the horizontal direction. For example, a fractional expression contains an upper, middle, and lower structure composed of a numerator, a fractional line, and a denominator. A radical expression includes a semi-enclosed structure composed of a root number, a square number, and a squared number. A fixed integrator operation includes an integrator and an integral upper limit. Structures that have both upper and lower relations and left and right relations formed with the integral lower limit, and even some mathematical expressions also contain complex nested structures of these relations. This feature makes the process of segmenting mathematical expression symbols extremely complicated. Figure 1 illustrates the segmentation difficulties mentioned above.

2.2 Expression Positioning

After the symbol segmentation step, we will get a list of objects with some known attribute values, and hope to get the identification of the symbol. In this step, we can apply any symbol recognition method as long as it is designed for the corresponding data type (i.e., online or offline). Not surprisingly, there are still some objective difficulties in the mathematical symbol recognition.

Fig. 1.
figure 1

Some examples of the spatial structure of mathematical expression.

The features of mathematical symbols which are not conducive to automatic recognition are as shown below.

In the task of identifying mathematical expression, local ambiguity is an inevitable problem. Local ambiguity means that the specific content represented by some strokes is difficult to determine. Especially in the handwriting scene, it is difficult to accurately distinguish many characters from the morphology alone. In some cases, it is difficult for even humans or the writer himself to give unique answers to what the strokes represent. For example, the hand-written English characters “O” and numeric characters “0” are difficult to distinguish, and similar ones include “2” and “z”, lowercase letters “x”, and multiplication operators “×”, “6”, and “b” “,” 9 “and” q “,” B “and” 13 “, etc. Figure 2 illustrates the segmentation difficulties mentioned above.

Fig. 2.
figure 2

Some examples of the local ambiguity of mathematical expression. The mathematical expressions are: ① \( {\text{O}}_{a} + a_{0} \); ② \( {\text{z}}_{1} + 1.23 \); ③ \( {\text{x}} \times 6{\text{b}} \); ④ \( 9{\text{B}} - 13{\text{q}} \).

The character set of mathematical symbols is relatively large, including English letters, Roman letters and Greek letters, numbers, and many operators. At the same time, font, size, bold, italics and others make the recognition of mathematical symbols more complicated. For example, in printing scene, the italic English letter “A” and the flower font “\( {\mathcal{A}} \)” are difficult to distinguish, which is much harder in handwriting scene. In mathematical expression, implicit multiplication and dot multiplication are quite common, which makes it difficult to distinguish commas, dots, other small symbols, and noise when the sample quality is poor. In addition, the adhesion of characters can also lead to local ambiguity. Figure 3 illustrates the segmentation difficulties mentioned above.

Fig. 3.
figure 3

Some examples of the local ambiguity of mathematical expression. The mathematical expressions are:① the italic English letter “A” and the flower font “\( {\mathcal{A}} \)”; ② \( \left( {{\text{p}} + 2} \right)\left( {{\text{y}} + 6} \right) \).

The sources of mathematical expression samples are mainly divided into print samples and handwriting samples. For the printing mathematical expression, due to its regular structural shape, the general recognition methods have a good performance in general. However, the recognition of the handwritten mathematical expression requires more consideration of its topological structure. Obviously, the difficulty in recognizing the handwritten mathematical expression is far greater than the recognition of the scanned image of the document.

As mentioned, there are basically two approaches of mathematical expression recognition: online recognition and offline recognition. The available information for online and offline recognition is different. Offline recognition mainly refers to the recognition of the mathematical expression image obtained by the scanner or the document containing the mathematical expression, including the printed matter and the handwriting. Online recognition refers to the recognition of the expression on smart terminal devices based on pen writing input, mainly the handwritten mathematical expression recognition. Compared with offline recognition, online recognition can make more use of handwriting trace information.

For online mathematical expression recognition, the handwriting trace information can help the model to identify segmentation. However, the problems caused by the difference in writing order and separation of handwriting trace from break make it difficult to use the handwriting trace information. For offline mathematical expression recognition, it is often judged by spatial information in binarized pictures. In view of the special circumstances of mathematical expression pictures, such as low pixels, blurred writing, and adhesion, offline mathematical expression recognition also faces many difficulties, and even segmentation ambiguity may occur (see Fig. 3).

2.3 Structure Analysis

To achieve accurate identification of mathematical expression, on the basis of mathematical symbol recognition, it is necessary to assemble, restore, and reconstruct mathematical symbols with the structural characteristics of the mathematical expression itself, to get the result of mathematical expression recognition. Structural analysis is mainly by determining the spatial relationship between the identified symbols, judging their logical relationship and constructive significance. Through structural analysis, the discrete recognition results can be assembled, restored, and reconstructed based on the original structural features, grammatical rules, priorities and other characteristics to complete the mathematical expression recognition of the entire process.

Suppose we are able to correctly segment and recognize mathematical symbols in the previous steps. Then, in the structural analysis stage, we can use the label, size and location of each mathematical symbol to build a parse tree or relationship tree. However, depending on the spatial operator, mathematical expression have the unique two-dimensional structure, which makes the spatial relationship between symbols in mathematical expressions more complicated. Therefore, we should first identify all spatial operators in order to build sub-structures over them and their operands. Using all intermediate substructures and the remaining objects, the final structure can be constructed. It is a pity that in some cases, the spatial relationship between symbols is still very vague, especially the handwritten mathematical expression. And it is difficult to judge some very complicated spatial operators from the simple positional relationship of the context, which requires the contextual semantic information.

The space operators mainly include implicit multiplication, subscripting, or exponentiation, which are widely used in mathematical expressions. However, the spatial relationship of the above space operators may be ambiguous. Figure 4 demonstrates that the same configuration of bounding boxes may reveal different spatial relationships. Furthermore, due to the complexity of large matrices and the structure of equations, it is more difficult to analyze and identify them.

Fig. 4.
figure 4

Implicit multiplication, subscripting, and exponentiation cannot be determined.

3 Document Mathematical Expression Recognition Research

Document mathematical expression recognition, normally, is printed mathematical expression recognition. With many researches in the identification of mathematical symbols at the beginning, many excellent methods have been proposed in the field of document mathematical expression recognition. In recent years, the development of deep learning has also led many researchers try to use neural network as experimental tools.

Ha et al. [5] proposed a recursive X-Y cut segmentation method, which worked well on typeset MEs but not fitted for handwritten cases. In the proposed system, a top-down page segmentation technique known as the recursive X-Y cut decomposes a document image recursively into a set of rectangular blocks. The recursive X-Y cut be implemented using bounding boxes of connected components of black pixels instead of using image pixels.

Kumar et al. [6] try a rule-based approach in the symbol information step. They focus on the recognition of printed MEs and assume connected components (ccs) of a given ME image are labelled. The proposed method comprises three stages, namely symbol formation, structure analysis and generation of encoding form like LATEX. The symbol formation process, where multi-cc symbols (like =, $\equiv $ etc.) are formed, identity of context-dependent symbols (like a horizontal line can be MINUS, OVERBAR, FRACTION etc.) are resolved using spatial relations. Multi-line MEs like matrices and enumerated functions are also handled in this stage. A rule-based approach is proposed for the purpose, where the heuristics based on spatial relations are represented in the form of rules (knowledge) and those rules are fired depending on input data (labelled ccs). As knowledge is isolated from data like an expert system in the approach, it allows for easy adaptability and extensibility of the process (Fig. 5).

Fig. 5.
figure 5

Various stages of Kumar’s approach [6] to ME recognition

Kim et al. [7] point out that the sequence of character segmentation is from left to right, and from top to bottom in case of general character recognition. However, mathematical expression is a kind of two-dimension visual language. Thus, they propose a modified recursive projection profile cutting method of character segmentation in images of mathematical expression, using depth first search for arranging and double linked list for re-arranging.

Murru et al. [8] propose the use of artificial neural networks (ANN) in order to develop pattern recognition algorithms capable of recognizing both normal texts and mathematical expression, and present an original improvement of the backpropagation algorithm. In symbol segmentation step, considering that features selection mainly depends on the experience of the authors, fuzzy logic can be more useful, since it is widely used in applications where tuning of features is based on experience and it can be preferred to a deterministic approach. Thus, a method is proposed that combines, by means of a fuzzy logic based approach, some state–of–the–art features usually exploited one at a time (Fig. 6).

Fig. 6.
figure 6

Kim’s [7] modified recursive projection profile cutting method of character segmentation in images of mathematical expression, and the double linked list applied in the paper.

4 Handwritten Mathematical Expression Recognition Research

Handwritten mathematical expression recognition includes both online recognition, such as the input from smart devices like note apps, and offline formula recognition, such as photos of mathematical expression. As mentioned in Sect. 2, due to the freedom of writing, handwritten mathematical expression recognition has greater difficulties than printed ones. Therefore, more diverse and complex technologies are needed, and researchers have invested more interest in it.

Hirata et al. [9] proposes a novel approach, based on expression matching, for generating ground-truthed exemplars of expressions (and, therefore, of symbols). Matching is formulated as a graph matching problem in which symbols of input instances of a manually labeled model expression are matched to the symbols in the model. Pairwise matching cost considers both local and global features of the expression (Fig. 7).

Fig. 7.
figure 7

Normalization of input expressions with respect to its corresponding model and deformation induced to the model graph in Hirata’s approach [9].

MacLean et al. [10] presents a new approach for parsing two-dimensional input using relational grammars and fuzzy sets. Motivated by the two-dimensional structure of written mathematics, a fast, incremental parsing algorithm is developed. Then check the similarity and membership between the fuzzy set and the handwritten input. The approach applies and improves existing rectangular partitioning and sharing techniques such as parsing forests, and then introduces new ideas such as relationship classes and interchangeability. A correction mechanism that allows the user to view the parsing results and select the correct interpretation in the case of identifying errors or ambiguities is proposed, and then such modifications are incorporated into subsequent incremental recognition results.

MacLean et al. [11] also uses Bayesian networks for parse tree selection. They proposed a system which captures all recognizable interpretations of the input and organizes them in a parse forest from which individual parse trees may be extracted and reported. If the top-ranked interpretation is incorrect, the user may request alternates and select the recognition result they desire. The tree extraction step uses a novel probabilistic tree scoring strategy in which a Bayesian network is constructed based on the structure of the input, and each joint variable assignment corresponds to a different parse tree. Parse trees are then reported in order of decreasing probability.

Simistira et al. [12] symbolizes the elastic template matching distance based on the directional feature of the pen. The structural analysis is based on extracting the baseline of the mathematical expression, and then classifying symbols into levels above and below the baseline. The symbols are then sequentially analyzed using six spatial relations and a respective two-dimensional structure is processed to give the resulting MathML representation of the ME (Fig. 8).

Fig. 8.
figure 8

Processing steps of Simistira’s approach [12] to ME recognition. (a) ME with strokes labeled in time order, (b) grouping of strokes into symbols, (c) symbol recognition, (d) assignment of symbols to levels, (e) hierarchical structure of MEs

Álvaro et al. [13] describes a formal model or the recognition of on-line handwritten mathematical expressions using two-dimensional stochastic context-free grammar and hidden Markov models to identify online handwritten mathematical expressions. Hidden Markov models are used to recognize mathematical symbols, and a stochastic context-free grammar is used to model the relation between these symbols. This formal model makes possible to use classic algorithms for parsing and stochastic estimation. The model is able to capture many of variability phenomena that appear in on-line handwritten mathematical expressions during the training process. The parsing process can make decisions taking into account only stochastic information, and avoiding heuristic decisions.

Awal et al. [14] presents an online handwritten mathematical expression recognition system that handles mathematical expression recognition as a simultaneous optimization of expression segmentation, symbol recognition, and two-dimensional structure recognition under the restriction of a mathematical expression grammar. The originality of the approach is a global strategy allowing learning mathematical symbols and spatial relations directly from complete expressions. A new contextual modeling is proposed for combining syntactic and structural information. Those models are used to find the most likely combination of segmentation/recognition hypotheses proposed by a two-dimensional segmentation scheme. Thus, the models are based on structural information concerning the symbol layout (Fig. 9).

Fig. 9.
figure 9

System architecture of Awal’s approach [14] to ME recognition.

Hu et al. [15] proposed a novel framework to analyze the layout and semantic information of handwritten mathematical expressions. The framework includes three steps, namely symbol segmentation, symbol recognition and semantic relationship analysis. For symbol segmentation, a decomposition on strokes is operated, then dynamic programming is adopted to find the paths corresponding to the best segmentation manner and reduce the stroke searching complexity. For symbol recognition, spatial geometry and directional element features are classified by a Gaussian Mixture Model learnt through Expectation-Maximization algorithm. At last, in the semantic relationship analysis module, a ternary tree is utilized to store the ranked symbols through calculating the operator priorities. The motivation for our work comes from the apparent difference in writing styles across western and Chinese populations.

Le et al. [16] present an end-to-end system to recognize Online Handwritten Mathematical Expressions (OHMEs). The system has three parts: a convolution neural network for feature extraction, a bidirectional LSTM for encoding extracted features, and an LSTM and an attention model for generating target LaTex (Fig. 10).

Fig. 10.
figure 10

Structure of the end-to-end model in Le’s approach [16].

Le et al. [17] also present a technique based on stroke order normalization for improving recognition of online handwritten mathematical expression (OHME). The stroke order dependent system has less time complexity than the stroke order free system, but it must incorporate special grammar rules to cope with stroke order variations. The presented stroke order normalization technique solves this problem and also the problem of unexpected stroke order variations without increasing the time complexity of ME recognition. In order to normalize stroke order, the X–Y cut method is modified since its original form causes problems when structural components in ME overlap. First, vertically ordered strokes are located by detecting vertical symbols and their upper/lower components, which are treated as MEs and reordered recursively. Second, unordered strokes on the left side of the vertical symbols are reordered as horizontally ordered strokes. Third, the remaining strokes are reordered recursively. The horizontally ordered strokes are reordered from left to right, and the vertically ordered strokes are reordered from top to bottom. Finally, the proposed stroke order normalization is combined with the stroke order dependent ME recognition system (Fig. 11).

Fig. 11.
figure 11

Horizontal and vertical components and their desired stroke order in Le’s approach [17].

Zhang et al. [18] extend the chain-structured BLSTM to tree structure topology and apply this new network model for online math expression recognition. The proposed system addresses the recognition task as a graph building problem. The input expression is a sequence of strokes from which an intermediate graph is derived using temporal and spatial relations among strokes. In this graph, a node corresponds to a stroke and an edge denotes the relationship between a pair of strokes. Then several trees are derived from the graph and labeled with Tree-based BLSTM. The last step is to merge these labeled trees to build an admissible label graph (LG) modeling two-dimensional expressions uniquely. The proposed system achieves competitive results in online math expression recognition domain (Fig. 12).

Fig. 12.
figure 12

A tree-based BLSTM network with one hidden level in Zhang’s approach [18].

Zhang et al. [19, 20] introduce Track, Attend and Parse (TAP), an end-to-end approach based on neural networks for online handwritten mathematical expression recognition (OHMER). The architecture of TAP consists of a tracker and a parser. The tracker employs a stack of bidirectional recurrent neural networks with gated recurrent units (GRU) to model the input handwritten traces, which can fully utilize the dynamic trajectory information in OHMER. Followed by the tracker, the parser adopts a GRU equipped with guided hybrid attention (GHA) to generate LATEX notations. The proposed GHA is composed of a coverage based spatial attention, a temporal attention and an attention guider. Moreover, the strong complementarity is demonstrated between offline information with static-image input and online information with ink-trajectory input by blending a fully convolutional networks based watcher into TAP. Inherently unlike traditional methods, this end-to-end framework does not require the explicit symbol segmentation and a predefined expression grammar for parsing (Fig. 13).

Fig. 13.
figure 13

Overall architecture of Track, Attend and Parse in Zhang’s approach [19, 20]. X denotes the input sequence, A denotes the annotation sequence, Y denotes the output sequence.

5 Conclusion

Through the efforts of researchers, recognition technology is becoming more and more mature. As mentioned in Sect. 2, mathematical expression recognition mainly includes three questions. These three tasks can be solved sequentially or jointly. The proposed solutions can be roughly divided into sequential solutions and integrated solutions. In addition, with recent advances in deep learning, several end-to-end deep-learning based systems were proposed for ME recognition.

In the past, due to technical and computational limitations, researchers performed separate calculations for each step. With the development of computing power and algorithms, more and more research attempts to solve problems from a global perspective. In the beginning, researchers attempted to apply much prior knowledge to the three steps of recognition models in order to obtain higher efficiency without a research foundation. This method makes it easier for people to understand the logic of recognition, and it is easier to summarize the reasons for the wrong results.

On the other hand, considering that prior knowledge depends on the experience of the authors, modern researches are more inclined to use of machine learning or deep learning algorithms to allow computers to learn recognition logic from samples in order to discover recognition logic that is not covered by prior knowledge. In addition, since the recognition of mathematical expression is also affected by contextual information, modern researches attempt to recognize the mathematical expression globally to improve the recognition efficiency. The continuous improvement of experimental results proves the advantages of this method, but it may be difficult for people to directly understand the recognition logic obtained by machine learning.

In the future, the work of mathematical expression recognition may include the following aspects. When completing the three steps independently, improve the level of segmentation algorithms and structural analysis, especially in terms of spatial operators and stroke adhesion whether printed or handwritten. Improve the reliability of structural analysis by using more reliable symbolic structural models. In the global recognitions, the algorithm can try to make full use of the context information, and tt can be continuously improved in network design and loss function. Depending on the full development of computing power, the application of deep learning for research may have good prospects. At the same time, how to interpret the results obtained by deep learning is also a potential problem. There are a variety of methods for identifying system integration, especially in terms of syntax and dynamic processing. In addition, more usage scenarios can be considered, such as mathematical expression recognition on the blackboard in teaching scenarios, or the conversion of voice input.