1 Introduction

Recognition of mathematical expressions (MEs) is currently a challenging pattern recognition problem. ME recognition is a process of converting offline or online MEs into an editable encoding format like (Lamport 1986), MathML (W3C 2010) etc. In offline recognition, MEs take the form of scanned images whereas for online, they are written using data tablets.

Several approaches are proposed in literature for the recognition of MEs and a recent survey on existing works is found in (Zanibbi & Blostein 2012). Some offline recognition systems are discussed in (Lee & Lee 1994; Lee & Wang 1997; Chaudhuri & Garain 2000; Suzuki et al 2003) and online systems in (Zanibbi et al 2002; Garain & Chaudhuri 2004; Vuong et al 2008). There is an imperative need to evaluate and compare the performance of different systems. It requires that the recognition output and ground truth in standard editable form be matched in an automated and generic manner. An obstacle to straight-forward matching is that the standard forms for representing MEs, such as and MathML, use extraneous tags and commands in addition to the regular mathematical symbols and units. For example, in , extraneous keywords or symbols like \begin{array} and \end{array} are used to encode matrices, enumerated functions, etc. In MathML also, extraneous tags or symbols like <mo>, <mi>, etc. are added to encode operators, identifiers, etc. These extraneous symbols are also involved in the matching process if standard forms are used. Performance evaluation can still be done by considering these additional symbols, but it is not intuitive.

In this paper, we have proposed an approach that allows performance evaluation to be done using standard forms and yet tackles the issues mentioned above. For that, we have introduced structure encoded strings which are analogous to the intermediate representation of compilers. Structure encoded strings retain the structure (spatial relationships like superscript, subscript, etc.) but do not contain any extraneous symbols. Recognition output and ground truth in form are converted into their corresponding structure encoded strings. Levenshtein edit distance (Richard et al 2000) between the structure encoded strings is then used as a measure for performance evaluation. As structure encoded strings retain the linear representation of MEs, standard Levenshtein edit distance can be applied. This approach is intuitive as extraneous symbols are not involved in the matching process.

The paper is organized as follows. Existing approaches for performance evaluation are discussed in section 2. Proposed approach to evaluate performance is presented in section 3. In section 4, experimental results are discussed, summary and discussion was provided in section 5 and conclusions are provided in section 6.

2 Related work

The complexity of performance evaluation task has been discussed in (Lapointe & Blostein 2009) and several issues such as levels of representation used for MEs, quality of performance metrics, etc. have been identified. Garain & Chaudhuri (2005) have reported a performance index that depends on their recognition procedure and it is not useful as different researchers use different recognition approaches. In (Lin et al 2012), the authors have focused on performance evaluation of mathematical formula identification in the documents. Chan & Yeung (2001) have proposed an integrated measure for symbol recognition and structural analysis. This integrated recognition rate is given by the ratio of the number of correctly recognized symbols and operators to the total number of symbols and operators. Jin et al (2004) have proposed a bottom-up and top-down performance evaluation method based on tree matching and they have presented triplet tree for ME representation. Some studies (Okamoto et al 2001; Ashida et al 2006) have measured accuracy manually by separating different structures like fraction, limit, root, etc. and the number of such structures that are recognized correctly is computed.

For the first time, (Sain et al 2011) have proposed an approach (called as EMERS approach) for an automated evaluation in a generic manner (using editable format). In their approach, the authors have adopted tree distance based algorithm (Zhang & Shasha 1989) to find distance between two MathML (tree-like) representations. As tree matching (Tai 1979) is a non-trivial problem in terms of implementation complexity, the authors have approximated MathML trees by their corresponding Euler strings (Aratsu et al 2009) and tree matching algorithm based on dynamic programming (Zhang & Shasha 1989) is used. Euler string (Aratsu et al 2009) is obtained by the depth-first traversal of MathML tree and in the traversal, if a node is encountered second time, complement of the node’s label is added. In addition, Sain et al (2011) have suggested that edit cost should take the level of symbol into consideration as symbols in the superscript and subscript are smaller in size than their base symbols and the current recognition engines cannot be expected to handle characters irrespective of sizes. So, they have designed the cost functions in terms of levels of tags in MathML tree. For example, let b i and \(a^{x^{i}}\) be recognized as bi and a xi, respectively. Both of them are erroneous. In the first case, the error occurs in the first level script whereas for the second case, in the second level script. It would be judicious to give more penalty to the first error than the second one. As dynamic programming based tree distance algorithm has been used, their algorithm complexity is O(n 4) (Sain et al 2011).

In EMERS approach, extraneous tags are also involved in matching that is not intuitive. Consider an example given in (Sain et al 2011): ∫ dx / x is wrongly recognized as 1dx / x. MathML representation for is <mo>#x222b;</mo> and for 1 is <mn>1</mn>. By intuition, it needs only one edit (substitute 1 with ) operation. But the edit operations are given by: Substitute <mn> with <mo> and Substitute 1 with #x222b;. The total edit cost is 2 as the extra tag <mn> is also involved in the matching.

Zanibbi et al (2011) have proposed performance metrics based on bipartite graphs at stroke level. In this approach, recognition output and ground truth interpretations are converted to bipartite graphs on which metrics based on hamming distances are defined. A competition (called as CROHME competition) has been conducted on the recognition of online handwritten mathematical expressions (CROHME 2011). Four research groups have submitted their ME recognition systems to the CROHME competition Mouchère et al 2011 and each of them is evaluated based on the following four aspects: stroke-level classification rate, symbol segmentation rate, symbol recognition rate and expression-level recognition rate. In this competition, some restrictions like only one symbol in superscript/subscript of function names, no recursive fraction, etc. have been imposed on the datasets. Recently, the second version of CROHME competition, CROHME2012 (Mouchère et al 2012) has been organized. In CROHME2012, seven systems are evaluated using augmented datasets that have less restrictions than those of the CROHME2011 competition.

In (Álvaro et al 2012), a method for unbaised evaluation has been proposed as different MathML structures can draw the same ME. In this method, a constrained parsing of an ME is performed to obtain a different MathML tree equivalent to the ground truth tree according to the model representation criteria. For evaluation, recognized tree and the obtained constrained parsed tree are compared using EMERS approach.

3 Proposed approach

Our approach for performance evaluation uses format for illustration. It works for MathML also. To eliminate extraneous symbols, we propose the use of structure encoded strings. Recognition output and ground truth in form are converted into their corresponding structure encoded strings. Levenshtein edit distance (Richard et al 2000) between the structure encoded strings is used as a measure for performance evaluation.

As suggested in (Sain et al 2011), we have also incorporated level information of symbols in the cost functions of Levenshtein edit distance measure. As Levenshtein edit distance is used, time complexity of the proposed approach is O(n 2) (Cormen et al 1989). Our approach is shown in the form of a block diagram in figure 1. Structure encoded strings and the algorithm for performance evaluation are discussed in the next sections.

Figure 1
figure 1

Proposed approach for performance evaluation.

3.1 Structure encoded strings

Mathematical symbols can have spatially related symbols in its surrounding six regions namely, top-left, above, top-right, bottom-left, below and bottom-right regions, which are shown in figure 2. But in most of the MEs, for any symbol, the entire top-left, above and top-right regions form a single subexpression and similarly, the entire bottom-left, below and bottom-right regions form a single subexpression. The combined top-left, above and top-right (bottom-left, below and bottom-right) regions is called as Northern (Southern) region (shown in figure 2).

Figure 2
figure 2

Possible spatial regions around a mathematical symbol M. Here, TL, A, TR, BL, B and BR refer to top-left, above, top-right, bottom-left, below and bottom-right regions, respectively. N and S refer to northern and southern regions, respectively.

For example, consider the following ME:

$$ \int\limits_{x-x_{0} < 0} f(x) \, dx. $$
(1)

Here, subscript expression (xx 0 < 0) of Integral symbol () spans its bottom-left, below and bottom-right regions (entire southern region). For FRACTION symbol, numerator and denominator present in its above and below regions and so they form its northern and southern regions, respectively. For SQUAREROOT symbol, its degree is considered to be in the northern region and the contained expression in the southern region. For other symbols, northern and southern regions give superscript and subscript expressions, respectively. That means, atmost two different subexpressions are present around any symbol in most of the MEs. Identity also called as label (a positive integer) information of a symbol can be used to resolve its northern and southern regions.

There may be unusual cases, where two different subexpressions are present in the same region like n C r . In n C r , C has two different subexpressions n (bottom-left) and r (bottom-right) in its southern region. There can also be rare cases where more than two different subexpressions are present around a symbol. In some other cases, symbols enclosed in the accent symbols (present in above or below region) may have subexpressions in the other regions. For example, in \(\hat {a}^{2}_{i}\), a is enclosed in the HAT accent symbol (above of a) and has superscript 2 (top-right subexpression) and subscript i (bottom-right subexpression). In this example, both HAT symbol and 2 are present in the northern region of a. All of these cases are considered in converting strings into structure encoded strings.

Generation of structure encoded strings: Structure encoded strings are generated by scanning strings from left to right. Let symbols in the structure encoded string be called as Structure encoded symbols. As symbols in or any other encoding forms are not atomic, a mapping table that maps symbols to structure encoded symbols is created. Each structure encoded symbol is given by a unique integer label. We have listed keywords of different categories namely, Roman letters, numerals, Greek letters, Sum-like operators, Binary operators, Log-like symbols, Delimiters, etc. in a sequence and these keywords are consecutively assigned labels (starting from 1). For example, \frac (fraction in ) has four symbols and is made atomic by mapping it to an integer label (say, 219). Similarly, \sqrt \(\left (\sqrt {\hspace {0.5cm}}\right .\) in \(\left .\vphantom {\sqrt {\hspace {0.5cm}}}\right )\) has five symbols and is given an integer label, 140. In the process of scanning string, keywords are captured and mapped to their corresponding labels using mapping table. Extraneous symbols are ignored while scanning and so this mapping not only creates atomic symbols but also eliminates extraneous symbols (that is similar to tokenization in compilers).

To maintain structure information, four special structure symbols are used to designate start and end of northern and southern subexpressions. That means, start structure symbol for northern (southern) region is used to designate start of northern (southern) subexpression. Similarly, end structure symbol for northern (southern) region is used to designate end of northern (southern) subexpression. Therefore, start and end structure symbols form the boundaries of northern and southern subexpressions so that these subexpressions can be unambiguously determined in the structure encoded string. Structure symbols also have unique labels, but do not refer to any mathematical symbol. Let N s and N e (S s and S e ) designate start and end structure symbols for northern (southern) subexpression, respectively.

Therefore, conversion is a simple mapping process that maps encoded symbols to the corresponding structure encoded symbols by scanning the input string. In the process of scanning, extraneous keywords or symbols are discarded as well as start and end structure symbols of northern and southern subexpressions for different symbols (like fraction, squareroot, etc.) are inserted recursively to store structure information.

For example, consider an ME: \(a^{i+1}_{2}\) and its string given by: a ̂{i+1}_2. In , ̂ and _ designate superscript (northern subexpression) and subscript (southern subexpression), respectively. Start and end structure symbols are added before and after scanning these subexpressions, recursively. Its structure encoded string is given by: a, N s , i, +, 1, N e , S s , 2, S e . As it is difficult to follow integer labels, here and in all the subsequent examples in the paper, we have shown symbols instead of their integer labels in the structure encoded string. It can be observed that N s and N e are written before and after superscript (i + 1) and similarly, S s and S e are written before and after subscript (2).

Consider a slightly unusual case where accented symbols have subexpressions. For example, \(\hat {a}^{2}_{i}\). Its string is given by: \hat{a}\(\hat {\hspace {0.1cm}}\)2_i. Here, HAT symbol is considered to be northern subexpression of a. Logically, other subexpressions are meant for the composite symbol (accent symbol along with its enclosed symbol), not for the symbol alone and hence they appear after the composite symbol in the structure encoded string representation. Continuing the example, super and subscripts, 2 and i form the northern and southern subexpressions of \(\hat {a}\) respectively and appear after it. Therefore, its structure encoded string is written as: a, N s , \(\hat {\hspace {0.1cm}}\), N e , N s , 2, N e , S s , i, S e .

Wide accent symbols can have enclosed expressions in their northern or southern regions. For wide accent symbols having enclosed expression in the northern region, subscript if any present, is considered to be in the southern region. For example, in \(\underbrace {a+b+\cdots +z}_{26}\), enclosed expression a + b + ⋯ + z is in the northern region of UNDERBRACE and the subscript 26 is in its southern region. Its structure encoded string representation is given by: UNDERBRACE, N s , a, + , b, + , CDOTS, + , z N e , S s , 2, 6, S e . Here, UNDERBRACE and CDOTS denote integer labels of underbrace and ellipsis ( ⋯ ), respectively. Similarly, for wide accent symbols with southern enclosed expression, superscript if any present, is considered to be in the northern region. For example, in \(\overbrace {n(n-1) \cdots (n-m+1)}^{\mbox {m factors}}\), enclosed expression n(n − 1) ⋯ (nm + 1) is in the southern region of OVERBRACE and superscript m factors is in its northern region.

Handling Multi-Line MEs: For Multi-Line MEs (MLMEs, defined in our earlier work (Pavan Kumar et al 2011)) like matrices, enumerated functions, etc. structure encoded strings are generated for all the elements (note that each element is again a ME) in all the rows in a row-major order. Special element and row delimiter symbols (which also have unique labels) are added after each element and row, respectively. Let D e and D r designate labels of the element and row delimiters of MLMEs respectively.

Consider the following MLME example (a determinant):

$$ \left| \begin{array}{cc} a_{0} & b_{0}\\ a_{1} & b_{1}\\ \end{array} \right| =0. $$
(2)

Its string is given by: \left∣ \begin{array}{cc} a_{0} & b_{0} \\ a_{1} & b_{1} \\ \end{array} \right∣ = 0. In , element and row delimiters are given by & and \\ , respectively and they are accordingly converted to D e and D r , respectively. Its structure encoded string is given by: ∣, a, S s , 0, S e , D e , b, S s , 0, S e , D r , a, S s , 1, S e , D e , b, S s , 1, S e , D r , ∣, =, 0. It has two rows with two elements in each row and the inserted delimiters, D e and D r . It can be observed that \begin{array} and \end{array} are removed in the structure encoded string.

Handling math fonts: Mathematical symbols can be typeset in different fonts namely, Calligraphic (\mathcal or {\cal }), Blackboard Bold (\mathbb), Fraktur (\mathfrak), etc. For example, Calligraphic form of A is given by {\cal A} (\(\mathcal {A}\)). To handle math fonts, we have defined start and end special symbols (that have unique labels) for each font. These start and end font symbols enclose the actual symbol. Let C s and C e denote integer labels of start and end Calligraphic font symbols, respectively. Structure encoded string representation for the above example is given by: C s , A, C e .

Consider an example: \(\mathcal {A} \setminus \mathcal {B}\). Its string is given by: {\cal A} \setminus {\cal B}. Its structure encoded string is given by: C s , A, C e , \ , C s , B, C e .

Completeness: As structure encoded strings need to handle all types of symbols including rare cases, we have introduced two special symbols, EMPTY and STACK. EMPTY and STACK are dummy symbols that also have unique labels and are used to handle unusual and rare cases as discussed below:

  1. (i)

    Consider an unusual case where two different subexpressions are present in the same (northern or southern) region for a given symbol. In this case, an EMPTY symbol is added and one of the subexpressions of the given symbol is handled by it. Consider n C r for example. Here, pre-subscript n is considered the southern subexpression of EMPTY symbol and r is the southern subexpression of C. That is, EMPTY symbol is written first followed by its southern subexpression, followed by C and its southern subexpression. Structure encoded string for this example is thus given by: EMPTY, S s , n, S e , C, S s , r, S e .

  2. (ii)

    Consider rare cases with more than two subexpressions around a symbol. Such symbols do not occur almost in any ME, but can generate them. For such cases also, EMPTY symbols are added to handle the extra subexpressions as discussed below.

    • Consider \({}^{p}_{q} \sum ^{s}_{r}\). Here, ∑ has four different subexpressions around it. Here, one EMPTY symbol is added to handle pre-super and pre-subscripts (top-left and bottom-left). That means, p and q are considered as the northern and southern subexpressions for the EMP symbol and the super and subscripts, s and r, are considered for ∑ symbol. Its structure encoded string is given by: EMPTY, N s , p, N e , S s , q, S e , ∑, N s , s, N e , S s , r, S e .

    • Consider \( \sideset {^{a}_{b}}{_{c}^{d}}\prod\limits _{i=1}^{n}\). Here, ∏ has six different subexpressions around it. In this case, two EMPTY symbols are added, one to handle top-left and bottom-left subexpressions and the other one to handle above and below subexpressions. Actual symbol ∏ handles the remaining two subexpressions. Its structure encoded string is given by: EMPTY, N s , a, N e , S s , b, S e , EMPTY, N s , n, N e , S s , i, =, 1, S e , ∏, N s , d, N e , S s , c, S e .

  3. (iii)

    There can be rare cases where two or more symbols are vertically stacked one over another. For example, in \(\overset {\alpha }{\omega }\), α is above ω and in \(\underset {\mu }{\nu }\), μ is below ν. To handle these cases, we have introduced start and end STACK symbols that are used to enclose the actual stacked symbols. Let T s and T e denote integer labels of start and end STACK symbols, respectively.

In , keywords \overset, \underset, \stackrel, etc. are used to generate vertically stacked symbols (like <mover> and <munder> in MathML). In \overset {x}{y} or \stackrel {x}{y}, x is above (northern region) y (whose structure encoded string can be written as y, N s , x, N e ). As y and x are vertically stacked, they are enclosed between T s and T e . Its structure encoded string representation is thus given by: T s , y, N s , x, N e , T e . Similarly, in \underset {x}{y}, x is below (southern region) y and its structure encoded string is given by: T s , y, S s , x, S e , T e .

Consider a complex example: \(\overset {\beta }{\underset {\Delta }{\tau }}\). Its string is given by: \overset{\beta}{\underset {\Delta }{\tau}}. Here, β is above (northern region) \(\underset {\Delta }{\tau }\) and in \(\underset {\Delta }{\tau }\), Δ is below (southern region) τ. Structure encoded string for \(\underset {\Delta }{\tau }\) is given by: T s , τ, S s , Δ, S e , T e and for the entire expression is thus given by: T s , T s , τ, S s , Δ, S e , T e , N s , β, N e , T e .

Uniqueness: For performance evaluation, structure encoded strings should give a unique representation. That is, if two structure encoded strings are the same, both should refer to the same ME. Structure encoded strings give a unique representation due to the following reasons:

  1. (i)

    Symbols in an expression or any subexpression (northern or southern) are ordered in a left to right manner.

  2. (ii)

    Northern and southern subexpressions for any symbol are enclosed between their corresponding start and end structure symbols recursively.

  3. (iii)

    Southern subexpression is always handled after the northern subexpression (if present).

  4. (iv)

    EMPTY and STACK symbols are used to handle rare cases.

Length: As structure symbols are added, the length of structure encoded string is more than the number of symbols in the given ME. Let n be the number of symbols in the given ME. Let n 1 be the number of symbols (out of n) having only one of the northern or southern subexpressions and n 2 be the number of symbols (out of n) having both the subexpressions. Two structure symbols (start and end of one of the subexpressions) are added for each of the n 1symbols and for each of the n 2symbols, four structure symbols (start and end of both the subexpressions) are added. Therefore, total number of structure symbols is 2n 1 + 4n 2 and the total length of the string is given by n + 2n 1 + 4n 2. As n 1 < n and n 2 < n, total length of structure encoded string is O(n).

Levels: Levels give nested depths of the symbols in an ME. In the conversion process, levels of the symbols in the structure encoded string are also found. Symbols of any expression (or subexpression) which are not present in the northern or southern region of any symbol are called its Baseline symbols. All the baseline symbols of a ME are at the same level ‘0’. For example, a is the baseline symbol of ME \(a^{i^{2}_{j}}\) and is at level 0. Any symbol present in the northern or southern regions of a symbol (called a base symbol) is one level greater than that of the base symbol. i is the baseline symbol of the northern subexpression \(i^{2}_{j}\), of a and so is at level 1. 2 and j are the northern and southern subexpressions of i, respectively and are at level 2.

Levels of start and end structure symbols of northern and southern subexpressions are assigned the same values as those of the baseline symbols of the subexpressions. For example, structure encoded string of a 2 is: a, N s , 2, N e . Here, 2 is the baseline symbol of northern subexpression of a and is at level 1. Levels of N s and N e are assigned the same value 1. Similarly, levels of element and row delimiters of MLMEs are assigned the same values as those of baseline symbols of the elements.

Intuitiveness: In structure encoded string representation, as northern and southern regions are considered standalone, it appears to be non-intuitive. But in most of the cases, mathematical symbols have atmost two arguments (numerator/denominator for fraction, degree/contained expression for squareroot, super/subscripts for others, etc.). Hence proposed approach that considers standalone northern and southern subexpressions is not far from intuition.

If sub-regions are considered separately so that each of the them has its corresponding start and end structure symbols, uniqueness property may be violated. For example, both \(\sum _{i}\) and \(\sum\limits_{i}\) represent the same ME (as i is the lower limit of ∑ semantically). In the first ME, i is in the bottom-right region whereas in the second one it is in the below region. Let B R s and B R e (B s and B e ) denote start and end structure symbols for bottom-right (below) region. Structure encoded strings for the above expressions if sub-regions are considered separately are respectively given by: ∑, B R s , i, B R e and ∑, B s , i, B e (both are different). If southern subexpression is considered standalone, both the above MEs are given by the same structure encoded string: ∑, S s , i, S e .

In view of the above reasons, northern and southern regions are considered standalone, and EMPTY and STACK symbols are used for rare cases in our structure encoded string representation. If tight performance evaluation (that gives more importance to spatial locations than semantics) is needed, sub-regions have to be considered separately. Tight evaluation is also feasible in our representation by defining and using start and end structure symbols for all the sub-regions.

Generality: The idea of the structure encoded string can be applicable to other similar structures like Indian language scripts and chemical equations. Indian language scripts have vowel and consonant modifiers in the northern and southern regions of base consonants, respectively. Hence proposed approach can be used for performance evaluation of the OCR systems for Indic scripts. Similarly, chemical equations have ionic information (like oxidation state) and the number of instances of an atom in the northern and southern regions, respectively. Hence matching of chemical structures can be performed using the proposed approach.

3.2 Algorithm for performance evaluation

As mentioned earlier, proposed approach for performance evaluation takes strings of the recognition output and ground truth and converts them into their corresponding structure encoded strings. Levenshtein edit distance (Richard et al 2000) between these two strings is then used as a measure for performance evaluation. All types of errors such as recognition, structural errors; etc. can be handled using edit distance on structure encoded strings. For example, if a 2 + b 2 is recognized as a 2 t b 2, it is a recognition error where + is misrecognized as t. Similarly, if it is recognized as a2 + b 2, superscript relationship between a and 2 is lost, and it is a structural error.

Levenshtein edit distance (Richard et al 2000) between two strings gives the minimum number of edit operations (substitution, insertion and deletion) required to transform one string to the other. The dynamic programming approach (Cormen et al 1989) uses a matrix to hold partial distances between all prefixes of the first string and all prefixes of the second. The final distance between the given two strings is computed using it. The time complexity of the dynamic programming approach to computing Levenshtein Edit Distance is O(n 2) (Cormen et al 1989). In terms of time complexity, this string edit distance is better than the tree edit distance (which is O(n 4)).

Let G and R be structure encoded strings of the ground truth and recognition output respectively and let p and q be their respective lengths. Let G[i] and R[i] be the i th elements of the strings G and R, respectively. Let E be the dynamic programming matrix, which is of size (p + 1) × (q + 1). Each cell in E, E[i, j], contains the edit distance by considering the first i elements in G and the first j elements in R and E[p, q] gives the final edit cost. Distance between empty strings, the trivial case, is zero. This implies, E[0, 0] = 0. Each cell E[i, j], is computed as follows:

$$ E[i,0] = E[i-1,0] + I(i) $$
(3)
$$ E[0,j] = E[0,j-1] + D(j) $$
(4)
$$\begin{array}{@{}rcl@{}} E(i,j) = \min \left\{ \begin{array}{c} E(i-1,j-1) + S(i,j) \\ E(i-1,j) + I(i)\quad.\\ E(i,j-1) + D(j) \end{array}\right. \end{array} $$
(5)

Here, S(i, j) denotes the cost of substituting R[j] with G[i]. I(i) denotes the cost of inserting G[i] and D(j) denotes the cost of deleting R[j]. We have used the same cost function as given in (Sain et al 2011), which is \(1/(l+1)\) where l is the level of a symbol. As the level of a symbol increases, its cost decreases and this cost function penalizes errors in symbols more than their nested (northern and southern) ones.

If all the errors need to be given same cost, irrespective of levels, it is still possible by setting l to 0 in the cost function which gives a cost of 1 to all the errors. Cost functions that are designed in terms of levels of structure encoded symbols are given by equations (6), (7) and (8). L(X) denotes the level of symbol X. For substitution cost, level of ground truth symbol is used as it is being substituted. For start and end structure symbols (as well as for start and end font and STACK symbols), only half of the computed costs are assigned as each start and end pair constitute a single (northern or southern) region.

$$S(i,j) = \left\{\begin{array}{cc} 0 \quad \text{ if } G[i] = R[j]\\ \frac{1}{L(G[i])+1} \quad \text{ otherwise} \end{array}\right. $$
(6)
$$I(i) = \frac{1}{L(G[i])+1} $$
(7)
$$D(j) = \frac{1}{L(R[j])+1} . $$
(8)

Consider the example discussed in section 2. strings of the ground truth and recognition output are given by: \int \frac{dx}{x} and 1 \frac{dx}{x}. Using proposed approach, structure encoded strings of ground truth and recognition output are respectively given by: , FRACTION, N s , d, x, N e , S s , x, S e and 1, FRACTION, N s , d, x, N e , S s , x, S e . FRACTION denotes the integer label for \frac. Levels of structure encoded symbols of ground truth in order are: 0, 0, 1, 1, 1, 1, 1, 1, 1 and those of structure encoded symbols of recognition output are: 0, 0, 1, 1, 1, 1, 1, 1, 1.

Hence using the proposed approach, only one edit operation is needed: Substitute 1 with ∫ at position 1 at level 0 with cost 1. The total edit cost is 1 which is as per intuition. Here, positions (start from 1) refer to structure encoded string of the recognition output as edit operations are assumed to be performed on the recognition output.

In our implementation, the edit operations can have further interpretations. For example, the edit operation like insertion of structure symbol N s at position x can be interpreted in the following manner: Subsequent symbols (after the inserted symbol N s ) form the northern subexpression of the symbol at position, x − 1. Northern and southern sub-expressions for different symbols are also accordingly interpreted as discussed in section 3.1. For instance, northern sub-expression of FRACTION symbol is interpreted as its numerator while that of SQUAREROOT symbol is interpreted as its degree.

It is to be noted that edit distance is not normalized between 0 and 1. As edit distance gives an estimate of the minimum number of edits to correct an ME, it reflects the correction efforts of a proof reader. Theoretically, its value ranges from 0 to ∞. If there are no edits (perfect recognition), edit distance is zero. More number of edits results in a greater edit distance. Hence the lesser the value of edit distance, the lesser the correction effort and the more efficient the ME recognition system. Hence edit distance values can be used to relatively compare correction efforts for different systems and so their normalization is not needed.

4 Experimental results and discussion

In this section, some experimental results are presented. As EMERS approach (Sain et al 2011) is the only one in the literature where performance evaluation is done using standard forms in a generic manner, proposed approach is compared with this approach. For comparison, we have used MEs from the publicly available Infty database (Infty 2009).

One detailed comparative example is given and table 1 summarizes results on more number of MEs from Infty database. All the recognition outputs are obtained using InftyReader (Suzuki et al 2003), an OCR system for MEs (that is also publicly available). Both the proposed and EMERS approaches are compared under similar conditions. For both of them, edit operations are assumed to be performed on the recognition output and the notion of level as well as the cost functions (defined in terms of levels) are same. Consider the following example.

Table 1 Edit costs of EMERS and Proposed approaches on different types of MEs from Infty database (Infty 2009). Recognition outputs are obtained using InftyReader (Suzuki et al 2003).

Ground truth: \(j \in P = \cup _{-1 \leq i \leq 2^{s-1}}P(k;i)\)

Recognition output: j 𝜖 P = ∪ − 1 ≤ i ≤ 2s − 1 P(k; i)

Here, ∈ is misrecognized as 𝜖 (symbol recognition error) and superscript relationship between 2 and s − 1 is lost (structural error).

4.1 EMERS approach

MathML for ground truth:

  • 1. <math> <mi>j</mi> <mo>#x03f5</mo>

  • 2. <mi>P</mi> <mo>=</mo> <msub>

  • 3. <mo>#x222a</mo> <mrow> <mo>-</mo>

  • 4. <mn>1</mn> <mo>#x2264</mo> <mi>i</mi>

  • 5. <mo>#x2264</mo> <msup> <mn>2</mn>

  • 6. <mrow> <mi>s</mi> <mo>-</mo>

  • 7. <mn>1</mn> </mrow> </msup> </mrow>

  • 8. </msub> <mi>P</mi> <mo maxsize="1">

  • 9. (</mo> <mi>k</mi> <mi>;</mi>

  • 10. <mi>i</mi> <mo maxsize="1">)</mo> </math>

MathML for recognition output:

  • 1. <math> <mi>j</mi> <mi>#x03b5</mi>

  • 2. <mi>P</mi> <mo>=</mo> <msub>

  • 3. <mo>#x222a</mo> <mrow> <mo>-</mo>

  • 4. <mn>1</mn> <mo>#x2264</mo> <mi>i</mi>

  • 5. <mo>#x2264</mo> <mn>2</mn> <mi>s</mi>

  • 6. <mo>-</mo> <mn>1</mn> </mrow> </msub>

  • 7. <mi>P</mi> <mo maxsize="1">(</mo>

  • 8. <mi>k</mi> <mi>;</mi> <mi>i</mi>

  • 9. <mo maxsize="1">)</mo> </math>

Edit operations (costs are shown upto two decimal points):

  1. 1

    Substitute <mi> and </mi> (that encloses #x03b5 (𝜖)) with <mo> and </mo> at level 0 with cost 1 (in line 1).

  2. 2

    Substitute #x03b5 with #x03f5 (𝜖 with ∈ ) at level 0 with cost 1 (in line 1).

  3. 3

    Insert <msup> before <mn> 2 </mn> (in line 5) and </msup> after <mn> 1 </mn> (in line 6) at level 2 with cost 0.33.

  4. 4

    Insert <mrow> before <mn> s </mn> (in line 5) and </mrow> after <mn> 1 </mn> (in line 6) at level 2 with cost 0.33.

    Total cost of edit operations = 2.66.

Here, line numbers refer to MathML of recognition output as edit operations are assumed to be performed on recognition output. First two edit operations correct the recognition error and #x03b5 and #x03f5, denote MathML symbols for 𝜖 and ∈ respectively. First edit operation considers substitution of the extraneous tags <mo> and </mo> , and so an additional cost of 1 is incurred which is not intuitive. Last two insert operations correct the structural error. If a super or subscript expression has more than one symbol, it should be enclosed between <mrow> and </mrow> in MathML. As such, last edit operation considers insertion of these extraneous tags with an additional cost of 0.33. It can be seen that edit operations on the opening and closing tags (like <mi> and </mi>, <msup> and </msup> etc.) are combined in this approach (Sain et al 2011).

4.2 Proposed approach

for Ground Truth: j \in P = \cup_{-1 \leq i \leq 2 \(\hat {\hspace {0.01cm}}\) {s-1}}P(k;i).

for Recognition output: j \epsilon P = \cup_{-1 \leq i \leq 2s-1}P(k;i).

Structure encoded string for Ground Truth: j, ∈ , P, =, ∪ , S s , -, 1, ≤ , i, ≤ , 2, N s , s, -, 1, N e , S e , P (k; i).

Levels of structure encoded symbols of Ground Truth: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 0, 0, 0, 0, 0, 0.

Structure encoded string for Recognition output: j, 𝜖, P, =, ∪ , S s , -, 1, ≤ , i, ≤ , 2, s, -, 1, S e , P (k; i).

Levels of structure encoded symbols of Recognition output: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0.

Edit operations (costs are shown upto two decimal points):

  1. 1.

    Substitute 𝜖 with ∈ at position 2 at level 0 with cost 1.00.

  2. 2.

    Insert N s at position 13 (before s) at level 2 with cost 0.17.

  3. 3.

    Insert N e at position 16 (after 1) at level 2 with cost 0.17.

  4. 4.

    Total cost of edit operations = 1.34.

Here, positions (start from 1) refer to structure encoded string of the recognition output. First substitute operation corrects the symbol recognition error and the next two insert (of structure symbols) operations correct the structural error. As mentioned earlier, cost of the insertion operations for each of the structure symbols, N s and N e , is 0.17 (half of 0.33, obtained using equation (7)). As extraneous symbols are not involved, edit cost is reduced to 1.34.

If level information is considered, it is to be noted that half of the edit costs are considered for each of the start and end structure symbols. Hence the edit costs for the start and end structure symbols as well as for the opening and closing MathML tags, at a given level are equal.

Ignoring level information (setting L(x) = 0 ∀x, so that a cost of one is assigned to all the errors), edit costs for EMERS and the proposed approaches are given by 7 and 3 respectively. In our approach, edit operations on start and end structure symbols are considered separately unlike EMERS approach. To facilitate comparison, we have treated the operations on opening and closing tags separately due to which an edit cost of 7 is obtained for EMERS approach.

Our approach is compared with EMERS approach by taking more examples that cover different fields like Algebra, Calculus, Geometry etc. from Infty database. Edit costs for these examples using both the approaches, are shown in table 1.

5 Summary and discussion

It can be observed from table 1 that EMERS approach shows higher edit cost values than for ours for most of the MEs as extraneous tags are also involved in the EMERS approach. Both EMERS and the proposed approaches give same cost for the last five entries in table 1. In these examples, extraneous tags are not involved in EMERS approach and the cases where this phenomenon occurs, are discussed below.

  1. (i)

    Symbol mis-recognition occurs within the same class like operators, identifiers, numerals, etc. For example, if j is misrecognized as J, their tags ( <mi> ) are matched as both are identifiers. Identifier tag matching can be seen for the entries 22 and 23, in table 1. In 24th entry of the table, operator ) is misrecognized as 〉 and in 25th entry, operator ≃ is misrecognized as =. For these two, operator ( <mo> ) tags are matched.

  2. (ii)

    Superscript (subscript) relationship is lost and the superscript (subscript) has a single symbol. In this case, <mrow> tags are not needed to group symbols in the super/subscripts. This case is shown in the 26th entry of table 1 (where subscript relationship between ∪ and ψ is lost).

In other cases, EMERS approach shows higher edit costs. EMERS approach can be modified to avoid contribution of the extraneous tags to the total edit cost by assigning zero costs to the edit operations involving these tags. But still, as tree edit distance is used, its time complexity is O(n 4) (Sain et al 2011). Differences between EMERS and the proposed approaches are highlighted in table 2.

Table 2 EMERS vs proposed approaches.

6 Conclusions and future work

In this paper, we have addressed the problem of an automated performance evaluation of ME recognition using standard forms like . For this, we have proposed structure encoded strings that retain the structure and eliminate extraneous symbols. Recognition output and ground truth in format are converted to structure encoded strings and Levenshtein edit distance is used to find distance between them. This approach is intuitive as only required symbols are involved in the matching process. Cost functions of edit distance are designed in terms of levels of structure encoded symbols. Length of structure encoded string is O(n) and as string edit distance is used, time complexity is O(n 2) which is better than O(n 4) for the earlier EMERS approach. Although it is illustrated using in this paper, note that it is a general approach and can be used for MathML or any other encodings. It is also applicable to other domains like chemical structures and OCR systems of Indic scripts.As structure encoded string representation is unique and complete, in future, it can be incorporated into ME recognition systems so that structure encoded string representation can be generated instead of , MathML, etc.