1 Introduction

An idea is the origin point of innovation and research. The key initiative towards realizing these ideas is to build and enhance cost-effective and efficacious methods for writing down the documented knowledge into the corresponding electronic format that can be processed by computers and distributed with the help of the Internet. With the numerous expansion in internet users in recent years, the escalating drift of disseminating and exchanging information is consistently done via digital mediums [38]. The categorization and recognition of mathematical expressions (MEs) have become a fascinating and stimulating study area of pattern recognition with endless real-world ramifications due to the fact that MEs represent an essential part of the engineering and scientific literature. [1]. In addition, recognizing handwritten mathematical expressions (also known as HMEs) is a difficult classification task that requires real-time identification of all the symbols that include input as well as the intricate 2D relationships that exist between subexpressions and symbols [119]. These subexpressions can be nested containing Greek and Latin letters, special symbols, and characters. This task of HME recognition also becomes more onerous as it has complicated tasks like structure analysis, symbol recognition, context of MEs unequivocally. This method of verification is complicated enough that it requires additional time to be spent on computational work. Therefore, recognizing them is unquestionably a difficult and laborious task, particularly when attempting to recognize them from a handwritten source of information.

1.1 Review gaps and importance

  1. a.

    Although researchers carried out plenty of work in the arena of recognizing MEs and symbols, there is a need to systematically collect, compile, and consolidate the recent works in this field. Other reviews [98] and surveys [221] finely presenting the works of HME. Yet none of the literature works systematically reports the studies and judiciously covers all significant empirical instances of literature available on MEs

  2. b.

    Unlike the traditional review methods that attempted to present the past works by producing a summarized result form of several studies of the concerned area, the objective of this SLR is to provide a complete possible list of all studies related to the subject area from different research aspects.

  3. c.

    To the best of our knowledge, no systematic review focuses on the extensive identification and classification of techniques used for MEs recognition. This is the first-ever SLR that aims to be as fair as possible by being auditable [79] and providing transparency to the researchers in this mine to dig and extract for what is left unexplored and unmined.

  4. d.

    The entire research methodology used for drafting this study has been presented in detail. Every step involved in the review process has been kept transparent to the readers. For instance, data collection, data selection, answer extraction design; each task involved in research has been vividly depicted.

  5. e.

    The uniqueness of the study is the research methodology involved. Apart from strictly adhering to the SLR guidelines, the authors have endeavored to experiment creatively with inter-disciplinary concepts during data synthesis and answer design.

  6. f.

    The intended effect of the study is to establish a synthesis of research questions through the use of ground theory, which is a qualitative research methodology. This method has been deployed for the inaugural time in conjunction with systematic literature analysis.

  7. g.

    In addition, the sub-processes that make up the recognition process have been dissected in great detail.

A better approach for summarizing the studies and research performed over a period of time for directing the researchers and future aspirants interested in the topic is a real need of the hour. So, this review is planned, conducted, and reported to broadly specify what techniques outperform the rest and where there is a genuine requirement for implementing a new method for recognizing MEs.

1.2 Research objectives

This paper conscientiously reviews all the studies between the period between January 2000 and June 2021. To perform the review study, the authors have collated several techniques that belong to multiple computers science domains, like computer vision, digital image processing, and artificial intelligence. This SLR endeavours and attempts to summarise, analyse and methodically assess the empirical evidence regarding

  1. a)

    Identification of methods and techniques used in recognition of HMEs.

  2. b)

    Extract out the kind of HMEs used for recognition study.

  3. c)

    Listing out the most frequently used dataset in the research and for empirical analysis.

  4. d)

    Analyzing the accuracy measures and evaluating the accuracy values of the several used methods.

  5. e)

    Focusing on the pure or hybrid techniques implemented in the subprocess analyzing the performance and capability of the applied techniques for recognizing HMEs.

  1. g)

    Comparing performance accuracy of contrasting ML techniques to ensure which method outperforms the other techniques belonging to the same header.

  2. h)

    Analyzing the actualization and capability of the applied techniques for recognizing HMEs, i.e., ML versus non-ML or conventional statistical methods.

  3. i)

    Summarizing the set of journals publishing the research on this stimulating research area.

1.3 Motivation

The several motivating factors for carrying research in this domain are listed in Table 1.

Table 1 Motivation factors

Apart from the listed factors, the primary factor that motivated and led to the writing of the research review in this domain is the lack of a systematic survey that could compile and extract all the pivotal attributes and determinants like recognition model, datasets, sub-processes, performance metrics, and other meta-data analysis, keeping the selection and synthesis process all crystalline and pellucid at every stage. This facilitates a better readership and provides more crisp insights from bulk literature present on MEs.

1.4 Focus of the study

The focus of the study is on the following points:

  • To acquire deeper understanding of the evolving domain of MEs recognition.

  • To provide a framework to refer for future research projects by identifying gaps in the recognition of mathematical expression domain.

  • To conclude and formulate facts from the meta data and evidences present in the literature.

Thus, the primary focus of this SLR is to provide a comprehensive and unbiased analysis based on the evidences captured from the literature and all the meta data related information that will be extracted in this process. The facts and findings of the study will direct the future research in the domain of MEs recognition.

1.5 Research questions

The research questions addressed by this study are tabulated in Table 2.

Table 2 Research Questions

1.6 Paper organization

The organization of the rest of this paper is as follows; the extended section 1.7 briefs about all the critical questions that a beginner needs to know for the understanding of this domain of HMER. In Section 2, the entire research methodology is presented. Section 3 presents the statistical analysis of the HMER related studies. Section 4 contains the results and discussions around the formulated research questions. In Section 5, a variety of results and highlights the extracted facts from the findings are discussed, termed under the heading named Summary and Findings. Section 6 has limitations, and Section 7 holds the calculated conclusions from this study—all references, bibliometric contents, and appendixes that have been quoted at the end. To enhance the readability of the study, the entire roadmap to paper organization has been presented in Fig. 1.

Fig. 1
figure 1

Roadmap for Paper Organization

2 Background

The earliest research work on MEs was in 1965 [15], where Anderson worked on syntax-directed recognition of printed 2D MEs. With the advancement of technology, the interest of researchers shifted from MEs to handwritten mathematical symbols and expressions. The idea is to recognize HMEs and handwritten equations from scanned images or pen-based computing technologies like electronic tablets [38], digital pens, and other gadgets. Before they embark on the review process, the authors have tried answering all possible prerequisite questions briefly to understand the objective of undertaking the research topic. Like HMEs, types, why recognition is necessary, types of recognition, the stepwise procedure in recognition, challenges involved, types of inputs to a mathematical expression recognition system, and problems related to the math recognition system.

2.1 Defining handwritten mathematical expressions

The term “mathematical expressions” (MEs) refers to a finite set of symbols that have been ordered in accordance with a rule or experimental study that is connected to some context, most commonly science and research. These MEs consist of symbols like as numerals, operators, constants, variables, functions, parentheses, special characters, and letters (Greek and Latin) arranged in a well-formed order in accordance with some formal propositional norms. As a result, mathematical expressions (MEs) make use of symbols, letters, and notations in order to deliberately depict various mathematical, scientific, and technological laws or formulas. In addition to this, it is not simply a collection of symbols that have been arranged in a random fashion; rather, it possesses a well-organized structure that is subject to the regulations of the system of mathematical notation [188]. These MEs, when considered in handwritten form, constitute the HMEs.

2.2 Types of handwritten mathematical expressions

When it comes to handwritten, MEs can be categorized as offline and online based on inputs mode. According to [145], handwritten data must be converted to digital form with the help of different modes like scanning or writing through special pens on the electronic surface, such as a digitizer combined with a liquid crystal display. The former method we examine the writing on paper constitutes offline handwriting; writing produced by finger or digital pen on some electronic device like tablet comprises the online handwriting. Similarly, the HMEs are divided into the set of offline/online HMEs. The difference in both online and offline HMEs is tabulated in Table 3.

Table 3 Comparative Analysis of types of Handwritten Mathematical Expression

2.3 Characteristics of mathematical expressions

MEs integrate writing letters with drawing a variety of signs and related symbols. The following are the traits of the MEs, according to [172]:

  • Two-Dimensional in nature: Although they frequently convey implicit meanings, the two-dimensional interactions between symbols are crucial. The two-dimensional relationships of mathematical expressions are examined using TEX’s math environment commands and the current MathML standard.

  • Inline (Example: 2x)

  • Subscript (Example: x2)

  • Superscript (Example: x2)

  • Prescript (Example: yF)

  • Enclosed (Example: √x)

The entirety of the MEs is comprised of symbols that exhibit two-dimensional relationships. A fundamental understanding of mathematics is essential in identifying and comprehending two-dimensional connections.

  • Implicit Semantics: Certain symbols and letters possess inherent semantic meaning. The determination of interpretations of expressions with implicit operators heavily relies on the identification of symbols involved. Two examples of mathematical expressions are f(y-1) and x(y-1). Concerning the f(y-1) scenario, it is generally accepted that f denotes a mathematical function, while (y-1) serves as the input or argument of the declared function. In the scenario of x(y-1), it is important to note that x is not being used as a function, but rather an implicit multiplication operator is assumed to exist between x and (y-1). Thus, there the symbols used in the equations carry different semantics in varied contexts of usability.

  • Arbitrary Associations: Considering the fact that there are numerous potential two-dimensional links between symbols and expressions, only a few interactions are permitted depending on the nature of the symbols and the expressions themselves. Mathematicians who are well-versed in the subject are aware of the proper associations and finally put an interpretation on the terms. Whether implicit or not, operators have the ability to link symbols and subexpressions. Linear Prefix (x), Infix (x + y), Postfix (x!), Bounding ([,]), Vertical (x y), Implicit (×2, 2x), Tabular (matrices), and Enclosing (x) are examples of several types of operators. The different types of operators can be further divided into unary (x!), binary (×2), and N-ary (x + y + z) operators since some operators can only accept a particular amount of parameters. The rules set a limit on the number of arguments and the positions of those arguments. Prescripts, for instance, are so uncommon in general that only certain symbols might be associated with them.

  • Conventional Dependency: Conventions regulate how to use mathematical symbols correctly. The conventions to be followed depend on the mathematical specialties and the text’s point of origin. The conventions of several branches of mathematics may be followed. As a result, different fields may have different ways of expressing the same mathematical concept. For instance, the imaginary number √ − 1 is denoted by the letters “i” in calculus texts and “j” in engineering. Their conventions may be adopted by other nations. These conventions and 2D relationships between the symbols are the primary cause of the inherent ambiguities caused by math expressions.

  • Variant Scales: The symbols and unique operators employed in mathematical calculations are referred to as the different scales of mathematical symbols. One of the fundamental qualities of mathematical symbols that also contributes to ambiguity and issues with expression recognition are the different scales of handwritten math symbols utilized in the creation of HMEs. The scale of the symbol alters its semantics and ultimately differentiates the meaning of the expression.

2.4 Challenges caused by the inherent properties of ME

Several challenges are encountered while recognizing mathematical symbols and expressions. These challenges are enough, and sufficient listing of these can be maintained. But the reason for such an ample number of challenges is the inherent properties of the symbols and expressions. Also, HME recognition is challenging due to various writing styles and MEs formats [198]. The authors have tried to list out several challenges belonging to the recognition process of HME, along with the properties that are the cause of these challenges occurring. One of the primary reasons for these challenges is the two-dimensional structure and spatial relationship among the symbols used in the MEs. Table 4 accrues to the clarity about the challenge residing in recognizing handwritten mathematical symbols and expressions due to their inherent properties, which are majorly responsible for ambiguity while recognizing.

Table 4 List of challenges caused due to properties of HME

2.4.1 Other challenges and CROHME

One of the particular challenges involved in the recognition process of HMEs is the lack of sufficient training and testing, especially for academic recognition systems [111]. Due to the lack of a sizeable common dataset of online or offline HMEs which can be used in the recognition process, there has been significantly no central benchmark available to compare the different research works done by various researchers. This non-availability of an open dataset of HMEs is also one of the challenges which forced the researchers to develop their dataset of MEs, which is either the images collected from handwritten expressions by several writers or volunteers or sometimes be the corpus of expressions that are recuperated from prior work of Raman [152], as well as through the mathematical expression base of Aster. So, these datasets collected had a limitation as they tended to contain and cover a subset of math expressions or expressions limited to a specific domain of math. Thus, it is impossible to compare the performance accuracy of different systems as no standard evaluation measure is available until the year 2010. The year 2010 proves to be a remarkable year in the research history of handwritten mathematical symbols and expressions as CROHME, series of competitions was first conducted in 2011.

Moreover, CROHME is the dataset used by the maximum researchers in the field of HME recognition. The studies of different researchers witness that significantly less work has been done in math recognition with standard encodings, benchmark datasets, or evaluation tools [131]. To start working more easily on handwritten math recognition, the CROHME competition was organized in 2011 and meaningfully compared systems using the common publically available dataset provided by the competition [128].

3 Review methodology

This study will comprise the planning, the execution, and the description of the result analysis of the area of research, i.e., recognition of HMEs. According to the [96] guidelines, the authors designed the planning phase, where they apply the review protocol including the seven stages: (1) defining and framing research questions, (2) planning search methodology, (3) search process and criteria, (4) selection of relevant studies, (5) data collection and extraction, (6) data analysis, and (7) results and conclusion evaluation.

3.1 Review protocol and Criteria

In SLR, critical importance is given to a review protocol [196]. The protocol has been developed according to the guidelines set up by [96] such that there is a reduction in researcher bias and there is rigorousness in SLR. After considering the principles, philosophy, and measures of SLR, a comprehensive review protocol has been developed. It mainly focuses on review background, research questions, search strategy, data extraction, quality assessment criteria for the research studies, and data analysis [126]. To differentiate between traditional or narrative literature review and SLR, review protocol plays an important role. It enhances the evaluation consistency and reduces researchers’ biases since researchers have to present the search strategy and the criteria for the inclusion or exclusion of any study in the review [96, 126]. All connected series of steps performed in this process have been illustrated by Fig. 2.

Fig. 2
figure 2

Systematic review methodology

3.2 Research questions

According to our review protocol, in the first phase, the authors set up a few research questions related to the objectives of our study. These research questions are fundamental tools for digging out the valuable information from the literature already available. So the definition and framing of these research questions are supposed to be done very carefully and critically. The entire direction of this review is entirely based on the framework designed through these research questions. While formulating the research questions, our goal is to assess the empirical evidence resulting from various studies on recognizing MEs using multiple techniques and methods. The authors have selected the research questions to cover all aspects investigated by [39] in their previous survey performed to investigate further issues and perform a better meta-analysis on the chosen studies.

To our knowledge, no SLR has ever been performed in this ME recognition to date, which could give a perspicuous idea to future researchers about the complete analysis of this challenging research area. However, the surveys performed by Tapia ([178]), Zhang [213], and even the recent survey by [222] focused only on online HMEs. Another recent short review by [98] tends to compile the studies comprehensively and not purely evidential of a systematic analysis. At the same time, the authors could find no complete review after 2000 [38] that targeted the systematic research analysis in this recognition survey process. That’s why the time range selected for this review is from 2000 to the present date.

3.2.1 Answer extraction design

Figure 3 explains the outline of our search strategy, selection procedure. It focuses majorly on research design for extracting data from some specific research questions to answer the other research questions of this study. According to the mode and kind of extraction used, the authors divided the extraction criterion into three categories: direct extraction, indirect interpretation, and synthesis-based indirect interpretation.

  • In direct extraction, the authors retrieved the raw data from the study and used processed information to answer our questions. This criterion is applied to answer the basic research questions, to which answers could be steadily extracted directly from the chosen study. Thus, justifying the name of this criterion.

  • Indirect interpretation works by using the information retrieved by previous questions. It makes use of this information, draws an interpretation, and answers the formulated research questions. The name given to these criteria is since we are trying to make the information pre-fetched more meaningful and valuable in our study. It uses the answers to a research question to formulate and answer the other research questions.

  • Synthesis-based indirect interpretation is an extended version of the above criteria of indirect interpretation. There exists a slight difference between them. It differs from indirect interpretation as it uses the answers or information extracted from two or more questions to answer the other research question of this review study.

Fig. 3
figure 3

Research Design for extracting answers to Research questions. 1 denotes Direct Extraction, 2 indicates Indirect Interpretation, 3 denotes synthesis based indirect interpretation.

Co-relating the criteria to our research questions, we can thoroughly read the study and directly extract the answer to our RQ1 (identifying ML/non-ML technique). It is the instance of direct extraction. Similarly, the extracted information of RQ1 can be used to answer RQ5 and RQ6 (concentrating on the approach used in the processes and sub-processes involved). The answer to these questions comes with the interpretation of the response of RQ1, where the technique used in the recognition process is well analyzed. The answer to RQ7 and RQ8 can be fetched by indirect synthesis-based interpretation, where results of RQ1 and RQ4 can be combined to reach conclusions for both these mentioned RQs. Here, we solely aim to figure the best possible technique, outputting sufficient accuracy. So, the method and the corresponding accuracy values can answer RQ7 and RQ8 (ML outperforms other methods or not).

3.3 Search design and strategy

The design of the search strategy encompasses search terms, sources of literature, and the search process. The authors have extracted the significant keywords and key terms essential for formulating the search strings in search terms. The core idea behind making this search string is to extract out all candidate papers of our interest in one go. After considering these research questions, in the second phase of the SLR, a search string is designed to fetch the best of all the relevant literature that is good enough to answer the research questions.

To search for the papers, this step of defining the search terms and constructing a search string is crucial. This creation of a search string to filter the relevant literature of our interest works iteratively and has been an unbiased strategy for searching appropriate studies. The second step is selecting digital libraries and using the data retrieval settings on these libraries to extract the papers. In the third step, inclusion and exclusion criteria are defined and applied to the research results [32].

3.3.1 Search terms

The following steps are used to construct the search term [29]: (a) By analyzing research questions, take out key terms. (b) Figure out the synonyms as well as alternate spellings of the acknowledged key terms. (c) Lookup for keywords in the research literature. (d) Boolean OR will be used in the search string to incorporate synonyms and alternative spellings. (e) Boolean AND will be used to integrate the essential terms. The functions of the significant key include the boolean operators used in the search string with their specific roles in the searching process. The primary critical terms from the questions are Handwritten, MEs, techniques, ML, recognition, process, online, offline, non-ML. Then identifying synonyms and spellings, the authors accrued another key search word in the list, i.e., MEs and prediction, used for recognition. Hence, after mustering the essential words in the pool of necessary keywords, the authors embarked on the process of search string formulation using boolean OR and AND. The final resultant search string after this procedure is (handwritten) AND (“mathematical expression” OR “mathematical expressions”) AND (classification OR prediction OR recognition)

3.3.2 Data retrieval and literature sources

For performing an automatic search, data retrieval has been performed on digital databases: 1) IEEE Explore 2) ACM Digital Library 3) Science Direct 4) Springer 5) Wiley Online Library 6) Scopus. The search execution using the appropriate search string on each of the mentioned digital libraries and the proper retrieval of the relevant studies and papers constituted the entire data retrieval process. Though an adequate amount of data is well retrieved from the above digital sources, the authors did not choose the other useful literature in magazines, books, and articles for the literature review. The concepts discoursed in these sources are not subject to review. Thus their quality can’t be reliably corroborated.

After proper execution of automatic search for data retrieval, a manual search, our secondary search phase, is accomplished to ensure that we don’t miss out on anything worthwhile. To fulfill this motive, the authors performed the manual search by forward and backward referencing. The authors iteratively referred to references from the retrieved studies to extract more relevant reviews from the past. This iterative process applied in the secondary search phase is called Snowballing. After snowballing, the extracted studies are added to the Mendeley library, which further helped to make suitable references in this study. The studies that are focused on are ranging from the time 2000 to 2021.

3.4 Screening of papers and the process of filtration of studies

When the constructed search string is run on the digital libraries, the search string fetches the different number of studies from various databases, respectively.

In screening papers, appropriate studies are selected, and this selection is based on well-defined criteria and research themes. The well-defined criteria of selection include four stages, as per the headings of Table 5. The first filter applied here is the year-wise filtration that has a constrain of considering the studies from the year 2000. The second filter focuses on the removal of duplicate studies. The chances of redundancy of papers arouse owing to the fact that articles have been extracted from major digital databases as well as Scopus. The Scopus includes almost all articles from IEEE, ACM, Springer and Wiley. The reason why Scopus has been exclusively searched for the inclusion of articles is that the authors don’t want to miss any quality publication which is published other than the mentioned digital databases. The criteria used for removing duplication are ‘Exact Match” (where the titles of all the gathered studies were compared and removed in case of any exact duplicates) and “Cross Checking” (where the authors, publication dates, and other bibliographic information has been cross-verified to identify potential duplicates from different sources or variations in title wording). The third filter is removal after reading the title. The search string will target to fetch the studies with the mentioned keywords in the search string.

Table 5 Count of studies at different stages of the refining process

One such example is the study titled “Strategy and Tools for Collecting and Annotating Handwritten Descriptive Answers for Developing Automatic and Semi-Automatic Marking - An Initial Effort to Math”.

This study is extracted as it contains the keyword ‘handwritten’ and ‘math’. The paper is published in the 2019 International Conference on Document Analysis and Recognition Workshops. So, because of the keyword ‘recognition’ also, the study has been extracted. On reading the title itself, the study is realized to not be relevant to the topic of research. Hence, removed based on the title filter. Thus, the criteria for removal-based titles include: ‘Relevance” (where titles that appeared to be directly related to the topic or research question were considered) and “Focus” (where titles that indicated a clear focus on the specific aspects or associated research variables). This helped to narrow down the studies to those most aligned with the research objectives. Similarly, in the case of the fourth filter, the study has been removed after analysing the relevance of the abstract. The criteria for removal of abstracts include: “Study Objective” (where abstracts that clearly outlined the objective or purpose of the research conducted in the study with direct mapping to our research concerns were considered, “Methodology” (where abstracts that briefly described the methodology employed in the study, ensuring that it aligned with the research interest of this study were prioritized). And the third criteria is “Findings” (where abstracts that summarized the main findings or results obtained from the study were analysed as they assess the potential relevance and contribution of their research to this study).

On performing analysis of the contribution of each digital library in all final count of studies selected, the authors noticed that more of the quality papers are majorly from the two digital databases involved, i.e., IEEE explore and Scopus. And Scopus ranked highest in terms of its contribution to the final set of selected papers. Below, Fig. 4 shows the analysis of the contribution of each digital library in the definitive collection of selected studies.

Fig. 4
figure 4

Analysis of the contribution of each digital library to the selected database \(\textbf{RELEVANCE}\ \textbf{RATIO}=\frac{number\ of\ selected\ studies}{total\ number\ of\ retrieved\ studies}=\textrm{NS}:\textrm{NR}\)

This aspect of the analysis of selected studies based on the relevance ratio and relevance percentage will allow us to analyze and identify what sources have retrieved the best of relevant results on the execution of the search string. The estimated values of relevance percentage show that IEEE explore and Scopus digital databases have higher relevance percentages than others. The conclusion drawn is that Scopus had the highest relevance among all digital databases compared. In contrast, Wiley had the least relevance as per the calculated values mentioned in Table 6.

Table 6 Representation of Relevance Ratio and Relevance Percentage of each study

3.5 Classification of HMER related studies

This section figures out the review findings and presents conclusions, consisting of results obtained by analyzing the selected studies for each research question formulated. The authors have conclusions discussed and answered the research questions in separate subsections by considering sufficient empirical shreds of evidence collected from the data. In the discussion, the findings are placed in a broader perspective that is nonetheless closely tied to the study issues.

At the end of this entire screening process, 28.94% of the total initial papers are identified, which, when summed up with studies selected from manual search, resulted in 202 articles. These 202 studies are collected, and their abstracts are read; the authors could classify the entire cluster assembled and allocate them in empirical categories according to the research approach. The nominated headers of the classification are presented in Table 7 as (a) generic (b) technique based (c) application-specific (d) sub-process concentrated (e) survey-based study (f) particular script oriented (g) CROHME winning studies, and (h) other evaluation and problem addressing papers. The studies of interest will include more articles selected from (a), (b), and (g) categories. The count of studies under each classified category is vividly depicted through Fig. 5.

Table 7 Defining the nominated headers of classification of studies
Fig. 5
figure 5

Classification of papers based on the defined meta-data categories

3.5.1 Inclusion and exclusion criteria: (Primary Selection Phase)

Inclusion-Exclusion principles are the significant filters for selecting potential studies from the candidate studies retrieved after the screening process of the papers. Subsequently, many of the candidate papers lack support to address the research questions that have been raised by this current study; further filtration is needed to boost the relevance of studies collected for our review study. That’s what our selection process precisely aims to do. The inclusion-exclusion parameters used in the primary selection phase are tabulated in Table 8.

Table 8 Inclusion Criteria (Primary Selection Phase)

These inclusion/exclusion criteria parameters are selected after numerous meetings between the authors, finalized with mutual consent.

3.5.2 Quality assessment criterion: (Secondary Selection Phase)

This is a pure quality check measure on the selected studies, aiming to satisfy the defined standard of the quality of the chosen studies. Some quality assessment questions are formulated to weigh the candidate studies so that the final selection of studies can be made, as mentioned in Appendix 2 Table 23. Note that the studies with a low quality imply that those with quality measures that weigh less than satisfied thresholds are supposed to be excluded from the cluster of the selected studies. A fuzzy linguistics idea is used from an SLR study by [4]. These fuzzy linguistics factors were utilized by the researchers to rank various quality assessment queries. Rather than just assigning scores on a binary scale of 0 and 1, these fuzzy variables are used to appropriately measure the score (relevance) of the survey in regard to the aforementioned assessment questions. That’s why they wisely employed the fuzzy linguistics variables, avoiding the binary scale.

The score chart for assessment of the quality of the studies is shown in Table 9. The above score chart depicts the score allocation to the studies during the quality assessment criteria. In particular, the authors have organized the scores for checking and validating the relevance of each research according to the formulated question list for this review. Rating 0 depicts the study has no significance according to the objectives of this study. The score between the values from 0.1 to 0.9 scales from rarely, i.e., of less relevance, partially, i.e., mediocre relevancy, mainly implies the study is highly significant. The highest score of 1 depicts that the study is answerable to all the quality check queries designed for accessing the quality of the research individually. Since we have formulated seven questions for assessing the quality, the summed score for a single study can be shown in Table 10. Table 10 is sculpted after numerous rounds of discussions and attentive observations made from the previous studies and works connected to quality evaluation criteria; both writers agreed on the notion of the assessment score based exclusively on the fuzzy linguistics variable.

Table 9 Score allocation according to relevance
Table 10 Total score for each study with relevance scale

The whole concept is concentrated on accessing the quality of the different selected studies. The studies with very extreme low scores (mainly with relevance ‘NO’ and ‘Rarely’) are excluded in this secondary selection phase of the review methodology. The apparent reason for this exclusion is the low scores against the quality assessment criteria questions. For instance, if a study receives the score for the quality assessment questions as Q1 marks 0.2, Q2 marks 1.4, Q3 marks 0.2, Q4 marks 0.3, Q5 marks 0.6, Q6 marks 0.4, Q7 marks 0.1, Q8 marks 0.1, and the total score is 3.3 which lies in the range of ‘rarely’ (refer Table 10). It implies that the study doesn’t qualify for the quality assessment procedure, so it cannot be included in selected studies.

On the contrary, if a study qualifies the defined quality criteria, satisfactorily scores more than 3.5, and reaches the scale interval named ‘Partly’ according to the relevance measure, the candidate study is confirmed to join the selected list of the studies for consideration and review. After extensive quality assessing procedures, 98 studies were chosen, and the rest were rejected as per the selection protocol. It should be noted that the studies ranking with relevance measures as ‘partly’ and ‘yes’ are selected for further review. These scores determine the studies to be of high relevance and quality for assessing the condition.

While analyzing recent studies, one of the matters of grave concern is that fewer citations are discovered, highlighting that the most recent studies carry fewer citations. There is a decent possibility that the research study can be relevant as per other questions of assessment criteria. So, considering this exception, the publications ranging from the year 2017–2021 are excluded as per Q8, i.e., inspecting the citation count. In this case, the study based on other relevant measures is included and according to the journal’s h-index and impact factor in the journal. Appendix 1 Table 22 shows the studies being grouped according to quality assessment labels.

The quality assessment criterion is applied to 198 studies out of 209 studies selected from the previous selection phase. Eleven of the considered studies are redundant when the entire database of chosen studies is compared. These 198 studies are considered for the quality assessment criterion, where each study is selected for further complete review by evaluating the studies according to the designed quality assessment questions. The qualifying range in quality assessment lies between 3.6–8.0. Any study scoring relevance scores less than 3.5 is not considered in the final database of selected studies. Thus, in this way, the potential studies are chosen for the complete review by applying quality assessment criteria. Additionally, the authors have decided to analyze the count of the studies for the particular relevance measure and count and segregate the studies as per the citation rate observed during the selection process.

Approximately 49% of the studies are selected, constituting 98 studies out of 202 studies. Overall, 97% of studies had at least one citation (note that studies ranging from 2017 to 2021 are not considered for citation check). Overall, as per the analysis, 32.32% of studies are utterly rejected and counted 64 (No), a good percentage of 19.6% (39 studies) are rarely relevant, and 14.64% of studies are of average relevance. Almost 33.33% of the studies are highly relevant and constituted the count of 66 studies for relevance measure ‘Yes.’ In this way, the entire criterion of the secondary phase is accomplished while leveraging the studies as per their relevance.

3.6 Data collection and extraction

Extracting the quality and quantity of information from the selected database is one of the prime objectives of this systematic study. After compiling the studies chosen from the well-defined selection phase, the authors have decided to extract metadata from the database of the final selected studies of this review. Table 11 shows the chosen studies, review analysis. The authors have tried to classify the investigations on several grounds to avoid missing any critical parameter or any vital perspective of the review that could help understand the present status of the research literature yielded to today.

Table 11 Framework of review analysis (investigating each study against the formulated research questions)

3.7 Data analysis and synthesis

In this section, data analysis and synthesis are to be performed. The principle goal of this phase of data analysis and synthesis is to identify, select, aggregate, and analyze the shreds of evidence collocated from the chosen papers for answering the formulated research questions. A single piece of evidence might have a small evidence force, but the aggregation of many of them can make a point stronger [139]. Thus, evidence collection and synthesis become truly indispensable part as it helps draw conclusions, which are the tangible outcome of a good review paper. Table 12 shows the review analysis framework.

Table 12 The framework of Review Analysis

To extract answers to the research questions, the selected studies are analyzed and synthesized quantitively (e.g., estimated accuracy of different recognition techniques) and qualitatively (e.g., focused on differentiating separate ML and non-ML methods, recording their accuracy metrics, evaluating their strengths and weaknesses). The authors employed various strategies to synthesize the extracted data associated with different kinds of research questions.

The different synthesis strategies conforming to the research questions are depicted in Fig. 6.

Fig. 6
figure 6

Different synthesis strategies

The narrative synthesis method is applied for the data pertaining to RQ1, RQ2, RQ3, RQ4a. That is, the data are tabulated in a regular and organized way consistent with the formulated research questions. To represent the extracted information, some visualization tools, including bar charts, pie charts, and other graphs, are used. These graphical representations enriched and glorified the display and presentation of the distribution of ML techniques, work on offline/online HMEs, the frequently used dataset, and their estimation accuracy data. Also, the survey studies done to date are analyzed and synthesized using this strategy.

For the data pertinent to RQ2, RQ3, RQ4a, and RQ9, which focus on comparing different techniques, their estimation accuracy, the other datasets used, kind of expressions used, the vote-counting method is used. Suppose we want to synthesize the comparisons between what types of math expressions are primarily used by researchers for recognition purposes. We can compare the number of studies using offline and the number of studies voting online. In this way, we can obtain a brief idea regarding which kind of math glyphs are frequently used. Also, we can use this voting synthesis to estimate what accuracy metrics are more regularly used for recognition by different techniques and compare them. This voting synthesis will also help analyze and compare the ML models used for the recognition task.

For the data pertaining to RQ5 and RQ6, the concept of grounded theory is used, one of the systematic methodologies in the social sciences involving the construction of theories through methodological gathering and data analysis. But we have tried to use this in the engineering science review to analyze the qualitative data of the different approaches used to recognize the math expressions. The basic idea of the GT approach is to read, revise and review the textual data (such as the approach evidence used by different techniques and label the variables (called group, concepts, and properties) and their interrelationships. This grounded theory has three phases, explained and illustrated in Fig. 7.

Fig. 7
figure 7

Conclusion construction of SLR using Grounded Theory

3.8 Synthesis with ground theory

Open coding

The essential part of the data analysis phase is open coding. It uses techniques like identifying, categorizing, naming, and describing the approach found in the text, i.e., from the initial data collection, the researchers’ categorize the information about the incidents [30, 48, 50]. The authors have implemented open coding to classify the different methods, approaches used in the recognition process, and sub-process to consolidate all the techniques used. In review, overall, 98 open codes have been found from the SLR, and those are related to several approaches used for the task of recognizing HMEs.

Axial coding

It is the procedure of developing inter-relationships between the open codes (properties and groups) via a combination of deductive and inductive thinking. It includes collecting open codes together and the similar ones are confined into distinct axial coding groups. This mechanism is not just time-saving but also lessens the overhead of searching and establishing relations for entire relations in the set [30, 48, 50, 73]. The reviewers have categorized all open codes into eight principal axial codes, also known as concepts. And then, we build the interconnections among the collected open codes, and In axial coding, we defined the inter-group relationships between the chosen open codes. The solid line arrows in Fig. 8 depict the direct connections between the categorized open codes, whereas the broken line arrow differentiates indirect relationships from direct ones. And the double-headed arrows represent the attributes identified through open coding that can be derivable from each other.

Fig. 8
figure 8

Implementing phases of Grounded theory on the attributes of the study

Selective coding

It is the process of selecting one core category and then relates all the further groups to that core category [30, 48]. The initial idea is to develop a single action around which everything else is covered. The authors segregated the appropriate axial chain coding from the prefabricated axial code chains to synthesize the research questions. The selective coding thus refines the synthesis procedure and assists in the consistent framing of relevant code chains that helps for qualitatively analyzing the research questions and extracting satisfying research answers. Figure 8 illustrates the refining process involved in selective coding.

3.9 Threat to the validity

3.9.1 Limitations of search string

Though sincere efforts have been made to formulate the search string, creating a search string also holds certain restrictions owing to the count of keywords allowed in the search string. Most of the digital libraries don’t support lengthier search strings containing too many keywords. In other words, it can be said that there is a limitation or restriction of picking the keywords to be used in the search string. For example, a string like [(offline OR handwritten) AND (“mathematical expression” OR “mathematical expressions”) AND (regression OR “machine learning” OR classification OR “Bayesian network” OR “neural network” OR “decision tree” OR “support vector machine” OR “genetic algorithms” OR “random forest” OR “deep learning”)], won’t execute well.

3.9.2 Selection bias

The search string is created and used by the authors for selecting the studies for this SLR. They had tried their best to choose entirely appropriate phrases in their search string. It must be noted that the selected keywords for the formulation of the search string are picked according to the research questions initially formulated. It may be possible that authors might have missed some relevant studies as there are chances that some studies may have omitted the primary keywords that the authors chose. These essential keywords might not be mentioned in the frequently overlooked sections like title, abstract, and keywords. Though rigorous trials are taken to avoid any such setback, including the manual search criteria is decided. Referring to the bibliographies of the various studies allows for the selection of some research that may be of interest to the reader. As a result, the initiative is still being pursued; there is a possibility that we may have missed some substantial studies that could be interpreted as posing a risk.

4 Statistical analysis of the selected HMER related studies

In this section, the statistical results of the selected studies will be presented concerning their publication type, publication year, geographical distribution over years, authors, and keyword count status.

4.1 Extracted metadata fields from selected studies

The list of the extracted fields/ heads for metadata analysis of the selected studies is mentioned in Table 13.

Table 13 Metadata themes with description

4.2 Publication type overview

The authors have tried to identify the sources which have contributed significantly high to the collection of studies. The purpose of conducting this analysis from this vantage point is to identify the foundation of the pertinent studies. This will allow these high-ranking names of sources to be shortlisted for the purpose of preferring them as secondary sources of literature for the subject matter of this review article and any further implementations that may be carried out. Additionally, this study offers a new dimension of priority for future researchers to identify the sources for reference and an outstanding supply of literature by functioning as a secondary database for the purpose of comprehending this research subject. Figure 9 shows the publications concerning publication type. It is observed that the majority of the selected studies belong to journals and conferences, ranging from about a total of 77 studies out of a total of 95 selected studies. Other studies included some papers in the proceedings of some workshops, a well-framed thesis, and organized symposiums.

Fig. 9
figure 9

Distribution of selected studies according to publication type

4.3 Temporal view of research over the years

The count of selected studies is analyzed well on the scale of years, depicting the number of publications produced per year. It is noticed that the average count of papers published per year before the year 2011 range about to 3 (average of three publications per year) and in the latter half of the period under review i.e., from 2011 to 2019 (excluding few studies taken from the year 2020) range about to 6 (average six publications per year) as shown in Fig. 10. This estimated average signifies the role of CROHME in doubling the useful publication per year. CROHME, a well-established annual competition conducted to accelerate the growth of research in this challenging domain. After the start of the competition in this domain, there has been an escalating interest in the handwritten mathematical symbols and expressions recognition average of double the studies per year are produced after 2011. It is credibly not amazing since the concept of handwritten mathematical symbol and expression recognition is grasping more researchers’ attention because of the advancement of the research work in the fields of ML, deep learning, and computer vision. Moreover, the need to input the MEs directly in the system is not indispensable; thus, research for high accuracy recognition processes and systems is expected to be exponentially rising, opening new ways and methods to recognize and retrieve mathematical character expressions.

Fig. 10
figure 10

Distribution of selected studies according to the year of publication

4.4 Geographical distribution of research studies over the years

The count of papers is also analyzed from the perspective of their geographical distribution and year of publications. It has been observed from the figure that the maximum research articles are from China and Japan, where the former had 17 releases and later had 16 papers. The contribution of different countries in different years can be visualized by Fig. 11.

Fig. 11
figure 11

Geographical distribution of research studies per year

4.5 Distribution of publications by authors

Figure 12 below shows the distribution of authors’ total research publication counts. The authors highlighted in this figure have made significant contributions to the field of research dedicated to identifying HMEs over the course of several years by publishing high-quality articles in the field. Researchers that place highly on this graph of study distributions likely have a keen interest in and considerable competence with the research topic of recognizing mathematical expressions. In addition to the metadata that was gathered, the analysis presented here is likely to be of great assistance to the domain experts who have dedicated themselves to this line of inquiry and developed the relevant recognition algorithms.

Fig. 12
figure 12

Distribution of publication according to the authors of the publications

On analyzing the Fig. 12, it can be noticed that author Anh Duc Le has contributed most to the dictionary of selected studies, succeeded by authors H. Mouchere, M. Nakagawa, Viard-Gaudin, and Richard Zannibi. These are the authors who are leading the research in recognizing handwritten mathematical symbols and expressions.

4.6 Frequency of keywords

The authors also tried to analyze the keywords count in the publications retrieved because these keywords are responsible for such high relevance ratio of the studies according to the search string. It has been observed that ‘mathematical expression recognition is the most common keyword found in the selected studies. This keyword has more relevance with the topic of the review objective.

Other specific keywords like ‘mathematical expression’, ‘online handwritten mathematical expression’, ‘handwritten mathematical expression,’ and ‘online recognition’ are among the extreme top entries while tabulating the count of keywords found in the retrieved studies after the entire selection procedure. These frequently used keywords associated with the topic of this review study are equipped in contextualizing the keywords of our own review study as our study will revolve around these selected papers having above mentioned keywords in their abstract and keywords section. The graphical analysis of the keywords associated with the count of publications is presented in Fig. 13.

Fig. 13
figure 13

Distribution of publications according to the frequency of keywords used

5 Results and discussion

The discussions around the extracted answers to the research questions are elaborated in this section.

  • RQ1. Which ML/non-ML techniques are used in the studies for recognition?

Before analyzing the type of ML and non-ML learning techniques used to recognize HMEs, the authors segregated the selected studies based on the approach used. It is found that from the total selected studies, 60 studies used ML algorithms for the task of handwritten math recognition. In contrast, the count of studies implementing the non-ML approach is 38. As the popularity of ML and deep learning algorithms grew a bit late by the end of 2010, significant studies are using conventional and non-ML-based methods before that. An endeavor has been done to analyze the trend and popularity of the ML techniques used for recognition over the years. Figure 14 depicts the journey of the rising trends of machine learning techniques from the year 2000 to the present time. On the onset, it is found that not even a single study used ML methods of recognition in the year 2000, and a downward trend is observed until 2011. After that, the pace and implementation of ML algorithms ([45, 116, 143] increased and ending in almost every study using ML models by the year 2021. The stack frames colored in red depict the studies using ML techniques to recognize, and the stack colored in grey represents the studies implementing non-ML approaches. This graphical representation specifies the rising trend of ML techniques over the years. On analyzing, 60 studies are found using ML models for recognition tasks, and the authors further analyzed several different ML techniques used in all the studies selected by the SLR. The list of the ML techniques used for the recognition of HMEs is as follows:

  • Support Vector Machine (SVM)

  • Artificial Neural Network (ANN)

  • Convolutional Neural Network (CNN)

  • Recurrent Neural Network (RNN)

  • Bidirectional Long Short Term Memory (BLSTM)

  • K-means neighbors (K-means)

  • Decision Tree and Random Forest (DT + RF)

  • Generative Adversarial Networks (GANs)

  • Graph Neural Network (GNN)

Fig. 14
figure 14

Yearwise distribution based on ML/non-ML technique

Figure 14 shows for each technique the number of studies that applied it. It can be noted that the overall most frequently used ML technique is SVM. Still, suppose the trend is observed after the year 2013. In that case, it is found that ANN and CNN are more frequently applied to the recognition systems when the experimentation part of several papers is scrutinized intensely. As per our observations, almost eleven papers used ANN and thirteen studies that used CNN to classify and recognize math expressions. On overall analysis, it is observed that SVM and CNN are most frequently used as the count of the studies using these techniques ranged about 18 and 13 studies, respectively. Adding to this, SVM gave neck to neck competition to the application and implementation of ANN. Even after the year 2013, there is considered a decent number of papers using SVM for recognition. So, generalizing, it can be concluded that both SVM, ANN, and CNN are used primarily to recognize HMEs and symbols.

When the number of papers associated with each method is broken down, it is discovered that the ML method known as SVM is the one that is used the most frequently. This method has been utilized in approximately 18 research, which accounts for approximately 29% of all studies. CNN is the second most common estimating method, and it has been explored in approximately 13 (approximately 21%) distinct papers. Several distinct varieties of neural networks (NNs), such as backpropagation networks [167] and recurrent neural networks, have been utilized and incorporated in this research [11, 216, 219], and fuzzy NN [69, 90, 118], etc. ANN has been investigated in 11 (18%approx.). While BLSTM has been employed in 6 (11%approx.) and RNN has been implemented in four different studies. Decision trees and Random forests have been used in two and one studies, respectively. K-means neighbor has been used in three selected studies, and other techniques like Naïve Bayes and Generative Adversarial Networks have been scarcely used. These techniques witnessed a unit entry under each head. The detailed history of ML models is shown in Table 14, whereas Fig. 15 shows the algorithms used by the selected studies. Figure 16 shows the distribution of HMER based selected studies according to the ML techniques. It is also investigated and observed that the three dominating ML techniques are implemented after several years from when they came into the research innovation and implementation. For instance, the SVM algorithm was invented in 1963, published work reported by 1995, and the research experimentations investigated in the selected studies started implementing the technique in 2003 [179]. The second most frequently used ML technique is CNN. ‘Neocognitron,’ the origin of the CNN architecture, was introduced by Kunihiko Fukushima in 1980. Owing to the pioneering work done by Yann Le Cun, the field of Deep Learning was moved forward by the creation of one of the first convolutional neural networks in the year 1994. This network, which was given the name LeNet5 and was implemented in 1988 after many successful iterations that came before it, is credited with propelling the field forward. After conducting research, it was discovered that the very first study employing CNN took place in the year 2015. ANN, which was first developed in 1958 by psychologist Frank Rosenblatt, is the third most used machine learning technique. It wasn’t until the late 1980s that many real-world institutes began using ANNs for a variety of applications, and the research that were included in SLR didn’t start making use of this ML approach until the year 2006 [153].

Table 14 History and year of implementation of dominant ML models in the studies
Fig. 15
figure 15

Machine learning methods used by studies

Fig. 16
figure 16

Distribution of studies according to the ML technique implemented

Delving in the non-ML recognition techniques, it is found that there has been a blend of trends witnessed in the studies before the advent of the machine learning era. The major non-ML recognition approaches identified are grammar-based approaches like graph grammar ([56, 75, 89], stochastic context-free grammar ([7, 108, 134, 201], probabilistic context-free grammar ([35, 171], definite clause grammar [39], and another algorithmic approach ([141, 142, 210]. There have also been instances of the studies which are concentrated towards parsing ([41, 108, 206, 208], fuzzy methodologies ([58, 60, 70, 90, 102, 117, 118] and other methods based on relational grammars [120]. Though the total count of studies, 38% of the chosen studies, are purely based on non-ML approaches, the inclination towards these methods cannot be neglected as the initials of research on HMSER thoroughly engaged in their deployment on these approaches. But, undoubtedly, overall, ML techniques dominated the recognition trends in HMSER.

Pros and Cons of using ML/ Non-ML approaches for HMER

Non-ML Techniques:

Pros:

  • Non-ML methods, such as rule-based or grammar-based approaches or template-matching algorithms or other parsing techniques, can be more efficient and require less computational capacity than ML methods.

  • These techniques can be useful for recognizing simple mathematical expressions or symbols with well-defined patterns and structures.

Cons:

  • Non-ML techniques are less precise when it comes to recognizing complex mathematical expressions or symbols, where handwriting, size, and orientation variations may exist.

  • Non-ML techniques necessitate more manual intervention and specialized knowledge because they rely on predefined rules or templates that must be created and maintained.

ML Techniques:

Pros:

  • ML techniques, such as deep learning algorithms, are at places extremely accurate at recognizing complex mathematical expressions because they can learn to recognize patterns and characteristics from large datasets of HMEs.

  • The ML techniques can accommodate variations in handwriting, size, and orientation because they can learn to recognize the mathematical expression’s underlying structure rather than its individual symbols.

  • ML techniques are also capable of automatically adapting to new handwriting patterns and symbols.

Cons:

  • To train and optimize ML models, large amounts of data and computational capacity are required, which can be time-consuming and costly.

  • ML techniques can also be susceptible to overfitting, which occurs when models learn to recognize specific patterns in the training data but cannot generalize to new data.

  • ML techniques may necessitate more specialized knowledge for model development and optimization.

To summarize, non-ML techniques may be effective for basic mathematical expressions or symbols, but they are less precise for complex expressions. ML techniques can accomplish greater accuracy and can handle complex expressions, but their development and optimization require more data, computational power, and expertise.

  • RQ2. Which datasets are frequently used in the studies?

Almost ten datasets have been used in the selected studies. The datasets that have been used at least once in the study have been considered. The listing of all datasets used is mentioned in Table 15, along with the count of studies that used these datasets.

Table 15 Count of papers to different datasets

It has been observed that the most widely used dataset in the SLR, which has been employed in studies almost 38% of is CROHME. This dataset is provided by the CROHME series of competitions. CROHME is a competition that was first held in 2011 in Beijing as a part of the International Conference on Document Analysis and Recognition (ICDAR). In addition, CROHME is one of the organizations that has contributed to the mathematical formula symbol library. The CROHME encourages research in the area of HME recognition, which is one of its focus areas.

In addition to supplying the dataset, it also provides academics with a platform on which they can test their methods, analyze them, and then ultimately stimulate further development in this area. Prior to the development of CROHME, only a modest amount of research on math recognition had been carried out without the benefit of benchmark datasets, standard encodings, or evaluation tools. Researchers are able to effectively evaluate different systems and work on improving handwritten math recognition thanks to the CROHME competition. The self-created dataset is referred to as the second name in the table shown above. This type of dataset is extremely adaptable due to the fact that several authors have generated a unique and diversified dataset of their own. For example, some authors have included some written expressions from conventional math books, while others have included volunteer writers to compose HMEs to make their dataset. It has been noted that the majority of the studies conducted before to the year 2011 concentrated on taking into consideration a dataset that was produced by the researchers themselves because a common standard dataset was not readily available. But after the opening of the CROHME series, most of the studies majorly used the CROHME dataset only.

As the majority of the studies used the CROHME dataset, the authors decided to analyze the sub-datasets, which are part of the dataset of this series of competitions. CROHME is held in the years 2011, 2012, 2013, 2014, and 2016. In each year, it provided different datasets of handwritten mathematical symbols and expressions. The authors have investigated the count of studies using these other datasets launched in additional years. Figure 17 clarifies the trend of the different datasets of CROHME used by the selected studies.

  • RQ3. What type of handwritten symbols are used (online/offline)?

Fig. 17
figure 17

Distribution of studies according to CROHME sub-dataset

Recognition of handwritten symbols written on paper or non-digital platforms is called offline recognition. When written on the digital platform where recordings of pen/fingertip movements are there, called online recognition. With regard to the additional information that is offered online about how the writing is done, the accuracy of online recognition is typically higher than that of offline recognition. Despite this, offline handwriting recognition is employed in large-scale real-world systems, such as determining the monetary values on bank cheques or deciphering handwritten postal addresses [145]. In online expression recognition, the input to the system is often composed of a set of strokes that contain geometric and temporal information. In the event that online recognition is performed, the system is able to make use of the temporal information that is contained within the online input image. When working in offline mode, the image that is being input does not have such information in its geometric or temporal components. This mode of input is consequently made more difficult to access and is utilized significantly less frequently during the recognition process. Table 3 illustrates the primary distinction that exists between the two categories HMEs. Figure 18 illustrates the percentage breakdown.

Fig. 18
figure 18

Distribution of papers based on HME type

It can be observed that most of the papers from the selected database of studies used and worked upon online handwritten math glyphs rather than offline. The reason could be the fact, as mentioned above, that the online data contains comparatively much more spatial and temporal information. Thus, most of the researchers prefer working on online HME.

  • RQ4. What metric is used to measure accuracy, how much accuracy has been achieved, and which techniques?

Accuracy measures are essential components of SLR. They are used to highlight how reliable a particular proposed model or system is when it comes to predicting and recognizing HMEs. Several accuracy measures have been applied in the excellent collection of the selected 95 studies. All the frequently used accuracy measures used in different studies are defined in Table 16.

Table 16 Description of Metrics/ Accuracy measures used by the reviewed studies

On investigating and observing the trend of accuracy measures used, it is noticed that the studies after the year 2010 are more likely to give comparable results as the well-formed datasets came into implementation after this year only. Before the year 2010, almost 90% of the studies used self-created datasets. Among these self-developed datasets, it is hard to list out and compare the accuracy metric used. It is the reason for neglecting some portion of comprehensive studies that were published before the year 2010. Table 17 shows the list of several studies taken into consideration while comparing the accuracy metrics used by the studies for experimentation and implementation of the recognition techniques for HMEs. The review writers have compared the accuracy measures used by all the studies and observed the dataset on which the experimentation is carried out. Among these measures, ExpRate, which means expression recognition rate, has been widely used in almost 43% of these studies.

Table 17 Accuracy analysis of the reviewed studies
  • RQ5. What approach is followed in the study, or what is the proposed system?

In the research question, the authors have tried to understand the kind of approach followed by the selected studies for recognizing the math expression in handwritten form. On investigating and generalizing the approaches, it is found that the recognition systems following non-ML approach, or we can say that the approach other than the ML, has used chiefly followed the conventional approach for solving the research problem in the steps of symbol segmentation, symbol recognition, structural analysis, and expression recognition. The last phase (extraction recognition) is common to the studies that used ML techniques. And for ML techniques, it is observed that the approaches proposed followed some steps (that are generalized by the authors) like preprocessing, segmentation, feature extraction, classification, and expression recognition. The generalized approach is shown in Fig. 19.

Fig. 19
figure 19

Generalized approach used by different ML and non-ML methods in the selected studies

  • RQ6. What kind of techniques is/are used in the sub-process?

It has also been keenly observed that procedures or special techniques are followed to execute these phases mentioned above in both approaches efficiently. The focus of this research question is centralized to highlight the most frequently used specialized procedures and techniques for conducting and executing this problem stage in both approaches. The authors noticed that some studies used several different methods for preprocessing, segmentation, classification, etc. The summary of the experimental techniques used in the sub-process of the recognition is presented in Table 18. Note that those papers have only been considered here where some special techniques are used in the sub-process.

Table 18 Sub-processes analysis concerning the reviewed studies
  • RQ7. Which ML Technique outperforms other ML techniques?

About 60% of the selected studies used ML techniques, which are compared with other ML methods. For reaching out to actual conclusions for this research question, the performances of different ML techniques have been compared based on the accuracy measure/metric used. By comparing the accuracy results, the appropriate answer to this question can be retrieved well and analyzed. Still, it would be unfair to compare the distinct values belonging to different accuracy metrics. So, the authors decided to poll out the most frequently used measure, which could be taken as a standard metric for accuracy analysis. And on comparing the corresponding values of accuracy and performance, the authors could extract well which technique is outperforming the rest. Briefing the steps involved to reach conclusions for this RQ7 are as follows:

  1. 1.

    Construct a database/table detailing the used technique in the study, dataset, accuracy measure/metric, and an accuracy value.

  2. 2.

    Observe and identify the accuracy metric, which has been used most of the time.

  3. 3.

    Compare the accuracy values corresponding to that identified accuracy metric (observed in step2) w.r.t different ML techniques used.

Note: This procedure for comparison is designed as different studies used different datasets and various accuracy measures. So, it won’t be purely justified if we observe the accuracy values irrespective of the accuracy metric used. The authors believed that these conclusions could be more acceptable and more exceptional if the performance could be evaluated and the same datasets and same accuracy measure. But the lack of standardization forced us to assess the results by the above-defined procedure

Table 17 allows for an easy analysis that reveals ExpRate to be the accuracy metric that is utilized the majority of the time. It is possible to quantify the accuracy of the expression recognition rate as the proportion of correctly recognized expressions to the total number of expressions. On analyzing the value against the ExpRate (accuracy metric commonly used), the highest accuracy value observed is, i.e., 68.07%. This ExpRate is found by applying SVM on the Handsmath dataset (using an augmented incremental approach) (refer to S15). Thus, it is observed that the highest accuracy rate is found as the outcome of the SVM and directing us to the conclusion that SVM outperformed all other ML models used by different studies.

NOTE: The research articles comprising summary of winning models of CROHME competition have been excluded for comparison purposes, and independent studies have been taken for performance analysis and comparisons

  • RQ8. Which ML techniques outperform other non-ML methods?

As for non-ML approaches, about 40% of the selected studies used non-ML techniques. The non-ML methods usually followed the recognition steps of symbol segmentation, symbol recognition, and structural analysis. There are fewer trends and frequencies observed while noticing the kind of non-ML techniques used. So, when comparing their performances with ML techniques, the studies that focused on evaluating the performance according to the identified accuracy metric are considered. The accuracy values against the expression recognition rate found the leading evaluated ExpRateas 64.9% obtained by implementing the Gaussian model. This accuracy rate is comparatively less than what accuracy rate produced by the SVM model. Thus, it can be concluded that SVM (an ML model) outperformed the other non-ML techniques.

It must be noted that a study abbreviated as S23 gave the ExpRate is 68.07%, the same as that offered by the study S15. The Cocke–Younger–Kasami (CYK) algorithm is used to parse two-dimensional (2D) structures of online handwritten MEs, and MEs are encoded in the form of a stochastic context-free grammar (SCFG). However, this research utilized both SVM classifiers. This ExpRate is evaluated on the Handsmath dataset. Thus, it can be concluded that the resultant accuracy rate equivalent to that produced the S15(a study that holds the maximum expression recognition rate) is achieved by undertaking a hybrid approach, i.e., by applying ML and non-ML models. This combined approach is out of the scope of this review.

  • RQ9. Which are the dominant journals/conference proceedings for papers analyzing HMEs recognition?

We identified 98 studies in the field of recognition of HMEs. These papers were published during the time period 2000–2021. Among these selected studies, 45 (45%) papers are published in conference proceedings, 34 (35%approx.) articles appeared in journals, 8 (8%) articles are from workshops, and others are taken from symposiums, forums, meeting proceedings, technical reports, and thesis work. The analysis selected from journals is presented in Fig. 20.

Fig. 20
figure 20

Distribution of studies by different journals

The studies selected for the review are taken from 14 different journals presented in Fig. 20. It is observed that the dominating journal found in this research domain of HMEs are primarily from the renowned journal named International Journal on Document Analysis and Recognition (IJDAR), which contributed about 22% of the total studies selected from the journals—followed by Pattern Recognition Letters and Pattern Recognition, which ranked equally according to the figure and together contributed 42% of the journal chosen studies.

The authors have also analyzed the selected studies based on the proceedings of the conferences to which they belong. It is investigated and found that the specified database of papers is taken from 22 different conferences. The dominating conference in this category is the International Conference on Frontiers in Handwriting Recognition, preceded by the ICDAR.

Summary and details of the dominating journals, conferences, and workshops in this domain are presented in Table 19. International Journal on Document Analysis and Recognition (IJDAR) has been the most dominating journal. The International Association for Pattern Recognition is the sponsoring agency for this reputed journal. The goal of this journal is to publish articles related to the areas of document analysis as well as document recognition. It invites articles for four different types, ‘original research papers’, ‘system descriptions’, ‘correspondence’, and ‘overviews and summaries’. It also focuses on coming out with special issues that target active areas of research.

Table 19 List of dominating journals, conferences, and others used by reviewed studies

6 Summary of findings

The author has thoroughly reviewed and presented the results of the systematic review analysis performed on recognition techniques for HMEs. We have followed the guidelines designed by [96] and applied inclusion-exclusion criteria on the retrieved studies, fetched by using a formulated search string on digital libraries like Scopus, IEEE Explore, Science Direct, Wiley, Springer, and ACM digital library. The period of studies considered for this review is from the year Jan 2000 to June 2021. A detailed and related report on this research topic was published last in 2012 [204]. The aspects of analysis considered in the study are entirely different than the systematic approach used by this review. It should also be noted that the authors have chosen that potential studies out of candidate studies based on the quality by which each of these studies can fulfill the requirement of answering the formulated research questions. The authors in this review have given the least attention to citation count and strictly followed quality assessment scores for choosing a study in the review.

The authors also believe that such comprehensive and systematic work has not been performed in this research domain of recognition of HME. As in the initial stage of screening papers before performing the quality assessment, the number of candidate studies is as many as 202. These studies are collected, analyzed, and stored in the screening database for reference purposes. It is found that very few review studies have been performed to date on this research theme of recognition of HMEs. We tried to analyze these reviews to brief what aspects and review concepts have been covered. It could help us give us a better direction before we could plan to frame this study. During this analysis, it has also been observed that there has been no evident review study that used systematic literature review guidelines and frameworks to perform the research review on this challenging research domain. Thus, there is a need for a new analysis that covers the gap of 8 or 9 years and also adds a new systematic dimension to the analysis perceptive of this topic.

Indeed, the idea behind conducting this SLR is to summarize the work done in this field and present several perspectives of review to the present and future researchers that can help them generalize and revise the generic prerequisites required to know about this subject area. Moreover, it covers the techniques used and analyzes most of the details of metadata related to the selected studies for review. The quality assessment criteria are performed where the reviewers chose 98 studies out of 202 retrieved studies. These studies are determined based on the relevance and quality score according to quality assessment questions. These 98 articles are published in 14 leading journals and 22 premier conferences, eight established workshops, and other sources like symposiums, technical reports, and other thesis work.

The authors have sincerely analyzed all possible aspects required for reaching the conclusions and finding answers to the research questions. The data from the selected studies are analyzed qualitatively and quantitatively using narrative, vote counting, and grounded theory methods. The review aims to segregate the studies based on the technique used.

(ML/non-ML) to recognize handwritten maths. The entire focus is not concentrated and is limited to analyze the methods and compare their accuracies. Still, the authors also targeted to analyze the dataset used and the kind of handwritten expressions used in the dataset for experimentation. The primary findings of this study are summarized in Table 20.

Table 20 List of findings as per the formulated research goals

Apart from the primary results and findings are drawn from the selected studies of the SLR, the writers have extended the limits of the findings by analyzing the other details of the research publications associating with the research zone. The summary of findings from metadata extracted from the studies included in the SLR is tabulated in Table 21.

Table 21 Meta-analysis listing and findings from the reviewed studies

The above summarizes the significant findings from the research studies referred to and reviewed while conducting this SLR. There is a lot much to add to the details about the results of the research. Still, we tried to summarize the research findings clearly and crisply by tabulating and listing the highlights and key conclusions extracted from the study. Finally, in a nutshell, we want to highlight that the primary purpose of this review study is to analyze the trend of techniques used to recognize HMEs. Though the research questions are formulated, especially targeting the ML-based studies published between the years 2000–2019, the SLR has also analyzed the studies using non-ML or deep learning models. Thus, extending the scope to a broader scale of review. This systematic SLR follows an entirely different approach than the other review studies performed since 2000. And to the best of our knowledge, this is the first SLR ever performed on this research domain of recognition of HMEs. This SLR aims and fulfills all the described objectives by determining

  • What are the new priorities of researchers, considering the recognition techniques/datasets/type of handwritten expressions/accuracy metrics/approach/sub-techniques they employed in their studies?

  • The experimentation results have been compared to analyze the technique that outcomes the rest of the methods employed for recognition.

  • The studies are analyzed based on several temporal aspects related to the recognition technique/publication years/authors/country/journals/conferences/keywords/affiliations, etc.

7 Limitations of this study

Taking into account the above findings of this review, we have noticed certain limitations of this study. At certain places, it has been found that the scope would be made broader by formulating more research questions that could analyze in more depth the techniques used for recognition purposes on a larger scale of MEs taken into consideration. Researchers could explore an aspect in the extended version of this study by categorically analyzing the features extracted by several ML studies. This study has restricted this scope by examining the techniques based on the ML/non-ML methods used. Still, there could be another dimension of analyzing the studies based on the combined approach used, which uses some ML classifiers and uses some non-ML methods for recognition.

Further, this review study is constrained as it restricts the scope of generic HMEs, with no examining related to the identification and recognition of other types of MEs written in different languages like Arabic, Chinese, Gurmukhi Devanagari, and so on. Moreover, there is a need for a broader perspective of a more comprehensive review, mainly dealing with HMEs written using different scripts and languages. A detailed study of all features of written math expressions could be analyzed, and this review limits its horizons while discussing features less thoroughly. Hence we sum up these limitations and leave them to be resolved and fulfilled in the future studies.

8 Conclusions and future scope

  • Many of the datasets used in the studies lack a large expression corpus. There is a need for a standard dataset, a complete corpus built using expressions of all forms, ranging from high school equations to complex scientific expressions.

  • Although many of the offered methods and approaches have been successful in achieving high levels of accuracy in their outcomes, there is still a lack of unified procedures in this field to evaluate the effectiveness of those methods.

  • Considering there was no freely accessible public dataset of HMEs before to 2011, the researchers were compelled to collect and construct the set of MEs on their own, which has a tendency to be restricted to a subset of expressions or particular domains. The result of this is that the comparative examination of the various methods is made more difficult. It is impossible and impractical to make a side-by-side comparison of the different recognition models and systems’ respective performance levels.

  • When it comes to the recognition tasks that are a part of this domain, there are not any standard accuracy measurements that have been developed. The absence of standard accuracy measurements is what causes the inconsistent metric trend that is utilized in the research to evaluate the subjects’ performances and carry out the necessary tasks for the experiments. As a result of this, it is difficult to do direct comparisons between the accuracy values that were produced using various accuracy metrics because standard accuracy measurements are not readily available. Therefore there is a need to use a standard accuracy metric that should be well defined and experimented with.

  • There are several ways of representation of MEs. Many systems make trees represent expressions resulting from structural analysis, while many others use parsing techniques and express the grammar using context-free grammar rules. Other representations are in binary trees and baseline structure trees, where the latter signifies the hierarchical structure baselines in a mathematical expression. Several procedures are proposed and implemented using representations like bounding box, body box, and hidden writing area (HWA) even for stroke recognition. The several representations used in the recognition process and different classifiers and feature extractors broaden the research zone. And there arises a need to perform a separate comparative study analyzing the recognition systems based on the kind of representation method used in the study.

  • As the field of recognition of HMEs has been performing extensively well from the past decade. It becomes an issue of vital importance; thus, this review study concludes that there is a need of the hour to standardize different evaluation measures (accuracy metrics), make use of the standard and commonly used datasets so that there could be better analysis when it comes to extracting conclusions about the performance of different recognition systems. The public benchmarking system should be developed to ease and facilitate the comparative analysis of the achievements of varying recognition systems.

  • It is observed that different classification techniques perform differently when combined with different models and other ML/non-ML methods and applied on different datasets. The outputs are varying and highly incomparable in these cases. So, the standardization of datasets, accuracy measures, and a benchmark system need to be established after thoroughly analyzing, revising, and examining the implementation of the concepts of pattern recognition techniques used in the research domain.

In the end, the authors call for the attention of future researchers on the implementation and accuracy of varying recognition techniques and a need for a more systematic review to be performed on deep learning methods, which have proven to produce a significantly enhanced efficiency. Also, as already mentioned, there is a need for a standard model for the comparison of different techniques is a requirement. Another central highlighting point is that different classification techniques are performed under other experimental conditions on different datasets with varying metrics of accuracy involved. Thus, standard accuracy measures should be adopted for better comparisons in the future.