1 Introduction

Although an enormous amount of work has been done in mathematical symbol recognition, a more precise recognition of mathematical symbols is still a big concern. The applications of handwritten input in electronic devices are massive; that is why handwritten symbols or character recognition technology have captured the interest of academicians and industry experts [19]. Optical Character Recognition handles the tasks related to handwritten symbols or character recognition efficiently as it has high skills to do so but with a strong dependency on contextual information, which is arduous to apply on handwritten mathematical formulas.

According to the research work of the authors [14], the increased accessibility of pen and touch-based devices such as smartphones, tablets, personal computers, and smartboards has proliferated the interest in mathematical symbol recognition. Thus, it has been growing. Still, existing systems are away from perfection because of complications arising from the typeset with many similar-looking symbols or characters and the two-dimensional nature of mathematical input. Recently, handwriting recognition used encoder-decoder models for image to text transformation [73, 82]. This recognition trend has evolved a lot since the time this field has passed been embarked for exploration.

Handwriting Recognition is divided into two categories: offline and online. “Off-line” encompasses the images of the written text papers taken by optical image scanning or Intelligent Word Recognition, whereas” On-line” relate to the movement of the pen tip, which is in contact with the pen-based computer screen surface [45].

Mathematics is free from racial, national, and cultural constraints as the “unique international scientific language.” For all kinds of documents, mathematical formulas are an essential part. The development of information requires that types of document image processing systems be mature and perfect, which can handle characters, tables, and images well [61]. After this, the challenging domain captivated more attention and concentration, and in the ‘90s, the hot topic of pattern recognition is mathematical formula recognition. Author [8] conducted their study to survey different techniques for symbol segmentation and recognition for online and offline math recognition.

A mathematical symbol recognition system must be powerful enough to manage various specific symbols.

The current literature highlights that Machine Learning (ML)-based models are used for recognition and have been extensively used. The recent studies witness the ML-based recognition models to be achieving significant accuracy rates. For instance, research by [52] uses a K-NN classifier and finally achieved 90% accuracy. In contrast, another study [46] uses the CNN technique as a recognition technique and achieves a competitive accuracy of 93.5%.

All the above considerations and significant stances of research have inspired us to summarize and extract the findings of the current research on the ML approach for mathematical symbol recognition. To this goal, this review describes the SLR of the studies published until 2021 year, following the SLR guidelines. Furthermore, it has been believed that an appreciable amount of work has been done in this domain in the past few years. In this SLR, the authors cover seven research questions. Studies to conduct this SLR are extracted from the most populated digital libraries, i.e., Scopus, Springer, IEEE Xplore, and Google Scholar, based on inclusion/exclusion factors and quality assessment scores.

The order of the paper is described in section 2; the authors detail the entire review process, discussing the procedural steps involved in the conduction of this SLR. In section 3, the authors focus on the necessary preliminaries for this review. In section 4, a detailed and particularized all necessary observations to answer the formulated research questions. Each discussion about the research questions is the result of extensive analysis and data extraction. In section 5, the authors have tried to summarize the gist of observations under the heading “summary of findings”. In section 6, the conclusions and recommendations for future work have been offered. The entire organization of the paper is presented in Fig. 1.

Fig. 1
figure 1

Organization of the paper

1.1 Motivation for research

  1. 1.

    Recognizing handwriting has been an evolving research domain. The field of mathematical symbols and expression recognition is the prominent research domain that has been a proactive research theme for the past 40 years. This systematic review objectifies to brief all the recognition models that could channelize and explore the literature along crucial dimensions and directions.

  2. 2.

    The landmark survey in this domain was published in 2012 by the authors [78]. Therefore, a gap of 9 years has witnessed the emergence and shift to other implementation models and recognition techniques. We recognized the requirement of a comprehensive literature survey after considering progressive research to recognize handwritten math text. Therefore, we realized the need to compile the existing literature based on a complete and systematic search to present the research challenges for future study.

  3. 3.

    In the context of scientific motivation, the goal of recognizing handwritten MEs is representative of pattern recognition in two dimensions. Therefore, the recognition technique for MEs can be used for others like flow charts, diagrams, etc. For academic motivation, recognizing handwritten math notations can help both the teachers and the students for information retrieval for identifying mathematical notation. A set of relevant documents can be obtained using ME as a query.

  4. 4.

    Another main motivation for reviewing the recognition ability and systems that identify mathematical text is that MEs and other related text can be easily entered into the machines and devices. Inputting directly from a handwritten source is far more realistic than using pre-existing formats for math descriptions like TEX, MathML. Once such an expression gets correctly interpreted, it can be pasted to an article, evaluated by a mathematical program, or plotted as a graph. Thus, recollecting and compiling the literature based on recognizing online handwritten mathematical symbols is the chosen subject of this paper.

1.2 Our contributions

  • A systematic investigation has been conducted to study various recognition models, classifying the work based on the kind of technique used for recognition (ML/Non- ML).

  • The identified procedures and techniques have been compared and categorized based on their kind, segmentation, and feature extraction techniques. The datasets deployed to implement the recognition models have been identified, and corresponding metrics used for accuracy have also been extracted. Further, the inclusion relationship among different metrics and the size of the dataset have been shown in Section 4.6.

  • Statistical results (yearly publications, type of publications, dataset frequency analysis) related to handwritten mathematical symbol recognition papers have been presented.

  • Future research directions relating to the current compiled literature on handwritten mathematical symbol recognition have been widely depicted.

1.3 Related surveys

Earlier surveys by Chan [8] and Zanibbi [78] have been very innovative. Still, as the research has consistently grown in mathematical symbol recognition, this required a systematic literature survey, which narrows down this field’s existing research. Chan has reviewed the recognition processes like segmentation and feature extraction details based on the online and offline handwritten text classification. Zanibbi has explored and studied the research works based on representation models and identified the challenges for recognizing and retrieving MEs. It identified the critical problems in math recognition: detecting expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation of meaning. Another survey in the list is added by Firdaus [22], which focuses on the merits and demerits of several recognition techniques for mathematical symbols. The recent survey report by Sakshi [62] undermines the statistical extractions derived from the combined literature of mathematical symbols and expressions, leaving a decent scope for attention and specification for literature investigation purely based on isolated symbol recognition.

This research review will be a complete value addition to the existing literature on the handwritten mathematical symbol, as it’s the first-ever systematic survey that aims to compile the literature from stem to stern concerning the recognition models trends in the context of symbol recognition and observes the research challenges concerning root problems.

2 Background

2.1 Mathematical symbol recognition

ME recognition mainly includes two steps: symbol recognition and structural analysis. The spatial relationships between symbols determine the structural analysis and interpret the expression [1, 4], (Carlos Ramirez-Pina et al., 2019). The basis of structural analysis is symbol recognition. There are two phases involved in symbol recognition, which are named symbol segmentation and isolated symbol recognition. The online ME is a set of strokes of input data, and the inclusion of more than one stroke forms a mathematical symbol. Converting stroke series into a set of symbols is called symbol segmentation, further classified in the isolated symbol recognition stage [8, 29].

The symbols that come under MEs are mathematical symbols, multi-front characters, and Greek letters. To extract the symbol from a ME correctly is not an easy task. The process of extracting mathematical symbols from the ME is termed mathematical symbol recognition, requiring specific necessary steps. There are three main stages in most recognition systems: Preprocessing, feature extraction, and matching. The processing environment for subsequent features provides in the preprocessing stage. The most crucial step in pattern recognition is feature extraction. The static properties and dynamic are factors for feature analysis like descender or no descender, dot, or no dot (Tappert, 1990). The selection of the most likely candidate for an unknown input character comprises the process called matching. [42].

2.2 Types of mathematical objects

Mathematical objects are used for communication used by a mathematician in mathematical language. The kernel of a mathematical document is a mathematical object. It can be a set of elementary mathematical objects arranged according to mathematical language grammar rules, as shown in Fig. 2. A small entity with a sense and is indivisible in other mathematical objects is an elementary mathematical object. Mathematical objects or sets of mathematical objects, possibly interconnected or arranged by operators, are MEs [31].

Fig. 2
figure 2

Types of mathematical objects

2.3 Types of mathematical symbol recognition

There are generally two categories: online and offline recognition methods, Handwritten character pattern recognition, or mathematical symbol recognition [43]. Earlier online patterns are recognized using the Markov Random Field as these online patterns are the strokes’ time order. Pen-based or touch-based devices are used to record the coordinates of pen-tip or fingertip trajectory, which is known as stoke [83], and to capture the 2D images of offline patterns which are scanned using the scanner or camera are further recognized by Modified Quadratic Discriminant Function and directional feature extraction. If we can discard the temporal and structural information from the pattern, then the method used to recognize offline patterns can identify online patterns.

The forward-backward neural network-based algorithms like Bidirectional Long Short Term Memory(BLSTM) networks with a Connectionist Temporal Classification output layer have exhibited outperform state-of-the-art Hidden Markov Model-based systems to recognize online handwriting. Combining the online and offline information to extract the set of hybrid features and using the Hidden Markov Model and BLSTM networks for classification is recently proposed by authors [3]. Raw images as local offline features employed by BLSTM alongside the pen-tip curve extensively outperformed Hidden Markov Model in the symbol recognition rate.

2.4 Characteristics of mathematical symbols

According to the relationships defined among the symbols used in the mathematical expressions, there are several characteristics of symbols displayed as follows:

  • The scope of fractions, roots, and matrices is determined by its “core symbol”, which gives the meaning of sub-expressions such as”-”,” √”,“ ( “,“ [ “,” {“,” I”, etc.. These core symbols can be further into two types: fence symbols and binding symbols [9].

  • There are fence symbols, which serve as fences around expression, to be treated as a unit. For instance, the enclosed units are grouped into a unit with the help of parentheses. For example, “(ab + c)” is a unit that holds the sum of the product of a and b with c.

  • Some symbols are termed as binding symbols that purposely bind together several entities of expression and give meaning to expression as a whole. Like the fraction line \( \raisebox{1ex}{$x$}\!\left/ \!\raisebox{-1ex}{$y$}\right. \), \( \sqrt{x} \)And Σxi, dominate their neighboring expressions but bounds several entities and represents it as a single unit. For example\( {\sum}_{i=1}^n Xi \), this represents the sum of n terms, i.e., X1 + X2 + X3 + … + Xn.

  • According to [54], the scope of scripts and limits is determined by the previous and subsequent symbols on the same line. For example, pre-super script, upper script, superscript, pre-subscript, lower script, and subscript are the possible positions/ scripts around the symbol.

  • Further, there are operator symbols, which include an operator for evaluation like +, −. /. ÷, *. These operator symbols dominate and operate their operands. As mentioned in the findings of authors [8], the operators can be classified into types, namely, explicit and implicit operators. Explicit operators are operator symbols, and implicit operators include spatial operators, which provide for geometrical properties defined in physical space.

3 Systematic review process

The procedure of the SLR follows the strategy recommended by Kitchenham [39]. It contains the arrangement, the execution, and the result analysis description. The first stage in the procedure of SLR is the planning stage, in which we classify and build up the review protocol that comprises the steps presented in Fig. 3.

Fig. 3
figure 3

Systematic Literature Review (SLR) Procedure

After the planning stage, a series of measures are carried out wherein the first step, the research questions are posed associated with the SLR’s objectives. In the second stage, the reviewers designed search tactics to determine those potential studies to answer the formulated research questions. The primary task of this stage is to recognize and corroborate the search terms (formation of search string) and the literature resources for extracting our set of studies for this review. According to our framed research questions, phase 3 of this SLR deals with extracting the relevant and appropriate studies from chosen digital data sources. The fourth step includes the inclusion/exclusion parameters to conclude the studies included and excluded from SLR.

Further quality assessment criteria are taken to determine individual studies’ strength and quality by forming the quality assessment questionnaire and using the fuzzy linguistic variable’s value to access the studies. In the second last step, research questions formulated in this study are answered based on the information gathered and stored in the data extraction form. The final step comprises analyzing the extracted data collected stored in the data extraction form in accordance with our designed research questions described in Table 1. The flow sequence of the SLR procedure is depicted in Fig. 3. The review protocol development is an essential step in an SLR to minimize the risk of research bias in the SLR.

Table 1 Research Questions

3.1 Research questions

This systematic analysis evaluates the pragmatic evidence from different studies using ML and non-ML techniques for mathematical symbol recognition. The formulated research questions have been presented in Table 1.

To cover all the aspects that have to be investigated in terms of mathematical symbol recognition, the review group has chosen the above research questions. The authors indeed focused on ML techniques, corpus, accuracy measures employed, and publication sources for mathematical symbol recognition.

3.2 Search tactics

Our foremost searching approach is to split into two stages for identifying and downloading studies a) Primary search b) Secondary search.

Following are the procedural steps for primary search:

  1. A.

    Recognize the keywords from the research questions

  2. B.

    Recognize synonyms and related words in step A

  3. C.

    Use Boolean operations such as OR for synonyms and related words or terms and Boolean AND to join significant terms.

The review group investigated the bibliographical content of the chosen primary set of studies to look for those studies that could have been missed or relatively ignored in the primary search. This falls under our secondary search phase. The string used for search is:

“mathematical symbol recognition” OR “Handwritten Formula Symbol Recognition” OR “Math symbol recognition” OR “Math symbol retrieval” OR “Mathematical symbol retrieval” OR “handwritten mathematical symbols.”

Using related terms and synonyms of associated terms using the Boolean operator OR and uniting the main terms via the Boolean operator AND complicated search terms have been included in similar systematic reviews. The search string is thus framed for fetching all accessible and relevant studies in the databases. The articulated search string based on the Boolean operators OR and AND together with all possible synonyms is run on the selected digital libraries. For example, we have combined the synonyms of recognition and retrieval using the OR operator because various studies use mathematical symbol recognition, while others use MS retrieval. In the same way, searches have been done for “maths” and “mathematical” terms. The authors have used maths* and mathematical*, which means if the words used in the individual studies are maths, mathematical, and mathematics, they are preferred and chosen by the framed search string.

The following are the sources of searching and selecting the primary studies for this literature review:

  • IEEE Xplore

  • Scopus

  • Springer

  • Google Scholar

The above digital databases are chosen for data collection because, according to the AMSTAR checklist, at least two databases, as well as at least one supplemental search, should be searched (Shea et al., 2017). Furthermore, the most recent PRISMA 2020 [55] standards suggest that research for SLR should be retrieved from multiple databases. To mitigate the consequences of varying coverage between particular databases, specific authors in other domains generally recommend a minimum of two databases [26, 51]. As a result, the authors could complete their search on IEEE, Scopus, and Springer, which covered the majority of the relevant studies. To avoid missing any significant papers for the review, the authors expanded their Google Scholar search to include all relevant and semi-related studies. Furthermore, these four digital libraries drew a sizable number of relevant studies.

The studies until 2021 have been gathered. The individual digital library has fetched a specific number of articles according to the search string. The total sum of 648 studies has been filtered based on search terms. The titles and abstracts of these 648 studies have been investigated, resulting in 185 studies. Thereafter the duplicate studies have been removed and finalized 94 studies are chosen after accomplishing the primary search phase. The secondary search for more relevant studies is performed by considering the references of the selected set of studies. This secondary search phase led to 12 significant additional studies (missed in the initial search). Therefore, the total comprehensive studies chosen for this SLR are 106 based on our primary and secondary search phrases. Figure 4 depicts the steps used for determining and selecting the studies for this review.

Fig. 4
figure 4

Stepwise selection and filtering of studies

3.3 Study selection

The selection of studies is initially based on the relevance of their titles and abstracts. The review group then decided to employ two more phases to get more propitious and consistent literature. The two phases are namely: inclusion-exclusion and quality assessment. Firstly, the inclusion and exclusion criteria are applied to the collected corpus of studies, which helped narrow our decision process while selecting the studies for this review. Secondly, quality assessment criteria are implemented for further filtration of chosen studies. The following (Table 2) are the factors defining our inclusion/exclusion criteria that have been deliberately deployed.

Table 2 Inclusion-Exclusion Factors

3.4 Quality assessment criteria

To determine the integrity and relevance of the studies considered, quality assessment is evaluated. An assessment criterion is another method to select the studies from the raw studies corpus. Firstly, the inclusion and exclusion criterion is applied to the studies, and further questions are framed to assess the quality of the studies to achieve potential studies for literature review. Studies found with low value are excluded from the studies. The question considered to conduct the quality assessment is given below:

  • Q1. The research aim is unambiguously affirmed or not?

  • Q2. Are the estimation methods well defined and deliberated?

  • Q3. Is the experiment applied to sufficient datasets?

  • Q4. Is the accuracy of estimations measured?

Related work of quality assessment criteria is studied, and various meetings and discussions are conducted between the authors. After the successful meetings, we finalized the fuzzy linguistic variable method to assess the quality of the selected studies. Fuzzy linguistic variables are taken in this study to score considered quality assessment questions, and this SLR used the score from 0 to 1 for each question. This concludes that finalized potential studies range from 0 to 1 as a score value for each study. Score 0 is not considered an efficient way to measure a specific quality question. In our view, it’s not fair to weigh them in this order, and this is the idea behind the fuzzy linguistic variables. For example, we have a quality question: “Is the experiment applied on a sufficient number of datasets?” Let us consider that any study has five datasets, and we assigned a precise value of 1 to this question, but questions arise what value is assigned to question if we have 2 or 3 datasets. Will we assign value 0? To avoid giving a crisp value to the question, fuzzy linguistic variables are considered a novel idea. To solve this problem, exact values are avoided in this study, and the fuzzy linguistic variables approach is used. The score range of each question is mentioned in Table 3.

Table 3 Range and Linguistic score of Quality Assessment

Labeling and the total score for each considered question are shown in Table 4.

Table 4 Cumulative Assessment Score with Labels

In this study, five linguistic variable values are considered, which are mentioned in Table 4. Studies having value no and rarely are excluded from the corpus as these values are regarded as low score values. We have further discarded 23 studies as a result of a quality assessment score. Studies having score value partly, mostly, and yes are considered confirmed studies to conduct this SLR. Studies that are scored as “yes” are judged as high-quality studies, and studies with a score value of partly and mostly are considered partial good studies. The score of each study is shown in Table 5.

Table 5 Quality assessment scores and their linguistic variable

The studies have a high impact that can be highlighted to our readers. For example, we can consider high-impact studies as those that answered most of the research questions. After all the steps followed in selecting the final corpus, we finalized 52 relevant studies to conduct this SLR. At the end of the paper, Table 13 in Appendix A depicts the complete form of abbreviations.

3.5 Data extraction

For every research question formulated for this review, separate files of excels are prepared. For instance, the count of publications per year had individual records. In contrast, different data extraction files manage the technique used, the dataset deployed, and other associated factors. One of the authors extracted the raw data from the selected studies. Simultaneously, the other is arranged in order as per the requirement of this review. The third author thoroughly revised and validated the authenticity of the data arranged after extraction.

Final studies of this SLR with the research question they addressed are shown in Table 6. Each study doesn’t need to provide answers to every research question formulated in SLR. The chosen final list of studies is selected and scored based on two factors: a quality assessment score and the number of questions addressed by the study. Studies are scored from 0 to 10, 0 is taken as a minimum score, and 10 is taken as the maximum score. According to score value studies, credibility is evaluated and represented in Table 5. In this SLR, seven research questions are formulated, and on these questions, one study can get a maximum of seven scores, i.e., one point for each research question. In the quality assessment score, one study can reach a maximum score of three as if the study value is “yes” it has been scored as 3, for “mostly” it is scored as 2 and “partly”, it is scored as 1. Considered studies are scored out of 10 based on the mentioned parameters. This idea of assigning a score to each study is to recommend quality studies to our readers.

Table 6 SLR’s research questions addressed by the selected studies

3.6 Data synthesis

The data synthesis step aims to identify and group all the confirmation from the chosen set of potential studies related to our research questions. A small piece of information might be considered little evidence, but this information collection can strengthen a point. To synthesize our result data, most tables are used, and to use the results, vote-counting measures are taken for comparative analysis.

3.7 Threat to validity

3.7.1 Study selection and discrimination

In this SLR, we selected 52 studies, and to finalize these studies search string is tried on different platforms according to our formulated research questions. There may be a chance some reviews do not include the string keywords in their Metadata, which creates the possibility of rejecting that study. To choose the studies, we also refer to the references used by different authors to include essential studies in the corpus. However, still, there is the possibility that some relevant studies are missed, which can be considered a threat.

3.7.2 Subjective quality assessment

The specified criteria settled by quality assessment led to the discarding of studies that don’t fulfill the set criteria bounds. Thus there is a decent probability that prime studies in this field might not have been considered, owing to their un-fulfillment to the settled criteria by the quality assessment phase. This could be a potential threat. Using a fuzzy linguistic variable gauges each study’s score value and excludes the studies with the low score value. Scoring of the selected studies is done by the three authors involved in this SLR. This could also be a threat. Credibility score related to each study is shown in Fig. 5, and it depicts that studies S11, S17, S21, S23, S25, S26, S32, S39, S42, S43, have a high credit score, whereas studies like S15, S22, S29, S38, S40, S52 having a low score.

Fig. 5
figure 5

Overall credibility of the selected studies

4 Results and discussions

In this section, results are obtained from our final selected research questions.

4.1 RQ1 which machine learning /non-machine learning techniques are employed in mathematical symbol recognition?

This SLR analyzes the ML and non-ML techniques used in all the selected studies, as shown in Fig. 6. On examining the trend of strategies implemented for the recognition tasks, it is observed that the ML techniques are comparatively more frequently deployed than the non-ML methods.

Fig. 6
figure 6

ML and non-ML techniques analyzed by SLR

The further analysis of ML techniques revealed that SVM is a widely used ML model implemented in almost 21% (#11) of the total studies. The success of SVM can be attributed to the fact that it performs effectively with distinct separation margins. Furthermore, mathematical symbols are carved in two dimensions, and SVM effectively deals with a wide range of dimensions and samples. It is also memory efficient from an algorithmic standpoint. The use of SVM, according to the authors, has resulted in competent recognition rates (presented later in the study in Table 11). As a result, SVM has become increasingly popular for symbol recognition tasks.

The following most applied ML methods are K-nearest neighbor and Convolutional neural network (CNN). The ML methods have been deployed in about five studies individually, constituting about 9% (each) of the chosen studies. The other neural network models like Artificial neural networks and Recurrent neural networks have also been employed in about a few studies, ranging from approximately 2 to 3% of selected studies. BLSTM, part of Feedforward neural networks, has been used in 5% (#3) of studies.

There have also been observations regarding deep learning and CNN architectures deployed in the selected studies. The authors have extended efforts to present a timeline of varied deep learning architectures in different studies chosen over the years (Fig. 7). For instance, in 2012, study S13 witnessed the recognition of symbols using Self-Organizing Map neural network (SOM-NN). Similarly, since 2015, the advent of hybrid architectures like CNN with Recurrent Neural Network, Multi-Layer perceptron, LSTM, or BLSTM have been seen. The recent trend displays the use of varied CNN architectures like Multi-Column Deep neural network (MDNN) and Squeeze-Extracted Multi-feature CNN (SE-MCNN).

Fig. 7
figure 7

Deep Learning Architectures over a Timeline

It is a critical observation that ML techniques in selected studies are considered individually or combined with other ML or non-ML techniques. The details of all ML-based studies are illustrated in Fig. 8. There have also been observations of non-ML techniques (refer to Fig. 9). The analysis report depicts significantly the prevalent non-ML methods, namely, elastic matching, Hidden Markov Model, relational grammars, and fuzzy logic, which has been employed in 5%(#3),4%(#2), 4%(#2), and 4%(#2) of the chosen set studies.

Highlights of RQ1: Most of the recognition models used for mathematical symbol recognition are based on ML approaches. The popular ML algorithms are SVM, K-nearest neighbor, recurrent neural network, and convolutional neural network.

Fig. 8
figure 8

ML Techniques Identified by SLR Studies

Fig. 9
figure 9

Non-ML Techniques Identified by SLR Studies

4.2 RQ2 which are corpus mainly used for experimentation in mathematical symbol recognition?

About 24 corpora have been used in the selected studies, but only those corpora are included that are used in more than one study. The most widely used datasets are self-created datasets employed in about 14, i.e., 26% of selected studies. The second most prevalent dataset is the CROHME dataset launched by the CROHME competitions series to recognize handwritten math expressions. This series had a periodic launch of datasets yearly, leading to several versions of the same core dataset. Thus, CROHME has its editions as CROHME 2011 [30], CROHME 2012 [16, 33], CROHME 2013 [15, 53], CROHME 2014 [4, 27], CROHME 2016 [19, 20], and among all primarily used dataset is CROHME 2013, which has been employed in 7, i.e. (13%) of the total selected studies. CROHME 2014 is the second widely investigated dataset, employed in 5, i.e. (9%) of the chosen studies, followed by the dataset CROHME 2016 employed in 4, i.e. (7%) of the collected studies. InftyCDB-1 is investigated in two different studies. Some of the datasets, like CROHME 2012, InftyCDB-3 [48, 49], Ground-truth [29, 47], and Modified National Institute Of Standards And Technology [21, 52], have been maximum investigated in about two studies. The statistical analysis of datasets is presented in a way that the CROHME studies and other recurrent datasets have been segregated and displayed in Fig. 10 in the form of groups. The 1st group elucidates the CROHME studies, whereas the second group depicts the recognized datasets like Modified National Institute Of Standards And Technology, InftyCDB. The self-created category of the dataset is the kind of corpus that has been developed by each researcher as per his requirement and availability of resources. Before the launch CROHME series of competitions, most of the literature used this category where the researchers developed their corpus for implementation.

Highlights of RQ2: Majorly, the datasets used for experimentation are self-created datasets used in around 26% of studies. The launch of CROHME datasets has been influential, and the analysis witnesses CROHME to be the second most popular dataset.

Fig. 10
figure 10

Corpus employed in the studies selected by SLR

4.3 RQ3 what are the majorly used accuracy metrics for mathematical symbol recognition? Identify the notable metric that corresponds to a specific phase in the recognition process?

One of the essential components of the SLR is accuracy metrics, as it is used to underline how efficient a recognition system is to recognize symbols in mathematical symbol recognition. In this SLR, several accuracy measures have been used.

Symbol Recognition Rate (SRR) and Minimum Symbol Error Rate (MSER) has broadly used mathematical symbol recognition studies selected for this SLR. The higher the SRR and lower the Minimum Symbol Error, the more accurate a particular technique is. An SRR is employed in about 36 studies, i.e., 69% of the selected studies and mathematical symbols and expression is implemented in about 18 studies constituting 34% of the selected studies. The third highest accuracy measure is the processing time used in four (7%) of the selected studies. Other accuracy measures are stroke recognition rate, ranking metrics, classification rate, segmentation rate, and convergence rate has been employed in about 2 to 3 studies, constituting overall 4 to 5% of the individual weightage in the total selected studies for this SLR. One of the remarkable factors while choosing the accuracy metrics is that at least two of the chosen studies must witness its usage during the experimentation.

Thus, the effective extract in this research question’s findings is that the notable metric for evaluating recognition results is SRR. Figure 11 illustrates the accuracy metrics, frequency count (green part of the chart), and their weightage (the grey part of the chart), and the evidential studies as the reference. The author has also endeavored to provide brief insight about metric selection criteria and their pros and cons in Table 7.

Fig. 11
figure 11

Accuracy metrics that are employed in different studies in the SLR

Table 7 Contemporary analysis of Metric Selection Criteria

4.3.1 Insights about validation methodologies for machine learning

Validation methodologies and approaches are the way to determine the error rate in the proposed ML models. ML validation approaches are used to calculate the ML model’s error rate, which is as close to the population’s genuine error rate as possible. The authors have synthesized the validation approaches in all the studies which deployed the ML based symbol recognition model. The analysis for the validation approaches in the studies is depicted in Table 8. The predominant validation techniques implemented in the selected studies are trained/ test split and k-fold cross-validation as depicted in Fig. 12. Few studies didn’t mention any instance of validation methodology, are considered in the category of “Not Applicable”. The train/test split is the most basic strategy. The premise is simple: the data is split into two categories—for instance, 70% for training and 30% for testing the model at random. In k-fold cross-validation, K-1 folds are utilized for training, and the remaining one is used for testing purposes. The benefit of using these methodologies is that all of the data may be used for training and testing.

Highlights of RQ3: The most popular accuracy metric is the symbol recognition rate deployed in almost 70% of the studies, and other prominent metrics are minimum symbol error rate (34%) and processing time (7%). The well-known validation technique deployed in machine learning-based studies are trained/test split and k-fold cross-validation.

Table 8 Analysis of Validation Techniques
Fig. 12
figure 12

Validation metric analysis

4.4 RQ4 what segmentation/feature extraction techniques are preeminent for mathematical symbol recognition?

The literature displays significant stances of segmentation and feature extraction used in the sub-stages of the recognition process, as mentioned in Tables 9 and 10, respectively. This question targeted the extraction and identification segmentation and feature extraction techniques used in the implementation by most selected studies. There also appeared some studies with no association of segmentation and feature extraction technique. The review group of this SLR decided to map the data extraction of the chosen studies under these two heads to explore another dimension of a striving potential parameter of study.

Highlights of RQ4: K-NN is found to be a popular segmentation and classification technique. In feature extraction, euclidean distance matching and convolutional neural networks are the most recognized techniques.

Table 9 Segmentation Techniques Identified from SLR
Table 10 Features Extraction Techniques from SLR

4.5 RQ5 what are trend analysis statistics pertinent to the performance of machine learning /non- machine learning techniques with other machine learning and non- machine learning techniques?

This research question is intended to identify which category of technique (ML /non-ML) outperformed the other. The designed approach to answer this question necessitated the need to specify a common ground or a prevalent parameter based on which the conclusions could be stemmed and assimilated. So, several rounds of discussions between the reviewers lead to finalizing a standard accuracy metric for this comparison. However, the ideal stances of efficacious comparisons could have been when both common dataset and concordant accuracy metrics could be brought to use. As the data extraction leads to the insufficient count of such studies, the accuracy metric is taken as a definite and impeccable parameter for comparison. As RQ3 has been suggestive of identifying the SSR as a prominent and frequent accuracy metric kind, the studies’ techniques are compared based on SSR, and S30 witnesses the highest accuracy value for SSR, which used BLSTM-RNN as a recognition method. This concludes that the ML technique outperforms the non-ML recognition models in terms of accuracy metrics, as shown in Table 11.

Highlights of RQ5: On analyzing the accuracy metrics, ML technique: BLSTM-RNN (used in S30) has outperformed other ML and Non-ML methods. Among Non-ML techniques, Elastic Matching (used in S48) achieves the best performance by yielding a symbol recognition rate of 99.70%.

Table 11 Accuracy Metric Analysis

4.6 RQ6 is the size of the dataset a factor influencing the accuracy rate of mathematical symbol recognition?

The necessary intent for the formulation of this research question is to examine whether the dataset’s size is a dependable parameter for affecting the accuracy rate of mathematical symbol recognition. To formulate the answers to this research question, the data extraction has been done in the context of two dataset size parameters: dataset classes and the number of symbols included in the dataset. The accuracy values extracted from the studies are mapped corresponding values of dataset parameters, and the mapped charts are presented in Figs. 13 and 14. According to the mapped charts, the values are arranged on the designed metric-size scale (Fig. 15), which depicts the accuracy ranges concerning the size of symbol classes. The data analysis helped to conclude the following points:

  • The symbol class of size range from 0 to 100 (as shown in Fig. 15 gives the accuracy range up to 89.97%, leading to a critical point that the accuracy range is from 0 to 100 can reach the accuracy values less than 90%.

  • There are highlighted entries of two accuracy values that are misleading the results that are accuracy values produced by the studies S17 and S30. The authors consider these two results of accuracy values to be a possible outcome of overfitting stances.

  • The symbol class of size range from 100 to 300 presented on the accuracy metric size scale is definite for producing the accuracy values equal to or more than 90% and lies less than 97%.

  • The symbol class ranging in size 300–400 outputs the accuracy values of more than 97%. So, the accuracy metric-size scale is entirely suggestive that dataset parameters size can be a dependent factor for determining the accuracy values.

  • The authors also believe that the dataset cannot be the only parameter for influencing the accuracy values. The accuracy results can depend on other factors like recognition technique employed, kind of dataset used, accuracy metric type, specific pre-processing, segmentation, and feature extraction methods. These all can also be contributing factors for determining the accuracy range of a recognition process, as the size of the dataset has been co-related with the accuracy obtained and presented in Fig. 15.

Highlights of RQ6: Accuracy metric when mapped to dataset size, the scale (Fig. 15) depicts that when the dataset has more than 300 symbol classes, the recognition process is likely to achieve accuracy of 97% or above. The two exceptional studies (S17, S30) have been noted to provide high accuracy even with fewer symbol classes.

Fig. 13
figure 13

Accuracy based on total symbols

Fig. 14
figure 14

Accuracy based on Symbol Classes

Fig. 15
figure 15

Accuracy Metric-Size Scale

4.7 RQ7 what are the predominant publication channels for research communications in mathematical symbol recognition?

Knowing about the articles’ primary sources related to mathematical symbol recognition is the idea behind this research question. International Journal on Document Analysis and Recognition (IJDAR) has published many studies of mathematical symbol recognition. It is the leading journal among all that have analyzed in this SLR, 5% of different studies in the SLR published in this journal, and International Conference on Document Analysis and Recognition (ICDAR) is the leading conference among other conferences that have been analyzed in our SLR, 23% of different studies in our SLR published in this conference. International Conference on Frontiers in Handwriting Recognition (ICFHR) published 7% of other studies in our SLR and International Conference on Pattern Recognition (ICPR) 7% of different studies in our SLR are the following major conferences publication sources. The sources which have at least two publications studies in our SLR are considered in this research question.

Highlights of RQ7: The predominant channel for publication is a renowned conference ICDAR which conducts the CROHME series of competitions, and around 23% of reviewed studies are part of this conference proceedings.

4.7.1 Extending discussions on problems encountered during math symbol recognition

The following are the five main problems associated with recognizing math notations, symbols, and expressions. Four of the issues are given by [78], and our analysis adds one more problem.

  • Expression Extraction/ Expression Detection: It’s tough to extract or recognize a mathematical expression on the same line as ordinary text. Tracing an exact phrase surrounded by words and extracting it from a text-rich document confuses and complicates the recognition operation. Furthermore, detecting expressions hidden in large amounts of textual material is difficult.

  • Symbol Extraction/Symbol Recognition: There are two types of representations involved. One is vector-based representations, and the other is raster-scanned representations. In the case of the former, we have PDFs; numerous symbol labels and locations can be recovered, but it requires handling of some exceptional cases like the one is mentioned by [5], a root symbol is not a single glyph (a connected component). It comprises a typeset (radial symbol) with an upper horizontal bar represented differently from a radial sign. In the latter case, where raster image data drawn with pen strokes is available. The complicated task is to detect and identify the symbol’s location. There are hundreds of alphanumeric and mathematical symbols used, many so similar in appearance that some use of context is necessary for the point of disambiguation. (e.g., o O 0 add reference) .

  • Layout Analysis: It is challenging to perform a detailed analysis of the spatial relationships between distinct symbols of varied scales embedded in the math expressions. When these spatial relationships among the symbols are represented using tree structures, we term them as Symbol Layout Trees. These symbol layout trees also represent contextual information like that present in LATEX MEs. They specify the groups that belong to horizontally adjacent symbols, and they share a baseline or writing line, along with superscript, below, subscript, above, and having containment relationships. In other words, it can be stated that layout analysis is all about a more in-depth analysis of the spatial relationships among the math symbols using symbol layout trees. Layout analysis is performed by many different techniques, and some of these are based on baseline structure trees (BST) [66, 79, 80] projection-profile cutting [18, 54], and graph rewriting [36]. One of the open-source structural analyzers is the Diagram Recognition Application for Computer Understanding of Large Algebraic Expressions (DRACULAE) and is very popular [13, 57, 70] which implements the BST method along with various optimizations and improvements.

  • Semantic Analysis: This semantic analysis step focuses on understanding the semantics or meaning of the recognized symbols from the segmented expressions. It mainly involves the creation of a semiotic analysis tree that is outputted from the semantic understanding step. It holds the mathematical meaning of the input formula, including the mathematically syntactic relationship among characters and symbols. The layout analysis and semantic analysis steps have been proposed by [12]. They have also given the basic idea of the system architecture for layout recognition and semantic understanding of mathematical formulae.

Apart from the problems mentioned above, there is a colossal challenge encountered when it comes to comparing different recognition techniques for concluding the best one out of all. In the past, many of the studies have introduced various schemes for recognizing mathematical symbols and MEs. It has been found that mainly, each study has used a distinct method for evaluating the performance accuracy of the recognition systems. Thereby making it a challenging task to compare several approaches to this problem accurately. From the literature, it also has been concluded that several metrics have been recommended to evaluate the MEs recognition systems performance. Some metrics are the symbol recognition rate [10] and symbol segmentation rate [65] that can be calculated if the ground truth is available. Above mentioned metrics take into consideration only a specific part of the recognition problem. Thereafter, the need of standardization of dataset, and a standard accuracy metric has been realized so that the task of comparison could be accomplished more effectively.

5 Summary of findings

The results of the SLR related to mathematical symbol recognition are what we offered, and we chose a total of 52 studies. We extracted studies until 2021, and the sources from where we extracted studies are Scopus, IEEE Xplore, Springer, Google scholar. To highlight the contribution of studies, we have presented the publication year and the number of studies published in that range in our SLR shown in Fig. 16. It is worth noting that from 2006 to 2016, there has been a significant improvement in the publications related to mathematical symbol recognition compared to previous years. As in the inclusion-exclusion criteria, we only focused on mathematical symbol recognition instead of the whole expression. By keeping this in mind, we formed inclusion-exclusion questions. After applying these criteria, we filtered 75 studies, and after this Quality assessment test is applied to the remaining 75 studies. The studies that do not fulfill the quality assessment requirements have been discarded. We got only 52 final selected studies for our SLR. The chosen 52 studies answered the questions in our SLR significantly.

Fig. 16
figure 16

Studies Distribution from 1987 to 2021

The study analyzed the pattern of research trends in the domain of handwritten mathematical symbol recognition, and it has been observed that the trend line is focused more on ML techniques. Amidst the ML techniques, SVM is the most well-known technique. K-Nearest Neighbour and CNN as the second most frequent technique used. To sum the observation, the authors have endeavored to summarize in Table 12.

Table 12 Research Questions Major Extracts

6 Conclusion and future scope

In this paper, we have covered all the studies until 2021 related to mathematical symbol recognition. A total of 52 studies we have selected in our SLR to evaluate and analyze critical information for users. A substantial amount of work has been done in mathematical symbol recognition, and a systematic review of it is a must need and this idea behind our SLR. In results and comparisons, we found that SVM is the most used ML technique and elastic matching is the most used non-ML technique. The most employed accuracy parameter is SSR and MSE in studies selected in our SLR. Comparing techniques used SSR and feature set or feature extraction is the broadly used metrics in studies. In terms of dataset/corpus, CROHME is the most used corpus in our SLR studies. Among CROHME versions, CROHME 2013 has been used widely in our SLR. This organized survey study has also extracted the uncovered and less focused potential dimensions of this domain. The reviewers have extracted the following points for future research in this direction:

  • Ambiguity and the 2D structure of MEs have always remained a challenge in the task of recognition. Moreover, the literature witnesses a decent scope and potential that involves more complicated MEs that work on high-order exponentials.

  • On analyzing the several representation models implemented in the recognition process, it has been found that there is a need for a more efficient representation model or structure that could contain information for both symbol segmentation and stroke segmentation.

  • The main challenge for the symbol segmentation sub-step is a lack of clear rules to group the strokes belonging to a single symbol embedded in the whole math expression.

  • One of the constraints observed during the review process is defining the metric accuracy scale (Fig. 15). There is a limitation while describing the relationship between accuracy metric and dataset size. That is because of the absence of standard publically available datasets and standard accuracy metrics. Until we don’t get common grounds in varied recognition approaches, it is hard to figure out the direct relationships between the necessary comparison factors. Performing comparative analysis by narrowing the field and choosing specific intersecting factors for comparison is not a good idea.

  • The datasets available in this domain of handwritten mathematical symbols and expressions needed more semantic-based handwritten symbols and expressions involving attributes like age, gender, etc. An improved recognizer could be built for advancement in this domain.

  • It can also be observed that multi-modal encoder-decoder systems that have been deployed for recognition can have a fair scope for offline to online reduction resulting in a single recognizer for both types of math expressions. Exploring the trend of multi-attention and multi-modal systems and working on its enhancement can be a said challenge in itself.

  • Ambiguity problems arising at the pre-processing step are challenging as scaling may lead to distortion and the shape formation of specific signs and symbols. The real meaning could get distorted in this process.