Mathematical expression extraction from scientific documents is an onerous area of research in academic improvement. Mathematical expression extraction is a complex yet essential task for academic plagiarism detection and information retrieval. Various scholars have attempted to extract the mathematical notations and expressions from documents, but precision and recall of these are relatively low at par with simple text retrieval. The compelling and completion for detecting the mathematical plagiarism and retrieval of the source document depend on the detection of ME. With the advancement in the digitalization of documents, it is becoming more and more difficult to detect the ME from documents. Although many techniques for the OCR based detection of ME give better performance for simple text documents, retrieving the ME from source with exact name is not accurate and effective. Mainly, there are two types of ME detection that is inline and embedded detection process that is implemented Zanibbi and Blostein (Zanibbi and Blostein 2012a).

The recent research for the ME detection is based on online and offline handwritten MEs, which still lack fully solving the problem. OCR-based ME detection usually has difficulties for recognizing the larger no of character and different types of symbols from the image documents. Traditional methods for ME detection focused on the displayed and inline detection of MEs by using the rule-based methods for detection by Lee and Wang (Lee and Wang 1997), and they employ the n-gram model for recognizing the ME from a large corpus. However, many different methods were given by Phong et al. (2017, 2019) for classifying the inline and displayed ME detection like based on SVM. There are also DNN-based methods for mathematical ME detection OCR that recognize the symbols from PDFs, handwritten documents, printed documents by using much deep learning-based ME detection methods proposed by Gao et al. (Gao et al. 2017) and Chan (Chan and Yeung Aug. 2000).

Further, the mathematical expression extraction’s formulas and symbols detection is an important subset of the academic plagiarism detection, which cannot be ignored, although it is a relatively small part of the plagiarism detection. It is accountable in mathematical plagiarism detection and in the lack of math information retrieval. The novel method for a possible feature selection and feature comparison strategies for developing the mathematical-based plagiarism detection approaches are designed by Norman Meuschke (Meuschke et al. 2017), and the result shows that the mathematical expressions are promising text-independent features to identify academic plagiarism. Later, they also presented a prototype that implements a hybrid approach to academic plagiarism detection by analyzing the similarity of mathematical expressions, images, citation patterns, and text, and shows a result visualization approach by using HyPlag to analyze the confirmed cases of content reuse. Norman Meuschuke (Meuschke et al. 2018) analyzed the concept of mathematical content similarity in different types of STEM documents and its implication in academic plagiarism detection. In their research paper, they presented a two-stage detection that combines the similarity assessments of mathematical content, academic content, and text. They also compared the effectiveness of math-based, citation-based, and text-based approaches using confirmed cases of academic plagiarism.

The rest of the paper is organized as follows. Section 2 presents the extent of work done in the research area. The proposed IFME framework has been illustrated and discussed in Sect. 3. The performance metrics that can be useful for our model results in future is discussed in the Sect. 5 and Finally, Sect. 6 finishes the research proposal by concluding and with some helpful future disclosures.

2 Background and Related Work

Mathematical Plagiarism Detection Techniques

Table 1 outlined the different methods of mathematical plagiarism detection techniques proposed by various researchers.

Table 1 Analysis of mathematical plagiarism detection techniques

Mathematical Expression Extraction Techniques

Table 2 summarizes the various techniques of mathematical expression proposed by many researchers.

Table 2 Analysis of mathematical expression extraction techniques

3 IFME-Intelligent Filter for Mathematical Expression

For the detection of standard mathematical expression and notation from the scientific document, the IFME framework is proposed which is presented in Fig. 1.

IFME framework

The description of the various components used in the IFME framework is discussed as follows:

Math Documents

In this phase, the different mathematical documents are collected.

ME Extraction by Neural Network

In this component, the mathematical expressions have been extracted from the mathematical document collected by the first component: using CNN and U-net framework for the extraction of in-line and embedding mathematical expressions.

Segmentation of ME Features

The extracted ME features are then segmented in different sub-blocks for both the inline and embedded ME features.

Compute Cosine Similarity of ME

The extracted features are created as vector for the computing the cosine similarity of MEs to identify the similarity between each detected MEs. It will use in improving the mathematical plagiarism detection.

ML Classification of ME

After computing the similarity between the features of mathematical expressions, classification has been done by using the random forest algorithm to classify that whether the detected ME is a standard notations or it is identified as a new idea for detecting plagiarism. If it is new identified idea, then it will be manually validated.

Standard ME Database

InftyProject databases called InftyCDB-1, InftyCDB-2 and the Marmot dataset, that contains characters, symbols and spatial features of mathematical documents, have been used as the standard mathematical expression databases.

Intelligent Filter for Mathematical Expression (IFME) Algorithm

This is the given pseudo-code of the algorithm for extraction of mathematical expression and detecting the plagiarism:

Input: Mathematical documents Output: Standard Mathematical Notations Step1: Take the mathematical document. Step2: Extract the different mathematical expressions from the document. Step3: Store the extracted Mathematical expressions in Vector A. [A] = AME1 + AME2 + AME3 + … AMEn

\([A]= \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} {\text{A}}_{{{\mathbf{ME}}}} \)

Step4: Take the ME dataset and store the different mathematical expressions in vector B [B] = BME1 + BME2 + BME3 + … BMEn Step5: Calculate the Cosine Similarity between two Vectors A and B.


$$ Similarity = \cos \theta = \frac{{A.B}}{{{ \Vdash }A{ \Vdash }{ \Vdash }B{ \Vdash }}}\frac{{\mathop \sum \nolimits_{{i = 1}}^{{n}} A_{{i}} \times B_{i} }}{{\mathop \sum \nolimits_{{i = 1}}^{{n}} \left( {A_{i} } \right)^{2} \times \mathop \sum \nolimits_{{i = 1}}^{{n}} \left( {B_{i} } \right)^{2} }} $$

Where, AMEi and BMEi are the components of vector A and B. Step7:/*Classify the features of detected MEs*/ If “ STANDARD NOTATION” “ NOT PLAGIARISM Else “MANUAL VALIDATION” Step 8: END

4 Result and Discussion

To evaluate the proposed algorithms, simulation test bench for the IFME framework has been created with a Lenovo idea pad laptop, hardware configuration of 8 GB RAM, 2 TB Hard disk. The input for the proposed framework is mathematical document images that are collected from 400 different documents.

Mathematical Dataset

The training and testing of the model is done using the InftyMCCDB-2 which is the updated version of InftyCDB-2 dataset. It contains more than 30,000 expressions that are further grouped into 12,551 images for training and 6830 images for testing in the dataset (Fig. 2).

InftyMCCDB-2 dataset

Evaluation of Framework

Recall RME, Precision PME and F-measures FME have been used as the performance matrices to validate the identified mathematical expression. For the FME it is the average weighted score of recall RME and precision PME which measures how good is the designed framework works.

Recall RME

It is the ratio of correctly predicted positive mathematical expression from the actual class of mathematical expression, defined as:

$$ R_{ME} = \frac{TP}{{TP + FN}} $$

Precision PME

It is the ratio correctly predicted positive mathematical expression to the total predicted mathematical expression, defined as:

$$ P_{ME} = \frac{TP}{{TP + FP}}$$

F-measures FME

The F-measure is the weighted average of recall and precision that is measured for predicted mathematical expression, because it takes both false negative and false positive values of predicted mathematical expression, it defined as:

$$ F_{ME} = \frac{{2 \times P_{ME} R_{ME} }}{{R_{ME} + P_{ME} }} $$

where the TP stands for True Positive; it is for the number of truly predicted values, FN stands for False Negative that is the number of yes values predicted as false and FP represents the False Positive; it is the number of no values predicted as true.

The evaluation of classified class can be measured on by finding the accuracy (AME) of the model and AME is defined as

$$ A_{ME} = \frac{TP + TN}{{TP + TN + FP + FN}}$$

In this accuracy (AME) formula TP, FN, FP stand same as in the recall (RME), precision (PME) and F-measures (FME) and TN stands for the (True Negative); these are the number of values which original class is yes but predicted as the no value class. Accuracy (AME) shows the performance of framework on combining all the parameters taken in the system. Figure 3 shows achieved performance of each work carried out by the researchers based on some performance metrics:- recall, precision, F-measures and accuracy. By this we can conclude that some researcher achieved the best performance for the ME extraction that can be useful for using it to filtering out the ME for detecting the plagiarism in mathematical documents.

Performance measures of ME techniques

5 Conclusion and Future Work

This research paper presents the study of existing mathematical plagiarism detection techniques and mathematical expression extraction techniques proposed by different researchers. The proposed framework uses a convolution neural network and U-net framework for the extraction of in-line and embedding mathematical expressions. Cosine similarity algorithm has been used to find the similarity between the features of the mathematical expressions. After computing the similarity between the features of mathematical expressions, classification has been done by using the random forest algorithm to classify that whether the detected ME is a standard notations or it is identified as a new idea for detecting plagiarism. If it is newly identified idea, then it will be manually validated techniques. It has been analyzed that the convolution neural network and the U-net framework produce promising results in getting higher accuracy of (around 0.941, when compared to the machine learning-based framework). In the future, the framework can also be designed by using different kind of neural network for better performance, and it will also be useful for the information retrieval of the mathematical document.