1 Introduction

This study proposes a multi-task learning technique and uses latest language model, ELECTRA [1], to solve mathematical word problems automatically. In automatic mathematical word problem-solving tasks, a machine must deduce an answer to a given mathematical problem by acquiring the implied numeric information in the problem. Solving such a task is useful in that a solution to the task can be generalized to provide solutions to other types of natural language understanding and mathematical manipulation tasks, such as tax affairs or accounting.

Researchers have attempted to address the task of automatic mathematical word problem solving since the 1960s [2, 3]. Starting with models that utilize hand-crafted features, more recent solutions include fully automatic feature generation models [4, 5]. With the recent development of machine learning, some researchers have adopted model architectures such as recursive neural networks or sequence-to-sequence incorporating natural language understanding models to solve mathematical word problems [6, 7].

To solve mathematical word problems automatically, recent machine learning-based studies have used equation templates as a general approach. An equation template is a normalized version of an answer equation, where the numbers and variables in an answer equation are replaced with slots [8]. Table 1 shows sample mathematical word problems along with the answer equation, equation template, and answer to the question. Because training with equation templates reduces the space of a possible answer equations from an unlimited to a restricted space, using equation templates over answer equations during model training can improve the deduction of a correct answer equation during model testing.

Table 1 Four sample mathematical word problems

In this study, we propose template-based multitask generation (TM-generation) that can improve problem-solving accuracy by handling two challenges that have been previously unaddressed. This model is an extension of T-MTDNN [9], which attempts to address two challenges: (1) filling in missing world knowledge that is required to understand the given mathematical word problem, and (2) understanding the implied relationship between numbers and variables. Table 1 shows four examples of mathematical word problem that demonstrate the two challenges. All four problems should deduce a correct equation or equation template to provide an answer. Questions 1 and 2 show how each question should use world knowledge on monetary units, that is, 100 pennies equal 10 dimes and 1 dollar. Questions 3 and 4 show how each question should apply different sets of operators, + and −, when the relationship between numbers and variables is correctly implied.

To address these two challenges, we employ the following two techniques. First, for challenge (1), we utilize recent language models, BERT and ELECTRA. Second, for challenge (2), we propose an improved version of operator identification. To verify whether these proposed techniques can help address the two challenges, we conducted an experiment and analyzed the errors in our models. Through the experiment and analysis, this study yields the following three contributions:

  • We propose a TM-generation model that reported comparable or state-of-the-art performance in MAWPS [10] and Math23k [11] datasets.

  • We suggest that using language models can fill in the missing world knowledge, as supported by the error analysis that showed higher performance in problems requiring world knowledge.

  • We propose an operator identification layer that can extract the relationship between numbers and variables. The layer is an improved version of T-NTDNN’s comparable component.

The remainder of this paper is organized as follows. In Sect. 2, we review existing studies that have attempted to address the two challenges when solving mathematical word problems. We then propose TM-generation (Sect. 3) and provide a detailed explanation of how we analyzed the model (Sect. 4). The results of the analysis are discussed in Sect. 5. In Sect. 6, we conclude our study and present future research directions.

2 Related work

In this section, we discuss how previous studies attempted to resolve the two challenges we presented. The approaches to the two challenges of the previous studies can be categorized as follows: solving mathematical word problems (1) using a template; (2) using methods for filling in missing world knowledge, and (3) methods for understanding the relationship between numbers and variables.

2.1 Solving mathematical word problem using template

Since the equation template approach was first proposed by Kushman et al. [8], many researchers adopted the equation template approach to address mathematical word problem-solving tasks. Existing methods for deducing equation templates can be largely categorized into two methods: classification [8, 12] and generation [7, 13].

First, in the classification method, a model learns equation templates from a training dataset. After the training phase, the model classifies an input question into the most probable equation template, which the model learns from the dataset. Therefore, the search space of the model is limited to previously learned templates in the classification process. As limiting the search space by using templates can yield performance improvement, many classification studies have attempted to limit the number of equation templates. For example, Kushman et al. [8] used a simplified equation template whose numbers and variables are replaced with ‘slots’. The ‘slot’ aims to group different equations of a number instance into one template. After choosing a template, a model aligns numbers or nouns in a question to the slots in the classified template. Kushman reported that their model exhibited a problem-solving accuracy of 68.7%. In contrast, Zhou et al. [14] used equation template slotting only numbers to improve Kushman’s model. This approach was used to test the optimal number of equation templates and reported an accuracy of 79.7% for Kushman’s dataset. However, these two approaches using slots have an issue: as numbers are replaced with slots, the information of a particular number is masked.

Meanwhile, in the generation [7, 13] method, a model learns to generate the correct equation template directly using tokens as a unit. Generating an equation template by tokens allows a model to have better extensibility than classification models because the model can generate an equation template even if the template is unseen during the training process. Existing research attempts to find the most suitable unit for generation to achieve performance improvement. For example, researchers have used recursive tree structures [6, 13] and simple sequences [4, 7, 11]. The state-of-the-art performance of using the recursive tree structure and simple sequences was 69.0% accuracy [6] and 66.9% [7] in Math23k, respectively. Therefore, we developed our model based on a generation method to pursue extensibility by widening the search space and to improve the performance.

2.2 Methods for filling in a missing world knowledge

A successful math word problem solving model should utilize world knowledge to understand the relationship between numbers and variables because the necessary information required for solving a mathematical word problem may not be explicitly stated in the question. One approach for reflecting world knowledge is to use a fine-tuned language model, which is pre-trained with a large multi-domain corpus using specific tasks such as mathematical word problem solving. Language models are pre-trained with multi-domain corpus so that the models can indirectly reflect the appropriate world knowledge required to solve mathematical word problems. For example, Liu et al. [15] reported that using the BERT language model [16] can improve performance in a simple algebraic problem dataset. Ki et al. [17] also reported a state-of-the-art performance with 73% accuracy for the Korean mathematical word problem dataset, with a BERT-based classification model. Moreover, Wallace et al. [18] argued that language models such as BERT can have numeracy and the ability to understand numbers.

Despite the potential of using language models to improve performance of math word problem solving models, the performance of a model can decline when a corpus with a scarce number of tokens is used to pre-train the model. For example, Griffith and Kalita [19] reported that performance decreased when they used a pre-trained language model. They reported 84.7% accuracy in MAWPS using a non-pretrained vanilla Transformer structure, while the performance was decreased by 2.6% when they used a pre-trained model using a small movie dialog dataset. The decline in performance maybe due to using a movie dataset, which may not have contained sufficient number of words that are used in representing numbers. Therefore, it is worthwhile to investigate whether there would be an increase in performance when we build a model using pre-trained language model that contains diverse token sets, such as the Wikipedia. In addition to using a corpus with many number-related tokens, a model should be designed to utilize such number tokens. Although a given corpus may contain various number tokens, the context in which those tokens are used vary greatly. Therefore, to increase model performance, a model should design a method that can utilize the tokens to address the specific task. In this paper, the operation identification layer of the TM-generation models serves such purpose.

2.3 Methods for understand the relationship between numbers and variables

In this section, we present two types of existing methods for finding the implied relationship between numbers and variables in mathematical word problems: hand-crafted features and automated features. For hand-crafted feature approaches, domain experts design features that can reflect the implied relationship between numbers and variables in mathematical word problems. For example, ARIS [20] used a verb categorizer to understand the relationship between entities necessary to make an equation. This verb categorizer categorizes the type of verbs used in the question into seven types that are predefined by experts. The information categorized in this way contributed to improving the accuracy of deducing the equation template from 64.0 to 77.7%, by grasping the connection between the number and operator in the question. In contrast, Roy and Roth [21] tagged unit dependencies between numbers in the question to identify numbers with the same unit dependency to utilize unit dependency information to deduce the correct equation template and reported an accuracy of 56.2% for their dataset. However, the design cost of hand-crafted features is high, and features are usually defined based on a specific dataset. Thus, hand-crafted features are usually overfitted to the dataset so that the model’s extensibility is limited.

To address the limitations of the extensibility of hand-crafted features, numerous attempts have been made to learn the relationship between numbers and variables through an automated feature extraction method that has been actively proposed and investigated. For example, deep neural solver [11], which exhibited an accuracy of 64.7% in Math23k, adopted a sequence-to-sequence-based model. The researchers made an LSTM encoder automatically learn the hidden contextual relationship in the question. Wang et al. [7] proposed a Bi-LSTM model that separates an operator selection process and an operand selection process to identify the relationship between numbers and variables independently. Recently, Liu et al. [6] also proposed a structure that can utilize the relationship between numbers in the process of generating each node in a tree format to create an equation template, rather than a sequence.

More recently, Lee and Gweon [9] proposed an ‘operator classification’ task, which extracts the relationship between numbers and variables, and proposed a multi-task learning technique that can extract the relationship in various aspects. Lee and Gweon showed that it is possible to obtain a general representation of numbers by learning how to classify an operator and how to derive an equation template together with multi-task learning. By applying this technique, they reported an accuracy of 71.2%, which was the state-of-the-art performance in Math23k. However, Lee’s operator classification task required high computational cost and relied only on the classification method, without exploring other methods, such as generation. Thus, the model still has room for improvement.

3 The TM-generation model

We propose a TM-generation model, which is an extension of T-MTDNN [9]. In Sect. 3.1, we first detail the input normalization process used in TM-generation model. Then, the T-MTDNN method is explained in Sect. 3.2. Finally, the TM-generation model is explained in Sect. 3.3. We explain how language models and improved operator identification are applied, and how they address the following two challenges: (1) filling in missing world knowledge, and (2) understanding the implied relationship between numbers and variables.

3.1 Input normalization

The input normalization step generates a normalized data format so that the search space for equation templates can be narrowed down. Figure 1 shows the original data format and normalized data format. For the input normalization, we generated a pair of numbers and index for numeric constants (INC), INC mapped questions, and three types of templates: postfix equation template, normalized equation template (NET) and partial NET.

Fig. 1
figure 1

Input normalization process

Each number given in a mathematical word problem is assigned a number-INC pair. A number-INC pair comprises a number token and a corresponding INC. INC is a normalized representation assigned for each number appearing in a mathematical word problem. The INC is assigned a number according to its occurrence in a problem, namely Ni, to the i-th number in the problem. INC is used to generate an INC mapped question and two types of templates. For the INC mapped question, INC is attached to numbers that appear in the question during the number-INC pair generation process. By proposing INC, we attempt to make the model extract more information from a problem than using the ‘slot’ proposed by Kushman et al. [8]. Unlike Kushman, we expect the model to learn about the number information that appears in the problem as a particular instance, as well as to learn that the numbers are negligible in terms of the template.

For the three templates, INC is used as follows: First, to generate a postfix equation template, we replace all numbers in an answer equation with the corresponding INC. Then, we converted the equation into a postfix format to obtain the postfix equation template. Second, to generate a NET, we replace all arithmetic operators, ÷ , − , × , and + , that appear in a given postfix equation template with an ‘OP’ token. Partial NETs, which are used in the T-MTDNN model, are generated for each operator. For each ‘OP’ token in the NET, partial NET is generated by removing the NET after the ‘OP’ token to be predicted. Therefore, two partial NETs are formed in the example shown in Fig. 1. To make the model reflect the contextual meaning of operators during the learning process, we used OP token in a slightly different way than that in [4]; we make the model receive OP tokens and the sentence as an input.

3.2 Base model: T-MTDNN

The TM-generation model is an enhanced version of T-MTDNN. Therefore, we first explain the structure of T-MTDNN as a base model for understanding TM-generation. The top half of Fig. 2 shows the architecture of T-MTDNN. T-MTDNN is a classification-based model that consists of a BERT language model and operator classification layer. In T-MTDNN, first the BERT language model yields a distributed representation, which reflects human pragmatic language usage and world knowledge [16]. Next, template classification layer matches an appropriate normalized equation template (NET) produced during the input normalization step, without considering operator types. Finally, the operator classification layer fills in ‘OP’ tokens in the NET.

Fig. 2
figure 2

The architecture of T-MTDNN and TM-generation

The learning process of the T-MTDNN is as follows. First, the BERT language model tokenizes input sentences and produces each token’s hidden-state vector using INC mapped questions. Next, the template classification layer outputs most probable NET for each INC mapped question. Yielded NETs are divided into partial NETs, and partial NETs are generated for each operator. For each ‘OP’ token in the NET, partial NET is generated by removing the NET after the ‘OP’ token to be predicted. And a concatenation of INC mapped question and partial NETs is fed into BERT again, and tokenized inputs passes through the operator classification layer and outputs appropriate operators. After T-MTDNN selects the most probable operators according to the number of operators in the NET, the OP tokens in the NET are replaced with the corresponding arithmetic operators to generate an equation template.

3.3 TM-generation model

TM-generation (template-based multitask generation) is an extension of T-MTDNN. The bottom half of Fig. 2 shows the architecture of TM-generation model, which has two main components for addressing two challenges: (1) filling in missing world knowledge that is required to understand the given mathematical word problem, and (2) understanding the implied relationship between numbers and variables. The two corresponding components are language model and operator identification layer. By combining these two components, TM-generation obtains the correct equation template. The pipeline for generating the equation template is organized as follows.

First, the ELECTRA language model tokenizes input sentences and produces each token’s hidden state vector. Then, to generate NET, the model uses the encoder-decoder architecture of a Transformer [22] and a template generation layer. The TM-generation encoder is the ELECTRA language model, and the decoder is a Transformer decoder. Because such an encoder–decoder architecture can generate any tokens in an equation template, TM-generation model can build an equation template that does not appear in the training set. The Transformer decoder uses the hidden-state vectors produced from the ELECTRA language model as inputs. Next, the template generation layer takes the decoder’s hidden state vectors as input and recursively generates NET tokens from each vector. By combining all these tokens generated sequentially, NET is finally produced.

After generating NET, the ELECTRA language model retokenizes the concatenation of the INC mapped question and NET similar to T-MTDNN. Then, the language model produces the hidden state vector of each token. After that, operator identification layer predicts operators by transforming the hidden state vector of the ‘OP’ token, which is intended to extract contextual information between numbers and operators. This technique which predicts ‘OP’ tokens is a similar approach that was used in BERT. In BERT, the '[CLS]' token was used as a symbolic token [16]. As the ‘[CLS]’ token of BERT represents the context of a whole sentence, we expect that the ‘OP’ token of TM-generation model would help establish the relationship between numbers and operators more easily than only using a language model by itself. Finally, after the operator identification layer outputs most probable operators, the TM-generation model can deduce the correct equation template by combining the generated NET and operators. During the whole process, we used the cross-entropy formula for calculating loss function. The formula is as follows.

$$ L_{{{\text{NET}}}} = - \mathop \sum \limits_{t = 1}^{T} \log P\left( {y_{t} |x,y_{t - 1} } \right) , $$

where \(y_{i}\) indicates i-th token in the output NET for the given problem \(x\).

3.3.1 Language model improvement

Specifically, for the language model, we apply the ELECTRA language model [1] instead of BERT, which was used in T-MTDNN. As ELECTRA reported state-of-the-art performance in several natural language understanding tasks, including question answering, we expect that using ELECTRA can improve the performance compared to using BERT.

3.3.2 Operator identification calculation process

Figure 3 shows the operator classification pipeline for operator classification of T-MTDNN versus operator identification for TM-generation. The main difference between the two operator calculation processes is the input format. In T-MTDNN, a single hidden-state vector of the [CLS] token, which is known to reflect meaning of a whole sentence [16], is used. Thus, T-MTDNN outputs the correct equation template by running the operator pipeline multiple times for each partial NETs. Meanwhile, in TM-generation, hidden-state vectors for all ‘OP’ tokens that appear in single NET are used in a single step without using partial NETs. This modification enables the operator identification layer to simultaneously process all the operators in the NET. Thus, repeating the operator pipeline is unnecessary, which in turn improves the computational efficiency.

Fig. 3
figure 3

Comparison of operator classification calculation process for T-MTDNN versus operator identification calculation processes for TM-generation

Moreover, using OP tokens’ hidden states instead of [CLS] token’s hidden state has an advantage in that the TM-generation can utilize the information extracted by the language model more directly in the operator calculation process than in the T-MTDNN model. Although the [CLS] token’s hidden state is known to contain the context of the entire sentence, it is unclear that what part of sentence is used to infer the context. Meanwhile, if the hidden states of the OP token are used to predict the operators, it becomes clear that the operator is deduced by the corresponding OP token. Thus, we expect that it is safer to predict the operator directly through the hidden state of the OP token than to predict the operator by using the hidden state of the [CLS] token indirectly.

For calculating the loss of the operator identification, we used cross entropy loss function. The detailed formula is as follows.

$$ L_{{{\text{total}}}} = L_{{{\text{NET}}}} + - \frac{1}{\left| O \right|}\mathop \sum \limits_{n = 1}^{\left| O \right|} \log P\left( {o_{n} } \right) , $$

where \(o_{n}\) indicates n-th operator token. |O|indicates the number of the operator tokens in the equation.

4 Experiment

In this section, we present the experimental design for evaluating the performance of our TM-generation model. We first explain the benchmark datasets and the baseline model used in our work. Then, we explain the implementation details used to run our models.

4.1 Dataset and baseline models

We used two popular mathematical word problem datasets as benchmark datasets: MAWPS [10] (2373 problems) and Math23k [11] (23,162 problems). Both datasets contain mathematical word problems that use an equation with one variable and four arithmetic operators (+ , − , × , ÷). Table 2 shows the characteristics of the two datasets in terms of number of problems, dataset split, number of template equations, number of unseen templates during training, and number of problems that require world knowledge to deduce a correct answer equation. Note that Math23k has a sufficiently large number of mathematical word problems that require world knowledge, which is approximately 23% of the problems.

Table 2 Characteristics of the datasets used for performance evaluation

For metrics, we computed the problem-solving accuracy of each dataset. As in other studies [4, 7, 23], we defined problem-solving accuracy as the proportion of correctly solved problems over the total number of problems. A problem is regarded as correctly solved by a model if (1) the model successfully deduces the answer equation for the given problem or (2) the model produces an answer equation that has the same answer as the original one. We calculated the answer of an equation using the Python library SymPy [24].

For baseline models, we compared the problem-solving accuracy of our proposed models with those of TSN-MD [23], Griffith and Kalita [19], and T-MTDNN with operator identification, which are the state-of-the-art models in MAWPS and Math23k. The TSN-MD involves a teacher-student approach using a student model with multiple decoders. The multiple decoders deduce different answer equations, and the TSN-MD model then selects the most probable equation with the highest beam search score. Such a process can regularize the learning behavior of the student network of the teacher–student model in solving mathematical word problems. Similarly, Griffith and Kalita [19] proposed a Transformer model. They evaluated the performance using various forms of expression and the existence of language models. Although they reported performance of the models on the same datasets as we used, they used random sampling to obtain the train/dev/test set. Because it is not possible to directly compare the performance of our model with Griffith and Kalita’s model due to unpublished test set, we used a five-fold cross-validation for performance comparison.

4.2 Implementation

The implementation details of the T-MTDNN with operator identification and TM-generation models are as follows. To build the two models, we used PyTorch 1.6 [25]. For both models, we used two different language models provided by the Transformer library: ELECTRA-base and Chinese-ELECTRA-base. The multi-task layer is implemented as follows: For TM-generation model, the operator identification layer uses a single linear layer. For the template generation task, we used a single linear template generation layer and a Transformer decoder with six layers. Each decoder layer had two multi-head attentions with eight heads and an intermediate layer with an output size of 2048.

We used the following hyperparameters to train our models in terms of the optimizer, learning rate, warm-up scheduler, and batch size. For the optimizer, we used Adam [26] with \(\epsilon = 1 \times 10^{ - 6}\), β1 = 0.9 and β2 = 0.999. For the learning rates, we used different values across components, ranging from 3 × 10−5 to 2 × 10−4, as shown in Table 3. We used a linear warm-up scheduler for the learning rates with a warm-up step ratio of 0.1. For the batch size, we formed 16 problems as a batch. We also employed cross entropy loss with a label smoothing approach [27] to handle the overconfidence issue in the template generation layer of TM-generation model. In addition, during the test, we used a beam search with a beam size of 3 for the TM-generation model. All experiments were conducted on Google Colab, a local PC with 64 GB RAM and RTX2070 Super and RTX2080 Ti, a local server with 192 GB and two RTX Quadro 8000.

Table 3 Learning rates of T-MTDNN + operator identification and TM-generation

5 Results and discussion

The proposed TM-generation model outperformed or exhibited a comparable result as that of TSN-MD and T-MTDNN with operator identification on MAWPS and Math23k. As shown in Table 4, on the MAWPS dataset, TM-generation model achieved state-of-the-art performance with 85.2% accuracy with the ELECTRA language model. On the Math23k dataset, TM-generation model achieved state-of-the-art performance with 85.3% accuracy with ELECTRA. We performed two ablation studies to examine whether solving the two challenges improved the performance of the proposed model. In the two subsections below, we discuss how TM-generation addresses each of the two challenges.

Table 4 Problem solving accuracies (%) of TM-generation and those of existing models

5.1 Analysis of filling in missing world knowledge

To examine whether using language models can solve the first challenge of filling in missing world knowledge, we conducted an ablation study and a corresponding an error analysis. For the ablation study, we compared the performance of TM-generation without a language model and TM-generation with a language model. From such a comparison, one can infer that the difference in performance would reflect the effectiveness of the language model. For error analysis, we calculated the “number of problems with the correct equation template (CE)” over “the number of problems requiring world knowledge (WK)”. We interpreted CE/WK as the proportion of problems for which the model successfully utilized world knowledge and deduced the correct answer equation. For example, let us consider a problem containing the words ‘nickel’ and ‘pennies’. To correctly understand the problem, a model should understand that ‘nickel’ in the problem context refers to ‘5 cents’, rather than ‘a kind of metal’.

A comparison of the accuracy between TM-generation without the language model and TM-generation with the language model showed that the language model contributes to performance improvement. As shown in Table 5, in the MAWPS and Math23k datasets, the performance of TM-generation with the language model showed 13.4% and 14.9% increases than TM-generation without the language model.

Table 5 Ablation study for challenge 1

To speculate the result of the performance improvements, we analyzed CE/WK. From the analysis, we suggest that the language model can be an effective method for reflecting WK, only if a sufficient amount of data is provided. In both MAWPS and Math23k, by applying language model, we observed a performance improvement of 8.2–8.9%, respectively. In particular, in Math23k, for a total of 226 problems that required world knowledge in the test set, TM-generation with the language model showed a performance improvement that exceeded the error range by correctly answering 20 more problems than the model without the language model.

However, in MAWPS, a statistically significant performance improvement was not observed in the test set compared to the performance improvement of the ablation study. We observed a gap in the CE/WK ratio between the MAWPS (20%) and Math23K (70%) datasets. The difference in the performance of datasets seems to be caused by the number of training problems requiring world knowledge. The MAWPS dataset, which is a relatively smaller dataset than Math23k, required world knowledge for only 189 problems, whereas Math23k required world knowledge for 5235 problems. Thus, with MAWPS dataset, there may not have been enough opportunities for learning world knowledge because the amount of data was relatively small.

5.2 Analysis of understanding the implied relationship

To investigate whether operator identification can address the second challenge of understanding the implied relationship between numbers and variables, we conducted an ablation study and an error analysis. For the ablation study, we compared the accuracy of TM-generation and a “pure” generation model, which does not use operator identification. From such a comparison, one can infer that the difference in problem solving accuracy would indicate the effectiveness of the operator identification task. For error analysis, we computed the number of problems with correct NET (CN) and the number of problems with a correct operator (CO). We interpreted CN and CO would be the proportion of problems for which the model correctly derived the NET and operators.

A comparison of the accuracies between TM-generation and pure generation models showed that operator identification contributes to performance improvements. As shown in Table 6, in the MAWPS and Math23k datasets, the TM-generation model showed 3.6% and 2.0% better performance than the pure generation model, respectively. To speculate the result of performance improvements, we analyzed the change in the number of CN and CO between pure generation (ablated model) and the TM-generation models. As shown in Table 6, the models using operator identification exhibited better performance than those without operator identification. Specifically, TM-generation model showed an increase in CN (368 to 383) and CO (363–378) in MAWPS and CN (711–739) and CO (694–729) in Math23k. Considering the performance improvements, we conclude that operator identification can contribute to understanding the implied relationship between numbers and variables.

Table 6 Ablation study for challenge 2

6 Conclusion

In this study, we proposed TM-generation model designed to address two challenges: (1) filling in missing world knowledge required to understand the given mathematical word problem, and (2) understanding the implied relationship between numbers and variables. TM-generation model successfully addressed two challenges, and accuracies of 85.2% and 85.3% were achieved, which are state-of-the-art performances in MAWPS and Math23k, respectively. Specifically, we utilized pre-trained language models to handle the first challenge. Then, we proposed an improved operator identification method to address the second challenge. Our analysis revealed two findings: (1) Using a language model can improve the performance of a mathematical word problem-solving model by reflecting world knowledge, only if sufficient amount of data is provided. (2) The improved version of operator identification, which is proposed in TM-generation model, is more successful in understanding the relationship between numbers and variables, compared to T-MTDNN. In future work, we plan to apply the two components to build a machine learning model for solving linear equations, which is more difficult than the benchmark task of mathematical word problem solving.