Keywords

1 Introduction

Since the advent of pre-trained language models (PLMs), the enhancement in model performance on knowledge-intensive tasks like arithmetic reasoning and common sense reasoning has been attributed to the ability of semantic representation and the acquisition of knowledge learned implicitly by PLMs trained on pre-training tasks. The math word problem (MWP), as an important branch task in arithmetic reasoning, has attracted the attention of many scholars. Large language models (LLMs) such as GPT-4 [1] achieve SOTA on reasoning tasks, but training them is extremely expensive. Thus, exploring the upper bound of the capability of small language models and the limitation of the knowledge they can accommodate is a more crucial issue.

Numerous scholars have made outstanding contributions to previous work on MWP. Still, typically they only use one solution to each question in the training set, which will further damage generalization performance by exacerbating the error accumulation of ingrained exposure bias in the generative model [2]. To address this, we propose utilizing the addition commutative and multiplication commutative properties to generate consistent solutions for each question in the training set. These consistent solutions are equivalent, as shown in Table 1. Training the model with consistent solutions can optimize the search space of beam search, reducing the error accumulation in exposure bias exacerbated by training with only one solution. This is due to the fact that consistent solutions can give the model the information to enhance reasoning abilities in the decoder, i.e., there may be a number of reasonable options available at each time step as opposed to using a single ground truth in a single solution, forcing the model to generate an optimal sequence of a solution according to the different ground truths from consistent solutions, which can be regarded as a disguised approximation strategy to not using ground truth in training. In testing, this will encourage model learning to adjust to circumstances without ground truth.

Table 1. The example of consistent solutions

In addition, a significant distinction is found between models trained with consistent solutions and those without, based on comparing results on the validation set. The intuitive approach is to use this difference combined with the model ensemble technique to improve the accuracy of the test set.

To sum up, our main contributions are as follows:

1. We use the consistent solutions generated by using the commutative property to replace the single solution when training the model and explain how the consistent solutions optimize the search space of beam search.

2. We utilize two models, one trained with consistent solutions and one trained without consistent solutions, to further improve the model’s performance with the model ensemble.

We achieve fourth place in the NLPCC-2023 shared task 3 with an accuracy of 23.66% on the test set. The specific ranking is shown in Table 2. Our system name is “Tsingriver". Our codes are available at GithubFootnote 1.

Table 2. The ranking of the NLPCC-2023 shared task on Math Word Problem. Only the top five results are shown in the table. A total of 18 teams participated in the task, of which 7 teams submitted the results

2 Related Work

Parsing-based methods and rule-based methods attempt expression templates that summarize various problems from data, but the generalization performance is only moderately effective [3,4,5,6]. More researchers have concentrated their efforts on considering MWP as a generative task and continuing to optimize it since the invention of the seq2seq model. The seq2seq model is used as a baseline model on the proposed Chinese MWP dataset math23k [7].

In terms of decoder architecture, to adapt to the characteristics of generated solutions in MWP, many scholars have proposed tree-structured decoder [8,9,10], among which GTS [10] is considered a strong baseline model. Pointer networks utilized to enhance tree-structured decoder are impacted by machine translation model [11, 12].

To improve the semantic representation of the encoder, graph convolutional networks are applied to capture relationships between quantities in math problems [13]. Similarly, the dependency parsing of hierarchical structure as semantic information is fused into an encoder to enhance the representation of the GTS model.

From the standpoint of deep reinforcement learning, the generative model could be considered a policy network employing the policy gradient algorithm to optimize the model’s capacity for generation [14].

With the emergence of pre-trained language models like BERT [15], the MWP tasks have been switched to the generative model named bert2tree, with BERT and GTS Decoder as the backbone. Subsequent research improves model performance on more detailed issues, such as strengthening the model’s awareness of numerical values through auxiliary or extra pre-training tasks [16,17,18], utilizing contrastive learning to increase the accuracy of solving the problem with similar solutions [19], and incorporating multilingual settings to boost generalization [20].

Inspired by the generative adversarial network (GAN), another mainstream method in the MWP problem employs the ranker network as a discriminator to score the answers in the candidate set generated by the generator. In order to better enhance representation through generative adversarial training, sampling candidate sets has become an important issue. Two methods to generate additional candidate answers, including adding tree-based disturbance to real labels [21] or using the diversity of solutions [22], both demonstrate the importance of using effective data augmentation for sampling.

In addition to regarding MWP as a generating task, the method combining deductive reasoning to treat MWP as an extraction task also shows excellent performance [23].

To sum up, inspired by the diversity of consistent solutions [22], in order to optimize the search space of the decoder, we use the commutative property to obtain the consistent solutions set for each problem and use it directly as training data to fine-tune the generative model consisting of the Mengzi encoder [24] and GTS decoder. Finally, we found significant differences between models trained using consistent solutions and those trained without consistent solutions, so the model ensemble technique is applied to improve model performance (Fig. 1).

3 Proposed System

3.1 Problem Statement

Given a data (WSA) from dataset \(\mathcal {D}\), we denote the background and the question as the input \(W = \{w_{1}, w_{2},..., w_{n}\}\) with length n, while the generated solution \(S = \{s_{1},s_{2},...,s_{m}\}\) following the pre-order traversal ordering is calculated to derive final answer A. Next, we define that S consisting of \(V_{num}\), \(V_{op}\) and \(V_{con}\), where \(V_{num} = \{n_{1},n_{2},...,n_{l}\}\) is variable-length set containing all the numerical quantities involved in W. \(V_{op} = \{+,-,\times ,\div ,\wedge \}\) and \(V_{con} = \{1,\pi \}\) denote the operators and the constant values that could be used in the solution, respectively. With number mapping [7], the elements among \(V_{num}\) transformed to [NUM1], [NUM2], .., [NUMl] will be recognized by the encoder as a special token to encode all tokens in W uniformly.

Given W and S from \(\mathcal {D}\) in the training stage, we first encode W using a pre-trained model, and then we extract the state \(H_{en}\) of the last hidden layer to represent W, which can be indicated as follows:

$$\begin{aligned} H_{en} = Encoder(W). \end{aligned}$$
(1)

the conditional probability p(S|W) can be decomposed as

$$\begin{aligned} p(S|W) = \prod \limits _{i=1}^{m}{p(s_{i}|h_{i},H_{en})} , \end{aligned}$$
(2)

where \(h_{i}\) uniformly denote the \(i\text {-}th\) hidden state containing \(q_{i}\) and \(c_{i}\) used in GTS decoder [10]. At the beginning of decoding, \(q_{1}\) is the vector of [CLS] token from all the hidden state \(H_{en}\) and \(c_{i}\) is always derived by calculating attention between \(H_{en}\) and \(q_{i}\). The objective is to minimize the negative log-likelihood of \(\mathcal {D}\):

$$\begin{aligned} \mathcal L = \sum \limits _{(W,S)\in \mathcal D} -\log {p(S|W)}. \end{aligned}$$
(3)
Fig. 1.
figure 1

On the left is the training strategy with consistent solutions, and on the right is the ensemble strategy

3.2 Consistent Solutions

During the training stage, the teaching forcing mechanism will force the ground truth of the previous time step as the reasoning premise of the current time step. To simplify the decoding process and highlight the role of the ground truth of the previous time step, we can regard it as a Markov Decision Process. When using GTS decoder as an illustration, the conditional probability \(p^{1}_{i}\) and \(p^{2}_{i}\) can be expressed as follows:

$$\begin{aligned} p^{1}_{i} = {\left\{ \begin{array}{ll} p(s^{1}_{i}|s^{1}_{i-1}), &{} \text{ if } s^{1}_{i} \text{ is } \text{ left } \text{ child } \\ p(s^{1}_{i}|s^{1}_{i-j},...,s^{1}_{i-1}), &{} \text{ if } s^{1}_{i} \text{ is } \text{ right } \text{ child } \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} p^{2}_{i} = {\left\{ \begin{array}{ll} p(s^{2}_{i}|s^{2}_{i-1}), &{} \text{ if } s^{2}_{i} \text{ is } \text{ left } \text{ child } \\ p(s^{2}_{i}|s^{2}_{i-j},...,s^{2}_{i-1}), &{} \text{ if } s^{2}_{i} \text{ is } \text{ right } \text{ child } \end{array}\right. } \end{aligned}$$
(5)

where \(s^{1}_{i}\) and \(s^{2}_{i}\) denotes the prediction in current time step i, \(s^{1}_{i-1}\) from solution 1 \(S^{1}\) indicates the ground truth in previous time step \(i\text {-}1\) and \(s^{1}_{i-j}\) is parent node of \(s^{1}_{i}\). Assumes \(s^{2}_{i-1}\) is another ground truth from solution 2 \(S^{2}\). For instance, 4.0 in \(S^{1}\) and \(\div \) in \(S^{2}\) both could be ground truth at the third time step as shown in Table 1. Let’s use the left child node as an example to make the explanation more straightforward.

Considering only \(S^{1}\) is used in training, the model would maximize the conditional probability \(p^{1}_{i}\) in Eq. 4. However, in testing, if \(s^{1}_{i-1}\) isn’t selected and \(s^{2}_{i-1}\) appear in the path of beam search, the specific predicted token of \(s^{2}_{i}\) could be wrong even resulting error accumulation due to the conditional probability \(p^{2}_{i}\) never be seen by the model. Thus, the beam search algorithm will be invalid for searching consistent solutions except \(S^{1}\), resulting in performance loss manifested as insufficient reasoning ability during decoding.

It is worth emphasizing that the insufficient reasoning ability is caused by the complete dependence on the only ground truth \(S^{1}\), simultaneously failing to generate \(S^{2}\) according to \(s^{2}_{i-1}\) never seen by the model. Essentially, although the teaching force limits the exploration ability of the model, we have to rely on it to make the model converge. Therefore, we observe this problem again in terms of the impact of the single solution on the search space of beam search instead of addressing exposure bias directly.

The single solution, essentially, limits the search space of beam search by invalidating the path of beam search that generates \(S^{2}\) according to \(s^{2}_{i-1}\). Thus, we adopt consistent solutions for training directly based on two subsequent considerations:

1) The consistent solutions enable the path of beam search that generates \(S^{2}\) according to \(s^{2}_{i-1}\).

2) When overfitting occurs with the data \((W,S^{1})\), \((W,S^{2})\) fed into the model can be regarded as an examination to generate \(S^{2}\) and \(s^{2}_{i-1}\) simulates the output of the model used to predict the next token in testing, which is a disguised approximation strategy to not using ground truth in training.

Ideally, all the consistent solutions to each data should be sampled as training data to address this problem. But the cost of manually labeling all consistent solutions is relatively expensive. Thus, only the addition commutative property and the multiplication commutative property are utilized to generate consistent solutions. The experimental examples and figures for optimizing the search space of beam search are offered in Sect. 4.4.

3.3 Model Ensemble

The data distribution with respect to \(\mathcal {D}\) is defined as \(\mathcal {P}_{\mathcal {D}}(W)\), and a data \((W_{1},S^{1}_{1}) \sim \mathcal {P}_{\mathcal {D}}(W)\). Next, We denote \(\mathcal {D}^{*}\) as the dataset augmented through consistent solutions, corresponding to the data distribution \(\mathcal {P}_{\mathcal {D}^{*}}(W)\). Thus, \(W_{1}\) with t solutions and \((W_{1},S^{t}_{1}) \sim \mathcal {P}_{\mathcal {D}^{*}}(W)\).

The model with parameters \(\theta \) trained on the data distribution \(\mathcal {P}_{\mathcal {D}}(W)\) is denoted as \(M_{\theta }\) and the model with parameters \(\theta ^{*}\) trained on \(\mathcal {P}_{\mathcal {D}^{*}}(W)\) denoted as \(M_{\theta ^{*}}\). A significant distinction between models trained with consistent solutions and those without is found based on comparing results on the validation set. Thus, we attempt to utilize two models combined model ensemble to improve performance. To start with, construct bias sets for \(M_{\theta }\) and \(M_{\theta ^{*}}\). We define that \(B_{M_{\theta }}\) as bias set involving W can be answered correctly by \(M_{\theta }\) only, and \(B_{M_{\theta ^{*}}}\) as bias set involving W can be answered correctly by \(M_{\theta ^{*}}\) only. They can be described as follows:

$$\begin{aligned} B_{\theta } = \{W|W \in R(M_{\theta }(\cdot )), W \in E(M_{\theta ^{*}}(\cdot ))\}, \end{aligned}$$
(6)
$$\begin{aligned} B_{\theta ^{*}} = \{W|W \in R(M_{\theta ^{*}}(\cdot )), W \in E(M_{\theta }(\cdot )) \}, \end{aligned}$$
(7)

where \(R(\cdot )\) and \(E(\cdot )\) are indicating functions. \(R(\cdot )\) and \(E(\cdot )\) indicate the problem from the validation set answered correctly and incorrectly, respectively.

Given a new \(\hat{W}\), \(B_{\theta }\) and \(B_{\theta ^{*}}\), the encoder with parameters \(\theta \), uniformly, is used to derive the [CLS] vector corresponding to \(\hat{W}\) and the [CLS] vector set corresponding to \(B_{M_{\theta }}\) and \(B_{M_{\theta ^{*}}}\) separately.

$$\begin{aligned} CLS_{\hat{W}} = Encoder_{\theta }(\hat{W}) \end{aligned}$$
(8)
$$\begin{aligned} CLS_{B_{\theta }} = Encoder_{\theta }(B_{\theta }) \end{aligned}$$
(9)
$$\begin{aligned} CLS_{B_{\theta ^{*}}} = Encoder_{\theta }(B_{\theta ^{*}}) \end{aligned}$$
(10)

Next, we calculate cosine similarity between \(CLS_{\hat{W}}\) and each element in \(CLS_{B_{\theta }}\) and take the top k [CLS] vector from \(CLS_{B_{\theta }}\) denoted as \(CLS^{i}_{B_{\theta }}\) according to cosine similarity in descending order. Similarly, we perform the same operation between \(CLS_{\hat{W}}\) and \(CLS_{B_{\theta ^{*}}}\) as described above. Final, we respectively calculate the score of \(\hat{W}\) on two bias sets as follows:

$$\begin{aligned} Score_{B_{\theta }} = \frac{1}{k} \sum \limits _{i=1}^{k}{sim(CLS_{\hat{W}},CLS^{i}_{B_{\theta }})}, \end{aligned}$$
(11)
$$\begin{aligned} Score_{B_{\theta ^{*}}} = \frac{1}{k} \sum \limits _{i=1}^{k}{sim(CLS_{\hat{W}},CLS^{i}_{B_{\theta ^{*}}})}, \end{aligned}$$
(12)

where \(sim(\cdot )\) indicates cosine similarity. \(\hat{W}\) will be solved by \(M_{\theta }\) if \(Score_{B_{\theta }} > Score_{B_{\theta ^{*}}}\), otherwise by \(M_{\theta ^{*}}\).

4 Experiment

4.1 Dataset

Experiments are conducted on the challenging dataset from NLPCC-2023 shared task 3, a well-annotated MWP dataset in Chinese. The training, validation, and test sets include 23162, 1200, and 1200 data, respectively.

Due to the commutative property being used to generate additional consistent solutions for each problem in the training set if their solution is commutative, the number of data in the training set is 48,103. Each question in the training set has a corresponding solution, whereas the validation and test sets only contain the final answer. This is a minor distinction between the training, validation, and test sets. Thus, the accuracy of the final answer serves as the evaluation metric in NLPCC-2023 shared task 3.

4.2 Implementation Details

In the training phase, we use AdamW [25] to optimize our models and tune hyperparameters based on the validation set results. In the hyperparameter setting, it is appropriate to set the learning rate at 8e-5 for the model trained without consistent solutions and 4e-5 for the model trained with consistent solutions. We recommend setting the rest of the hyperparameters involving warmup steps, dropout rate, hidden size, embedding size, and beam size at 3000, 0.5, 768, 128, and 3, respectively.

In the ensemble phase, we start with a statistical analysis of the results of two models on the validation set. Although the accuracy of the validation set for the two models is only marginally different, a detailed analysis of the answers of the models on particular questions reveals significant differences in that 37 problems are answered correctly only by \(M_{\theta }\) and 62 problems are answered correctly only by \(M_{\theta ^{*}}\). Thus, we utilize it to improve performance based on the hypothesis that the consistent distribution on validation and testing sets.

4.3 Experimental Results

Due to only one submission for the test set being allowed, we display all results on the validation set in Table 3. And the submission for the test set is shown in Table 2.

We experimented with two PLMs that work better in a Chinese context. In addition to comparing results from the end-to-end generative model to fine-tuning, mBART, used in previous work, is also added to our experiment.

Table 3. All results on the validation set. and indicate the model trained with or without consistent solutions

Ernie-3.0-Base [26]. AutoRegressive and AutoEncoder networks are creatively combined for pre-training in ERNIE 3.0 by including semantic tasks like entity prediction, judging sentence causality, and reconstructing article sentence structure.

Mengzi-Bert-Base [24]. Mengzi encoder is a pre-trained model on 300G Chinese corpus. Masked language modeling(MLM), part-of-speech(POS) tagging and sentence order prediction(SOP) are used to train.

mBART [27]. With earlier methods concentrating just on the encoder, decoder, or reconstructing portions of the text, mBART is the first method to pre-train the complete sequence-to-sequence model by denoising full texts in multilingual setting.

4.4 Case Study

Solution on Models.

The compared example from the validation set is fed into the Mengzi2tree with and without consistent solutions. The solution output is shown in Fig. 2.

Fig. 2.
figure 2

Solution on two models. The DACS Model indicates the model trained with consistent solutions, while the Normal Model is trained without consistent solutions

Fig. 3.
figure 3

The results of beam search on two models. The beam search result from the model trained with consistent solutions is shown above the red dashed line (Color figure online)

Fig. 4.
figure 4

Consistent solutions optimize the search space of beam search

When the left and right sub-nodes of the “\(*\)" are interchangeable, selecting the “4.0" as the left node first, the model that has not been trained with consistent solutions cannot deduce the correct solution and the final answer. We will display the results of the beam search from the model trained with consistent solutions in the following subsection.

Optimization on Search Space. The model trained with consistent solutions is still able to make accurate predictions even when the “4.0" is chosen as the left node of the “\(*\)" first, as shown in the second subgraph above the red dashed line in Fig. 3.

Additionally, Fig. 4 illustrates how the consistent solutions optimize the search space by demonstrating that our model trained with consistent solutions can produce the correct solution regardless of whatever node among the interchangeable nodes is predicted to appear first. The top 5 results from the beam search of the model are all correct, although the beam width is only set at 3 in training.

5 Conclusion

Based on previous studies on math word problems, we found that models that only learned one solution lacked reasoning skills when decoding. To address this, we suggest utilizing the commutative property to generate consistent solutions for the training set, and then using it as extra training data to optimize the search space in beam search.

In addition, we discover significant differences between models trained with and without consistent solutions, so the model ensemble is applied to improve performance. For future research, we will focus on exploring the optimization of search space with adversarial training and reinforcement learning.