Abstract
Research on math word problems has made significant advancements due to the emergence of language models. Large language models have excelled in a variety of reasoning tasks. Still, due to the demand for low costs, research on the upper bound of small language models in reasoning tasks and the limitation of the knowledge they can accommodate has drawn attention. In line with previous work on math word problems, we discover that models that only learned a single solution lacked reasoning ability during the decoding process, further exacerbating the error accumulation caused by exposure bias that will fail generalization. To tackle this problem, we suggest using the commutative property to generate a consistent solution set for each data in the training set. Then, we use it as additional training data to optimize the search space in beam search. On this foundation, we will go into great detail about how consistent solutions training affects the work process of beam search. In addition, we found significant differences between models trained using consistent solutions and those trained without consistent solutions, so the model ensemble technique is applied to improve model performance. In the NLPCC-2023 shared task 3, our model ultimately ranks fourth with an accuracy of 23.66%.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Since the advent of pre-trained language models (PLMs), the enhancement in model performance on knowledge-intensive tasks like arithmetic reasoning and common sense reasoning has been attributed to the ability of semantic representation and the acquisition of knowledge learned implicitly by PLMs trained on pre-training tasks. The math word problem (MWP), as an important branch task in arithmetic reasoning, has attracted the attention of many scholars. Large language models (LLMs) such as GPT-4 [1] achieve SOTA on reasoning tasks, but training them is extremely expensive. Thus, exploring the upper bound of the capability of small language models and the limitation of the knowledge they can accommodate is a more crucial issue.
Numerous scholars have made outstanding contributions to previous work on MWP. Still, typically they only use one solution to each question in the training set, which will further damage generalization performance by exacerbating the error accumulation of ingrained exposure bias in the generative model [2]. To address this, we propose utilizing the addition commutative and multiplication commutative properties to generate consistent solutions for each question in the training set. These consistent solutions are equivalent, as shown in Table 1. Training the model with consistent solutions can optimize the search space of beam search, reducing the error accumulation in exposure bias exacerbated by training with only one solution. This is due to the fact that consistent solutions can give the model the information to enhance reasoning abilities in the decoder, i.e., there may be a number of reasonable options available at each time step as opposed to using a single ground truth in a single solution, forcing the model to generate an optimal sequence of a solution according to the different ground truths from consistent solutions, which can be regarded as a disguised approximation strategy to not using ground truth in training. In testing, this will encourage model learning to adjust to circumstances without ground truth.
In addition, a significant distinction is found between models trained with consistent solutions and those without, based on comparing results on the validation set. The intuitive approach is to use this difference combined with the model ensemble technique to improve the accuracy of the test set.
To sum up, our main contributions are as follows:
1. We use the consistent solutions generated by using the commutative property to replace the single solution when training the model and explain how the consistent solutions optimize the search space of beam search.
2. We utilize two models, one trained with consistent solutions and one trained without consistent solutions, to further improve the model’s performance with the model ensemble.
We achieve fourth place in the NLPCC-2023 shared task 3 with an accuracy of 23.66% on the test set. The specific ranking is shown in Table 2. Our system name is “Tsingriver". Our codes are available at GithubFootnote 1.
2 Related Work
Parsing-based methods and rule-based methods attempt expression templates that summarize various problems from data, but the generalization performance is only moderately effective [3,4,5,6]. More researchers have concentrated their efforts on considering MWP as a generative task and continuing to optimize it since the invention of the seq2seq model. The seq2seq model is used as a baseline model on the proposed Chinese MWP dataset math23k [7].
In terms of decoder architecture, to adapt to the characteristics of generated solutions in MWP, many scholars have proposed tree-structured decoder [8,9,10], among which GTS [10] is considered a strong baseline model. Pointer networks utilized to enhance tree-structured decoder are impacted by machine translation model [11, 12].
To improve the semantic representation of the encoder, graph convolutional networks are applied to capture relationships between quantities in math problems [13]. Similarly, the dependency parsing of hierarchical structure as semantic information is fused into an encoder to enhance the representation of the GTS model.
From the standpoint of deep reinforcement learning, the generative model could be considered a policy network employing the policy gradient algorithm to optimize the model’s capacity for generation [14].
With the emergence of pre-trained language models like BERT [15], the MWP tasks have been switched to the generative model named bert2tree, with BERT and GTS Decoder as the backbone. Subsequent research improves model performance on more detailed issues, such as strengthening the model’s awareness of numerical values through auxiliary or extra pre-training tasks [16,17,18], utilizing contrastive learning to increase the accuracy of solving the problem with similar solutions [19], and incorporating multilingual settings to boost generalization [20].
Inspired by the generative adversarial network (GAN), another mainstream method in the MWP problem employs the ranker network as a discriminator to score the answers in the candidate set generated by the generator. In order to better enhance representation through generative adversarial training, sampling candidate sets has become an important issue. Two methods to generate additional candidate answers, including adding tree-based disturbance to real labels [21] or using the diversity of solutions [22], both demonstrate the importance of using effective data augmentation for sampling.
In addition to regarding MWP as a generating task, the method combining deductive reasoning to treat MWP as an extraction task also shows excellent performance [23].
To sum up, inspired by the diversity of consistent solutions [22], in order to optimize the search space of the decoder, we use the commutative property to obtain the consistent solutions set for each problem and use it directly as training data to fine-tune the generative model consisting of the Mengzi encoder [24] and GTS decoder. Finally, we found significant differences between models trained using consistent solutions and those trained without consistent solutions, so the model ensemble technique is applied to improve model performance (Fig. 1).
3 Proposed System
3.1 Problem Statement
Given a data (W, S, A) from dataset \(\mathcal {D}\), we denote the background and the question as the input \(W = \{w_{1}, w_{2},..., w_{n}\}\) with length n, while the generated solution \(S = \{s_{1},s_{2},...,s_{m}\}\) following the pre-order traversal ordering is calculated to derive final answer A. Next, we define that S consisting of \(V_{num}\), \(V_{op}\) and \(V_{con}\), where \(V_{num} = \{n_{1},n_{2},...,n_{l}\}\) is variable-length set containing all the numerical quantities involved in W. \(V_{op} = \{+,-,\times ,\div ,\wedge \}\) and \(V_{con} = \{1,\pi \}\) denote the operators and the constant values that could be used in the solution, respectively. With number mapping [7], the elements among \(V_{num}\) transformed to [NUM1], [NUM2], .., [NUMl] will be recognized by the encoder as a special token to encode all tokens in W uniformly.
Given W and S from \(\mathcal {D}\) in the training stage, we first encode W using a pre-trained model, and then we extract the state \(H_{en}\) of the last hidden layer to represent W, which can be indicated as follows:
the conditional probability p(S|W) can be decomposed as
where \(h_{i}\) uniformly denote the \(i\text {-}th\) hidden state containing \(q_{i}\) and \(c_{i}\) used in GTS decoder [10]. At the beginning of decoding, \(q_{1}\) is the vector of [CLS] token from all the hidden state \(H_{en}\) and \(c_{i}\) is always derived by calculating attention between \(H_{en}\) and \(q_{i}\). The objective is to minimize the negative log-likelihood of \(\mathcal {D}\):
3.2 Consistent Solutions
During the training stage, the teaching forcing mechanism will force the ground truth of the previous time step as the reasoning premise of the current time step. To simplify the decoding process and highlight the role of the ground truth of the previous time step, we can regard it as a Markov Decision Process. When using GTS decoder as an illustration, the conditional probability \(p^{1}_{i}\) and \(p^{2}_{i}\) can be expressed as follows:
where \(s^{1}_{i}\) and \(s^{2}_{i}\) denotes the prediction in current time step i, \(s^{1}_{i-1}\) from solution 1 \(S^{1}\) indicates the ground truth in previous time step \(i\text {-}1\) and \(s^{1}_{i-j}\) is parent node of \(s^{1}_{i}\). Assumes \(s^{2}_{i-1}\) is another ground truth from solution 2 \(S^{2}\). For instance, 4.0 in \(S^{1}\) and \(\div \) in \(S^{2}\) both could be ground truth at the third time step as shown in Table 1. Let’s use the left child node as an example to make the explanation more straightforward.
Considering only \(S^{1}\) is used in training, the model would maximize the conditional probability \(p^{1}_{i}\) in Eq. 4. However, in testing, if \(s^{1}_{i-1}\) isn’t selected and \(s^{2}_{i-1}\) appear in the path of beam search, the specific predicted token of \(s^{2}_{i}\) could be wrong even resulting error accumulation due to the conditional probability \(p^{2}_{i}\) never be seen by the model. Thus, the beam search algorithm will be invalid for searching consistent solutions except \(S^{1}\), resulting in performance loss manifested as insufficient reasoning ability during decoding.
It is worth emphasizing that the insufficient reasoning ability is caused by the complete dependence on the only ground truth \(S^{1}\), simultaneously failing to generate \(S^{2}\) according to \(s^{2}_{i-1}\) never seen by the model. Essentially, although the teaching force limits the exploration ability of the model, we have to rely on it to make the model converge. Therefore, we observe this problem again in terms of the impact of the single solution on the search space of beam search instead of addressing exposure bias directly.
The single solution, essentially, limits the search space of beam search by invalidating the path of beam search that generates \(S^{2}\) according to \(s^{2}_{i-1}\). Thus, we adopt consistent solutions for training directly based on two subsequent considerations:
1) The consistent solutions enable the path of beam search that generates \(S^{2}\) according to \(s^{2}_{i-1}\).
2) When overfitting occurs with the data \((W,S^{1})\), \((W,S^{2})\) fed into the model can be regarded as an examination to generate \(S^{2}\) and \(s^{2}_{i-1}\) simulates the output of the model used to predict the next token in testing, which is a disguised approximation strategy to not using ground truth in training.
Ideally, all the consistent solutions to each data should be sampled as training data to address this problem. But the cost of manually labeling all consistent solutions is relatively expensive. Thus, only the addition commutative property and the multiplication commutative property are utilized to generate consistent solutions. The experimental examples and figures for optimizing the search space of beam search are offered in Sect. 4.4.
3.3 Model Ensemble
The data distribution with respect to \(\mathcal {D}\) is defined as \(\mathcal {P}_{\mathcal {D}}(W)\), and a data \((W_{1},S^{1}_{1}) \sim \mathcal {P}_{\mathcal {D}}(W)\). Next, We denote \(\mathcal {D}^{*}\) as the dataset augmented through consistent solutions, corresponding to the data distribution \(\mathcal {P}_{\mathcal {D}^{*}}(W)\). Thus, \(W_{1}\) with t solutions and \((W_{1},S^{t}_{1}) \sim \mathcal {P}_{\mathcal {D}^{*}}(W)\).
The model with parameters \(\theta \) trained on the data distribution \(\mathcal {P}_{\mathcal {D}}(W)\) is denoted as \(M_{\theta }\) and the model with parameters \(\theta ^{*}\) trained on \(\mathcal {P}_{\mathcal {D}^{*}}(W)\) denoted as \(M_{\theta ^{*}}\). A significant distinction between models trained with consistent solutions and those without is found based on comparing results on the validation set. Thus, we attempt to utilize two models combined model ensemble to improve performance. To start with, construct bias sets for \(M_{\theta }\) and \(M_{\theta ^{*}}\). We define that \(B_{M_{\theta }}\) as bias set involving W can be answered correctly by \(M_{\theta }\) only, and \(B_{M_{\theta ^{*}}}\) as bias set involving W can be answered correctly by \(M_{\theta ^{*}}\) only. They can be described as follows:
where \(R(\cdot )\) and \(E(\cdot )\) are indicating functions. \(R(\cdot )\) and \(E(\cdot )\) indicate the problem from the validation set answered correctly and incorrectly, respectively.
Given a new \(\hat{W}\), \(B_{\theta }\) and \(B_{\theta ^{*}}\), the encoder with parameters \(\theta \), uniformly, is used to derive the [CLS] vector corresponding to \(\hat{W}\) and the [CLS] vector set corresponding to \(B_{M_{\theta }}\) and \(B_{M_{\theta ^{*}}}\) separately.
Next, we calculate cosine similarity between \(CLS_{\hat{W}}\) and each element in \(CLS_{B_{\theta }}\) and take the top k [CLS] vector from \(CLS_{B_{\theta }}\) denoted as \(CLS^{i}_{B_{\theta }}\) according to cosine similarity in descending order. Similarly, we perform the same operation between \(CLS_{\hat{W}}\) and \(CLS_{B_{\theta ^{*}}}\) as described above. Final, we respectively calculate the score of \(\hat{W}\) on two bias sets as follows:
where \(sim(\cdot )\) indicates cosine similarity. \(\hat{W}\) will be solved by \(M_{\theta }\) if \(Score_{B_{\theta }} > Score_{B_{\theta ^{*}}}\), otherwise by \(M_{\theta ^{*}}\).
4 Experiment
4.1 Dataset
Experiments are conducted on the challenging dataset from NLPCC-2023 shared task 3, a well-annotated MWP dataset in Chinese. The training, validation, and test sets include 23162, 1200, and 1200 data, respectively.
Due to the commutative property being used to generate additional consistent solutions for each problem in the training set if their solution is commutative, the number of data in the training set is 48,103. Each question in the training set has a corresponding solution, whereas the validation and test sets only contain the final answer. This is a minor distinction between the training, validation, and test sets. Thus, the accuracy of the final answer serves as the evaluation metric in NLPCC-2023 shared task 3.
4.2 Implementation Details
In the training phase, we use AdamW [25] to optimize our models and tune hyperparameters based on the validation set results. In the hyperparameter setting, it is appropriate to set the learning rate at 8e-5 for the model trained without consistent solutions and 4e-5 for the model trained with consistent solutions. We recommend setting the rest of the hyperparameters involving warmup steps, dropout rate, hidden size, embedding size, and beam size at 3000, 0.5, 768, 128, and 3, respectively.
In the ensemble phase, we start with a statistical analysis of the results of two models on the validation set. Although the accuracy of the validation set for the two models is only marginally different, a detailed analysis of the answers of the models on particular questions reveals significant differences in that 37 problems are answered correctly only by \(M_{\theta }\) and 62 problems are answered correctly only by \(M_{\theta ^{*}}\). Thus, we utilize it to improve performance based on the hypothesis that the consistent distribution on validation and testing sets.
4.3 Experimental Results
Due to only one submission for the test set being allowed, we display all results on the validation set in Table 3. And the submission for the test set is shown in Table 2.
We experimented with two PLMs that work better in a Chinese context. In addition to comparing results from the end-to-end generative model to fine-tuning, mBART, used in previous work, is also added to our experiment.
Ernie-3.0-Base [26]. AutoRegressive and AutoEncoder networks are creatively combined for pre-training in ERNIE 3.0 by including semantic tasks like entity prediction, judging sentence causality, and reconstructing article sentence structure.
Mengzi-Bert-Base [24]. Mengzi encoder is a pre-trained model on 300G Chinese corpus. Masked language modeling(MLM), part-of-speech(POS) tagging and sentence order prediction(SOP) are used to train.
mBART [27]. With earlier methods concentrating just on the encoder, decoder, or reconstructing portions of the text, mBART is the first method to pre-train the complete sequence-to-sequence model by denoising full texts in multilingual setting.
4.4 Case Study
Solution on Models.
The compared example from the validation set is fed into the Mengzi2tree with and without consistent solutions. The solution output is shown in Fig. 2.
When the left and right sub-nodes of the “\(*\)" are interchangeable, selecting the “4.0" as the left node first, the model that has not been trained with consistent solutions cannot deduce the correct solution and the final answer. We will display the results of the beam search from the model trained with consistent solutions in the following subsection.
Optimization on Search Space. The model trained with consistent solutions is still able to make accurate predictions even when the “4.0" is chosen as the left node of the “\(*\)" first, as shown in the second subgraph above the red dashed line in Fig. 3.
Additionally, Fig. 4 illustrates how the consistent solutions optimize the search space by demonstrating that our model trained with consistent solutions can produce the correct solution regardless of whatever node among the interchangeable nodes is predicted to appear first. The top 5 results from the beam search of the model are all correct, although the beam width is only set at 3 in training.
5 Conclusion
Based on previous studies on math word problems, we found that models that only learned one solution lacked reasoning skills when decoding. To address this, we suggest utilizing the commutative property to generate consistent solutions for the training set, and then using it as extra training data to optimize the search space in beam search.
In addition, we discover significant differences between models trained with and without consistent solutions, so the model ensemble is applied to improve performance. For future research, we will focus on exploring the optimization of search space with adversarial training and reinforcement learning.
References
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Lee, S., Lee, D.B., Hwang, S.J.: Contrastive learning with adversarial perturbations for conditional text generation. In: International Conference on Learning Representations (2021)
Bakman, Y.: Robust understanding of word problems with extraneous information. arXiv preprint math/0701393 (2007)
Roy, S., Vieira, T., Roth, D.: Reasoning about quantities in natural language. Trans. Assoc. Comput. Linguistics 3, 1–13 (2015)
Liang, C.-C., Hsu, K.-Y., Huang, C.-T., Li, C.-M., Miao, S.-Yu., Su, K.-Y.: A tag-based English math word problem solver with understanding, reasoning and explanation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 67–71, June 2016
Hosseini, M.J., Hajishirzi, H., Etzioni, O., Kushman, N.: Learning to solve arithmetic word problems with verb categorization. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 523–533, October 2014
Wang, Y., Liu, X., Shi, S.: Deep neural solver for math word problems. In: Proceedings of the 2017 Conference on Empirical methods in Natural Language Processing, pp. 845–854, September 2017
Liu, Q., Guan, W., Li, S., Kawahara, D.: Tree-structured decoding for solving math word problems. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2370–2379, November 2019
Wang, Y., Lee, H.-Y., Chen, Y.-N.: Tree transformer: integrating tree structures into self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1061–1070, November 2019
Xie, Z., Sun, S.: A goal-driven tree-structured neural model for math word problems. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5299–5305, 7 2019
Lin, X., Huang, Z., Zhao, H., Chen, E., Liu, Q., Wang, H., Wang, S.: Hms: A hierarchical solver with dependency-enhanced understanding for math word problem. Proceedings of the AAAI Conference on Artificial Intelligence 35(5), 4232–4240 (2021)
Kim, B., Ki, K.S., Lee, D., Gweon, G.: Point to the expression: solving algebraic word problems using the expression-pointer transformer model. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3768–3779, November 2020
Zhang, J., et al.: Graph-to-tree learning for solving math word problems. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3928–3937, July 2020
Huang, D., Liu, J., Lin, C.-Y., Yin, J.: Neural math word problem solver with reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 213–223, August 2018
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, June 2019
Qin, J., Liang, X., Hong, Y., Tang, J., Lin, L.: Neural-symbolic solver for math word problems with auxiliary tasks. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 5870–5881, August 2021
Wu, Q., Zhang, Q., Wei, Z., Huang, X.: Math word problem solving with explicit numerical values. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 5859–5869, August 2021
Liang, Z., Zhang, J., Wang, L., Qin, W., Lan, Y., Shao, J., Zhang, X.: MWP-BERT: Numeracy-augmented pre-training for math word problem solving. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 997–1009 (2022)
Li, Z., Zhang, W., Yan, C., Zhou, Q., Li, C., Liu, H., Cao, Y.: Seeking patterns, not just memorizing procedures: Contrastive learning for solving math word problems. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 2486–2496 (2022)
Tan, M., Wang, L., Jiang, L., Jiang, J.: Investigating math word problems using pretrained multilingual language models. In: Proceedings of the 1st Workshop on Mathematical Natural Language Processing, pp. 7–16, December 2022
Shen, J., Yin, Y., Li, L., Shang, L., Jiang, X., Zhang, M., Liu, Q.: Generate & rank: A multi-task framework for math word problems. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2269–2279 (2021)
Liang, Z., Zhang, J., Wang, L., Wang, Y., Shao, J., Zhang, X.: Generalizing math word problem solvers via solution diversification. In: Proceedings of the AAAI Conference on Artificial Intelligence, 37, pp. 13183–13191, 06 2023
Jie, Z., Li, J., Lu, W.: Learning to reason deductively: math word problem solving as complex relation extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 5944–5955, May 2022
Zhang, Z., et al.: Towards lightweight yet ingenious pre-trained models for chinese. arXiv preprint arXiv:2110.06696 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, May 2019
Sun, Y., et al.: Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics 8, 726–742 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Y., Li, S., Pu, C., Wang, J., Zhou, X. (2023). Consistent Solutions for Optimizing Search Space of Beam Search. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14304. Springer, Cham. https://doi.org/10.1007/978-3-031-44699-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-44699-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44698-6
Online ISBN: 978-3-031-44699-3
eBook Packages: Computer ScienceComputer Science (R0)