Abstract
Text classification is an essential part of natural language processing. With the development of deep learning technology, using deep learning for text classification has become mainstream. However, deep learning methods often require a considerable quantity of data, and collecting and processing data sets is cumbersome and expensive. This paper presents a data enhancement method simulating the generation of acronyms, quickly expanding the text classification datasets. We show the performance of our method on three classical data sets and compare it with two classical text data enhancement methods. The results show that this method can effectively improve the effect of text classification. Meanwhile, by comparing the similarity with the original sentence, it is found that our method has very little change in sentence semantics and is more robust than the baseline method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Su, J., Zhang, B., Xu, X.: Research progress of text classification technology based on machine learning. J. Softw. 17(9), 12 (2006)
Yu, Q., Liu, R.: Identification of spam based on dependency syntax and convolutional neural network. In: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE (2018)
Xu, H., Yang, W., Wang, J.: Hierarchical emotion classification and emotion component analysis on Chinese micro-blog posts. Expert Syst. Appl. 42(22), 8745–8752 (2015)
Shi, M.: Knowledge graph question and answer system for mechanical intelligent manufacturing based on deep learning. Math. Probl. Eng. 2021(2), 1–8 (2021)
Ye, Q., Misra, K., Devarapalli, H., et al.: A sentiment based non-factoid question-answering framework. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE (2019)
Halperin, J.L., Levine, G.N., Al-Khatib, S.M., et al.: Further evolution of the ACC/AHA clinical practice guideline recommendation classification system. J. Am. Coll. Cardiol. (2015). S0735109715060453
Xie, Q., Dai, Z., Hovy, E., et al.: Unsupervised data augmentation for consistency training. arXiv:1904.12848 (2019)
Coulombe, C.: Text data augmentation made simple by leveraging NLP cloud APIs 2018. arXiv:1812.04718 (2018)
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019). arXiv:1901.11196
Xie, Z., Wang, S.I., Li, J., et al.: Data noising as smoothing in neural network language models. International Conference on Learning Representations (ICLR) (2017). arXiv:1703.02573
Barnett, A., Doubleday, Z.: Meta-research: the growth of acronyms in the scientific literature. Elife 9, e60080 (2020)
Maron, M.E.: Automatic indexing: an experimental inquiry. J. ACM 8(3), 404–417 (1961)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Scott, S., Matwin, S.: Feature engineering for text classification. In: ICML, vol. 99, pp. 379–388 (1999)
Zhang, W., Gao, F.: An improvement to Naive Bayes for text classification. Procedia Eng. 15, 2160–2164 (2011)
Pu, W., Liu, N., Yan, S., et al.: Local word bag model for text categorization. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 625–630. IEEE (2007)
Mansuy, T., Hilderman, R.J.: A characterization of WordNet features in Boolean models for text classification. In: Proceedings of the Fifth Australasian Conference on Data Mining and Analytics, vol. 61, pp. 103–109 (2006)
Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., et al.: Text classification algorithms: a survey. Information 10(4), 150 (2019)
Bohan, L., Yutai, H., Wangxiang, C.: Data augmentation approaches in natural language processing: a survey. arXiv:2110.01852 (2021)
Min, J., Mccoy, R.T., Das, D., et al.: Syntactic data augmentation increases robustness to inference heuristics. arXiv:2004.11999 (2020)
Kang, D., Khot, T., Sabharwal, A., et al.: AdvEntuRe: adversarial training for textual entailment with knowledge-guided examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
Anaby-Tavor, A., Carmeli, B., Goldbraich, E., et al.: Not enough data? Deep learning to the rescue! arXiv:1911.03118 (2019)
Thakur, N., Reimers, N., Daxenberger, J., et al.: Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. arXiv:2010.08240 (2020)
Richard, S., John, B., et al.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (2013)
Nishikawa, S., Ri, R., Tsuruoka, Y.: Data augmentation with unsupervised machine translation improves the structural similarity of cross-lingual word embeddings. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop (2021)
Almeida, T. A., Hidalgo, J.M.G., Yamakami, A.: Contributions to the study of SMS spam filtering: new collection and results. In: Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG 2011), Mountain View, CA, USA (2011)
Voorhees, E.: The TREC-8 question answering track evaluation. In: Proceedings of Text Retrieval Conference (1999)
Zhiqiang, H.E., Yang, J., Luo, C.: Combination characteristics based on BiLSTM for short text classification. Intell. Comput. Appl. 9, 21–27 (2019)
Sha Rfu Ddin, A.A., Tihami, M.N., Islam, M.S.: A deep recurrent neural network with BiLSTM model for sentiment classification. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE (2018)
Leila, A., Franziska, H., Grégoire, M., et al.: “What is relevant in a text document?”: an interpretable machine learning approach. PLoS ONE 12(8), 0181142 (2017)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Jeffrey, P., Richard, S., Christopher, M.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Chinese Institute of Command and Control
About this paper
Cite this paper
Ou, L., Chen, H., Luo, X., Li, X., Chen, S. (2022). ADA: An Acronym-Based Data Augmentation Method for Low-Resource Text Classification. In: Proceedings of 2022 10th China Conference on Command and Control. C2 2022. Lecture Notes in Electrical Engineering, vol 949. Springer, Singapore. https://doi.org/10.1007/978-981-19-6052-9_35
Download citation
DOI: https://doi.org/10.1007/978-981-19-6052-9_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-6051-2
Online ISBN: 978-981-19-6052-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)