Vietnamese Sentence Paraphrase Identification Using Sentence-BERT and PhoBERT

Phan, Quoc Long; Doan, Tran Huu Phuoc; Le, Ngoc Hieu; Tran, Ngoc Bao Duy; Huynh, Tuong Nguyen

doi:10.1007/978-3-031-15063-0_40

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 148))

Included in the following conference series:

International Conference on Intelligence of Things

429 Accesses

Abstract

In 2019, Reimers et al. proposed SBERT to derive sentence embedding for many purposes. It highly reduced the time complexity of finding the most similar pair of sentences from traditional BERT/RoBERTa to SBERT, while the accuracy is maintained. There are many English SBERT models, but lacking the other languages ones. In this publication, we develop our Vietnamese SBERT model for Vietnamese sentence embeddings, using PhoBERT as our main transformer for Vietnamese token embeddings. For the training processes, we use the Vietnamese NLI and STSb datasets, and for the evaluation of sentence paraphrase identification task to compare with other models, we use the VnPara dataset in. Our model has achieved an accuracy of 95.33% and F1 of 95.42%, slightly outperforming many recent methods in Vietnamese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LSTM Based Paraphrase Identification Using Combined Word Embedding Features

Paraphrase Identification Based on Weighted URAE, Unit Similarity and Context Correlation Feature

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Notes

1.
Our Vietnamese SBERT model: https://huggingface.co/keepitreal/vietnamese-sbert.

References

Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: EMNLP 2019 (2019). arXiv:1908.10084
Nguyen, D.Q., Nguyen, A.T.: PhoBERT: pre-trained language models for Vietnamese, pp. 1037–1042. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.92
Dinh, D., Le Thanh, N.: Vietnamese sentence paraphrase identification using pre-trained model and linguistic knowledge. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 12(8), 2021 (2021). https://doi.org/10.14569/IJACSA.2021.0120891
Vani, K., Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. (2016). https://doi.org/10.25103/jestr.095.02
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v2 [cs.CL], 24 May 2019
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762v5 [cs.CL], 6 December 2017
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116 [cs.CL], 5 November 2019
Liu, Y., et al: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:2003.00744 [cs.CL], 26 July 2019
Bach, N.X., Oanh, T., Hai, N., Phuong, T.: Paraphrase identification in Vietnamese documents. In: 2015 IEEE International Conference on Knowledge and Systems Engineering, KSE 2015, pp. 174–179 (2015). ISBN: 9781467380133. https://doi.org/10.1109/KSE.2015.37
SQuAD2.0 The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/
Vietnamese NLI Dataset, Dat Quoc Nguyen. https://github.com/DatCanCode/sentence-transformers/tree/master/DataNLI
Semantic Textual Similarity Wiki. http://ixa2.si.ehu.eus/stswiki

Download references

Acknowledgement

We are very grateful to our instructors who helped us review our works and our friends for their valuable support. Also, we acknowledge the support of facilities from Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study.

Author information

Authors and Affiliations

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam
Quoc Long Phan, Tran Huu Phuoc Doan, Ngoc Hieu Le, Ngoc Bao Duy Tran & Tuong Nguyen Huynh
Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
Quoc Long Phan, Tran Huu Phuoc Doan, Ngoc Hieu Le, Ngoc Bao Duy Tran & Tuong Nguyen Huynh

Authors

Quoc Long Phan
View author publications
You can also search for this author in PubMed Google Scholar
Tran Huu Phuoc Doan
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Hieu Le
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Bao Duy Tran
View author publications
You can also search for this author in PubMed Google Scholar
Tuong Nguyen Huynh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Long Phan .

Editor information

Editors and Affiliations

Wroclaw University of Science and Technology, Wrocław, Poland
Ngoc-Thanh Nguyen
Sejong University, Seoul, Korea (Republic of)
Nhu-Ngoc Dao
Vietnam National University of Agriculture, Hanoi, Vietnam
Quang-Dung Pham
Hanoi University of Mining and Geology, Hanoi, Vietnam
Hong Anh Le

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phan, Q.L., Doan, T.H.P., Le, N.H., Tran, N.B.D., Huynh, T.N. (2022). Vietnamese Sentence Paraphrase Identification Using Sentence-BERT and PhoBERT. In: Nguyen, NT., Dao, NN., Pham, QD., Le, H.A. (eds) Intelligence of Things: Technologies and Applications . ICIT 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 148. Springer, Cham. https://doi.org/10.1007/978-3-031-15063-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-15063-0_40
Published: 23 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15062-3
Online ISBN: 978-3-031-15063-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Vietnamese Sentence Paraphrase Identification Using Sentence-BERT and PhoBERT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LSTM Based Paraphrase Identification Using Combined Word Embedding Features

Paraphrase Identification Based on Weighted URAE, Unit Similarity and Context Correlation Feature

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Vietnamese Sentence Paraphrase Identification Using Sentence-BERT and PhoBERT

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LSTM Based Paraphrase Identification Using Combined Word Embedding Features

Paraphrase Identification Based on Weighted URAE, Unit Similarity and Context Correlation Feature

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation