Predicting Protein-DNA Binding Sites by Fine-Tuning BERT

Zhang, Yue; Chen, Yuehui; Chen, Baitong; Cao, Yi; Chen, Jiazi; Cong, Hanhan

doi:10.1007/978-3-031-13829-4_57

Yue Zhang¹³,
Yuehui Chen¹⁴,
Baitong Chen¹⁵,
Yi Cao¹⁶,
Jiazi Chen¹⁷ &
…
Hanhan Cong¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13394))

Included in the following conference series:

International Conference on Intelligent Computing

1876 Accesses

Abstract

The study of Protein-DNA binding sites is one of the fundamental problems in genome biology research. It plays an important role in understanding gene expression and transcription, biological research, and drug development. In recent years, language representation models have had remarkable results in the field of Natural Language Processing (NLP) and have received extensive attention from researchers. Bidirectional Encoder Representations for Transformers (BERT) has been shown to have state-of-the-art results in other domains, using the concept of word embedding to capture the semantics of sentences. In the case of small datasets, previous models often cannot capture the upstream and downstream global information of DNA sequences well, so it is reasonable to refer the BERT model to the training of DNA sequences. Models pre-trained with large datasets and then fine-tuned with specific datasets have excellent results on different downstream tasks. In this study, firstly, we regard DNA sequences as sentences and tokenize them using K-mer method, and later utilize BERT to matrix the fixed length of the tokenized sentences, perform feature extraction, and later perform classification operations. We compare this method with current state-of-the-art models, and the DNABERT method has better performance with average improvement 0.013537, 0.010866, 0.029813, 0.052611, 0.122131 in ACC, F1-score, MCC, Precision, Recall, respectively. Overall, one of the advantages of BERT is that the pre-training strategy speeds up the convergence in the network in migration learning and improves the learning ability of the network. DNABER model has advantageous generalization ability on other DNA datasets and can be utilized on other sequence classification tasks.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Article Open access 21 April 2022

Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure

Article Open access 17 December 2020

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features

Article Open access 05 February 2024

Keywords

1 Introduction

Protein-DNA binding site refers to a fragment of a protein macromolecule that specifically [1] binds to a DNA sequence of approximately 4–30 bp [2,3,4] in length. And transcription factors, as a common type of protein macromolecule, are an important issue for Protein-DNA binding site prediction, and when transcription factors bind to these specific regions, the sites are called transcription factor binding sites (TFBS) [5, 6]. During the transcription of a gene, transcription factor binds specifically to a segment of DNA sequence as a protein macromolecule, and the region forms the transcription factor binding site. Transcription factors are of great importance in gene regulation, transcription, and biological research and drug design [7,8,9]. Therefore, accurate prediction of Protein-DNA binding sites is very important for genomic understanding, description of gene specific functions, etc. [10, 11].

In the past decades, sequencing operations were performed using traditional biological methods, especially ChIP-seq [12] sequencing technology, which greatly increased the quantity and quality of available sequences and laid the foundation for subsequent studies. With the development of sequencing technology, the number of genomic sequences has increased dramatically, and traditional biological sequencing techniques are costly and slow, therefore, machine learning [13] ideas have been applied to Protein-DNA binding site prediction, such as, Wong et al. proposed the kmerHMM [14] model based on Hidden Markov (HMMs) and belief propagations, and Li et al. [15] proposed the fusion pseudo nucleic acid composition (PseNAC) model based on SVM. However, with the gradual accumulation of sequences, traditional machine learning methods cannot meet the requirements in terms of prediction accuracy and computational speed, and deep learning has performed well in other fields such as machine vision [2, 16, 17]. so researchers have gradually applied deep learning to bioinformatics [4, 18,19,20], DeepBind has applied convolutional neural networks to Protein-DNA binding site prediction for the first time, and Zeng et al. further explored the number of convolutional layers and pooling methods to validate the value of Convolutional Neural Network (CNN) for Protein-DNA binding sites. KEGRU is a framework model that is fully based on RNN using Bidirectional Gated Recurrent Unit (Bi-GRU) and K-mer embedding. DanQ utilizes a hybrid neural network combining CNN and Recursive Neural Network (RNN) with the addition of Bi-directional Long-Short Term Memory (Bi-LSTM) layers for better long distance dependencies in sequence relations for learning.

In our work, we utilized DNABERT for feature extraction of the dataset and classification by fully connected layers. First, we segment the DNA sequences using the K-mer representation, as opposed to the One-hot encoding commonly utilized in previous deep learning, we only segment it, and later utilize the processed data add the location information as the input to BERT. Then feature extraction is performed using BERT based on the Multi-headed Self-attention mechanism, with 101x768 dimensions for the input data and no change in the dimensionality of the output data. Finally, the input is fed into the fully connection and activated using the softmax function for binary classification prediction. In order to verify the generalization ability of the model, we utilized fine-tuning model to predict different cell line transcription factor datasets and verified the effectiveness of the model.

2 Materials and Methods

2.1 Benchmark Dataset

To better evaluate the performance of the model, we selected 45 public transcription factor ChIP-seq datasets of Broad cell lines from the ENCODE dataset, which were previously utilized in DeepBind, CNN-Zeng, and DeepSEA model frameworks, each with a DNA sequence sample length of 101 bp and a positive to negative sample number ratio of approximately 1:1. These data can be found in http://cnn.csail.mit.edu/motif_discovery/.

2.2 Model

Tokenization

We utilize K-mer for DNA sequences, and for each deoxyribonucleic acid base concatenate it with subsequent bases, integrating better contextual information for each deoxyribonucleic acid. Different K values correspond to different tokenization of DNA sequences, and we set the value of K to 6, i.e. {ACGTACGT} can be tagged as {ACGTAC, CGTACG, GTACGT}. In the utterance, in addition to all permutations indicated by K-mer, five other special tokens are included, the categorical CLS token inserted into the head, the SEP token inserted after each sentence, the MASK token that masks the words, the placeholder pad token, and UNK token that stands for unknown in the sequence, when K = 6, there are \({4}^{6}+5\) token.

The DNABERT Mode

Bert is a transformer-based pre-trained language representation model that is a milestone in NLP. It introduces an idea of pre-training and fine-tuning, where after pre-training with a large amount of data, an additional output layer is added for fine-tuning using small task-specific data to obtain state-of-the-art performance in other downstream tasks. The innovation of BERT is the use of a new technique of masked language model (MLM), which uses a bi-directional Transformer for language modeling, where the bi-directional model will outperform the uni-directional model in language representation. BERT models can also be used in question-and-answer systems, language analysis, document clustering, and many other tasks. We believe that BERT can be applied to Protein-DNA binding site prediction to better capture the hidden information in DNA sequences, as shown in Fig. 1.

3 Result and Discussion

3.1 Competing Methods

In order to ensure the fairness of the experiment, we used three deep learning-based models to compare performance with DNABERT model, namely DeepBind, DanQ and WSCNNLSTM. Through comparison, it is found that DNABERT model has better performance in the evaluation indexes we used. Table 1 shows the performance comparison of DNABERT in the data set of each cell line we selected. As can be seen from the Table 1, DNABERT is higher than existing models in the evaluation indexes ACC, F1-Score, MCC, Precision and Recall. ACC is 0.013537 higher than other methods on average, and F1-score increases by 0.010866. MCC increased by 0.029813, Precision and Recall increased by 0.052611 and 0.122131, respectively. Experimental results show that our method is superior to existing networks. Table 1 is the setting of hyper-parameters in the experiment.

Table 1. Comparison of performance on datasets of cell lines.

Full size table

4 Conclusion

In recent years, transformer-based series models have had state-of-the-art performance in the field of NLP. As the research gradually progressed, researchers migrated it to other fields and achieved equally desirable results. In our work, we demonstrate that the performance of DNABERT for Protein-DNA binding site prediction greatly exceeds that of other existing tools. Due to the sequence similarity between genomes, it is possible to transfer data of biological information to each other using the DNABERT pre-trained model. DNA sequences cannot be directly translated on the machine, and DNABERT gives a solution to the problem of deciphering the language of non-coding DNA, correctly capturing the hidden syntactic semantics in DNA sequences, showing excellent results. Although DNABERT has excellent performance in predicting Protein-DNA binding sites, there is room for further improvement. CLS token represents the global information of the sequence, and the rest token represents the features of each part of the sequence, we can consider separation processing to better capture the sequence features and achieve better results. However, so far, the BERT pre-training method for Protein-DNA binding site prediction has the most advanced performance at present, and the use of DNABERT introduces the perspective of high-level language modeling to genomic sequences, providing new advances and insights for the future of bioinformatics.

References

Rohs, R., Jin, X., West, S.M., Joshi, R., Honig, B., Mann, R.S.: Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010). https://doi.org/10.1146/annurev-biochem-060408-091030
Article Google Scholar
Jordan, M.I., LeCun, Y., Solla, S.A. (eds.): Advances in Neural Information Processing Systems: Proceedings of the First 12 Conferences. MIT Press, Cambridge (2004)
Google Scholar
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019)
Google Scholar
Liu, Y., Zhu, Y.-H., Song, X., Song, J., Yu, D.-J.: Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation. Brief Bioinform. 22, bbab001 (2021). https://doi.org/10.1093/bib/bbab001
Karin, M.: Too many transcription factors: positive and negative interactions. New Biol. 2, 126–131 (1990)
Google Scholar
Latchman, D.S.: Transcription factors: an overview. Int. J. Biochem. Cell Biol. 29, 1305–1312 (1997). https://doi.org/10.1016/s1357-2725(97)00085-x
Article Google Scholar
Jolma, A., et al.: DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). https://doi.org/10.1016/j.cell.2012.12.009
Article Google Scholar
Tuupanen, S., et al.: The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat. Genet. 41, 885–890 (2009). https://doi.org/10.1038/ng.406
Article Google Scholar
Wasserman, W.W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5, 276–287 (2004). https://doi.org/10.1038/nrg1315
Article Google Scholar
Lambert, S.A., et al.: The human transcription factors. Cell 172, 650–665 (2018). https://doi.org/10.1016/j.cell.2018.01.029
Article Google Scholar
Basith, S., Manavalan, B., Shin, T.H., Lee, G.: iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput. Struct. Biotechnol. J. 16, 412–420 (2018). https://doi.org/10.1016/j.csbj.2018.10.007
Article Google Scholar
Furey, T.S.: ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat. Rev. Genet. 13, 840–852 (2012). https://doi.org/10.1038/nrg3306
Article Google Scholar
Manavalan, B., Shin, T.H., Lee, G.: DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest. Oncotarget 9, 1944–1956 (2017). https://doi.org/10.18632/oncotarget.23099
Wong, K.-C., Chan, T.-M., Peng, C., Li, Y., Zhang, Z.: DNA motif elucidation using belief propagation. Nucleic Acids Res. 41, e153 (2013). https://doi.org/10.1093/nar/gkt574
Article Google Scholar
Li, L., et al.: Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM. BMC Bioinform. 15, 340 (2014). https://doi.org/10.1186/1471-2105-15-340
Article Google Scholar
Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016). https://doi.org/10.15252/msb.20156651
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947
Hong, J., et al.: Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief. Bioinform. 21, 1825–1836 (2020). https://doi.org/10.1093/bib/bbz120
Article Google Scholar
Hong, J., et al.: Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 21, 1437–1447 (2020). https://doi.org/10.1093/bib/bbz081
Article Google Scholar
Min, S., Kim, H., Lee, B., Yoon, S.: Protein transfer learning improves identification of heat shock protein families. PLoS ONE 16, e0251865 (2021). https://doi.org/10.1371/journal.pone.0251865
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).

Author information

Authors and Affiliations

School of Information Science and Engineering, University of Jinan, Jinan, China
Yue Zhang
School of Artificial Intelligence Institute and Information Science and Engineering, University of Jinan, Jinan, China
Yuehui Chen
Xuzhou First People’s Hospital, Xuzhou, China
Baitong Chen
Shandong Provincial Key Laboratory of Network Based Intelligent Computing (School of Information Science and Engineering), University of Jinan, Jinan, China
Yi Cao
Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan
Jiazi Chen
School of Information Science and Engineering, Shandong Normal University, Jinan, China
Hanhan Cong

Authors

Yue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuehui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Baitong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jiazi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hanhan Cong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cao .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Xi’an Polytechnic University, Xi’an, China
Junfeng Jing
The University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Polytecnic of Bari, Bari, Italy
Vitoantonio Bevilacqua
Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Chen, Y., Chen, B., Cao, Y., Chen, J., Cong, H. (2022). Predicting Protein-DNA Binding Sites by Fine-Tuning BERT. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13394. Springer, Cham. https://doi.org/10.1007/978-3-031-13829-4_57

Download citation

DOI: https://doi.org/10.1007/978-3-031-13829-4_57
Published: 15 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13828-7
Online ISBN: 978-3-031-13829-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Predicting Protein-DNA Binding Sites by Fine-Tuning BERT

Abstract

Similar content being viewed by others

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features

Keywords

1 Introduction

2 Materials and Methods

2.1 Benchmark Dataset

2.2 Model

Tokenization

The DNABERT Mode

3 Result and Discussion

3.1 Competing Methods

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Predicting Protein-DNA Binding Sites by Fine-Tuning BERT

Abstract

Similar content being viewed by others

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure

Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features

Keywords

1 Introduction

2 Materials and Methods

2.1 Benchmark Dataset

2.2 Model

Tokenization

The DNABERT Mode

3 Result and Discussion

3.1 Competing Methods

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation