Skip to main content

Function Characterization of Unknown Protein Sequences Using One Hot Encoding and Convolutional Neural Network Based Model

  • Conference paper
  • First Online:
Machine Learning and Computational Intelligence Techniques for Data Engineering (MISP 2022)

Abstract

Protein Subcellular Localization (PSL) prediction of recently evolved Unknown Protein Sequence (UPS) is vital for understanding the protein functions. Although PSL provides insight into the prediction of harmful and useful characteristics, diagnosis of disease and drug design. In the present work One-Hot-Encoding (OHE) and Convolutional Neural Network (CNN) based OCNN model is proposed for the functional characterization of protein sequence through the PSL. Gram-Positive (G+) dataset with 473 known protein sequence samples including four subcellular localizations is used for the training and validation of the OCNN model. As essential preprocessing raw protein sequence has been encoded using OHE, as well as the length of the encoded sequence are standardized and normalized through padding and capping. Next, encoded and standardized protein sequence samples are convoluted in the hidden layer of the OCNN model using ReLU, TanH, and Sigmoid activation function. After that Adam and Stochastic Gradient Decent (SGD) optimization function are utilized for the PSL prediction of the protein sequence samples. OCNN model achieved 92.94% of accuracy through combination of Sigmoid, Softmax, and Adam functions with known protein sequences. The validated OCNN model can be further utilized for the function prediction of UPS, where 64.83% accuracy is achieved through the combination of ReLU, Softmax, and Adam functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/Data.htm.

References

  1. Lei X, Zhao J, Fujita H, Zhang A (2018) Predicting essential proteins based on RNA-Seq, subcellular localization and GO annotation datasets. Knowl-Based Syst 151:136–148. https://doi.org/10.1016/j.knosys.2018.03.027

    Article  Google Scholar 

  2. Guo H, Liu B, Cai D, Lu T (2018) Predicting protein–protein interaction sites using modified support vector machine. Int J Mach Learn Cybern 9:393–398. https://doi.org/10.1007/s13042-015-0450-6

    Article  Google Scholar 

  3. Sureyya Rifaioglu A, Doğan T, Jesus Martin M, Cetin-Atalay R, Atalay V (2019) DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Sci Rep 9:1–16.https://doi.org/10.1038/s41598-019-43708-3

  4. Zhang J, Yang JR (2015) Determinants of the rate of protein sequence evolution. Nat Rev Genet 16:409–420. https://doi.org/10.1038/nrg3950

    Article  Google Scholar 

  5. Tahir M, Khan A (2016) Protein subcellular localization of fluorescence microscopy images: employing new statistical and Texton based image features and SVM based ensemble classification. Inf Sci 345:65–80. https://doi.org/10.1016/j.ins.2016.01.064

    Article  Google Scholar 

  6. Wan S, Mak MW (2018) Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. Int J Mach Learn Cybern 9:399–411. https://doi.org/10.1007/s13042-015-0460-4

    Article  Google Scholar 

  7. Ranjan A, Fahad MS, Fernandez-Baca D, Deepak A, Tripathi S (2019) Deep robust framework for protein function prediction using variable-length protein sequences. IEEE/ACM Trans Comput Biol Bioinf 1–1. https://doi.org/10.1109/tcbb.2019.2911609

  8. Almagro Armenteros JJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O (2017) DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics (Oxford, England) 33:3387–3395.https://doi.org/10.1093/bioinformatics/btx431

  9. Agrawal S, Sisodia DS, Nagwani NK (2021) Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2297–2310. https://doi.org/10.1007/s11517-021-02436-5

  10. Shi Q, Chen W, Huang S, Wang Y, Xue Z (2019) Deep learning for mining protein data. Brief Bioinform 1–25. https://doi.org/10.1093/bib/bbz156

  11. Wang Y, Li Y, Song Y, Rong X (2020) The influence of the activation function in a convolution neural network model of facial expression recognition. Appl Sci (Switzerland) 10. https://doi.org/10.3390/app10051897

  12. Vassallo K, Garg L, Prakash V, Ramesh K (2019) Contemporary technologies and methods for cross-platform application development. J Comput Theor Nanosci 16:3854–3859. https://doi.org/10.1166/jctn.2019.8261

    Article  Google Scholar 

  13. Shanmugham B, Pan A (2013) Identification and characterization of potential therapeutic candidates in emerging human pathogen mycobacterium abscessus: a novel hierarchical In Silico approach. PLoS ONE 8. https://doi.org/10.1371/journal.pone.0059126

  14. Audagnotto M, Dal Peraro M (2017) Protein post-translational modifications: In silico prediction tools and molecular modeling. Comput Struct Biotechnol J 15:307–319. https://doi.org/10.1016/j.csbj.2017.03.004

    Article  Google Scholar 

  15. Mondal SI, Ferdous S, Jewel NA, Akter A, Mahmud Z, Islam MM, Afrin T, Karim N (2015) Identification of potential drug targets by subtractive genome analysis of Escherichia coli O157:H7: an in silico approach. Adv Appl Bioinform Chem 8:49–63. https://doi.org/10.2147/AABC.S88522

    Article  Google Scholar 

  16. Weimer A, Kohlstedt M, Volke DC, Nikel PI, Wittmann C (2020) Industrial biotechnology of Pseudomonas putida: advances and prospects. Appl Microbiol Biotechnol 104:7745–7766. https://doi.org/10.1007/s00253-020-10811-9

    Article  Google Scholar 

  17. Zhang T, Ding Y, Chou KC (2006) Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 30:367–371. https://doi.org/10.1016/j.compbiolchem.2006.08.003

    Article  MATH  Google Scholar 

  18. Agrawal S, Sisodia DS, Nagwani NK (2021) Long short term memory based functional characterization model for unknown protein sequences using ensemble of shallow and deep features. Neural Comput Appl 4. https://doi.org/10.1007/s00521-021-06674-4

  19. Elabd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M (2020) Amino acid encoding for deep learning applications. BMC Bioinform 21:1–14. https://doi.org/10.1186/s12859-020-03546-x

    Article  Google Scholar 

  20. Giri SJ, Dutta P, Halani P, Saha S (2021) MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J Biomed Health Inform 25:1832–1838. https://doi.org/10.1109/JBHI.2020.3022806

    Article  Google Scholar 

  21. Choong ACH, Lee NK (2017) Evaluation of convolutionary neural networks modeling of DNA sequences using ordinal versus one-hot encoding method. In: 1st international conference on computer and drone applications: ethical integration of computer and drone technology for humanity sustainability, IConDA 2017. 2018 Jan, pp 60–65. https://doi.org/10.1109/ICONDA.2017.8270400.

  22. Sønderby SK, Sønderby CK, Nielsen H, Winther O (2015) Convolutional LSTM networks for subcellular localization of proteins. Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9199:68–80. https://doi.org/10.1007/978-3-319-21233-3_6

    Article  MathSciNet  Google Scholar 

  23. Wei L, Ding Y, Su R, Tang J, Zou Q (2018) Prediction of human protein subcellular localization using deep learning. J Parall Distrib Comput 117:212–217. https://doi.org/10.1016/j.jpdc.2017.08.009

    Article  Google Scholar 

  24. Kulmanov M, Khan MA, Hoehndorf R (2018) DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34:660–668. https://doi.org/10.1093/bioinformatics/btx624

    Article  Google Scholar 

  25. Gao R, Wang M, Zhou J, Fu Y, Liang M, Guo D, Nie J (2019) Prediction of enzyme function based on three parallel deep CNN and amino acid mutation. Int J Mol Sci 20. https://doi.org/10.3390/ijms20112845

  26. Kulmanov M, Hoehndorf R, Cowen L (2020) DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36:422–429. https://doi.org/10.1093/bioinformatics/btz595

    Article  Google Scholar 

  27. Zhou J, Lu Q, Xu R, Gui L, Wang H (2017) CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features. In: Proceedings—2016 IEEE international conference on bioinformatics and biomedicine, BIBM 2016, pp 78–85. https://doi.org/10.1109/BIBM.2016.7822496

  28. Shen H-B, Chou K-C (2009) Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of gram-positive bacterial proteins. Protein Pept Lett 16:1478–1484. https://doi.org/10.2174/092986609789839322

    Article  Google Scholar 

  29. Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA (2002) The relationship of protein conservation and sequence length. BMC Evol Biol 2:1–10. https://doi.org/10.1186/1471-2148-2-20

    Article  Google Scholar 

  30. Sercu T, Goel V (2016) Advances in very deep convolutional neural networks for LVCSR. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH. 08–12-September-2016, pp 3429–3433. https://doi.org/10.21437/Interspeech.2016-1033

  31. Wang L, Wang HF, Liu SR, Yan X, Song KJ (2019) Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest. Sci Rep 9:1–12. https://doi.org/10.1038/s41598-019-46369-4

    Article  Google Scholar 

  32. Zhou S, Chen Q, Wang X (2013) Active deep learning method for semi-supervised sentiment classification. Neurocomputing 120:536–546. https://doi.org/10.1016/j.neucom.2013.04.017

    Article  Google Scholar 

  33. Sharma R, Dehzangi A, Lyons J, Paliwal K, Tsunoda T, Sharma A (2015) Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features Into Chou’s General PseAAC. IEEE Trans Nanobiosci 14:915–926. https://doi.org/10.1109/TNB.2015.2500186

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saurabh Agrawal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Agrawal, S., Sisodia, D.S., Nagwani, N.K. (2023). Function Characterization of Unknown Protein Sequences Using One Hot Encoding and Convolutional Neural Network Based Model. In: Singh, P., Singh, D., Tiwari, V., Misra, S. (eds) Machine Learning and Computational Intelligence Techniques for Data Engineering. MISP 2022. Lecture Notes in Electrical Engineering, vol 998. Springer, Singapore. https://doi.org/10.1007/978-981-99-0047-3_24

Download citation

Publish with us

Policies and ethics