Abstract
Huge numbers of protein sequences are now available in public databases. In order to exploit more fully this valuable biological data, these sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. In the March 2018 release of UniProtKB, some 556,000 sequences have been manually curated but over 111 million sequences still lack functional annotations. The ability to annotate automatically these unannotated sequences would represent a major advance for the field of bioinformatics. Here, we present a novel network-based approach called GrAPFI for the automatic functional annotation of protein sequences. The underlying assumption of GrAPFI is that proteins may be related to each other by the protein domains, families, and super-families that they share. Several protein domain databases exist such as InterPro, Pfam, SMART, CDD, Gene3D, and Prosite, for example. Our approach uses Interpro domains, because the InterPro database contains information from several other major protein family and domain databases. Our results show that GrAPFI achieves better EC number annotation performance than several other previously described approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Protein Data Bank, https://www.rcsb.org/.
References
Altschul, S.F., et al.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucl. Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389
Aridhi, S., Montresor, A., Velegrakis, Y.: Bladyg: a graph processing framework for large dynamic graphs. Big Data Res. 9, 9–17 (2017)
Chou, K.C.: Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics 6(4), 262–274 (2009)
Chua, H.N., Sung, W.K., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22(13), 1623–1630 (2006)
Consortium, T.U.: Uniprot: a hub for protein information. Nucl. Acids Res. 43(D204–D212) (2015). https://doi.org/10.1093/nar/gku989. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4384041/
Cornish-Bowden, A.: Current iubmb recommendations on enzyme nomenclature and kinetics. Perspect. Sci. 1(1–6), 74–87 (2014)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dobson, P.D., Doig, A.J.: Predicting enzyme class from protein structure without alignments. J. Mol. Biol. 345(1), 187–199 (2005)
Finn, R.D., Clements, J., Eddy, S.R.: Hmmer web server: interactive sequence similarity searching. Nucl. Acids Res. 39(2), W29–W37 (2011). https://doi.org/10.1093/nar/gkr367
Gattiker, A., et al.: Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27(1), 49–58 (2003). https://doi.org/10.1016/s1476-9271(02)00094-4
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., Takagi, T.: Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18(6), 523–531 (2001)
Huang, W.L., Chen, H.M., Hwang, S.F., Ho, S.Y.: Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 90(2), 405–413 (2007)
des Jardins, M., Karp, P.D., Krummenacker, M., Lee, T.J., Ouzounis, C.A.: Prediction of enzyme classification from protein sequence without the use of sequence similarity. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 92–99 (1997)
Jones, P., et al.: Interproscan 5: genome-scale protein function classification. Bioinformatics 30(9), 1236–1240 (2014)
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on swiss-prot. Bioinformatics 17(10), 920–6 (2001)
Kumar, N., Skolnick, J.: Eficaz2. 5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics 28(20), 2687–2688 (2012)
Kummerfeld, S.K., Teichmann, S.A.: Protein domain organisation: adding order. BMC Bioinform. 10(1), 39 (2009)
Li, Y., et al.: Deepre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018). https://doi.org/10.1093/bioinformatics/btx680
Li, Y.H., et al.: Svm-prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PloS One 11(8) (2016)
Lu, L., Qian, Z., Cai, Y.D., Li, Y.: Ecs: an automatic enzyme classifier based on functional domain composition. Comput. Biol. Chem. 31(3), 226–232 (2007)
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl\(\_\)1), i302–i310 (2005)
Nagao Chioko, N.N., Kenji, M.: Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PloS One 9(1) (2014)
Nasibov, E., Kandemir-Cavas, C.: Efficiency analysis of knn and minimum distance-based classifiers in enzyme family prediction. Comput. Biol. Chem. 33(6), 461–464 (2009)
Quester, S., Schomburg, D.: Enzymedetector: an integrated enzyme function prediction tool and database. BMC Bioinform. 12(1), 376 (2011)
Quevillon, E., et al.: Interproscan: protein domains identifier. Nucl. Acids Res. 33(suppl\(\_\)2), W116–W120 (2005)
Rahman, S.A., Cuesta, S.M., Furnham, N., Holliday, G.L., Thornton, J.M.: Ec-blast: a tool to automatically search and compare enzyme reactions. Nat. Methods 11(2), 171 (2014)
Schwikowski, B., Uetz, P., Fields, S.: A network of protein-protein interactions in yeast. Nat. Biotechnol. 18(12), 1257 (2000)
Shen, H.B., Chou, K.C.: Ezypred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun. 364(1), 53–59 (2007)
Volpato, V., Adelfio, A., Pollastri, G.: Accurate prediction of protein enzymatic class by n-to-1 neural networks. BMC Bioinform. 14(1), S11 (2013)
Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., Zhang, Y.: The i-tasser suite: protein structure and function prediction. Nat. Methods 12(1), 7 (2015)
Yu, C., Zavaljevski, N., Desai, V., Reifman, J.: Genome-wide enzyme annotation with precision control: catalytic families (catfam) databases. Proteins: Struct. Funct. Bioinform. 74(2), 449–460 (2009)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Zhang, C., Freddolino, P.L., Zhang, Y.: Cofactor: improved protein function prediction by combining structure, sequence and proteinprotein interaction information. Nucl. Acids Res. 45(1), 291–299 (2017)
Zhao, B., Hu, S., Li, X., Zhang, F., Tian, Q., Ni, W.: An efficient method for protein function annotation based on multilayer protein networks. Hum. Genomics 10(1), 33 (2016)
Acknowledgements
This work was partially supported by the CNRS-INRIA/FAPs project “TempoGraphs” (PRC2243). Bishnu Sarker is a doctoral student funded by an INRIA CORDI-S contract.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sarker, B., Rtichie, D.W., Aridhi, S. (2019). Exploiting Complex Protein Domain Networks for Protein Function Annotation. In: Aiello, L., Cherifi, C., Cherifi, H., Lambiotte, R., Lió, P., Rocha, L. (eds) Complex Networks and Their Applications VII. COMPLEX NETWORKS 2018. Studies in Computational Intelligence, vol 813. Springer, Cham. https://doi.org/10.1007/978-3-030-05414-4_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-05414-4_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05413-7
Online ISBN: 978-3-030-05414-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)