Integrative Literature and Data Mining to Rank Disease Candidate Genes

Wu, Chao; Zhu, Cheng; Jegga, Anil G.

doi:10.1007/978-1-4939-0709-0_12

Chao Wu^4,5,
Cheng Zhu^4,5,7 &
Anil G. Jegga D.V.M., M.S.^4,5,6

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1159))

2765 Accesses
1 Citations

Abstract

While the genomics-derived discoveries promise benefits to basic research and health care, the speed and affordability of sequencing following recent technological advances has further aggravated the data deluge. Seamless integration of the ever-increasing clinical, genomic, and experimental data and efficient mining for knowledge extraction, delivering actionable insight and generating testable hypotheses are therefore critical for the needs of biomedical research. For instance, high-throughput techniques are frequently applied to detect disease candidate genes. Experimental validation of these candidates however is both time-consuming and expensive. Hence, several computational approaches based on literature and data mining have been developed to identify the most promising candidates for follow-up studies. Based on “guilt by association” principle, most of these methods use prior knowledge about a disease of interest to discover and rank novel candidate genes. In this chapter, we provide a brief overview of recent advances made in literature- and data-mining-based approaches for candidate gene prioritization. As a case study, we focus on a Web-based computational approach that uses integrated heterogeneous data sources including gene–literature associations for ranking disease candidate genes and explain how to run typical queries using this system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Candidate Gene Discovery and Prioritization in Rare Diseases

Computational Approaches for Human Disease Gene Prediction and Ranking

Enhancing Precision Medicine: An Automatic Pipeline Approach for Exploring Genetic Variant-Disease Literature

References

Cheung WA, Ouellette BF, Wasserman WW (2012) Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles. Genome Med 4(9):75
Article PubMed Central PubMed Google Scholar
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R (2012) Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13(12):829–839
Article PubMed CAS Google Scholar
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(Suppl 1):D267–D270
Article PubMed Central PubMed CAS Google Scholar
Smith CL, Goldsmith C-A, Eppig JT (2005) The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol 6(1):R7
Article PubMed Central PubMed Google Scholar
Gault LV, Shultz M, Davies KJ (2002) Variations in Medical Subject Headings (MeSH) mapping: from the natural language of patron terms to the controlled vocabulary of mapped lists. J Med Libr Assoc 90(2):173
PubMed Central PubMed Google Scholar
McKusick VA (1998) Mendelian inheritance in man: a catalog of human genes and genetic disorders. Johns Hopkins University Press, Maryland, USA
Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Article PubMed Central PubMed CAS Google Scholar
Cohen KB, Hunter LE (2013) Text mining for translational bioinformatics. PLoS Comput Biol 9(4):e1003044
Article PubMed Central PubMed Google Scholar
Mattingly CJ, Colby GT, Forrest JN, Boyer JL (2003) The Comparative Toxicogenomics Database (CTD). Environ Health Perspect 111(6):793
Article PubMed Central PubMed CAS Google Scholar
Klein T, Chang J, Cho M, Easton K, Fergerson R, Hewett M, Lin Z, Liu Y, Liu S, Oliver D (2001) Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenomics J 1(3):167–170
Article PubMed CAS Google Scholar
Kuhn M, von Mering C, Campillos M, Jensen LJ, Bork P (2008) STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res 36(Suppl 1):D684–D688
PubMed Central PubMed CAS Google Scholar
Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36(7):664
Article PubMed CAS Google Scholar
Olsen C, Djebbari A, Bontempi G, Correll M, Bouton C, Haibe-Kains B, Quackenbush J (2012) Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks. Nucleic Acids Res 40(D1):D866–D875
Google Scholar
Rzhetsky A, Koike T, Kalachikov S, Gomez SM, Krauthammer M, Kaplan SH, Kra P, Russo JJ, Friedman C (2000) A knowledge model for analysis and simulation of regulatory networks. Bioinformatics 16(12):1120–1128
Article PubMed CAS Google Scholar
Frijters R, Heupers B, van Beek P, Bouwhuis M, van Schaik R, de Vlieg J, Polman J, Alkema W (2008) CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res 36(Suppl 2):W406–W410
Article PubMed Central PubMed CAS Google Scholar
Müller H-M, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11):e309
Article PubMed Central PubMed Google Scholar
Pafilis E, O’Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27(6):508–510
Article PubMed CAS Google Scholar
Fo B, Nolin M-A, Tourigny N, Rigault P, Morissette J (2008) Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 41(5):706–716
Article Google Scholar
Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA symposium, 2001. American Medical Informatics Association, p 17
Google Scholar
Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74(2–4):289–298
Article PubMed Google Scholar
Jourquin J, Duncan D, Shi Z, Zhang B (2012) GLAD4U: deriving and prioritizing gene lists from PubMed literature. BMC Genomics 13(Suppl 8):S20
Article PubMed Central PubMed Google Scholar
Liekens AM, De Knijf J, Daelemans W, Goethals B, De Rijk P, Del-Favero J (2011) BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol 12(6):R57
Article PubMed Central PubMed Google Scholar
Yoshida Y, Makita Y, Heida N, Asano S, Matsushima A, Ishii M, Mochizuki Y, Masuya H, Wakana S, Kobayashi N (2009) PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning. Nucleic Acids Res 37(Suppl 2):W147–W152
Article PubMed Central PubMed CAS Google Scholar
Swanson DR, Smalheiser NR (1997) An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artif Intell 91(2):183–203
Article Google Scholar
Swanson DR (1990) Medical literature as a potential source of new knowledge. Bull Med Libr Assoc 78(1):29
PubMed Central PubMed CAS Google Scholar
Makita Y, Kobayashi N, Yoshida Y, Doi K, Mochizuki Y, Nishikata K, Matsushima A, Takahashi S, Ishii M, Takatsuki T, Bhatia R, Khadbaatar Z, Watabe H, Masuya H, Toyoda T (2013) PosMed: ranking genes and bioresources based on Semantic Web Association Study. Nucleic Acids Res 41(Web Server issue):W109–W114
Article PubMed Central PubMed Google Scholar
Chen J, Bardes EE, Aronow BJ, Jegga AG (2009) ToppGene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res 37(Web Server issue):W305–W311. doi:10.1093/nar/gkp427
Article PubMed Central PubMed CAS Google Scholar
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24(5):537–544
Article PubMed CAS Google Scholar
Smalheiser NR, Torvik VI, Zhou W (2009) Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Methods Programs Biomed 94(2):190
Article PubMed Central PubMed Google Scholar
Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, Alkema W (2010) Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol 6(9):e1000943
Article PubMed Central PubMed Google Scholar
Lindsay RK, Gordon MD (1999) Literature-based discovery by lexical statistics. J Am Soc Inform Sci 50(7):574–587
Google Scholar
Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, Ananiadou S, Ji T (2010) PathText: a text mining integrator for biological pathway visualizations. Bioinformatics 26(12):i374–i381
Article PubMed Central PubMed CAS Google Scholar
Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA, Weng W, Wilbur WJ, Hatzivassiloglou V, Friedman C (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37(1):43–53. doi:10.1016/j.jbi.2003.10.001
Article PubMed CAS Google Scholar
Krauthammer M, Kaufmann CA, Gilliam TC, Rzhetsky A (2004) Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer’s disease. Proc Natl Acad Sci U S A 101(42):15148–15153. doi:10.1073/pnas.0404315101
Article PubMed Central PubMed CAS Google Scholar
Özgür A, Vu T, Erkan G, Radev DR (2008) Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 24(13):i277–i285. doi:10.1093/bioinformatics/btn182
Article PubMed Central PubMed Google Scholar
Coulet A, Shah NH, Garten Y, Musen M, Altman RB (2010) Using text to build semantic networks for pharmacogenomics. J Biomed Inform 43(6):1009–1019. doi:10.1016/j.jbi.2010.08.005
Article PubMed Central PubMed CAS Google Scholar
Percha B, Garten Y, Altman RB (2012) Discovery and explanation of drug-drug interactions via text mining. In: Pacific symposium on biocomputing. Pacific symposium on biocomputing, 2012. World Scientific, p 410
Google Scholar
Hoehndorf R, Schofield PN, Gkoutos GV (2011) PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res 39(18):e119. doi:10.1093/nar/gkr538
Article PubMed Central PubMed CAS Google Scholar
Freudenberg J, Propping P (2002) A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics 18 Suppl 2:S110–S115
Google Scholar
Turner FS, Clutterbuck DR, Semple CA (2003) POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol 4(11):R75
Article PubMed Central PubMed Google Scholar
Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA (2005) Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 33(5):1544–1552
Article PubMed Central PubMed CAS Google Scholar
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6:55
Article PubMed Central PubMed Google Scholar
Thornblad TA, Elliott KS, Jowett J, Visscher PM (2007) Prioritization of positional candidate genes using multiple web-based software tools. Twin Res Hum Genet 10(6):861–870
Article PubMed Google Scholar
Zhu M, Zhao S (2007) Candidate gene identification approach: progress and challenges. Int J Biol Sci 3(7):420–427
Article PubMed Central PubMed CAS Google Scholar
Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M, Lopez-Bigas N, Ouzounis C, Perez-Iratxeta C, Andrade-Navarro MA, Adeyemo A, Patti ME, Semple CA, Hide W (2006) Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res 34(10):3067–3081
Article PubMed Central PubMed CAS Google Scholar
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2006) SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22(6):773–774
Article PubMed CAS Google Scholar
Chen J, Xu H, Aronow BJ, Jegga AG (2007) Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8:392
Article PubMed Central PubMed Google Scholar
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL (2007) The human disease network. Proc Natl Acad Sci U S A 104(21):8685–8690. doi:10.1073/pnas.0701361104
Article PubMed Central PubMed CAS Google Scholar
Jimenez-Sanchez G, Childs B, Valle D (2001) Human disease genes. Nature 409(6822):853–855
Article PubMed CAS Google Scholar
Smith NG, Eyre-Walker A (2003) Human disease genes: patterns and predictions. Gene 318:169–175
Article PubMed CAS Google Scholar
Tranchevent LC, Barriot R, Yu S, Van Vooren S, Van Loo P, Coessens B, De Moor B, Aerts S, Moreau Y (2008) ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Res 36(Web Server issue):W377–W384
Article PubMed Central PubMed CAS Google Scholar
Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, Mattingly CJ (2009) Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res 37(Database issue):D786–D792. doi:10.1093/nar/gkn580
Article PubMed Central PubMed CAS Google Scholar
Popescu M, Keller JM, Mitchell JA (2006) Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Trans Comput Biol Bioinform 3(3):263–274
Article PubMed CAS Google Scholar
Poirier K, Lebrun N, Broix L, Tian G, Saillour Y, Boscheron C, Parrini E, Valence S, Pierre BS, Oger M, Lacombe D, Genevieve D, Fontana E, Darra F, Cances C, Barth M, Bonneau D, Bernadina BD, N’Guyen S, Gitiaux C, Parent P, des Portes V, Pedespan JM, Legrez V, Castelnau-Ptakine L, Nitschke P, Hieu T, Masson C, Zelenika D, Andrieux A, Francis F, Guerrini R, Cowan NJ, Bahi-Buisson N, Chelly J (2013) Mutations in TUBG1, DYNC1H1, KIF5C and KIF2A cause malformations of cortical development and microcephaly. Nat Genet 45(6):639–647. doi:10.1038/ng.2613
Article PubMed CAS Google Scholar
Hamosh A, Scott A, Amberger J, Bocchini C, McKusick V (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33:D514–D517
Article PubMed Central PubMed CAS Google Scholar
Becker KG, Barnes KC, Bright TJ, Wang SA (2004) The genetic association database. Nat Genet 36(5):431–432. doi:10.1038/ng0504-431
Article PubMed CAS Google Scholar
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106(23):9362–9367. doi:10.1073/pnas.0903103106
Article PubMed Central PubMed CAS Google Scholar
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504. doi:10.1101/gr.1239303
Article PubMed Central PubMed CAS Google Scholar
King MC, Wilson AC (1975) Evolution at two levels in humans and chimpanzees. Science 188(4184):107–116
Article PubMed CAS Google Scholar
Korstanje R, Paigen B (2002) From QTL to gene: the harvest begins. Nat Genet 31(3):235–236
Article PubMed CAS Google Scholar
Mackay TF (2001) Quantitative trait loci in Drosophila. Nat Rev Genet 2(1):11–20
Article PubMed CAS Google Scholar
Bromberg Y (2013) Chapter 15: disease gene prioritization. PLoS Comput Biol 9(4):e1002902. doi:10.1371/journal.pcbi.1002902
Article PubMed Central PubMed CAS Google Scholar
Tranchevent LC, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, Moreau Y (2011) A guide to web tools to prioritize candidate genes. Brief Bioinform 12(1):22–32. doi:10.1093/bib/bbq007
Article PubMed CAS Google Scholar
Perez-Iratxeta C, Bork P, Andrade MA (2002) Association of genes to genetically inherited diseases using data mining. Nat Genet 31(3):316–319
PubMed CAS Google Scholar
Perez-Iratxeta C, Wjst M, Bork P, Andrade MA (2005) G2D: a tool for mining genes associated with disease. BMC Genet 6:45
Article PubMed Central PubMed Google Scholar
van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG (2003) A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 11(1):57–63
Article PubMed Google Scholar
van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, Vriend G (2005) GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res 33(Web Server issue):W758–W761
Article PubMed Central PubMed Google Scholar
Masseroli M, Galati O, Pinciroli F (2005) GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res 33(Web Server issue):W717–W723
Article PubMed Central PubMed CAS Google Scholar
Masseroli M, Martucci D, Pinciroli F (2004) GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res 32(Web Server issue):W293–W300
Article PubMed Central PubMed CAS Google Scholar
Rossi S, Masotti D, Nardini C, Bonora E, Romeo G, Macii E, Benini L, Volinia S (2006) TOM: a web-based integrated approach for identification of candidate disease genes. Nucleic Acids Res 34(Web Server issue):W285–W292
Article PubMed Central PubMed CAS Google Scholar
van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A text-mining analysis of the human phenome. Eur J Hum Genet 14(5):535–542
Article PubMed Google Scholar
Franke L, Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C (2006) Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 78(6):1011–1025
Article PubMed Central PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, College of Engineering and Applied Science, University of Cincinnati, Cincinnati, OH, USA
Chao Wu, Cheng Zhu & Anil G. Jegga D.V.M., M.S.
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, MLC 7024, Cincinnati, OH, 45229, USA
Chao Wu, Cheng Zhu & Anil G. Jegga D.V.M., M.S.
Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, 45229, USA
Anil G. Jegga D.V.M., M.S.
Genzyme Corporation, Cambridge, MA, 02142, USA
Cheng Zhu

Authors

Chao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Anil G. Jegga D.V.M., M.S.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anil G. Jegga D.V.M., M.S. .

Editor information

Editors and Affiliations

GlaxoSmithKline, King of Prussia, Pennsylvania, USA
Vinod D. Kumar
GlaxoSmithKline, Hitchin, Hertfordshire, United Kingdom
Hannah Jane Tipney

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wu, C., Zhu, C., Jegga, A.G. (2014). Integrative Literature and Data Mining to Rank Disease Candidate Genes. In: Kumar, V., Tipney, H. (eds) Biomedical Literature Mining. Methods in Molecular Biology, vol 1159. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-0709-0_12

Download citation

DOI: https://doi.org/10.1007/978-1-4939-0709-0_12
Published: 10 April 2014
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-0708-3
Online ISBN: 978-1-4939-0709-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Integrative Literature and Data Mining to Rank Disease Candidate Genes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Candidate Gene Discovery and Prioritization in Rare Diseases

Computational Approaches for Human Disease Gene Prediction and Ranking

Enhancing Precision Medicine: An Automatic Pipeline Approach for Exploring Genetic Variant-Disease Literature

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Integrative Literature and Data Mining to Rank Disease Candidate Genes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Candidate Gene Discovery and Prioritization in Rare Diseases

Computational Approaches for Human Disease Gene Prediction and Ranking

Enhancing Precision Medicine: An Automatic Pipeline Approach for Exploring Genetic Variant-Disease Literature

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation