Abstract
The human genome was first sequenced in 1994. It took 10 years of cooperation between numerous international research organizations to reveal a preliminary human DNA sequence. Genomics labs can now sequence an entire genome in only a few days. Here, we talk about how the advent of high-performance sequencing platforms has paved the way for Big Data in biology and contributed to the development of modern bioinformatics, which in turn has helped to expand the scope of biology and allied sciences. New technologies and methodologies for the storage, management, analysis, and visualization of big data have been shown to be necessary. Not only does modern bioinformatics have to deal with the challenge of processing massive amounts of heterogeneous data, but it also has to deal with different ways of interpreting and presenting those results, as well as the use of different software programs and file formats. Solutions to these problems are tried to present in this chapter. In order to store massive amounts of data and provide a reasonable period for completing search queries, new database management systems other than relational ones will be necessary. Emerging advance programing approaches, such as machine learning, Hadoop, and MapReduce, aim to provide the capacity to easily construct one’s own scripts for data processing and address the issue of the diversity of genomic and proteomic data formats in bioinformatics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hart EM, Barmby P, LeBauer D, Michonneau F, Mount S, Mulrooney P, Poisot T, Woo KH, Zimmerman NB, Hollister JW (2016) Ten simple rules for digital data storage. PLoS Comput Biol 12:e1005097
Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26:65–74
Julliet R (2022) How to store big data. https://www.bocasay.com/how-to-store-big-data/
Hassan J, Shehzad D, Habib U, Aftab MU, Ahmad M, Kuleev R, Mazzara M (2022) The rise of cloud computing: data protection, privacy, and open research challenges-a systematic literature review (SLR). Comput Intell Neurosci 2022:8303504
Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2015) Big data analytics in bioinformatics: a machine learning perspective. arXiv preprint arXiv:1506.05101
Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36:W5–W9
Madeira F, Pearce M, Tivey ARN, Basutkar P, Lee J, Edbali O, Madhusoodanan N, Kolesnikov A, Lopez R (2022) Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res 50:W276–W279
Bianchi V, Ceol A, Ogier AG, De Pretis S, Galeota E, Kishore K, Bora P, Croci O, Campaner S, Amati B, Morelli MJ (2016) Integrated systems for NGS data management and analysis: open issues and available solutions. Front Genet 7:75
Prajapati J. List of bioinformatics software tools for next generation sequencing. https://bioinformaticsonline.com/pages/view/26617/list-of-bioinformatics-software-tools-for-next-generation-sequencing
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 38:e178
Howe EA, Sinha R, Schlauch D, Quackenbush J (2011) RNA-Seq analysis in MeV. Bioinformatics 27:3209–3210
Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2016) Big data analytics in bioinformatics: architectures, techniques, tools and issues. Netw Model Anal Health Inform Bioinform 5:1–28
Amaral ML, Erikson GA, Shokhirev MN (2018) BART: bioinformatics array research tool. BMC Bioinform 19:296
Illumina (2018) Beeline Illumina (Version 2.0). Illumina, Inc. Retrieved from https://support.illumina.com/downloads/beeline-software-2-0.html
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B Stat Methodol 63:411–423
Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew Y-E, Haviv I, Australian Ovarian Cancer Study Group, Gertig D, de Fazio A, Bowtell DDL (2008) Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res 14:5198–5208
Khezr SN, Navimipour NJ (2017) MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J Grid Comput 15:295–321
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) GraphLab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041
Apache Software Foundation (2023) Apache Spark (version 3.4.0). Retrieved from https://spark.apache.org/news/spark-3-4-0-released.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Gupta, A., Kumar, S., Kumar, A. (2024). Big Data in Bioinformatics and Computational Biology: Basic Insights. In: Mandal, S. (eds) Reverse Engineering of Regulatory Networks. Methods in Molecular Biology, vol 2719. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3461-5_9
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3461-5_9
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-3460-8
Online ISBN: 978-1-0716-3461-5
eBook Packages: Springer Protocols