Introduction

Since 1990s, the whole genomes of different species have been sequenced by different genome sequencing projects. In 1995, the first free-living organism Haemophilus influenzae was sequenced by the Institute for Genomic Research. In 1996, the first eukaryotic genome (Saccharomyces cerevisiase) was completely sequenced. In 2000, the first plant genome, Arabidopsis thaliana, was also sequenced by Arabidopsis Genome Initiative. In 2003, the Human Genome Project (HGP) announced its completion. Following the HGP, the Encyclopedia of DNA Elements (ENCODE) project was started, revealing a massive number of functional elements in the human genome in 2011 (ENCODE Project Consortium et al. 2004). The drastically decreasing cost of sequencing also enables the 1000 Genomes Project and Roadmap Epigenomics Project to be carried out. Their results have been published in 2012 and 2015 respectively (1000 Genomes Project Consortium et al. 2010; Kundaje et al. 2015). Nonetheless, the massive genomic data generated by those projects impose an unforeseen challenge for big data analysis at the scale of gigabytes (GBs) or even terabytes (TBs).

In particular, next-generation sequencing (NGS) technologies have enabled massive data generation for different genomes (Wong and Zhang 2014; Mardis 2008); for instance, DNA sequencing, protein-DNA binding occupancy (Wong et al. 2013) (e.g., ChIP-seq Visel et al. (2009), ChIP-exo Ho and Franklin Pugh (2011), and ChIA-PET Fullwood et al. (2009)), bisulfite sequencing (Bock et al. 2005), transcriptome sequencing (e.g., RNA-seq Mortazavi et al. (2008)), and chromatin interaction sequencing (e.g., Hi-C Lieberman-Aiden et al. (2009)). Thanks to the relatively low costs, those NGS technologies have been readily applied to human genomes nowadays. The international projects aforementioned have been successfully launched, leading to massive NGS data accumulation at an unprecedentedly fast pace. Nonetheless, current integrative analyses are usually limited to traditional machine learning and data mining methods such as pair-wise correlation analysis, statistical tests, classification, and feature extraction (Wong et al. 2015b). Those methods are intentionally designed to generally fit different types of data. However, the data from NGS is unique and different from the traditional data; for instance, the ChIP-seq data is sparse, noisy, and discontinuous. Special care has to be taken to alleviate and transform those challenges to be taken advantages of (Wong et al. 2015a). In addition, the NGS data is huge (in gigabytes per each dataset) which imposes a difficulty in applying some of the existing statistical/computational methods.

Therefore, different genome-scale problems have been defined and framed to harness those genomic data. Figure 1 aims to provide a concise summary of those challenges.

Fig. 1
figure 1

Big data challenges in genome informatics. The challenges are listed from top to bottom; namely, genome assembly, signal profile analysis, and 3D genome structure reconstruction

De novo genome assembly

The advancement in DNA sequencing technologies has enabled the assembly of whole genome in an economical and fairly accurate way (Mardis 2011). Nonetheless, a genome cannot be easily identified in one piece from wet-labs. Limited by our current DNA sequencing technologies, each genome has to be shattered into many non-overlapping small pieces (short DNA sequence reads) before their DNA nucleotides can get sequenced and identified as shown in Fig. 1. Therefore, we come to the de novo genome assembly problem: to sequence and identify a genome, we have to “stitch” those short DNA sequences into a single and consistent DNA genome while allowing for overlaps. There are different benchmark measurements such as N50, total length, and number of missing nucleotides. If we already have a reference genome, then the measurements can be more solid than the previous measures such as NG50 and genome fraction. If a reference genome annotation is available, the number of genes covered can be a good measurement. More details can be found in Gurevich et al. (2013). To solve this kind of genome assembly problems (in GBs or TBs), there are many computational methods proposed in the past. Nonetheless, most of them depend on the construction of de brujin graph which is memory-consuming and computationally intensive. According to the recent benchmark study, different genome assembly methods show result disagreement with each other by Bradnam et al. (2013). In addition, the sequencing errors incurred by wet-lab experimental conditions are unavoidable, making the genome assembly problem even more complicated than we have imagined (Mardis 2011). Therefore, the genome assembly problem remains as a big data challenge to be solved.

Genome signal profile analysis

In addition to genome assembly, there are different genome-wide signals such as gene regulation (e.g. protein-DNA binding interactions) and epigenetic interactions (e.g. DNA methylation) as shown in Fig. 1. Therefore, it is essential for us to look into those information. To this end, different genome-wide biotechnologies have been developed such as ChIP-seq, DNase-seq, RNA-seq, CLIP-seq, DNA methylation array assay, bisulphite sequencing, Repli-Seq, and CAGE. To gains insights into those data, tremendous efforts have been made to pre-process the data such as read trimming (Bolger et al. 2014), sequencing error correction (Yang et al. 2013), sequencing replicates (Robasky et al. 2014), and read mapping (David et al. 2011). After the data has been processed, downstream analysis methods can be applied to reveal genome-wide signals from it; for instance, multiple signal profile integrative analysis (Wong et al. 2015a, b) and signal profile peak calling (Zhang et al. 2008). In particular, the multiple signal profile analysis is very important for us to understand the complex behavior of the genome-wide signals (Wong et al. 2015a). Unfortunately, each signal profile is proportional to genome size since it has a genome-wide coverage (usually in GBs). Therefore, if we have multiple signal profiles (e.g., hundreds from the ENCODE consortium), the computational scalability issue has to be taken into serious account. Another issue is that the past wet-lab studies are very limited to fine-scale knowledge (e.g., single gene study). Therefore, the genome-wide result verification is very difficult to be carried out. At the current stage, we heavily rely on null hypothesis testing to ascertain the results’ statistical significances. Therefore, we can foresee that the genome signal profile analysis will still be a big data challenge in genome informatics.

3D genome structure reconstruction

In recent years, Hi-C technology has been developed and applied to reveal the three dimensional organizations of different cell lines by the chromosome conformation capture method (Belton et al. 2012). In particular, there is increasing evidence that long-range chromatin interactions are related to gene co-expression (Babaei et al. 2015; Jin et al. 2013) as well as protein-DNA interactions (Lan et al. 2012; Mifsud et al. 2015). Therefore, it is essential to comprehensively identify and reconstruct the three-dimensional (3D) genome shape from those long-range chromatin interactions for understanding genomes in the three-dimensional space. Given the GB data size of genome as well as its three-dimensional nature, such a 3D genome reconstruction is doomed to be another big data challenge.

Future perspectives

In this article, we have discussed several big data challenges in genome informatics. Especially, we envision that those challenges will become intense in the near future, given the maturing and cost-effective sequencing technologies. Several future directions are deemed promising: (1) third-generation sequencing technologies (Schadt et al. 2010) have been developed and being refined to be of practical uses. Although its sequencing error rate is still high, we believe that those third-generation sequencing technologies will enable another wave of big data challenges in genome informatics. (2) Single cell sequencing is another promising direction. In the past, we usually studied specific cell types or tissue types using the population-based approaches. However, cell type heterogeneity is often observed in practice. Therefore, our current single-cell sequencing technologies can enable us to look at each of the individual cells; it holds tremendous potential to trigger the next levels of big data challenges. However, the cell-destructive nature of single-cell sequencing may limit its capability such as real-time live tracking, disease prognosis analysis, and stem cell development. To address those limitations, single-cell imaging techniques could be promising; it can even offer insights into the spatial arrangement of individual cells. (3) Given the genome data in GBs or even TBs, high-performance computing frameworks such as MapReduce are definitely needed to handle the exponentially growing genome data in a scalable but still accurate manner. The high-throughput computing technologies such as Hadoop, Spark, and Pig Latin are expected to become more pronounced than now.