1 Introduction

Embedded in a long string spanning several billion characters, drawn from a set of genetic alphabets, the genomic big data encompasses a well authored genetic literary work that narrates the story of evolution over billions of years. Genome Informatics (GI), the study of genomes, integrates the big data of genomes with a ubiquitous base of interoperable medical and engineering disciplines. GI has evolved to be a discovery-driven approach to analyse the unstructured genomic big data, which takes inferences from an organism’s genetic code to arrive at translationally important interpretations. Upcoming and widely popular GI applications cater to numerous domains including targeted personalized diagnostics and therapeutics, thereby improving the effectiveness of healthcare.

1.1 Sequencing for GI

Understanding the genome through GI involves determining the order of the genetic alphabets or bases, namely adenine (A), cytosine (C), guanine (G) and thymine (T), within the genomic sequence, and the process is widely known as sequencing. Next Generation Sequencing (NGS) involves massively parallel sequencing of genetic data with high throughput, while offering an unparalleled interrogation of the genome, throwing deeper insight into the functional and structural investigation of genetic data [1, 2].

Data processing with NGS, over an elaborated multi-stage data-analytics pipeline, is depicted in Fig. 1. At the end of the primary data analysis, the pipeline generates several intermediate files and output files of significant magnitude, contributing to petabytes of NGS big data of raw sequence short reads per sample per run. Each short read is a very small fragment or substring of the target genome string under consideration. The short reads are then aligned or mapped to a reference genome string through a process called Short Read Mapping (SRM) or Short Read Alignment (SRA). By the year 2025, genomic data acquisition through NGS, being highly geographically distributed across multiple species, is predicted to reach the rate of one zettabase per year [3].

Figure 1
figure 1

NGS workflow.

1.2 Short Read Mapping (SRM): Computational Bottleneck in GI

The SRM process, illustrated in Fig. 2, is interpreted as a classic Approximate String Matching (ASM) problem. SRM attempts to search the specific short read string q of length —q— (ranging from about 25 to a few hundred bases), over a much longer reference genome string G of length —G— (a human reference genome is typically 3 billion bases long). SRM aims to find the regions of origin of each short read string with respect to the reference, and hence finds regions of similarity or dissimilarity, over the character set Σ = {A, C, G, T} [4, 5].

Figure 2
figure 2

SRM workflow.

SRM performs ASM for its input strings, with a cost function and appropriate error model to accommodate errors in strings induced due to:

  1. 1.

    Genetic mutations (computationally interpreted as a character being replaced with another over the character set Σ)

  2. 2.

    Single Nucleotide Polymorphism (SNP) (algorithmically similar to genetic mutations, but interpretations vary biologically over a sample population)

  3. 3.

    Insertions or deletions of bases (computationally interpreted as addition or deletion of a genetic character across the length of the string)

  4. 4.

    Other evolutionary genetic alterations

  5. 5.

    NGS platform induced errors

The algorithms here quantify the similarity or the edit distances between the two strings under consideration. Edit distance is the minimum number of more likely deviations or manoeuvres that can transmute one string to another. The cost function in ASM assigns each such manoeuvre a cost, and eventually aims at minimizing or maximizing the total cost based on the limits of the cost function, which serves to quantify similarity between the two strings.

2 Related Work: SRM in GI Using Accelerators and HPC Platforms

With growing volume of NGS big data, the SRM and subsequent analytic steps demand an HPC environment complimented with accelerators for data storage and analyses [6, 7]. NGS has thus become a complex engineering problem, eliciting innovative computational, scientific and statistical approaches towards big data analysis. A strict validation of various algorithms and softwares in an NGS pipeline is essential, to ensure reliable and accurate results [8,9,10,11].

2.1 Accelerators for SRM

The most popular and central scheme for SRM is the Dynamic Programming (DP) methodology. Though computationally complex, the DP algorithms prove to be very efficient in discriminating substantial similarities amongst severe noise in genetic data presented by evolution. There are several parallel implementations of DP method [12,13,14,15,16,17]. While some adopt parallel computations using SIMD (Single Instruction Multiple Data) style instructions within a single processor, others realize parallelism on multiple processors. There are various accelerator platforms like reconfigurable hardware (FPGAs) and GPUs on which the DP recursive equation kernel is realized as multiple threads or blocks to accelerate alignment. Most of these methods can be classified into two major categories listed in Table 1. We can see the bottlenecks offered by these methods, thus rendered not useful while handling big data.

Table 1 Alignment methods.

2.2 SRM on HPC Platforms

To perform SRM on such large data volumes, GI adopts a multi-stage multi-algorithmic parallel pipeline. The deployment of the GI pipeline exploits the best practices in HPC on platforms like clouds, grids, accelerators and clusters, while strictly following bio computational principles in classical genetics, molecular and cell biology. All such efforts are predominantly directed towards prospecting the unexploited scope of parallelism and scalability of the HPC platforms [18, 19].

However, the bio-computing within the GI pipeline is irregular and combinatorial in nature. It is irregular due to being heavily data dependent, lacking sense of temporal and spatial locality of data. This severely curbs the performance of modern processor architectures built on deep memory hierarchies meant for pertinent data structures. The runtime computational irregularities are perfectly complemented by the non contiguous file accesses, making an optimal parallelization of GI pipeline on a multi-core environment more difficult. The big data along with an all-to-all computation contributes to the time and computational complexity of the combinatorial algorithms. This makes fine-grain synchronization an utmost necessity to exploit data-level and process-level parallelism in a multi-node and multi-core HPC environment. In presence of a variety of accelerator platforms to conceive the parallel versions of the various computational algorithms, a substantial engineering effort is required in optimizing bio-computing on the available HPC hardware for concurrency, time, cost, and coverage [6, 9, 20, 21].

Through this paper, we present the detailed performance analysis of ReneGENE-GI, an innovatively engineered GI pipeline. The architecture of ReneGENE-GI was discussed in our previous work [22]. This paper is an extension of the same, with more algorithmic, performance and implementation details of the various stages of the pipeline. It performs mapping of raw genomic data from the NGS platforms with high precision. The pipeline hosts a unique blend of highly dynamic multi-dimensional data structures and parallel algorithms designed for executing the irregular genomic computing on accelerator based hardware and HPC platforms. ReneGENE-GI exploits the inherent parallelism and scalability of the hardware at the level of micro and system architecture to offer a reliable mapping for any NGS read data, regardless of the size. This allows for optimizing time, cost, and affordability without unduly penalizing biological fidelity of the results. It exploits a substantial degree of latent parallelism by engaging fine-grain synchronization, while allowing the application to scale up on HPC platforms.

The principal novelty of our solution involves engineering of the pipeline using existing algorithms on platforms using a data streaming approach that minimizes heap memory footprint and input/output bottlenecks. It is also supplemented by compiler-level and architecture-specific optimizations to improve the performance in a reconfigurable HPC environment. We also present the performance analysis for ReneGENE-GI’s Comparative Genomics Module (CGM), implemented on both FPGA and GPU accelerator platforms.

3 The ReneGENE-GI Pipeline

The ReneGENE-GI pipeline, illustrated in Fig. 3, adopts the multi-stage NGS workflow illustrated in Fig. 1 for data analytics. While implementing the regular bio-computing algorithms used for SRM and subsequent steps, the pipeline follows a modular approach for each step and each algorithm in an effort to deploy the respective stages on an HPC environment to enable parallel computing.

Figure 3
figure 3

The ReneGENE-GI pipeline.

3.1 Overview

The novelty of the ReneGENE-GI pipeline lies in the fact that it offers a unique blend of comparative genomics and de novo sequence assembly, offering the most precise SRM. The CGM exploits parallel dynamic programming methodology to accurately map the short reads against the reference genome. The alignment is backed by an exhaustive indexing and lookup of reads against the reference using the parallel implementation of dynamic Monotonic Minimal Perfect Hashing (MMPH) method [23]. This is a complete index of the reference, where the k-mer seeds fully cover the entire span of the reference, inclusive of the repeat regions. As compared to other indexing techniques that employ heuristics of purging repeat region hits, ReneGENE-GI pipeline reports those hits as well, throwing light on many anomalies embedded in these repeats.

The de novo module is implemented as a parallel map-reduce based readtig generation technique. The readtigs are extended short reads, based on a novel read extension algorithm, prototyped and verified for precision on HPC platforms with reconfigurable accelerator support. The readtigs are further mapped on to the reference genome to encompass the possible insertions and deletions of genetic alphabets at certain locations, thereby widening the map space and coverage.

The final SRM alignment results are then subjected to variant calling or preliminary tertiary analysis.

3.2 Reference Preprocessing in ReneGENE-GI

With an extremely long reference sequence string, indexing the reference over the alphabet set of Σ = {A, C, G, T} is a difficult task. ReneGENE-GI presents an efficient indexing scheme for the reference. Here, we generate a hash table for the index, based on a static set of lexicographically ordered keys.

It is known that, a Perfect Hash Function (PHF), for a set U, places the keys from U in an index table for efficient lookup operations, by mapping distinct elements in U to distinct values, avoiding any collisions. The table is indexed by the output of the PHF. Such PHFs are best suitable for indexing, where the data is very large, and is less frequently updated. This method is space-efficient, where the table created is compact, for a static set of keys.

A PHF becomes a minimal PHF, when the PHF maps k keys to consecutive integer values, usually ranging from 1 to k or 0 to k − 1. A minimal PHF is order preserving, when the keys are given in some order like k1,k2,…,kn, and for any two keys ki and kj, where i < jPHF(ki) < PHF(kj).

A minimal PHF becomes Monotonic Minimal PHF (MMPH), when the lexicographical order of the keys is preserved. Now, considering an application like genomics, there is dynamics involved in the form of the continuous insertions and deletions into the set U. Hence, to avoid heuristics in lookup, ReneGENE-GI implements a dynamic MMPH. This is typically done as part of preprocessing the reference genome, and coming up with index tables, over the lexicographically sorted set of keys extracted from the reference.

ReneGENE-GI indexing is illustrated in Algorithm 1. The keys are typically substrings of length k, called k-mer seeds, extracted by a sliding window operation on the genome. By performing a dynamic MMPH, Rene-GENE-GI maps each k-mer from a lexicographically ordered keyset K, to its corresponding index position in the Reference Index Table (RIT). The number of keys is always fixed, based on the choice of k. The natural order is always preserved over the keyset k by the binary encoding of bases in substring k (i.e., A → 00, C → 01, G → 10, T → 11). This algorithm renders a small memory footprint for the resulting reference index. For example, the human genome reference version GRCh38 (3.1 GB) was indexed in about 5 minutes using the MPI based implementation of Algorithm 1, generating an index of 5.4 GB in size.

figure e

The values corresponding to each position in the index, is the list of RIT IDs, which are locations corresponding to the occurrence of the k-mer seed across the length of the reference string. These values can be retrieved from the table by a single access to the table, thus searching the sorted index table with O(1) accesses to the table per key. The lookup process per read on the RIT table is explained in Algorithm 2.

In the context of repeat regions, a k-mer is extracted from multiple locations over a fragment of the reference, resulting in an extended list of values in the RIT. As compared to other indexing techniques that employ heuristics of purging repeat region hits, ReneGENE-GI reports those hits. These can eventually throw light on many anomalies embedded in the repeats, during alignment.

In an attempt to make the lookup process mutation aware, all possible mutations from all the locations of a single k-mer key is derived, and a lookup is performed for each of these mutation-induced k-mers. This results in a complete lookup, where instead of a single key, lookup is performed for a complete set of mutation-aware keys, for each read.

figure f

3.3 SRM in ReneGENE-GI

The ReneGENE-GI pipeline performs SRM where the small fragments of genome from the NGS platforms, generally known as short reads, are mapped or aligned against a reference genome string. SRM works on the massive input data set of short reads, typically of the order of petabytes, and aims to find the region of origin of each short read string with respect to the reference, and hence find regions of similarity or dissimilarity. Eventually, SRM builds the longer genome from the short reads, by putting short reads together as in a jig-saw puzzle, with respect to a reference genome. SRM is interpreted as a typical ASM problem in the ReneGENE-GI pipeline, to find occurrences of a smaller short read in a much larger reference [4, 24, 25].

figure g

The SRM, based on a Dynamic Programming (DP) [26] method, with preprocessing, is shown in Algorithm 3. While handling genome sequences, the DP technique is proven to be the most sensitive in performing ASM. The DP method comes with a quadratic time and space complexity of O (LN). The DP based algorithms employ a recursive scoring or cost function model, with an appropriate linear or affine penalty model (for the dissimilarities and string errors), to assign scores for mapping. The algorithm adopts a matrix space, called the alignment matrix, D.

The SRM module of ReneGENE-GI runs on accelerator hardware like FPGAs and GPUs which are plugged in to the HPC systems. The ReneGENE-DP algorithm within the SRM module is designed to run as multiple parallel threads on the accelerator hardware. This effectively speeds up the entire pipeline providing multi-fold performance improvement over the state-of-the-art SRM implementations.

3.4 Read Extension Module of ReneGENE-GI

The de novo read extension module of the pipeline, deals with the problem of grouping the short reads based on an overlap relationship among the reads, in the absence of a reference genome. This algorithm is discussed in detail in our previous work [27]. Related reads are grouped together and they grow to form longer sequences called readtigs. Here again, a single read can share a similar overlap relationship with several of its sequence neighbours, resulting in a single seed growing into many readtigs. This is decided on run time and hence the computations are clearly irregular due to the irregularity in the relationships among the input data sets. To accommodate the readtigs or extended reads that grow on the fly, this module implements dynamically growing data structures cast in the map-reduce framework, allowing a parallel deployment. The de novo module processing is shown in Algorithm 4.

figure h

3.5 Variant Calling in ReneGENE-GI

Variants or mutations in a genome sequence represent the unique changes in genomic alphabets along the length of the target genome, with respect to a reference genome at specific locations. Variant calling is the process of identifying such variants for the sample under consideration. These variants can eventually throw light on many structural and functional anomalies embedded in the genomes and its repeat regions, manifesting in the form of structural and Copy Number Variations (CNVs), Single Nucleotide Polymorphisms (SNPs) etc. A precise alignment achieved through ReneGENE-GI’s SRM enables a variant calling of high quality and confidence levels, allowing a more precise genotyping and phenotyping in presence of fusion genes and translocations within repeat regions in a genome. The SRM output from ReneGENE-GI is presented to variant calling tools like GATK, SAMTOOLS, FreeBayes etc which provide the resultant variant calls in the standard VCF format.

Amidst a wide variety of state-of-the art GI solutions [28, 29], the genomic computing community faces a lack of consensus or standards in brewing a flawless elucidation of biologically relevant information. In addition, the choices of algorithms and implementations in intermediate stages of GI have been subjective enough to snub out the useful information for downstream analyses, in the process of optimizing and accelerating the pipeline. As a result, downstream analyses continue to suffer due to the sufficiently large heuristics-driven errors that creep into the pipeline and subsequent biologically relevant inferences. In this context, the ReneGENE-GI pipeline stands out in offering the optimal choice for performing GI, over a fully accelerated pipeline, with an underlying confidence in the biologically significant and causative inferences made downstream.

4 ReneGENE CGM: The Comparative Genomics Module of ReneGENE-GI

The ReneGENE-CGM runs on accelerator platforms like FPGAs and GPUs. We have two flavours of this module, ReneGENE-AccuRA for FPGAs and ReneGENE-GMAccS for GPUs.

4.1 AccuRA: The SRM Pipeline for FPGAs

ReneGENE-GI’s CGM is implemented on a reconfigurable accelerator platform as ReneGENE-AccuRA. This is an extended version of AccuRA, published in our earlier work [30, 31], which presents AccuRA’s architecture, algorithms, mathematical model and scalability analysis. The AccuRA hardware archetype is presented in Fig. 4.

Figure 4
figure 4

AccuRA SRM Pipeline Architecture: The RHP hosts Mapper Kernel (MAK) Units embedded with filter subsystem and Aligner Kernel (ALK) Units embedded with Dynamic Programming Kernel (DPK) Units.

The SRM performed by the CGM, when applied to very long genomic sequences, is interpreted as an Approximate String Matching (ASM) problem. SRM algorithmically analyses the structural, functional and evolutionary relationship between the two input strings. SRM attempts to search the specific short read string q of length —q— (ranging from about 25 to a few hundred bases), over a much longer reference genome string G of length —G— (a human reference genome is typically 3 billion bases long). The aim is to find the regions of origin of each short read string with respect to the reference, and hence find regions of similarity or dissimilarity, over the character set Σ = {A, C, G, T}.

The Dynamic Programming Kernel (DPK) units in AccuRA’s hardware host a highly efficient and parallel DPK kernel to achieve traceback in hardware, based on a DP alignment algorithm as seen in Algorithm 3. The hardware performs alignment, in the shortest deterministic time, agnostic to short read length. AccuRA achieves a significant improvement in performance over conventional RHP models for SRM, with adequate sequence partitioning and scheduling schemes in the SRM workflow. By performing traceback in hardware overlapped with the forward scan during alignment, AccuRA eliminates the memory bottleneck issues and reduces the compute intensive tasks on the host significantly. The AccuRA prototype, configured on a reconfigurable hardware like FPGA, scaled well towards accommodating the big data of short reads of varying lengths, from smaller prokaryotic genomes to the larger mammalian genome, with a fine-grained single nucleotide resolution.

4.2 ReneGENE-AccuRA: A Multichannel Implementation of AccuRA SRM Pipeline on FPGAs

The scalability analysis and results from various prototypes in our earlier work proved to substantiate the scalability and performance of the parallel AccuRA SRM pipeline, making it a promising target to accelerate the SRM process in the NGS pipeline. Here, we present ReneGENE-AccuRA, a multi-channel, scalable and massively parallel computing pipeline that performs ultra-fast alignment of DNA short reads, presented in Fig. 5. Each channel of ReneGENE-AccuRA is composed of one AccuRA SRM pipeline, hosting several DPK and mapper units. A single reconfigurable hardware like FPGA can host multiple such AccuRA SRM pipelines. Supplemented with multi-threaded firmware architecture, ReneGENE-AccuRA precisely aligns short reads, at a fine-grained single nucleotide resolution, and offers full alignment coverage of the genome including repeat regions. ReneGENE-AccuRA is a fully streaming solution that eliminates memory bottleneck and storage issues, thus reducing the computing and I/O burden on the host significantly. With an appropriate data streaming pipeline, we provide an affordable solution, customizable according to scalability needs and budget availability. It is also pluggable to any genome analysis pipeline for use across multiple domains from research to clinical environment.

Figure 5
figure 5

ReneGENE-AccuRA: The multi-channel architecture based on AccuRA SRM pipeline.

4.3 ReneGENE-GMAccS: A Multichannel Implementation of SRM on GPUs

ReneGENE-GMAccS, presented in Fig. 6 is a scalable, massively parallel and heterogeneous GPU-based model for SRM. This is a heterogeneous Single Instruction Multiple Data (SIMD) system, for accelerating SRM process in the NGS pipeline. This architecture implements Algorithm 3 for ReneGENE-GI across multiple parallel computational threads on GPUs.

Figure 6
figure 6

The ReneGENE-GMAccS architecture.

ReneGENE-GMAccS efficiently handles the task-level parallelism and data level parallelism that is implicit within the ASM problem presented. The Rene-GENE-GMAccS firmware runs on a multi-node multi-core host platform, hosting several kernels in a pipelined fashion. These kernels are scheduled to run on a GPU based Accelerator Platform (GAP) housing single or multiple GPUs. The firmware allows for dynamic balancing of computations and flexible memory hierarchy. Each kernel performs a complex, coarse grained to fine grained parallel task, executed on a collection of data elements. With a seamless streaming of data between the host and the GAP, ReneGENE-GMAccS presents a massively parallel HPC model for SRM.

4.4 ReneGENE-GMAccS OpenCL Kernel Model on GPUs

We have implemented SRM algorithms in ReneGENE-GMAccS as kernels written in Open Computing Language (OpenCL), which makes it platform independent. It also enables ReneGENE-GMAccS to run on a heterogeneous parallel computing environment, hosting GPUs from multiple vendors. OpenCL also offers Just-In-Time (JIT) compile options, which allows the end-to-end application to make a superior use of the target GPU platform. ReneGENE-GMAccS has leveraged the portability of OpenCL and the compute efficiency of GPUs, through simplified wrappers and libraries. The kernels take advantage of the concurrent and parallel programming model provided by the OpenCL framework, in association with the application code developed in C/C++ that resides on the host.

The OpenCL platform model comes with an inbuilt host management layer and device side language support, and defines the relationship between the Rene-GENE-GMAccS host and the GAP. Each device on the GAP is an abstraction of a set of Compute Units (CUs), with each CU hosting a set of Processing Elements (PEs), presenting a Single Instruction Multiple Thread (SIMT) parallel execution model.

A scalable execution model is realized, while deploying fine-grained work items or batches of short read inputs onto the GAP. The kernels define the number of work-items as an n-dimensional range or NDRange. The work items are deployed as fixed size work groups across CUs in the Shared Multiprocessors (SM) of the GPU. The workgroups get scheduled as a group of threads on a set of PEs.

Each schedulable group of threads is termed wavefront on the hardware, where all the threads execute the same instruction while being on different control paths. Several such work groups run concurrently in a large batch of execution. An appropriate choice of NDRange and workgroup size can result in the most suitable occupancy levels for the GAP. Occupancy, a measure of concurrency within the GAP, decides the system performance while streaming large batches of short reads.

ReneGENE-GMAccS follows a fully consistent Open-CL shared memory model. The kernels allow synchronized host-device and device-host data transfers, while running ReneGENE-GMAccS at multiple levels of granularity. This helps the ReneGENE-GMAccS firmware to efficiently interleave periods of computations and communication, while handling ASM on large batches of short reads. By choosing the appropriate batch size, NDRange and level of granularity, ReneGENE-GMAccS OpenCL kernels exploit the underlying GPU architecture-level parallelism to its full potential.

5 ReneGENE-GI: Solutions and Results

Here, we present the details of the prototypes developed for ReneGENE-AccuRA and ReneGENE-GMAccS. The performance comparison for the above CGMs are provided for the SRM conducted on short reads for the human genome data set.

5.1 ReneGENE-GI: Solutions

5.1.1 ReneGENE-GI Host Environment

The software stack, that runs on the host, comprises of the preprocessing and post-processing modules of the ReneGENE-GI pipeline. This includes: (i) the reference index hashing step based on the MMPH algorithm, (ii) the read-lookup algorithm against the indexed reference for candidate genomic locations for a probable alignment, (iii) the HPC platform specific libraries and middleware, (iv) the hardware abstraction layer with the corresponding device drivers and platform drivers, (v) the post-processing module that makes decisions for the best alignment, secondary alignments, the corresponding computations for alignment/map qualities and (vi) a subsequent formatting of the output data in the Sequence Alignment (SAM) format. The pipeline also allows conversion of the SAM file to its compressed Binary Alignment (BAM) file and its verification towards fitness for downstream NGS data analytics.

5.1.2 Prototype Model for ReneGENE-GMAccS

To evaluate ReneGENE-GMAccS for its performance and scalability, we have developed a prototype model based on both single and multiple GPUs. The single GPU environment (Platform P1) is a workstation, hosting an 8 core AMD processor coupled with a Nvidia GPU. To evaluate the scalability features of ReneGENE-GMAccS, we modeled the same on SahasraT, the Cray XC40 based in-house supercomputing cluster, with upto 24 GPUs put to use for alignment in parallel (Platform P2) [32]. The prototype has a set of three kernels, embedded in a buffered pipeline. The details of the platforms are shown in Table 2.

Table 2 ReneGENE-GMAccS prototype platform details.

5.1.3 Prototype Model for ReneGENE-AccuRA

ReneGENE-AccuRA was prototyped on an HPC platform supported with a reconfigurable accelerator card built on multiple Xilinx Virtex 7 XC7V2000T devices, that is scalable upto 633 million ASIC gates. The host interface is through a Kintex-7 XC7K325T-FBG-900 FPGA. The host processor is interfaced to the Kintex-7 FPGA via the high speed interface of PCI-E x8 gen3. The embarassingly parallel bio-computing in AccuRA’s SRM is further favoured by the inherent reprogrammability of FPGAs, massively parallel compute resources, extreme data path parallelism and fine grained control mechanisms offered by the FPGAs.

5.1.4 ReneGENE-AccuRA Hardware

The multi-channel ReneGENE-AccuRA is represented as DUT within each FPGA. It is interfaced with the prototyping infrastructure on the FPGA through the standard AXI4 interface with 256 bit-wide data bus, running at a frequency of 125MHz. The Address Remapper unit allows an automatic remapping of the address spaces of DUT for transactions, allowing an ease of scalability in adding more AccuRA SRM channels to the DUT. The implementation is done using VHDL and Verilog.

5.1.5 Scalability Analysis for ReneGENE-AccuRA

The parameters in scalability analysis for ReneGENE-AccuRA are given in Table 3. Consider the multi-channel AccuRA SRM pipeline model, where reads are streamed in at the rate Rin (measured as Giga Reads/second or GR/s) over an input streaming bandwidth of BWin (measured as Giga Bytes/second or GB/s). The m subsequences of short reads, each of length l, are streamed through a streaming buffer of depth B, which holds one subsequence in each word of storage.

Table 3 Scalability analysis parameters.

Each MAK unit performs filtering in time TMAK, over x cycles of the MAK unit clock, with period τMAK. Each DPK unit performs alignment in time TDPK, over y cycles of the DPK unit clock, with period τDPK. If N MAK-DPK units are configured within a single AccuRA SRM pipeline channel, then each unit gets its share of p pairs for performing SRM. The single AccuRA SRM pipeline channel thus performs N SRMs in a total time of TMAK + TDPK, with N MAK-DPK units running in parallel. The single channel hence processes reads at a rate of RRHP measured in GR/s. At this rate, the hardware aligns all the P reads, with p reads aligned in parallel over N MAK-DPK units, over a total time of TRHP.

For scaling up the performance, let us include C such channels of AccuRA SRM pipelines within a single FPGA. Here, each channel will take the same amount of time to process the same number of reads.

Now, the overall performance from all the MAK units from C channels, measured in terms of Giga Maps Per Second (GMPS), is given by:

$$ \small P_{MAK} = \frac{C \times N \times K}{x \times T_{MAK}} $$
(1)

The overall performance of DPK unit, measured in terms of Giga Cell Updates Per Second (GCUPS), is given by:

$$ \small P_{DPK} = \frac{C \times N \times C}{y \times T_{DPK}} $$
(2)

Thus, we see that by scaling up the single AccuRA SRM pipeline channel, by increasing N, the ReneGENE-AccuRA hardware gains a better throughput, as it can handle more pairs in parallel. The scalability is complemented by further scaling up the number of such channels, C, within a single FPGA. The number of such channels within an FPGA is limited only by the allowed reconfigurable hardware space for the DUT within the FPGA. The input data is then fairly divided among the channels, so that the SRM process is complete in approximately 1/C times the total time taken for SRM by a single channel.

5.2 ReneGENE-GI: Results

5.2.1 Comparing ReneGENE-GI CGM with Existing Aligners

Here, we compare the basic multi-core implementation of ReneGENE-GI’s CGM with the Open source distributions of widely used aligners namely, BWA-MEM and Bowtie2, without the support of any acceleration hardware.

We have selected the short read data for a list of small organisms, listed in Table 4, for running the Rene-GENE-GI pipeline. These data vary in the length of the reads and the reference. The performance comparisons are listed in Table 5. From the table, we can see that for these organisms, ReneGENE-GI is much faster than the other two SRM tools. In the subsequent sections, we present the results for ReneGENE-GI on accelerator platforms with larger genomes.

Table 4 Small genome data.
Table 5 ReneGENE-GI Versus state-of-the-art SRM comparisons (time taken in seconds).

ReneGENE-GI has reported an increased number of valid alignments with the precision and accuracy of reference locations. This has helped in reporting several unique variants (changes in genomic alphabets) at specific genomic locations, as part of the alignment process. Through a downstream process called variant calling, the SRM output was analysed and scanned for quality and quantity of such variants. A comparison of the variants derived from ReneGENE-GI and those derived from Bowtie2 and BWA-MEM was done, the details of which are captured in Table 6. The table shows that several variants are exclusively reported by ReneGENE-GI, which would be relevant while looking for structural and biological interpretations and filtering out causative and actionable variants, which otherwise would have been purged by the other SRM tools.

Table 6 ReneGENE-GI variant calling comparison with alignment to forward reference strand: results derived from the output of comparative genomics module of the ReneGENE-GI pipeline.

5.2.2 Performance Evaluation of ReneGENE-GI for Large Genome Benchmarks

The ReneGENE-GI prototypes for FPGA and GPU platforms were tested by running SRM for very large data sets of the order of several Giga Bytes, for the mammalian human genome. The details of the input data set is provided in Table 7. We have used the GrCh38 reference genome assembly, which is around 3 billion bases long, consisting of 23 chromosomes and mitochondrial DNA. We have considered alignment of three human genomes, each of which corresponds to a family of father (SRR1559289, SRR1559290, SRR1559291, SRR1559292, SRR1559293), mother (SRR1559294, SRR-1559295, SRR1559296, SRR1559297, SRR1559298) and their child (SRR1559281, SRR1559282, SRR1559283, SRR1559284). Here, each read is 200 bases long. The reads are subjected to lookup against the reference gen-ome index. Subsequently, they are sent for alignment on the FPGA and GPU by streaming over the PCIe link through buffers. The sample FPGA buffer sizes, are configured to hold up to 18874368 words of data in one batch, as shown in Table 7. For GPUs, this is configured across two variables, the batch size (number of parallel GPU compute threads) and the NDRange (NDR).

Table 7 Human genome experiment details.

5.2.3 Results from Large Genome Benchmarks for ReneGENE-GMAccS on P1

Table 8 captures the time taken by P1, in seconds, to align the read sets across all the chromosomes in the reference. We can see that, for the largest read set SRR1559291, P1 could align about 169 million short reads of 100 bases each, in a total time of 162.66 seconds against chromosome 1. Thus, with a single GPU, we could achieve SRM for the entire read set of 169777482 reads, against all the chromosomes within the human reference genome, in about 38.71 minutes, which is the GPU run time for all the alignments.

Table 8 ReneGENE-GMAccS alignment time per chromosome for human genome read sets on P1. The values indicate time taken in seconds.

5.2.4 Results from Large Genome Benchmarks for ReneGENE-GMAccS on P2

We have exploited the scalability of ReneGENE-GMAccS to achieve a better SRM performance on P2, where the system was configured to use up to 24 GPUs in parallel.

Figure 7 shows the performance of P2, with 1, 2, 4, 8 and 24 GPUs, for a batch size of 1283776 and NDR of 64, for the large genome read sets. Table 9 depicts the total time taken by P1 and P2 to align all the input read sets. Here we can see that, with 24 GPUs, P2 took 113.71 seconds to complete the alignment of the largest read set SRR1559291, as against P1, which took 2322.85 seconds to finish the job with a single GPU. Thus, from Fig. 7 and Table 9, the performance improvement is evident in terms of the throughput levels measured in Million Maps Per Second (MMPS) as well as the total time taken. The distributed memory architecture on P2 along with the MPI middleware in ReneGENE-GMAccS thus helps in achieving a more-than-linear improvement in performance without overwhelming the shared resources, as the number of GPUs increases.

Figure 7
figure 7

ReneGENE-GMAccS performance in MMPS, on P2 with multiple GPUs.

Table 9 ReneGENE-GMAccS overall alignment time comparison for human genome read sets.

5.2.5 Results from Large Genome Benchmarks for ReneGENE-AccuRA

The ReneGENE-AccuRA prototype was tested with single and dual channel AccuRA SRM pipelines within a single FPGA while aligning the human short read sets. Each channel hosted 16 MAK units and 16 DPK units. With this configuration, to align 500 million reads (100 bases long) against the reference genome (3 billion bases long), with each read reporting a mapping at five locations on the reference, ReneGENE-AccuRA performs 4.65 Tera map operations and 10.24 Tera cell updates at the rate of 21.14 GMPS and 46.56 GCUPS in about 3.68 minutes. The implementation results for the dual-channel ReneGENE-AccuRA are provided in Table 10.

Table 10 ReneGENE-AccuRA utilization report, with single and dual channel AccuRA SRM pipeline single Xilinx Virtex 7 XC7V2000T device.

For the human genome read sets in Table 7, the alignment times for various configurations are shown in Fig. 8. Here, we can see that the time taken by ReneGENE-AccuRA is about one-fifth (with single channel AccuRA SRM pipeline) and about one-tenth (with dual channel AccuRA SRM pipeline), the time taken by the single GPU OpenCL implementation of ReneGENE-GI’s CGM. This single GPU implementation is itself 2.62x faster than CUSHAW2-GPU (the GPU CUDA implementation of CUSHAW) [33, 34]. With the single-GPU implementation demonstrating a speedup of 150x over standard heuristic aligners in the market like BFAST [35], the reconfigurable accelerator version of ReneGENE-AccuRA is several orders faster than the competitors, offering precision over heuristics. By extending the implementation to four and six channels within a single FPGA, there is a definite increase expected in the performance as evident from the scalability analysis. With multiple FPGAs available on the platform, the scope for further improvement in performance increases with increase in number of FPGAs and number of channels supported within the FPGAs.

Figure 8
figure 8

Performance comparison of FPGA versus GPU for human short read sets.

6 Conclusion

Through this paper, we have presented ReneGENE-GI, an innovatively engineered GI pipeline. The pipeline strikes the right balance between comparative genomics and de novo read extension, to run an irregular application like GI. With parallel algorithms executed on reconfigurable accelerator hardware, ReneGENE-GI exploits the inherent parallelism and scalability of the hardware at the level of micro and system architecture, amidst fine-grain synchronization.

The k-mer based dynamic MMPH algorithm for reference genome indexing provides an accurate hash table, allowing a heuristic free multi-read alignment across repeat regions of the reference. Supplemented with a multi-threaded firmware architecture, the CGM in Rene-GENE-GI precisely aligns short reads, at a fine-grained single nucleotide resolution, and offers full alignment coverage of the genome including repeat regions. The CGM has been deployed on two accelerator platforms, as ReneGENE-AccuRA and ReneGENE-GMAccS, on FPGA and GPU respectively, The parallel dynamic programming kernels on multiple channels of CGM seamlessly perform traceback process in hardware in parallel with forward scan, thus achieving short read mapping in a minimum possible deterministic time.

ReneGENE-GI is a fully streaming solution that eliminates memory bottleneck and storage issues, thus reducing the computing and I/O burden on the host significantly. The performance analysis shows that Rene-GENE-AccuRA is faster in comparison with ReneGENE-GMAccS and the state-of-the-art aligners, with similar levels of precision and accuracy, while aligning significantly large volumes of human genome data. With an appropriate data streaming pipeline, we provide an affordable solution, customizable according to scalability needs and budget availability. It is also pluggable to any genome analysis pipeline for use across multiple domains from research to clinical environment. The precise secondary analysis offered by the CGM of ReneGENE-GI running on accelerator hardware, associated with an efficient tertiary analysis down- stream serves to be a promising target to derive more meaningful inferences from NGS data with biological and clinical significance.