Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

De novo sequence assembly continues to be one of the most fundamental problems in Bioinformatics. Most of the available assemblers [1, 12, 13, 19, 20, 25] are based on the notions of de Bruijn graphs and of k-mers (short k-long substrings of input data). Currently, biological data are produced by different Next-Generation Sequencing (NGS) technologies which routinely and cheaply produce a large number of reads whose length varies according to the specific technology. For example, reads obtained by Illumina technology (which is the most used) have length between 50 and 150 bases [21].

To analyze datasets coming from different technologies, hence with a large variation of read lengths, an approach based on same-length strings is likely to be limiting, as witnessed by the recent introduction of variable-length de Bruijn graphs [9]. The string graph [18] representation is an alternative approach that does not need to break the reads into k-mers (as in the de Bruijn graphs), and has the advantage of immediately distinguishing the repeats that result in different arcs. The string graph is the main data representation used by assemblers based on the overlap-layout-consensus paradigm. Indeed, in a string graph, the vertices are the input reads and the arcs corresponds to overlapping reads, with the property that contigs are paths of the string graph. An immediate advantage of string graphs is that they can disambiguate some repeats that methods based on de Bruijn graphs might resolve only at later stages—for example, the repeats that are longer than k / 2 but contained in a read. Even without repetitions, analyzing only k-mers instead of the longer reads can result in some information loss, since bases of a read that are more than k positions apart are not part of the same k-mer, but might be part of the same read. Indeed, differently from de Brujin graphs, any path of a string graph is a valid assembly of reads. On the other hand, string graphs are more computationally intensive to compute [24], justifying our search for faster algorithms. From an algorithmic point of view, the most used string graph assembler is SGA [23], which first constructs the BWT [11] and the FM-index of a set of strings, and then uses those data structures to efficiently compute the arcs of the string graph (connecting overlapping reads). Another string graph assembler is Fermi [17] which implements a variant of the original SGA algorithm [23] that is tailored for SNP and variant calling. A number of recent works face the problem of designing efficient algorithmic strategies or data structures for building string graphs. Among those works we can find a string graph assembler [4], based on a careful use of hashing and Bloom filters, with performance comparable with the first SGA implementation [23]. Another important alternative approach to SGA is Readjoiner [15] which is based on an efficient computation of a subset of exact suffix-prefix matches, and by subsequent rounds of suffix sorting, scanning, and filtering outputs the non-redundant arcs of the graph.

All assemblers based on string graphs (such as SGA) need to both (1) query an indexing data structures (such as an FM-index), and (2) access the original reads set to detect prefix-suffix overlaps between the elements. Since the self-indexing data structures, such as FM-index, represent the whole information of the original dataset, an interesting problem is to design efficient algorithms for the construction of string graphs that only require to keep the index and do not need to access the read set together with the index. Improvements in this direction have both theoretical and practical motivations. Indeed, detecting prefix-suffix overlaps only by analyzing the (compressed) index is an almost unexplored problem, and managing such data structure is usually more efficient.

Following this research direction, we propose a new algorithm, called FSG, to compute the string graph of a set R of reads, whose O(nm) time complexity matches that of SGA—n is the number of reads in R and m is the maximum read length. To the best of our knowledge, it is the first algorithm that computes a string graph using only the FM-index of the input reads. The vast literature on BWT and FM-index hints that this approach is amenable to further research. An important observation is that SGA computes the string graph basically performing, for each read r, a query to the FM-index for each character of r, to compute the arcs outgoing from r. While this approach works in O(nm) time, it can perform several redundant queries, most notably when the reads share common suffixes (a very common case). Our algorithm queries the FM-index in a specific order, so that each string is processed only once, while SGA might process more than once each repeated string. It is important to notice that our novel algorithm uses a characterization of a string graph that is different, but equivalent, to the one in [18] stated in [7] and which is quite useful when processing reads with their FM-index. Moreover, since we have integrated our algorithm into SGA, the read correction and the assembly phases of SGA can be applied without any modification. These facts guarantees that the assemblies produced by our approach and SGA are the same. In a previous paper, we have tackled the problem of constructing the string graph in external memory [8] by taking advantages of some recent results on the external memory implementation of the FM-index [2]. Experimental results [8] have revealed that computing the FM-index and LCP (Longest Common Prefix) array are the two main limiting factors towards an efficient (in terms of running time and main memory requirements) external memory algorithm to construct the string graph. In fact, even the best known algorithms for these steps do not have an optimal I/O complexity [2, 3].

The FSG algorithm provides an approach to build a string graph that could be used for different read assembly purposes. We have implemented FSG and integrated it with the SGA assembler, by replacing in SGA the step related to the string graph construction. Our implementation follows the SGA guidelines, i.e., we use the correction step of SGA before computing the overlaps without allowing mismatches (which is also SGA’s default). Notice that SGA is a finely tuned implementation that has performed very nicely in the latest Assemblathon competition [10]. We have compared FSG with SGA, where we have used the latter’s default parameter (that is, we compute overlaps without errors). Our experimental evaluation on a standard benchmark dataset shows that our approach is 2.3–4.8 times faster than SGA in terms of wall clock time.

2 Preliminaries

We briefly recall some standard definitions that will be used in the following. Let \(\varSigma \) be a constant-sized alphabet and let S be a string over \(\varSigma \). We denote by S[i] the i-th symbol of S, by \(\ell =|S|\) the length of S, and by S[i : j] the substring \(S[i]S[i+1] \cdots S[j]\) of S. The suffix and prefix of S of length k are the substrings \(S[\ell -k +1: \ell ]\) (denoted by \(S[\ell -k +1:]\)) and S[1 : k] (denoted by S[ : k]) respectively. Given two strings \((S_i, S_j)\), we say that \(S_i\) overlaps \(S_j\) iff a nonempty suffix \(\beta \) of \(S_i\) is also a prefix of \(S_j\), that is \(S_{i}=\alpha \beta \) and \(S_j = \beta \gamma \). In this paper we consider a set R of n strings over \(\varSigma \) that are terminated by the sentinel \(\$\), which is the smallest character. To simplify the exposition, we will assume that all input strings have exactly m characters, excluding the $. The overlap graph of a set R of strings is the directed graph \(G_O = (R, A)\) whose vertices are the strings in R, and each two overlapping strings \(r_i= \alpha \beta \) and \(r_j= \beta \gamma \) form the arc \((r_i, r_j) \in A\) labeled by \(\alpha \). In this case \(\beta \) is called the overlap of the arc and \(\alpha \) is called the extension of the arc. Observe that the notion of overlap graph originally given by [18] is defined by labeling with \(\gamma \) the arc \((r_i, r_j) \in A\).

The notion of a string graph derives from the observation that in a overlap graph the label of an arc (rs) may be obtained by concatenating the labels of a pair of arcs (rt) and (ts), thus arc (rs) can be removed from the overlap graph without loss of information, since removing all such arcs, called redundant arcs, does not changet the set of valid paths. In [18] redundant arcs are those arcs (rs) labeled by \(\alpha \beta \), for \(\alpha \) the prefix of an arc (rt). An equivalent definition of string graphs is below. An arc \(e_1 = (r_i, r_j)\) of \(G_{O}\) labeled by \(\alpha \) is transitive (or reducible) if there exists another arc \(e_2 = (r_k, r_j)\) labeled by \(\delta \) where \(\delta \) is a suffix of \(\alpha \) [7]. Therefore, we say that \(e_1\) is non-transitive (or irreducible) if no such arc \(e_{2}\) exists. The string graph of R is obtained from \(G_{O}\) by removing all reducible arcs. This definition allows to use the FM-index to compute the labels of the string graph via backward extensions on the index.

The Generalized Suffix Array (GSA) [22] of R is the array \( SA \) where each element \( SA [i]\) is equal to (kj) iff the k-long suffix \(r_{j}[|r_j|-k+1:]\) of the string \(r_{j}\) is the i-th smallest element in the lexicographic ordered set of all suffixes of the strings in R. The Burrows-Wheeler Transform (BWT) of R is the sequence B such that \(B[i]=r_{j}[|r_j|-k]\), if \( SA [i] = (k,j)\) and \(k > 1\), or \(B[i]= \$\), otherwise. Informally, B[i] is the symbol that precedes the k-long suffix of a string \(r_j\) where such suffix is the i-th smallest suffix in the ordering given by \( SA \). For any string \(\omega \), all suffixes of (the lexicographically sorted) \( SA \) whose prefix is \(\omega \) appear consecutively in \( SA \). Consequently, we define the \(\omega \)-interval [2], denoted by \(q(\omega )\), as the maximal interval [be] such that \(b\le e\), \( SA [b]\) and \( SA [e]\) both have prefix \(\omega \). Notice that the width \(e-b+1\) of the \(\omega \)-interval is equal to the number of occurrences of \(\omega \) in some read of R. Since the BWT B and \( SA \) are closely related, we also say that [be] is a \(\omega \)-interval on B. Given a \(\omega \)-interval and a character c, the backward c-extension of the \(\omega \)-interval is the \(c \omega \)-interval.

3 The Algorithm

Our algorithm is based on two steps: the first is to compute the overlap graph, the second is to remove all transitive arcs. Given a string \(\omega \) and R a set of strings (reads), let \(R^S(\omega )\) and \(R^P(\omega )\) be respectively the subset of R with suffix (resp. prefix) \(\omega \). As usual in string graph construction algorithms, we will assume that the set R is substring free, i.e., no string is a substring of another. A fundamental observation is that the list of all nonempty overlaps \(\beta \) is a compact representation of the overlap graph, since all pairs in \(R^S(\beta ) \times R^P(\beta )\) are arcs of the overlap graph. Our approach to compute all overlaps between pairs of strings is based on the notion of potential overlap, which is a nonempty string \(\beta ^{*}\in \varSigma ^+\), s.t. there exists at least one input string \(r_i = \alpha \beta ^{*}\) (\(\alpha \ne \epsilon \)) with suffix \(\beta ^{*}\), and there exists at least one input string \(r_j=\gamma \beta ^{*}\delta \) (\(\delta \ne \epsilon \)) with \(\beta ^{*}\) as a substring (possibly a prefix). The first part of Algorithm 1 (lines 3–11) computes all potential overlaps, starting from those of length 1 and extending the potential overlaps by adding a new leading character. For each potential overlap, we check if it is an actual overlap. Lemma 1 is a direct consequence of the definition of potential overlap.

Lemma 1

Let \(\beta \) be an overlap. Then all suffixes of \(\beta \) are potential overlaps.

The second part of our algorithm, that is to detect all transitive arcs, can be sped up if we cluster together and examine some sets of arcs. We start considering the set of all arcs sharing the same overlap and a suffix of their extensions, as stated in the following definition.

Definition 2

Assume that \(\alpha , \beta \in \varSigma ^*, \beta \ne \epsilon \) and \(X\subseteq R^P(\beta )\). The arc-set \(ARC(\alpha , \alpha \beta , X)\) is the set \(\{ (r_{1}, r_{2}) : \alpha \beta \text { is a suffix of }r_{1}, \beta \text { is a prefix of } r_{2}\), and \(r_{1}\in R, r_{2}\in X\}\). The strings \(\alpha \) and \(\beta \) are called the extension and the overlap of the arc-set. The set X is called the destination set of the arc-set.

In other words, an arc-set contains the arcs with overlap \(\beta \) and extension \(\alpha \). An arc-set is terminal if there exists \(r \in R\) s.t. \(r = \alpha \beta \), while an arc-set is basic if \(\alpha = \epsilon \) (the empty string). Since the arc-set \(ARC(\alpha , \alpha \beta , X)\) is uniquely determined by strings \(\alpha \), \(\alpha \beta \), and X, the triple \((\alpha , \alpha \beta , X)\) encodes the arc-set \(ARC(\alpha , \alpha \beta , X)\). Moreover, the arc-set \(ARC(\alpha , \alpha \beta , X)\) is correct if X includes all irreducible arcs that have overlap \(\beta \) and extension with suffix \(\alpha \), that is \(X\supseteq \{ r_{2}\in R^P(\beta ): r_{1}\in R^S(\alpha \beta )\text { and } (r_{1}, r_{2}) \text { is irreducible}\}\). Observe that our algorithm computes only correct arc-sets. Moreover, terminal arc-sets only contain irreducible arcs (Lemma 5). Lemma 3 shows the use of arc-sets to detect transitive arcs. Due to space constraints, all proofs are omitted.

Lemma 3

Let \((r_{1}, r_{2})\) be an arc with overlap \(\beta \). Then \((r_1, r_2)\) is transitive iff (i) there exist \(\alpha , \gamma , \delta , \eta \in \varSigma ^*\), \(\gamma ,\eta \ne \epsilon \) such that \(r_1 = \gamma \alpha \beta \), \(r_2 = \beta \delta \eta \), (ii) there exists an input read \(r_{3}=\alpha \beta \delta \) such that \((r_{3}, r_{2})\) is an irreducible arc of a nonempty arc-set \(ARC(\alpha , \alpha \beta \delta , X)\).

A direct consequence of Lemma 3 is that a nonempty correct terminal arc-set \(ARC(\alpha , \alpha \beta \delta , X)\) implies that all arcs of the form \((\gamma \alpha \beta , \beta \delta \eta )\), with \(\gamma ,\eta \ne \epsilon \) are transitive. Another consequence of Lemma 3 is that an irreducible arc \((\alpha \beta \delta , \beta \delta \eta )\) with extension \(\alpha \) and overlap \(\beta \delta \) reduces all arcs with overlap \(\beta \) and extension \(\gamma \alpha \), with \(\gamma \ne \epsilon \). Lemma 3 is the main ingredient used in our algorithm. More precisely, it computes terminal correct arc-sets of the form \(ARC(\alpha , \alpha \beta \delta , X)\) for extensions \(\alpha \) of increasing length. By Lemma 3, \(ARC(\alpha , \alpha \beta \delta , X)\) contains arcs that reduce all the arcs contained in \(ARC(\alpha , \alpha \beta , X')\) which have a destination in X. Since the transitivity of an arc is related to the extension \(\alpha \) of the arc that is used to reduce it, and our algorithm considers extensions of increasing length, a main consequence of Lemma 3 is that it computes terminal arc-sets that are correct, that is they contain only irreducible arcs. We will further speed up the computation by clustering together the arc-sets sharing the same extension.

Definition 4

Let T be a set of arc-sets, and let \(\alpha \) be a string. The cluster of \(\alpha \), denoted by \(C(\alpha )\), is the union of all arc-sets of T whose extension is \(\alpha \).

We sketch Algorithm 1 which consists of two phases: the first phase to compute the overlap graph, and the second phase to remove all transitive arcs. In our description, we assume that, given a string \(\omega \), we can compute in constant time (1) the number \(\mathsf {suff}{(\omega )} \) of input strings whose suffix is \(\omega \), (2) the number \(\mathsf {pref}{(\omega )} \) of input strings whose prefix is \(\omega \), (3) the number \(\mathsf {substr}{(\omega )} \) of occurrences of \(\omega \) in the input strings. Moreover, we assume to be able to list the set \(\mathsf {listpref}{(\omega )}\) of input strings with prefix \(\omega \) in \(O(|\mathsf {listpref}{(\omega )}|)\) time. In Sect. 4 we will describe such a data structure. The first phase (lines 3–11) exploits Lemma 1 to compute all overlaps. Potential overlaps are defined inductively. The empty string \(\epsilon \) is a potential overlap of length 0; given an i-long potential overlap \(\beta ^*\), the \((i+1)\)-long string \(c\beta ^*\), for \(c \in \varSigma \), is a potential overlap iff \(\mathsf {suff}{(c\beta ^{*})} > 0\) and \(\mathsf {substr}{(c\beta ^{*})} > \mathsf {suff}{(c\beta ^{*})} \). Our algorithm uses this definition to build potential overlaps of increasing length, starting from those with length 1, i.e., symbols of \(\varSigma \) (line 2). The lists Last and New store the potential overlaps computed at the previous and current iteration respectively. Observe that a potential overlap \(\beta ^{*}\) is an overlap iff \(\mathsf {pref}{(\beta ^{*})} > 0\). Since a potential overlap is a suffix of some input string, there are at most nm distinct suffixes, where m and n are the length and the number of input strings, respectively. Each query \(\mathsf {suff}{(\cdot )}, \mathsf {pref}{(\cdot )}, \mathsf {substr}{(\cdot )} \) requires O(1) time, thus the time complexity related to the total number of such queries is O(nm). Given two strings \(\beta _1\) and \(\beta _2\), when \(|\beta _{1}|=|\beta _{2}|\) no input string can be in both \(\mathsf {listpref}{(\beta _{1})}\) and \(\mathsf {listpref}{(\beta _{2})}\). Since each overlap is at most m long, the overall time spent in the \(\mathsf {listpref}{(\cdot )}\) queries is O(nm). The first phase produces (line 7) the set of disjoint basic arc-sets \(ARC(\epsilon , \beta , R^p(\beta ))\) for each overlap \(\beta \), whose union is the set of arcs of the overlap graph. Recall that \(\mathsf {listpref}{(\beta )}\) gives the set of reads with prefix \(\beta \), which has been denoted by \(R^p(\beta )\).

The second phase (lines 13–25) classifies the arcs of the overlap graph into reducible or irreducible by computing arc-sets of increasing extension length, starting from the basic arc-sets \(ARC(\epsilon , \epsilon \beta , R^p(\beta ))\) obtained in the previous phase. By Lemma 3, we compute all correct terminal arc-sets \(ARC(\alpha ,\alpha \beta ,X)\) and remove all arcs that are reduced by \(ARC(\alpha ,\alpha \beta ,X)\). The set Rdc is used to store the destination set X of the computed terminal arc-sets. Notice that if \(ARC(\alpha ,\alpha \beta ,X)\) is terminal, then all of its arcs have the same origin \(r=\alpha \beta \), i.e., \(ARC(\alpha ,\alpha \beta ,X) = \{(r,x): x\in X\}\). By Lemma 3 all arcs in the cluster \(C(\alpha )\) with a destination in X and with an origin different from r are transitive and can be removed, simply by removing X from all destination sets in the arc-sets of \(C(\alpha )\). Another application of Lemma 3 is that when we find a terminal arc-set all of its arcs are irreducible, i.e., it is also correct. In fact, Lemma 3 classifies an arc as transitive according to the existence of a read \(r=\alpha \beta \) with extension \(\alpha \). Since the algorithm considers extensions \(\alpha \) of increasing length, all arcs whose extensions is shorter than \(\alpha \) have been reduced in a previous step, thus all terminal arc-set of previous iterations are irreducible. More precisely, the test at line 18 is true iff the current arc-set is terminal. In that case, at line 19 all arcs of the arc-set are output as arcs of the string graph, and at line 20 the destination set X is added to the set Rdc that contains the destinations of \(C(\alpha )\) that must be removed. For each cluster \(C(\alpha )\), we read twice all arc-sets that are included in \(C(\alpha )\). The first time to determine which arc-sets are terminal and, in that case, to determine the set Rdc of reads that must be removed from all destinations of the arc-sets included in \(C(\alpha )\). The second time to compute the clusters \(C(c\alpha )\) that contain the nonempty arc-sets with extension \(c\alpha \) consisting of the arcs that we still have to check if they are transitive or not (that is the arcs with destination set \(X\setminus Rdc\)). In Algorithm 1, the cluster \(C(\alpha )\) that is currently analyzed is stored in CurrentCluster, that is a list of the arc-sets included in the cluster. Moreover, the clusters that still have to be analyzed are stored in the stack Clusters. We use a stack to guarantee that the clusters are analyzed in the correct order, that is the cluster \(C(\alpha )\) is analyzed after all clusters \(C(\alpha [i:])\)\(\alpha [i:]\) is a generic suffix of \(\alpha \). We can prove that a generic irreducible arc \((r_1, r_2)\) with extension \(\alpha \) and overlap \(\beta \) belongs exactly to the clusters \(C(\epsilon ), \ldots , C(\alpha [2:]), C(\alpha ) \). Moreover, \(r_2\) does not belong to the set Rdc when considering \(C(\epsilon ), \ldots , C(\alpha [2:])\), hence the arc \((r_1, r_2)\) is correctly output when considering the cluster \(C(\alpha )\). The lemmas leading to the correctness of the algorithm follow.

Lemma 5

Let \(ARC(\alpha ,\alpha \beta ,X)\) be an arc-set inserted into a cluster by Algorithm 1. Then such arc-set is correct.

Lemma 6

Let \(e_1\) be a transitive arc \((r_1, r_2)\) with overlap \(\beta \). Then the algorithm does not output \(e_1\).

Theorem 7

Given as input a set of strings R, Algorithm 1 computes exactly the arcs of the string graph.

figure a

We can now sketch the time complexity of the second phase. Previously, we have shown that the first phase produces at most O(nm) arc-sets, one for each distinct overlap \(\beta \). Since each string \(\alpha \beta \) considered in the second phase is a suffix of an input string, and there are at most nm such suffixes, at most nm arc-sets are considered in the second phase. In the second phase, for each cluster a set Rdc is computed. If Rdc is empty, then each arc-set of the cluster can be examined in constant time, since all unions at line 20 are trivially empty and at line 25 the set \(X \setminus Rdc\) is equal to X, therefore no operation must be computed. The interesting case is when \(X\ne \varnothing \) for some arc-set. In that case the union at line 20 and the difference \(X \setminus Rdc\) at line 25 are computed. Let d(n) be the time complexity of those two operations on n-element sets (the actual time complexity depends on the data structure used). Notice that X is not empty only if we have found an irreducible arc, that is an arc of the string graph. Overall, there can be at most |E| nonempty such sets X, where E is the set of arcs of the string graph. Hence, the time complexity of the entire algorithm is \(O(nm+|E|d(n))\).

4 Data Representation

Our algorithm entirely operates on the (potentially compressed) FM-index of the collection of input reads. Indeed, each processed string \(\omega \) (both in the first and in the second phase) can be represented in constant space by the \(\omega \)-interval \([b_{\omega }, e_{\omega }]\) on the BWT (i.e., \(\textsf {q}{(\omega )} \)), instead of using the naïve representation with \(O(|\omega |)\) space. Notice that in the first phase, the i-long potential overlaps, for a given iteration, are obtained by prepending a symbol \(c \in \varSigma \) to the \((i-1)\)-long potential overlaps of the previous iteration (lines 8–10). In the same way the arc-sets of increasing extension length are computed in the second phase. In other words, our algorithm needs in general to obtain string \(c \omega \) from string \(\omega \), and, since we represent strings as intervals on the BWT, this operation can be performed in O(1) time via backward c-extension of the interval \(\textsf {q}{(\omega )} \) [14].

Moreover, both queries \(\mathsf {pref}{(\omega )} \) and \(\mathsf {substr}{(\omega )} \) can be answered in O(1) time. In fact, given \(\textsf {q}{(\omega )} = [b_\omega , e_\omega ]\), then \(\mathsf {substr}{(\omega )} =e_\omega - b_\omega + 1\) and \(\mathsf {pref}{(\omega )} =e_{\$\omega } - b_{\$\omega } + 1\) where \(\textsf {q}{(\$\omega )} = [b_{\$\omega }, e_{\$\omega }]\) is the result of the backward \(\$\)-extension of \(\textsf {q}{(\omega )} \). Similarly, it is easy to compute \(\mathsf {listpref}{(\omega )}\) as it corresponds to the set of reads that have a suffix in the interval \(\textsf {q}{(\$\omega )} \) of the GSA. The interval \(\textsf {q}{(\omega \$)} = [b_{\omega \$}, e_{\omega \$}]\) allows to answer to the query \(\mathsf {suff}{(\omega )} \) which is computed as \(e_{\omega \$} - b_{\omega \$} + 1\). The interval \(\textsf {q}{(\omega \$)} \) is maintained along with \(\textsf {q}{(\omega )} \). Moreover, since \(\textsf {q}{(\omega \$)} \) and \(\textsf {q}{(\omega )} \) share the lower extreme \(b_{\omega }=b_{\omega \$}\) (recall that \(\$\) is the smallest symbol), each string \(\omega \) can be compactly represented by the three integers \(b_\omega , e_{\omega \$}, e_\omega \). While in our algorithm a substring \(\omega \) of some input read can be represented by those three integers, we exploited the following representation for greater efficiency. In the first phase of the algorithm we mainly have to represent the set of potential overlaps. At each iteration, the potential overlaps in Last (New, resp.) have the same length, hence their corresponding intervals on the BWT are disjoint. Hence we can store those intervals using a pair of \(n(m+1)\)-long bitvectors. For each potential overlap \(\beta \in Last\) (New, resp.) represented by the \(\beta \)-interval \([b_{\beta }, e_{\beta }]\), the first bitvector has 1 in position \(b_\beta \) and the second bitvector has 1 in positions \(e_{\beta \$}\) and \(e_\beta \). Recall that we want also to maintain the interval \(q(\beta \$)=[b_{\beta }, e_{\beta \$}]\). Since \(\mathsf {substr}{(\beta )} > \mathsf {suff}{(\beta )} \), then \(e_{\beta \$}\ne e_\beta \) and can be stored in the same bitvector. In the second phase of the algorithm, we mainly represent clusters. A cluster groups together arc-sets whose overlaps are pairwise different or one is the prefix of the other. Thus, the corresponding intervals on the BWT are disjoint or nested. Moreover, also the destination set of the basic arc-sets can be represented by a set of pairwise disjoint or nested intervals on the BWT (since \(\mathsf {listpref}{(\beta )}\) of line 7 correspond to the interval \(\textsf {q}{(\$\beta )} \)). Moreover, the loop at lines 13–25 preserves the following invariant: let \(ARC(\alpha , \alpha \beta _1, X_1)\) and \(ARC(\alpha , \alpha \beta _2, X_2)\) be two arc-sets of the same cluster \(C(\alpha )\) with \(\beta _1\) prefix of \(\beta _2\), then \(X_2 \subseteq X_1\). Hence, each subset of arc-sets whose extensions plus overlaps share a common nonempty prefix \(\gamma \) is represented by means of the following three vectors: two integers vectors \(V_b, V_e\) of length \(e_{\gamma } - b_{\gamma } +1\) and a bitvector \(B_x\) of length \(e_{\$\gamma } - b_{\$\gamma } +1\), where \([b_{\gamma }, e_{\gamma }]\) is the \(\gamma \)-interval and \([b_{\$ \gamma }, e_{\$\gamma }]\) is the \(\$\gamma \)-interval. More specifically, \(V_b[i]\) (\(V_e[i]\), resp.) is the number of arc-sets whose representation (BWT interval) of the overlap starts (ends, resp.) at \(b_\gamma + i\), while \(B_x[i]\) is 1 iff the read at position \(b_{\$\gamma } + i\), in the lexicographic order of the GSA, belongs to the destination set of all the arc-sets. As a consequence, the number of backward extensions performed by Algorithm 1 is at most O(nm), while SGA performs \(\varTheta (nm)\) extensions.

5 Experimental Analysis

A C++ implementation of our approach, called FSG (short for Fast String Graph), has been integrated in the SGA suite and is available at http://fsg.algolab.eu under the GPLv3 license. We have evaluated the performance of FSG on a standard benchmark of 875 million 101 bp-long reads sequenced from the NA12878 individual of the International HapMap and 1000 genomes project and comparing the running time of FSG with SGA. We have run SGA with its default parameters, that is SGA has compute exact overlaps after having corrected the input reads. Since the string graphs computed by FSG and SGA are the same, we have not compared the entire pipeline, but only the string graph construction phase. We could not compare FSG with Fermi, since Fermi does not split its steps in a way that allows to isolate the running time of the string graph construction—most notably, it includes reads correction and scaffolding.

Especially on the DNA alphabet, short overlaps between reads may happen by chance. Hence, for genome assembly purposes, only overlaps whose length is larger than a user-defined threshold are considered. The value of the minimum overlap length threshold that empirically showed the best results in terms of genome assembly quality is around the 75 % of the read length [24]. To assess how graph size affects performance, different values of minimum overlap length (called \(\tau \)) between reads have been used (clearly, the lower this value, the larger the graph). The minimum overlap lengths used in this experimental assessment are 55, 65, 75, and 85, hence the chosen values test the approaches also on larger-than-normal (\(\tau =55\)) and smaller-than-normal (\(\tau =85\)) string graphs. Another aspect that we have wanted to measure is the scalability of FSG. We have run the programs with 1, 4, 8, 16, and 32 threads. In all cases, we have measured the elapsed (wall-clock) time and the total CPU time (the time a CPU has been working). All experiments have been performed on an Ubuntu 14.04 server with four 8-core Intel® Xeon E5-4610v2 2.30 GHz CPUs. The server has a NUMA architecture with 64 GiB of RAM for each node (256 GiB in total).

Table 1. Comparison of FSG and SGA, for different minimum overlap lengths and numbers of threads. The wall-clock time is the time used to compute the string graph. The CPU time is the overall execution time over all CPUs actually used.

Table 1 summarizes the running times of both approaches on the different configurations of the parameters. Notice that LSG approach is from 2.3 to 4.8 times faster than SGA in terms of wall-clock time and from 1.9 to 3 times in terms of CPU time. On the other hand, FSG uses approximately 2.2 times the memory used by SGA—on the executions with at most 8 threads. On a larger number of threads, and in particular the fact that the elapsed time of FSG on 32 threads is larger than that on 16 threads suggests that, in its current form, FSG might not be suitable for a large number of threads. However, since the current implementation of FSG is almost a proof of concept, future improvements to its codebase and a better analysis of the race conditions of our tool will likely lead to better performances with a large number of threads. Furthermore, notice that also the SGA algorithm, which is (almost) embarrassingly parallel and has a stable implementation, does not achieve a speed-up better than 6.4 with 32 threads. As such, a factor that likely contributes to a poor scaling behaviour of both FSG and SGA could be also the NUMA architecture of the server used for the experimental analysis, which makes different-unit memory accesses more expensive (in our case, the processors in each unit can manage at most 16 logical threads, and only 8 on physical cores). FSG uses more memory than SGA since genome assemblers must correctly manage reads extracted from both strands of the genome. In our case, this fact has been addressed by adding each reverse-and-complement read to the set of strings on which the FM-index has been built, hence immediately doubling the size of the FM-index. Moreover, FSG needs some additional data structures to correctly maintain potential overlaps and arc-sets: two pairs of \(n(m + 1)\)-long bitvectors and the combination of two (usually) small integer vectors and a bitvector of the same size. Our experimental evaluation shows that the memory required by the latter is usually negligible, hence a better implementation of the four bitvectors could decrease the memory use. The main goal of FSG is to improve the running time, not the memory use.

The combined analysis of the CPU time and the wall-clock time on at most 8 threads (which is the number of physical cores of each CPU on our server) suggests that FSG is more CPU efficient than SGA and is able to better distribute the workload across the threads. In our opinion, our greater efficiency is achieved by operating only on the FM-index of the input reads and by the order on which extension operations (i.e., considering a new string \(c\alpha \) after \(\alpha \) has been processed) are performed. These two characteristics of our algorithm allow to eliminate the redundant queries to the index which, instead, are performed by SGA. In fact, FSG considers each string that is longer than the threshold at most once, while SGA potentially reconsiders the same string once for each read in which the string occurs. Indeed, FSG uses 2.3–3 times less user time than SGA when \(\tau = 55\) (hence, when such sufficiently-long substrings occur more frequently) and “only” 2–2.6 times less user time when \(\tau = 85\) (hence, when such sufficiently-long substrings are more rare).

6 Conclusions and Future Work

We present FSG: a tool implementing a new algorithm for constructing a string graph that works directly querying a FM-index representing a collection of reads, instead of processing the input reads. Our main goal is to provide a simpler and fast algorithm to construct string graphs, so that its implementation can be easily integrated into an assembly pipeline that analyzes the paths of the string graph to produce the final assembly. Indeed, FSG could be used for related purposes, such as transcriptome assembly [5, 16], and haplotype assembly [6]. These topics are some of the research directions that we plan to investigate.