Keywords

1 Introduction

The NIH Sequence Read Archive [8] currently contains over 26 petabases of sequence data. Increased use of sequence-based assays in research and clinical settings creates high computational processing burden; metagenomics studies generate even larger sequencing datasets [17, 19]. New computational ideas are essential to manage and analyze these data. To this end, researchers have turned to k-mer-based approaches to more efficiently index datasets [7].

Minimizer techniques were introduced to select k-mers from a sequence to allow efficient binning of sequences such that some information about the sequence’s identity is preserved [18]. Formally, given a sequence of length L and an integer k, its minimizer is the lexicographically smallest k-mer in it. The method has two key advantages: selected k-mers are close; and similar k-mers are selected from similar sequences. Minimizers were adopted for biological sequence analysis to design more efficient algorithms, both in terms of memory usage and runtime, by reducing the amount of information processed, while not losing much or any information [12]. The minimizer method has been applied in a large number of settings [4, 6, 20].

Orenstein and Pellow et al. [14, 15] generalized and improved upon the minimizer idea by introducing the notion of a universal hitting set (UHS). For integers k and L, set \(U_{k,L}\) is called a universal hitting set of k-mers if every possible sequence of length L contains at least one k-mer from \(U_{k,L}\). Note that a UHS for any given k and L only needs to be computed once. Their heuristic DOCKS finds a small UHS in two steps: (i) remove a minimum-size set of vertices from a complete de Bruijn graph of order k to make it acyclic; and (ii) remove additional vertices to eliminate all \((L-k)\)-long paths. The removed vertices comprise the UHS. The first step was solved optimally, while the second required a heuristic. The method is limited by runtime to \(k\le 13\), and thus applicable to only a small subset of minimizer scenarios. Recently, Marçais et al. [10] showed that there exists an algorithm to compute a set of k-mers that covers every path of length L in a de Bruijn graph of order k. This algorithm gives an asymptotically optimal solution for a value of k approaching L. Yet this condition is rarely the case for real applications where \(10\le k \le 30\) and \(100 \le L \le 300\). The results of Marçais et al. show that for \(k \le 30\), the results are far from optimal for fixed L. A more recent method by DeBlasio et al. [3] can handle larger values of k, but with \(L\le 21\), which is impractical for real applications. Thus, it is still desirable to devise faster algorithms to generate small UHSs.

Here, we present PASHA (Parallel Algorithm for Small Hitting set Approximation), the first randomized parallel algorithm to efficiently generate near-optimal UHSs. Our novel algorithmic contributions are twofold. First, we improve upon the process of calculating vertex hitting numbers, i.e. the number of \((L-k)\)-long paths they go through. Second, we build upon a randomized parallel algorithm for Set Cover to substantially speedup removal of k-mers for the UHS—the major time-limiting step—with a guaranteed approximation ratio on the k-mer set size. PASHA performs substantially better than current algorithms at finding a UHS in terms of runtime, with only a small increase in set size; it is consequently applicable to much larger values of k. Software and computed sets are available at: pasha.csail.mit.edu and github.com/ekimb/pasha.

2 Background and Preliminaries

Preliminary Definitions

For \(k \ge 1\) and finite alphabet \(\varSigma \), directed graph \(B_k = (V, E)\) is a de Bruijn graph of order k if V and E represent k- and \((k+1)\)-long strings over \(\varSigma \), respectively. An edge may exist from vertex u to vertex v if the \((k-1)\)-suffix of u is the \((k-1)\)-prefix of v. For any edge \((u, v) \in E\) with label \(\mathcal {L}\), labels of vertices u and v are the prefix and suffix of length k of \(\mathcal {L}\), respectively. If a de Bruijn graph contains all possible edges, it is complete, and the set of edges represents all possible \((k+1)\)-mers. An \(\ell =(L-k)\)-long path in the graph, i.e. a path of \(\ell \) edges, represents an L-long sequence over \(\varSigma \) (for further details, see [1]).

For any L-long string s over \(\varSigma \), k-mer set M hits s if there exists a k-mer in M that is a contiguous substring in s. Consequently, universal hitting set (UHS) \(U_{k, L}\) is a set of k-mers that hits any L-long string over \(\varSigma \). A trivial UHS is the set of all k-mers, but due to its size (\(|\varSigma |^k\)), it does not reduce the computational expense for practical use. Note that a UHS for any given k and L does not depend on a dataset, but rather needs to be computed only once.

Although the problem of computing a universal hitting set has no known hardness results, there are several NP-hard problems related to it. In particular, the problem of computing a universal hitting set is highly similar, although not identical, to the (kL)-hitting set problem, which is the problem of finding a minimum-size k-mer set that hits an input set of L-long sequences. Orenstein and Pellow et al. [14, 15] proved that the (kL)-hitting set problem is NP-hard, and consequently developed the near-optimal DOCKS heuristic. DOCKS relies on the Set Cover problem, which is the problem of finding a minimum-size collection of subsets \(S_1, ..., S_k\) of finite set U whose union is U.

The DOCKS Heuristic

DOCKS first removes from a complete de Bruijn graph of order k a decycling set, turning the graph into a directed acyclic graph (DAG). This set of vertices represent a set of k-mers that hits all sequences of infinite length. A minimum-size decycling set can be found by Mykkelveit’s algorithm [13] in \(O(|\varSigma |^k)\) time. Even after all cycles, which represent sequences of infinite length, are removed from the graph, there may still be paths representing sequences of length L, which also need to be hit by the UHS. DOCKS removes an additional set of k-mers that hits all remaining sequences of length L, so that no path representing an L-long sequence, i.e. a path of length \(\ell =L-k\), remains in the graph.

However, finding a minimum-size set of vertices to cover all paths of length \(\ell \) in a directed acyclic graph (DAG) is NP-hard [16]. In order to find a small, but not necessarily minimum-size, set of vertices to cover all \(\ell \)-long paths, Orenstein and Pellow et al. [14, 15] introduced the notion of a hitting number, the number of \(\ell \)-long paths containing vertex v, denoted by \(T(v, \ell )\). DOCKS uses the hitting number to prioritize removal of vertices that are likely to cover a large number of paths in the graph. This, in fact, is an application of the greedy method for the Set Cover problem, thus guaranteeing an approximation ratio of \(O(1 + \log (\max _v T(v,\ell )))\) on the removal of additional k-mers.

The hitting numbers for all vertices can be computed efficiently by dynamic programming: For any vertex v and \(0 \le i \le \ell \), DOCKS calculates the number of i-long paths starting at v, D(vi), and the number of i-long paths ending at v, F(vi). Then, the hitting number is directly computable by

$$\begin{aligned} T(v, \ell ) = \sum _{i = 0}^{\ell } F(v, i) \cdot D(v, \ell - i) \end{aligned}$$
(1)

and the dynamic programming calculation in graph \(G=(V',E')\) is given by

$$\begin{aligned} \begin{array}{l} \forall v \in V', \ D(v, 0) = F(v, 0) = 1\\ D(v, i) = \sum \nolimits _{(v, u) \in E'} D(u, i-1)\\ F(v, i) = \sum \nolimits _{(u, v) \in E'} F(u, i-1) \end{array} \end{aligned}$$
(2)

Overall, DOCKS performs two main steps: First, it finds and removes a minimum-size decycling set, turning the graph into a DAG. Then, it iteratively removes vertex v with the largest hitting number \(T(v, \ell )\) until there are no \(\ell \)-long paths in the graph. DOCKS is sequential: In each iteration, one vertex with the largest hitting number is removed and added to the UHS output, and the hitting numbers are recalculated. Since the first phase of DOCKS is solved optimally in polynomial time, the bottleneck of the heuristic lies in the removal of the remaining set of k-mers to cover all paths of length \(\ell =L-k\) in the graph, which represent all remaining sequences of length L.

As an additional heuristic, Orenstein and Pellow et al. [14, 15] developed DOCKSany with a similar structure as DOCKS, but instead of removing the vertex that hits the most \((L-k)\)-long paths, it removes a vertex that hits the most paths in each iteration. This reduces the runtime by a factor of L, as calculating the hitting number T(v) for each vertex can be done in linear time with respect to the size of the graph. DOCKSanyX extends DOCKSany by removing X vertices with the largest hitting numbers in each iteration. DOCKSany and DOCKSanyX run faster compared to DOCKS, but the resulting hitting sets are larger.

3 Methods

Overview of the Algorithm. Similar to DOCKS, PASHA is run in two phases: First, a minimum-size decycling set is found and removed; then, an additional set of k-mers that hits remaining L-long sequences is removed. The removal of the decycling set is identical to that of DOCKS; however, in PASHA we introduce randomization and parallelization to efficiently remove the additional set of k-mers. We present two novel contributions to efficiently parallelize and randomize the second phase of DOCKS. The first contribution leads to a faster calculation of hitting numbers, thus reducing the runtime of each iteration. The second contribution leads to selecting multiple vertices for removal at each iteration, thus reducing the number of iterations to obtain a graph with no \((L-k)\)-long paths. Together, the two contributions provide orthogonal improvements in runtime.

Improved Hitting Number Calculation

Memory Usage Improvements. We reduce memory usage through algorithmic and technical advances. Instead of storing the number of i-long paths for \(0 \le i \le \ell \) in both F and D, we apply the following approach (Algorithm 1): We compute D for all \(v \in V\) and \(0 \le i \le \ell \). Then, while computing the hitting number, we calculate F for iteration i. For this aim, we define two arrays: \(F_{curr}\) and \(F_{prev}\), to store only two instances of i-long path counts for each vertex: The current and previous iterations. Then, for some j, we compute \(F_{curr}\) based on \(F_{prev}\), set \(F_{prev} = F_{curr}\), and add \(F_{curr}(v) \cdot D(v, \ell - j)\) to the hitting number sum. Lastly, we increase j, and repeat the procedure, adding the computed hitting numbers iteratively. This approach allows the reduction of matrix F, since in each iteration we are storing only two arrays, \(F_{curr}\) and \(F_{prev}\), instead of the original F matrix consisting of \(\ell +1\) arrays. Therefore, we are able to reduce memory usage by close to half, with no change in runtime.

To further reduce memory usage, we use float variable type (of size 4 bytes) instead of double variable type (of size 8 bytes). The number of paths kept in F and D increase exponentially with i, the length of the paths. To be able to use the 8 bit exponent field, we initialize F and D to float minimum positive value. This does not disturb algorithm correctness, as path counting is only scaled to some arbitrary unit value, which may be \(2^{-149}\), the smallest positive value that can be represented by float. This is done in order to account for the high numbers that path counts can reach. The remaining main memory bottleneck is matrix D, whose size is \(4\cdot 4^{k}\cdot (\ell +1)\) bytes.

Lastly, we utilized the property of a complete de Bruijn graph of order k being the line graph of a de Bruijn graph of order \(k-1\). While all k-mers are represented as the set of vertices in the graph of order k, they are represented as edges in the graph of order \(k-1\). If we remove edges of a de Bruijn graph of order \(k-1\), instead of vertices in a graph of order k, we can reduce memory usage by another factor of \(|\varSigma |\). In our implementation we compute D and F for all vertices of a graph of order \(k-1\), and calculate hitting numbers for edges. Thus, the bottleneck of the memory usage is reduced to \(4 \cdot 4^{k-1}\cdot (\ell +1)\) bytes.

figure a

Runtime Reduction by Parallelization. We parallelize the calculation of the hitting numbers to achieve a constant factor reduction in runtime. The calculation of i-long paths through vertex v only depends on the previously calculated matrices for the \((i-1)\)-long paths through all vertices adjacent to v (Eq. 2). Therefore, for some i, we can compute D(vi) and F(vi) for all vertices in \(V'\) in parallel, where \(V'\) is the set of vertices left after the removal of the decycling set. In addition, we can calculate the hitting number \(T(v, \ell )\) for all vertices \(V'\) in parallel (similar to computing D and F), since the calculation does not depend on the hitting number of any other vertex (we call this parallel variant PDOCKS for the purpose of comparison with PASHA). We note that for DOCKSany and DOCKSanyX, the calculations of hitting numbers for each vertex cannot be computed in parallel, since the number of paths starting and ending at each vertex both depend on those of the previous vertex in topological order.

Parallel Randomized \({\varvec{k}}\) -mer selection

Our goal is to find a minimum-size set of vertices that covers all \(\ell \)-long paths. We can represent the remaining graph as an instance of the Set Cover problem. While the greedy algorithm for the second phase of DOCKS is serial, we will show that we can devise a parallel algorithm, which is close to the greedy algorithm in terms of performance guarantees, by picking a large set of vertices that cover nearly as many paths as the vertices that the greedy algorithm picks one by one.

In PASHA, instead of removing the vertex with the maximum hitting number in each iteration, we consider a set of vertices for removal with hitting numbers within an interval, and pick vertices in this set independently with constant probability. Considering vertices within an interval allows us to efficiently introduce randomization while still emulating the deterministic algorithm. Picking vertices independently in each iteration enables parallelization of the procedure. Our randomized parallel algorithm for the second phase of the UHS problem adapts that of Berger et al. [2] for the original Set Cover problem.

The UHS Selection Procedure. The input includes graph \(G=(V,E)\) and randomization variables \(0<\varepsilon \le \frac{1}{4}\), \(0<\delta \le \frac{1}{\ell }\) (Algorithm 2). Let function calcHit() calculate the hitting numbers for all vertices, and return the maximum hitting number (line 2). We set \(t = \lceil \log _{1+\varepsilon }T_{max} \rceil \) (line 3), and run a series of steps from t, iteratively decreasing t by 1. In step t, we first calculate the hitting numbers of all vertices (line 5); then, we define vertex set S to contain vertices with a hitting number between \((1+\varepsilon )^{t-1}\) and \((1+\varepsilon )^t\) for potential removal (lines 8–9).

Let \(P_{S}\) be the sum of all hitting numbers of the vertices in S, i.e. \(P_S = \sum _{v \in S} T(v, \ell )\) (line 10). In each step, if the hitting number for vertex v is more than a \(\delta ^3\) fraction of \(P_S\), i.e. \(T(v, \ell ) \ge \delta ^3P_{S}\), we add v to the picked vertex set \(V_t\) (lines 11–13). For vertices with a hitting number smaller than \(\delta ^3P_{S}\), we pairwise independently pick them with probability \(\frac{\delta }{\ell }\). We test the vertices in pairs to impose pairwise independence: If an unpicked vertex u satisfies the probability \(\frac{\delta }{\ell }\), we choose another unpicked vertex v and test the same probability \(\frac{\delta }{\ell }\). If both are satisfied, we add both vertices to the picked vertex set \(V_t\); if not, neither of them are added to the set (lines 14–16). This serves as a bound on the probability of picking a vertex. If the sum of hitting numbers of the vertices in set \(V_t\) is at least \(|V_t|(1+\varepsilon )^t (1 - 4\delta - 2\varepsilon )\), we add the vertices to the output set, remove them from the graph, and decrease t by 1 (lines 17–20). The next iteration runs with decreased t. Otherwise, we rerun the selection procedure without decreasing t.

figure b

Performance Guarantees. At step t, we add the selected vertex set \(V_t\) to the output set if \(\sum _{v \in V_t} T(v,\ell ) \ge |V_t|(1+\varepsilon )^t (1 - 4\delta - 2\varepsilon )\). Otherwise, we rerun the selection procedure with the same value of t. We show in Appendix A that with high probability, \(\sum _{v \in V_t} T(v,\ell ) \ge |V_t|(1+\varepsilon )^t(1 - 4\delta -2\varepsilon )\). We also show that PASHA produces a cover \(\alpha (1 + \log {T_{max}})\) times the optimal size, where \(\alpha = 1/(1 - 4\delta - 2\varepsilon )\). In Appendix B, we give the asymptotic number of the selection steps and prove the average runtime complexity of the algorithm. Performance summaries in terms of theoretical runtime and approximation ratio are in Table 1.

Table 1. Summary of theoretical results for the second phase of different algorithms for generating a set of k-mers hitting all L-long sequences. PDOCKS is DOCKS with the improved hitting number calculation, i.e. greedy removal of one vertex at each iteration. \(p_{D}, p_{DA}\) denote the total number of picked vertices for DOCKS/PDOCKS and DOCKSany, respectively. m denotes the number of parallel threads used, \(T_{max}\) the maximum vertex hitting number, and \(\epsilon \) and \(\delta \) PASHA’s randomization parameters.

4 Results

PASHA Outperforms Extant Algorithms for \({\varvec{k \le 13}}\)

We compared PASHA and PDOCKS to extant methods on several combinations of k and L. We ran DOCKS, DOCKSany, PDOCKS, and PASHA over \(5 \le k \le 10\), DOCKSanyX, PDOCKS, and PASHA for \(k = 11\) and \(X = 10\), and PASHA and DOCKSanyX for \(X = 100, 1000\) for \(k = 12, 13\) respectively, for \(20 \le L \le 200\). We say that an algorithm is limited by runtime if for some value of \(k \le 13\) and for \(L = 100\), its runtime exceeds 1 day (86400 s), in which case we stopped the operation and excluded the method from the results for the corresponding value of k. While running PASHA, we set \(\delta = 1/\ell \), and \(1 - 4\delta - 2\varepsilon = 1/2\) to set an emulation ratio \(\alpha = 2\) (see Sect. 3 and Appendix A). The methods were benchmarked on a 24-CPU Intel Xeon Gold (2.10 GHz) with 754 GB of RAM. We ran all tests using all available cores (\(m = 24\) in Table 1).

Comparing Runtimes and UHS Sizes. We ran DOCKS, PDOCKS, DOCKSany, and PASHA for \(k = 10\) and \(20 \le L \le 200\). As seen in Fig. 1A, DOCKS has a significantly higher runtime than the parallel variant PDOCKS, while producing identical sets (Fig. 1B). For small values of L, DOCKSany produces the largest UHSs compared to other methods, and as L increases, the differences in both runtime and UHS size for all methods decrease, since there are fewer k-mers to add to the removed decycling set to produce a UHS.

We ran PDOCKS, DOCKSany10, and PASHA for \(k = 11\) and \(20 \le L \le 200\). As seen in Fig. 1C, for small values of L, both PDOCKS and DOCKSany10 have significantly higher runtimes than PASHA; while for larger L, DOCKSany10 and PASHA are comparable in their runtimes (with PASHA being negligibly slower). In Fig. 1D, we observe that PDOCKS computes the smallest sets for all values of L. Indeed, its guaranteed approximation ratio is the smallest among all three benchmarked methods. While the set sizes for all methods converge to the same value for larger L, DOCKSany10 produces the largest UHSs for small values of L, in which case PASHA and PDOCKS are preferable.

PASHA’s runtime behaves differently than that of other methods. For all methods but PASHA, runtime decreases as L increases. Instead of gradually decreasing with L, PASHA’s runtime gradually decreases up to \(L = 70\), at which it starts to increase at a much slower rate. This is explained by the asymptotic complexity of PASHA (Table 1). Since computing a UHS for small L requires a larger number of vertices to be removed, the decrease in runtime with increasing L up to \(L = 70\) is significant; however, due to PASHA’s asymptotic complexity being quadratic with respect to L, we see a small increase from \(L = 70\) to \(L = 200\). All other methods depend linearly on the number of removed vertices, which decreases as L increases.

Despite the significant decrease in runtime in PDOCKS compared to DOCKS, PDOCKS was still limited by runtime to \(k \le 12\). Therefore, we ran DOCKSany100 and PASHA for \(k = 12\) and \(20 \le L \le 200\). As seen in Figs. 1E and F, both methods follow a similar trend as in \(k = 11\), with DOCKSany100 being significantly slower and generating significantly larger UHSs for small values of L. For larger values of L, DOCKSany100 is slightly faster, while PASHA produces sets that are slightly smaller.

At \(k=13\) we observed the superior performance of PASHA over DOCKSany1000 in both runtime and set size for all values of L. We ran DOCKSany1000 and PASHA for \(k = 13\) and \(20 \le L \le 200\). As seen in Figs. 1G and H, DOCKSany1000 produces larger sets and is significantly slower compared to PASHA for all values of L. This result demonstrates that the slow increase in runtime for PASHA compared to other algorithms for \(k<13\) does not have a significant effect on runtime for larger values of k.

Fig. 1.
figure 1

Runtimes (left) and UHS sizes (divided by \(10^4\), right) for values of \(k = 10\) (A, B), 11 (C, D), 12 (E, F), and 13 (G, H) and \(20 \le L \le 200\) for the different methods. Note that the y-axes for runtimes are in logarithmic scale.

PASHA Enables UHS for \({\varvec{k=14,15,16}}\)

Since all existing algorithms and PDOCKS are limited by runtime to \(k \le 13\), we report the first UHSs for \(14 \le k \le 16\) and \(L = 100\) computed using PASHA, run on a 24-CPU Intel Xeon Gold (2.10 GHz) with 754 GB of RAM using all 24 cores. Figure 2 shows runtimes and sizes of the sets computed by PASHA.

Density Comparisons for the Different Methods

In addition to runtimes and UHS sizes, we report values of another measure of UHS performance known as density. The density of the minimizers scheme d(MSk) is the fraction of selected k-mers’ positions over the number of k-mers in the sequence. Formally, the density of scheme M over sequence S is defined as

$$\begin{aligned} d(M, S, k) = \frac{|M(S, k)|}{|S|-k+1} \end{aligned}$$
(3)

where M(Sk) is the set of positions of the k-mers selected over sequence S.

Fig. 2.
figure 2

Runtimes (A) and UHS sizes (divided by \(10^6\)) (B) for \(14 \le k \le 16\) and \(L = 100\) for PASHA. Note that the y-axis for runtime is in logarithmic scale.

We calculate densities for a UHS by selecting the lexicographically smallest k-mer that is in the UHS within each window of \(L-k+1\) consecutive k-mers, since at least one k-mer is guaranteed to be in each such window. Marçais et al. [11] showed that using UHSs for k-mer selection in this manner yields smaller densities than lexicographic or random minimizer selection schemes. Therefore, we do not report comparisons between UHSs and minimizer schemes, but rather comparisons among UHSs constructed by different methods.

Marçais et al. [11] also showed that the expected density of a minimizers scheme for any k and window size \(L-k+1\) is equal to the density of the minimizers scheme on a de Bruijn sequence of order L. This allows for exact calculation of expected density for any k-mer selection procedure. However, for \(14 \le k \le 16\) we calculated UHSs only for \(L = 100\), and iterating over a de Bruijn sequence of order 100 is infeasible. Therefore, we computed the approximate expected density on long random sequences, since the computed expected density on these sequences converges to the expected density [11]. In addition, we computed the density of different methods on the entire human reference genome (GRCh38).

We computed the density values of UHSs generated by PDOCKS, DOCKSany, and PASHA over 10 random sequences of length \(10^6\), and the entire human reference genome (GRCh38), for \(5 \le k \le 16\) and \(L = 100\), when a UHS was available for such (kL) combination.

As seen in Fig. 3, the differences in both approximate expected density and density computed on the human reference genome are negligible when comparing UHSs generated by the different methods. For most values of k, DOCKS yields the smallest approximate expected density and human genome density values, while DOCKSany generally yields lower human genome density values, but higher expected density values than PASHA. For \(k\le 6\), the UHS is only the decycling set; therefore, density values for these values of k are identical for the different methods.

Since there is no significant difference in the density of the UHSs generated by the different methods, other criteria, such as runtime and set size, are relevant when evaluating the performance of the methods: As k increases, PASHA produces sets that are only slightly smaller or larger in density, but significantly smaller in size and significantly faster than extant methods.

Fig. 3.
figure 3

Mean approximate expected density (A), and density on the human reference genome (B) for different methods, for \(5 \le k \le 16\) and \(L = 100\). Error bars represent one standard deviation from the mean across 10 random sequences of length \(10^6\). Density is the fraction of selected k-mer positions over the number of k-mers in the sequence.

5 Discussion

We presented an efficient randomized parallel algorithm for generating a small set of k-mers that hits every possible sequence of length L and produces a set that is a small guaranteed factor away from the optimal set size. Since the runtime of DOCKS variants and PASHA depend exponentially on k, these greedy heuristics are eventually limited by runtime. However, using these heuristics in conjunction with parallelization, we are newly able to compute UHSs for values of k and L large enough for most biological applications.

The improvements in runtime for the hitting number calculation are due to parallelization of the dynamic programming phase, which is the bottleneck in sequential DOCKS variants. A minimum-size set that hits all infinite-length sequences is optimally and rapidly removed; however, the remaining sequences of length L are calculated and removed in time polynomial in the output size. We show that a constant factor reduction is beneficial in mitigating this bottleneck for practical use. In addition, we reduce the memory usage of this phase by theoretical and technical advancements. Last, we build on a randomized parallel algorithm for Set Cover to significantly speed up vertex selection. The randomized algorithm can be derandomized, while preserving the same approximation ratio, since it requires only pairwise independence of the random variables [2].

One main open problem still remains from this work. Although the randomized approximation algorithm enables us to generate a UHS more efficiently, the hitting numbers still need to be calculated at each iteration. The task of computing hitting numbers remains as the bottleneck in computing a UHS. Is there a more efficient way of calculating hitting numbers than the dynamic programming calculation done in DOCKS and PASHA? A more efficient calculation of hitting numbers will enable PASHA to run over \(k>16\) in a reasonable time.

As for long reads, which are becoming more popular for genome assembly tasks, a k-mer set that hits all infinite long sequences, as computed optimally by Mykkelveit’s algorithm [13], is enough due to the length of these long read sequences. Still, due to the inaccuracies and high cost of long read sequencing compared to short read sequencing, the latter is still the prevailing method to produce sequencing data, and is expected to remain so for the near future.

We expect the efficient calculation of UHSs to lead to improvements in sequence analysis and construction of space-efficient data structures. Unfortunately, previous methods were limited to small values of k, thus allowing application to only a small subset of sequence analysis tasks. As there is an inherent exponential dependency on k in terms of both runtime and memory, efficiency in calculating these sets is crucial. We expect that the UHSs newly-enabled by PASHA for \(k>13\) will be useful in improving various applications in genomics.

6 Conclusion

We developed a novel randomized parallel algorithm PASHA to compute a small set of k-mers which together hit every sequence of length L. It is based on two algorithmic innovations: (i) improved calculation of hitting numbers through paralleization and memory reduction; and (ii) randomized parallel selection of additional k-mers to remove. We demonstrated the scalability of PASHA to larger values of k up to 16. Notably, the universal hitting sets need to be computed only once, and can then be used in many sequence analysis applications. We expect our algorithms to be an essential part of the sequence analysis toolkit.