Keywords

1 Introduction

The majority of bioinformatics analysis is performed by k-mer based tools that provide several advantages with respect to the ones that directly process reads or reads alignments. These tools operate primarily by transforming the input sequence data, which may be of various lengths depending on the technology used for sequencing, into a k-mer set that is a set of strings with fixed length and their multiplicities, called counts.

k-mers-based methods achieve better performance in many applications. In genome assembly, Spades [2] used k-mers-based methods to reconstruct the entire genome from reads obtaining efficiently highly accurate results. Also the assembly validation of Merqury [22] uses k-mer counts. In metagenomics, Kraken [27] is capable to classify and identify microorganisms in complex environmental samples using k-mers and it is 900 times faster than MegaBLAST. Since the introduction of Kraken, most of the tools for metagenomic classification are based on k-mers [1, 5, 19, 24]. In genotyping, several tools [9, 13, 14, 26] use k-mers instead of alignment to identify genetic variations in individuals or populations. In phylogenomics, Mash [15] uses k-mers to efficiently estimate genomes and metagenomes distances in order to reconstruct evolutionary relationships among organisms. In database searching many k-mers-based methods [3, 10, 12, 16, 25] have been proposed in order to search sequences efficiently.

Overall, k-mer-based methods have revolutionized many areas of bioinformatics and they have become an essential tool for analyzing large-scale genomic data. These tools often rely on specialized data structures for representing sets of k-mers (for a survey, see [6]). Since modern sequencing datasets are huge, the space used by such data structures is a bottleneck when attempting to scale up to large databases. Conway and Bromage [8] showed that at least \(\log {{4^k} \atopwithdelims ()n} \) bits are needed to losslessly store a set of n k-mers, in the worst case. However, a set of k-mers generated from a sequencing experiment typically exhibits the spectrum-like property [6] and contains a lot of redundant information. Therefore, in practice, most data structures can substantially improve on that bound [7].

Since storing a k-mer set requires non-negligible space, it’s desirable to reduce the size that can be very large. For example, the dataset used to test the BIGSI [21] index takes approximately 12 TB to be stored in compressed form.

The best tool to compress a set of k-mers with counts is UST [21] (see Sect. 1.1) and it uses the De Bruijn graph representation of the input k-mer set. The problem of finding the smallest k-mer set representation is equivalent to finding the smallest path cover in a de Bruijn graph (Sect. 2). In this paper, we present USTAR (Unitig STitch Advanced constRuction), which follows a similar paradigm, but implements a better strategy for exploring De Bruijn graphs. The USTAR strategy leverages the density of the de Bruijn graph and node connectivity, enabling a more effective path selection for the construction of the path cover, and thus improving the compression. In Sect. 3 we reported a series of results on several real sequencing datasets. We showed that USTAR achieves the best compression ratio of k-mers and counts and it outperforms UST, and other tools.

1.1 Related Works

The problem of k-mer set compression has been addressed by several researchers, in this section we summarize the most recent findings. k-mers counters are tools that are designed to count and store distinct k-mers, a particularly hard challenge for large datasets. The most famous tools are Squeakr [17], KMC [11] and DSK [23]. Squeakr is an approximate and exact k-mers counting system that exploits Bloom filters, a probabilistic data structure, in order to efficiently store k-mers. KMC uses disk files as bins in which divide, sort, and count k-mers. Finally, DSK uses hash tables in order to update k-mers counters. The last two tools are not specifically designed for compression but they are still capable to reduce the size storing only distinct k-mers and their counts.

A k-mer set with counters can be represented by a de Bruijn graph (dBG) that can be exploited for efficient storage. BCALM2 [7] is a tool for the low-memory construction of dBGs that are compacted, meaning that maximal non-branching paths are merged in a single node labeled with k-mers glued together and the list of the counts. Compaction not only provides advantages in terms of memory used but also in terms of disk space. The idea, in order to compress k-mers, is to save \(k-1\) characters per link. For example, given a dBG, the non-branching path (ACT, CTG, TGA) can be replaced by a single node ACTGA. The sequence represented by a non-branching path is called unitig and it is an attempt to compress k-mers using dBGs.

Another way to reduce the redundancy of a k-mer set is to exploit its spectrum-like property [6], i.e. the existence of long strings that “generate” all the k-mers. This idea has been developed in parallel by the authors of ProphAsm [4] and UST [21]. The authors of ProphAsm [4] refer to these long strings as simplitigs and they build them by linking overlapping unitigs and k-mers during the exploration of a dBG computed on the fly. They showed that simplitigs outperformed unitigs, the k-mers representation proposed by BCALM2, in terms of computational resources and compression rate. Also UST [21] links overlapping unitigs and k-mers, but it uses as input the compacted de Bruijn graph computed by BCALM2 and it considers also k-mers counts. They find a nearly tight lower bound for the optimal k-mers representation and they showed empirically that in most cases their greedy algorithm is within 3% of the lower bound.

2 USTAR: Unitig STitch Advanced ConstRuction

2.1 Definitions

For the purpose of this paper, we consider a string made up of characters in \(\Sigma = \{A, C, T, G\}\). A string of length k is called k-mer and its \(reverse\ complement\ rc(\cdot )\) is obtained by reversing the k-mer and replacing each character with its complement, that is \(A \mapsto T\), \(C \mapsto G\), \(T \mapsto A\), \(G \mapsto C\). Since we don’t know from which DNA strand it is taken, we consider a k-mer and its reverse complement as the same k-mer. Given a string \(s = \langle s_1, \dots , s_{|s|}\rangle \), we denote the first i characters of s as \(pref_i(s)=\langle s_1, \dots , s_i \rangle \) and the last i characters of s as \(suf_i(s) = \langle s_{|s|-i+1}, \dots , s_{|s|} \rangle \). We define the glue operation between two strings u and v such that \(suf_{k-1}(u) = pref_{k-1}(v)\) as the concatenation of u and the suffix of v:

$$u \odot ^{k-1} v = u \cdot suf_{|v| - (k - 1)}(v)$$

For example, given two 3-mers \(u = CTG\) and \(v = TGA\), their gluing is \(u \odot ^2 v = CTGA\).

A set of k-mers can be represented by a de Bruijn graph, of which we will give a node-centric definition, meaning that the arcs are implicitly given by the nodes. Thus we can refer to k-mers and dBG(K) interchangeably.

Given a k-mer set \(K = \{m_1, \dots m_{|K|}\}\), a de Bruijn graph of K is a directed graph dBG(K) = (V, A) in which:

  1. 1.

    \(V=\{v_1, \dots , v_{|K|}\}\)

  2. 2.

    each node \(v \in V\) has a label \(lab(v_i) = m_i\)

  3. 3.

    each node \(v \in V\) has two different sides \(s_v \in \{0,1\}\), where (v, 1) is graphically represented with a tip

  4. 4.

    a node side \((v,s_v)\) is spelled as

    $$\begin{aligned} spell(v, s_v) = {\left\{ \begin{array}{ll} lab(v) &{} \, s_v = 0 \\ rc(lab(v)) &{} \, s_v = 1 \\ \end{array}\right. } \end{aligned}$$
    (1)
  5. 5.

    there is an arc between two node sides \((v,s_v)\) and \((u,s_u)\) if and only if there are spellings that share a (\(k-1\))-mer. In particular, it must be

    $$\bigl ((v,s_v),(u,s_u)\bigr ) \in A \iff suf_{k-1}\left( spell\left( v, 1 - s_v\right) \right) = pref_{k-1}\left( spell\left( u,s_u\right) \right) $$

    The right-hand condition is also known as (vu)-oriented-overlap [21].

Note that nodes’ sides allow treating a k-mer and its reverse complement as if they were the same k-mer. Furthermore, nodes can be associated with k-mer counts.

A path \(p = \langle (v_1,s_1), \dots , (v_l, s_l) \rangle \) is spelled by gluing the spelling of its node sides:

$$spell(p) = spell(v_1, s_1) \odot ^{k-1} spell(v_1, s_1) \odot ^{k-1} \dots \odot ^{k-1} spell(v_l, s_l)$$

The path p is said to be a unitig if its internal nodes have in-degree and out-degree equal to 1. A unitig is said to be maximal if it cannot be extended on either side. In order to decrease its memory footprint, a dBG(K) can be compacted by replacing maximal unitigs with single nodes labeled with the spellings of the unitigs.

An example of compacted dBG(K) with

$$K=\{ACT,CTG,TGA,CTT,TTG,TGC\}$$

is in Fig. 1. It has been compacted by replacing the maximal unitig (CTTTTG) with the node CTTG.

2.2 Vertex-Disjoint Path Cover Problem

Compressing a k-mer set K can be achieved by finding a representation S of K made of strings of any length such that the set of its substrings of length k is equal to K.

A natural way to measure the size of a string set S is by computing its \(cumulative\ length\) defined as the sum of all the string lengths:

$$CL(S) = \sum _{s \in S} |s|$$

where |s| is the length of the string s. It has been shown in [4, 21] that, when S does not contain duplicate k-mers, the cumulative length of S is proportional to its cardinality, in particular, it holds

$$CL(S) = |K| + (k-1)\cdot |S|$$

where |S| is the cardinality of the set S. Therefore our goal, finding the best representation of a k-mer set K, is equivalent to minimizing the number of strings in the string set S.

Consider again the example in Fig. 1. From the path \(p = (ACT,CTG,TGA)\) we can compute its spell \(spell(p) = ACTGA\) that contains all the 3-mers ACT, CTG and TGA in p. Thus from a set of paths P that contains all the nodes in dBG(K) we can derive a set S of compressed k-mers. By imposing that all the paths are vertex-disjoint, we guarantee that k-mers are represented only once. Therefore a vertex-disjoint path cover can be used in order to compute S for compression.

Fig. 1.
figure 1

An example of a compacted de Bruijn graph. Nodes are labeled with k-mers and their counts. Undirected arcs are used in place of two arcs with opposite directions. UST may choose the path cover in red while USTAR is forced to choose the path cover in green. (Color figure online)

Recalling that in order to minimize the cumulative length of a string set S that represents K we need to minimize the number of strings |S| and that it corresponds to the number of paths |P|, we can solve the following problem.

Problem 1

Given a de Bruijn graph dBG(K) of a k-mer set K, the minimum vertex-disjoint path cover problem is to find the minimum number of vertex-disjoint paths that cover the graph.

For general graphs the problem above is known to be NP-hard [4, 21] since it can be reduced from Hamilton: a graph has a Hamiltonian path if and only if it has a vertex-disjoint path cover of cardinality 1. However, it is not clear if the problem is still NP-hard for de Bruijn graphs.

Nevertheless, greedy and non-optimal algorithms have been proposed. ProphAsm [4] uses a simple heuristic that takes an arbitrary k-mer in the dBG(K), and it tries to extend it forward and backward as long as possible and it restarts until it consumes all the k-mers. Similarly, using as input the compacted dBG(K) constructed by BCALM2, UST [21] takes an arbitrary node, tries to extend it forward as long as possible, and restarts until there are available nodes. In the end, UST merges linked paths. Both methods perform a similar strategy by picking the first available k-mer, and without considering the graph structure. If we consider the example in Fig. 1, ProphAsm and UST, by choosing nodes arbitrarily, may build the path cover (in red) \(P=\{(ACT, CTG, TGA), (CTT, TTG), (TGC)\}\) from which derives, by computing the spelling of each path, the set of strings \(S=\{ACTGA, CTTG, TGC\}\). In this example, the cumulative length is \(CL(S)=12\).

In this work, we present USTAR (Unitig STitch Advanced constRuction) that, unlike previous algorithms, exploits the connectivity of the dBG graph and the values of counts. USTAR also implements a heuristic to compute simplitigs. As UST, also USTAR takes advantage of the compacted de Bruijn graph computed by BCALM2. Similarly to UST and ProphAsm, at each step, USTAR selects a seed node in the graph, and then it tries to compute a path starting from this node. A path is constructed by connecting adjacent nodes until the path cannot be further extended. The algorithm continues with the selection of a new seed node until all nodes have been covered by a path. The two key operations in this algorithm are how to select a good seed node, and how to extend a path among the available connections.

In USTAR the counts associated with each node and the topology properties can be used to determine the best seed and how to extend it. The distribution of counts is in general very skewed, with several low values and few very high values. Since the counts distribution is non-uniform, and skewed, it turns out that higher counts are harder to compress. For this reason, the exploration strategy of USTAR chooses as seed the node that has the highest average counts. In general neighboring nodes usually have similar counts, so that choosing the seed based on the highest average count might improve the compression of these high counts.

As for the path cover construction, we observe that UST and ProphAsm might choose a highly connected node, and since this node will not be available in the subsequent iterations, this selection may lead to isolated nodes, that will increase the cumulative length, like in the example in Fig. 1. Instead, in USTAR we try to avoid this scenario and, in fact, paths are extended by selecting the node with fewer connections so that highly connected nodes are still available for future iterations. These choices will help to have a lower CL since they create fewer and longer simplitigs.

Following the example in Fig. 1, USTAR guarantees that while constructing the first path, the most connected node CTG is avoided. This will produce a cover of the dBG with the paths (in green) \(P^\prime =\{(ACT, CTT, TTG, TGA),\) \((CTG, TGC)\}\) and thus a set of strings \(S^\prime =\{ACTTGA, CTGC\}\). If we measure the cumulative length of \(S^\prime \) we have that \(CL(S^\prime ) = 10\).

$$CL(S^\prime ) = 10 < CL(S) = 12 < CL(K) = 18$$

Overall, in this example, the uncompressed k-mer set will require \(CL(K) = 18\), with UST the k-mer set can be compressed with \(CL(S) = 12\), whereas USTAR will produce a better compression with \(CL(S^\prime ) = 10\).

3 Results

In this section, we present a series of experiments in order to find the best tool that compresses k-mers and counts. In our evaluation, we compared USTAR with several other tools: Squeaker, KMC, DSK, BCALM2, and UST. We used for testing a set of real reads datasets taken from previous studies [4, 7, 11, 17, 23]. A summary of all datasets is reported in Table 1. For each dataset, we extracted all k-mers (see Table 1) and the corresponding counts, and use this information as input for all compression tools. For some tools, like UST and USTAR, it is required to build a compacted dBG with BCALM2 as a preprocessing.

Table 1. A summary of the read datasets used in the experiments. Datasets are downloaded from NCBI’s Sequence Read Archive.

In the first experiment, we ran all compression tools on all datasets for \(k=21\) and we reported the results in Table 2. In all cases, the stored data is additionally compressed using MFCompress [18] for nucleotide sequences or with bzip3 for binary data. In Table 2 are reported the dimensions of the files compressed by the different tools.

Table 2. Datasets (with \(k=21\)) are processed with DSK, KMC, Squeaker, BCALM, UST, and USTAR. Nucleotide files are then compressed with MFCompress [18] while other text or binary files are compressed with bzip3. The average file size over all datasets is reported in the last row.

We can observe that USTAR is on average the best compressor, and it consistently outperforms the other tools on all datasets. As expected UST and BCALM are the second and third best methods, however, USTAR shrinks the representation by 76% over BCALM and 4.2% over UST.

Since it is clear that UST is the best competitor in the next tests we compare USTAR with UST. We used three different evaluation metrics:

  • CL: the cumulative length as defined in Sect. 2, to test the quality before the k-mers compression with the dedicated compressor MFCompress;

  • counts: the file size of counts after compression with a general-purpose compressor;

  • overall: the sum of compressed file sizes of k-mers and counts.

Given the metric M the improvement over UST is computed as

$$\varDelta M = \frac{M_{UST} - M_{USTAR}}{M_{UST}}$$

where M can be CL, counts or overall.

In the next experiment, we compared USTAR and UST with these three metrics. In Table 3 we reported the average improvements of USTAR w.r.t. UST when tested on all the above datasets while varying the size of k-mers, using odd lengths as previously done by other authors [4, 20, 21].

Table 3. Average improvements of USTAR w.r.t. UST varying the k-mer size.

We can observe that for all values of k, USTAR improves over UST, for both nucleotide and count compression. For large values of \(k=31\), the overall advantage of USTAR is \(2.30\%\), and this improvement is mainly achieved with a better compression of counts, in fact, \(\varDelta \)counts = 12.70%. We suspect that with higher values of k the compression might decrease, but more experiments are needed to confirm this hypothesis.

If smaller values of k are considered the overall improvement increases with \(k=21\) and \(k=17\) and it reaches the maximum value of \(26.40\%\) for \(k=15\). A similar behavior can be observed for \(\varDelta \)CL, which increases from \(0.97\%\) with \(k=31\) to \(33.64\%\) with \(k=15\). We can note that the count improvement is roughly constant as k varies with an improvement of 12.07–\(15.85\%\). In general, \(\varDelta \)overall is mainly driven by \(\varDelta \)CL.

Since USTAR exploits the structure and connectivity of the dBG graph it is interesting to further study the behavior of USTAR w.r.t. to the number of arcs in the graph. Recall that nodes in a dBG have two sides and that each side can have four arcs, one per nucleotide, thus the maximum number of arcs is \(2 \cdot 4 \cdot \#nodes\). Based on this observation we can define the graph density as the number of arcs over the maximum number of arcs:

$$density = \frac{\#arcs}{8\cdot \#nodes}$$
Fig. 2.
figure 2

Improvement on the cumulative length, w.r.t. UST, as a function of the dBG graph density.

In Fig. 2 are shown the improvements in the cumulative length for all datasets with different k-mer sizes, plotted against their density. Each point in the Figure represents a dataset. The plot has a well-defined curve that slowly raises until \(density\,=\,35\%\) and then increases almost linearly. \(\varDelta CL\) spans from \(0.02\%\) to \(48.31\%\) showing that the compression ratio strongly increases as the density increases. We can say that the graph density is very important in determining the improvement in k-mers compression because, with a denser graph, we obtained significantly higher improvements. We also considered other properties such as the number of k-mers, the number of unitigs, the number of isolated nodes, the read length, and the counts variance. However all these features are not as significant as the graph density, and so they are omitted.

Indeed, a higher density graph implies that USTAR has more nodes from which it can choose, unlike UST which picks the first connected k-mer, so that more connected nodes can be available to paths that would have been made up of single nodes. The higher compression ratio achieved by USTAR on denser graphs confirms that the strategy based on node connectivity works well. The fact that UST chooses almost randomly how to extend a path, it cannot guarantee to work well, even with a denser graph there exists the possibility that many single-node paths are generated, which substantially increases the number of paths and thus the cumulative length of compressed k-mers, worsening the compression ratio. If we decrease the k-mer size we observe a higher density: with small k there are more k-mers and it is more likely to have more connections between k-mers leading to an increase in density. In summary, USTAR can compress k-mers and counts with a better compression ratio w.r.t. to several other tools. USTAR is more effective on dense dBG graphs and for small k-mer sizes.

4 Conclusions

In this paper, we have presented USTAR, a tool for compressing k-mer sets with counters. Our approach utilizes a vertex-disjoint path cover to find a representation of the k-mer set that minimizes the cumulative length of the compressed data. By exploring de Bruijn graphs and making informed choices based on node connectivity and average counts, we have achieved better compression ratios compared to existing tools such as UST.

We have evaluated USTAR using various datasets and compared it with several other tools. The results demonstrate that our method consistently outperforms UST, and other tools, in terms of compression ratio, especially for smaller k-mer sizes. The improvements range from 0.97% to 33.64% in terms of cumulative length and it’s almost constant for counts. The overall improvement in compressed file size is driven by the reduction in cumulative length.

Furthermore, we have observed that graph density plays a crucial role in determining the effectiveness of USTAR. Denser graphs yield higher improvements, and lowering the k-mer size contributes to increased density. These findings highlight the power of our method, particularly for datasets with higher graph density.

In conclusion, USTAR offers an effective solution for compressing k-mer sets with counters. By leveraging de Bruijn graphs and making informed choices in the construction of the vertex-disjoint path cover, we have achieved superior compression ratios compared to existing tools. Our method has the potential to enhance the storage and processing efficiency of analysis in bioinformatics, enabling more efficient analysis of large-scale genomic data using k-mer-based tools.