Keywords

1 Introduction

Several applications in bioinformatics require the compression of a string, x, given other string, y. This is the case when one needs to analyze or store compactly as possible the data [1,2,3,4,5,6]. The information in y can be used together with that on x or alone. In the so called conditional approach [7, 8], the compressor can explore the information that is contained in y, as well as that from x (assuming causality), according to

$$\begin{aligned} C(x|y)= \sum _{i=1}^{|x|}-\log _2 P(x_i|x_1^{i-1},y), \end{aligned}$$
(1)

where |x| is the size of x and \(x_i\) is \(i^{th}\) element of x. So, for example, \(x_3^5\) is a substring of x composed by \(x_3,x_4\) and \(x_5\).

The relative approach [6, 9,10,11,12,13,14], \(C(x\Vert y)\), assumes that information comes exclusively from y, according to

$$\begin{aligned} C(x\Vert y)= \sum _{i=1}^{|x|}-\log _2 P(x_i|x_{i-\pi }^{i-1},y), \end{aligned}$$
(2)

where \(i-\pi \) is the allowed size of elements from x that can be used in order to search for regularities in y. For \(i\le \pi \) we assume a uniform distribution.

In order to calculate the probabilities of Eq. 2, we need data models that describe y efficiently. Both Ziv-Merhav dictionary-based models [9, 13, 15] and Markov models [5, 14, 16, 17] have been successfully used in diverse data type applications. However, for DNA sequences, Markov models proved to be more efficient [6].

Markov models (MMs), also known as finite-context models (FCMs), are statistical models. A MM of an information source assigns probability estimates to the symbols of an alphabet, \(\varTheta \), according to a conditioning context computed over a finite and fixed number, k, of past outcomes (order-k MM) [18]. At element i, these conditioning outcomes are represented by \(x_{i-k+1}^{i-1} = x_{i-k+1},\dots ,x_{i-1}\). A non relative MM can store each outcome of the past in memory, while a MM working in relative mode can only store the outcomes seen in y. The number of conditioning states of the model in DNA sequences is \(4^k\). The cooperation between MM of different orders has proved to be a more efficient solution for representing DNA sequences, instead of competition [19].

High order MM, typically with \(k\ge 13\), proved to be one of the most important models for DNA sequence representation [20], as well as to address other applications [21,22,23]. However, when substitutional mutations occur between two identical sequences, high order MM fall short to represent the data. This happens because, if, for example, we use an order-20 MM and we have a probability of one random substitution for each 20 bases, the probability that the same context is seen again is low. The DNA data between close species is frequently of this nature, because they share a common ancestral. Moreover, the distinct majority of the editions in the DNA sequences are of substitutional nature.

Aware of these characteristics, we have recently proposed a preliminary approach to deal with substitutional mutations in DNA sequences [6]. In this paper, we consolidate the concept of substitutional tolerant Markov models (STMM) and we apply them to the relative compression case. After, we measure its impact on synthetic genomic data, exploring some characteristics of compressing the elements from a reverse order, as well as some combinations between both. Finally, we show some comparative results between whole genomes.

2 Substitutional Tolerant Markov Model (STMM)

A substitutional tolerant Markov model (STMM) is a probabilistic-algorithmic finite-context model. It assigns probabilities according to a conditioning context that considers the last symbol, from the sequence to occur, as the most probable, given the occurrences stored in the memory, such as those from y, instead of the true occurring symbol.

For a symbol \(s\in \varTheta \), the estimator of a STMM, working in relative mode, is given by

$$\begin{aligned} P(s|{x'}_{i-k}^{i-1}, y) = \frac{N(s|{x'}_{i-k}^{i-1}, y)+\alpha }{N({x'}_{i-k}^{i-1}, y)+ \alpha |\varTheta |}, \end{aligned}$$
(3)

where function N accounts for the memory counts regarding the model and \(x'\) is a copy of x, edited according to

$$\begin{aligned} {x'}_{i} = \mathop {\mathrm{argmax}}\limits _{\forall s \in \varTheta }{P(s|{x'}_{i-k}^{i-1}, y)}. \end{aligned}$$
(4)

The parameter \(\alpha \) allows balancing between the maximum likelihood estimator and a uniform distribution. For deeper orders, \(\alpha \) should be generally lower than one.

When a STMM (relative or non-relative model) is cooperating with any other model, besides being probabilistic, can also be algorithmic, because they can be switched on or off given its performance, according to a threshold, t, defined before the computation.

Both relative and non-relative modes work with a threshold, t, that enables or disables the model according to the number of times that the context has been seen. Listing 1.1. describes the process for enabling or disabling a STMM.

figure a

The threshold, t, is set at the beginning of the computation. We also need a Boolean cache-array (history) to store the past k hits/fails. For example, consider that \(k=7\) and that \(c_0 = \mathrm {C}\mathrm {A}\mathrm {C}\mathrm {G}\mathrm {T}\mathrm {C}\mathrm {A}\) is the current context. Also, consider that the number of past symbol occurrences following \(c_0\) was \(\mathrm {A}=1, \mathrm {C}=0, \mathrm {G}=0, \mathrm {T}=0\). If the symbol that is being compressed is \(\mathrm {G}\) (contradicting the probabilistic model), a MM would have as next context \(c_1 = \mathrm {A}\mathrm {C}\mathrm {G}\mathrm {T}\mathrm {C}\mathrm {A}\mathrm {G}\). However, the STMM would use a \(c'_1\), taking into account the most probable outcome and, hence, \(c'_1 = \mathrm {A}\mathrm {C}\mathrm {G}\mathrm {T}\mathrm {C}\mathrm {A}\mathrm {A}\). Therefore, the next probabilistic model would be dependent on the past context assumed to be seen and, hence, it assumes that the symbol that was compressed is \(\mathrm {A}\).

3 Results

For producing the results, we have used synthetic and real data. The synthetic data made available a controlled comprehension of the STMMs, while the real data shown the characteristics that are also not controlled. The materials to replicate both results on synthetic and real data are available, under GPL v3 license, at the repository https://github.com/pratas/STMM. All experiments were run on Ubuntu Linux v16.04 LTS, with gcc v5.3.1, using only one Intel Core i7-6700K 3.4 GHz CPU, 32 GB of RAM and a solid-state hard drive.

3.1 Synthetic Data

In Fig. 1 we have simulated a sequence y with 200 bases, copied y to x and inserted edits in several positions of x, specifically at positions 50, 100, 102, 150, 152 and 154. Then we have compressed x relatively to y, assuming the order of each element of x as \(x_1,x_2,...,x_{|x|}\) as right direction, \(x_{|x|},...,x_2,x_1\) as left direction and the minimum complexities of both directions as min.

Fig. 1.
figure 1

Relative compression using a cooperative set of MMs (left plot) and a cooperative set of MMs and STMMs (right plot). The compression direction is included for right and left, as well as the minimum (min) between both for each elements. The data is synthetic. The length is in bytes (B). The experiment can be replicated using the script runSmallBidirection.sh, from the repository described in this paper.

As it can be seen, the cooperation between MMs and STMMs led to a much better approximation of the data. While the MMs can not address efficiently the data after a substitution occurs, between a period of time that seems related with the k-size, the cooperation between MMs and STMMs address them efficiently, having an almost strict decay to a low complexity value.

In Fig. 2 we have simulated a sequence y. Then, we have made 12 copies, for each one applied some degree of random substitutional mutations, and concatenated all into a final sequence, called x. Then we have compressed, using \(C(x_i\Vert y)\), and plotted it. As it can be seen, with 7.5% of substitutional mutations the cooperation of only MMs reaches the average of 1 BPB (bits per base), while the cooperation between MMs and STMMs reaches the same BPB only at 15% of substitutional mutations.

Fig. 2.
figure 2

Relative compression using a cooperative set of MMs (left plot) and a cooperative set of MMs and STMMs (right plot). The synthetic data has been copied from y, creating multiple concatenated x’s. For each 100k of data (bottom axis), a substitution mutation rate has been applied (top axis). Besides normal, the legend shows the computation of min and max. These are the minimum (min) and maximum (max) functions of each element processed in left and right directions. The length is in mega bytes (M). The experiment can be replicated using the script runRelativeBidirection.sh, from the repository described in this paper.

3.2 Real Data

We have used two eagle whole genomes in non-assembled mode, namely White-tailed eagle (Haliaeetus albicilla, 1.14 GB, 26X) and Bald eagle (Haliaeetus leucocephalus, 1.26 GB, 88X), from [24]. We have also used the reference genomes of human, chimpanzee, gorilla, orangutan, and marmoset from the NCBI. We have used a setup of 4 MMs in cooperation with order-k of \(\{4,6,13,20\}\) and the \(\alpha \) of, respectively, \(\{1,1,0.5,0.005\}\). Only one STMM was used with \(k=20\), \(\alpha =0.5\) and \(t=5\). The experiment can be replicated using the script runBirds.sh and runPrimates.sh.

Fig. 3.
figure 3

Compression improvement and compression time added between the relative compression using a cooperative set of MMs and a cooperative set of MMs and STMMs. Percentages are given by \(STMM_{bytes}/MM_{bytes}\times 100\) for compression improvement and \(MM_{minutes}/STMM_{minutes}\times 100\) for time added.

As can be seen in Fig. 3, to compress the Bald eagle relatively to White-tailed eagle, using only a cooperation between MMs, we needed 31, 561, 247 bytes. Adding the cooperation of the STMMs, we reached 34, 864, 683 bytes, which is around 10% of improvement, using the same RAM memory (13.8 GB) and around 10% more computational time. These species are believed to have diverged \(\approx \)1 million years ago (mya) [25].

As can be seen in Fig. 3, to compress a chimpanzee relatively to a human genome, using only a cooperation between MMs, we needed 274, 450, 972 bytes and near 80 min. Adding the cooperation of the STMMs we were able to spend only 210, 691, 987 bytes, which is around 23% of improvement, using the same RAM memory (26.3 GB) and around more 22.5% of computational time. The human and chimpanzee lineages are believed to have diverged \(\approx \)3–4.5 mya [26].

To compress a gorilla relatively to a human genome, using only a cooperation between MMs, we needed 262, 271, 376 bytes. Adding the cooperation of the STMMs we were able to spend only 199, 204, 749 bytes, which is around 24% of improvement, using the same RAM memory (26.3 GB) and around 19.8% more computational time. The human and gorilla lineages are believed to have diverged before \(\approx \)5–9 mya [26].

To compress a orangutan relatively to a human genome, using only a cooperation between MMs, we needed 418, 481, 411 bytes. Adding the cooperation of the STMMs we were able to spend only 299, 316, 387 bytes, which is around 28.5% of improvement, using the same RAM memory (26.3 GB) and around 19.2% more computational time. The human and orangutan lineages are believed to have diverged before 10 mya [26].

Finally, to compress a marmoset relatively to a human genome, using only a cooperation between MMs, we needed 562, 916, 901 bytes. Adding the cooperation of the STMMs we were able to spend only 488, 238, 361 bytes, which is around 13.3% of improvement, using the same RAM memory (26.3 GB) and around 18.8% more computational time. The human and marmoset lineages are believed to have diverged around \(\approx \)40 mya [27].

4 Conclusions

In this paper, we have proposed a new model for relative compression of DNA sequences—the substitutional tolerant Markov model (STMM). We have shown that it addresses efficiently some degree of substitutional mutations, being a model efficient to use between species that divergence less than 40 million years ago, such as between some primates or eagles. The time added by the model to the compressor is affordable, given the compression improvement—for example, between human and orangutan is around 28.5%. This model is, therefore, a strong candidate to be used in ancient DNA analysis, namely because of the high substitutional mutation rates of the data.