Introduction

DNA sequences are often compared with the texts written in a language, either a natural human language or an artificial one. The fundamental difference is that all usual human texts are organized linearly so that if there is more than one message, the messages do not overlap but follow one another. This does not apply to natural genetic DNA sequences. The same fragment of a sequence typically carries more than one functional message, and these messages overlap. Holliday (1968) predicted this phenomenon considering signals responsible for recombination, which may well reside within the protein-coding sequences. This would be allowed, perhaps, due to degeneracy of the triplet code. The idea was developed further in works by Schaap (1971), Zuckerkandl (1976), Eigen and Schuster (1979), and Trifonov (1981) and confirmed by numerous experimental data, as reviewed by Normark et al. (1983). In a generalized form (Trifonov 1989) any sequence pattern that corresponds to a certain function is considered a code. Due to the degeneracy of the codes, “corresponding messages are not only interspersed, but actually overlapped, so that some nucleotides belong to several messages simultaneously.” The idea of the multiplicity of the codes of nucleotide sequences and their overlapping also appeared in studies by Caporale (1984), Staden (1984), Kypr (1986), and Konings et al. (1987) (reviewed by Trifonov 1996).

One example of such overlapping is when a biologically significant RNA secondary structure overlaps sequencewise with the protein-coding region. Several well-known RNA secondary structures along the HIV-1 genome play various functional roles during the virus life cycles, while residing in protein-coding regions. One of the best known is the RNA structure of the Rev Responsive Element (RRE), which interacts with the Rev Trans-Activator protein (Dayton et al. 1989; Kjems et al. 1991; Malim et al. 1989, 1990). Biologically functional RNA secondary structure motif may stimulate ribosomal frameshifting in the gag–pol overlapping region in retroviruses. The exact nature of this signal is still controversial. Various researchers define it as either a pseudoknot (Le et al. 1991; Morikawa and Bishop 1992), a stem–loop structure (Parkin et al. 1992; Vickers and Ecker 1992), or a two-stem structure (Dulude et al. 2002). The translational frameshifting message is also suggested to reside in (G–non-G–N) n mRNA periodicity (Trifonov 1987, 1992; Lagunez-Otero and Trifonov 1992), which may be responsible for monitoring the correct reading frame during translation by forming transient complementary complexes with the C-periodical structure of the ribosomal RNA. Staple and Butcher (2003) fortified the frameshifting theory by NMR analysis of the RNA of the gag–pol region. This analysis indicated a frameshift-inducing stem–loop element made of an A form helix capped by a structured ACAA tetraloop.

Superimposed messages can be revealed even without knowing the biological meaning of the hidden message. A simple statistical approach based on processing data from a combination of multiple alignments and RNA secondary structure predictions, allowed to deduce distinct RNA folds in the first conserved (C1) protein-coding region of the env gene (Peleg et al. 2002) and in the β3/β4 region of the nef gene of HIV-1 (Peleg et al. 2003). It is important to distinguish between overlapping messages that play a role in the expression of the gene they overlap with and those that are functionally unrelated to the gene. The first group includes ribosomal frameshift signals, attenuators, antiterminators, and nucleosome positioning codes. An example of a message from the second group is the RRE RNA secondary structure encoded by the sequence of the env gene in retroviruses, which does not play any role in expression of the protein gp160 encoded by the same sequence.

Viruses often have very compact genomes, and overlapping of messages in this case occurs frequently. Konings (1992) studied a specific type of multiple-coding problem in lentiviruses, taking the RRE structure as a unique biological example. TAR is another well-known RNA secondary structure of retroviruses, which is not located inside any coding sequence. Comparison of the mutation rate between the TAR and the RRE regions revealed that in the case of RRE overlapping with the protein-coding message could be viewed as a factor constraining the evolutionary divergence of the element by increasing its selective value. Wagner and Stadler (1999) studied the genomes of single-stranded RNA viruses (dengue virus, hepatitis C virus, and HIV-1), focusing on the mutational stability of conserved and nonconserved viral secondary structure elements. Using this comparison, they concluded that the mutation robustness and monomorphism are associated with the RNA secondary structure message that overlaps with the protein coding.

The overlapping messages occur frequently in prokaryotic genomes and prokaryote-derived organelles such as mitochondria (Normark et al. 1983). Fukuda et al. (1999) identified 160 overlapping gene pairs in the genome of M. pneumonia and 155 overlapping gene pairs in the genome of M. genitalium.

A number of reports indicate the existence of overlapping messages in nuclear genomes of eukaryotes. Coelho et al. (2002) found that coding information for protein (TAR1) and structural RNAs (rRNA) in Saccharomyces cerevisiae can overlap, raising issues regarding the coevolution of such complex genes. Shintani et al. (1999) reported the overlapping of genes ACAT2 and TCP1 in their 3′ untranslated regions (UTRs). They appear in a tail-to-tail orientation, while their coding sequences are located on the opposite strands. Zhou and Blumberg (2003) found that the genes VLCAD and DLG4 are arranged in a head-to-head orientation on human chromosome 17p13 and share a 245-bp overlapping region that contains part of DLG4 exon 1 and the entire exon 1 of VLCAD including 62 bp of protein-coding sequence. Edgar (2003) found that the genes ABHD1 and Sec12 overlap. These genes, located on human chromosome 2p23.3, share 42 bp of the 3′-UTR in an antisense manner.

Species with a low quality of replication can maintain only a short genome (Eigen and Schuster 1979), which may be too small to store all the necessary information in a sequential manner. To increase the amount of information that can be stored, “the quantity of information per length unit has to be increased; i.e., part of the genome has to code for multiple functions” (Huynen et al. 1993). Hogeweg and Hesper (1992) studied the evolutionary dynamics leading to multiple coding. They showed that a high mutation rate and crossing-over lead to “multiple coding.” However, they concluded that “multiple coding often does not increase the fitness of the population; nevertheless, it is selected.” Huynen et al. (1993) evaluated the transition from RNA primary sequences to RNA secondary structure in the RRE region in the env gene of lentiviruses and Visna virus. The results of this study indicated a variation in the initial ruggedness of fitness landscapes that plays an essential role in the evolution and optimization of RNA secondary structure encoded in the translated region. On the other hand, simulation of an evolutionary search process for a specific secondary structure shows a reduction of allowable point mutations and a reduction of the possibility for small-scale adaptation. Apparently, this decreases the final fitness of the region of overlapping. High fitness as a prerequisite for multiple coding is discussed by Pavesti et al. (1997). Studying the informational content of overlapping genes in prokaryotic and eukaryotic viruses, the authors revealed an increased frequency of amino acid residues with high levels of degeneracy in proteins encoded by overlapping genes. Krakauer (2000) proposed a mathematical model to estimate the stability and evolution of overlapping genes in various orientations in terms of information cost. Krakauer assumed that the superposition increases coupling between functionally related genes and concluded that overlapping at the 3′ end decays more slowly than that at the 5′ end.

If the cost of the superposition is so high, is it at all advantageous? In this article, we propose two versions of a model in which multiple coding directly enhances the fitness under conditions of a high mutation rate.

Models of Advantageous Overlapping of Biological Messages

A Simple Model

Let us consider an example with two adjacent messages in a given genome. Each message contains several crucial residues (uppercase) along its base sequence.

Seq 1:ActggtGttaTCtttaCcgATAggaTGgccttActC

Seq 2:CaGggaaggAaaCagtTAgCcaGtcaAtcgGtagT

Here the total number of crucial residues in both messages is 23 (N = 23). For simplicity, other residues are here considered neutral, replaceable by any other residue. Consider now the overlapping as below, such that the matching residues (boldface letters) are identical in both sequences:

The superposition (Seqoverlap) contains both messages, but the total number of crucial residues (uppercase) including common ones is now smaller (18). A reduction in the number of crucial residues (target size) by 5, thus, would increase the fitness of the organism. In other words, for fitness λ and number of crucial positions N we have λ↓N. In the case of overlapping messages, Noverlap<Nnon. As a result, λoverlapnon.

The Model Based on Probabilities of Lethal Mutations per Nucleotide in A, B, and A/B

The above model can be described in detail by considering two sequence messages sliding toward their superposition forming a common region. Let us have two messages, A and B. Message A has length n1; message B has length n2. m is the length of the overlapping region a/b in the A/B merge.

Let μ1 be the probability of a (lethal) mutation per one nucleotide for message A. (Further on, we consider only lethal mutations.) Let μ2 be the probability of a mutation per one nucleotide of message B. Therefore, the probability that there will be no mutation in the interval n1m is equal to , and by the same token the probability of no mutation in the interval n2m is . The probability that there will be no mutation in the overlapping interval m is equal to (1−ν)m. Consequently, the probability P m of the total absence of any mutation in the whole interval is equal to the product of these probabilities,

(1)

Obviously, 0 ≤ μ1, μ2, ν ≤ 1. Let us rewrite P m

(2)

where the constant C

From (2), it follows that the extreme value of P m is reached at one of the limits of the parameter range. Namely, if (1−ν) < (1−μ1)(1−μ2), then the value P m reaches a maximum when m = 0 (P0 > P i for 0<im). If (1−ν) > (1−μ1)(1−μ2), then the value P m reaches the maximum when m is maximal, that is, m = min (n1, n2) (P m > P i for 0 ≤ i < m). Note that parameter ν, in principle, may take any value between zero and one in a species-dependent or environment-dependent manner. One reasonable possibility is that ν is equal to the larger of the probabilities μ1 and μ2: ν = max (μ1, μ2). In this case, the inequality (1−ν) > (l−μ1)(1−μ2) is always true; consequently, the optimum corresponds to the maximal possible overlap. In the case of ν = μ1μ2, the probability that there would not be a mutation in the interval m is equal to (1−μ1μ2)m, and again, (1−ν) > (l−μ1)(1−μ2).

To illustrate this equation graphically we assumed that the probability of a lethal mutation per one nucleotide μ1 is equal to mutation rate α multiplied by “index of lethality” λ and ν = μ1 = μ2. Then the survival index σ for estimating the difference between the nonoverlapping situation and overlapping of m bases (σ = P0P m ) is

(3)

Figure 1A–C illustrate how the survival index is influenced by the index of lethality, mutation rate, and length of the superposition. All figures correspond to mutation rate values between 0 and 0.0001 and superposition lengths between 0 and 100 bases. Figure 1A corresponds to lethality index λ = 0.3 (30% of critical points). Figure 1B and C correspond to lethality index λ = 0.6 and λ = 0.9, respectively. We would like to emphasize that the actual values of the survival index are not important. For the purposes of this analysis what matters is its behavior as a function of the parameters.

Figure 1
figure 1

Survival index as a functionof overlap length and mutation rates.A λ = 0.3 (30% of critical points).B λ = 0.6. C λ = 0.9.

Figure 1A–C demonstrate that the survival index increases with degree of overlapping, in particular, under conditions of a high mutation rate. This is a simple formal illustration of the general idea of the advantage of overlapping. Formally the survival index as defined and parameterized in this model would increase indefinitely and cause the whole genome to overlap. This situation is clearly unrealistic. The penalty should be introduced to take into account a consequence of the compromise, which decreases the ability of the overlapping (degenerate) messages to vary. The compromise of the overlapping may lead, for example, to a nonoptimal choice of an amino acid at a particular location in the protein. It might cause a mismatch of a specific nucleotide in an RNA secondary structure to its counterpart in the opposite strand of the stem. Both would cause a reduction of the phenotypic repertoire. This justifies the assumption that the fitness also depends on the length of the overlap. Let us introduce the penalty for the overlapping as a function f(m) that decreases with an increase in m. Then a general fitness λ m becomes

(4)

Examples with different possible types of penalty function f(m) are considered below.

(1) f(m) = (1−β)m. In this case a penalty for each overlapping nucleotide (letter) equals a constant value β. Then the probability that no letter would be penalized is (1−β)m.

(5)

Obviously, the maximum of λ m is reached either when m = 0 or when m = min (n1, n2). As in the previous case, if (1−ν)(1−β) < (1−μ1)(1−μ2), then P m is maximal when m = 0. If (1−ν)(1−β) > (1−μ1)(1−μ2), then P m is maximal when m = min (n1, n2).

(2) f(m) = 1/(1+m). In this case

(6)

If (1−ν) < (1−μ1)(1−μ2) then λ m decreases with an increase in m. This means that m = 0 (no overlap) has the best fitness. If (1−ν) > (1−μ1)(1−μ2) then λ m is not a monotonous function of m. An extreme value would be when

(7)

However, it is easy to show that λm0 is the minimum of the function. Thus, as before, the maximum of λ m is reached either when m = 0 or when m = min (n1, n2).

(3) , where ω is a relatively small value. This penalty function has a weaker length dependence than in case 2. Consequently,

(8)

An extreme value would be reached when

(9)

0 ≤ m0 ≤ min (n1,n2) and, in principle, may be any value in this range. If ν = max (μ12), assuming μ12, then

(10)

That is, approximately

(11)

Thus, the optimal m, which is the optimal length of the overlapping region, depends on the ratio between the mutation rate μ and the coefficient ω of the cumulative influence of an overlapping length m in the penalty function f(m).

From formulae (10) and (11) it is apparent that the optimal m does not depend upon the lengths of the messages. It should obey the inequality 0≤m≤min (n1,n2), and it may happen that m would be less than required by (10) and equal to min (n1,n2), which is the case of the maximal possible overlap.

Figure 2A–C correspond, respectively, to Figs. 1A–C modified by the addition of the penalty function. The penalty function is . The survival index σ = λ0−λ m is

(12)
Figure 2
figure 2

Same as Fig. 1, with penalty introduced. A λ = 0.3. B λ = 0.6.C λ = 0.9 (90% of critical mutations).

The survival index indicates an evolutionary advantage for overlapping as long as it is higher than zero. Figure 2 shows that since the penalty depends only on the extent of the overlapping, its increase leads to accumulation of the compromises. The information contained in at least one of the messages decreases, which leads to the loss of the evolutionary advantage of the overlapping. Values of the parameters in the formulae above (overlapping length m, mutation rate μ, lethality index λ, parameter ω) are only of illustrative significance. Moreover, each sequence position has, actually, its own penalty for the message overlapping. However, the plots above illustrate in a semiquantitative way how the accumulation of compromises that accompanies the overlapping restricts the length of the overlapping region. This occurs even under favorable conditions for overlapping such as a high mutation rate and reduction of the number of crucial points.

The Model Based on Probabilities of Lethal Mutations per Interval

In this model we assume that a probability of one (or more) lethal mutation within the interval is proportional to the length of the interval. The definitions follow.

Q m is the probability that a lethal mutation happens inside the overlapping interval,

(13)

and the probability P m ,

(14)

to compare with (1).

From (14) it follows that the optimal m corresponds to the maximal overlap m = min (n1, n2). Let us introduce a penalty for each overlapping nucleotide (letter) equal to a constant value β. Then the probability that no letter would be penalized is (l−β)m. Accordingly

(15)

and the extreme value of a general fitness λ m would be reached when

(16)

A very simple and meaningful conclusion follows from this formula: The higher the penalty for the overlapping, the lower the advantage of the overlapping.

Discussion

A genome region with two overlapping messages is considered, such that each message contains several crucial residues, while other residues are assumed to be neutral. The models are proposed to demonstrate how the superposition of critical points and mutation rate may influence the survival index. The interplay between an advantage of the superposition and a penalty for impairment of an advantage of the variability of the (degenerate) message is analyzed. The conclusion is that under reasonable assumptions the degree of the optimal overlapping depends on the ratio between the mutation rate and the cumulative penalty function. A low mutation rate and incompatibility of the messages (too high a cost for the compromise in the overlapping region) make overlapping disadvantageous. Importantly, a high mutation rate and high tolerance of the juxtaposed sequence messages make the superposition beneficial.

Usually, the phenomenon of overlapping is explained by the need to store a large quantity of information in a small genome (Eigen and Schuster 1979). The model that we present here is in general accord with previous models associating high mutation rate with message overlapping. However, unlike the previous models (Hogeweg and Hesper 1992; Huynen et al. 1993), our analysis suggests a direct evolutionary advantage for message overlapping under conditions of a high mutation rate. Moreover, our model proposes a particular mechanism to realize this evolutionary advantage through superposition of critical points, thus reducing their amount in the genome. This simple model involves two opposing forces that balance the degree of overlapping. The first force is the reduction of the number of vulnerable points and the second, opposing factor is a penalty for deterioration of the messages, which gradually reach the point of zero gain. This penalty is quite similar to the information cost described by Krakauer (2000). He described the information cost in terms of the tendency of a population to become monomorphic, which restricts the ability of polypeptides to be fine-tuned. Although our model describes the overlapping messages as sliding toward each other, we do not mean that this is the only possible mechanism. The sliding presentation is only a simplification that helps to illustrate the idea of two opposing forces optimizing the effect of overlapping.