Abstract
Single-strand DNA symmetry is pointed as a universal law observed in the genomes from all living organisms. It is a somewhat broadly defined concept, which has been refined into some more specific measurable effects. Here we discuss the exceptional symmetry effect. Exceptional symmetry is the symmetry effect beyond that expected in independence contexts, and it can be measured for each word, for each equivalent composition group, or globally, combining the effects of all possible words of a given length. Global exceptional symmetry was found in several species, but there are genomic words with no exceptional symmetry effect, whereas others show a very high exceptional symmetry effect. In this work, we discuss a measure to evaluate the exceptional symmetry effect by symmetric word pair, and compare it with others. We present a detailed study of the exceptional symmetry by symmetric pairs and take the CG content into account. We also introduce and discuss the exceptional symmetry profile for the DNA of each organism, and we perform a multiple comparison for 31 genomes: 7 viruses; 5 archaea; 5 bacteria; 14 eukaryotes.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Erwin Chargaff was a biochemist that discovered a set of intriguing rules about the composition of DNA from the analysis of bacterial genomes [1]. The first rule states that the total percentage of complementary nucleotides (A-T and C-G) in double-stranded DNA must be equal. Of course, this is now known to result from the double helix structure of DNA [2]. The second rule sates that the percentage of complementary nucleotides is also identical in each strand [3–5], [6, chap. 4].
A natural extension of Chargaff’s second parity rule is that, in each DNA strand, the number of occurrences of a given word (oligonucleotide or k-mer) should match that of its reversed complement [6]. The extension to the second parity rule is also known as the single-strand symmetry phenomenon. This symmetry phenomenon refers to the distributions of symmetric pairs, i.e., the distribution of occurrences of all words and the distribution of occurrences of the corresponding reversed complements.
Presently, there is not a generally accepted justification for the need of single-strand parity in DNA sequences, and there is no consensual explanation for the occurrence of the single-strand phenomenon. There are some attempts to explain the phenomenon, which could be classified in two groups: the conserved patterns model [7–9], and the evolutive models. Evolutive models can further be classified according to several underlying hypothesis, for example: the stem-loops hypothesis [10]; the duplication followed by inversion hypothesis [11]; the inversions and inverted transpositions hypothesis [12, 13]; the non-uniform substitutions hypothesis [14]; and the statistical mechanics equilibrium hypothesis [15].
To characterize the symmetry phenomenon, Powdel and others [16] analyzed the frequency distributions of oligonucleotides in localized windows along a single strand of DNA. They found that the differences between the frequency distributions of reverse complementary oligonucleotides are not statistically significant. Afreixo et al. [17] noted that the frequency of an oligonucleotide is more similar to the frequency of its reversed complement than to the frequencies of other words of equivalent composition (equal-length oligonucleotides with equal CG content). They called this phenomenon exceptional symmetry, defined measures to evaluate it, and identified several word groups with strong exceptional symmetry in the human genome. More recently, a different measure was introduced to overcome a disadvantage of the previous measure of exceptional symmetry by word [18]. This measure evaluates the difference between the number of occurrences of a word and its reversed complement and relates it with the dissimilarities of the number of occurrences in the corresponding equivalent composition group.
Here, we introduce an improved exceptional symmetry measure and use it to obtain the word symmetry effects in 31 complete genomes stratified by equivalent composition group for word lengths up to 14. Results confirm that measures of word exceptional symmetry can be used to form clusters of related species. Also, we identify words that show high symmetry effect across the 31 species, and across the 9 animal species studied.
2 Materials
The genomes analyzed here are available from the website of the National Center for Biotechnology Information (NCBI; ftp://ftp.ncbi.nih.gov/genomes/). The complete list of species is indicated in Table 1. We selected genomes of species representative of the major taxonomic groups across the tree of life. These include vertebrates, invertebrates, protozoans, fungi, plants, bacteria (gram-positive and gram-negative), archaea and viruses (both double-stranded and single-stranded DNA and RNA viruses).
All non-sequenced or ambiguous nucleotides (mostly N symbols in the sequence file) were discarded from the analysis. For genomes composed by several chromosomes, the chromosomes were processed as separate sequences. All genome sequences used under this study were processed to obtain the word counts, considering overlap between successive words. We obtained the word counts for word lengths from 1 to 14 nucleotides.
3 Methods
In a previous work [17], we called equivalent composition group (ECG) to a set of words with length k that contain a given number m of nucleotides a or t [17]. For example, for \(k=2\) there are three ECGs:
The words division created by ECGs is also called a binary partition [19]. Consider the binary classification of nucleotides in two types, \(T_1=\{a,t\}\) and \(T_2=\{c,g\}\), and let \(G_m^k\) (or simply, \(G_m\)) be the ECG with words of length k where each word has m symbols of type \(T_1\) and \(k-m\) symbols of type \(T_2\), with \(m \in \{0,1,...,k\}\). Taking into account the combinatorial results (permutations with repetition of indistinguishable objects), it can be concluded that \(G_m\) has \(N_m\) distinct words,
Note that, for k-mers there are \(k+1\) ECGs with a total of \(4^k\) words.
For even values of k, some words are equal to their reversed complement. We denote these as self symmetric words (SSW). We also define a symmetric word pair as the set composed by one word w and the corresponding reversed complement word \(w'\), with \((w')'=w\) (for example, cca and tgg make a symmetric word pair).
We proposed in a previous work [17] one exceptional genomic word symmetry measure evaluated for ECGs and globally. Here, we highlight the exceptional genomic symmetry evaluated for each word, discussing the potentialities of the T measure (symmetric word pair effect, Eq. 1), an improvement of the S measure recently proposed in [18].
Let \(n_w\) be the total number of occurrences of word w in the sequence, and \(n_m\) be the total number of occurrences of words in the ECG \(G_m\), which contains words composed by m nucleotides a or t. The symmetric word pair effect, for \(w \in G_m=\{w_1,w_2,w_3,..., w_{N_m} \}\), was given by,
The T(w) measure may also be expressed as the difference between two terms. The first term assesses the average frequency deviation between any two words in \(G_m\), whereas the second term accounts for the deviation between the frequency of w and that of its reversed complement. Exceptional symmetry, therefore, is revealed by positive values of T.
T differs from the previously defined S measure by a simple correction introduced to avoid indeterminations. Their values are approximately equal for sufficiently large word counts.
3.1 Control Experiments
Small, positive values of T may be obtained for word pairs that are not exceptionally symmetric. In order to establish a magnitude reference for T, we generate random sequences of independent and identically distributed nucleotides, under the assumption of the validity of the second parity rule, that is, by constraining the generator to produce complementary nucleotides with equal probabilities. Under these conditions, all words in each ECG have the same probabilities, hence no exceptional symmetry (see details in [20]). The label sym is used to denote these random sequences in the remainder of the document.
3.2 Word Analysis Procedure
A word is declared as exceptionally symmetrical when its T value surpasses the critical value, which is defined as the 95th percentile of the T values obtained from the control experiments. To complement this analysis, we compute the percentage of words with \(T\le 0\) for each word length.
To identify groups of genomes with similar exceptional symmetry profiles (T(w) values), we use a hierarchical clustering procedure, using the UPGMA aggregation criterion with Euclidean distance. A similar clustering procedure is used to identify words with similar exceptional symmetry profiles across species.
4 Results and Discussion
For the set of 31 genomes, the word counts were obtained for all word lengths between 1 and 14 nucleotides, and the symmetric word pair effect was obtained for each genomic word. However, for given genome, we only consider the genomic words with lengths k (\(k \in \{1, ..., k_\mathrm{{max}}\}\)), with
and n the genome size. This threshold motivation is the count representability and the protection of the T measure to the sensitivity of rare counts occurrences.
Obviously, for \(k=1\), each ECG contains only one symmetric word pair, and so \(T(w)=0\), for all nucleotides. Almost all words in eukaryote genomes show significant exceptional symmetry effect (above the critical values obtained in the control experiments). Table 2 shows the percentage of words with \(T>0\) for each species and word length of this study. A high percentage of words in viruses show no exceptional symmetry. This result agrees with a previous work [20], which used a different measure and procedure.
Table 2 includes the sym row corresponding to one control scenario (sequence with length equal to the length of the human genome). This may be used as a reference of non-exceptional symmetry results.
4.1 Human Genome
A word analysis in the context of exceptional symmetry for the human genome was carried out.
Figure 1 shows boxplots of the T values for \(k=5\) in the human genome and in the corresponding random realization sym. The boxplot for the human genome shows high and significant symmetric word pair effects. The most exceptionally symmetric word pairs, corresponding to the right outliers, detected in the human T boxplot are: (gcgta, tacgc), (accgg, ccggt), (gccac, gtggc), (gccca, tgggc), (cggga, tcccg).
Figure 2 shows the T values in each ECG for \(k=5\) in the human genome. We observe that as the CG content varies (decreases along the x-axis), the T median values have a non-monotonous behavior. The ECG \(G_1\) has the highest T median value. In general, for the word lengths under study and for the human genome, the T median in ECG \(G_0\) is lower than in \(G_1\), and the T median for \(G_k\) is higher than for \(G_{k-1}\). For the control scenario, on the other hand, we observed that the T median values remained essentially constant across all ECGs.
Table 3 presents, for the word lengths under study, the twelve words with the six highest and the six lowest T(w) values. Some of these extreme words could have some biological interest, e.g., regulatory elements, functional elements, motifs.
Based on the results of the effect size measure, we may conclude that the human genome presents exceptional symmetry. The human genome shows exceptional symmetry for the thirteen different word lengths (\(k=2,...,14\)) used in this study.
Although the existence of global exceptional symmetry in the human genome was verified, there are distinct profiles for each chromosome. Consequently, the exceptional symmetry profile may be used as a signature of each chromosome. Preliminary results also suggest that exceptional symmetry profiles are distinct between species, which will be presented in the next section.
It may be also concluded that in the human genome there are ECGs that are more exceptionally symmetric than others. And a large percentage of the genomic words present some exceptional symmetry. However, for longer word lengths (\(k\ge 5\)), there are some words without any exceptional symmetry. With this analysis, it was identified that words rich in CG content behave differently from words rich in AT content, in terms of exceptional symmetry.
4.2 Species Comparison
Figure 3 shows the dendrogram obtained with the hierarchical clustering procedure, for \(k=4\). Four distinct groups can be observed in Figure 3: mammalian (on the left); viruses (on the right); a group including the plants and the other animals (except Danio rerio); and a group with the unicellular species, plus Danio rerio. For other word lengths, the resulting dendrograms essentially maintain the same structure (the dendrogram for \(k=3\) is also included in Figure 4).
Figure 4 shows the heatmap with biclustering organization for trinucleotides. Species are shown on the horizontal axis, and words are shown on the vertical axis. The symmetric word pair effect is stronger on the left side of the heatmap, corresponding to multicellular organisms, and weaker on the right side. The word clustering highlights the group formed by two symmetric word pairs: (ccg, cgg), (gcg, cgc).
We identified the word pairs with high exceptional symmetry (T above the third quartile) in every species under study. From these, we selected the pairs that are highly symmetric across the most species under study, and those that are highly symmetric across the most animal species under study. The results are shown in Table 4. No word pair is considered highly symmetric across all the species under study. However, \(T(cgtacga) = T(tcgtacg)\) is above the third quartile in all the animal species under study. The strongest symmetric word pair effect is observed in words composed by CpG dinucleotides.
The results presented in Table 4 are restricted to word lengths between 2 and 7 because for longer word lengths the number of most common symmetric word pair above the third quartile is high. The strongest symmetric word pair effect is observed in words composed by CpG dinucleotides.
5 Conclusions
We evaluated the exceptional symmetry effect in several species, with particular emphasis in the human genome. The word exceptional symmetry values contain information specific to the species and seem to contain information about the species evolution. Taking into account the species in this study, the primates and rodents species have the highest exceptional symmetry values and form a subgroup distinct from all the other species under study. Globally, the eukaryote group showed the highest word exceptional symmetry values, while viruses showed the lowest values. We reinforce that some viruses show a behavior opposite to the exceptional symmetry (\(T<0\)) in almost all words under study.
Exceptional symmetry effect was found in a high percentage of words in all cellular organisms under study. Therefore, we conjecture that exceptional symmetry results from some universal law imposed on cellular organisms. Still, the exceptional symmetry profiles are species specific.
References
Chargaff E (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 6(6):201–209
Watson J, Crick F (1953) A structure for deoxyribose nucleic acid. Nature 171:737–738
Karkas JD, Rudner R, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. II. Template functions and composition as determined by transcription with RNA polymerase. Proc Natl Acad Sci USA 60(3):915–920
Rudner R, Karkas JD, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands, I. Biological properties. Proc Natl Acad Sci USA 60(2):630–635
Rudner R, Karkas JD, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. III. Direct analysis. Proc Natl Acad Sci USA 60(3):921–922
Forsdyke DR (2011) Evolutionary bioinformatics. Springer, New York
Sobottka M, Hart AG (2011) A model capturing novel strand symmetries in bacterial DNA. Biochemical and biophysical research communications 410(4):823–828. doi:10.1016/j.bbrc.2011.06.072. http://www.sciencedirect.com/science/article/pii/S0006291X1101045X
Zhang SH, Huang YZ (2008) Characteristics of oligonucleotide frequencies across genomes: conservation versus variation, strand symmetry, and evolutionary implications. Nat Precedings:1–28
Zhang SH, Huang YZ (2010) Strand symmetry: characteristics and origins. In: Fourth international conference on bioinformatics and biomedical engineering (iCBBE) 2010. pp. 1–4 (2010). doi:10.1109/ICBBE.2010.5517388
Forsdyke DR, Bell SJ (2004) Purine loading, stem-loops and Chargaff’s second parity rule: a discussion of the application of elementary principles to early chemical observations. Appl Bioinform 3(1):3–8
Baisnée PF, Hampson S, Baldi P (2002) Why are complementary DNA strands symmetric? Bioinformatics 18(8):1021–1033
Albrecht-Buehler G (2006) Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverted transpositions. Proc Natl Acad Sci USA 103(47):17,828–17,833
Albrecht-Buehler G (2007) Inversions and inverted transpositions as the basis for an almost universal “format” of genome sequences. Genomics 90:297–305
Lobry TH (1995) Properties of a general model of DNA evolution under no-strand-bias condition. J Mol Evol 40:326–330
Hart A, Martnez S, Olmos F (2012) A gibbs approach to Chargaff’s second parity rule. J Stat Phys 146:408–422
Powdel B, Satapathy S, Kumar A, Jha P, Buragohain A, Borah M, Ray S (2009) A study in entire chromosomes of violations of the intra-strand parity of complementary nucleotides (chargaff’s second parity rule). DNA Res 16:325–343
Afreixo V, Rodrigues JMOS, Bastos CAC (2015) Analysis of single-strand exceptional word symmetry in the human genome: new measures. Biostatistics 16(2):209–221
Afreixo V, Rodrigues JMOS, Bastos CAC, Silva RM (2016) Exceptional symmetry profile: A genomic word analysis. In: PACBB
Kong SG, Fan WL, Chen HD, Hsu ZT, Zhou N, Zheng B, Lee HC (2009) Inverse symmetry in complete genomes and whole-genome inverse duplication. PLoS ONE 4(11):e7553
Afreixo V, Rodrigues JMOS, Bastos CAC (2014) Exceptional single strand DNA word symmetry: analysis of evolutionary potentialities. J Integr Bioinform 11(3):250
Acknowledgements
This work was supported by Portuguese funds through the iBiMED-Institute of Biomedicine, IEETA-Institute of Electronics and Informatics Engineering of Aveiro, CIDMA - Center for Research and Development in Mathematics and Applications and the Portuguese Foundation for Science and Technology (“FCT–Fundação para a Ciência e a Tecnologia”), within projects: UID/BIM/04501/2013, PEst-OE/EEI/UI0127/2014 and UID/MAT/04106/2013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Afreixo, V., Rodrigues, J.M.O.S., Bastos, C.A.C. et al. Exceptional Symmetry by Genomic Word. Interdiscip Sci Comput Life Sci 9, 14–23 (2017). https://doi.org/10.1007/s12539-016-0200-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-016-0200-9