1 Introduction

Erwin Chargaff was a biochemist that discovered a set of intriguing rules about the composition of DNA from the analysis of bacterial genomes [1]. The first rule states that the total percentage of complementary nucleotides (A-T and C-G) in double-stranded DNA must be equal. Of course, this is now known to result from the double helix structure of DNA [2]. The second rule sates that the percentage of complementary nucleotides is also identical in each strand [35], [6, chap. 4].

A natural extension of Chargaff’s second parity rule is that, in each DNA strand, the number of occurrences of a given word (oligonucleotide or k-mer) should match that of its reversed complement [6]. The extension to the second parity rule is also known as the single-strand symmetry phenomenon. This symmetry phenomenon refers to the distributions of symmetric pairs, i.e., the distribution of occurrences of all words and the distribution of occurrences of the corresponding reversed complements.

Presently, there is not a generally accepted justification for the need of single-strand parity in DNA sequences, and there is no consensual explanation for the occurrence of the single-strand phenomenon. There are some attempts to explain the phenomenon, which could be classified in two groups: the conserved patterns model [79], and the evolutive models. Evolutive models can further be classified according to several underlying hypothesis, for example: the stem-loops hypothesis [10]; the duplication followed by inversion hypothesis [11]; the inversions and inverted transpositions hypothesis [12, 13]; the non-uniform substitutions hypothesis [14]; and the statistical mechanics equilibrium hypothesis [15].

To characterize the symmetry phenomenon, Powdel and others [16] analyzed the frequency distributions of oligonucleotides in localized windows along a single strand of DNA. They found that the differences between the frequency distributions of reverse complementary oligonucleotides are not statistically significant. Afreixo et al. [17] noted that the frequency of an oligonucleotide is more similar to the frequency of its reversed complement than to the frequencies of other words of equivalent composition (equal-length oligonucleotides with equal CG content). They called this phenomenon exceptional symmetry, defined measures to evaluate it, and identified several word groups with strong exceptional symmetry in the human genome. More recently, a different measure was introduced to overcome a disadvantage of the previous measure of exceptional symmetry by word [18]. This measure evaluates the difference between the number of occurrences of a word and its reversed complement and relates it with the dissimilarities of the number of occurrences in the corresponding equivalent composition group.

Here, we introduce an improved exceptional symmetry measure and use it to obtain the word symmetry effects in 31 complete genomes stratified by equivalent composition group for word lengths up to 14. Results confirm that measures of word exceptional symmetry can be used to form clusters of related species. Also, we identify words that show high symmetry effect across the 31 species, and across the 9 animal species studied.

2 Materials

The genomes analyzed here are available from the website of the National Center for Biotechnology Information (NCBI; ftp://ftp.ncbi.nih.gov/genomes/). The complete list of species is indicated in Table 1. We selected genomes of species representative of the major taxonomic groups across the tree of life. These include vertebrates, invertebrates, protozoans, fungi, plants, bacteria (gram-positive and gram-negative), archaea and viruses (both double-stranded and single-stranded DNA and RNA viruses).

Table 1 List of species whose genomes are analyzed in this work

All non-sequenced or ambiguous nucleotides (mostly N symbols in the sequence file) were discarded from the analysis. For genomes composed by several chromosomes, the chromosomes were processed as separate sequences. All genome sequences used under this study were processed to obtain the word counts, considering overlap between successive words. We obtained the word counts for word lengths from 1 to 14 nucleotides.

3 Methods

In a previous work [17], we called equivalent composition group (ECG) to a set of words with length k that contain a given number m of nucleotides a or t [17]. For example, for \(k=2\) there are three ECGs:

$$\begin{aligned} G_0=\, & {} \{cc,cg,gc,gg\};\\ G_1=\, & {} \{ac,ag,ca,ct,ga,gt,tc,tg\};\\ G_2=\, & {} \{aa,at,ta,tt\}. \end{aligned}$$

The words division created by ECGs is also called a binary partition [19]. Consider the binary classification of nucleotides in two types, \(T_1=\{a,t\}\) and \(T_2=\{c,g\}\), and let \(G_m^k\) (or simply, \(G_m\)) be the ECG with words of length k where each word has m symbols of type \(T_1\) and \(k-m\) symbols of type \(T_2\), with \(m \in \{0,1,...,k\}\). Taking into account the combinatorial results (permutations with repetition of indistinguishable objects), it can be concluded that \(G_m\) has \(N_m\) distinct words,

$$\begin{aligned} N_m=2^k\times \frac{k!}{m!(k-m)!}. \end{aligned}$$

Note that, for k-mers there are \(k+1\) ECGs with a total of \(4^k\) words.

For even values of k, some words are equal to their reversed complement. We denote these as self symmetric words (SSW). We also define a symmetric word pair as the set composed by one word w and the corresponding reversed complement word \(w'\), with \((w')'=w\) (for example, cca and tgg make a symmetric word pair).

We proposed in a previous work [17] one exceptional genomic word symmetry measure evaluated for ECGs and globally. Here, we highlight the exceptional genomic symmetry evaluated for each word, discussing the potentialities of the T measure (symmetric word pair effect, Eq. 1), an improvement of the S measure recently proposed in [18].

Let \(n_w\) be the total number of occurrences of word w in the sequence, and \(n_m\) be the total number of occurrences of words in the ECG \(G_m\), which contains words composed by m nucleotides a or t. The symmetric word pair effect, for \(w \in G_m=\{w_1,w_2,w_3,..., w_{N_m} \}\), was given by,

$$\begin{aligned} T(w) = T(w') =\ln {\frac{\sqrt{\frac{\sum _{i=1}^{N_m}{\sum _{j=1}^{N_m}{(n_{w_i}-n_{w_j})^2}}}{N_m^2-N_m}}+1}{|n_w-n_{w'}|+1}}. \end{aligned}$$
(1)

The T(w) measure may also be expressed as the difference between two terms. The first term assesses the average frequency deviation between any two words in \(G_m\), whereas the second term accounts for the deviation between the frequency of w and that of its reversed complement. Exceptional symmetry, therefore, is revealed by positive values of T.

T differs from the previously defined S measure by a simple correction introduced to avoid indeterminations. Their values are approximately equal for sufficiently large word counts.

3.1 Control Experiments

Small, positive values of T may be obtained for word pairs that are not exceptionally symmetric. In order to establish a magnitude reference for T, we generate random sequences of independent and identically distributed nucleotides, under the assumption of the validity of the second parity rule, that is, by constraining the generator to produce complementary nucleotides with equal probabilities. Under these conditions, all words in each ECG have the same probabilities, hence no exceptional symmetry (see details in [20]). The label sym is used to denote these random sequences in the remainder of the document.

3.2 Word Analysis Procedure

A word is declared as exceptionally symmetrical when its T value surpasses the critical value, which is defined as the 95th percentile of the T values obtained from the control experiments. To complement this analysis, we compute the percentage of words with \(T\le 0\) for each word length.

To identify groups of genomes with similar exceptional symmetry profiles (T(w) values), we use a hierarchical clustering procedure, using the UPGMA aggregation criterion with Euclidean distance. A similar clustering procedure is used to identify words with similar exceptional symmetry profiles across species.

4 Results and Discussion

For the set of 31 genomes, the word counts were obtained for all word lengths between 1 and 14 nucleotides, and the symmetric word pair effect was obtained for each genomic word. However, for given genome, we only consider the genomic words with lengths k (\(k \in \{1, ..., k_\mathrm{{max}}\}\)), with

$$\begin{aligned} k_{\max }=\max \left\{ k \in \left\{ 1,2,3,...\right\} : n*0.25^k>5\right\} \end{aligned}$$

and n the genome size. This threshold motivation is the count representability and the protection of the T measure to the sensitivity of rare counts occurrences.

Obviously, for \(k=1\), each ECG contains only one symmetric word pair, and so \(T(w)=0\), for all nucleotides. Almost all words in eukaryote genomes show significant exceptional symmetry effect (above the critical values obtained in the control experiments). Table 2 shows the percentage of words with \(T>0\) for each species and word length of this study. A high percentage of words in viruses show no exceptional symmetry. This result agrees with a previous work [20], which used a different measure and procedure.

Table 2 Percentage of words (of length k) with exceptional symmetry effect (\(T>0\)), measured in the genomes of 31 species and in the random control sequence (sym)

Table 2 includes the sym row corresponding to one control scenario (sequence with length equal to the length of the human genome). This may be used as a reference of non-exceptional symmetry results.

4.1 Human Genome

A word analysis in the context of exceptional symmetry for the human genome was carried out.

Figure 1 shows boxplots of the T values for \(k=5\) in the human genome and in the corresponding random realization sym. The boxplot for the human genome shows high and significant symmetric word pair effects. The most exceptionally symmetric word pairs, corresponding to the right outliers, detected in the human T boxplot are: (gcgta, tacgc), (accgg, ccggt), (gccac, gtggc), (gccca, tgggc), (cggga, tcccg).

Fig. 1
figure 1

Boxplots for T values in the human genome and in a random control sequence realization (sym) for word length 5

Figure 2 shows the T values in each ECG for \(k=5\) in the human genome. We observe that as the CG content varies (decreases along the x-axis), the T median values have a non-monotonous behavior. The ECG \(G_1\) has the highest T median value. In general, for the word lengths under study and for the human genome, the T median in ECG \(G_0\) is lower than in \(G_1\), and the T median for \(G_k\) is higher than for \(G_{k-1}\). For the control scenario, on the other hand, we observed that the T median values remained essentially constant across all ECGs.

Fig. 2
figure 2

Boxplots for T values in each ECG for word length 5, in the human genome

Table 3 presents, for the word lengths under study, the twelve words with the six highest and the six lowest T(w) values. Some of these extreme words could have some biological interest, e.g., regulatory elements, functional elements, motifs.

Table 3 The six symmetric word pairs (represented by a single word of the pair) that have the highest (−h) T(w) values, and the six symmetric word pairs that have the lowest (−l) T(w) values for each k, in the human genome

Based on the results of the effect size measure, we may conclude that the human genome presents exceptional symmetry. The human genome shows exceptional symmetry for the thirteen different word lengths (\(k=2,...,14\)) used in this study.

Although the existence of global exceptional symmetry in the human genome was verified, there are distinct profiles for each chromosome. Consequently, the exceptional symmetry profile may be used as a signature of each chromosome. Preliminary results also suggest that exceptional symmetry profiles are distinct between species, which will be presented in the next section.

It may be also concluded that in the human genome there are ECGs that are more exceptionally symmetric than others. And a large percentage of the genomic words present some exceptional symmetry. However, for longer word lengths (\(k\ge 5\)), there are some words without any exceptional symmetry. With this analysis, it was identified that words rich in CG content behave differently from words rich in AT content, in terms of exceptional symmetry.

4.2 Species Comparison

Figure 3 shows the dendrogram obtained with the hierarchical clustering procedure, for \(k=4\). Four distinct groups can be observed in Figure 3: mammalian (on the left); viruses (on the right); a group including the plants and the other animals (except Danio rerio); and a group with the unicellular species, plus Danio rerio. For other word lengths, the resulting dendrograms essentially maintain the same structure (the dendrogram for \(k=3\) is also included in Figure 4).

Fig. 3
figure 3

Dendrogram obtained from the T values for all species under study, word length 4

Figure 4 shows the heatmap with biclustering organization for trinucleotides. Species are shown on the horizontal axis, and words are shown on the vertical axis. The symmetric word pair effect is stronger on the left side of the heatmap, corresponding to multicellular organisms, and weaker on the right side. The word clustering highlights the group formed by two symmetric word pairs: (ccg, cgg), (gcg, cgc).

Fig. 4
figure 4

Heatmap with biclustering organization of the T values for words of length 3 and for all species under study

We identified the word pairs with high exceptional symmetry (T above the third quartile) in every species under study. From these, we selected the pairs that are highly symmetric across the most species under study, and those that are highly symmetric across the most animal species under study. The results are shown in Table 4. No word pair is considered highly symmetric across all the species under study. However, \(T(cgtacga) = T(tcgtacg)\) is above the third quartile in all the animal species under study. The strongest symmetric word pair effect is observed in words composed by CpG dinucleotides.

Table 4 Word pairs with exceptional symmetry effect above the third quartile, which are most common across species, and most common across animal species

The results presented in Table 4 are restricted to word lengths between 2 and 7 because for longer word lengths the number of most common symmetric word pair above the third quartile is high. The strongest symmetric word pair effect is observed in words composed by CpG dinucleotides.

5 Conclusions

We evaluated the exceptional symmetry effect in several species, with particular emphasis in the human genome. The word exceptional symmetry values contain information specific to the species and seem to contain information about the species evolution. Taking into account the species in this study, the primates and rodents species have the highest exceptional symmetry values and form a subgroup distinct from all the other species under study. Globally, the eukaryote group showed the highest word exceptional symmetry values, while viruses showed the lowest values. We reinforce that some viruses show a behavior opposite to the exceptional symmetry (\(T<0\)) in almost all words under study.

Exceptional symmetry effect was found in a high percentage of words in all cellular organisms under study. Therefore, we conjecture that exceptional symmetry results from some universal law imposed on cellular organisms. Still, the exceptional symmetry profiles are species specific.