Introduction

Compositional properties of genome sequences have been investigated long before the beginning of the genome sequence projects since global and local compositional features of DNA are important from an evolutionary point of view and because they also provide information about functional elements. An approach that has been used to study the compositional properties of genome sequences is related to the concept of genome signature that, so far, has been conceived in terms of dinucleotide frequencies (Campbell et al. 1999; Karlin and Ladunga 1994) or the frequencies of longer oligonucleotides (Ohno 1987; Bultrini et al. 2003). It is believed that the genome signature may represent the response of the genome to an overall selective pressure at the DNA level, a pressure that is basically constant in both coding and non-coding sequences (Russell et al. 1976). There are, however, a number of known genome signatures that differ between coding and non-coding regions (Havlioglu et al. 2007).

Another fruitful approach to study the compositional properties of DNA deals with the concept of the information supplied per base or residue as deduced from information theory (Shannon 1948). For that purpose, compression algorithms applied to a sequence in which the length of the compressed sequence is taken as a measure of its complexity are especially useful.

The genome signature and the information content are interrelated since the information content of DNA, recorded in the base composition, often reflects phylogenetic relationships (Sueoka 1961) and sheds light on the mechanisms that shape genomes (Bernardi 2000). Consequently, systematic measurements of base compositions have been used for taxonomic classification objectives (Schildkraut et al. 1962). As it has also been pointed out, the nucleotide order of DNA sequences contributes to their similarity in base composition if the evolutionary distance is much smaller than their average permutation distance obtained by calculating the distances for many random permutations of these sequences (Altschul and Erickson 1985). Thus, to retrieve stored information from DNA not only the genetic code is required but also other genetic (Osawa et al. 1992) and epigenetic codes (Zuckerkandl and Cavalli 2007).

Some data have been reported indicating that the size of complex genomes has not been reached by the incorporation of sequences generated de novo without any dependence upon pre-existing coding or non-coding sequences, but that it has resulted from increases in ploidy and from smaller scale duplication of DNA sequences followed by sequence divergence and often, eventually, translocation of sequence duplicates (Ohno 1970; Long and Langley 1993; Begun et al. 2007; Levine et al. 2006). It is, then, fully expected that, to a significant extent, DNA sectors of a given organism can be grouped into clusters sharing similar compositional features.

In this report, a novel way of evaluating similarities and differences among functionally distinct DNA sectors based on neighbor base context is presented. This approach reveals compositional features that heretofore, to our knowledge, have not been reported. It provides, moreover, a new coding system whose elements, the triplet composons, gives precise information about the base context of a given DNA sequence. This context is closely tied to the DNA sequence, a context closely dependent upon its mutation regime (Antezana and Jordan 2008).

Triplet composons are defined here as groups of triplets that contain the same set of distinct bases, in whatever order and number, giving all possible neighbor base combinations that could be formed within a triplet. In this article, we have analyzed the composon usage frequency of a sample of 706 human genes. The results reveal that both the coding and intron sequences form clusters characterized by specific composon usage signatures. The sequences that belong to a particular cluster have similar base composition but dissimilar triplet permutation frequencies. By triplet permutation frequency, we understand the number of sequence changes in the set of overlapping triplets that compose the sequence. To examine the evolution of the gene clusters found in humans, human and mouse homologous gene samples of similar size were compared.

Methods

DNA Composition in Terms of Varied Base Representation in Consecutive Standard Length Sequence Fragments (Multiplets)

Establishing a genome’s spectrum of base multiplets obviously requires exploring the frequency of occurrence of multiplets of different length per unit length of DNA. The higher the number of bases is in a multiplet used as probe, the lower will be the frequency of the multiplet’s occurrence in a DNA sequence. Markedly short or markedly long test sequences to be matched to genomic sequences yield only limited information about the DNA being explored. Information can be gained by focusing on the base composition of the multiplets in addition to their length. This can be done in particular by counting the number of multiplets that share in the exclusion of particular bases. We will refer to multiplets as “exclusionary multiplets” when they exclude particular bases. As can be seen in Table 1, the maximum number of different exclusionary multiplets with the minimal number of bases is the highest for triplets.

Table 1 Number of exclusionary multiplets found for different DNA fragments of unit length based on the presence or absence of particular bases

The Triplet Composon Code

In this article, the term “triplet” refers to any trinucleotide in whatever part of the genome. Triplets containing the same set of distinct bases, in whatever order and number will be referred to as a “triplet composon”. For example, the set of triplets AAC, ACA, CAA, CCA, CAC, and ACC constitutes the <AC> composon (abbreviated <AC>). As can be observed, a triplet composon specified as <AC> contain all possible triplets formed by nucleotides A and C. The 64 DNA triplets that constitute the 14 possible triplet composons and their abbreviations are shown in Fig. 1. For the sake of brevity in the rest of the article, the term triplet composon will be substituted by composon.

Fig. 1
figure 1

Triplet classification of the DNA proposed code. a The 64 DNA triplets that can be formed with the nucleotides A, G, T, and C are grouped in composons. Ten composons are sixfold degenerate and four composons are non-degenerate. b Nomenclature used in the paper for the composons

Calculation of Composon Usage Frequencies

To calculate the composon usage frequencies in the DNA of a given gene, the exons (CDS) on the one hand and the introns (IS) on the other were nicked out and stitched together. To do that, the software utilities from both the National Center for Biotechnology Information (NCBI) DNA database and the European Molecular Biology Open Software Suite (EMBOSS) explorer were used. The NCBI sequence viewer utility, named cds, stitches together the exons of a gene giving as output the protein coding sequence in a FASTA format. In the case of introns, the utility used was coderet from the EMBOSS explorer. The introns and exons of a given gene were concatenated and considered as a unique sequence, hereafter named intron sequence (IS) and coding sequence (CDS), respectively.

Composon usage frequencies in genomic DNA fragments were then determined starting the reading at any letter of the original triplet (5′) and moving from triplet-to-triplet to the end of the sequence, in a fully overlapping way. The EMBOSS utility compseq was used for this reading. A given composon usage frequency was obtained by summing up the usage frequencies of each of the triplets that constitute a composon. The usage frequencies of the 14 composons, in ‰, were tabulated and represented graphically.

The k-means algorithm used for gene clustering is found under GEPAS website http://www.gepas.org (Montaner et al. 2006). This algorithm starts from a random initial clustering. Then, by iteration, the algorithm calculates the cluster centroids reassigning elements to the cluster with the closest centroid, and it is halted when no more reassignments are made. The centroid, or centre, is the average of all the points in the cluster (MacQueen 1967). The distance function used in the analysis was the Pearson correlation coefficient that measures the strength and the direction of the linear relationship between the n = 14 pairs of values of the two variables x i and y i which represent the composon usage frequencies of the two sequences to be compared. The coefficient is, then, the best estimate of the correlation of x i and y i and it is written as:

$$ r = {\frac{{n\sum {x_{i} y_{i} - \sum {x_{i} \sum {y_{i} } } } }}{{\sqrt {n(\sum {x_{i}^{2} ) - (\sum {x_{i} } )^{2} } } \sqrt {n(\sum {y_{i}^{2} ) - (\sum {y_{i} } )^{2} } } }}}. $$
(1)

Composon Usage of Random DNA Sequences

The base line used represents the composon usage frequency in a DNA sequence of infinite length where the nucleotide composition of A, T, G, C remains the same. Thus, on average, each triplet in that random DNA sequence must appear with a frequency of 15.6‰. This number will be considered as the base line for the frequency of appearance of the non-degenerate composons. Since each one of the degenerate composons is codified by six triplets, the extent of the usage of these composons would be 93.6‰.

Composon Fingerprints and Composon Signatures

The composon fingerprint of a specified gene cluster is defined by the composon usage frequencies of the CDS and IS that do not overlap with the composon usage frequencies of random sequences (the base line). The composon signature of a particular gene cluster (represented as a barcode) is defined by the composon usage frequencies of the CDS that do not overlap with those of the IS.

Statistical Criterion for the Selection of Gene Samples

The gene sample of human DNA was selected by using the random sample selection criterion that characterizes a good sample from a population as the one in which any individual sequence is as likely to be included as any other. Moreover, when only a single sample from a population is available, the best estimate of the mean of the sample is given by the mean of the population. Thus, if the mean of the sample and the mean of the population are similar it is inferred that a correct selection of the sample was made. A collection of Mus musculus genes homologous to those of humans was also selected. The mean composon usage frequencies of the populations of 93487 human-CDSs and 53036 mouse-CDSs were calculated taking into account the codon usage tables tabulated in the Codon Usage Database that it is under the website http://www.kazusa.or.jp/codon/ (Nakamura et al. 2000). Using the algorithms previously described, we calculated the mean composon usage frequency of the CDS of human and mouse genes. Figure 2 shows that the mean of the composon usage frequencies of the human gene population and the gene sample were nearly identical (r 2 = 0.998; P < 0.0001). A similar value (r 2 = 0.989; P < 0.0001) was found between the mouse gene population and the sample.

Fig. 2
figure 2

Plots of the composon usage frequency averages for both the sample and the population of human-CDS. The black line represents the composon usage frequency average of the population that was estimated by using the codon usage tables of Homo sapiens tabulated in the Codon Usage Database. The mean frequency (±SD) profile of composon usage was estimated from the CDS of 706 human genes chosen from the human gene population on the basis of the random sample selection criterion

Data Source

The human and mouse DNA sequences were taken from NCBI-GenBank Flat File Release 160.0 [June 15, 2007] (http://www.ncbi.nlm.nih.gov/entrez/). Only complete genes were used for the analysis. Supplemental Table 1 lists the sample of the 706 genes of Homo sapiens analyzed in the article with their accession numbers.

Results

Composon Usage Frequencies of a Sample of Human Genes

Figure 3a and b shows the profiles of the composon usage frequency of the IS and CDS analyzed in this study, respectively. We can observe two characteristics that are clearly different in these two profiles: (1) composons <AT> in IS and <CG> in CDS are used with high variability, and (2) composon <CG> in IS and <AT> in CDS are scarcely used. Thus, as a whole, the IS and the CDS present clear differences in their compositional contents. In order to analyze in more detail the differences between the composon usage frequencies between IS and CDS, a statistical analysis of the averages of the composon usage frequencies in IS and CDS was carried out (a t-test). Table 2 shows that only the composon usage frequencies of <C> and <ACT>, in CDS and IS, have a high probability of overlapping with a P t > 10−2. The rest of the composons have frequency usages with overlapping probabilities in the range of 10−96 < P t < 10−8.

Fig. 3
figure 3

Composon usage frequency profiles of all IS (a) and all CDS (b) of the gene sample. Each line represents a DNA sequence

Table 2 Statistical data obtained from a sample of 706 human genes for each one of the 14 composons

k-Means Clustering of Human Genes

To analyze the possible existence of defined patterns in the distribution of the composon usage frequencies in CDSs shown in Fig. 3b, a k-means clustering algorithm was applied to all CDSs of the gene sample. The number of clusters, for a given size of the sample, was defined by the confidence limits of the statistical parameters that characterize the cluster with a minimal number of elements. The aim is to find the clustering solution that minimizes the within-cluster sum of distances. In this case, with a sample of 706 human genes, the minimum numbers of clusters turn out to be 11. The percentage of genes in clusters 1 to 11 were approximately 10.6, 10.4, 8.3, 6.9, 7.1, 12.2, 12.2, 6.3, 10.3, 6.6, and 9.4%, respectively.

The clustering of genes was performed according to the composon usage frequencies of their CDS. Then, for each cluster, a statistical analysis of the composon usage frequencies of the IS was also performed. Interestingly, it was observed that the composon usage frequencies of all IS of the same cluster have nearly the same composon usage frequencies but different from that of the CDS. This means that all the IS of the CDS of a given cluster also form a cluster, as expected if they have evolved together. The profiles of the composon usage frequencies of CDS and IS are shown in Fig. 4. In each profile, the composon usage frequencies are represented with their respective standard deviation. We provide the numerical values of the average composon usage frequencies as supplemental material (see Tables 2–12 in supplementary material).

Fig. 4
figure 4

Composon usage frequency averages (±SD) of human-CDS (dark lines) and IS (gray lines) in each one of the gene clusters. The dotted line represents the base line. Each gene cluster is represented in the figure by a number

Composon Usage Frequency of CDS Versus IS

A detailed analysis of the mean composon usage frequencies of each cluster shows that there are notable inter-cluster and also intra-cluster differences between CDS and IS. To assess whether or not the composon usage frequency averages of CDS and IS are statistically equal in each of the gene clusters, a t-test was performed (Table 3). The data shown indicate that there are specific inter-cluster and intra-clusters differences. Only the frequency usage average of <G>, <C>, and <ACT> in CDS and IS has high probability of being similar in all clusters, P t > 10−3, with the exception of the CDS and IS of clusters 5 and 7. The frequency usage average of <AC> has also high probability (P t > 10−4) of being similar in all clusters with the exception of the CDS and IS of clusters 1 and 9. The inter-cluster variations in the frequency usage average of the rest of composons differ within a broad range. The dispersion values of each one of the clusters and the probability obtained from the t-probe for CDS and IS are provided as supplemental material (see Tables 2–12 in supplementary material).

Table 3 Values of the probability (P t) obtained from a t-test to assess the hypothesis of whether the composon usage frequency average obtained in CDS and IS are statistically equal for each gene cluster

Fingerprint of CDS and IS

As indicated before, we define the composon fingerprint of a particular gene cluster as the composon usage frequency averages of the CDS and IS that do not overlap with the base line (Fig. 4, Table 3). Thus, the fingerprint of the CDS together with that of IS constitutes the fingerprint of the cluster. Since we are dealing with CDS and their corresponding IS the cluster could be properly named as a gene cluster. These data indicate that the IS of the CDS that belong to single specific cluster also form a single specific cluster as an indication that the CDS and IS of a particular gene co-evolve as a unit.

The Gene Cluster Barcode

We have detected that the usage frequency of certain composons overlap in CDS and IS. The set of composons of CDS or IS of each cluster whose usage frequency does not overlap with each other define the barcode of the cluster. In order to determine whether each one of the gene clusters has features that differentiate one from the other, an analysis of the composon usage frequencies of CDS and IS from each of the clusters was carried out. The analysis would reveal the composon usage frequency averages that do not overlap with one another. The black and white bars (Fig. 5) correspond to those composons whose mean usage frequencies are lower and higher than the base line, respectively. The gray bar corresponds to those composons whose mean usage frequencies overlap with the base line. The white regions represent the composons whose mean usage frequencies overlap between CDS and IS. Figure 5 shows that in CDS and IS composons <G>, <C>, and <ACT> overlap in all clusters but that all genes clusters have different barcodes.

Fig. 5
figure 5

Barcodes of CDS and IS in each one of the gene clusters. Bars white and black correspond to those composon usage frequencies higher and lower that the base line, respectively. The gray bar corresponds to those frequencies that overlap with the base line. The dotted line represents regions where the composons <G>, <C>, and <ACT> overlap to each other, in all clusters

k-Means Clustering of Mouse Genes

We then wanted to analyze whether or not the composon usage pattern of the clusters found in humans is also conserved in another species, such as mouse. For that purpose, a collection of mouse genes homologous to those of humans were clustered following the same methodology we used for the clustering of human genes. Our data indicated that the composon usage frequencies defining clusters 2 and 5 in humans are different in mouse. The composon usage frequencies defining in humans the rest of the clusters are also found in mouse. A high correlation for CDSs (r 2 > 0.932) was observed between clusters 1, 3, 4, 6, 7, 8, 9, 10, and 11 in mouse and the corresponding ones in humans. A high correlation for IS (r 2 > 0.911) was also observed between clusters 1, 3, 4, 6, 7, 8, 9, 10, and 11 in mouse and the corresponding ones in humans. However, for some composons significant statistical differences in the usage frequency were observed between similar human and mouse clusters (Fig. 6a, b). The genes present in cluster 2 and 5 in humans are present in several clusters in mouse as shown in Fig. 7.

Fig. 6
figure 6figure 6

Comparison of the human (dotted lines) and mouse (continuous line) composon usage frequency averages. CDS (left columns) and IS (right columns). Each gene cluster is represented by a number. a Clusters 1, 3, 4, and 6. b Clusters 7 to 11. Clusters 2 and 5 (from humans) are absent in mouse

Fig. 7
figure 7

Number of genes (in percentage) from various mouse clusters whose human homologous genes are present in cluster 2 (a) and cluster 5 (b)

Discussion

Nucleotide composition measurements have been used for taxonomic DNA classification (Bernardi 1995, 2000; Schildkraut et al. 1962; Sueoka 1961) and to analyze the distance of phylogenetic relationships that correlate with the notion of triplet permutation in sequences with the same triplet usage (Altschul and Erickson 1985). Methodologies based on the identification of characteristic compositional features of coding sequences that distinguish them from noncoding DNA have been used for gene prediction (Cruveiller et al. 2003). As a practical outcome of the data presented in this article, algorithms based on specific fingerprints of CDS and IS and of gene barcodes could be developed to indentify and cluster gene sequences by the analysis of their composon usage frequency.

The coding system presented in this article defines the triplet composons as groups of triplets that contain the same set of distinct bases, in whatever order and number, giving all possible neighbor base contexts that could be formed within a triplet (see Fig. 1). A coding system such as the one presented may be highly relevant due to the vital importance that the neighbor base context has in the mutation regime (Antezana and Jordan 2008) and possibly in the rules that govern the long-term shaping of DNA sequences. As it is known, moreover, a strong correlation exists in coding regions of vertebrates between on the one hand GC content and codon usage and on the other certain motif preferences. These are conserved in vertebrates even more rigidly than codon usage in coding regions. It is known that the occurrence–preference correlations are strongest in intronic and non-genic DNA (Antezana and Jordan 2008). It has also been shown as well that the human coding regions are less mutable under estimated neighbor-base-dependent mutation than under matched context-independent mutation and that this entails marked differences between the spectra of amino acid mutations that either mutation regime should generate (Antezana and Jordan 2008).

The data presented in Fig. 4 and supplemental Tables 3–12 show that the clusters with the highest number of genes are those with a higher usage frequency of composons rich in GC content (65%) and that by contrast the clusters with a lower number of genes are those with a lower usage frequency in composons rich in GC content (35%). This finding is in agreement with the data indicating that the GC content of a gene is highly correlated with the GC content of the region of the genome, namely, the isochore, in which it is found (Bernardi et al. 1985; Bernardi 2000) and that this variation in base composition affects both coding and non-coding sequences (Eyre-Walker and Hurst 2001).

Although there are articles, moreover, dealing with the evolutionary and functional role of introns (Mattick 1994; Stoltzfus 1994), compositional fingerprints of introns in humans, to our knowledge, have not yet been obtained. The fact that the CDS and IS formed clusters when the reading of the sequence was done in a fully overlapping way most likely indicates that the DNA sequence contains a type of information that resides in the DNA sequence independently of the reading frame. This observation is in agreement with data reporting that the “off-frame” trinucleotide-motif preferences inferable from Drosophila coding regions are often in the same direction as Drosophila’s “in-frame” codon preferences, i.e., its codon usage (Antezana and Kreitman 1999). Although the codon bias is only shown in coding sequences it could reflect a wider base context (Zhang et al. 2003).

It was interesting to detect that all IS associated with each one of the CDS of the same cluster also form a cluster having, however, a composition different from that of the CDS. This finding suggests that the CDS and IS of a given gene co-evolve as a single unit and that, most likely, this type of co-evolution occurs in all genes. The compositional co-evolution between CDS and IS is in agreement with reported data indicating that there is co-evolution in the intron–exon architecture of various eukaryotic genomes, including the human genome, and that common factors must exist in shaping both exons and introns (Zhu et al. 2009). At this stage, it would be difficult to provide details about the mechanistic bases of how the base composition of coding and intron sequences co-evolves. Even though, the evolutionary rates of coding and intron sequences are different since more indels (random nucleotide substitution and insertion/deletion events) are likely to occur in intron sequences than in exons, because an indel in an exon, even if it does not cause a frameshift, may be disruptive to protein structure (Hughes and Yeager 1997), previous investigations on human genes have shown that consistent compositional constraints exist between exon and intron sequences (Clay et al. 1996). Evidence for the existence of functional constraints has been also demonstrated by analysis of the high levels of sequence conservation in noncoding DNA between human and mouse genes (Hare and Palumbi 2003).

Since, as we described, there are notable differences in composon usage frequencies between exons and introns we think that they may represent important factors for the recognition of, for example, the splice sites because of the relationship that exist between the composon usage frequency and the base composition of a DNA sequence. This difference in composition between coding exons and flanking introns is known to be important for the recognition of splice sites. The marked difference in base composition of the leader exon and the leader intron is probably maintained by selection despite a high rate of sequence divergence (Gottlieb and Ford 2002). In line with this observation, we believe that the particular association observed in all clusters between the composon usage frequencies of CDS and their respective intron sequences is in general conserved because it could have a selective advantage.

The observation that between the corresponding human and mouse clusters there is high correlation in their composon usage frequencies indicates that the majority of the clusters are stable for both the CDS and IS. The fact that the IS associated with each one of the CDS of the same cluster also form a cluster suggests that, as we mentioned for humans, the CDS and IS of a given gene co-evolve as a single unit.

It was interesting to observe that the composon usage frequencies of CDS and IS defining human clusters 2 and 5 are different in mouse. The observation, moreover, that some of the mouse genes from different clusters form new clusters in humans seems to suggest that the sequence composition of those genes have evolved with higher mutational rates than that of the genes from the rest of the clusters. In view of the data presented, we believe that it would be appropriate to analyze the base composition of specific homologous human and mouse genes from those clusters in which changes in composon usage frequency have been detected in order to locate within the sequences the position and nature of these changes.

We think that the data presented here are in agreement with those indicating that both coding and non-coding sequences in complex organisms carry informational content although of a different nature. Coding sequences convey functional and evolutionary information in terms of proteins according to the genetic code while non-coding sequences may convey information in regulatory (Mattick 1994; Mitchell and Tjian 1989) as well as evolutionary terms (Blake 1979; Gilbert 1987; Stoltzfus 1994). We think that the finding of specific compositional features in IS implies that these sequences could be considered as key informational and functional elements in large genomes of complex organisms as has been suggested (Vinogradov 2003; Vinogradov 2006; Zuckerkandl 1981).