Introns Form Compositional Clusters in Parallel with the Compositional Clusters of the Coding Sequences to Which they Pertain

Fuertes, Miguel A.; Pérez, José M.; Zuckerkandl, Emile; Alonso, Carlos

doi:10.1007/s00239-010-9411-6

Introns Form Compositional Clusters in Parallel with the Compositional Clusters of the Coding Sequences to Which they Pertain

Published: 04 December 2010

Volume 72, pages 1–13, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Evolution Aims and scope Submit manuscript

Introns Form Compositional Clusters in Parallel with the Compositional Clusters of the Coding Sequences to Which they Pertain

Download PDF

Miguel A. Fuertes¹,
José M. Pérez²,
Emile Zuckerkandl^3,4 &
…
Carlos Alonso¹

158 Accesses
4 Citations
Explore all metrics

Abstract

This report deals with the study of compositional properties of human gene sequences evaluating similarities and differences among functionally distinct sectors of the gene independently of the reading frame. To retrieve the compositional information of DNA, we present a neighbor base dependent coding system in which the alphabet of 64 letters (DNA triplets) is compressed to an alphabet of 14 letters here termed triplet composons. The triplets containing the same set of distinct bases in whatever order and number form a triplet composon. The reading of the DNA sequence is performed starting at any letter of the initial triplet and then moving, triplet-to-triplet, until the end of the sequence. The readings were made in an overlapping way along the length of the sequences. The analysis of the compositional content in terms of the composon usage frequencies of the gene sequences shows that: (i) the compositional content of the sequences is far from that of random sequences, even in the case of non-protein coding sequences; (ii) coding sequences can be classified as components of compositional clusters; and (iii) intron sequences in a cluster have the same composon usage frequencies, even as their base composition differs notably from that of their home coding sequences. A comparison of the composon usage frequencies between human and mouse homologous genes indicated that two clusters found in humans do not have their counterpart in mouse whereas the others clusters are stable in both species with respect to their composon usage frequencies in both coding and noncoding sequences.

Chloroplast Genomes Exhibit Eight-Cluster Structuredness and Mirror Symmetry

Spliceosomal Introns: Features, Functions, and Evolution

Article 13 July 2020

TA, GT and AC are significantly under-represented in open reading frames of prokaryotic and eukaryotic protein-coding genes

Article 13 February 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Compositional properties of genome sequences have been investigated long before the beginning of the genome sequence projects since global and local compositional features of DNA are important from an evolutionary point of view and because they also provide information about functional elements. An approach that has been used to study the compositional properties of genome sequences is related to the concept of genome signature that, so far, has been conceived in terms of dinucleotide frequencies (Campbell et al. 1999; Karlin and Ladunga 1994) or the frequencies of longer oligonucleotides (Ohno 1987; Bultrini et al. 2003). It is believed that the genome signature may represent the response of the genome to an overall selective pressure at the DNA level, a pressure that is basically constant in both coding and non-coding sequences (Russell et al. 1976). There are, however, a number of known genome signatures that differ between coding and non-coding regions (Havlioglu et al. 2007).

Another fruitful approach to study the compositional properties of DNA deals with the concept of the information supplied per base or residue as deduced from information theory (Shannon 1948). For that purpose, compression algorithms applied to a sequence in which the length of the compressed sequence is taken as a measure of its complexity are especially useful.

The genome signature and the information content are interrelated since the information content of DNA, recorded in the base composition, often reflects phylogenetic relationships (Sueoka 1961) and sheds light on the mechanisms that shape genomes (Bernardi 2000). Consequently, systematic measurements of base compositions have been used for taxonomic classification objectives (Schildkraut et al. 1962). As it has also been pointed out, the nucleotide order of DNA sequences contributes to their similarity in base composition if the evolutionary distance is much smaller than their average permutation distance obtained by calculating the distances for many random permutations of these sequences (Altschul and Erickson 1985). Thus, to retrieve stored information from DNA not only the genetic code is required but also other genetic (Osawa et al. 1992) and epigenetic codes (Zuckerkandl and Cavalli 2007).

Some data have been reported indicating that the size of complex genomes has not been reached by the incorporation of sequences generated de novo without any dependence upon pre-existing coding or non-coding sequences, but that it has resulted from increases in ploidy and from smaller scale duplication of DNA sequences followed by sequence divergence and often, eventually, translocation of sequence duplicates (Ohno 1970; Long and Langley 1993; Begun et al. 2007; Levine et al. 2006). It is, then, fully expected that, to a significant extent, DNA sectors of a given organism can be grouped into clusters sharing similar compositional features.

In this report, a novel way of evaluating similarities and differences among functionally distinct DNA sectors based on neighbor base context is presented. This approach reveals compositional features that heretofore, to our knowledge, have not been reported. It provides, moreover, a new coding system whose elements, the triplet composons, gives precise information about the base context of a given DNA sequence. This context is closely tied to the DNA sequence, a context closely dependent upon its mutation regime (Antezana and Jordan 2008).

Triplet composons are defined here as groups of triplets that contain the same set of distinct bases, in whatever order and number, giving all possible neighbor base combinations that could be formed within a triplet. In this article, we have analyzed the composon usage frequency of a sample of 706 human genes. The results reveal that both the coding and intron sequences form clusters characterized by specific composon usage signatures. The sequences that belong to a particular cluster have similar base composition but dissimilar triplet permutation frequencies. By triplet permutation frequency, we understand the number of sequence changes in the set of overlapping triplets that compose the sequence. To examine the evolution of the gene clusters found in humans, human and mouse homologous gene samples of similar size were compared.

Methods

DNA Composition in Terms of Varied Base Representation in Consecutive Standard Length Sequence Fragments (Multiplets)

Establishing a genome’s spectrum of base multiplets obviously requires exploring the frequency of occurrence of multiplets of different length per unit length of DNA. The higher the number of bases is in a multiplet used as probe, the lower will be the frequency of the multiplet’s occurrence in a DNA sequence. Markedly short or markedly long test sequences to be matched to genomic sequences yield only limited information about the DNA being explored. Information can be gained by focusing on the base composition of the multiplets in addition to their length. This can be done in particular by counting the number of multiplets that share in the exclusion of particular bases. We will refer to multiplets as “exclusionary multiplets” when they exclude particular bases. As can be seen in Table 1, the maximum number of different exclusionary multiplets with the minimal number of bases is the highest for triplets.

Table 1 Number of exclusionary multiplets found for different DNA fragments of unit length based on the presence or absence of particular bases

Full size table

The Triplet Composon Code

In this article, the term “triplet” refers to any trinucleotide in whatever part of the genome. Triplets containing the same set of distinct bases, in whatever order and number will be referred to as a “triplet composon”. For example, the set of triplets AAC, ACA, CAA, CCA, CAC, and ACC constitutes the <AC> composon (abbreviated <AC>). As can be observed, a triplet composon specified as <AC> contain all possible triplets formed by nucleotides A and C. The 64 DNA triplets that constitute the 14 possible triplet composons and their abbreviations are shown in Fig. 1. For the sake of brevity in the rest of the article, the term triplet composon will be substituted by composon.

Calculation of Composon Usage Frequencies

To calculate the composon usage frequencies in the DNA of a given gene, the exons (CDS) on the one hand and the introns (IS) on the other were nicked out and stitched together. To do that, the software utilities from both the National Center for Biotechnology Information (NCBI) DNA database and the European Molecular Biology Open Software Suite (EMBOSS) explorer were used. The NCBI sequence viewer utility, named cds, stitches together the exons of a gene giving as output the protein coding sequence in a FASTA format. In the case of introns, the utility used was coderet from the EMBOSS explorer. The introns and exons of a given gene were concatenated and considered as a unique sequence, hereafter named intron sequence (IS) and coding sequence (CDS), respectively.

Composon usage frequencies in genomic DNA fragments were then determined starting the reading at any letter of the original triplet (5′) and moving from triplet-to-triplet to the end of the sequence, in a fully overlapping way. The EMBOSS utility compseq was used for this reading. A given composon usage frequency was obtained by summing up the usage frequencies of each of the triplets that constitute a composon. The usage frequencies of the 14 composons, in ‰, were tabulated and represented graphically.

The k-means algorithm used for gene clustering is found under GEPAS website http://www.gepas.org (Montaner et al. 2006). This algorithm starts from a random initial clustering. Then, by iteration, the algorithm calculates the cluster centroids reassigning elements to the cluster with the closest centroid, and it is halted when no more reassignments are made. The centroid, or centre, is the average of all the points in the cluster (MacQueen 1967). The distance function used in the analysis was the Pearson correlation coefficient that measures the strength and the direction of the linear relationship between the n = 14 pairs of values of the two variables x _i and y _i which represent the composon usage frequencies of the two sequences to be compared. The coefficient is, then, the best estimate of the correlation of x _i and y _i and it is written as:

$$ r = {\frac{{n\sum {x_{i} y_{i} - \sum {x_{i} \sum {y_{i} } } } }}{{\sqrt {n(\sum {x_{i}^{2} ) - (\sum {x_{i} } )^{2} } } \sqrt {n(\sum {y_{i}^{2} ) - (\sum {y_{i} } )^{2} } } }}}. $$

(1)

Composon Usage of Random DNA Sequences

The base line used represents the composon usage frequency in a DNA sequence of infinite length where the nucleotide composition of A, T, G, C remains the same. Thus, on average, each triplet in that random DNA sequence must appear with a frequency of 15.6‰. This number will be considered as the base line for the frequency of appearance of the non-degenerate composons. Since each one of the degenerate composons is codified by six triplets, the extent of the usage of these composons would be 93.6‰.

Composon Fingerprints and Composon Signatures

The composon fingerprint of a specified gene cluster is defined by the composon usage frequencies of the CDS and IS that do not overlap with the composon usage frequencies of random sequences (the base line). The composon signature of a particular gene cluster (represented as a barcode) is defined by the composon usage frequencies of the CDS that do not overlap with those of the IS.

Statistical Criterion for the Selection of Gene Samples

The gene sample of human DNA was selected by using the random sample selection criterion that characterizes a good sample from a population as the one in which any individual sequence is as likely to be included as any other. Moreover, when only a single sample from a population is available, the best estimate of the mean of the sample is given by the mean of the population. Thus, if the mean of the sample and the mean of the population are similar it is inferred that a correct selection of the sample was made. A collection of Mus musculus genes homologous to those of humans was also selected. The mean composon usage frequencies of the populations of 93487 human-CDSs and 53036 mouse-CDSs were calculated taking into account the codon usage tables tabulated in the Codon Usage Database that it is under the website http://www.kazusa.or.jp/codon/ (Nakamura et al. 2000). Using the algorithms previously described, we calculated the mean composon usage frequency of the CDS of human and mouse genes. Figure 2 shows that the mean of the composon usage frequencies of the human gene population and the gene sample were nearly identical (r ² = 0.998; P < 0.0001). A similar value (r ² = 0.989; P < 0.0001) was found between the mouse gene population and the sample.

Data Source

The human and mouse DNA sequences were taken from NCBI-GenBank Flat File Release 160.0 [June 15, 2007] (http://www.ncbi.nlm.nih.gov/entrez/). Only complete genes were used for the analysis. Supplemental Table 1 lists the sample of the 706 genes of Homo sapiens analyzed in the article with their accession numbers.

Results

Composon Usage Frequencies of a Sample of Human Genes

Figure 3a and b shows the profiles of the composon usage frequency of the IS and CDS analyzed in this study, respectively. We can observe two characteristics that are clearly different in these two profiles: (1) composons <AT> in IS and <CG> in CDS are used with high variability, and (2) composon <CG> in IS and <AT> in CDS are scarcely used. Thus, as a whole, the IS and the CDS present clear differences in their compositional contents. In order to analyze in more detail the differences between the composon usage frequencies between IS and CDS, a statistical analysis of the averages of the composon usage frequencies in IS and CDS was carried out (a t-test). Table 2 shows that only the composon usage frequencies of <C> and <ACT>, in CDS and IS, have a high probability of overlapping with a P _t > 10⁻². The rest of the composons have frequency usages with overlapping probabilities in the range of 10⁻⁹⁶ < P _t < 10⁻⁸.

Table 2 Statistical data obtained from a sample of 706 human genes for each one of the 14 composons

Full size table

k-Means Clustering of Human Genes

To analyze the possible existence of defined patterns in the distribution of the composon usage frequencies in CDSs shown in Fig. 3b, a k-means clustering algorithm was applied to all CDSs of the gene sample. The number of clusters, for a given size of the sample, was defined by the confidence limits of the statistical parameters that characterize the cluster with a minimal number of elements. The aim is to find the clustering solution that minimizes the within-cluster sum of distances. In this case, with a sample of 706 human genes, the minimum numbers of clusters turn out to be 11. The percentage of genes in clusters 1 to 11 were approximately 10.6, 10.4, 8.3, 6.9, 7.1, 12.2, 12.2, 6.3, 10.3, 6.6, and 9.4%, respectively.

The clustering of genes was performed according to the composon usage frequencies of their CDS. Then, for each cluster, a statistical analysis of the composon usage frequencies of the IS was also performed. Interestingly, it was observed that the composon usage frequencies of all IS of the same cluster have nearly the same composon usage frequencies but different from that of the CDS. This means that all the IS of the CDS of a given cluster also form a cluster, as expected if they have evolved together. The profiles of the composon usage frequencies of CDS and IS are shown in Fig. 4. In each profile, the composon usage frequencies are represented with their respective standard deviation. We provide the numerical values of the average composon usage frequencies as supplemental material (see Tables 2–12 in supplementary material).

Composon Usage Frequency of CDS Versus IS

A detailed analysis of the mean composon usage frequencies of each cluster shows that there are notable inter-cluster and also intra-cluster differences between CDS and IS. To assess whether or not the composon usage frequency averages of CDS and IS are statistically equal in each of the gene clusters, a t-test was performed (Table 3). The data shown indicate that there are specific inter-cluster and intra-clusters differences. Only the frequency usage average of <G>, <C>, and <ACT> in CDS and IS has high probability of being similar in all clusters, P _t > 10⁻³, with the exception of the CDS and IS of clusters 5 and 7. The frequency usage average of <AC> has also high probability (P _t > 10⁻⁴) of being similar in all clusters with the exception of the CDS and IS of clusters 1 and 9. The inter-cluster variations in the frequency usage average of the rest of composons differ within a broad range. The dispersion values of each one of the clusters and the probability obtained from the t-probe for CDS and IS are provided as supplemental material (see Tables 2–12 in supplementary material).

Table 3 Values of the probability (P _t) obtained from a t-test to assess the hypothesis of whether the composon usage frequency average obtained in CDS and IS are statistically equal for each gene cluster

Full size table

Fingerprint of CDS and IS

As indicated before, we define the composon fingerprint of a particular gene cluster as the composon usage frequency averages of the CDS and IS that do not overlap with the base line (Fig. 4, Table 3). Thus, the fingerprint of the CDS together with that of IS constitutes the fingerprint of the cluster. Since we are dealing with CDS and their corresponding IS the cluster could be properly named as a gene cluster. These data indicate that the IS of the CDS that belong to single specific cluster also form a single specific cluster as an indication that the CDS and IS of a particular gene co-evolve as a unit.

The Gene Cluster Barcode

We have detected that the usage frequency of certain composons overlap in CDS and IS. The set of composons of CDS or IS of each cluster whose usage frequency does not overlap with each other define the barcode of the cluster. In order to determine whether each one of the gene clusters has features that differentiate one from the other, an analysis of the composon usage frequencies of CDS and IS from each of the clusters was carried out. The analysis would reveal the composon usage frequency averages that do not overlap with one another. The black and white bars (Fig. 5) correspond to those composons whose mean usage frequencies are lower and higher than the base line, respectively. The gray bar corresponds to those composons whose mean usage frequencies overlap with the base line. The white regions represent the composons whose mean usage frequencies overlap between CDS and IS. Figure 5 shows that in CDS and IS composons <G>, <C>, and <ACT> overlap in all clusters but that all genes clusters have different barcodes.

k-Means Clustering of Mouse Genes

We then wanted to analyze whether or not the composon usage pattern of the clusters found in humans is also conserved in another species, such as mouse. For that purpose, a collection of mouse genes homologous to those of humans were clustered following the same methodology we used for the clustering of human genes. Our data indicated that the composon usage frequencies defining clusters 2 and 5 in humans are different in mouse. The composon usage frequencies defining in humans the rest of the clusters are also found in mouse. A high correlation for CDSs (r ² > 0.932) was observed between clusters 1, 3, 4, 6, 7, 8, 9, 10, and 11 in mouse and the corresponding ones in humans. A high correlation for IS (r ² > 0.911) was also observed between clusters 1, 3, 4, 6, 7, 8, 9, 10, and 11 in mouse and the corresponding ones in humans. However, for some composons significant statistical differences in the usage frequency were observed between similar human and mouse clusters (Fig. 6a, b). The genes present in cluster 2 and 5 in humans are present in several clusters in mouse as shown in Fig. 7.

Discussion

Nucleotide composition measurements have been used for taxonomic DNA classification (Bernardi 1995, 2000; Schildkraut et al. 1962; Sueoka 1961) and to analyze the distance of phylogenetic relationships that correlate with the notion of triplet permutation in sequences with the same triplet usage (Altschul and Erickson 1985). Methodologies based on the identification of characteristic compositional features of coding sequences that distinguish them from noncoding DNA have been used for gene prediction (Cruveiller et al. 2003). As a practical outcome of the data presented in this article, algorithms based on specific fingerprints of CDS and IS and of gene barcodes could be developed to indentify and cluster gene sequences by the analysis of their composon usage frequency.

The coding system presented in this article defines the triplet composons as groups of triplets that contain the same set of distinct bases, in whatever order and number, giving all possible neighbor base contexts that could be formed within a triplet (see Fig. 1). A coding system such as the one presented may be highly relevant due to the vital importance that the neighbor base context has in the mutation regime (Antezana and Jordan 2008) and possibly in the rules that govern the long-term shaping of DNA sequences. As it is known, moreover, a strong correlation exists in coding regions of vertebrates between on the one hand GC content and codon usage and on the other certain motif preferences. These are conserved in vertebrates even more rigidly than codon usage in coding regions. It is known that the occurrence–preference correlations are strongest in intronic and non-genic DNA (Antezana and Jordan 2008). It has also been shown as well that the human coding regions are less mutable under estimated neighbor-base-dependent mutation than under matched context-independent mutation and that this entails marked differences between the spectra of amino acid mutations that either mutation regime should generate (Antezana and Jordan 2008).

The data presented in Fig. 4 and supplemental Tables 3–12 show that the clusters with the highest number of genes are those with a higher usage frequency of composons rich in GC content (65%) and that by contrast the clusters with a lower number of genes are those with a lower usage frequency in composons rich in GC content (35%). This finding is in agreement with the data indicating that the GC content of a gene is highly correlated with the GC content of the region of the genome, namely, the isochore, in which it is found (Bernardi et al. 1985; Bernardi 2000) and that this variation in base composition affects both coding and non-coding sequences (Eyre-Walker and Hurst 2001).

Although there are articles, moreover, dealing with the evolutionary and functional role of introns (Mattick 1994; Stoltzfus 1994), compositional fingerprints of introns in humans, to our knowledge, have not yet been obtained. The fact that the CDS and IS formed clusters when the reading of the sequence was done in a fully overlapping way most likely indicates that the DNA sequence contains a type of information that resides in the DNA sequence independently of the reading frame. This observation is in agreement with data reporting that the “off-frame” trinucleotide-motif preferences inferable from Drosophila coding regions are often in the same direction as Drosophila’s “in-frame” codon preferences, i.e., its codon usage (Antezana and Kreitman 1999). Although the codon bias is only shown in coding sequences it could reflect a wider base context (Zhang et al. 2003).

It was interesting to detect that all IS associated with each one of the CDS of the same cluster also form a cluster having, however, a composition different from that of the CDS. This finding suggests that the CDS and IS of a given gene co-evolve as a single unit and that, most likely, this type of co-evolution occurs in all genes. The compositional co-evolution between CDS and IS is in agreement with reported data indicating that there is co-evolution in the intron–exon architecture of various eukaryotic genomes, including the human genome, and that common factors must exist in shaping both exons and introns (Zhu et al. 2009). At this stage, it would be difficult to provide details about the mechanistic bases of how the base composition of coding and intron sequences co-evolves. Even though, the evolutionary rates of coding and intron sequences are different since more indels (random nucleotide substitution and insertion/deletion events) are likely to occur in intron sequences than in exons, because an indel in an exon, even if it does not cause a frameshift, may be disruptive to protein structure (Hughes and Yeager 1997), previous investigations on human genes have shown that consistent compositional constraints exist between exon and intron sequences (Clay et al. 1996). Evidence for the existence of functional constraints has been also demonstrated by analysis of the high levels of sequence conservation in noncoding DNA between human and mouse genes (Hare and Palumbi 2003).

Since, as we described, there are notable differences in composon usage frequencies between exons and introns we think that they may represent important factors for the recognition of, for example, the splice sites because of the relationship that exist between the composon usage frequency and the base composition of a DNA sequence. This difference in composition between coding exons and flanking introns is known to be important for the recognition of splice sites. The marked difference in base composition of the leader exon and the leader intron is probably maintained by selection despite a high rate of sequence divergence (Gottlieb and Ford 2002). In line with this observation, we believe that the particular association observed in all clusters between the composon usage frequencies of CDS and their respective intron sequences is in general conserved because it could have a selective advantage.

The observation that between the corresponding human and mouse clusters there is high correlation in their composon usage frequencies indicates that the majority of the clusters are stable for both the CDS and IS. The fact that the IS associated with each one of the CDS of the same cluster also form a cluster suggests that, as we mentioned for humans, the CDS and IS of a given gene co-evolve as a single unit.

It was interesting to observe that the composon usage frequencies of CDS and IS defining human clusters 2 and 5 are different in mouse. The observation, moreover, that some of the mouse genes from different clusters form new clusters in humans seems to suggest that the sequence composition of those genes have evolved with higher mutational rates than that of the genes from the rest of the clusters. In view of the data presented, we believe that it would be appropriate to analyze the base composition of specific homologous human and mouse genes from those clusters in which changes in composon usage frequency have been detected in order to locate within the sequences the position and nature of these changes.

We think that the data presented here are in agreement with those indicating that both coding and non-coding sequences in complex organisms carry informational content although of a different nature. Coding sequences convey functional and evolutionary information in terms of proteins according to the genetic code while non-coding sequences may convey information in regulatory (Mattick 1994; Mitchell and Tjian 1989) as well as evolutionary terms (Blake 1979; Gilbert 1987; Stoltzfus 1994). We think that the finding of specific compositional features in IS implies that these sequences could be considered as key informational and functional elements in large genomes of complex organisms as has been suggested (Vinogradov 2003; Vinogradov 2006; Zuckerkandl 1981).

Abbreviations

SD:: Standard deviation
CDS:: Coding sequence
IS:: Intron sequence

References

Altschul SF, Erickson BW (1985) Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 2:526–538
CAS PubMed Google Scholar
Antezana MA, Jordan IK (2008) Highly conserved regimes of neighbor-base-dependent mutation generated the background primary-structural heterogeneities along vertebrate chromosomes. PLoS One 3(5):1–51
Article Google Scholar
Antezana MA, Kreitman M (1999) The nonrandom location of synonymous codons suggest that reading frame-independent forces have patterned codon preferences. J Mol Evol 49:36–43
Article CAS PubMed Google Scholar
Begun DJ, Lindfors HA, Kern AD, Jones CD (2007) Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176:1131–1137
Article CAS PubMed Google Scholar
Bernardi G (1995) The human genome: organization and evolutionary history. Annu Rev Genet 29:445–476
Article CAS PubMed Google Scholar
Bernardi G (2000) Isochores and the evolutionary genomics of vertebrates. Gene 241:3–17
Article CAS PubMed Google Scholar
Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F (1985) The mosaic genome of warm-blooded vertebrates. Science 228:953–958
Article CAS PubMed Google Scholar
Blake CC (1979) Exons encode protein functional units. Nature 277:598
Article CAS PubMed Google Scholar
Bultrini E, Pizzi E, Del Giudice P, Frontali C (2003) Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melagaster. Gene 304:183–192
Article CAS PubMed Google Scholar
Campbell A, Mrazek J, Karlin S (1999) Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc Natl Acad Sci USA 96:9184–9189
Article CAS PubMed Google Scholar
Clay O, Cacciò S, Zoubak S, Mouchiroud D, Bernardi G (1996) Human coding and noncoding DNA: compositional correlations. Mol Phylogenet Evol 5(1):2–12
Article CAS PubMed Google Scholar
Cruveiller S, Jabbari K, Clay O, Bemardi G (2003) Compositional features of eukaryotic genomes for checking predicted genes. Brief Bioinform 4:43–52
Article CAS PubMed Google Scholar
Eyre-Walker A, Hurst LD (2001) The evolution of isochores. Nat Rev Genet 2:549–554
Article CAS PubMed Google Scholar
Gilbert W (1987) The exon theory of genes. Cold Spring Harb Symp Quant Biol 52:901–905
CAS PubMed Google Scholar
Gottlieb LD, Ford VS (2002) The 5′ leader of plant PgiC has an intron: the leader shows both the loss and maintenance of constraints compared with introns and exons in the coding region. Mol Biol Evol 19:1613–1623
CAS PubMed Google Scholar
Hare MP, Palumbi SR (2003) High intron sequence conservation across three mammalian orders suggests functional constraints. Mol Biol Evol 20(6):969–978
Article CAS PubMed Google Scholar
Havlioglu N, Wang J, Fushimi K, Vibranovski MD, Kan Z, Gish W, Fedorov A, Long M, Wu JY (2007) An intronic signal for alternative splicing in the human genome. PLoS One 2(11):e1246
Article PubMed Google Scholar
Hughes AL, Yeager M (1997) Comparative evolutionary rates of introns and exons in murine rodents. J Mol Evol 45:125–130
Article CAS PubMed Google Scholar
Karlin S, Ladunga I (1994) Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci USA 91:12832–12836
Article CAS PubMed Google Scholar
Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ (2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA 103:9935–9939
Article CAS PubMed Google Scholar
Long M, Langley CH (1993) Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260:91–95
Article CAS PubMed Google Scholar
MacQueen B (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, Berkeley, pp 281–297
Mattick JS (1994) Introns: evolution and function. Curr Opin Genet Dev 4:823–831
Article CAS PubMed Google Scholar
Mitchell PJ, Tjian R (1989) Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245:371–378
Article CAS PubMed Google Scholar
Montaner D, Tarraga J, Huerta-Cepas J, Burguet J, Vaquerizas JM, Conde L, Minguez P, Vera J, Mukherjee S, Valls J, Pujana MA, Alloza E, Herrero J, Al-Shahrour F, Dopazo J (2006) Next station in microarray data analysis: GEPAS. Nucl Acids Res 34:W486–W491
Article CAS PubMed Google Scholar
Nakamura Y, Gojobori T, Ikemura T (2000) Codon usage tabulated from international DNA sequence databases: status for the year (2000). Nucleic Acids Res 28:292
Article CAS PubMed Google Scholar
Ohno S (1970) Evolution by gene duplication. Springer-Verlag, New York, p 160
Google Scholar
Ohno S (1987) Codon preference is but an illusion created by the construction principle of coding sequences. Proc Nat Acad Sci USA 84:6486–6490
Article CAS PubMed Google Scholar
Osawa S, Jukes TH, Watanabe K, Muto A (1992) Recent evidence for evolution of the genetic code. Microbiol Rev 56:229–264.490
CAS PubMed Google Scholar
Russell GJ, Walker PM, Elton RA, Subak-Sharpe JH (1976) Doublet frequency analysis of fractionated vertebrate nuclear DNA. J Mol Biol 108:1–23
Article CAS PubMed Google Scholar
Schildkraut CL, Mandel M, Levisohn S, Smith-Sonneborn JE, Marmur J (1962) Deoxyribonucleic acid base composition and taxonomy of some protozoa. Nature 196:795–796
Article CAS PubMed Google Scholar
Shannon (1948) A mathematical theory of communication. Bell Syst Technol J 27:9–423
Google Scholar
Stoltzfus A (1994) Origin of introns-early or late. Nature 369:526–527 (author reply 527–528)
Article CAS PubMed Google Scholar
Sueoka N (1961) Variation and heterogeneity of base composition of deoxyribonucleic acids: a compilation of old and new data. J Mol Biol 3:31–40
Article CAS Google Scholar
Vinogradov AE (2003) Silent DNA: speaking RNA language? Bioinformatics 19:2167–2170
Article CAS PubMed Google Scholar
Vinogradov AE (2006) ‘Genome design’ model and multicellular complexity: golden middle. Nucl Acids Res 34:5906–5914
Article CAS PubMed Google Scholar
Zhang J, Hu J, Shi XF, Cao H, Liu WB (2003) Detection of potential positive regulatory motifs of transcription in yeast introns by comparative analysis of oligonucleotide frequencies. Comput Biol Chem 27(4–5):497–506
Article CAS PubMed Google Scholar
Zhu L, Zhang Y, Zhang W, Yang S, Chen JQ, Tian D (2009) Patterns of exon-intron architecture variation of genes in eukaryotic genomes. BMC Genomics 10:47
Article PubMed Google Scholar
Zuckerkandl E (1981) A general function of noncoding polynucleotide sequences. Mass binding of transconformational proteins. Mol Biol Rep 7:149–158
Article CAS PubMed Google Scholar
Zuckerkandl E, Cavalli G (2007) Combinatorial epigenetics, “junk DNA”, and the evolution of complex organisms. Gene 390:232–242
Article CAS PubMed Google Scholar

Download references

Acknowledgments

We thank Sara Fuertes for help with statistical data. This work was supported by COSTD20/003/00, CICYT Bio BIO2002-04049-C02-01, SAF 2004-03111, and ISCIII-RETIC RD06/0021/0008-FEDER programs. An institutional grant from Fundación Ramón Areces is also acknowledged.

Author information

Authors and Affiliations

Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Universidad Autónoma de Madrid, c/Nicolás Cabrera 1, 28049, Madrid, Spain
Miguel A. Fuertes & Carlos Alonso
Bioapter S.L.c/Formentera, Polígono Industrial Las Rozas, Las Rozas, Madrid, Spain
José M. Pérez
Department of Biological Sciences, Stanford University, Stanford, CA, 94305, USA
Emile Zuckerkandl
Institute of Molecular Medical Sciences, P.O. Box 20452, Stanford, CA, 94309, USA
Emile Zuckerkandl

Authors

Miguel A. Fuertes
View author publications
You can also search for this author in PubMed Google Scholar
José M. Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Emile Zuckerkandl
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Alonso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Alonso.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 555 kb)

Supplementary material 2 (DOC 66 kb)

Supplementary material 3 (DOC 66 kb)

Supplementary material 4 (DOC 72 kb)

Supplementary material 5 (DOC 66 kb)

Supplementary material 6 (DOC 67 kb)

Supplementary material 7 (DOC 43 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fuertes, M.A., Pérez, J.M., Zuckerkandl, E. et al. Introns Form Compositional Clusters in Parallel with the Compositional Clusters of the Coding Sequences to Which they Pertain. J Mol Evol 72, 1–13 (2011). https://doi.org/10.1007/s00239-010-9411-6

Download citation

Received: 22 September 2009
Accepted: 10 November 2010
Published: 04 December 2010
Issue Date: January 2011
DOI: https://doi.org/10.1007/s00239-010-9411-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introns Form Compositional Clusters in Parallel with the Compositional Clusters of the Coding Sequences to Which they Pertain

Abstract

Similar content being viewed by others

Introduction

Methods

DNA Composition in Terms of Varied Base Representation in Consecutive Standard Length Sequence Fragments (Multiplets)

The Triplet Composon Code

Calculation of Composon Usage Frequencies

Composon Usage of Random DNA Sequences

Composon Fingerprints and Composon Signatures

Statistical Criterion for the Selection of Gene Samples

Data Source

Results

Composon Usage Frequencies of a Sample of Human Genes

k-Means Clustering of Human Genes

Composon Usage Frequency of CDS Versus IS

Fingerprint of CDS and IS

The Gene Cluster Barcode

k-Means Clustering of Mouse Genes

Discussion

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation