Introduction

Arabidopsis thaliana, a flowering plant, is an important model system for the study of plant biology. Based on the sequencing efforts of the Arabidopsis Genome Initiative (AGI) established in 1996, the genome sequences of all five chromosomes have been completely sequenced (AGI 2000). The availability of the complete Arabidopsis genome sequence provides an unprecedented opportunity to study the global genome organization at the sequence level.

The isochore structure is referred to the phenomenon that in some eukaryotic genomes, the genome is organized into mosaics which are characterized by a fairly constant average GC content over scales of hundreds of kilobases and by abrupt change to another fairly constant-GC-content region (Bernardi 1995; Macaya et al. 1976). More than 10 years ago, Bernardi and coworkers analyzed the isochore structures of plants by density gradient ultracentrifugation experiments (Matassi et al. 1989; Montero et al. 1990; Salinas et al. 1988). The isochore structures of the Arabidopsis genome have also been investigated at the sequence level (Nekrutenko and Li 2000; Oliver et al. 2001).

In this report, we analyzed the isochore structures of the Arabidopsis genome by a newly developed windowless technique for the GC content computation, the cumulative GC profile. Consequently, 15 isochores have been identified. These isochores have a fairly homogeneous GC content and appear in the genome alternatively, with relatively sharp boundaries. The isochores are classified into three types, i.e., AT-, GC-, and centromere-isochores. The three types are distinct in terms of GC content, gene density and T-DNA insertion density, transposable element (TE) distribution. It is generally believed that TEs are accumulated in the regions surrounding the centromeres. Surprisingly, we found that within these TE-rich regions, there are regions of extremely low TE numbers (TE deserts), which correspond to the position of centromere-isochores. In addition, a heterochromatic knob is located at the boundary of an AT-isochore. The source of GC content variation among isochores was analyzed and shown to be mainly due to the differences of GC content at introns, the third codon positions and intergenic sequences.

Materials and Methods

The genome sequences of Arabidopsis were downloaded from http://www.ncbi.nlm.nih.gov. The TE data were based on Wright et al. (2003).

The Z-curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed given the other (Zhang and Zhang 1991, 1994). Based on the Z-curve, any DNA sequence can be uniquely described by three independent distributions, i.e., x n , y n , and z n . In particular, z n displays the distribution of bases of GC/AT types along the sequence, which is calculated as follows (Zhang and Zhang 1991, 1994)

(1)

where A n , C n , G n , and T n are the cumulative numbers of the bases A, C, G, and T, respectively, occurring in the subsequence from the first base to the nth base in the DNA sequence inspected, A0 = C0 = G0 = T0 = 0, z n  = 0. By viewing the z n n curve, many global and local features of the GC content can be detected in a perceivable way.

For most genome or chromosome sequences, the curves of z n n are roughly straight lines. To amplify the variations, the curve of z n n is fitted by a straight line using the least square technique,

(2)

where (z, n) is the coordinate of a point on the fitted straight line and k is its slope. Instead of using the curve of z n n, we will use the z n n curve, or simply z′ curve hereafter, where

(3)

Therefore, the variations of z n n curve deviated from the straight line, which corresponds to a constant GC content (see Eq. [4] below), are protruded by the z n n curve. The z′ curve or the cumulative GC profile are used interchangeably in this paper. Let \(\overline {{\text{GC}}}\) denote the average GC content within a region Δn in a sequence, it was shown that (Zhang et al. 2001)

(4)

where k′ = Δz n n is the average slope of the z′ curve within the region Δn. It is clear to see from Eq. (4) that an up jump in the z′ curve, i.e., k′ > 0, indicates a decrease in GC content or an increase in AT content, whereas a drop in the z′ curve, i.e., k′ < 0, indicates an increase in GC content or a decrease in AT content. In addition, if a region in the z′ curve is a purely (approximately) straight line, then the GC content keeps absolutely (approximately) constant within this region. Any sharp maximum (minimum) point in the z′ curve indicates a turning point, where the GC content undergoes an abrupt change from a relatively GC-poor (GC-rich) region to a relatively GC-rich (GC-poor) region. The region Δn is usually chosen to be a fragment of a natural DNA sequence, e.g., isochore. The above method to calculate the GC content is called the windowless technique (Zhang et al. 2001).

The concept of isochores is related to domains of relatively homogeneous GC content with large scales in genomes, in which the variations of GC content may be considered to be small. Based on the z′ curve, a homogeneity index h, which describes the smallness of the GC content variations in isochores, was introduced (Zhang and Zhang 2003):

(5)

where

(6)
(7)

where z n n is the cumulative GC profile defined in Eq. (3) for the isochore and the entire chromosome studied, respectively, and M and N are their lengths. If h ≪ 1, the variations of the GC content of the isochore may be considered to be small. No prior knowledge is available to define isochores based on h. In the current study, we arbitrarily chose h = 0.2 as the threshold of isochores.

One-way ANOVA tests were performed when comparing multiple groups of samples, whereas Student t-tests were performed when comparing two groups of samples, unless indicated otherwise.

Results

Features of the z′ Curves for the Five Chromosomes of Arabidopsis

The z′ curves for all five chromosomes of Arabidopsis are shown in Fig. 1. The cumulative GC profiles show that in these chromosomes long domains that have relatively homogeneous GC content exist, as reflected by the fact that some regions of the cumulative GC profiles can be approximately described by straight lines. The domains that have relatively homogeneous GC content are isochores. A quantitative index is used to assess the homogeneity of the GC content of isochores and this is discussed in another section. In addition, the GC-rich isochores are followed immediately by AT-rich isochores, and vice versa, clearly indicating a mosaic structure of the genome. In addition, one striking feature is that all the centromeric regions of the five chromosomes are located within GC-rich isochores. Furthermore, the overall patterns of GC content variation can be roughly classified into two groups: those of chromosomes 1, 3, and 5 are highly similar, and those of chromosomes 2 and 4 are highly similar.

Figure 1
figure 1

The cumulative GC profiles for the five chromosomes of the Arabidopsis genome. The filled circles indicate the position of centromeric regions. The position of the heterochromatic knob is indicated by an arrow. An up jump in the curve indicates a decrease in GC content, whereas a drop in the curve indicates an increase in GC content. If a region is approximately described by a straight line, then the GC content is approximately constant, suggesting this region to be an isochore. Therefore, the sharp peaks of the curves suggest a mosaic structure of the genome, i.e., the GC content undergoes abrupt changes, fromGC-rich regions to GC-poor regions, alternatively, and vice versa. In addition, the variations in GC content in isochores are relatively small. Refer to the text for a quantitative definition. All the centromeric regions are located in GC-rich isochores. The overall patterns GC distributions of chromosomes 1, 3, and 5 are similar. The locations of isochors and unassigned regions are indicated by pars of different patterns.

Isochores of Arabidopsis and Their Classification

A total of 15 isochores have been identified in the Arabidopsis genome. In a previous work (Zhang and Zhang 2003), we classified the isochores into two types, i.e., GC-isochores and AT-isochores. The GC (AT-)-isochores are isochores whose GC content is higher (lower) than that of the chromosome where the isochores are located. The gene density of GC-isochores is usually high, whereas that of AT-isochores is usually low. In the Arabidopsis genome, all the centromeric regions are located in five GC-rich isochores. Although these five GC-rich isochores have a high GC content, they are distinct from the other class of GC-rich isochores in terms of many features, such as gene and T-DNA insertion densities. A much lower gene density was found in these five GC-isochores that are associated with centromeric regions (Table 1). Therefore, we classify the isochores that are associated with the centromeric regions as another class, the centromere-isochores.

Table 1 The isochores in the Arabidopsis genomea

Gene Distribution Among Isochores

In mammalian genomes, genes are preferentially distributed in high GC regions (Bernardi 1995; Lander et al. 2001). In human isochores, a higher gene density was found in H3 isochores (more GC-rich) than in L isochores (less GC-rich) (Bernardi 2000), which was subsequently confirmed based on isochores identified by a compositional segmentation method (Oliver et al. 2002). This correlation also holds, generally, for the Arabidopsis genome (Fig. 2A). The average GC contents for AT-isochores and GC-isochores are 0.340 and 0.368, respectively (Table 2) (p < 0.01). The gene density in AT-isochores and GC-isochores is 215 and 291/Mb, respectively (p < 0.001). The third type of isochores, centromere-isochores, although they have the highest GC content, 0.405 (p < 0.0001 compared with that of AT-isochores and p < 0.01 compared with that of GC-isochores), have the lowest gene density, 77/Mb (p < 0.001 compared with that of GC-isochores). Therefore, the centromere-isochores are distinct from the other class of GC-rich isochores, GC-isochores, which are characterized by a high GC content and a high gene density.

Figure 2
figure 2

Different genome features among three classes of isochores. These features include (A) gene density, (B) T-DNA insertion density, (C) TE number, (D) TE length, (E) intron number,(F) intron length, (G) GC, (H) GC (exon), (I) GC (intron), (J) GC (intergenic), (K) GC12, and(L) GC3.

Table 2 Statistics of the three types of isochores in the Arabidopsis genomea

T-DNA Insertion Site Distribution Among Isochores

The distribution of the integration of transferred DNA (T-DNA) was investigated in the genome of Arabidopsis in 2000 (Barakat et al. 2000). Recently, over 225,000 independent Agrobacterium T-DNA insertion events in the Arabidopsis genome have been created that represent near-saturation of the gene space. The precise locations were determined for more than 88,000 T-DNA insertions. Genome-wide analysis of the distribution of integration events revealed the existence of a large integration site bias at the chromosome level (Alonso et al. 2003).

We studied the distribution of T-DNA insertion sites among isochores and found that T-DNA is most preferentially integrated into GC-isochores and most unfavorably integrated into centromere-isochores. The T-DNA insertion densities for AT, GC, and centromere-isochores are 2397, 3532, and 1605 sites/Mb, respectively (p < 0.0001) (Table 2 and Fig. 2B).

Although the precise mechanism of T-DNA integration in the host genome is not fully understood, one possible reason for the biased integration sites is that the biased integration is due to the different chromatin structures. For example, the integration may be promoted by increased chromatin accessibility in transcribed regions, thereby removing inhibitory effects of unfavorable chromatin environment (Schroder et al. 2002). Indeed, it has recently been found that sites of HIV integration in the human genome are not randomly distributed but instead are enriched in active genes (Schroder et al. 2002). Therefore, the biased distribution of T-DNA insertion sites among these isochores may reflect the difference in the chromatin structures of the three classes of isochores.

Transposable Element Distribution Among Isochores

Transposable elements (TEs) have been found in all eukaryotes, and these elements have the ability to move into new locations in chromosomes. The locations of TEs are highly biased, and in the Arabidopsis genome, it is generally believed that TEs are accumulated in the regions surrounding the centromeres (Copenhaver et al. 1999; AGI 2000; Wright et al. 2003). Recently, the locations of all TEs in the Arabidopsis genome were determined based on an Arabidopsis TE database (http://www.tebureau.mcgill.ca/ ). By using these data, we analyzed the TE distribution among isochores.

Based on the study of TE distribution along genomes, it is shown that regions surrounding the centromeric regions indeed contain high concentration of TEs. However, we noticed that in these TE-rich regions, there are regions that have extremely low numbers of TEs (TE deserts). Strikingly, these TE deserts correspond to the locations of centromere-isochores. Refer to Fig. 3 for an example based on chromosome 1. One possible explanation is that the regions corresponding to TE deserts are critical for centromere functions, and therefore, insertions of TEs into these regions are deleterious and eliminated by natural selection.

Figure 3
figure 3

A The z curve for chromosome 1. B TE distribution of chromosome 1 based on 50-kb sliding windows. Note that TEs are accumulated in the region surrounding the centromeres. However, in the IE-rich region, there is a segment of genome sequence that has an extremely low number of TEs (TE desert). The TE desert corresponds to the position of centromere-isochores.

There is also considerable difference of the TE numbers and lengths between the AT- and the GC-isochores. The average TE numbers of AT- and GC-isochores are 84.16 and 13.39/Mb, respectively; the average TE lengths are 124 and 17 kb/Mb, respectively (Table 2 and Figs. 2C and D) (p < 0.01). Therefore, although there is only a 2.8% difference in GC content, the TE number (length) of AT-isochores is 6.3 (7.3) times that of GC-isochores. As mentioned previously, AT-isochores have a lower gene density than GC-isochores. The distributions of TEs in AT- and GC-isochores are consistent with the result of Wright et al. (2003), which shows a negative correlation between gene density and TE abundance.

Intron Number and Length Distributions AmongIsochores

It has been found that two classes of genes are present in Arabidopsis. One class of genes, the GC-rich class, has relatively low intron numbers and short concatenated intron lengths. The other class of genes, the GC-poor class, has relatively high intron numbers and long concatenated intron lengths (Barakat et al. 1998; Carels and Bernardi 2000). We analyzed the intron length and number distributions among isochores. The intron numbers of AT, GC, and centromere-isochores are 3.81, 4.58, and 2.70, respectively (p < 0.001); intron lengths are 645, 728, and 450 bp, respectively (p < 0.001) (Table 2). Therefore, genes within the most GC-rich class of isochores, centromere-isochores, have a much lower intron number and shorter intron length, than those of the other two classes (Figs. 2D and E). However, genes in the GC-isochores do not have lower intron numbers and shorter intron lengths. Therefore, it is likely that in the previously found two classes of genes (Carels and Bernardi 2000), the GC-rich class, which has low intron numbers and short lengths, contains the genes in the centromere-isochores.

A Heterochromatic Knob Is Located at an Isochore Boundary

The heterochromatic knobs were first observed by McClintock (1929) in the maize genome. These knobs are cytologically detectable, darkly stained heterochromatic regions present on the maize pathytene chromosomes. Many genetic effects are linked to the heterochromatic knobs. For instance, in the maize genome, the heterochromatic knobs were found to affect the recombination frequency and chromosome behavior in microspore divisions (Rhoades 1978; Rhoades and Dempsey 1973). Chromosome 5 of the Arabidopsis genome has a heterochromatic knob (AGI 2000) and the location is at the boundary of an AT-isochore. At the location of the heterochromatic knob, the genome undergoes a relatively abrupt change from a GC-rich region to an AT-rich region. However, chromosome 4 also has a heterochromatic knob, which is not close to any isochore boundary. Therefore, there is a possibility that the correlation between the heterochromatic knob and the isochore boundary in chromosome 5 is due to coincidence. Further information on heterochromatic knobs is needed to investigate this issue. If there is indeed a correlation between heterochromatic knobs and isochores, the correlation between these two structures may provide further insight into the origin of isochores. In addition, the cumulative GC profiles for chromosomes 1, 3, and 5 show a similar overall pattern (Fig. 1), therefore, it will be interesting to investigate the corresponding parts of chromosomes 1 and 3, to examine whether there are also heterochromatic knob structures.

Features of Unassigned Regions

In each chromosome, besides isochore regions, there are other unassigned regions, which are not isochores. We have also studied various features of these unassigned regions (Tables 1 and 2). The average gene density of these regions is 267.55/Mb; the T-DNA insertion number is 3146.31/Mb; the TE number is 31.09/Mb; the TE length is 40,952.24/Mb; the intron number is 4.25/gene; the intron length is 686.03/gene; and the GC contents of exons, introns, intergenic regions, GC12, and GC3 are 0.438, 0.333, 0.324, 0.452, and 0.427, respectively. In brief, all these features are in between those of AT- and GC-isochores. Because the GC content of these regions is in between those of the AT- and GC-isochore, it appears that the GC content is a critical factor in determining all these features.

Discussion

Source of the GC Content Variations Among Isochores

To investigate the source of the GC content variation among isochores, we calculated the GC content of exons, introns, GC12, GC3, and intergenic regions among different classes of isochores (Tables 1 and 2 and Fig. 2). Generally, the GC contents of all these different regions of AT-isochores are less than those of GC-isochores, which are less than those of centromere-isochores. For the AT-, GC-, and centromere-isochores, the GC contents of exons are 0.434, 0.442, and 0.468, respectively (p < 0.01); the GC contents of introns are 0.332, 0.336, and 0.399, respectively (p < 0.01); the GC contents of intergenic regions are 0.317, 0.330, 0.389, respectively (p < 0.01); the GC contents of GC12 are 0.447, 0.453, and 0.462, respectively (p < 0.05); and the GC contents of GC3 are 0.419, 0.437, and 0.468, respectively (p < 0.01). There is a clear difference, however, between GC12 and the GC content of noncoding regions, i.e., the GC content differences in noncoding regions among isochores are much more than those of GC12. For instance, the difference in GC12 between centromere and GC-isochore is 0.009, whereas the difference in GC content of introns is 0.063; the difference in GC3 is 0.031; and the difference in GC content of intergenic regions is 0.059. Therefore, the difference in isochore GC content is likely to be largely due to the variation in GC content in noncoding regions. This observation is consistent with the GC variation in human isochores, i.e., GC3 variation is greater than the GC content variation of isochores (Clay et al. 1996). These noncoding regions are believed to have less selective pressure than coding regions. Therefore, this GC variation pattern appears to support the view that isochores are due to the mutational bias along genomes (Eyre-Walker 1999; Eyre-Walker and Hurst 2001).

Comparison of Arabidopsis Isochores with Those of the Human Genome

Recently, the isochore structures of the human genome have been identified based on the cumulative GC profile (Zhang and Zhang 2003). The isochore structures of Arabidopsis are distinct from those of human in several aspects. First, the variations in GC content between AT- and GC-isochores are different. In the human genome, the average GC contents for AT- and GC-isochores are 0.38 and 0.47, respectively. Therefore, GC-isochores are 9% higher in GC content than AT-isochores. In the Arabidopsis genome, the average GC contents for AT- and GC-isochores are 0.34 and 0.37, respectively. Therefore, there is only a 3% difference in terms of GC content between these two types of isochores. However, isochore structures are still clear even though the GC content difference is not as large as those of the human genome. Another striking difference is that in the human genome, the type centromere-isochore does not exist. The behaviors of the GC-isochores that harbor the centromeric regions show no difference from that of other GC-isochores in the human genome. This difference may reflect the characteristic organization of the centromeric region of Arabidopsis (Haupt et al. 2001). In addition, the relative proportion of the AT-, GC-, and centromere-isochores in the Arabidopsis genome is 17.45, 53.65, and 7.37%, respectively. However, in the human genome, the GC-rich isochores (H3) only represent ∼3–5% (Bernardi 1995). Furthermore, in the Arabidopsis genome, the GC-isochores (8.01 Mb) are longer than the AT-isochores (2.49 Mb) and there are fewer GC-isochores (3) than AT-isochores (7), whereas in the human genome, GC-isochores (11.25 Mb) are shorter than AT-isochores (13.49 Mb) and there are more GC-isochores (34) than AT-isochores (22).

Comparison of the Cumulative GC Profile with theGC Content Distribution Based on Other Methods

As a routine procedure, the GC content distribution is computed by a sliding-window technique (Lander et al. 2001; Waterston et al. 2002). In this method, the number of G and C residues are counted within a window, therefore, the size of the window can be considered as the resolution of the GC content. The resolution of this method is usually low, because a small window size leads to large statistical fluctuations. In addition, the GC content distribution obtained by window-based methods is dependent on the window sizes chosen; in contrast, the cumulative GC profile is unique for a genome. A comparison between the window-based and the windowless approaches has been detailed in the literature (Li 2001; Zhang and Zhang 2003). We want to emphasize that in genomes with larger GC content variation, such as the human genome, the homogeneous GC content domains can be revealed by window-based methods, although the boundaries sometimes cannot be determined precisely (Pavlicek et al. 2002). However, for the genomes with relative low GC content variations, the disadvantages of window-based methods are more obvious, i.e., the window-based methods usually show a complex pattern. For example, the GC content distribution of the Arabidopsis genome is plotted based on 20-kb sliding windows (Fig. 4C), which shows a complex pattern, and boundaries between domains of the GC distribution are totally blurred by the variations. On the contrary, the z′ curve clearly shows five domains of GC distribution, i.e., two AT-isochores, two GC-isochores, and a domain with large GC variation (Figs. 3A and B). Therefore, the isochore structures and their boundaries can be revealed by the cumulative GC profiles clearly.

Figure 4
figure 4

A Schematic diagram showing the GC content of isochores in Arabidopsis thaliana chromosome 3 based on the cumulative GC profile. The regions of isochores are marked with bold horizontal lines, whereas those of interisochores are marked with wavy lines. B The z′ curve for the chromosome 3. C The GC content calculated based on the 20-kb sliding window technique. Note that the boundaries of isochores cannot be identified based on the window technique. D Fifty-six isochores indentified by the entropic segmentation method.

Another windowless tool to analyze genome heterogeneity is compositional segmentation, and this technique has also been used to study the isochore structures of eukaryotic genomes (Li 2001; Oliver et al. 2001). The isochores of the Arabidopsis genome has also been studied by the entropic segmentation method (Oliver et al. 2001). Please also visit http://bioinfo2.ugr.es/isochores/ for details about the method. As a comparison, the isochores of chromosome 3, obtained based on the entropic segmentaton method, are shown in Fig. 4D. One apparent difference is that much more isochores (56) were determined by the entropic segmentation method than those by the z′ curve. Another difference is that all regions of the chromosome are classified into different isochores based on the entropic segmentation method, whereas there are some unassigned regions, which are not isochores, based on the z′ curve. However, some segmentation points of both methods are highly consistent. For instance, the isochore boundaries obtained based on the z′ curve overlap well with some of the segmentation points based on the entropic segmentation method. In addition, among the 56 isochores obtained based on the entropic segmentation method, many boundaries correspond to some jumps in the z′ curve. Therefore, in some aspects, the two methods are consistent. However, the z′ curve appears to be more intuitive and can give a picture of the global GC content distribution along genomes.

Definition of Isochores

Identification of isochore structures in eukaryotic genomes provides much insight into the understanding of the genome organization, because of the clear functional implications of isochores. For instance, isochores have been correlated with gene density (Zoubak et al. 1996), chromosome bands (Saccone et al. 1993), and repeat elements (Meunier-Rotival et al. 1982). In the Arabidopsis genome, the isochores identified have been shown to be related to gene density, T-DNA insertion density, TE density, intron length, and so on. Although isochores have been known for more than 25 years, currently no clear definition of isochores is available. We defined the index, h, to assess the relative homogeneity of isochores compared to the variation in GC content of the whole genome. In this study, we arbitrarily chose a threshold of the homogeneity index, h, to be 0.20 for isochores. In fact, the homogeneity index, h, is more suitable to be an index to assess the relative homogeneity of isochores, rather than a definition. By using the entropic segmentation method, genomes can be split into many segments (isochores) objectively, based on segmentation points (Oliver et al. 2001). For instance, 56 isochors were found in chromosome 3 of the Arabidopsis genome (Oliver et al. 2001). However, it seems to be unreasonable that every region of a genome is an isochore. Therefore, due to the lack of a clear definition, most isochores identified so far are quite subjective.

The homogeneity of the GC content of isochores should be considered to be relative, whereas boundaries of isochores are absolute. No strict isochores that have an absolutely constant GC content have been found in the human genome, as well as other genomes. In terms of the homogeneity index h, h cannot be equal to 0. In some sense, the homogeneity of GC content of isochores is not as important as their functional implications. Isochores are a segment of genome DNA sequences, in which many characteristics, such as the gene density and repeat density, are different from those of other isochores (Meunier-Rotival et al. 1982; Zoubak et al. 1996). Therefore, isochores may be deemed as function domains of genomes or chromosomes, whose boundaries have critical biological meanings. For example, the boundary between Class II and Class III isochores in the human MHC sequence correspond to the change in replication timing (Tenzen et al. 1997). In the Arabidopsis genome, the centromere-isochores correspond to TE deserts. The problem is how to find these boundaries both experimentally and theoretically. The cumulative GC profile is one of the available tools (Li et al. 2002; Oliver et al. 2001; Peshkin and Gelfand 1999) to determine the isochore boundaries. The characterization of isochores and their boundaries will provide a solution to define isochores based on their biological functions, and in this regard, the isochore boundary appears to be more important than the homogeneity of GC content of isochores.