Introduction

Highly repetitive, noncoding DNA is known to make up the majority of many eukaryotic genomes. Prokaryotic genomes, however, are typically seen to contain lower levels of repetitive DNA; most phyla examined have less than 5% of their genome in the form of global DNA repeats (Ussery et al. 2004). These authors hypothesized that prokaryotes have had more replication cycles over evolutionary time to “streamline” their genomes and that the repetitive DNA found in prokaryotic genomes often has some selective advantage or other reason for its existence.

DNA repeats have been classified as either direct, inverted, or mirror, depending on their region of symmetry. Short direct repeats are common in genomes containing repetitive DNA; these can be found arrayed in tandem, dispersed throughout a genome, or, more recently, interspersed with nonrepetitive sequences. Clustered regularly interspaced short palindromic repeats (CRISPRs) are a recently described class of repetitive DNA element that is contained exclusively in prokaryotes. CRISPRs contain 21–47 bp of a directly repeated sequence separated by similarly sized chunks of nonrepetitive DNA.

The presence of short regularly spaced repeats in prokaryotes was first described by Mojica and colleagues, whereas the name CRISPRs was later introduced by Jansen and colleagues (Mojica et al. 2000; Jansen et al. 2002a). These authors were able to locate CRISPRs in approximately half of the prokaryotic genomes available at the time, including a wide diversity of organisms belonging to both the Eubacterial and Archaeal domains (Jansen et al. 2002a,b). Jansen and colleagues also described four CRISPR-associated, or cas, genes that were typically found in association with the DNA repeats. The function of CRISPRs, as well as their associated genes, remains unknown. The genes do share homology with proteins involved in DNA recombination and repair. Cas 1, 3, and 4 are homologous to a DNA repair protein, a helicase, and a RecB exonuclease, respectively. The function of cas2 has not been predicted with any certainty.

Clusters of CRISPR-associated genes have previously been characterized in great detail due to their resemblance to a novel DNA repair system in certain prokaryotes (Makarova et al. 2002). This investigation was performed before it was known that repetitive sequences were located nearby. The authors used phylogenetic evidence to show that HGT was most likely responsible for the movement of this cluster among distantly related genomes, and hypothesized that the gene cassettes disseminated as a single entity. Recently, the intervening sequences of CRISPR loci from a wide range of different prokaryotes have been found to resemble sequence from transmissible genetic elements such as bacteriophage and conjugative transposons (Mojica et al. 2005). This conclusion has been confirmed by studies which focused on the origin of intervening DNA taken from many strains of a single organism, either Streptococcus thermophilus or Yersinia pestis (Bolotin et al. 2005; Pourcel et al. 2005). All of this has lead to the speculation that CRISPRs and their associated genes represent a form of mobile genetic element which moves via HGT. In this study, we expand upon the prokaryotic genomes found to contain CRISPRs as well as cas genes, confirm findings that HGT has acted upon the cas genes in question and present evidence that this transfer has likely involved the use of conjugation.

Results

We have searched 370 prokaryotic genomes for the presence of interspersed direct repeats, three quarters of these genomes had been completed and published at the time of our analysis, the remainder of which were in various stages of completion (supplemental material, Table A). We initially set the limits of the repeated pattern broader than the previously reported range that CRISPRs fell into, until we were convinced that no species fell outside of this range. The average size of a CRISPR repeat was found to be 32 bp (supplemental material, Table B). Organisms at the lower end of this range include Yersinia pestis KIM and Nostoc punctiforme, while Bacteroides fragilis NCTC 9343 has the longest CRISPR repeat found to date (Table 1). Once the range became apparent, the search parameters were narrowed to include this range only in order to speed up our analysis. We have been able to greatly expand the number of species which are known to contain CRISPRs, from the 39 previously reported to the 148 described here (Jansen et al. 2002a; Table 1; supplemental material, Table A).

Table 1 Cas genes and associated CRISPRs in prokaryotic genomes

CRISPR containing species of prokaryotes are extremely diverse, belonging to the domain Archaea as well as the majority of the phyla that have been sequenced from the Eubacterial domain. In all, about 40% of the genomes that have been searched were found to contain CRISPR loci. It is not uncommon for there to be more than one CRISPR locus per genome, genera with large numbers of loci include Anabaena (sp. 7120), Chloroflexus, and Methanocaldococcus, which all contain 11 (supplemental material, Table A). Overall, about half of the CRISPR-containing organisms studied had more than one CRISPR locus in their genome (supplemental material, Table A). The number of interspersed repeats in these loci also varies, with many loci containing as few as 3 repeats (the lower limit set by our program) and the average number of repeats per locus being around 27. However, one species, Thermoanaerobacter tengcongensis, contains a locus with 217 repeats, nearly twice the upper limit previously reported (Jansen et al. 2002a). The genomic coordinates for each of the CRISPR loci studied are given in Table 1 and again in a machine-readable format in the supplemental material (Table B).

The CRISPRs described here can be seen to be quite variable in sequence. In fact, with the exception of closely related strains or species, no two CRISPR sequences were found to be alike. Most CRISPRs do, however, have an imperfect palindromic sequence which conforms to the general consensus sequence shown in Figure 1. It is not uncommon for a CRISPR sequence to begin with a G, followed by 3 Ts and either a G or a C. This is roughly palindromic to the most common ending, namely GAAAC. A run of cytosines a quarter of the way into the sequence also corresponds with a run of guanines three quarters of the way in. There is, however, little conservation of sequence in the center of the CRISPR repeat.

Figure 1
figure 1

Sequence logo representing the consensus sequence for all CRISPRs characterized to date. The relative size of each nucleotide represents the probability that it is found at each position. The logo has been truncated by 5 bp at each end to remove non-consensus sequence.

Many, but not all, of the CRISPR loci described here contain a number of genes in close association with the DNA repeats. A recent study has examined many CRISPR-associated genes and grouped them into 45 different protein families, providing evidence of the size, complexity, and heterogeneity of many CRISPR loci (Haft et al. 2005). Although the number of CRISPR-associated genes appears to be approximately a dozen in many of the genomes we have examined, we have chosen to concentrate on the four original cas genes for this investigation. Homologs of the cas genes were identified in many of the species found to have CRISPRs by our DNA pattern searching techniques. In order to be classified as a cas homolog in this study, a gene had to meet two criteria: 1) homology with known cas genes, and 2) proximity to a CRISPR locus. This allows a greater degree of confidence that, despite the divergence of many of the encoded proteins, they indeed belong to the cas gene categories indicated, although we realize that bone fide cas genes may have been left out by imposing these criteria. Overall, about two thirds of the 148 CRISPR-containing species revealed identifiable homologs of at least one of the four core cas genes, these 103 species are presented in Table 1, along with the locus tags of the homologs in question.

We performed phylogenetic analysis of the identified cas genes in order to determine whether horizontal gene transfer, or HGT, of these elements had occurred during prokaryotic evolution. Although we initially analyzed each of the cas genes separately, we found four separate phylogenetic trees containing approximately 100 members each to be rather unwieldy (supplemental material, Fig. A). Multiple alignment data as well as smaller individual trees for most of the cas genes in question have been presented previously by others (Jansen et al. 2002a; Makarova et al. 2002; Bolotin et al. 2005). The latter authors have suggested that each of the cas gene classes, with the exception of cas4, divide into two distinct clades and have thus divided off cas1b, cas6, and cas5 from what we have called cas 1, 2 and 3, respectively. We do not see such a clear distinction in our phylogenetic trees and have chosen to keep the classes more inclusive for this analysis (supplemental material, Fig. A).

Since the conclusions drawn from each individual tree were qualitatively similar, we chose to create a “total evidence” tree to represent our data. Since it is likely that the cas genes are transferred among prokaryotes as an entire cassette (Makarova et al. 2002), we decided to join their amino acid sequences prior to phylogenetic analysis. This approach did require that we limit our analysis to loci that contained representatives of all four cas genes; this amounted to about one half of the organisms presented in Table 1, specifically 59 loci from 52 different species. The amino acid sequences of the cas genes were concatenated to form long sequences with an average size of 1413 residues, and this representation of the data was submitted to multiple alignment followed by the generation of an unrooted phylogenetic tree (supplemental material, Figs. B-F; Fig. 2). Certain features of this tree are notable. For instance, major phylogenetic groups which should form their own clade if no HGT had taken place are seen dispersed throughout the tree. The Archaea, Proteobacteria, and Firmicutes are all designated on the tree, none of which cluster together into a single clade.

Figure 2
figure 2

Total evidence phylogenetic tree for all four cas genes. The organism in which cas genes are found is listed, these are numbered if more than one cluster is found in a given organism. Members of the division Archaea are indicated by brackets and the letter “A”. Members of the phyla Proteobacteria and Firmicutes are indicated with the letters “P” and “F”, respectively. Distances between organisms and the next node are given above their branches, bootstrap values greater than 55 (of 1000 replicates) are given next to their corresponding nodes. An arrow indicates the Shewanella sp. from the Sargasso Sea environmental sample and the Dusulfovibrio vulgaris megaplasmid.

In searching for CRISPR loci and associated cas genes, we noticed that a number of them were not located on the bacterial chromosome but were found on megaplasmids, very large plasmids ≥ 40 kb in size (Table 2). Of the ten megaplasmids described, four of these are the only CRISPR loci found within a particular species, while the rest have CRISPR sequences and cas genes on the corresponding chromosome as well. The number of loci associated with megaplasmids varies between one and eight, while the size of the plasmids themselves varies between 39 and 257 kb. The interspersed sequence in one of these plasmids, Sulfolobus pNOB8, has been described previously but was not identified as a CRISPR at that time (She et al. 1998). Two other plasmids on our list have their cas genes identified in GenBank but the presence of CRISPR loci on these plasmids was not discussed in the corresponding publications (Heidelberg et al. 2004; Henne et al. 2004).

Table 2 Cas genes and associated CRISPRs from megaplasmids

While some of the CRISPR sequences are unique to the megaplasmid on which they are found, others are shared with bacterial chromosomes or other megaplasmids (Tables 1 & 2). The pTT27 megaplasmid from Thermus thermophilus HB8, for instance, shares a repeating sequence with the chromosome of the subspecies HB27, while the corresponding plasmid in HB27 shares a different repeating unit with both the HB8 chromosome and megaplasmid. We have extended our analysis by comparing the two pTT27 megaplasmids using a Pustell DNA Matrix (Fig. 3). It is clear that the two plasmids are highly conserved at the level of DNA sequence, with most of their non-conserved regions being made up of CRISPR loci which differ in both sequence and location. Gaps occur in the diagonal in regions which correspond to CRISPR loci coordinates, while segments which deviate from this line can often be traced to the one CRISPR sequence which the two plasmids share, albeit at different locations. It also appears that much of the size difference between the two plasmids (24 kb) can be accounted for by the presence of two more cas gene containing loci in the HB8 subspecies. The inter-relatedness of some of the megaplasmid-borne cas genes we have described is also apparent from our phylogenetic analysis, namely the cas genes found on the Thermus HB8 megaplasmid show a close phylogenetic relationship with those found on the pSYSA plasmid from Synechocystis (Fig. 2).

Figure 3
figure 3

Comparison of megaplasmids from Thermus thermophilus. A Pustell DNA Matrix highlighting conserved sequence between pTT27 plasmids from strains HB8 and HB27. An unbroken diagonal line represents regions of high sequence conservation.

Additional clues concerning the origin and method of propagation of CRISPRs and their associated genes comes from recent studies of environmental samples taken from the Sargasso Sea. Venter and colleagues have embarked upon a project to obtain large amounts of DNA sequence from uncultured samples taken from the environment (Venter et al. 2004). While it is still relatively uncommon to have sequence data available for a number of different strains of the same species (with obvious exceptions occurring for pathologically important specimens, see Pourcel et al. 2005), environmental sampling has provided a number of closely related genomes belonging to non-pathogenic species.

We searched for CRISPRs using this data and noticed that there was a CRISPR locus in the SAR2 scaffold that corresponds to the complete genome of a Shewanella sp. obtained from the Sargasso Sea. This environmental sample was compared to the Shewanella oneidensis genome, a closely related species that completely lacks CRISPR loci (Fig. 4). It can be seen that, while the genes flanking the CRISPR locus share a high degree of homology between SAR2 and S. oneidensis, the CRISPR locus itself appears as wide gap of non-conserved sequence. The environmental sample has apparently picked up the CRISPR locus via HGT and incorporated the locus into its genome. Although the four cas genes from the Shewanella environmental sample are somewhat divergent from their homologs on the Desulfovibrio megaplasmid listed in Table 2 at the level of amino acid sequence (Fig. 2, see arrow), this supposed insertion displays a high degree of synteny with the CRISPR locus from the megaplasmid, the order of not only the 4 core cas genes but also three additional CRISPR-associated ORFs, cas5d, csd1, and csd2, is completely conserved between the two (Haft et al. 2005; Fig. 5). This arrangement of genes has been deemed the Dvulg subtype by the above authors and it is found not only in Desulfovibrio vulgaris, which gives this subtype its name, but also in the 15 other species which share a large clade with the Desulfovibrio megaplasmid in our phylogenetic analysis (Fig. 2, unpublished data). It appears that sharing this subtype of CRISPR-associated genes is one of the few things these organisms have in common; they consist of a mix of Proteobacteria, Firmicutes, Spirochetes, and Chlorobi, to name a few phyla.

Figure 4
figure 4

Alignment of a portion of the Shewanella oneidensis MR-1 genome (top, accession # NC_004347.1) with the SAR2 environmental sample taken from the Sargasso sea (bottom, # AACY01119384.1) using the Artemis Comparison Tool (ACT). The CRISPR locus is seen as a gap between overlapping regions. The coordinates of each genome are given.

Figure 5
figure 5

Model for insertion of CRISPRs and Cas genes into Shewanella. This is one model which would explain our findings concerning the presence of a CRISPR locus in the SAR2 environmental sample. The synteny between the Desulfovibrio megaplasmid and the Shewanella sample from the Sargasso Sea is displayed. The cas genes are identified by name, as are five additional ORFs which have not been identified as CRISPR-associated. Int., Tran., and PR stand for a putative integrase, transposase, and plasmid replication protein, respectively. The locus tags for S. baltica homologs are given in parentheses in the boxes for their corresponding S. oneidensis genes.

Through the comparison of these two strains of Shewanella, it is possible to estimate the size of at least one CRISPR locus insertion. The SAR2 scaffold shows no similarity to the published genome of S. oneidensis between bases 14,480 and 54,988, for a total length of 40,508 bp. The sequences which flank this region, however, show a high degree of similarity to the Shewanella genome, with no deletions of genomic DNA detectable. To rule out the possibility that the differences between these genomes could be accounted for by a deletion of approximately 40 kb from the organism represented by SAR2 to create the sequence of genes seen in S. oneidensis, we have analyzed the arrangement of genes in a third genome, Shewanella baltica, that is likely to have diverged from S. oneidensis before this predicted insertion took place (Murray et al. 2001). The draft genome of S. baltica is almost completely syntenic with the corresponding region in S. oneidensis, the locus tags of the S. baltica homologs are given in Fig. 5. A single ORF of 112 amino acids, Sbal_1424, lying within the region of our predicted insertion does not contain a S. oneidensis homolog in the corresponding region. The number of DNA bases which fall between SO1154 and SO1155 in S. oneidensis is 757, while the number of bases between Sbal_1423 and Sbal_1425 in S. baltica is 1150, for a difference of 393 bases which contains the ORF described above. Since the size difference between these divergent genomes (baltica vs. oneidensis) is 100 times less than we find between SAR2 and S. oneidensis, we conclude that a deletion event was not responsible for difference between the latter two strains. Our conclusion is that the opposite has occurred- that about 40 kb has been inserted into the Shewanella genome to form the SAR2 strain.

Since 40 kb is approximately the size of the smallest CRISPR-containing megaplasmid we have characterized, and all of the megaplasmids presented contain additional genes which lie outside of the CRISPR loci, this raises the possibility that various plasmid-associated genes may be transferred along with a CRISPR locus but would not be identifiable as being CRISPR-associated per se. We have identified five additional ORFs upsteam of cas3 in the SAR2 scaffold that do not appear in the Desulfovibrio megaplasmid and were probably specific to the vector which was responsible for the transfer of this CRISPR locus. These ORFs include a putative integrase, transposase, and plasmid replication protein, as well as TraD (a putative DNA transport protein) and TrwC (a putative conjugative relaxase). The presence of these ORFs in the SAR2 genome also point to a conjugative origin of the adjoining CRISPR locus. It is likely that additional transferred genes which are not required by the CRISPR locus would eventually be deleted by the host genome so that analogous conjugation-associated genes are not evident in many of the loci examined.

Discussion

Previous studies have suggested that the CRISPR-associated genes play a role in DNA recombination and repair (Jansen et al. 2002a). Additional CRISPR-associated genes have been found that act as hydrolases as well as members of the RAMP superfamily (Repair Associated Mysterious Protein) (Makarova et al. 2002). Phylogenetic analyses by these authors indicated the possibility that cas genes have been propagated via HGT. These investigators surmised that what came to be known as CRISPR loci represented a novel DNA repair pathway which is found primarily in Archaea as well as hyperthermophilic Eubacteria. Our current work has greatly expanded the number of bacteria known to have CRISPR loci, including a diverse group of mesophiles, thermophiles, aerobes, anaerobes, enterics, heterotrophs, photoautotrophs, etc. It is clear from perusing the 148 species that contain CRISPR loci that they hail from all walks of prokaryotic life and that they do not belong exclusively to any specific groups.

Multiple events of HGT throughout evolution are suggested by our phylogenetic analysis of concatenated sequences from cas1-4. While some have argued that homologs of cas2 should not be included in the conserved “core” of genes usually found in what are now known as CRISPR loci (Makarova et al. 2002), this work supports their inclusion since we have now expanded the number of these homologs to well within the range of the other three core cas genes. The separation of species in the phylogenetic tree we have generated that one would expect to find in close association, as well as the close association in the tree of divergent species that one would expect to be separated, both indicate that these genes have at times been passed by means other than vertical transmission. The precise number of HGT events which have taken place are impossible to estimate due to the complexity of the phylogeny, but that there have been numerous events is without a doubt. Recent studies of the non-repetitive intervening sequences have suggested the origins of this spacer DNA originate from elsewhere in genomes, namely bacteriophage DNA or conjugative transposons (Mojica et al. 2005; Bolotin et al. 2005; Pourcel et al. 2005).

The presence of megaplasmids that contain CRISPR loci suggest a mode of transmission by which CRISPRs have spread so widely throughout prokaryotes. The 40 kb size of the predicted insertion into the Shewanella oneidensis genome is approximately the size of the smallest CRISPR-containing megaplasmid characterized to date (plasmid ece1 from Aquifex aeolicis, which is 39 kb), both of which are on the same size scale as bacteriophage and conjugative transposons. One difference the CRISPR-containing megaplasmids share with most conjugative transposons described to date is the absence of genes conveying resistance to antibiotics (Scott & Churchward, 1995). This is not to say that other, non-CRISPR-associated genes are not found on CRISPR containing megaplasmids. The pTT27 plasmid, for instance, contains enzymes required for the final stages of carotenoid as well as cobalamin biosynthesis (Henne et al. 2004).

One question that remains is: if megaplasmids are ultimately responsible for the propagation of CRISPRs throughout prokaryotic genomes, why have only ten CRISPR-containing megaplasmids been found, compared to the 148 CRISPR-containing genomes? The answer might be that most megaplasmids are not stably maintained in their host cells and that they often pay transient visits to the species in question, what we have defined as megaplasmids account for one fifth to one quarter of the total number of plasmids contained on the NCBI website for Archaea and Eubacteria, respectively. Notable exceptions to this proposed transience include the pTT27 plasmid discussed above which contains needed biosynthetic enzymes. Other CRISPR-containing megaplasmids may have developed alternate means by which they became stably maintained within a host. To fully answer this question, further studies are needed. Future characterization of CRISPR containing megaplasmids as well as the genes contained therein should shed insight into the function and amazing propagation of the class of DNA repetitive elements known as CRISPRs.

Experimental Procedures

Finding CRISPRs

Whole genomes were downloaded and searched using the program PatScan which was obtained from Argonne National Laboratory, Argonne, IL USA (http://www-unix.mcs.anl.gov/compbio/PatScan/HTML/patscan.html). Initially, the pattern used to find interspersed repeats was: p1 = 15 ... 70 15 ... 70 p1 15 ... 70 p1 (algorithm 1), but this was changed to: p1 = 21 ... 39 15 ... 45 p1 15 ... 45 p1 (algorithm 2) since it was found to speed up our analysis without detrimentally affecting the results obtained. The specific algorithm used to obtain our results is noted in the supplemental materials (Table A). Degenerate tandem repeats were discarded by hand following the analysis. A sequence logo representing the consensus sequence of all CRISPRs characterized was created by submitting a multiple alignment of the sequences to NetLogo at http://www.weblogo.berkeley.edu. A multiple alignment of 360 sequences was created by aligning each sequence, along with its reverse complement using the ClustalW function of MacVector 7.2.2. The open gap penalty was set at 10, while the extend gap penalty was set at 5, transitions were weighted with a divergent delay of 40%.

Finding Cas Genes

Existing cas genes (Jansen et al. 2002a) were used as the query for pBLAST searches using the NCBI database (http://www.ncbi.nlm.nih.gov/BLAST/). Cas genes were identified based both on their homology to the query, as well as their proximity to a CRISPR sequence. Often, the newly identified cas gene with the lowest amount of homology to the query (E-value ≤.7) was then used as a query in a new search to identify divergent members of the cas family.

Phylogenetic Analysis of Cas Genes

Organisms which contained all four known cas genes in a particular CRISPR locus were compared phylogenetically. All phylogenetic analysis was performed using the MacVector 7.2.2 suite of software. The cas genes were first concatenated in the following order: cas 1, 2, 4, then 3. Cas3 was the last protein sequence included since it is the most variable in size. These concatenated sequences were aligned using ClustalW, using an MD 350 matrix for pairwise alignment and an MD series for multiple alignment. Open gap penalties of 10 and extend gap penalties of 0.1 and 0.05 were used in the respective alignments. The neighbor joining method with random tie breaking was used to construct the phylogenetic tree. Gaps were distributed proportionally. The concatenated alignment was visually inspected for significant overlap between the different cas genes and none was found.

Alignment of Genomes

Genomic sequences were aligned using the Artemis Comparison Tool (http://www.sanger.ac.uk/Software/ACT/). Input comparison files were generated at the Double ACT website (http://www.193.129.245.227/pise/double_act.html). A cutoff score of 200 was used. Megaplasmid sequences were compared using a Pustell DNA Matrix generated in MacVector 7.2.2. A window size of 232 was used with a minimum % score of 60, a hash value of 6, and a jump value equal to 1.