Introduction

The mysterious microbial world encompasses organisms having a wide diversity in their metabolic, phenotypic, genomic characteristics. The pursuit to identify microbes has seen a shift from relying upon their morphological and biochemical characteristics to genomic features. The advent of molecular biology and bioinformatic techniques has almost completely revolutionized the concept of bacterial taxonomy and their evolutionary pathways. The transition from single gene sequence to whole genome sequence has given confidence of identifying even the bacteria which are yet to be cultured. In fact, bacterial genomic limit can be extended through metagenomic explorations [1]. Phylogenetic trees provide an evolutionary scale for distinguishing organisms which are distantly placed. However, the output of these tools is complicated by too much of heterogeneity on one extreme to virtually nil variability among the strains. It thus becomes a tough task to identify them in an unambiguous manner [2, 3].

The modern taxonomic classification of microbes is based largely on the gene, which is conserved throughout the prokaryotic domain: the 16S rRNA (rrs). The microbial taxonomy was given a new look and the nucleotide sequence of this gene has been so widely adapted that it has become a reference point for almost all practical purposes. The Ribosomal Database Project (RDP) (https://rdp.cme.msu.edu/), which was initiated as a small depository of a few hundred rrs sequences, has more than 3.0 million entries (RDP Release 11, Update 3::September 17, 2014:: has 3,019,928 16S rRNAs:: 102,901 Fungal 28S rRNAs entries), at present. The rapidly increasing magnitude of this database is a clear reflection on the influence of the findings of Prof. Carl R. Woese [4, 5]. At times, the rrs gene sequence is not able to differentiate very closely related taxa. In such a scenario, one needs to resort to gene sequences which code for features such as heat shock proteins, ATPase-ß-subunit, RNA polymerases or recombinase etc. In certain cases, additional genes have been identified, which can be used exclusively for distinguishing members within a genus: (1) rpoB for Mycobacterium; (2) gyrB for Acinetobacter, Mycobacterium, Pseudomonas, and Shewanella, (3) gyrA gene for Bacillus subtilis, etc. A few methods generally used for identifying bacterial strains are: Amplified fragment length polymorphism (AFLP), DNA–DNA re-association, Microarray, PCR-ribotyping, multi-locus sequence analysis, Randomly amplified polymorphic DNA, and restriction endonuclease (RE) digestion [5, 6].

The Latent Features of 16S rDNA

The RDP database, used as reference to identify the newly sequenced 16S rDNA, is limited by the fact that it can be used identify the extent of what has been known and is available. In order to identify the gene sequence which is yet to be seen by the database, it is difficult to visualize how to place the unknown on the evolutionary map. Efforts to resolve the potential problems existing among the different species of (1) Bacillus, (2) Clostridium, (3) Pseudomonas, and (4) Streptococcus, revealed the presence of certain latent features in their 16S rDNA gene. The first step involved in the generation of molecular makers was to develop a Phylogenetic Framework, which was composed of sequences, which delineated one species from another i.e. those sequences, which could be used to demarcate the phylogenetic limits of all the known sequences within a species. The second step was to identify motifs (signatures sequences, 30–50 nucleotides (nts) in length), which were unique to a particular species and completely absent from all other species. The third feature, which validates the true identity of the 16S rDNA was the identification of RE, which gives a unique digestion pattern: fragment lengths (nts) and the order of their occurrence. These efforts helped in identification of organisms which were identified initially (by the inventor) only up to genus level [69]. This humble beginning in identifying the latent features of those organisms which have been already well identified will help in future to identify and place them on the phylogenetic tree. In fact, these tools have been used to a small extent in certain studies; however, a complete study has been undertaken successfully by others to identify clinically important members of the genus Streptococcus [8, 10].

The Mysterious Clostridium

Clostridium is a phenotypically and phylogenetically heterogeneous group of strains, which may or may not produce spores and/or toxins, and may give gram-negative or gram-positive reaction [7, 9]. It is tedious to identify them, since their GC content varies from 24 to 58 mol % in Clostridium perfringens and C. barkeri, respectively. Another major hurdle in identifying Clostridium with high precision is the high heterogeneity caused by the presence of multiple copies of rrs gene. The need is to look for novel makers for their rapid identification. A novel approach to distinguish very closely related strains of Clostridium botulinum was developed recently [11]. However, the method though effective, could be applied to a limited set of strains. In order to identify Clostridium present in a mixture of unrelated bacteria, we have identified two sets of genes in Clostridium which are: (1) common to most of the species, and (2) unique to a species. A combination of a particular gene or gene set and its (unique) digestion pattern obtained with a specific RE can be exploited to rapidly identify Clostridium species.

Materials and Methods

Sequence Data and Comparative Genome Analysis

Completely sequenced genomes of 27 strains of 9 species belonging to genus Clostridium were retrieved (http://www.ncbi.nlm.nih.gov/), of which 13 strains belonged to C. botulinum, three strains each belonged to C. acetobutylicum and C. perfringens, 2 strains each were of C. kluyveri, and C. tetani. The rest of the genomes were of C. beijerinckii, C. cellulovorans, C. ljungdahlii, and C. novyi (Table S1). Information of the Clostridium genomes for the following parameters such as Accession number, GC percentage, size, and number of genes has been presented (Table S1). Pairwise comparisons among the Clostridium genomes were done to identify common (Table 1) and unique genes (Table S2).

Table 1 List of genes common among sequenced genomes of Clostridium strains (www.ncbi.nlm.nih.gov)

Restriction Endonuclease Analysis for Common Gene

A total of 22 Type II REs were considered for digestion on the basis of our previous works [6, 7, 9, 11]. Following REs were used: (1) Four base cutters AluI (AG′CT), BfaI (C′TA_G), BfuCI (_GATC′), Bsp143I (_GATC′), BstKTI (G′AT_C), BstMBI (_GATC′), CviAII (C_AT′G), DpnI (GA′TC), DpnII (_GATC′), FatI (_CATG′), FspBI (C_TA′G), Hin1II (′CATG_), HpyCH4 V (TG′CA), Hsp92II (′CATG_), MaeI (C_TA′G), RsaI (GT′AC), TaqI (T_CG′A), Tru9I (T_TA′A), XspI (C_TA′G), (2) Five Base cutters Hsp92I (GR_C′YC), and (3) Six base cutters HaeI (WGG′CCW), Hin1I (GR_CG′YC) (Table S3). All 27 common gene sequences (Table 1) were entered into Cleaver (http://cleaver.sourceforge.net/) to obtain RE digestion patterns. Subsequently, emphasis was laid on those REs motifs which were common to all the strains. Data matrices of those REs were taken into consideration which produced 5–15 fragments. Consensus RE patterns, frequency of occurrence of RE sites and the pattern of nucleotide fragments (nts) were determined for each gene by employing: AluI (AG′CT), BfaI (C′TA_G) and Tru9I (T_TA′A).

Restriction Endonuclease Analysis for Unique Gene

A total of 241 Type II REs with recognition sites of ≥4 nucleotides available in BioEdit were used to generate unique RE patterns [12]. Out of these, only 102 REs were used for further analyses (Table S4). Subsequently, the study was focused on those RE sites which were unique to each strain.

Results

The 27 completely sequenced genomes of Clostridium: C. botulinum (13), C. acetobutylicum and C. perfringens (3 each), C. kluyveri and C. tetani (2 each), C. beijerinckii, C. cellulovorans, C. ljungdahlii, and C. novyi (1 each), showed high heterogeneity at genetic level. The number of genes per genome varies from 2427 to 5243 and the overall GC content ranges from 27.4 to 32.02 mol % (Table S1).

Common Gene Analysis

Comparative genomic analyses revealed the presence of genes which were common to all the Clostridial genomes. A total of 27 common genes including 22 housekeeping genes (HKG) were identified on the basis of their high frequency of occurrence (Table 1). A total of 13 genes (including 10 HKGs) were found to be present in 2–4 copies in 21 strains.

In Silico RE Digestion Patterns of Common Genes

In silico RE digestion patterns for all the 27 common genes were obtained with 22 REs, which were selected on the bases of our previous works [6, 7, 9, 11]. The following REs: AluI (AG′CT), BfaI (C′TA_G) and Tru9I (T_TA′A) were generally found to produce 5–15 easily distinguishable fragments, which were thus selected for identifying novel markers (Tables 2, 3, 4, S5–S7).

Table 2 Unique fragmentation pattern (5′-3′) generated by in silico digestion of common genes present in Clostridium strains: AluI
Table 3 Unique fragmentation pattern (5′-3′) generated by in silico digestion of common genes present in Clostridium strains: BfaI
Table 4 Unique fragmentation pattern (5′-3′) generated by in silico digestion of common genes present in Clostridium strains: Tru9I

AluI: RE-AluI showed unique digestion patterns in three HKGs: recN, dnaJ and secA among the Clostridium strains (Tables 2, S5). On the basis of the digestion of recN, with RE-AluI, it was possible to distinguish 20 strain out of 27 Clostridium strains of 8 species (Table 2) that includes 10 strains of C. botulinum, 3 strains of C. perfringens, 2 strains of C. tetani, one each of C. beijerinckii NCIMB 8052, C. cellulovorans 743B, C. kluyveri DSM 555, C. ljungdahlii DSM 13528 and C. novyi NT. The interesting unique digestion patterns (nucleotide fragments) was observed with C. botulinum 230613 (162·19 nts), C. botulinum Alaska E43 (575·22 nts), C. kluyveri DSM 555 (897·165 nts) and C. novyi NT (193·132 nts), which had only two fragments each. Another set of strains, which have only four unique RE fragments are (1) C. beijerinckii NCIMB 8052 (240·483·310·50 nts) (2) C. botulinum Eklund 17B (13·17·575·121 nts), and (3) C. ljungdahlii DSM 13528 (565·188·286·32). C. botulinum strain BKT105925, C. cellulovorans strain 743B, C. tetani strains 12124569 and E88 were also easily distinguishable on the basis of the unique RE-AluI digestion patterns.

Among C. botulinum strains Kyoto, 657, Langeland, Loch Maree, Okra and BKT105925, each of them had similar fragments of 162·19 nts at 5′ end and 70·30·213 nts at 3′ end. However, all of them were easily distinguishable on the basis of fragments present between the two ends. Common genes of C. botulinum strain H04402 065 had minor similarities with other strains of this species; however, they were still unique and can be used as novel markers. Similarly, the three strains of C. perfringens appeared quite close to each other, however, certain fragments were further subdivided to enable easy distinction e.g., 252 nts and 325 nts fragments of strain 13 appeared as 4·248 and as 123·202 nts in strains ATCC 13124 and SM101. Further distinction between C. perfringens strains ATCC 13124 and SM101 could be made on the basis of 105 nts fragment being partitioned into 5·100 nts in the later.

Similarly, on the basis of the digestion of dnaJ and secA, with RE-AluI, it was possible to distinguish all the 16 strains listed in Table 2.

BfaI: With RE-BfaI, unique digestion patterns of common genes, recN and mutS of Clostridium species (Table 3, S6) could be used as novel markers for 17 and 14 strains, respectively.

Tru9I: In silico digestion pattern analysis of common genes of Clostridium species with RE-Tru9I (Table 4, S7), revealed that two genes, mutS and grpE can be used to clearly identify 16 strains. However, from practical point of view, digestion pattern of mutS may not be very effective, as it generates a large number of small sized fragments (Table 4). mutS is the only gene that may be used to differentiate Hall from all other Clostridium strains.

Multiple Copies of Common Genes in Clostridium Genome

In this study, we found multiple copies of 13 different genes belonging to 22 different strains of Clostridium. The number of gene copies varied from 2 to 4, with 2 being the most frequent number (Table S8–S10). In most of the cases, RE digestion patterns varied among the copies as well. By digesting common genes, which were present in multiple copies, we could distinguish an additional 3 strains of Clostridium: C. acetobutylicum ATCC 824, C. acetobutylicum DSM 1731 and C. kluyveri NBRC 12016 (Table S8–S10). It may be concluded that using RE—common gene combinations; we could distinguish 24 out of 27 strains used in this study.

Unique Gene Analysis

Pairwise comparison among 27 annotated strains of Clostridium species revealed the presence of unique genes. The number of unique genes varied from as low as one in C. acetobutylicum strains ATCC 824 and EA 2018, C. botulinum strains Alaska E43 and BKT015925 to as high as 31, 35, 40 and 71 in the cases of C. ljungdahlii DSM 13528, C. tetani 12124569, C. acetobutylicum DSM 1731, and C. kluyveri DSM555, respectively (Table S2). Out of 27 genomes, only 19 strains were found to have unique genes, which can be exploited for strain level identification. It indicates that a wide genetic variability is available for distinguishing even very closely related species.

In Silico RE Digestion Patterns of Unique Genes

Unique genes for 19 strains of Clostridium and their digestion pattern with REs have been listed in Table S2. These genes can be used either individually or in various combinations to identify organisms up to strain level. In order to increase the validity of the identification, RE patterns of genes with multiple cut sites can be used (Table S2). By combined approach of the RE digestion patterns of common and unique, we can identify 26 out of 27 strains used in this study.

Discussion

In silico mapping of genes with different Type II REs has revealed that digestion patterns vary substantially even between closely related organisms. The variation in RE digestion patterns within a gene originates because of single nucleotide changes, especially those, which fall within the RE recognition motif [11]. Although a large number of REs can be used to digest a gene, however, it has been realized that for driving meaningful conclusions, only a few of them can be employed. Around 22 different REs have been used in this study to identify unique digestion patterns within a gene. It was revealed that out of 2427–5243 genes present in the genomes of Clostridium strains, around 27 genes were common to most of them. The presence of these common genes can help in easily identifying the organism at least up to genus level. Now in order to identify the organism up to species level we need another set of markers. It was realized that only three combinations of REs- and HKGs: (1) AluI-recN, dnaJ and secA, (2) BfaI-recN and mutS, and (3) Tru9I-mutS and grpE, can be used as novel markers for identifying Clostridium strains. In summary, we may conclude that each strain can be identified and further validated by combining the observations made of certain common or unique genes and their RE digestion patterns (Table 5). This study thus provides a unique opportunity to develop diagnostic kits for rapidly identifying strains by amplifying only a very limited number of genes. And perhaps the best part of this study is its potential to be extended to any gene and organism of interest. A few studies have in fact been conducted, where RE digestion patterns of functional genes have been used as markers [8, 10, 1317].

Table 5 Potential gene types which can be used for identification of Clostridium strains