Introduction

The Cyanobacteria constitute a large phylum of oxygenic phototrophic bacteria, which exhibit enormous diversity in terms of their morphological, physiological and developmental characteristics (Rippka et al. 1979; Castenholz 2001). The evolutionary relationships and the classification of bacteria within this phylum are, at present, poorly understood (Wilmotte and Herdman 2001; Komárek 2002). This is largely due to the fact that for a long-time cyanobacterial taxonomy was governed by botanical criteria, which are based on morphological and developmental characteristics (Geitler 1932; Desikachary 1959; Rippka et al. 1979; Castenholz 2001; Hoffmann 2005). These characteristics are generally plastic in nature and most of them are not well suited for reliable classification (Stanier et al. 1978; Woese 1992; Gupta 1998; Wilmotte and Herdman 2001). Although the nomenclature and taxonomy of cyanobacteria was later placed under the International Code of Nomenclature of Bacteria (ICNB) (Stanier et al. 1978; De Vos and Truper 2000; Labeda 2000; Hoffmann 2005), most cyanobacterial names are still based on the original botanical criteria and very few taxa are validly described in terms of the bacteriological code (Oren 2004; Hoffmann 2005; Oren et al. 2009; Parte 2014). As a result, species/strains from large numbers of cyanobacterial genera (e.g., Synechococcus, Nostoc, Calothrix, Cyanothece, Oscillatoria, Anabaena, Prochlorococcus, Leptolyngbya, etc.) exhibit extensive polyphyletic branching in the 16S rRNA or other gene/protein trees (Honda et al. 1999; Turner et al. 1999; Wilmotte and Herdman 2001; Hoffmann et al. 2005; Rajaniemi et al. 2005; Gupta and Mathews 2010; Shih et al. 2013). Thus, it has proven very difficult to develop any reliable classification for cyanobacteria (Rippka et al. 1979; Castenholz 2001; Cavalier-Smith 2002; Hoffmann 2005; Sayers et al. 2010; Oren and Garrity 2014). In view of the enormous challenges that are faced in developing a meaningful classification of cyanobacteria under the Bacteriological Code, it has been recently proposed that the cyanobacteria be excluded from the groups of organisms that are covered by the Bacteriological Code (Oren and Garrity 2014).

Within cyanobacteria, one important group consists of the bacteria that are able to differentiate into heterocysts (Rippka et al. 1979; Castenholz 2001). Heterocysts are specialized cells that enable them to separate nitrogen fixation from photosynthesis. The heterocystous cyanobacteria have filamentous morphology and most of them use specialized cells, called hormogonia, for replication (Rippka et al. 1979; Castenholz 2001). The heterocyst-producing cyanobacteria are further divided into two groups based on their development of unbranched (false branching) or branched (true branching) filament colonies (Rippka et al. 1979; Cavalier-Smith 2002). The two kinds of heterocysts are placed into two different orders, Nostocales and Stigonematales by Cavalier-Smith (2002), and in two separate sections (IV and V, respectively) in a classification scheme proposed by Rippka et al. (1979). In the 16S rRNA tree, heterocystous cyanobacteria, which are comprised of >40 genera, form a monophyletic cluster (Giovannoni et al. 1988; Honda et al. 1999; Turner et al. 1999; Wilmotte and Herdman 2001; Henson et al. 2002; Rajaniemi et al. 2005). However, within this clade, the members of both orders exhibit extensive intermixing (Wilmotte and Herdman 2001; Gugger and Hoffmann 2004). Similar polyphyletic branching of the members from these two orders is observed in phylogenetic trees based upon nifD and nifH sequences (Zehr et al. 1997; Henson et al. 2002, 2004; Singh et al. 2013). These observations indicate that the division of heterocystous cyanobacteria into the two distinct groups, viz. Nostocales and Stigonematales, is not supported by the available evidence. Additionally, although the formation of hormogonia is regarded as a defining characteristic of heterocystous cyanobacteria (Cavalier-Smith 2002), these structures are not found in many members of this group (Castenholz 2001; Shih et al. 2013). Thus, it is important to identify other biochemical/molecular characteristics which are uniquely shared by either all or different subgroups of heterocystous cyanobacteria and can prove useful in clarifying their evolutionary relationships.

In recent years, genome sequences for large number of cyanobacteria have become available; these sequences provide a valuable resource for understanding the evolutionary relationships among cyanobacteria and for discovering molecular markers that are specific for the different main clades that are present within this phylum (Gupta 2000; Gao et al. 2009; Gupta 2009; Wu et al. 2009). Our earlier work on a limited number of cyanobacterial genomes identified many molecular signatures in the forms of conserved signature inserts or deletions (i.e., Indels) (CSIs) in protein sequences and conserved signature proteins (CSPs) that are specific for different clades of cyanobacteria (Gupta 2009, 2013; Gupta and Mathews 2010). However, the overall coverage of cyanobacterial diversity in these analyses was very limited and they included either no genomes, or only a few genomes, from the Stigonematales and Nostocales species. Hence, an examination of the relationships among the heterocystous cyanobacteria or the discovery of new molecular signatures for this group was not feasible at that time. Recently, as a result of the phylogenetically driven Genomic Encyclopedia of Bacteria and Archaea (GEBA) project (Wu et al. 2009; Shih et al. 2013), genome sequences for large numbers of cyanobacteria have become available (CyanoGEBA database) providing extensive coverage of the phylogenetic diversity within this phylum. These sequences, which more than triple the number of available cyanobacterial genomes, include multiple representatives from the orders Nostocales and Stigonematales (Shih et al. 2013), thus enabling detailed phylogenetic and comparative genomic studies on these bacteria. In this study, we have used these genome sequences to construct phylogenetic trees for 140 cyanobacterial species/strains based upon concatenated sequences for 32 universally distributed proteins. These studies provide strong evidence that the heterocystous cyanobacteria form a monophyletic clade within the phylum cyanobacteria. In addition, comparative analyses of these genome sequences have identified 15 CSIs and 3 CSPs that are specific for either all sequenced heterocystous cyanobacteria or a number of their ancestral lineages, providing novel molecular markers and insights into the evolution of this important group of cyanobacteria.

Methods

Phylogenetic analysis

Phylogenetic analysis was performed on a concatenated sequence alignment of 32 highly conserved proteins that are found in most bacteria (Harris et al. 2003; Gupta 2009). The names and sequence characteristics of the proteins used for phylogenetic analysis is provided in Supplementary Table 1. Amino acid sequences for these proteins were obtained for 140 cyanobacterial species/strains, whose complete or draft genomes are now available in at least one of the following databases viz. Joint Genome Institute’s Integrated Microbial Genomes Database (http://img.jgi.doe.gov/cgi-bin/w/main.cgi), NCBI genome database (http://www.ncbi.nlm.nih.gov/) and the EzGenome database (http://www.ezbiocloud.net/). Characteristics of these genomes are listed in Supplementary Table 2. The sequences for Bacillus subtilis subsp. subtilis 168 were included in our dataset to root the tree. Multiple sequence alignments of the proteins were created using Clustal_X 2.1 (Larkin et al. 2007) and after arranging them in the same species order, they were concatenated into a single file. Poorly aligned regions from the alignment were removed using Gblocks 0.91b (Castresana 2000). The resulting sequence alignment, which contains 15,279 aligned positions, was used for phylogenetic analysis. Maximum-likelihood (ML) and neighbor-joining (NJ) trees based on 100 bootstrap replicates of this sequence alignment were constructed using Mega 5.05 (Tamura et al. 2011) employing Jones-Taylor-Thornton substitution models (Jones et al. 1992). A phylogenetic tree was also constructed for 68 16S rRNA gene sequences (>1,100 bp in length) for members of the orders Nostocales and Stigonematales that are present in the SILVA ribosomal RNA gene database (Quast et al. 2013). This alignment included at least one representative members of different genera from the above two orders. A maximum-likelihood tree with 100 bootstrap replicates was constructed from this dataset based on the maximum composite-likelihood model of evolutionary gene change using Mega 5.05 (Nei and Kumar 2000; Tamura et al. 2011). The 16S rRNA gene of Microcystis aeruginosa NIES-843 was used to root this tree.

Identification of conserved signature indels

The identification of CSIs was carried out in a similar manner as described in earlier work (Gupta et al. 2003; Gupta 2009). Blastp searches were initially performed on proteins from the genome of Nostoc sp. PCC 7120. These searches were individually performed on all proteins from GI numbers 17227498 to 177228576 and 17231359 to 17232863 using the NCBI nr database. For each protein, sequences of 10–15 high-scoring homologues were obtained from the available Nostocales and other cyanobacteria as well as several outgroup species. Multiple sequence alignments of these proteins were constructed using Clustal_X 2.1 (Larkin et al. 2007). The resulting alignments were visually examined for the presence of indels that are flanked on both sides by at least 5–6 identically conserved amino acids in the neighboring 30–40 residues. Indels that were not flanked on either side by conserved regions were not further considered, as they do not provide useful molecular markers and could arise from alignment artefacts (Gupta et al. 2003; Gupta 2009). Species distribution patterns of all potentially useful indels were examined further by performing more detailed Blastp searches on short sequence regions (approximately 60–80 aa long) containing the indel and its flanking conserved regions. The top 250 blast hits were examined for the presence or absence of similar indels to determine the specificities of different indels (Gupta 2009). Protein sequence information from available draft genomes was obtained by tBlastn searches. In this work, we report the results of those CSIs that in most cases are specifically present in species from the orders Nostocales and Stigonematales and not present in other cyanobacteria or other bacteria.

Identification of conserved signature proteins specific for heterocystous cyanobacteria

Blastp searches were conducted on all previously identified proteins that were only found in a limited number of Nostocales species (from Table 4 and supplementary Table 5 of Gupta and Mathews 2010) to determine their species distribution. A protein was considered to be specific for heterocystous cyanobacteria, if all significant Blast hits (E value <1 × 10−4) were from this group and the protein was present in all (or most) heterocystous cyanobacteria, with very few exceptions (Gupta and Mathews 2010).

Results

Phylogenetic analysis of cyanobacteria based on a large dataset of protein sequences

The evolutionary relationships among different cyanobacteria until recently were only determined on the basis of 16S rRNA gene sequences (Honda et al. 1999; Turner et al. 1999; Wilmotte and Herdman 2001; Yarza et al. 2010). With the availability of genome sequences, it is now possible to construct phylogenetic trees based upon concatenated sequences for many conserved proteins, which provide a more reliable portrayal of the species relationships than those based on single genes or proteins (Rokas et al. 2003; Ciccarelli et al. 2006; Wu et al. 2009; Gupta and Mathews 2010). Phylogenetic trees for a limited number of cyanobacteria have been previously constructed based upon different datasets of protein sequences (Sanchez-Baracaldo et al. 2005; Zhaxybayeva et al. 2006; Shi and Falkowski 2008; Swingley et al. 2008; Gupta and Mathews 2010). Although, in these studies, a number of distinct clades of cyanobacteria were identified, due to the very limited coverage of cyanobacterial diversity in these analyses, inferences based upon them required further confirmation. Recently, Shih et al. (2013) reported sequencing of genomes from 54 phylogenetically and phenotypically diverse cyanobacteria (CyanoGEBA database), which greatly increases the information available for this phylum. In the phylogenetic tree, that they constructed for 123 cyanobacterial species/strains, based upon 31 conserved proteins, the sequenced cyanobacteria formed several distinct clusters and all heterocystous cyanobacteria grouped into one clade (Shih et al. 2013). More recently, draft genomes for a number of additional cyanobacterial species/strains, which include 11 new members from the heterocyst group (viz. Cylindrospermopsis raciborskii CS-509, Anabaena sp. 90, Anabaena circinalis AWQC131C, Anabaena circinalis AWQC310F, Chlorogloeopsis fritschii PCC 9212, Chlorogloeopsis fritschii PCC 6912, Fischerella thermalis PCC 7521, Fischerella muscicola PCC 73103, Calothrix desertica PCC 7102, Mastigocoleus testarum BC008, Richelia intracellularis HH01) have become available (see Methods). This has afforded a more detailed examination of the relationships among heterocystous cyanobacteria.

In this study, we have constructed phylogenetic trees for 140 cyanobacteria (listed in Supplementary Table 2), including 35 members of the heterocystous group, based upon concatenated sequences of 32 conserved proteins. The trees based upon this large dataset of protein sequences were constructed using both ML and NJ algorithms. The tree obtained using the ML method is shown in Fig. 1 and information for the NJ tree is provided in Supplementary Figure 1. The branching patterns of the cyanobacterial species/strains in the two trees are very similar and most nodes are supported by bootstrap values between 70 and 100 %. As noted earlier, and known from numerous other studies (Honda et al. 1999; Turner et al. 1999; Wilmotte and Herdman 2001; Rajaniemi et al. 2005; Gupta and Mathews 2010; Yarza et al. 2010; Shih et al. 2013), many cyanobacterial genera exhibited extensive polyphyletic branching indicating that the current names of many cyanobacteria are not informative and may, in fact, be misleading. Aside from this problem, the examined cyanobacterial species formed a number of strongly supported clades in these trees. These clades, arbitrarily marked 1–9, are generally similar to those observed by Shih et al. (2013), except for some differences in the grouping of species in the smaller clades.

Fig. 1
figure 1

A maximum-likelihood consensus tree based on 32 concatenated sequences for 140 species of cyanobacteria. The tree was rooted using the sequences for Bacillus subtilis subsp. subtilis 168. Numbers located at nodes indicate the bootstrap values out of 100. Major clades and subclades of cyanobacteria resolved in the tree are indicated

Of these clades, Clade 1A includes all of the heterocystous cyanobacteria confirming their monophyletic origin. The closest relative of the heterocystous cyanobacteria in this tree is Clade 1B, which is comprised of three poorly characterized cyanobacteria (viz. Synechocystis PCC 7509, Chroococcidiopsis PCC 7203 and Gloeocapsa PCC 7428). A specific relationship between the Clade 1A and 1B cyanobacteria is supported by both ML and NJ methods and it was also observed in the tree by Shih et al. (2013). In addition to this relationship between Clade 1A and Clade 1B cyanobacteria, another small clade consisting of two species viz. Crinalium epipsammum PCC 9333 and Chamaesiphon minutus PCC 6605 (Clade 1C) also exhibited a close relationship to these two groups of cyanobacteria. A close relationship of the heterocystous cyanobacteria (Clade 1A) to the species which are part of the Clades 1B and 1C is also observed in the 16S rRNA tree (Wilmotte and Herdman 2001). Of the other main clades marked in the tree, the Clades 2 and 3 are mainly comprised of the species from the orders Chroococcales and Oscillatoriales. Although species/strains belonging to these orders of cyanobacteria are also present in a number of other clades, these other species are evolutionary unrelated to the members of these two clades. The phylogenetic tree also supports a grouping of the species/strains from Clades 1, 2 and 3 into a larger clade, which corresponds to Clade B in our earlier work (Gupta 2009; Gupta and Mathews 2010) and by Shih et al. (2013). In addition to these clades, another large and strongly supported clade in the tree, i.e., Clade 7, is comprised of marine unicellular cyanobacteria belonging to the genera Prochlorococcus, Synechococcus and Cyanobium. This clade, which is separated from all other cyanobacteria by a long branch, was referred to as Clade C in our earlier work (Gupta 2009; Gupta and Mathews 2010) or Syn/Pro clade by Sanchez-Baracaldo et al. (2005). Previously, many CSIs and CSPs, which are specific for members of this clade, have been identified (Gupta 2009; Gupta and Mathews 2010). Besides these main clades, a number of smaller clades of cyanobacteria, which are comprised of between 3 and 5 species/strains, are also observed in the tree shown in Fig. 1 and in the work of Shih et al. (2013).

Since the focus of this work is on heterocystous cyanobacteria, a subtree for these bacteria and the two related clades excerpted from Fig. 1 is shown in Fig. 2. Within the Clade 1A, comprising of heterocystous cyanobacteria, a number of distinct subclades are also phylogenetically resolved. The members of these subclades differ from each other based upon morphological/developmental characteristic. One of these large subclades is comprised of the akinetes-forming Nostocales species/strains. This subclade can be further divided into two smaller subclades depending upon whether the members of these clades differentiate into hormogonia, or they lack such ability. However, it should be mentioned that the ability to differentiate into hormogonia is more broadly distributed among the heterocysts-forming cyanobacteria (Rippka et al. 1979; Cavalier-Smith 2002).

Fig. 2
figure 2

A summary diagram showing the distribution of identified CSIs and CSPs specific to heterocystous cyanobacteria and its immediate relatives. The identified clades supported by both phylogenetic studies as well as various molecular markers are indicated. Dots indicate nodes that are supported by bootstrap scores of at least 80. The superscripts N and S denote members belonging to the suggested orders Nostocales and Stigonematales, respectively

A phylogenetic tree was also constructed for the heterocystous cyanobacteria based on 16S rRNA gene sequences (Fig. 3). This tree includes representatives from different genera of heterocystous cyanobacteria. The branching of different species/strain in this tree is similar to those observed in earlier studies (Wilmotte and Herdman 2001; Gugger and Hoffmann 2004), with members from the orders Nostocales or Stigonematales (marked with the superscripts N and S, respectively) showed extensive intermixing. The cyanobacterial species/strains that are part of our protein tree are distributed throughout the rRNA tree indicating that our dataset provides a good representation of the known heterocystous cyanobacteria. Importantly, the different subclades of heterocystous cyanobacteria, which are seen in the protein tree, are also observed in the rRNA tree. Thus, a subclade consisting of the akinete-forming cyanobacteria and the two groups within it, which differentiate into hormogonia or lack such ability, are also resolved, with only isolated exceptions.

Fig. 3
figure 3

A maximum-likelihood consensus tree based on 16S rRNA gene sequences, representing at least one member from every genera from the orders Nostocales and Stigonematales (except Riveria) and rooted with Microcystis aeruginosa NIES-843.N and S denote members belonging to the Nostocales and Stigonematales groups, respectively. The boxed species indicate those which were also used in the creation of concatenated protein trees and for comparative genomic analyses

Conserved signature indels specific for Nostocales/Stigonematales

An important objective of this study was to identify molecular markers that can distinguish the heterocystous cyanobacteria from all other bacteria, or those which can provide insights into the origin and evolutionary relationships of these bacteria to other cyanobacteria. As noted earlier, CSIs and CSPs, which are restricted to a given group of related species, provide important molecular characteristics for evolutionary and taxonomic purposes (Baldauf and Palmer 1993; Delwiche et al. 1995; Gupta 1998, 2003; 2009; Rokas and Holland 2000). Recently, these markers have been used to define, in molecular terms, and to propose important taxonomic changes in a number of phyla of bacteria (viz. Spirochetes, Aquificae, Thermotogae, Chloroflexi) at multiple phylogenetic levels (Adeolu and Gupta 2013; Bhandari and Gupta 2014; Gupta and Lali 2013; Gupta et al. 2013; Naushad et al. 2014). We have previously identified several CSIs and CSPs that appeared specific, at that time, for either all cyanobacteria, or a number of their larger clades (Gupta 2009, 2010; Gupta and Mathews 2010). However, due to the paucity of sequence information for heterocystous cyanobacteria, no markers specific for these bacteria were identified at that time. As genome sequences are now available for large numbers of heterocystous cyanobacteria, comparative genomic analyses were undertaken to identify conserved signature indels (CSIs) that might be specifically present in this group of bacteria or provide information regarding their relationships to other cyanobacteria.

The present study has identified 15 CSIs in different proteins that are useful in this respect. Eight of these CSIs, which are present in proteins involved in diverse functions, are specific for all of the sequenced species/strains of heterocystous cyanobacteria (i.e., Clade 1A) and they are not found in the homologous proteins from any other bacteria. Two examples of the CSIs that are specific for the Clade 1A species are shown in Fig. 4. In the first example, one amino acid insert is present in a highly conserved region of a XRE family transcription regulator (Fig. 4a), which play an important role in the metabolism of toxic compounds (Saatcioglu et al. 1990). Another CSI showing similar specificity is shown in Fig. 4b, where a 7 aa insert in the protein all0200 (DUF111) is uniquely found in all heterocystous cyanobacteria. Sequence information for 6 other CSIs showing similar specificity is provided in Supplementary Figures S2–S7 and some of their characteristics are summarized in Table 1.

Fig. 4
figure 4

Partial sequence alignments for the proteins a XRE family protein, showing a 1 aa insertion and b all0200, showing a 7 aa insertion that are both specifically present in all heterocystous cyanobacterial members and are flanked by conserved regions. Dashes in the sequence alignments show amino acid identity with the amino acid indicated on the top line. Sequence information for a limited number of species from other cyanobacterial taxa are shown here, but the indicated CSIs were not found in any other cyanobacteria

Table 1 Conserved Signature Indels that are specific for the heterocystous cyanobacteria

In addition to the CSIs that are specific for all heterocystous cyanobacteria (Clade 1A), our analyses have also identified 5 CSIs, which are found in the Clade 1A as well as in members of the Clade 1B cyanobacteria, which forms the immediate out group of the Clade 1A in the phylogenetic tree (Fig. 1). Partial sequence alignment for one of these CSIs, containing a 5 amino acids insert in a highly conserved region of the 30S ribosomal protein S3, is shown in Fig. 5a. Sequence alignments for four other CSIs in different proteins, which also exhibit similar specificity, are provided in Supplementary Figures S8–S11 and some of their characteristics are summarized in Table 1. Two other CSIs, identified by our analysis, in addition to being commonly shared by Clade 1A and Clade 1B cyanobacteria, are also present in some or all members of the Clade 1C cyanobacteria (comprising of Crinalium epipsammum PCC 9333 and Chamaesiphon minutus PCC 6605), which forms the out group of the Clades 1A and 1B in the phylogenetic tree (Fig. 1). Sequence information for one of these CSIs consisting of a 7 aa insert in a conserved region of the pentapeptide repeat protein is shown in Fig. 5b and some characteristics of these CSI are also summarized in Table 1.

Fig. 5
figure 5

a Partial sequence alignment of the 30S ribosomal protein S3 showing a 5 aa insert that is specifically present in all heterocystous cyanobacteria and Clade 1B cyanobacteria; b Sequence alignment of a pentapeptide repeat protein, showing a 7 aa insert in a conserved region that is commonly shared by all heterocystous cyanobacteria as well as Clade 1B and Clade 1C cyanobacteria. Additionally, this latter insert is also present in Geitlerinema sp. PCC 7407. Dashes in the sequence alignments indicate the presence of the same amino acid as shown on the top line. Sequence information for a limited number of species from other cyanobacterial groups is presented here

Signature proteins (CSPs) specific for the heterocystous cyanobacteria

Our earlier work on Nostoc sp. PCC 7120 identified a number of proteins which were specifically found in a small number of the sequenced Nostocales species/strains (Gupta and Mathews 2010; Gupta 2010). In view of the large numbers of genome sequences that are now available for heterocystous cyanobacteria, Blastp searches on the sequences of these proteins were repeated to determine whether any of them are specifically found in all heterocystous cyanobacteria. The results of these studies show that for three of the proteins, of unknown functions, all significant Blast hits (except a few isolated exceptions as noted in Table 2) are limited to members of the Clade 1A comprising of heterocystous cyanobacteria. Due to the specific presence of these proteins in different sequenced members of the Clade 1A cyanobacteria, these CSPs appear to be distinctive characteristics of the heterocystous cyanobacteria. In addition to these three CSPs, our analysis has also identified one protein (accession number NP_485332), whose homologs are mainly limited to the akinete-forming cyanobacteria (Table 2), providing further evidence that these cyanobacteria form a distinct group within the heterocystous cyanobacteria.

Table 2 Conserved signature proteins specific for all Nostocales and Stigonematales members

Discussion

In this work, we have examined the evolutionary relationships among 140 genome sequence cyanobacteria, covering its diverse lineages, based upon concatenated sequences for 32 conserved proteins. The tree constructed in this work is the most comprehensive tree for cyanobacteria that has been made to date based upon genomic sequence data. The evolutionary relationships among different cyanobacterial taxa seen in this work are similar to those observed by Shih et al. (2013) in another detailed study based upon a different dataset of protein sequences, and they also show good concordance with the relationships seen in the 16S rRNA trees (Wilmotte and Herdman 2001; Yarza et al. 2010; Shih et al. 2013). The cyanobacterial species form at least 8–9 distinct clades in these trees (Fig. 1), which could correspond to the higher taxonomic groupings (e.g., classes or orders) within the phylum Cyanobacteria. Some of these clades were also identified in our earlier work based upon a limited numbers of cyanobacteria (Gupta 2009; Gupta and Mathews 2010).

The main focus of this work was on heterocystous cyanobacteria. In the trees based on large datasets of concatenated proteins, these bacteria formed a strongly supported monophyletic clade, which is in accordance with their distinct branching in the 16S rRNA trees (Wilmotte and Herdman 2001; Gugger and Hoffmann 2004; Yarza et al. 2010). The monophyletic origin of the heterocystous cyanobacteria is also independently strongly supported by eight novel CSIs described here in different proteins, which are uniquely present in all of the heterocystous cyanobacteria (Table 1). The most parsimonious explanation to account for the presence of these CSIs is that the rare genetic changes responsible for them first occurred in a common ancestor of the Clade 1A and these changes were then vertically inherited by its different descendants (Gupta 2009). These CSIs provide novel genetic markers (synapomorphies) for the identification of heterocystous cyanobacteria in molecular terms.

The phylogenetic trees based on both protein sequences and the 16S rRNA gene sequences created in this study also show that the two previously suggested main groups within the heterocystous cyanobacteria (viz. sections IV and V or the orders Nostocales and Stigonematales) are not monophyletic and that the members of these groups exhibit extensive intermixing. The intermixing of the members of these two orders is also observed in phylogenetic trees based upon 16S rRNA as well as nifH and nifD gene sequences (Zehr et al. 1997; Wilmotte and Herdman 2001; Gugger and Hoffmann 2004; Henson et al. 2004; Singh et al. 2013). While the division of the heterocystous cyanobacteria into the two main groups, viz. Nostocales and Stigonematales, is not supported by available evidence, this work has identified a number of novel subclades within the heterocystous cyanobacteria (see Fig. 2). One of these subclades consists of the akinete-forming cyanobacteria (Figs. 2, 3). The distinctness of this subclade of heterocystous cyanobacteria is supported not only by phylogenetic analyses but also by a CSP identified in this work that is mainly limited to this group of cyanobacteria. Within the akinete-forming cyanobacteria, two smaller subclades are also distinguished by both phylogenetic means and by their shared morphological and developmental characteristics (Figs. 2, 3). The members of one of these subclades are capable of differentiating into distinct hormogonia, while the members of the other subclade lack this ability. Apart from these subclades, several deeper branching subclades that grouped the remainder of the heterocystous cyanobacteria are also observed.

The ability of the cells to differentiate into specialized cell types (viz. heterocysts) is found only within a define lineage of cyanobacteria (Rippka et al. 1979; Castenholz 2001). To understand the origin of heterocysts-forming bacteria, it is of much interest to identify their closest relatives within the cyanobacteria. The results of our studies provide important insights in this respect. In the phylogenetic trees based upon concatenated protein sequences, the members from the Clade-1B cyanobacteria are found to be the immediate out group of the heterocysts-forming cyanobacteria (Clade-1A). A close relationship between these two groups was also observed in the tree by Shih et al. (2013). Additionally, another deeper branching clade (Clade 1C) consisting of the two species/strains (viz. Crinalium epipsammum PCC 9333 and Chamaesiphon minutus PCC 6605) also exhibits a close relationship to the above two clades of cyanobacteria. Strong and independent evidence that the species from the Clade-1B are the closest relatives of the heterocysts-forming cyanobacteria is provided by our identification of five CSIs in proteins involved in different functions that are uniquely shared by the members of these two subclades of (Clade-1A and 1B) cyanobacteria. These results provide strong evidence that the members of the Clade-1B cyanobacteria are the immediate ancestor of the heterocysts-forming cyanobacteria. Additionally, a specific relationship of these two clades of cyanobacteria to the Clade-1C species/strains is also supported by phylogenetic analysis and by two of the identified CSIs. The members of the Clades-1B and 1C are presently very poorly characterized. However, as these are indicated to be the closest relatives of the heterocysts-forming cyanobacteria, it is important to determine what other genetic, physiological or morphological characteristics are commonly shared by the members of these clades and the other heterocystous cyanobacteria.

This work has identified for the first time multiple molecular markers in the forms of CSIs and CSPs that are uniquely shared by the heterocystous cyanobacteria or their closest relatives. The cellular functions of these molecular signatures are presently not known. However, due to the unique presence of these molecular characteristics in these specific lineages of cyanobacteria, it is likely that these genetic changes (or genes) are in some way linked to the unique morphological (viz. heterocysts formation) and associated biochemical characteristics (nitrogen fixation) of these cyanobacteria. Earlier work on a number of CSIs and CSPs, which were specific for other groups, has shown that these molecular characteristics are essential for the groups where they are found and hence serve important functions in the particular groups of bacteria (Fang et al. 2005; Singh and Gupta 2009; Schoeffler et al. 2010). Therefore, studies on understanding the cellular functions of the CSPs and CSIs that are specific for the heterocystous cyanobacteria could provide important insights into the novel biochemical or structural aspects of these bacteria.