1 The General Organisation of Rhizobial Genomes

In the first decade of the twenty-first century, people working on rhizobia had been very excited about the release of the complete genomes of model strains including Mesorhizobium loti MAFF303099 (reclassified as M. japonicum) (Kaneko et al. 2000), Sinorhizobium meliloti 1021 (Galibert et al. 2001), Bradyrhizobium japonicum USDA110 (reclassified as B. diazoefficiens) (Kaneko et al. 2002), Rhizobium etli CFN42 (González et al. 2006), Rhizobium leguminosarum bv. viciae 3841 (Young et al. 2006), Bradyrhizobium sp. BTAi1 and ORS278 (Giraud et al. 2007), Azorhizobium caulinodans ORS571 (Lee et al. 2008), Cupriavidus taiwanensis LMG19424 (Amadou et al. 2008), Sinorhizobium sp. NGR234 (Schmeisser et al. 2009) and Sinorhizobium medicae WSM419 (Reeve et al. 2010). Notably, all of these genomes were sequenced using the Sanger platform, and these valuable earlier efforts have provided us essential information regarding general features of rhizobial genomes. For example, symbiosis genes are intensively clustered in a symbiosis island or a symbiosis plasmid, and genome organisation and gene content can vary drastically between different species. These features have been further validated by more than 100 complete rhizobial genomes obtained later on using next-generation sequencing platforms such as Illumina, Roche 454, Ion Torrent and PacBio. As shown in Fig. 4.1, there is a great variation in genome size of different strains/species within each genus, indicating diverse metabolic abilities of rhizobial germplasms.

Fig. 4.1
figure 1

122 complete rhizobial genomes. Genomes retrieved on March 30, 2018, from BioProject as indicated are ordered according to the genome sizes. Bradyrhizobium, Rhizobium, Mesorhizobium and Sinorhizobium are coloured

1.1 Replicons: Chromosome, Chromid and Plasmid

1.1.1 Replicons

Although some archaea harbour a chromosome with the multiple-origin mode of replication (Lindås and Bernander 2013), all DNA molecules of bacterial genomes studied to date have a single replication origin. Consequently, replicon has been widely used to refer such a DNA molecule in bacterial genomes (Harrison et al. 2010; diCenzo and Finan 2017). Among the 122 complete rhizobial genomes from 12 genera accessible till March of 2018 (Fig. 4.1), 107 genomes from 11 genera have two or more replicons (Fig. 4.2a), a genome organisation feature described as a multipartite genome. The multipartite genome organisation is apparently overrepresented in rhizobia compared to an estimated 10% for bacterial genomes (diCenzo and Finan 2017).

Fig. 4.2
figure 2

The number of replicons in rhizobial genomes. (a) Box plot of the number of replicons within individual complete genomes available in the corresponding genus. The number of analysed genomes is indicated in brackets. (b) Box plot of the genome size in rhizobia with different numbers of replicons. One hundred twenty-two rhizobial genomes were analysed

The multipartite genome architecture is scarce in Bradyrhizobium and not found in the genus Azorhizobium, from which only one complete genome is available (Fig. 4.2a). More replicons per genome do not necessarily lead to a larger genome size (Fig. 4.2b). Genome sizes of Bradyrhizobium strains are among the largest in rhizobia (Figs. 4.1 and 4.3), even though most Bradyrhizobium strains have only one replicon. In a recent global exploration of the soil microbiome, Bradyrhizobium was identified as one of the most ubiquitous phylotypes of bacteria (Delgado-Baquerizo et al. 2018). It would be interesting to investigate the relationship between the genome size and the adaptation ability of different Bradyrhizobium strains with a great variation in their genome sizes (such as from 7.23 to 10.48 Mb; Fig. 4.1). The size of individual replicons in a multipartite genome can be smaller than the single replicon found in Bradyrhizobium and Azorhizobium, and the total genome size of a rhizobial strain is generally above 4.5 Mb (Fig. 4.3). This number is several times larger than those of animal-associated obligate endosymbionts (Toft and Andersson 2010), which cannot be cultivated in the laboratory. It is also larger than the average and median bacterial genome sizes, 3.87 Mb and 3.65 Mb, respectively (diCenzo and Finan 2017).

Fig. 4.3
figure 3

Box plot showing the size of rhizobial genomes. The numbers of analysed genomes are the same as those indicated in Fig. 4.2a for each genus

1.1.2 Chromosome, Chromid and Plasmid

The single replicon or the largest replicon in a multipartite genome is called a chromosome. In most cases for a multipartite genome, information genes such as rRNA genes and highly conserved housekeeping genes and most essential genes are located on the chromosome. But the presence of a sole rRNA operon in a nonchromosomal replicon has been reported for a plant-associated α-proteobacterium Aureimonas sp. AU20 (Anda et al. 2015). In addition to the chromosome, secondary chromosome, chromid, megaplasmids (above 350 kb in size) and plasmids are terms that have been proposed to classify different replicons present in a multipartite genome (diCenzo and Finan 2017). Among them, secondary chromosome and chromid refer to a secondary replicon harbouring some essential genes (either under all conditions or environmentally) (Harrison et al. 2010; diCenzo and Finan 2017). The name “secondary chromosome” was used to indicate that it resulted from a split of the ancestor chromosome into two (Fig. 4.4). With the increasing number of genomes in the database, convincing evidence for this split event is however rare (diCenzo and Finan 2017) and not easily obtained for researchers not fully involved in this specific field. On the other hand, nearly all secondary replicons with essential genes are considered to be chromids evolved from plasmids (Harrison et al. 2010; diCenzo and Finan 2017). Megaplasmids (>350 kb) and plasmids are used to refer replicons lacking essential genes but enriched with dispensable genes and characterised with biased sequence features such as lower GC content and biased codon usage compared to the chromosome. By contrast, the sequence features including GC content and codon usage of chromids are similar to those of the chromosome (Harrison et al. 2010). Distinct characteristics of different replicons have been excellently reviewed by Harrison et al. (2010) and listed herein in Table 4.1.

Fig. 4.4
figure 4

Classification of bacterial replicons. A single replicon genome (left) and a multipartite genome (right) are shown. Different replicons in a multipartite genome are ordered according to their sizes from left to right, but the diameter of each replicon is not scaled to actual size (e.g. >350 kb for megaplasmid). The same colour of second chromosome and the primary chromosome indicates that two replicons result from a split of the ancestor chromosome into two. Essential genes are harboured by the corresponding replicon as indicated. Notably, a chromid smaller than megaplasmids has been reported in a non-rhizobial strain Deinococcus deserti VCD115. (Harrison et al. 2010)

Table 4.1 General features of different replicons

1.2 Symbiosis Plasmid and Symbiosis Island

Despite the great diversity of alpha- and beta-rhizobia (Peix et al. 2015), most rhizobia have a cluster of key symbiosis genes, including nod, nif and fix, localised within a symbiosis island in the chromosome or in the symbiosis plasmid. These genes are specifically involved in nodulation and nitrogen fixation processes during the symbiosis with compatible legumes. Deletion of the key nodulation genes usually leads to a complete loss of symbiotic ability of rhizobia associated with either specific (such as Medicago sativa) or promiscuous (such as Sophora flavescens) legume hosts (Marvel 1985, 1987; Horvath et al. 1986; Liu et al. 2018). Some Bradyrhizobium strains do not require canonical nod genes and typical lipochitooligosaccharidic Nod factors for symbiosis with certain Aeschynomene species (Giraud et al. 2007; Miche et al. 2010); nevertheless the nif and fix genes are clustered in a 45-kb island in their genomes (Giraud et al. 2007).

As expected from Table 4.1, symbiosis plasmids have a lower GC content compared to the chromosome in a multipartite rhizobial genome. Here we take Sinorhizobium strains associated with soybeans as examples (Fig. 4.5a, b). S. fredii CCBAU45436 is an epidemic and efficient soybean microsymbiont in alkaline soils (Zhang et al. 2011; Tian et al. 2012). Five replicons were identified in its multipartite genome (Jiao et al. 2018): chromosome (cSF45436), chromid (pSF45436b), symbiosis plasmid (pSF45436a) and two accessory plasmids (pSF45436d and pSF45436e) (Fig. 4.5a). The symbiosis plasmid pSF45436a is a megaplasmid of 0.42 Mb, which is around 10% and 20% of the size of the chromosome (cSF45436) and chromid (pSF45436b), respectively (Fig. 4.5a). Its GC% (59.9%) is at least 3% lower than those of chromid and chromosome (Fig. 4.5a). Although the replicon size of the symbiosis plasmid varies in different S. fredii strains nodulating soybeans, such as 0.40–0.74 Mb in CCBAU45436, CCBAU25509 and CCBAU83666, the average GC% varies little among symbiosis plasmids of Sinorhizobium spp. nodulating soybeans (Fig. 4.5b). By contrast, the GC% of chromid is only slightly (0.5%) but also significantly lower than that of chromosome (Fig. 4.5b). Another notable feature of the symbiosis plasmid in these Sinorhizobium strains is the enrichment of insertion sequences (ISs), particularly those high-copy ones, compared to chromosome and chromid (Zhao et al. 2018). Although transposable elements had been considered as junk and selfish components in genomes, accumulative evidence supports their critical roles in the evolution of both eukaryotes and prokaryotes (Biémont 2010). A recent experimental evolution study demonstrated that insertion mutation of type three secretion system (T3SS) genes by parallel transpositions of ISs, enriched on the same symbiosis plasmid, is the major mutagenesis mechanism during adaptive evolution of symbiotic compatibility of Sinorhizobium associated with soybeans (Zhao et al. 2018). It should be noted, however, that the symbiosis plasmid is not essential for the free-living stage of its rhizobial host, as experimentally demonstrated in S. meliloti (diCenzo et al. 2014, 2018). Transcriptomics analyses recurrently show that most genes on the symbiosis plasmid of diverse rhizobia are specifically induced during nodulation and nitrogen fixation, but not under free-living conditions lacking a compatible host or its symbiotic signal molecules (Ampe et al. 2003; Capela et al. 2006; Vercruysse et al. 2011; Li et al. 2013; Jiao et al. 2016).

Fig. 4.5
figure 5

Representative symbiosis plasmid and symbiosis island. (a) Five replicons including the symbiosis plasmid pSF45436a in the genome of Sinorhizobium fredii CCBAU45436 nodulating soybeans. (b) Average GC% of three major replicons in soybean microsymbionts belonging to Sinorhizobium (S. fredii CCBAU45436, S. fredii CCBAU25509, S. fredii CCBAU83666, S. sojae CCBAU05684, Sinorhizobium sp. CCBAU05631). Significant GC% difference of chromid or symbiosis plasmid compared to that of chromosome is shown (T-test; ∗, p < 0.05; ∗∗∗, p < 0.001). (c) The genome of Bradyrhizobium diazoefficiens USDA 110 nodulating soybeans. The size and GC% of the symbiosis island are indicated. (a and c) GC content (black ring) and GC skew (the ring in green and purple) are shown. The genome size of USDA 110 in (c) is at a scale of one third of the CCBAU45436 genome in (a). A window size of 10,000 and a step of 100 were used in GC content and GC skew analyses for USDA 110, cSF45436 and pSF45436b, while a size of 1000 and a step of 10 were used for pSF45436a, pSF45436d and pSF45436e

In rhizobia with a single replicon and some rhizobia (such as certain Mesorhizobium strains) with multiple replicons, key symbiosis genes are found on the chromosome. As shown in the genome of Bradyrhizobium diazoefficiens USDA 110 (Fig. 4.5c), a genomic island of six hundred eighty-one kilobyte in length is characterized by its lower GC% (59.4%) than the genomic average (64.4%).

Six hundred eighty-one kilobyte in length is characterised by its lower GC% (59.4%) than the genomic average (64.4%). This island contains key symbiosis genes nod, nif and fix (Göttfert et al. 2001; Kaneko et al. 2002) and many uncharacterised genes, which are highly transcribed in soybean nodules (Pessi et al. 2007). Similarly, a genomic island of 611 kb containing nod/nif genes was identified on the chromosome of Mesorhizobium japonicum MAFF303099, which harbours two more replicons (plasmids) (Kaneko et al. 2000). Consequently, “symbiosis island” has been used to refer this kind of genomic island (Sullivan and Ronson 1998). As in symbiosis plasmids, there is an overrepresentation of ISs in symbiosis islands, such that 60% of the ISs of B. diazoefficiens USDA 110 were localised in this island (Kaneko et al. 2002). The symbiosis island of M. japonicum is also characterised by its enrichment of transposable elements compared to the chromosome background and the two plasmids (Kaneko et al. 2000).

More than 20 years ago, it was demonstrated that the symbiosis island of Mesorhizobium loti can be transferred into non-symbiotic mesorhizobia under field and lab conditions and integrated into a phe-tRNA gene (Sullivan et al. 1995; Sullivan and Ronson 1998). Recently, Ling et al. provided evidence that the symbiosis island of Azorhizobium caulinodans is an integrative and conjugative element that can be transferred to a specific site in a gly-tRNA gene of other rhizobial genera (Ling et al. 2016). Moreover, the horizontal transfer frequency of this symbiosis island increased in the legume rhizosphere or in the presence of plant flavonoids (Ling et al. 2016), highlighting an intriguing host-dependent evolutionary scenario of rhizobia. As shown in Table 4.2, one or two conserved met-tRNA gene(s) can be identified in symbiosis plasmids but not other extrachromosomal replicons of Sinorhizobium strains nodulating soybeans. These data imply that integration into a tRNA gene may have played an important role in the horizontal transfer of symbiosis genes in many rhizobial genera in the long run. It is noteworthy that the symbiosis plasmid itself can be subject to conjugative transfer, as demonstrated in Rhizobium and Sinorhizobium (Danino et al. 2003; Perez-Mendoza et al. 2004, 2005). This is in line with the finding that extremely similar symbiosis plasmids were found in different Rhizobium species associated with common bean (Perez Carrascal et al. 2016). If we look at the alignment of symbiosis plasmids from Sinorhizobium strains associated with soybeans (Fig. 4.6), a similar conclusion can be drawn for certain S. fredii and S. sojae strains (CCBAU45436, CCBAU25509 and CCBAU05684). Although highly conserved locally collinear blocks can also be found in S. fredii CCBAU83666 and Sinorhizobium sp. CCBAU05631, extensive rearrangement and the presence of other accessory sequences can be found in symbiosis plasmids of these two strains (Fig. 4.6).

Table 4.2 Distribution of tRNA genes in multipartite genomes of Sinorhizobium nodulating soybeans
Fig. 4.6
figure 6

Progressive Mauve alignment of symbiosis plasmids from Sinorhizobium microsymbionts of soybean. From the first to fifth row: S. sojae CCBAU05684, S. fredii CCBAU25509, S. fredii CCBAU45436, S. fredii CCBAU83666 and Sinorhizobium sp. CCBAU05631. Locally collinear blocks conserved between different strains are indicated in the same colour and connected

2 Evolution of Core and Accessory Genes

2.1 Characteristics of Core and Accessory Genes

In the previous Sect. (4.1), intraspecies and intra-genus variation in rhizobial genome size at the scale of Mb can be observed. If we simply take 1 kb as the average length of a gene, the difference in gene number can be up to several thousand between strains. This phenomenon is widespread in prokaryotes. In 2005, a term “pan-genome” (“pan” – “παν” in Greek – means “whole”) was introduced to refer the gene repertoire accessible to any given species (Medini et al. 2005; Tettelin et al. 2005). The pan-genome is composed of a “core genome” containing genes present in all strains and a “dispensable genome” (also called accessory, flexible or adaptive) with genes present in a subset of strains (Medini et al. 2005) (Fig. 4.7). The dispensable genome can be further divided into two elements: genes shared by some but not all strains (named “accessory” genes in some publications to distinguish it from “core” and “unique” elements) and genes unique to each strain (Medini et al. 2005, 2008; Rouli et al. 2015) (Fig. 4.7). Although the species is usually considered to be an evolutionary unit, the pan-genome concept has been extended to higher taxonomic units (Lapierre and Gogarten 2009). This is biologically meaningful, since accessory gene functions may provide adaptive advantages for their host cells in a specific niche and the pan-genome analysis of different species inhabiting the same niche can provide novel insight into the evolutionary mechanisms underlying their adaptation and competition. For example, S. sojae CCBAU05684, Sinorhizobium sp. CCBAU05631 and S. fredii CCBAU45436 share certain wild soybean hosts (Li et al. 2011; Zhang et al. 2011; Liu et al. 2017; Zhao et al. 2018). A pan-genome analysis followed by reverse genetics has revealed that an accessory gene cluster present in CCBAU45436 and CCBAU05631 but absent in CCBAU05684 is essential for effective symbiosis of its host strains (Liu et al. 2017).

Fig. 4.7
figure 7

A schematic diagram illustrating the partition of a pan-genome for n strains of a given taxonomic unit. +, present

It has been estimated that the pan-genome of the bacterial domain is of infinite size, likely due to numerous niches on earth (Lapierre and Gogarten 2009; McInerney et al. 2017), i.e. the number of new genes grows indefinitely with the number of sequenced strains. An “open” pan-genome is used to refer this pattern (Medini et al. 2005). By contrast, if the size of a pan-genome quickly saturates to a limiting value, a “closed” pan-genome can be proposed (Medini et al. 2005). A closed pan-genome has been reported for species living in isolated niches with limited access to the global microbial gene pool, such as Bacillus anthracis, Mycobacterium tuberculosis and Chlamydia trachomatis (Medini et al. 2005). As facultative microsymbionts, rhizobia are expected to have a large pan-genome to cope with fluctuating biotic and abiotic stimuli in soils and during symbiosis with legumes. Indeed, rhizobia such as the model species Sinorhizobium meliloti associated with Medicago and S. fredii nodulating soybeans have a typical open pan-genome (Tian et al. 2012; Galardini et al. 2013). The same conclusion can be drawn for species belonging to Rhizobium, Mesorhizobium and Bradyrhizobium (Tian et al. 2012; Kumar et al. 2015; Perez Carrascal et al. 2016; Porter et al. 2017).

A genome-wide average nucleotide identity (ANI) value of 95% has been widely used to determine if two prokaryotic strains can be considered to be the same species (Richter and Rossello-Mora 2009), and a discontinuity in ANI space is observed around this boundary (Konstantinidis and Tiedje 2005; Richter and Rossello-Mora 2009). This gap in sequence space has also been reported in several independent analyses of rhizobia using either a fixed number of shared core genes or a genome scale alignment (Tian et al. 2012; Zhang et al. 2012; Kumar et al. 2015). Therefore, it is established that core genome determines the taxonomy of rhizobia, as for other prokaryotes (Ormeno-Orrillo et al. 2015). By contrast, representative features used in polyphasic taxonomy in pre-genomics studies only capture a tiny fraction of the inter-species variation, and it is not uncommon that these features can also vary at the intraspecies level (Ormeño-Orrillo and Martínez-Romero 2013; Kumar et al. 2015; Vernikos et al. 2015; Young 2016), thus blurring the species boundary. Comparative genomics of rhizobia from 8 genera suggested that the phyletic distribution of 887 functional genes with experimental evidence can reflect the species phylogeny of test strains, while the distribution of the whole pan-genome could not (Tian et al. 2012). This highlights that accessory genes in the open pan-genome of rhizobia are differentially integrated with the genome backgrounds of individual species. As typical accessory genes, key nodulation and nitrogen fixation genes within symbiosis islands or symbiosis plasmids of rhizobia determine the symbiovar and hence the corresponding legume host, rather than the bacterial species assignments (Rogel et al. 2011). These key symbiosis genes provide adaptive advantage for rhizobia in the presence of compatible legumes, while many other accessory genes can be adaptive in diverse niches in soils. For example, in contrast to Sinorhizobium, Bradyrhizobium strains are enriched with accessory genes involved in secondary metabolism, which may explain the high global abundance of Bradyrhizobium in soils (Tian et al. 2012; Delgado-Baquerizo et al. 2018).

2.2 Main Evolutionary Forces Shaping the Diversity of Core and Accessory Genes

It is estimated that the divergence of rhizobial genera predates the origin of legumes (Turner and Young 2000), and transferable accessory symbiosis genes can be considered “microsymbionts” that have spread across diverse bacteria (Remigi et al. 2016). That is to say, these symbiosis genes succeed, regarding their wide phyletic distribution in at least two bacterial orders, by improving the adaptation of their host strains. This regime has largely dominated the evolutionary study of rhizobia in past decades.

With the burst of new rhizobial species being documented in the literature and the development of sequencing technology, our knowledge of rhizobial core genes has been extended from information on the 16S rRNA gene and few housekeeping genes (such as atpD, glnII, recA, rpoB, etc.) to hundreds and thousands of core genes. It is notable that both intragenic and intergenic recombination, in addition to point mutation, have played a substantial role in creating the observed diversity of chromosomal housekeeping genes in rhizobial species such as Bradyrhizobium canariense, B. japonicum, B. elkanii, B. liaoningense, B. yuanmingense, B. diazoefficiens, Rhizobium gallicum sensu lato, Rhizobium leguminosarum bv. viciae and Sinorhizobium fredii (Vinuesa et al. 2005, 2008; Silva et al. 2005; Tian et al. 2010; Zhang et al. 2014; Guo et al. 2014). This view has been further verified in a comparison of individual core gene trees to the species tree based on 295 core genes in alpha- and beta-rhizobia (Tian et al. 2012). Around 90% of these core genes have undergone horizontal gene transfer or intergenic recombination, and only 20 out of 295 genes in test strains were free of either inter- or intragenic recombination (Tian et al. 2012). Therefore, strict vertical evolution is rare in rhizobial chromosomal core genes.

The multipartite architecture of many rhizobial genomes (Fig. 4.2 and Table 4.1) provides a unique opportunity to investigate the evolution of core and accessory genes. Extrachromosomal replicons thought to be essential for the saprophytic lifestyle in soils and rhizospheres usually show higher rates of recombination than the chromosomes, as demonstrated in Rhizobium and Sinorhizobium (Bailly et al. 2011; Guo et al. 2014; Perez Carrascal et al. 2016). The chromid of Sinorhizobium species such as S. meliloti and S. fredii is characterised by its distinct role in intraspecies differentiation and enrichment with accessory genes (Galardini et al. 2013; Jiao et al. 2018). Moreover, the chromid is a hot spot for positively selected genes such as those involved in the synthesis of polysaccharides (Bailly et al. 2011; Galardini et al. 2013), which can influence diverse aspects including host range and phage tolerance (Campbell et al. 2003; Parada et al. 2006; Staehelin et al. 2006; Müller et al. 2009; López-Baena et al. 2016). Horizontal gene transfer has a greater effect on gene content of symbiosis plasmids/islands than of chromids or chromosomes (Bailly et al. 2011; Zhang et al. 2014; Guo et al. 2014; Kumar et al. 2015; Perez Carrascal et al. 2016). Symbiosis plasmids are more prone to share a gene pool with accessory plasmids, as reported in S. meliloti strains (Nelson et al. 2018). A low frequency of horizontal gene transfer on chromosomes does not equal none. Although accessory genes can be interspersed throughout the chromosome, most are concentrated in flexible genomic islands (fGIs) (Rodriguez-valera and Lo 2016). This phenomenon can be clearly identified in the example of S. fredii strains (Fig. 4.8). Several fGIs are present in locally collinear blocks. These fGIs may contribute to intraspecies variation and increase the adaptation potential of populations. For example, an accessory operon encoding a multidrug efflux system in S. fredii CCBAU45436 is located within a fGI on the chromosome (indicated in Fig. 4.8) and is essential for efficient symbiosis of CCBAU45436 with soybeans (Jiao et al. 2018).

Fig. 4.8
figure 8

Progressive Mauve alignment of chromosomes from soybean microsymbionts of Sinorhizobium fredii. From the first to third row: S. fredii CCBAU45436, CCBAU25509 and CCBAU83666. Locally collinear blocks conserved between different strains are indicated in the same colour and connected. ∗ indicates a flexible genome island containing genes encoding a multidrug efflux system that is essential for effective symbiosis of CCBAU45436 with soybeans. (Jiao et al. 2018)