Introduction

The genus Corynebacterium includes a number of pathogenic bacterial species of importance for human and animal health (Bernard 2012). The most well-studied pathogen in this genus, Corynebacterium diphtheriae, is the main causative agent of diphtheria, which is a potentially fatal infection of humans. Though incidence of classical, toxin-mediated, diphtheria has considerably dropped due to widespread vaccination, there is a need for continued surveillance as novel cases remain being reported in various regions (Zakikhany and Efstratiou 2012; Forbes 2017). Also, there are several recent reports of outbreaks of cutaneous diphtheria, a highly contagious skin manifestation of infection, including in developed countries (Meinel et al. 2016; Reynolds et al. 2016; Belchior et al. 2017; Kolios et al. 2017). Moreover, the emergences of toxigenic strains of Corynebacterium ulcerans in cases of diphtheria-like illness and of non-toxigenic strains of C. diphtheriae as causative agents of severe invasive infections have been matters of concern worldwide (Edwards et al. 2011; Viguetti et al. 2012; Zasada 2013; Hacker et al. 2016).

Similarly, novel cases of severe infections caused by non-diphtherial Corynebacterium spp. are increasingly being registered, involving the emerging pathogenic species Corynebacterium striatum, Corynebacterium amycolatum, Corynebacterium minutissimum, and, less frequently, Corynebacterium xerosis (Martins et al. 2009; Hahn et al. 2016; Chandran et al. 2016). These so-called diphtheroids can be found as normal constituents of the human microbiota, but their clinical significance has been demonstrated by various studies (Martins et al. 2009; Chandran et al. 2016; Hahn et al. 2016; Leal et al. 2016). Particularly, rapid acquisition of a multidrug resistant phenotype has been shown for the species C. striatum, and this justifies the need for rapid and accurate identification of diphtheroids to the species level (Hahn et al. 2016; Ajmal et al. 2017).

Traditional phenotypic identification methods based on batteries of biochemical reactions, such as the API Coryne system (bioMérieux) or the RapID CB Plus Kit (Thermo Fisher), are the most widely employed to identify corynebacterial isolates to the species level (Bernard 2012; Bernard and Funke 2012). These tests are considered as efficient for identification of the species C. diphtheriae (Zakikhany and Efstratiou 2012), though some biochemical variability can be observed between isolates of different biovars, especially in the tests for nitrate reduction and utilizations of sucrose, glycogen, maltose, and starch (Hall et al. 2010; Bernard and Funke 2012; Sangal et al. 2014). Noteworthy, recent genomic studies have already challenged the usefulness of classifying C. diphtheriae isolates into biovars with basis on biochemical tests, due to extensive genomic diversity within the species (Sangal and Hoskisson 2016). Additionally, the effectiveness of biochemical tests for unambiguous identification of some diphtheroids may vary considerably (reported range = 41.7–90.5% success) and might require additional tests (Funke et al. 1997; Renaud et al. 1998; Almuzara et al. 2006; Adderson et al. 2008; Alibi et al. 2015). Confounding biochemical identification profiles are commonly reported for the species C. striatum, C. amycolatum, C. minutissimum, and C. xerosis (Funke et al. 1996; Renaud et al. 1998; Adderson et al. 2008; Palacios et al. 2010). Palacios et al. (2010) reported the existence of five different biochemical profiles for the C. amycolatum isolates studied and reported difficulties for identification of C. xerosis due to variabilities in the tests for nitrate reduction and alpha-glucosidase (Palacios et al. 2010). Biochemical variabilities of isolates of these four diphtheroids also frequently include the reactions of sucrose, maltose, ribose, trehalose, and galactose utilizations (Funke et al. 1996; Renaud et al. 1998; Palacios et al. 2010; Bernard and Funke 2012).

In order to identify the genomic basis contributing to the biochemical variabilities observed in phenotypic identification methods of isolates of these emerging and reemerging species of the genus Corynebacterium, we combined a comprehensive literature review with a bioinformatics approach based on reconstruction of specific biochemical reactions/pathways in 33 recently released whole genome sequences. We used data retrieved from curated databases (MetaCyc, PathoSystems Resource Integration Center (PATRIC), The SEED, TransportDB, UniProtKB) associated with homology searches by BLAST and profile Hidden Markov Models (HMMs) to detect enzymes participating in the various pathways and performed ab initio protein structure modeling and molecular docking in order to confirm specific results. Our data demonstrate the usefulness of an integrated approach (literature review, homology searches, structural analysis) for successful predictions of phenotypes from whole genomic sequences. Noteworthy, the level of genome assembly (draft or complete) and the quality of annotation significantly impacted phenotypic predictions.

Materials and methods

Whole genome sequences and genome re-annotation

Table 1 lists all genomic sequences used in this study. In total, 33 whole genome sequences were retrieved from NCBI’s Genome Database: 14 C. diphtheriae strains, 2 C. ulcerans strains, 10 Corynebacterium striatum strains, 2 C. xerosis strains, 3 C. minutissimum strains, and 2 C. amycolatum strains. To standardize genomic annotations, all genome sequences were re-annotated using RASTtk (Brettin et al. 2015), through the fully automated genome annotation functionality of the PATRIC (Wattam et al. 2017). When required, existing divergences between annotations were detected through comparative analysis using Artemis software (Rutherford et al. 2000) and NCBI’s Sequence Viewer application (http://www.ncbi.nlm.nih.gov/projects/sviewer/). Annotations compared were from the International Nucleotide Sequence Database (INSDC), which are generated by authors, and from NCBI Reference Sequence Database (RefSeq), automatically generated by NCBI’s Prokaryotic Genome Annotation Pipeline, besides newly re-annotated genome sequences using RASTtk.

Table 1 Genomic sequences used in this study

Literature review and identification of variable biochemical pathways

The target metabolic pathways were identified through literature review of studies aimed at evaluating the reliability of various biochemical reactions for differential identification of the six Corynebacterium species studied: C. diphtheriae, C. ulcerans, C. striatum, C. amycolatum, C. minutissimum, and C. xerosis (Martinez-Martinez et al. 1995; Funke et al. 1996, 1997; Renaud et al. 1998, 2001; Wauters et al. 1998; Almuzara et al. 2006; Letek et al. 2006; Adderson et al. 2008; Funke and Frodl 2008; Palacios et al. 2010; Bernard 2012; Bernard and Funke 2012; Torres et al. 2013). Reactions that are commonly used in phenotypic identification methods and that presented conflicting results between the various studies were selected: (i) nitrate reduction and utilization of (ii) sucrose, (iii) maltose, (iv) glycogen, (v) galactose, and (vi) ribose (Table 2). These biochemical reactions are part of the most traditionally used commercial identification kits, such as the API Coryne system (bioMérieux) (Table 2).

Table 2 Common biochemical profiles of isolates of C. diphtheriae, C. ulcerans, and C. xerosis/C. striatum/C. minutissimum/C. amycolatum (XSMA), according to literature review

Retrieval of enzyme commission numbers for each pathway

The general workflow of the bioinformatics analysis is shown in Fig. 1. The various enzymatic reactions that compose each of the six variable biochemical pathways were initially retrieved from the MetaCyc database (www.metacyc.org). The following pathways were selected: sucrose degradation I (sucrose phosphotransferase), maltose degradation, ribose degradation, d-galactose degradation I, glycogen degradation I, and nitrate reductase. The enzymatic reactions composing each curated pathway were retrieved through the associated enzyme commission (EC) numbers, using the reference bacterial species available in the database. For comparison with MetaCyc, the manually curated subsystems in the SEED (Overbeek et al. 2005) that correspond to each of the biochemical pathways were also retrieved, this time choosing phylogenetically close bacterial species. Additionally, automatic metabolic reconstructions for each of the studied genomes were retrieved from PATRIC v.3.3.10 (www.patricbrc.org/), using the Comparative Pathway tool.

Fig. 1
figure 1

General workflow of the bioinformatics analysis

Ortholog search

The EC numbers that compose each of the studied biochemical pathways were used to retrieve protein sequences from UniProtKb (www.uniprot.org); sequences from phylogenetically close organisms were prioritized. A local nucleotide database was created with the 33 whole genome sequences, using the makeblastdb tool, available at the NCBI’s BLAST+ suite. Similarity searches for orthologs were done with the tBLASTn stand-alone version 2.6.0 (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Cutoff values were as follows: E value ≤ 10E-4, query cover ≥ 70%, and identity ≥ 30%. When required, ortholog searches were performed using profile Hidden Markov Models, with jackHMMER (Finn et al. 2015).

Literature-based curation of carbohydrate utilization pathways and mapping of transporters

To support automatic predictions of metabolic pathways, a literature review of experimentally confirmed carbohydrate utilization pathways in the species Corynebacterium glutamicum, a biotechnologically important species, was performed to identify genes demonstrated to participate in the metabolism of hexoses (glucose and fructose), pentoses (d-ribose, l-arabinose, d-xylose), disaccharides (sucrose and maltose), and other relevant carbohydrates. Information was also retrieved on carbohydrate transport in C. glutamicum using phosphoenolpyruvate-dependent sugar phosphotransferase (PTS) and non-PTS, ATP-binging cassette (ABC)-type, systems (Blombach and Seibold 2010; Ikeda 2012). Additionally, sugar transport systems were predicted in the whole genome sequences using TransportDB 2.0 relational database (Elbourne et al. 2017).

Prediction of genomic islands, genomic context analysis, and pan-genome comparisons

Metabolic genomic islands were predicted in the studied genomes using GIPSy software (Soares et al. 2016). The genomic context of the genes belonging to the various biochemical pathways was analyzed with the SyntTax tool (http://archaea.u-psud.fr/SyntTax/), using default parameters (Oberto 2013). A pan-genome analysis was performed with re-annotated genome sequences using the Bacterial Pan-Genome Analysis (BPGA) pipeline (Chaudhari et al. 2016), to identify shared proteins and genome-specific proteins.

Protein structural analysis

A structural approach was used to aid prediction of functionally of the mapped PTS-type carbohydrate transporters and of strain-restricted proteins. Ab initio 3D protein modeling was performed with the RaptorX (Källberg et al. 2012) and Robetta (Song et al. 2013) servers. Template-based modeling was performed with the software Modeller v9.18. Protein structure refinement was done with i3Drefine (Bhattacharya and Cheng 2013). Protein structure evaluation and validation was done with Ramachandran plots generated with RAMPAGE (http://mordred.bioc.cam.ac.uk/~rapper/rampage2.php), and Z score values and energy distribution plots obtained with ProSA-web (Wiederstein and Sippl 2007). Structural alignments were done with the RaptorX structure alignment server (Wang et al. 2011, 2013) and visualized using UCSF Chimera (Pettersen et al. 2004). Functional domains were predicted with Conserved Domains Database (CDD) (Marchler-Bauer et al. 2014) and CATH structure database (Dawson et al. 2017). Multiple protein sequence alignments were performed through PRABI server (https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html). Molecular docking was performed using the software autodocktools v1.5.6 (Huey and Morris 2003).

Results and discussion

Variable biochemical reactions

The target biochemical reactions were identified through literature review of conflicting results obtained in phenotypic identification studies of six emerging and reemerging pathogenic Corynebacterium species: C. diphtheriae (including biovar gravis, mitis, belfanti, and intermedius) and C. ulcerans; C. xerosis, C. striatum, C. minutissimum, and C. amycolatum (hereafter collectively referred to as XSMA group). Six biochemical reactions were identified as the most variable among the various isolates, of which five reactions are component of the most traditionally used identification test, the API Coryne system (bioMérieux): nitrate reduction and fermentations of ribose, maltose, sucrose, glycogen, and galactose (the latter not included in API Coryne) (Table 2). Bacterial identification by the API Coryne system is based on a numerical code generated by the APIweb application; this code combines results of 21 different reactions, and biochemical variabilities in the aforementioned reactions may render unreliable identifications.

The nitrate reduction (NIT) test normally renders positive results for the species C. striatum, though some studies have reported on nitrate-negative isolates (Table 2). It is also positive for the biovars gravis, mitis, and intermedius of C. diphtheriae, while the biovar belfanti is negative. The species C. ulcerans and C. minutissimum are also expected to yield negative results, whereas variable results are reported for the species C. xerosis and C. amycolatum (Table 2).

Ribose fermentation (RIB) is positive for isolates of C. diphtheriae, C. ulcerans, and C. xerosis but variable for the species C. striatum, C. minutissimum, and C. amycolatum (Table 2). Maltose fermentation (MAL) presents variable results for C. amycolatum and is negative for C. striatum. This test is generally positive for C. xerosis, C. ulcerans, and C. diphtheriae, but there are reports of atypical C. diphtheriae isolates presenting negative results (Hall et al. 2010). Similarly, the sucrose fermentation (SAC) test, which is mostly negative for the species C. diphtheriae, might present some atypical positive results for some isolates (Mattos-Guaraldi and Formiga 1998). This reaction is expected to be positive for C. xerosis but is variable for all the other species (Table 2). Glycogen utilization (GLYG) presents variability within the species C. diphtheriae, with isolates of the biovar gravis normally being expected to render a positive result; this test is also expected to render positive results for C. ulcerans but is negative for the remaining four species studied (Table 2).

Enzymatic reactions composing each biochemical pathway

A comparative analysis was performed using data from different databases and literature-based curation, in order to identify key enzymatic reactions and potential variations in the enzymatic activities that compose the six target metabolic pathways (Fig. 1). Annotated EC numbers for the various enzymatic reactions were retrieved from the MetaCyc database for the following bacterial species: Bacillus subtilis, Escherichia coli, Lactobacillus brevis, Lactococcus lactis, and Neisseria meningitidis. From The SEED database, we retrieved annotated enzymatic reactions for the species C. diphtheriae. Finally, a literature review of the various reactions was performed for the species C. glutamicum. This latter species was chosen for manual curation of pathways.

We identified variations in the annotated carbohydrate utilization pathways in the different sources (Supplementary Figs. S1 and S2). The maltose utilization pathway in C. glutamicum requires internalization by the ABC-type transporter MusEFGK2I (EC 3.6.3.19) (Henrich et al. 2013); then, it involves the actions of a 4-α-glucanotransferase (encoded by malQ gene) (EC 2.4.1.25) and of maltose phosphorylase (malP gene) (EC2.4.1.8), glucokinase (glk gene) (EC 2.7.1.2), and phosphoglucomutase (pgm gene) (EC 5.4.2.2), to generate maltodextrins, glucose-1-phosphate, and glucose-6-phosphate (Blombach and Seibold 2010; Clermont et al. 2015). The sucrose degradation pathway in this species involves the transport of this carbohydrate by a specific PTS-type transporter (EC 2.7.1.191), followed by cleavage of sucrose-6-phosphate by a sucrose hydrolase (scrB gene) (EC 3.2.1.-), to generate glucose-6-phosphate and fructose; unlike annotations for various bacteria in MetaCyc and the SEED, C. glutamicum does not present fructokinase (scrK gene) activity (EC 2.7.1.4) (Supplementary Figs. S1 and S2). Alternatively, it is believed to export fructose by an as-yet-unidentified efflux mechanism and then this sugar is re-imported through a fructose-specific PTS to yield fructose-1-phosphate; it can then be phosphorylated by the enzyme FruK 1-phosphofructokinase (pfkB gene) to generate the glycolytic intermediate fructose-1,6-bisphosphate (Moon et al. 2005; Engels et al. 2008; Ikeda 2012) (Supplementary Figs. S1 and S2). Glycogen utilization involves the following reactions: cleavage of α-1-4-glycosidic linkages by glycogen phosphorylase (glgP gene) (EC 2.4.1.1); generation of maltodextrins by the glycogen debranching enzyme GlgX (glgX gene) (EC 3.2.1.196); and generation of glucose and glucose-6-phosphate involves the activities of the enzymes maltodextrin phosphorylase (malP gene) (EC 2.4.1.1), 4-alpha-glucanotransferase (malQ gene) (EC 2.4.1.125), maltodextrin glucosidase (malZ gene) (EC 3.2.1.20), glucokinase (glk gene) (EC 2.7.1.2), and alpha-phosphoglucomutase (-pgm gene) (EC 5.4.2.2) (Seibold et al. 2009; Von Zaluskowski 2015) (Supplementary Figs. S1 and S2). d-ribose utilization involves the activity of the RbsACBD transporter (EC 3.6.3.17) and generation of d-ribose-5-phosphate by RbsK1 and RbsK2 ribokinases (rbsK1 and rbsK2 genes) (EC 2.7.1.15) (Nentwich et al. 2009). Since C. glutamicum is not able to utilize galactose, we retrieved EC numbers from the annotated pathway in E. coli; the pathway involves conversion of β-d-galactose in α-d-galactose by galactose-1-epimerase (galM gene) (EC 5.1.3.3), phosphorylation by galactokinase (galK gene) (EC 2.7.1.6), transfer of UMP from UDP-glucose to α-d-galactose-1-phosphate by galactose-1-phosphate uridil transferase (galT gene) (EC 2.7.7.12), and conversion of UDP-galactose in UDP-glucose by UDP-glucose-4-epimerase (galE gene) (EC 5.1.3.2); phosphoglucomutase (pgm gene) (EC 5.4.2.2) is then required to convert glucose-1-phosphate in glucose-6-phosphate (Holden et al. 2003) (Supplementary Figs. S1 and S2).

Finally, the nitrate reductase activity (EC 1.7.5.1) in C. glutamicum involves the narKGHJI gene cluster (Nishimura et al. 2007), with high similarity to the system annotated for E. coli and other bacteria in MetaCyc. It involves the transport of nitrate ions by the NarK transporter and nitrate reduction by the alpha subunit of NarG. This subunit contains the guanylyl molybdenum cofactor and there is an iron-sulfur cluster [4Fe-4S]. The beta subunit NarH, which contains three [4Fe-4S] and one [3Fe-4S] iron-sulfur clusters, is anchored to the membrane, together with NarG, by the NarI gamma subunit, that contains two heme groups for electron transfer; delta subunit NarJ is needed for assembly of the complex (Rothery et al. 1998; Almeida et al. 2017).

Identification of the various enzyme genes in the studied genomes

Homology searches using the tBLASTn tool were initially performed against a local database, using protein sequences retrieved from UniProtKb that represent each of the enzymatic activities involved in the variable biochemical reactions. The distribution of the various enzyme genes in the 33 studied genomes is shown in Fig. 2. The differential occurrence of the genes is also color-coded in Supplementary Figs. S1 and S2. Complete results of the BLAST searches for each genome are presented in supplementary material (Supplementary Table S1).

Fig. 2
figure 2

Graphical representation of tBLASTn-based homology searches on the local database containing genomic sequences of the 33 strains. Pie charts represent the different metabolic pathways and each fraction corresponds to an essential protein/subunit required for that pathway. Red: nitrate reductase reaction. Green: ribose utilization. Blue: maltose utilization. Yellow: sucrose utilization. Pink: glycogen utilization. Purple: galactose utilization. Gray fractions indicate genes annotated as pseudogenes and white fractions indicate the absence of specific genes

In accordance with expected phenotypic results, all genes required for the nitrate reduction pathway were found in most strains of C. striatum, C. xerosis, and C. diphtheriae (Fig. 2 and Table 2). Two strains of C. striatum (2230 and 2245) and all studied strains of C. ulcerans, C. minutissimum, and C. amycolatum did not present any of the genes required for respiratory nitrate reductase activity (Fig. 2). As for the biovar belfanti of C. diphtheriae, we identified through BLAST searches that the nitrate reductase pathway might be non-functional in strain INCA 402, due to the absence of the narG gene. Although this is in accordance with the expected phenotypic profile for strains of the biovar belfanti, we could later relate this finding to incorrect assembly of the genomic region that contained the narGHIJ operon in this strain, in the genome release used. Nevertheless, Sangal et al. (2014) have previously reported that this belfanti strain indeed possesses a mutated narJ gene that might explain the nitrate-negative phenotype. We further investigated this in four newly released draft genomic sequences of isolates assigned to the biovar belfanti (strains 631, 1734, 1137, and 2937—GenBank accession numbers GCA_002202425.1, GCA_002202575.1, GCA_002202505.1, GCA_002202655.1). Although the narKGHIJ gene cluster was complete in all strains, the gene coding for the molybdopterin biosynthesis enzyme MoaB, which is a cofactor for nitrate reductase activity, is annotated as pseudogene due to frameshifts in three belfanti strains (data not shown).

The main enzyme gene participating in the ribose utilization pathway, rbsK, which codes for ribokinase, was found in all genomic sequences, along with rpiA, coding for ribose-5-phosphate isomerase (Fig. 2). Unlike C. glutamicum that has two ribokinases, the six species studied only presented one ribokinase gene. Given the low complexity of the main reactions in the pathway, the observed variability could potentially originate from the functionality of the ribose-specific transporter. According to Ikeda (2012), ribose uptake in C. glutamicum occurs through the RbsACBD transporter (Nentwich et al. 2009). However, only C. minutissimum 1941 presented all four components of the transporter. The other strains showed only the RbsABC components (C. diphtheriae, C. ulcerans, C. striatum, and C. minutissimum) or RbsAC (C. xerosis and C. amycolatum) (Tables 3, 4, 5, and 6). According to Clifton et al. (2015), the ribose-specific ABC transporter in E. coli differs from other ABC transporters, being composed only by RbsABC2 subunits; the two ATPase components are fused to the RbsA subunit. Van Der Heide and Poolman (2002) identified the existence of chimeric ABC proteins having the substrate-binding subunit (SBP) fused to the translocator, in which case the RbsC translocating subunit could have the substrate-binding site incorporated into the sequence. Also, the existence of low-affinity ribose transporters cannot be ruled out; as shown by Kim et al. (1997) for E. coli, the D-alose ABC transporter is also capable of transporting ribose. Considering the absence of the rbsD gene in the majority of strains in this study, we hypothesize that the ribose transport in these Corynebacteria may follow a pathway very similar to the one in E. coli.

Table 3 Genomic status of transporter systems in C. diphtheriae strains
Table 4 Genomic status of transporter systems in strains of the XSMA group
Table 5 Genomic status of the variable reactions in 14 C. diphtheriae strains
Table 6 Genomic status of variable reactions in 17 strains of the XSMA group

In the maltose utilization pathway, all studied genomes also presented the four main enzymes predicted according to the annotated pathway in C. glutamicum (Bloombach and Seibold 2010) (Fig. 2); nevertheless, the presence of the pgmB gene (coding for beta-phosphoglucomutase) in some species (C. xerosis and C. amycolatum) is an indication that the degradation pathway might also follow an alternative pathway that involves the genes malP, pgmB, and glk, as described by Andersson and Rådström (2002). The essentiality of beta-phosphoglucomutase activity for maltose metabolism has also been demonstrated for L. lactis and Streptococcus mutans (Buckley et al. 2014).

C. striatum is phenotypically negative for maltose utilization but possessed all the genes for maltose degradation according to the pathway annotated in C. glutamicum. By using a refined search using jackHMMER, we could also find in this species all the components of the maltose-specific transporter MusKEFG (Henrich et al. 2013) (Tables 4 and 6); thus, it is feasible to propose that this species might need beta-phosphoglucomutase for efficient maltose utilization. C. minutissimum, in turn, did not present the pgmB gene but is phenotypically positive for the maltose fermentation test. This species presented all genes for the MusKEFG transporter components, so the pathway used for maltose utilization seems to be closer to the functional variant of C. glutamicum (Table 4). The distribution of homologs of the malZ gene in the strains is controversial, as only strains of C. xerosis and C. diphtheriae presented homologs with identities above 60%. However, the predicted proteins that are encoded by these genes, when evaluated in databases of protein domain prediction, indicate an enzyme from the same protein family but with slightly different catalytic site. Whereas the maltodextrin glycosidase MalZ has catalytic activity at alpha-1,4-glycosidic linkages, the alignment results in CATHdb indicate an oligo-1,6 glycosidase. Seibold et al. (2009) also failed to identify the malZ gene in C. glutamicum, though they have demonstrated experimentally the expected enzymatic activity.

In the sucrose utilization pathway, the main enzyme for differentiation between functional alternative pathways is fructokinase (encoded by the scrK gene); the ability to phosphorylate fructose in the cytoplasm is important to determine if the pathway follows the classical model, using the enzymes ScrB and ScrK and the transporter PTS(Suc), or the functional variant, in which fructose is exported to the extracellular environment and then imported through a fructose-specific PTS. Our data indicate that C. diphtheriae probably assimilates sucrose following a classic model that requires the three proteins, given the absence of PTS(Suc) in most strains. Additional evidence would be the annotation of the fructose-specific transporter PTS(Fru) in all strains of this species as a pseudogene, due to frameshifts. The absence of the ability to phosphorylate fructose through PTS(Fru) impairs sucrose utilization, as C. diphtheriae strains also do not possess a functional fructokinase. Nonetheless, several previous studies have reported on the identification of atypical sucrose-fermenting strains of C. diphtheriae (Pennie et al. 1996; Efstratiou and George 1996; Mattos-Guaraldi and Formiga 1998; Pimenta et al. 2008; Viguetti et al. 2012). We could identify the presence of the ptsS, scrB and scrK genes in C. diphtheriae strains 31A and BH8, confirming the strain-specific presence of the reaction. However, as reported by Viguetti et al. (2012), other strains used in this study also present a positive sucrose fermentation reaction (strains HC01, HC02, HC03, INCA 402, and 241); we only identified in these strains the presence of the scrB gene. This is controversial as the absence of the sucrose-specific PTS transporter PTS(Suc) would be sufficient to abolish the sucrose fermentation capacity (Moon et al. 2005).

In the strains of the XSMA group, the presence of the scrK gene in 14 genomic sequences indicates that the sucrose fermentation in these species may follow the classical pathway. The species C. xerosis and C. amycolatum presented the gene coding for the hydrolase ScrB, however annotated as pseudogene due to frameshifts. These species are phenotypically positive for sucrose fermentation according to the literature; however, the annotation indicated the presence of frameshifts in the sequence of scrB in both strains of C. xerosis and C. amycolatum, what would make it impossible to cleave the sucrose molecules. Most strains of C. minutissimum and C. striatum showed all the necessary genes, with exception of strain 2245, which lacks a detectable scrK gene. Noteworthy, this strain is also phenotypically negative for fructose utilization (Ramos 2014). The high number of annotated pseudogenes in the genomic sequences of the C. xerosis strains raises the possibility that these have arisen due to assembly errors. Interestingly, the absence of PTS(Suc) and ScrK also occurs in the diphtheric species C. pseudotuberculosis and C. ulcerans, which have positive and variable phenotypes for sucrose fermentation, respectively. C. pseudotuberculosis is phylogenetically close to C. diphtheriae but does not present all the genes required for utilization of sucrose; no orthologs of the PTS(Suc) transporter could be identified in genomic sequences of C. pseudotuberculosis strains (data not shown). This points to the existence of an alternative sucrose utilization pathway operating in these species, but this needs to be explored further. We did investigate the possibility of alternative transport of sucrose in these species and found an N-acetylglucosamine-specific transporter with great structure similarity with the C. glutamicum transporter PTS(Suc) and with the ones found in the strains 31A and BH8 of C. diphtheriae (data not shown).

The glycogen degradation pathway relates to the maltose utilization pathway, as glycogen is formed by glucose chains of 8 to 12 branched molecules and connected by 1,4 and 1,6-glycosidic linkages between the branches (Berg et al. 2012), and maltose is formed by two glucose molecules connected by 1,4-glycosidic linkages; therefore, sharing of enzymatic reactions between the two pathways is expected. Among the shared enzymes are Glk, MalP, MalZ, MalQ, and α-Pgm (Seibold et al. 2009).

According to literature review, all species of the XSMA group are negative for glycogen utilization, but all had homologs to the main genes of the metabolic pathway. Most of the strains only presented one α-glucan phosphorylase (malP or glgP genes), indicating a possible difference with the metabolic pathway annotated for E. coli (Seibold et al. 2009). As the enzymes MalP and GlgP catalyze the same reaction, the presence of one enzyme is sufficient for functionality of the pathway. MalP and GlgX are the main enzymes for glycogen degradation (Von Zaluskowski 2015); annotations for C. xerosis ATCC 373 and C. striatum 1961 indicated a pseudogene in glgX, suggesting inability to metabolize glycogen. The malQ gene is also annotated as a pseudogene in the strain 1941 of C. minutissimum. Even though this gene is not essential, it could alter the dynamics of production of free glucose and decrease the rate of maltose absorption. The other genomic sequences presented all the genes with identities above 50%, except for malZ which was the one with the lowest identity among all strains.

Similarly to the findings for the XSMA group, all the strains of C. diphtheriae presented genes for glycogen degradation. However, according to the literature, only strains of the biovar gravis, and eventually biovar intermedius, can utilize glycogen. The strains 241 and HC01, both of biovar mitis, presented frameshifts in the glgP gene, despite also possessing all genes. Although the glycogen utilization in C. glutamicum describes the main reactions needed for maintenance of the pathway intracellularly, they do not warrant the degradation of glycogen in the external environment. Due to its size, glycogen could not normally be transported through the bacterial membrane into the cytoplasm, raising doubts about the mechanism by which it is fragmented and transported in strains of the biovar gravis. Abbott et al. (2010) emphasize the importance of spuA and malX genes in the metabolism of extracellular glycogen in Streptococcus pneumoniae. The pneumococcal SpuA acts on glycogen depolymerization through the hydrolysis of α-1,6 glycosidic linkages and formation of malto-oligosaccharides of different sizes, while MalX interacts with high affinity with the different malto-oligosaccharides produced by SpuA activity. GlgX also acts on cleavage of α-1,6 glycosidic linkages but has no signal peptide for exportation from the intracellular environment; thus, SpuA is essential for assimilation of external glycogen. A profile HMM-based search for an ortholog of the streptococcal extracellular enzyme SpuA (pullulanase) rendered positive results only for some genomic sequences from C. diphtheriae, particularly of the biovar gravis (E value = 7.4e-189) (Fig. 3a). This SpuA ortholog identified in C. diphtheriae has a secretory signal peptide (Fig. 3b) and is putatively able to bind to glycogen, as shown by molecular docking (Fig. 3c, d). Out of 200 genomic sequences (draft and complete) of C. diphtheriae available in public databases, we could identify the putative pullulanase in 72 genomes: 53 of the biovar gravis; 2 of the biovar mitis; and 17 without biovar assignment. All C. diphtheriae strains presented the gene coding for MalX that binds to malto-oligosaccharides greater than 11 glucose units that are unable to be transported normally (Abbott et al. 2010), reinforcing the need for the SpuA ortholog. The classification of C. diphtheriae strains into biovars based on biochemical reactions is still complex and unreliable (Sangal et al. 2014). Sangal et al. (2014) reported on the difficulties for finding a genetic basis that supports biovar differentiation. Our finding that strains of the biovar gravis present a SpuA ortholog may aid to this knowledge as this might help to explain the differential ability to utilize glycogen in this biovar. The presence of this gene in three strains of the biovar mitis could also help to explain the rare ability to utilize starch. In the XSMA group, no homologs were found for spuA and malX, indicating that these strains are not capable of utilizing external glycogen, though they may still be able to degrade it intracellularly.

Fig. 3
figure 3

a C .diphtheriae biovar gravis enzyme with putative alpha-1,6-glycosidase activity, similar to streptococcal SpuA. a Genomic context analysis demonstrating the strain-specific presence of the gene coding for the C. diphtheriae alpha-1,6-glycosidase. b Prediction of signal peptide for protein extracellular localization. c Structural alignment of the C. diphtheriae alpha-1,6-glycosidase with SpuA from Streptococcus pneumoniae. d Molecular docking of C. diphtheriae VA01 putative alpha-1,6-glycosidase and the glycogen molecule

The galactose degradation pathway is originally not present in the most traditionally used biochemical identification tests of Corynebacteria, such as the API Coryne system. However, this reaction is widely used in the identification of microorganisms in clinical microbiology laboratories and its high variability in the studied species may aid understanding of the genetic events contributing to biochemical variability. The presence of the galK, galE, and galT genes is essential for the maintenance of the galactose utilization pathway. C. xerosis presented homologs for all genes in the pathway; however, the strain ATCC 373 possesses a frameshift in the galK and galT genes, indicating an inability to phosphorylate galactose and impairment of the subsequent reactions. The C. minutissimum genotype corresponds to the expected negative phenotype, as galM and galT genes were not identified in any of the strains. There were divergences between C. amycolatum strains, as strain SK46 presented all the essential genes for galactose utilization, whereas strain ICIS 53 did not present galT, possibly explaining variable phenotype observed for this species. All C. diphtheriae strains presented essential genes for degradation of galactose but prediction of transporter components rendered ambiguous results with protein domains for transport of other substrates, such as ribose and xylose. A similar result was found for prediction of ABC transport systems in the other species (Tables 3 and 4).

Genomic context analysis and prediction of genomic islands

To explore the contributions of events of gene gain and loss for the differential distribution of the various enzyme genes, we performed a genomic context analysis of the main genes that are essential for the six variable biochemical pathways and evaluated their occurrences in horizontally acquired genomic islands. We found that the main genes in the sucrose utilization pathway—scrA, scrB, and scrK—are in close proximity in the genome, whereas the most conserved components of the PTS system, Hpr and EI, encoded by the ptsH and ptsI genes, respectively, are found in distinct regions of the genomes. A prediction of genomic islands (Fig. 4) indicated the existence of a metabolic island that contains the main genes for sucrose utilization, only in the strains 31A and BH8 of C. diphtheriae. Although these genes were also found in other species, prediction of genomic islands in strains of the XSMA group was hampered due to the high fragmentation of the genome assemblies into various contigs (Table 1).

Fig. 4
figure 4

Genomic island alignments in C. diphtheriae strains. Graphical representation of the alignment of the genomes of C. diphtheriae strains with potential for sucrose fermentation. The genomic islands were aligned with the five genomes but were not found in some of the strains described as positive for sucrose utilization. The genome of C. diphtheriae 31A was used as a reference. GI: Genomic Island; MI: Metabolic Island; CD: C. diphtheriae

Mapping of carbohydrate transporters

We mapped the carbohydrate transporter systems in the various genomic sequences in order to evaluate whether differential occurrence of transport-related protein components might contribute to biochemical variabilities of the strains. In addition to evaluating transport of the target carbohydrates, we also observed the presence of transporters for monosaccharides, fructose and glucose, which are some of the basic components of the target carbohydrates. The genes coding for the common Hpr and EI components of the PTS system were present in all strains (Tables 3 and 4). The glucose- and fructose-specific PTS transporters were identified in all strains; however, all C. diphtheriae strains presented frameshifts in the fructose-specific PTS (ptsF) gene, and it is annotated as pseudogene. The ptsF gene is a key component of sucrose metabolism in C. glutamicum, due to the lack of fructokinase activity; the only pathway for phosphorylation of fructose derived from sucrose is the export and subsequent uptake of fructose through PTS(Fru). All strains of the XSMA group presented homologous genes of the non-PTS iol1 and iol2 glucose transporters (Ikeda et al. 2011), while no strains of C. diphtheriae showed homologs for these transporters (Tables 3 and 4). An automatic analysis in the curated database TransportDB identified the presence of the EIIAB components of fructose in all strains of C. diphtheriae, with exception of the strain INCA 402. A sucrose-specific PTS was found in all strains of C. striatum and C. minutissimum (Tables 3 and 4); however, only two strains of C. diphtheriae (31A and BH8) and one strain each of C. xerosis (NBRC16721) and C. amycolatum (SK46) presented this sucrose-specific transporter. Maltose transport is carried out by a specific ABC-type transporter; homologs with high similarities to the four C. glutamicum transporter components were identified in all strains of C. diphtheriae, C. minutissimum, and in C. amycolatum strain ICIS53 (Tables 3 and 4). Ribose transport is also performed by an ABC-type transporter; however, homologs of the four component proteins were only found in the C. minutissimum 1941 strain, an unexpected result given the observed positive and variable profiles of the studied species (Tables 3 and 4). The transport of galactose in its free form is also carried out by an ABC carrier, according to the Transport Classification Database (TCDB). The transport of galactose in E. coli is carried out by three components, encoded by the genes mglA, mglB, and mglC and by the symporter permease encoded by galP. Protein orthologs sharing more than 30% identity were found in all C. diphtheriae strains; however, protein annotation and domain database searching indicated these components as part of the ribose-specific ABC transporter, given the similarities between the families of ABC transporters of ribose and galactose. C. striatum and C. minutissimum also rendered results with more than 30% identities for mglA and mglC, but these are potentially false-positive results as the prediction of domains and annotations in different databases also indicated the transport of other substrates. C. xerosis and C. amycolatum only showed homologs for mglA (Tables 3 and 4). BLAST-based homology searches did not detect orthologs of the gene encoding the galactose-binding periplasmic protein MglB; nevertheless, by using profile HMMs through jackHMMER (Finn et al. 2015), we could find statistically relevant hits for the MglB substrate-ligand protein in C. diphtheriae, C. striatum, C. minutissimum, and C. amycolatum (Tables 3 and 4). Noteworthy, some ABC transporter components present ambiguous annotations due to high sequence conservation among various protein family members.

Concluding remarks

The results obtained solely by BLAST-based homology searches of the main enzyme genes needed for each of the biochemical reactions (Fig. 2) clearly do not explain the phenotypic profiles of the various bacterial isolates (Table 2). Besides, discordant phenotypic predictions from genotypes are obtained if only these results are considered, without curation of the biochemical pathway reconstructions. Additionally, the differential occurrence of carbohydrate transporters may impact the phenotypic predictions (Tables 5 and 6).

For instance, prediction of maltose utilization using the pathway structure annotated for the species C. glutamicum will render positive results for all genomic sequences analyzed (Fig. 2), even though this reaction is expected to be negative for the species C. striatum. Automatic reconstruction of this pathway using the PATRIC platform would yield similar results (data not shown). However, when considering the structure of the pathway annotated in MetaCyc for other bacterial species, that requires the enzyme beta-phosphoglucomutase (pgmB gene) (EC 5.4.2.6), then C. striatum would render a negative result, while the species C. xerosis and C. amycolatum would be considered positive (Fig. 5). Therefore, the occurrence of the pgmB gene permits a better prediction of maltose utilization ability in species of the XSMA group.

Fig. 5
figure 5

Comparisons between true phenotypic profiles and bioinformatic predictions of phenotypes from genomic data. Phenotypic profiles were obtained from literature review for specific strains of C. diphtheriae, C. ulcerans, and C. striatum (Trost et al. 2011; Baio et al. 2013; Ramos 2014; Encinas et al. 2015). Curated bioinformatics predictions considered results obtained by tBLASTn- and profile HMM-based orthologs search, prediction of carbohydrate transport systems, and literature-based curation of results. Improvements over bioinformatics predictions based solely on BLAST searches are indicated. Limitations of the strategy are also highlighted

In the glycogen utilization pathway, the major variation between all genomic sequences was the occurrence of the malZ gene (that codes for maltodextrin glucosidase) (Fig. 2). Biochemical pathway reconstructions in PATRIC indicate differential occurrence of the malP gene (not shown). All strains presented the essential genes glgX and glgP for intracellular degradation of glycogen; considering that mostly strains of the biovar gravis of C. diphtheriae were expected to be positive for glycogen fermentation, these findings pointed to a possible variability in the process of import of glycogen derivatives from the extracellular environment. In fact, as mentioned earlier, a profile HMM-based search for an ortholog of the streptococcal extracellular enzyme SpuA (pullulanase) rendered positive results only for genomic sequences from C. ulcerans and C. diphtheriae, particularly for strains of the biovar gravis (Fig. 3). Therefore, we propose that a positive phenotype prediction for glycogen utilization in strains of C. diphtheriae should include the detection of a SpuA ortholog in the genomic sequence (Fig. 5).

In the ribose utilization pathway, there was a variability in the occurrence of the components of the RbsACBD transporter between the studied genomes (Tables 5 and 6). With exception of the pathway annotated in The SEED database, the other variants of this pathway required the component d-ribose-pyranase (rbsD gene); only one strain of C. minutissimum (strain 1941) presented this component (Tables 5 and 6), even though the ribose utilization test renders positive results for the species C. diphtheriae and C. xerosis and is potentially positive for the remaining species. PATRIC reconstructions indicate this biochemical pathway as positive for all genomics sequences (not shown). Intriguingly, false-positive predictions of phenotype from genotype were obtained for the species C. striatum when we assumed that the lack of rbsD would not compromise the ability to utilize ribose (Fig. 5). On the other hand, the same assumption could be made for the species C. diphtheriae (Fig. 5).

Similarly, in the galactose utilization pathway, all required genes were found in genomic sequences of strains of C. diphtheriae, C. ulcerans, and C. striatum (Fig. 2); however, strains of C. diphtheriae which are phenotypically negative for galactose utilization also show the presence of all genes in the pathway (Fig. 5). As mentioned earlier, reliable prediction of specific ABC transport systems might be hampered by high similarities between the different systems and this will need further studies.

In the sucrose utilization pathway, there was significant variation in the distribution of the major component genes scrA (sucrose-specific PTS permease) and scrB (sucrose phosphate hydrolase) between the genomes. Interestingly, strains of C. striatum and C. minutissimum presented a gene coding for ScrK (fructokinase), which is absent in most C. diphtheriae strains and in C. glutamicum. This latter species uses a ptsF for uptake and phosphorylation of fructose exported during sucrose utilization. Although some C. diphtheriae strains used in this study are reported to present the atypical sucrose-fermenting phenotype (Viguetti et al. 2012), the ptsF gene is annotated as non-functional due to frameshifts in all strains. Importantly, an automated pathway reconstruction using PATRIC would indicate the presence of all components for sucrose utilization in most C. diphtheriae strains, including a sucrose-specific PTS (not shown). Conversely, manual curation indicates the horizontal acquisition of the entire gene set only in two C. diphtheriae strains (31A and BH8). Therefore, prediction of sucrose utilization ability by identification of a scrK homolog gene in the genomic sequences renders reliable results; however, the absence of this gene and of a detectable sucrose-specific PTS is not sufficient to rule out this ability. These results suggest the existence of an alternative pathway for sucrose utilization.