Introduction

In the sequence-based classification system of carbohydrate-active enzymes, the CAZy database (https://www.cazy.org/; Lombard et al. 2014), glycoside hydrolase family 77 (GH77) is a monospecific family of 4-α-glucanotransferases (EC 2.4.1.25) (Janecek and Gabrisko 2016). In Archaea and Bacteria, these enzymes are also known as amylomaltases (Terada et al. 1999; Kaper et al. 2005; Godany et al. 2008; Mehboob et al. 2016, 2020), while they are named disproportionating enzymes (DPEs; or just D-enzymes) in plants (Takaha et al. 1993; Wattebled et al. 2003). Currently, family GH77 contains more than 12,000 sequences (Lombard et al. 2014), the vast majority (> 11,800) are from bacteria; while the remaining members are distributed between Archaea (~ 70) and Eucarya, i.e. plants and green algae (~ 90). In total, so far 20 enzymes of GH77 are characterised experimentally and crystal structures are solved of 6 bacterial amylomaltases and 2 plant DPEs (Lombard et al. 2014).

At the higher hierarchical level, GH77 together with the main α-amylase family GH13 and family GH70 of circularly permuted glucansucrases constitutes the so-called α-amylase clan GH-H (Kuriki and Imanaka 1999; MacGregor et al. 2001; van der Maarel et al. 2002; Janecek et al. 2014; Lombard et al. 2014; Janecek and Gabrisko 2016; Gangoiti et al. 2018). The GH77 members, such as the GH13 and GH70 enzymes, employ a retaining reaction mechanism and adopt the α-amylase-type (β/α)8-barrel (TIM-barrel) catalytic domain (Kuriki and Imanaka 1999; Janecek et al. 2014). Similar to TIM-barrels from the α-amylase family GH13, family GH77 has a protruding domain B inserted between strand β3 and helix α3 (Matsuura et al. 1984; Vujicic-Zagar et al. 2010). However, the GH77 TIM-barrel contains additional insertions following other β-strands (MacGregor et al. 2001; Janecek and Gabrisko 2016). Overall, the tertiary structure of GH77 is composed of two main domains, the catalytic TIM-barrel and domain B, the latter comprising three auxiliary subdomains known as B1, B2 and B3 (Przylas et al. 2000b). While subdomain B1 corresponds to domain B in GH13 and B3 may play the role of domain C in GH13, the subdomain B2 is unique to GH77 and has no counterpart identified in GH13 and GH70 (Matsuura et al. 1984; MacGregor et al. 2001; Vujicic-Zagar et al. 2010; Janecek and Gabrisko 2016). Notably, the lack of an antiparallel β-sandwich domain C succeeding the catalytic TIM-barrel is the main feature distinguishing GH77 4-α-glucanotransferases from the other clan GH-H members in GH13 and GH70 (Przylas et al. 2000b). The primary structures of 4-α-glucanotransferases of GH77 have 4–7 conserved sequence regions (CSRs), CSR-I–CSR-VII, in common with GH13 and GH70 (Janecek 2002; Janecek and Gabrisko 2016). The catalytic machinery of GH77 enzymes consists of a triad of aspartic acid, glutamic acid and aspartic acid localised at the C-terminal ends of strands β4 (Asp; catalytic nucleophile), β5 (Glu; proton donor) and β7 (Asp; transition-state stabiliser), respectively. These were first observed for GH77 as Asp293, Glu340 and Asp395 in the crystal structure of amylomaltase from Thermus aquaticus (Przylas et al. 2000a, b). Currently, seven additional structures are available in GH77, namely of amylomaltases from Aquifex aeolicus [Protein Data Bank (PDB): 1TZ7], Thermus thermophilus (Barends et al. 2007), Thermus brockianus (Jung et al. 2011), Escherichia coli (Weiss et al. 2015) and Corynebacterium glutamicum (Joo et al. 2016) and two DPEs from Arabidopsis thaliana (O’Neill et al. 2015) and potato (Imamura et al. 2020).

In general, enzymes of GH77 catalyse transfer of a glucan chain from one α-1,4-glucan to extend another α-1,4-glucan or produce a cyclic α-1,4-glucan from a single linear α-1,4-glucan chain (Takaha et al. 1993; Terada et al. 1999; Wattebled et al. 2003; Kaper et al. 2005; Godany et al. 2008; Mehboob et al. 2016). The degree of polymerization (DP) of the latter products, often referred to as cycloamylose, is 17 or higher; thus these cyclic α-1,4-glucans are much larger than the α-, β- and γ-cyclodextrins of DP 6, 7 and 8, respectively, produced by cyclodextrin glucanotransferases of GH13 (Fujii et al. 2005a, b, 2007; Srisimarat et al. 2012; van der Maarel and Leemhuis 2013; Roth et al. 2017; Tumhom et al. 2017).

In addition to catalytic domains of carbohydrate-active enzymes, CAZy classifies carbohydrate-binding modules (CBMs) (Boraston et al. 2004). CBMs are functionally and structurally independent domains without enzymatic activity, which by binding carbohydrates can support the function of catalytic domains. In amylolytic enzymes, these modules are known as starch-binding domains (SBDs) or less frequently as glycogen-binding domains, and in principle, they contribute by binding α-glucans, for example on starch granules (Janecek et al. 2011). Nowadays, CAZy categorises 88 CBM families (Lombard et al. 2014), among which 15 are considered as SBDs; CBM20, 21, 25, 26, 34, 41, 45, 48, 53, 58, 68, 69, 74, 82, and 83 (Janecek et al. 2019). Members belonging to these individual SBD CBM families, with the exception of family CBM74 (Valk et al. 2016), are approximately 100 amino acid residues long (Janecek et al. 2011, 2019; Carvalho et al. 2015; Armenta et al. 2017).

As it was demonstrated previously, amylomaltases from borreliae may be a rather unique group within family GH77 exhibiting unusual amino acid residues at functionally important positions in individual CSRs (Godany et al. 2008; Kuchtova and Janecek 2015). Otherwise, most of the typical GH77 4-α-glucanotransferases are covered by the prokaryotic Thermus-like amylomaltases (Przylas et al. 2000a, b; Barends et al. 2007; Jung et al. 2011) and the eukaryotic DPE1s (O’Neill et al. 2015; Imamura et al. 2020). These enzymes are usually ~ 500 residues long, however, part of the GH77 family is formed by longer prokaryotic and eukaryotic sequences (Kuchtova and Janecek 2015). Thus, in Eucarya, the DPE2s possess a ~ 140 residues long insertion between the catalytic nucleophile and proton donor and have two SBDs in tandem of family CBM20 preceding the TIM-barrel domain (Lloyd et al. 2004; Steichen et al. 2008; Kuchtova and Janecek 2015; Janecek et al. 2019). In addition, in Bacteria, a group of amylomaltases differs from the Thermus-like GH77 (Przylas et al. 2000a, b; Barends et al. 2007; Jung et al. 2011) by a ~ 190 residues long N-terminal extension (Kuchtova and Janecek 2015; Janecek et al. 2019). Currently, this group is best represented by amylomaltases from Escherichia coli (Pugsley and Dubreuil 1988; Weiss et al. 2015) and Corynebacterium glutamicum (Srisimarat et al. 2011; Joo et al. 2016) both with available crystal structures where part of the N-terminal extension adopts an immunoglobulin-like fold similar to those seen in structure-determined SBD families (Janecek et al. 2019). Although especially the C. glutamicum amylomaltase has been extensively studied from the structure/function point of view (Srisimarat et al. 2012; Rachadech et al. 2015; Nimpiboon et al. 2016a, b; Tumhom et al. 2017, 2018), no function has been assigned to the N-terminal domain from both E. coli and C. glutamicum GH77 enzymes (Weiss et al. 2015; Joo et al. 2016; Janecek et al. 2019). Recently, the 4-α-glucanotransferase from Bifidobacterium longum was characterised (Jeong et al. 2020), which according to its sequence also belongs to the group having the N-terminal extension.

The need to expand fundamental knowledge on ancillary domains in family GH77 and clan GH-H motivates the present bioinformatics investigation of the unusual N-terminal domains, that are candidates of a novel SBD CBM family and seen in crystal structures of GH77 amylomaltases from E. coli and C. glutamicum. The approach involves search for and retrieval of a relevant group of bacterial amylomaltases containing N-terminal extensions homologous to those found in the E. coli and C. glutamicum GH77 enzymes. The comparison includes docking trials with a series of α-1,4-oligoglucosides, to the two determined and the two modelled three-dimensional structures of the four phylogenetically distinguished new putative N-terminal SBDs in GH77.

Materials and methods

Sequence collection

There is a huge number of GH77 members in the CAZy database (currently > 12,000 sequences; Lombard et al. 2014; https://www.cazy.org/), the two GH77 amylomaltases with solved crystal structures from Escherichia coli (Pugsley and Dubreuil 1988; Weiss et al. 2015) and Corynebacterium glutamicum (Srisimarat et al. 2011; Joo et al. 2016), possessing the mutually homologous N-terminal extension, were chosen as the main representatives in the present study. All GH77 sequences exhibiting resemblance with the N-terminal module of these two enzymes were collected from CAZy, because protein BLAST searches (Altschul et al. 1990; https://blast.ncbi.nlm.nih.gov/) using these E. coli and C. glutamicum N-terminal domains as queries failed to provide meaningful results. A preliminary set of 682 sequences was obtained by browsing family GH77 data from CAZy, the main criterion being a sequence length of 650–750 residues. The sequences were aligned by the programme Clustal-Omega available at the European Bioinformatics Institute’s server (Sievers et al. 2011; https://www.ebi.ac.uk/Tools/msa/clustalo/) and following an initial alignment, 194 sequences were eliminated for one or more of three reasons: (i) they did not contain a domain homologous to the N-terminal domain of amylomaltases from E. coli and C. glutamicum; (ii) they did not contain the complete catalytic machinery; and (iii) they significantly disrupted the multiple alignment. This resulted in 488 sequences of bacterial GH77 amylomaltases with a convincing N-terminal domain homologous to those found in E. coli and C. glutamicum amylomaltases. One hundred sequences (including E. coli and C. glutamicum amylomaltases as well as the third experimentally characterised amylomaltase from B. longum having the N-terminal extension homologous to that present in the two former enzymes) were finally selected for in-depth analysis (Table S1) ensuring the widest possible taxonomical diversity and a minimum length of ~ 70 residues of the predicted N-terminal domain.

All amino acid sequences were retrieved from UniProt (UniProt Consortium 2017; https://www.uniprot.org/) and/or GenBank (Benson et al. 2018; https://www.ncbi.nlm.nih.gov/genbank/) databases. The sequence boundaries for the N-terminal domains in amylomaltases from E. coli and C. glutamicum were retrieved from their tertiary structures (Weiss et al. 2015; Joo et al. 2016), and those of other full-length GH77 amylomaltase sequences were defined based on sequence alignment with the E. coli and C. glutamicum enzymes.

Sequence comparison and evolutionary analysis

The alignment of the selected 100 sequences of the N-terminal module (Table S1) was performed using the programme Clustal-Omega (Sievers et al. 2011; https://www.ebi.ac.uk/Tools/msa/clustalo/). Only a subtle manual tuning was necessary to maximise similarities considering the best-conserved residues and those potentially involved in carbohydrate binding based on inspection of the two three-dimensional structures (Weiss et al. 2015; Joo et al. 2016).

The evolutionary tree was calculated from the final sequence alignment of all 100 N-terminal domains including all gaps as a maximum-likelihood tree using the WAG substitution model (Whelan and Goldman 2001) and the bootstrapping procedure with 500 bootstrap trials (Felsenstein 1985) implemented in the MEGA-X package (Kumar et al. 2018). The tree was displayed with the programme iTOL (Letunic and Bork 2007; https://itol.embl.de/).

Sequence logos of five CSRs defined within the alignment were created using the WebLogo3 online server (Crooks et al. 2004; http://weblogo.threeplusone.com/). Four sequence logos were calculated—one for each of the four clusters identified in the evolutionary tree.

Comparison of tertiary structures and molecular docking

The coordinates of the GH77 template amylomaltases from E. coli (Weiss et al. 2015) and C. glutamicum (Joo et al. 2016) were retrieved from the Protein Data Bank (PDB; Berman et al. 2000; https://www.rcsb.org/) under the PDB codes 4S3Q (4S3R) and 5B68, respectively. The structural data were modified to contain only the N-terminal domain of the two enzymes by cutting out the remaining parts of their structures from the original PDB files based on the literature information (Weiss et al. 2015; Joo et al. 2016).

Reflecting the preliminary evolutionary distribution of all 100 GH77 amylomaltases into four groups (Table S1), structures of the N-terminal domain of two other amylomaltases from Kushneria marisflavi (Yun and Bae 2018; UniProt: A0A240US28) and Pelotomaculum thermopropionicum (Kosaka et al. 2008; UniProt: A5D1W1) were modelled using the fold recognition Phyre2 server (Kelley and Sternberg 2009; http://www.sbg.bio.ic.ac.uk/~phyre2/). The K. marisflavi and P. thermopropionicum amylomaltases represent two clusters additional to the two ones containing the experimentally characterised amylomaltases from E. coli and C. glutamicum.

All molecular docking trials were performed by the program CB-Dock (Liu et al. 2020; http://clab.labshare.cn/cb-dock/) that utilises the AutoDock Vina (Trott and Olson 2010; http://vina.scripps.edu/) with all parameters used as default. The structures of the N-terminal domains from E. coli (Weiss et al. 2015) and C. glutamicum (Joo et al. 2016) amylomaltases as well as the structural models from homologous domains in K. marisflavi (Yun and Bae 2018) and P. thermopropionicum (Kosaka et al. 2008) amylomaltases were docked with maltose (G2), maltotriose (G3), maltotetraose (G4) and β-cyclodextrin (β-CD). Three-dimensional structures of the ligands were retrieved from the PubChem database (Kim et al. 2019; https://pubchem.ncbi.nlm.nih.gov/) and converted into PDB coordinates by the SMILES programme (Weininger et al. 1988; https://cactus.nci.nih.gov/translate/). The resulting complexes of individual structures with bound maltooligosaccharides were displayed using the UCSF Chimera programme (Pettersen et al. 2004).

Results and discussion

Family GH77 has attracted a special interest not only as a member of the α-amylase clan GH-H (MacGregor et al. 2001; Lombard et al. 2014; Janecek and Gabrisko 2016), but also due to the fact that GH77 amylomaltases from borreliae contain unique substitutions in their amino acid sequence, especially the presence of a lysine two residues before the catalytic nucleophile instead of an arginine otherwise invariant throughout clan GH-H (Godany et al. 2008; Kuchtova and Janecek 2015). This difference was first observed in the amylomaltase from Borrelia burgdorferi (Machovic and Janecek 2003). The present study, however, focuses on a different feature of quite a large group of bacterial GH77 amylomaltases, namely a unique N-terminal extension comprising a separate domain (Fig. 1) that adopts an immunoglobulin-like fold, which is also characteristic for SBD CBM families (Janecek et al. 2019). The first in silico analysis predicting this domain in family GH77 (Kuchtova and Janecek 2015) was later confirmed by the crystal structures of amylomaltases from E. coli (Weiss et al. 2015) and C. glutamicum (Joo et al. 2016), having this N-terminal extension as opposed to the typical Thermus-like amylomaltases (Fig. 1). Notably, the N-terminal extension has two separated parts, the so-called N-terminal domain adopting an immunoglobulin-like fold and considered a potential novel SBD, which immediately precedes the catalytic TIM-barrel, and also a smaller domain N1 situated at the very N-terminus of the protein (Fig. 1).

Fig. 1
figure 1

Domain arrangement of selected amylomaltases of family GH77. The enzymes from E. coli (a) and C. glutamicum (b) represent the group of bacterial amylomaltases with N-terminal extensions, a part of which, indicated as the “N-terminal domain”, constitutes a presumed CBM, i.e. a novel SBD. Individual domains or subdomains are coloured as follows: domain N1—grey; N-terminal domain (the potential novel SBD CBM family)—red; catalytic TIM-barrel domain A—blue; subdomains B1, B2 and B3—green, cyan and yellow, respectively. The residues of the catalytic triad of family GH77 (and hence of the entire clan GH-H)—aspartic acid, glutamic acid and aspartic acid, positioned on strands β4, β5 and β7, respectively, of the TIM-barrel, are shown as red diamonds. For comparison, the domain arrangement of a typical bacterial Thermus-like GH77 amylomaltase from Thermus aquaticus (c) is also shown

Sequence analysis and evolutionary relationships

The amino acid sequence alignment of the N-terminal domains of the 100 collected GH77 amylomaltases (Fig. 2), despite the obvious overall homology, also shows two larger distinguishable groups (Table S1): (i) sequences No. 1–49 represented by the amylomaltase from E. coli (Weiss et al. 2015); and (ii) sequences No. 50–100 represented by the C. glutamicum enzyme (Joo et al. 2016). These groups clearly differ by the length of their N-terminal domain, being 70–80 residues for the E. coli-like and ~ 100 residues for the C. glutamicum-like group (Fig. 2).

Fig. 2
figure 2

Amino acid sequence alignment of the N-terminal domains of GH77 amylomaltases. The sequence order from the top corresponds to their appearance in the evolutionary tree (Fig. 3). Four proteins, representing four potential clusters in the tree (distinguished by different colours), are marked by an asterisk, i.e. amylomaltases from Kushneria marisflavi (blue), Escherichia coli (red), Pelotomaculum thermopropionicum (magenta) and Corynebacterium glutamicum (green). Each protein is labelled by the UniProt accession number and the name of the organism. The five suggested conserved sequence regions (CSR-1–CSR-5) are indicated above the alignment. The best-conserved aromatic position corresponding to Tyr108 in amylomaltase from E. coli and the two invariant residues (the glycine succeeding the Tyr108 and the proline at the end of the domain) are signified by dollar and hash symbols, respectively, below the alignment. The colour code for the selected residues: W, yellow; F, Y—blue; V, L, I—green; D, E—red; R, K—cyan; H—brown; C—magenta; G, P—black

The evolutionary tree (Fig. 3) shows that the selected 100 sequences (Fig. 2), in fact, segregate into four clusters: (i) sequences No. 1–17 (blue group in the tree); (ii) 18–49 (red); (iii) 50–64 (magenta); and (iv) 65–100 (green). In addition to the two best characterised amylomaltases, i.e. from E. coli (No. 49; Weiss et al. 2015) and C. glutamicum (No. 80; Joo et al. 2016)—representing sequences No. 18–49 (red group; covering mostly Gammaproteobacteria) and No. 65–100 (green group; mostly Actinobacteria), respectively, the hypothetical amylomaltases from K. marisflavi (No. 8; Jun and Bae 2018; UniProt: A0A240US28) and P. thermopropionicum (No. 64; Kosaka et al. 2008; UniProt: A5D1W1) were chosen from the two remaining groups of sequences No. 1–17 (blue; covering mostly Proteobacteria) and No. 50–64 (magenta; mostly Firmicutes and Alphaproteobacteria). It is of note, that the third experimentally characterised amylomaltase from B. longum having the N-terminal domain (No. 69; Jeong et al. 2020) is positioned in the cluster represented by C. glutamicum amylomaltase, thus reflecting its actinobacterial origin (Fig. 3). In conclusion, each of the two large groups recognised in the multiple alignment (Fig. 2) is formed by two evolutionarily independent clusters. This division unambiguously observed in the evolutionary tree (Fig. 3) is kept throughout the present study (Table S1).

Fig. 3
figure 3

Evolutionary tree of the N-terminal domains of GH77 amylomaltases. The tree displays the relatedness of 100 unique non-redundant GH77 sequences (Table S1) containing the N-terminal domain corresponding to an immunoglobulin-like fold observed in crystal structures of E. coli (Pugsley and Dubreuil 1988; Weiss et al. 2015) and C. glutamicum amylomaltases (Srisimarat et al. 2011; Joo et al. 2016). Each protein is labelled by the UniProt accession number and the name of the organism. Four proteins, representing four clusters illustrated on the tree and each distinguished by different colours, are marked by an asterisk, i.e. amylomaltases from Kushneria marisflavi (blue), Escherichia coli (red), Pelotomaculum thermopropionicum (magenta) and Corynebacterium glutamicum (green). The tree is based on the alignment of the N-terminal domains (Fig. 2)

Aromatic residues, phenylalanine, tyrosine and/or tryptophan are usually responsible for binding α-glucans in SBDs classified in various CBM families (Janecek et al. 2011, 2019). Therefore, a thorough inspection of the 100 aligned N-terminal domains was performed to identify the best-conserved residues with special attention to conserved aromatic residues. Indeed, aromatic residues are conserved at several positions in the four identified clusters, e.g. Tyr76 and Trp78 in the group represented by the amylomaltase from E. coli (Fig. 2 shows these as Tyr25 and Trp27 in the E. coli sequence; No. 49) and Phe133 in the group represented by the C. glutamicum amylomaltase (Fig. 2 shows this as Phe63 in the C. glutamicum sequence; No. 80). These three positions, however, are not conserved in all four groups. Overall, the only widely, albeit not invariantly conserved aromatic position corresponds to Tyr108 of the E. coli amylomaltase (Fig. 2, position Tyr57 in the E. coli sequence; No. 49). Remarkably, this tyrosine is invariant in three of the four groups, i.e. those containing E. coli, K. marisflavi and P. thermopropionicum amylomaltases. In the fourth group containing the C. glutamicum amylomaltase, it is substituted in 18 of 36 cases by tryptophan (Trp143) (Fig. 2, position Trp73 in C. glutamicum; No. 80), although, e.g. the third experimentally characterised of this set B. longum amylomaltase contains a tyrosine in that position (Fig. 2, position Tyr65 in B. longum; No. 69). The Tyr108 in E. coli amylomaltase and corresponding aromatic residues in all the sequences support the hypothesis that the N-terminal domain is involved in α-glucan binding and defines a new SBD.

It is worth mentioning, however, that this Tyr108 can hardly be related to two functional tyrosines (Tyr54 and Tyr101) from the catalytic domain identified in the Thermus aquaticus amylomaltase contributing to the second α-glucan binding site (Fujii et al. 2007). Moreover, both Tyr54 and Tyr101 do not represent invariantly conserved residues in the family GH77, as demonstrated by a previous in silico study comparing more than 400 sequences of GH77 amylomaltases (Kuchtova and Janecek 2015).

Regarding fully invariant residues, N-terminal domains of bacterial GH77 amylomaltases only have two, Gly107 and Pro128 (E. coli amylomaltase numbering) of which the former precedes the potential α-glucan binding Tyr108, while the latter is at the C-terminal residue of the domain (Fig. 2; positions Gly56 and Pro77, respectively, in E. coli; No. 49). Glycine and proline residues are both significant on the so-called consensus sequence of the first known SBD (Svensson et al. 1989; Janecek and Sevcik 1999), belonging to the current family CBM20 (Janecek et al. 2011, 2019; Lombard et al. 2014).

In an effort to focus attention on segments potentially bearing residues involved in α-glucan binding, the five best-conserved short stretches were proposed to constitute conserved sequence regions (CSRs) of the N-terminal domains in bacterial GH77 amylomaltases as follows (Fig. 2; E. coli amylomaltase sequence and numbering): (i) CSR-1—55_PNVMVYTSG; (ii) CSR-2—66_MPMVVE; (iii) CSR-3—80_LTTE; (iv) CSR-4—105_PEGYHTLT; and (v) CSR-5—121_HCRVIVAP. Identification of CSRs has become typical for catalytic domains to emphasise key residues involved in activity and/or substrate specificity. This is particularly important for large and polyspecific GH families, such as in the individual α-amylase families GH13, GH57, GH119 and GH126 (Janecek 2002; Blesak and Janecek 2012, 2013; Janecek and Kuchtova 2012; Janecek et al. 2014; Janecek and Gabrisko 2016; Kerenyiova and Janecek 2020a, b). It also makes sense, however, to establish CSRs for putative SBDs. Among all known SBDs, currently classified in 15 CBM families in CAZy (Lombard et al. 2014), residues from the best-conserved regions usually belong to one or in some cases to two binding sites (Janecek et al. 2011, 2019).

In agreement with four clusters being found in the evolutionary tree of N-terminal domains from bacterial GH77 enzymes (Figs. 2 and 3), sequence logos covering 35 positions for the five proposed CSRs were created for each cluster (Fig. 4). All logos contain the invariant Gly107 and Pro128, at CSR-4 position 22 and CSR-5 position 35, respectively (E. coli amylomaltase numbering), plus the potentially α-glucan binding Tyr108 at CSR-4 position 23. The tripeptide GYH in CSR-4 that is invariant in two of the four evolutionary clusters, containing E. coli and P. thermopropionicum amylomaltases, respectively (Fig. 4b, c), is one of the best-conserved stretches in the N-terminal domain. The other highly conserved region is found at the last five residues, L/I/V-A/I-V/I-A/T-P, at CSR-5 positions 31–35. At the start of the logo, conserved proline residues (CSR-1; positions 1–2) are seen except in the first position in the logo of the group of P. thermopropionicum amylomaltase (Fig. 4c). Concerning additional positions of interest, Glu83 (E. coli amylomaltase numbering; CSR-3 position 19) is almost invariantly conserved and deserves attention.

Fig. 4
figure 4

Sequence logos of N-terminal domains from individual clusters of family GH77 bacterial amylomaltases. The four groups (for details, see Table S1) are represented by a Kushneria marisflavi (Proteobacteria; ~ 70 residues long; 17 sequences); b Escherichia coli (Gammaproteobacteria; ~ 80 residues long; 32 sequences); c Pelotomaculum thermopropionicum (Firmicutes and Alphaproteobacteria ~ 100 residues long; 15 sequences); and d Corynebacterium glutamicum (Actinobacteria; ~ 90 residues long; 36 sequences). CSR-1, residues 1–9; CSR-2, residues 10–15; CSR-3, residues 16–19; CSR-4, residues 20–27; CSR-5, residues 28–35. The aromatic residue, marked by the red asterisk, corresponding to Tyr108 and Trp143 in E. coli and C. glutamicum amylomaltases, respectively, is potentially involved in stacking interactions with α-glucans. All other residues involved in hydrogen bonding contacts with docked α-glucans (G2, G3, G4 and β-CD; for details, see Table 1) are indicated by black asterisks

Importantly, the sequence logos identify positions suitable to discriminate the four clusters by showing positions illustrating their individual uniqueness. This is a key attribute of CSRs, hence of sequence logos, which are well known and widely utilised in the α-amylase families mentioned above to distinguish subfamilies and/or enzyme specificities (Janecek 2002; Blesak and Janecek 2012, 2013; Janecek and Kuchtova 2012; Janecek et al. 2014; Janecek and Gabrisko 2016; Kerenyiova and Janecek 2020a, b). Notably, it was described for a single SBD CBM family (Janecek et al. 2019), CBM41 composed of two genuine groups mutually distinguished by a characteristic sequence pattern of three essential aromatic residues, “W‐W‐∼10aa‐W” and “W‐W‐∼30aa‐W”, implying that the position of the third tryptophan in the pattern is not shared by the two groups (Janecek et al. 2017). Similarly, only the C. glutamicum amylomaltase group has an invariantly conserved histidine, His90 in CSR-2 position 15 (Fig. 4d). Another example of interest is the well-conserved cysteine in the E. coli amylomaltase group (Cys122; Fig. 4b; CSR-5 position 29).

Tertiary structure comparison and molecular docking

The amylomaltases having the N-terminal domain are not the only family GH77 members that possess extra sequence compared to the canonical Thermus-like amylomaltases of just the catalytic TIM-barrel domain and a few inserted subdomains (Fig. 1). In addition to amylomaltases from borreliae that may have single unique substitutions even in functionally important positions, but still within the otherwise conserved basic domain arrangement (Machovic and Janecek 2003; Godany et al. 2008; Kuchtova and Janecek 2015; Janecek and Gabrisko 2016), in the Eucarya, the DPE2 version exists as a typical GH77 amylomaltase with a ~ 140 residues insertion between the catalytic nucleophile and proton donor in the TIM-barrel which, moreover, is preceded by two SBDs of CBM20 (Lloyd et al. 2004; Steichen et al. 2008; Kuchtova and Janecek 2015; Janecek et al. 2019). Unfortunately, no three-dimensional structure is available for DPE2 and the structural fold of the ~ 140-residue insertion is not known, neither is its potential function (Steichen et al. 2008), nor if it is essential for activity (Ruzanski et al. 2013). Remarkably, however, the function of DPE2 with the two CBM20s in Arabidopsis thaliana was effectively retained by replacement with amylomaltase from E. coli (Ruzanski et al. 2013). In that light, it may be relevant that the N-terminal domain in the E. coli enzyme (Fig. 1) has an immunoglobulin-like fold typical of SBDs (Janecek et al. 2019), and can be speculated to act as an SBD.

The two crystal structures from E. coli (Weiss et al. 2015) and C. glutamicum (Joo et al. 2016) describe amylomaltases with the investigated N-terminal domain (Fig. 1). In neither case was the structure of the N-terminal domain obtained in complex with an α-glucan. Since the thorough phylogenetic analysis revealed four clusters (Fig. 3), models for structural comparison were made of N-terminal domains of two hypothetical amylomaltases of K. marisflavi and P. thermopropionicum from the other two clusters. In agreement with the closer evolutionary relatedness observed on the one hand for the E. coli and K. marisflavi groups and on the other for the C. glutamicum and P. thermopropionicum groups (Fig. 3), N-terminal domains of K. marisflavi and P. thermopropionicum were modelled using those of amylomaltases from E. coli (PDB code: 4S3R; Weiss et al. 2015) and C. glutamicum (PDB: 5B68; Joo et al. 2016), respectively, as templates. Notably, none of the established SBD CBM families (Janecek et al. 2019) with available three-dimensional structures were identified as a suitable template.

To get an idea of how an α-glucan binds if the N-terminal domain acted as an SBD, maltose (G2), maltotriose (G3), maltotetraose (G4) and β-cyclodextrin (β-CD) were docked. Since the SBDs from various CBM families have already been demonstrated to retain their binding abilities even if being separated from their catalytic domains, i.e. they can preserve the binding also in an isolated form (for a review, see Janecek et al. 2019), the docking was in each case performed with extracted N-terminal domains. Although it may not reflect the situation in real amylomaltases completely, for the purpose of the present study it has been considered sufficient.

Not to bias the results, docking by CB-Dock does not require to focus the ligand on a target place in the protein (Liu et al. 2020), i.e. on the presumed aromatic binding residue Tyr108 in amylomaltase from E. coli and its counterparts, Trp143, Tyr121 and Tyr146 from C. glutamicum, K. marisflavi and P. thermopropionicum, respectively. Importantly, using all four maltooligosaccharides, there should be a single α-glucan-binding site in every studied N-terminal domain corresponding to each other (cf. Figures 5and 6). Nevertheless, it seems that at least Tyr121 and Tyr146 from K. marisflavi and P. thermopropionicum amylomaltases have not been recognised as making direct hydrogen bond contacts with all four ligands—Tyr121 may be involved with G2, G4 and β-CD, whereas Tyr 146 appears to be involved just with G2 (Table 1). Note, Table 1 summarises only the residues involved in hydrogen bond contacts. Importantly, the aromatic residues—Tyr108, Trp143, Tyr121 and Tyr146—may interact with respective ligands also via stacking interactions (cf. Figs. 5 and 6). Results of all individual docking trials are summarised in Table 1. Notably, these complexes indicated an appropriate binding energy ranging from the best of − 5.9 kJ/mol for maltotriose bound to the N-terminal domain from K. marisflavi amylomaltase to − 3.7 kJ/mol for the maltose complex of the P. thermopropionicum N-terminal domain (Table 1). Interestingly, in addition to the aromatic position represented by Tyr108 (E. coli amylomaltase numbering) that might interact by stacking onto glucose moieties, a few other residues were involved in hydrogen bond formation with docked α-glucans (Table 1), corresponding, e.g. to the positions of Glu83 (CSR-3), Thr110 (CSR-4), Thr112 (CSR-4) and Arg123 (CSR-5) in E. coli amylomaltase (Fig. 4).

Fig. 5
figure 5

Visualisation of molecular docking experiments with real structures. The N-terminal domain of family GH77 amylomaltases from a, b Escherichia coli (PDB: 4S3Q; Weiss et al. 2015) and c, d Corynebacterium glutamicum (PDB: 5B68; Joo et al. 2016) with docked β-cyclodextrin (a, c) and maltose (b, d). Side-chains of residues potentially involved in carbohydrate binding are displayed (in black) and labelled accordingly (for details, see Table 1). The best-conserved aromatic residue, i.e. Tyr108 of the E. coli amylomaltase and Trp143 of the C. glutamicum counterpart (cf. Figure 2) and possibly involved in stacking interactions with glucose moieties, is specifically colour-highlighted

Fig. 6
figure 6

Visualisation of molecular docking experiments with modelled structures. The models of the N-terminal domain of family GH77 amylomaltases from: a, b Kushneria marisflavi (UniProt: A0A24OU528; template: 4S3R; Weiss et al. 2015) and c, d Pelotomaculum thermopropionicum (UniProt: A5D1W1; template: 5B68; Joo et al. 2016) with docked β-cyclodextrin (a, c) and maltose (b, d). Side-chains of residues potentially involved in carbohydrate binding are displayed (in black) and labelled accordingly (for details, see Table 1). The best-conserved aromatic residue, i.e. Tyr121 of the K. marisflavi amylomaltase and Tyr146 of the P. thermopropionicum counterpart (cf. Figure 2) and possibly involved in stacking interactions with glucose moieties, is also highlighted, regardless of whether it is involved or not in hydrogen bond contacts during the docking trials (Table 1)

Table 1 Characteristics of docking trials of the N-terminal domain of four selected amylomaltasesa

There are two positions of interest—Gly107 and Pro128 (E. coli amylomaltase numbering), which are conserved invariantly (Fig. 2). However, with regard to their eventual involvement in the α-glucan binding, neither the glycine, nor the proline here corresponding to Gly107 and Pro128, respectively, has been found as involved in binding α-glucans in the docking trials (Fig. 4 and Table 1). As far as other glycine and proline residues are concerned, there are a few of them that might be interacting with the tested α-glucans (Table 1). Although some of these residues are positioned within the identified CSRs (cf. Figs. 2 and 4) and some of them are located outside these regions, obviously, there seems to be no tendency in preserving their potential binding function.

Each N-terminal domain thus exhibits potential to act as an SBD with at least one starch-binding site as seen to be the case in the currently established SBD CBM families (Janecek et al. 2019). Admittedly, docking of a given α-glucan resulted in slightly different arrangements of binding residues within the same indicated single binding site (Table 1). It should be noted that to see different residues in a binding site of a potential SBD (a CBM in general) may not be too surprising since in an SBD (CBM) there typically should be one or two main binding residues; the overall binding being then helped by different surrounding residues depending on a given particular case (Penninga et al. 1996; Sorimachi et al. 1997; Janecek et al. 2017, 2019). In the present case of the N-terminal domain of GH77 amylomaltases, this residue is suggested to be the Tyr108 of E. coli amylomaltase (and its counterparts in homologous amylomaltases). What is, however, most important is the fact that each of the four different maltooligosaccharides (G2, G3, G4 and β-CD) was docked in each N-terminal domain (4 different bacterial origins) within the mutually corresponding single potential binding site (cf. Figs. 5 and 6; Table 1).

Conclusions

The present study provides an in silico analysis of the N-terminal domain from 100 selected bacterial amylomaltases classified in family GH77. This domain is predicted to function as a type of SBD that would define a novel CBM family. From the evolutionary point of view, these GH77 amylomaltases are divided into four clusters, roughly reflecting bacterial phyla and classes as follows: (i) Gammaproteobacteria; (ii) Proteobacteria; (iii) Firmicutes and Alphaproteobacteria; and (iv) Actinobacteria, illustrated by amylomaltases from E. coli, K. marisflavi, P. thermopropionicum and C. glutamicum, respectively. The conserved Tyr108 of E. coli amylomaltase and its counterparts throughout the four phylogenetic clusters are proposed as the key residue responsible for α-glucan binding. Based on a careful sequence comparison including definition of CSRs coupled with docking of linear maltooligosaccharides and β-CD, a few additional residues are predicted to belong to the starch-binding site. All candidate residues identified in the present study should be among the first targets for future mutational analysis. The experimental work to confirm the starch-binding role of this N-terminal domain has been initiated.