Introduction

Cellulose and xylan, from the plant cell wall, represent the largest pool of organic carbon in land ecosystems whereas chitin, although abundant in terrestrial ecosystems, dominates in marine ecosystems. The deconstruction of these polysaccharides, outside the cell, by specific enzymes releases short oligo-/disaccharides which then can be translocated into the cell and further processed to release energy (e.g., glycolysis, fermentation). Glycoside hydrolases (GHs) are essential enzymes required for the breakdown of these polysaccharides. These proteins together with other carbohydrate-active enzymes (e.g., carbohydrate esterases (CE), polysaccharide lyases (PL), and lytic polysaccharide monooxigenases (LPMO)) (Lombard et al. 2014) support key processes across ecosystems. However, although abundant across environments (Berlemont and Martiny 2016), GHs are not randomly distributed (Berlemont and Martiny 2013). Most identified GHs are from microbes (e.g., Medie et al. 2012; Berlemont and Martiny 2013; Lombard et al. 2014; Berlemont and Martiny 2015; Berlemont 2017) and invertebrates (see Guo et al. 2008; Rahman et al. 2014).

Large-scale comparisons of sequenced bacterial genomes reveal that not all the lineages have genes for GHs potentially involved in polysaccharide deconstruction, whereas most microbes have genes for the processing of polysaccharide deconstruction products (i.e., oligosaccharides) (e.g., Berlemont and Martiny 2013). Microorganisms targeting short deconstruction products, sometime referred to as the opportunists (Berlemont et al. 2014), contribute indirectly to the process of polysaccharide deconstruction by keeping local oligosaccharide concentrations low and, thus, prevent the product from inhibiting the enzyme (see Rignall et al. 2002; Gefen et al. 2012; Xu et al. 2013; Bailey et al. 2013). In contrast, organisms targeting larger substrates have evolved several strategies to degrade polysaccharides (Wilson 2011; Talamantes et al. 2016). Briefly, microbes produce many single-domain enzymes consisting of one unique GH-catalytic domain sometime associated with accessory non-catalytic domain(s) such as carbohydrate-binding modules (CBMs). These CBMs direct their associated catalytic domains to specific substrates, increase the local concentration of enzymes, reduce the enzyme diffusion, and help relax the crystalline structure of the substrate. In so doing, they improve the overall catalytic efficiency of the hydrolytic systems (Din et al. 1991; Hervé et al. 2010). Interestingly, still other degraders produce proteins with multiple catalytic domains (Gibbs et al. 1992; Brunecky et al. 2013; Talamantes et al. 2016). Finally, some bacteria and fungi evolved non-covalent modular multi-protein complexes consisting of several GHs called cellulosomes (Artzi et al. 2016; Haitjema et al. 2017).

Large-scale comparisons of sequenced genomes highlight the phylogenetic conservatism of enzymes involved in polysaccharide deconstruction and variability of the domain organization in GHs from closely related strains (e.g., Talamantes et al. 2016; Berlemont 2017). In bacteria, most members of the same genus share similar abilities for cellulose, xylan, and chitin deconstruction. Identified degrader lineages frequently display redundant enzymes from the same GH family and assumed to target the same substrate (Berlemont and Martiny 2015; Berlemont 2017). However, the extensive biochemical characterization of the “CAZome” of isolated microbes highlighted subtle variations in the substrate specificity, enzymology, and regulation of these apparently redundant enzymes (e.g., Ravachol et al. 2014).

Detailed information about characterized GHs is centralized on the carbohydrate-active enzymes database (CAZy, http://www.cazy.org) (Lombard et al. 2014). The CAZy is an essential resource for scientists studying the processing of carbohydrates. Amongst others, CAZy provides a framework for the sequence-based classification of GHs (and other carbohydrate-active enzymes; CAZymes), a listing of the characterized enzymes, and some taxonomic information. CAZy also lists and classifies the many identified yet uncharacterized GHs from sequenced genomes. Basically, the classification of CAZymes reflects their structural and/or sequence similarity (Henrissat and Davies 1997). According to CAZy most GHs families are polyspecific and target various substrates (Henrissat and Davies 1997; Aspeborg et al. 2012). For example, at least 20 distinct enzymatic activities are listed for members of the GH family 5, whereas monospecific families display narrow substrate specificity. Specificity can also be investigated at the protein level. The vast majority of characterized GHs are monospecific, whereas few proteins or catalytic domains, able to accommodate multiple substrates, are polyspecific (e.g., Berger et al. 1989). Sometime, multi-domain proteins (e.g., Brunecky et al. 2013) artificially inflate the polyspecificity of identified GH families on CAZy. For example, the unique xylanase in GH family 62 is a multi-domain xylanase/arabinofuranosidase from Streptomyces chattanoogensis UAH23, whose xylanolytic activity likely results from a GH10 domain (acc. num. AAD32559.2) (Hernández et al. 2001).

Although listing GHs, CAZy provides no analysis tools to annotate or analyze new sequences, the removal of the links to the Pfam and InterPro databases (in 2008) resulted in the development of alternative annotation systems for GHs and related enzymes (Park et al. 2010; Yin et al. 2012; Berlemont and Martiny 2013; Talamantes et al. 2016). In this context, three important questions remain. First, can the polyspecificity of GH families associated with the deconstruction of abundant polysaccharides be estimated? Knowing the targeted substrates and the activities in each family will provide an easy way to connect specific sequences to environmental process (e.g., carbon cycling) (Treseder and Lennon 2015). Next, what are the taxonomic distribution and the substrate specificity of characterized GHs? Identifying clustered distribution of enzymes with particular substrate specificity could highlight their recent evolution and provide a comprehensive framework to isolate new enzymes with specific activities (Aspeborg et al. 2012). Finally, as the classification of CAZymes was first intended to be “more friendly to the needs of bioinformatics” (Henrissat and Davies 1997), we asked the question: can the annotation of GHs for cellulose, xylan, and chitin be achieved using publicly accessible tools? The rapid and reliable annotation of GHs in the growing number of sequenced genomes and microbiomes is essential because GH enzymes support key functions in cells and ecosystems (Knight et al. 2012). In order to answer these questions, we reviewed the functional and taxonomic distribution of characterized enzymes listed on the CAZy database, as of summer 2017. Sequences from characterized proteins from the GH families of interest were analyzed using Pfam-scan against the entire PFam A database (Eddy 2011; Punta et al. 2012; Finn et al. 2014).

Cellulases

The enzymatic deconstruction of cellulose requires the interaction between some endo-acting GHs (i.e., endocellulase EC.3.2.1.4) and some exo-acting GHs (i.e., exocellulase, EC.3.2.1.91/176) (Wilson 2011). These activities release cellooligosaccharides and cellobiose that are further degraded to glucose by β-glucosidases (EC3.2.1.21, not discussed here). Besides these enzymes, 4-β-D-glucan glucohydrolases (EC.3.2.1.74) are exocellulases active on cellulose and releasing glucose directly. Both endo- and exocellulases are listed in the GH families 5, 6, 7, 9, and 48. In addition, according to CAZy, endocellulases have also been identified in GH families 8, 10, 12, 26, 44, 45, 51, 74, and 124 (Table 1, S1).

Table 1 Distribution of substrate specificity and activities of characterized enzymes from GH families associated with the deconstruction of cellulose, xylan, and chitin according to the CAZy database (in August 2017)

Most characterized cellulolytic enzymes are from the polyspecific GH family 5 (n = 547 listed proteins) (Aspeborg et al. 2012). Respectively, 61 and 1% of the characterized GH5s act as endocellulases (EC.3.2.1.4) or exocellulases (EC.3.2.1.74/91). Non-cellulolytic GH5s target other plant cell wall polysaccharides such as 109 endo-β-1,4-mannosidases (EC.3.2.1.78), few licheninases (EC.3.2.1.73, 2%) and some xylanases (EC.3.2.1.8, 1.6%), and some xyloglucanases (EC.3.2.1.151, 1%). Besides these well-characterized enzymes, 46 of the listed GH5s have unspecified substrates (i.e., EC.3.2.1.-). Interestingly, 19 cellulolytic GH5s are polyspecific enzymes. Some of these proteins are multi-activity GHs (Talamantes et al. 2016), such as a GH5-CBM3-CBM3-GH44 protein from Caldicellulosiruptor saccharolyticus (acc. num. AAA71887.1) (Gibbs et al. 1992) and the GH5-GH26 protein from an uncultured bacterium (acc. num. ABB46200.1) (Palackal et al. 2007). Conversely, some enzymes with only one catalytic domain, such as the GH5-CBM2 protein identified in Butyrivibrio fibrisolvens H17c (acc. num. CAA35574.1) (Berger et al. 1989), also target multiple substrates, sometime marginally, and are thus considered polyspecific enzymes. Similarly, although most GH5 endo-β-1,4-mannosidases target mannans only, few enzymes such as cel5B mannanase from Thermotoga maritima MSB8 target multiple substrates (acc. num. AAD36817.1) (Nelson et al. 1999). The GH family 8 (n = 74 characterized enzymes) is also polyspecific (Table 1). Endocellulases (n = 36), chitosanases (EC.3.2.1.132, n = 23), and few xylanases (EC.3.2.1.8, n = 12) are listed. Few proteins from Bacillus (n = 1), Paenibacillus (n = 3), and Lysobacter (n = 1) are polyspecific and target lichenan and cellulose or chitosan (e.g., Ogura et al. 2006). Licheninase activity is never observed alone, however, suggesting that lichenan is not the primary target of characterized GH8s. The GH family 12 (n = 68), also polyspecific, lists enzymes targeting cellulose or xyloglucan (EC.3.2.151) (Table 1). TrCel12A from Trichoderma reesei is the only polyspecific GH12 listed (EC.3.2.14/151, acc. num. AAE59774.1) (US patent #6187732).

Proteins in GH families 9, 44, and 45 are mostly monospecific endocellulases (> 95%) with only few characterized enzymes from each family targeting a different substrate (Table 1). Non-cellulolytic GH9s target other plant polysaccharides (e.g., Cel9X xyloglucanase from Clostridium cellulolyticum H10, acc. num. ACL76949.1 (Ravachol et al. 2014)) and oligosaccharides (e.g., exo-β-D-glucosaminidase from Photobacterium profundum SS9, acc. num. CAG18943.1 (Honda et al. 2011)). Similarly, GH families 6, 7, and 48 are mostly monospecific exocellulases (Table 1). Finally, few cellulolytic enzymes are listed in GH families 10, 26, 51, and 74. Amongst others, 3 GH5 endocellulases were associated with catalytic domains from GH family 10 (xylanase) in Caldicellulosiruptor (Talamantes et al. 2016) and an uncultured bacterium (Saul et al. 1989). Conversely, in GH26, amongst the 72 characterized proteins listed, all four cellulolytic enzymes are polyspecific, suggesting that cellulose is not the primary target of GH26 (von Freiesleben et al. 2016). Finally, in GH families 51 and 74, only six and two proteins are cellulases, out of 78 and 27 characterized proteins, respectively. Most characterized GH51 are α-L-arabinofuranosidases (EC.3.2.1.55), whereas most GH74 are xyloglucanases (EC.3.21.151).

Few characterized GHs targeting cellulose have been identified in animals, including cellulases from termites (e.g., Coptotermes), crustaceans (e.g., Limnoria), and mollusks (e.g., Ampularia and Aplysia) (e.g., Byrne et al. 1999; Guo et al. 2008; King et al. 2010). Most of these enzymes are likely involved in digestive functions (Watanabe and Tokuda 2010), whereas GH5 and GH9 cellulases from plants (e.g., Arabidopsis, Nicotiana) are likely involved in plant cell wall synthesis and remodeling (Vain et al. 2014). Few hydrolytic cellulases from archaea (e.g., GH5 and 12 from Crenarchaeota) (Huang et al. 2005; Graham et al. 2011) and algae (e.g., GH9 from Eisenia fetida, acc. num. BAM14716.1) have been characterized. However, besides these enzymes, the overwhelming majority of characterized cellulases are from bacteria or fungi. Characterized GH5, 6, and 12 from fungi account for 28, 34, and 36% of the characterized proteins, whereas bacterial enzymes account for 53, 62, and 55%, respectively (excluding the enzymes of unknown origin). Next, 86% of characterized GH7s derived from fungi (mostly from Ascomycota), whereas some are from termites (e.g., Coptotermes) and their symbiotic protozoa Holomastigotoides. Similarly, in GH family 45, besides two bacterial enzymes from Cellvibrio (acc. num. ACE82688.1) and Fibrobacter (acc. num. ACX75523.1), 49% of characterized enzymes are derived from fungi (mostly from Ascomycota and Mucoromycota and few from Basidiomycota and Neocallimastigomycota) and 47% from animals (mostly arthropods and mollusks). The systematic investigation of sequenced bacterial and fungal genomes supports the skewed distribution of characterized GH7 and 45 in fungi (Berlemont and Martiny 2013; Berlemont 2017). Thus, due to their abundance in sequenced genomes, identifying sequences for GH7 can be used to estimate the contribution of fungi to plant cell wall deconstruction in the environment (Berlemont et al. 2014; Treseder and Lennon 2015). Conversely, no characterized enzymes from GH families 8, 9, and 44 are from fungi. More precisely, 49 and 35% of the characterized GH8s are from Firmicutes and Proteobacteria, respectively but few are from Fibrobacters and Actinobacteria. In GH9 family, 19, 20, and 48% of characterized enzymes derive from plants, animals, and bacteria (mostly Firmicutes and Proteobacteria), respectively. Similarly, most characterized GH44s and GH48s are from bacteria (mostly Firmicutes), whereas one GH44 is from the mollusk Bankia gouldi (acc. num. CAH68691.1) and one GH48 is from the insect Gastrophysa atrocyanea (acc. num. BAE94320.1). However, despite a skewed distribution in characterized enzymes, identified sequences for GH9 and 44 are abundant in bacteria and fungi (Berlemont 2017). Thus, amongst the cellulase families, only GH8 and GH48 can be predominantly associated with bacterial lineages. However, unlike GH7, many GH8s are not hydrolytic cellulases. Indeed, many GH8s are chitosanases in Firmicutes or non-cellulolytic cellulases associated with the bacterial cellulose synthesis operon in Proteobacteria (Berlemont and Martiny 2013; Römling and Galperin 2015), whereas GH48s are relatively rare in sequenced genomes.

Xylanases

Xylan, abundant in hemicellulose from plant cell wall, consists of a linear backbone made of β-1,3/4 linked β-D-xylose “decorated” with various side groups such as acetyl-groups in O-2 and O-3 positions. Larger groups can substitute the xylan backbone (e.g., arabinofuranosyl and 4-O-methyl glucuronyl) (Grantham et al. 2017). The enzymatic deconstruction of xylan requires first the removal of the side chains and then the deconstruction of the xylan backbone (Dodd and Cann 2009). Carbohydrate Esterases (CE) are carbohydrate-active enzymes involved in the removal of the side chains from the substituted xylose units (Dodd et al. 2009), whereas specific GHs, called xylanases, are involved in the xylan backbone hydrolysis (Kulkarni et al. 1999; Dodd and Cann 2009). The removal of cumbersome side chains by CEs expose the xylan backbone to xylanases and improve the overall deconstruction process (Vardakou et al. 2008). Amongst others, acetyl xylan esterases (EC.3.1.1.72) are found in CE families 1, 2, 3, 4, 5, 6, 7, 12, and 16, whereas feruloyl esterases (EC 3.1.1.73) are found in CE family 1 (Lombard et al. 2014). CE and endo-1,4-β-xylanase domains are frequently identified in multi-domain proteins (e.g., Xyn10D-Fae1A, acc. num. ACN78954.1 (Dodd et al. 2009)).

Xylanases/endo-β-1,4-xylanases (EC.3.2.1.8) are endo-acting GHs targeting the backbone of xylan from plants and seaweed whereas endo-β-1,3-xylanases (EC.3.2.1.32) target β-1,3-linked xylose in seaweed (Konishi et al. 2012). Xylanases release xylobiose that is further degraded by β-xylosidases (EC.3.2.1.37, not discussed here). According to CAZy, endo-β-1,4-xylanases are found in GH families 3, 5, 8, 9, 10, 11, 12, 16, 26, 30, 43, 44, 51, 62, 98, and 141, whereas endo-β-1,3-xylanases are found in GH families 11 and 26 only. The vast majority of characterized xylanases are from GH families 10 and 11 (Lombard et al. 2014). In GH10 (n = 350 listed proteins) and GH11 (n = 271), 96.8 and 99.6% of the characterized enzymes are monospecific endo-1,4-β-xylanases, respectively, whereas few characterized endo-1,3-β-xylanases are found in GH11 (0.01%). In GH10, the few non-xylanolytic enzymes are two endocellulases (EC.3.2.1.4) and eight enzymes with unspecified substrate (EC.3.2.1.-). Most characterized xylanases have one catalytic domain such as Xyn10A (acc. num. AGA16736.1) (Bai et al. 2012); however, some proteins consist of multi-domain and multi-activity enzymes such as a Xyn10D-Fae1A (i.e., GH10-CE1, acc. num. ACN78954.1) from Prevotella ruminicola 23 (Dodd et al. 2009).

Although polyspecific, the GH family 30 (n = 38 listed proteins) is dominated by endo-1,4-β-xylanases (n = 17). In this family, other activities include glucuronoarabinoxylan endo-1,4-β-xylanases (EC.3.2.1.136, n = 8) and β-xylosidases (EC.3.2.1.37, n = 5) also involved in xylan deconstruction. GH30 also contains a few non-xylanolytic activities such as β-glucosidase (EC.3.2.1.21, n = 2), β-1,6-glucanase (EC.3.2.1.75, n = 7), endo-β-1,6-galactanase (EC.3.2.1.164, n = 3), and some unspecified activity (EC 3.2.1.-) (n = 3).

Besides GH10, 11, and 30, a few xylanolytic enzymes were identified in cellulolytic GH families (see “Cellulases” section) and in some polyspecific families including GH26, 43, 51, 62, and 98, according to CAZy (Lombard et al. 2014). More precisely, only six characterized GH26 are bacterial β-1,3-xylanases (EC.3.2.1.32), whereas most GH26s target mannans (e.g., 74% β-mannanases (EC.3.2.1.78), 8% exo-β-1,4-mannobiohydrolase (EC.3.2.1.100)). Similarly, although most of the 151 listed GH43s target arabinans (e.g., 37% α-L-arabinofuranosidases (EC.3.2.1.55), 22% arabinanase (EC.3.2.1.99)) or xylobiose (EC.3.2.1.37, 34%), six proteins with domain from GH family 43 are xylanolytic. Most of these xylanases are bifunctional multi-domain proteins with two catalytic domains (e.g., β-1,4-xylanase/α-L-arabinosidase (XynA) from Caldicellulosiruptor sp. Tok7B.1, acc. num. AAD30363.1) (Gibbs et al. 2000). Similarly, the unique “xylanase” from GH family 62 is a multi-domain xylanase/arabinofuranosidase from Streptomyces chattanoogensis UAH23 (acc. num. AAD32559.2)(Hernández et al. 2001). Finally, the only characterized GH51, out of 78 listed proteins, endowed with xylanolytic activity is a distantly related single polyspecific catalytic domain targeting cellulose and xylan from Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446 (acc. num. ACV57112.1) (Mavromatis et al. 2010).

Investigating the taxonomic origin of characterized proteins from GH family 10 and 11 supports their broad distribution in sequenced bacterial and fungal genomes (Berlemont and Martiny 2015; Berlemont 2017). In addition, two GH10 xylanases are from plants (e.g., maize xylanase) (Wu et al. 2002), and one GH11 is from the insect Phaedon cochleariae (acc. num. AGK45632.1) (Pauchet and Heckel 2013). Finally, in GH family 30, all the xylanases, except one from the nematode Meloidogyne incognita (acc. num. AAF37276.1) (Mitreva-Dautova et al. 2006) and one exo-xylanase from T. reesei RutC30 (acc. num. AAP64786.1) (US patent #6555335), are from bacteria. However, not all the bacterial GH30s are xylanases, and most non-xylanolytic GH30s target hemicellulose (e.g., glucuronoarabinoxylan-specific endo-β-1,4-xylanases (EC.3.2.1.136) and endo-β-1,6-galactanase (EC.3.2.1.164)). Most characterized eukaryotic GH30s are fungal enzymes also targeting hemicellulose (e.g., endo-β-1,6-galactanase (EC.3.2.1.164) from Trichoderma viride (acc. num. BAC84995.1) (Kotake et al. 2004)).

Chitinases

Chitin is a linear polysaccharide made of β-1,4 linked N-acetylglucosamine produced by fungi and arthropods. The enzymatic deconstruction of chitin requires chitinases. These GHs release the disaccharide chitobiose that is further processed by β-N-acetylhexosaminidase (EC.3.2.1.52). Chitin can also be deacetylated by specific chitin deacetylases (EC.3.5.1.41) to produce chitosan (not discussed here). Chitinolytic enzymes are endochitinases (EC.3.2.1.14) (Lombard et al. 2014) or exochitinases acting on either the reducing (EC.3.2.1.201) or the non-reducing end of chitin (EC.3.2.1.200) (e.g., Wang et al. 2001). Although there is no mention of exo-chitinase on CAZy, many chitinolytic enzymes are listed in the mostly monospecific GH families 18 and 19. In addition, few chitinases are listed in GH families 23 and 48.

More precisely in GH18, amongst 477 listed proteins, 91% display endochitinase activity, whereas three proteins are lysozymes targeting bacterial peptidoglycan (EC.3.2.1.17), 21 are endo-β-N-acetylglucosaminidases (EC.3.2.1.96) possibly involved in the deconstruction of chito-oligosaccharides (Gooday 1990; Jhaveri et al. 2015) and 25 target unspecified substrates (EC.3.2.1.-). Although most enzymes are monospecific, three polyspecific chitinases-lysozymes from Bacillus cf. pumilus SG2 (acc. num. ABI15082.1), Hevea brasiliensis (acc. num. CAA07608.1), and Nicotinia tabacum (acc. num. CAA55128.1) are identified (e.g., Bokma et al. 2000). Next, 172 out of 176 proteins from GH family 19 are monospecific endochitinases, three are lysozymes, and one targets an unspecified substrate (acc. num. BAE86996.1). Finally, GH families 23 and 48, described mostly as lysozyme (type G) (Wohlkönig et al. 2010) and exocellulase families, respectively, contain one chitinolytic enzyme each. Interestingly, the GH48 chitinase from the beetle G. atrocyanea is the only characterized member of the GH family 48 derived from an eukaryote (acc. num. BAE94320.1) (Fujita et al. 2006).

When excluding the enzymes of unknown origin, 51% of characterized GH18 are derived from eukaryotes. More precisely, 20.4, 11.9, 8.4, and 4.5% are from Ascomycota, plants, arthropods, and chordates. Conversely, 46% of characterized GH18 are bacterial enzymes: 20.7, 16.2, 7.4, and 1.7% derive from Proteobaceria, Actinobacteria, Firmicutes, and Bacteroidetes, respectively. Finally 2.4% derive from Euryarchaeota. Most characterized GH19 are derived from plants (88.8%), whereas few originate from bacteria (mostly Acinobacteria and some Proteobacteria) and none from fungi. The taxonomic origin of characterized chitinases from GH18 and GH19 reflects the broad distribution of identified chitinases in sequenced genomes (Suzuki et al. 2001; Kawase et al. 2006; Bussink et al. 2007; Berlemont 2017). The skewed distribution of GH19 provides some rationale to link the identification of GH19 in sequenced microbiomes to the environmental deconstruction of chitin by bacteria.

Glycoside hydrolase identification

The identification of GHs is a prerequisite for understanding the polysaccharide deconstruction by isolated (Youssef et al. 2013) and complex communities of microorganisms (Hess et al. 2011) and for the identification of new enzymes for biotechnological applications (Brunecky et al. 2013). As the CAZy database provides no tool for sequence annotation and is not designed to ease the extraction and the analysis of information at large, we created a custom bioinformatic program to link (i) the functional annotation as listed on CAZy, (ii) the sequences retrieved from the NCBI database, and (iii)—if possible—the taxonomic information for all characterized proteins from GH families involved in the deconstruction of cellulose, xylan, and chitin (Supplementary data). Next, sequences were analyzed using HMMscan (Eddy 2011) against the Pfam A database (Finn et al. 2014) as described by Talamantes and referred to as the Pfam-based annotation here after (Talamantes et al. 2016).

We focused on GH families predominantly involved in cellulose, xylan, and chitin deconstruction. The Pfam-based annotation of characterized proteins (Table 1) correctly identified 2326 proteins out of 2409 tested sequences (96.6%). More precisely, HMMscan identified 514 PF00150-cellulase domains out 537 sequences retrieved from the GH family 5 (96%). Similarly, more than 90% of the domains from analyzed GH families were categorized as expected with the exception of proteins from GH families 30 and 45. Noteworthy, a detailed analysis of sequences from GH family 30 highlighted its polyspecificity and its structural complexity (St John et al. 2010). Eventually this resulted in the creation of up to eight subfamilies and all the xylanolytic GH30s fall into the subfamily GH30_8. In GH family 45, nine of the 55 characterized proteins were miss-annotated. However, these cellulases are derived from invertebrates (e.g., Aplysia, acc. num. BAP19116.1) and are known to be distantly related to the other GH45s (Rahman et al. 2014). Finally, the identification of GH families marginally associated with cellulose, xylan, and chitin deconstruction was also supported by Pfam-based annotation (Table S1).

Future directions

The vast majority of sequences from the GH families involved in cellulose, xylan, and chitin deconstruction can be identified using Pfam-based annotation, a documented and publicly accessible system (Eddy 2011; Finn et al. 2014). Although similar annotation systems exist (e.g., Yin et al. 2012), using the Pfam A database allows for the identification of accessory domains not listed in the CAZy database (Talamantes et al. 2016). However, some recently created GH families containing a reduced number of sequences (e.g., six identified and one characterized sequences in GH124, acc. num. ABN51673.1) cannot yet be identified using this approach.

The conserved substrate specificity in monospecific GH families provides a way to link the occurrence of specific GH domains to environmental processes (Stursová et al. 2012; Berlemont and Martiny 2016). Moreover, based on the skewed taxomonic distribution of domains from GH families, the studied processes can be attributed to specific lineages (Berlemont et al. 2014; Treseder and Lennon 2015; Lladó et al. 2017). In polyspecific families (GH5, 8, and 30), the predictive power is reduced. However, characterized GH5s target mostly cellulose or hemicellulosic substrates, GH8 target mostly cellulose or chitosan (i.e., deacetylated chitin), and GH30 target mostly xylan or hemicellulosic substrates. In these polyspecific families, the substrate specificity cannot be identified using the HMM-based annotation. Detailed characterization of these families eventually led to the creation of monospecific GH subfamilies (St John et al. 2010; Aspeborg et al. 2012). In the future, the growing number of sequences in poorly represented GH families (e.g., GH124) and in each subfamilies (e.g., GH30_8) will allow the creation of specific HMM profile to further identify proteins from these groups. In the meantime, the compiled sequence dataset (Supplementary data) will be more friendly to the needs of bioinformatics. This dataset can be used for the creation of a custom database to be used with the Basic Local Alignment Search Tool (BLAST) in order to perform detailed sequence comparison, identify close relatives, and help predict the mode of action or the substrate specificity of proteins from polyspecific GH families.