Keywords

Carbohydrate-Active enZymes (CAZymes) assemble, breakdown, and modify glycans and glycoconjugates using their catalytic and binding modules (functional protein domains). The CAZy database offers since 1998 an online and continuously updated classification of CAZyme modules (Lombard et al. 2014). Each module family in the CAZy classification has been created based on experimentally characterized protein modules from the literature, and the families are populated by related module sequences from public protein sequence databases. Since no universal threshold allows the systematic classification of the various CAZyme families, CAZy annotations result from an expert combination of module modeling/calibration and human curation. CAZy annotations are made publicly available for all proteins released by GenBank (Benson et al. 2012), Swiss-Prot (Boutet et al. 2016) and the Protein Data Bank (PDB; http://www.rcsb.org; (Berman et al. 2000)). Further, functional and 3-D structural information, curated from the literature on a regular basis, constitute essential added values to the CAZy annotation. In this spirit, the display of ligand information from crystallographic complexes has been recently developed (Lombard et al. 2014). This chapter will guide the reader through the usage of CAZy to search enzyme annotations. It will also answer frequent questions such as (i) how to obtain CAZy annotations for a specific protein, a genome, or a metagenome, (ii) how to have a newly characterized family included in the CAZy classification scheme, (iii) why CAZy does not cover all protein families related to glycans/glycoconjugates, and (iv) why CAZy does not transfer functional annotation to similar sequences. Finally, we present here a recent CAZy-associated tool, namely, the Polysaccharide Utilization Loci (PUL) predictor and database in Bacteroidetes species (Terrapon et al. 2015).

1 Classes of CAZy Modules

The CAZy classification covers sequences from all taxonomic groups and provides the ground for common nomenclature for CAZymes across many glycobiologists, often specialized in some preferred taxa. Among the large diversity of proteins acting on glycoconjugates, poly- and oligosaccharides, the CAZy classification covers several enzyme classes that catalyze their assembly, breakdown, or modifications.

  • Glycosyltransferases (GTs) represent the unique class in charge of glycan assembly, forming glycosidic bonds from phospho-activated sugar donors by either inverting or retaining the anomeric configuration (Campbell et al. 1997; Coutinho et al. 2003).

  • Glycoside hydrolases (GHs) and polysaccharide lyases (PLs) are responsible for the cleavage of glycans (Lombard et al. 2010). GHs hydrolyze or transglycosylate glycosidic bonds, while PLs cleave the glycosidic bonds of uronic acid-containing polysaccharides by a β-elimination mechanism. Because of their widespread importance for biotechnological and biomedical applications, GHs and PLs constitute so far the best biochemically characterized set of enzymes present in the CAZy database. Interestingly, while GH-coding genes are abundant and present in the vast majority of genomes corresponding to almost half of the enzymes classified in CAZy, PLs only represent a very small proportion (Table 6.1).

    Table 6.1 Statistics of the CAZy website content in April 2016
  • Because lignin is invariably found together with polysaccharides in the plant cell wall, CAZy recently integrated enzyme families known to be involved in lignin degradation along with lytic polysaccharide monooxygenases in a new class termed auxiliary activities (AAs) to accommodate the large range of mechanisms and substrates (Levasseur et al. 2013).

  • Carbohydrate esterases (CEs) are enzymes that remove O- or N-acyl substituents on glycans (Coutinho 1999) and thereby often facilitate the action of GHs and PLs on complex polysaccharides. However, as the specificity barrier between carbohydrate esterases and other esterase activities is thin, it is likely that the CAZy sequence-based classification incorporates some enzymes that may act on noncarbohydrate esters (as illustrated by the high proportion of CEs falling in the “Nonclassified” category – see below).

  • Carbohydrate-binding modules (CBMs) have no enzymatic activity per se but are known to potentiate the activity of many enzyme activities described above by targeting to and promoting a prolonged interaction with the substrate. If CBMs can occasionally exist in isolated or tandem forms, they are usually combined with catalytic modules within enzymes (Boraston et al. 2004). For this reason CBMs are set apart from other non-catalytic sugar-binding proteins (such as lectins and sugar transporters – see Sect. 6.6) and integrated in the CAZy classification scheme (Coutinho 1999).

The CAZy module classes are subdivided into families (see Table 6.1) based on amino acid sequence similarity, which almost invariably involves similar mechanisms. Families are designated using a simple formula including the class and a number referring to the order of family creation within the class, such as GT1 or GH130. However, the occurrence of enzymes that act on different substrates within a single family prevents the direct functional annotation of CAZymes based on family assignment. Phylogenetic analyses can frequently improve the correlation between sequence and specificity by defining subfamilies as was done for families GH5 (Aspeborg et al. 2012), GH13 (Stam et al. 2006), GH30 (St John et al. 2010), GH43 (Mewis et al. 2016), and all PL families (Lombard et al. 2010). More subfamilies are currently under development internally in CAZy and could be released when in-depth analyses confirm the stability of the subfamilies when the number of sequences increases. Finally, some of most remote homologs, for which sequence similarity is still detectable but cannot guarantee anymore any level of functional similarity nor family assignment, are also reported but without family assignment in a “Nonclassified modules” list, for each CAZy class (see Table 6.1). The “Nonclassified modules” list is not a family per se but gathers many heterogeneous remote sequences, some that might give rise to distinct CAZy families in the future.

2 Browsing the CAZy Website

The homepage of the CAZy website includes a banner with several links to browse the CAZy annotation, either by CAZyme class (tabs labeled “enzyme classes” and “associated modules”) or by genome (see Fig. 6.1).

Fig. 6.1
figure 1

Banner of the CAZy website where the user can choose to browse the data by CAZyme module class/family or search for a specific genome annotation

2.1 Browsing by CAZy Class and Families

2.1.1 CAZy Class Webpages

The webpage dedicated to each CAZy class, illustrated in Fig. 6.2, starts with an introduction to the module function, completed by some details about the catalytic mechanisms for GHs, PLs, and GTs. Further, some statistics are given about the number of occurrence of modules in one family of this class and about the most distant homologs assigned to this class but not into a family, referred to as “Nonclassified modules.” Finally, it provides the user with access to all families created in this class – links to individual webpages – in two tables: a simple ordered enumeration of existing families and a functionally oriented table that lists the different families by EC number. Please note that due to the modular nature of CAZymes, these EC numbers may not be directly associated with the family but simply borne by adjacent modules. Hence, enzymatic families with more than one known activity are repeated along this table.

Fig. 6.2
figure 2

Screenshot of the CAZy webpage that describes the PL class. Following an introductory description, statistics data and direct access to individual families are provided in different tables

Fig. 6.3
figure 3

Screenshot of the CAZy webpage that describes the GH5 family information with the specific details on protein sequences attributed to subfamilies in the rightmost column (“Subfamilies” tab)

Fig. 6.4
figure 4

Screenshot of the CAZy webpage corresponding to the GT3 family with the specific display (at the bottom) of structurally characterized proteins and related information (“Structure” tab)

2.1.2 CAZy Family Webpages

Each webpage dedicated to a CAZy family, illustrated in Figs. 6.3 and 6.4, contains a synthetic and updated report with all known activities (EC numbers and activity names) in the family. It should be noted that contrary to the class webpage, the activities that are listed in the header of the families correspond to the actual modules of the family and not the activity of adjacent modules. The report also specifies the mechanism (e.g., inverting or retaining), structural fold, catalytic residues, etc. where known or appropriate. More extensive encyclopedic knowledge of the biology/chemistry of some families can be obtained through links to the CAZypedia resource (see Sect. 6.8). CAZy also provides statistics about the number of known modules in each family, the number of members with a 3-D structure, and the number of functionally characterized enzymes. Finally, the complete list of modules can be browsed with a tab subdivision to see either all or restricted to a specific kingdom of life or to structurally/experimentally characterized cases. Almost all tabs present modules as lines containing the protein name, EC numbers if any, the organism, the GenBank accessions (one reference in bold, and redundant ones below), and the UniProt and PDB identifiers (crystals not yet solved/deposited labeled as “cryst”). Further, for families with subfamily division, a tab at the very right shows the subfamily number (see Fig. 6.3). Finally, the “Structure” tab is the special case (see Fig. 6.4): it does not contain GenBank nor UniProt accessions but instead displays more detailed information from the PDB files. For each PDB file, we extract and display the resolution when the structure was solved by x-ray crystallography (otherwise we indicate the method: powder diffraction or nuclear magnetic resonance).

2.1.3 Recent Addition of Carbohydrate Ligands

The PDB does not provide any option to perform a comprehensive search for carbohydrate structures found in CAZyme binding sites, and, unlike proteins or nucleic acids, the nomenclature for carbohydrate residues within PDB files is not yet standardized. Significantly, the information on how the isolated carbohydrate residues are linked to each other is not described in PDB files. For each PDB file, we thus extract the carbohydrate ligand information using PDB-care (www.glycosciences.de/tools/pdb-care/; (Lütteke and Von Der Lieth 2004)). These ligands are filtered for display as follows. N- and O-glycans covalently linked to Asn or Ser/Thr residues are discarded as they correspond to posttranslational modifications of the protein structure, and generally not directly linked to enzyme function. The remaining carbohydrate ligands are retained as they should describe functional recognition in catalytic or other binding sites in CAZymes and are displayed in the structure pages of CAZy (see Fig. 6.4) following IUPAC nomenclature (Lombard et al. 2014). Not all carbohydrate structures are susceptible to automated description by PDB-care. In a number of cases, we must manually curate and provide IUPAC descriptions for structures that are unsuitable to PDB-care such as (i) nonreducing glycans (cyclodextrins, sucrose and sucrose derivatives, trehalose, kestose, raffinose, nystose, etc.), (ii) ligands that are made of both carbohydrate and noncarbohydrate moieties such as acarbose, (iii) thio-oligosaccharides, (iv) fluorine-containing carbohydrates, and (v) oligosaccharides containing 3,6-anhydro bridges. In addition, automated scripts have been devised to handle close to 200 carbohydrate analogues that we denote <carb_like_ligandref> where ligandref corresponds to the three-letter ligand name given by the PDB. For instance, the carbohydrate-like inhibitor 1-deoxynojirimycin appears as <carb_like_NOJ>. Significantly, nearly half (45 %) of the approx. 7500 PDB structures present in CAZy as of April 2016 bear a glycan-containing ligand or a glycan analog revealing enzyme-glycan interactions.

2.2 Browsing by Genome

The collection of carbohydrate-active enzymes encoded by the genome of an organism, hereafter referred to as “CAZome,” provides an insight into the nature and extent of the metabolism of complex carbohydrates of the species. The CAZomes of free-living organisms typically correspond to 1–5 % of the predicted coding genes. Because of the massive chemical, structural, and functional variability of carbohydrates, CAZome comparisons can highlight the adaptation of the CAZymes repertoire of species to their carbohydrate environment.

The CAZy website allows to browse CAZomes by kingdom of life, where species are presented in alphabetically ordered tabs. For each organism, the complete list of CAZymes is displayed in addition to the family distribution, as illustrated in Fig. 6.5. As of April 2016, CAZy is close to 5000 public genomes, with more than 4000 bacterial genomes but less than 200 eukaryotes. This is due to the fact that the CAZomes listed in the CAZy website correspond to protein models of finished genomes from the daily releases of GenBank. In just a few rare cases, genomes with protein models not released as finished entries in GenBank, but publicly available, have been analyzed and are presented in CAZy. However, for these few cases, the display only shows the number of proteins in each family, but does not feature the actual list of proteins with database accessions. The taxonomical lineage of the genome is directly extracted and updated from the NCBI Taxonomy database.

Fig. 6.5
figure 5

Screenshot of the CAZy webpage for the genome of Porphyromonas gingivalis ATCC 33277. The CAZy family distribution (top) is followed by the list of all identified CAZymes with their modularity (where relevant)

3 Retrieving Information from the Search Form/Engine in the CAZy Website

To facilitate search of specific information, the CAZy website includes a search tool, which appears at the top right of every page. The search form is composed of a text area with a magnifying glass to enter the required query and a drop-down list to indicate the field of searched information (see Fig. 6.6, with “Site” option to search in every field). Main fields notably allow the user to search by CAZy family, organism (name, even partial, or taxonomy id), protein name or accessions (GenBank, UniProt, and PDB), ligand (indicating a sugar-like compound, a part of a chain, or the catalytic residue to which the ligand is attached, e.g., “GLU”), activity (EC number/name), etc. The result of the search either indicates the modularity of the protein or provides direct links to the relevant genome and family webpages.

Fig. 6.6
figure 6

Search tools and fields that accept queries in the CAZy database

4 How to Get CAZy to Annotate Your Studied Protein, Genome, or Metagenomic Sample?

The most straightforward way to obtain a CAZy annotation for a genome is to submit your sequence(s) to the NCBI with the “finished” status or by contacting us (cazy@afmb.univ-mrs.fr) to request our analysis as part of a collaborative effort. Every day, the internal CAZy tool for the semiautomatic modular assignment runs on the protein sequences from the daily release of NCBI GenBank, and our computational equipment makes it possible to perform several large-scale analyses such as the annotation of CAZyme repertoire in genomic and metagenomic investigations (the latter can be in the form of DNA or protein sequences). These putative assignments are thus manually validated (or rejected) by expert curators. Subsequent family comparisons provide insights into how similar or different might be the newly sequenced organisms compared to closely related species or how metagenomic samples differ relative to each other. Differences in the relative family size, for example, can reflect the relative diversity or complexity of the inherent biological processes and, therefore, the biology of the compared species/samples.

Automatic tools, freely available on the web, attempt to emulate the CAZy classification scheme. Our experience is that these fully automatic methods provide results that can be substantially different from actual CAZy assignments. Further, these tools sometimes include outdated module families and automatic subfamilies that are not curated. And finally (and most importantly), automatic predictors are dependent on the user’s parameters for detection threshold, generally applied to e-value statistics. The issues with an e-value threshold is that (i) the e-value varies with the length of the aligned sequences for identical sequence similarity percentage, (ii) such threshold completely bypasses curation to distinguish possibly functionally related homologs from locally shared secondary structures, (iii) a unique threshold is not appropriate for families of unequal diversity, and (iv) low/significant e-values do not guarantee the completeness of modules since all detection tools (BLAST- or HMM-based) are local by nature.

5 What to Do If You Obtain a New Activity or the 3-D Structure for a CAZyme or If You Characterize a New CAZy Family?

We cordially invite biologists with experimental results to contact us (with appropriate material such as a peer-reviewed preprint of the work), to reduce the number of reports that was missed during our bibliography surveillance. If the subject protein has not yet been submitted to GenBank/PDB, we strongly encourage you to do so. This will allow the automatic capture and display of the annotation on the CAZy website as follows:

5.1 Novel Activity, 3-D Structure, or New Chemical Information for an Enzyme in an Existing CAZy Family

If the studied enzyme is already assigned to a numbered family in the CAZy website, we will complete the family records (content and description) with the new information. The new information will thus be displayed on the webpage dedicated to the family, in the corresponding sections as described in Sect. 6.2.1. The accumulation of experimental evidence will notably help in refining the classification system by the population of subfamilies based on phylogenetic analyses (see Table 6.1).

Warning

If your enzyme appears in the CAZy website but in the “Nonclassified modules” listed for each CAZy class, it has to be considered as new regarding the CAZy classification (see below).

5.2 Novel Family in the CAZy Classification

If the newly characterized enzyme does not belong to any known CAZy family, or belongs to the “Nonclassified modules” of a CAZy class, we will create a new CAZy family, as follows. Starting from the subject sequence, we first collect the most similar homologs in GenBank by BLAST. Then, we iteratively gather more distant family members using HMMs, which capture the family diversity (flexible and constrained positions in the multiple sequence alignment and corresponding structure). The delineation of the module boundaries is guided by family conservation and is generally facilitated or refined when a 3-D structure becomes available. The creation and analysis of a new CAZy family remains private until notification by the original requestor or until publication.

6 Why Doesn’t CAZy Extend Its Classification Scheme to Other Classes of Enzymes?

Even though the CAZy families do not always coincide with a precise substrate specificity, family assignment often gives clues on what the broad substrate category might be. And when the relatedness to a functionally characterized enzyme is high, typically at the subfamily level, then the functional predictions for CAZymes can be very good. In any case, this is substantially more informative than most other families of enzymes (kinases, proteases, esterases, etc.) whose substrates are difficult or impossible to derive from their sequence alone. Due to our limited number of expert curators and to the poorer relationship between family and function in other enzyme categories, we prefer to stay within our field of competence and do not expand the scope of CAZy beyond what it is.

7 Why Doesn’t CAZy Propagate Experimentally Established Function to Similar Sequences?

All too often during a protein/genome study, the functional annotations automatically inferred by computational methods contain a significant amount of low-quality and even erroneous information that are then propagated to the next projects. For example, the transfer of Gene Ontology (GO) terms based on Pfam modules usually assigns excessively general terms. This can be explained by the stringent policy of module annotation that links a module solely to the GO terms common to all proteins having this module, whatever the diversity of the possible module combination and associated functions. Other widely used tools are also prone to overprediction by transferring annotation from a demonstrated example to distant homologs or by creating annotation based on hypothesis devoid of any experimental evidence, as, for example, with the BACON domain (Pfam ID PF13004) which stands for Bacteroidetes-Associated Carbohydrate-binding Often N-terminal based on a conjecture-only publication. This conjecture has been recently challenged in a publication showing that the BACON domain of BACOVA_02653 protein of Bacteroides ovatus ATCC 8483 does not have any carbohydrate-binding activity (Larsbrink et al. 2014). As a consequence CAZy did not create a new CBM family for such modules. More generally, to avoid problems linked to annotation transfer, CAZy policy is to display EC numbers only for the experimentally characterized enzymes.

8 Links and Announcements on the CAZy Website

In addition to multiple links to essential enzymatic and glycogenomic resources, CAZy contains many cross-links to the CAZypedia resource. CAZypedia is a community-driven encyclopedic resource meant to be the logical extension of the CAZy classification. It contains extensive information about CAZy families with especial emphasis on GHs, but the other CAZy families are now being filled progressively. The CAZy website also offers an opportunity for commercial enzyme providers to present their products which follow the CAZy nomenclature and to announce scientific meetings and opened job positions related to CAZymes.

9 What Is the PULDB Database?

PULDB is a recent addition to CAZy that describes Polysaccharide Utilization Loci (PULs) experimentally characterized in the literature and our automated PUL predictions in Bacteroidetes species (Terrapon et al. 2015). A PUL is a set of physically linked genes organized around a susCD gene pair. Named according to the prototypic starch utilization system, susC is a characteristic membrane transporter, and susD encodes an outer membrane-binding protein (Shipman et al. 2000). PULs are prevalent in the Bacteroidetes phylum, with species encoding dozens of PULs, each tailored to degrade a particular glycan structure. PULs provide an evolutionary advantage to these gram-negative species by orchestrating the breakdown of complex glycans, thanks to the encoded CAZymes, and by sequestrating these nutrients away from competitors (Terrapon and Henrissat 2014). PULDB offers a query engine to search PULs by species, by (combination of) CAZy modules, and by locus tags. It also contains a JBrowse engine (Skinner et al. 2009) to visualize the genomic context of CAZymes and PULs for all the integrated genomes (source: IMG HMP project at the JGI (Markowitz et al. 2012)) as illustrated in Fig. 6.7.

Fig. 6.7
figure 7

Screenshot of the PULDB website. JBrowse visualization of a xyloglucan PUL in Bacteroides ovatus ATCC 8483

10 Conclusion

The CAZy database is based on family classification schemes that were established in the 1990s, before any genome had been completely sequenced. A key feature of the success of CAZy is the stability of its underlying classification system. The earliest GH families have survived a >500 times expansion since their creation in 1991. Other key features of CAZy are the integration of the variable modular architecture of CAZymes and its panel of expert curators to capture structural and functional data from the literature. In the near future, however, high-throughput enzymology will deliver more data in 1 year than what has accumulated during the last 50 years. Without a mechanism to capture functional information reliably, a large amount of experimental data will remain buried and underexploited.