Introduction

Lectins are proteins of non immunologic origin that bind to carbohydrates with high fidelity. Lectins form a large class of multivalent recognition molecules that specifically interact with their cognate sugar moieties for decoding the information underlying the structural heterogeneity (Sharon and Lis 2004). Although, their occurrence in nature was known only during the early nineteenth century, by 1960s, tremendous research in this field was carried out to explore their functional importance in a range of biologic processes, both across plant and animal kingdom. However, the plant lectins were the most extensively studied (da Silva and Correia 2014), among which those from legume in particular were foremost to be investigated and were found to be a rich source of lectins and are most widely studied.

Legume lectins have been pivotal to the study of the molecular basis of protein carbohydrate interactions (Sharon and Lis 1995). They are a large family of homologues proteins possessing great overall similarities in terms of their physical, chemical and biological properties, despite their origin from different taxonomically distant species. They display remarkable divergence in their carbohydrate specificities ranging from monosaccharides to oligosaccharides. Some legume lectins are synthesized as prolectins in the endoplasmic reticulum and undergo post translational modifications in the Golgi apparatus to function as secretory proteins (Moreira et al. 2013).

Since the advent of recombinant techniques in 1970s, intensified studies were performed for determining the physico-chemical and physiological properties of lectins, amino acid sequences and elucidating their 3D structures. The 3D structure of concanavalin A amongst legume lectins was the first lectin for which a high resolution X-ray crystallographic structure became available (Edelman et al. 1972). Soon thereafter 3D structures for a diverse group of lectins were elucidated.

The basic architecture of the protomer is the “lectin fold”, which is related to the “jelly-roll fold” comprising three anti-parallel sandwiched β sheets which are connected by α turns, β turns and bends along with short loops. The three anti-parallel β sheets constitute a flat six stranded “back” sheet, a concave seven stranded “front” sheet and a short “top” sheet which holds the two sheets together (Fig. 13.1). They are usually devoid of α helices with the exception of occasional 310 helices. Each protomer is dome shaped with dimensions of 42 × 40 × 39 Å and molecular weight of 25–30 kDa. The carbohydrate recognition domain (CRD) is a shallow depression on the surface located at the apex of the dome like structure, accessible to both monosaccharides and oligosaccharides for binding (Sharon and Lis 2002). The basic architecture of CRD in legume lectins constitute four binding site loops A, B, C, D, which are adjacent to each other in the pocket in the 3D structure but are not close together in the sequence. The residues in the binding pocket are known to show the greatest variability and are inferred to be involved in specificity determination (Young and Oomen 1992; Benevides et al. 2012). The floor of the binding site consists of few conserved key amino acids residues in the loops including Asp in Loop A, Gly or Arg (in Concanavalia and Dioclea lectins) in Loop B, Asn and an aromatic residue in Loop C, which contribute to hydrogen bonds and vander Waals interactions with the sugar. The variation in loop C and D is possibly a primary determinant of the monosaccharide specificity (Sharma and Surolia 1997; Rao et al. 1998). The CRD in these lectins lies in close proximity with the metal binding sites and require Ca2+ and transition metal ion Mn2+ for their binding activity (Etzler et al. 2009).

Fig. 13.1
figure 1figure 1

(a) Structure of Canavalia A as a model for legume lectin fold represented as cartoon. (b) Binding site loops A, B, C and D of Canavalia with mannose in its carbohydrate recognition domain

Despite emulating a common β-sandwich fold, variability among the legume lectins occurs both at the level of the quaternary fold, with a variety of dimeric and tetrameric arrangement (Srinivas et al. 2001; Manoj and Suguna 2001) and at the level of the binding site. Other modes of interaction that contribute to the variability in specificity are interaction with water, post translational modification, carbohydrate-aromatic interactions, etc. Thus, classification of lectins into distinct groups based on their monosaccharide specificity that is the best hapten inhibitor of the lectin and its extrapolation it to amino acid sequence variations will shed light on the features of the design of their combining sites.

So far, the relationship between the variation of the amino acid composition of legume lectins in the context of their diverse specificities has been examined only to a limited extent in the past (Swamy et al. 1985). In this piece of work, we identify broad features that allow generation of a spectrum of specificities in them without a fundamental alteration of their 3D structural fold. For this, we employ a new approach to simultaneously visualize and analyse the amino acid variations in 46 legume lectins categorized under five different sugar specific groups through pattern recognition method using heatmaps.

Methodology

  1. 1.

    Generation of dataset of 3D structures

    Nearly 1,094 plant lectins belonging to leguminosae family were deposited (with 159 unique source entries) in the comprehensive database of UNIPROT (http://www.uniprot.org/) with amino acid sequence and functional information. Of which, 235 PDB structures have been deposited in the Protein Data Bank (http://www.rcsb.org/pdb/home/home.do), where each lectin has been complexed with one or more ligands (Berman et al. 2000). For this study, a set of 46 legume lectins were short-listed from the large dataset based on “unique source” as the criteria and whose 3D structures were elucidated.

    These legume lectins were categorize into five groups according to their monosaccharide specificity, i.e. (1) Mannose/Glucose (MG), (2) Galactose (GA), (3) N-acetyl-Glucosamine (GLN), (4) N-Acetyl-Galactosamine (GAN) and (5) Fucose (FU), based on the literature. Table 13.1 provides the complete details of 46 lectins along with their source and monosaccharide specificity and PDB IDs. The final dataset constitutes 24 MG, 1 GLN, 8 GA, 10 GAN and 3 FU lectins.

    Table 13.1 Dataset of 46 legume lectins considered in this study
  2. 2.

    Obtaining amino acid sequences

    Complete canonical sequences were only selected for these entries and retrieved in “FASTA format” from RCSB-PDB. Chain A was only chosen to maintain consistency in the data, except for the lectins with PDB IDs: 1LEN, 1LOB, 2B7Y and 2LTN, we have considered both chains A and B as they were fragments of the same protomer which had been truncated. As the lectins belonging to the genus Canavalia, Dioclea, Cratylia and Cymbosema of MG group exhibit circular homology, their sequences were manually re-transposed to align them with other sequences of legume lectins.

  3. 3.

    Protein secondary structure prediction using PSSPRED

    For the secondary structure prediction, PSSPRED (Protein Secondary Structure PREDiction server), a webserver (http://zhanglab.ccmb.med.umich.edu/PSSpred/) was employed based on the Rumelhart error back-propagation method (Xu and Zhang 2013) using amino acid sequence. This tool uses a simple neural network training algorithm for accurate prediction (Zhang 2012). Based on these calculations, the amino acid sequence of the four binding site loops were determined for the lectin dataset.

  4. 4.

    Multiple sequence alignment and analysis

    Multiple sequence alignment was performed using ClustalW2 (http://www.ebi.ac.uk/Tools/msa/clustalw2/) (Larkin et al. 2007), using all default parameters. BLOSUM protein weight matrix was employed along with penalties for GAP opening as 10, GAP extension as 0.20 and a GAP distance penalty as 5.

  5. 5.

    Phylogenetic analysis of legume lectins

    Phylogenetic analysis was based on amino acid sequence alignment. Multiple sequence alignments were performed for the entire set of 46 lectins considering the complete sequences as well as only the amino acid sequences of binding site loops using ClustalW2. For this, alignment was generated using PAM matrix with all other default settings. For phylogenetic analysis based on sequence alignment, a software tool MEGA6 (Molecular Evolutionary Genetics Analysis) (Tamura et al. 2013) was used, in which the output sequence alignment file was provided as input for inferring phylogenetic trees.

  6. 6.

    Calculation of percentage identity matrix

    Pairwise percentage identity scores for all the 46 lectins and their respective binding sites were computed based on the sequence alignment in ClustalW2. The alignment scores were rearranged as a matrix to indicate the pairwise identity scores calculated between every pair of sequences among the legume lectin dataset. These indicate the number of identities between the two sequences, divided by the length of the alignment, and represented as a percentage.

  7. 7.

    Computation of amino acid composition

    Amino acid composition of the complete protein and only the binding site loops were computed separately. ProtParam, a webserver (http://web.expasy.org/protparam/) was employed to obtain the percentage compostion of each amino acid in a given protein sequence (Gasteiger et al. 2005). Similarly, this procedure was repeated for the four binding site loops. The values were tabulated into a 20 × 46 matrix to generate a clustergram for the same.

  8. 8.

    Pattern recognition and clustering

    In order to demonstrate characteristic features among sugar specific lectin groups, we computed heatmaps to display specific patterns in the entire lectin structure and particularly binding site, based on two aspects: (1) Percentage identities to demonstrate (dis)similarities and (2) Percentage amino acid compositions to study the significance of amino acid variation. Heatmaps were generated using MATLAB v7.5 (MathWorks 2007) (Distance measure: Euclidean). Clustergrams based on amino acids were also generated by employing Kmeans clustering algorithm (MacQueen 1967; Weisstein 1995) using a function module CVAP 3.7 (Cluster Validity Analysis Platform) in MATLAB v7.5.

Results and Discussion

In the present study, we have employed pattern recognition for demonstrating the influence of amino acid variability on legume lectin specificity. Pattern recognition allows making inferences from observations using a statistical approach. Pattern recognition enables discrimination between seemingly similar entities based on their quantitative features (Duin and Pekalska 2007). Accordingly, we have used heatmaps and clustergrams to highlight the characteristic features of each of the five lectin groups classified based on their monosaccharide binding abilities.

(Dis)similarites in Legume Lectins Based on Percentage Identities

Figure 13.2 shows the percentage identity matrix as a heatmap, the top diagonal half computed based on binding site loops and the lower second half represents the full lectin sequences. From the heatmap, it is evident that there is a clear demarcation between the five groups of legume lectins, which were differentiated based on their pair wise comparisons. We observed that the overall percentage identity for the entire protein across 46 lectins was in the range of 28.24–100 %, while it was only 14.29–100 % for the binding site loops. This clearly represents the variability in the carbohydrate binding site residues relative to the whole protein sequence, with the highest identities shared among the same species in the same sugar specific group. The intra-group percentage identities for the MG specific proteins was found to be between 35.68 and 100 %; 33.78 and 96.65 % for GA, 37.78 and 61.61 % for GAN and 35.71 and 36.89 % for FU lectins. Similarly, the identities between the amino acids of binding site loops fall in the range of 21.05–100 % for MG; 17.07–98.04 % for GA; 20.83–56.6 % for GAN and 14.49–34 % for FU binding legume proteins. Table 13.2 illustrates the inter-group percentage identities across the five different groups of legume lectins.

Fig. 13.2
figure 2figure 2

Heatmap generated for 46 legume lectins using percentage identity matrix. The upper diagonal half represents the identities computed for the four binding site loops and the lower diagonal half is based on the full protein sequence

Table 13.2 Intra and Inter group percentage identities calculated from full protein sequences and binding site loops
  1. 1.

    MG lectin group: This set includes lectins from Canavalia sp., Dioclea sp., Cratylia sp., Cymbosema sp., Camptosema sp., Bowringia, Platypodium, Pterocarpus, Lens culinaris, Pisum sativum, Vicia faba and Lathyrus ochrus I. There are seven Canavalia sp. in the dataset, which shared more than 97 % intra-species identity and in particular, three lectins (PDB ID: 1I3H, 3QLQ and 2A7A) possessed 100 % identity for the full protein sequence while, the binding site loops of the six Canavalia lectins, except 2OVU, exhibited 100 % identity in their carbohydrate binding residues indicating high conservation in the binding site architecture. Phylogenetic trees based on the sequences of entire protein and its binding site shows that all the lectins of Canavalia sp. are closely clustered (Figs. 13.3 and 13.4). Similarly, there are six lectins in Dioclea sp. with a percent identity greater than 95 %. However, we have noticed that the three proteins 2JEC, 2GDF, 3SH3 showed 100 % sequence identity in their binding site loops, which also corroborated with the formation of a single clade in the cladogram obtained based on the binding site. We have also noted that Cymbosema lectin (3A0K) also shared an high identity (>93 %) with the Dioclea lectins and hence was grouped together (Loris et al. 1998). The two Cratylia (2D3P and 1MVQ) proteins along with Camptosema (3U4X) formed a third clade closer to the origin of Canavalia sp. in both the cladograms. It was interesting to note that all these lectins of above mentioned species are known to have an unusual type of homology called the circular homology. Initially, these are synthesized as glycosylated precursors having nearly 290 amino acids and are known to undergo transposition by domain swapping followed by transpeptidation (Sharon and Lis 1990). On the other hand, Platypodium (3ZYR) and Pterocarpus (1UKG) proteins of this MG group are found to cluster together, while lectin from Bowringia (2FMD) stands as an individual clade. The four lectins—Lens culinaris (1LEN), Pisum sativum (2LTN), Vicia faba (2B7Y) and Lathyrus ochrus I (1LOB) shared high percentage identity (>80 %) and were grouped together in the heatmaps as well as the cladograms as these four lectins have identical B-chain (Kolberg et al. 1980; Debraya and Rougé 1984).

    Fig. 13.3
    figure 3figure 3

    Cladogram for the 46 lectins obtained from full protein sequences

    Fig. 13.4
    figure 4figure 4

    Cladogram showing the relationship between 46 legume lectins based on the four loops of their binding site

  2. 2.

    GLN lectin group: Among the 46 lectin dataset, only single legume lectin belonging to Ulex europaeus II (1QNW) was found to be GLN specific. The whole lectin differed with MG, GA, GAN, FU by a range of 40–89.7 %, 38.01–62.29 %, 35.45–56.52 % and 36.73–52.97 %, respectively. Similarly, the binding site differed with the above sugar specific groups by 21.57–46.67 %, 26–49.09 %, 17.39–32.08 % and 23.08–35.19 %, respectively.

  3. 3.

    GA and GAN lectin group: Until recently, the GA and GAN specific lectins were grouped together (early 1990s), but due to their amino acid variability in the binding regions attributing differences to their biologic recognition process, they were considered as separate entities (Sharma et al. 1998). The findings of our present study reinforces these observations as the heatmap computed based on the percentage identities showed higher identities among these two groups while considering entire lectin (28.24–86.96 %) whereas the same was not observed for the binding site residues (19.15–71.7 %). This demonstrates the difference in specificity between the two sets of lectins. The intra-group identity of GA was 33.78–96.65 % and 37.78–61.61 % for GAN binding proteins for the full sequence, while for the binding site loops, 17.07–98.04 % and 20.83–60.96 % identities were observed for GA and GAN proteins, respectively. Analysis of the cladogram demonstrated a similar trend, wherein the lectins of GA and GAN were distributed in common clades, which in turn supported the fact that the structural characteristics of GA and GAN lectins are dependent on their phylogeny rather than their differences in sugar specificity (Liener et al 1986).

  4. 4.

    FU lectin group: This set included three proteins—Griffonia simplicifolia IV (1GSL), Lotus tetragonolobus (2EIG) and Ulex europaeus I (1FX5). These lectins had an intra-group identity range of 35.7–36.89 % and 14.49–34 % for the full protein sequence and binding site loops, respectively. Despite their specificities, these FU specific lectins are clustered with their respective genus clades in both the cladograms (Thomas and Surolia 2000).

Amino Acid Variability in Legume Lectins Based on Percentage Composition

The basic differences in the binding site architecture of legume lectins can be attributed to their amino acid variability and thereby their biologic function. The amino acids in the clustergrams were grouped based on their abundance (Figs. 13.5 and 13.6). Table 13.3 details the relative abundance of 20 amino acids in all 46 lectins classified as high, moderate and low.

Fig. 13.5
figure 5figure 5

Heatmaps with the dendrogram constructed using percentage composition of amino acids for the 46 lectins on full protein sequences

Fig. 13.6
figure 6figure 6

Heatmaps with the dendrogram constructed using percentage composition of amino acids in the binding site loops for the 46 lectins

Table 13.3 Categorisation of amino acids based on their abundance in the full protein and the binding site loops

In the clustergram of full protein, it was evident that Ser and Thr were highly present across the 46 lectins and Ser was prevalent in MG, GLN and GA binding proteins with 12.39 %, 12.10 % and 11.7 %, respectively. Unlike other MG lectins, Thr was found in relatively high percentage in Lens culinaris (1LEN), Pisum sativum (2LTN) and Vicia faba (2B7Y). MG proteins had a high percentage of acidic amino acids in comparison with the others. Residues Val and Thr were present in moderate percentages in the full protein but their presence was found to be low at the binding site indicating their importance for the protein structure stability than binding site specificity.

Binding site loop composition clearly differentiates the high presence of Gly over other residues with the maximum in MG constituting 12.37 and 11.39 % in GA binding proteins. GA, GAN and FU are found to be Pro rich at the carbohydrate binding site. The other major residue Ser has a high percentage in MG, GLN and GA specific lectins. Tyr in particular has a moderate percentage of 6.47 and is reported to be involved in CH-pi interaction in MG binding proteins. GAN and FU binding lectins have basic residues Lys and Arg in considerably higher percentage in the loops.

The residues Asp, Asn and Ala have an intermediate percentage in the binding site loops in comparison to the full protein, which is in accordance to their role in non-covalent interactions with the monosaccharide. Similarly Phe, His, Tyr and Trp found in low percentages have been reported to be necessary for stacking interactions with the sugar unit.

Cys and Met were either found in very low percentages or absent and thus were grouped together in the dendrogram.

Conclusion

Pattern recognition through heatmaps assists in reducing data complexity and enhances data interpretation by visualisation. Hence, we have exploited it in this study to analyse the data generated from amino acid variability in a set of 46 legume lectins. Our findings on sequence based variability and phylogenetic analysis are complementary to the previous studies, revealing that legume lectins arose from divergent evolution while retaining a common beta sandwich fold. There is a clear distinction in the sequence identity among these proteins specific to a particular monosaccharide. The results from percentage composition justify the plausible role of certain amino acid residues in the carbohydrate binding site for non-covalent interactions with the sugar.