Keywords

4.1 Introduction

The adaptive immune system is also termed as acquired immune system as it is acquired during the lifetime rather than the inherited one and is considered as a subsystem of the global immune system whose constituents are highly specialized systemic cells and processes that help out in elimination of pathogens as well as in their growth prevention. Due to the existence of acquired immunity, immunological memory creates an initial response for each specific pathogen which results in a strong anamnestic response at the time of subsequent exposure to that particular pathogen. Vaccination is based on this particular feature of acquired immunity. B and T cells are involved in adaptive immunity which is responsive for humoral- and cell-mediated immunity, respectively. They recognize a specific portion of protein residing on the surface of pathogen rather than pathogens as a whole and that protein is termed as an antigen. Distinct receptors residing on the surface of B and T cells designated as B-cell and T-cell receptors (BCR & TCR) consist of membrane-bound immunoglobulins helping in the recognition of the solvent-exposed antigens. There is a remarkable difference between perceptions by B and T cells [30]. Different functions are triggered from antibodies released by B cells upon binding with their respective antigens. As a result, toxins and pathogens get neutralized and labeled as for destruction [20].

Besides this, cell surface-residing T-cell receptor (TCR) presented by T cells assist recognition of antigen-presenting cells (APCs) displayed antigens bounded with major histocompatibility complex (MHC) molecules. MHC I and II molecules are involved in T-cell epitopes presentation. Co-receptor CD4 expressed by helper T cells assists in the perception of antigen in the context of MHC class II, while antigen displayed by MHC class I molecules is acknowledged by cytotoxic CD8+ T cells as per the immunological dogma. Subsequently, CD8 and CD4 T-cell epitopes exist. Meanwhile, CD4 T cells can act as a helper or regulatory T cells [20]. The immune response is amplified by helper T cells which are divided into three major subclasses that include Th1 involved in cell-mediated immunity against intracellular pathogens, Th2 involved in antibody-mediated immunity, and Th17 showing inflammatory response as well as defense across extracellular bacteria [37].

Along with the advancement in recombinant DNA technology, bioinformatics tools development and information of host immune response that acts as the genetic background of pathogen has led to the advancement of new vaccines which are more efficient, secure, and inexpensive in contrast to conventional vaccines. Conservation of chosen epitopes in a vaccine is a prerequisite event across distinct stages of pathogen and its variants. Intracellular antigen processing is required for cytotoxic T-cell-intervened response for which linear epitopes act as a prevailing target. In this respect, the binding affinity of selected epitopes should be with more than one major histocompatibility complex allele for a particular vaccine.

To identify B-cell and T-cell epitopes for vaccine designing is a decisive step as it requires to construct overlapping peptides based on experimental scanning result of epitope-active regions that span complete sequence of a protein antigen, and it is again a pricey and tedious job. Therefore, to elicit an immune response, in silico techniques are a perfect substitute to identify protein domains out of thousands of plausible candidates [29]. This chapter gives an insight regarding some of the commonly used bioinformatics tools developed for B-cell and T-cell epitope prediction.

4.2 Tools for B-Cell Epitopes Prediction

B-cell epitope anticipation tools aim to contribute to the detection of the specific antigenic peptide (epitope), and thus it has a significant purpose as it acts as a substitute of antigen for antibody production.

However, linear and conformational epitopes are the two groups based on B-cell epitopes classification. Sequential residues in primary sequence constitute a segment of linear epitope, whereas a cluster of antigen residues placed at a distance from each other in their primary sequence is regarded as conformational epitope that is brought to spatial vicinity because of polypeptide folding [1]. Thereby, linear and conformational B-cell epitopes are equally termed as continuous and discontinuous B-cell epitopes, respectively. This means that denatured antigens can be identified by antibodies which are used to identify linear B-cell epitopes, while in case of conformational B-cell epitopes, denaturation leads to recognizance failure. Unlike linear epitopes, conformational epitopes prediction depends on the three-dimensional structure of the protein. Linear B-cell epitopes are possessed by only a few of the native antigens; otherwise, approximately 90% of them are conformational [26].

4.2.1 Linear B-Cell Epitopes Anticipation

In spite of being a trivial one, linear B-cell epitopes can act as a substitute for immunization and antibody production. Thus, their anticipation received major attention. It has been predicted via methods based on a sequence from the primary sequence of antigens. Earlier computational methods were rooted on propensity scales of simplified amino acids featuring physicochemical characteristics for B-cell epitopes. For example, residue hydrophilicity calculations were implemented by Hopp and Wood to predict B-cell epitopes [11, 12] on the basis of the hypothesis that hydrophilic regions preferentially reside on the protein surface and are probably antigenic. For developing diverse prediction tools datasets, algorithms and training features used to differ.

Currently, accessible linear B-cell epitopes envision tools involve BcePred indulged in anticipation of linear B-cell epitopes as per their physicochemical attributes. Another one is Lbtope based on Immune Epitope Database (IEDB)-derived data of experimentally approved non-B-cell epitopes [39]. Analogous positive data of B-cell epitopes is required for training of artificial neural networks (ANNs) algorithm that has been implemented in Lbtope yet vary on negative data of non-B-cell epitopes.

Another one is BepiPred, which involves random forests algorithm-based training of B-cell epitopes derived from the three-dimensional architecture of antigen-antibody complexes. It is involved in the prediction of both varieties of B-cell epitopes [14]. On the whole, B-cell epitope prediction methods implementing machine learning algorithm outperformed other methods rooted on the basis of amino acid propencities.

4.2.2 Conformational B-Cell Epitopes Anticipation

It has been already mentioned that preferentially B-cell epitopes are conformational, even though linear B-cell epitopes anticipation is ahead of them, for that two major empirical approaches exist. Firstly, the requirement of conformational B-cell epitopes prediction is whole information of protein 3D structure which is available only for a few proteins [31]. The second one is the complicated task of discontinuous B-cell epitopes isolation from their corresponding protein frame to formulate a particular antibody. Its necessity is suitable scaffolds for epitope grafting. In spite of these difficulties, various mechanisms exist to envisage conformational B-cell epitopes.

One of them is CBTOPE which relies on Support Vector Machine (SVM) algorithm. Physicochemical characteristics and sequence-derived attributes are utilized for training of conformational B-cell epitopes, and a benchmark dataset of conformational epitopes derived from 3D structures of antibody-protein complexes is used for their assessment along with 86.59% accuracy from cross-validation experiments [1]. This tool is involved in predicting discontinuous B-cell epitope of an antigen based on its primary sequence by overcoming the first difficulty.

Another one is ElliPro that depends on the geometrical properties of protein structure. In addition to CBTOPE, ElliPro also assessed on the same benchmark dataset derivative of 3D structures of antibody-protein complexes [24].

There is a significant role of bioinformatics tools for each of the B-cell epitopes envision in peptide-based vaccine designing and disease identification [9, 22].

Although there are various tools for each of the B-cell epitope prediction, the five most commonly highly utilized tools are described in Table 4.1.

Table 4.1 Some freely accessible B-cell epitope anticipation tools

4.2.3 Description of Various Tools and Their Overall Performance Enlisted in Table 4.1

4.2.3.1 BcePred Server

BcePred server assists in envision of linear B-cell epitope rooted on physicochemical characteristics of amino acids. These properties comprised of mobility, turns, flexibility, exposed surface, accessibility, hydrophilicity, polarity, and antigenicity of any particular antigen. To quantify these properties, attributes value is allocated to all of the 20 natural amino acids. The user can opt for any combination of physicochemical attributes for epitopes prediction.

PERL version 5.03 is used for writing a common gateway interface (CGI) script. Sun Server (420E) with a UNIX (Solaris 7) environment is used for their installation.

Submission Form Using the Following Steps for BcePred Server

  • Input data is in the form of sequence that should be written in submission form by using one-letter amino acid code: “acdefghiklm-npqrstvwy” or “ACDEFGHIKLMNPQRSTVWY.” Other letters get transformed into “X” which were reviewed as obscure amino acids.

  • Threshold values lie in the range of −3 to +3. As per the outstanding sensitivity and specificity value gained, default thresholds for various parameters have been opted.

  • After pressing “Submit sequence” button, a WWW page will return as a result that delivers summarized information about entered query sequence in graphical (Fig. 4.1a) as well as in tabular and in overlap display format (Fig. 4.1b). The tabular format provides a normalized score of opted attributes with the respective amino acid residue of a protein as well as minimum, maximum, and average values of integrated methods opted.

  • Quick picturing of B-cell epitope on protein is achieved when residue properties are plotted along protein backbone. A particular amino acid residue will be reviewed as expected B-cell epitope when their peak is having value above threshold (default value is 2.38 in the combined approach).

Fig. 4.1
figure 1

BcePred server showing B-cell epitope regions in insulin precursor sequence (length is 156 aa) of Aplysia californica. (a) Graphical result. (b) Overlap display in which selected programs are hydrophilicity, flexibility, accessibility, and turns having threshold value as 1.9, 2.0, 1.9, and 2.4, respectively. Predicted B-cell epitopes are shown in blue color and are underlined

Pros and Cons

  • By using BcePred server, prediction of B-cell epitopes can be made based on two or more physicochemical properties at a time. So it would be more accurate.

  • However, there is no autonomous assessment or benchmarking of prevailing procedures in this server; thereby, the decision of much better residue property or method is a difficult task.

4.2.3.2 Lbtope

Lbtope is a tool designed to predict linear B-cell epitope. PHP 5.2.9, HTML, and JavaScript have been used to develop its front end. Further, Red Hat Enterprise Linux 6 server environment has been utilized for its installation. Along with experimentally certified B-cell epitopes, non-B-cell epitopes can be also retrieved from Immune Epitope Database (IEDB) which include five datasets termed as Lbtope_Fixed, Lbtope_Fixed_non_redundant, Lbtope_Variable, Lbtope_Variable_non_redundant, and Lbtope_Confirm dataset. Various models have been developed based on these datasets to discriminate B-cell epitopes from non-epitopes.

In Lbtope, SVMlight package is used for implementing SVM technique in association with Weka implemented Ibk.

Working Steps

  1. I.

    Input data is the primary amino acid sequences in fasta format (Fig. 4.2a).

  2. II.

    Overlapping peptides containing 20 amino acids and 5–30 amino acids are developed for Lbtope fixed dataset model and for variable datasets, respectively, for prediction of linear epitopes. Due to the very high specificity, nonredundant model is introduced as well.

  3. III.

    Antigen sequences profiled with B-cell epitopes having probability scale of 20–80% comes as an output data (Fig. 4.2a).

    Fig. 4.2
    figure 2

    (a) Sequence of OspA from Borrelia burgdorferi taken as input showing highlighted text as the predicted B-cell epitope along with probability scale. (b) Output data from peptide submission and mutant generation

  4. IV.

    A higher score is meant for a higher possibility of a peptide to behave as B-cell epitope.

Pros and Cons

  • In addition to B-cell epitope prediction, this server exhibits a peptide mutation tool. It helps to create all plausible single-point mutations of a given peptide (Fig. 4.2b) and to predict its other properties. The further probability score is calculated based on a particular algorithm. Thereby, mutation tool is useful in the creation of peptide mutants and examination of its epitopic and other desired probability as well.

  • Model based on Lbtope_Confirm dataset executed in an improved way as a comparison to mock-up established on Lbtope_Variable dataset. However, these model’s activity decreased on nonredundant datasets.

4.2.3.3 ElliPro

ElliPro is a Web server obtained from Ellipsoid and Protrusion, that executes a modified version of Thornton’s method according to which identification of continuous epitopes from protruding regions of protein globular surface becomes possible [38]. In addition to a residue clustering algorithm, the MODELLER program [8] and a Jmol viewer (Fig. 4.3b) are implemented in ElliPro as well. Due to this implementation, envision of antibody epitopes as well as its visualization becomes possible in protein sequences as well as in structures. From 3D structures of antibody-protein complexes, a benchmark dataset of epitopes has been derived which is used to train ElliPro having the Area Under the ROC Curve (AUC) value as 0.732 [23].

Fig. 4.3
figure 3

(a) ElliPro prediction result for myohemerythin as an input sequence having sequence ID as 2MHR. (b) Epitope 3D structures for 2MHR via Jmol viewer program

Three algorithms are introduced in ElliPro to perform some major objectives that include an understanding of protein shape as an ellipsoid, estimation of residue protrusion index (PI), and grouping of neighboring residues as per their PI values.

Working Steps

  1. I.

    Input data is either a protein structure or its primary amino acid sequence.

  2. II.

    The sequence in fasta format or single-letter codes or their SwissProt/UniProt ID can be entered as a query in case the only sequence is available. To design a 3D structure of the submitted sequence, the selection of both a threshold for BLAST e-value and structural templates from PDB are required.

  3. III.

    In case of structure, either a four-character PDB ID is entered in required space or a PDB file in PDB format can be uploaded (Fig. 4.3a). If submitted framework possesses more than one protein chain, then a specific chain has to be selected by the user on which calculation would be based.

  4. IV.

    Threshold values are changeable based on parameters utilized by server to predict epitope, like minimum residue score (protrusion index), referred as S, that ranges in between 0.5 and 1.0 and maximum distance, termed as R, that ranges from 4 to 8 Å.

Pros and Cons

  • ElliPro proves to be a helpful server for recognition of antibody epitopes from protein antigens and is helpful in identifying protein-protein interactions.

  • A procedure that relies on geometrical attributes of protein structure has been introduced in this server which doesn’t require training as well, so it is unable to properly differentiate between epitopes and non-epitopes.

4.2.3.4 CBTOPE

CBTOPE is a user-friendly Web server. It is established to anticipate conformational B-cell epitopes from antigen’s amino acid sequence rather than based on their tertiary structure. A CGI script is written in Perl and HTML. Sun Server (420E) is used for installation under UNIX (Solaris 7) environment [1]. Development of this server is evident for envisioning of antigen’s conformational B-cell epitope in which their primary amino acid sequences play a possible role.

Methodology

  1. (a)

    For prediction via CBTOPE, main dataset is created by obtaining 526 antigenic sequences in combination with IEDB database as well as benchmark dataset [23] which is comprised of 161 protein chains derived from 144 antigen-antibody complex structures.

  2. (b)

    Sequence redundancy is excluded by using program CD-HIT [16] at 40% cutoff.

  3. (c)

    Finally, a nonredundant set of 187 antigens is gained. This set is devoid of sequences with the sequence identity of more than 40%.

  4. (d)

    A different pattern is created. Standard procedure for assigning patterns is that if there would be any interaction between central residues and antibody, a positive value is assigned otherwise defined as negative (Fig. 4.4).

  5. (e)

    By using patterns like the binary profile of pattern (BPP) and physiochemical profile of patterns (PPP), several models have been developed by using SVM as a classifier. It gained a maximal value of MCC as 0.22 and 0.17, respectively.

  6. (f)

    Conventional characteristics of binary and physicochemical profiles are used and further assessed via fivefold cross-validation.

  7. (g)

    The number of non-redundant protein chains is 187 comprising of 2261 antibody-interacting B-cell epitope residues that are used for training and assessment of all SVM models.

Fig. 4.4
figure 4

CBTOPE prediction result for insulin sequence of Octodon degus as an input. Predicted B-cell epitope is shown in red color

Working Steps

  1. I.

    Input data is amino acid sequences in fasta format.

  2. II.

    Total of 19 window patterns for each of the submitted sequences is created via server. The further amino acid composition is calculated to predict residues interacting with the antibody.

  3. III.

    Amino acid sequence mapped with probability scale that ranges in between zero and nine comes as an output data for all amino acids where zero signifies the unusual possibility of residue to be a part of B-cell epitope and nine is the most plausible one (Fig. 4.4).

  4. IV.

    For extraordinary precision (high-confidence) prediction, higher threshold value should be selected as per suggestion along with compromising the sensitivity of prediction. Nonetheless, lower threshold value should opt for maximum prediction of antibody-interacting residues.

  5. V.

    The default threshold value is fixed at −0.3 as sensitivity and specificity are found to be equivalent at this value during CBTOPE development.

Pros and Cons

  • Structure determination of a protein via techniques like X-ray crystallography proves to be costly, prolix, and time-consuming. Due to development of CBTOPE, one can predict conformational B-cell epitopes of antigens with ease which is lacking their tertiary structures with better sensitivity and AUC than other structure-based methods on same benchmark dataset as CPP composition-based SVM model is used in this server which outperformed others.

  • Limitation of CBTOPE is its ineptitude for determination of number and distance required to obtain an epitope segment from antigen sequence.

4.2.3.5 BepiPred-2.0

BepiPred-2.0 is a Web server based on random forest algorithm for estimation of B-cell epitope, and annotated epitopes extracted from a dataset are used for its training which is composed of 649 antigen-antibody crystal structures and is derived from Protein Data Bank (PDB). Antibody molecules of each complex are recognized via HMM models.

Methodology

  1. (a)

    Random Forest Regression (RF) algorithm is assessed on a dataset to determine the plausibility of a given antigen residue so that it can be a part of an epitope with the usage of the fivefold cross-validation strategy.

  2. (b)

    All of the residues is encrypted with the help of its polarity, hydrophobicity, computed volume along with secondary structure (SS), and relative surface accessibility (RSA) as anticipated by NetSurfP [21].

  3. (c)

    The overall volume of antigen is gained via the addition of respective volumes of entire antigen’s residues for almost 46 variables.

  4. (d)

    Rolling average of window 9 is implemented on RF output to acquire concluding BepiPred-2.0 predictions.

Working Steps

  1. I.

    Input data is protein sequences of interest having size more than 10 amino acids and lesser than 6000 in fasta format that can be entered into textbox either by pasting them or via uploading as a single file.

  2. II.

    When predictions get completed, the user is automatically redirected to output page (Fig. 4.5) that has a navigation bar containing distinct tabs like “Summary” showing the result of each of the individual sequence in horizontal as well as in the form of a vertical table. Optionally, an email address can be given by the user so that after the job gets finished, result page link will be emailed.

  3. III.

    “E” in “Epitopes” line is indicated as predictions higher than the user-defined threshold which is by default 0.5 above itself the protein sequence and is used to select the background color for protein sequences. Epitope classifications are alterable as per desire with the usage of “Epitope Threshold” slider.

  4. IV.

    Predictions result are downloadable as JSON or CSV format via dropdown tab “Downloads.” Besides this, by clicking the “All Downloads” tab, a short descriptive file can be found as well.

Fig. 4.5
figure 5

Sequence markup table of epitope predictions for three antigenic sequences to visualize the predictions on sequences in advanced output mode

Pros and Cons

  • BepiPred-2.0 attains a considerably better positive predictive value (PPV) and a moderately better true positive rate (TPR) in comparison to other methods. Also, it outperforms other available tools like BepiPred-1.0 and Lbtope for sequence-based epitope prediction relies on dataset retrieved from solved 3D structures or of a large collection of linear epitopes downloadable from IEDB database.

  • The result format is informative as well as convenient.

  • Limitation of BepiPred-2.0 is that it doesn’t respond to nucleic acid sequences.

4.3 Tools for T-Cell Epitopes Prediction

Recognition of shortest peptides within an antigen is the main objective of T-cell epitope prediction which possesses immunogenicity, meaning capable to incite either CD4 or CD8 T cells. Immunogenicity is mainly based on three essential events which include processing of antigen and its binding with MHC molecules and acceptance from its respective TCR.

Amid all steps, MHC-peptide binding is the most discerning to delineate T-cell epitopes [13, 15]. Subsequently, the peptide-MHC binding prediction is the substantive baseline for prediction of T-cell epitopes.

4.3.1 Peptide-MHC Binding Anticipation

For peptide-MHC binding prediction, there should be an overview of already known peptide sequences that adhere with MHC molecules such as the existence of specified epitope databases, for instance, antigen [32], EPIMHC [18], and IEDB [39].

At the level of 3D structures of groove-resided bound peptides, resemblance exists between MHC I and II molecules, even though there is a major distinction between their binding grooves. For MHC I molecules, its peptide binding cleft consists of a single α chain; thereby, it is closed due to which their binding peptide length is reduced to 9 to 11 amino acid residues whose N- and C-terminal ends continue to stick by means of a linkage of hydrogen bonds with preserved residues of MHC I molecules [17, 36]. Tight physicochemical preferences also exist in addition to deep binding pockets in their peptide-binding groove that assist binding predictions. Alternative binding pockets exist for the same MHC I molecule which is often used by peptides of distinct sizes. Hence, there is a requirement of a fixed peptide length for the prophecy of MHC I-binding peptides. As mostly ligands have 9–11 residues, it can be the desired length.

On the contrary, open peptide-binding cleft is found in MHC II molecules, that allows expansion of peptide’s N- and C-terminal ends beyond its binding groove [17, 36] which results in diversification of their peptide-binding length (9–22 residues). However, peptide-binding cleft allows to reside merely a core of nine residues, termed as peptide-binding core, into them. Consequently, the target of peptide-MHC II binding anticipation tools is to recognize peptide-binding cores mainly. The reason behind this imprecise forecasting of peptides that bind with MHC II molecule is their shallower and less demanding binding pockets than that of MHC I molecules [30].

Apart from this, peptide antigens derived from endogenous and exogenous pathway are offered by MHC I and MHC II molecules, respectively. Endosomal compartments are used for degradation and loading endocytosed antigens onto MHC II molecule [7], while antigens degraded via cytosolic pathway are transported via TAP to the endoplasmic reticulum and further loaded onto MHC I molecules. Before loading, peptides mostly go for trimming with the aid of ERAAP N-terminal aminopeptidases [10].

Along with MHC I and II-peptide binding anticipation tools, various tools are there to envisage even TAP binding that has been designed by training distinct algorithms on peptides having a significant affinity with TAP [3].

Consistently occurring amino acids are present in peptides at particular positions that bind with MHC molecules, termed as anchor residues thought to be liable for its binding with MHC molecule. However, later, it has been shown that along with anchor residues, peptide binding to a given MHC molecule is facilitated by non-anchor residues as well [27]. Accordingly, development of motif matrices (MM) helps in the assessment of input for each and all peptide positions of MHC molecule binding [19, 25].

Several ML algorithm has been used to solve mainly two distinct problems which are trained on datasets having peptides of known kinship to MHC molecules. First and foremost is the discernment of MHC binders from non-binders, and the second one is to envisage peptides binding affinity with MHC molecules.

MHC polymorphism is the major challenge in T-cell epitopes prediction. Human leukocyte antigen (HLA) is a term for MHC molecules in case of humans, and hundreds of their allelic variants exist which bind to peptide variants that need distinctive models to predict peptide-MHC binding. These variants are expressed at immensely diverse frequencies due to which HLA polymorphism creates hindrance in the advancement of T-cell epitope-based vaccines for distinct ethnic groups. In spite of all obstruction, there are various tools accessible for prediction of peptide-MHC binding. Some of them are described in Table 4.2.

Table 4.2 Some freely accessible T-cell epitope anticipation tools

4.3.2 Description of Various Tools for T-Cell Epitope Prediction Enlisted in Table 4.2

4.3.2.1 nHLAPred

nHLAPred is a hybrid approach-based Web server which includes, firstly, a quantitative matrix (QM)-rooted technique in which involvement of each residue has been taken into consideration rather than just anchor residues and is formulated for 47 MHC class I alleles for which minimal 15 binders are accessible from MHCBN version 1.1 [5]. Secondly, an artificial neural network (ANN)-based method is implemented for 30 alleles out of 47 MHC alleles featuring at least 40 binders approachable from the database. Mutual approach (ANN and QM) has been used for the anticipation of 30 MHC alleles (Fig. 4.6), while the prediction of the remaining 37 alleles relies on QM [4]. The average accuracy of prediction is 92.8% that has ameliorated by 6% compared to each individual means with the development of this amalgam approach.

Fig. 4.6
figure 6

Diagrammatic representation of combining ANNs and QM

Sun Server 420R is used for installation under the Solaris environment. There is a partitioning of server in two substantial parts, ComPred and ANNPred, amid which ComPred enables for estimation of binders for 67 MHC class I alleles. Along with that, proteasomal matrices have been utilized by both parts to anticipate proteasomal cleavage site possessing MHC binders at C-terminal.

Working Steps

  1. I.

    ReadSeq developed by Dr. Don Gilbert has been implemented in the server, so input data can be the protein sequence query of any standard format.

  2. II.

    For 47 MHC class I alleles, quantitative matrices are developed that are further assessed via jackknife validation test.

  3. III.

    For each amino acid from point one to nine, coefficient value has been calculated via allocating the possibility of an amino acid at an exact point in binders as well as in non-binders.

  4. IV.

    For prophecy of proteasomal cleavage sites which befall at the midpoint of 12mer peptides mainly six amino acids away from N-terminal, proteasomal and immunoproteasomal matrices are acquired from ProPred I server [34].

Pros and Cons

  • The server is user-friendly, and its outcome demonstration format (HTML-II) is helpful in tracing promiscuous MHC-binding regions as of antigenic sequence with fair accuracy.

  • However, certain limitations are also there like the incapability to handle non-linearity in data because of significant confinement of quantitative matrix-based method. Also, the ANN-based method requires a large dataset for training.

  • Proteasome cleavage site prediction procedures are less authentic due to extensive specificity of the proteasome in comparison of MHC-peptide binding specificity. Proteasome digested data are present in limited amount as well. Moreover, cleavage specificity depends on cleavage site-residing residues as well as on neighboring residues equally.

4.3.2.2 ProPred1

ProPred1 is an online matrix-based Web server in order to predict peptide binding to 47 MHC class I alleles. Matrices implemented have been acquired from BIMAS server as well as from literature. Results are in a user-friendly format that helps out users to identify promiscuous MHC binders in an antigen sequence.

The server enables users to predict MHC binders in an antigenic sequence along with their usual proteasome and immunoproteasome cleavage sites at C terminus simultaneously which results in identifying T-cell epitope with high potency.

PERL is used for writing a common gateway interface (CGI) script and is launched via Apache Web server. Further, Sun Server (420E) with a UNIX (Solaris 7) environment is used for installation.

Working Steps

  1. I.

    Input data is the primary amino acid sequence of protein query in any frequently used sequence formats as the server uses ReadSeq to analyze input sequence (Fig. 4.7a).

  2. II.

    There is an independency to select a threshold value for prediction.

  3. III.

    Representation of output data in graphical (Fig. 4.7b) or text form provides assistance to the user in appropriate recognition of promiscuous MHC-binding domains in their query sequence.

  4. IV.

    Firstly, for a given antigen sequence, all probable overlying 9mer peptides are produced followed by a quantitative matrix-based score calculation of selected MHC alleles. A peptide is designated as predicted binder if their score would be superior to a particular threshold value (e.g., at 4%) for selected MHC allele.

  5. V.

    In an effort to forecast proteasome cleavage sites in an antigenic sequence, overlying 12mer peptides were developed for sequence followed by their score calculation with the usage of weight matrix of the proteasome.

  6. VI.

    Further peptides having score superior to a certain threshold value (e.g., at 5%) are deemed as peptides featuring proteasome cleavage site at their midpoint positions (6-position left and 6-position right) as per prediction.

  7. VII.

    Prediction of the immunoproteasome cleavage site of peptides shares analogy with proteasome cleavage site prediction.

  8. VIII.

    Concurrent anticipation of MHC binders and proteasome cleavage sites results in removal of MHC binders not retaining cleavage site at C terminus.

Fig. 4.7
figure 7

(a) Sequence submission form of ProPred1 server showing protein sequence of O-antigen polymerase of Shigella dysenteriae as an input. (b) Prediction result in graphical format

Pros and Cons

  • Purpose of ProPred1 development is to efficaciously attenuate wet lab experiments number indulged in to identify effective T-cell epitopes and thereby develop relevant vaccines.

  • However, due to lack of sufficient data for MHC non-binders, calculation of threshold value is little bit crucial.

4.3.2.3 TAPPred

TAPPred is a user-friendly, support vector machine (SVM)-based Web server designed to predict TAP-binding affinity as well as translocation efficiency of the peptide. The server is initiated via public domain software package Apache on Sun server 420R in Solaris background. HTML is used for writing all the Web pages, while PERL and JavaScript are used for inscription of CGI scripts. By utilizing freely downloadable software, SVMlight, SVM has been implemented.

Working Steps

  1. I.

    Input data is protein sequence as a single-letter amino acid code whose minimum length should be nine that is uploaded as a local sequence file or is pasted in required space, in any of the standard formats because of integration of ReadSeq.

  2. II.

    Before running prediction sequence, uploaded format must be chosen by the user that it is in either plain or formatted form as server acknowledges both formatted and unformatted raw antigenic sequences which results in erroneous prediction if the selected format is false.

  3. III.

    Prediction of binding affinity of the peptide has given permission by the server on the basis of two variants of SVM. Simple SVM involves prediction relied on sequential knowledge of peptides and is quicker than cascade SVM which includes characteristics of amino acids along with its sequential knowledge.

  4. IV.

    Two tiers exist for prediction. Initially via joining characteristics of amino acids with sequential information, preliminary results are gained. Later on, the results of the first tier are further filtered. Despite having a slower rate of prediction, cascade SVM is more trustworthy as compared to simple SVM. Only a single approach can be selected for prediction at a time.

  5. V.

    Results are depicted in two user-friendly formats. In the first format, the result is presented by coloring the residues. N-terminal is demarcated by the green color background of residues. Rest of the residues are represented with the violet-blue background (Fig. 4.8a).

  6. VI.

    Type of peptides can be chosen to be displayed in the result.

  7. VII.

    Tabular format display (Fig. 4.8b) has four alternatives. Only one output display can be selected at a time.

  8. VIII.

    Only one output display can be selected by the user at a time that includes primarily a header and has data about the length of the peptide sequence, about nonamers obtained, as well as the date of prediction.

Fig. 4.8
figure 8

Prediction results from TAPPred server for CTL as an input sequence. (a) Displaying result in the form of colors. (b) Tabular display format

Pros and Cons

  • The user can select parameters of their choice in this server.

  • However, due to insufficient data for TAP-binding peptides, limited algorithms are there. Also, the minimum length of the query sequence should be nine; otherwise, it won’t be accepted for prediction.

4.3.2.4 ProPred

ProPred is a graphics-based Web server in which matrix-based prediction algorithm has been deployed along with the implementation of amino acid or position coefficient table inferred from literature in order to foretell binding domain for MHC class II in antigenic sequences. Either as peaks in graphical interface or as colored residues in HTML interface, predicted binders can be envisioned. It has been developed mainly for 51 HLA-DR alleles whose matrices have been extracted from a pocket profile database defined by Sturniolo et al. in 1999 [33].

Working Steps

  1. I.

    Input data is protein sequences in fasta or PIR format which are generally used as standard sequence formats and can be uploaded as a file.

  2. II.

    In order to attain desirable results, selection of alleles, threshold, and other parameters are customizable.

  3. III.

    An output as text or graphics is generated from the analysis of sequence data in which two choices have been provided by text display: the first choice in which binding regions of antigenic sequences are displayed by different colors thus providing easier detection. An option of representing binding score in a commonly used tabular format is also there that has been calculated from the matrix.

  4. IV.

    The second choice involves the representation of coinciding regions independently on discrete lines; thus, delineation of specific regions from display becomes easier.

  5. V.

    GDPlot library established by Lincoln D. Stein is used for graphics formulation in GIF format. HLA-DR-binding tendency laterally with the primary structure of a protein is represented as an output along with their binding strength. Consequently, it has an advantage over text presentation.

  6. VI.

    Besides this, an alternative method is there for plotting threshold versus binding peptides, i.e., threshold profile, which renders assistance in the selection of a reasonable threshold value for finding promiscuous binders.

Pros and Cons

  • All HLA-DR alleles are evaluated by server independently, and output is posted on a single screen that helps out the user in rapid visualization of promiscuous binders. Henceforth, it can be considered as a useful tool.

  • Binding strength for all peptide frames in an opted subsequence can be computed by this server.

However, it is less expressive in representing overlapping binding regions.

4.3.2.5 EpiDOCK

EpiDOCK is the first structure-based server for prediction of peptide binding to 23 utmost common human MHC class II proteins which include 5 HLA-DP, 6 HLA-DQ, and 12 HLA-DR proteins. These alleles are the composition of more than 95% of the human population. The server is implicated to identify 90% of true binders as well as 76% of true non-binders, with a global precision of 83%.

Working Steps

  1. I.

    Input data is protein sequence in fasta format. Multi-fasta protein format is likely reinforced.

  2. II.

    Selection of HLA class II protein of concern is the next step that can be a single protein or all proteins.

  3. III.

    Peptide-binding core is composed of nine adjacent residues due to which a collection of overlapping nonamers is formed as a result of input sequence conversion. A docking score-based quantitative matrix (DS-QM) is used for assessment of all nonamers retrieved for certain HLA class II protein and allotted a specific score.

  4. IV.

    For any DS-QM, thresholds are defined with utmost certainty. Peptides having higher scores than the threshold or equal to them are expected to be binders, else considered as non-binders.

  5. V.

    After that, if prophesied nonamer binder is a portion of recognized binder sequence, only then it will be categorized as an accurately foretold binder, else referred to as a false binder. Data is reported either in xls or csv formats.

  6. VI.

    To validate anticipations, a test set of 7050 identified binders to HLA-DR, HLA-DQ, and HLA-DP proteins is implicated that originates from 1195 proteins, which is collected from Immune Epitope Database.

  7. VII.

    Assigned values for specificity, sensitivity, accuracy, and AUC are 0.759, 0.903, 0.831, and 0.892, respectively.

Pros and Cons

  • Structure-based approaches require information about peptide-MHC protein complex centered on their X-ray structure only rather than extensive preexisting experimental data.

  • It is authentic and credible.

  • Because of high resource implications of experimental testing at the time of scanning large proteome, a number of false positives can be more in contrast to a large number of false negatives which is a major problem to be dealt with.

  • Amino acids having negative coefficients decrease the affinity of peptides for HLA-DRB1.