Introduction

Over the last 20 years, the amount of information related to epitopes recognized in the course of T- and B-cell-mediated immune responses has dramatically increased. As of November 2004, a PubMed search using the word “epitope” revealed a total of 5,173 records prior to 1975, and 17,088 records in the 1975–1984 period. The number of records jumps to 30,948 for 1985–1994, and has now reached 34,205 for the 1995–2004 period.

Epitope-based techniques allow the accurate and precise characterization of immune responses. This precise definition of immune responses allows understanding and quantitating instances of immune responses in which the microbe is able to overcome the host, or establish chronic infection. This knowledge is ultimately crucial for basic immunological studies, and for designing effective intervention strategies and vaccines, as well as for the rigorous evaluation of new vaccine candidates. Knowledge of the epitopes recognized in the course of natural infection, or as a result of vaccination, is also key for our capacity to develop bioinformatics tools and to accurately analyze, model, and predict immune responses. Recently, a renewed focus on emerging diseases and bioterrorism has emphasized the need to develop a central repository of available information regarding the key epitopes recognized by the immune system.

Despite this growing need, a centralized resource to store and access relevant information is not available. To address this need, the National Institute for Allergy and Infectious Diseases (NIAID) is sponsoring the creation of the Immune Epitope Database and Analysis Resource (IEDB), a comprehensive knowledge center composed of a repository of immune epitope data and associated analysis tools. This repository will host data relating to both B and T cell epitopes from infectious pathogens as well as experimental and self-antigens (RFP-NIH-NIAID-DAIT-03/31). Priority will be placed on pathogens considered to be potential bioterrorism threats and emerging diseases. Epitopes recognized in humans, nonhuman primates, rodents, and other animal species are included. In a coordinated effort to increase our knowledge in the field of emerging diseases and category A–C pathogens, NIAID has also launched a large-scale antibody and T cell epitope discovery program aimed at generating epitope data and analysis resources to be included in the IEDB (RFP-NIAID-DAIT-03-43/-04-39).

The IEDB is not the first database aimed at storing immune epitope information. Table 1 gives an overview of existing databases. By examining their structure during the IEDB design, we were able to capitalize on their experiences. Most components of the IEDB can be found in one of these databases, but none contains them all. For example, the pioneering SYFPEITHI database focuses on carefully mapped epitopes or naturally processed peptides, but does not further annotate the context in which an epitope is immunogenic. The Los Alamos HIV Molecular Immunology Database does provide an extensive context annotation, but is limited to a single virus system.

Table 1 Existing epitope databases

So what are the main distinguishing features of the new IEDB database? First, we will place a priority focus on category A–C pathogens. Second, database dimensions compatible with inclusion of a large number of records will be provided, which is needed to accommodate further growth due to large-scale epitope identification programs. Third, the IEDB will include T cell and B cell epitope data from both humans and other animal species. Fourth, extensive epitope annotation will allow users to customize queries to an unprecedented level of detail (e.g., search by type of assay, type of cytokine released in response, and route of vaccine administration). Finally, the inclusion of all available data including patent data will provide unprecedented comprehensiveness. The patent literature is underrepresented by most epitope databases, but contains a large fraction of the data generated by the biotechnology and pharmaceutical industries.

Another crucial component of the project is the involvement of the scientific community in the design of the IEDB’s scope and capability. The IEDB will be produced in a manner that encourages the incorporation of data and analytical tools derived by research labs at large. The scope of this report is to inform the scientific community of our effort and to solicit feedback while the project is still in a design stage. Herein, we present the specific immunological perspective of the IEDB and the database concepts applied to its design. We envision that this resource center will be freely available on the Internet, and that a prototype will be available and operational in the first quarter of 2006.

Results

What is an epitope? Defining the IEDB scope

An immunogen is defined as a substance used to elicit a specific immune response, whereas an antigen is a substance recognized by an existing immune response and associated molecules such as T cells or antibodies in a recall response. An epitope is defined as the chemical structure recognized by antigen specific receptors of the immune system [antibodies and/or T cell receptors (TR)] (William 1999). In the case of most T cell epitopes, an epitope is defined as a structure that is presented in association with a specific MHC, and is bound by the variable domain of a specific TR (Fig. 1a,b). Likewise, a B cell epitope is defined as a structure bound by the variable domains of a monoclonal or polyclonal antibody (Fig. 1c). Both linear and conformational B cell epitopes are considered within the scope of our project. Linear epitopes (also called sequential or continuous epitopes) are normally defined as short peptide fragments that can bind antibodies raised against the intact protein from which the peptides are derived. Conformational or discontinuous epitopes are defined as the set of antibody binding amino acids that are not adjacent in the primary sequence, but are brought into proximity by the folding of the polypeptide chain. The classification is not clear-cut as discontinuous epitopes may contain linear stretches of amino acids, and continuous epitopes may contain indifferent amino acids. Ninety percent of antibodies raised against proteins react against discontinuous fragments (Van Regenmortel 1996).

Fig. 1
figure 1

Orthogonal views of X-ray structures of immunological complexes available in the Protein Data Bank. a Complex of a human TR, influenza Ha antigen peptide PKYVKQNTLKLAT, and MHC Class II HLA-DR1 (PDB ID, 1fyt). T-cell epitope residues colored in orange. b Xenoreactive complex AHIII 12.2 TR bound to P1049 ALWGFFPVLS and MHC Class I HLA-A2.1 (PDB ID, 1lp9). T-cell epitope residues colored in orange. c IgG1 Fab fragment of E8 antibody complexed with horse cytochrome c (PDB ID, 1wej). B cell epitope and paratope residues colored in red and green, accordingly. The figures were obtained using the WebLab viewer

Over the course of the last 20 years, a sizable amount of data has been accumulated relating to which peptides can bind to specific MHC (Barber and Parham 1993; Buus et al. 1987; Engelhard 1994; Engelhard et al. 2002; Fleckenstein et al. 1999; Flower et al. 2003; Gromme and Neefjes 2002; Hammer et al. 1993; Lauemoller et al. 2000; Lehner 2003; Madden 1995; Margalit and Altuvia 2003; Rammensee 1995; Saveanu et al. 2002; Sette and Sidney 1999; Stevanovic 2002; van der Merwe and Davis 2003; Van Kaer 2002). In addition, a wealth of data has been generated relating to the peptides naturally presented by MHC (Engelhard et al. 2002; Falk et al. 1991; Hickman et al. 2004; Latek and Unanue 1999; Rammensee et al. 1993; Rotzschke et al. 1990; Shastri et al. 1998; Stevanovic et al. 2003; Van Bleek and Nathenson 1990). Taken together, these data have provided the basis for clarification of the chemical specificity of MHC, and for the development of methods to predict and identify peptides binding to, or naturally presented by, MHC (Rotzschke et al. 1991). The fact that a given structure can bind to or is presented by MHC does not guarantee, per se, that the resulting complex will be capable of engaging specific TR and thereby meet the definition of an epitope. However, for the sake of completeness, the IEDB will also include data relating to MHC binding, or lack thereof, and naturally presented MHC ligands since compounds capable of binding to MHC have the potential of being immunogenic or antigenic for T cells.

Intrinsic and extrinsic features associated with epitopes

In designing the database, we started from the basic observation that each epitope is associated with intrinsic and extrinsic features. Intrinsic features are those determined by its structure and its phylogenetic relations, whereas extrinsic features are context-dependent attributes dependent on the experimental or natural environment. In other words, immune epitope information is best represented by capturing not only genomic/proteomic and structural information regarding the pathogen of origin, but also representing the immune system of the host interacting with the pathogen, and the biological circumstances in which the interaction occurs. Accordingly, in our view, information related to epitopes should be represented by considering both intrinsic and extrinsic features.

At the level of T cell epitopes, intrinsic features that we plan to include in the IEDB are the molecular structure of the epitope, its binding affinity for different MHC, and the affinity of epitope/MHC complexes for TR of defined sequence. Likewise, at the level of B cell epitopes, we plan to include intrinsic features such as the epitope’s molecular structure and binding affinity for immunoglobulin (IG) or antibody of defined sequence. These features are unequivocally specified and can be singularly associated with a given epitope structure or receptor/epitope combination. Finally, each epitope may also be linked with objective and quantifiable measurements of its binding affinity to specific immune receptors.

As mentioned above, other features, such as immunogenicity, are not intrinsically associated with a given epitope or structure, but rather are highly context-dependent (i.e., extrinsic). Accordingly, lack of information regarding the context may limit the usefulness of data relating to the epitope’s antigenicity or immunogenicity. Context information includes data representing the immune system of the host interacting with the pathogen and the biological circumstances in which the interaction occurs. Examples of context information and extrinsic features are the species of the host (or vaccine recipient) in which the response occurs, the assay utilized to measure responses, and the dose and route of administration (Brockstedt et al. 1999). Likewise, the yield when processing a given epitope protein precursor is dictated by the overall sequence of the protein in which the epitope is contained, and the general cellular environment in which cellular processing occurs (York et al. 1999).

3D structure as an intrinsic feature associated with epitopes

The Protein Data Bank (Berman et al. 2000a,b) currently provides access to 3D structures of more than 1,600 molecules of immunological interest and their complexes. These structures are valuable for epitope analysis and can be used for prediction tool development. For these purposes, as well as for epitope visualization, the IEDB will provide access to the available 3D structures of epitope/MHC and TR/epitope/MHC complexes, antibodies and antibody/antigen complexes, and proteins of the immunoglobulin superfamily (IgSF) and MHC superfamily (MhcSF) in the PDB. Figure 1 presents examples of X-ray structures of epitopes in complexes with antibody and TR/MHC: (1) complex of a human TR, Influenza Ha antigen peptide, and MHC Class II HLA-DR1 (Hennecke et al. 2000); (2) xenoreactive complex AHIII 12.2 TR bound to P1049 ALWGFFPVLS and MHC Class I HLA-A2.1 (Buslepp et al. 2003); (3) E8 antibody Fab complexed with horse cytochrome c (Mylvaganam et al. 1998). In addition to this, we will consider providing links to IMGT/3Dstructure-DB, http://imgt.cines.fr, which provides standardized annotations for the 3D structures of the IG, TR, MHC, IgSF, and MhcSF, and a query tool for contact analysis (IMGT/StructuralAnalysis) (Kaas et al. 2004).

Immunogenicity and immunodominance of T cell epitopes

In general, the immune responses produced following natural infection or immunization with native antigens does not exploit the full range of epitope possibilities. Rather, only a small fraction of all possible epitopes derived from a given pathogen actually induces an immune response. This phenomenon is known as immunodominance. As stated above, immunogenicity is not determined by the intrinsic features of an epitope, but rather depends on the context. The following are a number of examples illustrating how context influences immunogenicity.

T cell responses against certain epitopes may go undetected because the epitopes are poorly generated in the breakdown of pathogen-derived proteins within the cells of the host (Chen et al. 2001; Kloetzel 2004; Nakagawa and Rudensky 1999; Stoltze et al. 2000; Yewdell 2001; York et al. 1999). The specificity of the TAP transporter (Daniel et al. 1998; Kloetzel 2004; Lauvau et al. 1999; Uebel and Tampe 1999) and the postproteasomal antigen processing (Reits et al. 2004; Rock et al. 2004) may also play a critical role. The abundance of a particular pathogen-derived protein in intracellular compartments and the presence of proteolytic cleavage sites proximal to, distal to, and within the epitope are just two of many factors that determine this yield (Bergmann et al. 1994; Shastri et al. 1995)

Whether an epitope appears to be immunogenic also depends on the assay method utilized. For example, a given Hepatitis C virus-derived epitope elicited T cells that did not recognize naturally processed Hepatitis C virus antigens expressed by transfected cells, as determined by chromium release killing assays. However, the same CTL line was fully capable of recognizing the same transfected cells if the production of IFNγ was utilized as a readout (Alexander et al. 1998). In a different study (Sette et al. 2001), it was similarly shown that Hepatitis B virus peptide-specific CTL lines apparently incapable of killing endogenous targets, or even producing significant amounts of IFNγ in vitro, were nonetheless capable of controlling viral replication in vivo in Hepatitis B virus transgenic mice.

In other cases, T cells specific for a given epitope are not detectable (“holes in the repertoire”; Klein 1982), because of immunoregulatory phenomena, thymic selection, or peripheral tolerance. In certain cases, the hierarchy of epitope recognition can be altered by prior experiences of the immune system as in memory T cells (Cole et al. 1997). Likewise, the deletion of the immunodominant epitope from a protein antigen can also dramatically change the pattern of epitope recognition, resulting in the activation of T cells against previously subdominant peptides (van der Most et al. 1997). Finally, T cells specific for a given epitope might limit the expansion of other T cell specificities by competition for space or nutrients, competition for cell–cell interactions, or by rapidly eliminating or decreasing the pathogen concentration before other epitope specificities can be effectively stimulated (Chen et al. 2000; Grufman et al. 1999).

The outcome of immunization also varies as a function of immunogen structure (synthetic peptide, recombinant vector, protein, attenuated pathogen) and presence or absence of specific adjuvants. Dose, schedule, and route of immunization also significantly influence the nature of the immune response generated (Brockstedt et al. 1999; Doria-Rose and Haigwood 2003; Horwood and Macfarlane 2002; Imami and Hardy 2003; Murray and Jackson 2002; Scheibenbogen et al. 2003; Sondak and Sosman 2003).

Collectively, these examples illustrate that the failure to detect a response against a given epitope does not necessarily signify that a particular epitope cannot be immunogenic in a different context.

Epitopes recognized in B cell responses

Similar to what is described above for T cell epitopes, not all possible structures expressed by a foreign (or self) protein are recognized as epitopes by B cells and result in antibody responses. Intense investigation has focused on elucidation of the factors that determine antigenicity, immunogenicity, and immunodominance in antibody responses (Aguilar et al. 2001; Atassi et al. 1996; Cleveland et al. 2000; Herkel et al. 2002; Ito et al. 2003; Kanduc et al. 2001; Liang et al. 1999; Musiani et al. 2000; Oleksiewicz et al. 2001; Sobotta et al. 2000; Xiong et al. 2001). It appears that, given appropriate conditions, almost any structure, whether natural or man-made, has the potential to elicit or be recognized by antibody responses.

In general, it appears that B cell epitopes are located on the surface of the structures recognized (protein, virus, bacteria), since antigen processing is not involved and therefore B cells only come in contact with surface components. In addition to structural features, other characteristics and variables play a crucial role in determining immunodominance at the B cell level. For example, the breadth and affinity of antibody responses change dramatically as a function of time and repeated antigenic exposures. In these cases, preferential continued stimulation of high-affinity clones under limiting (decreasing) antigen conditions, combined with the somatic mutation of the immunoglobulin genes encoding the antibody molecule itself (William 1999), leads to affinity maturation. Another mechanism for immunodominance relates to the phenomenon of “original antigenic sin” (Berry et al. 1999; Franks et al. 2003), whereby previous exposure to related (cross-reactive) antigens greatly influences which particular antibody specificities are observed upon subsequent vaccination, infection, or immunization. The presence (or absence) of helper T cells, and their TH1 versus TH2 phenotype, also has a dramatic influence on the type of B cell response observed and on the epitopes recognized (Spellberg and Edwards 2001).

Several additional factors determine recognition of a particular protein within a given pathogen, or of an epitope within a protein. These factors include the type of antigen presenting cell initially encountering the antigen, the presence or absence of “danger” signals, and the presence of opsonizing complement factors and/or antibodies (Aalberse and Platts-Mills 2004; Gallucci and Matzinger 2001; Ishida et al. 2001; Janssens and Beyaert 2003; Jarva et al. 2003; Jiang and Koganty 2003; Kayhty 1998). As in the case of T cell epitopes, these examples illustrate that the immunogenicity or antigenicity of a B cell epitope is highly context-dependent. Therefore, to accurately represent B cell immunogenicity or antigenicity in a database, context-dependent variables must be taken into consideration.

System engineering

The previous sections described the immunological considerations behind our definition of an epitope and the type of immunological information we want to capture. This and the following sections describe how this was translated into a database application design.

We are in the process of developing a formal ontology, which defines a common vocabulary for researchers to share information (Noy and McGuinness 2001). Where possible, we try to build on existing ontologies, such as the IMGT-ONTOLOGY (Giudicelli and Lefranc 1999). This is the first ontology in the domain of immunogenetics and immunoinformatics, which provides a controlled vocabulary and the annotation rules for data, and allows the knowledge management of the IG, TR, MHC, IgSF, and MhcSF of human and other vertebrate species (Lefranc et al. 2005b). The formal specification of the terms to be used in the domain of epitope description, binding, and recognition will be developed to ensure as much as possible accuracy, consistency, and coherence with IMGT (Lefranc et al. 2005a, 2004).

The IEDB classes

The data fields in the IEDB are organized hierarchically into classes and subclasses (Noy and McGuinness 2001). The main classes of the IEDB are defined as “Reference,” “Epitope,” “Binding,” and “Context” (Fig. 2). The class “Reference” contains three main subclasses, each representing the possible sources of data, namely literature, patent, and direct submissions. Literature references will be described by a series of slots (or fields), including the PubMed ID, journal title, and authors. Similarly, in the case of patent references, the patent application number, inventors, title of the application, and related fields are utilized to capture relevant information. Finally, in the case of direct submissions, different fields are utilized to capture the relevant information describing the reference. We anticipate that direct submission of large volumes of data through the NIH-NIAID Large Scale Antibody and T Cell Epitope Discovery Program will be one of the main sources of information for the database.

Fig. 2
figure 2

Class diagram for the Immune Epitope Database and Analysis Resource

The next concept of the high level diagram is the “Epitope” class. It is further subdivided in two categories, namely structure and source. Within the Epitope structure category, different fields are meant to capture different types of available information on epitope structure. For example, an epitope could be described by its primary amino acid sequence, by its chemically defined 2D structure, which can be captured in the Simplified Molecular Input Line Entry System (SMILES) (Weininger 1988) notation, or by its 3D structure, which would usually be defined as a subset of atoms in a PDB structure. In the Epitope source category, fields describing the source of the epitope structure are grouped. Examples of these fields are the organism from which the epitope is derived and the relevant GenBank or Swiss-Prot accession numbers identifying the source protein of an epitope.

The “Binding” class captures intrinsic, context-independent information relating to how the structure specified in the “Epitope” class interacts with well-defined receptors of the immune system. Three different subclasses exist: MHC, antibody, and TR. Each subclass encompasses fields relating to the nature of the specific receptors (e.g., PDB, IMGT/3Dstructure-DB, IMGT/GENE-DB, GenBank, Swiss-Prot, or IMGT/LIGM-DB accession numbers), as well as fields relating to the magnitude and modality of interaction such as binding affinity.

The “Context” class is a key novel aspect of our database design. It is organized into three subclasses, including T cell immune responses, naturally processed peptides and B cell immune responses. Within the T cell responses subclass, information fields relating to the TR (of the effector T cells), the immunogen, the administration, and the formulation are found. Other components of this section include fields specifying the source organism from which the MHC of the antigen presenting cells is derived, the antigen, and the type of assay utilized. In general, the subclass “Naturally processed peptides” is similar to the T cell responses category, except for the fact that no effector T cells (and related fields) are present. Likewise, the B cell response subclass is similar to the T cell response subclass, with the main difference that no MHC or antigen-presenting cell is present. This design is robust and flexible and has thus far allowed representation of most immunological situations modeled in our preliminary studies.

As an example of how epitope information would be captured, consider the first sequenced peptide eluted from influenza infected cells (Rotzschke et al. 1990): the reference fields for this epitope would contain the authors, publication year, journal, and so on. The epitope fields would contain the epitope structure, which is its linear sequence TYQRTRALV, and the epitope source, which is the NP protein of the influenza virus strain A/PR/8/34. In this reference, the epitope is associated with two separate context entries. The “naturally processed” context specifies that this epitope was eluted from mouse cells bearing the MHC allele H2-Kd, and that the antigen containing the epitope in the assay was the influenza virus itself. The “Immune response T cell” context specifies that a CTL response was measured in a 51Cr release assay, in which the target cells presented the MHC allele H2-Kd, the antigen used was the epitope itself, and the effector cells were a CTL line derived from mice infected with influenza virus.

Guidelines for populating the database

In recent years, different groups have followed different approaches to the discovery of immune epitopes, and have collected privileged data generated by certain assay types (such as direct MHC binding assays, ex vivo T cell detection assays or elution of naturally processed peptides) for the purpose of epitope definition or validation. In a programmatic sense, we believe that selecting data that fit one particular epitope definition or experimental bias is not our prerogative and would be unwise. Rather, we have opted, as described in the preceding sections, for a comprehensive all-inclusive definition. It is our plan to include as many different data sources as possible, and to minimize or exclude bias at the level of database population. A corollary of this approach is that it is necessary to develop robust and flexible user interfaces, so that the user will not be overwhelmed and will be able to navigate with ease through a large volume of diverse information.

Criteria for data inclusion is publication in peer-reviewed journals, published patents or patent applications, and direct submissions from institutions or companies. Particular attention and priority is given to the inclusion of antibody and T cell epitopes associated with NIAID category A–C pathogens (http://www2.niaid.nih.gov/Biodefense/bandc_priority.htm), which are considered potential bioterrorism agents. They are divided into categories A–C, where category A is considered the worst potential bioterror threat. NIAID has recently solicited proposals for high-throughput identification of epitopes derived from class A–C pathogens, with 14 distinct contracts having been awarded to different research groups worldwide. These newly established epitope discovery contracts are expected to be a major source of new information, but the database will also include epitopes derived from other emerging/reemerging infectious pathogens, allergens, alloantigens, autoantigens, and model antigens.

All individuals who contribute data or analysis resources to the database will be cited, either by authorship or acknowledgment of their contributions. All of the information contained within the Immune Epitope Database and Analysis Resource, as well as all of the contractor-generated materials (e.g., documentation, software source code, analysis tools, and algorithms) will be made freely available to the entire research community for use, improvement, and publication purposes.

The curation process

We have begun the database population task before deployment of the database because curation of an estimated 80,000 possibly relevant journal articles and 8,000 patent records will be time-consuming. We will define metrics so that the first few months of curation will allow us to refine our labor estimates, and to quantify the value of our proposed curation aids and work flow software support system. Scientific literature will be accessed using PubMed to select articles with a deliberately broad query, then potentially further filtered using a machine-learning model to decrease the number of irrelevant items while maintaining an acceptably low rate of missed relevant items.

The curation workflow on the sorted and queued items is carried out and supervised in a three-tier staffing structure (Fig. 3). The first tier consists of curators who are responsible for the bulk of the data entry and annotation. This data entry team is supervised and coordinated by a second tier of lead curators who oversee data entry and conduct spot checks on the entries to ensure accuracy, consistency, and quality of the annotation. They answer questions that the data entry staff might have and resolve issues related to data curation and annotation. The lead curators also oversee bulk data submissions which will result from either agreement with other existing databases hosting epitope information or data transfer agreements. The third tier, called the Epitope Council, is composed of specialists and recognized authorities in various types of epitopes, who provide input and clarification to the two lower tiers. Approximately ten faculty members of the La Jolla Institute for Allergy and Immunology (LIAI), The Scripps Research Institute (TSRI), and the University of California, San Diego (UCSD) now serve in this capacity on the committee. If the epitope council requires advisory input, it can consult the Epitope Annotation Advisory Board (EAAB) composed of experts in various specific disease systems. The Council may seek the input of specific advisors for any disease-specific issue that might arise that requires specialized know-how and expertise, in-depth knowledge of relevant literature, or an in-depth understanding of issues related to the specific pathology, immunology, and biology of a given pathogen or model organism.

Fig. 3
figure 3

The curation process

As populating the database begins, the available data are being compiled from various sources (scientific literature, patent applications, and data submissions), starting from the present date and moving backwards in time, with priority given to items containing epitopes from NIAID’s category A–C priority pathogen list. Our current rough estimate is that a curator will be able to process several filtered journal articles or patents per day. We have staffed appropriately to accommodate the entire literature and patent backlog within 4 years (approximately 1,000 working days), encompassing new entries, and assuming that a reasonable fraction of new publications will have their data submitted by the authors using a prescribed template format, once the database is on-line. These estimates will be refined as the curation effort proceeds and better metrics become available.

Scientific approach for the development of the analysis resource

Our proposal includes the establishment and maintenance of an Analysis Resource for the Immune Epitope Database. This resource will provide direct online access to various analytical tools where allowed by the tool authors, or links to other public web sites if only links are permitted or feasible. The IEDB team will also develop a pallet of tools for public access and use. As the goal of this resource is to service the entire scientific community, it is important that the tools provided cover a wide range of research areas relating to epitope discovery and analysis, and that no particular scientific approach has exclusivity. To identify tool candidates, a list of existing tools of interest was initially generated through extensive literature searches and expert input. This list will be periodically revised and expanded, taking advantage of feedback from the scientific community and NIAID. Where desired functionality is not provided by existing tools, the IEDB team will facilitate the development of new tools. An overview of tool categories is provided in Table 2, and a more detailed description of each tool category is given below.

Table 2 Tool categories for the analysis resource

The list of candidate existing tools comprises prediction tools for identifying novel antibody and T cell epitopes from genome and protein sequences. At the level of antibody epitope predictions, standard methods predicting which regions in a protein are likely to be on the surface will be provided, such as hydrophilicity, solvent accessibility, or thermal mobility analysis. Also, tools using various methods for prediction of MHC binding will be provided that include binary main anchor motif scans (both canonical and extended), linear coefficient matrices, neural networks, and other machine learning prediction methods, for as many different MHC as practical at any given point in time. As mentioned above, tools predicting proteasomal processing and TAP transport of T cell epitopes will also be made available. We anticipate that the prediction quality of all of these tools will increase as more data is deposited in the database, which in turn will allow tool developers to train more accurate models.

Apart from these prediction tools, we will provide analytical tool resources to assist in vaccine discovery and development. The population coverage tool will take a set of epitopes with known MHC restriction as an input, and from that project what fraction of a given ethnic population is likely capable to respond to at least one of the epitopes, using MHC allele frequency tables. The conservancy analysis tool will take an epitope and a set of protein sequences as input, and assess the degree of conservancy of the epitope in the sequences, which could be various isolates of the same pathogen or related pathogens. Finally, tools to visualize data will be provided, such as tools displaying antibody antigen interactions where 3D structural information is either available or can be generated by predictive algorithms.

In terms of how many tools should be hosted, a balance has to be achieved between being too selective, which may leave user demands unaddressed, and hosting too many tools, so that the collection becomes overly redundant and unmanageable. On one hand, resources are required to thoroughly evaluate, host, and maintain the tools, making prioritization necessary in terms of tools actually hosted by the IEDB. On the other hand, links to other websites hosting specific tools can be almost all-inclusive.

To facilitate an objective and transparent choice of which predictive tools should be hosted, the predictions of all candidate tools will be evaluated. This evaluation will be periodically revised as new sets of testing data become available. Most importantly, we plan to make all evaluations (and the methods utilized) publicly available through the IEDB website, and we will encourage all different scientific groups to participate by submitting tools and evaluating data. Such prediction “contests” have had a tremendous positive impact in the evaluation and prediction of protein structure (Fischer and Rychlewski 2003; Moult et al. 2003; Vajda and Camacho 2004; Venclovas et al. 2003; Wodak and Mendez 2004). The set of evaluations proposed herein represent, to the best of our knowledge, the first attempt at a rigorous and comprehensive evaluation of immune response associated prediction tools. Input, feedback, and collaboration with the scientific community will be a key component of this endeavor.

Discussion

In this work we have described the blueprint for the development of an immune epitope knowledge center, composed of an epitope database and an associated analysis resource. Our vision is based on the definition of epitopes as structures interacting with B or T cell antigen specific receptors, or MHC. In addition to the structures defining the epitopes, we are attempting to also represent context-dependent features of immune epitopes, such as the type of responses evaluated in a particular immunological context.

In terms of developing an Analysis Resource, our approach is inclusive and rigorous. We plan to include a variety of different tools aimed at assisting vaccine design, evaluating immune responses, and identifying T cell and B cell epitopes. Our plans also call for a rigorous, unbiased evaluation of different tools, and open dissemination of the results of the analysis.

In conclusion, our vision is to develop an Immune Epitope Database and Analysis Resource. Our goal is to empower researchers working on immunological evaluations by quickly accessing relevant information and bioinformatics tools. As a result, we hope to facilitate basic research studies and the development and evaluation of prophylactic and therapeutic interventions against new and old, emerging and reemerging disease threats.