Introduction

The advent of high-throughput TCR sequencing techniques has changed the way T cell responses are studied (Benichou et al. 2012). As deep sequencing is required to study the highly diverse set of TCR sequences present in each individual, such techniques soon became one of the methods of choice for assaying the TCR repertoire (Rubelt et al. 2017). TCR repertoire sequencing can generate huge amounts of data and can be successfully applied to compare repertoires between individuals and tissues, as well as to track and monitor selected TCR sequences. However, it soon became obvious that the exact functional role is known for only a minor fraction of TCR sequences, which is one of the major bottlenecks in extracting all the available information from TCR repertoire sequencing data (Shugay et al. 2015).

A large number of published assays that involve enriching a T cell sample for cells specific to an antigen of interest and sequencing their TCRs predate high-throughput sequencing era. Such data, while being sparsely organized until recent, carries an important source of information that can be applied to annotate TCR sequences and provide additional information for high-throughput TCR repertoire profiling studies. Recent efforts in summarizing such data gave rise to databases listing TCRs with known antigen specificity: McPAS-TCR (Tickotsky et al. 2017) and VDJdb (Shugay et al. 2018; Bagaev et al. 2019) databases contain ~ 20 k and ~ 55 k specific TCR sequences as of October 2019. Since 2018, immune epitope database (IEDB) also provides TCR and antibody sequences linked with certain antigens having ~ 30 k TCR and ~ 2 k antibody sequences as of October 2019 (Mahajan et al. 2018).

These databases complement the existing set of related immunoinformatics resources that include epitope immunogenicity databases such as IEDB (Vita et al. 2015), MHC-binding analysis tools such as NetMHCpan (Jurtz et al. 2017), and various other T cell epitope prediction tools as those reviewed in Kar et al.( 2018). Both McPAS-TCR and VDJdb are directly applicable to the analysis of large-scale TCR repertoire studies; however, as suggested below, any comprehensive adaptive immunity study requires integrated usage of all of these data sources to delineate interactions between TCRs, antigens, and MHCs involved in the immune response.

Accumulating TCR antigen specificity data

There are several types of T cell assays that can serve as a source of data for the TCR antigen specificity database. The most basic approach is to stimulate a culture of T cells with an antigen of interest in order to select T cells that have undergone an antigen-driven expansion. The stimulation is basically carried out together with a series of limiting dilutions leading to what is known as a limiting dilution assay (Sharrock et al. 1990). Alternatively, primary T cells can be stimulated and sorted on the basis of expression of surface molecules such as CD69, CD154, CD137 (Bacher and Scheffold 2013) or production of effector molecules such as IFNg and perforin. Resulting T cell cultures or cloned TCR sequences can be validated by either antigen re-stimulation and monitoring for intra- and extracellular molecules (cytokine secretion, granzyme B, perforin, etc), or by direct lysis of antigen-expressing target cells (such as 51Cr release assay) (Saade et al. 2012).

Culture-based assays can be carried out using either a specific epitope, a set of peptides, or an entire protein. In the case of protein and peptide mixes, the information about the actual cognate epitope and presenting MHC allele may be lost. It is necessary to note that the VDJdb database includes only those TCR specificity records that list both the exact cognate epitope sequence and the HLA allele presenting it. McPAS-TCR and IEDB, on the other hand, also report TCR specificities resolved up to polypeptide, protein, pathogen, or even pathology level.

One of the main limitations of conventional antigen specificity assays is that they are quite laborious and have a relatively small yield in terms of TCR sequences. The development of peptide-MHC multimer-based techniques allowed accurate identification of specific T cells in both primary samples and expanded cultures (Altman et al. 1996; Dolton et al. 2015). While originally followed by conventional cloning and Sanger sequencing of sorted T cells, currently such approaches are followed by high-throughput sequencing yielding several thousands TCR sequences (Rius et al. 2018). This assay became the main source of specificity records for corresponding databases, as it can yield several-fold more sequences compared to experiments that use Sanger sequencing. As this technique yields many TCR sequences of frequency lower than 1%, it can be biased by sorting contaminations and other artefacts. However, as it was recently shown, antigen specificity can be validated even for low-abundance TCR variants coming from such experiments (Rius et al. 2018).

A combinatorial approach based on parallel sorting of donor cells for several combinations of peptide-MHC multimers followed by TCR repertoire sequencing of positive and negative fractions can be applied to survey a list of epitopes in a single assay (Klinger et al. 2015; Napolitani et al. 2018). The number of false-positives in this case can be minimized by combining the set of positive fractions sorted for each particular epitope. Further developments in the field of TCR specificity assays are related to the application of DNA-barcoded MHC-multimers and single-cell sequencing allowing to survey multiple antigen specificities for an individual T cell in a high-throughput manner (Bentzen et al. 2016). Recent improvements in methods for tagged MHC-multimer library production and specific T cell isolation promise to greatly increase the yield of T cell specificity assays in the near future (Zhang et al. 2018; Moritz et al. 2019; Saini et al. 2019; Ng et al. 2019).

Peptide-MHC libraries in yeast is a powerful technology for high-throughput screening of dozens peptides against target TCR technology (Birnbaum et al. 2014). Except studying of TCR cross-reactivity, the technology is able to provide data for identification of TCR targets without prior knowledge of antigen for the TCR of interest (Gee et al. 2018). Two recently developed promising methods take advantages of “natural” antigen-presentation process in specific cell line for large-scale screening of potential epitopes against T cells (Kula et al. 2019; Sharma et al. 2019). Combination of both library of antigen-presenting cells with library of cells expressing TCRs and co-stimulatory molecules can lead to the technology for large-scale identification of TCR-epitope pairs (Siewert et al. 2011).

Such studies can greatly increase the size of available TCR specificity knowledgebase in the near future; however, a thorough assessment of the amount of non-specific TCR binding should be performed in order to guarantee that TCRs specific to multiple antigens do represent real T cell cross-reactivity events.

Studying TCR repertoire structure

Several large datasets of human TCR repertoires were generated recently, most notable the collection of nearly 786 samples from Emerson et al. (Emerson et al. 2017) and 79 samples from Britanova et al. aging study (Britanova et al. 2016) that were used as a reference set for a number of recent bioinformatic studies (DeWitt et al. 2018; Pogorelyy et al. 2018). Such a large collection of donor TCR repertoires is useful in defining public and rare TCR sequences that can be met frequently across the population or are private to specific donor (Venturi et al. 2008; Shugay et al. 2013; Bagaev et al. 2016).

Moreover, these datasets allow to quantify the incidence specific T cells across the general population and can serve as a reliable baseline for specific TCR frequency estimates. As was demonstrated previously, the precursor frequency of T cells specific to a selected epitope varies greatly and may be as small as 10−6 (Alanio et al. 2010; Neller et al. 2015), reaching the detection limit of a regular high-throughput sequencing study (Britanova et al. 2016). On the other hand, there should be a direct link between epitope immunogenicity, i.e., its ability to elicit an immune response in a given individual, and the incidence rate of specific T cells: given an estimate of ~ 3 × 1011 T cells in a human body and a typical naive T cell clone size of ~ 5 cells (Mora and Walczak 2016), encountering an antigen by an individual specific T cell is an extremely rare event that is limited by specific T cell abundance. Some recent results do indeed confirm this link suggesting that the overall frequency of specific T cells is important for forming an immune response (Pogorelyy et al. 2018).

The structure of unperturbed T cell repertoire can be also summarized using a model originally developed by Murugan et al. (2012). This elegant model involves a relatively small set of parameters including specific variable, diversity, and joining segment choices, their 5′ and 3′ sequence trimmings and features of randomly added N-bases, and can be trained using non-functional TCR sequences that do not undergo thymic selection. Interestingly, for TCRs with known antigen specificity, TCR population frequencies predicted using this model are in a good agreement with those observed in the Emerson et al. dataset (Pogorelyy et al. 2018). Another advantage of using such model is the ability to subtract the background TCR repertoire structure produced by intrinsic biases of the VDJ rearrangement process. For example, one of the recent studies combined a large TCR repertoire sequencing dataset and a VDJ rearrangement model in order to identify a set of rare TCR sequences associated with an autoimmune disease (Komech et al. 2018).

Applying MHC-restriction rules

MHC presentation of foreign peptides is a limiting factor for TCR-mediated recognition, while thymic selection and TCR-MHC interactions at the periphery shape the T cell repertoire of an individual (Zvyagin et al. 2014; Sharon et al. 2016; Chen et al. 2017). A huge body of knowledge on which epitopes are actually presented by MHC molecules and recognized by TCRs is summarized in the IEDB database (Vita et al. 2015). Peptidomes of various MHC molecules can be utilized to build highly accurate predictors of MHC binding, such as the one implemented in netMHCpan software (Jurtz et al. 2017). Searching for TCRs that recognize epitopes that are not presented by donor MHCs even in case they are related to pathogens of interest has little sense as it is unlikely that they can ever elicit a strong immune response. One of the recent studies probing the Emerson et al. dataset shows strong patterns of MHC-associated clusters of TCR sequences that are related to common infections (DeWitt et al. 2018). A ubiquitous infection such as EBV that affects nearly 95% percent of the adult population can leave a distinctive mark on the repertoire that can, in theory, be used for MHC typing in case actual donor MHCs are unknown (Pogorelyy et al. 2018). Thus, pathology-associated TCR sequences can be identified based on their HLA linkage in case of pathologies with strong HLA association. Overall, this suggests that donor MHC haplotype is critical for any study that aims at donor TCR repertoire annotation and should be utilized for increasing the precision of this procedure (Pogorelyy and Shugay 2019).

Studying crystal structures of TCR/peptide/MHC complex

TCR antigen specificity is determined by a set of complex interaction between complementarity determining region (CDR) loops and a peptide presented by the MHC complex (Rossjohn et al. 2015). The ability to model and predict these interactions will ultimately lead to the ability to predict potential binding for TCRs and epitopes that were not studied previously. As current knowledge of TCR specificity is limited to a relatively small set of peptides, studying and modeling interactions in the TCR/peptide/MHC complex is a key for resolving unknown epitope specificity for autoimmunity-linked TCRs and predicting optimal TCRs targeting cancer neoantigens.

The number of available TCR/peptide/MHC structures resolved so far is relatively small, less than 200 structures (Leem et al. 2018). Several databases and web resources dedicated to TCR/peptide/MHC complexes were recently developed, such as the TCRmodel web server (Gowthaman and Pierce 2018). There are, however, several lessons that can be learned from structural data analysis: for example, only the central part of hypervariable CDR3 region is in direct contact with the antigen (Rossjohn et al. 2015) and that CDR3 regions of alpha and beta chains contact the central part of the epitope, while CDR1/2 regions contact epitope termini (Egorov et al. 2018).

Integrated analysis of TCR repertoire sequencing data using immunoinformatics resources

A large set of immunoinformatics resources related to various aspects of TCR repertoire and immunopeptidome analysis outlined in previous paragraphs is summarized in Table 1. Integrating various resources is necessary for proper analysis of TCR repertoire sequencing data since straightforward application of a TCR specificity database for TCR repertoire annotation may give rise to many false-positive calls. Moreover, certain tasks, especially those related to tumor immunotherapy and autoimmunity studies, may require complex analysis of both TCR repertoire and peptidome.

Table 1 An overview of resources related to annotation of TCR repertoire antigen specificity. Software tools, databases, and datasets are subdivided into four sections: those related to TCR repertoire structure learning (“TCR repertoire baseline”), MHC binding and epitope immunogenicity prediction (“Epitope mapping”), TCR specificity databases and specificity prediction (“TCR specificity”), and TCR/peptide/MHC structure modeling (“Structural modeling”)

For example, in case one would like to identify the imprint of past and ongoing infections (Fig. 1a), donor HLA haplotype information should be taken into account, as it allows to filter a majority of TCRs that most probably would not recognize a pathogen in a given context (Pogorelyy and Shugay 2019). This strategy can be also applied when dealing with an immune response directed towards rare epitopes: controlling for the baseline structure of the TCR repertoire allows inferring TCR motifs that are present in the data, while common pathogen-specific responses that are always present in donor repertoires can be filtered using the TCR specificity database.

Fig. 1
figure 1

Answering some of the key questions of adaptive immunity studies by integrating TCR repertoire sequencing, HLA typing, epitope discovery, structural modeling, and TCR specificity predictors. a Exploring the imprint of common pathogen exposure on donor TCR repertoire. Pathogen-reactive TCRs can be inferred by analyzing hyperexpanded TCR variants and groups of homologous TCRs that are unlikely to appear simultaneously in an individual repertoire by chance. Resulting TCR set can be queried against a database of TCR sequences of known antigen specificity to infer the pathogen exposure landscape. HLA-restriction of pathogen epitopes can be exploited to filter out irrelevant TCR matches and decrease false discovery rate. b Estimating immunogenicity of pathogenic and autologous epitopes. Viral and self-peptides that are presented by HLA molecules and are homologous to known epitopes from TCR specificity database are rated based on the frequency of TCRs that can potentially recognize them. Specific TCR frequency can be estimated using the V(D)J rearrangement model and correlated with epitope features. Links between specific TCR frequency and epitope features can be further validated using sets of immunogenic and non-immunogenic epitopes from a database containing T cell assay results for various epitopes. c, d Unraveling T cell reactivity to tumor neoantigens and self-peptides related to autoimmunity. c Candidate TCRs are screened for reactivity towards highly immunogenic tumor neoantigens presented by donor HLAs. A set of potential neoantigens is selected based on their presentation on patient HLAs, distance from self, and overall immunogenicity. Candidate TCRs can be either selected using TCR specificity database in case the tumor expresses known neoantigens, selected from the set of TCR sequences of tumor-infiltrating lymphocytes, or predicted using structural modeling or machine learning methods otherwise. d Potential immunogenic self-peptides presented by donor HLAs are screened based on their affinity towards TCRs observed exclusively in patients with autoimmune disease. Screening process can be guided by HLA association of the pathology in question and by considering molecular mimicry and immunogenicity of self-antigens. Self-antigen ranking can again be performed using molecular modeling of corresponding TCR/peptide/MHC complexes or machine learning algorithms

Integrating TCR specificity and epitope assay databases can be applied to quantify peptide immunogenicity for both pathogenic and self-peptides (Fig. 1b). Population frequencies of T cells recognizing a given epitope can be estimated using both existing theoretical models and available TCR repertoire sequencing data. As one would expect, specific TCR frequencies are a good predictor of antigen immunogenicity (Pogorelyy et al. 2018) and can be correlated to certain antigen features allowing immunogenicity scoring for novel epitopes.

Prediction of potential targets for cancer immunotherapy also relies on selection of a set of neoantigens that are both immunogenic and can be presented by donor HLAs (Fig. 1c). A number of recent studies (Bjerregaard et al. 2017; Thorsson et al. 2018; Rubinsteyn et al. 2018; Schenck et al. 2019) aim at identifying prospective candidates among mutated self-peptides. Methods outlined in the previous paragraph can be also considered to rank neoantigens, while TCR specificity prediction algorithms and TCR/peptide/MHC structural modeling can be utilized to select candidate TCRs for tumor immunotherapy.

In case when only a weak immune response is expected, such as responses towards non-mutated tumor-associated antigens or self-peptides in autoimmunity, structural modeling may aid in identifying cognate epitopes for corresponding TCRs (Fig. 1d). Up to this date, despite the emergence of several successful database-driven TCR specificity prediction algorithms (Table 1, “TCR specificity” section), no study reported a pipeline for TCR/peptide/MHC structural modeling that can successfully predict cognate TCRs and epitopes using their primary sequences alone, which remains a major challenge. However, a number of TCR modeling software were published recently (Table 1, “Structural modeling” section) and we hope that TCR specificity databases such as VDJdb can aid in calibrating such tools in order to allow de novo prediction of T cell antigen specificity.