Introduction

Immunoglobulin heavy- and light-chain genes rearrange early in B-cell ontogeny. The rearranged light chain is formed by the recombination of genes selected from either the variable and joining genes of the kappa locus (IGKV and IGKJ), or the variable and joining genes of the lambda locus (IGLV and IGLJ). The genes and allelic variants of these loci were a focus of early immunoglobulin sequence studies in the mid-1980s. Major studies in the early 1990s led to the conclusion that the description of the human kappa locus was essentially complete (Cox et al. 1994; Zachau 1993), and that IGKV genes show relatively little polymorphism. Nevertheless, a number of later studies reported new alleles (e.g., Atkinson et al. 1996). A study that specifically set out to identify new alleles of the IGKV2-29 and IGKV2D-29 genes reported one new IGKV2-29 allele and two new IGKV2D-29 alleles, leading the authors to suggest that many more unreported alleles probably remained to be detected (Atkinson et al. 1996; Feeney et al. 1996).

In a major initiative in the late 1990s, all reported immunoglobulin genes and allelic variants were incorporated into the ImMunoGeneTics (IMGT) databases of germline sequences and a systematic nomenclature was developed. The IMGT database of germline human kappa chain immunoglobulin gene sequences includes 103 IGKV genes and alleles, including 37 pseudogenes, 55 functional genes and allelic variants and 11 Open Reading Frames (ORFs) (Barbie and Lefranc 1998). Functional genes are defined as such by IMGT if the coding region has an open reading frame without stop codons, and where there are no apparent defects in the splicing sites, recombination signals, or regulatory elements of the genes.

Over the last decade, the number of reported rearranged IGK genes has increased substantially, making it possible to reassess the functionality of reported alleles, and to assess the completeness of the reported repertoire. We have recently used this approach to review the reported human heavy chain gene repertoire (Lee et al. 2006, 2007; Wang et al. 2008). These studies identified both sequences that have been reported in error, and putative polymorphisms that remain unreported. In this study, we reconsider the reported IGKV and IGKJ gene repertoires by analysing features of the reported germline alleles, and by an analysis of the apparent usage of the different germline genes in a database of 1863 rearranged gene sequences.

In order to gain a better understanding of the overall diversity of kappa chains, we also analysed the processes that contribute to junctional diversity of the kappa chain repertoire during the gene rearrangement process. This diversity is introduced by deletions and additions of nucleotides at the joining IGKV and IGKJ gene ends. Analysis is presented showing relatively little addition or loss of nucleotides from the kappa chain gene ends. Analysis of the resulting junctional amino acids shows that the kappa chain repertoire includes a number of dominant sequences that have been reported from multiple independent studies.

Materials and methods

Sequence accumulation

Expressed human immunoglobulin kappa chain sequences were obtained from the European Molecular Biology Laboratory (EMBL) nucleotide sequence database (http://www.ebi.ac.uk/embl/) (Kanz et al. 2005). All sequences were derived from complementary DNA (cDNA) and had a minimum sequence length of 270 nucleotides. Not all sequences included identifiable IGKJ genes.

A database of IGKV germline genes was compiled from the IMGT IGKV gene database (http://imgt.cines.fr/) (Barbie and Lefranc 1998). The publications that reported each germline gene were identified from IMGT annotations, and each publication was reviewed to identify the polymerase chain reaction (PCR) primers that were used. The number of genomic sequences that have been reported for each reported germline gene was also determined.

The germline genes were analysed by multiple sequence alignment, using ClustalW (www.ebi.ac.uk/Tools/clustalw) (Thompson et al. 1994). Germline sequences that had only one or two nucleotide differences to other sequences were noted, as apparent allelic variants can be generated as a result of PCR errors.

Expressed kappa chain sequences were aligned against the IMGT IGKV germline repertoire using the Smith–Waterman algorithm (Smith and Waterman 1981) with the Gotoh optimisation (Gotoh 1982). Clonally elated sequences were identified by shared IGKV genes and shared somatic point mutations. No attempt was made to identify clonally related sequences amongst relatively unmutated sequences.

Partitioning

The 3′ ends of most IGKV genes are C-rich, with most genes ending with the nucleotides CCTCC. The neighbouring N-REGIONs, at the junction of the IGKV and IGKJ genes, are also often likely to be C-rich, because of the bias of N nucleotide addition towards additions of guanine and cytidine, and a tendency for the formation of homopolymer tracts (Jackson et al. 2007). There is therefore a risk that an N-REGION could be mistaken for the end of an IGKV gene, and this risk is particularly high if mutations occur near the junction. In order to accurately characterise the processing of IGKV and IGKJ gene ends, as well as N-REGIONS, kappa sequences were therefore chosen that had little likelihood of including mutations near the ends of the IGKV and IGKJ genes. The LowVMut database was compiled from 448 sequences, which had no more than three mutations in their IGKV genes, between codons 10 and 104.

These sequences were re-aligned, to determine IGKJ genes, using IMGT/VQUEST (Giudicelli et al. 2004), and after the removal of sequences that did not include rearranged IGKJ, 435 sequences remained. Where exonuclease removals meant that a sequence aligned equally well to two alleles, the sequence was ultimately assigned to the allele that was seen more often in the database of expressed kappa chain sequences.

The ends of the IGKV genes were determined by accepting just a single mutation within the nucleotides that were aligned against the final 15 nucleotides of the IGKV germline sequence. Consecutive mismatches at the 3′ end of the IGKV genes were deemed to be N nucleotides, or part of an IGKJ sequence, and where the ends of an IGKV gene included a series of mismatches and matches, no runs of more than two matching nucleotides were deemed to have been removed from the end of any sequence. The start of the IGKJ genes was determined in a similar fashion, and the nucleotides between the ends of the IGKV and the start of the IGKJ genes were defined as the N-REGIONS.

Identification of unreported polymorphisms

The number of mutations in each rearranged sequence was noted from the output of the program, after removal of likely primer-mediated mismatches. The frequencies with which different levels of mutation were seen in alignments to each germline gene were then compared to the overall distribution of mutations in the dataset by Chi-squared test. Where the number of sequences that aligned to a particular gene/allele included an unexpectedly low number of unmutated sequences, additional analysis was performed using multiple sequence alignment. Where shared mismatches were commonly seen at a particular position within a sequence, unreported polymorphisms were inferred.

DNA isolation and amplification

In order to confirm the existence of putative alleles identified in the bioinformatic studies, genomic screening was undertaken. Buccal smears were collected from 12 volunteers, with the approval of the University of New South Wales Human Research Ethics Committee. DNA was extracted and IGKV sequences were amplified as previously described (Dahlke et al. 2006), using primers that were designed using the reported IGKV1-16*01 and IGKV2-30*01 sequences, as follows: 5′-tcccaggtaaggatggagaa-3′ (IGKV1–16 forward primer), 5′-gatggaggcatcaggagaag-3′ (IGKV1–16 reverse primer), 5′-ttgaaatatgaacaaatacatatatagcctg-3′ (IGKV2–30 forward primer) and 5′-ttcagggctgtaccactgtg-3′ (IGKV2–30 reverse primer). PCR products were cloned and sequenced as previously described (Dahlke et al. 2006). Sequences were then aligned against the reported IGKV germline repertoire as described above.

Results

A database of germline IGKV and IGKJ genes was accessed from the IMGT website, and after removal of pseudogenes, 66 reportedly functional IGKV genes and ORFs remained, as well as nine reportedly functional IGKJ genes. The expression of each of the IGKV genes was then investigated by analysis of 1,863 rearranged kappa gene sequences, and the frequency of expression of the different IGKV genes are presented as Table 1. No alignments were seen for 22 reported IGKV genes. Nine of these IGKV genes have been reported from at least two independent studies, and therefore while the functionality of the genes may be questioned, the genes almost certainly exist. Five of the remaining genes are described as ORFs, and a lack of rearrangements could therefore be expected. Five sequences differ from other reported genes by at least three nucleotides, and therefore PCR errors or other PCR artefacts are unlikely to be responsible for their generation. The existence of the final 3 genes (IGKV1-5*02, IGKV3-NL2, IGKV3-NL4) may be queried, but no additional evidence could be found to suggest that these germline sequences were originally reported in error.

Table 1 IGKV gene usage in a database of 1863 rearranged human kappa chain genes

A further 13 germline genes were seen in fewer than five alignments. As rearranged sequences may be misaligned because of the similarity of different IGKV genes, the presence of a handful of alignments in a database of 1,863 rearranged sequences cannot be taken as unequivocal proof that reported genes are real and functional. But the existence of eight of the 13 genes has been confirmed by multiple reports of the germline sequences, while the low level of apparent mutations seen in the rearranged gene alignments gives credence to the existence of three of the remaining five genes. Only one of the final two genes (IGKV2-29*03) is sufficiently similar to another gene to suspect that it may have resulted from PCR error; however, there is insufficient evidence to be certain that this was the case.

A small number of alignments were seen to two sequences (IGKV1D-13*01 and IGKV1-37*01), which have previously been reported as ORFs (Barbie and Lefranc 1998), suggesting that, in some individuals, these may be functional genes.

The apparent absence of unmutated sequences utilising IGKV1-16*01, and the relative absence of unmutated sequences utilising IGKV2-30*01 led us to infer the likely existence of unreported alleles for these two genes. The putative IGKV1-16*02 allele includes AAG rather than AGG as codon 75. All 29 alignments to IGKV1-16*01 included this single nucleotide mismatch in codon 75, suggesting that the germline gene was originally reported in error. The putative IGKV2-30*02 allele includes CAC rather than TAC as codon 31. Of 68 alignments to IGKV2-30*01, 27 sequences included the C/T mismatch in codon 31. This suggests that the IGKV2-30*01 allele was reported accurately, but that an unreported polymorphism exists that may be present in many individuals. We carried out investigations of the putative alleles by genomic screening of buccal swabs from 12 individuals. This confirmed the existence of both IGKV1-16*02 (accession no. FM164406) and IGKV2–30*02 (accession no. FM164408).

The LowVMut database, a smaller dataset of 435 relatively unmutated sequences, was used to investigate IGKJ gene/allele usage, and is available upon request. Analysis was restricted to these sequences because accurate partitioning is essential for investigation of the usage of reported IGKJ2 and IGKJ4 alleles, which only differ at the 5′ ends of the sequences. Database analysis showed that seven of the nine reported IGKJ genes could be identified in rearranged sequences, though their frequency of expression varies from less than 0.5% of sequences (IGKJ2*02) to greater than 25% of sequences (IGKJ1*01, IGKJ4*01) (Table 2). Neither the IGKJ2*03 or IGKJ4*02 alleles were seen in the VLowMut Database, though IGKJ4*02 is well established as a rare allele (Atkinson et al. 1996). IGKJ2*03 has only been reported from cDNA, and its existence must be queried.

Table 2 IGKJ gene usage in a database of 435 relatively unmutated rearranged human kappa chain genes

Summaries of exonuclease removals from the IGKV and IGKJ genes are provided as Fig. 1a and b, respectively. An average of 2.4 nucleotides were apparently removed from IGKV, and 1.8 nucleotides were removed from IGKJ. These are likely to be slight underestimates, for in some cases nucleotide removal will not be identified because of addition of the same nucleotide by the process of N nucleotide addition. The detectable levels of N addition are presented as Fig. 1c. On average, just 1.5 nucleotides were added, and in 46% of cases, there was no addition at all. The lack of both addition and removal of nucleotides means that the overall length of IGK genes shows little variability, as shown in Fig. 1d.

Fig. 1
figure 1

Analysis of a dataset of 435 relatively unmutated rearranged human kappa chain genes, showing (a) exonuclease removals from the IGKV genes, (b) exonuclease removals from the IGKJ genes, (c) N nucleotide addition at the VJ junction, and (d) overall nucleotide gains and losses to the VJ rearrangements as a result of N addition and exonuclease removals

The small amount of processing of the VJ junction, as well as the strong biases in the use of certain IGKV and IGKJ genes, led to additional investigations of IGKV/IGKJ pairings in the expressed repertoire. The ten most common IGKV/IGKJ gene combinations collectively accounted for 40.9% of all alignments (Table 3). These combinations reflect the biases in expression of the IGKV and IGKJ genes, and there were no additional biases in the combinations themselves.

Table 3 Commonly seen IGKV/IGKJ pairings in a database of 435 relatively unmutated rearranged human kappa chain genes, and the most commonly seen amino acid junction sequences that were seen in such pairings

Amino acid sequences of the VJ junction were then analysed, and many examples of identical junctions were noted (Table 3). Not all such identical sequences can be assumed to have arisen independently, for identical sequences were seen that were derived from the same study. Most sequences, however, were independently derived. The eight sequences encoding the dominant junction CQQYGSSPRTF of the IGKV3-20*01/IGKJ4*01 pairing came from numerous sources, including two sequences from patients with chronic lymphocytic leukemia (Z46310, DQ101165), one with systemic lupus erythematosus (AY422527), one with MALT lymphoma (AY281340), one with follicular lymphoma (AM040568), as well as two identical sequences (AB064107, AB064108) from an unpublished study involving an antibody gene library and one sequence (DQ187538) that is said to be derived from pneumococcal polysaccharide-specific B cells. On the other hand, the dominance of the most frequently seen junction in the study (CQQYGSSPLTF) is likely to be exaggerated by the presence of 11 identical sequences derived from a single study (Meijer et al. 2006).

Identical junctions were often produced in different ways. For example, the CQQYGSSPLTF junctions of the IGKV3-20*01/IGKJ4*01 sequences were the result of one, three or six nucleotide removals from the IGKV end, zero, one or three N additions, and zero, one or two nucleotide removals from the IGKJ end. Four different processing pathways led to the same junction sequence. Not surprisingly, very similar junctions were also seen at high frequency. For example, in addition to the seven independent instances of the junction CQQYGSSPRTF amongst IGKV3-20*01/IGKJ4*01 sequences, there were two independent instances each of CQQYGSSP[E/G/Q/W]TF.

Discussion

We have recently shown that 104 of the 226 reported heavy chain IGHV genes almost certainly contain errors and should be removed from the available repertoire (Wang et al. 2008). Many additional IGHV genes may also contain sequencing errors, but there was insufficient evidence to conclude this with confidence. This present study casts some doubt on five of the 55 functional IGKV genes, but there was insufficient evidence to firmly conclude that any of the sequences contained errors.

Most of the errors in reported IGHV genes appear to be the result of Taq polymerase-mediated PCR errors, while some errors came from the use of coding region primers and from the amplification of chimeric sequences (Wang et al. 2008). These errors led to the impression that IGHV genes are more highly polymorphic than is the case. The IGHV3-30 gene, for example, has been reported to have 19 allelic variants (Pallares et al. 1999), with 16 of these variants being identified in a single study (Olee et al. 1991). It seems likely that 14 IGHV3-30 alleles have been reported in error, but the gene is certainly polymorphic, and over half of all IGHV genes are polymorphic (Wang et al. 2008). In contrast, only 11 of the 51 IGKV genes considered here have apparent allelic variants, with no more than three variants being reported for any gene. Functional polymorphisms could initially only be confirmed in this study for IGKV1-5 and IGKV2D-29. In each case, just two functional alleles were identified. With the sequencing of the IGKV2-30*02 allele, we have identified a third gene with functional allelic variants. The locus nevertheless shows little polymorphism, and it is likely that few if any further common IGKV alleles remain to be identified.

Not only is there little diversity in the germline genes of the kappa locus, but little junctional diversity is generated during the process of gene rearrangement. The small amount of processing of rearranged kappa genes has been previously noted (Tomlinson et al. 1995), and parallels reports of lambda gene rearrangements, where the overwhelming majority of sequences are subject to little exonuclease activity and where very few sequences have more than a handful of nucleotides added by TdT activity (Farner et al. 1999). This means that the length of the lambda light chain CDR3 regions is highly constrained (Farner et al. 1999), as was also observed here for the kappa chain CDR3 regions.

We extended these observations by studying the amino acid sequences of the VJ junction, and noted a number of sequences that have been repeatedly observed in independent studies. Highly similar heavy chain sequences, termed stereotypical sequences have also been reported (Stamatopoulos et al. 2007); however, these ‘stereotypical’ heavy and light chain sequences may arise for different reasons. The formation of the heavy chain junction region generates far more diversity than is the case for the light chain, and stereotypical heavy chain rearrangements are truly improbable events. In contrast, ‘stereotypical’ light chains may simply arise because of the overuse of certain IGKV and IGKJ genes, combined with the general lack of exonuclease activity and N nucleotide addition. Although biases towards pairings of downstream IGKV genes with upstream IGKJ genes, and vice versa, have been suggested by modelling studies (Mehr et al. 1999), such biases were not seen here. In fact the prominent IGKV3-20 gene was frequently associated with four of the five IGKJ genes.

Whatever the theoretical diversity of the light chain, the present study suggests that the expressed repertoire is dominated by a relatively small number of amino acid sequences. Many different sequences are each seen at a frequency of more than 1% of randomly selected sequences. The immunoglobulin gene repertoire can therefore be characterised as having extraordinary diversity in the heavy chain, and limited diversity in the light chain. In order to understand the operation of the humoral immune response, it is likely that an understanding of the role of such commonly expressed light chains will be as important as an understanding of the legendary but perhaps overstated diversity of the theoretical repertoire.