Keywords

1 Human Protein-Coding Genes

In this section, I will first outline the basic concepts of human protein-coding genes and explain the structure of a typical human protein-coding gene. Then, I will explain the exceptional genes related to the immune system and pseudogenes .

1.1 Concepts of Human Protein-Coding Genes

According to a conventional textbook-like concept of human protein-coding genes, they are regions of the human genome, beginning at the transcription start sites (TSS) and ending at the transcription termination sites, sometimes including the promoter regions upstream of TSSs. Introns that are spliced out after transcription are also thought to be parts of genes. The gene regions are transcribed into RNAs, followed by modifications by splicing, 5′-capping, and polyadenylation to produce matured messenger RNAs (mRNAs). The mRNAs have open reading frames (ORFs) that carry information necessary for translation into proteins. Upstream and downstream of ORFs are 5′ and 3′ untranslated regions (5′ UTRs and 3′ UTRs), respectively.

The above classic concept of human protein-coding genes has been collapsed gradually by recent, large amount of comprehensive experimental data. For example, it has been thought that mRNA molecules are repeatedly transcribed from a particular genomic position, producing many copies of mRNAs of identical nucleotide sequences. However, transcriptome studies have revealed that the positions of TSSs are highly variable. Cap analysis of gene expression (CAGE) is a technique for precise determination of the genomic locations of TSSs at the genome-wide scale, by preparing a library of about 20 bases from the 5′ ends of full-length cDNAs, sequencing them, and then mapping the sequences onto the genome sequence. Extensive human transcriptome studies using the CAGE technique have revealed that TSS is highly variable. The FANTOM and the genome-network project used the CAGE technique and found that there are broad and narrow types of TSS in human genes, and each gene has its characteristic pattern of TSS distribution (Carninci et al. 2006). In other words, TSS is not a predetermined single nucleotide of the genome but a region of certain length. There are genes with a sharp, single dominant peak of TSS, but there are genes with multiple, broad-shaped peaks as well.

In addition, there were many important discoveries during the last decade, such as existence of many noncoding RNA genes other than rRNAs and tRNAs and the existence of many genes with alternative splicing isoforms, that made our concepts of human genes drastically changed. Some of these will be described in more detail in the following sections.

1.2 A Typical Human Protein-Coding Gene

Human genes have great variation in their sizes and structure. While there are many single-exon genes, there are genes with more than a hundred exons. Some of the human genes encode very small proteins of less than 100 amino acids, but others encode huge proteins with 36,000 amino acids (titin). In order to grasp a clear image of diverse human genes, a general description of an average protein-coding gene may be of help. I thus show here some features of an average human protein-coding gene, such as length of ORFs, number of exons, and number of alternative splicing isoforms (Table 4.1). Also, I will compare them with those of single-exon genes that are the minority of human genes.

Table 4.1 Statistics of average human genes

An average structure of multi-exon genes is quite different from that of single-exon genes (Table 4.1). In the case of multi-exon genes, the average length of transcripts (after splicing) is 2.6 kb, while that of single-exon genes is 1.7 kb which is much shorter than multi-exon genes. Multi-exon genes have ORFs of 1520 bps (506 amino acids), while single-exon genes have ORFs of 708 bps (235 amino acids), about half of multi-exon genes. In the upstream and downstream of these ORFs, there are short 5′ UTRs and relatively long 3′ UTRs.

Additionally, the majority of human single-exon genes encode G-protein-coupled receptors (GPCRs), including hundreds of olfactory receptors, that have important cellular functions and are frequently utilized as drug targets.

1.3 Number of Protein-Coding Genes

How many protein-coding genes exist in the human genome has long been a question of much attention especially among genome scientists. This is partly because solving this question was an important milestone of the human genome research. The genome scientists were so enthusiastic that while human genome-sequencing project was going on, they were betting on the number of human genes (Pennisi 2007). However, it is not very clear whether or not we have reached the final answer, even though more than 10 years have passed since the completion of the human genome-sequencing project in 2004.

The exact numbers of protein-coding genes for major eukaryotic model organisms have been revealed in the 1990s. The numbers appeared to be 6000 in yeast (Goffeau et al. 1996), 19,000 in worms (C. elegans Sequencing Consortium 1998), and 13,600 in fruit flies (Adams et al. 2000). On the other hand, the number of human protein-coding genes remained somewhat controversial even when the draft sequence of the human genome was published. For example, there were 39,114 predicted genes from the human genome draft sequences determined in 2000, while there were only 11,015 known human genes in RefSeq mRNA dataset, showing great discrepancy (Hogenesch et al. 2001). Around the same time, by extrapolating from the number of genes encoded on chromosome 22 that has been sequenced by then and the number of expressed sequence tags (ESTs), the number of human protein-coding genes was estimated to be around 35,000 (Ewing and Green 2000). Finally, after the completion of the human genome-sequencing project, the number of human protein-coding genes was corrected to be only 20,000–25,000 (International Human Genome Sequencing Consortium 2004). Many scientists accepted it with a big surprise, because the number was thought to be between 30,000 and 100,000 before the completion of human genome project and because the number was not much different from those of fruit flies and other animals.

Recent human proteomics studies have validated many of the human protein-coding genes. A proteomics study on 30 different human tissues (including 7 fetal tissues) has validated 17,294 human proteins by mass spectrometry (Kim et al. 2014). Another proteomics study on 32 different human tissues and organs has succeeded in identifying 17,132 human proteins by either mass spectrometry, monoclonal antibodies, or other techniques, out of 20,344 putative protein-coding genes supported by RNAseq data (Uhlén et al. 2015). These studies have provided good pieces of evidence for many human protein-coding genes at the protein level, but more studies will be needed to identify “minor” proteins that are used in a specific tissue at a particular timing. By further proteomics studies of various human tissues at various developmental stages, we will be able to better discriminate protein-coding genes from noncoding RNA genes, which will lead to a more precise number of human protein-coding genes (Imanishi et al. 2013; Gaudet et al. 2015).

1.4 Immunoglobulin and T-Cell Receptor Genes

The most characteristic gene complex in the human genome that is different from the majority of human protein-coding genes might be immunoglobulin (Ig) and T-cell receptor (TCR) genes. These genes that encode important proteins in the immune response experience somatic rearrangements and mutations during the maturation processes of B-cells and T-cells, which produce the greatest variation in gene sequences, resulting in the recognition of a wide variety of antigens. The mechanism underlying the hypervariability is mostly due to combinatorics; they increase the possible number of combinations by selecting one gene from many variable (V) genes, one gene from many diversity (D) genes, and one gene from many joining (J) genes during somatic rearrangements.

Ig loci are comprised of a heavy-chain locus (IGH locus at 14q32.33) and two light-chain loci, lambda (IGL at 22q11.2) and kappa (IGK at 2p11.2). All these loci undergo somatic V(D)J rearrangements that produce great diversity of Ig proteins. The TCR loci are comprised of alpha/beta TCR loci (TRA at 14q11.2 and TRB at 7q34) and gamma/delta TCR loci (TRG at 7p14 and TRD at 14q11.2). In the same way as Ig loci, TCR loci undergo somatic V(D)J rearrangements to produce hypervariability in the complementarity-determining regions (CDRs) of the TCR molecules.

In fact, the immunoglobulin heavy-chain (IGH) locus has 167 V regions, 27 D regions, 9 J regions, and 11 constant (C) regions, according to the human gene nomenclature database (HGNC) . Thus, the possible combinations of V, D, J, and C regions will be 167 × 27 × 9 × 11 = 446,391. The T-cell receptor beta (TRB) locus has 68 V regions, 2 D regions, 14 J regions, and 2 C regions if we include pseudogenes . Thus, the possible combinations of V, D, J, and C regions will be 68 × 2 × 14 × 2 = 3808. Because these molecules form heterodimers, the possible combinations can be even more. Although there are only three Ig loci and only four TCR loci, respectively, the mechanism described above produces hypervariability of these molecules.

1.5 Pseudogenes

Pseudogenes are genes that have structural and sequence similarity with some functional genes but lost their original function by any mechanisms. There are two most conspicuous mechanisms of producing pseudogenes. One is the destruction of a copy of functional duplicated genes. Such type of pseudogenes is quite common, because duplicated genes provide redundant copies of functional genes and there is no harm on the survival of the organism even if one of the multiple copies is lost. The second type of pseudogenes is processed pseudogenes that arise by reverse transcription of mRNAs. This type of pseudogenes lacks intronic sequences but sometimes has poly-A sequences, so they are easily recognized in the genome sequences. Also, there are two ways how pseudogenes lost their function: the fixation of nonsense mutations on ORFs and silencing of genes by mutations in gene control regions.

In general, most of functional genes are conserved during evolution. This means that functional genes are subject to functional constraint and that if they lose their function by mutations and turn into pseudogenes, they should have harmful effects on the organism. Thus, it is safe to think that pseudogenization is disadvantageous in most cases. To avoid such genetic load, redundant copies of functional genes should be produced by duplication before pseudogenization takes place. This is why many pseudogenes have functional counterpart in the genome.

There are functional genes that originate from processed pseudogenes. If we carefully examine human and mouse processed pseudogenes that can be found by transcriptome and genome sequence comparisons, we can find many transcribed pseudogenes that have intact ORFs. They are indistinguishable from functional genes, but their gene structures lack introns. Up to 1% of the processed pseudogenes seemed to have reinvigorated and became functional genes (Sakai et al. 2007). In this way, functional resurrection took place in a small fraction of processed pseudogenes, which was utilized as a way of producing new functional genes in the genome.

According to the human GENCODE database (version 24), there are 14,505 pseudogenes in the human genome, including 10,728 processed pseudogenes and 3295 unprocessed pseudogenes (Pei et al. 2012).

2 Noncoding RNA Genes

One of the most significant findings from human transcriptome studies that have been extensively carried out concurrently with the human genome studies is the discovery of abundant transcripts that do not contain apparent open reading frames and thus lack potential to produce proteins. This finding was not expected from the analysis of human genome sequence, because such noncoding RNA (ncRNA) genes could be by no means predicted from the genome sequences alone. Nowadays, ncRNAs became a new established category of genes, and more and more functional ncRNA genes are being discovered in the human genome.

2.1 Classification of Noncoding RNA Genes

In the first years of human transcriptome studies, expressed sequence tags (ESTs) have been used to identify many genes that encode human proteins (Adams et al. 1992). Later, transcribed regions of the human genome have been comprehensively surveyed using genome-wide tiling arrays (Bertone et al. 2004; Cheng et al. 2005) and cDNAs (Imanishi et al. 2004; Genome Information Integration Project and H-Invitational 2, 2008), which could gradually reveal the whole picture of human transcribed genes. These studies also identified many transcripts that have no apparent ORFs. There was a possibility that these transcripts have very short ORFs that encode functional peptides, but such hypothesis was not supported experimentally at the protein level. And later, at least some of these transcripts appeared to function as RNA molecules.

Aside from mRNAs that have genetic information for protein synthesis, classical biological textbooks introduce ribosomal RNA (rRNA) that is a component of ribosomes and transfer RNA (tRNA) that transports amino acids during protein synthesis. rRNA is an essential component of ribosomes. In eukaryotes, large ribosomal subunit contains 28S, 5.8S, and 5S rRNAs, while small ribosomal subunit contains 18S rRNA. rRNA is the most abundant RNA molecule in the cell. On the other hand, tRNA is a small RNA that functions as a transporter of specific amino acids to newly synthesized polypeptides during protein syntheses. tRNAs bind with specific amino acids to become amino acyl-tRNAs by the support of amino acyl-tRNA synthetase. Then, correct amino acids are transferred to the polypeptides by referring to the codons of mRNAs in the ribosomes with the anticodons of tRNAs.

Later, many other new classes of functional ncRNAs have been discovered (Hirose et al. 2014). They can be classified into long noncoding RNAs (lncRNAs) that are typically longer than a few hundred bases and short noncoding RNAs that function in gene expression regulation. Until now, noncoding RNA genes have been roughly classified into six classes by their length, structure, and function (Table 4.2). However, noncoding RNA is one of the most enthusiastically studied subjects as of now, and it is highly probable that new members of ncRNAs as well as new classes of ncRNAs will be discovered in the future. Data in Table 4.2 should thus be regarded as a tentative snapshot of the functional ncRNAs.

Table 4.2 Classification of noncoding RNAs

2.2 Long Noncoding RNAs

Among RNAs other than mRNAs, rRNAs, and tRNAs, those that are longer than a few hundred bases and have poly-A sequences like mRNAs are called long noncoding RNAs (lncRNAs) . This class of ncRNAs mostly binds to proteins to function in various cellular processes such as chromatin modification, transcription, and splicing. Examples of human lncRNAs include X inactive-specific transcript (XIST) , H19, imprinted maternally expressed transcript (H19), and nuclear paraspeckle assembly transcript 1 (NEAT1). Among them, XIST is the most extensively studied lncRNA.

XIST is a master controller of the X chromosome inactivation in females. In female somatic cells, one of the two copies of the X chromosome is inactivated. Because the inactive X chromosome is randomly chosen, expression of X chromosomal genes will be a mosaic in the female cells. The X chromosome from which the XIST gene is expressed is inactivated. XIST is a long noncoding RNA of 17 kb that is spliced and polyadenylated. The function of XIST is to bind with inactivated X chromosomes and trigger X chromosome inactivation.

As explained above, XIST RNAs bind with genomic DNA and some proteins in the nucleus. There seem to be many other lncRNAs that function in a similar way; they bind with certain RNA-binding proteins and function as ribonucleoproteins in the nucleus. However, most of their functions remain unresolved, which are to be further studied in the future.

According to lncRNAdb version 2.0, a database of lncRNAs, there are at least 76 human lncRNA entries of possible functional significance (Quek et al. 2015). Also, there are hundreds of noncoding transcripts in a human transcriptome database H-InvDB , but only a small fraction of them have known function (Imanishi et al. 2004; Takeda et al. 2013). Such lncRNAs may include transcripts by un-induced or leaky transcription from the human genome.

2.3 miRNA and Other Small ncRNAs

Other classes of short RNAs that are apparently different from the above-mentioned RNAs have been found. These RNAs, including miRNAs, revealed to have extensive variation. This class of small ncRNAs can be classified into miRNAs, snRNAs , snoRNAs , and many other minor RNAs. Here, I will outline these small ncRNAs.

Small nuclear RNA (snRNA) is a group of small RNAs that are found in the nucleus. They are involved in various important functions such as splicing of mRNAs and maintenance of telomeres. Each of snRNAs binds to specific proteins to form small nuclear ribonuleoproteins (snRNPs). The most well-known snRNAs are five components (U1, U2, U4, U5, and U6) of spliceosomes. They bind to specific proteins to form snRNPs that function in splicing reactions. There are 65 snRNA genes registered in the HGNC database that provides official nomenclature of human genes (Table 4.2).

Small nucleolar RNA (snoRNA) is a class of small RNAs that function in chemical modifications of other RNA molecules such as rRNAs and tRNAs. There are two major classes of snoRNAs: C/D box snoRNAs function in methylation, and H/ACA box snoRNAs function in pseudouridylation. snoRNAs are associated with some proteins to form ribonucleoproteins (called snoRNPs). snoRNAs bind to target RNAs that have complementary sequences to parts of snoRNAs, and associated proteins catalyze chemical modifications. According to snoRNA-LBME-db, a database of snoRNAs, there are 269 C/D box snoRNAs and 108 H/ACA box snoRNAs. On the other hand, there are 498 entries in the HGNC database.

MicroRNA (miRNA) is yet another class of small noncoding RNAs and a key molecule of posttranscriptional regulation of gene expression. miRNAs have complementary sequences to specific protein-coding genes, and they bind with 3′ UTRs of mRNA molecules to form double-strand RNAs, which trigger degradation of target mRNAs. After transcription from the genome, miRNAs form a characteristic stem-loop structure using repetitive sequences and are then processed by Dicer proteins and RNA-induced silencing complex (RICS) to become mature miRNAs of 21–22 nucleotides. According to miRBase, a database of miRNA genes, there are 1881 miRNA genes in the human genome (Table 4.2). It has been estimated that each miRNA has about 400 target mRNAs on average (Friedman et al. 2009).

Because small noncoding RNA is an actively studied area of biology, it is highly probable that new classes of functional ncRNAs will be discovered and their classifications will be modified in the future.

3 Alternative Splicing

By the time when the human genome-sequencing project officially completed, it has been revealed that the total number of human genes is as low as 20,000–25,000, which is much lower than previous predictions (International Human Genome Sequencing Consortium 2004). On the other hand, some researchers predicted that the number of different human proteins is somewhere around 100,000. One of the possible mechanisms that can partly fill the gap between the two numbers is alternative splicing (AS). Now it is known that more than 90% of human genes undergo AS. Nearly 100,000 different kinds of proteins are produced by AS from some 20,000 genes, thus expanding the human proteome diversity (Nilsen and Graveley 2010). Here, one gene-one protein relationship no more holds. Furthermore, we postulate that many unidentified AS variants exist that are specific to some tissues or to some developmental stages. We thus need to investigate comprehensively the transcriptome for each of some hundred kinds of human cells, in order to reveal the whole picture of AS.

3.1 Mechanisms of AS

Splicing is a rigorously regulated reaction that takes place in the nucleus to cut out introns from pre-mRNA molecules that are originally transcribed from the genome. Usually, there are perceivable signal DNA sequences for splicing both in introns and exons in order to precisely determine the positions of splicing junctions. So, in principle, each gene can produce one kind of mRNA and hence one kind of protein. However, for example, when nucleotide substitutions have weakened the signals for the regulation of splicings, the mechanism of splicing does not perfectly work in a uniform manner, and splicing may occur at different positions from the original position. This is thought to be one of the possible mechanisms of AS.

Splicing reaction usually binds the 5′-end of exons with the 3′-end of preceding exons. In contrast, in the case of alternative splicing of cassette exon type, an exon will be bound not to the preceding exon but to the second or more remote exon, by skipping the previous exon. This may produce two kinds of mRNAs: long mRNAs with the exon in question as well as short mRNAs without the exon (Fig. 4.1).

Fig. 4.1
figure 1

A model of alternative splicing . Exon 3 of this gene is an alternative exon, and depending on whether exon 3 is included (upper panel) or excluded (lower panel), two kinds of mRNAs will be generated. Here, this type of alternative splicing is called cassette type

Above is the basic mechanism of alternative splicing . In this way, different kinds of mRNAs are produced by AS, which sometimes result in a diversification of protein function. A typical example is that AS, especially the exon skipping type, switches the protein sequences at the C-terminus and sometimes deletes transmembrane domains. This causes a change of protein function from membrane proteins into secreted type. There seem to be many genes that encode membrane proteins and are regulated in the same way as this example.

3.2 Patterns of AS

There are various forms of alternative splicing that can be classified into five major types based on the positions of exons skipped (Fig. 4.2, Table 4.3; Takeda et al. 2006). Here, I call the exons that are alternatively spliced as “AS exons.” The most prevailing pattern of AS is the cassette exon in which single or multiple neighboring exons are skipped in some of the isoforms. The second most popular type of AS has variation among isoforms at the start or end positions of introns, which causes sequence length variation among AS isoforms. They are called “alternative 5′ splice site” and “alternative 3′ splice site.” NAGNAG sequences, which will be introduced later in this chapter, are a kind of alternative 3′ splice sites. Mutually exclusive exons are often thought to be a typical pattern of AS, but in fact they are found in only 791 human genes. Because this type of AS can switch sequences from one exon to another, this might be the most suitable structure for substituting functional domains of proteins. Retained intron is a type of AS in which both spliced and unspliced transcripts exist in the cell. One may think that intron-retained transcripts are caused by splicing errors, but in fact splicing reactions in these transcripts are rigorously controlled. For example, there are known cases that AS of retained intron type occurs in a tissue-specific manner.

Fig. 4.2
figure 2

Five most common types of alternative splicing . From upper to lower panels, cassette exon, alternative 5′ splice site, alternative 3′ splice site, mutually exclusive exons, and retained intron are shown. Alternatively spliced exons are shaded

Table 4.3 Patterns of human alternative splicings based on human full-length cDNA sequences

There are other patterns of AS that do not fit to any of these five groups. They include combinations of above five AS patterns and some complicated unclassifiable patterns. Such irregular types of AS are identifiable based on the evidence of transcripts, but they are not greater in number, and functional significance of them is not yet clear.

3.3 Examples of Human AS

Many of human AS genes produce simple pairs of isoforms by the existence of one AS exon. On the other hand, there are human AS genes that have multiple AS exons and consequently show extensive repertoire of AS isoforms. Remarkable examples of such human AS genes include MUC1 (mucin 1, cell surface associated), DISC1 (disrupted in schizophrenia 1), KCNMA1 (potassium calcium-activated channel subfamily M alpha 1) , and CALU (calumenin) genes. Here, I will explain KCNMA1 and CALU genes (Fig. 4.3).

Fig. 4.3
figure 3

Examples of human genes that undergo alternative splicing . Gene structures of KCNMA1 and CALU genes are depicted. Exon lengths are shown in proportion to the number of nucleotides except that the last exon of NM_001014797 is shortened. Introns are shortened. Predicted ORFs are shown in green. KCNMA1 gene has many AS isoforms. There are at least ten distinct types of transcripts that are produced by alternative splicing. CALU gene has at least three AS isoforms, including those with mutually exclusive exons

KCNMA1 gene has many kinds of AS isoforms. Figure 4.3 shows the gene structure of KCNMA1 gene based on the sequences of four RefSeq transcripts and six high-quality full-length cDNAs. This shows that there are isoforms with different transcription start sites (AK124355 and AK128392) as well as three isoforms with cassette exons (AK310379, NM_001161352, and NM_001161353). Because there are multiple AS exons in this locus, there is a variety of AS isoforms by combinations of these AS exons. CALU (calumenin) is a typical example of AS genes with mutually exclusive exons (Fig. 4.3). There are two RefSeq entries in this gene (NM_001130674 and NM_001219), and the third exons of these transcripts are selected in a mutually exclusive manner. Also, there is a cassette-type exon in 5′ UTR (AK056338). Because there are two AS exons in this gene, the number of possible AS isoforms will be four (=22), but only three of them have been observed so far.

The most extreme example of AS is seen in the Dscam gene in Drosophila. The gene structure of Drosophila Dscam is quite unique, having large numbers of alternative copies of exons 4, 6, and 9. There are 12, 48, and 33 copies of exons 4, 6, and 9, respectively, that are tandemly arranged on the genome. For each exon, one of these copies will be chosen during splicing in a mutually exclusive manner. Each copy encodes different types of immunoglobulin domains. Also, there are two copies of exon 17 that encode transmembrane domains of Dscam protein. As a result, the number of possible AS isoforms will be 38,016 (Wojtowicz et al. 2004). On the other hand, Down syndrome cell adhesion molecule (DSCAM) gene is a human ortholog of Dscam in Drosophila. DSCAM belongs to the immunoglobulin superfamily and encodes cell adhesion molecule. It was identified in the Down syndrome susceptible region on chromosome 21. Human DSCAM gene transcribes at least two different mRNAs that encode different protein isoforms. Extremely large number of AS variants observed in Drosophila is not found in its human ortholog (Yamakawa et al. 1998). As is evident from this example, different species have different patterns of AS.

3.4 Evolutionary Conservation of AS

As has been discussed above, AS can expand the degree of human proteome variation that is translated from the human genome, which contributes to the regulation of complicated molecular mechanisms in humans. It is a very interesting problem to imagine how the precise regulatory mechanism of AS evolved.

Here, I will introduce an example of highly conserved AS in mammalian evolution. Human cysteinyl-tRNA synthetase (CARS) gene has two AS isoforms of cassette type (Takeda et al. 2008). One of the AS isoforms (BX647906) has exon 2 in its transcript, while the other isoform (BC002880) skips this exon. As a result, the former transcript is 249 nucleotides longer than the latter, and the protein product from the former transcript is 83 amino acids longer than that of the latter transcript. This AS exon is known to encode a functional domain, glutathione S-transferase C-terminal-like (IPR010987). What is more interesting is that the same pair of AS variants exists in mouse. This means that the common ancestor of humans and mouse about 100 million years ago might have possessed this pair of AS isoforms, and these AS isoforms have been conserved in both human and mouse lineages ever since.

As is evident from the above example, there are conserved AS isoforms in evolution, but how many of AS isoforms are conserved? To solve this problem, we conducted a comprehensive, cross-species analysis of AS between humans and mouse (Takeda et al. 2008). Because there is a large amount of transcriptome data for humans and mouse, we can comprehensively identify AS genes and AS exons and then make comparisons of AS between these two species. First, we mapped sequences of all available AS isoforms on to the genome sequence, and then we identified AS isoforms that show good correspondence between these two species, using the whole genome alignment of human and mouse. We call these isoforms as “evolutionarily conserved AS isoforms.” Then, if there are two or more pairs of evolutionarily conserved AS isoforms on a particular locus between these species, we defined them as evolutionarily conserved AS. As a result of this analysis, we found only 189 genes that have multiple pairs of evolutionarily conserved AS isoforms between human and mouse. This means that only a fraction of human AS genes are completely conserved for a long time during mammalian evolution and the majority of human AS is species specific.

To sum up, the majority of the human AS arose recently in the evolutionary history. Creation of AS isoforms can be regarded as evolutionary experiments to try to generate new functional proteins from already existing genes. This is as if life is trying to make full use of all available resources during evolution. It is well established that gene duplications have created new functional genes during evolution. In a similar way, AS might have contributed to the modification and improvement of genes during evolution.

4 Other Mechanisms for Proteome Diversification

In addition to alternative splicing , there are other mechanisms that can extend the diversity of human genes and proteins. Although they are not observed in all human genes, they do occur in a minority of human genes. In this section I will introduce three kinds of such mechanisms: alternative open reading frames (AltORFs) , NAGNAG sequences, and selenoproteins .

4.1 Alternative ORFs

Among geneticists, the one gene-one protein theory has been taken for granted for a very long time, but the alternative splicing has diminished the concept, and later even more exceptional phenomena appeared. If we carefully examine the mRNA sequences of human protein-coding genes, there are sometimes open reading frames that are different from the authentic, principal ones (Fig. 4.4a). We call them alternative open reading frames (AltORFs) in this chapter. AltORFs have been observed in many viral genomes, which may be due to a constraint to keep the genome size compact, but such ORFs also exist in human genes.

Fig. 4.4
figure 4

Miscellaneous mechanisms of human gene and protein diversification. (a) Alternative open reading frames (AltORFs). AltORFs (ORF2) are generally overlapping with and much shorter than the authentic, principal ORFs (ORF1). (b) NAGNAG introns. If NAGNAG sequences are present at the 3′ ends of introns, two kinds of mRNAs can be produced by use of different splicing acceptor sites. (c) Molecular structure of selenocysteine (Sec) that is incorporated in selenoproteins. This is identical to cysteine (Cys), if selenium (Se) is replaced by sulfur (S)

A systematic survey of proteins encoded by AltORFs in all human genes that are different from the primary ORF and have strong Kozak conserved sequences at the translation initiation sites was carried out by using mass spectrometry analysis of human cell lines and tissues. As a result, at least 1259 AltORF proteins were experimentally identified (Vanderperre et al. 2012, 2013). The average length of these AltORF proteins was 57 amino acids, and there were much longer AltORF proteins. In many cases, both AltORF and primary ORF are translated from one mRNA, leading to co-expression of two proteins.

There might exist many human AltORF proteins with functional importance. For example, AltORF protein from MRV1 gene was demonstrated to bind with BRCA1 proteins in the nucleus (Vanderperre et al. 2013). Also, many AltORFs have high evolutionary conservation, suggesting functional importance. However, whether or not human AltORF proteins have important function is still to be solved, and apparently further investigation is required. Experimental validation of functional significance of AltORF proteins will be hard, because proteins from AltORF could be translated in very small quantities and they could be synthesized only in specific tissues or in specific periods of developmental stages. In the future, high-resolution proteomics studies may lead to new discoveries about human AltORFs.

Furthermore, it is known that some of the human mRNAs have short upstream ORFs in their 5′ UTRs, which are called uORFs. They are typically terminated at the upstream of primary ORFs, or they are overlapping with primary ORFs. If uORFs are translated, it consumes ribosomes, and as a result it inhibits translation of primary ORFs downstream. By this mechanism, uORFs are thought to control human protein translations from mRNAs.

4.2 NAGNAG Introns

Nucleotide sequences at the 3′ ends of introns are called splicing acceptor sites and are known to be highly conserved. In particular, the last two nucleotides at the 3′ ends of introns attain a strong consensus sequence of “AG.” However, if we examine the last six nucleotides of introns, there are some introns having “NAGNAG” sequences, where N represents one of the A, C, G, or T nucleotides (Fig. 4.4b). In this case, splicing usually takes place immediately downstream of the fifth to sixth “AG” nucleotides, but it also occurs at the second to third “AG” nucleotides with some probability. As a result of this error, the corresponding mRNA will be only three nucleotides longer, and its protein products will be only one amino acid longer because of the extra three nucleotides. Because the reading frame of this elongated mRNA is the same as the original one, the other parts of the protein remain unaffected. So, this will not affect the protein function significantly, and both types of proteins can coexist.

As has been shown above, the presence of NAGNAG sequences at the splicing acceptor sites will cause human mRNA and protein diversification. In fact, such NAGNAG sequences are found in introns of many human genes (Hiller et al. 2004). For example, the second intron of human microfibrillar-associated protein 2 (MFAP2) gene has “CAGCAG” sequence at the acceptor site, and as a result of this, two types of MFAP2 mRNAs are produced (BC015039 and AK222751 in GenBank records). Although the functional consequence of this variation has not been recognized, these mRNAs may be translated into two different proteins. According to H-InvDB , an integrated database of human genes, there are at least 5081 human protein-coding genes that have NAGNAG introns, which might have caused mRNA and protein diversification to a certain extent. They must have some functional influence on their mRNAs and protein products. Also, NAGNAG introns have been found in 31 long noncoding RNA genes (Sun et al. 2014).

How the 1-amino acid shorter or longer proteins produced by NAGNAG sequences diverge in their structure and function will be an issue to be addressed in the future. If an extra or deleted amino acid is located in a loop region of a protein, it may not change the protein structure drastically, hence not much disadvantageous. On the contrary, if the amino acid is located inside a functional domain of the protein, it will cause a serious impact on the protein function. In reality, there seems to be a weaker structural constraint of proteins on producing NAGNAG sequences, because more than half of mutations that produce NAGNAG sequences in human introns have been eliminated by negative selection (Hiller et al. 2008).

4.3 Selenoproteins

Selenium (Se) is the 34th element that belongs to the same group as oxygen and sulfur and is an essential trace element for humans. Selenocysteine (Sec) is an amino acid with similar molecular structure with cysteine (Cys), in which sulfur (S) is substituted by selenium (Fig. 4.4c). It is generally thought that human proteins are comprised of 20 kinds of amino acids, but in fact the Sec is the 21st amino acid that is actually incorporated in some of the human proteins. The proteins that have Sec are collectively called selenoproteins (Hatfield and Gladyshev 2002).

During their translation, selenoproteins are synthesized by incorporating Sec at the positions of UGA codons that are usually recognized as a termination signal of protein synthesis. tRNAs for Sec recognize not only the UGA codons of selenoprotein mRNAs but also the presence of a specific stem-loop structure, called Sec insertion sequence, in the 3′ UTR of mRNA. Thus, Sec is not incorporated to other UGA codons that lack the signal structure. Also, UGA codons for Sec will be recognized as premature termination codons by the regular translational machinery, and such mRNAs will be degraded by nonsense-mediated decay (NMD). However, selenoprotein mRNAs can escape from the NMD (Reeves and Hoffmann 2009).

Incorporation of Sec seems to occur only in specific proteins. It has been revealed that there are 25 selenoproteins in humans (Kryukov et al. 2003). The selenium atoms of incorporated Sec molecules are located in the reactive centers of the protein and take important roles in many of the selenoproteins. For example, selenoprotein P that is encoded by SEPP1 gene has ten Sec residues in a protein and may function as extracellular antioxidant (Mostert 2000).

Acquisition of selenocysteine is an exceptional phenomenon that requires a special mechanism of translation. In the same way as the irregular genetic code in mitochondrial DNA, we can consider the selenoprotein as a kind of genetic code variation in human protein translation. In other words, selenoprotein can be regarded as a mechanism for diversification of human genes and proteins.

5 Human Gene Databases

In this section, I will introduce five major databases from which researchers can obtain information about human genes. HGNC is an official body of human gene nomenclature providing database of human gene symbols and gene names. RefSeq is a database of human reference sequences that are nonredundant and curated. GENCODE is a standard dataset of the ENCODE project for identifying all functional elements in the human genome. H-InvDB is an annotation database of human genes based on human transcriptome. lncRNA db is a database of long noncoding RNAs . Finally, I will discuss about comparisons of these databases and future perspectives.

Because these databases are being developed independently in different research institutes, there are some discrepancies among them even now. The contradiction comes from different interpretations of experimental data and will not be cleared until more detailed and comprehensive validation studies are carried out for all human genes in the future. Nevertheless, these databases reflect our current understandings about human genes.

5.1 HUGO Gene Nomenclature Committee (HGNC)

HGNC is responsible for approving unique symbols and names for human genes and provides a database of human gene names (http://genenames.org/). As of August 2015, HGNC database provides information about 18,997 protein-coding genes, 2734 long noncoding RNAs , 1879 miRNAs , 65 small nuclear RNAs, and 458 small nucleolar RNAs. Furthermore, HGNC provides symbols and names for 12,444 human pseudogenes.

5.2 RefSeq

RefSeq (http://www.ncbi.nlm.nih.gov/refseq/) is a collection of annotated genomic, transcript, and protein sequence records developed by the National Center for Biotechnology Information (Pruitt KD et al. 2014). Human gene set of RefSeq is comprised of 26,266 genes and 47,619 transcripts. Seventy-nine percent of the transcripts are “curated,” and the remaining are “models” (release 60; July 2013).

5.3 GENCODE

GENCODE (http://www.gencodegenes.org) is a database developed by the Wellcome Trust Sanger Institute that was used as a reference gene set in the ENCODE (Encyclopedia of DNA Elements) project (Harrow et al. 2012). GENCODE provides integrated annotation of human genes by manual curation, computational analysis, and experimental validation. The latest version M16 (Aug 2017 freeze) contains a total of 53,379 genes (21,963 protein-coding genes, 12,374 long noncoding RNA genes, 6,109 small noncoding RNA genes, 12,437 pseudogenes , and 494 immunoglobulin and T-cell receptor gene segments).

5.4 H-InvDB

H-InvDB (http://hinv.jp) is a human gene database based on analysis of human transcriptome (Imanishi et al. 2004). The latest release of H-InvDB (ver 9.0) contains 39,495 entries of protein-coding genes in total. Among them, 26,386 entries that belong to categories I (identical to known human proteins), II (similar to known proteins), or III (InterPro domain-containing proteins) comprise a reliable set of protein-coding genes. There are also 8591 entries of possible noncoding RNA genes.

5.5 lncRNA db

lncRNAdb (http://www.lncrnadb.org) is a database of long ncRNAs of 200 bps or longer that are manually curated from literatures (Quek et al. 2015). Unlike shorter ncRNAs, long ncRNAs have diverse cellular function such as chromatin modification, transcription, and splicing. The number of entries in lncRNAdb gradually increased as research proceeds. As of August 2015, there are at least 76 entries of human long ncRNAs (lncRNAdb ver 2.0).

5.6 Comparisons of Databases

If we compare human gene datasets among various public databases, they will never agree to each other perfectly, because these databases have different policies of gene annotation. For example, according to a recent comparison of GENCODE and RefSeq gene annotation, there is a significant difference between the two (Frankish et al. 2015). Basic set of GENCODE is larger than the basic set of RefSeq (NM/NR only), and the comprehensive set of GENCODE is larger than the large set of RefSeq (XM/MR included). Figure 4.5 shows a comparison of three major databases about human genes, human transcripts, and human proteins. This also illustrates how the consensus among different sets of experimental evidence is hard to obtain.

Fig. 4.5
figure 5

Comparison of human gene, transcript, and protein datasets. Human protein sequences were obtained from each of RefSeq protein, H-InvDB (transcripts), and UniProtKB/Swiss-Prot (proteins; The UniProt Consortium 2015), and the protein sequences were compared to each other to examine correspondence among databases. The condition of sequence matches was set at identity >95% and coverage >80%. Values in the Venn diagram indicate numbers of protein entries in each cell (green for RefSeq proteins, blue for H-InvDB representative protein-coding transcripts, and purple for UniProtKB/Swiss-Prot proteins). Because of the AS and partial sequences, the proteins from different databases do not always show one-to-one correspondence. The following datasets were used: 35,967 protein-coding genes in RefSeq protein (release 59; dated on May 20, 2013), 28,384 human reviewed proteins in UniProtKB/Swiss-Prot (release 2013_05; dated on May 23, 2013), and 38,040 protein-coding transcripts in H-InvDB (release 9.0; dated on May 27, 2015)

Furthermore, each database changes its contents in periodical updates. In particular, every time when human reference genome sequence is updated, the structure and annotation of many genes and transcripts require significant modifications. Also, because of the genomic variations among individuals, we cannot expect that the DNA sequences in databases will match perfectly with the sequences of human samples that are actually used in experiments.

These problems are very hard to clear up in the future. Researchers may wish to use the most accurate database that shows the best correspondence with the actual samples used in experiments. However, none of the databases can provide perfect information after all, and we need to understand that what the databases provide is no more than an approximation of the real world. In order for us to approach the complete human gene databases, we need to produce and accumulate a larger amount of more precise data about human genes, by carrying out more comprehensive experiments than before.

6 Conclusion

The number of human protein-coding genes is converging to about 20,000, while that of RNA-coding genes is still gradually increasing. There are thousands of other genomic regions that are transcribed from the human genome, which can be regarded as candidates of novel human genes. For these candidate genes, it is essentially important to conduct proteomic analysis to validate the existence of protein products and to reveal their function and interactions with other molecules. Through such functional studies of candidate human genes, it is expected that many true genes that are involved in important biological function or human diseases will be discovered in the future.