Introduction

Regulatory variants: definition and distribution

Regulatory variants may be defined as DNA sequence changes that have occurred within a gene’s regulatory elements and which serve to alter either the expression of an mRNA transcript or one of its encoded isoforms (Pastinen and Hudson 2004). Although most known regulatory variants are found within promoters and 5′ untranslated regions (5′ UTRs) and may thus exert their effects by interfering with either transcription or translation (Pickering and Willis 2005), variants within the coding sequences and intronic regions are also known to affect allelic expression [e.g. coding sequence variants that affect splicing (Cartegni et al. 2002) or which lead to nonsense-mediated mRNA decay (Mendell et al. 2004), and intronic variants that modulate either splicing (Baralle and Baralle 2005) or transcription factor binding (Liao et al. 2004)]. By contrast, the 3′ UTR has been a relatively neglected source of regulatory variants. This is perhaps surprising since not only is the average length of 3′ UTRs of protein-coding genes some 3–5 times longer than that of their 5′ counterparts (Mignone et al. 2002), but there is also increasing evidence to show that 3′ UTRs are particularly rich in regulatory elements. It therefore follows that 3′ UTRs are likely to be a fertile hunting ground for regulatory variants.

3′ UTRs: diverse roles in the regulation of gene expression

3′ UTRs are involved in regulating gene expression at multiple levels: at the pre-mRNA level, 3′ UTRs are involved in mRNA 3′ end formation and polyadenylation (Zhao et al. 1999) whereas at the mature mRNA level, 3′ UTRs determine such properties as mRNA stability/degradation, nuclear export, subcellular localization and translation efficiency (Conne et al. 2000; Mignone et al. 2002; Chabanon et al. 2004). These diverse regulatory roles are executed via cis-acting elements in the 3′ UTRs that interact with a multitude of trans-acting factors in a given cellular environment. Whereas the function of a DNA regulatory motif is essentially conferred by its primary structure, that of an RNA regulatory motif often relies on a combination of primary and secondary structure (Mignone et al. 2002).

A recent comparative analysis of the 3′ UTRs of some 17,700 human genes with their counterparts in the mouse, rat and dog genomes identified 106 highly conserved motifs in human 3′ UTRs, many of which may turn out to be microRNA (miRNA) targets (Xie et al. 2005). A subsequent study of the ‘transcriptional landscape’ of the mammalian genome has revealed that many non-protein-coding RNAs initiate from sites in the 3′ UTRs of protein-coding genes (Carninci et al. 2005). Further, 3′ UTRs have also been found to constitute preferred targets of cis-encoded natural antisense transcripts (Sun et al. 2005). These findings suggest that 3′ UTRs may play a rather more important role in regulating gene expression than perhaps has hitherto been appreciated.

How do the 3′ UTRs of protein-coding genes perform such diverse biological functions?

3′ UTRs are not under the same rigid structural constraints as promoter or coding sequences which are obliged to accommodate transcriptional or translational machinery (Conne et al. 2000). In other words, 3′ UTRs are subject to weaker purifying selection than coding and promoter sequences and have consequently been found to be less highly conserved in comparisons of orthologous gene regions (Makalowski et al. 1996). 3′ UTR sequences may thus have been freed up to evolve more rapidly in response to the need to specify the fates of different mRNA species in different cell types at different developmental stages. This postulate appears to be supported by at least two lines of evidence. On the one hand, not only do 3′ UTRs manifest a higher density of both short insertion/deletion length variants and single nucleotide polymorphisms (SNPs; Imanishi et al. 2004) but they also appear to have a greater tolerance of repeat expansions (Missirlis et al. 2005) as compared with 5′ UTRs and coding sequences. Furthermore, whilst the average length of 5′ UTRs appears to be roughly constant over diverse taxonomic classes (ranging between 100 and 200 nucleotides), that of 3′ UTRs is much more variable, ranging from ∼200 nucleotides in plants and fungi to ∼1,000 nucleotides in humans (Mignone et al. 2002).

Aims of this meta-analytical study

Given the importance of 3′ UTRs in regulating gene expression, Conne et al. (2000) suggested that the 3′ UTR might serve as a “molecular ‘hotspot’ for pathology”. Indeed, a number of examples of human diseases arising from anomalies within 3′ UTRs were already known at this time and Conne et al. (2000) suggested that a more systematic study would be likely to reveal the existence of other ‘such 3′ UTR-mediated diseases’. Noting that most of the 106 highly conserved motifs found in human 3′ UTRs did not match known regulatory motifs, Xie et al. (2005) opined that “the next challenge will be to develop systematic methods to discern the specific functions of these motifs in a genome-wide fashion”. While such an approach is clearly warranted, we nevertheless feel that other complementary strategies should also be adopted. In this regard, we have performed a systematic analysis of the considerable number of disease-associated sequence variants that have now been identified in human gene 3′ regulatory regions (3′ RRs; see Fig. 1 for definition), with a view not only to cataloguing 3′ RR-mediated disease but also to gaining a greater understanding of the functional role of both the known and the putative regulatory elements that have been disrupted.

Fig. 1
figure 1

Schematic representation of the primary and processed transcripts of an archetypal human protein-coding gene depicting the 3′ regulatory domains and motifs relevant to this study. Shaded boxes, coding region of the last exon [note that in some genes (e.g. IL12B and PTGDS), the last exon corresponds precisely to the entire 3′ untranslated region (3′ UTR)]; USS, upstream sequence between the translational termination codon and the upstream core polyadenylation signal (UCPAS) exemplified by the most frequently occurring AAUAAA hexamer; LAS and RAS, left and right arms of the ‘spacer’ sequence between the UCPAS and the downstream core polyadenylation signal (DCPAS) exemplified by the best described U/GU-rich element; CS, cleavage site or poly(A) addition site (indicated by an upward pointing arrow). The 3′ flanking region (3′ FR) was arbitrarily designated as the 2 kb region 3′ to the pre-mRNA CS. DSS, downstream sequence of the 3′ FR. The 3′ regulatory region (3′ RR) and 3′ UTR are also indicated. The trans-acting factors, cleavage and polyadenylation specificity factor (CPSF) and cleavage stimulating factor (CstF) that specifically bind to UCPAS and DCPAS, respectively, are also illustrated. aAverage length of complete human 3′ UTRs in accordance with Mignone et al. (2002)

Data source and analysis

Data collection

Original publications reporting all the disease-associated 3′ RR variants collated in the Human Gene Mutation Database (HGMD; http://www.hgmd.org; Stenson et al. 2003) as of Jan. 31, 2006 were manually screened. Since the HGMD is a non-redundant mutation database, considerable efforts were made not only to check the original publications that first reported each variant, but also to follow up subsequent reports that provided either (1) functional data or (2) additional information relating to a putative disease association. This was achieved by a combination of a keyword search in PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&itool=toolbar) and the perusal of articles citing a given publication (reference citations for some papers are available in the linked-ISI Web of Science and other citation tracking systems but may also be traced through Google Advanced Scholar Search (http://scholar.google.com/advanced_scholar_search)). As a result of this process, some additional 3′ RR variants that had not originally been logged in HGMD were identified; these were also included for the purposes of analysis. It should also be noted that the term ‘disease-associated’ as used in this study is quite loose; some of the collated variants may be disease-causing (as in the case of monogenic disorders), disease-predisposing (as in the case of polygenic or complex disorders), disease-modifying (i.e. the variant itself does not cause or predispose to disease but rather affects the severity, age of onset or clinical outcome of the disease), or even ‘disease-protective’, whereas other variants may only serve as disease markers. Together, these efforts yielded a total of 121 variants in the 3′ RRs of 94 human protein-coding genes (Table 1). Although this dataset should not be regarded as an exhaustive list of reported 3′ RR variants, it is nevertheless likely to be the most comprehensive available. Based upon the data currently logged in HGMD, we estimate, rather conservatively, that at least 0.2% of disease-associated mutations reside within 3′ RRs.

Table 1 Catalogue of 3′ regulatory region (3′ RR) variants analysed in this study

Classification of 3′ RR variants with respect to their locations

The collated variants were classified into distinct categories depending upon their respective locations within the 3′ RRs (Fig. 1). This classificatory scheme is discussed below.

Function-based domain and motif definition of the 3′ RR

The 3′ ends of all fully processed eukaryotic mRNAs [with the exception of most histone mRNAs] have a poly(A) tail (Gilmartin 2005). The poly(A) tail is synthesized by poly(A) polymerase, has a length ranging in mammalian mRNAs from ∼150 to 250 nucleotides [See Supplementary Online Text for mRNAs with <20-nt poly(A) tails] and is critical for the regulation of mRNA stability and translation, probably by providing a binding site for poly(A)-binding proteins (Coller and Parker 2004; Kuhn and Wahle 2004).

The cellular process that adds a poly(A) tail to the maturing mRNAs, known as polyadenylation, proceeds in two steps: the endonucleolytic cleavage of pre-mRNA and the subsequent addition of poly(A) tail to the newly formed 3′ end (Zhao et al. 1999). In mammalian cells, three elements define the core polyadenylation signal: (1) the highly conserved AAUAAA hexamer or a close variant thereof (termed the upstream core polyadenylation signal, UCPAS) located 10–30 nucleotides upstream of the polyadenylation addition site or cleavage site (CS), (2) a less conserved U/GU-rich element (termed the downstream core polyadenylation signal, DCPAS) located 10–30 nucleotides downstream of the CS, and (3) the CS itself (Fig. 1; Zhao et al. 1999).

Except for the preferential use of a CA dinucleotide immediately 5′ to the CS (Sheets et al. 1990), sequence between the UCPAS and the DCPAS is poorly conserved and has not yet been shown to contain additional experimentally confirmed regulatory motifs (Chen et al. 1995; Zarudnaya et al. 2003; Tian et al. 2005). It would nevertheless appear that the distance between the UCPAS and the DCPAS is critical for correct cleavage and polyadenylation to occur (Chen et al. 1995). We have therefore termed this region the ‘spacer’ and it may be further divided into two portions relative to the CS, viz. the left arm of the spacer (LAS) and the right arm of the spacer (RAS) (Fig. 1).

The 3′ RR comprises the 3′ UTR and 3′ FR (Fig. 1). Whilst the 3′ UTR of a gene can be unequivocally defined as the region stretching from the first nucleotide 3′ to the translational termination codon to the CS, there is no consensus definition of the 3′ FR. Here, we arbitrarily define the 3′ FR as a 2 kb sequence tract downstream of the CS; this allows the inclusion not only of the region known to contain downstream core and auxiliary cis-elements which regulate mRNA 3′ end formation, but also those variants that could disrupt more distant transcription enhancer or repressor sites.

Practical considerations for assigning 3′ RR variants to specific domains or motifs

Alternative polyadenylation has been observed in many genes and this phenomenon is generally considered to be a further means of gene regulation (Edwalds-Gilbert et al. 1997; Tian et al. 2005; Yan and Marr 2005). However, to date, no consensus as to how to document the complex patterns of alternative polyadenylation has been proposed. Here we propose a new classificatory system depending upon whether or not alternative splicing is involved.

First, alternatively spliced transcripts that end in different 3′ terminal exons invariably promote alternative polyadenylation and we have termed these, type I alternative splicing-coupled alternative polyadenylation (ASCAP) isoforms (Fig. 2a). Second, alternatively spliced transcripts that end in a common 3′ terminal exon but which use different sets of UCPASs and CSs for 3′ end formation have been termed type II ASCAP isoforms (Fig. 2a). By contrast, alternatively spliced transcripts that not only end in a common 3′ terminal exon but also use the same set of UCPAS and CS for 3′ end formation were termed alternative splicing-coupled identical polyadenylation (ASCIP) isoforms (Fig. 2a). Third, the length of a 3′ terminal exon in a given protein-coding transcriptional framework may vary due to the alternative use of different sets of tandemly arrayed UCPASs and CSs in the downstream sequence; these alternatively polyadenylated isoforms may simply be termed non-ASCAP (Fig. 2b). Finally, it is important to make a clear distinction between the alternative polyadenylation-associated use of alternative CSs and the heterogeneity of CSs. By definition, alternative polyadenylation, irrespective of whether ASCAP or non-ASCAP, involves the use of different sets of UCPASs and CSs. In addition, some 50% of human mRNAs have more than one CS (Beaudoing and Gautheret 2001; Pauws et al. 2001; Tian et al. 2005) in the context of a common UCPAS motif (Fig. 2c).

Fig. 2
figure 2

a Schematic representation of the different types of mRNA polyadenylation when alternative splicing is involved. Type I alternative splicing-coupled alternative polyadenylation (ASCAP) isoforms end in different 3′ terminal exons; type II ASCAP isoforms end in a common 3′ terminal exon but use different sets of UCPASs and CSs for 3′ end formation; alternative splicing-coupled identical polyadenylation (ASCIP) refers to the use of a same set of UCPAS and CS for 3′ end formation within a common 3′ terminal exon. Exons are indicated by boxes whereas splicing is denoted by dotted lines. Oval indicates UCPAS. Note that multiple patterns of alternative splicing may exist for each of the three illustrated types. b Schematic representation of non-alternative splicing-coupled alternative polyadenylation (non-ASCAP), exemplified by three mRNA species that were generated from the differential use of three tandemly arrayed UCPASs and CSs in the context of a given transcriptional unit. c Schematic representation of the heterogeneity of CS in the context of a common UCPAS. Symbols used in b and c: shaded box, coding region of the last exon; oval, UCPAS; arrow, CS

Based upon the above classificatory system, all the collated 3′ RR variants were precisely localised within GenBank-available genomic DNA sequences and/or mRNA sequences in the following way: first, when a variant occurred in genes with alternatively spliced isoforms, it was usually assigned to the main transcript (i.e. the most extensively studied transcript with respect to physiological expression and function); second, when a variant occurred in genes with non-ASCAP isoforms, it was assigned to the shortest 3′ UTR whilst its possible relevance to the context of longer UTRs is specified only when deemed to be of potential importance; and finally, the CS of a given mRNA with heterogeneous CSs was assigned 3′ to a CA dinucleotide (in accordance with Sheets et al. 1990) wherever possible; otherwise the location of the 5′-most CS was used to represent the location of the CS (in accordance with Tian et al. 2005).

In addition, for the UCPAS, LAS and 3′ FR variants, mRNAs were systematically aligned with both ESTs and genomic sequences available in GenBank in order to locate the UCPAS hexamers and CSs accurately. This was not however performed for some of the USS variants located at least 100 bp 5′ to the end of the GenBank-annotated reference mRNAs.

Distribution of reported 3′ RR variants in the different domains or motifs

The distribution of reported 3′ RR variants in the different domains or motifs is not proportional to the lengths of the respective domains or motifs (Table 1). This may be largely accounted for by the bias inherent in mutation screening but may also reflect the differential functional importance of the different domains. For example, only ten variants have been found in the 2,000 bp 3′ FR (including LAS, DCPAS and DSS). Apart from being screened much less often, this region clearly has a much more limited regulatory role than the 3′ UTR since it is not included in the mature mRNA. Finally, it should be noted that seven variants (two of which occurred at the 3′ end of the unique 3′ UTR-lacking GLA mRNA and five others within the USSs of their respective genes) have been treated as isolated examples owing to their unique features (Table 1).

Systematic evaluation of 3′ RR variants with respect to their potential functional consequences

All the collated variants were evaluated individually in the context of the initial disease association reports, replication studies, the physiological role of the gene involved, the expression level of mRNA/protein in vivo, and in vitro functional analysis data whenever available. In particular, considerable efforts were made to validate/elucidate the mechanisms underlying the bona fide or putative functional variants wherever appropriate. Here it should be emphasized that most of the functional analyses were performed in transient transfection assays, in which a fragment of the studied 3′ UTR was inserted downstream of a reporter gene. There are always concerns as to whether in vitro results can be extrapolated to the in vivo situation (Pastinen and Hudson 2004). Nevertheless, every care has been taken to avoid potentially misleading interpretations by considering the abovementioned factors. It is also important to emphasize that linkage disequilibrium (LD) is not only often seen with common variants but is also frequently encountered in the context of rare mutations. However, in the absence of appropriate functional analytical data, all 3′ RR variants in LD with other variants were regarded as being of potential functional importance and were therefore analysed in the same way as those that had not been shown to be in LD with other variants.

Disease-associated variants within the UCPAS

The UCPAS hexamer serves as the binding site for CPSF (Fig. 1) and constitutes the most frequently encountered motif in the 3′ UTRs of protein-coding genes (Graber et al. 1999; Zhao et al. 1999; Beaudoing et al. 2000; Tian et al. 2005; Xie et al. 2005). It is therefore not surprising that a considerable number (17) of functional variants within the UCPAS have been reported (Table 1). It should however be pointed out that whereas some 92% of human protein-coding genes contain the consensus UCPAS hexamer (AATAAA) motif or a close variant thereof (Tian et al. 2005; see also Table 2 for the 12 most frequently encountered UCPAS hexamer variants), the remaining 8% of genes may employ an alternative tripartite mechanism for poly(A) site recognition (Venkataraman et al. 2005). In this study, however, no tripartite-like structures were identified in any of the genes in which the UCPAS hexamers and cleavage sites (CSs) were accurately located.

Table 2 Most frequently encountered UCPAS hexamers in human protein-coding genes and disease-associated UCPAS mutations

HBA2: the gene in which the first disease-causing UCPAS mutation was characterized

Two functional human α-globin genes, α2 (HBA2; MIM# 141850) and α1 (HBA1; MIM# 141800) are tandemly repeated on chromosome 16pter-p13.3. Mutations in the duplicated HBA2 and HBA1 genes can result in diminished (α+ thalassaemia) or absent (α0 thalassaemia) α-chain production.

A homozygous A>G mutation in a patient with α+ thalassaemia that altered the UCPAS of the HBA2 gene from AATAAA to AATAAG (Fig. 3), was the first UCPAS mutation reported to cause a human inherited disease (Higgs et al. 1983). Since the HBA1 gene in the same patient was inactivated by a homozygous frame-shift mutation, the observed α+ phenotype must have been due solely to the mutant HBA2 gene. Indeed, analysis of cytoplasmic mRNA obtained from peripheral blood reticulocytes of the patient revealed that the HBA2 gene was still expressed, albeit at a significantly reduced level (some 5% of normal; Higgs et al. 1983). In a HeLa cell-based transient expression system, the 3′ end of wild-type HBA2 mRNA was shown to correspond to the usual poly(A) addition site, whereas the majority of the mutant mRNA extended beyond this point (Higgs et al. 1983; Whitelaw and Proudfoot 1986). However, mutant HBA2 mRNA transcripts, obtained from the peripheral blood reticulocytes of the patient, were identical in size to those from normal reticulocytes. They would thus appear to represent those transcripts that have not only been cleaved correctly at the usual polyadenylation site but which have also been normally polyadenylated (Higgs et al. 1983). By contrast, the abnormal (i.e. extended) HBA2 transcripts, observed in transfected HeLa cells but not in the patient’s peripheral blood reticulocytes, were relatively unstable as compared with the normally formed transcripts (Higgs et al. 1983). This kind of aberrant mRNA with an extended 3′ UTR is probably degraded by the same cellular machinery that is responsible for nonsense-mediated mRNA decay (Muhlrad and Parker 1999; Amrani et al. 2006).

Fig. 3
figure 3

Naturally occurring human UCPAS variants. The UCPAS motif is highlighted in bold. Point mutations are denoted by upper case letters and shaded. Deleted nucleotides are barred and shaded. The CSs are indicated by downward pointing arrows. Reference is made to the publications that first reported these mutations. Note also that both the ARSA and IGF1 genes utilised non-canonical UCPAS hexamers and have alternatively polyadenylated transcripts

Three other HBA2 UCPAS mutations have also been reported (Fig. 3). Although these mutations were not functionally characterized, they may cause disease through a similar mechanism to that discussed above.

HBB: the gene with the most reported UCPAS mutations

β-thalassaemia is an autosomal recessive disease characterized by defective synthesis of the β-chains (encoded by the HBB gene; MIM# 141900) of the haemoglobin tetramer. As in the case of α-thalassaemia, β-thalassaemia can also be divided into two types: β+, in which some chains are present, and β0, in which β chain synthesis is absent.

Orkin et al. (1985) reported an AATAAA→AACAAA mutation in the UCPAS of the HBB gene in a patient with β+-thalassaemia (Fig. 3). This mutation was shown to result in an extended transcript (∼1,500 bp, cf. the normal transcript of ∼600 bp) not only in vitro but also in vivo. Two additional β+-thalassaemia-causing UCPAS mutations in the HBB gene, AATAAA→AATGAA and AATAAA→AATAGA (Jankovic et al. 1990; Fig. 3), were also found to be associated with extended unstable ∼1,500 bp transcripts in vivo.

Rund et al. (1992) reported a further two HBB UCPAS mutations, AATAAA→AATAAG and AATAAA→A in patients with β+-thalassaemia (Fig. 3). A detailed analysis of in vivo-expressed mRNA in these patients enabled Rund et al. (1992) to draw several important conclusions:

  • A significant amount of correct cleavage and polyadenylation occurred in spite of the AATAAA→AATAAG mutation. This indicates that the highly conserved AATAAA hexamer may not be an absolute requirement for correct mRNA 3′ end formation.

  • Only four extended HBB transcripts with lengths of ∼1,500, ∼1,650, ∼2,450 and ∼2,900 bp, respectively, were present in RNA derived from patients carrying the AATAAA→AATAAG mutation, although up to 14 cryptic AATAAA sequences are present distal to the normal UCPAS within the range of the extended transcripts. This serves to demonstrate that not all UCPAS hexamers within 3′ UTRs are functional and that additional contextual signals may be required for correct mRNA 3′ end formation to occur.

  • The human HBB gene primary transcript can extend to >5 kb beyond the CS. This illustrates the point that transcriptional termination is not a spatially precise event but rather can extend over hundreds or even thousands of nucleotides (Birnstiel et al. 1985; Kim and Martinson 2003).

Although four additional HBB UCPAS mutations have been reported in β+-thalassaemia patients (Fig. 3), no associated information about in vivo mRNA expression and in vitro characterization has been made available.

The HBB gene itself harbours a total of nine UCPAS variants whereas the other five genes listed exhibit only eight between them (Fig. 3). Of the many factors that could have contributed to the preponderance of disease-causing UCPAS mutations in the HBB gene, perhaps the most important is likely to be that the HBB gene coding region is fairly short and uncomplex, thereby increasing the relative mutational target size proffered by the UCPAS. However, in comparison with the HBA2 gene, which is similar in size and structure to HBB but harbours only four UCPAS mutations, it may be significant that the HBB gene does not have a functionally equivalent paralogue.

IL2RG: difficulties in assessing the significance of UCPAS mutation-associated transcripts of unusual length

An A>G mutation that changed the UCPAS of the IL2RG gene from AATAAA→AATAAG was identified in a male patient with sporadic X-linked severe combined immunodeficiency (Hsu et al. 2000; Tsai et al. 2002; Fig. 3). Northern blotting revealed a low amount of a 4 kb mRNA species (cf. the normal 1.8 kb IL2RG mRNA) in a B-cell line derived from the patient and this abnormally long mRNA was accounted for by invoking extended transcription (Hsu et al. 2000). However, in the process of cloning the human IL2RG gene, Takeshita et al. (1992) had already demonstrated that in addition to the main ∼1.8 kb transcript, a second ∼3.6 kb mRNA species was also detectable in normal human peripheral blood leukocytes; whereas the shorter species corresponds to the cloned IL2RG gene (GenBank accession number D11086.1), the origin of the longer one remains unknown.

Given that the lengths of the two longer transcripts represent only rough estimates from Northern blotting, we consider that it is likely that the two longer transcripts described by Takeshita et al. (1992) and Hsu et al. (2000) represent one and the same transcript. Since the human IL2RG gene spans about 4 kb on the X chromosome (GenBank accession number NC_000023.8), we surmise that the longer mRNA species may represent a non-processed nuclear transcript of the IL2RG gene. To provide some support for this postulate, we searched the EST-human database in GenBank (as of July 8, 2005) using BLASTN against ±20 bp flanking the cleavage site of the human IL2RG gene and indeed found a clone (GenBank accession number CT002657.1) that is characteristic of a non-processed nuclear transcript viz. it comprises the 3′ end sequence of the last intron, the last exon (coding region plus 3′ UTR), and sequence 3′ to the cleavage site of the IL2RG gene.

The AATAAA→AATAAG mutation could indeed have resulted in the production of extended transcripts. Such transcripts were however probably unstable and therefore unlikely to be detectable in vivo. Finally, this mutation certainly did not render the gene completely non-functional because the patient expressed trace amounts of IL2RG (Tsai et al. 2002).

FOXP3: an illustration of the necessity to analyse entire 3′ UTRs

IPEX (Immune dysfunction, polyendocrinopathy, enteropathy, X-linked) is caused by mutations in the FOXP3 gene (MIM# 300292). In a five-generation family with IPEX, no mutations were found in the FOXP3 gene after sequencing all 11 exons including at least 50 bp flanking each intron/exon boundary, 531 bp promoter sequence and 487 bp 3′ UTR. Since the 3′ UTR sequence originally screened did not contain a canonical UCPAS, Bennett et al. (2001) extended their analysis to further downstream 3′ UTR sequence and consequently found an A>G mutation within the first UCPAS encountered, some 972 bp 3′ to the translational termination codon (AATAAA>AATGAA; Fig. 3). That this substitution was disease-causing was supported by the following observations: (1) it segregated with the disease; (2) it was absent in 318 control chromosomes; (3) the next potential UCPAS motif was not encountered for a further 5.1 kb and (4) FOXP3 mRNA expression was not only clearly reduced in the primary peripheral blood mononuclear cells from the heterozygous mother but also undetectable in those from the affected son (Bennett et al. 2001).

The mutant UCPAS sequence observed in this case, AATGAA, is used as a normal polyadenylation signal in 1.6% of human genes (Table 2). This lends further support to the notion that UCPAS function must be context-dependent.

ARSA: the mutational loss of one UCPAS did not lead to the increased use of alternative UCPASs

Metachromatic leukodystrophy (MLD) is an autosomal recessive inborn error of metabolism resulting from the deficiency of the lysosomal enzyme, arylsulphatase A (ARSA). ARSA deficiency is consequent to mutations in the ARSA gene (MIM# 607574) but not all mutations resulting in ARSA deficiency cause MLD. In this regard, ∼10% of healthy individuals exhibit ARSA activity levels comparable to those seen in MLD patients; this non-pathogenic reduction in ARSA activity is caused by homozygosity for the so-called ARSA pseudodeficiency allele (ARSA-PD; e.g. Gieselmann et al. 1989; Harvey et al. 1998).

Northern blotting using ARSA cDNA probes has detected three different RNA species of 2.1, 3.7 and 4.8 kb in poly(A)+ RNA from normal human fibroblasts (Gieselmann et al. 1989; Kreysing et al. 1990). These probably represent either type II ASCAP isoforms (refer to Fig. 2a) or non-ASCAP mRNA isoforms (refer to Fig. 2b), both because a probe derived from DNA sequence 300 nucleotides downstream of the first UCPAS (i.e. that used for the 2.1 kb species) hybridises only to the two larger transcripts and because an intron-specific probe does not hybridise to any of the three transcripts (Kreysing et al. 1990). The 2.1 kb mRNA species, whose sequence is known (Kreysing et al. 1990), accounts for 71–90% of the total ARSA mRNA (Gieselmann et al. 1989; Harvey et al. 1998). The full-length sequences of the two larger transcripts have still not been determined.

An A>G mutation in an ARSA-PD allele that changed the UCPAS of the major mRNA species from AATAAC to AGTAAC has been described (Gieselmann et al. 1989; Fig. 3). Although the expression of the 2.1 kb mRNA species was seriously reduced in ARSA-PD homozygous individuals, that of the other two longer transcripts was unaltered (Gieselmann et al. 1989; Harvey et al. 1998). This example therefore indicates that the loss of one UCPAS may not necessarily lead to a compensatory increase in the use of other alternative UCPASs. In addition, it is worth noting that AATAAC is not among the most frequently encountered UCPAS hexamers in human genes (Table 2).

IGF1: an UCPAS mutation with probable dual effect

The insulin-like growth factor 1 gene (IGF1; MIM# 147440) encodes three major mRNA species of 1.1, 1.3 and 7.6 kb, respectively. Since the 1.1 and 7.6 kb transcripts share the same coding sequence (exons 1, 3, 4 and 6; gene structure in accordance with Sussenbach et al. 1992) but differ in terms of their 3′ UTRs, they therefore represent non-ASCAP isoforms (refer to Fig. 2b). The 1.3 kb species is a type I ASCAP isoform (refer to Fig. 2a) which comprises exons 1, 3, 4 and 5, and will not be considered further.

In a patient with short stature, sensorineural deafness and delayed psychomotor development, Bonapace et al. (2003) identified a homozygous AATATA→AAAATA UCPAS mutation in the IGF1 gene with respect to the 1.1 kb transcript (Fig. 3). This transversion is probably disease-causing because (1) it was the only mutation found within the entire coding region, promoter and 3′ UTR of the IGF1 gene, (2) it was not present in 100 unrelated healthy controls and (3) the patient manifested a significant deficiency of serum IGF1. The original authors attempted to assess the mutation’s effect on mRNA transcription and maturation by RT-PCR, using total RNA prepared from leukocytes from the homozygous patient, heterozygous mother and father and a normal control. The normal control displayed a single band of 450 bp, the patient a single band of 340 bp, and the mother and father both bands (see Fig. 3 in Bonapace et al. 2003). Although the sequences for the two RT-PCR primers were not provided in the original publication, it may be inferred that the two primers must have targeted the 5′ end of exon 6 and sequences downstream of the pre-mRNA CS of the IGF1 gene with respect to the 1.1 kb mRNA [this means that the normally expressed 1.1 kb mRNA could not have been amplified by this pair of primers]. Although the 450 bp band corresponds to a cDNA sequence amplified from the normally expressed 7.6 kb mRNA species, our evaluation of the sequence of the 340 bp band (see Fig. 4 in Bonapace et al. 2003) revealed, however, that it was probably derived from non-specific PCR amplification. It would thus appear that no specific amplification was obtained from the homozygous patient. This, when combined with the significant IGF1 deficiency in the patient, strongly suggests that the expression of both the 1.1 and 7.6 kb mRNA species were adversely affected. In other words, the observed point mutation could have abrogated not only 3′ end formation and stability of the shorter mRNA species (as a UCPAS mutation), but also that of the longer transcript (as a USS mutation).

Finally, it is important to note that, in the context of the 1.1 kb mRNA species, the AATATA>AAAATA mutation yields an alternative AATATA motif 2 bp downstream from the physiological UCPAS hexamer (Fig. 3). Clearly, this newly created hexamer motif is unable to compensate for the loss of the physiological one, most probably due to a sub-optimal distance between the new UCPAS and the normal CS (Chen et al. 1995).

NAT1*10 is a common USS polymorphism rather than an UCPAS mutation as originally claimed

The gene encoding arylamine N–acetyl transferase 1, known as NAT1 (MIM# 108345), is highly polymorphic. Approximately 30% of Europeans carry a NAT1*10 allele in which the canonical UCPAS AATAAA has become altered to AAAAAA. A number of studies have found associations of this allele with colorectal and bladder cancer, but results have often been contradictory (e.g. Gu et al. 2005). Irrespective of the status of these disease associations, the T>A transversion associated with the NAT1*10 allele would be expected to affect 3′ end formation of its associated mRNA. However, as noted by de Leon et al. (2000), an aat triplet lies immediately 5′ to the original UCPAS (AATAAA) and, together with the first three A nucleotides of the mutant sequence, would be expected to generate a new canonical UCPAS (aatAAA) only 3 nucleotides 5′ to the original UCPAS (i.e. aatAATAAAaatAAAAAA). The in vivo efficacy of the newly generated UCPAS was thought to be unclear because the 5′ shift of three nucleotides may give rise to a sub-optimal distance between the UCPAS and the normal CS (de Leon et al. 2000).

de Leon et al. (2000) addressed this issue by means of two complementary approaches. First, they demonstrated that the recruitment of the immediate upstream aat to the formation of a new AATAAA polyadenylation signal site does not disrupt the predicted secondary structure of the NAT1*10 pre-mRNA which is indistinguishable from that of the wild-type sequence. Then they performed transfection studies in COS-1 cells: the amount of NAT1*10 protein in COS-1 cell cytosol was identical to that of the wild-type protein. The authors therefore concluded that the aatAATAAAaatAAAAAA change in NAT1*10 had been fully compensated for by the newly generated polyadenylation signal site.

We were intrigued by the extremely mild effect on gene expression conferred by this mutation that occurred within a canonical AATAAA element. We therefore traced the original report that cloned the NAT1 gene and found that the gene was sequenced at the genomic rather than the cDNA level (Blum et al. 1990). Thus, the 3′ UTR of the gene was formerly unknown and the ‘canonical UCPAS AATAAA’, referred to subsequent work, was regarded as only ‘putative’ in the original study (Blum et al. 1990). Evaluation of the mRNA sequences and ESTs available in UniGene Hs. 155956 (for Homo sapiens NAT1; http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene&cmd=search&term=hs.155956; as of Aug. 18, 2005) allowed us to conclude that, on the basis of all informative sequences with an unambiguous poly(A) tail, the physiological UCPAS of the NAT1 gene should be assigned as the ATTAAA motif located 154 bp 3′ to the so-called ‘canonical AATAAA motif” (see Supplementary Table S1). Consequently, NAT1*10 should be re-defined as an USS variant, the above functional interpretation of de Leon et al. (2000) being simply spurious.

Correlation of genome-wide and biochemical analysis with naturally occurring mutation data

As indicated in Table 2, the canonical AATAAA sequence accounts for ∼53% of the presumably functional UCPAS hexamers detected in the 3′ UTRs of human protein-coding genes. The most frequent UCPAS variant, ATTAAA, has a frequency of ∼16% whereas the 11 next most frequently encountered UCPAS hexamers exhibit frequencies of 0.6–4.4%. Early in vitro studies conclusively demonstrated that, with the exception of ATTAAA, all variants of the canonical AATAAA sequence strongly inhibit pre-mRNA processing (Wickens and Stephenson 1984; Wilusz et al. 1989; Sheets et al. 1990). This is also the case for the functionally analysed pathogenic UCPAS mutations. It therefore follows that the less frequently used non-canonical UCPAS hexamers may well have become evolutionarily adapted (and hence perhaps even optimised) to the expression of their respective genes. Alternatively, it may be that the use of an apparently sub-optimal UCPAS site represents a gene-specific strategy for modulating the level of gene expression. Nevertheless, it seems very likely that the less frequently used non-canonical UCPAS hexamers serve as comparatively poor binding sites for CPSF; thus, CPSF would be more likely, in those genes using non-canonical hexamers as physiological UCPASs, to be able to bind to other potential UCPAS elements. It is precisely this type of gene that is often associated with alternative polyadenylation (e.g. Xiong et al. 1991; Zhao et al. 1999; Hall-Pogar et al. 2005). Interestingly, of the six genes listed in Fig. 3, the first four use AATAAA as an UCPAS and have no alternatively polyadenylated mRNA species, whereas the remaining two genes utilize non-canonical UCPAS variant sites and give rise to at least two alternatively polyadenylated mRNA isoforms.

From Table 2, we can see that the pathogenic point mutations that have been reported within canonical AATAAA UCPAS hexamers tend to generate motifs that correspond to the less frequently used non-canonical UCPAS hexamers. Further, it has been clearly shown that, in a given sequence context, the effect of single nucleotide substitution within the UCPAS AATAAA hexamer on cleavage and polyadenylation depends critically on its position within the hexamer; and the in vitro phenotypic severity follows the order of positions 3, 4, 6 > 5 > 1 > 2 (Sheets et al. 1990). Intriguingly, of the 12 non-canonical UCPAS hexamers found in human genes, the 2nd position in the motif is the most frequently affected whereas the 6th and 4th positions are the least frequently affected (Table 2). By contrast, of the pathogenic canonical (AATAAA) hexamer point mutations, the 2nd position is least frequently affected whereas the 6th and 4th positions are the most frequently affected (Table 2). Finally, biochemical analysis has revealed that a substituting G in any given position is often associated with a severe phenotype by comparison with other base substitutions (Sheets et al. 1990). Not surprisingly, therefore, G is the most frequently encountered substituting base among the pathogenic point mutations known to have occurred within canonical AATAAA UCPAS hexamers (Table 2).

Variants that have occurred within the LAS

Currently, the only known role of the short LAS is involvement in mRNA 3′ end formation at the pre-mRNA level (discussed below). A recent bioinformatics analysis has identified several well conserved motifs in the region -40/−1 (the first nucleotide 5′ to the CS as –1; nomenclature used consistently in this section) of human genes (Hu et al. 2005). However, since (1) the analysed region is 10–30 bp longer than the LAS as defined here and (2) the relative position of each identified motif within the analysed region is unknown, the utility of this information is seriously limited. Nevertheless, none of the six known LAS variants were located in these motifs and their possible functional consequences will therefore only be discussed in the context of our current understanding of the role of the LAS in mRNA 3′ end formation.

Three point mutations involving the dinucleotide immediately upstream of the CS

Preferential use of a CA dinucleotide immediately upstream of the CS

To assess the evolutionary conservation of the nucleotides to which the poly(A) is added, Sheets et al. (1990) aligned 63 vertebrate cDNA sequences with their corresponding genomic sequences. Whenever the start of the poly(A) tail coincided with an A in the genomic sequence, this A was assigned as the nucleotide to which the poly(A) stretch was added post-transcriptionally [in accordance with the observation that the first nucleotide of the poly(A) tail in most adenovirus L3 and SV40 late mRNAs is likely to be template-encoded (Moore et al. 1986; Sheets et al. 1987)]. Having found a C residue at position –2 in 59% of the genes examined and an A at position –1 in 71%, Sheets et al. (1990) concluded that CA is the most common dinucleotide to which poly(A) is added. Using a large set of human genes (NB. the exact number of genes used for the relevant analysis cannot be determined from the original publication), Zhang (1998) found that a C residue at position –2 and an A residue at position –1 accounted for 46 and 64% of nucleotides at these sites, respectively. Since the C at position –2 and the A at position –1 were counted separately in both the above cited studies, we have no idea as to the proportion of human genes that use a CA dinucleotide immediately before the CS. Using the same principle employed by Sheets et al. (1990), we systematically annotated the CSs of all the genes known to contain UCPAS, LAS and 3′ FR variants and of some of the genes containing USS variants; a CA dinucleotide was found to be used in 25 (37%) of the 67 genes analysed.

Since a significant fraction of CSs are not preceded by a CA dinucleotide and since >50% of human protein-coding genes harbour multiple mRNA CSs (Tian et al. 2005), it would appear that a CA dinucleotide cannot be an absolute requirement for correct cleavage, a situation analogous to the use of both the canonical AATAAA sequence and its variants by human genes. In this regard, a detailed biochemical analysis of a series of constructs, in which the wild-type CA dinucleotide at positions –2 and –1 of the SV40 late polyadenylation signal was changed to all 16 possible combinations, clearly demonstrated that (1) mutations at position –2 or/and –1 had only minor effects on the overall efficiency of cleavage; (2) whilst cleavage occurred predominantly at certain sites, additional minor sites of cleavage over a ∼6 base-pair interval were also observed and (3) CS usage at position –1 was found to be in the order of preference A > U > C >> G (Chen et al. 1995).

F2: two variants that probably increase the efficiency of CS recognition

G20210A, occurring immediately 5′ to the CS of the prothrombin gene (F2; MIM# 176930), displays an allele frequency of 1–4% in the Caucasian population and represents a moderate risk factor for venous thrombosis (Poort et al. 1996). Consistent with the observations that (1) the G20210A polymorphism changes the CG dinucleotide to a preferentially used CA dinucleotide (Fig. 4), (2) it is associated with a gain-of-function, and (3) heterozygous carriers of the A allele exhibit on average a 25% higher plasma prothrombin level as compared to wild-type individuals, this polymorphism might reasonably be expected to give rise to improved cleavage site by enhancing CS recognition thereby leading to increased mRNA accumulation and protein synthesis. This indeed turned out to be the case, as has been demonstrated by several in vitro studies (Gehring et al. 2001; Pollak et al. 2002; Ceelie et al. 2004; Danckwardt et al. 2004; Sachchithananthan et al. 2005). Thus, in the words of Danckwardt et al. (2004), “the physiological G at the cleavage site at position 20210 is the functionally least efficient nucleotide to support 3′ end processing but has evolved to be physiologically optimal”.

Fig. 4
figure 4

Naturally occurring LAS variants. The UCPAS motif is shaded. The cleavage sites are indicated by downward pointing arrows. Point mutations in F2, HBB, and ADIPOR1 and the ATG repeat polymorphism exemplified by the 7-repeat allele in RETN are highlighted in bold upper case letters. The two 3′ flanking region variants in the F2 gene are also included (underlined)

C20209T, which affects the –2 position relative to the CS of F2 (Fig. 4) and which has been noted at low frequency in the African-American population (∼0.4%), probably acts as a modifier of thrombotic risk (Warshawsky et al. 2002; Schrijver et al. 2003; Itakura et al. 2005; Soo et al. 2005; Wylenzek et al. 2005). This variant, if it is indeed of functional significance, should also increase the efficiency of pre-mRNA cleavage and polyadenylation.

HBB: a mutation that may decrease the efficiency of CS recognition

Two heterozygous single nucleotide substitutions in cis that altered the end of the HBB 3′ UTR from ttgCA to ctgAA (Fig. 4) [combined with a null mutation on the other allele] were identified in a patient with β-thalassemia intermedia (Heath et al. 2001). Note that this mutation changed the preferred CA dinucleotide to a sub-optimal AA (Sheets et al. 1990). Given both the loss-of-function nature of known HBB mutations and the patient’s mild phenotype (a moderate microcytic, hypochromic anaemia with normal iron and elevated HbF; Heath et al. 2001), this change could lead to a moderate decrease in the efficiency of pre-mRNA CS recognition.

Two point mutations occurring in residues upstream of the −2/−1 dinucleotide

To date, we are not aware of any reports of the functional characterization of artificially introduced single nucleotide substitutions with respect to the remaining positions in the LAS (i.e. from +1 downstream of the UCPAS hexamer to position –3 relative to CS). However, substitution of a short sequence tract ranging from 3 to 5 bp in the middle of the LAS of the human complement C2 (C2) gene has been found to affect adversely cleavage and polyadenylation efficiency. Moreover, this effect was proportional to the number of bases substituted (Moreira et al. 1995). Based upon the finding that the shortest 3-base substitution resulted in only a twofold reduction in efficiency as compared with the wild-type sequence (Moreira et al. 1995), and the results of Chen et al. (1995), it is not unreasonable to conclude that nucleotide substitutions in the LAS, from +1 downstream of the UCPAS hexamer to position –3 relative to the CS, are unlikely to have a major effect on mRNA cleavage and polyadenylation. Indeed, the A20207C variant, located at position –4 of F2 (Fig. 4) and so far only reported in abstract form (Meadows et al. 2002), does not appear to result in major changes to the position of the poly(A) attachment site, the efficiency of polyadenylation, or protein synthesis (Ceelie et al. 2005). The functional consequences of a common C>G SNP, located deep within the LAS of the APIPOR1 gene (MIM# 605441; Fig. 4) and reported to be associated with reduced gene expression in cell lines from diabetic African-Americans as compared with control cell lines (Wang et al. 2004), appear to be more elusive.

RETN: an ATG repeat variation polymorphism that may exert its effects by altering the optimal distance between the UCPAS and the CS

A specific spacing between the UCPAS and the CS has been found to be essential for efficient cleavage and polyadenylation (Chen et al. 1995). However, until now, no sequence variants have been reported that exert their effects by altering the spacing between the UCPAS and the CS. Our systematic analysis may have identified just such an example.

Pizzuti et al. (2002) identified three alleles involving an ATG triple repeat in the 3′ UTR of the resistin gene (RETN; MIM# 605565) in a control population: 8, 7 and 6 repeats with allele frequencies of 0.3, 94.5 and 5.2%, respectively. The 6-repeat allele was found to be associated with a decreased risk of insulin resistance by comparison with the 7-repeat allele. Since the ATG repeat maps to the LAS of the RETN gene (Fig. 4), the shortest allele could well lead to decreased RETN expression by reducing the optimal distance between the UCPAS and the CS.

Variants located within the USS

There is no doubt that the USS, which accounts for some 97% of the average length of 3′ UTRs in human protein-coding genes (Fig. 1), is heavily involved in gene regulation at the post-transcriptional level. To better understand the relatively large number (81) of USS variants (most of which are polymorphisms), we first performed a comprehensive survey of well-defined cis-regulatory elements within the 3′ UTRs of protein-coding genes; this has served to strengthen the notion that RNA regulatory elements function in the context of a specific secondary structure. We then attempted to validate/decipher the potential functional consequences of the collated USS variants by systematically evaluating the primary sequences (against the well-defined cis-regulatory motifs) within the context of predicted RNA secondary structures; this enabled us to identify (1) consistent patterns of secondary structural change that may allow the discrimination of non-functional USS variants from their functional counterparts and (2) potential novel regulatory motifs within the 3′ UTRs. The details of this part will be published elsewhere (J.M. Chen, C. Férec, D.N. Cooper, submitted).

3′ FR variants

The 3′ flanking sequence, arbitrarily defined here as the region containing up to 2,000 nucleotides 3′ to the CS, was further divided into three sub-domains or motifs viz. RAS, DCPAS and DSS (Fig. 1).

Unlike the UCPAS, the DCPAS is poorly conserved evolutionarily. This notwithstanding, two main DCPAS types, a U-rich element and a GU-rich element, have been described (Zhao et al. 1999; Zarudnaya et al. 2003; Hu et al. 2005). In addition, some putative cis-acting elements in the region from +1 to +100 (the first nucleotide 3′ to CS is numbered +1; nomenclature used consistently in this section) of human genes have also been identified by comparative sequence analysis (Hu et al. 2005). Further, some secondary and higher-order structures formed by sequences mapping within the region +1/+100 have been reported to be involved in mRNA 3′ end formation (Zarudnaya et al. 2003).

It is highly likely that elements which regulate mRNA 3′ end formation are also going to occur in the region downstream of +100 [NB. the human HBB gene primary transcript can extend to >5 kb beyond the CS (Rund et al. 1992)] but such elements have been reported only rarely (Dye and Proudfoot 2001; Plant et al. 2005). Rather, most studies have focused upon putative regulatory elements that function as transcriptional enhancers or repressors (see below).

Given the poorly defined nature of the DCPAS and hence the RAS, three of the 10 3′ FR variants (all occurring ≤12 bp 3′ to the CS) were assigned as RAS or DCPAS mutations whereas the remaining seven (all occurring ≥74 bp 3′ to the CS) were assigned as DSS mutations.

RAS or DCPAS mutations that could affect mRNA 3′ end formation

F2: C20221T is probably a pathological mutation in a CstF binding site but the functional consequences of A20218G are less clear

A C20221T mutation in the F2 gene was identified in a 9-year-old child with acute vascular rejection and intrarenal segmental arterial thrombosis of an allogeneic kidney transplant (Wylenzek et al. 2001), a 28-year-old man with Budd-Chiari syndrome (Balim et al. 2003), and a 40-year-old women with pregnancy complications (Schrijver et al. 2003), respectively. The clinical phenotypes of these patients are consistent with a gain-of-function mutation. Interestingly, the C20221T mutation, which occurred at position 11 3′ to the CS of F2 (Fig. 4), generates a four nucleotide sequence tract TGTT that could potentially enhance CstF binding. Indeed, when tested in vitro, the 20221T allele increases the mRNA expression level by 2.6-fold as compared with the wild-type allele (Danckwardt et al. 2004).

The F2 A20218G variant (Fig. 4) was cited by Ceelie et al. (2005) as having been first reported in abstract form (Meadows et al. 2002). When tested by in vitro functional analysis, this variant did not lead to major changes in the position of the poly(A) attachment site, in the effectiveness of polyadenylation or in protein expression (Ceelie et al. 2005).

FGG: reduced expression of the γB isoform may be due to increased expression of the type I ASCAP γA isoform carrying a putative gain-of-function mutation in the CstF binding site

The fibrinogen γ gene (FGG; MIM# 134850) gives rise to two type I ASACP isoforms, γA and γB (Fornace et al. 1984). [γB corresponds to γ′ in Wolfenstein-Todel and Mosesson (1981) and Uitte de Willige et al. (2005)]. γA, the major isoform (comprises some 90% of the fibrinogen γ chain in plasma; Wolfenstein-Todel and Mosesson 1981), consists of exons 1–10. By contrast, γB retains intron 9 and polyadenylation occurs in a site within this intron (see Fig. 1 in Uitte de Willige et al. 2005).

A FGG haplotype, FGG-H2, has recently been reported to be associated with reduced γB levels and elevated total fibrinogen levels that are in turn associated with an increased risk of venous thrombosis (Uitte de Willige et al. 2005). Uitte de Willige et al. (2005) have proposed that the reduced expression of FGG γB might be attributable to the increased expression of γA: the FGG-H2 haplotype contains a 10034C>T polymorphism (located 12 bp downstream of the CS in the context of the γA isoform; Supplementary Table S2) which results in the gain of a CstF consensus 2a sequence viz. YGTGTYTTYAYTGNNYGT (Beyer et al. 1997). In other words, the T allele may increase the efficiency of 3′ end formation of the γA mRNA isoform, which in turn leads to a competitive decrease in the formation of the γB mRNA isoform.

DSS variants that may affect binding sites for transcriptional enhancers or repressors

It is now apparent that cis-acting transcriptional enhancers or repressors can modulate gene expression over very long distances e.g. tens of kilobases or even up to a megabase away from their target genes (reviewed by West and Fraser 2005). Here, we focus exclusively upon the proximal 3′ flanking regions. As we shall see below, three of the seven disease-associated DSS variants have been shown to affect transcription factor-binding sites. The remaining four variants―a G > A SNP (236 bp downstream of the CS) in CTLA4 (MIM# 123890), a C > T polymorphism (+ 242 bp) in CYP1A1 (MIM# 108330), a single T deletion (+ 485 bp) in FABP3 (MIM# 134651), and a C > T polymorphism (+ 707 bp) in KCNS3 (MIM# 603888) (Table 1; see also Supplementary Table S3)―will not be addressed further owing to the lack of supporting functional analysis.

SERPINA1: an intriguing illustration of the complexity of gene regulation

α1-antitrypsin is an acute-phase reactant. Its concentration in the plasma can increase three to fourfold during inflammation; this effect is mediated primarily by the cytokine, interleukin-6, that interacts with the tissue-specific transcription factor NF-IL6 (Morgan et al. 1997). A G > A substitution at position 1255 in the 3′ flanking region of the α1-antitrypsin gene (SERPINA1; MIM# 107400; Supplementary Table S4) occurs in ∼5% of apparently healthy individuals. However, it is present in ∼17% of patients with chronic respiratory disease (Morgan et al. 1993). Since the G > A transition does not affect the basal expression of the protein (the plasma concentration of α1-antitrypsin is normal in individuals carrying the polymorphism), this variant probably predisposes individuals to chronic respiratory disease by modulating the increase in plasma α1-antitrypsin concentration during inflammation (Morgan et al. 1993). Indeed, this polymorphic variant, which occurs within a DNA binding site for the ubiquitous transcription factor Oct-1 (i.e. ATTTCGA > ATTTCAA) was found to disrupt the functional cooperativity between Oct-1 and NF-IL6, resulting in IL6-deficient acute-phase response (Morgan et al. 1997).

HBD: a variant that affects a putative context-specific transcriptional repressor binding site

A G > A substitution 64 bp downstream of the CS of the δ-globin gene (HBD; MIM# 142000; Supplementary Table S5) was identified in subjects with δ-thalassaemia (Moi et al. 1992). The wild-type G contributes to the sequence TACAGATAGG which is very similar to the GATA box core sequence (A/T)GATA(A/G), a binding site for GATA-1. That the GATA-1 protein binds more tightly to the HBD mutant A allele than to its wild-type counterpart (Moi et al. 1992) suggests that GATA-1 may act as a transcriptional repressor in this specific context (Cao and Moi 2002).

SLC9A3R1: a SNP that disrupts a putative RUNX1-binding site

Psoriasis is a common, immunologically-mediated, hyperproliferative skin disorder that is inherited as a multifactorial trait. Recently, several non-coding SNPs within the second susceptibility locus, PSORS2 (MIM# 602723), have been reported to be associated with the disease (Helms et al. 2003). The A allele of SNP9, which lies 237 bp downstream of the CS of the SLC9A3R1 gene (MIM# 604990) and serves to eliminate a putative binding site for the Runt-related transcription factor, RUNX1 (see Supplementary Table S6), has received special attention for two reasons. First, SLC9A3R1 encodes a PDZ domain-containing phosphoprotein that is implicated in diverse aspects of epithelial membrane biology and immune synapse formation in T cells (Itoh et al. 2002). Second, DNA variants in RUNX1 [a transcription factor that is essential for haematopoietic and endothelial cell development; Lacaud et al. 2002)] binding sites have been associated with systematic lupus erythematosus (Prokunina et al. 2002) and rheumatoid arthritis (Tokuhiro et al. 2003). Interestingly, electrophoretic mobility shift assays indicated specific binding of nuclear extracts from a human T cell line (Jurkat) to the wild-type RUNX1-binding site (G allele), but not to the psoriasis-associated A allele at any concentration of nuclear extract. Moreover, although no significant differences in reporter gene activity were observed in transient transfection assays in the Jurkat T cell line with constructs, respectively, expressing the A or G alleles, the wild-type G allele showed the most prominent increase in reporter gene activity when RUNX1 and its coactivator CBFβ were co-transfected (Helms et al. 2003).

The above observations have strongly suggested that the SNP9-associated RUNX1-binding site has a functional role. However, this disease association has not been confirmed by subsequent studies (Capon et al. 2004; Butt et al. 2005; Hosomi et al. 2005; Huffmeier et al. 2005; Hwu et al. 2005; Morar et al. 2006; Stuart et al. 2006).

Isolated examples

DMPK and SCA8: trinucleotide CTG repeat expansions within the 3′ UTR

CTG repeat expansion within the 3′-UTR of the dystrophia myotonica protein kinase gene (DMPK; MIM# 605377) causes autosomal dominant myotonic dystrophy, type 1 (Brook et al. 1992). This constitutes a novel pathological mechanism resulting from a triplet repeat expansion that acts at the RNA level: CUG expansion sequesters essential cellular RNA-binding proteins resulting in abnormal splicing of multiple transcripts involved in the disease (Day and Ranum 2005). The CTG expansion within the 3′-UTR of the SCA8 gene (MIM# 603680) causing spinocerebellar ataxia type 8 (Koob et al. 1999) may involve a similar pathogenic RNA mechanism (Mutsuddi et al. 2004).

FCMD: a large LINE-1 trans-driven SVA insertion in the 3′-UTR

Disease-causing mutations resulting from long interspersed element-1 (LINE-1 or L1)-dependent retrotransposition constitute ∼0.1% of the lesions listed in HGMD (Stenson et al. 2003; Chen et al. 2005a). Of the ∼50 such events known (Chen et al. 2005b), one involves the integration of a 3,062 bp SVA element into the 3′-UTR of the FCMD gene (MIM# 607440) causing autosomal recessive Fukuyama-type congenital muscular dystrophy (Kobayashi et al. 1998). It is thought that this large SVA insertion may either inhibit transcriptional elongation or cause abnormal polyadenylation, resulting in the complete loss of gene expression (reviewed in Chen et al. 2006).

GLA: a human gene lacking a 3′-UTR

The human GLA gene (MIM# 301500) is thought to be unique on account of the complete absence of a 3′ UTR: the polyadenylation signal ATTAAA actually lies within the coding sequence whilst the translational termination codon TAA corresponds to the end of the mature mRNA sequence. Two micro-deletions at the 3′ end of the GLA gene―a dinucleotide (AA) deletion occurred within the UCPAS ATTAAA and a tetranucleotide (ACTT) deletion located immediately upstream of the translational termination codon TAA―were identified separately in two unrelated men with classical Fabry disease. Both mutations were found to affect adversely 3′ end formation of the GLA mRNA (Yasuda et al. 2003).

HLA-G: a 14 bp deletion/insertion length polymorphism within the 3′ UTR regulates alternative splicing and mRNA stability

Human leukocyte antigen-G (HLA-G; MIM# 142871), a non-classical major histocompatibility complex (MHC) class I molecule, is characterised by reduced polymorphism (15 alleles) and only seven different protein isoforms (LeMaoult et al. 2003). The presence of a 14 bp insertion polymorphism in the 3′ UTR of the HLA-G gene (Harrison et al. 1993) in certain alleles promotes a further splicing event viz. a total of 92 bases flanking the insertion site was spliced out due to the activation of a downstream cryptic acceptor splice site (Hiby et al. 1999; see also Fig. 1 in Rousseau et al. 2003); this also affects mRNA stability (Hviid et al. 2003; Rousseau et al. 2003). The 14 bp insertion/deletion length polymorphism may play a role in recurrent spontaneous abortion, pre-eclampsia, and outcome of in vitro fertilization (Hviid 2004; Hviid et al. 2004; Tripathi et al. 2004; Hviid and Christiansen 2005).

SLC6A3: controversial association of a 3′ UTR VNTR polymorphism with ADHD

Given its important role in regulating dopamine neurotransmission, the dopamine transporter gene (SLC6A3; MIM# 126455) has been extensively studied as one of the candidate genes that could predispose to attention-deficit/hyperactivity disorder (ADHD; Madras et al. 2005). Not only is the association of the 10-repeat allele of a 40-bp variable number of tandem repeat (VNTR) polymorphism in the SLC6A3 3′ UTR with ADHD controversial (e.g. Bellgrove et al. 2005; Feng et al. 2005; Langley et al. 2005; Purper-Ouakil et al. 2005; Kim et al. 2006), but inconsistent results have also been generated in in vitro transient expression analyses of the 10-repeat allele (e.g. Fuke et al. 2001; Mill et al. 2005; VanNess et al. 2005).

Concluding remarks and perspectives

Despite their evident complexity both in terms of biology and pathology, we have performed the first systematic analysis of the known (and highly diverse) naturally-occurring 3′ RR variants. This has generated many useful insights, some of which will be briefly discussed below.

3′ RR variants do not usually lead to the complete loss of gene expression

With the exception of the rather unusual cases of DMPK, SCA8 and FCMD mutations, none of the known disease-associated 3′ RR variants/mutations result in the complete functional loss of the genes involved. This is perhaps not altogether surprising since even the highly conserved AATAAA hexamer appears not to be an absolute requirement for correct mRNA 3′ end formation, let alone the other functional elements.

Functionality versus causality

Most of the USS variants studied might reasonably have been expected to have had some functional consequences for gene expression, a conclusion based upon the variants’ high probability of changing either the primary sequences of a variety of regulatory motifs present within this region or/and the associated mRNA secondary structures (J.M. Chen, C. Férec, D.N. Cooper, submitted). However, most of these variants as well as those in the LAS and the 3′ FR have usually been found to exert a comparatively minor influence on mRNA expression. This could perhaps provide a partial explanation for why such a high proportion of initially reported disease associations involving 3′ RR variants end up not being independently confirmed; these variants may predispose to, ‘protect from’, or modify disease susceptibility but only in combination with other genetic and environmental factors, which may vary quite dramatically between different populations.

Our current understanding of the functionality of 3′ RR variants is far from complete

Our current understanding of the functional roles of the diverse 3′ RR variants (and accordingly the biology of the 3′ RR) is far from complete. For example, none of the in vitro characterised functional USS variants have yet been found to affect mRNA cellular localisation. It may be that this is due to the still rather limited number of such variants which are known. However, any such effects might well have been overlooked in studies that have focussed almost exclusively on the analysis of mRNA stability and/or the level of the synthesized protein. Different types of enhancers, repressors and silencers should also exist within the 3′ UTRs of human genes. However, owing to the presence of well characterized functional elements that operate post-transcriptionally, the identification of such bona fide regulatory elements is likely to be rather challenging. The 3′ UTR is known to represent a preferred target for regulation by cis-encoded natural antisense transcripts (Sun et al. 2005), a phenomenon that has so far scarcely been investigated and which could yet prove to be quite widespread. It may be that a given 3′ UTR motif will turn out to have multiple roles, depending upon the cell type, subcellular location, developmental stage, and the concomitant availability or otherwise of trans-acting protein or RNA factors. We would also like to emphasize that whilst miRNA-binding sites are likely to exist in the 3′ UTRs of many human genes (Lim et al. 2005), much of the new lexicon of miRNA-binding sites still remains to be defined. Many of these sites could harbour mutations such as that found to cause Tourette syndrome (Abelson et al. 2005) and it may be that this category of lesion will turn out to be especially important in certain disease states.

Toward the identification of all functional regulatory elements in the human genome

The study of naturally occurring disease-associated mutations in the 3′ RRs of human genes should facilitate the identification of novel regulatory sequence elements within these regions. Conversely, our improving knowledge of these long neglected regulatory regions should help to guide mutation screening programs such that an increasing number of mutations within 3′ RRs are likely to come to clinical attention in the coming years. This parallelism should ensure not only that the mechanisms through which newly detected 3′ RR mutations exert their pathological effects will become steadily clearer, but also that we shall in time arrive at a better understanding of how both the primary and secondary structure of these regions influences their normal function.

Of the different approaches to studying cis-acting regulatory elements, each has its own strengths and weaknesses (Pastinen and Hudson 2004). The use of all these approaches in concert promises to provide us with new insights into the emerging vocabulary of 3′ RR elements. It is hoped that the integrative approach employed here in the study of naturally occurring variants of actual or potential pathological significance will serve to complement ongoing efforts to identify all functional regulatory elements in the human genome (e.g. ENCODE Project Consortium 2004).