Introduction

In an attempt both to catalogue 3′ regulatory region (3′ RR)-mediated disease and to improve our understanding of the structure and function of the 3′ RR, we have performed a systematic analysis of disease-associated variants in the 3′ RRs of human protein-coding genes. In a previous article, we discussed the general principles pertinent to this meta-analytical study and focused on the variants that are known to have occurred in two specific domains/motifs [i.e. the upstream core polyadenylation signal sequence (UCPAS) and the left arm of the ‘spacer’ sequence (LAS) between the UCPAS and the pre-mRNA cleavage site (CS)] of the 3′ untranslated region (3′ UTR) as well as in the 3′ flanking region (Chen et al. 2006). In the present article, we shall focus on the 83 known variants that are known to have occurred within the upstream sequence (USS) between the translational termination codon and the UCPAS of the 3′ UTR (refer to Fig. 1 in Chen et al. 2006 for definition of terms).

Classification of the USS variants

The 81 previously collated USS variants (Chen et al. 2006) were essentially derived from the Human Gene Mutation Database (http://www.hgmd.org; Stenson et al. 2003) with the inclusion of several entries found by cross-reference search (note that no public polymorphism databases were used for data collection). In addition, two recently published and functionally characterised USS variants in the BMP2 (Fritz et al. 2006) and SEPN1 (Allamand et al. 2006) genes were also included for analysis (Table 1; see Supplementary Tables S1–S5 for sequence details). Although the general principles involved in analysing the 3′ RR variants have been previously described (Chen et al. 2006), it is nevertheless important to explain how the USS variants were sub-divided into the five groups listed in Table 1. The first two groups were assigned essentially on the basis of the results of either in vitro reporter gene and/or electrophoretic mobility shift assay (EMSA) data: a variant was termed ‘functional’ when it exerted a marked effect on gene expression (at either the mRNA or protein level) in a reporter gene assay and/or affected trans-acting factor binding in the EMSA assay; otherwise, variants were termed ‘non-functional’. The third group contains those variants which have not been characterised in vitro but which have supporting in vivo gene expression data whereas the fourth group constitutes those variants that lack any in vitro or in vivo supporting data. Finally, four variants were treated as isolated examples owing to their complex nature.

Table 1 Summary of USS variants analysed in this study

Analytical strategy

It has long been appreciated that many RNA regulatory motifs rely for their function on a combination of primary and secondary structure (e.g. Mignone et al. 2002). However, to date, no general rules have been formulated that govern how RNA secondary structure might operate to mediate the functionality of the component regulatory elements. Here we have set out to address this basic question by systematically evaluating both functional and non-functional USS variant-containing primary sequences against the well-defined cis-regulatory motifs and within the context of the predicted RNA secondary structures. The rules derived from this study were then used to infer potential functionality, in the case of some of the remaining functionally uncharacterised USS variants, from their respective secondary structures. First, however, a survey of known cis-regulatory elements within the USS is warranted.

Overview of well-defined regulatory elements in the USS

As we shall see below, a diverse range of cis-acting elements has been identified in the USS. Note that (a) in the interest of brevity, we have focused primarily on human genes. Other systems and organisms have been discussed only where data from humans were limited or absent; (b) usually only well-characterised elements (i.e. those mapped within shorter RNA fragments and experimentally validated) were collated; (c) secondary structural features were discussed whenever such information was available in the original publications; and (d) although the examples given are by no means comprehensive (despite an extensive literature search), taken together they represent all the major stages of post-transcriptional gene regulation.

Auxiliary regulatory elements involved in mRNA 3′ end formation

In addition to the three core elements required for polyadenylation [i.e. UCPAS, DCPAS (downstream core polyadenylation signal), and CS; refer to Fig. 1 in Chen et al. 2006], other sequence elements have also been identified that modulate the efficiency of mRNA 3′ end formation. One class of such auxiliary regulatory elements is located upstream of the UCPAS hexamer and are thus termed upstream sequence elements (USEs; reviewed in Zhao et al. 1999). Human USEs are generally U-rich, but a consensus sequence has not yet been established (Table 2). U-rich USEs in human genes bind to Fip1, a subunit of cleavage and polyadenylation specificity factor (CPSF), thereby contributing to CPSF-mediated stimulation of poly(A) polymerase activity (Kaufmann et al. 2004).

Table 2 Summary of upstream sequence elements (USEs) that modulate cleavage/polyadenylation in human protein-coding genes

An AU-rich element in the 3′ UTRs of the CCND1 and ODC genes is involved in eIF4E-mediated mRNA nuclear export

The eukaryotic translation initiation factor eIF4E promotes the nucleocytoplasmic export of cyclin D1 (CCND1) and ornithine decarboxylase (ODC) mRNAs but it does not influence GAPDH or actin mRNA export (Rousseau et al. 1996); this has recently been attributed to the presence of an eIF4E sensitivity element, a ∼100 bp AU-rich sequence, in the USS of the human CCND1 and ODC genes (Culjkovic et al. 2005).

Cis-acting elements that control mRNA subcellular localisation

In eukaryotic cells, localisation of mRNA to different regions of the cytoplasm ensures that intracellular proteins are synthesised near to where they are to function. It also provides the means to establish cellular polarity through asymmetric RNA and protein distribution (reviewed in Chabanon et al. 2004; Shav-Tal and Singer 2005). Table 3 summarises the currently well-characterised mRNA localisation signals (known as ‘zipcodes’) in mammalian genes.

Table 3 Summary of ‘zipcodes’ in the USSs of mammalian genes

3′ UTR motifs that affect mRNA stability

Of the many known post-transcriptional gene regulatory mechanisms, the control of mRNA stability/instability is the most intensively studied.

AU-rich elements (AREs)

AU-rich elements were initially identified in the 3′ UTRs of a variety of short-lived mRNAs encoding cytokines and protooncogenes (Caput et al. 1986; Shaw and Kamen 1986). Recent bioinformatics analyses have, however, revealed a large repertoire of AU-rich mRNAs that encode functionally diverse proteins (Bakheet et al. 2001; Khabar et al. 2005). AREs from different mRNAs vary dramatically but can be divided broadly into three distinct classes based upon their sequence features: class I has 1–3 copies of scattered AUUUA motifs coupled with a nearby U-rich region or U stretch; class II has at least two overlapping copies of the nonamer UUAUUUA(U/A)(U/A) in a U-rich region; and class III has a U-rich region but lacks a core AUUUA element (Chen and Shyu 1995).

To date, at least a dozen ARE-binding proteins (AREBPs) have been identified (reviewed by Khabar 2005; Barreau et al. 2006) and these proteins can either promote or inhibit the degradation of ARE-containing mRNAs. In the former case, the exosome, a complex of 3′ to 5′ exoribonucleases, may be directly recruited to the ARE by certain AREBPs such as AUF1 (Chen et al. 2001); these AREBPs therefore function as mRNA-destabilising proteins. In the latter case, however, a putative DExH RNA helicase, known as RHAU (RNA helicase associated with AU-rich element), may first be required to disrupt the interaction between ARE and certain AREBPs such as HuR before the exosome is activated (Tran et al. 2004); thus, these AREBPs function as mRNA-stabilising proteins.

mRNA-destabilising elements that are structurally different and functionally distinct from AREs have been claimed to be present in mRNAs encoding several cytokines including the granulocyte colony-stimulating factor (G-CSF; Brown et al. 1996; Putland et al. 2002) and tumour necrosis factor α (TNF-α; Stoecklin et al. 2003). However, we believe that (a) these elements may well fall into the AU-rich element category primarily because they are highly AU-rich (i.e. CUGUUUAAUAUUUAAACAG in Brown et al. 1996; UGUUUUCUGUGAAAAC in Stoecklin et al. 2003) and (b) that these elements may be regulated independently of other AREs in the same genes as a consequence of their interaction with different AREBPs. This notwithstanding, Brown et al. (1996) noted the importance of stem–loop structure in determining mRNA stability and suggested that it is the structure of the stem, rather than its sequence, which is important for function. We also evaluated the 15-nucleotide element reported by Stoecklin et al. (2003) and found that it could indeed form a stem–loop structure. Such AREs have been increasingly investigated in the context of secondary structure (e.g. Lopez de Silanes et al. 2004; Shao and Ismail-Beigi 2004; Berger et al. 2005; Fialcowitz et al. 2005).

C- or pyrimidine-rich elements

In contrast to the many short-lived mRNAs that often have half-lives measured in minutes, globin mRNAs, which have half-lives of >24 h, are among the most stable of mRNAs due to their obvious biological roles. Three discontinuous C-rich stretches (i.e. CCUGCC, GCCU, and CUCCCCUCCUUG) within the 3′ UTR of α-globin mRNA have been identified as being essential for the longevity of the mRNA in erythroid cells (Weiss and Liebhaber 1994, 1995). They bind to αCP1 and αCP2 [α-globin poly(C)-binding proteins; Kiledjian et al. (1995)] which also interact directly with poly(A)-binding protein (PABP; Wang et al. 1999). This complex RNA–protein–protein interaction may promote/stabilise the binding of PABP to the poly(A) tail, thereby hindering the deadenylation of the α-globin mRNA (Wang et al. 1999). Interestingly, all three αCP-binding, C-rich segments are located in the loop or interloops of a complex stem–loop structure (Waggoner and Liebhaber 2003).

C-rich elements have also been found in the 3′ UTRs of three other highly stable mRNAs: α(I)collagen, tyrosine hydroxylase and 15-lipoxygenase. That all these C-rich elements bind to one or more common trans-acting protein factors may suggest a common regulatory mechanism of mRNA stability (reviewed by Waggoner and Liebhaber 2003). These C-rich elements are also known as pyrimidine-rich elements since the second frequently encountered nucleotide within the element is U (e.g. Holcik and Liebhaber 1997). In this regard, Yu and Russell (2001) identified a specific 14-nt pyrimidine-rich tract, UUCCUUUGUUCCCU, in the 3′ UTR of the β-globin mRNA as the stability determinant. Given that antibodies raised against αCP also bind to the β-globin mRNP complex, these authors suggested that the highly stable α- and β-globin mRNAs might be regulated through a common pathway.

AG-rich elements

In keeping with studies of the orthologous murine gene (Glisovic et al. 2003 and references therein), Christian et al. (2004) demonstrated that the stability of the human CYP2A6 mRNA is increased when its 3′ UTR is bound by heterogeneous nuclear ribonucleoprotein (hnRNP) A1. Further, secondary structure prediction studies revealed that the primary binding site within the 3′ UTR of the CYP2A6 mRNA contains AG-rich blocks that are likely to be single-stranded (Christian et al. 2004).

Histone mRNA 3′-terminal stem–loop

The half-life of replication-dependent histone mRNA is about 40 min during S phase but decreases to 10 min by the end of S phase. This rapid degradation appears to be due to the release of HBP (hairpin-binding protein) from a 6-bp stem/4-bp loop structure

figure a

(Y=C/U; R=G/A; N=A/G/C/U) in the gene’s 3′ UTR, allowing 3′ exonuclease to access the histone mRNA (reviewed in Jaeger et al. 2005).

Cis-acting elements determining site-specific endonucleolytic mRNA cleavage

The degradation of at least four human mRNAs―Gro protein alpha (groa or CXCL1; Stoeckle 1992), transferrin receptor (TFRC; Binder et al. 1994), insulin-like growth factor 2 (IGF2; de Pagter-Holthuizen et al. 1988), and α-globin (Wang and Kiledjian 2000)―is, at least in part, determined by site-specific endonucleolytic cleavage within their 3′ UTRs. The frequency of occurrence of this regulatory mechanism is, however, likely to be seriously underestimated because internally cleaved mRNA intermediates are difficult to capture: whilst the resulting 5′ fragment is no longer protected by a poly(A) tail (analogous to deadenylation), the resulting 3′ fragment is unprotected by a cap structure (analogous to decapping; Mata et al. 2005). The mRNA breakdown intermediates captured in the above four cited cases may have formed stable RNase-resistant duplex structures or could have bound to stabilising trans-acting factors (Ross 1995). To date, endonucleolytic cleavage sites (ECS) have been fully characterised in three of the four mRNAs.

The role of iron-responsive elements (IREs) in modulating TFRC mRNA stability

TFRC mRNA displays a relatively short half-life when iron is abundant and a relatively long half-life when iron is scarce (Koeller et al. 1991). Regulation is accomplished through the binding of trans-acting IRE-binding proteins (IRPs) to the cis-IREs located within the gene’s 3′ UTR (Casey et al. 1989; Koeller et al. 1989).

The iron-responsive element is a hairpin structure comprising a 5′-CAGWGH-3′ apical loop (W is A or U and H is C, A or U; underlined bases form a Watson-Crick pair) and a stem that is interrupted either by a single C bulge or an unpaired C residue within an internal loop/bulge. The apical loop is necessary for high affinity binding to IRPs, whereas the C-bulge appears to orientate optimal protein binding without directly contacting the IRPs (Meehan and Connell 2001).

Up to five IREs (viz. A, B, C, D, and E in the order 5′ to 3′) have been identified in the 3′ UTR of the TFRC gene (Casey et al. 1988); three (i.e. B, C, and D) were found to be capable of conferring iron-dependent regulation of TFRC expression upon binding with IRPs (Casey et al. 1989). How does the IRE–IRP association stabilise TFRC mRNA? Binder et al. (1994) detected a shorter mRNA that lacks a significant portion of the 3′ end of the TFRC mRNA in a human plasmacytoma cell line. These authors noted that (a) the appearance of the truncated molecules correlated with rapid turnover of the TFRC mRNA; (b) the truncated RNA resulted from endonucleolytic cleavage rather than a 3′→5′ exonuclease pause; and (c) the ECS was located just 6 bp 3′ to the IRE element C, and concluded that the binding of IREs with IRPs results in protection of the TFRC mRNA against nucleolytic attack. Put another way, the IREs and the ECS constitute the key determinants of TFRC mRNA stability.

Secondary structures formed by two widely separated sequence elements are required for the specific cleavage of IGF2 mRNA

Human IGF2 mRNA is specifically cleaved between nucleotides 2183 and 2184 (the first nucleotide 3′ to the translational stop codon as 1; Supplementary Table S3) in the 3′ UTR (de Pagter-Holthuizen et al. 1988; Meinsma et al. 1991). The ECS, located in an internal loop, is flanked by two complex secondary structures: the 5′ flanking one contains two stable stem–loops that are formed by nucleotides -133 to -7 (numbered with respect to the ECS); the 3′ flanking one comprises a downstream G-rich tract (element II) that folds into an extended duplex of 83 nucleotides with a C-rich sequence tract (element I) that is located ∼2 kb 5′ to the ECS (Meinsma et al. 1992; Scheper et al. 1995; van Dijk et al. 1998, 2000). These secondary structures (refer to Fig. 1 in van Dijk et al. 2000) have been shown to maintain a highly specific ECS by preventing the formation of alternative structures.

C-rich elements and specific α-globin mRNA cleavage

Three C-rich elements in the 3′ UTR of α-globin mRNA have been postulated to hinder deadenylation through the formation of an RNA–αCP–PABP complex. α-globin mRNA has also been reported to be specifically cleaved in its 3′ UTR, the ECS (indicated by /) being located within the third and longest C-rich element, CUCCCCUCC/UUG (Wang and Kiledjian 2000). More interestingly, secondary structure prediction has suggested that whilst the first nine nucleotides of the element are single stranded, the last three nucleotides are located within a stem structure. In other words, the ECS is located just 5′ to a double-stranded region (Waggoner and Liebhaber 2003). This degradation pathway is probably also regulated by the interaction between the C-rich elements and αCPs, irrespective of whether it is functionally linked with the deadenylation pathway (Waggoner and Liebhaber 2003).

Cis-acting 3′UTR elements that control mRNA translation

AU-rich elements

AU-rich elements (AREs) are the best-known determinants of mRNA stability/instability. However, it has been known for some time that AREs within certain cytokine genes are also responsible for translational repression (effects on mRNA stability have been excluded; Kruys et al. 1989; Han et al. 1990). The expression of some 14 genes, mostly encoding proteins involved in inflammation and tumour growth, is known to be controlled at the level of translation by means of an association between AREs and AREBPs such as TIA-1 and CUGBP2 (Espel 2005).

T-cell intracellular antigen 1 (TIA-1) has been best characterised as a translational repressor in the context of the ARE-bearing tumour necrosis factor-α (TNFA) and cyclooxygenase 2 (COX2) genes (Piecyk et al. 2000; Dixon et al. 2003). Very recently, using immunoprecipitation of TIA-1–RNA complexes from human colorectal carcinoma RKO cells followed by microarrary-based identification and computational analysis of bound transcripts, Lopez de Silanes et al. (2005) identified a common motif present in TIA-1 target mRNAs. The 30 to 37-nucleotide-long motif is highly U-rich in its 5′ segment and AU-rich in its 3′ segment, forming loops of variable size and a bent stem. This motif has not only been found in both the TNFA and COX2 mRNAs but also in ∼3% (3,019) of UniGene transcripts (Lopez de Silanes et al. 2005).

Cis-acting sequence involved in specific incorporation of selenocysteine at UGA codons of selenoprotein mRNAs

In addition to functioning as a translational stop codon, UGA also signals the incorporation of selenocysteine (Sec). In eukarya and archaea, recognition of UGA as Sec requires a specific secondary structure known as the selenocysteine-insertion sequence (SECIS) that is located within the USS of these mRNAs. The SECIS recruits SECIS-binding protein 2 (SBP2), which in turn recruits a Sec-specific translation elongation factor and a specific Sec transfer RNA (Berry 2005 and references therein).

The canonical mammalian SECIS element is characterised by a hairpin structure comprising two helices, an internal loop, four consecutive non-Watson-Crick base-pairs (the quartet) containing a central G.A/A.G tandem, and an apical loop (Walczak et al. 1996). [Unlike ‘Watson-Crick base-pairs’, ‘non-Watson-Crick base-pairs’ present chemical groups in either the major or the minor groove of the helix that are available for specific protein or RNA binding (Walczak et al. 1998) and may thus be regarded as single stranded]. However, more recent studies have led to the identification of non-canonical forms of SECIS (Korotkov et al. 2002; Kryukov et al. 2003). Indeed, as Korotkov et al. (2002) opined, the absolutely conserved primary sequence in the SECIS is limited to the UGA...GA motif in the quartet, which serves as a specific recognition site for SBP2; the only other recognition feature of the SECIS might be its three-dimensional structure.

MicroRNA targets: another type of cis-acting regulatory element

MicroRNAs (miRNAs) post-transcriptionally regulate gene expression by binding to their target mRNAs (usually within the 3′ UTRs) (for recent reviews, see Bartel 2004; Pillai 2005; Valencia-Sanchez et al. 2006). The total number of human miRNAs has recently been estimated to be at least 800 (Bentwich et al. 2005), each being potentially capable of down-regulating a large number of different target mRNAs (Lim et al. 2005). However, to date, the mechanism by which the bound miRNA down-regulates gene expression remains unclear. In this regard, it is pertinent to note that Robins et al. (2005) and Zhao et al. (2005) have recently tried to incorporate the folded structure of mRNA to understand miRNA–target interactions.

Practical considerations about secondary structure prediction for establishing the rules that determine how RNA regulatory elements function in the context of a specific secondary structure

The above fairly comprehensive survey of well-defined cis-regulatory elements within the 3′ UTRs of protein-coding genes supports the view that RNA regulatory elements function in the context of a specific secondary structure, best exemplified by the extensively studied histone mRNA 3′-terminal stem–loop, IRES, SECIS, and cis-elements determining site-specific endonucleolytic mRNA cleavage. To establish general rules governing how RNA secondary structure might operate to effect the functionality of component regulatory elements, the most commonly used mfold program (Zuker 2003; http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi) was employed to predict secondary structures under default parameters. Several key points regarding this analysis are discussed below.

‘Local’ vs ‘global’

Our analysis was based on the prediction of ‘local’ rather than ‘global’ secondary structure. Interested readers are invited to consult Meyer and Miklos (2005) for detailed discussions of this issue.

Sequence length used for secondary structure analysis

Secondary structure varies as a function of the input sequence length for which there is no consensus for use in mfold analysis [e.g. 36 bp in Ruggiero et al. (2003); 87 bp in Chabanon et al. (2005); from 100 to 200 bp in Aranda-Abreu et al. (2005); 130 bp in Waggoner and Liebhaber (2003); 234 bp in Christian et al. (2004)]. In an attempt to evaluate the diverse USS variants in both a systematic and an objective manner, we employed the following policy: (a) the sequence analysed was always limited to the last exon; (b) ±100 bp sequences flanking each variant were used for analysis wherever possible; (c) if either of the analysed paired sequences (i.e. wild-type and variant sequences) did not display an obvious folded structure or the 5′ flanking sequence was <100 bp, ±50 bp sequences flanking each variant were used for analysis wherever possible; and (d) if again either of the paired sequences analysed did not display a folded structure, ±60 or ±40 bp sequences flanking each variant were used for analysis wherever possible.

Terms used for describing secondary structural features

Terms used for describing RNA secondary structural features (e.g. hairpin loop, bulge loop, internal loop) are in accordance with Mathews et al. (1999).

Evaluating non-functional USS variants in the context of secondary structure

Four USS variants were identified as being non-functional during the course of this meta-analytical study.

CXCL12

Stromal cell-derived factor-1 (SDF-1; or CXCL12; MIM# 600835) is the principal ligand for CXCR4, a co-receptor with CD4 for T-lymphocyte cell line-tropic human immunodeficiency virus type 1 (HIV-1). Winkler et al. (1998) identified a common polymorphism, designated SDF1-3′A (or SDF-G801A) in the 3′-UTR of the SDF-1 β isoform. Homozygosity for SDF1-3′A has been shown in one study to have a protective effect with respect to AIDS progression (Winkler et al. 1998) although it has also been reported to be associated with accelerated disease progression (Mummidi et al. 1998; van Rij et al. 1998). A subsequent international meta-analysis has, however, indicated that SDF1-3′A homozygosity is unlikely to affect disease progression (Ioannidis et al. 2001).

If the SDF1-3’A allele really were to play a role in AIDS progression, then it should be associated with altered gene expression. In the original work of Winkler et al. (1998), the functional significance of the SDF1-3′A allele was simply inferred from the observation that “the SDF1-3′A variant is located in a segment of the 3′-UTR of the SDF-1β transcript that is highly conserved in sequence (69% sequence between human and mouse SDF-1β 3′ UTR sequence with no gaps).” As opined by Winkler et al. (1998), “this extent of conservation with the segment suggests that it may serve as a target for cis-acting factors influencing transcript abundance, synthesis, transport, stability, or splice product abundance.” However, the SDF1-3′A variant was not found to affect SDF-1β RNA synthesis either in vitro (Arya et al. 1999) or in vivo (Kimura et al. 2003). More recently, using allele-specific transcript quantification, Kimura et al. (2005) were able to show that polymorphisms other than SDF1-3′A exert a cis-acting effect on the expression of SDF-1 transcripts.

APOA5

Apolipoprotein A-V (APOA5; MIM# 606368) plays an important role in the regulation of triglyceride metabolism (Merkel and Heeren 2005). The APOA5*2 haplotype, which contains a T>C SNP in the 3′ UTR (located 158 bp downstream of the translational termination codon) and is present in ∼16% of Caucasians, was found to be significantly associated with increased plasma triglyceride levels (Pennacchio et al. 2001, 2002). The 3′ UTR T>C SNP has, however, recently been shown not to influence reporter gene expression (Talmud et al. 2005).

NPPC

A rare G>A polymorphism in the C-type natriuretic peptide gene (NPPC; MIM# 600296) has been reported to be tentatively associated with hypertension in the Japanese population (A allele frequency: 1.6% in controls vs 2.6% in patients). However, this polymorphism was not found to influence luciferase activity when evaluated in a transient reporter gene assay (Ono et al. 2002).

SEP15

SEP15, one of the 25 selenoprotein-encoding genes (Kryukov et al. 2003), manifests two haplotypes, C811/G1125 versus T811/A1125 (frequencies: 68 vs 32%); both variants are located within the USS of the gene’s 3′ UTR (Gladyshev et al. 1998; Kumaraswamy et al. 2000). Unlike G>A at position 1125 (which will be considered in the next section), C>T at position 811 was not predicted to reside within a SECIS motif using SECISearch 2.19 (http://www.genome.unl.edu/SECISearch.html; Kryukov et al. 2003); C811T has also been consistently found to be non-functional in a reporter gene system (Kumaraswamy et al. 2000; Hu et al. 2001).

All four non-functional USS variants result in a similar secondary structural change

All four non-functional USS polymorphisms were predicted to give rise to similar secondary structures (here termed pattern 0) in which both the substituted and substituting nucleotides are capable of pairing with the same opposing nucleotides in a helical portion (NB. G can pair with either U or C; U can pair with either G or A in an mRNA molecule). This is exemplified by the G801A polymorphism in CXCL12 (Fig. 1); the predicted secondary structures of the other three polymorphisms are provided in Supplementary Figs. S1, S2 and S3, respectively.

Fig. 1
figure 1

Proposed secondary structures of the wild-type (left panel) and variant (right panel) RNA sequences in relation to the alternative G/A alleles (position 101; indicated by arrow) of the 3′ UTR polymorphism in the CXCL12 gene. This type of secondary structural change was termed pattern 0

Evaluating functional USS variants in the context of secondary structure

A total of 17 known functional USS variants were collated and analysed individually. All deletions of ≥5 bp were predicted by mfold to result in significant secondary structural change and will thus not be discussed individually in the text. As we shall see below, the functional consequences of these variants, taken together, affected virtually all the main stages of post-transcriptional gene regulation (with the exception of subcellular localisation). In addition, the cis-regulatory elements that these variants have disrupted can also be confidently assigned in most cases.

HBB: a point mutation occurring within a CU-rich sequence tract affects both mRNA 3′ end formation and stability

A C>G mutation, six nucleotides downstream of the translational termination codon (UAA) of the HBB gene (MIM# 141900; encoding β-globin), has been reported in β-thalassaemia intermedia patients (Jankovic et al. 1991; Maragoudaki et al. 1998). This mutation, which occurred within a pyrimidine-rich sequence tract, UAAGCUCG(C/G)UUUCUUGCUGUCCAAUUUCUAUU, is associated with a 20–34% reduction in HBB mRNA levels in heterozygous patients as compared with healthy controls (Maragoudaki et al. 1998). Thus, in vivo, the mutant allele is associated with a 40–68% reduction in HBB mRNA level as compared with the wild-type allele. This observation concurs with data obtained from in vitro analysis: steady-state cytoplasmic mRNA levels from transfected MEL cells containing the mutant allele were reduced by 52–60% as compared with those obtained from the wild-type allele (Sgourou et al. 2002). Analysis of nuclear RNA demonstrated that the HBB mutation served to lower the ratio of cleaved/uncleaved transcripts by 22–30%, suggesting that the C>G substitution adversely affects mRNA 3′ end formation. As already noted by Sgourou et al. (2002), the observed decrease in nuclear HBB RNA does not account for the rather greater decrease in cytoplasmic HBB mRNA. It may well be therefore that this lesion impacts on other properties of the HBB mRNA such as its stability.

As shown in Fig. 2, all three large loops that formed downstream of the translational stop codon (UAA) are pyrimidine- or U-rich. Since U-rich elements are known to regulate mRNA 3′ end formation (refer to Table 2), and CU-rich elements in the 3′ UTR of HBB mRNA have been shown to affect mRNA stability (Yu and Russell 2001), it is reasonable to suppose that these CU-rich sequence tracts regulate both pre-mRNA 3′ end formation in the nucleus and mRNA stability in the cytoplasm. Not surprisingly, the C>G mutation was predicted not only to lead to a shortening of the first bulge loop by one base (i.e. from UUUCUU to UUCUU) but also to introduce an unpaired base (G) into the stem portion of this loop (Fig. 2). Thus, whilst the shortened CU-rich motif may exhibit reduced binding affinity for trans-acting factors, the unpaired G within the stem could also have affected the interaction between the CU-rich motif(s) and the trans-acting factors by modifying the orientation of the first bulge loop. This kind of secondary structure change, affecting both the loop(s) and stem(s), was termed pattern I.

Fig. 2
figure 2

Proposed secondary structures of the wild-type (left panel) and mutant (right panel) RNA sequences in relation to the 3′ UTR C>G mutation (position 101; indicated by thin solid arrow) in the HBB gene. Dotted arrow the unpaired G residue in the mutant sequence; thick solid arrow translational stop codon UAA (positions 93–95); open arrows loops comprising pyrimidine-rich sequences. This type of secondary structural change was termed pattern I

HBB: a 13 bp deletion that may affect RNA nucleocytoplasmic transport

A 13 bp deletion (GCATCTGGATTCTGCCTAATAAA) that terminates four nucleotides 5′ to the UCPAS (in italics) of the HBB gene was identified in a 12-year-old Turkish boy with β-thalassaemia (Basak et al. 1993). This mutation was associated with a 6-fold reduction in mRNA level in vivo. However, in vitro analysis indicated that the lesion neither affected the assembly of the mRNA-stabilising mRNP complex nor the stability of the mutant mRNA. Rather, this deletion appears to decrease HBB mRNA levels mainly by inhibiting the processing of pre-mRNA to mRNA in the nucleus (Bilenoglu et al. 2002). The original authors considered that the mutation might disrupt a sequence element that facilitates nucleocytoplasmic transport, thereby hampering the processing of the mutant pre-mRNA, and resulting in the accumulation of the fully processed RNA species in the nucleus.

USS variants that affect mRNA stability

AU- or U-rich elements

Six USS variants involving six different genes were found to affect AU- or U-rich elements.

BMP2. Variations in the BMP2 gene (MIM# 112261; encoding bone morphogenetic protein 2) have been associated with both osteoporosis (Styrkarsdottir et al. 2003) and osteoarthritis (Valdes et al. 2004). A common A>C SNP in BMP2, located within the AU-rich USS, was found to be functional: the minor C allele not only has a different affinity for specific proteins but also exhibits a higher in vitro decay rate as compared with the major A allele (Fritz et al. 2006). Unlike the above-mentioned pattern 0 and pattern I variants, the A>C SNP in the BMP2 gene was located within a hairpin loop (here termed pattern II; Fig. 3). Since this was the only change observed in the predicted secondary structure, it would follow that an A to C substitution in a single-stranded AU-rich motif might reduce the motif’s affinity for its cognate binding factors.

Fig. 3
figure 3

Proposed secondary structures of the wild-type (left panel) and variant (right panel) RNA sequences in relation to the 3′ UTR A>C SNP (position 101; indicated by arrow) in the BMP2 gene. This kind of secondary structural change was termed pattern II

NR3C1. The human glucocorticoid receptor gene (GCCR or NR3C1; MIM# 138040) comprises 10 exons (1–8, 9α, and 9β). It encodes two major isoforms through alternative splicing; whereas the α isoform (hGRα; exons 1-9α) encodes a functional receptor, the β isoform (hGRβ; exons 1–8 and 9β) encodes a protein that does not bind glucocorticoid (Oakley et al. 1996). However, hGRβ acts as a dominant negative inhibitor of hGRα by competitively binding to hGR-interacting proteins (Oakley et al. 1996; Charmandari et al. 2005). Thus, increased expression of hGRβ would be predicted to result in glucocorticoid resistance which may not only impede glucocorticoid treatment of immune-related disease but may also contribute to the pathogenesis of these conditions (Schaaf and Cidlowski 2002).

An A>G polymorphism, which occurred in the AU-rich USS of hGRβ mRNA, has been reported to be associated with rheumatoid arthritis (Derijk et al. 2001). Two independent in vitro studies have shown that the G allele of this polymorphism increases the stability of the reporter mRNA (Derijk et al. 2001; Schaaf and Cidlowski 2002). This polymorphism also results in a pattern II secondary structural change involving an AU-rich loop (Supplementary Fig. S4).

TGFB3. Beffagna et al. (2005) recently detected a 1723 C>T transition in the 3′ UTR of the transforming growth factor-β3 gene (TGFB3; MIM# 190230) in a 16-year-old boy with a typical arrhythmogenic right ventricular dysplasia (ARVD) phenotype and his brother who died suddenly of ARVD at the age of 16. This variant was considered to be disease-associated since it was not found in 600 control chromosomes, and an in vitro transfection assay revealed that it significantly increased reporter gene activity in murine C2C12 myoblasts (Beffagna et al. 2005).

The TGFB3 C>T polymorphism is again predicted to result in a pattern II secondary structural change involving an AU-rich sequence (Supplementary Fig. S5). However, while the above-mentioned BMP2 and NR3C1 polymorphisms decrease the number of motif-defining nucleotides, the TGFB3 polymorphism increases the number of such nucleotides.

CEACAM1. A single T deletion polymorphism, located within a T8 sequence tract in the 3′ UTR of the carcinoembryonic antigen-related cell adhesion molecule 1 gene (CEACAM1; MIM# 109770), is involved in tumour onset and progression; this ΔT allele is also associated with an increased level of reporter gene expression (Ruggiero et al. 2003). This deletional polymorphism has been predicted to result in an enlarged T-rich hairpin (from 4T to 6T), a shortened stem, and the generation of a new small bulge loop in the deletion allele as compared with the insertion allele (see Fig. 2 in Ruggiero et al. 2003). This then falls into the pattern I secondary structure change.

PPP1R3A. A common polymorphism comprising a 5 bp deletion plus three single nucleotide substitutions in the AU-rich 3′ UTR of the protein phosphatase-1 regulatory subunit 3 gene (PPP1R3A; MIM# 600917) has been found to be associated with insulin resistance and type 2 diabetes in Pima Indians; it also correlates with significantly reduced PPP1R3A mRNA expression in vivo (Xia et al. 1998). As illustrated in Fig. 4, the deletion allele is characterised by the loss of an AUUUA motif and the generation of a similar motif, AUUUUA; the spacing of the two AUUU(U)A motifs nevertheless differs between the two alleles. Transient expression analysis indicated that the half-life of the chimeric β-globin mRNA containing the deletion allele was at least 10-fold shorter than that observed for an mRNA species bearing the insertion allele, suggesting that the observed reduction in PPP1R3A mRNA could be directly attributable to the deletion variant in the 3′ UTR (Xia et al. 1998). Further in vitro analyses revealed that three different proteins (43, 80, and 139 kDa) bind to the polymorphic region whereas the less stable deletion allele exhibits ≥2-fold higher relative protein binding (Xia et al. 1999).

Fig. 4
figure 4

Comparison of the deletion and insertion alleles of the length polymorphism in the 3′ UTR of the PPP1R3A gene. The AUUU(U)A motif is shaded. Dashes indicate the deleted residues. The three substitutions are highlighted in bold

TYMS. A common 6-bp deletion polymorphic variant (delTTAAAG) in the USS of the thymidylate synthase gene (TYMS; MIM# 188350; Ulrich et al. 2000) has recently been reported to be associated with gastric (Graziano et al. 2004; Zhang et al. 2005) and colorectal (Mandola et al. 2004) cancer susceptibility, sensitivity of gastric cancer to 5-fluorouracil-based chemotherapy (Lu et al. 2006), and clinical outcome of patients with esophageal adenocarcinoma treated with pre-operative chemoprevention (Liao et al. 2006). That this deletion variant is functional is supported by two observations: (1) it is associated with reduced intratumoural TYMS mRNA expression in vivo and (2) the ∼35% decrease in the in vitro expression of TYMS-3′ UTR reporter gene constructs bearing the 6 bp deletion was due to an increased rate of mRNA degradation (Mandola et al. 2004).

C-rich element in NPR1

A common micro-deletion (CCCC→CCC) polymorphism (with an allele frequency of 33–39% in the general population) in the 3′ UTR of the human natriuretic peptide receptor A gene (NPR1; MIM#108960) was identified concurrently by Knowles et al. (2003) and Pitzalis et al. (2003). These two independent studies were complementary to each other. Transient expression analysis of constructs containing the NPR1 3′ UTR indicated that the 3C allele was associated with a 3-fold reduction in the expression of the reporter gene by comparison with the 4C allele (Knowles et al. 2003). On the other hand, although the 3C allele is not associated with cardiovascular disease, it does appear to represent a risk-modifying factor: the 3C allele, in heterozygous form, occurs more frequently in young normotensive subjects with a family history of hypertension than those without a family history, whereas 3C homozygotes display significantly higher systolic blood pressure and a prolonged ventricular relaxation time by comparison with 4C homozygotes (Pitzalis et al. 2003).

In the words of Knowles et al. (2003), “we were surprised to find that having 3C or 4C at 30 bp downstream of the stop codon in the NPR1 transcript significantly affected expression”. With hindsight, however, this finding is compatible with the observation that C-rich elements act as mRNA stability determinants (Waggoner and Liebhaber 2003). Further, the deletional polymorphism (underlined) is located within an extended C-rich sequence tract, CCTGCCTCCTCTCCTATCCCTCCACACCTCCC(C)TACCC with most of the cytosines being predicted to be single-stranded (Supplementary Fig. S6). The loss of one cytosine in the hairpin loop of the deletion allele (pattern II) could reduce the binding efficiency to mRNA-stabilising trans-acting factors.

CU-rich element in TFCP2

Lambert et al. (2000) reported that the A allele of a G/A polymorphism in the 3′ UTR (15 bp downstream of the translational stop codon) of the transcription factor LBP-1c/CP2/LSF gene (TFCP2; MIM# 189889) was associated with a reduced risk of Alzheimer’s disease (AD) in three independent populations (French, British, and North American). Although two studies (Taylor et al. 2001; Luedecking-Zimmer et al. 2003) have lent support to this finding, other studies have yielded contradictory results viz. the A allele was found to be associated with increased AD risk (Panza et al. 2004; Bertram et al. 2005). The underlying reasons for this discrepancy are unclear but it may be that the 3′ UTR SNP is not itself pathogenic but is rather in linkage disequilibrium (LD) with another functional genetic variant in the vicinity. This notwithstanding, electrophoretic mobility shift assays have demonstrated that oligomers containing the A allele display, on average, a 3.75-fold lower affinity for neuroblastoma nuclear proteins than those containing the G allele, whereas the absence of the A allele was associated with lower expression of TFCP2 mRNA. These findings suggested that TFCP2 gene expression may be differentially modulated by the G and A alleles within the 3′ UTR sequence through differential binding of a nuclear protein (Lambert et al. 2000).

The G/A polymorphism, occurred within a CU-rich sequence tract, CGUUUC(G/A)UGCCC, was predicted to cause a pattern I secondary structural change (Supplementary Fig. S7). In particular, this polymorphism significantly affects the length of a hairpin loop: although the wild-type hairpin loop contains 12 CU-rich bases, the variant loop contains only three bases (UUC). In addition, the newly generated 5-nt bulge loop in the variant allele is not CU-rich (Supplementary Fig. S7). In summary, the G>A polymorphic change may convert much of the normal CU-rich motif from the single-stranded state to the double-stranded state, thereby lowering its binding affinity for trans-acting factors. This structural analysis therefore argues strongly for the functional significance of the G>A polymorphic change, although its exact role in the pathogenesis of AD remains to be clarified.

AG-rich elements

In addition to the recently described functional polymorphisms that have occurred within different AG-rich blocks in the 3′ UTR of the CYP2A6 gene (Wang et al. 2006), three other polymorphisms were retrospectively found to occur within an AG-rich element.

F7. Factor VII (F7; MIM# 227500) plays a key role in the intrinsic pathway of blood coagulation. A 2 bp (AA) insertional polymorphism in the 3′ UTR of the F7 gene, with an allele frequency of 15% in the Caucasian population, is often associated with other variants causing F7 deficiency (Peyvandi et al. 2005 and references therein). It has been postulated that this common polymorphism may in part account for the poor genotype–phenotype correlation that has frequently been observed in F7 deficiency (Peyvandi et al. 2005). Transient expression analysis has demonstrated that the insertion allele reduces the steady-state level of F7 mRNA by 40% as compared with the deletion allele (Peyvandi et al. 2005).

The AA insertional polymorphism in the F7 gene occurs within a large loop of AG-rich sequence tract and results in a pattern II secondary structure change (Supplementary Fig. S8). The consequently enlarged single-stranded AG-rich motif might display an increased binding capacity for trans-acting factors.

PTPN1. Di Paola et al. (2002) identified a common G micro-insertion variant at position 1484 (nomenclature in accordance with GenBank accession number M33689) in the 3′ UTR of the protein tyrosine phosphatase 1B gene (PTPN1; MIM# 176885) that is associated with several features of insulin resistance in two different Italian populations. Although this association has not been confirmed by subsequent studies (Echwald et al. 2002; Dahlman et al. 2004; Florez et al. 2005; Spencer-Jones et al. 2005), the potential functionality of this insertion polymorphism is indicated by (a) PTPN1 mRNA levels being significantly higher in five muscle biopsies taken from 1484insG carriers than in 11 age- and sex-matched controls and (b) 1484insG mRNA being more stable than that derived from the deletion allele in transfection studies (Di Paola et al. 2002). This deletion/insertion polymorphism occurs within an AG-rich sequence tract and results in a pattern I secondary structural change (Supplementary Fig. S9).

NPR1. In addition to the common micro-deletion (CCCC→CCC) polymorphism, Knowles et al. (2003) also identified a second 4 bp (AGAA) micro-deletion polymorphism in the 3′ UTR of the NPR1 gene, with an allele frequency of 4%. Transient expression analysis has shown it to be associated with a 2-fold reduction in expression of the reporter gene (Knowles et al. 2003). This deletion polymorphism, which occurs within an AG-rich sequence tract, GGGAGGAGAAAGAG, results in a pattern I secondary structural change (Supplementary Fig. S10). Thus, the stability of NPR1 mRNA appears to be regulated by two different kinds of 3′ UTR motifs (i.e. C-rich and AG-rich elements).

Variants that modify the efficiency of Sec incorporation

Of the 25 known selenoprotein-coding genes, four have been reported to harbour disease-associated variants in their 3′ UTRs. However, only the T>C mutation in SENP1 (MIM# 606210), the G1125A polymorphism in SEP15 (MIM# 606254) and a common T/C polymorphism in GPX4 (MIM# 138322) appear to reside within the SECIS motif (see Cis-acting sequence involved in specific incorporation of selenocysteine at UGA codons of selenoprotein mRNAs). The two common 3′ UTR polymorphisms in the type 1 deiodinase gene (DIO1; MIM# 147982; Peeters et al. 2003, 2005) are not predicted to affect the SECIS motif and will thus be analysed together with the other group 4 variants. In addition, the polymorphism in the GPX4 gene has not been functionally characterised and was therefore also included in group 4.

SEP15

As illustrated in Fig. 5, the G/A1125 polymorphism does not affect the main stem–loop structure of the SECIS motif. However, it does serve to alter the sub-stem–loop structure attached to the apical loop. [Consistent with this, mfold analysis suggested that this polymorphic variant gives rise to a pattern I secondary structural change (Supplementary Fig. S11)]. This change may indirectly modify the three-dimensional structure of the SECIS motif, resulting in reduced binding of trans-acting factors such as SBP2. Indeed, using a specialised reporter gene construct, it has been demonstrated that although the A1125-containing SECIS motif was approximately twice as efficient in stimulating the readthrough of the UGA codon, it was less responsive to added selenium in the culture medium than the G1125-containing SECIS motif (Kumaraswamy et al. 2000; Hu et al. 2001). Malignant mesothelioma cells with the A1125 genotype were also less responsive to the growth inhibitory and apoptotic effects of selenium than cells with the G1125 genotype (Apostolou et al. 2004).

Fig. 5
figure 5

Predicted secondary structures of the wild-type (left panel) and variant (right panel) RNA sequences in relation to the 3′ UTR G>A polymorphism (indicated by arrows) in the SEP15 gene using SECISearch 2.19. The well-conserved AA in the apical loop, the non-Watson-Crick base-pairs (the quartet; 5′-UGAA-3′:5′-AGAU-3′), and an A in the internal loop are highlighted in bold. Note that in the secondary structures predicted by mfold, the non-Watson-Crick base-pairs (the quartet) are single stranded (Supplementary Fig. S11)

SEPN1

Very recently, a homozygous T>C mutation that occurred in the quartet 5′-UGAU-3′:5′-AGAU-3′ of the SEPN1 gene has been detected in a patient with a mild form of rigid spine muscular dystrophy. This mutation was found to abolish SBP2 binding in vitro (Allamand et al. 2006) and results in a pattern II secondary structural change (Supplementary Fig. S12).

SLITRK1: a 3′ UTR variant reported to occur within a miRNA target site

A G>A transition, in a predicted binding site for human miRNA hsa-miR-189 within the 3′ UTR of the Slit and Trk-like 1 gene (SLITRK1; MIM# 609678), was identified in two apparently unrelated Tourette syndrome patients but was absent in 4,296 control chromosomes (Abelson et al. 2005). Experimental confirmation of the functional effect of this mutation came from the demonstration that, in the presence of hsa-miR-189, in vitro constructs bearing the 3′ UTR mutation increased repression of a reporter gene by comparison with the wild-type. Interestingly, the G>A mutation results in a pattern I secondary structural change (Supplementary Fig. S13).

Correlating potential functionality with secondary structural changes

Comparison of the patterns of secondary structural changes observed with both the functional and non-functional USS variants (Table 4) strongly suggests that secondary structural change can be used as a reliable indicator of functionality. More importantly, this analysis sheds some light on several fundamental issues. Firstly, the frequently adopted rules to define cis-regulatory elements, based upon sequence comparisons between orthologous genes and/or in vitro functional analysis, appear to be inadequate to the task. For example, a variant that occurs within a well-conserved sequence tract that is double-stranded in the predicted RNA secondary structure may well have functional consequences. It would, however, be inappropriate immediately to term this sequence tract a cis-acting element because the variant could simply have induced a structural change in a nearby single-stranded, bona fide regulatory motif. Secondly, there may be a minimal length requirement for cis-acting elements. A detailed evaluation of the loops presumed to serve as binding sites for trans-acting factors reveals that they usually comprise at least four nucleotides (Table 4).

Table 4 Characteristic hallmarks of different patterns of secondary structural change with respect to functionality

Predicting functionality of the remaining USS variants using the newly established rules

The functionality or non-functionality of the group 3 and 4 variants can be reasonably well predicted by application of the rules listed in Table 4 and briefly described in Table 1. Here we shall confine our discussion to some novel findings and several interesting examples.

Identification of a new pattern of secondary structure change

Five variants (GFPT2, IL12B, and RNASE3 in group 3; CDKN2A and ZNF627 in group 4) were predicted to result in an altered orientation of the global stem–loop structure, exemplified by the IL12B A/C SNP (Fig. 6). This kind of radical secondary structural change (termed pattern III) could well be of functional significance.

Fig. 6
figure 6

Proposed secondary structures of the wild-type (left panel) and variant (right panel) RNA sequences in relation to the 3′ UTR A>C polymorphism (position 101; indicated by arrow) in the IL12B gene. This kind of secondary structural change was termed pattern III

Further secondary structural patterns of unknown significance

Type 1

The secondary structural change predicted to be associated with the AGTR1 A1166C variant (in group 4) involved the replacement of the paired A•U in the wild-type allele by a 1 × 1 internal loop in the variant allele (Supplementary Fig. S14). This secondary structural change as well as those of the ASIP g.8818A>G polymorphism (group 4; Supplementary Fig. S15) and the DPYSL2 2236T>C polymorphism (group 4; Supplementary Fig. S16) can in principle be assigned to pattern I. However, these changes are more minor than those of the pattern I functional variants and are thus termed patterns of unknown significance, type 1 (Table 4).

Type 2

Seven variants (VEGF 936C>T in group 3; APC A8822G, ENDRA, KCNJ9 8639A>G, LEP, OLR1, and TCP1 in group 4) could in principle be assigned to pattern II. However, unlike the pattern II functional variants, all of which involve a loop (either hairpin or internal) comprising at least four nucleotides (Table 4), these seven variants involved an internal loop comprising only 1–3 nucleotides (e.g. Supplementary Fig. S17). These changes were thus termed patterns of unknown significance, type 2 (Table 4).

In the light of the above predictions, it would be interesting to investigate the functional consequences of these type 1 and 2 variants. Here it is perhaps worth reiterating that all these patterns of unknown significance were virtually identified in group 4 variants; the only one (i.e. VEGF 936C>T) from group 3 is in LD with a 1451C>T polymorphism that results in a pattern II change involving a bulge loop comprising five nucleotides (ACACC>ACACU).

Identification of potential novel cis-regulatory elements

In addition to the known AU- AG-, C-rich and AC-rich motifs, some putatively novel USS motifs including AGCCUG(C/U)AG (CDKN1A) and a GC-rich sequence (THPO) were identified during the course of this analysis (Table 1).

Several interesting examples

These examples in groups 3 and 4 are intended to stimulate discussion rather than to allow conclusions to be drawn.

CRP: CRP 3 (pattern I) vs. CRP 4 (pattern 0)

A haplotype containing the minor A allele of a G/A polymorphism (termed CRP 4) in the 3′ UTR of the C-reactive protein (CRP) gene (MIM# 123260) has been reported to be associated with both reduced basal CRP expression and the development of systematic lupus erythematosus (Russell et al. 2004). However, secondary structure prediction (pattern 0; Supplementary Fig. S18) suggests that CRP 4 is unlikely to be of functional significance. In this regard, CRP4 is in tight LD with another polymorphism in the 3′ UTR of the CRP gene, CRP3 (CRP3 is located 858 bp upstream CRP4; Russell et al. 2004). CRP3 is identical to the +1444C>T polymorphism reported to influence in vivo CRP levels in other two studies (Brull et al. 2003; Kovacs et al. 2005). Thus, it is CRP3 that is probably of functional significance since it results in a pattern II secondary structure change in which the single-stranded AAACGG sequence in an internal loop is altered to AAAUGG (Supplementary Fig. S19).

GPX4

A common T/C polymorphism (allele frequencies 0.54/0.46) has been found in the 3′ UTR of the GPX4 gene (Villette et al. 2002). That individuals with different genotypes exhibited significant differences in the levels of lymphocyte 5-lipoxygenase suggests that this SNP may be of functional significance. As suggested by Supplementary Figs. S20 and S21, this common SNP probably affects the binding of trans-acting factors to the SECIS motif by modifying the three-dimensional structure of the GPX4 mRNA.

IGF2: does the ApaI polymorphism affect the specific cleavage of IGF2 mRNA?

As discussed earlier, specific cleavage of human IGF2 mRNA is largely determined by two flanking stem structures, one of which involves a long-range RNA interaction between two widely separated (∼2 kb) elements, I and II. It is perhaps pertinent to mention that the formation of this long-range stem structure is highly unusual and is facilitated firstly by the presence of a large tract (700–800 bp) of simple CA repeats and secondly by an extremely C-rich element II complementary to element I which is extremely G-rich (Supplementary Table S3).

An ApaI polymorphic site, initially detected at the IGF2 cDNA level by Xiang et al. (1988), was deduced to be a G>A substitution (Tadokoro et al. 1991) at position 820 (in accordance with GenBank accession number X07868.1; Supplementary Table S3) within the USS. This polymorphism has been reported to be associated with body weight (O’Dell et al. 1997; Gaunt et al. 2001; Gu et al. 2002; Roth et al. 2002; Schrager et al. 2004; Gomes et al. 2005) and to affect IGFII levels in vivo (O’Dell et al. 1997). Although the nucleotide at position 820 (located in the middle of a ∼450 base tract between the element II and the simple CA repeats) is not involved in the formation of the two flanking stem structures, the G/A polymorphism does change the local secondary structure of the IGF2 mRNA significantly (Pattern I; Supplementary Fig. S22). It is possible that this secondary structural change somehow impacts on the accessibility of the special ECS within the IGF2 3′ UTR by modifying either or both of the two flanking stem structures.

IL12B: although the initial disease association of a 3′ UTR A/C SNP may be spurious, this variant may nevertheless still act as a disease modifier

The initial report of the positive association of an A/C SNP in the 3′ UTR of the IL12B gene (MIM# 161561) with type 1 diabetes (Morahan et al. 2001) is now generally regarded as a false positive (e.g. Dahlman et al. 2002). This notwithstanding, there are consistent and independent findings to suggest that this SNP does indeed have functional consequences; whilst allele C correlated with significantly increased secretion of IL12 protein after stimulation of human peripheral blood mononuclear cells (PBMCs) with Staphylococcus aureus plus recombinant human interferon-γ (Seegers et al. 2002), lipopolysaccharide (Yilmaz et al. 2005), or purified protein derivative (Yilmaz et al. 2005), the A allele correlated with increased IL12 p40 production in C3-binding glycoprotein-stimulated human PBMCs (Stanilova and Miteva 2005). However, given the fact that the A/C SNP is a very common polymorphism (minor C allele frequency of 20% in controls; Morahan et al. 2001), its clinical consequences, if there are any, are likely to be fairly minor. Interestingly, the authors who reported the initial positive association have recently reported that the C allele is associated with late onset type 1 diabetes: the C allele was found to be more common in patients diagnosed after the age of 16 than in controls or in patients diagnosed before the age of 16 (Windsor et al. 2004). Although this observation should still be regarded as preliminary until independently replicated, it may well provide a plausible molecular mechanism for mediating disease severity (as distinct from mediating disease susceptibility). In other words, it is possible that the IL12B A/C SNP may act as a disease modifier.

This postulate has been supported by several recent studies. Given the central role of T cells in viral control and clearance in general (reviewed by Bowen and Walker 2005; Klenerman and Hill 2005) and the pivotal role of IL12 in the generation of the Th1 response in particular, three groups have postulated that the IL12B A/C SNP could affect spontaneous and treatment-induced recovery from hepatitis V virus (HCV)-infection. Their results [the A/A genotype was found to be associated with persistent infection (Yin et al. 2004; Houldsworth et al. 2005) whereas the C allele was associated with a more efficient response to antiviral combination therapy as a consequence of a reduced relapse gap (Mueller et al. 2004)] suggests that the C allele may act as a protective factor with respect to the outcome of HCV infection. However, the A/C SNP does not appear to be a susceptibility factor for HCV infection since its distribution did not significantly differ between patients and healthy controls (Mueller et al. 2004).

Taken together, the above observations strongly suggest that the IL12B A/C SNP is potentially of functional significance. This polymorphism causes a pattern III secondary structure change (Fig. 6).

OLR1: a 3′ UTR variant that is in linkage disequilibrium with splicing variants

The OLR1 gene (MIM# 602601) contains six exons and encodes oxidized low-density lipoprotein receptor 1. Whether a C>T SNP―located 188 bp 3′ to the translational termination codon in the OLR1 gene―is associated with atherosclerosis (Trabetti et al. 2006 and references therein) and/or Alzheimer’s disease (Shi et al. 2006 and references therein) remains controversial. This SNP is, however, in complete LD with another five intronic SNPs: IVS4+27G>C, IVS4-73C>T, IVS4-14A>G, IVS5-70A>G, and IVS5-27G>T (Mango et al. 2003). Using quantitative real-time PCR and a minigene approach, Mango et al. (2005) demonstrated that the intronic variants regulate the expression of an exon 5-lacking isoform (termed LOXIN) of the OLR1 gene, with the ‘non-risk haplotype’ being associated with increased expression of LOXIN at both the mRNA and protein levels. Further in vivo analysis suggested that increased expression of LOXIN protects cells from OLR1-induced apoptosis (Mango et al. 2005). Thus, the ‘risk haplotype’, irrespective of its potential medical significance, is biologically important owing to the presence of the multiple intronic variants. Interestingly, the OLR1 3′ UTR C>T SNP results in a pattern of unknown significance, type 2 (Supplementary Fig. S17).

PHB: a controversial cancer risk-associated polymorphism exhibits a pattern 0 secondary structural change

A single C>T transition at position 729 of the PHB 3′-UTR (nomenclature in accordance with Fig. 5 in Jupe et al. 1996) has been reported to be associated with an increased risk of breast cancer in North American women, especially in those aged ≤50 years who already had a first-degree relative with the disease (Jupe et al. 2001). In support of this association, stable clones of the breast cancer cell line MCF7 transfected with plasmids containing the PHB wild-type 3′-UTR (i.e. the C allele) manifested significant suppression of growth in cell proliferation assays, inhibition of colony formation in soft agar assays, and suppression of xenograft tumour growth when implanted on nude mice, as compared with those clones transfected with either empty vectors or plasmids containing the PHB variant 3′-UTR (i.e. the T allele; Manjeshwar et al. 2003). Based upon these and their own previous results, Manjeshwar et al. (2003) stated that “our studies showing the cell cycle inhibitory activity of the prohibitin 3′ UTR in a panel of immortalized cell lines certainly suggests that its loss of function may play a role in a number of cancers”. However, not only was the association of the T allele with an increased risk of breast cancer not confirmed in two subsequent studies (Spurdle et al. 2002; Campbell et al. 2003), but also an association of this allele with risk of ovarian cancer has been excluded (Spurdle et al. 2003). A recent study from the group that reported the initial disease association and who performed the relevant functional evaluation of the C>T variant has, however, suggested that the PHB and other genes might interact to influence breast cancer risk in a manner not entirely predictable from single gene effects (Aston et al. 2005). Finally, it is pertinent to note that the C>T polymorphism results in a pattern 0 secondary structural change (Supplementary Fig. S23).

Isolated examples

ADRA2C polymorphisms: exemplifying the diverse intragenic variability with haplotype-specific functional effects

A 12-bp in-frame deletion/insertion polymorphism of nucleotides 964–975 in the presynaptic α2C adrenergic receptor gene (ADRA2C; MIM# 104250), leading to the gain or loss of amino acids 322–325, constitutes a risk factor for heart failure (Small et al. 2004 and references therein). This polymorphism was, however, partitioned between nine haplotypes that displayed differential mRNA and protein expression profiles in a whole-gene transfection assay. Obviously, in a given del322-325-containing haplotype, other polymorphic variants including a 21-bp insertion/deletion polymorphism in the 3′-UTR may serve to amplify, attenuate, or even dominate the phenotypic expression previously attributed solely to the del322-325 variant (Small et al. 2004).

GPR44: haplotypes containing two closely linked common 3′ UTR polymorphisms

Two common polymorphisms, separated by only six nucleotides, occur in the 3′ UTR of the GPR44 gene (MIM# 604837) (Huang et al. 2004). Transcription pulsing experiments have shown that the 1544G-1651G haplotype confers a higher level of mRNA stability than the 1544C-1651A haplotype. However, as yet, it is unclear if this effect is conferred by one single variant or the two variants acting in concert.

TNFRSF1B: haplotypes containing different combinations of three SNPs within a 28 bp 3′ UTR sequence tract

The TNFRSF1B (MIM# 191191) gene encodes the human TNFα receptor. Different combinations of three SNPs within a 28 bp sequence tract in the 3′ UTR of TNFRSF1B constitute five different haplotypes. These haplotypes display different mRNA stabilities in transient expression assays (Puga et al. 2005).

VDR: a functional 3′ UTR haplotype containing multiple variants

Polymorphisms of the vitamin D receptor gene (VDR; MIM# 601769) have been reported to be associated with increased fracture risk (Fang et al. 2005). Multiple polymorphic variants occur in the 3′ UTR and manifest as several haplotypes. When evaluated using reporter gene constructs containing the complete 3.2 kb 3′ UTR of VDR, the decay rate of haplotype 1-related mRNA was found to be 30% greater than that for haplotype 2-related mRNA (Fang et al. 2005). The individual contribution of each sequence variant is, however, unknown.

Concluding remarks

The idea that an RNA regulatory motif relies on a combination of primary and secondary structure is not new. Indeed, more and more studies have incorporated secondary structure prediction in order to identify regulatory elements and/or to understand the nature of disease-associated regulatory variants. However, the systematic evaluation of known naturally occurring functional and non-functional USS variants has never been attempted before. In the present work, we have not only established a reliable and objective means to perform secondary structure prediction but, somewhat unexpectedly, we obtained consistent patterns of secondary structural change that appear to allow the discrimination of functional USS variants from their non-functional counterparts. The resulting rules were then employed to predict the potential functionality of the other collated USS variants. This notwithstanding, the validity and reliability of the predicted secondary structural patterns 0–III will need to be validated prospectively as new 3′ UTR variants are identified and functionally analysed. Moreover, the predicted types 1 and 2 of unknown significance warrant further exploration. Nevertheless, had this type of analysis been performed by Winkler et al. (1998), they might well have been more cautious with respect to their interpretation regarding the biomedical consequences of the CXCLI2 SDF-G801A polymorphism.