Biological context

RNA is a multifaceted biomolecule that plays important functional and regulatory roles in the cell. In the simplest case, RNA has a crucial role in the central dogma of molecular biology. Not only do messenger (m) RNAs serve as the template for protein synthesis in the cell, the cellular machinery responsible for protein synthesis, the ribosome, is largely composed of RNA molecules. Outside of its role in the central dogma, RNAs contribute to the overall health of the cell by controlling splicing (Papasaikas and Valcarcel 2016), translation efficiency (Rodnina 2016), and genomic stability (Theimer and Feigon 2006). More recently, RNA molecules have been used as the foundation for vaccines (Corbett et al. 2020; Pardi et al. 2018), investigated as therapeutics (Burnett and Rossi 2012; Damase et al. 2021; Winkle et al. 2021), and have been identified as potential therapeutic targets themselves (Howe et al. 2015; Meyer et al. 2020; Warner et al. 2018).

Despite the importance of RNAs in biology, structural elucidation of RNA molecules lags significantly behind their protein counterparts. As of May 2021, the Protein Data Bank (PDB) (Berman et al. 2000a, b) has 173,534 protein-containing structures and only 5,364 RNA-containing structures. Similarly, the Biological Magnetic Resonance Data Bank [BMRB (Ulrich et al. 2008)] contains 13,420 protein depositions and only 467 RNA depositions. This imbalance in our RNA structural knowledge has led to a similar disparity in our mechanistic understandings of the functions of these RNAs.

Chemical shifts are a fundamental indicator of molecular structure and are at the heart of all NMR studies (Ebrahimi et al. 2001; Fares et al. 2007; Ghose et al. 1994; Giessner-Prettre and Pullman 1987; Rossi and Harbison 2001; Sripakdeevong et al. 2014). However, assignment of chemical shifts remains a laborious and often rate-limiting step in the determination of biomolecular structure and characterization of biomolecular dynamics and ligand interactions. A number of groups are actively working to develop and improve methods to predict chemical shifts of nuclei in proteins (Han et al. 2011; Meiler 2003), DNA (Kwok and Lam 2013; Lam 2007; Lam et al. 2007; Ng and Lam 2015), and RNAs (Aeschbacher et al. 2013; Bahrami et al. 2012; Barton et al. 2013; Brown et al. 2015; Frank et al. 2013, 2014; Krahenbuhl et al. 2014; Wang et al. 2021). A robust prediction, coupled with a linked, interactive data analysis software (Marchant et al. 2019), can significantly reduce the time required for the assignment of chemical shifts. However, the accuracy of these algorithms requires a robust database of chemical shifts that represent a wide variety of both sequence and structural motifs. At the beginning of this study, we found only 244 entries at the BMRB for RNA molecules that were usable with our chemical shift analysis tools. Features which excluded some (27) BMRB entries from our analysis included the lack of an associated PDB file, presence of unnatural or modified nucleotides, or lack of relevant chemical shifts. The sparseness of this database limits the accuracy of the available tools.

In a canonical A-form helix, the neighboring base pairs have the most significant impact on the chemical shifts of the central base pair (Barton et al. 2013), forming 64 possible canonical base pair triplets or “core sequences”. We aim to identify the effect that bases flanking these core sequences have on the chemical shifts of the central base pair. Since the flanking base pairs have a relatively small effect on the chemical shifts of the central base pair as compared to the internal bases of the core sequence, we considered only nucleotide type (purine or pyrimidine) rather than nucleotide identity at the flanking positions. This is consistent with the attributes used in the current prediction algorithm used in NMRFx Analyst (Marchant et al. 2019). For the purposes of this study, we define a “group” as a core sequence with variable flanking bases. If each of these core sequences is flanked by a purine-pyrimidine base pair or a pyrimidine-purine base pair at the 5ʹ- and 3ʹ-ends, there are 256 possible combinations of 5 base pair sequences (Table S1). With these 256 sequences in hand, we examined the completeness of aromatic proton and carbon chemical shifts deposited in the BMRB and identified four “groups” of RNAs whose chemical shifts were incomplete (Fig. 1). We prepared 12 unique RNAs, three from each group, for chemical shift assignment (Fig. 2). These sequences were chosen from a number of missing sequences (see Methods, below) because they represented groups that were the most incomplete (i.e. three of the four group members had missing chemical shift data).

Fig. 1
figure 1

Overview of RNA sequences with missing assignments. Four groups of RNAs were identified A Group 1, RNAs 5–8, B Group 2, RNAs 21–24, C Group 3, RNAs 73–76, D Group 4, RNAs 89–92. Red nucleotide indicates the nucleotide with missing chemical shifts. The common triple of base pairs for each group is highlighted in a colored box. Sequence variations at flanking nucleotides were examined. The occurrence of aromatic proton and carbon chemical shifts in the BMRB for each RNA are indicated. Bold font is used to highlight the RNAs from each group with missing chemical shifts and are the RNAs further examined in this study. Flanking sequences for each RNA are shown. R, purine; Y, pyrimidine

Fig. 2
figure 2

RNA constructs used in this study. Red nucleotides indicate the variable sequence relative to a single example for each group. The common triple of base pairs for each group is highlighted in a colored box and the central nucleotide is in bold font

The RNAs examined in this study were designed to fill specific gaps in the BMRB chemical shift database. Additionally, these RNAs represent helical regions from a diverse set of biological RNAs including ribosomes, tRNAs, riboswitches, internal ribosome entry sites, and viral RNAs (Table S2) as identified via the RNA fragments search engine and database [RNA FRABASE, (Popenda et al. 2008, 2010)]. The RNA FRABASE searches for RNA fragments based on known structures, therefore there are likely many more biologically relevant RNAs that are represented by these RNA oligonucleotides. Our efforts focused on the assignment of aromatic and anomeric protons, as these are the chemical shifts that are currently incorporated into the NMRFx pipeline. Furthermore, we report imino proton assignments which inform on secondary structure and continue to build on the database and prediction tool reported earlier this year (Wang et al. 2021). We expect that these additional chemical shift assignments, which are a step towards a more complete chemical shift database, will benefit the RNA NMR community.

Materials and Methods

Identification of RNA sequences that are underrepresented in the BMRB

The identification of RNA sequences that are underrepresented in the BMRB was done using the tools previously developed for training models for the prediction of RNA chemical shifts (Brown et al. 2015). We used these tools to analyze NMR-STAR files containing RNA data that we downloaded from the BMRB. The downloaded NMR-STAR files were analyzed with the scripts developed for the prediction model training. This process automatically identifies PDB files from which secondary structures are generated and does various validation checks. After this process was complete, 244 files were available for the subsequent analysis. This analysis generated a table of chemical shifts and attributes as used in the prediction software. The tabular data was searched for all 256 five-residue sequences described above and a new table generated containing a line for each of the 256 sequences. Each line contains the number of measured chemical shifts for each of the aromatic proton and carbon atoms for the corresponding sequence. Of these 256 sequences, 182 of them have at least one measurement of all the aromatic proton and carbons (Table S1). The remaining 74 entries were missing the measurement of the chemical shift of at least one atom.

Construct design and template preparation

DNA oligonucleotides corresponding to the 12 RNA constructs (Table 1) were ordered (Integrated DNA Technologies) with 2ʹ-O-methoxy modifications at the two 5ʹ-most positions to reduce non-templated transcription (Kao et al. 1999). To generate the templates for transcription, DNA oligonucleotides (Table 1) were annealed with an oligo corresponding to the T7 promoter sequence (5ʹ-TAATACGACTCACTATA-3ʹ). Briefly, 40 µL of each DNA oligonucleotide (200 µM) was individually mixed with 20 µL of the T7 promoter sequence oligonucleotide (600 µM). These samples were placed in boiling water for 3 min and left in the water bath to slowly cool to room temperature over at least 2 h. The annealed template was then diluted with 940 µL H2O to produce the partially double-stranded DNA templates for transcription.

Table 1 RNA constructs

RNA preparation

RNA transcription conditions were optimized prior to large scale preparation. RNAs were prepared by in vitro transcription in transcription buffer (40 mM Tris-base, 5 mM DTT, 1 mM Spermidine, 0.01% Triton-X, pH 8.5) with addition of 2–6 mM unlabeled NTPs, 10–20 mM MgCl2, 8–13 pmol annealed DNA template, 0.2 U/mL yeast inorganic pyrophosphatase (New England Biolabs) (Cunningham and Ofengand 1990), ~ 15 µM T7 RNA polymerase, and 10–20% (vol/vol) DMSO (Helmling et al. 2015). Reactions were incubated at 37 °C for 3–4 h and then quenched using a solution of 7 M urea and 250 mM EDTA (pH 8.5). The quenched transcription mixture was loaded onto 16% preparative-scale denaturing gels for RNA purification. The RNA band was identified by UV shadowing, excised, and electroeluted from the gel for ~ 24 h using a membrane trap elution system (Elutrap, Whattman). RNAs were spin concentrated, washed with 2 M high-purity sodium chloride, and exchanged into water using Amicon Centrifugal Filter Units (Millipore, Sigma). RNA purity was checked by running RNA on a 16% analytical denaturing gel.

NMR spectroscopy

RNA samples (300–400 µM in 300 µL) were prepared in 50 mM potassium phosphate buffer, pH = 7.4. Samples were lyophilized and rehydrated in an equal volume of D2O or in 10% D2O/90% H2O (99.8%, Cambridge Isotope Laboratories, Inc.). 2D 1H–1H NOESY (in both D2O and H2O), 1H–1H TOCSY, and 1H–13C HMQC spectra were recorded for each RNA at 25 °C (see Table 2 for experimental parameters). NMR spectra were collected on a 600 MHz Bruker AVANCE NEO spectrometer equipped with a 5 mm TCI cryogenic probe (University of Michigan BioNMR Core). NMR data were processed with NMRFx (Norris et al. 2016) and NMRPipe (Delaglio et al. 1995) and analyzed with NMRViewJ (Johnson and Blevins 1994). 1H chemical shifts were referenced to water and 13C chemical shifts were indirectly referenced from the 1H chemical shift (Wishart et al. 1995).

Table 2 Experimental parameters for NMR data acquisition

Assignments of helical regions

For all 12 RNA oligonucleotides, the nonexchangeable proton assignments were unambiguously assigned using 2D 1H–1H NOESY (Hwang and Shaka 1995), 2D 1H–1H TOCSY (Bax and Davis 1985), and 1H–13C HMQC (Bax et al. 1983) experiments. The 2D 1H–1H TOCSY spectrum yields strong H5-H6 cross-peaks, reporting on the number of pyrimidines (cytosine and uracil) in an RNA molecule, and was used to identify the H5-H6 correlations in the corresponding 1H–1H NOESY spectrum. The 1H–1H NOESY spectrum was used to make assignments of sequential nucleotides in A-helical regions. Near the diagonal, sequential aromatic-aromatic proton correlations were observable for many regions of each RNA. In the aromatic-anomeric region, sequential assignments were possible by following the NOE pattern for A-helical regions. Briefly, the H6/H8 of each residue (i) has NOE with its own H1ʹ and the H1ʹ of the preceding nucleotide (i-1). Assignments could be confirmed by examining additional NOEs, for example the aromatic H6/H8 proton of a residue (i) has a weak, but detectable NOE to the H5 of a following (i + 1) pyrimidine. Additionally, the position of the adenosine C2 proton (H2) in the interior of the helix provides rich NOE data informing on both intra-strand (sequential) and cross-strand connectivities. When part of an A-form helix, the adenosine H2 (i) has NOEs to its own H1ʹ (i), to the H1ʹ of its 3ʹ residue (i + 1), and an inter-strand cross-peak to H1ʹ of the residue (j + 1) 3ʹ of the base to which it is paired (j). The 2D 1H–13C HMQC was used to help distinguish C2-H2 (13C δ ~ 152 ppm) from C6-H6 and C8-H8 resonances (13C δ ~ 139 ppm). Assigned 1H–1H NOESY (aromatic) and 1H-13C HMQC spectra for the 12 RNAs are presented in Figs. 3, 4, 5, 6, 7 and Supplemental Figs. S1–S9.

Fig. 3
figure 3

RNA7 chemical shift assignments. A 1H–1H NOESY, B 1H–13C HMQC, C Summary of the sequence, secondary structure, and assignment validation for RNA7. The secondary structure is shown in Vienna format. NMRViewJ assignment summary to validate proton (H6/H8, H5/H2, H1ʹ, H2ʹ, H3ʹ) and carbon (C6/C8, C2) assignments (bottom). Assigned chemical shifts for specific atoms are indicated with open and filled circles. The vertical offset of the circles indicates the deviation from the predicted values for that atom. Filled circles indicate that there are chemical shifts for atoms with the same set of attributes in the BMRB. Open circles indicate atoms that have a prediction, but for which no exact matches of the attributes are available in the BMRB. Grey boxes represent atoms that are not present in a given base

Fig. 4
figure 4

RNA23 chemical shift assignments. A 1H–1H NOESY, B 1H–13C HMQC, C summary of the sequence, secondary structure, and assignment validation for RNA23. The secondary structure is shown in Vienna format. NMRViewJ assignment summary to validate proton (H6/H8, H5/H2, H1ʹ, H2ʹ, H3ʹ) and carbon (C6/C8, C2) assignments (bottom). Assigned chemical shifts for specific atoms are indicated with open and filled circles. The vertical offset of the circles indicates the deviation from the predicted values for that atom. Filled circles indicate that there are chemical shifts for atoms with the same set of attributes in the BMRB. Open circles indicate atoms that have a prediction, but for which no exact matches of the attributes are available in the BMRB. Grey boxes represent atoms that are not present in a given base

Fig. 5
figure 5

RNA73 chemical shift assignments. A 1H–1H NOESY, B 1H–13C HMQC, C summary of the sequence, secondary structure, and assignment validation for RNA73. The secondary structure is shown in Vienna format. NMRViewJ assignment summary to validate proton (H6/H8, H5/H2, H1ʹ, H2ʹ, H3ʹ) and carbon (C6/C8, C2) assignments (bottom). Assigned chemical shifts for specific atoms are indicated with open and filled circles. The vertical offset of the circles indicates the deviation from the predicted values for that atom. Filled circles indicate that there are chemical shifts for atoms with the same set of attributes in the BMRB. Open circles indicate atoms that have a prediction, but for which no exact matches of the attributes are available in the BMRB. Grey boxes represent atoms that are not present in a given base

Fig. 6
figure 6

RNA89 chemical shift assignments. A 1H–1H NOESY, B 1H–13C HMQC, C summary of the sequence, secondary structure, and assignment validation for RNA89. The secondary structure is shown in Vienna format. NMRViewJ assignment summary to validate proton (H6/H8, H5/H2, H1ʹ, H2ʹ, H3ʹ) and carbon (C6/C8, C2) assignments (bottom). Assigned chemical shifts for specific atoms are indicated with open and filled circles. The vertical offset of the circles indicates the deviation from the predicted values for that atom. Filled circles indicate that there are chemical shifts for atoms with the same set of attributes in the BMRB. Open circles indicate atoms that have a prediction, but for which no exact matches of the attributes are available in the BMRB. Grey boxes represent atoms that are not present in a given base

Fig. 7
figure 7

Assigned imino proton NOESY spectra for A RNA7, B RNA23, C RNA73, and D RNA89. Secondary structures for each RNA are inset

Assignments of imino protons

The exchangeable imino protons were unambiguously assigned using the 2D 1H–1H NOESY recorded in 10% D2O/90% H2O and confirm the secondary structure of the RNAs. A:U base pairs are easily identifiable via a strong NOE cross peak between the uracil H3 imino proton and the adenosine H2 proton, which serves as a good starting point for the assignment of the imino proton region of the NOESY experiment. For G:C base pairs, the guanosine H1 imino proton shows a strong NOE correlation to the amino protons (H41 and H42) of the base-pairing cytosine. For this reason, the amino protons of cytosines involved in base pairing are easy to identify in comparison to the guanosine and adenosine amino protons due to the strong NOE with the H5 of the cytosine. The imino-imino NOEs can be traced through each RNA secondary structure (Fig. 7 and Supplemental Fig. S9).

GAGA tetraloop

Hairpin or stem-loop structures are pervasive in RNAs. RNA hairpins with a four nucleotide loop are known as tetraloops and are the most common size of loop (Antao and Tinoco 1992). All RNA constructs were designed to contain a GAGA tetraloop with a U-A closing base pair (Fig. 2). The GAGA tetraloop was chosen due to its structural stability (Dale et al. 2000; Sheehy et al. 2010) and characteristic signals in the 1H–1H NOESY spectrum (Fig. 8). While GAGA, and more generally, GNRA (where N represents any nucleotide and R represents a purine) are commonly closed with C-G base pairs,(Brown et al. 2020; Legault et al. 1998; Pham et al. 2018; Vallurupalli and Moore 2003) we chose to close our tetraloops with U-A base pairs to further broaden the chemical shift database. Currently, there are only two deposition in the BMRB with a GAGA tetraloop closed with a U-A base pair (D'Souza et al. 2004; Imai et al. 2016) and only two GNRA tetraloops closed with a U-A base pair (Cornilescu et al. 2016; D'Souza et al. 2004). The NOE walk of the GAGA tetraloop presents a unique pattern which was helpful to both confirm proper folding of the RNA and facilitate resonance assignments. The three-dimensional structure of the nucleotides in the GAGA tetraloop cause a dramatic change in the H1ʹ frequency of the nucleotide following the loop (A14 in these constructs), out of the typical anomeric proton chemical shift range and to a smaller (upfield) chemical shift (Jucker et al. 1996).

Fig. 8
figure 8

1H–1H NOESY spectrum of RNA24, highlighting the chemical shift assignments for the GAGA tetraloop. Tetraloop resonances are connected with green lines, resonances from flanking sequences are connected with purple lines. Dashed lines are used in crowded regions to indicate resonance assignment

Assignments and data deposition

In this work, we have unambiguously assigned all 330 aromatic C6-H6, C8-H8, and C2-H2 correlations. For the 66 adenosine residues, we have assigned 100% of the aromatic and anomeric protons (A-H2, A-H8, and A-H1ʹ). We were able to assign 86% of the adenosine H2ʹ and 74% of adenosine H3ʹ protons (57/66 and 49/66, respectively). For the 90 guanosine residues, we have assigned 100% of the aromatic and anomeric protons (G-H8, and G-H1ʹ). We were able to assign 89% of the guanosine H2ʹ and 86% of guanosine H3ʹ protons (80/90 and 77/90, respectively). For the 66 cytosine residues, we have assigned 100% of the aromatic and anomeric protons (C-H5, C-H6, and C-H1ʹ). We were able to assign 89% of the cytosine H2ʹ and 77% of cytosine H3ʹ protons (59/66 and 51/66, respectively). Finally, for the 42 uracil residues, we have assigned 100% of the aromatic and anomeric protons (U-H5, U-H6, and U-H1ʹ). We were able to assign 90% of the uracil H2ʹ and 90% of uracil H3ʹ protons (38/42 and 38/42, respectively). We assigned 100% of the imino protons (66 base paired guanosine G-H1 and 42 base paired uracil U-H3). Additionally, for the 66 base paired cytosines, we assigned 92.4% of the amino C-H41 and C-H42 protons.

Assigned chemical shifts along with raw NMR data have been deposited in the BMRB. RNA5: 50,933, RNA7: 50,932, RNA8: 50,931, RNA21: 50,930, RNA23: 50,929, RNA24: 50,928, RNA73: 50,927, RNA74: 50,926, RNA75: 50,925, RNA89: 50,924, RNA90: 50,923, RNA91: 50,922. The new data has also been used to update the training of chemical shift predictions in NMRFx Analyst (Marchant et al. 2019). The more complete database will immediately allow for better quality predictions in the use of the molecular network assignment tool integrated in NMRFx Analyst (Marchant et al. 2019). Similarly, these additional data will aid in machine learning approaches that predict RNA secondary structure (Zhang and Frank 2020) and other RNA structural features from the assigned chemical shift data (Zhang et al. 2021).