Introduction

Now is an exciting time to be a marine scientist involved with estuarine-coastal systems. Whether the focus be viruses in a water sample, diversity of methanotrophs in a sediment sample, or characterizing a de novo genome of an organism, high-throughput sequencing (HTS) has fundamentally changed the way questions in estuarine-coastal science are approached. One of the most common application of HTS is metagenomics. Metagenomics is “the application of modern genomic techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species” (Chen and Pachter 2005). Metagenomics has existed since the late 1990s (Handelsman et al. 1998) and has exploded as a field since the availability of long read, high-throughput sequencing offered by the Roche/454 Life Sciences Genome Sequencer (Margulies et al. 2005). Beginning in 2006, a variety of papers surveying 16S rRNA microbial diversity (Sogin et al. 2006), marine viromes (Angly et al. 2006), and environmental genetics of microbes (Edwards et al. 2006) began to appear. With the advent of a true random sampling of genomes (i.e., “shotgun-approach” to sequencing) and new protocols (Frias-Lopez et al. 2008; Pernthaler et al. 2008), estuarine science and the study of coastal environments is experiencing a renaissance via genomics. By definition, metagenomics is the study of the genetic material from an environmental sample. While much of metagenomics focuses on bacteria (especially the 16S rRNA gene), the field is expanding rapidly to encompass the entire spectrum of organisms in an environmental sample, including bacteria, archaea, viruses, small eukaryotes, plasmids, and short RNAs.

HTS provides researchers and scientists with a revolutionary tool for quickly and affordably obtaining DNA sequence. The cost per unit of DNA sequence (i.e., base pair) obtained with HTS compared to traditional Sanger sequencing is dramatically more cost and time effective. The initial human genome was sequenced by an army of labs filled with ABI3730 Sanger sequencing instruments. This approach took 13 years and $3 billion US to go from sample collection to publication of the completed human genome (Venter et al. 2001). Today, a human genome can be sequenced de novo in 2 months for $1 million US (Wheeler et al. 2008). To put this in broader terms, a Roche/454 Life Science genome sequencer can produce the same sequence data at 10% of the cost and in 0.9% of the time that a single ABI3730 48 capillary can.

As of December 2008, there are three commercially available HTS platforms: the Applied Biosystems SOLiD, Illumina/Solexa 1G, and the Roche/454 Life Sciences Genome Sequencer. While the SOLiD and 1G systems currently use short reads (50 and 75 bp, respectively), the Roche/454 Life Sciences platform with the FLX chemistry produces average reads of 250 bp. As metagenomics is a random sampling of genetic information in a discrete sample (usually DNA extracted from a water, soil, or other environmental sample), short reads (i.e., less than 100 bp) make it practically impossible to be able to glean enough information to characterize DNA segments to either functional groups (i.e., Clusters of Orthologous Groups of proteins) or species (Krause et al. 2008; though see Wommack et al. 2008). A recent review by Mardis (2008) provides a point-by-point comparison of the applications as well as the pros/cons for each of these three platforms. In this paper, I will focus on the Roche/454 Life Science Genome Sequencing platform, as it is the most widely used HTS platform for metagenomics due to the relatively long read lengths.

The Roche/454 Life Sciences HTS process begins with 3–5 μg of any double-stranded starting material whether it be from a soil, water, plant, animal, or environmental sample (Fig. 1a, i). While high-molecular weight DNA is preferred (1.8 OD, 1.5 kb pieces or larger), fragmented DNA from museum collections or mummified samples is acceptable and has been used to sequence ancient genomes (Miller et al. 2008). Following an accurate quantification of the DNA concentration by a fluorescence method (i.e., PicoGreen), the DNA is fractionated into a size-range conducive for emulsion PCR (emPCR) and subsequent pyrosequencing.

Fig. 1
figure 1

Schematic portrayal of the Roche/454 Life Science work flow. DNA is extracted from environment/organism of interest and converted into a library (a), which is then used as a template for the emulsion PCR (b), which clonally amplifies the DNA. Clonally amplified DNA containing beads are then loaded on a PicoTiter plate (c), which is processed for the sequencing run. Raw data images from the pyrosequencing reactions are finally converted via on-instrument software to nucleotide sequence data

The power behind pyrosequencing lies in the combination of emulsion PCR and pyrosequencing. The ability to perform a very large number of independent, clonally amplified (i.e., PCR reactions) beads in a single tube using a mixture of oil and water to create emulsions provides a way of physically separating PCR reactions on a very small spatial scale. Emulsion PCR (emPCR) is a modification of standard PCR where micelles are created wherein each micelle contains a single oligo-coated bead, template DNA, and PCR reagents/enzymes. For the emPCR amplification to be effective, DNA fragments must be no larger than 500–600 bp (Roche protocol and Jones per obs). After preparation of the sample following established protocols (below), clonally amplified beads are sequenced using pyrosequencing, which involves the production of light via a luciferase-catalyzed reaction when a complementary nucleotide is incorporated during the sequencing by synthesis. Though fundamentally different from tradional dye-labeled sequencing approaches, the data and handling are the same for pyrosequencing.

Typically, DNA for Roche/454 pyrosequencing is sheared (=nebulized) using compressed nitrogen gas that forces the sample through a small opening in a nebulizer and randomly shears the DNA. Next, the nebulized DNA is checked on a Bioanalyzer DNAChip to ensure the size-fractionated DNA is suitable (ideally between 300 and 800 bp with a median of 500 bp). If the nebulized sample is acceptable, the ends of the DNA fragments are blunt-end-polished with T4 PNK and Roche library adaptors A and B are added via standard blunt-end ligation (Fig. 1a, iia). The complement of the Roche A adaptor is used as the sequencing primer (see Fig. 2), while the complement of the Roche B adaptor is on the Sephardose bead used in the emPCR reactions (Fig. 1b, i). One option at this point is to ligate adaptors that have barcodes (aka molecular IDs or MIDs) in their sequence (also see Fig. 2). The barcodes allow multiple discrete samples to be pooled together following library preparation and sequenced together in a pool with other like-treated samples for pyrosequencing. After the pyrosequencing run is complete and the DNA sequence data are collected, the software is then told to detect the barcodes in order to bin the reads and data accordingly.

Fig. 2
figure 2

Schematic of a Roche/454 single-stranded library. Adaptors “A” and “B” are blunt-end-ligated to the sample DNA fragment. The key (TCAG for sample DNA or ATCG for control beads) and the barcode are part of the larger Roche/454 A adaptor that is added during the blunt-end ligation. The sequencing primer anneals to the portion of the A adaptor immediately prior to the key, thus making the first base of the key the first nucleotide sequenced. The B adaptor is complementary to the oligo on the Sephardose bead

At this point in the process, the genomic sample is now fully converted into a single-stranded DNA library (sstDNA) that is compatible with the emPCR and Roche/454 pyrosequencing chemistry. After quantifying the amount of single-stranded DNA (a typical sstDNA library is 10–20 ng/μl) and estimating the size of the library using a Bioanalyzer RNA chip (Fig. 1b, iii), the molarity (number of DNA molecules per microliters) of the sample can be estimated. As each emPCR micelle is an independent clonal amplification of the template DNA inside (Fig. 1b), hundreds of thousands of discrete PCR products can be generated in a single 200 μl PCR tube.

In theory, 1 copy of DNA per bead is the preferred ratio. Because of the Poisson distribution of DNA at low concentration encountering a Sephardose bead in solution, a typical estimate of the DNA to bead ratio is done by taking the molarity and performing what’s referred to as a “titration”. The concentration of DNA to emPCR beads that will maximize data output is estimated by performing a series of test emPCRs where different amounts of DNA are added to emPCR beads (i.e., “titration”). In a typical sample library titration, four conditions are chosen (see below) to bracket the idealized concentration of 1 copy of DNA/bead.

As mentioned earlier, any double-stranded DNA can be sequenced with the Roche/454 Life Sciences genome sequencer. In addition to being double-stranded, PCR products also have two useful characteristics that we can use to our advantage with HTS. The first is that one can order PCR primers of interest with the Roche A and B adaptors as part of the sequence (Fig. 1a, iib). This principle is analogous to the M13-tailed primers for dye-terminator sequencing. During the PCR reaction, the Roche adaptors are being created as part of the region of interest being amplified and thus create the equivalent of the genomic sstDNA library. The second benefit of PCR products is that they are often easy to quantify for concentration and size by standard gel electrophoresis or Bioanalyzer DNAchip, thus making the calculation of molarity relatively simple. As such, the need for a titration step is eliminated, saving time and money.

While one DNA copy per bead for genomic material is the theoretical recommendation, numbers can range from 0.125 up to 32 copies in our experience. This variation seems dependent on the accurate estimation of the size and concentration of the sample library. For example, the sstDNA size may range from 300 to 1,200 bp with a median of 500 bp. However, since fragments larger than 600 bp do not amplify well in emPCR, the Bioanalyzer concentration of the sstDNA can dramatically overestimate the DNA to bead ratio.

Each individual library to be sequenced must be titrated to accurately calculate the DNA concentration for the emPCR. If too much DNA is added, then too many of the reads will fail during the run because of the presence of “mixed” signals (analogous to having two separate plasmid clones in one Sanger sequencing reaction). If too little DNA is added, not enough data will be collected, and the advantage of pyrosequencing is lost.

To determine the ideal amount of library to add to the beads, we set up small tests with two emulsion reactions per treatment (usually 0.5, 2, 4, and 16 copies/bead). Following the emulsion step, beads are enriched for those containing DNA, and the DNA negative beads are separated and counted on a coulter counter. The percentage of positive beads is then calculated by dividing the number of DNA positive beads by the sum of the DNA positive and the DNA negative beads. At our facility, we find that if an emulsion produces 12–15% DNA positive beads, this particular condition will produce the optimum amount of material for a Roche/454 pyrosequencing run while minimizing beads with no DNA and beads with more than one copy of DNA. If the beads are too positive (i.e., over 20%), the condition contains too much DNA per bead and will result in a high number of mixed signal sequences, which will be discarded by the software during the sequencing run.

Once the appropriate amount of beads are obtained from the emPCR for performing a run, the DNA positive beads are enriched and loaded onto a picotiter plate (PTP, Fig. 1b, iii) along with enzyme beads (containing the necessary enzymes for catalyzing the light production) and packing beads that help to hold the DNA bead in place (Fig. 1c, i). Since each PTP well is 40 μm in diameter and each DNA bead is 33 μm in diameter, each well will contain at most one DNA bead. At this point, the instrument run begins (see Fig. 1c, ii). Reagents are “flowed” over the PTP plate in the order of T, A, C, and G (=one flow). Each of the four nucleotides in a flow is recorded by a high-resolution CCD camera that is flush with the PTP. The onboard instrument software records each well for location (relative X, Y), intensity (two As are roughly twice as bright as one A), and key (TCAG for sample DNA and ATCG for control DNA; Fig. 1c, iii). A FLX LR70 run consists of 100 flows with each flow producing roughly 2.5 nucleotides for an average of 250 nucleotides for each of the 400,000 reads. Further details on the principles and chemistry of pyrosequencing can be found in Ronaghi et al. (1998).

Comparison of HTS with Sanger Sequencing

The cost/time savings benefit of HTS versus Sanger sequencing has already been illustrated, but how do the two chemistries compare in terms of coverage, accuracy, and error rate? Because there is no cloning step where the sample DNA is ligated into Escherichia coli plasmids, there is no selective step where pieces of DNA may be “missed” because of sequence composition, GC content, or secondary structure. As such, the shotgun sequencing approach with pyrosequencing is more inclusive of typically hard-to-sequence regions such as AT-rich regions.

Accuracy of HTS with the Roche/454 Life Sciences chemistry is touted to be 0.5%/bp, while general consensus with Sanger sequencing is that the error rate is ∼0.7%/bp. One of the key innovations of Roche/454 Life Sciences HTS is that for each run, a set of control beads are loaded along with the sample beads. With the current FLX chemistry, there are six control bead types that vary in composition (AT), homopolymers (a continuous stretch of a single nucleotide type), and length (150–250 bp). As the sequence of each of the six bead types are known and are sorted by the software as control beads (due to the ATGC rather than TCAG key at the beginning of the read), the accuracy of the individual run error rate can be easily estimated. On a standard LR70 with 420,000 sample reads, there are typically 15,000–30,000 control reads present. On a typical LR70 run, we see ∼0.3–0.8%/bp error rates.

Considerations

With the obvious benefits of HTS for cost and time, what are some drawbacks or limitations that one should be aware of prior to designing a HTS experiment for a metagenomics project?

Accurate Quantification of Starting Material

One of the biggest drawbacks with current HTS technology is the amount of starting material required. According to the Roche/454 Life Science library protocol, 3–5 μg of starting material is required. The reasons for this relatively large amount of material (which is often not feasible for environmental samples) is that the purification steps for the Roche/454 library preparation uses Qiagen columns (where up to 50% of material can be lost at each of the three Qiagen steps in the library preparation protocol) and the fact that the final check of the single-stranded library with the ligated adaptors must be quantified (typically by PicoGreen or an Agilent Bioanalyzer 2100) in order to estimate the ratio of DNA copies to beads for emulsion PCR (see above).

One of the most critical aspects of Roche/454 Life Sciences pyrosequencing protocol is accurate quantification of the sstDNA. Meyer et al. (2008b) have developed a protocol that takes advantage of the sensitivity of quantitative PCR (qPCR; aka real-time PCR) and the known sequences of the adaptors that have been ligated to the unknown sample. In this assay, a standard of known size and concentration is run alongside with experimental samples at specified dilutions to estimate a regression line. The number of molecules per microliter of experimental sample can be calculated at levels down to picograms levels rather than micrograms. As Meyer et al. (2008b) note, only 50 pg of initial DNA is needed for 12,000 reads on a single 16th region of a Roche/454 picotitre plate. For a full LR70 (420,000) reads, 1.75 ng of starting material would be sufficient for approximately 100 Mb of data.

Working with Low Amounts of Starting Material

The other drawback of HTS is actually having enough material to begin the library preparation. Roche/454 does not recommend library construction on samples where there is less than 1 μg of starting material. What does one do when their environmental sample produces only 0.5 μg of material? One option that has been formalized by Blow et al. (2008) is similar in concept to the qPCR protocol of Meyer et al. (2008b) in that it takes advantage of the adaptor sequences attached to the sample DNA. The twist is that the amplified product is used as the template for the PCR. In effect, a whole-genome amplification (WGA) takes place on the sample only after the library prep, rather than before. The benefit of the Blow et al. (2008) protocol is that if you have marginal levels of starting DNA (1–3 μg), you can still have the library prepared, and if there is too little as determined by a Bioanalyzer (range of detection is 5–50 ng/μl), then the sstDNA can be used as a template in a standard PCR with the complement of the Roche/454 adaptors as the primers. We have successfully used this protocol for BACs, bacterial complementary DNA (cDNA), eukaryotic cDNA, and bacterial genomes with no discernable bias in coverage as also noted by Blow et al. (2008). The one obvious drawback of this approach is an additional amplification step, which can be negated, to some degree by using a high quality Taq with 5′ and 3′ proofreading capability. One alternative to the Blow et al. (2008) protocol is the use of a commercially available WGA kit such as the GE Healthcare Phi29 DNA polymerase amplification or Qiagen REPLI-g, which are capable of increasing sample amount by an order of magnitude. While WGA is used in metagenomic projects (Abulencia et al. 2006; Lasken 2007), one should be aware of the introduction of biases attributed by an additional amplification step (Pinard et al. 2006).

Partial Plate Runs

A smaller, though important drawback to keep in mind is the vast amount of data obtained by HTS. If one is only interested in a small piece of DNA (i.e., BAC) or only need a small amount of data for a grant or proof-of-concept paper, 100 Mb of data is overkill. One solution is to pool together samples with other researchers and colleagues to fill out a plate. A second alternative is to run a partial plate. This latter option is often not available for people outsourcing their HTS work because many core facilities do not offer this or have limits on partial plates. Many in-house genome centers process partial plates for their researchers but it may take on the order of months for a plate to be sufficiently full to justify the cost of a run.

Homopolymers and Special Considerations for cDNA

HTS sequencing of cDNA poses a unique set of drawbacks, but is ultimately related to the issues of pyrosequencing homopolymers. Eukaryotic cDNA is transcribed from messenger RNA (mRNA) typically using commercially available kits that use a Poly-T primer. Because eukaryotic mRNA is poly-adenylated on the 3′ end, the poly-T primer is used to target this region, as T is the complement of A. The one drawback of pyrosequencing cDNA libraries is the presence of these long regions of homopolymers. In our experience, the accuracy of Roche/454 Life Sciences base calling on homopolymers diminishes markedly after the fifth or sixth base in a homopolymer stretch. Additionally, the most nucleotides that can be incorporated in a single flow for a given nucleotide is ∼10. If three consecutive flows fail to get through a homopolymer (i.e., more than 30 As), the software discards that read as a “dot” which means that the software has interpreted the bead as containing no DNA and that particular read is discarded. This problem of data loss by the A homopolymers is exasperated by the fact that the intensity of the bead with the 30+ As also obscures neighboring beads (compare Fig. 3b vs. a), thus causing additional loss of data. Finally, the faulty reads obscure the control reads such that even the data collected cannot be used with much confidence (Table 1). Fortunately, a recent, simple modification solves the polyA dilemma with Roche/454 Life Sciences pyrosequencing.

Fig. 3
figure 3

Framegrab of plate images from a Roche/454 run of a bacterial cDNA sample with polyA tails present (regions 6–8) and absent (regions 9–11). Wells in regions 9–11 are not visible due to the extreme white balance (>5,000) required to show the intensity of the A flow in the sequencing reaction for the polyA present sample

Table 1 Summary statistics for regions 6–11 in Fig. 3a and b

Since the confounding issue is the presence of the polyA tail on the 3′ end of the cDNA, Novaes et al. (2008) have incorporated a modified 3′ primer that has a rare restriction site (SfiI) built in the 3′ cDNA adaptor primer. Following relatively standard cDNA protocols, the additional step of digesting the cDNA with SfiI and removal of the polyA tail via column or Ampure purification greatly increases the percent passing reads and data yield. A similar solution has been proposed by Frias-Lopez et al. (2008) for bacterial cDNA. Bacterial cDNA (which lack adenylation on their mRNA) are poly-adenylated via ligation, thus making them effectively eukaryotic cDNA for downstream applications. A rare restriction enzyme (BpmI) recognition sequence is inserted between the poly-T and the T7 promoter sequence, thus allowing removal of the confounding polyA region prior to pyrosequencing library construction.

Coastal Ecosystems and Applications of HTS

As of December 2008, a total of 677 papers have referenced the Margulies et al. (2005) paper describing the proof-of-concept for the Roche/454 Life Sciences Genome Sequencer platform. Most applications of Roche/454 Life Sciences HTS have been for de novo characterization such as metagenomics, amplicons, and whole-genome sequencing of new organisms.

For metagenomics, there are two main approaches. The first is a systems-based approach where the entire sample of DNA is processed and analyzed. Depending on the sample origin, DNA from bacteria, archaea, eukaryotes, viruses, and plasmids will be detected by HTS. A widely used program to assist in the daunting task of characterizing a HTS metagenomic pyrosequencing run is the Metagenomic Rapid Annotation Subsystem Technology (MG-RAST, http://metagenomics.nmpdr.org; Meyer et al. 2008a). Through the use of a subsystem-based annotation approach, the MG-RAST web server provides annotation, phylogenetic as well as functional classification, and the ability to compare metagenomic samples to see both shared and unique genes/subsystems. Another widely used approach for processing and exploring metagenomic data is the MEGAN program (Huson et al. 2007).

The whole-system approach has been used to simultaneously characterize a variety of taxa: viruses in the ocean (Angly et al. 2006), multiple comparisons of biomes (Dinsdale et al. 2008), and bacterial community composition (Edwards et al. 2006; Sogin et al. 2006; Fierer et al. 2007; Huber et al. 2007; Pernthaler et al. 2008). One promising development has been the advancement of metagenomics beyond simple characterization of what species are present. Now that protocols exist for the construction of cDNA from bacteria, temporal and spatial studies of changes in bacterial communities are possible (Frias-Lopez et al. 2008). Additionally, the combination of metagenomics with the analysis of microbial community metabolism provides a powerful approach to understanding the function of microbial communities in a suite of environments (Turnbaugh and Gordon 2008). With the advent of autonomous times-series transcriptomics for coastal bacteria populations (Preston et al. 2009), our understanding of coastal communities and changes in both space and time promise to be an exciting area of research and development. Additionally, new protocols are being developed to pull a species of interest from the metagenomic haystack (Pernthaler et al. 2008; Zehr et al. 2008). As HTS becomes a more accessible tool in the biologist’s toolbox, more creative and exciting modifications are bound to change the way we view science in the ocean and beyond. The whole-system approach has the benefit that one is characterizing a community by the transcriptomic potential (i.e., are ammonia oxiders present in a particular sample/environment?).

The second approach to metagenomics is designed to identify which species are present in a sample. Unlike the first approach, this approach has the distinct likelihood of potentially missing certain taxa through the use of PCR primers for amplifying specific regions (usually variable regions of the 16S rRNA). Even with the known bias of PCR amplification and missing certain taxa due to primer mismatches (Fodor et al. 2009), the microbial diversity estimates being discovered using HTS are truly astounding (Fierer et al. 2007; Huber et al. 2007). The one drawback of this approach is that the transcriptomic potential of a community is being inferred by species’ assignment based on a single molecule (typically 16S rRNA).

Ideally, one would use a multi-tiered approach to characterizing a metagenomic community. Just as one should not rely on a single gene for a phylogeny or a particular dataset, one should consider all accessible avenues of data in order to strengthen ones conclusions. In order to accurately characterize a community, one would look at the genes expressed in a particular sample (i.e., cDNA) in addition to all genes present (i.e., metagenomic DNA). With advances in methodology such as Frias-Lopez et al. (2008) and the ability to barcode samples, HTS is a tool with great potential for revolutionizing the way we design studies for coastal ecosystems.

Data Handling and Storage

Many open-source programs and scripts (especially Perl and Python) exist and are freely available in the HTS/metagenomics community. The main issue that many people first entering into the HTS world encounter is that the raw data (the Ts, As, etc.) are rarely viewed by a human eye due to the enormous amount of data. The accuracy of Roche/454 pyrosequencing is over 99% for a single read and close to 100% with coverage (usually 10–20× depending on the project). Additionally, the individual nucleotide quality scores (Q, where Q = −10 log p and p is the estimated error probability of the base) are invariably over Q35 (usually closer to Q40), which is considerably better than the Q20 score (i.e., odds of miscalling an individual nucleotide is 1%) used as a cutoff with traditional Sanger sequencing. It is also not possible to manually go through all nucleotides in a HTS dataset to verify the accuracy of nucleotides as one would do with Sanger reads. Today’s biologist must be adept and knowledgeable with Unix/Linux commands and how to apply pipelines for data processing. The final note of warning for those considering HTS as part of their experimental toolbox is data storage. The current Roche/454 Life Sciences FLX run produces on average 100 Mb of data in a folder of approximately 12–18 Gb of data. The new chemistry upgrade (“Titanium”) that has become recently available produces ∼400 Mb of data in a folder of ∼35 Gb.

Conclusions

Our understanding of the environment has been revolutionalized by findings one could not have possibly imagined before the application of high-throughput sequencing to metagenomics. With new protocols and technologies continually being developed, the metagenomics field will continue to evolve and revolutionize our understanding of complex coastal communities and how they function and change over space and time.