Recent methodological advancements in the fields of high-throughput DNA amplification and sequencing have opened new windows on many research questions in the life sciences (Shendure and Ji 2008). One such area is metagenomics, where the total DNA found at any sample site is sequenced and analyzed through, e.g., massively parallel (“454”) pyrosequencing (Margulies et al. 2005) or Illumina sequencing (Bentley 2006). This makes a wide range of functional and ecological inferences pertaining to the roles and capacities of the underlying species community possible (Trevors and Masson 2010; Wooley et al. 2010). Common to these pursuits is usually the need, or desire, to also examine the taxonomic composition of the community recovered. This is typically achieved through similarity searches of the ribosomal 12S/16S/18S small subunit (SSU) sequences of the query dataset against nucleotide sequence databases such as GenBank (Benson et al. 2009), SILVA (Preusse et al. 2007), and RDP (Cole et al. 2009).

The process of identifying and annotating sequences with respect to taxonomic affiliation is not trivial and often requires both manual intervention and some degree of familiarity with the lineages recovered (Christen 2008; Kang et al. 2010; Nilsson et al. 2011). Furthermore, as the complexity of the samples and sample sites increase, so does that of the sequence identification process. The SSU, in addition to being present in the nucleus of eukaryotes and the core genome of prokaryotes, is also found in the mitochondria of eukaryotes and in the chloroplasts of photosynthetic eukaryotes. In the two last cases the gene has independent endosymbiotic origins. As a consequence, these different SSU rRNAs should normally not be incorporated into, e.g., joint multiple alignments for taxonomic identification, phylogenetic analysis, or ecological inferences. Thus, if the metagenome under scrutiny contains prokaryotes as well as eukaryotes, there are many situations where the distinct classes of SSU sequences need to be delimited and extracted for separate analysis. This is a time-consuming and largely manual exercise that is further complicated by the considerable proportion of incorrectly identified or otherwise poorly annotated reference entries in the public sequence databases (Bidartondo et al. 2008; Ryberg et al. 2009). The present study offers a remedy, however, in the form of an open source MacOS X/Linux/UNIX software tool—Metaxa—for automated detection and discrimination among ribosomal SSU sequences from archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in large datasets (Online Resource 1; http://microbiology.se/software/metaxa/). The source code is written in Perl and takes advantage of multiple processor cores if available. Internet access is not needed to run the software.

Metaxa has a two-step analysis procedure, where each step may be run separately as needed: it first extracts all SSU sequences from the dataset and then subjects only the SSU sequences to detailed analysis, thus bypassing the need to spend further time on sequences that are not SSU in the first place. It expects query sequences of any number in the FASTA format (Pearson and Lipman 1988). By default, Metaxa starts by examining the query dataset for the presence of SSU sequences of any of the five origins. This is accomplished through HMMER 3.0 (Eddy 1998), the archaeal, bacterial, and eukaryote hidden Markov models (HMMs) of V-Xtractor 2.0 (Hartmann et al. 2010), and a set of newly generated HMMs for the mitochondrial and chloroplast SSU (Online Resource 1). Following the V-Xtractor recommendations, we built the new HMMs from conserved, ~50 bp sequence segments distributed across the full length of the SSUs; an average of 11 HMMs were made for each origin. The first step finds SSU sequences ranging from full length down to about 100–200 bp and assigns them to a tentative origin based on the HMM (e.g., bacterial) that produced the best match to the sequence in question. Some regions of the SSU are however highly conserved across the organelles and lineages of the tree of life such that HMMs computed for several different origins could potentially produce nearly equally good matches to those regions, cautioning against a final decision already at this stage. Instead, the second step uses the extracted SSU entries in BLAST-based sequence similarity searches (Altschul et al. 1997) against local filtered copies of the manually curated prokaryote, eukaryote, mitochondrial, and chloroplast SSU entries of the GreenGenes (DeSantis et al. 2006), SILVA, CRW (Cannone et al. 2002), and MitoZoa (Lupi et al. 2010) databases. By default, the five best BLAST matches of each query are examined for origin (archaea, bacteria, eukaryote, mitochondria, or chloroplast). The origin of the best BLAST match is given a score of 5; the origin of the second best match a score of 4; that of the third a score of 3; the fourth 2; and the fifth 1. In addition, the origin determined by HMMER is given a score of 5 in order to make the HMMER step influential but not decisive. The score is then summed up for each origin. If the origin with the highest score is the same as the origin suggested by HMMER, the query sequence is assigned to that origin. If the origin with the highest score is different from that suggested by HMMER, the sequence is still assigned to the origin with the highest score but marked as in potential need of further scrutiny. Cases in which the score among origins are tied are treated in the same way, except that the corresponding sequences are classified as “uncertain” and that a multiple alignment is computed in MAFFT (Katoh and Toh 2008) for the query sequence together with its five best BLAST matches to facilitate manual examination and interpretation. This dual approach where both HMMER and the best BLAST matches influence the final decision minimizes the effect of single incorrectly annotated, or otherwise problematic, reference sequences, whose presence would distort efforts based on BLAST alone. The second step concludes by writing a separate FASTA file for each origin found (e.g., queryfile.bacteria.fasta), with each such file containing all query sequences of the origin in question. In addition, a detailed log file is generated.

To evaluate the efficacy of Metaxa, we downloaded the 262,032 SSU sequences of the non-redundant SILVA 102 release that were annotated to origin (bacterial, archaeal, nuclear eukaryote, mitochondrial, or chloroplast). We required that each sequence should produce matches to at least two HMMs for the sequence to be classified as an SSU sequence. Metaxa identified more than 99.95% of the sequences to the correct origin (130 out of 262,032 (0.05%) sequences were classified to a different origin than that given by SILVA; Online Resource 2), although a slight drop in accuracy was noted for the mitochondrial sequences (18 out of 434 (4.15%) mitochondrial sequences were assigned to a different origin than that given by SILVA). We furthermore collected 100 random chloroplast SSU sequences from the full-length chloroplast genomes of cpBase (http://chloroplast.ocean.washington.edu/); 100 random SSU sequences from the full-length mitochondrial genomes of GOBASE (O’Brien et al. 2009) and MitoZoa; 100 SSU sequences from the full-length bacterial genomes of UCSC Archaeal Genome Browser (Schneider et al. 2006); 80 SSU sequences from the 80 public full-length archaeal genomes of UCSC Archaeal Genome Browser; and 100 eukaryote SSU sequences from the non-redundant SILVA 104 release. The sequence corpus was run in eight versions through Metaxa to mimic read lengths ranging from those obtained through traditional Sanger sequencing down to those obtained from present pyrosequencing technology and below: the full length, 1250, 1000, 750, 500, 300, 200, and 100 bp. Each length n was run in two versions: the first n basepairs of the SSU, and a random segment of n basepairs along the SSU. The first case simulates traditional, targeted PCR whereas the second case simulates metagenomic data. All 480 sequences were correctly identified to their respective origin in the full-length dataset (Table 1). At pyrosequencing read lengths of 500 bp, the percentage of correct assignments for both versions was at or above 99% for archaea and bacteria; at or above 98% for nuclear eukaryote and chloroplasts; and at or above 94% for mitochondria (Table 1; Online Resource 3). To evaluate the susceptibility of Metaxa to false positives, we generated five 5-million-sequence datasets of random nucleotide data of the lengths 1250, 1000, 750, 500, and 300 bp in the EMBOSS 6.2.0 suite (Rice et al. 2000). As above, these datasets were run with the requirement that a sequence must produce matches against at least two HMMs to be classified as an SSU sequence. Three of these 25 million sequences (0.00012%) were incorrectly identified as SSU sequences, suggesting a considerable robustness against false-positive matches (Online Resource 4).

Table 1 Number of correctly assigned entries of the 480 reference SSU sequence dataset as reported for the five different origins (80 sequences from archaea, 100 from bacteria, 100 from eukaryotes (nuclear), 100 from chloroplasts, and 100 from mitochondria)

We view the proportion of incorrect assignment to origin—on average well below 0.5% for sequences longer than 750 bp—as acceptable given the complex evolutionary history of the SSU as reflected across the organelles and lineages of the tree of life. Although expertly curated and further filtered in this study, the BLAST databases employed by the software are likely to contain a small proportion of taxonomically misidentified or otherwise anomalous entries (Hartmann et al. 2011), which would add some degree of noise to the present effort. Lineages that hold basal positions within the five origins—such as those close to the mitochondria/alphaproteobacteria or the chloroplast/cyanobacteria ancestor demarcations—are probably more likely to be incorrectly assigned to origin than lineages deeply nested within the respective clades. Sequences from previously undiscovered or sparsely sampled lineages are similarly subject to a higher risk of misclassification. In light of these observations, we recommend that the entries on whose origin a final decision could not be reached should be examined manually. The software outputs ample information—including multiple alignments in the case of sequences of uncertain assignment—to assist such scrutiny. The focus on shorter sequences of the metagenomics type, as well as the ability to sort those sequences into the five different origins targeted, set Metaxa apart from RNAmmer (Langesen et al. 2007), which is a HMM-based software resource for detection of rRNA genes in full genome sequences. When compared for performance on the data underlying Table 1, Metaxa outperformed RNAmmer in terms of accuracy and speed on all sets of SSU sequences examined. In addition, Metaxa was able to satisfactory address sequences shorter than 1000 bp as well as sequences of mitochondrial and chloroplast origin, both of which are out of reach for RNAmmer (Online Resource 5).

The time needed by Metaxa to analyse a dataset scales linearly with the number of SSU sequences, such that a doubling of the number of SSU sequences will, on average, double the runtime. Non-SSU sequences do not add much to the runtime, such that a one-million-sequence pyrosequencing metagenome with 0.5% SSU sequences will be processed in under 2 h. The test corpus of 262,032 true-positive SILVA SSU sequences took 34 h to run on a twelve-core 2.0 GHz Linux machine, and the five-million, 1250 bp sequence dataset of true negatives took 19 h. Since Metaxa loads the query sequences sequentially, there is no restriction on the number of query sequences. For the same reason, Metaxa does not require large amounts of computer memory; at no point during the execution of the 262,032 true-positive SSU dataset was more than ~250 Mb of memory needed.

In conclusion, Metaxa detects SSU entries in larger bodies of sequences—such as metagenomes and environmental sequencing datasets—and assigns them to origin with a negligible proportion of false positives and negatives and at a relatively high speed. To rely solely on BLAST for the same purpose, in contrast, would be many times slower and less precise, and would require significant manual intervention. Metaxa is freely available under the GNU GPL v. 3 software licence (Online Resource 1; http://microbiology.se/software/metaxa/), and it is written in a way that makes integration into existing software pipelines for analysis of environmental sequences straightforward. We believe it may increase accuracy in the annotation and analysis of metagenomes and similar datasets, whose sizes tend to defy most attempts at manual processing and examination.