Introduction

The mitochondrial genomes (mtDNA) of bilaterian animals are typically single, ~16 kb circular molecules. They are inherited maternally and encode a highly conserved set of 37 mitochondrial genes, including 22 tRNAs, 2 rRNAs and 13 protein-coding genes. These mtDNAs are efficient molecules, generally having only a short non-coding region, no introns, and minimal to no intergenic sequences. Many genes overlap one another by several nucleotides (nts), such that post-transcriptional polyadenylation of the mRNA completes the transcript (Lavrov 2007). This succinctness has led to suggestions that larger mtDNA are selected against during evolution (Rand and Harrison 1989). These selective pressures may also contribute to the low frequency of inverted/tandem repeats or non-functional gene copies (pseudogenes) (Moritz et al. 1987).

However, complete sequencing of bilaterian animal mitochondrial genomes has revealed unusually large (>24 kb) mtDNA sporadically in a range of animals, primarily due to tandem repeats and lengthy non-coding regions. This includes in a fish (Inoue et al. 2010), a reptile (Fujita et al. 2007), molluscs (La Roche et al. 1990; Simison et al. 2006), nematodes (Azevedo and Hyman 1993; and GenBank accession no. EF368011), and an arthropod (Jeyaprakash and Hoy 2007). Pseudogenes of tRNAs, rRNAs and/or protein-coding genes have been characterized in the mtDNA of a range of organisms, including insects (Beckenbach et al. 2005; Dowton et al. 2003; Lessinger et al. 2004), amphibians (San Mauro et al. 2006), reptiles (Kumazawa et al. 1998; Russell and Beckenbach 2008; Zevering et al. 1991), fish (Mabuchi et al. 2004), birds (Mindell et al. 1998), a sponge (Rosengarten et al. 2008), an oyster (Milbury and Gaffney 2005), and in nematodes (Gibson et al. 2007a; Gibson et al. 2007b; Howe and Denver 2008). Long homopolymer stretches have been reported in the coding regions of some animal mtDNA. Homopolymers are potentially detrimental as they are prone to transcriptional slippage, resulting in variation in the tract length relative to the template molecule (Wagner et al. 1990). Observations of decreased homopolymer tract lengths and frequencies in coding regions compared to non-coding regions have led to hypotheses of evolutionary selection against long (>~8 nts) homopolymer tracts in coding regions (Tamas et al. 2008). Despite this, long homopolymers are reported in the mtDNA of several invertebrate species, including a brachiopod, Terebratalia transversa (Helfenbein et al. 2001), and several nematodes, including Dirofilaria immitis (Hu et al. 2003b), Haemonchus contortus (Jex et al. 2008) Radopholus similis (Jacob et al. 2009), and Strongyloides stercoralis (Hu et al. 2003a).

Recently, we reported homopolymers in protein-coding mitochondrial genes of two potato cyst nematodes (Globodera rostochiensis and Globodera pallida) (Gibson et al. 2007a; Gibson et al. 2007b; Riepsamen et al. 2008). The mtDNA of these species are atypical, composed of several small circular mtDNA molecules (~6.4–9.5 kb in G. pallida), each with a large non-coding region and only some of the 37 mitochondrial genes in typical animal mtDNA (Armstrong et al. 2000). The gene organization is an overlapping mosaic of large multigenic fragments present on several circles. These multigenic fragments have up to 98% identity on different circles for both G. pallida and G. rostochiensis. Several of these gene copies produced truncated proteins when conceptually translated, and hence were considered pseudogenes. Though pseudogenes are rare in animal mtDNA, duplicated gene copies are predicted to have reduced selective pressures and hence accumulate mutations. However, we observed the nucleotide sequence of these gene copies was neither divergent nor truncated, but had indel mutations (insertions/deletions of 1–2 nts) which disrupted the reading frame. Analysis of multiple clones revealed that these indels were consistently thymidine indels within polythymidine [poly(T)] tracts on the coding strand (or adenine insertions/deletions within polyadenine [poly(A)] tracts on the non-coding strand). Termed ‘poly(T) variation’, these indels were not considered artefactual for several reasons: (i) both DNA strands were sequenced for all experiments, (ii) electropherograms showed no ambiguous base-calling surrounding the homopolymers, (iii) the DNA polymerase used for G. rostochiensis had proofreaders (both 5′ → 3′ and 3′ → 5′), (iv) there was no evidence of polymerase errors, with substitution mutations observed between gene copies on different circles identical between G. pallida and G. rostochiensis, and (v) the poly(T) variation was unrelated to experimental conditions, with reproduced amplifications on different populations having markedly different poly(T) variation, suggesting that poly(T) variation is population-specific, and (vi) poly(T) variation was characterized in amplicons from genomic DNA and mRNA (Riepsamen et al. 2008), and in EST databases generated independently (www.nematode.net).

Despite these observations, the possibility of poly(T) variation in Globodera mtDNA being an artefact of amplification could not be discounted, as for each instance an amplification preceded sequencing. Thus, the poly(T) variation observed was more precisely a measure of biological variation plus any artefactual variation introduced (biological + artefactual variation). It is extremely difficult to measure biological poly(T) variation in nematodes without an amplification step. Multiple individuals cannot be used to increase the amount of biological material, as this may introduce inter-individual poly(T) variation. However, it is relatively simple to measure the artefactual variation (i.e., polymerase error) that a poly(T) tract generates in vitro. By isolating a single DNA molecule from a region exhibiting poly(T) tracts, the amount of artefactual poly(T) variation can be measured by performing a second round of amplification, and sequencing multiple clones. If no polymerase errors occur, all copies amplified during this second round will be identical. If polymerase errors do occur and poly(T) variation is introduced, the proportion of clones with different poly(T) tract lengths will be a measure of the extent of artefactual poly(T) variation. As described above, the biological + artefactual variation can also be determined, i.e., the poly(T) variation observed after amplification of a biological sample (and the sequencing of multiple clones). By deducting the artefactual poly(T) variation from the biological + artefactual variation, the biologically derived poly(T) variation can be estimated. The use of a single DNA molecule as the template for the analysis of artefactual variation avoids the possible introduction of biological variation, for example, due to mtDNA heteroplasmy. Further, by measuring the extent of both artefactual and artefactual + biological poly(T) variation after a single round of cloning, any variation introduced during cloning would be inherent in both poly(T) variation estimates. Thus, any differences between these estimates will primarily be due to amplification error.

In this study, the extent of artefactual and biological + artefactual poly(T) variation was determined, using a mtDNA region from a heteroderid nematode (the family to which G. pallida and G. rostochiensis belong). We investigated the generation of poly(T) variation using a long DNA polymerase, as ultimately we intended to use this to amplify long fragments of nematode mitochondrial genomes for sequencing. This revealed that the level of poly(T) variation characterized for the biological sample after an amplification step was not significantly greater than that introduced as an artefact of amplification. This suggests that poly(T) variation observed in the heteroderid mtDNA is predominantly an artefact of amplification. The poly(T) variation generated during amplification of a broad range of heteroderid nematodes as well as in two non-cyst-forming plant parasitic nematodes was also studied to assess the potential each molecule has for poly(T) variation. These data were used to determine features that make these molecules susceptible to poly(T) variation. These results have significant implications for the sequencing of mitochondrial genomes containing extensive homopolymer regions. Unless strategies are applied to minimize or account for potential poly(T) variation, incorrect homopolymer tract lengths may be inferred, and genes can be incorrectly annotated as pseudogenes.

Materials and Methods

Genomic DNA Extraction and Molecular Identification

Nematode specimens were collected from Australian and international sources (see Table 1), and preserved in 95% ethanol on arrival. Single cysts of the Heterodera and Betulodera specimens were washed with distilled water and surface sterilized in 0.3% hypochlorous bleach as described (Nour et al. 2003). Sterilized cysts were washed with distilled water to remove any bleach prior to DNA extraction.

Table 1 Biological material used and origin of the specimens

DNA extractions utilized (i) the egg mass of Meloidogyne javanica within the gall formed on the infected root, (ii) an adult of Helicotylenchus dihystera, and (iii) sterilized cysts of the Heteroderidae nematodes. Cysts are formed from the posterior of a deceased gravid female. As the mtDNA of the eggs inside are maternally inherited (Tsang and Lemire 2002), this is the largest available mtDNA source representing a single individual. Total genomic DNA of all specimens were extracted using the salting out method described by Sunnucks and Hales (1996), but with the proteinase K step extended to 1 h at 65°C, followed by 10 min at 95°C to inactivate the enzyme. Extracts were resuspended in TE buffer (1 mM Tris–HCl, 0.1 mM EDTA [pH 8]) and stored at 4°C until use. Species identifications were molecularly verified by amplification and sequencing across the internal transcribed spacer (ITS) regions 1 and 2 using the Heteroderidae diagnostic primers TW81 (5′-GTTTCCGTAGGTGAACCTGC-3′) and AB28 (5′-ATATGCTTAAGTTCAGCGGGT-3′) as described by Subbotin et al. (2001). Amplification reactions used the BIO-X-ACTTM Long DNA Polymerase Kit (Bioline, Australia), and cycling conditions consisted of 94°C for 2 min, 35 cycles of 94°C for 30 s, 55°C for 30 s and 72°C for 2 min, with a final elongation of 72°C for 5 min. Sequencing reactions were carried out using the BigDye® Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Australia), and an ABI 3130xl Genetic Analyzer (Applied Biosystems, Australia). Sequences were analyzed using ChromasPro (Technelysium Ltd., Tewantin, Australia), BioEdit (Hall 1999) and ClustalW (Thompson et al. 1994) software. The RNAfold software (http://rna.tbi.univie.ac.at) was used to predict secondary structures in the DNA sequences (Hofacker 2003).

Amplification and Cloning of a Mitochondrial Fragment

A fragment of the cytochrome oxidase subunit II (COII) gene containing poly(T) tracts was amplified in Heterodera cajani using the BIO-X-ACTTM Long DNA Polymerase Kit, and primers Het-COII-F (5′-GTGGGTCATCARTGRTAYTG-3′) and Het-COII-R (5′-GAATGATTAGCTCCRCARATYTC-3′). PCR cycling conditions consisted of 94°C for 2 min, 35 cycles of 94°C for 10 s, 45–65°C for 30 s and 68°C for 1 min/kb, followed by 68°C for 5 min. The PCR product was cloned directly into the pGEM®-T Easy Vector System (Promega, Australia). Two positive clones were purified using the Wizard® Plus SV Minipreps DNA Purification System (Promega). Both DNA strands were sequenced using the sequencing and analysis methods described above, using M13 vector primers.

Characterization of Artefactual (In Vitro) Variation at Poly(T) Tracts

The COII fragment of the purified plasmid DNA (representing a single DNA molecule) was subcloned to facilitate analysis of the poly(T) variation generated in vitro, as per the approach of Clarke et al. (2001). Re-amplification and cloning used the same primers, enzyme, cycling conditions and vector system as described above. Two approaches were used to enable sequencing of these clones. In the first approach, purified products of colony-PCRs were used for sequencing. Colony-PCRs used the BIO-X-ACTTM Long DNA Polymerase Kit as described above, with products purified using ExoSAP-IT® (USB Corporation, Australia). In the second approach, plasmid DNA was purified for sequencing using the Wizard® Plus SV Minipreps DNA Purification System. For both approaches sequencing was performed on both DNA strands using M13 primers as described above.

Characterization of Total (Biological + Artefactual) Variation at Poly(T) Tracts

The same COII region was analyzed in a broad range of species to assess the features that may contribute to poly(T) variation. For Heterodera cajani, Heterodera cardiolata, Heterodera glycines and Heterodera zeae, species-specific primers P116F-1R (5′-CCACGGGTATAAAGGAATGA-3′) and H.Gly-COIIF_inFOR (5′-GGTAGTTGGTCATCAATGATATTG-3′) were used. To optimize amplification from divergent species, degenerate primers Het-COII-F and Het-COII-R (described above) were used for all species listed in Table 1. Both primer pairs encompass the same COII region. Amplification, cloning, and sequencing of multiple clones from the purified colony-PCR product were performed as described above.

An internal fragment of the cytochrome oxidase subunit I (COI) gene was also amplified in H. cardiolata & H. glycines using the primers H.Gly-COIIinFOR-4F (5′-TGCTTTAATTGGTGCTTTTGG-3′) and H.Gly-COI-FOR (5′-TATTCCAACAACAAATATATGATGAGC-3′). For H. glycines, the entire COI gene was further amplified, both as two overlapping fragments and as one larger fragment encompassing the entire gene. The primer pairs for these amplifications were H.Gly-COIFOR and H.Gly-COIIinFOR (5′-GGTAGTTGGTCATCAATGATATTG-3′) for the 5′ fragment, H.Gly-COI-REV (5′-CTGTTTTGGCAGGAGCTATTAC-3′) and H.Gly-ND4inREV1 (5′-GAATTAATTTTTTCAATTTGGACTCC-3′) for the 3′ fragment, and H.Gly-COIIinFOR-2F (5′-GCACCTTTAGCGTGATTGGT-3′) and H.Gly-ND4inREV1-2F (5′-TGAAAGCCAACTAAAGCCTCA-3′) for the entire gene. The methods for amplification, cloning, sequencing from the colony-PCR product and analyzing multiple clones were as described above for the COII fragment.

Results

Artefactual Poly(T) Variation

A region of the COII gene in H. cajani was amplified and the PCR product sequenced directly. The electropherogram showed evidence of heterogeneity (see Supplementary Fig. 1). This heterogeneity was evident as peaks of lesser intensity than the main peak, and was always seen immediately downstream (3′) of a poly(T) region. In each case, the minor peak corresponded to the next-most nucleotide, suggesting that the secondary peaks were the product of slippage errors. To determine the extent to which poly(T) variation was generated in vitro, individual (i.e., cloned) molecules of a fragment containing poly(T) tracts were characterized. A 274 bp region of the COII gene from H. cajani was sequenced in two clones (denoted a1 and a2). The electropherograms of these sequences were clean, with no evidence of heterogeneity (for an example, see Supplementary Fig. 1), suggesting negligible variation was introduced by cloning. The COII sequences of a1 and a2 were identical and translated a conserved polypeptide. To represent single DNA molecules, plasmid DNA of a1 and a2 was purified. This DNA was subject to a second round of amplification and cloning. Ten clones were sequenced from each to assess the level of variation introduced during this second round of amplification, cloning and sequencing (i.e., artefactual variation). However, in order to sequence these clones, the template had to be amplified, either by culture of the bacterial clones (‘mini-prep’), or by colony-PCR. Sequences obtained from individual clones using either approach were identical (data not shown). Examination of the electropherograms of each clone sequenced using either approach showed no ambiguous peaks as to suggest that either the colony PCR or the bacterial propagation were individually contributing to the sequence reported (for an example, see Supplementary Fig. 1). Thus, although some amplification errors may have been introduced during either the colony PCR or the bacterial propagation, this did not influence the result, with each reported sequence representing a PCR amplicon. Thus, the more rapid approach of sequencing the colony-PCR product was used for subsequent experiments.

Comparison of the 20 cloned sequences from amplification of a1 and a2 revealed that several poly(T) indels and substitution mutations were introduced in vitro (Table 2). Across the poly(T) tracts in this COII template, four clones had a single T deletion (−1T), three at a tract of 10 Ts (Fig. 1), and one at a tract of 9 Ts. This resulted in a poly(T) variation rate of 1.90 × 10−3 indels/poly(T) tract >6 Ts/cycle. In this study, indels at poly(T) tracts were only observed at tracts longer than 6 Ts. Thus, this rate will hereafter be referred to as indels/poly(T)/cycle. Substitution mutations were present in 10 clones, and included 13 transition mutations and one transversion, giving a substitution mutation rate of 7.30 × 10−5 mutations/bp/cycle (Table 2).

Table 2 Poly(T) variation and substitution mutation rates within an ~274 bp COII fragment amplified and cloned from both individual molecules isolated in vitro (i.e., artefactual variation) and from biological samples from a range of nematode species (i.e., biological + artefactual variation)
Fig. 1
figure 1

Section of an alignment of the cloned COII gene fragment amplified from Heterodera cajani. Clones a1.1a1.10 and a2.1a2.10 were each amplified from a single molecule, thus any sequence variation is an artefact of amplification. The remaining clones were amplified from genomic DNA and thus represent biological + artefactual variation, with clones a1, a2 and bD1bD17 amplified using a degenerate primer set, and clones bS1bS16 amplified using a specific primer set. TL indicates the conceptual translation of the a1 clone

Biological + Artefactual Poly(T) Variation

The extent of biological + artefactual poly(T) variation was then assessed in the same COII region of H. cajani (i.e., cloned COII fragments amplified from genomic DNA templates). Both specific and degenerate primer pairs were used to amplify this COII fragment, from which 16 and 17 clones were sequenced (denoted bS1-bS16 and bD1-bD17, respectively, in Fig. 1). Poly(T) length variation was present in eight of the 33 clones; at a tract of 10 Ts, four clones had −1T and two clones had a single T insertion (+1T) (Fig. 1), and at a tract of 9 Ts two clones had −1T. This gives a poly(T) variation rate of 2.18 × 10−3 indels/poly(T)/cycle (Table 2). Substitution mutations were present in nine clones, and included six transition mutations and four transversions, giving a substitution rate of 2.98 × 10−5 mutations/bp/cycle (Table 2).

A two-tailed χ2 test was performed to investigate whether the biological + artefactual poly(T) variation was significantly different to the artefactual poly(T) variation (www.graphpad.com/quickcalcs/contingency1.cfm). There was no significant difference (P = 0.985) between the rates of artefactual poly(T) variation and the artefactual + biological poly(T) variation in the clones analyzed for this COII fragment from H. cajani. There was also no significant difference (P = 0.061) between the artefactual substitution rate and the artefactual + biological substitution rate. This suggests that the majority of both the poly(T) variation and substitutions observed in the biological sample are artefacts of amplification. However, a more extensive investigation may be required to detect low levels of biological poly(T) variation at statistically significant levels.

Template-Specific Poly(T) Variation in a COII Gene Fragment

To determine the sequence characteristics associated with the generation of poly(T) variation, this same COII region was cloned from a range of plant parasitic nematodes of the Hoplolaimoidea superfamily (Table 1). For all nematodes studied, the 274 bp (271 bp in B. betulae) COII fragment described for H. cajani was analyzed. For each species 9–22 clones were characterized (Table 2). In each case, intraspecific sequence identity was 98–100%, suggesting that divergent numts (nuclear copies of mitochondrial DNA) were not being co-amplified and analyzed. Interspecific sequence identity ranged from 63–85%, providing diverse template molecules for the characterization of the generation of poly(T) variation. Despite 14 instances of poly(T) tracts 6 nts long, variation was only evident at tracts >6 Ts. Only the Heteroderidae family displayed poly(T) variation, though template molecules of the Hoplolaimidae and Meloidogynidae families had only one poly(T) tract > 6 Ts, compared with 2–5 for the Heteroderidae species (Table 2). Across the Heteroderidae, the rate of poly(T) variation ranged from 0.65 × 10−3 to 2.54 × 10−3 indels/poly(T)/cycle (Table 2). Variation consisted of −1 T at 20 poly(T) tracts (ranging from 7 to 11 Ts), and a deletion of 2 Ts (−2 Ts) at two tracts (of 9 and 10 Ts). It is difficult to predict whether the −2 Ts is the result of two single T deletion events or one deletion event involving two Ts.

All species except H. pratensis displayed intraspecific nucleotide substitutions, with substitution rates ranging from 0.70 × 10−5 to 2.98 × 10−5 mutations/bp/cycle (Table 2).

Template-Specific Variation in a COI Gene Fragment

A 625 bp central region of the COI gene in H. cardiolata and H. glycines was also amplified, sequenced and cloned. None of the clones displayed poly(T) variation, despite the template molecules having three and two poly(T) tracts of seven Ts, respectively. Of the 10 clones sequenced for H. cardiolata, there was one transversion mutation, giving a substitution rate of 0.64 × 10−5 mutations/bp/cycle. Of the 11 clones sequenced for H. glycines, there were two substitution mutations (one transition and one transversion), giving a substitution rate of 1.16 × 10−5 mutations/bp/cycle.

Template-Specific Poly(T) Variation Across the Entire COI Gene

The entire COI gene was amplified and cloned in H. glycines, both as two overlapping fragments, and as a larger fragment encompassing the entire COI gene. COI was sequenced in two clones and determined to be 1524 bp long. Start and stop codons were predicted as GTT and TAA, respectively. The two overlapping fragments characterized spanned 883 bp of the 5′ end and 937 bp of the 3′ end. Two clones of each overlapping fragment were also sequenced, thus each nucleotide of COI was sequenced in at least four clones, with the central portion sequenced in six clones.

There were 18 poly(T) tracts across the COI gene, ranging in length from 7 to 16 Ts. When all clones were compared, there were 13 instances of poly(T) variation across all tracts with eight or more Ts. These included six tracts with −1T, three with +1T, two with −2 Ts, and two with insertions of 2 Ts (+2 Ts) (based on the COI sequence translating the correct reading frame). This resulted in a poly(T) variation rate of 5.02 × 10−3 indels/poly(T)/cycle. This is twofold higher than any values reported for COII, though this may have been skewed by the relatively low number of clones analyzed.

Across all clones, there were five substitution mutations (all transitions), giving a substitution rate of 2.14 × 10−5 mutations/bp/cycle.

Features of Sites Prone to Poly(T) Length Variation

Analysis of the DNA sequences both up- and downstream of poly(T) tracts did not locate any sequence motifs or potential secondary structures associated with the poly(T) variation observed. If poly(T) variation was associated with a sequence motif, closely related species (with more similar DNA sequences) would be expected to have more similar poly(T) variation rates. However, there was no apparent correlation between poly(T) variation and phylogenetic proximity. For example, H. cajani and H. glycines are two of the more closely related species studied (both of the Schachtii clade of Heterodera nematodes) (Subbotin et al. 2001), yet have differing poly(T) variation rates (2.18 × 10−3 and 0.95 × 10−3 indels/poly(T)/cycle, respectively; Table 2). Further, there was no correlation between clones with poly(T) variation and those with nucleotide substitutions.

The only characteristic common to all sites displaying poly(T) variation was the length of the poly(T) tract. Comparison of poly(T) tract lengths and poly(T) variation rates across all clones for all indel types (Fig. 2) identified no variation at any tract with six or fewer Ts, and a linear increase in variation rate with increased poly(T) tract length for tracts 7–16 Ts long. Observed individually, only the −1T indels appear to follow this trend (Supplementary Table 1). However, the distribution of other indels (+1T, −2 Ts and +2 Ts) at poly(T) tracts is more sporadic, and there are insufficient occurrences to establish a trend.

Fig. 2
figure 2

Relationship between the rate of poly(T) variation and poly(T) tract length of all indel types in COI and COII for all Hoplolaimoidea species studied

Of the 630 poly(T) tracts that were longer than 6 Ts analyzed across all COI and COII clones, the overall rate of poly(T) variation was 1.95 × 10−3 indels/poly(T)/cycle. Of these, 74% were single T deletions, 12% were single T insertions, 9% were deletions of 2 Ts, and 5% were insertions of 2 Ts (see Supplementary Table 1).

Discussion

Features Associated with Poly(T) Sites

Analysis of the generation of poly(T) variation in vitro revealed that the majority of poly(T) variation was artefactual, not biological. Further, sequence analysis using multiple clones of COI and COII gene fragments in the mtDNA of Heteroderidae nematodes revealed poly(T) variation throughout many poly(T) tracts in these template molecules. This allowed the characteristics associated with poly(T) variation to be assessed. Single T deletions were the most frequent type of poly(T) variation, accounting for 74% of the total poly(T) variation observed. Overall, contractions were five times more frequent than expansions. Studies of homopolymers in nuclear DNA have found similar trends, with contractions 5–10 times more frequent than expansions (Kroutil et al. 1996; Kunkel 1986; Shinde et al. 2003; Tindall and Kunkel 1988). It was proposed that this may be due to contractions being more energetically favorable than expansions (Kroutil et al. 1996; Shinde et al. 2003). These studies predict a simple slippage model involving one base per indel event. Our observations support this model, with indels involving one nucleotide sixfold more frequent than those involving two. Thus, the observed indels of −2 Ts are likely the result of two subsequent single indel events. Further, poly(T) variation rates were found to increase with increasing tract lengths. Similar trends of increased mutation rate with increasing number of repeats have been reported for monomeric A/T microsatellites (Bacon et al. 2001; Lai and Sun 2003; Shinde et al. 2003) and for short poly(T) tracts up to 5 nts (Kroutil et al. 1998). If these errors arise by slippage of the polymerase or template molecule during replication, longer homopolymer runs may stabilize (and hence maintain) the mutation more efficiently (Kunkel 1990). The binding capacity of the polymerase active site is also thought to be significant. Typically there are 7–9 nts in the active site during replication; however, this can increase to up to 12 nts (Eom et al. 1996). Shinde et al. (2003) suggest that saturation of the active site with repetitive sequences may also contribute to an increased mutation rate with increasing number of repeats. Despite this study demonstrating a considerably higher rate of poly(T) variation for tracts ≥11 Ts, there are too few instances to support a correlation between this and the nucleotide binding capacity of the polymerase enzyme.

Consideration of Poly(T) Errors Introduced In Vitro

The majority of genomic DNA studies involve in vitro DNA amplification using a polymerase. These studies often require downstream cloning and expression experiments which rely on accurate DNA replication during amplification. However, the generation of amplification artefacts in vitro is often understated despite the potential implications of not accounting for artefacts if they are introduced.

Polymerase Error Rates Associated with Homopolymer Tracts

Studies of homopolymer polymerase errors have estimated error rates at 1.5 × 10−2 mutations/repeat/cycle in non-coding microsatellites (Shinde et al. 2003) and between 0.42 × 10−6 and 90 × 10−6 mutants/colony-forming unit/detectable site in a bacterial gene, lacZ (Kroutil et al. 1998). However, the differences in methodology between these studies and the present study (which estimated 1.9 × 10−3 indels/poly(T)/cycle) are significant and confound rate comparisons. The studies differed in the type of polymerase and template molecule used, and in the mutation detection and rate estimation methods. For example, this study estimated polymerase errors from a proofreading enzyme mix across poly(T) tracts up to 16 Ts long in isolated clones derived from biological templates. In comparison, Shinde et al. (2003) assessed the mutation rate of Taq polymerase (i.e., without a proofreader) across plasmids containing synthesized poly(A) inserts of 8–12 As. Poly(A) variation was detected via analysis of ‘stutter peaks’ associated with the primary peak during capillary electrophoresis of fluorescently labeled PCR products. Mutation rates were estimated from this using a complex mathematical model that accounted for replication variables associated with (for example) increasing cycle number. Further, Kroutil et al. (1998) assessed ‘gap-filling’ Taq polymerase error rates at poly(T) tracts up to five Ts long. Using the bacterial lacZ gene, mutants were detected by blue/white colony screening and sequence analysis. Thus, comparisons with other studies using long-PCR polymerase mixes may be more appropriate (although these have not examined the mutation rate at homopolymers). Mutation rates of Taq/Pfu and Klentaq/Pfu polymerase mixes were reported as 5.6 × 10−6 and 3.9 × 10−6 mutations/bp/duplication, respectively (Cline et al. 1996). These rates are approximately three orders of magnitude less than the poly(T) error rate observed in this study, and one order of magnitude less than the in vitro substitution rate in this study (7.30 × 10−5 mutations/bp/cycle). This further indicates the propensity towards polymerase errors at poly(T) tracts relative to other DNA sequences. Regardless, there remains a need to identify genomes susceptible to in vitro homopolymer length variation.

Genomes Susceptible to Polymerase Indel Errors

Analysis of the mtDNA sequences in GenBank from other species (Table 3) demonstrates a high frequency of poly(T) tracts >6 Ts in many other parasitic nematodes, but not in free-living nematodes, such as Caenorhabditis elegans. The mitochondrial genomes of the parasitic nematodes are therefore at high risk of indel errors being introduced in vitro. Characterizing the mtDNA of parasitic nematodes has previously been described as difficult, suspected to be a result of their high AT content (Hu et al. 2007). Poly(A/T) tracts contribute to this high AT content. For example, Radopholus similis and Heterodera glycines of the Chromadorea class of nematodes have the highest AT contents of any animal mtDNA sequenced to date, with 85.4% (Jacob et al. 2009) and 82.6% AT content (T. Gibson et al., unpublished), respectively (cf. 76.2% in C. elegans; GenBank accession no. NC_001328). R. similis and H. glycines also have some of the highest incidence of poly(T) tracts, with 161 and 347 poly(T) tracts >6 Ts, respectively (cf. 28 in C. elegans; Table 3). With poly(T) tracts so prevalent in parasitic nematodes, it is vital that the potential for poly(T) variation be recognized both during annotation, and during cloning and expression experiments. The annotation of R. similis mtDNA utilized sequence data from both direct sequencing of genomic DNA and from sequencing of plasmid DNA containing expressed sequence tags (ESTs) derived from mRNA. Different polymerases were used for each. Poly(T) variation was identified and the biological poly(T) length predicted from comparison of corresponding genomic DNA and EST sequences. Thus, for analyzing genomes with many homopolymers, methodologies need to be scrutinized to ensure they factor in the potential for homopolymer length variation.

Table 3 Incidence of poly(T) tracts longer than 6 nucleotides in the entire mtDNA of several animal species described as having many homopolymers, and relatives of these species also found to have many poly(T) tracts

Strategies for Minimizing Polymerase Errors and Their Potential Impact

Strategies for minimizing in vitro errors should be considered when amplifying from genomes suspected of having many homopolymers. For amplification across homopolymers, the choice of polymerase has been shown to have a significant impact on fidelity (Fazekas et al. 2010). DNA polymerases Bst and Tfu have demonstrated resilience to replication slippage, suggested to be a factor of the strand-displacement activity of the enzyme (Viguera et al. 2001). Fusion polymerases, such as PhusionTM and Herculase® II Fusion, have also shown improved sequencing quality over other polymerases. The additional binding domain of fusion polymerases is thought to increase the contact between the polymerase and template molecules, and hence improves stability and fidelity during replication (Fazekas et al. 2010). Although including a polymerase with a proofreader can improve polymerase fidelity sevenfold (Bebenek and Kunkel 1990), Clarke et al. (2001) found no difference in in vitro homopolymer errors using Taq and Pfu DNA polymerases. Further studies are required to determine the optimal polymerase, or combination of polymerases, for accurately amplifying genomes with long homopolymer tracts. Increasing the amount of template DNA in the reaction and decreasing the number of replication cycles will allow similar quantities of product to amplify, while decreasing the number of amplifications, minimizing the impact of length variation introduced early in the PCR. This strategy is further supported by observations that mutations occur more frequently during later PCR cycles (Eckert and Kunkel 1991). During sequence analysis, homopolymer tract lengths should be interpreted carefully. To ensure that all tract lengths are accurate, amplification and sequencing should not rely on a single approach.

Sequencing Strategies for Genomes with Extensive Homopolymer Tracts

There is no error-free method currently available for sequencing across homopolymer tracts. Direct sequencing across poly(T) regions using amplification products can be difficult, particularly when longer tracts are concerned. As the length of the tract increases, the rate of indel errors increases, and thus the proportion of length variants that are sequenced. This is of particular concern when the error is introduced during an early cycle of the PCR, as it will be a significant component of the final product. During sequencing, this causes mixed signals in the sequence data downstream of poly(T) regions. This becomes progressively worse as the number of poly(T) tracts in the sequence increases. A study by Madani et al. (2010) encountered this problem when sequencing the mtDNA cytochrome b gene of G. pallida and G. rostochiensis. They were forced to clone the PCR product and sequence from the plasmid DNA of a single colony to obtain readable sequence data. This ensured that only one DNA amplicon was being sequenced. However, poly(T) variation was still present in many of the cloned sequences, preventing conceptual translation into a functional protein. If poly(T) variation is introduced during early PCR cycles, they may still form a considerable portion of the amplified molecules, and it becomes a matter of chance as to whether the plasmid sequenced contains DNA representative of the biological poly(T) tract length. Further, substitution errors introduced in vitro may also be cloned and sequenced. Thus, it is necessary to sequence several clones to ensure the accuracy of the sequence data. Next generation sequencing technologies, such as the 454 sequencing platform, also have difficulties accurately sequencing tracts >6 Ts. These technologies are based on the detection and analysis of light emissions that are proportional to the number of identical nts incorporated adjacently. However, for long homopolymers, the detector becomes saturated with light and the base-calling software can no longer accurately interpret the amount of light emitted. This is predicted to occur when homopolymer tracts of seven or more nts are sequenced (Mardis 2008). Contrary to this, a study by Jex et al. (2010) evaluated the use of 454 technology to sequence the mtDNA of a chromadorean nematode, Haemonchus contortus, which has an AT content of 78.1% and poly(T) tracts up to 9 Ts. Although they observed many errors associated with homopolymers, the final 14,055 bp mtDNA consensus sequence was >99.7% accurate. Thus, the 40-fold depth of coverage achieved with the high throughput analysis of 454 generally outweighed the frequency of base-calling errors. However, this sequencing approach was still preceded by PCR amplification from genomic DNA. Thus, sequencing methods that avoid amplification altogether would be superior. However, the longest poly(T) tract of H. contortus mtDNA is only 9 nts. Thus, although this singular approach appears sufficient, it may not be suitable for the mtDNA of genomes with longer homopolymer tracts, such as that of the Tylenchida order of nematodes to which R. similis and H. glycines belong. (See Supplementary Fig. 2 for the frequencies of different tract lengths for the nematode species listed in Table 3).

These observations have significant implications for amplifying, sequencing, annotating, and protein expression studies. The potential for the generation of poly(T) variation should be a consideration for all studies involving genes with many homopolymers longer than 6 nts.