Introduction

Soil is an important type of forensic evidence because it is spatially variable, frequently transferred during criminal acts, and can be overlooked as evidence of the crime by the perpetrator. The examination and characterization of soil material has been completed by forensic geologists for decades, and the information gleaned is often applied to constrain the circumstances of a crime. For instance: could the soil on the suspect’s shoe have been derived from the crime scene, or does it match an alibi location? In provenance cases, where often the amount of material submitted is usually extremely limited (milligrams to a few grams), it is imperative that all components of the sample be analyzed to obtain as much probative information for investigative leads as possible. Biological material, such as plant and insect fragments, are often present in soil evidence but rarely taxonomically identified. This biological material could provide useful information, particularly in provenance cases, considering plant and insect species inhabit specific ecosystems, and may be present at specific times of the year. Traditionally, taxonomic identification of biological material is performed based on morphology. However, given that the majority of biological material observed in soil evidence is either a fragment or an incomplete specimen, morphological identification is not straightforward. In these circumstances, using DNA for identification is an attractive alternative approach, as it is present in all biological tissues and can be viable even in material not optimally preserved [1,2,3], such as would be expected in evidence samples.

DNA barcoding, which utilizes a standardized sequence of DNA, typically 400–800 base pairs in length, was coined in 2003 as a molecular approach for taxonomic identification [4]. Although DNA barcoding has received some criticism [5,6,7,8,9], it has gained broad acceptance given its application beyond taxonomy, to areas such as ecology, population genetics, and conservation [10,11,12,13,14], for monitoring and tracking invasive and economic pests [15,16,17,18,19,20,21] and in forensics [22,23,24,25]. The utility of DNA barcoding for species-level discrimination of unknown samples hinges on access and the ability to search databases of reference barcode sequences, containing relatively complete coverage of the taxa of interest. There currently are two main public databases that contain DNA barcode data, the Barcode of Life DataSystems (BOLD) [26] and GenBank [27].

A 648-bp region of the mitochondrial cytochrome oxidase subunit I (COI) gene has been adopted as the standard barcoding region for animal/insect identification [4, 28,29,30], as it has a fast mutation rate and is found in high copies within tissues. The discrimination power of this region has been evaluated in more than 10,000 peer review articles and provides species-level resolution among vertebrates [31, 32] and invertebrates [33,34,35,36,37]. As COI evolves too slowly to facilitate species-level discrimination among plants, the Consortium for the Barcode of Life (CBOL) recommended the scientific community adopt a 2-locus barcode for discrimination among land plants: ribulose 1,5-biphosphate carboxylase (rbcL) and maturase K (matK) both from the plastid genome [38]. The rationale surrounding the use of a 2-locus barcode is that although rbcL is more straightforward to amplify and sequence than matK, the level of resolution is limited (i.e., order and family as opposed to genus and species, respectively). Unlike rbcL, there is no universal primer pair to facilitate the amplification of matK across plants; thus, the taxonomic information obtained from rbcL can prove very useful for choosing the appropriate matK primers to ensure successful amplification (especially prudent when dealing with unknown material). In some plant groups however, species discrimination using only these two markers is not possible, so a range of supplementary markers are often required to increase the level of species resolution (e.g., intergenic spacers trnH-psbA, atpF-atpH, and psbK-psbl and gene regions rpoB and rpoC1) [39, 40].

Current efforts that have used molecular-based approaches such as DNA barcoding to document the biodiversity within a soil sample have primarily been focused on a bulk metagenomic approach [41,42,43,44,45,46,47,48,49,50]. Using conserved primers for the desired barcode regions, individual taxa can be amplified and sequenced simultaneously (i.e., massively parallel sequencing) from a single bulk soil extraction. Although a metagenomic approach facilitates the collection of large amounts of data from potentially highly degraded samples, the current usefulness of this technique mainly lies with cross-sample comparisons; the operational taxonomic units (OTUs) identified in the unknown sample are compared to a series of knowns, to determine the level of similarity. Additionally, a large amount of soil (at least 100 mg) is needed for a DNA extraction [46], which would be problematic in forensic applications where sample mass is often limited and non-consumption analysis is preferred.

To enhance the forensic examination of soils, this study focused on developing a protocol for obtaining DNA barcode data from individual biological fragments isolated from forensic-type soil samples. Although protocols for obtaining DNA barcode data from both plants and insects have been well developed by the scientific community [29], these methods have been primarily optimized for fresh, pristine samples. Using these methods as a starting point, a DNA barcoding protocol was developed to work with both “new” and “old” biological material. The broad utility of the developed method was tested using fragments (n, 213) isolated from 11 soil samples collected from within Virginia, USA, which represent varied geology and ecohabitats. This paper outlines (1) the challenges with developing a protocol to obtain barcode data from forensic-type biological material, (2) the types of plant and insect fragments that are commonly recovered with surface soil samples (e.g., seeds, rootlets, legs, or heads), (3) whether such fragments contain viable DNA, (4) whether the appropriate DNA barcode regions could be amplified and sequenced using traditional Sanger methods, and (5) the level of taxonomic identification possible from barcode data when using public sequence databases (BOLD and GenBank).

Materials and methods

The protocol outlined below was originally developed and tested using two types of samples for both plants and insects: (1) new, fresh, and intact tissue collected immediately prior to extraction (surrogate positive control) and (2) old, fragmented tissue recovered from surface soil samples, which had been exposed to environmental conditions likely for several months (see Online Resource 1 for examples of old fragments).

DNA extraction

To remove any remaining soil particulates or fungal contaminants from the old samples, each fragment was submerged in a 5% bleach solution for 5 min, and subsequently washed three times with purified water [51]. After washing, each fragment was left to dry overnight in a drying cabinet (lid of the 1.5 mL centrifuge tube was left open). The length of each fragment, along with the dry weight of the plant fragments, was recorded. The insect fragments were not weighed given their extremely small size. Photographs were taken of each individual fragment using a Nikon D90 camera, to permit subsequent categorization. The total genomic DNA was isolated using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany) and the DNeasy Blood and Tissue DNA Purification Kit (Qiagen), for plant and insect fragments, respectively. To facilitate straightforward homogenization of the tissue, each fragment was snap frozen using liquid nitrogen and ground to a fine powder using a disposable mortar and pestle. The manufacturer’s protocols were followed for extraction with one exception: the DNA was eluted into two eluates of 50 μL of AE Buffer as opposed to one eluate of 100 μL, to increase the final DNA concentration.

Characterizing DNA quantity and purity

The quantity and purity of the extracted DNA was assessed using the Nanodrop ND-1000 (Thermo Scientific, Wilmington, DE, USA). AE Buffer was used to calibrate the blank of the instrument, and 1.5 μL of DNA eluate was used to obtain a reading. The quantity of DNA in each sample was recorded (ng/μL) along with the absorbance at descriptive wavelengths: 230 nm for phenols and humic acid; 260 nm for nucleic acids; and 280 nm for carbohydrates, proteins, and RNA.

Amplification

All amplifications were performed on a GeneAmp PCR System 9700 Thermal Cycler (Applied Biosystems, Foster City, CA, USA) using the primer pairs given in Table 1 and the cycling conditions outlined in Online Resource 2. Initially, all primer pairs were tested using a 20μL reaction mix containing: 0.4 μM of each primer, 2.5 mM MgCl2, 0.5 mM of each dNTP (Applied Biosystems), 5 U of AmpliTaq GOLD™ (Applied Biosystems), and 2 μL of genomic DNA (2 μL of nuclease free water for the negative control and 2 μL of the new extract as a surrogate positive control). In additional experiments, KAPA3G Plant DNA polymerase (KAPA Biosystems, Wilmington, MA, USA), 2× KAPA Taq DNA polymerase (KAPA Biosystems), Q5® Hot Start High-Fidelity DNA polymerase (New England BioLabs Inc. [NEB], Ipswich, MA, USA), and the Q5® High-Fidelity DNA polymerase (NEB) were tested using the manufacturer’s suggested reaction mix constituents. Inhibitor removal steps or alternate PCR constituents were examined in some experiments and included betaine (Sigma-Aldrich [B-2754], St Louis, MO, USA), final concentrations of 1–2 M; polyvinylpyrrolidone (PVP; Sigma-Aldrich [P-5288]), final concentrations of 1–3% v/v; and dimethyl sulfoxide (DMSO; Sigma-Aldrich [D8418]), final concentrations of 3–10% v/v. Purification of extracted DNA was tested with the PowerClean® Pro DNA Cleanup Kit (Mo Bio Laboratories, Inc., Carlsbad, CA, USA), Agencourt® AMPure XP Reagent (Beckman Coulter, Inc., Brea, CA, USA) and an ammonium acetate precipitation (Sigma-Aldrich [A2706]).

Table 1 Information on the targeted barcode regions and primer pairs used for amplification

Amplicon screening and purification

A total of 5 μL of PCR product and 1 μL of 6× loading dye (Promega, Madison, WI, USA) were loaded into a single well of a 1.2% agarose gel. To facilitate size quantitation of amplicons, 10 μL of 1 kbp DNA Ladder (Bioline, Taunton, MA, USA) was also run. Each gel was subjected to electrophoresis prior to ethidium bromide staining and visualization under ultraviolet light. ExoSAP-IT® (USB® Products, Cleveland, OH, USA), which digests any unincorporated primer and dNTPs, was used to purify amplicons. A total of 1 μL of ExoSAP-IT® was combined with every 5 μL of PCR product and incubated as per the manufacturer’s instructions. Purified samples were quantitated using the Agilent 2100 Bioanalyzer and the Agilent DNA 1000 kit (Agilent Technologies, Santa Clara, CA, USA) following the manufacturer’s protocol.

Sequencing and data analysis

Sequencing of ExoSAP-IT®-treated PCR products was performed using the ABIPRISM® BigDye™ Terminator Cycle Sequencing Kits (v3.1 for plant amplicons and v1.1 for insect amplicons) (Applied Biosystems). Each sequencing reaction contained 10 ng of purified PCR product, 3.9 μL of BigDye™ Ready Reaction Mix, and 0.175 μM of the appropriate forward amplification primer. Samples were subjected to the following cycling conditions on a GeneAmp 9700 Thermal Cycler: plant amplicons, 1× 96 °C for 1 min and 25× 96 °C for 15 s, 50 °C for 1 s, and 60 °C for 1 min; 4 °C hold and insect amplicons, 1× 96 °C for 1 min and 25× 96 °C for 15 s, 50 °C for 1 s, and 60 °C for 4 min; 4 °C hold. Individual sequencing reactions were purified using Centri-Sep™ strip columns (Princeton Separations, Freehold, NJ, USA) following the manufacturer’s protocol.

The sequencing products were separated using an ABI 3130xl Genetic Analyzer (Applied Biosystems), and Sequence Analysis 5.2 software (Applied Biosystems) was used for basecalling. Each sequence was manually edited using 4Peaks (Nucleobytes, Amsterdam, the Netherlands) to check for base ambiguities and to remove the primer sequences. The resulting edited nucleotide sequence was subjected to a nucleotide BLAST search (blastn, searching the “other” nucleotide collection database; available at http://blast.ncbi.nlm.nih.gov) and also searched against the appropriate BOLD database (available at www.boldsystems.org) to obtain a taxonomic identification.

Broad assessment of the protocol

Once a protocol had been developed to work with both new and old samples, the broad utility of the protocol was tested on ~ 200 individual plant and insect fragments, isolated from 11 different soil samples collected within Virginia, USA (Online Resource 3). Soil was collected 0 to 3 cm below the litter layer, with the biological fragments isolated from the samples likely exposed to environmental conditions for several months (as soil collections were made in early winter and early spring, times separated from the major deposition of plant litter). For both COI and rbcL, when an amplicon for the entire barcode region could not be obtained, the “mini” primer pair was tested (primers fall within the entire barcode region) (Table 1). Given there is not a published mini primer pair for matK, a nested PCR, in which 3 μL of the initial amplification reaction mix was used as the DNA template rather than genomic DNA, was implemented using an internal primer pair (Table 1).

Results and discussion

Inhibition with plant extracts

During protocol development, amplicons of the expected size were only observed on agarose gels from the new plant extract, regardless of the primer pair used. It was possible that the DNA from the old fragment was highly degraded, such that even amplification of the smallest plant region (rbcL mini, ~ 230 bp) was not possible. However, as inhibitors such as polyphenolic/aromatic compounds, polysaccharides, and humic acid are common in plant material [57] and are known to interfere with PCR both directly and indirectly [58,59,60,61], it was also possible that such compounds were co-isolated. To confirm whether inhibitors were present in the old plant extract, an inhibition assay was performed using different sources of untreated DNA: (1) only new plant, (2) only old plant, (3) both new and old plant (with the final amount of DNA from both extracts being equal), and (4) negative control (nuclease free water). In the presence of the old DNA, the new DNA failed to amplify the fragment of interest, confirming the presence of inhibitors (Table 2). Three different strategies were used to address inhibition: incorporation of a second round of DNA purification, altering the constituents in PCR, and using an alternate specialized polymerase. Downstream efficacy was assessed using the inhibition assay outlined above for two different-sized fragments (~ 850 bp matK and ~ 230 bp rbcL mini).

Table 2 Steps taken to reduce PCR inhibition when amplifying barcode regions from old plant extracts

DNA purification

The efficacy of three DNA purification methods was tested individually for removing inhibitors from only the old plant extract (Table 2): (1) the PowerClean® Pro DNA Cleanup Kit, which utilizes a patented Inhibitor Removal Technology® to remove challenging impurities; (2) the Agencourt® AMPure XP Reagent, which uses magnetic bead technology to isolate all genomic DNA greater than 100 bp in length; and (3) ammonium acetate, to precipitate any polyphenolics and polysaccharides in the extract [62]. For each purification method, extracts obtained from a single old plant were purified in triplicate following the manufacturer’s protocol (methods 1 and 2) and as described by Miller [62] (method 3). All three methods were successful in removing the inhibitors present in the old extracts (Table 2, DNA purification panel). In instances where an amplicon was observed in the new and old reaction but not for the old alone, this was suggestive of degraded DNA (as seen by the absence of the large matK amplicon in Table 2).

Modifying the PCR constituents

Additional experiments were performed to examine whether it were possible to suppress the activity of the inhibitors during PCR by modifying the constituents in the reaction mix. The commonly employed approach to lessen the impact of PCR inhibitors by reducing the volume of DNA extract added to the reaction mix [57] yielded no improvement in this study (volume of input DNA extract was decreased by ~ 10-fold; results not shown).

The addition of betaine, polyvinylpyrrolidone (PVP), and dimethyl sulfoxide (DMSO) to the reaction mix was investigated at a range of concentrations, which overlapped the levels previously documented to be effective in suppressing inhibitors (outlined in Table 2, PCR constituents panel) [57,58,59, 63,64,65]. Only PVP suppressed the inhibitors present in the old plant. In most instances, adding DMSO or betaine to the PCR reaction did not address the impact of inhibitors present in the old plant extract at any concentration; in some cases, DMSO had a negative impact on the new plant, by suppressing the amplification of both matK and the rbcL mini fragments.

Specialized polymerase

The KAPA3G Plant DNA polymerase is a high-efficiency polymerase formulated to improve tolerance to PCR inhibitors such as polyphenolics and polysaccharides and has previously permitted successful amplification with challenging samples [66, 67]. Therefore, we assessed the performance of this polymerase on the untreated/unpurified old plant extract. With the manufacturer’s suggested constituents for a 25-μL reaction and using the previously optimized cycling conditions (Online Resource 2), successful amplification of both the matK and rbcL mini barcode regions was achieved for the old plant (Table 2). The amount of product obtained when using the KAPA3G Plant DNA polymerase was far greater for both the new and old plant when compared to that obtained when using AmpliTaq GOLD™ (Fig. 1). Additionally, the KAPA3G Plant DNA polymerase provided strong and reproducible PCR amplifications for all of the plant primer pairs (Fig. 2). Given that the KAPA3G Plant DNA polymerase is not reported to repair DNA, amplification of the long matK fragment in the old extract (which likely contains only a few full-length, intact templates) may be due to the enzyme’s high efficiency. The resulting sequence data from amplicons generated using KAPA3G Plant DNA polymerase for all regions (matK, rbcL and rbcL mini) were clean but also matched the expected locus and taxa in GenBank and BOLD. Considering these results, the KAPA3G Plant DNA polymerase was used for amplifications in the broad assessment of the protocol, which utilized fragments isolated from forensic-type soils that likely contain similar inhibitors and DNA of suboptimal lengths.

Fig. 1
figure 1

Amplification of the mat K barcoding region (~ 850 bp; primers mat K-KIM-1R/matK-KIM-3F) for both new and old plant fragments using AmpliTaq GOLD™ (lanes 2, 4) and the KAPA3G Plant DNA polymerase (lanes 3, 5). 1 kbp ladder shown (lane 1). Results shown are typical for those obtained from numerous experiments (n, > 10)

Fig. 2
figure 2

Plant and insect DNA barcoding region amplicons obtained using KAPA3G Plant DNA polymerase (lanes 2–5) and the Q5 Hot Start High-Fidelity DNA polymerase (lanes 6–7): (1) 1 kbp ladder; (2) ~ 850 bp mat K (primers matK-KIM-1R/matK-KIM-3F); (3) nested ~ 830 bp mat K (primers matK4La/matKMALPR1); (4) ~ 590 bp rbcL (primers rbcLa-F/rbcLa-R); (5) ~ 230 bp rbcL mini (primers rbcL1/rbcLB); (6) ~ 650 bp COI (primers LCO1490-L/HCO2198-L); (7) ~ 130 bp COI mini (primers uniminibarF1/uniminibarR1); (8) 1 kbp ladder

Amplification and sequencing of the insect barcode regions

Challenges with the 648 bp COI barcode fragment

When using AmpliTaq GOLD™ to amplify the 648 bp COI barcode region using the previously published primers (Table 1) and associated cycling conditions (Online Resource 2), only a faint band from the new extract was observed on an agarose gel (band for the old extract absent). By performing an inhibition assay similar to that employed for the plant extracts, the presence of inhibitors was ruled out as the reason for the failed PCR of the old insect extract (results not shown). Thus, it was likely that the failed amplification of the old insect was due to DNA degradation or low polymerase efficiency (given that only a faint band was observed with the new extract). To address this, a nested PCR was performed using 3 μL of the previous amplification reaction as template and the same initial amplification primers and cycling conditions. This approach yielded strong, clean amplicons of the expected size on gels for both the new and old insect extracts (old amplicon shown in Fig. 3a, Agilent electropherogram).

Fig. 3
figure 3

Agilent DNA 1000 electropherograms (a, c) and Sanger sequencing electropherograms (b, d) for the old insect amplified using AmpliTaq GOLD™ in a nested PCR with a total of 80 cycles (a, b) and a non-nested PCR with Q5 Hot Start High-Fidelity DNA polymerase, using a total of 40 cycles (c, d). Agilent peaks denoted as follows: 1, lower marker; 2, the ~ 650 bp COI barcode region amplicon; 3, upper marker. The X-axis of the Agilent electropherograms is not linear, and the Y-axis reflects the relative concentration of the amplicons

Upon sequencing the nested COI amplicons, high background noise or mixed reads were observed in the sequence electropherograms, meaning the sequence was mostly unusable (Fig. 3b). To resolve this, a range of approaches known to improve the quality of the sequence data were systematically tested, including increasing the primer annealing temperature, decreasing separately the amount of dye and primer, adding DMSO in a final concentration of 5% v/v, and sequencing with alternate primers. None of these approaches produced reliable, clean sequence data. As the peak corresponding to the nested COI amplicon appeared somewhat broad at its base in the Agilent electropherogram (Fig. 3a), it is likely that obtaining clean sequence data was impeded by additional secondary products, either a few nucleotides shorter or longer than the desired fragment. Given that a nested PCR approach was utilized to obtain amplicons from both the new and old insects using AmpliTaq GOLD™, artifacts such as these can be expected.

To obtain clean sequences, reamplification of the COI barcode region from old insect DNA was tested using different polymerases (AmpliTaq GOLD™, Q5® Hot Start High-Fidelity DNA polymerase, Q5® High-Fidelity DNA polymerase, and 2× KAPA Taq polymerase), but also with varying cycle numbers (40, 45, and 50) to increase the amount of product. When using either AmpliTaq GOLD™ or 2× KAPA Taq polymerase at best faint bands of the expected size were observed, even when 50 amplification cycles were used (results not shown). Both of the NEB High-Fidelity polymerases produced strong amplicons at all cycle numbers; however, a number of strong secondary products were also visualized for the Q5® High-Fidelity DNA polymerase. The amplicon obtained when using the Q5® Hot Start High-Fidelity DNA polymerase and 40 amplification cycles appeared as a strong band on the agarose gel (Fig. 2, Lane 6) and single peak on the Agilent after cleanup with ExoSAP-IT® (Fig. 3c), albeit in a lower concentration to that obtained with a nested PCR using AmpliTaq GOLD™ (Fig. 3a). Subsequent sequencing of this COI amplicon had limited background noise (Fig. 3d) and matched to the expected locus (COI) and insect (Danaus plexippus, monarch butterfly) in public sequence databases. To ensure clean, reproducible sequencing data when processing the forensic-type insect fragments, amplification of the COI barcode region was performed using Q5® Hot Start High-Fidelity DNA polymerase at 40 amplification cycles.

Optimizing the COI mini PCR

Considering numerous papers have reported that amplifying COI mini using the uniminibar-F1/uniminibar-R1 primer pair is challenging [68, 69], a “touch-up” PCR is suggested [29] (Online Resource 2). When using the Q5® Hot Start High-Fidelity DNA polymerase with the published cycling conditions, a strong amplicon of the expected size (~ 130 bp) was obtained from the new and old extracts, along with numerous secondary products. A set of modified cycling conditions were identified that produced a single dominant amplicon; annealing temperature in the first set of cycles was increased to 50 °C, and the extension time for all cycles was reduced to only 1 s (Fig. 2, Lane 7; Online Resource 2). The resulting sequence data was clean and reproducible for both extracts; however, given the small size of the amplicon, only ~ 100 bases could be used for downstream comparison to public databases after the removal of the primer sequence.

Utility of the developed protocol on forensic-type biological material

A summary schematic of the protocol developed to obtain DNA barcode data from forensic-type plant and insect fragments is given in Fig. 4, and protocol conditions that generate locus specific amplicons have been tabulated in Online Resource 2. The results outlined in the section below address the utility of this protocol for processing fragments isolated from soils collected across Virginia, which represent a broad range of parent soil and surface material, ecoregions, and pH (Online Resource 3).

Fig. 4
figure 4

DNA barcoding protocol developed for processing biological materials isolated from forensic-type soil samples. 1Final concentration of the reaction mix constituents and thermal cycling conditions used to amplify each of the barcoding regions are given in Online Resource 2

Characterization of fragments

Biological fragments were numerous (i.e., generally > 30) in most of the 11 soil samples; thus, a wide variety of fragment types were chosen to test the broad utility of the protocol. In total, 110 plant fragments and 103 insect fragments were processed and they were categorized as follows: plants—roots (24%), leaf (21%), branch (10%), bark (9%), entire seed (9%), casing of seed (8%), grass (2%), and other (17%); insects—unidentifiable part of exoskeleton (48%), thorax/abdomen (30%), leg (15%), head (5%), wing (1%), and spider’s web (1%). The average length of the insect fragments was far smaller than the plants, 1.8 ± 3.4 and 8.4 ± 7.3 mm, respectively. The average weight of the plant fragments was 1.8 ± 3.3 mg.

DNA quality and quantity

When only considering extracts for which the concentration was above the reliable detection limits of the Nanodrop (2 ng/μL), the average total DNA yields from plants and insects were 1.15 ± 3.7 and 0.45 ± 0.75 μg, respectively (Online Resource 4). The DNA purity of each extract was assessed based on absorption ratios at various wavelengths (A260/280 and A260/230). Unexpectedly, the insect extracts had higher levels of phenolics and humic acid, whereas the plant extracts contained considerable amounts of carbohydrates, proteins, and RNA (Online Resource 4). Researchers have documented high levels of humic acid and protein contamination either when extracting bulk soil samples [70] or individual degraded plant samples [71], using a range of commercially available kits.

PCR and sequencing success

During protocol development, we confirmed that the reaction and cycling conditions for all primer pairs were reliable and specific, as the resulting sequence data matched the expected locus and taxa in public sequence databases. Given this, if a single band the same size as the surrogate positive control (the new extract) was observed on the agarose gel, the PCR was deemed successful. Both the entire barcode primers for rbcL and COI returned a ~ 70% amplification success rate, whereas the matK barcode region was only amplified in a few samples (~ 5%; Table 3). Far greater amplification success rate for matK was observed when the nested PCR was implemented (Fig. 2, lane 3), and rbcL (full length or mini) was amplified in over 90% of fragments.

Table 3 Summary of PCR and sequencing success from 110 plant and 103 insect fragments isolated from forensic-type soil samples

Sequencing was deemed successful when clean sequence data (> 100 bp in length) was obtained from a purified PCR amplicon. At least two-thirds of all amplicons produced useable sequence data for downstream comparisons to public databases, with the majority of sequences being over 300 bp in length (Table 3). When a sequence was unusable due to high background noise, re-sequencing was attempted using the reverse amplification primer, with varying degrees of success. No distinguishable trends were observed based on the type of fragment (e.g., plants—leaf, roots, branch, bark, seeds; insects—legs, head, exoskeleton) and PCR or sequencing success.

Assessment of public sequence databases for taxonomic identification

All plant sequences (rbcL and matK) matched the expected locus when searched against GenBank (e.g., an rbcL sequence was identified as a portion of the rbcL locus) (Table 3), a reflection of high specificity in the primers and cycling conditions. When examining the taxonomic resolution obtained with DNA barcode sequences, the majority of rbcL and matK sequences achieved a minimum of order-level discrimination, with the resulting taxonomic identifications being highly concordant between the two public databases.

In instances where both rbcL and matK data are collected from a single sample, the taxonomic identification, especially at higher levels, should be congruent. In this study, 46 samples had sequence data from both rbcL and matK; however, high discordance (~ 75%) was noted in the taxonomic identifications from the two loci. In every case, the rbcL data indicated that the fragment was a pine species (Pinus, gymnosperm), whereas the matK data suggested the origin as an oak species (Quercus, angiosperm). Considering a nested PCR was implemented for matK using angiosperm primers, it was plausible that the matK data could be misleading. To verify this hypothesis, the intergenic spacer trnH-psbA (a supplemental plant barcoding locus) was amplified and sequenced (following the protocol outlined in 29) for a subsample of the fragments in which discordance was noted. The trnH-psbA data confirmed the rbcL identifications; thus, if an amplicon is not obtained in an initial PCR with the matK-KIM primers, PCR should be performed with a primer pair degenerate to another plant group (perhaps Gym_F1A/Gym-R1A [72] for gymnosperms), instead of implementing a nested PCR. When only considering the rbcL data, family-level assignments were as follows: 58% Pinaceae (pine), 13% Fagaceae (oak/stone oak), 5% Vitaceae (grapes), 3% Brassicaceae (bittercress), and 3% Brachytheciaceae (moss), with the remaining 18% of fragments assigned to one of six other families. In this study, when using the developed DNA barcoding protocol, the level of plant biodiversity captured in the 11 soil samples was low, considering only rbcL data could be used reliably. With the analysis of more fragments, but more importantly the recovery of sequence data from the more discriminatory matK locus, better taxonomic resolution would be possible. The authors envisage limited difficulty in obtaining matK data from any fragment, when the KAPA3G Plant DNA polymerase is used in tandem with well-tested cycling conditions for alternate universal matK primer pairs (i.e., angiosperms, gymnosperms, ferns, and mosses).

When examining the insect sequence data, despite ~ 75% matching to the COI locus in GenBank, only six sequences had a match in either public database to an organism from the class Insecta (Table 3); the best match for the vast majority of COI sequences was either to a fungus, marine invertebrate, algae, or uncultured bacterium. However, for any match, the similarity statistics were on average very poor and the average e-value from BLAST searches was higher than ideal. Given the extremely small size of the starting insect material (generally < 1 mm) and the known exposure of such fragments to prolonged environmental conditions, it was not surprising there was little insect DNA remaining for analysis. If more intact or larger insect fragments were processed using the developed protocol, the proportion of COI sequences matching to the class Insecta would likely increase, providing useful information for provenance cases. It is apparent that using the presence of an amplicon of the expected size on the agarose gel as a metric for PCR success provides a misleading representation of the likely downstream success of taxonomic identification.

Conclusions

Using previously published studies as a guide, a protocol was developed that permits the collection of DNA barcode sequences from biological fragments exposed to environmental conditions. The utility of this developed protocol for taxonomic identifications was subsequently tested using 213 plant and insect fragments isolated from forensic-type soil samples collected within Virginia. Amplification and sequencing was straightforward, and the resulting sequence data matched the expected loci in public sequence databases. Despite this, the level of taxonomic discrimination was low, as a result of unreliable matK data and the absence of viable insect DNA. To capitalize on the application of this protocol for the identification of biological fragments encountered in forensic-type soil samples, further research should be focused on determining the number of fragments needed for analysis to sufficiently capture the biodiversity within a sample, along with impacts of seasonal variation. With the ever-advancing field of massively parallel sequencing (MPS), the developed protocol may need to be modified or a standardized protocol may be required to permit the collection of DNA barcode data from bulk soil samples. An MPS approach might assist with obtaining more information on the insect community, especially for samples in which individual insect fragments are very small and contain little viable DNA. However, for an MPS-based approach to be feasible within a forensic context where the evidence material is generally very limited, work is needed to optimize soil extractions for small sample amounts.