Introduction

The ability to properly and definitively identify the tissue types of biological samples found at a crime scene can assist in reconstructing the events of a crime and impact criminal litigation. Over 120,000 sexual assaults occurred in 2009 in the USA alone [1] and the number is likely under-reported. Given the volume of occurrence and heinous nature of the crime, providing law enforcement with an effective tool for semen detection would be beneficial.

A variety of non-DNA-based approaches to tissue identification exist, including presumptive chemical tests, cytological methods, protein catalytic activity and detection, spectroscopic methods, and microscopy. Limitations to these methods include lack of tissue specificity, consumption of sample, and/or variation in technologies not allowing for parallel processing methods [28].

One promising method for tissue type determination is based on the difference of RNA expression in various tissue types. This method has been shown to be a robust and reliable approach even with relatively low quantities of RNA [911]. However, RNA typing has yet to gain wide acceptance, possibly because it does not allow for simultaneous analysis of DNA profiling and tissue type identification.

DNA typing methods are highly informative regarding individualization [1216] but typically are not informative for determining the source tissue of a DNA sample. However, CpG methylation patterns of DNA are tissue specific and have been demonstrated to be capable of identifying the tissue type of DNA samples [1719]. One approach to determining methylation patterns is by using a bisulfite reaction to convert non-methylated cytosines to uracil and then compare the base changes with a sequence from an untreated sample [20]. This methodology requires additional chemistries and sequencing of each target twice, once for the reference and once for the bisulfite-treated sample.

Another approach to the identification of tissue origin based on methylation patterns is to use CpG methylation-sensitive restriction enzyme digestion followed by polymerase chain reaction (PCR) of the targeted regions of DNA samples [17, 18]. In the assay, amplicons are selected in areas that will be preferentially methylated in some tissues versus others and contain a restriction site that will enable the identification of tissue(s). The DSI-Semen™ kit (Nucleix, Tel Aviv, Israel) is based on the restriction digestion of methylated sites as described by Frumkin et al. [17, 18]. The kit is a multiplex assay containing primers specifically for the detection of semen. The assay contains three targets that are methylated in semen and two targets that are not methylated in semen but methylated in other forensically relevant tissues. In addition, the assay targets two methylated sites in all forensically relevant tissues as positive amplification controls and one target type as a digestion control which is not methylated in all forensically relevant tissues. The electrophoresis-generated profiles result in characteristic signatures for undigested control, digested semen, and digested non-semen-derived DNA (Fig. 1 a–c, respectively) from amplicons that are all less than 160 base pairs in length.

Fig. 1
figure 1

Electropherograms showing the peak signatures of DSI-Semen™ assay. a Eight amplification peaks exist in the assay and can be observed in an undigested sample. Numbered from left to right, there are two peaks that serve as positive amplification controls (peaks L1 and L8), one peak as a control for restriction digestion (peak L2), two peaks to indicate methylation patterns specific for non-semen tissues (peaks L3 and L4), and three peaks to indicate methylation patterns present in semen (peaks L5, L6, and L8). In samples that are properly digested, peak 2 will be absent (tailless arrowheads in be). In samples that are derived from semen, the third and fourth peaks will be absent (black arrows in c and e). In samples derived from a non-semen source, peaks 5, 6, and 7 are absent (gray arrows in b). Using these guidelines, the sample in b is derived from a non-semen source (buccal in this case), and the samples in c and e are derived from semen. The SourceIdentifier™ software properly identified the samples in b and c, but the sample in e (31 pg of total DNA) was sub-threshold for the algorithm; however, the peak signature for a properly digested semen-specific sample is observable. The sample in d which has peaks 1 and 3–8 is representative of a sample containing a mixture of semen and non-semen

The system is accompanied by a proprietary analysis program, SourceIdentifier™ (Nucleix, Tel Aviv, Israel), to analyze the electropherograms and generate a probability of the tissue type associated with the sample. Thus, confidence can be conveyed with tissue source findings. The system was evaluated for its utility to detect semen in a variety of forensically relevant samples, and the results support that the DSI-Semen™ kit can be a useful tool for the analysis of appropriate forensic samples.

Materials and methods

Samples

Buccal, blood, semen, and saliva samples were collected from 26 unrelated males residing in Texas. Semen samples from unrelated males (four African American, one East Asian, four Southwestern Asian, one Southwestern Hispanic, and five Caucasian) were also kindly provided by Xytek (Atlanta, GA, USA). The samples were anonymized and collected in accordance with methods approved by the Institutional Review Board for the University of North Texas Health Science Center in Fort Worth, TX, USA. Frozen neat semen samples from horse (Equus ferus caballus), bovine (Bos taurus), and American bison (Bison bison) were kindly provided by Dr. James Derr at Texas A&M University College of Veterinary Medicine, College Station, TX, USA.

For limit of detection studies, a dilution series of DNA (2 ng, 1 ng, 500 pg, 250 pg, 125 pg, 62 pg, 31 pg, and 16 pg) derived from semen (n = 2) and buccal (n = 2) samples was assayed for tissue identification.

Known quantity mixtures were prepared by adding DNA from buccal (n = 2) and semen (n = 2) samples that were normalized to 250 pg/μL and mixed pairwise (semen to non-semen) in the following ratios: 19:1, 9:1, 6:1, 4:1, 2:1, 1:1, 1:2, 1:4, 1:6, 1:9, and 1:19—with a constant of 500 pg of total DNA per reaction. Mock casework single source and mixture samples were created by blotting or dropping collected fluids on cotton cloth and allowing them to dry at least overnight before processing. The sample types and number of samples analyzed are listed in Table 1 and Table S1 of the “Electronic supplementary material”.

Table 1 List of samples assayed with the DSI-Semen™ kit

UV exposure, DNA shearing, and inhibition studies

For UV exposure studies, neat semen was placed onto cotton cloth, allowed to dry overnight, and subsequently exposed to UV light at 254 nm and 120,000 μJ/cm2 for 0, 30, 60, 120, and 240 min in a UVP CL-1000 ultraviolet crosslinker at 254 nm to 120,000 μJ/cm2.

DNA shearing was accomplished using a Covaris S2x (Covaris, Woburn, MA, USA) adaptive acoustic shearing system according to the manufacturer’s recommendation to achieve relatively homogenous fragment sizes of 1,000, 500, and 150 bp. For a fragment size of potentially less than 150 bp, the amount of exposure time for the sample was doubled as the manufacturer does not provide recommendations for generating fragments below 150 bp.

For inhibition studies, fresh hematin and humic acid (Sigma, St. Louis, MO, USA) were substituted for the water component of the digest/PCR mix to achieve total concentrations of 0, 2.5, 5, and 7.5 ng/μL and 0, 2.5, 5, 7.5, and 15 μM, respectively.

Isolation of DNA and preparation of samples for analysis

DNA was isolated from buccal swabs and various mock casework samples using either the AutoMate Express® (Life Technologies, Carlsbad, CA, USA) or the DNA Investigator® Kit (Qiagen, Hilden, Germany) according to the manufacturers’ recommendations. The quantity of DNA was determined by qPCR using the Quantifiler® Quantification Kit and 7500 Real Time PCR® System (Life Technologies). Samples were then normalized to 250 pg/μL and stored at either −20 or −40 °C until amplification.

Restriction digest, amplification, and analysis

The DSI-Semen™ kit was kindly provided by Nucleix, and analysis was performed according to the manufacturer’s instructions. Briefly, with the exception of the limit of detection analysis, 500 pg of DNA was added to the restriction digest/amplification master mix. Then, the samples were incubated in a GeneAmp PCR system 9700 (Life Technologies) for 15 min at 37 °C to facilitate restriction digestion (with the restriction enzyme hhaI which is provided in the kit). The thermocycler amplification protocol was 95 °C for 11 min, followed by 30 cycles of 94 °C for 1 min, 59 °C for 1 min, and 72 °C for 1 min. To promote terminal adenylation, a final extension step was performed at 60 °C for 45 min. Samples were either immediately analyzed or stored at −20 °C until analysis.

Each sample was prepared immediately prior to electrophoresis analysis by adding 1 μL of sample to an electrophoresis loading mix. The loading mix consisted of 8.5 μL of HI-Di™ formamide (Life Technologies) and 0.5 μL of GeneScan™ROX 500™ (Life Technologies) per sample. The samples were then subjected to electrophoresis on either an Applied Biosystems 3130xl or 3500xl Genetic Analyzer® (Life Technologies) according to the manufacturer’s recommendations. Electrophoretic data were analyzed using either Genemapper IDX® (Life Technologies) or SourceIdentifier™ v1.09 software (Nucleix). Peak height ratios for semen samples were obtained by dividing the intensity of the first semen-specific peak (more intense peak in all samples tested) with the mean of the second and third peaks.

Results and discussion

Profile interpretation

Representative electropherograms of known semen and buccal swab samples are shown in Fig. 1. For an undigested sample, i.e., no hhaI added to the reaction mix, there are eight peaks observed in semen and non-semen samples (Fig. 1a). Peaks 1 and 8 are positive amplification control products which should be present in all samples. Peak 2 is a digestion control and should be absent in all completely digested samples. Peaks 3 and 4 are products of loci that are unmethylated in semen samples but are methylated in samples collected from other tested tissues (venous blood, menstrual blood, urine, saliva, buccal swabs, and vaginal swabs—according to the manufacturer’s information), and the presence of peaks 3 and 4 indicates that the source of the sample is a tissue other than semen (Fig. 1b). Peaks 5, 6, and 7 are loci that are methylated in semen but not in the other previously mentioned non-semen tissues (Fig. 1c). In samples that are mixtures of semen and another tissue, peaks 1, 3, 4, 5, 6, 7, and 8 will be present (Fig. 1d) and would be interpreted as a mixture of two or more tissues, one of which contains semen. Figure 2 displays representative profiles from blood, buccal, saliva, and semen samples (Fig. 2b–e, respectively). An undigested sample is included as a reference sample.

Fig. 2
figure 2

Profiles of multiple tissue types using the DSI Semen™ assay. Sample profiles obtained using the DSI Semen™ assay from blood (b), buccal (c), saliva (d), and semen (e). A profile from an undigested sample (a) is provided for visual reference

A probability of error value is generated by the accompanying analysis program, SourceIdentifier™. The algorithm measures the relative peak heights of the individual loci and compares the relative intensity of the tissue identification peaks to the mean amplification control peak height. Based on this ratio, an identification of the source tissue on a per locus basis is assigned. The product of the likelihood ratios at all of the loci for a tissue type is then subtracted from 1 for a probability of error.

Limit of detection

Samples above 1 ng were unable to be analyzed by SourceIdentifier™ because the peak heights were above the threshold for the analysis program. However, with GMIDX®, the requisite peaks for proper identification (peaks 1, 5, 6, 7, and 8 for semen and peaks 1, 3, 4, and 8 for non-semen) were readily detected, and the digestion control peak was absent. SourceIdentifier™ was able to analyze samples between 1 ng and 62 pg of total DNA and correct tissue sources were obtained. A representative electropherogram from a sample at 250 pg of total DNA is shown in Fig. 1c. DNA identification of the tissue source was inconclusive at 31 pg of the sample using the analysis program; however, visually (Fig. 1e), the presence of the semen–specific peaks and the absence of the digestion control and non-semen specific peaks were clearly observable. The calls visually were unambiguous. The profile in Fig. 1e is consistent with a DNA sample derived from semen even though it is sub-threshold for the kit’s analytical software program. The overall sensitivity of detection of the system is quite good and the results observed are consistent with the dilution series. However, as a cautionary note, the quantity of the sample was determined prior to generating the dilution series and as such each quantity tested might be slightly inaccurate due to pipetting error. While the sensitivity of detection results is a guide, a requisite internal validation study should be performed to determine the sensitivity of detection within a laboratory.

Figure 1e clearly demonstrates that the sensitivity of manual interpretation is greater than the interpretation limitations of SourceIdentifier for identifying semen. The semen-specific peaks, although of low signal, are clearly visible. The software could have been designed differently using other parameters such that it would allow the interpretation of a lower signal. However, it would necessarily come at the expense of decreased specificity. If the sensitivity of the software is increased, small peaks originating from random noise could result in a false-positive reading. The parameters for the software were selected to be highly specific to minimize the number of false-positives for semen detection as a false detection of semen would have serious consequences. In our experience, the software is far too stringent. Therefore, manual interpretation is advocated as an adjunct to provide valuable information even when the software is unable to provide identification due to low signal.

Reproducibility

Semen (n = 5) and buccal (n = 5) derived DNA samples were assayed in ten replicates each. All ten replicates for the ten samples (five semen and five non-semen) yielded a correct result for their respective tissue source. The peak height ratios of the amplification control peaks (Fig. 3a, b) and the semen (Fig. 3a) and non-semen (Fig. 3b) peaks had a mean peak height ratio of 87.03 % (± 3.09) for the semen peaks and 57.05 % (±4.24) when combined across individual samples of the same type (reported error is standard deviation). These observations demonstrated the reproducibility of the assay as measured by signal intensity by the low variance of peak height ratios across multiple replicates and samples.

Fig. 3
figure 3

Reproducibility of DSI-Semen™ assay. a Mean peak height ratio for 5 semen samples (n = 10 replicates). Light gray bars represent the positive amplification control peaks and dark gray bars represent the semen-specific peaks. b Mean peak height ratio for five non-semen samples (n = 10 replicates). Light gray bars represent positive amplification control peaks and white bars represent non-semen-specific peaks

Assay performance

DNA from neat semen samples (n = 35) was positive for semen (Table 1). However, two of the samples presented as a mixture. These two samples were reported as negative for semen and did not return a probability of error with SourceIdentifier™ and were not used to calculate the mean value observed in Table 1. Following visual observation, the samples appear to be a mixture. Due to the private nature of collection and anonymization of the samples, it is not possible to unequivocally determine the cause of these mixed tissue results. Plausible explanations include sample contamination by a sexual partner or the collector, contamination by white blood cells and/or epithelial cells of the donor, or a mutation causing a change in the methylation patterns at the two sites of amplification. For the latter to be plausible, two separate mutations would have to occur; thus, the absence of methylation-sensitive sites is unlikely.

The two samples were typed for STRs. For both of the samples, the STR profile presented as a single source. This makes the possibility of partner contamination unlikely. Possibly, the donors might have other cellular materials in their semen. Regardless, the samples correctly yielded a positive result for the presence of semen upon visual analysis of the electropherogram data. Currently, visual examination of the electropherogram is recommended for use in casework and to use the assay only for semen detection rather than to determine the presence of other tissues types in addition to semen.

Buccal swabs were collected from 14 different individuals who also had donated semen. All of the buccal swabs were correctly identified as having DNA from a non-semen source (Table 1). These paired samples support that the assay is efficient at determining semen from non-semen-derived DNA samples from the same individual.

Standard presumptive and confirmatory tests for blood and semen were used to analyze the buccal, blood, and semen samples. For the detection of semen, the acid phosphatase and prostate-specific antigen (p30) tests were used, and for blood, Leucomalachite Green and Hematrace (Abacus Diagnostics, West Hills, CA, USA) were used. The samples tissue sources were correctly identified by the respective presumptive and confirmatory tests as either semen, blood, or neither (in the case of the buccal samples). These findings were concordant with the Nucleix DSI-Semen™ Kit’s identification as “semen” or “non-semen.”

Three mammalian semen samples from horse (E. ferus caballus), cattle (B. taurus), and American bison (B. bison) were available for testing with the methylation assay. These samples did not yield any results.

UV exposure, degradation, and inhibition studies

To evaluate performance of the kit when typing environmentally insulted DNA samples, experiments were performed to explore the effects of UV exposure on the DNA and a limited number of known PCR inhibitors on the amplification process [2124].

Neat semen was placed onto cotton cloth, allowed to dry overnight, and subsequently exposed to UV light at 254 nm and 120,000 μJ/cm2 for 0, 30, 60, 120, and 240 min. For all UV light exposure times, the assay yielded a positive result for all semen samples.

Carrying out degradation studies are of little consequence as it can be predicted that as long as fragments are larger than the amplicon size and of sufficient quantity (and excluding inhibitors), a profile will be observed. Nonetheless, DNA extracted from a semen sample was sheared using a Covaris S2x acoustic shearing system (Covaris, Inc. Woburn, MA, USA). The aliquots sheared to 1,000–500 bp (Fig. 4c, d) returned a full positive result via SourceIdentifier™. The overall signal was reduced for aliquots sheared to approximately 150 bp and less (Fig. 4e, f). This observation is consistent with the size of the amplicons in the assay (the largest amplicon is 160 bp in length). The software yielded an inconclusive result for these degraded profiles, but upon visual inspection, a semen-specific pattern was observed and correctly identified the samples as semen.

Fig. 4
figure 4

Effects of mechanical DNA degradation on DSI-Semen™ assay. Electropherograms of sample mechanically sheared by a Covaris S2x acoustic shearing system into c 1,000-bp fragments, d 500-bp fragments, e 150-bp fragments, and f less-than-150-bp fragments compared to a non-sheared control and b negative template control

Humic acid and hematin were added to the reaction mixes to achieve total concentrations of 0, 2.5, 5, and 7.5 ng/μL and 0, 2.5, 5, 7.5, and 15 μM, respectively. Two semen and two non-semen samples were tested. At all humic acids concentrations, the assay was able to correctly identify semen samples (Fig. 5b) and non-semen samples. Concentrations of hematin of 2.5 μM did not affect the assay, enabling semen (Fig. 5c) or non-semen samples to be positively identified. Concentrations of 5 μM hematin or greater completely inhibited the assay (Fig. 5d).

Fig. 5
figure 5

Effect of inhibitors on DSI-Semen™ assay. The effect of a no inhibitor, b 7.5 ng/μL humic acid (maximum concentration assayed), c 2.5 μM hematin, and d 5 μM hematin on the DSI-Semen™ assay

Mixture analysis

Mixtures of 1:0 and 19:1 buccal to semen were assayed as containing no detectable semen (Table S2 of the “Electronic supplementary material”). This observation was confirmed by visual inspection of the electropherograms. Mixtures of 9:1 through 2:1 were identified as inconclusive by the analysis program. As explained above, the software was designed with parameters that minimize false-positive detection of semen at the expense of sensitivity. Manual inspection clearly identified the semen contribution of the mixture, while the software, due to its “conservative” parameters, yields an inconclusive interpretation. Visually, the correct peak signature of a mixture of semen and another tissue type was obtained. An electropherogram of 6:1 ratio of buccal to semen is shown in Fig. 6d. As mentioned earlier, visual interpretation of data is recommended for casework.

Fig. 6
figure 6

DSI-Semen™ assay mixture profiles. Total DNA samples, at 500 pg, from two sources were mixed in the following semen to buccal DNA ratios of a 1:0, b 6:1, c 1:1, d 1:6, and e 0:1

Mixtures of 1:1 were designated as positive for semen by the analysis program. However, by visual inspection, the electropherograms clearly display the correct non-semen-specific peaks as well (Fig. 6c). Mixtures ranging from 1:2 to 0:1 were all identified as positive for semen. Non-semen-specific peaks were visibly absent at ratios of 1:6 (Fig. 6b). This current constraint of the software to not identify a tissue adulterant does not detract from the assay performance. The presence of semen was correctly determined until levels of the tissue-specific DNA were below the sensitivity of detection of the assay.

Mock casework

To mimic casework samples, five semen samples, five blood samples, and five saliva samples that were dried on cotton were assayed. The correct designation of semen or non-semen was obtained for all 15 samples (Table 1). To more closely approximate mixtures that may derive from casework , five semen/semen mixtures, five semen/blood mixtures, and five semen/saliva mixtures were prepared on cotton cloth. All of the mixture samples except sample U-76, a saliva and semen mix, were identified correctly as positive for semen using SourceIdentifier™ (Table S2 of the “Electronic supplementary material”). U-76 was analyzed as inconclusive by the analysis software. Upon visual observation of the sample U-76 electropherogram, the peak signature was consistent with that of a mixed semen/non-semen sample. Based on these data, caution should be exercised when using the analysis program interpreting mixture samples where one of the components is relatively low compared with the other. Currently, visual review is recommended.

The aforementioned samples used in the study were extracted via column-based methods (i.e., DNA investigator, Qiagen). To ensure that the kit was compatible with other extraction methods, tests were performed to determine if the extraction method used to isolate the DNA could influence the performance of the assay. Samples were extracted by either organic extraction (semen n = 6, buccal n = 6) or an automated bead-based extraction system (PrepfilerPlus™ on an Automate express™, Life Technologies; semen n = 10, buccal n = 10, blood n = 5). The samples extracted via the organic method were correctly typed for tissue source. However, the Automate Express extraction platform did not yield DNA that could be typed. The digestion control indicated that possibly the restriction enzyme is inhibited by some component that is co-purified with the DNA in this extraction method. More research is required to determine the cause of this effect. Currently, it is recommended that samples to be assayed with DSI-Semen™ kit should only be extracted with organic or column-based methods.

Conclusion

DNA is routinely collected for human identity testing from sexual assault crime evidence. Based on the results of this study, the DSI-Semen™ kit has been shown to be a reliable tool for the determination of the presence of semen, at or below the levels of template DNA used for human identity testing. The evidence currently suggests that the methylation patterns for semen occur in DNA from spermatozoa. As with any sperm-specific assay, such as microscopy, caution must be used when interpreting results that could possibly have come from an azoospermic individual.

The accompanying analysis software program can correctly identify tissues as containing or not containing semen. However, due to its conservative nature, the software returns an inconclusive result for some tissue mixtures. This lack of rendering a result can be overcome by using Genemapper IDX and visual analysis to determine the presence of semen in the sample.

Due to the small amounts required for tissue typing with this method, it is possible that residual extracted DNA from identity testing cases could be used for tissue typing of the same sample. The small amplicon sizes for the assay make the test somewhat resilient to typing of degraded samples. Lastly, the assay is compatible with commercially available human identity testing kits and instrumentation [17]. With further optimization, it is feasible that tissue source and short tandem repeat loci typing could be performed simultaneously in a combined multiplex system, especially with the advent of six-dye capabilities.