Introduction

The current accepted standard for evaluating the male reproductive tract microbiome employs culturing techniques. These can be limiting [1, 2] as the majority of pathogens cannot be cultured [3]. To provide an overview, metagenomic/targeted DNA sequencing of 16S ribosomal RNA (rDNA) hypervariable regions and RNA sequencing of the 16S rRNA hypervariable regions using PCR or next-generation sequencing (NGS) has been performed on seminal fluid, semen, and urine [1, 2, 4,5,6]. These strategies are capable of genus and species identification [2], but are generally targeted to one or a group of bacteria [1]. Despite an increased sensitivity by 16S rRNA/rDNA sequencing, technical considerations and cost may be perceived as a limiting factor in some clinical laboratories [7]. Similarly, virus detection must be performed separately and relies on targeted viral culture or virus specific DNA/RNA sequencing strategies [8,9,10]. However, sequencing costs have decreased dramatically from the ~ 3.5 billion dollars for the first human genomes to less than 1000 dollars and are still declining with a non-targeted approach providing a more complete and picture.

Male genital tract bacteria are primarily introduced by direct contact to the source, e.g., sexual transmission [6]. Escherichia coli is the most frequently noted bacteria in males, associated with genital tract infection or urethral/post-urethral contamination upon semen collection [6, 11]. Comparative 16S rDNA sequencing between male and female partners indicates that 85% of the phylotypes are shared, primarily within the genera Lactobacillus, Veillonella, Streptococcus, Porphyromonas, and Atopobium [4]. Compared to vaginal bacterial communities, the male seminal fluid has a less concentrated, more diverse (alpha diversity) bacterial community [4].

We sought to determine if human sperm RNA-seq data could provide a sensitive method of detection. To accomplish this objective, the microbial composition among 85 individuals was compared to previously published reports of the male reproductive tract microbiome [1, 6, 12,13,14,15]. Here, we report that human sperm RNA-seq is a viable method for the evaluation of the seminal microbiome. By RNA-seq, we have identified a male that may present with Streptococcus agalactiae bacterial colonization.

Materials and methods

Study population, sperm RNA isolation, sequencing, and alignment

A total of 85 informed consented deidentified biobanked semen samples from idiopathic infertile couples were obtained from two independent groups of sites within the USA and a single site in Canada. Males ranged in age from 24 to 44 years old (mean age = 34.73). Each couple had undergone at least one cycle of timed intercourse (TIC), intrauterine insemination (IUI), intracytoplasmic sperm injection (ICSI) or in vitro fertilization (IVF). By study completion, 48 couples achieved live birth (LB) and 37 couples failed to achieve a birth (NLB).

The corresponding deidentified frozen semen samples were processed and analyzed as summarized in Fig. 1 in accord with IRB protocol 095701MP2E(5R). Sperm RNA was isolated as previously described [16,17,18,19] and sequenced on an Illumina Next-seq 500 to assess library quality and then an Illumina Hi-seq 4000 to increase sequencing depth. Sequences aligned to the human genome [20] were removed from further analysis. Sequences not mapped to the human genome by HISAT2 [21] were aligned to the bacterial/viral/archaeal genomes (NCBI RefSeq release 95—all bacterial genomes, as well as NCBI RefSeq release 76—all bacterial, viral, archaeal genomes) with the 98% similarity taxonomic profiling method of Kraken v1 [22]. Using the Kraken-filter command, the threshold for calling a sequence was set at 0.15. The percentage of microbial sequencing reads was determined as a function of total sequence reads. For each sample, the percentage of microbial assigned reads from the total sequencing reads was calculated for total microbial assignments after removing internal controls, Enterobacteria phage phiX174 sensu lato (phiX) and Escherichia virus MS2 (MS2). Within each sample, the percentage of species-specific microbial sequences was calculated for each sequencing run. One sample presented an abnormally high number of microbial-assigned sequencing reads (outlier) and was considered separately from the remaining group of 84 samples which provided a background distribution. A subset of these samples including the outlier were also assessed using the Greengenes v13_8 reference database aligning sequences based on a 99% percent similarity to the 16S rRNA.

Fig. 1
figure 1

Study design. Sample collection and processing is outlined. Sperm RNA from 85 semen samples was isolated and sequenced using a Next-seq 500 and Hi-Seq 4000 sequencer. Sequences were aligned to the human genome then aligned sequences discarded. The unmapped sequences were retained then aligned to the bacterial, viral, and archaeal genomes. Estimated richness of each sample (alpha diversity) employed the Shannon diversity index. Samples were then evaluated as a function of beta diversity (between group richness). With the exception of a single outlier, 84 samples showed the same statistical distribution. Microbial composition was then determined

Estimated richness—alpha and beta diversity

The Kraken report was used to compute OTU tables by unique NCBI taxonomy ID and phylogeny tables, for analysis by phyloseq that was written in R [23, 24]. Taxonomic lineage and rank from the Kraken v1 [22] designated NCBI taxonomy IDs were assigned using Taxonkit [25]. Duplicate NCBI IDs were merged, and NCBI taxonomy IDs for phiX and MS2 controls were removed. Missing values were re-assigned a raw abundance of 0. The remaining NCBI taxonomy IDs were re-named beginning with OTU1 and abundance normalized to total library size using reads per million (RPM). Two OTU tables were generated (one with raw abundance and one with normalized abundance) and included the OTU ID and abundance value. The phylogeny tables were generated for each remaining OTU ID and associated taxonomic lineage (Kingdom, Phylum, Class, Order, Family, Genus and Species). Alpha diversity (within sample microbial diversity) was calculated using the functions, estimate_richness(), followed by plot_richness() [24] for visualization. The Shannon diversity index, a common measure of alpha diversity, was performed for all 4426 OTUs prior to RPM normalization and visualized based on sequencing run (Next-seq vs Hi-seq), birth outcome (LB vs NLB), and sample group (LB + NLB vs outlier). The R Mann-Whitney test [23] was used to evaluate group differences in alpha diversity.

A phyloseq object was then created using the normalized OTU abundance table and merged at the species level by the tax_glom() function. This function combines abundance and lineage for all OTUs assigned to the same species. For example, only 1 OTU will represent the 49 OTUs originally assigned to Escherichia coli. All OTUs with no species assigned were re-labeled as NA at the species level but remain unmerged. Rarefication of the species merged phyloseq object was performed so that each sample library was of equal size. Briefly, the rarefy_even_depth() function was performed with 711 random sub-samplings (“set.seed(711)”) without OTU replacement and OTU trimming if no longer observed in any sample. This resulted in the removal of 581 OTUs from further analysis of the microbial diversity between species (beta diversity). Beta diversity of the remaining 3645 OTUs was determined. This employed non-metric multidimensional scaling (NMDS) with unweighted UniFrac distances computed and visualized using the plot_ordination() function from phyloseq. Statistical significance of the unweighted UniFrac distances based on sequencing run, birth outcome, and sample group employed the R version of vegan [23, 26] using a permutational multivariate analysis of variance test (PERMANOVA). Differences between sequencer, birth outcome, and sample group microbiome variance were further evaluated in vegan by testing beta dispersion using the betadisp() and permutest() functions in vegan. Both functions were calculated using 1000 permutations.

Determining microbial background

Microbial background within human sperm RNA-seq data was determined as a function of the 84 LB + NLB group sample set that excluded the outlier from each sequencing run using normalized, unmerged, and unrarefied data. Sequencing runs were visualized as separate samples as the alpha, but not beta, diversities were statistically different. This enabled visualization of the microbial abundance between sequencing runs so that taxa were present in at least one sequencing run for all samples and in > 60% of the samples at an RPM > 1 in both Next-seq and Hi-seq sequencing runs. Bacteria were prevalent, followed by viral and archaeal assigned OTUs. Based on the above-threshold criteria, bacterial taxa displaying a relatively similar abundance in both sequencing runs were identified, beginning at phylum and moving towards genus taxonomic level. These were compared to bacterial phyla and genera reported as discernible in the male reproductive tract by culture or rRNA sequencing [1, 6, 12,13,14,15]. In addition, bacterial species reported to be contained within some reagents or from unknown sources that can be accentuated with low biomass samples like the sperm were also considered. They are characterized by water or soil bacteria, often associated with nitrogen fixation, including some, but not all Streptococcus species [27,28,29]. Background genera were then assessed as sample group mean for the comparison to outlier genera. The Wilcoxon Mann-Whitney U test in R [23] was used to identify significant differences between sample sets LB + NLB vs outlier and respective OTUs for their contribution. Group means are reported along with the confidence interval set at 95%. Multiple testing (F-test with 100 permutations and Benjamini-Hochberg (BH) adjustment of p values) in phyloseq was used to evaluate OTUs assigned along the taxonomic lineage for genera exhibiting the most variability between the LB + NLB group and the outlier for their influence over alpha and beta diversity among sample groups.

Data availability

The microbial sequence data is available at SRA Bioproject PRJNA595621 (https://www.ncbi.nlm.nih.gov/sra/PRJNA595621).

Results

Microbial sequences in human sperm RNA-Seq

The 85 idiopathic infertile sample cohorts were comprised of the 84 male LB + NLB group and the NLB outlier. Semen parameter group means were within the healthy range (Table 1), with 84% (71 of 85) of the males fulfilling the World Health Organization (WHO) fertile-male characteristics for all four parameters [30] and 96% (82 of 85) for at least three of four parameters. By study completion, 48 couples achieved a successful LB, and 37 did not achieve a live birth NLB. The total number of microbial assigned sequencing reads obtained from the sequencer were 1.0% (Next-seq 500) and 1.4% (Hi-seq 4000) on average. The microbial species-specific sequencing read average ranged from 0.41% (Next-seq 500) to 0.84% (Hi-seq 4000) of the total number of sequencing reads are in accord with that observed by others [31,32,33,34]. In comparison, the outlier sample far exceeded the average presenting a total microbial assigned sequencing reads of 32.1% (Next-seq 500) and 41.2% (Hi-seq 4000). This corresponded to 12.99% (Next-seq 500) and 15.78% (Hi-seq 4000) species-specific microbial sequences assigned respectively.

Table 1 Mean standard semen parameters for the 85 semen samples separated by sample group as compared to World Health Organization (WHO) minimum guidelines: 84 LB + NLB samples and 1 NLB outlier sample with an over-abundance of microbial sequencing reads. The 95% confidence interval is reported for group LB + NLB only

Estimated microbial richness

Microbial diversity (alpha and beta) within each sample was examined to identify shared microbial profiles between the samples. Microbial richness within each sample (alpha diversity) was evaluated using the Shannon diversity index for each of the samples to determine the number of different microbes (or OTUs) within each sample. Differences between the two sequencers and birth outcome (LB vs NLB) were examined. A statistically significant different variance by sequencer (p value ≤ 0.001, Online Resource 1) but not birth outcome (p value = 0.596, Fig. 2) was noted. For the LB group, the mean observed alpha diversity was 689.98 (± 53.87) with a mean Shannon diversity index of 3.37 (± 0.11) while the NLB group presented a mean observed alpha diversity of 642.20 (± 61.25) and Shannon diversity index of 3.28 (± 0.12). Differences in microbial diversity between samples (beta diversity) were then evaluated for both sequencer and birth outcome. The rarified data was plotted in two dimensions as determined by a stress level of 0.18 (stress type 1 or weak stress) using NMDS and unweighted UniFrac distances. Beta diversity was not statistically different between sequencer. However, the unweighted UniFrac distances (p = 9.99e-4) and beta dispersion showed statistical significance (BH adjusted p value = 0.001) using the R vegan package when birth outcome was considered. To determine whether this difference was due to the presence of the outlier sample within the NLB group, the analysis was repeated following removal of the outlier. This yielded no significant difference between the remaining LB and NLB samples, indicating the difference observed in the original analysis was due to the outlier sample. As such, the 48 LB and 37 NLB samples were combined into a single group, LB + NLB.

Fig. 2
figure 2

Estimated sample richness (alpha diversity) by birth outcome; live birth (LB) versus no live birth (NLB). The observed microbial richness (Observed) and Shannon diversity index (Shannon) based on birth outcome is not significantly different (p = 0.596). Both sequencing runs were considered as separate samples so that each sample has two reported values. The color of the dots indicates the sequencing run, while the NLB black dots indicate an outlier

Re-examining the samples based on sample group (LB + NLB) versus NLB outlier showed a dramatic difference in the Shannon diversity index (p value = 0.015, Fig. 3). This corresponded to major difference in alpha diversity in the LB + NLB group (mean observed = 661.49 (± 39.42), mean Shannon diversity index = 3.35 (± 0.08) compared to the outlier (mean observed = 1315.00 (± 362.59), mean Shannon diversity index = 1.18 (± 0.04)). Significance of the unweighted UniFrac distances (p value = 9.99e-4) and beta dispersion, microbiome variance (p value = 9.99e−4), for the sample group was significant. Similar to alpha diversity, this difference in beta diversity remained independent of taxonomic database (Greengenes, NCBI RefSeq) used. The results were consistent with the above Kraken v1 assignments. This indicated that the sample in question was truly an outlier.

Fig. 3
figure 3

Estimated sample richness (alpha diversity) by sample groups. Group (LB + NLB; n = 84) was compared to the NLB outlier (n = 1). The observed microbial richness (Observed) and Shannon diversity index (Shannon) of the outlier sample compared to LB + NLB group is significantly different (p value = 0.015). Both sequencing runs were considered as separate samples so that each sample has two reported values. The color of the dots indicates the sequencing run, while black dots indicate the sample replicate was an outlier

Sperm microbiota

Birth outcome did not impact the alpha diversity. Accordingly, to define the microbiota background, the outlier sample was excluded and the LB + NLB samples were combined as a group (n = 84) and replicates used to define the microbiota background (Fig. 4a). Four bacterial phyla: Firmicutes, Proteobacteria, Bacteroidetes and Actinobacteria were consistently highlighted in all sperm samples. OTUs from these phyla were assigned to the 11 genera previously identified in semen by PCR and/or 16S rDNA sequencing (Seq) and/or hNGS (human sperm RNA-seq) and/or bacterial culture (Table 2) [1, 6, 12,13,14,15]. The genera Escherichia, Staphylococcus, and Streptococcus were observed independent of assay used (culture-PCR-Seq-hNGS) while the genera Corynebacterium and Pseudomonas were observed using PCR-Seq-hNGS but not by bacterial culture. In comparison, the genus Bacillus was only observed by Seq and hNGS, and the genus Acinetobacter was identified by PCR and hNGS. Finally, the genera Burkholderia, Cutibacterium, Flavobacterium, and Riemerella were only observed using hNGS (Table 2). The background level was established across the genera observed in all assays within the LB + NLB group. Corynebacterium, Staphylococcus, and Streptococcus were identified as significantly different between the LB + NLB group and the NLB outlier (Fig. 4a; Table 3).

Fig. 4
figure 4

Average group abundance of assigned bacteria. a Genera are ordered based on assay identification; Escherichia, Streptococcus, and Staphylococcus (PCR-Seq-hNGS-culture); Acinetobacter, Corynebacterium, and Pseudomonas (PCR-Seq-hNGS); Bacillus (Seq-hNGS); Burkholderia, Cutibacterium, Flavobacterium, and Riemerella (hNGS). b Representative bacterial species are ordered by assay; PCR-Seq-hNGS-culture for Escherichia coli (49 OTUs); the Staphylococcus species S. aureus (31 OTUs), S. epidermidis (3 OTUs), S. haemolyticus (1 OTU), S. saprophyticus (1 OTU); and the Streptococcus species S. agalactiae (8 OTUs), S. anginosus (3 OTUs), S. dysgalactiae (95 OTUs), S. mitis (1 OTU), S. pneumoniae (22 OTUs), and PCR-Seq-hNGS for Pseudomonas aeruginosa (14 OTUs) and Pseudomonas putida (11 OTUs) known to be discernible within the seminal microbiome. S. dysgalactiae infection of the male reproductive tract has recently been reported and is abundant in the outlier sample only. Background genera were determined using the 84 LB + NLB sample group in which summed genera level OTUs had an RPM > 1 in both sequencing runs for > 60% of samples. A total of 545 OTUs directly assigned a taxonomic rank (to the genera level) or indirectly (to the species level within the given genera) were used for determination of genera abundance (a). Only OTUs assigned directly to the representative species were used to determine species abundance (b). Both sequencing runs were considered as separate samples. Bar color indicates sample group. PCR, 16S rDNA sequencing by PCR; NGS, next-generation sequencing; Seq, 16S rDNA sequencing by NGS; hNGS, human sperm RNA-seq

Table 2 Background genera identified among sperm samples. The 11 background genera identified by human sperm RNA-seq (hNGS) were compared with those reported as present by 16S rDNA sequencing (PCR, Seq: represents > 1% of sequences) and/or bacterial culture (cultured or not cultured). The data is summarized as present (+) and absent (−) by assay type. Background genera were determined using the LB + NLB group samples. Both sequencing runs were considered as separate samples. Summed OTUs for each genus had an RPM > 1 in both sequencing runs for > 60% of samples. PCR 16S rDNA sequencing by PCR; NGS next generation sequencing; Seq 16S rDNA sequencing by NGS
Table 3 Background genera significantly different in the outlier. The proportion of operational taxonomic units (OTUs) contributing to this difference are reported. Background genera were determined using the LB + NLB group samples. Both sequencing runs were considered as separate samples. Summed OTUs for each genus had an RPM > 1 in both sequencing runs for >60% of samples. P values were determined using the Wilcoxon Mann-Whitney U test

Seventeen bacterial species within the 11 genera were identified in previous reports as present in the male reproductive tract [1, 6, 15]. Of these, 12 species were sequenced and are visualized in Fig. 4b. As shown in Fig. 4b, the NLB outlier OTUs were directly assigned to S. agalactiae (9.21 to 11.03% total sequencing reads) and S. dysgalactiae (3.37 to 4.25% total sequencing reads) which comprise the majority of sequences assigned to the genus Streptococcus. In total, these account for 48.13% of all Streptococcus assigned sequencing reads (Fig. 4). The majority of these reads corresponded to OTU3813, assigned to S. agalactiae and OTU3848, assigned to S. dysgalactiae, reflective of the alpha and beta diversity difference exhibited by the outlier following multiple testing in phyloseq (Fig. 3). These observations were independent of the taxonomic database assignment method used (Greengenes, NCBI RefSeq).

Discussion

Human RNA-seq data includes sequencing reads that do not align and map back to the human genome [31]. For example, bacterial, viral, fungal, and archaeal sequences have now been observed in several large datasets including the Cancer Cell Line Encyclopedia (CCLE) [34], 1000 Genome Project [33], SRA database [31], and The Cancer Genome Atlas (TCGA) [32]. As shown above, species-specific microbial-associated sequencing reads associated with sperm are present at frequency of 0.41 to 0.84% of the total. This is in accord with the average microbial load obtained by merging the sequencing data irrespective of sequencing depth or sample type (e.g., intestine, brain, skin) [31,32,33,34].

Samples were considered within the context of the LB + NLB samples versus the NLB outlier sample following at least one cycle of TIC, IUI, ICSI, or IVF. No difference in alpha diversity was determined based on birth outcome (Fig. 2); however, one sample was identified as an outlier in the NLB group and was separately tested from the remaining samples, group LB + NLB versus NLB outlier. On one hand, the presence of bacterial components can be traced to their introduction to reagents during manufacture [27,28,29]. On the other hand, this could reflect the sample itself. This poses a particular challenge for low microbial biomass samples such as semen, as the small amount of starting material may be overpowered by these reagent bacteria [29]. Computationally, this noise can be removed from analysis. Again, as illustrated in Fig. 3, the Shannon diversity index suggested the outlier’s microbiome was vastly different from all other samples.

The most dramatic difference between the outlier and the remaining 84 samples was Streptococcus (Fig. 4a). This microbial difference was identified irrespective of microbial reference genome (Greengenes v13_8 99%—16S rRNA, NCBI RefSeq release 95—all bacterial genomes, NCBI RefSeq release 76—all bacterial, viral, archaeal genomes) or the specific algorithm used for taxonomic assignment. Multiple testing of the contribution of the OTU towards differences in microbial composition between all 85 samples was performed for the taxonomic lineages of each of the 11 genera. The OTUs yielding significance were for S. agalactiae (OTU3813) and S. dysgalactiae (OTU3848). When sub-setting along the taxonomic lineage within the phylum Firmicutes (BH adjusted p value = 0.044), class Bacilli (BH adjusted p values = 0.033), order Lactobacillales (BH adjusted p values = 0.015), family Streptococcaceae (BH adjusted p values = 0.011) and genus Streptococcus (BH adjusted p values = 0.009) were evident. Considering abundance and contribution towards alpha and beta diversity, the outlier was likely colonized with S. agalactiae and potentially with S. dysgalactiae.

Male genital tract infection by S. dysgalactiae, but not colonization, has been reported [35, 36]. However, the frequency of S. agalactiae transient colonization in college-aged males of 20%, as determined in urine (first of the day), swabs from the bottom of the rectum, mouth, and interdigital spaces is half that observed in females [37,38,39]. Sexual partners share the same isolates of S. agalactiae when colonized indicating these bacteria are sexually transmitted [37, 40]. As partners will share isolates of S. agalactiae, the female partner of the outlier may have been colonized. S. agalactiae infection in women can resolve as both stillbirth or pregnancy loss [41]. This may reflect that S. agalactiae actively forms biofilms and increases inflammation via leukocyte number [39, 42, 43]. Interestingly, this couple did not achieve a pregnancy after three ICSI cycles. While S. agalactiae colonization has been detected in sperm from couples seeking reproductive care [6], there was no difference in standard semen parameters (Table 1), suggesting that the presence of bacteria (specific or in groups) did not impact semen parameters [2, 12]. This remains to be reconciled with the data of others that suggest that certain bacterial infection(s) alter semen parameters including sperm count and the degree of necrosis [1, 5, 6, 42] when larger studies are undertaken.

Conclusions

Un-biased sequencing of microorganisms (bacteria, viruses, archaea) has the potential to provide enhanced sensitivity and specificity as compared to targeted-culturing, or -rDNA/rRNA sequencing. This information can be recovered from the data typically cast aside as part of RNA-seq unmapped reads. Using this in hand resource provided the opportunity to identify a sample presenting a noticeable level of S. agalactiae RNA-seq reads. While it cannot be excluded, it is unlikely that the presence of S. agalactiae is due to contamination introduced during processing considering the mode of transmission and use of personal protection equipment as part of standard clinical practice. Given the recent increase and severity of Streptococcus (agalactiae) infection as well as others in adults, neonates, and newborns [40], in addition to providing fertility status [16], non-targeted human sperm RNA-seq data may also prove useful as a diagnostic for microbial status. Perhaps it may become part of routine sample quality assessment as we move towards personalized care.