figure a

Cade Emlet

figure b

Mack Ruffin

figure c

Regina Lamendella

What Is the Human Virome?

Viruses are the most abundant and widely distributed biological entities on Earth, existing ubiquitously throughout the biosphere in equal or greater distribution than other microbes [1] and thus are influential in nearly every ecosystem [2]. The human body is no exception; within the vast microbial communities that thrive on and within the confines of our anatomy, viruses exist in great abundance. Viruses are capable of regulating bacterial community structure and, subsequently, human health to an extent that has not yet been fully appreciated [3]. It is the subject of recent investigative fervor due to advancements in viral community analyses.

The human virome is a diverse and abundant collection of all viruses found in or on humans and includes both eukaryotic and prokaryotic viruses, encompassing both animal-infecting viruses and bacteriophage. There are, of course, other types of viruses included within the human virome, like archaeal viruses and virophages, which are not as deeply studied due to a limited understanding of their function within the environment of the human body [4]. By either directly affecting host cell behavior and structure or by preying upon certain species of bacteria and subsequently altering bacterial communities, viruses can alter their own environment and hold strong direct or indirect influence over host health and physiology [5, 6]. While much research has focused on the role of the human bacteriome in the etiology and progression of diseases, such as cancer, cardiovascular disease, inflammatory bowel disease, among others, the human virome has been less studied. Recent work is just beginning to shed light on the role the virome may play in human diseases such as periodontitis [7], inflammatory bowel disease [8, 9], cystic fibrosis [10, 11], and cancer [12].

How Can the Virome Induce Carcinogenesis?

In 2012, 15.4% of new cancers in the world were attributable to carcinogenic infections. The following viruses accounted for over 60% of cancers associated with infections: human papillomavirus (640,000), hepatitis B virus (420,000), and Epstein–Barr virus (120,000), among others [13]. The unique aspect of these viruses is that infection does not always lead to cancer; most people infected with the aforementioned viruses do not develop further symptoms, let alone progression to cancer. However, the vast majority of cancerous cohorts have been infected with one of these viruses, or some combination thereof. Viral-linked cancers appear in the setting of persistent infection over years, sometimes decades [14]. This clearly suggests that other issues contribute to the carcinogenic process of the infection that leads to the development of cancer in some patients. The role of other understudied viruses may also be involved in this process.

These potentially oncogenic viruses utilize several different strategies to develop carcinogenic persistence. These strategies include creating conditions for replication (inducing cell cycle progression, altering metabolic reprogramming, inducing angiogenesis, etc.), ensuring correct replication (recruiting or inhibiting DNA damage repair), maximizing viral production (preventing apoptosis until virions mature, evading cellular immune systems), and producing multiple latent episomes or proviruses [15]. There are several different mechanisms by which other viruses may impact carcinogenesis, including genomic alterations, impacting cellular and inflammatory pathways, and shaping bacterial community structure. Changes in human virome composition and diversity have been implicated in periodontal disease [7], HIV [16], cystic fibrosis [10], diseases after antibiotic exposure [17], urinary tract infections [18], and inflammatory bowel disease [19]. In Table 1, we highlight eukaryotic viruses associated with oncogenesis in the gut. These viruses could be conceptualized as drivers in carcinogenesis while the other viral communities facilitate or hinder the chronic infection that leads to cancer. We cannot exclude the other viral communities having a direct impact on the carcinogenesis process.

Table 1 Eukaryotic viruses associated with oncogenesis in the gut

One such class of viruses to which colorectal cancer can be attributed is the human papillomaviruses (HPV). HPV is a double-stranded, non-enveloped DNA virus, and it is the most common sexually transmitted infectious agent in the USA [20, 21]. HPV has been attributed to a number of cancers, including head and neck squamous cell carcinomas [22], oropharyngeal cancer [23], cervical cancer [24], prostate cancer [25], and colorectal cancer [26], among others. The global increasing incidence of young-onset colorectal cancer has been noted to include a prominent rise in rectal cancer when compared to colon cancer [27]. Persistent HPV high-risk infection is clearly linked to anal and rectal cancer [28].

The reason for the varied effects of HPV can be attributed to the variation in human papillomaviruses themselves, of which over 150 different types have been identified [29]. Only certain types of HPV, including HPV-16, HPV-18, and HPV-45, have been found to be statistically present in colorectal cancer tissue samples, with HPV-16 bearing the strongest prevalence among all HPV types found [30]. It is assumed that oncogenesis begins sometime after integration of the viral genome into the host genome, but how the subsequent induction of cancer occurs remains unknown and contested. The high-risk HPV genes E5, E6, and E7 encode potent oncoproteins that target almost all of the strategies mentioned previously which support replication and persistence [15].

Human polyomaviruses, a class of icosahedral, non-enveloped double-stranded DNA viruses, have also been attributed to cancer [31]. Out of the few human polyomaviruses suspected of having oncogenic properties (including Merkel cell polyomavirus, Trichodysplasia spinulosa polyomavirus, John Cunningham Polyomavirus, Simian Virus 40, and BK polyomavirus), only John Cunningham Polyomavirus (JCV) and BK polyomavirus have been associated with colorectal cancer [32,33,34,35]. Polyomaviruses encode the T-antigen, a non-structural oncogenic protein that is capable of inactivating tumor suppressor proteins p53 and pRB, among other mechanisms of signaling pathway interference [36]. Specific oncogenic mechanisms remain unknown for human polyomaviruses in colorectal cancer, but are implied by their presence in cancerous colorectal tissue samples [37].

Bacteriophage Shapes Community Structure and Can Indirectly Induce Oncogenesis

Although the effects of eukaryotic viral species have been well documented via functional techniques, the mechanisms by which viral communities induce carcinogenesis in the gut are still being discovered with the advent of next-generation sequencing methods (NGS), including metagenomics and metatranscriptomics techniques. Bacteriophages are indirectly associated with the development of some cancers in humans, particularly in the gastrointestinal system [38,39,40]. Due to the well-established relationship between microbial dysbiosis and the evolution of gastrointestinal malignancy, the relationship between viruses and colorectal cancer is strongly inferred and is currently being investigated [41,42,43,44].

Community-based viral shotgun NGS techniques have revealed that colon virome diversity is altered in individuals with colorectal cancer (CRC) [39], with viral diversity being higher in CRC cohorts [40]. The association between the bacteriophage portion of the enteric virome and CRC is considered to be indirect, as they alter bacterial community structures and bacterial behavior, which can lead to carcinogenesis. Importantly, when asking whether phages in the cancerous gut are primarily lytic or lysogenic, they are not exclusive in this distinction; the vast majority of phages in the gut are temperate, capable of being either lytic or lysogenic [39]. With this in mind, it is difficult to pin carcinogenesis on any specific phage or viral action, as the role of the phage community as a whole is to act as an overarching community modulator for bacteria. In this capacity, phages can serve to reduce some bacterial species, while acting consequently as the indirect impetus for population growth in other, oftentimes more pathogenic, bacterial species. Additionally, horizontal and vertical viral gene transmission is suspected of altering bacterial behavior, particularly with respect to biofilm formation, the alteration of which can also lead to carcinogenesis.

The ability to apply NGS to investigating trans-kingdom microbiomes is enabling a more holistic understanding of the pathogenesis of digestive diseases such as CRC, as tumor initiation and progression are influenced by complex host, environmental, and gut microbial factors. Several bacterial biomarkers have been correlated with CRC and progressional stages of cancer development [45,46,47,48]; however, contributions from viral components of the microbiome remain underexplored. NGS studies in CRC patients have highlighted how opportunistic and persistent viruses have been involved in the course of carcinogenesis [30, 49,50,51,52,53,54,55,56,57,58,59,60,61] and are summarized in Table 2. In addition, phages that infect Gram-negative bacterial hosts, such as enterotoxigenic Bacteroides fragilis and E. coli, as well as Fusobacterium nucleatum, have been associated with CRC development [62,63,64,65]. Bacteriophages also have been shown to have a putative functional role in the regulation of biofilm production (among other virulent functional genes encoded by bacteriophage), which has been implicated in colorectal tumorigenesis [66,67,68,69]. In addition to shaping the bacterial community, bacteriophages have been shown to transfer directly into colonic epithelial cells, promoting tumor growth and invasiveness in CRC [70, 71]. Although the full extent of the enteric virome’s indirect influence on carcinogenesis by means of modulating bacterial populations is in its infancy, a snapshot of how dynamic changes in the virome vary across stages of colorectal cancer is beginning to form (Fig. 1). Shotgun NGS techniques have begun to identify the presence of certain phages as a common factor for early-, mid-, and late-stage colorectal cancer, if not as a driving factor, then potentially as a biomarker (Table 2).

Table 2 Phages that are involved in carcinogenesis, either through causing dysbiosis or serving as a biomarker for community shifts
Fig. 1
figure 1

Bacteriophages alter bacterial communities and induce carcinogenesis through varying mechanisms. Panel A: The bacteriophage community serves as a community regulator for bacterial populations, and bacterial communities also have the ability to modulate bacteriophage populations. Dysbiosis in either community can allow carcinogenic bacterial populations to take hold, giving the virome a more indirect role in cancer development. Panel B: This is the hypothetical mechanism by which viruses indirectly induce carcinogenesis. Bacterial and viral communities begin in a healthy and well-regulated state. Variations in the phage community alter bacterial communities, reducing the populations of commensal bacteria while opening a niche for pathogenic (and potentially carcinogenic) bacteria to take home. The deleterious bacterial communities form a biofilm, which spreads due to the biofilm-altering components of some phages. The tight junction between cells is disrupted, allowing bacterial cells to infiltrate the spaces between them; this, in turn, leads to inflammation and creates the perfect environment for opportunistic pathogens, both viral and bacterial. Bacteria then thrive off of peptides secreted by the stressed epithelial cells, which leads them to release carcinogenic reactive oxygen species and polyspermines, subsequently inducing carcinogenesis. Figure 1 is obtained from Hannigan et al. [39]. Reuse is covered under the mBio Creative Commons Attribution 4.0 International license found here https://creativecommons.org/licenses/by/4.0/

Methods for Investigating the Human Virome

Culture-based techniques involve the isolation of human viruses and phages from human-associated environments. The basis for understanding phage ecology and host interactions has been provided by numerous conventional plaque-based assays [72], and these in vitro methods will remain essential in furthering our understanding of these relationships. These cultivation-based techniques are inherently limited by the inability to culture and identify hosts, as well as questions surrounding the ability to develop robust and representative in vitro models [73, 74]. In addition, cultivation-based approaches do not allow for unbiased representation of viral community structure and ecology. While 16S rRNA-based studies have helped revolutionize our understanding of bacterial ecology, absence of conserved molecular marker genes in viruses has complicated the application of marker-specific nucleic acid amplification technologies (NAATs) for viral community profiling [75,76,77,78,79,80]. Targeted sequence capture panels, such as ViroCap, have also been used to enrich nucleic acid from DNA and RNA viruses from viral communities harbored by vertebrate hosts [81].

Recent improvements in parallelized NGS technologies and bioinformatics have enabled for the first time a deeper and more comprehensive view of human-associated viral communities as compared to cultivation-based and PCR-based methodologies [82]. Since no single methodology can provide an all-inclusive approach, when designing a virome study, one should take into consideration sample source and type of viral particles to be assessed (i.e., DNA and/or RNA genomes, enveloped versus non-enveloped). Due to their small genome size, viral nucleic acids represent a minority of the total nucleic acids recovered from a given sample despite their greater abundance, and thus, isolation of viral DNA/RNA and sample concentration is recommended before shotgun sequencing. It should be noted that the methods used for viral particle purification can have a strong effect on the viral populations recovered and are subject to differential contamination issues [83, 84]. Methods for isolating viral-like particles (VLPs) from human-associated sample matrices often utilize a combination of filtration and centrifugation techniques for concentration of VLPs, followed by the elimination of contaminating cells and free nucleic acids [83, 85]. Viral nucleic acids can then be extracted using a variety of methods including those that utilize phenol–chloroform and Trizol, as well as a variety of commercially available kits, such as DNeasy (Qiagen), or QIAmp Ultrasens Virus kits (QIAGEN, Germantown, MD). Because of the low concentration of viral nucleic acids, prior to shotgun sequencing, an amplification step, such as multiple displacement amplification (MDA), is generally utilized to amplify viral genomes [83, 85, 86]. It should be noted that MDA introduces biases into viral community analysis due to preferential amplification of small circular viruses [87]. To analyze RNA viruses, RNA must first be reverse transcribed into more stable cDNA and then amplified by sequence-independent, single-primer amplification (SISPA) [88] or via amplification-based techniques [89]. Subsequently, library preparation can be performed using Illumina Nextera XT or FLX kits, which require very low nucleic acid inputs, followed by deep sequencing on high-output Illumina sequencers.

Shotgun sequencing approaches have provided unprecedented glimpses into the viral fraction of the human-associated ecosystems, revealing for the first time, in-depth inventories of the composition and functional repertoire of the virome [90,91,92,93]. Microscopic techniques like transmission electron microscopy (TEM) have not only benefited our understanding of viral structure, but they have also aided in the support of novel virus identification from shotgun-based sequencing studies [39, 40, 94, 95]. Perhaps the most comprehensive study characterizing viruses from a variety of sample types associated with the development of various cancers utilized a diverse selection enrichment method targeting all major viral groups followed by high-throughput sequencing [96]. This study investigated total DNA and RNA, mRNA, retroviral [97] and vertebrate viral capture [98], as well as enrichment of small circular DNA capture [99].

Bioinformatics Analysis of Viral Shotgun Data

Transforming shotgun sequencing data into usable outputs for clinicians and biologists requires robust bioinformatics analysis. There are more than 30 different pipelines available to analyze viral shotgun data as reviewed in detail [100]. Among this sea of tools, selecting the appropriate workflow can be challenging [101, 102]. Most workflows include preprocessing and filtering of nontarget sequences, assembling short reads, database searching for taxonomic assignment, and post-processing to detect any potential false positive results. Tool selection should be made based on the type of application, such as simple detection (genus/species) or more specific identification to the subspecies level, the latter of which is generally performed for source-tracking and surveillance purposes. For discovery-based detection projects, finding remote homologs can be accomplished by using alignment-based or composition search algorithms against reference databases that span a wide range of viral taxa; however, these approaches are very computationally intensive and time-consuming [100].

The viral metagenomics field has not yet adopted standard methods for classification of next-generation sequencing (NGS) results. Current tools include a few user-friendly online workflows, including VIROME [103] and Metavir [104]. Several more flexible command line programs are also available and generally require some basic background in linux/unix operating systems. Since methods and study goals vary widely, it is virtually impossible to assess the classification performance of these tools. Generally speaking, if high sensitivity is required, one should minimize the preprocessing steps on sequence data for retention of as many viral reads as possible. If higher specificity is required, such as in a clinical setting, high-specificity workflows such as RIEMS [105] and MetLab [106] can be employed. If the user’s application is more focused on higher precision, such as for variant calling, more aggressive preprocessing/quality filtration and assembly steps should be employed [101]. RINS [107] and Kraken [108] combined with MetLab [106] can perform preprocessing, filtering, and assembly, and each was determined to have high precision [100].

In the era of NGS technologies, viruses that are only represented by sequence data alone will likely need to be formally placed into the classification schemes maintained by the International Committee on the Taxonomy of Viruses [109]. Currently, there are no published guidelines on divergence cutoffs for viruses to be considered new viral taxa. When comparing genomic distances for eukaryotic viruses, there is high consistency between their genetic content and their current family- and genus-level taxonomic assignments, which suggests that eukaryotic viral metagenomic signatures could be classified consistently with the current ICTV taxonomy [110]. However, the taxonomy of prokaryotic viruses is far more divergent as compared with archeal and eukaryotic viruses [111]. Thus, one of the biggest unmet needs of viral metagenomics is the development of a consistent classification of viruses in which assignment at family and finer taxonomic levels is based on genomic relatedness and other evidence-based classification schemes.

Numerous challenges remain in analyzing viral shotgun data. First is the problem of sensitivity and false positive detections. For example, viruses may be missed due to biases associated with wet laboratory procedures, database limitations, and poor sequencing coverage. Contrastingly, viruses that are not present may in fact be detected because of homology to other viruses, incorrect annotation in databases, or sample cross-contamination. The intractability of viruses is also complicated by their high recombination rate and horizontal gene transfer and/or reassortment of genomic segments. Challenges for analyzing viral content in metagenomics datasets have been robustly discussed [75, 102, 112,113,114,115,116,117,118,119,120,121]. In order to make viral NGS data more useful in the clinical realm, we need to move beyond only taxonomic classification. It has been suggested that taxonomy should be associated with confidence scores, as well as potential pathogenicity/cancer associations [122]. Functional gene analysis would enable detection of mechanisms of pathogenicity/carcinogenesis; however, it certainly complicates bioinformatics analysis of shotgun data [123].

Important steps in the much needed standardization of viral metagenomics data [101, 102, 116] are vital to bring this technology into clinical studies, and this will be enabled by comparing and validating results within and between laboratories. In addition, metadata such as sample preparation methodology and sequencing technology should always be reported, as well as the establishment of true and false positive and negative results of synthetic constructs and experimental controls. Benchmarks should also be assessed for bioinformatics workflows, where different filtering, assembly, and annotation methods are tracked to optimize sensitivity and specificity [124,125,126,127,128]. Optimized parameterization will enable tractable and flexible frameworks that enable implementation of different algorithms so that users can utilize the best possible workflow tailored to their application. Updated reference databases, which include newly classified sequences, should be routinely used. While the field of viral metagenomics is young, there is much progress and momentum to advance the field by standardizing and validating methods including the CAMI challenge (http://cami-challenge.org/), OMICtools [129], and COMPARE (http://www.compare-europe.eu/).

Challenges and Future Directions

Given the early status of our understanding about the virome and colorectal cancer, there are numerous next steps for future research that could be considered. It may help to divide the potential research strategies into basic science and clinical science approaches. The basic scientific approach would take the current known carcinogenic mechanisms related to sporadic colorectal cancer pathways, hereditary pathways, and nonpolyposis pathways and look for roles that single viruses, such as HPV, may play. However, this may have limited utility given that other microbiota, both bacterial and viral, likely to play a role. As increasing amounts of microbiota are added into the conceptual model of carcinogenesis, complexity increases exponentially. Often the research team lacks expertise in analysis of large complex data sets which impairs the synthesis of the information. Adding experts in managing and analyzing large data sets would improve the synthesis in terms of quality and efficiency. With respect to the clinical aspects of reducing the impact of colorectal cancer suffering, the clinical or public health approach is divided into prevention, early detection, treatment, and surveillance for recurrence. If we knew the large bowel microbiota communities associated with risk of developing adenomatous cancer or which support the expression of a genetic mutation, then we begin looking into interventions to morph the microbial communities into ones associated with less risk. This would lead to a more preventive strategy. The same research could lead to developing an early detection test of large bowel microbiota (alone or in combination with stool and blood testing, or stool genetic markers). Both strategies require access to adults over the age of 40 or 50, an easily collected biological source containing microbiota, and easy/inexpensive assays to evaluate the communities of microbiota. There are plenty of adults undergoing some form of screening for colorectal cancer. There is also plenty of stool to be collected and assayed. This, however, is not always easy or acceptable to patients. As highlighted in this chapter, there are numerous assays that can be used to address the viral communities present. Finally, the reams of data that would be produced require partnership with bioinformaticians from project design through analysis.

Establishing a causative relationship between viruses and colorectal cancer requires a deep investigation the presence of etiologic agents in the diseased state and absence in non-diseased state. There are many challenges associated with fulfilling Koch’s postulates as many viruses cannot be cultured and due to the lack of available models to recapitulate viral pathology. Because of these challenges, alternative means for investigating causality of viruses involved in CRC have been proposed and include detection of viral nucleic acid and antigen in clinical specimens, in addition to visual detection of viral particles. Causation could further be improved with longitudinal sampling at different stages of CRC and demonstration of a host immune response throughout disease course [130].

While the application of NGS to virome profiling is exciting, future studies should incorporate proper quality control measures including negative, positive, and internal controls throughout sampling, extraction, and library preparation, and bioinformatics analysis to minimize false negative and positive detections. For example, several viruses that have been reported as etiological agents have later been shown to be contaminants [131,132,133,134]. In addition, much analytical and clinical validation must be performed for NGS methods to demonstrate test performance is comparable and/or improved compared to nucleic acid amplification techniques (NAATs) [135, 136]. As mandated by the Clinical Laboratory Improvement Amendments (CLIA), NGS viral detection tests will necessitate analytical sensitivity and specificity, reproducibility, and accuracy of both laboratory and bioinformatics processes. These validation procedures should include patient samples with known spike-in standards, as well as positive and negative controls within each testing run. Validation of bioinformatics methods should include in silico controls to test algorithms and databases and perform robust parameter sweeps for optimization of analyses.

NGS-based technologies hold much promise due to the ability to build predictive models that can describe risk of opportunistic and emerging viral infections associated with CRC, which could help personalize treatment options. Integrating the viral diversity and other microorganisms will yield new insights into CRC pathogenesis, by generating reliable diagnostic biomarkers that could inform disease status and treatment strategies. However, the variation in the virome within the gut microbiome of CRC patients is not exclusively explained by clinical factors alone but may also be related to lifestyle factors, such as diet [93, 137, 138]. Thus, careful integration of viral and other microbial components with patient metadata will be imperative in the discovery and generation of reliable CRC biomarkers. Ultra-deep sequencing approaches are also necessary to exhaustively measure viral diversity associated with CRC by circumventing viral enrichment procedures that often lead to biases in viral profiling. Artificial intelligence and machine learning models can then be applied to partition in which metadata information, viral, and other microbial components are most predictive of CRC onset and progression. Thus, as the cost for NGS becomes lower, deeper sequencing approaches will facilitate the integration of trans-kingdom profiling, longitudinally in large CRC cohorts to provide accurate and emerging mechanistic insights into how the virome is contributing to CRC etiology.

There are a variety of challenges that still face the field of utilizing NGS for viral detection in clinical specimens. First, sample collection is challenging, especially obtaining samples that are representative of the true gut ecosystem. Typically, samples are acquired at the time of endoscopy or surgery after the gut has been prepped and thus likely isn’t representative of the normal microenvironment but changes back to usual communities within 2 weeks [139]. In addition, sample preservation should be considered in order to stabilize DNA and RNA from the sample directly at the time of collection. In addition, as previously mentioned, unbiased and comprehensive preparation of viral DNA and RNA should be considered using ultra-high-throughput sequencing, to circumvent enrichment strategies which empart biases on the viral profile.

With respect to bioinformatics analyses, more funding support should be granted for disseminating developed, documented, and tested software distributed through package managers or virtual machines to improve the lifespan of viral informatics software. In addition, attention must be given to the maintenance of updated and well-indexed, centralized sequence databases, with corresponding patient/sample metadata. This will facilitate the discovery of unclassified viral dark matter across several studies. As new software becomes available, benchmarking of software using standardized test datasets will help scientists analyze and optimize viral data analysis across multiple studies. The reporting and indexing of patient health information, socioeconomic data, and other relevant metadata will enable identification of predictive variables and covariates of viral presence and cancer development. In addition, direct communication among software developers and clinicians will help generate analyses that are interpretable and useful for clinicians interested in investigating viral communities and their role in CRC development and progression.

Implications for Clinicians

  • There will be an explosion of information pertaining to the virome and its implications in colorectal cancer over the next 5 years due to commercial advances in high-throughput sequencing.

  • Clinicians need to move away from a single infectious agent model for disease etiology by grasping this new, more encompassing etiological paradigm, in which communities of various microbial components interact with each other and the host.

  • As our understanding of the role of the virome in enteric carcinogenesis expands, new preventive methods, screening/surveillance techniques, or treatment modalities may develop.

  • It may be easy for some people to erroneously associate common infections with cancer. As the concept of viral contributions to carcinogenesis reaches the public, clinicians need to help patients remain calm when they encounter a viral infection.

  • Certain eukaryotic viruses are the driver of gastrointestinal cancers, but most people infected with these viruses do not develop cancer. Thus, it should be noted that infection by these potentially carcinogenic viruses does not necessarily imply future carcinogenesis; rather, their presence should be associated more so with elevated risk.