Introduction

The Flavivirus genus consists of more than 90 species that exhibit high variability in host range and specificity (Schoch et al. 2020). Some of these are significant public-health pathogens, such as dengue virus, zika virus, yellow fever virus, West Nile virus, and Japanese encephalitis virus. Dengue virus alone is responsible for about 390 million infections each year, and currently only limited antiviral treatments are available for its control (Thomas and Yoon 2019; Troost and Smit 2020). According to the World Health Organization, the number of reported dengue cases increased by over eightfold in the last two decades, with over 4.2 million reported cases in 2019 (World Health Organization 2020).

All flaviviruses are enveloped, positive-sense, and single-stranded RNA (ssRNA) viruses, with small genomes, ranging from 9 to 13 Kb, and consisting of a 3’ and 5’ untranslated regions (UTRs) wrapping a single open-reading frame (ORF). This reading frame is translated into a polyprotein, from which both structural and non-structural proteins are derived. The structural region, located at the N terminus of the polyprotein, contains the capsid protein, followed by a pre-membrane glycoprotein which protects the virion from pH changes in host environment, and an envelope glycoprotein which acts as a receptor for binding and entry to host cells. The non-structural region consists of the proteins NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5 (Chambers et al. 1991). These non-structural proteins are responsible for various functions in the virus, ranging from proteolysis, genomic replication, and evasion from the host immune system (Chong et al. 2019).

The Flavivirus genus can be categorized into four (not necessarily monophyletic) groups according to their host range and transmission mode: insect-specific flavivirus (ISF), mosquito-borne flavivirus (MBFV), tick-borne flavivirus (TBFV), and no-known arthropod vector (NKV). While ISFs only infect invertebrate hosts and NKVs primarily infect vertebrate hosts, MBFVs, and TBFVs can replicate in both. In accordance with the variation in their host-tropism, flaviviruses from these different groups also vary in their mode of transmission between infected individuals. ISFs are usually transmitted vertically, from arthropod mother to offspring (Halbach et al. 2017), while NKVs were shown to transmit mainly horizontally between their vertebrate hosts (Blitvich and Firth 2017). In contrast, MBFVs and TBFVs, which possess dual-host tropism, undergo alternating transmission cycles between their insect vectors and their vertebrate hosts (Franz et al. 2015). As such, successful transmission of vector-borne viruses from their arthropod vectors to their vertebrate hosts entails the infection of both the midgut and salivary gland of the insect (Brackney and Armstrong 2016; Miesen et al. 2016). Aside from transmission to vertebrates via insect bites, vertebrates can be infected through ingestion of the insect (Dobson et al. 1995), consumption of dairy and meat of infected vertebrates (Swanepoel et al. 1985; Offerdahl et al. 2016) and blood transfusions (Leiby and Gill 2004).

Vector-borne viruses exhibit greater host plasticity, infecting up to three times the number of host taxonomic groups, compared to non-vector-borne viruses (Kreuder Johnson et al. 2015). This property could be attributed to the increased opportunities for effective contact across diverse animal hosts enabled by vector-borne transmission. Indeed, vectors provide an effective bridge for transmission of disease from wild animals that do not normally contact humans, thereby increasing the zoonosis frequency of vector-borne viruses. Furthermore, vector-borne flaviviruses demonstrate highly variable host plasticity, with some infecting only closely related hosts while others infect a wide range of vertebrate taxa (Weaver and Barrett 2004). For example, the vector-borne West Nile virus can infect a wide range of vertebrate hosts, from humans and horses to birds (e.g., Turdus merula), while the vector-borne yellow fever virus was only isolated from primates (Mihara et al. 2016; Michel et al. 2019). These differences in host range make the Flavivirus genus a compelling target for studying the viral traits that control host range and specificity. Such knowledge can be utilized for early detection of outbreaks through identification of potential host species and monitoring of viral threats (Bicca-Marques and de Freitas 2010).

The evolutionary basis for the adaptability of vector-borne viruses to multiple hosts has been previously explored. In this context, three hypotheses consider the fitness landscape of vector-borne viruses in the context of their multiple hosts (reviewed in Novella et al. 2012). The first hypothesis, termed the tradeoff hypothesis, proposes that vector-borne viruses maintain adequate fitness to their multiple hosts in exchange for lower adaptation capabilities towards any one of them (Wilson and Yoshimura 1994; Kassen 2002). The second hypothesis suggests that vector-borne viruses do occasionally reach an optimal peak that is shared across multiple hosts. The third hypothesis argues that the evolutionary constraints of vector-borne viruses are mainly determined by their vector host due to its key role in transmission to vertebrates. Accordingly, viruses reside in the optimal fitness peak within the vector, rather than in that of the vertebrate host. All three hypotheses are supported by different experimental studies (e.g., Ciota et al. 2008; Deardorff et al. 2011). The discrepancy across these studies could be attributed to differences in the methods utilized for measuring fitness and for manipulating the passage of vector-borne viruses through their different hosts. For example, the quasispecies theory proposes that viral fitness should be measured while considering the variation across multiple viral individuals within a population, rather than virus individuals (Domingo et al. 2012). This variation, viewed as mutant clouds, is induced by a population of interacting viruses and enables the expression of phenotypic variance and virus adaptability (Eigen 1996; Wilke 2005).

In addition to intrinsic viral characteristics, host traits and environmental factors may also influence viral capability to replicate and persist within hosts. Thus, the identification of macro-ecological attributes that facilitate viral persistence can contribute to our understanding of the variation, prevalence, and propensity of viruses towards specific hosts. Several recent studies have utilized machine learning and explanatory models to predict virus-host associations based on traits of viruses and their respective hosts (Olival et al. 2017; Babayan et al. 2018; Pandit et al. 2018; Mollentze and Streicker 2020; Albery et al. 2020). Yet, the choice of the traits used for predictions and the underlying causes for their contribution to the prediction accuracy are not fully understood. In addition, empirical validation of such predictions is still largely missing. In this review, we first examine the phylogenetic relationships within the Flavivirus genus and point to the multiple potential shifts in host tropism. We then discuss infection barriers and phenotypic and genomic differences between members of the Flavivirus genus. Finally, we review emerging methods that partially utilize this knowledge, together with a range of viral and host traits, to predict potential hosts shifts within the Flavivirus genus and other viral genera.

Phylogenetic Relationships and Evolution Among Flaviviruses

Taking advantage of many complete genomic sequences of flaviviruses available at that time, Moureau et al. (2015) established the phylogenetic relationships within the Flaviviridae family in the context of vector-host relationships. In agreement with previous studies (Cook et al. 2012; Marklewitz et al. 2015), the ISF clade was placed externally to all other clades, suggesting that the ancestral state at the root of the Flavivirus genus was limited to infecting invertebrates. Based on that phylogeny, at least five major transitions in host compatibility were observed (Fig. 1 in Moureau et al. 2015): Two leading to the TBFV and NKV clades, and another to a large clade, dominated by mosquito-borne viruses. Two additional transitions were observed within this latter clade. First, three NKV viruses (sokuluk, Entebbe bat and yokose viruses) that were later denoted NKV-like viruses, were clustered within the MBFV group. This suggests that the ancestors of these NKV-like viruses were able to replicate in mosquitos, and this ability has been lost at the present lineages (Cook and Holmes 2006). Despite the different categorization of the Flavivirus groups of species, the phylogeny exhibited in Ochsenreiter et al. (2019), which was also reconstructed based on the complete polyprotein sequences of several flaviviruses, also supports the paraphyly of MBFVs with the three NKV-like viruses. Second, seven previously unknown ISF viruses, later denoted ISF-like viruses, which do not exhibit an ability to replicate in vertebrate cells, were clustered together with MBFVs such as dengue and yellow fever viruses, (Huhtamo et al. 2009, 2014; Junglen et al. 2009). Thus, these viruses may have lost the genetic trait that enabled them to infect vertebrate hosts.

Fig. 1
figure 1

Phylogenetic tree of 83 Flavivirus species. The tree is colored according to host-tropism groups: insect-specific (ISF) in gray, no-known vector (NKV) in blue, tick-borne (TBFV) in green, mosquito-borne (MBFV) in red, NKV-like in orange, and ISF-like in yellow. Twelve species with available NS5 sequence data but whose classification is unknown or whose sequences are unreliable were pruned from the tree.

Following the accumulation of sequences from additional Flavivirus species, we have re-examined the phylogenetic relationships within Flavivirus based on the NS5 protein sequences (Fig. 1), as opposed to the longer and more variable polyprotein sequences that were used in Moureau et al. (2015). The vital functionality of NS5 in viral replication (Murray et al. 2008; Bollati et al. 2010) and suppression of the host immune response (Lin et al. 2006) leads to high genetic conservation, making it an appealing target for serving as a phylogenetic marker as well as for antiviral treatments. Sequences of different Flavivirus species were collected from the Virus Pathogen Resource database (Pickett et al. 2012). These were supplemented with several sequences of species that were present in Moureau et al. (2015) but absent from the Virus Pathogen Resource database. A single-representative NS5 protein sequence was selected for each species using CD-HIT version 4.6 (Li and Godzik 2006). The sequences were then aligned with MAFFT version 7.471 (Katoh et al. 2002). The phylogenetic tree was reconstructed using IQTree version 2.0.3 (Minh et al. 2020) given an amino-acid substitution model customized for Flavivirus evolution (Le and Vinh 2020), coupled with a four-category Gamma distribution modeling rate variation across sites, with bootstrap analysis consisting of 1000 repeats. The resulting tree was rooted using minimal ancestor deviation as implemented in Python (Tria et al. 2017). The accessions of the selected sequences (Supplementary text S1), the alignment (Supplementary text S2), the resulting phylogeny (Supplementary text S3), and the classification of species to host tropism (Supplementary text S4) are available in the supplementary materials.

The reconstructed phylogeny consists of 95 species, 31 of which are absent from the phylogeny reconstructed by Moureau et al. (2015). There is a general agreement between the two phylogenies, despite differences in the reconstruction strategies employed: The two phylogenies differ with respect to the taxonomic level (species versus strains), sequence marker (polyprotein ORF versus NS5 protein), substitution model [WAG (Whelan and Goldman 2001) versus FLAVI (Le and Vinh 2020)], and phylogenetic reconstruction method (Bayesian versus maximum likelihood). In the phylogeny obtained here, all five transitions identified previously could be detected, although the exact placement of the NKV-like clade varied between the two phylogenies. In the phylogeny reconstructed here, an additional transition from a putative MBFV ancestor to the ISF-like nounane and barkedji viruses could be identified. This transition suggests the possibility that more than one back-transition from dual-host tropism to single-host tropism have occurred, although this additional transition is supported by a weak bootstrap value (0.32) and could be an artifact of the relatively short sequence length of the NS5 protein. Additional putative transitions of this type have been implicated. For example, the rabensburg virus, which is considered as an ISF, is closely related to the mosquito-borne West Nile virus (Shah-Hosseini et al. 2014; Elrefaey et al. 2020), but its NS5 sequence is missing from our compilation and thus not included in our phylogeny. A promising future research direction would verify whether these back-transitions have occurred and, if so, what are the genetic modifications that have led to the loss of dual-host tropism in these lineages.

Host Barriers of ISFs Preventing Infection of Vertebrates

The placement of the root in the Flavivirus phylogeny reconstructed here, as well as in those previously inferred (Cook et al. 2012; Marklewitz et al. 2015; Moureau et al. 2015; Ochsenreiter et al. 2019), suggests that vector-borne flaviviruses are the descendants of ISF lineages that later acquired dual-host tropism. Detecting the underlying genomic transitions that led to the acquisition of these abilities in vector-borne viruses could contribute to the identification of viral species which are more likely to acquire such abilities in the future and consequently widen their range of hosts. For an in-depth review of the bottlenecks of mosquito-specific viruses to enter and replicate in vertebrate cells, see Halbach et al. (2017). Below we point to their main findings and review additional research concerning TBFVs.

Higher temperature in vertebrates compared to invertebrates was shown to act as a replication bottleneck in mosquito-specific (Jerzak et al. 2008). For example, studies on the insect-specific rabensburg virus, which is closely related to the mosquito-borne West Nile virus, showed that acquisition of the ability to replicate at higher temperatures can lead to vertebrate host compatibility (Aliota and Kramer 2012; Aliota et al. 2012; Ngo et al. 2019). However, studies on other ISFs indicate that additional factors, other than the ability to replicate at higher temperatures, are needed for successful transmission of ISFs to vertebrate hosts (Huhtamo et al. 2014).

Inherent vertebrate factors can also limit their infection by ISFs. Such factors can be either virus Agonist that are required for the virus to complete its replicative cycle in the infected cell (Junglen et al. 2017), or virus antagonist, of which the most widely recognized are interferons that suppress viral replication. It appears that certain proteins in vector-borne flaviviruses, including NS1 and NS5, interact with either of these host factor types to enable successful infection of vertebrate cells (Grant et al. 2016; Best 2017; Xia et al. 2018). The inability of ISFs to infect vertebrates suggests that these proteins interact differently with the host factors in ISFs, thereby limiting viral infection. This altered functionality of ISF proteins enabled the development of a vaccine against the MBFV chikungunya virus based on a recombinant insect-specific virus carrying the chikungunya virus structural proteins (Erasmus et al. 2017).

Invertebrate factors may also play a role in host restriction of ISFs through tissue-specific immune response. Successful infection of vertebrate hosts via insect bites requires the presence of viral particles in the insect salivary gland. However, the insect immune response in the salivary gland acts as a barrier to many ISFs, preventing their transmission to vertebrates (Blitvich and Firth 2015; Hall-Mendelin et al. 2016). While the mechanisms underlying suppression of this tissue-specific immune response by vector-borne viruses are not well understood, evidence indicates that mosquito-borne viruses overcome this response through neutrilizing agnets. For example, Kent et al. (2010) showed that co-infecting mosquitoes with the insect-specific culex Flavivirus and the mosquito-borne West Nile virus enabled succesful infection of the salivary gland by both. This suggests that the ability of the West Nile virus to infect the salivary gland involves neutralization of some unknown mosquito factors that would otherwise prevent infection by ISFs. Tick-borne viruses face similar tissue-specific barriers in the infected tick (Nuttall 2014). Interstingly, the tick salivary gland was shown to contain active molecules that aid tick-borne viruses in coping with the vertebrate immune response following its transmission to vertebrates (Hermance and Thangamani 2015; Kotál et al. 2015).

Proteome variation between mosquito-borne and mosquito-specific flaviviruses may also explain differences in host compatibility. Aside from fixed proteome differences, dynamic variation can be the result of programmed ribosomal frameshifting (PRF)—A phenomenon in which viruses harbor sequences that induce a proportion of translating ribosomes to shift by one nucleotide position to a new reading frame, thus, producing a ‘transframe’ fusion protein (Firth and Brierley 2012). For example, a PRF product, termed NS1', has been observed in several MBFVs (Moureau et al. 2015) and was reported to play a role in the inhibition of the vertebrate interferon-mediated immune response (Zhou et al. 2018) and in increased transmission efficiency from mosquitoes to vertebrates (Melian et al. 2014). Another PRF that is unique to ISFs, results in a modified NS2A/NS2B-coding region, termed Fairly Interesting Flavivirus ORF (FIFO) (Firth et al. 2010). The function of this PRF product is still unknown. Surely, additional research is required to better understand the prevalence of other PRFs, whether they are also present in TBFVs, and their possible contribution in mediating host compatibly.

Finally, UTRs of flaviviruses include conserved, and often duplicated, RNA structural elements that were shown to be involved in various stages of the viral life cycle (Filomatori et al. 1995; Alvarez et al. 2005; Manzano et al. 2011; Brinton and Basu 2015; Ng et al. 2017). Unique structures were found in the different Flavivirus groups, suggesting that their presence could be associated with host adaptation (Davies and Pedersen 2008; Villordo et al. 2015, 2016; Pallarés et al. 2020). Ochsenreiter et al. (2019) performed a comparative analysis of the 3’UTR structures and detected a consistent 3’UTR architectural organization across TBFVs, and inconsistent organization in ISF and NKV, suggesting that the organization detected in TBFVs may confer their dual-host tropism.

Genome Composition of Flavivirus

Due to their alternating replication in invertebrate and vertebrate hosts, vector-borne flaviviruses face strong selective constraints that could limit reaching optimal fitness in either of them. These cumulative constraints are likely stronger than the ones that operate on single-host viruses. The fitness landscape of vector-borne viruses is constantly fluctuating according to their current host, leading to time-averaged adaptation (Wilke 2001). At the quasispecies level, these fluctuating environments are expected to affect the size of the mutant cloud and the pattern of mutation accumulation, reflecting the genetic variation within the virus population and its ability for adaptation. However, this does not always seem to be the case (Novella et al. 2012).

Molecular differences between dual-host and single-host viruses may also be observed at the genome composition level, reflected by variation in the relative frequencies of pairs of nucleotides and codons compared to their expected frequencies across genomic sequences, termed dinucleotide and codon usage, respectively. Lobo et al. (2009) explored the differential patterns of dinucleotide and codon usage across different Flaviviridae groups and their hosts. Using extracted ORF sequences from the complete genomes of 39 Flaviviridae species and the complete mRNA sequences of 9 vertebrates and 4 invertebrates, the authors discovered that the transcriptomes of all vertebrate-infecting flaviviruses display a dinucleotide usage pattern similar to that of vertebrates. Specifically, they found a bias against CpG and CpA in these groups, which was in accordance with the one observed in vertebrates. In vertebrates, this bias could be explained by cytosine methylation and deamination of the DNA in regions that are rich with CpG and CpA, a mechanism that is absent from insects (Bird 2007). In vertebrate-infecting flaviviruses, this bias can be explained as either mimicry of the host genomic composition (Kandimalla et al. 2003) or evasion from host immune system components, such as the Zinc-finger antiviral protein (ZAP) that recognize CpG containing viral RNA sequences (Takata et al. 2017; Odon et al. 2019; Meagher et al. 2019; Luo et al. 2020). A similar bias was not observed in ISFs. Indeed, Colmant et al. (2021) showed that the insect-specific binjari and Hidden Valley viruses were able to replicate in ZAP-knockout human cells, leading to the conclusion that ZAP is as an important barrier in prevention of ISF infection in vertebrates. Lobo et al. (2009) also observed a bias against UpA that was consistent across all viral groups. UpA is known to trigger vertebrate immune response of the RNase L enzyme (Han et al. 2004), but no such response has been recorded in insects. However, low UpA abundance can be explained in ISFs by the mimicry of dinucleotide usage of the insect transcriptome which is underabundant in UpA-containing tRNAs (Sexton and Ebel 2019). In general, UpA avoidance in both vertebrates and invertebrates can be triggered by avoidance of TATA box, TAA and TAG stop codons, or transcribed UA-rich RNA regions that lead to mRNA decay (Karlin and Mrázek 1997). At last, the biases against CpG and UpA appear to be compensated by a positive bias towards UpG in all the examined groups.

Several studies conducted a codon usage analysis to examine whether co-evolutionary signals are expressed at the codon level as well. In accordance with their dinucleotide analysis, Lobo et al. (2009) found a bias against CpG containing codons in vertebrate-infecting viruses and in their vertebrate hosts, but not in ISFs. However, no depletion in CpA containing codons was found, as might be expected from the detected dinucleotide underabundance. The authors also found depletion in UpA-containing codons in both insects and ISFs, which suggested mimicry of host codon usage. Notably, the analysis of Lobo et al. (2009) indicated that genome composition in vector-borne flaviviruses is influenced mostly by their vertebrate hosts—A finding that contradicts previous hypotheses that considered the insect vectors as the dominant driver of vector-borne viral evolution (see Introduction).

A recent study by Di Paola et al. (2018) investigated the codon adaptation of flaviviruses to their hosts using the codon adaptation index (CAI) (Sharp et al. 1986; Sharp and Li 1987), based on extracted ORFs from the complete genomic sequences of 205 flaviviruses and the codon usage tables of three vertebrate species and three invertebrate species. As expected, the CAIs to human genes were found to be higher in vector-borne flaviviruses compared to those of ISFs. The authors also detected significantly higher CAI to vertebrate genes in MBFVs compared to that of TBFVs, suggesting more rapid adaptation in the former group. The higher adaptation to humans in vector-borne viruses may be associated with their increased translation and replication capabilities in vertebrates (Andersen et al. 2015; Cugola et al. 2016). However, codon usage biases have not been directly associated with increased replication rates (Vasilakis et al. 2009; Shin et al. 2015). Interestingly, Di Paola et al. (2018) also collected sequences of West Nile viral genomes from the time of its recent emergence out of Africa and into Europe (2010–2014). The authors observed increased CAI of the virus to human-housekeeping genes between 2012 and 2014, compared to previous years. This increase was correlated with increased levels of infection, suggesting an association between codon usage adaptation and high infection efficacy. The increase in CAI during the time of the outbreak of the West Nile virus in Europe could potentially be attributed to increased exposure to humans. Thus, increasing similarity of codon usage between humans and vector-borne viruses that are not yet adapted to human hosts could possibly serve as an indicator of emerging zoonosis.

Notably, the studies described above showed resemblance in dinucleotide bias and codon usage between flaviviruses and their hosts. Di Giallonardo et al. (2017) hypothesized that such characteristics in viral species are more closely associated with their evolutionary history rather than with their hosts. This hypothesis is supported by the observation that dinucleotide bias in viruses simply reflects background mutation pressure (Wright 1990; Jenkins and Holmes 2003). To test their hypothesis, Di Giallonardo et al. predicted the host group and the viral taxonomic classifications based on the odds ratios of all 16 dinucleotides of 29,310 viral sequences from 20 families spanning a variety of hosts, some of which belonged to the Flavivirus genus. The prediction sensitivity was much higher for viral taxonomic classification as compared to the prediction of host groups (true positive prediction rate of 0.71 compared to 0.33). Moreover, exclusion of dual-host viruses from the analysis did not result in improved sensitivity of host prediction. However, the accuracy of host prediction based on dinucleotide information was much higher for the Flaviviridae family (true positive prediction rate of 0.76) and even more so when focusing on Flaviviridae viruses from the “vector-borne” host category. This suggests that vector-borne flaviviruses carry a unique genomic compositional signature, and this may be utilized for predicting virus-host associations in this group.

Methods for Prediction of Host Shifts and Compatibility

Identification of potential host shifts and zoonotic viral species can aid in the ongoing arms-race against viral threats by enabling risk-based allocations of research and surveillance effort. In this section, we discuss recent studies that have attempted to tackle these challenges, some of which focused on Flavivirus and others that examined a wider range of viruses.

Pandit et al. (2018) divided 35 Flavivirus species to several groups based on their known primary hosts and collected 29 traits of their hosts. Based on these traits, the authors trained a machine learning model for each group of flaviviruses to predict the identity of hosts that it may infect. The models corresponding to dengue and Japanese encephalitis viruses predicted 139 and 388 novel hosts for the respective viruses. The host traits that contributed most to the predictive model were those related to the geographical range of the species, body mass, and a feature that accounted for the biases in research efforts (i.e., the number of PubMed hits). In the model corresponding to the group of zika and Japanese encephalitis viruses, host metabolic rate was also determined as an important feature. This study demonstrated that geographical and physiological traits of the hosts can serve as important predictors of potential host shifts in flaviviruses. Interestingly, the importance of some features varied across models trained for different Flavivirus groups, which could indicate the presence of variation in the mechanisms conferring host adaptation in these different groups.

Other studies have built prediction models based on viral traits rather than those of the hosts. Many of these studies focused on genomic traits that have been suggested to indicate adaptation to hosts. For example, Babayan et al. (2018) trained gradient boosting models on features describing the phylogenetic relatedness between viruses as well as various genomic compositional biases of ssRNA viruses to predict for a given virus: (1) its host, (2) weather it can be transmitted via an arthropod vector, and if so (3) its transmission vector. The models produced predictions of high accuracy as assessed using cross-validation (0.84, 0.97, and 0.91 for the respective three categories). The models also predicted the hosts of 36 viruses in which hosts have not been verified empirically and arthropod vector compatibility for 17 viruses that have not been classified as vector borne thus far. Finally, the models predicted the vector class (i.e., midge, mosquito, sandfly, tick) of 31 vector-borne viruses whose vectors are currently unknown. However, some of these predictions were found to be contradictory to current evidence. For example, the model for reservoir hosts predicted Pterobats and Vestbats as the hosts of the MERS virus, while evidence suggest that its main host is in fact Artiodactyl (Ghai et al. 2021). An interesting future development could combine features that are based on traits of both the host and the virus for predicting probable virus-host associations. Furthermore, additional features could be integrated into the learning framework, strengthening the predictive power, including those that are related to transmission bottlenecks such as the host body temperature, virus transmission mode, identity of duplicated molecular secondary structures in the viral 3’UTR, and the identity of different products of programmed ribosomal frameshifting (Fig. 2).

Fig. 2
figure 2

Host and virus factors that can be utilized for predictions of virus-host associations. The figure depicts the factors discussed within the scope of this review that can be used as features in machine learning models for predictions of virus-host associations in Flavivirus. Factors that characterize different hosts are shown at the left side while factors that characterize viruses are shown at the right. These features are fed into machine learning methods to predict various virus-host associations, such as probable hosts or dual-host tropism. The four Flavivirus host-tropism groups listed in the figure are: mosquito-borne (MBFV), tick-borne (TBFV), insect-specific (ISF), and no-known vector (NKV).

While computational methods provide the opportunity to detect the host range of novel viruses, empirical validation of their results can be challenging. Isolation and identification of viruses from hosts can be obtained in several manners, including those that are based on serological samples using enzyme-linked immunosorbent assays (ELISA) and immunofluorescence assays (Gubler et al. 1984; Innis et al. 1989), or via molecular techniques that involve nucleic acid amplification using polymerase chain reaction (PCR) (or reverse transcriptase PCR in the case of RNA viruses) (Guarneri et al. 2001) or nucleic acid hybridization and microarrays. While molecular methods are considered more sensitive (Lanciotti et al. 2000; Lanciotti and Kerst 2001), they may fail to detect viruses that produce low and short-lived viremias, such as the West Nile virus (Murray et al. 2011). Molecular methods can also be applied to archival samples to characterize epidemics over time and to retrospectively diagnose viral diseases (Frisbie et al. 2004; Decaro et al. 2013).

The above experimental detection methods rely on the prior knowledge of the identity of the viruses that are expected to reside in a sample of interest and are thus limited to viruses that are well described. To overcome this limitation, detection methods based on metagenomic data were developed. Such methods use data collected from environmental samples and animal tissues and do not rely on viral amplification in cell culture (Delwart 2007). Rather, these methods entail the use of computational tools for the assembly and identification of viral sequences from high-throughput metagenomic sequence data (Cholleti et al. 2018). Viral metagenomic analysis has provided means for discovery of many novel viruses. For example, Li et al. (2015) identified 112 novel RNA viruses across 70 arthropod species. In another large-scale analysis of metagenomic samples extracted from 220 invertebrate species, Shi et al. (2016) discovered 1445 novel RNA viruses. Still, the use of metagenomic-based methods entails several challenges. First, assembly that is based on viral reference genomes may fail to detect viral sequences that have diverged from the reference, while a reference-free de-novo assembly is more susceptible to produce ambiguous and chimeric sequences of closely related viruses (Domingo et al. 2012). Furthermore, classification of viral sequences is based on similarity searches and is thus limited by the variety of annotated viral genomes in existing databases (Simmonds 2015). A promising future development could utilize the abundance of available metagenomic data from different sources for the simultaneous analysis of multiple viral metagenomes using a pan-genomics approach, similar to approaches for microbial communities profiling (Zhong et al. 2021). This will enable distinguishing between genuine viral species and those that are a product of specific modifications in a metagenomic sample and are not likely to survive. Such developments would also enable the identification of viruses with wide host range and the detection of conserved genetic regions that confer wide host compatibility.

Conclusions

In the past few years, important progress has been made in our understanding of Flavivirus evolution, including better resolution of the phylogenetic relationships and co-evolutionary patterns at the genomic level between viruses and their hosts. Ongoing advances in predictive models that utilize artificial intelligence techniques are expected to play a key role in research efforts aimed at deciphering the differences between distinct Flavivirus groups and mechanisms that induce compatibility to their different host types. To date, learning methods, trained on viral traits, demonstrated great potential for the prediction of novel vectors and hosts based on cross-validation. However, empirical support of novel virus-host associations is largely missing and their result can often be biased due to imbalanced data or increased error due to the “curse of dimensionality” (Agany et al. 2020). It is expected that models that utilize a combination of features, spanning both viral and host traits, would provide greater predictive power. Such anticipated developments should contribute greatly to the ongoing battle against viral threats and would focus viral surveillance efforts on host species and viral species with high predicted potential of transmission to humans.