Introduction

One of the reasons why viruses are so prone to causing epidemics stems from their high genetic diversity which, in the case of influenza A viruses (IAVs), is in part due to their high mutation rate and to their segmented genome, comprised of eight negative single stranded RNA molecules. The ten to twelve proteins encoded across these segments (Wise et al. 2009) play different roles in the life cycle of the virus and thus undergo different selective pressures. The two most studied proteins are the hemagglutinin (HA) and the neuraminidase (NA), responsible for host cell recognition and entry, and for facilitating virus release from infected cells, respectively (e.g., Neumann et al. 2009). The antigenic properties of these two cell surface antigens are used to name and classify IAV subtypes, which can form almost all possible combinations between the 17 known subtypes of HA and the ten known for NA (Tong et al. 2012). While most of these subtypes have been observed in wild waterfowl, the most prevalent subtypes in the human population are H1N1 and H3N2, with H3N2 dominating human epidemics since its emergence in 1968 (Rambaut et al. 2008; Guan et al. 2010).

Each of the eight IAV segments can be exchanged between viruses coinfecting the same host, hereby leading to reassortant viruses. This process has the potential to change the antigenic properties of the virus in a dramatic way, leading to an “antigenic shift.” Such a process was at the origin of the 1968 H3N2 pandemic (Guan et al. 2010) or of the 2009 H1N1 pandemic (Smith et al. 2009), and can be experimentally shown to lead to the adaptation of swine and/or avian reassortant viruses to a mammalian host (Imai et al. 2012). In addition to reassortment, the genetic material of the virus also undergoes mutations which, in the absence of a proofreading mechanism, lead to substitution rates as high as ≈5 × 10−3 substitutions per site per year (Rambaut et al. 2008). Evidence, however, suggests that all subtypes do not evolve at the same rate. For instance, emergence of new H1N1 variants is often slow while H3N2 has often undergone rapid evolution and dissemination, as evidenced by the A/Sydney/5/97-like viruses that were detected in all parts of the world six months after their initial discovery (Hay et al. 2001).

In the face of these constantly evolving antigenes, the human immune system detects the infecting virus and generates antibodies that contain current viruses and help prevent future infections. This immune machinery can easily recognize similar viral variants, but novel influenza strains that are antigenically different from their progenitors can trump the host immune system. Vaccines help boost the human immune system, but the rapid viral evolution described above demands that the composition of the influenza vaccine be updated every year. Currently, the trivalent vaccine targets two IAV subtypes (H3N2 and H1N1) and influenza B (Hay et al. 2001). Vaccine composition is reevaluated every year based on recommendations from the World Health Organization and other National Influenza Centers distributed around the world (see Hay et al. 2001) Footnote 1. Candidate strains for vaccine composition are based on circulating viruses. Strain selection begins 8–10 months before the vaccine is available to the public, but the data used to determine candidate strains are inaccessible to the general public (Salzberg 2008) and are largely based on HA inhibition assays that often have poor resolution in distinguishing between strains (Plotkin et al. 2002).

While the strain-selection process can be effective and match as much as 91 % of the circulating viruses as in 2012–2013 Footnote 2, the selection process is imperfect. In 2007 for instance, a virulent H3N2 variant emerged in Australia and New Zealand a short time before the onset of the influenza season in the Southern hemisphere (April–September). To some extent, because the virulent strain was not the part of the 2007 vaccine, a widespread epidemic with a threefold increase in prevalence compared to regular seasons ensued (Owen et al. 2008). This then novel and highly infectious strain, identified as Brisbane/10/2007, crossed the equator to North America just before the onset of the Northern hemisphere’s influenza season (November–March), eliciting a similar epidemic during the 2007–2008 season (Saks 2008).

The failure of the 2007 and 2007–2008 vaccine shows that there is a need for additional methods for determining which strain to include in the vaccine for each upcoming season. Computational methods have long been sought to predict the emergence of influenza viruses. An early method looked at nucleotide substitutions in codons of the HA gene undergoing positive selection and used a phylogenetic approach to determine which HA sequence was most likely to emerge (Bush et al. 1999). This type of directional evolution was later dismissed as it was determined that the evolution of HA genes tends to be more clustered than linear (Plotkin et al. 2002). The segmented structure of the influenza genome must also be taken into account when attempting to predict future influenza strains; as the rate of change is not constant across all segments (Holmes et al. 2005), epistatic interactions are likely to shape the virulence of a given virus (Neumann et al. 2009; Kryazhimskiy et al. 2011), and recombination / reassortment are key processes of the evolution of most viruses (e.g., Holmes 2009, p. 48).

To address the current dearth of prediction tools for the emergence of novel viruses, we introduce a very general phylogenetic approach that takes both selection and recombination / reassortment into account. Because simulations only confirm that a model performs well in the absence of model misspecification, we put the model to test in the worst possible scenarios: (i) detecting the emergence of Brisbane/10/2007, and (ii) out of cluster prediction in a longitudinal study design. In each context, a retrospective analysis of HA and NA data sets shows that our model has a different but low predictive power for these two genes. We show that including punctual bursts of evolution in our model almost doubles predictive power for HA, but not for NA. In turn, this result suggests that the evolution of HA is more episodic than that of NA in H3N2 viruses.

Methods

Overview of the Model

The objective of the model is to generate a sample of sequences that have a high probability of emerging, given a set of observed sequences. Let us denote the observed sequences as \((X_1,\ldots,X_t)=X_{1:t}\). If time t represents the current influenza season, then X 1:t represents a set of sequences sampled over the recent t seasons, and X t+1 represents the sequences sampled from the future season. The quantity of interest here is the posterior predictive probability of the data at season t + 1, given the data observed between the recent t seasons, or p(X t+1|X 1:t ). This quantity can be decomposed as:

$$ p\left(X_{t+1}|X_{1:t}\right) = \int\limits_{\Uptheta} p\left(X_{t+1}|\theta\right) p\left(\theta|X_{1:t}\right) {\rm d}\theta $$
(1)

where θ is a vector of nuisance parameters, typically the branch lengths of the phylogenetic tree and the parameters of the model of evolution, and where \(\Theta\) denotes the state space of θ. Equation (1) represents the sum (integral) over the product of two probability density functions: p(X t+1| θ), the likelihood of θ given the future data, and p( θ|X 1:t ), the posterior distribution of the nuisance parameters θ given the observed data. According to Bayes’ theorem, this posterior distribution is proportional to the product of the likelihood of θ given the sampled data, p(X 1:t | θ), and a prior on nuisance parameters p( θ):

$$ p\left(\theta|X_{1:t}\right)={\frac{p\left(X_{1:t}|\theta\right)p(\theta)} {p(X_{1:t})}} $$
(2)

The posterior predictive probability (Eq. 1), therefore, summarizes the information about the probability of new (emerging) sequences given the likelihood, the prior, a model of evolution and the observed data. However, the integration in Eq. (1) cannot be done analytically. Instead, we resorted to a two-step procedure where we first sample θ from the posterior distribution as in Eq. (2), and then use these sampled θ values to simulate future sequences X t+1 (e.g., Pagel and Meade 2006; Liu and Pearl 2007; Liu et al. 2008).

Computational Details

In the first step (Fig. 1, top), the θ values are drawn with the reversible-jump Markov chain Monte Carlo (rjMCMC) sampler implemented in OmegaMap ver. 0.5 (Wilson and McVean 2006). This model describes the evolution of codon data with selection and recombination under a standard coalescent prior (constant population size). The model has two parameters, collectively denoted as θ in Eq. 1: a selection parameter ω, which is the rate ratio of nonsynonymous to synonymous substitutions; and the population recombination rate ρ. Both can vary along the sequence by defining a block-like structure that segments an alignment of length L into at most L selection blocks and L−1 recombination blocks. In both cases, the number of blocks is estimated from the data. The model was parameterized as follows. Prior distributions for ω and ρ were set to have mean lengths of 20 and 74 codons, respectively, while in intensity these parameters were assumed to follow exponential priors centered on 1 for \(\omega\,(\exp(1))\) and 0.01 for \(\rho\,(\exp(1/10))\). The model also includes nuisance parameters that are defined over the entire length of the alignment: the transition to transversion rate ratio \(\kappa\sim\exp(3)\), the rate of synonymous transversion \(\upmu \sim exp(1/14)\), and the insertion/deletion rate \(\phi\sim\exp(1/10)\); the specification of these priors followed Wilson and McVean (2006). Equilibrium codon frequencies were set to their empirical frequencies. The recombination model is asymmetric, as it assumes that one of the sampled sequences is a mosaic of the other sampled sequences; therefore, chains were run with ten random sequence orderings. Each sampler was run for 107 steps with a thinning of 100. Two independent runs were performed to check for convergence and to obtain the marginal distributions of ω|X 1:t ρ|X 1:t as well as that of their respective block structures. Burn-in periods were empirically determined.

Fig. 1
figure 1

Workflow of the algorithm used to draw sequences from the posterior predictive distribution. The first step of the algorithm computes the posterior distribution p(θ|X 1:t ) with the rjMCMC sampler implemented in OmegaMap . The second step of the algorithm samples from p(X t+1|X 1:t ) by first drawing a sequence from X 1:t ; this sequence is used to generate a recombinant sequence and is then evolved by an amount ν under a codon model; this simulation step gives rise to two sequences, seq1 and seq2; one of these two sequences is drawn at random (red oval) to generate X t+1. The posterior predictive probability is then computed as described in the text (Color figure online)

The second step (Fig. 1, below the “+” sign) performs the predictive simulation of future sequences X t+1 based on the θ values sampled in the previous step. The procedure is initialized by estimating the average amount of evolution \(\overline{b}\) separating two sequences in X 1:t ; by so doing, we assume that the evolutionary process is on average time-homogeneous over the entire time window considered. Maximum likelihood pairwise branch length estimation is performed under the one-ratio codon model (Goldman and Yang 1994) with codeml (Yang 2007). Simulation of a sequence from p(X t+1|X 1:t ) proceeds in two steps. First, a recombinant sequence is generated according to the recombination block structure sampled from p( θ | X 1:t ). For that purpose, a “master” sequence is first drawn at random; this draw is limited to the most recent sequences in X 1:t , i.e., those collected during the current season t. The positions of the recombination blocks are extracted from the output of OmegaMap . For each of these blocks, a corresponding block is drawn with probability [ρ|X 1:t ] from one sequence taken at random with replacement from the most recent sequences in X 1:t . The blocks thus sampled are concatenated to form the recombinant sequence X ρ t+1 . This recombinant sequence is then evolved following the block structure of the selection (codon) process, as sampled from p( θ | X 1:t ). Indels are first replaced by a random nucleotide (in practice, adenines) to give X ρ\ indels t+1 . To reproduce the among-site variation in ω implemented in OmegaMap , each ω|X 1:t block of X ρ \ indels t+1 is used as the root of a simulated two-sequence tree \((seq_1:\overline{b},seq_2:\overline{b})\) under the one-ratio codon model parameterized with (ω|X 1:t ,κ|X 1:t ). One of these sequences is drawn at random to form the final sequence; indels are repositioned in this simulated sequence X ρ\ indels t+1 to give X t+1. This process is repeated 100 times for each of the θ values drawn from p( θ|X 1:t ).

Finally, the likelihood of the alignment that includes the simulated sequence is computed. To speed computations up, only the selection block structure was taken into account (the recombination block structure was ignored). For each ω|X 1:t block drawn by the rjMCMC sampler, a matrix of maximum likelihood pairewise distances is first estimated under the one-ratio codon model (Goldman and Yang 1994), still using codeml . This matrix is used to obtain an approximate tree for this block by weighted Neighbor-Joining as implemented in weighbor (Bruno et al. 2000). Negative branch lengths are set to zero to avoid computational problems. A maximum likelihood tree could also be obtained for greater accuracy, for instance using codeml , but this approach is expected to increase the computational burden. The log-likelihood of each block is computed with codeml by reusing the parameters drawn from the posterior distribution (ω|X 1:t and κ|X 1:t ) and the weighbor branch lengths. The log-likelihood of the predicted alignment is obtained by summing the log-likelihood values over the selection blocks.

Computations involved in the last two steps are easily distributed, either on a shared memory / multicore computer or on a computer cluster. Therefore, they are typically quick to perform (of the order of a few days for the data analyzed below after parallelization of the algorithm on a cluster). The main computational bottleneck is in the first step, when samples are drawn from the posterior distribution (of the order of a few weeks for the same data even on a large shared memory computer).

Episodic bursts of evolution or of recombination were incorporated into the model as follows. In the base model described above (Fig. 1), the simulated sequences are evolved on a two-sequence tree in which the branch lengths are both set to the average branch length \(\bar{b}|X_{1:t}\) within each selection block, while recombination follows the sampled ρ|X 1:t within each recombination block. Episodic bursts of evolution are then emulated by multiplying \(\bar{b}|X_{1:t}\) by a scaling factor denoted ν, while episodic bursts of recombination are generated by multiplying ρ|X 1:t by a scaling factor  ϱ.

Identification of the Simulated Sequences

The simulated sequences were then used as queries in BLASTn searches (Altschul et al. 1990) against a local copy of the influenza database Footnote 3. As a result, it is possible to infer the identity (year and country of sampling, subtype and accession number) of the most similar sequences present in the database, and check whether the algorithm is capable of sampling sequences from the future with a high probability.

Sequence Data for the Retrospective Studies

Individual protein-coding sequences for both the HA and NA genes were downloaded from the influenza Virus Resource (Bao et al. 2008). Only unique, full-length sequences collected between 2002–2007 were used, resulting in 555 HA sequences and 498 NA sequences. As these sequences were not limited to a particular geographic area, they represent the worldwide diversity of sampled influenza viruses available during this entire period of time.

Two designs were used to assess predictive power: one analyzing the whole period (2002–2007), and one analyzing the data year by year (longitudinal analyses). In the first design, due to the relatively large size of these sequence alignments, we clustered sequences with at least 95 % similarity (Abdussamad and Aris-Brosou 2011), so that the size of each data set is reduced while maintaining most of the diversity found in each data set. A single sequence from each resultant cluster was randomly picked as the representative sequence for that group, save for the Brisbane/10/2007 strain, which was set to represent its own cluster. This subsampling of each data set resulted in alignments comprising 19 HA and 30 NA sequences, which represent most of the diversity found in the original pool of sequences. In the second design, 24 sequences were randomly sampled for each year for both HA and NA data sets. The algorithm was then run on each year X t to predict sequences circulating during X t+1; note that each year in the Northern hemisphere overlaps with two seasons, with the majority of sequences coming from the second half of the first season.

In order to assess the predictive power of our model, we constructed Neighbor-Joining trees of the original and simulated sequences together. These trees were obtained using maximum likelihood pairwise distances estimated under the general-time reversible substitution model with among-site rate variation modeled with a discrete \(\Upgamma\) distribution (e.g., Aris-Brosou and Rodrigue 2012). For each of the resulting trees, we computed the patristic distance between the simulated sequence and the target Brisbane/10/2007 sequence (both for HA and NA). If the model has good predictive power, then we expect that highly probable sequences will be very similar to the target sequence and hence show a significant relationship between the probability of the generated sequences and their distance to the target sequence. Predictive power was then quantified by computing the R 2 value of the regression (proportion of the variance explained by the linear model). Equality of slopes was tested with an F test (Sokal and Rohlf 2011, p. 513). Trees in Fig. S3 were reconstructed by maximum likelihood using fasttree (Price et al. 2010) under the \(\hbox{GTR}+\Upgamma\) model of evolution; support values are based on the SH-like P-values from the approximate likelihood ratio test (Anisimova and Gascuel 2006).

Results

Prediction Under the Base Model

Full-length HA and NA sequences were extracted from NCBI. In the first retrospective study, the current sampling period was set to cover the 6-year period spanning 2002–2007. The objective was to test if the emergence of the Brisbane/10/2007 strain, our target sequences, can be predicted from first principles of molecular evolution, involving selection and recombination / reassortment.

The reduced HA and NA data sets (after clustering) were first analyzed together to demonstrate the possibility of identifying “breakpoints” in concatenated data sets. The results show that the algorithm is able to recover the concatenation point between the two genes as the most probable breakpoint (at codon position 567 in Fig. 2a). Two other codon positions have high breakpoint probabilities (dotted lines in Fig. 2a); they do not correspond to subunit limits: in HA for instance, the limit between HA1 (encoding the globular part of the antigen) and HA2 (encoding the transmembrane domain) is at position 328 (dashed line in Fig. 2a). However, these two peaks of breakpoint probabilities correspond to positions of elevated ω rate ratios (Fig. 2b, d), which is due to the confounding signals of recombination and selection (Anisimova et al. 2003).

Fig. 2
figure 2

Concatenated analysis of HA and NA gene segments. All panels represent posterior estimates along the concatenated sequence of breakpoint probabilities for recombination blocks (a), selection blocks (b), posterior estimates of recombination (c), and selection (d). Gray lines indicate the 5 and 95 % percentiles of the posterior means (in black). HA run from codon position 1–567 (vertical red line), while NA runs from position 568–1035. Dotted vertical lines mark the peaks of recombination breakpoint probabilities within each gene at position 246 for HA and 336 (904−568) for NA; for HA, this position differs from the HA1/HA2 boundary (vertical dashed line) (Color figure online)

To alleviate the computational cost of the algorithm, from here on, we analyzed the HA and NA genes independently of each other, so that ρ represents intragenic recombination and not reassortment or a mixture of these two processes. Output from the first step of the computation (Fig. 1) showed evidence of varying selective pressures (Fig. S1a, b) and recombination levels (Fig. S1c, d) across the entire length of both the HA and NA alignments (see also Fig. 2c, d). In the second step of our model (Fig. 1), these posterior distributions of ω and ρ were used to generate gene sequences drawn from their target posterior predictive distribution. A large proportion of the simulated sequences, for both data sets, was identified with BLASTn to be from the 2002–2007 time frame, our “current sampling period” (Fig. S2). The model was able to simulate sequences BLASTn-identified as the Brisbane/10/2007 strain for the NA data set (521/16,317 = 3.2 % in one particular run of the MCMC sampler), but not for HA. In addition, no simulated sequences were BLASTn-identified as coming from 2008 or later with a high posterior predictive probability (in the top 5 % of the distribution) using either the HA or NA data set (Fig. 3). On the other side of the prediction spectrum, both the HA and NA analyses contained simulated sequences that were already circulating well before 2002, illustrating the wide range of diversity simulated by the model, as well as the potentially long persistence time of viral sequences. Note that, this persistence of circulating sequences might be less pronounced for the HA gene (Fig. 3a) than for the NA gene (Fig. 3b). Results were robust to the inclusion of the target sequences in the “current sampling period” (Fig. S2) for HA (distribution of top-scoring simulated sequences over the years: χ 248  = 54, P = 0.2559; Fisher exact test: P = 1.000) and for NA (χ 249  = 56, P = 0.2289; Fisher exact test: P = 1.0000).

Fig. 3
figure 3

Distribution of the BLASTn-identified sequences in the top 5 % of the posterior predictive distribution. Results are presented for data sets including the target sequence for (a) HA and (b) NA, and the data set excluding the target sequence for (c) HA and (d) NA. Shaded bars represent sequences BLASTn-identified as coming from the “current sampling period” (2002–2007), while empty bars represent sequences coming from outside of this period

In order to quantify the predictive power of the model, we plotted the log-posterior probabilities of the simulated sequences against the patristic distances to the target strain (Fig. 4) and calculated the R 2 value of the regression. Table 1 shows that the base model has a predictive power of 26 % for HA and 18 % for NA.

Fig. 4
figure 4

Patristic distances of both recombination rates and branch lengths. Log-posterior probabilities plotted against patristic distance between each simulated sequence and the target Brisbane/10/2007 sequences, both for HA (a) and NA (b)

Table 1 R 2 values for the linear regressions of log-posterior predictive probabilities against patristic distances to the target sequence

Effect of Duration of Current Sampling Period

We then investigated the impact of the duration of the current sampling period. The hypothesis here was that the longer this time window, the higher the predictive power—assuming that the evolutionary process is stationary during the sampled time period, and that no multiple clades of H3N2 viruses were co-circulating. Alternatively, reducing the duration of the sampling period should decrease predictive power.

This hypothesis was evaluated by subsampling the original data sets of 555 (for HA) and 498 (for NA) sequences according to time. The original data were sampled from 2002 and 2007. A phylogenetic tree of the original 555 HA sequences highlighted multiple clusters of sequences that were sufficiently distant from the rest of the sequences to pose potential problems to the prediction model (Fig. S3). Closer inspection revealed that these clusters consisted of sequences circulating each year, suggesting that multiple evolutionary shifts might occur every year. Therefore, the most distant sequences, circulating in 2002 and 2003, were removed from both the HA and NA data sets, and the posterior predictive algorithm was then run on the remaining 2004–2007 sequences. This provided a 4-year time span of data, as opposed to the original 6-year span. Furthermore, a data set containing only the 2005 sequences was also extracted, as 2005 was the year that contained the largest number of sequences in the original data sets. As a result, we could compare the effectiveness of the predictive method across three sampling durations: 1 year (2005), 4 years (2004–2007), and 6 years (2002–2007). Note that, all three sampling durations aim at making out of cluster prediction.

The results show that the 4-year and 1-year analyses have higher probabilities than the 6-year study performed above (Fig. S4 for HA), which is expected since the 6-year alignment is larger. More critically, the slope of the regression on the 4-year data set is smaller (in absolute value) than that of the full 6-year data set, which suggests a decrease in predictive power with smaller data sets (shorter sampling durations). Indeed, the R 2 value for the 6-year data set stands at 0.37, and drops to 0.11 for the 4-year data set (P < 0.0001) and to 6.5 × 10−4 for the 1-year data set (P < 0.0001). Therefore, longer sampling durations improve the predictive (and out of cluster) power of our model. As a result, we only used the original 6-year data sets in the rest of this study.

Effect of Punctual Bursts of Evolution

So far, the model had a rather small predictive power, probably due to the structure of H3N2 circulation with each year forming its own cluster. However, cluster-shift or the emergence of “unexpected” strains might be due to episodes of accelerated evolution. To test this hypothesis, we incorporated punctual bursts of evolution and of recombination into the base model. Both the HA and NA data sets were used to test the effect of increasing the length of the branch leading to the predicted sequences by a factor ν, hereby mimicking a punctual burst of evolution. By default, this length is set to the average branch length of the tree containing only the sequences from the current sampling period. This rate multiplier ν was set to 1, 2, 5, and 10 for both data sets, HA and NA.

The results show a very significant negative relationship between posterior predictive probabilities and patristic distances for all ν multipliers, both for HA (Fig. 4a) and NA (Fig. 4b). Our model is, therefore, able to predict sequences that have a relatively high probability. Table A.1 further shows that for HA, the average probability is increasing with ν (see also Fig. 4a), while the slopes show a small but significant decrease in absolute value (Table A.1). The pattern is similar for NA, where the slopes are progressively decreasing (in absolute value) with ν, but to a much larger extent (Table 2 and Table A.1). As a result, the inclusion of bursts of evolution in the model helps out of cluster prediction for HA but not NA sequences. Indeed, the R 2 values of the regressions for HA increase to almost 40 % as ν increases (Table 1). On the other hand, the inclusion of bursts of evolution makes our prediction of NA sequences worse, as R 2 values decrease with increasing ν (Table 1). This shows that the evolution of HA sequences during that period of time for the H3N2 subtype was characterized by episodic bursts of evolution (at least between 2002 and 2007), while the evolution of NA was more gradual.

Table 2 Slopes for the regressions of log-posterior predictive probabilities against patristic distances to the target sequence

Effect of Punctual Bursts of Recombination

Homologous (intrasegmental) recombination is generally considered to be insignificant in IAVs (Nelson and Holmes 2007; Boni et al. 2008). In order to assess the impact of recombination on the emergence of the target Brisbane/10/2007 strain under a different perspective, we incorporated a burst of recombination in our model. Branch length multipliers ν were first kept constant and set to 1, while recombination rates along the branch leading to the simulated sequences were multiplied by a factor  ϱ that was varied from 1 to 5, a value of 5 meaning that recombination rates leading to the predicted sequences were 5 times larger than those sampled from the rest of the tree.

Predictive power was assessed again by plotting log-posterior probabilities of the simulated sequences against patristic distances to the target strain. These regressions are highly significant for both the HA and the NA genes (P < 0.0001; Fig. 4). However, increasing the recombination rate multiplier  ϱ essentially led to unchanged or even decreasing predictive power, both for HA and NA (Table 1). Therefore, our results confirm the general consensus that homologous (intrasegmental) recombination is not a significant process in the evolution of IAVs, at least in the case of the Brisbane/10/2007 strain.

Joint Effect of Bursts of Evolution and of Recombination

Despite the lack of evidence for any effect of recombination in the emergence of the Brisbane/10/2007 strain, we tested the hypothesis of a potential interaction between bursts of evolution and bursts of recombination. The model was then run on all combinations of multipliers for branch lengths (ν set to 1, 2, 5, and 10) and recombination rates (ϱ set to 1, 2, and 5). Phylogenetic trees were constructed as above for each of the 4 × 3 = 12 possible combinations.

The computation of patristic distances for the different ν and  ϱ combinations supported the pattern of increased sequence diversity both for HA (Fig. 4a) and NA (Fig. 4b). Consistently with the results found when varying ν or  ϱ independently, the HA gene proved to be more responsive than NA to a joint increase in ν and  ϱ, while the impact of bursts of recombination was negligible in both cases (Table 1). These results again support the hypothesis that the evolution of HA is mostly driven bursts of rates of evolution, and that (intrasegmental) recombination did not play any role in the evolution of these two genes.

Longitudinal Analyses

In the preceding sections, we assessed predictive power with respect to one specific strain, Brisbane/10/2007. A more general way to assess predictive power is to monitor posterior predictive distributions longitudinally in time. We used a sliding window of width 1 year, from year y = 2002 to y = 2007, to try and predict year y + 1 in each case. Note that, in this longitudinal design, there is no “target sequence”: the goal is to be able to predict sequences that will be circulating in the upcoming year (y + 1).

For computational reasons, we capped the number of sequences to 24 in each year. We then ran the complete algorithm on each of the 6 years, for both HA and NA. Given the results in Table 1, we used for our simulations two sets of rate multipliers: a first set with ν = 1 and  ϱ = 1, and a second set with ν = 5 and  ϱ = 1, which is the rate multipliers’ setting that corresponds to the largest posterior predictive power for HA.

The results are essentially the same for the two settings of rate multipliers, so we only present the case where ν = 1 and  ϱ = 1. The results are shown for HA (Fig. 5a) and for NA (Fig. 5b). If the predictive power was significant, then the average predicted year in the top 1 % of the log-posterior predictive distribution would be on or above the solid line. This was not the case, as for every year analyzed, the average predicted year in the top 1 % of the posterior predictive distribution was not significantly different from the current year. This result is, however, in line with those obtained above (Fig. S4), where predictive power was inexistent when the “current sampling period” is reduced to a single year. Further testing of the algorithm should focus on determining the optimal size of this sliding window.

Fig. 5
figure 5

Distribution of BLASTn identified sequences in the longitudinal analyses. Analyses were performed on each individual year from 2002 to 2007 for (a) HA (2006 missing) and (b) NA. Open circles average predicted year in the top 1 % of the posterior predictive distribution. Vertical bars are 1 SD. Solid line expectation under the hypothesis that the analysis can predict emerging sequences in the following year. Broken line first diagonal

Discussion

The original motivation behind the development of the model presented here was to be able to predict the emergence of influenza viruses: (i) in a timely manner, and (ii) accurately. The computational time required by our model in a pilot phase was, at three months for about 500 sequences, far greater than what can be desired in practice (compared to about 20 days with the reduced data, with samplers run 10 times longer). The burden was essentially caused by the first step, where posterior distributions are estimated with omegaMap . The second step, being amendable to parallelization (each posterior predictive simulation being carried out as an independent thread), does not stand as a serious computational bottleneck. Therefore, a sampling method was used to produce smaller HA and NA data sets, while still preserving most of the existing sequence diversity (Abdussamad and Aris-Brosou 2011). However, this sampling method intrinsically discards information relative to haplotype frequencies, which may be critical to help predict emerging viruses. Yet, because genetic diversity of influenza viruses, as measured by effective population sizes scaled to generation time, is thought to be low (Rambaut et al. 2008), it is more likely that nonadaptive processes play a key role in the emergence of influenza viruses. If this nonadaptive hypothesis is correct, then our filtering of the data to represent most of the available sequence diversity circulating in a region or worldwide might be an efficient method to predict emerging viruses.

While the computational burden was reduced, the accuracy of the model as a prediction tool for emerging viruses was not impressive. The sequences generated by the predictive model reveal in particular that the majority of the high-probability sequences were generated from 2002 and 2007 (Fig. 3). This suggests that H3N2 strains continue to circulate for several years after their emergence (Holmes et al. 2005; Plotkin et al. 2002). The best prediction under the base model, which did not incorporate any burst of evolution or of recombination, was obtained by analyzing a window of 6 years. By including punctual bursts of evolution, the predictive power of the model increased from 25 to 40 % for HA, but not for NA the power of which remained low at 12–28 % (Table 1). Therefore, while the forte of the current approach may not lie in its predictive power, the analysis reveals two key features about the mode of evolution of HA and NA in H3N2 viruses: (i) none of these genes undergo recombination (Nelson and Holmes 2007; Boni et al. 2008); and (ii) the evolution of HA is episodic in H3N2 viruses, undergoing sporadic bursts of evolution, while NA evolves gradually. This confirms a recent report that took a more direct approach to estimate bursts of evolution (Westgeest et al. 2012), based on a codon model explicitly allowing selection to change episodically (Kosakovsky Pond et al. 2011).

Irrespective of these confirmatory results, the general low accuracy of the approach presented here highlights the difficulties in predicting the emergence of influenza viruses in two nonexclusive situations: (i) long-term predictions of, and (ii) out of cluster viruses. In the case of long-term predictions, the stationarity assumption of our model is likely to be violated by changes in the mode of evolution of viruses. For instance, pandemic H1N1 viruses showed in 2009 increased ω rate ratios, which were interpreted as an increase in surveillance and/or adaptation to the new human host (Smith et al. 2009). In the second situation, H3N2 viruses form clusters of co-circulating strains every year (Fig. S3) (see also Plotkin et al. 2002; Nelson et al. 2006), so that our approach here attempts to perform out of cluster prediction based on multicluster information. In this light, it becomes clear that making such out of cluster prediction is difficult (Fig. 5) as the underlying evolutionary process is nonstationary.

Compared to previous approaches, either rooted in phylogenetics (Bush et al. 1999; Plotkin et al. 2002; Ferguson and Anderson 2002) or in machine learning (Xia et al. 2009; Trtica-Majnaric et al. 2010; Lees et al. 2010; Ito et al. 2011), the predictive model we described has several unique features such as the incorporation of a more realistic model of natural selection along with a model of recombination / reassortment in a Bayesian phylogenetic framework. Our model would therefore be directly applicable to predicting the emergence of viruses that undergo intragenic recombination such as retroviruses (e.g., Holmes 2009, p. 50), or the evolution of viruses with segmented genomes like influenza by concatenating segments in a whole-genome analysis. However, with the present algorithm, this kind of analysis of long (genomic) sequences is computationally prohibitive. In addition to incorporating heterogeneity of the evolutionary process (Le et al. 2008) and the interplay between mutation and selection (Rodrigue et al. 2010), a more fruitful set of extensions would be either to consider antigenic determinants, either by using predictive tools as in Abdussamad and Aris-Brosou (2011) or by using grammar models (Loose et al. 2006), or to take the ecology of the virus and the spatial patterns of its spread into account. Demographic models usually adopt a different formal structure, being based on systems of partial differential equations (Ferguson et al. 2005), and are, therefore, difficult to incorporate into genetic models. One notable exception attempted to reconcile the outputs of the two approaches (Ferguson et al. 2003), but the authors did not attempt to predict emerging strains. More recent forays into spatial studies addressed the surveillance issue from a phylogenetic point of view (Wallace et al. 2007; Parks et al. 2009; Janies et al. 2010; Cybis et al. 2013). Although these tools have the potential to predict where a particular virus is likely to emerge (Janies et al. 2010), they do not attempt to predict which viral strain is likely to emerge. Finally, the development of predictors of epidemics and pandemics would clearly benefit from the release of a public database linking influenza genomes to a proxy of their phenotype, such as the results of hemagglutination inhibition assays (Smith et al. 2004). In order to increase the predictive power of the model presented here, special efforts will probably be required to combine spatial and immunological models with genetic models, without forgetting demographic modeling as well as the population genetics of the virus of interest.