Introduction

Accurate fitness measurements are central to questions in experimental evolution, quantitative genetics, and functional genomics. Traditional methods include approaches such as estimating maximum growth rate from growth curves (Hall et al. 2014) and quantifying colony sizes from spot assays (Baryshnikova et al. 2010). An alternative, and increasingly used, approach is a competitive fitness assay in which a reference strain with known fitness is competed directly with a test strain. The relative fitness of the test strain can be inferred from its change in frequency compared to the reference, with either colony counts (Lenski et al. 1991) or fluorescence as a readout (Breslow et al. 2008; Thompson et al. 2006). However, pairwise competition assays are lower throughput and challenging to scale to thousands of measurements.

Competitive fitness assays can be adapted from measuring fitness of a single test strain per assay to measuring fitnesses of several thousand strains in parallel. This typically involves uniquely tagging each strain with a DNA barcode and tracking changes in the frequency of the barcodes over time using deep sequencing (Smith et al. 2009, 2010). Applications of such sequencing based fitness measurements and phenotyping range from CRISPR (Shalem et al. 2014; Wang et al. 2014) and transposon mutagenesis screening for essential genes (van Opijnen and Camilli 2013; Wetmore et al. 2015), genetic interaction screens (Du et al. 2017; Jaffe et al. 2017), deep mutational scanning of proteins (Fowler and Fields 2014; Fowler et al. 2010; Stiffler et al. 2015), codon usage (Kelsic et al. 2016), fitness measurements of thousands of adaptive mutations from evolution experiments (Venkataram et al. 2016), genetic crosses (Nguyen Ba et al. 2022), and natural variants (Carrasquilla et al. 2022).

Despite their scalability, highly parallel sequencing-based fitness assays are prone to biases in estimation of fitness. Li et al. demonstrated that fold enrichment based fitness metrics cannot be quantitatively compared across pools of strains with different underlying distributions of fitness effects, and developed FitSeq, a fitness estimation method that accounts for changes in the mean population fitness over time (Li et al. 2018, 2023). Their simulations also showed that measuring multiple timepoints makes fitness estimates more robust to changes in the distribution. However, uncertainty in fitness measurements depend on the true fitness and are typically worse for more deleterious fitness effects. Consequently, it is not evident if parameter regimes improving resolution of fitness measurements are the same regardless of the true fitness effect under investigation.

Here, we combined simulated fitness assays for a range of experimental regimes and reanalysis of a deeply sequenced transposon mutagenesis dataset to explore how experimental parameters impact uncertainty in fitness measurements across a wide range of fitness effects. Some of the results presented here have appeared in other work and are cited when appropriate (Li et al. 2018; Robinson et al. 2014); the purpose of this paper is to combine both our findings and these existing insights to derive recommendations for designing pooled sequencing-based fitness assays.

Note About Terminology

While we refer to fitness effects of mutations throughout the paper, the results can be extended to any collection of strains, for instance, derived from an evolution experiment, or from natural variation. For a mutant with a true fitness effect s (relative to a reference strain), the change in frequency of the mutant is given by:

$${f}_{2 } = {f}_{1 }{e}^{st}$$
(1)

where f1 and f2 are the frequencies of the mutants before and after selection, and t is the number of generations of selection. We note that this equation holds when the frequency of the reference lineage does not change. This assumption means that all the mutant (i.e., non-reference) frequencies are very small and do not impact mean fitness. In a sequencing-based fitness assay, we can estimate the fitness effect of the mutation as follows:

$$\widehat{{\varvec{s}}}\boldsymbol{ }=\frac{1}{t}ln\frac{{n}_{2}/{n}_{1}}{{N}_{2} /{N}_{1}}$$
(2)

where n1 and n2 are the mutants counts before and after selection, N1 and N2 are total counts for those timepoints, and t is the number of generations of selection. Note that this equation also assumes that the frequency of the reference lineage does not change. While this estimator is biased for finite read depth and bottleneck size, we found in simulations that this bias was negligible compared to measurement error (Fig. S1). This definition can be readily generalized to multiple timepoints as the slope of the linear regression of ln(frequency) vs number of generations of selection in fitness assay. Under this definition, a neutral mutation has a fitness of 0, and an unviable mutation (say loss of an essential gene) has a fitness effect of -ln(2) = − 0.693 (Chevin 2011). Note that a fitness value of -ln(2) is in units of inverse generations, and is pertinent to microbes dividing by binary fission.

Simulating Sequencing-Based Fitness Assays

In our simulations, we decouple mutant abundances and the read counts from sequencing (Fig. 1A). We assume that the initial mutant abundances are Poisson distributed (with mean equal to the bottleneck size). Sequencing this mutant pool also leads to Poisson sampling, introducing additional noise, determined by the depth of sequencing. Each subsequent passaging in the fitness assay involves an additional bottleneck step.

Fig. 1
figure 1

a Illustration of fitness assay simulation approach. b Distribution of N(0) counts obtained after a bottleneck of size 50, and sequencing depth per timepoint of 100. c Scaling of variance and mean of N(0) sequencing read counts obtained from simulations with bottleneck of size 50

We estimated fitness effects using linear regression of log(read count frequencies) vs number of generations, averaging over 5 replicates for each mutant (in practice typically done with redundant barcoding). Lastly, for mutant trajectories that disappeared (either due to demographic stochasticity, or due to deleterious fitness effect), we added a pseudocount, and restricted the regression to the first appearance of a zero read count. We made a few additional simplifying assumptions: we restricted our analysis to a focal mutant of interest, ignoring changes in the mean fitness of the population. We further ignored noise in measurement of reference strains, and overdispersion in the initial distribution of mutant abundances.

We examined the distribution of initial read counts prior to the start of the fitness assay (N(0)), involving Poisson sampling during the bottleneck and sequencing steps. We found that the read counts were overdispersed (Fig. 1B), and the variance was significantly higher than the mean regardless of sequencing depth (Fig. 1C). Due to the overdispersion, they are a more accurate proxy for real sequencing counts datasets than naïve Poisson distributed counts.

Results

First, we investigated the relative importance of bottleneck size and sequencing depth on measurement error. We calculated the uncertainty as the standard error of mean of fitness estimates of five replicates. We found that for neutral mutations, while errors decrease with bottleneck size and sequencing depth, we found that increasing sequencing depth much more beyond the bottleneck size has little impact on measurement error (Fig. 2). This pattern persisted for slightly deleterious mutations.

Fig. 2
figure 2

Diminishing returns of increasing sequencing depth well beyond the bottleneck size. Uncertainty in measurements is obtained as the standard errors of measurements obtained from simulated fitness assays (fixed parameters: number of timepoints = 2, number of generations between timepoints = log2(100))

Errors in Fitness Measurements Depend on the True Fitness of a Mutant

Next, we explored how the uncertainty in fitness estimates depends on the “true” fitness of a mutation. We observed that errors are consistently larger for more deleterious mutations (Fig. 3A). Next, we turned to a deeply sequenced transposon sequencing dataset of E. coli B REL606 from our previous work (Limdi et al. 2022). We found a statistically significant negative correlation between the estimated fitness of disrupting a gene and the error in the estimate (p value < 0.001, Fig. 3B); this pattern was more evident when we binned by mutant effect sizes (Fig. 3C). This was consistent with the result that FitSeq errors are larger for deleterious mutations (Li et al. 2018). Because errors were dependent on the effect size of the mutation, we decided to explore how experimental parameters impact both near-neutral fitness effects and deleterious fitness effects separately.

Fig. 3
figure 3

Errors in measurements depend on the true fitness of the mutation. a Uncertainty in measurements, defined as the standard error of mean of 5 replicates. Values plotted are the average of fitness assays for 1,000 mutations. Parameters: number of timepoints = 200, number of generations = log2(100), total sequencing depth per replicate = 200. b Error in fitness measurements (defined as standard error of mean) from a transposon sequencing dataset of E. coli B REL606, using two timepoints. c Same data as in b binned by fitness effects. Annotations above points indicate the number of genes in the bin

Near-Neutral Mutants Require More Time for Fitness Effects to Exceed Measurement Noise

Given fixed sequencing over the entire experiment, we investigated how varying the fitness assay design impacted the measurement errors of near-neutral mutations. First, we explored the impact of changing the number of timepoints in the fitness assay, keeping the number of generations between timepoints fixed. We found that in simulations, for a given sequencing depth, measurement errors were lower with more timepoints (and generations of selection) but with less sequencing per timepoint (Fig. 4A). We then tested this hypothesis in data from the transposon sequencing dataset, finding that over a 30-fold change in total sequencing depth over the experiment, errors were consistently smaller when sampling multiple timepoints and generations (Fig. 4B). Notably, an experiment with 3 × 105 reads spread out over five timepoints led to better estimates than 107 reads over two timepoints.

Fig. 4
figure 4

For fixed total sequencing depth over the experiment, increasing number of generations of selection leads to better resolution of near-neutral fitness effects despite less sequencing per time point. Schematic indicates design of the fitness assay, with circles in blue indicating which timepoints were sequenced. a Simulations: uncertainty in fitness estimates for a neutral mutation as a function of timepoints (and number of generations, interval = log2(100) generations) keeping total sequencing depth per mutant constant over the experiment. b Reanalysis of the TnSeq dataset: measurement uncertainty as a function of timepoints (and number of generations, interval = log2(100) generations), keeping total reads in the experiment constant. c Simulations: uncertainty in fitness estimates keeping total sequencing depth per mutant and total generations of selection constant (26.5 generations), while varying number of generations between sequencing. d Reanalysis of TnSeq dataset: measurement uncertainty in fitness estimates keeping total reads over the experiment and total generations of selection constant (26.5 generations), while varying number of generations between sequencing. Since there is no ground truth of neutrality, we averaged over the measurement errors for genes in the range of (-0.05, 0.05)

In the above analysis, both number of timepoints and number of generations of selection were varied simultaneously. We next probed the effect of changing the frequency of sequencing the mutant pools, keeping the total numbers of generations of selection constant. In both simulations and experimental data, we found that at low sequencing depths, sequencing only at two timepoints ~ 26.5 generations apart performed better than sequencing five timepoints ~ 6.6 generations apart (Fig. 4C, D). However, at high sequencing depths, the measurement errors were nearly independent of frequency of sampling. These results suggest that measurements of near-neutral mutations improve with longer durations of selection, while only weakly depending on frequency of sequencing.

For Deleterious Mutations, Increasing Timepoints and Generations of Selection at Fixed Total Sequencing Leads to Less Usable Data

Next, we investigated to what extent this intuition also held true for deleterious fitness estimates. For moderately deleterious mutations (s = − 0.1) we found that errors typically decreased with timepoints (and generations of selection). However, at low sequencing depth, (25 across the experiment), we found that going from 4 to 5 timepoints made measurements less reliable (Fig. 5A). To investigate this, we examined the average number of timepoints used in calculating fitness estimates. At low total depth, increasing timepoints (and therefore reducing sequencing reads per timepoint) measured in fact led to a reduction in usable data from the fitness assay, contributing to noisier measurements (Fig. 5B).

Fig. 5
figure 5

For fixed total sequencing, adding timepoints leads to less usable data for estimating deleterious fitness effects in simulated assays. Schematic indicates fitness assay designs, with blue circles indicating which timepoints were sequenced. a Uncertainty in fitness measurement for a mutant with true s = -0.1, as a function of timepoints measured (keeping number of generations between timepoints fixed: log2(100). b Average number of timepoints that were used to calculate fitness. c Uncertainty in fitness measurement for a mutant with true s = -0.25, as a function of timepoints measured (keeping number of generations between timepoints fixed: log2(100). d Average number of timepoints that were used to calculate fitness (Color figure online)

Similarly, for a strongly deleterious mutation (s = − 0.25), we found that errors increased from 3 to 4 timepoints for all but the highest sequencing depth (Fig. 5C). This trend corresponded to a decrease in the average number of timepoints used in fitness estimates (Fig. 5D). This suggests that for fixed sequencing depth, measuring earlier timepoints at greater depth provides better resolution of deleterious fitness effects, and that sampling additional timepoints can reduce the amount of useful data.

Tuning Frequency of Sequencing to Detect Deleterious Fitness Effects

While increasing the duration between sequencing (for fixed total depth) can be helpful for resolving near-neutral fitness effects, it is not necessarily optimal for deleterious fitness effects. Over tens of generations of selection, if no intermediate timepoints are sampled, it is not possible to distinguish a slightly deleterious mutation from an unviable mutation, as the expected mutant abundance and read counts is nearly zero for both scenarios. Under this scenario, the most deleterious fitness effect detectable can be estimated as:

$${s}_{min} = -ln({N}_{1})/t$$
(3)

Assuming equal total sequencing depths at two timepoints, this approximation follows from Eq. (2) because when trajectories disappear to 0, n2 = 1 (from adding a pseudocount of 1). We verified this approximation in experimental data, finding that the average of the ten most deleterious fitness effects, a proxy for the most deleterious effect detectable, matched the theoretical predictions well (Fig. 6). This comparison shows that true fitnesses below the theoretical bound cannot be estimated from pooled fitness assays, and if calculated, will be systematically over-estimated.

Fig. 6
figure 6

Generations of selection between sequencing set a lower bound on fitness that can be inferred using pooled fitness assays. Schematic indicates fitness assay design, with blue circles indication timepoints that were sequenced. a Predicted lower bound as a function of number of generations, and sequencing depth, assuming that the mutant disappears after selection (see Eq. 3). b Average of the 10 most deleterious fitness effects estimated for a set of parameters (downsampling and number of generations of selection). We use this as a proxy for the resolution of deleterious effects in bulk fitness assays. Dotted line in indicates the fitness of an unviable mutation, -ln(2)

Discussion

We found that sequencing more timepoints (over more generations of selection) at relatively lower sequencing depth, as opposed to fewer timepoints (and generations of selection) at very high depth, improves resolution of near-neutral fitness estimates. Conversely, for deleterious fitness effects, with additional time points there is less new, usable information obtained, as these variants are depleted over time. Our results highlight that the timescale of sampling in fitness assays should be tuned to the timescale of change in mutant frequencies. Moreover, they suggest that there is no combination of experimental parameters that optimally resolves both ranges of fitness effects for a fixed amount of data.

A limitation of our simulations is that we make several simplifying assumptions in modeling fitness assays. We do not consider noise from PCR amplification and DNA extraction steps, which likely contribute to higher measurement noise. We also do not account for changes in the mean fitness of populations over the course of the fitness assay. For a detailed discussion of how the underlying distribution of fitness effects impacts estimates of mutant fitness using log-fold change metrics, and more generally inferring fitnesses from barcode frequencies, we recommend the following (Li et al. 2018, 2023; Ascensao et al. 2023). While changing mean fitness can be corrected for using neutral, reference strains, any measurement errors in these lineages will propagate to fitness estimates of all mutants and can introduce systematic biases. Initial mutant abundances are not typically perfectly Poisson distributed, as generating mutant libraries involves growth steps which can skew abundances toward mutants that have a fitness advantage. Conversely, mutants growing poorly in the growth media (prior to the fitness assay) will have noisier measurements by virtue of starting off with fewer cells, and therefore read counts.

Our simulations and reanalysis of transposon sequencing data, combined with previously published results, can be distilled into principles for experimental design:

Identify Fitness Effects that are Relevant for the Biological Question at Hand

Errors in measurements depend on the true fitness of the mutations, and there exists a tradeoff between resolution of near-neutral fitnesses and deleterious fitness effects.

For Measurements Near Neutrality and Fixed Total Sequencing Budget, Sequence Mutant Pools at More Timepoints, with Less Sequencing Depth

Our reanalysis of a deeply sequenced transposons sequencing data shows that sequencing more timepoints (over longer selection periods) at lower sequencing depth outperforms sequencing very deeply but for fewer generations of selection (Fig. 4B). While we find that there is no advantage to sequencing more timepoints if the period of selection is unchanged (Fig. 4C, D), it may provide additional robustness to fitness estimates.

Firstly, the mean fitness of the population can change over time depending on the underlying distribution of fitness effects. This can lead to biased fitness estimates; for instance, neutral mutations may appear deleterious without any correction. Quantifying mutant abundance over multiple timepoints allows for use of methods such as FitSeq to correct for this bias. Secondly, beneficial mutations can occur on otherwise neutral or even deleterious over the course of the fitness assay purely due to chance. Sequencing multiple timepoints can allow for identifying such outlier events and excluding them from downstream analysis.

For Measurements of Deleterious Mutations and Fixed Total Sequencing Budget, Sequence at Higher Depth for Fewer Timepoints and Fewer Generations

For fixed amount of sequencing, adding more timepoints does not add meaningful information as deleterious mutations will go extinct over time. As a starting point for parameters, we recommend using sequencing depth and number of generations of selection such fitness effects of interest are above this lower bound predicted in Eq. 3. Lastly, we recommend (from experience) always sequencing the mutant pools prior to any fitness assay, as deleterious mutations (or variants) disappear from the pool rapidly in a few generations.

Using Pilot Experiments and Simulations to Guide Fitness Assay Design

We present an approach for tuning fitness assay design; we suggest performing a pilot fitness assay and sequencing experiment, using simulations as a starting point for experimental parameters, and reanalyzing the data with subsampling (either fewer reads or fewer timepoints). If the errors or resolution of fitness effects of interest do not change with subsampling, it is possible to collect data for more strains/genetic backgrounds with the same total sequencing.