1 Introduction

The Susceptible-Infectious-Recovered (SIR) model, first introduced in the early twentieth century, is a mathematical model describing the spread of a novel pathogen through a population (Kermack and Mckendrick 1927; Ross 1916; Ross and Hudson 1917a, b). This model is governed by the ordinary differential equations

$$\begin{aligned} \frac{ds}{dt} = -\beta i s,\quad \frac{di}{dt} = \beta i s - \gamma i,\quad \frac{dr}{dt} = \gamma i. \end{aligned}$$
(1)

According to this model, the population is divided into three groups or “compartments,” each of which represents a proportion of the population. The susceptible compartment, s, consists of the proportion of individuals who have never been infected with the pathogen. The infected compartment, i, consists of the proportion of individuals who are currently infected. The removed compartment, r, consists of the proportion of individuals who have either recovered from the pathogen and are immune or have died, and are therefore removed from the population. Since the SIR model assumes all recovered individuals are permanently immune to the pathogen, the value of r can be obtained from s and i via the identity \(s+i+r = 1\).

During the last century the SIR equations have been modified and extended to model a diverse range of epidemics including Ebola, cholera, H1N1, tuberculosis, HIV/AIDS, influenza, malaria, Dengue fever, Zika, and most recently SARS-CoV-2 (Brauer et al. 2019; Coburn et al. 2009; Eisenberg et al. 2013; Khaleque and Sen 2017; Lee et al. 2020; Pasquali et al. 2021; Rachah and Torres 2015; Yang et al. 2020). In many of these examples, additional terms are added to account for pathogen specific characteristics of transmission. Additional compartments may also be added to model different subpopulations. One such example is the SEIR model, which includes a subpopulation of exposed (E) but non-infectious cases (Sauer et al. 2020). Collectively, the SIR model and its extensions and variations provide epidemiologists with a vast array of interpretable and highly expressive models to understand and predict the behavior of outbreaks. However, incorporating too many features can have subtle but important drawbacks including limited or unreliable inference of model parameters early in an epidemic. The main contribution of this article is insight on the inferential limits of epidemics (as captured by estimated parameters) which can be obtained from noisy, real-time observations of an outbreak.

Our work is motivated by the application of SIR and related compartmental models to real-time analysis of epidemics of human disease such as the 2014 Ebola outbreak in West Africa and the ongoing SARS-CoV-2 (“coronavirus”) pandemic. In such outbreaks, the initial aim of the public health response is to extinguish the epidemic while the number of infected individuals is still small, or at least to significantly slow the rate of infection to allow time for the pathogen to be better understood and effective therapeutics or vaccines to be developed. The stay-at-home orders instituted by many countries due to SARS-CoV-2 are one recent example which has had profound global economic impacts. As such, mathematical models employed in the real-time analysis of epidemics must provide accurate inferences about properties of the epidemic – encapsulated by model parameters – early in the epidemic, when only a small fraction of the population has been infected. Hereafter, we refer to estimation of unknown model parameters from observations as the inverse problem.

As noted in a review by Hamelin et al, many disease models proposed in the literature follow a similar structure: (1) a model is proposed, (2) a subset of model parameters are inferred from the literature, and (3) the remaining parameters are fit from data using least squares or maximum likelihood estimation (Hamelin et al. 2020). In order for these parameter estimates to be reliable the parameters must be statistically identifiable, ruling out settings in which multiple parameter values are equally consistent with observed data. Such issues were first considered in the context of compartmental models by Bellman and Aström in 1970 (Bellman and Åström 1970). Specific details relevant to the SIR model may be found in Hamelin et al. (2020). Here we provide a brief overview of the well-posedness of the inverse problem.

A model is structurally identifiable when there is a single value of the parameters consistent with noise-free data observations. A comprehensive review of analytic methods for assessing structural identifiability is given in Chis et al. (2011); alternatively, software packages such as DAISY can be used (Bellu et al. 2007). There are many examples in the literature (Brunel 2008; Chapman and Evans 2009; Daly et al. 2018; Eisenberg et al. 2013; Piazzola et al. 2020; Tuncer et al. 2016; Tuncer and Le 2018; Villaverde 2018). In particular, structural identifiability of the SIR parameters, \(\beta \) and \(\gamma \), is well understood with strong theoretical support. See Hamelin et al. (2020) for specific cases based on different observations of the compartments. Similar considerations arise in literature related to branching process models which are also commonly used for modeling the dynamics of an outbreak. For example, Fok and Chou establish theoretical guarantees on ascertaining the progeny and lifetime distributions for Bellman-Harris processes when one knows the extinction time or population size distributions (Fok and Chou 2013). In practical applications, much less is typically known about the dynamics. Laredo et al. (2009) prove that when certain branching processes are observed only up to their nth generation, one can infer that the true model parameter belongs to a specific subset (which depends on n) of parameter space, but it is impossible to infer the exact true parameter for any finite n.

In practice, data observed during an epidemic tend to be very noisy, so we are far from the idealized noise-free case. Practical identifiability is the ability to discern different parameter values based on noisy observations. Despite considerable recent attention (Balsa-Canto et al. 2009, 2008; Chis et al. 2011; Srinath and Gunawan 2010), far less is known about practical identifiability. Present theoretical methods rely on sensitivity analysis and the computation of the Fisher information matrix, which is analytically intractable in the SIR model and its extensions. Instead, it is common to see Monte Carlo methods employed, wherein the model is simulated for a set value of the parameters, noise is added to the simulated observations, and a fitting procedure is conducted on the noisy data (Chis et al. 2011; Hamelin et al. 2020; Lee et al. 2020; Tuncer and Le 2018). The fidelity of parameter estimates relative to the known values is summarized using the average relative estimation error, which is then plotted as a function of the noise intensity.

Interestingly, the lack of practical identifiability manifests in a remarkably similar manner across multiple, different model formulations even in cases where the parameters are known to be structurally identifiable. As the magnitude of the noise is increased, Monte Carlo parameter estimates concentrate along a curve stretched throughout parameter space indicating a functional relationship between model parameters (Browning et al. 2020; Eisenberg et al. 2013; Piazzola et al. 2020; Tuncer et al. 2016; Tuncer and Le 2018). Importantly, there are often great disparities in parameter values along this curve and hence huge uncertainty in the parameters. See Fig. 1 for a representative example in the specific case considered herein.

Fig. 1
figure 1

The trajectory of the SIR model used in the simulation (left). Plots of \({\hat{\beta }}\) vs \({\hat{\gamma }}\) from 1000 realizations from the sampling distribution of their MLE (right)

The goal of this article is to provide theoretical tools for understanding practical identifiability in the context of the SIR model. We propose a formulation based on realistic observations early in an outbreak. Then, using linearizations similar to those of Sauer et al. (2020), we construct analytically tractable approximations to the SIR dynamics from which theoretical guarantees of the performance of the inverse problem are developed. We begin by introducing the model under consideration, reemphasizing ideas discussed previously to provide overt examples of the challenges of practical identifiability.

2 Statistical model

The data available to infer the parameters of an SIR model are usually noisy, biased measurements of the rate of change in the size of the susceptible compartment, discretized to unit time intervals \(\Delta _t = N(s_{t-1} - s_{t})\). For simplicity, we take the time unit to be one day. Here, N represents the total population size in the jurisdiction under study and \(s_t\) is the size of the susceptible compartment at time t. The quantity \(\Delta _t\) is the number of newly infected individuals between day \(t-1\) and day t. Data on daily confirmed cases, hospitalizations, or deaths are all examples of observable data that depend on the underlying value of \(\Delta _t\). Specifically, all are discrete convolutions of \(\Delta _t\) of the form \(p \sum _{s=0}^t \Delta _s \pi _{t-s}\), where p is the probability that an infected person goes on to be diagnosed, hospitalized, or die, and \(\pi _k\) is the conditional probability that a person tests positive, is hospitalized, or dies k days after becoming infected given that the corresponding outcome will eventually occur. It is likely that the parameters p and, to a lesser extent, \(\pi \) change over the course of an epidemic. However, changing values of these parameters can only make inference more difficult, and since our main focus is on studying limitations of inference, as a starting point we assume that p and \(\pi \) are fixed and known.

While the inverse problem with known initial conditions but unknown parameters \(\theta =(\beta ,\gamma )\) is well-posed when even a partial trajectory of \(\Delta _t\) is observed, in reality we observe \(\Delta _t\) corrupted with noise, and we always have to work with finitely many discrete-time observations. In epidemic modeling, unlike some other inverse problems, we do not even have control of the sampling rate and are generally stuck with at best daily monitoring data. To simplify exposition, we focus on a simple but flexible noise model in which the observed data \(Y_t\) are realizations of a random variable satisfying \(\mathbb {E}[Y_t] = p \Delta _t\) for some known \(p \in (0,1)\). In this case, \(\pi _0=1\) and \(\pi _k = 0 \) for \(k>0\). While our results apply to many noise models, to fix ideas we begin with Gaussian noise

$$\begin{aligned} Y_t = p\Delta _t + \xi _t, \quad \xi _t \sim \mathcal {N}(0,\sigma ^2_t). \end{aligned}$$
(2)

In addition to simplifying exposition, our primary motivation for choosing Gaussian noise is to illustrate that the SIR model can, as we see shortly and explain later, be practically unidentifiable even for simple, idealized models like the one above. A secondary reason is that, despite its simplicity, (2) is not entirely unrealistic. For example, suppose any two people infected on day t have the same chance of eventually testing positive, that the chance any one such person tests positive is independent of whether any other such person does, and that the average number of people who became infected on day t who go on to test positive is roughly \(p\Delta _t\). Then in any sufficiently large population the central limit theorem implies \(Y_t\), which in this case is the number of people who become infected on day t and go on to get diagnosed, is approximately normally distributed with mean \(p\Delta _t\) and some variance \(\sigma _t\), i.e. \(Y_t\) satisfies (2).

Initially, suppose that the variances \(\sigma ^2_t\) in (2) are known. A simple procedure for solving the inverse problem from data \(Y_t\) is maximum likelihood. The gradient of the log-likelihood can be obtained by numerically solving an extended ODE system (Gronwall 1919) which allows for easy fitting via gradient-based optimization methods. It can be shown that, even when the trajectory \(p\Delta _t\) is observed only at discrete time intervals and the peak of infections has not yet occurred, the maximum likelihood estimator (MLE) exists and is unique, and so the model is structurally identifiable (Hamelin et al. 2020). Problems become apparent however when one seeks to study uncertainty in the estimated parameters. Figure 1 gives a stark indication of the challenges. We simulate data from an SIR model with parameters \(\theta = (\beta ,\gamma ) = (0.21, 0.07)\) and initial conditions \(s_0 = 1-1/N, i_0 = 1/N\) for \(N = 10^7\). These parameters were selected to roughly approximate the dynamics of the coronavirus epidemic in New York City prior to the lockdown of March 16, 2020. The trajectories \(s_t,i_t\) for \(0 \le t \le 120\) are shown in the left panel. By \(t=80\), about 1 percent of the population has been infected, and the peak size of the infected compartment occurs around \(t=120\). The right panel of Fig. 1 is obtained by repeatedly simulating data from (2) using the trajectory in the left panel, with \(p=1\) and \(\sigma ^2_t = 100 N\) chosen for illustrative purposes. Other potentially more realistic values of p and \(\sigma _t\) are considered later in the text; see for example Table 3 in Sect. 3.4 and Cases 1 and 2 in Sect. 3.2. For each replicate simulation, the model is fit by maximum likelihood. The resulting estimates of \(\hat{\theta }\) are shown in Fig. 1, which plots \(\hat{\beta }\) against \({\hat{\gamma }}\). These are samples from the sampling distribution of the maximum likelihood estimator for these parameters. The estimates exhibit very tight concentration along a line of slope 1. The variation in \({\hat{R}}_0 = {\hat{\beta }}/\hat{\gamma }\) observed for these values is large, ranging from 1.88 to 5.01. This high degree of uncertainty occurs despite the fact that we have observed data up through the time when over half the population has been infected.

The linear shape of the plot in Fig. 1 suggests a practical identifiability problem in this model. That is, while the MLE exists and is unique, the curvature of the log-likelihood in the neighborhood of the MLE is very small in the direction where \({\hat{\beta }}, {\hat{\gamma }}\) lie along a line. We are not the first to notice this phenomenon. Previous works include (Chis et al. 2011; Hamelin et al. 2020; Lee et al. 2020; Tuncer and Le 2018), which experience qualitatively similar issues despite notable differences in the formulation of the likelihood in those settings.

While various empirical studies exist, our main contribution is a theoretical analysis of this phenomenon and the resulting limitations for solving the inverse problem from noisy observations. We take a two-step approach to the analysis. First, we characterize sensitivity of trajectories \(\Delta _t\) to perturbations of the parameters \(\theta \), and show that perturbations of \(\theta \) in the directions \(\pi /4\) and \(5\pi /4\) (equivalently, along the line of slope 1 through \(\theta \)), closely approximate the smallest variation in the trajectory \(s_t\) among all perturbations for which \(\Vert \theta _\epsilon -\theta \Vert = \epsilon \). We then give a computable approximate lower bound on \(\inf _{\theta _\epsilon : \Vert \theta - \theta _\epsilon \Vert = \epsilon } |s_t(\theta _\epsilon )-s_t(\theta )|\) for times t prior to the peak infection time. Taken together, these results provide an explanation for the phenomenon in Fig. 1.

In the second part of the analysis, we relate the problem of uncertainty quantification to hypothesis tests of the form

$$\begin{aligned} H_0 : \theta = \theta _0 \quad \text { vs. }\quad H_1 : \theta = \theta _\epsilon \end{aligned}$$

for \(\Vert \theta _\epsilon - \theta _0\Vert = \epsilon \). We use the result of the first part of our analysis to approximate the type II error of the test, which in turn allows for both theoretical and empirical analysis of the limits of epidemic prediction using SIR models.

3 Results

3.1 Perturbation bound for SIR trajectories

Informally, the phenomenon in Fig. 1 is a manifestation of the fact that very different values of \(\theta \) can lead to SIR model trajectories that are very close. To formalize this, let \(\varphi _t(x_0,\theta )\) be the (si)-trajectory of the SIR model starting from \(x_0=(s_0,i_0)\) with parameters \(\theta =(\beta ,\gamma )\). To aid the reader, all relevant notation is summarized in Table 1. We also remark that the analysis in this subsection and its associated appendices, “Appendices A and B”, applies directly to the deterministic SIR system (1). In particular, it is independent of our choice of statistical model, which will not become relevant until our discussion of hypothesis testing in Sect. 3.2.

Table 1 Summary of notation used throughout this article

For \(\epsilon >0\), let \(S_\epsilon (\theta )\) denote the circle of radius \(\epsilon \) about \(\theta \). That is,

$$\begin{aligned} S_\epsilon (\theta ) = \{\theta _\epsilon (\omega ):\omega \in [0,2\pi )\} \end{aligned}$$

where \(\theta _\epsilon (\omega )=\theta +\epsilon (\cos (\omega ),\sin (\omega ))\). We set \(\delta = \beta -\gamma \) and assume throughout that \(\delta >0\); if not, then the reproductive number \(R_0=\beta /\gamma \) is at most 1 and the epidemic does not grow even at time 0. Similarly, we assume \(\epsilon <\delta \). This ensures \(R_0\) values of the perturbed parameters \(\theta _\epsilon (\omega )=(\beta +\epsilon \cos (\omega ),\gamma +\epsilon \sin (\omega ))\) are also strictly greater than 1

$$\begin{aligned} \beta +\epsilon \cos (\omega ) - \gamma -\epsilon \sin (\omega ) = \delta + \epsilon (\cos (\omega )-\sin (\omega )) \ge \delta -\epsilon >0 \end{aligned}$$

and so \(R_0(\epsilon ,\omega )=(\beta +\epsilon \cos (\omega ))/(\gamma +\epsilon \sin (\omega ))>1\) for every \(\omega \). Finally, for any fixed initial condition \(x_0\) and parameter \(\theta \) we define the peak time, denoted \(t_*\), to be the deterministic time at which the number of infected individuals \(i_t(x_0,\theta )\) is greatest; that is, \(t_*={{\,\mathrm{argmax}\,}}\{i_t(x_0,\theta ):t\ge 0\}\). Since \(di/dt=0\) if and only if \(i=0\) or \(s=1/R_0\), it is follows that \(t_*\) exists and is unique whenever \(R_0>1\). With this notation, the main result of this subsection is the following proposition.

Approximation 1

Let \(\Vert \cdot \Vert \) denote the Euclidean norm on \(\mathbb {R}^2\) and let \(t_*\) be the time of peak infection corresponding to \(\theta \). Then for all \(t\in [0,0.8t_*)\),

$$\begin{aligned} \frac{\epsilon }{\delta \sqrt{2}}\big (e^{\delta t}-1\big )i_0 \approx \inf _{\omega \in [0,2\pi )} \big \Vert \varphi _t\big (x_0,\theta _\epsilon (\omega )\big ) - \varphi _t(x_0,\theta )\big \Vert . \end{aligned}$$
(3)

Furthermore the infimum is approximately achieved when \(\omega =\pi /4\) or \(5\pi /4\).

The derivation of (3) is in “Appendix A”. Approximation 1 says for any perturbation \(\theta _\epsilon (\omega )\) of \(\theta \), the distance between the perturbed trajectory \(\varphi _t(x_0,\theta _\epsilon (\omega ))\) and true trajectory \(\varphi _t(x_0,\theta )\) is approximately bounded below by the left side of (3) for all times t up to roughly \(80\%\) of \(t_*\). The “\(\approx \)" in (3) indicates the bound is subject to error. Specifically, our derivation of Approximation 1 involves two approximations: First, we approximate the SIR model by a differential Eq. (A1) whose solution \({\widetilde{\varphi }}_t\) is given by (A2). Second, we use first-order Taylor expansions to approximate perturbations of \({\widetilde{\varphi }}_t\) resulting from perturbations in parameter space. Despite these approximations, numerical analysis of the error given in “Appendix B” indicates (3) holds for a wide range of parameter values and population sizes; see Fig. 2 below and Fig. 10 in “Appendix B”. This numerical analysis also motivates our choice of \(80\%\) of the peak time as a cutoff, though this cutoff can be extended to \(85\%\) or even \(90\%\) for larger populations and certain parameter values; see Table 5. To complement the numerical results of “Appendix B”, we give a theoretical upper bound on the error in “Appendix C”. The theoretical result is more mathematically rigorous than the numerical one; however, it is significantly less precise than the control on error obtained in “Appendix B”. We therefore use results derived from the numerical analysis, e.g. the 80% threshold, of “Appendix B” rather than the theoretical analysis of “Appendix C” for the remainder of this paper.

Fig. 2
figure 2

Distance between perturbed and true trajectories for different parameter values and population sizes. In each graph the horizontal axis is the number of days since the start of the epidemic and the vertical axis is the distance \(\Vert \varphi _t(x,\theta _\epsilon (\omega ))-\varphi _t(x,\theta )\Vert \) between a perturbed trajectory and the true trajectory at time t. The gray, green, and black curves correspond to 90 perturbed trajectories, one for each of 90 equally spaced angles \(\omega \) in \([0,2\pi )\). The black curves correspond to the angles \(\pi /4\) and \(5\pi /4\). The green curves correspond to the remaining angles in the intervals \([\pi /4-\pi /12,\pi /4+\pi /12)\) and \([5\pi /4-\pi /12,5\pi /4+\pi /12)\), i.e. in intervals of width \(\pi /6\) centered at \(\pi /4\) and \(5\pi /4\), respectively. The gray curves correspond to those angles in \([0,2\pi )\) outside these two intervals. Note the distances corresponding to angles close to \(\pi /4\) and \(5\pi /4\) (the green and black curves) are smaller than those distances corresponding to angles farther away from \(\pi /4\) and \(5\pi /4\) (the gray curves), which supports the claim that the inverse problem is least practically identifiable for parameter perturbations approximately along a line of slope 1. The approximate lower bound of Approximation 1 is in red. The peak time of the trajectory corresponding to \(\theta \) is indicated by the vertical blue line, and 80% of it by the vertical orange line. The first through fourth columns have population sizes \(10^4, 10^5, 10^6\), and \(10^7\), respectively, with only one initial infection in each case. The perturbation sizes for the first through fourth rows are \(\epsilon =.03, .03, .06\), and .1, respectively. The SIR paramaters for the first through fourth rows are \((\beta ,\gamma )=(.21,.14), (.21,.07), (.42,.07)\), and (1.68, .14), which give respective \(R_0\) values of 1.5, 3, 6, and 12. Note the approximate lower bound holds roughly up to 80% of the peak time in all cases despite the wide range of parameters. Finally, we remark that the two seemingly “distinct" classes of gray curves in each plot correspond to different subsets of the 90 distinct angles. This as well as the multimodality of certain curves (which becomes more apparent when our graphs are extended further beyond the peak time) are consequences of the nonlinearity of the SIR model and are not directly relevant to our analysis

Fig. 3
figure 3

Logarithm of distance between perturbed and true trajectories for different parameter values and population sizes. Everything is the same as in Fig. 2 except now we plot \(\log \Vert \varphi _t(x,\theta _\epsilon (\omega ))-\varphi _t(x,\theta )\Vert \) instead of \(\Vert \varphi _t(x,\theta _\epsilon (\omega ))-\varphi _t(x,\theta )\Vert \). This gives a better view of the approximate lower bound early in the epidemic. Note the vertical axis is now a log scale

Approximation 1 successfully predicts the directions in parameter space, namely \(\omega =\pi /4\) and \(5\pi /4\) (equivalently, along the line of slope 1 through \(\theta \)), corresponding to the most uncertainty about parameters even when data are observed up to the peak time, as in Fig. 1. In other words, the inverse problem of determining \(\theta \) from data is least practically identifiable when distinguishing between \(\theta \) and parameter values lying approximately on the line of slope 1 through \(\theta \). Furthermore, the approximate lower bound (3) quantifies the extent to which the inverse problem will not be practically identifiable which, as we discuss in the next subsection, is necessary for meaningful hypothesis testing. Finally, we find that the lower bound in (3) approximately holds for the s trajectory alone. That is, if \(s_t(x_0,\theta _\epsilon (\omega ))\) and \(s_t(x_0,\theta )\) are the s trajectories corresponding to \(\theta _\epsilon (\omega )\) and \(\theta \), respectively, then

$$\begin{aligned} \frac{\epsilon }{\delta \sqrt{2}}\big (e^{\delta t}-1\big )i_0 \approx \inf _{\omega \in [0,2\pi )}\big |s_t\big (x_0,\theta _\epsilon (\omega )\big ) - s_t(x_0,\theta )\big |, \end{aligned}$$
(4)

and the infimum is again achieved when \(\omega =\pi /4\) and \(5\pi /4\). The intuition behind (4) is that the s compartment is substantially larger than the i compartment early in an epidemic and therefore contributes significantly more to \(\Vert \varphi ^\epsilon _t-\varphi _t\Vert \) than i. This observation will be used for the hypothesis testing in Sect. 3.2 since our statistical model depends crucially on \(\Delta _t=N(s_{t-1}-s_{t})\), which in turn depends only on s rather than on s and i together. The approximation error implicit in the \(\approx \) symbol in (3) and (4) is the one quantity we do not have rigorous control over; see “Appendix B” for details.

3.2 Hypothesis testing for the inverse problem

In this subsection we revisit the inverse problem in light of the perturbation bounds (3) and (4). To give context and motivate the main result of this subsection, namely Approximation 2 and its subsequent discussion, we first give a brief overview of simple hypothesis testing and the Neyman-Pearson Lemma.

Suppose we observe data Y taking values in a space \(\mathcal {Y}\) and that these data are drawn from an unknown probability distribution belonging to a parametrized family of probability distributions \(\{\mathbb {P}_\theta \}\). Given two parameters \(\theta _0\) and \(\theta _1\), a natural question is whether the observed data came from \(\mathbb {P}_{\theta _0}\) or \(\mathbb {P}_{\theta _1}\). This is a simple hypothesis test, denoted by

$$\begin{aligned} H_0 : \theta = \theta _0 \quad \text { vs. }\quad H_1 : \theta = \theta _1, \end{aligned}$$
(5)

where \(H_0\) and \(H_1\) are the null and alternative hypotheses, respectively. Simple here refers to the fact that both \(H_0\) and \(H_1\) correspond to single \(\theta \) values which completely determine the distributions \(\mathbb {P}_{\theta _0}\) and \(\mathbb {P}_{\theta _1}\). The aim is to decide whether to reject \(H_0\) in favor of \(H_1\), which is done by choosing a subset \(\mathcal {R}\) of \(\mathcal {Y}\) called the rejection region. This choice of \(\mathcal {R}\) completely determines the test: If \(Y\in \mathcal {R}\), then reject \(H_0\) in favor of \(H_1\); if \(Y\notin \mathcal {R}\), then do not reject \(H_0\). Type I error occurs when \(H_0\) is true but is rejected, and type II error occurs when \(H_0\) is false but not rejected; see Table 2. This is quantifiedFootnote 1 as

$$\begin{aligned} \begin{aligned}&\mathcal {E}_1(\mathcal {R}) = Type\, I\, error\, rate = \mathbb {P}_{\theta _0}(Y\in \mathcal {R}) = \mathbb {P}_{\theta _0}(\text {Reject}\ H_0), \\&\mathcal {E}_2(\mathcal {R}) = Type\, II\, error\, rate = \mathbb {P}_{\theta _1}(Y\notin \mathcal {R}) = \mathbb {P}_{\theta _1}(\text {Do not reject}\ H_0). \end{aligned} \end{aligned}$$

Ideally one would find a rejection region \(\mathcal {R}\) that simultaneously minimizes type I and type II error rates, but this is generally impossible. Instead, a common statistical paradigm is to fix a significance level \(\alpha >0\) and minimize \(\mathcal {E}_2(\mathcal {R})\) subject to the constraint \(\mathcal {E}_1(\mathcal {R})=\alpha \). For such an \(\alpha \), a region \(\mathcal {R}\) is called a most powerful level-\(\alpha \) rejection region if \(\mathcal {E}_1(\mathcal {R})=\alpha \) and \(\mathcal {E}_2(\mathcal {R})\le \mathcal {E}_2(\mathcal {R}')\) for all \(\mathcal {R}'\) satisfying \(\mathcal {E}_2(\mathcal {R}')=\alpha \). That is, \(\mathcal {R}\) minimizes type II error over all rejection regions with type I error equal to \(\alpha \). The Neyman-Pearson Lemma gives the most powerful rejection region in the case of a simple hypothesis test.

Lemma 1

(Neyman-Pearson) Let \(L(Y\vert \theta )\) denote the likelihood function for data Y and a parameter \(\theta \), and fix \(\alpha >0\). Then there exists an \(\eta \in \mathbb {R}\) such that

$$\begin{aligned} \mathcal {R}_{LR} = \bigg \{Y : \frac{L(Y\vert \theta _1)}{L(Y\vert \theta _0)} \ge \eta \bigg \} \end{aligned}$$
(6)

is a most powerful level-\(\alpha \) rejection region for the hypothesis test (5).

\(L(Y\vert \theta _1)/L(Y\vert \theta _0)\) is called the likelihood ratio and the decision to reject or not reject \(H_0\) based on the rejection region \(\mathcal {R}_{LR}\) is called the likelihood ratio test. Since the Neyman-Pearson Lemma guarantees the likelihood ratio test is most powerful in our setting, we henceforth consider only the rejection region \(\mathcal {R}_{LR}\) and set \(\mathcal {E}_1=\mathcal {E}_1(\mathcal {R}_{LR})\) and \(\mathcal {E}_2=\mathcal {E}_2(\mathcal {R}_{LR})\).

Table 2 Simple hypothesis test

Returning now to the SIR model, our hypothesis test of interest is

$$\begin{aligned} H_0 : \theta = \theta _0 \quad \text { vs. }\quad H_1 : \theta = \theta _{\epsilon }(\omega ) \end{aligned}$$
(7)

where, as before, \(\theta _\epsilon (\omega )\) is a perturbation of \(\theta \) of size \(\epsilon \) in the direction \(\omega \). The observed data are \(Y_{1:T} = (Y_1,\dots , Y_T)\) for any time T before the time of peak infection, with each \(Y_t=p\Delta _t+\xi _t\) as in (2). For the rest of this paper \(\mathcal {E}_2(\omega )\) will denote the type II error rate of (7) for the likelihood ratio test with angle \(\omega \). As such, the likelihood ratio test (6) minimizes \(\mathcal {E}_2(\omega )\) thereby providing the most powerful technique for detecting differences of order \(\epsilon \) in the SIR model parameters. We also set \(\Delta _t^\epsilon (\omega )= \Delta _t(\theta _\epsilon (\omega ))\) where, recall, \(\Delta _t(\theta )=N(s_{t-1}(\theta )-s_t(\theta ))\), and let \(\Phi \) denote the standard normal cumulative distribution function. With this notation we now present the main result of this subsection.

Approximation 2

For any \(\epsilon >0\), \(\omega \in [0,2\pi )\), and significance level \(\alpha >0\),

$$\begin{aligned} \mathcal {E}_2( \omega )&\approx 1-\Phi \left( \Phi ^{-1}(\alpha )+pNi_0\sqrt{\sum _{t=1}^T\frac{e^{2\delta t}}{\sigma _t^2}\left[ \beta _\epsilon \bigg (\frac{e^{-\delta _\epsilon }-1}{-\delta _\epsilon }\bigg )e^{\epsilon tf(\omega )}-\beta \bigg (\frac{e^{-\delta }-1}{-\delta }\bigg )\right] ^2}\right) \end{aligned}$$
(8)
$$\begin{aligned}&\approx 1-\Phi \left( \Phi ^{-1}(\alpha )+pNi_0\sqrt{\sum _{t=1}^T\frac{e^{2\delta t}}{\sigma _t^2}\left[ (\beta +\epsilon \cos \omega ) e^{\epsilon tf(\omega )}-\beta \right] ^2}\right) , \end{aligned}$$
(9)

where \(f(\omega )=\cos (\omega )-\sin (\omega )\), \(\beta _\epsilon =\beta +\epsilon \cos \omega \), and \(\delta _\epsilon =\delta +\epsilon f(\omega )\). Moreover,

$$\begin{aligned} \mathcal {E}_2(\pi /4)\approx \mathcal {E}_2(5\pi /4)\approx \sup _{\omega \in [0,2\pi )}\mathcal {E}_2(\omega ). \end{aligned}$$
Fig. 4
figure 4

Type II error as a function of perturbation size and noise level. The left panel shows the empirical and theoretical type II errors for the angles \(\omega =0,\pi /4\), and \(\pi \) as a function of perturbation size \(\epsilon \) with fixed noise level \(\sigma =0.3\). The right panel shows the empirical and theoretical type II errors for the same angles as a function of noise level \(\sigma \) with fixed perturbation size \(\epsilon =.03\). In each case the SIR parameters are those from Sect. 2, namely \((\beta ,\gamma )=(.21, .07)\), \(N=10^7\), and initial condition \(i_0=1/N\). The time horizon T is 60 days into the epidemic, which in this case is 60 days prior to the peak time. The significance level is \(\alpha =.05\). Here, theoretical refers to the first (red) and second (black) approximations of type II error \(\mathcal {E}_2(\omega )\) in Approximation 2, i.e. Eqs. (8) and (9), respectively. Empirical refers to the type II error obtained by performing 1000 simulations of the noisy SIR model (2) followed by a likelihood ratio test of the hypothesis in (7) for each set of parameters. More specifically, the red and black curves lying over the blue line are the type II error approximations (8) and (9) when \(\omega =\pi /4\), those lying over the purple line are when \(\omega =\pi \), and those lying over the green line are when \(\omega =0\), with the blue, green, and purple curves corresponding to the empirically computed type II error rates when \(\omega =\pi /4, 0\), and \(\pi \), respectively. In each case both theoretical results closely align with the empirical ones, with the first approximation being slightly better than the second as expected. Also as predicted, the empirical type II errors all approach \(1-\alpha =.95\) both as perturbation size goes to 0 and as the noise level gets large, and this approach is most rapid when \(\omega =\pi /4\). In each case the noise model is Case 2, \(\sigma _t=N\sigma i_t\). For the simulated blue, green, and purple curves, we used a numerical integrator to obtain the \(i_t\) values, while for the red and black curves we used the pre-peak approximation \(i_t\approx e^{\delta t}i_0\)

The derivation of Approximation 2 is in the “Appendix”. The first and second approximations of \(\mathcal {E}_2(\omega )\) correspond to the red and black curves in Fig. 4, respectively. Comparing these to the empirical type II error rates (the blue, green, and purple curves) we see these approximations are sound. In particular, the last part of Approximation 2 indicates the angles \(\pi /4\) and \(5\pi /4\) give rise to the largest type II error rate for the hypothesis test (7) with perturbation size \(\epsilon \) and significance level \(\alpha \). To quantify the magnitude of type II error in these cases, we substitute into the second approximation to get

$$\begin{aligned} \mathcal {E}_2(\tfrac{\pi }{4}) \approx \mathcal {E}_2(\tfrac{5\pi }{4}) \approx 1 - \Phi \left( \Phi ^{-1}(\alpha )+\frac{pNi_0\epsilon }{\sqrt{2}}\sqrt{\sum _{t=1}^T\frac{e^{2\delta t}}{\sigma _t^2}}\right) . \end{aligned}$$
(10)

Note that as the noise level \(\sigma _t^2\) goes to 0, the sum under the square root goes to infinity and the entire expression goes to \(1-\Phi (\infty )=0\). That is, if there is no noise then the type II error rate of the likelihood ratio test will vanish. If there is any noise at all however, the sum is finite and (10) becomes arbitrarily close to \(1-\Phi (\Phi ^{-1}(\alpha ))=1-\alpha \) as either p, the probability of detecting an infected individual, or \(\epsilon \), the perturbation size, go to 0. For example, if we set the type I error rate to \(\alpha =0.1\) then as either p or \(\epsilon \) go to 0, the probability of making a type II error will approach 0.9. Similarly, type II error will go to \(1-\alpha \) as \(\sigma _t^2\) goes to infinity. This limit is unrealistic though since \(\sigma _t^2\) is the variance of observed data and as such should be less than the population size. This leads us to consider two cases for noise.

Case 1.:

Noise proportional to population size, i.e. \(\sigma _t = N\sigma \) for \(\sigma \) in (0, 1).

Case 2.:

Noise proportional to number of infections, i.e. \(\sigma _t= N\sigma i_t\) for \(\sigma >0\).

In both cases \(\sigma \) is constant and independent of t. Case 2 involves \(i_t\) which is not expressible in closed-form. However, we can use Approximation 1 and its derivation, specifically the approximate solution (A2), to circumvent this issue by replacing \(i_t\) with \(e^{\delta t}i_0\). As discussed in Sect. 3.1, this approximation is appropriate early in the epidemic. In Case 1, Eq. (10) becomes

$$\begin{aligned} \mathcal {E}_2 \approx \ 1 - \Phi \left( \Phi ^{-1}(\alpha )+\frac{pi_0\epsilon }{\sigma \sqrt{2}}\sqrt{\sum _{t=1}^T e^{2\delta t}}\right) . \end{aligned}$$

In addition to the aforementioned limits, we see in this case that the expression, and hence the type II error, approaches \(1-\alpha \) as the population N goes to infinity (so that \(i_0=1/N\) goes to 0). In Case 2, Eq. (10) becomes

$$\begin{aligned} \mathcal {E}_2(\tfrac{\pi }{4}) \approx \mathcal {E}_2(\tfrac{5\pi }{4}) \approx 1 - \Phi \left( \Phi ^{-1}(\alpha )+\frac{p\epsilon \sqrt{T}}{\sigma \sqrt{2}}\right) . \end{aligned}$$
(11)

The above expression does not depend on population size, N, nor on the SIR parameters \(\beta \) and \(\gamma \), while the asymptotic results for p, \(\epsilon \), and \(\sigma \) still apply. Since Case 1 has noise proportional only to N, it implicitly assumes relative noise is larger earlier in the outbreak which may not be realistic. Case 2 avoids this since relative noise will be small whenever the reported number of infected individuals is small, e.g. early in an epidemic. For this reason and its invariance under different model parameters and population sizes, we consider only Case 2 moving forward.

3.3 Simple illustration: implications of Approximation 2

Approximation 2 says that given an SIR parameter \(\theta _0\), the probability of failing to reject the hypothesis \(H_0:\theta = \theta _0\) when the alternative \(H_1:\theta =\theta _\epsilon \) is true can be very high, especially when the angle of perturbation is \(\pi /4\) or \(5\pi /4\). In this subsection we take a closer look at what this means for epidemic prediction.

Consider for concreteness the familiar setting \(\theta _0=(.21, .07)\), \(N=10^7\), and \(i_0=1/N\). Figure 5 shows the total number of infections 10 days past the peak time as well as the durationFootnote 2 of the epidemic for \(\theta _0\), \(\theta _\epsilon (\pi /4)\), and \(\theta _\epsilon (5\pi /4)\) and varying perturbation sizes \(\epsilon \).

Fig. 5
figure 5

Consequences of type II error. The top panels show the total number of infections 10 days past the time of peak infection as a percentage of the total population. The bottom panels show the duration of the epidemic, which is defined to be the first day past the peak when less than 10 individuals are infectious. The left panels correspond to the parameter \(\theta _0=(0.21, 0.14)\) and the right panels to \(\theta _0=(.21, .07)\). The red lines give the total percent infected or duration of the epidemic for the true parameter \(\theta _0\) in each of their respective plots, while the blue and green curves give these values for \(\theta _\epsilon (\pi /4)\) and \(\theta _\epsilon (5\pi /4)\) over a range of \(\epsilon \) values, respectively. In all cases \(N=10^7\)

Setting \(\omega =\pi /4\) or \(5\pi /4\) and letting noise be as in Case 2, the first approximation in Approximation 2 can be rearranged to obtain

$$\begin{aligned} \epsilon \approx \frac{\big [\Phi ^{-1}(1-\mathcal {E}_2)-\Phi ^{-1}(\alpha )\big ]\sigma \delta e^\delta \sqrt{2}}{(e^\delta -1)p\sqrt{T}}. \end{aligned}$$

From this we can compute the consequences of type II error. For example, suppose we are 60 days into an epidemic \((T=60)\) and wish to test the hypothesis \(\theta _0=(0.21, 0.07)\) versus \(\theta _\epsilon (5\pi /4)\) as above. Moreover, suppose \(p=1\) (perfect diagnostics), \(\sigma = 0.2\) (infection standard deviation of \(\pm 20\%\) of new cases), and we set \(\alpha =.05\) and \(\mathcal {E}_2=0.5\). Then the above equation gives \(\epsilon \approx .064\). Thus, reading off the right panels in Fig. 5, we see that under these fairly generous conditions a type II error—which has a \(50\%\) chance of occurring—will result in underestimating the total number of infections of an epidemic by over \(5\%\) of the total population and the duration of an epidemic by over \(20\%\) of the predicted duration. For \(\pi /4\) a type II error in this setting will result in overestimating the total infections by nearly \(10\%\) of the total population and the duration by approximately \(10\%\) of the predicted one. A similar computation shows \(\epsilon \approx .062\) when \(\theta _0=(0.21, 0.14)\) with all other parameters the same, and again from the left panels in Fig. 5 we observe significantly different predicted outcomes depending on whether or not the null hypothesis \(H_0:\theta =\theta _0\) is rejected.

Fig. 6
figure 6

Practical identifiability of \(\delta \). The left and center panels use the first type II error approximation in Approximation 2 to graph type II error as a function of perturbation size, \(\epsilon \), and noise level, \(\sigma \), respectively. Each of the rainbow colored curves in both panels correspond to one of 150 different values of \(\omega \) spread uniformly across \([0,2\pi )\). The color chart in the right panel indicates the colors corresponding to different angles \(\omega \): The light red curves correspond to the \(\omega \) closest to \(\pi /4\) and \(5\pi /4\), the yellow are those a bit farther away, the blue still farther, and the purple are those farthest from \(\pi /4\) and \(5\pi /4\), i.e. closest to \(3\pi /4\) and \(7\pi /4\). Finally, the dark red curve in each of the two panels corresponds to \(\omega =\pi /4\) and \(5\pi /4\), which have the same type II error. Note the rapid fall off in type II error as angles get farther from \(\pi /4\) and \(5\pi /4\), especially as a function of \(\epsilon \). This agrees with the empirical observation in Fig. 1 that MLE favors parameters lying along a line of slope 1. In particular, the inverse problem for \(\delta \) is practically identifiable

While our empirical and theoretical results indicate the inverse problem of finding \(\theta =(\beta ,\gamma )\) is prone to error, they also show inference of \(\delta \) is robust and reliable (see for instance Figs. 1 and 6). In particular, since \(\theta =(\beta ,\gamma )\) is completely determined byFootnote 3\(\gamma \) and \(\delta \), knowledge of \(\delta \) reduces the inverse problem to finding \(\gamma \), the reciprocal of the average number of days an individual is infectious. In this case our new hypothesis test becomes

$$\begin{aligned} H_0: \gamma = \gamma _0 \quad \text { vs. }\quad H_1 : \gamma = \gamma _0+{\hat{\epsilon }}. \end{aligned}$$
(12)

for some real number \({\hat{\epsilon }}\). Furthermore, knowing \(\delta \) implies \(\theta \) lies on the line of slope 1 with vertical intercept \(-\delta \). So by a simple geometric argument (see Fig. 7), the above hypothesis test is equivalent to the hypothesis test (7) with \(\epsilon =|{\hat{\epsilon }}|\sqrt{2}\) and angle \(\pi /4\) if \({\hat{\epsilon }}>0\) or \(5\pi /4\) if \({\hat{\epsilon }}<0\). So by (11) the type II error of (12) is

$$\begin{aligned} \mathcal {E}_2 = 1 - \Phi \left( \Phi ^{-1}(\alpha )+\frac{p\epsilon \sqrt{T}}{\sigma \sqrt{2}}\right) = 1 - \Phi \left( \Phi ^{-1}(\alpha )+\frac{p|{\hat{\epsilon }}|\sqrt{T}}{\sigma }\right) . \end{aligned}$$

Thus rather than consider the original hypothesis test, one can first infer \(\delta \), then consider the hypothesis test (12) with type II error rate as above.

Fig. 7
figure 7

Going from hypothesis test (12) to (7)

3.4 Empirical analysis: NYC Covid cases, March 2020

In this section we discuss the extension of the theoretical results on parametric non-identifiability to a real world dataset. Consider the Spring 2020 COVID-19 outbreak in New York City. The New York City Health Department keeps a repository of all public COVID-19 data online [32]. Using their daily case data as a proxy for new infections, we directly apply Eq. (2) to the noisy data. To focus on estimation early in the pandemic, we focus on reported daily cases from February 29, 2020 through March 14, 2020, which approximately represent the first two weeks of the pandemic in New York City. This period precedes the statewide lockdown including the closing of schools on March 15th. However, the increasing awareness of COVID-19 and increased testing capacity strongly suggest that the contact rate \(\beta \) and reporting rate p were likely non-constant during this time. These parameters are also not jointly identifiable. Thus, we make the simplifying assumption that they are constant. Below, we show estimates of \(\beta \), \(\gamma \), and \(\sigma \) for fixed values of p ranging from 0.01 to 0.25 consistent with the current literature on the underreporting rates of COVID-19 infection (de Oliveira et al. 2020; Richterich 2020; Lau et al. 2021).

To connect with the earlier analysis, we are following Case 2 as discussed in Sect. 3.2, in which the noise is proportional to number of infections, i.e. \(\sigma _t=N\sigma i_t\) for some \(\sigma > 0\) which is also inferred via maximum likelihood. We thus model daily infections by

$$\begin{aligned} y_t = pN(s_{t-1}-s_t) + \sqrt{N i_t}\epsilon _t, \quad \epsilon _t \sim N(0,\sigma ^2) \end{aligned}$$

from which we obtain the log-likelihood function

$$\begin{aligned} \ell (y_t\vert \beta ,\gamma ,\sigma ) \approx -\frac{1}{2}\sum _{k=1}^{t} \frac{\left( y_k-pN(s_{k-1}-s_{k})\right) ^2}{Ni_t\sigma ^2}. \end{aligned}$$

For a fixed value of \(p=0.05\), maximizing the above likelihood gives estimates

$$\begin{aligned} (\hat{\beta },\hat{\gamma },\hat{\sigma }) = (4.82,4.22,1.37) \end{aligned}$$

and a corresponding estimate of \(\hat{R}_0 = 1.14\). The SIR curve generated by the maximum likelihood estimates of \(\beta \) and \(\gamma \) is shown in Fig. 8 with corresponding 95% confidence regions based on the maximum likelihood estimate of \(\sigma .\) Additional results for \(p=0.01, 0.02\), and 0.1 are also shown in the figure.

Fig. 8
figure 8

New York City public testing results for COVID-19 from the first known case on February 29, 2020 to March 15, 2020. We have used maximum likelihood estimation to generate an SIR trajectory through the noisy data for each reporting rate

Returning to the testing framework, Fig. 9 provides type II error estimates based on the approximation of Eq. (11) with significance level \(\alpha = 0.1\), reporting rates \(p=0.01\), 0.02, 0.05, and 0.1, and \(T = 14\) days of new infection counts. For all values of p considered herein, the MLE of \(\hat{\sigma }\) is greater than 0.75 and corresponds with Type II error greater than 80% for all values of \(\epsilon \) such that \(\hat{\theta }_\epsilon (\pi /4)\) or \(\hat{\theta }_\epsilon (5\pi /4)\) with corresponding \(R_0 > 1\).

Fig. 9
figure 9

Type II error rate as a function of \(\epsilon \) and \(\sigma \) at significance level \(\alpha =0.1\), reporting rate \(p=0.15\), and \(T=14\) days of new infection observations

Thus, while there may be a large disparity between the true SIR parameters and our maximum likelihood estimates—hence large differences in the estimate of \(R_0\)—the hypothesis testing framework has very low power to detect such differences. This result is based on the most difficult to detect perturbations in \(\theta \). However, it provides pessimistic but important lower bounds on the extent to which one can rely on parameter estimates from noisy, early pandemic data.

One final note on the preceding example. The MLEs of \(\beta \) and \(\gamma \) in the previous analysis are quite sensitive to the reporting rate. For reference, Table 3 provides corresponding MLE estimates for \(\beta \), \(\gamma \), and \(\sigma \) as a function of p.

Table 3 Maximum likelihood estimates of \(\beta \), \(\gamma \), and \(\sigma \) are shown for different choices of reporting rate p

However, the type II error plot in Fig. 9 is largely unchanged for the range of p in the preceding table. Since \(\sigma \) is fairly robust to different choices of p, our conclusion about the limited power of testing holds true for the range of p considered. Therefore, one has limited statistical power to detect large differences in SIR model parameters in the worst case scenario, regardless of the choice of reporting rate. As such, we believe this article serves as a cautionary tale to those fitting SIR-type models in the early days of an epidemic.

There is an important distinction to make. Having low power to detect a difference is not equivalent to being unable to tell that there is a difference. Certain parameter values are essentially impossible given natural assumptions about the dynamics of a pandemic. For example, \(1/\gamma \) is the average time an infected individual can spread the disease before they are either recovered or removed from the population by quarantine. Extremely large values of \(\gamma \) and hence small values of \(1/\gamma \), such as those attained in Table 3, are likely unrealistic. Thus, the inclusion of side or prior information on \(\gamma \) and/or \(\beta \) akin to the analysis in Sect. 3.3 can greatly improve one’s ability to disambiguate different SIR parameters.

4 Discussion

The preceding analysis was based on a simple implementation of the SIR model. Practitioners studying future outbreaks may consider a multitude of modifications to our model construction which result in different likelihood functions. Thus, we have decided to conclude this article with a short discussion of how one may adapt our techniques to these different settings to better understand issues of practical identifiability with noisy or sparse observations.

To construct an analytically tractable approximation to the type II error, we assumed the proportion of susceptible individuals remains essentially 1 and thus obtained a linear system, namely (A1), that approximates the SIR equations. Such approximations are suitable locally in time and are therefore appropriate when one is focused on the early stages of an outbreak. Importantly, a similar approach can be used to construct analytic approximations to the dynamics of any epidemic model. Such expressions will depend on unknown SIR parameters, fixed parameters such as population size, and other parameters such as reporting rate or behavioral factors, as in Cori et al. (2020). For example, Britton and Scalia Tomba (2019) assume the proportion of susceptible individuals remains 1 early in an epidemic to study the problem of inferring infection rate from observations of generation and serial times, which are often available via contact tracing. In all cases one can investigate the use of this and other realistic simplifications of the dynamics to approximate type II error and better understand potential limitations of their particular model. We believe this approach remains an interesting, potentially fruitful avenue toward understanding identifiability in a wide array of epidemic models.

On the theoretical side, the upper bound for the error term in Proposition 1 of “Appendix C” gives some rigorous justification for the approximate dynamics used throughout this work. However, the numerical results of the main text and “Appendix B” indicate the approximation is more accurate than the theoretical bound suggests. It is therefore an open question whether our theoretical bound on the approximation error can be improved upon, perhaps via other approximate solutions of the SIR model found in, for example, (Turkyilmazoglu 2021; Barlow and Weinstein 2020; Schlickeiser and Kröger 2021) and references therein. Finally, since the approximation of s by 1 is used in other models and to investigate other questions about epidemics (Sauer et al. 2020; Britton and Scalia Tomba 2019), it is also of interest whether estimates of the error in our setting can be used to control error for similar approximations in related settings.