Keywords

1 Introduction

Single-molecule (SM) experiments allow us to peer deep into the molecular dynamics that drive biology at the microscopic scale [4, 14]. Though observing the dynamics of a single molecule is an amazing feat in and of itself, the information gleaned is limited by the small number of observables that can be simultaneously tracked, and the resolution at which this can be done. Faced with such limitations, mechanistic modeling and parameter estimation are often used to extract as much quantitative information as possible.

Using SM fluorescence or Förster resonance energy transfer (FRET) [15], it is possible to generate time distributions for reactions, such as the unbinding-time distributions of ligands unbinding from a single receptor (Fig. 5.1). Such distributions are particularly useful when the pathway includes multiple steps, as they can be quite complex and information rich. Faced with systems exhibiting several characteristic times, least-squares (LS) fitting is often brought to bear on the problem. Though popular and often useful, there are situations in which standard LS approaches fail, and unfortunately often do so in quite non-obvious ways. To help the reader understand and avoid such pitfalls, we here explore some of these situations through the lens of ML estimation, an alternative approach that has become very popular in the physical sciences [1,2,3, 7, 8, 13, 16, 17, 24,25,26, 28].

Fig. 5.1
figure 1

a A single-step ligand-receptor unbinding model. A dsRNA-binding protein releases dsRNA at a characteristic rate \( k_{\text{off}} \). For this model, we expect an exponential distribution of unbinding time, with the average unbinding time \( \hat{\tau }_{\text{off}} = 1/k_{\text{off}} \). b A histogram (bars) formed from 300 unbinding times picked from an exponential distribution with the true average unbinding time \( \hat{\tau }_{\text{off}} = 1 \) s. The predicted bin counts for a model with average unbinding time \( \tau_{\text{off}} = 1 \) s are shown as a red curve, and the notation used in Eq. (5.1) is indicated for bin \( b \) (pink bar). In the inset, we show the unweighted LS residue \( R^{\text{uwLS}} \left( {\tau_{\text{off}} } \right) \) (in log-scale) as a function of the model parameter \( \tau_{\text{off}} \). The function displays a global minimum close to the true average unbinding time (yellow arrow), as well as a local minimum for short times (red arrow). Beware that local minima can sometimes trap numerical minimization algorithms, leading them to erroneously report the local minimum as the sought after global minimum

As it is straight forward, adaptable, and well suited to SM experiments, we here provide a self-contained introduction to ML estimation. We heuristically show that ML estimation should generally outperform LS fitting and explicitly show this to be the case in relevant SM FRET examples. We close with a discussion of how to use bootstrapping to estimate the standard deviation of fit parameters. The presentation is intended for SM experimenters who find fitting data indispensable to their work, but might find the advantages/limitations/rationale of various approaches hard to ascertain.

2 Prerequisites

In an effort to be self-contained, we start by discussing LS fitting, as well as error estimation and some crucial concepts in probability theory. These sections can be skipped by the initiated reader.

2.1 LS Fitting and the Distance Between Model and Data

LS fitting comes in several flavors, depending on how statistical fluctuations in bin counts are accounted for. The fitting is generally performed by collecting the available data into bins \( b = 1,2, \ldots ,B \), and finding the model parameter values that minimize the total square deviation between actual bin counts \( \left( {H_{b} } \right) \) and model predictions for bin counts \( \left( {h_{b} } \right) \) (Fig. 5.1b), normalized with the true standard deviation of the bin count \( \left( {\sigma_{b} } \right) \). We will refer to this approach as true LS (tLS) fitting. For unbinding times in the simple RNA-protein example of Fig. 5.1a, tLS fitting consists of finding the model parameter \( \tau_{\text{off}} \) (the average unbinding time of the model) that minimizes the total residue

$$ R^{\text{tLS}} \left( {\tau_{\text{off}} } \right) = \mathop \sum \limits_{b = 1}^{B} \left( {\frac{{\Delta H_{b} \left( {\tau_{\text{off}} } \right)}}{{\sigma_{b} }}} \right)^{2} , \quad \Delta H_{b} \left( {\tau_{\text{off}} } \right) = H_{b} - h_{b} \left( {\tau_{\text{off}} } \right). $$
(5.1)

Minimizing the total residue \( R^{\text{tLS}} \) makes intuitive sense, as it penalizes parameter values that give large deviations between predictions and measurements, in a manner scaled by the size of statistical fluctuations in each bin. A perfect estimate in a bin \( \left( {H_{b} = h_{b} \left( {\tau_{\text{off}} } \right)} \right) \) results in zero residue, while any positive (weighted) residue gives a measure of the “statistical distance” between model and data in that bin. By summing the residues in Eq. (5.1), we get a measure of the total distance between model and data; tLS fitting aims to minimize this distance.

Unfortunately, we do not often have access to the true standard deviation of counts in each bin, and various approximations to Eq. (5.1) must be deployed. For ease of presentation, we will here focus on two cases: In the first case, we assume that count fluctuations are almost constant over all bins, and we use unweighted LS (uwLS) residues by taking \( \sigma_{b} \) to be constantFootnote 1 (e.g., see inset in Fig. 5.1b); in the second case, we assume a fixed total number (\( N \)) of independent measurements, such that the count fluctuations in each bin are binomially distributed, with \( \sigma _{b} = \sqrt {\left\langle {H_{b} } \right\rangle \left( {1 - {{\left\langle {H_{b} } \right\rangle } \mathord{\left/ {\vphantom {{\left\langle {H_{b} } \right\rangle } N}} \right. \kern-\nulldelimiterspace} N}} \right)} \approx \sqrt {\left\langle {H_{b} } \right\rangle } \). Here, the angle brackets represent the statistical average over a large number of experiments, and we have in the last step assumed bins to be small enough that no bin on average contains a large fraction of the total number of observations (i.e., \( \left\langle {H_{b} } \right\rangle \ll N \) for all bins). With no better estimate at hand, the statistical average of bin counts is often approximated with the observed bin count by setting \( \sigma_{b} \approx \sqrt {H_{b} } \) in Eq. (5.1). We will refer to this approach as weighted LS (wLS).

Both wLS and uwLS fitting can be problematic. Using uwLS, we assume fluctuations in bin counts to be uniform over bins. As we shall see, this is often a poor approximation for systems with multiple characteristic timescales. Using wLS, we instead use individual bin counts to estimate the standard deviation of counts in that bin. As individual bin counts can be small, relative fluctuations can be large, resulting in large approximation errors when using \( \sigma_{b} \approx \sqrt {H_{b} } \) in Eq. (5.1).

2.2 Error Estimation, Variation, and Systematic Bias

For any estimation method applied to an experiment with a finite set of measurements, the estimated parameter value (\( \tau \)) will deviate from the true value (\( \hat{\tau } \)). To compare two methods, we need to understand the distribution of parameter estimates that each approach would yield were it to be repeated many times. Over a large number of experiments, the typical error can be measured by the mean square error, \( {\text{MSE}} = \left\langle {\left( {\tau - \hat{\tau }{\text{~}}} \right)^{2} } \right\rangle \). To understand the nature of estimation errors, consider the bias \( \Delta \tau ^{{{\text{bias}}}} = \left\langle \tau \right\rangle - \hat{\tau } \), capturing how the average estimate deviates from the true parameter value, as well as the standard deviation \( \Delta \tau ^{{{\text{sd}}}} = \sqrt {\left\langle {\left( {\tau - \left\langle \tau \right\rangle } \right)^{2} } \right\rangle } \), capturing the typical spread of estimates around their average (Fig. 5.2). Conveniently, the bias and standard deviation add in quadrature to form the MSE [9]

Fig. 5.2
figure 2

A histogram of estimates for a hypothetical process with the true parameter value \( \hat{\tau } \) =  1. The systematic bias \( \Delta \tau^{\text{bias}} \) and the typical size of fluctuations \( \Delta \tau^{\text{sd}} \) around the average estimate \( \langle \tau \rangle \) are indicated

$$ {\text{MSE}} = \left( {\Delta \tau^{\text{bias}} } \right)^{2} + \left( {\Delta \tau^{\text{sd}} } \right)^{2} . $$

The smaller the MSE the better, and we should seek to minimize both the bias and standard deviation as far as possible. A large bias can be introduced by the estimation method itself, while a large standard deviation typically results from a lack of data and/or accuracy of the measurements.

2.3 Bayes’ Equation and Observation Frequencies

To explain the rationale behind ML estimation [9], we first introduce Bayes’ equation by way of Venn diagrams and the frequentist interpretation of probability. According to this interpretation, probabilities can be seen as the asymptotic frequency of outcomes, recorded over a large number of repetitions [12]. For concreteness, imagine a steady rainfall with water drops hitting the yellow (event \( A \))- and blue (event \( B \))-striped shapes shown in Fig. 5.3. Further imagine keeping track of the number of raindrops that falls on the section with just yellow stripes \( \left( {N_{A} } \right) \), just blue stripes \( \left( {N_{B} } \right) \), both yellow and blue stripes \( \left( {N_{A\& B} } \right) \), or anywhere \( \left( {N_{\text{tot}} = N_{A} + N_{B} } \right) \). Among these various counts, the relationship

Fig. 5.3
figure 3

Imagine exposing the blue and yellow shapes to rain, while keeping track of the number of raindrops that hit each differently striped area. If the rainfall is steady, we can use the frequentist interpretation of probability to relate the different fractions of raindrops landing on the various areas to probabilities. The trivial Eq. (5.2) then becomes Bayes’ equation as expressed in Eq. (5.3)

$$ \frac{{N_{A,B} }}{{N_{\text{tot}} }} = \frac{{N_{A\& B} }}{{N_{B} }}\frac{{N_{B} }}{{N_{\text{tot}} }} = \frac{{N_{A\& B} }}{{N_{A} }}\frac{{N_{A} }}{{N_{\text{tot}} }} $$
(5.2)

holds trivially true, as can be seen by canceling the first denominator with the second numerator after each equal sign. If we collect enough raindrops, the fraction of raindrops that has so far landed on a particular section will approach the probability that also the next raindrop will land in that same section. Taking the frequentist approach, we can translate Eq. (5.2) into Bayes’ equation for probabilities

$$ P\left( {A,B} \right) = P\left( {A|B} \right)P\left( B \right) = P\left( {B|A} \right)P\left( A \right). $$
(5.3)

In the above, \( P\left( {A,B} \right) = N_{A\& B} /N_{\text{tot}} \) is the joint probability that both \( A \) and \( B \) occur, \( P\left( A \right) = N_{A} /N_{\text{tot}} \) is the probability that \( A \) occurs irrespective of whether \( B \) occurs or not, \( P\left( {A |B} \right) = N_{A\& B} /N_{B} \) gives the conditional probability that \( A \) occurs, given that \( B \) occurs, and so on swapping \( A \) and \( B \).

2.4 Continuous Outcomes and Probability Densities

We are ultimately interested in measurements that produce real numbers (such as unbinding times), while Bayes’ equation (Eq. 5.3) is valid for probabilities of discrete events. For outcomes that can take a continuous value, the relevant concept is that of the probability density function (PDF). For two concurring continuous outcomes, denote recording the respective values in interval \( I_{a} \) and \( I_{b} \), centered around \( a \) and \( b \), as event \( A \) and \( B \). For very short interval lengths \( \Delta I_{a} \) and \( \Delta I_{b} \), the probability to end up in the interval (denoted by upper case \( P \)) is simply the relevant PDF (denoted by a lower case \( p \)) multiplied with the relevant interval length(s)

$$ P\left( {A,B} \right) = p\left( {a,b} \right)\Delta I_{a} \Delta I_{b} , $$
$$ P\left( A \right) = p\left( a \right)\Delta I_{a} , P\left( {A|B} \right) = p\left( {a |b} \right)\Delta I_{a} , P\left( B \right) = p\left( b \right)\Delta I_{b} , P\left( {B|A} \right) = p\left( {b |a} \right)\Delta I_{b} . $$

The above relations can be plugged into Bayes’ equation for probabilities of discrete events (Eq. 5.3), giving the sought after Bayes’ equation for PDFs of continuous outcomes

$$ p\left( {a,b} \right) = p\left( {a |b} \right) p\left( b \right) = p\left( {b |a} \right) p\left( a \right). $$
(5.4)

With the prerequisites covered, we are now ready to address the rationale behind ML estimation and assess how it compares and relates to LS fitting.

3 Maximum Likelihood

To keep the discussion in general, consider an experiment where we collect \( N \)-independent measurements \( \left\{ t \right\}_{N} = \left\{ {t_{1} ,t_{2} , \ldots ,t_{N} } \right\} \) and modeled it as process with \( M \) parameters \( \left\{ \tau \right\}_{M} = \left\{ {\tau_{1} ,\tau_{2} , \ldots ,\tau_{M} } \right\} \). Based on our one experiment, we would like to determine the parameter values that gave rise to the data. As the stochasticity of the data makes it impossible to precisely determine the parameters exactly, our best bet would by definition be to find the most probable set of parameter values, given the data. In the language of conditional PDFs, this corresponds to finding the model parameters which maximize the PDF \( p\left( {\left\{ \tau \right\}_{M} |\left\{ t \right\}_{N} } \right) \) of having a model with parameters \( \left\{ \tau \right\}_{M} \) given the measured data \( \left\{ t \right\}_{N} \) (for a lighthearted and instructive discussion of the meaning of the probability of a model, see [18]. Unfortunately, we do not have direct access to this conditional PDF. Still, we can make considerable progress by using Bayes’ equation and introducing a few additional assumptions.

3.1 The Most Likely Model

Through Bayes’ equation for PDFs (Eq. 5.4), we can relate the unknown PDF of interest to PDFs about which we do have some knowledge, or regarding which we can at least make some reasonable assumptions. Letting \( a = \left\{ t \right\}_{N} \) and \( b = \left\{ \tau \right\}_{M} \) in Eq. (5.4), we haveFootnote 2

$$ p\left( {\left\{ \tau \right\}_{M} |\left\{ t \right\}_{N} } \right) = \frac{{p\left( {\left\{ \tau \right\}_{M} } \right)}}{{p\left( {\left\{ t \right\}_{N} } \right)}}p\left( {\left\{ t \right\}_{N} |\left\{ \tau \right\}_{M} } \right). $$

With the aim to maximize the left-hand side of the above expression with respect to the model parameters, we note that the denominator on the right-hand side does not depend on the model parameters and therefore will not influence which parameter value maximizes the left-hand side; we promptly ignore the denominator. The numerator can be interpreted as encoding what we knew of the correct parameter values before our experiments. If we assume little or no prior knowledge, it makes sense to also assume this prior PDF to be roughly uniform and thus largely independent of the model parametersFootnote 3; we promptly ignore also the numerator. The last term on the right-hand side of the equation describes the PDF of a particular set of measurements, given the model parameters. This conditional PDF can be calculated if we have a model of the system!

Through the above argument, we conclude that by maximizing the likelihood function \( p\left( {\left\{ t \right\}_{N} |\left\{ \tau \right\}_{M} } \right) \), we can find an estimate for the model parameter values that best describe the data. Equivalently, we could choose to minimize the log-likelihood functionFootnote 4 \( L^{\text{ML}} \left( {\left\{ \tau \right\}_{M} } \right) = - \ln p\left( {\left\{ t \right\}_{N} |\left\{ \tau \right\}_{M} } \right) \), which has a global minimum for the same parameter values as the likelihood function has a global maximum. As we assume independent measurements, the PDF of the whole experimental outcome \( \left\{ t \right\}_{N} \) can simply be written as the product of the PDFs for each measurement. The log-likelihood function then has the convenient property that it turns into a sum over measurements,

$$ L^{\text{ML}} \left( {\left\{ \tau \right\}_{M} } \right) = - \ln \left( {\mathop \prod \limits_{n = 1}^{N} p\left( {t_{n} |\left\{ \tau \right\}_{M} } \right)} \right) = - \mathop \sum \limits_{n = 1}^{N} \ln p\left( {t_{n} |\left\{ \tau \right\}_{M} } \right). $$
(5.5)

Finding the parameter values that globally minimize Eq. (5.5) constitutes ML parameter estimation, and we now apply it to a few simple but illustrative examples to familiarize the reader with the approach.

3.2 ML Estimation for an Exponential Process

To demonstrate ML estimation in practice, we return to ligand–receptor unbinding. For simple unbinding kinetics, the unbinding times are exponentially distributed with the PDF \( p\left( {t|\tau_{\text{off}} } \right) = {\text{e}}^{{ - t/\tau_{\text{off}} }} /\tau_{\text{off}} \). Inserting this PDF into Eq. (5.5), we see that the log-likelihood function is given by

$$ L^{\text{ML}} \left( {\tau_{\text{off}} } \right) = N\left( {\ln \tau_{\text{off}} + \frac{{\bar{t}}}{{\tau_{\text{off}} }}} \right), \quad \bar{t} = \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} t_{n} . $$

The ML estimate \( \left( {\tau_{\text{off}}^{\text{ML}} } \right) \) is now arrived at by minimizing \( L^{\text{ML}} \left( {\tau_{\text{off}} } \right) \) with respect to \( \tau_{\text{off}} \). In this simple example, we can find the ML estimate analytically by using the zero-derivative test for finding an optimum,

$$ 0 = \frac{{\partial L^{\text{ML}} }}{{\partial \tau_{\text{off}} }}\left( {\tau_{\text{off}}^{\text{ML}} } \right) = N\left( {\frac{1}{{\tau_{\text{off}}^{\text{ML}} }} - \frac{{\bar{t}}}{{(\tau_{\text{off}}^{\text{ML}} )^{2} }}} \right)\quad \Rightarrow \quad \tau_{\text{off}}^{\text{ML}} = \bar{t}. $$
(5.6)

Consequently, ML estimation confirms the well-known result that the characteristic time of an exponential process can be estimated by the average event time observed in the data; or simply, the off-rate estimate is \( k_{\text{off}}^{\text{ML}} = 1/\bar{t} \). Note that we did not need to perform any binning to extract this estimate, which constitutes a clear advantage over standard LS fitting methods.

3.3 ML Estimation for an Exponential Process with a Time Cutoff

The simplest additional characteristic time to consider is possibly that introduced by photobleaching in FRET experiments. With photobleaching, the experimental signal in our unbinding example can, in addition to unbinding, also be lost due to the stochastic degradation of fluorophores over time. We can account for photobleaching by interpreting the estimated characteristic rate (\( 1/ \tau_{\text{off}}^{\text{ML}} \)) of the PDF (which is still exponential), not purely as the unbinding rate, but as the sum of the unbinding and bleaching rate. As the bleaching rate can usually be independently measured, we can often readily estimate the unbinding rate by subtracting the bleaching rate from the estimated total rate.

Next, consider having a hard cutoff time \( T_{\text{cut}} \) limiting the duration of each measurement. Slightly more complex than photobleaching, this scenario will serve to demonstrate that the ML approach often allows us to utilize extra information in a rational manner. Though we cannot know the precise duration for any binding event lasting longer than \( T_{\text{cut}} \), there is information in the number of unbinding events that exceeded it. We start by noting that the simple ML recipe used in Eq. (5.6) does not work, as losing long unbinding times will clearly lead us to underestimate the characteristic unbinding time. Instead, we would like to keep the information regarding the number of measurements that exceeded the finite measurement time window. Combining the probability densities of the measured unbinding times \( \left( {\left\{ t \right\}_{{N_{\text{rec}} }} } \right) \) with the probabilities of the missed times \( \left( {\left\{ {t^{\prime}} \right\}_{{N_{\text{cut}} }} } \right) \), the relevant likelihood function is

$$ \underbrace {{\mathop \prod \limits_{n = 1}^{{N_{\text{rec}} }} p\left( {t_{n} |\tau_{\text{off}} } \right)}}_{{\begin{array}{*{20}c} {{\text{PDF}}\;{\text{of }}\;{\text{the}}\; N_{\text{rec}} } \\ {{\text{recorded }}\;{\text{events}}} \\ \end{array} }}\underbrace {{\mathop \prod \limits_{{n^{\prime} = 1}}^{{N_{\text{cut}} }} P\left( {t^{\prime}_{{n^{\prime}}} > T_{\text{cut}} |\tau_{\text{off}} } \right)}}_{{\begin{array}{*{20}c} {{\text{probability }}\;{\text{of}}\; {\text{the}}} \\ {N_{\text{cut}} \;{\text{missed }}\;{\text{events}}} \\ \end{array} }} = \mathop \prod \limits_{n = 1}^{{N_{\text{rec}} }} p\left( {t_{n} |\tau_{\text{off}} } \right) \left( {\mathop \smallint \limits_{{T_{\text{cut}} }}^{\infty } {\text{d}}t^{\prime}\,p\left( {t^{\prime}|\tau_{\text{off}} } \right)} \right)^{{N_{\text{cut}} }} . $$

The corresponding log-likelihood function will now become a sum over both probability densities (for the \( N_{\text{rec}} \) recorded times) and probabilities (for the \( N_{\text{cut}} \) missed times)

$$ \begin{aligned} L^{\text{ML}} \left( {\tau_{\text{off}} } \right) & = - \mathop \sum \limits_{n = 1}^{{N_{\text{rec}} }} \ln p\left( {t_{n} |\tau_{\text{off}} } \right) - N_{\text{cut}} \ln \mathop \smallint \limits_{{T_{\text{cut}} }}^{\infty } {\text{d}}t\,p\left( {t|\tau_{\text{off}} } \right) \\ & = N_{\text{rec}} \left( {\ln \tau_{\text{off}} + \frac{{\bar{t}}}{{\tau_{\text{off}} }}} \right) + \frac{{N_{\text{cut}} T_{\text{cut}} }}{{\tau_{\text{off}} }}. \\ \end{aligned} $$
(5.7)

The ML estimate can once again be found analytically through the zero-derivative condition, yielding the simple formula

$$ 0 = \frac{{\partial L^{\text{ML}} }}{{\partial \tau_{\text{off}} }}\left( {\tau_{\text{off}}^{\text{ML}} } \right) \quad \Rightarrow \quad \tau_{\text{off}}^{\text{ML}} = \bar{t}\left( {1 + \frac{{T_{\text{cut}} N_{\text{cut}} }}{{\bar{t} N_{\text{rec}} }}} \right) $$
(5.8)

to correct for the cutoff-induced bias. Note that the correction only becomes significant when the lower bound of the total duration of cut events (\( T_{\text{cut}} N_{\text{cut}} \)) becomes comparable to the total time of recorded events (\( \bar{t}N_{\text{rec}} \)).

3.4 ML Estimation for a Double-Exponential Process

The unbinding process itself might have several characteristic times. We next consider the case where the model yields a double-exponential PDF of unbinding times and where the maximal measurement duration is large enough to be ignored. For the unbinding problem discussed above, such PDFs could originate in two interconvertible binding modes: a loose binding mode where the ligand first binds, and eventually unbinds from, and a tight binding mode from which the ligand cannot unbind directly (see Fig. 5.4a). Alternatively, it could result from two protein populations with different unbinding rates. The PDF for either system can be written as (Fig. 5.4b)

Fig. 5.4
figure 4

a A dsRNA-binding protein exhibiting two bound states, resulting in a double-exponential PDF for the unbinding time. b Histogram (bars, in log-scale) formed by picking 1000 unbinding times from a double-exponential distribution with a PDF characterized by \( \hat{\tau }_{1} = 1 \) s, \( \hat{\tau }_{2} = 5 \) s, and \( \hat{P}_{1} = \hat{P}_{2} = 0.5 \). The predicted bin counts for a model with \( \tau_{1} = 1 \) s, \( \tau_{2} = 5 \) s, and \( P_{1} = P_{2} = 0.5 \) are shown as a red curve

$$ p\left( {{\text{t|}}\tau_{1} ,\tau_{2} ,P_{1} } \right) = \frac{{P_{1} }}{{\tau_{1} }}{\text{e}}^{{ - t/\tau_{1} }} + \frac{{1 - P_{1} }}{{\tau_{2} }}{\text{e}}^{{ - t/\tau_{2} }} $$
(5.9)

where the characteristic times \( \tau_{1} \) and \( \tau_{2} \), as well as the population fraction \( P_{1} \) associated with \( \tau_{1} \), can be directly related to the microscopic rates of the relevant system. Attempting to use the PDF of Eq. (5.9) to calculate the log-likelihood function according to Eq. (5.5), it quickly becomes clear that we can no longer find a simple analytic solution to the minimization problem. This is quite generally the case, and one has to perform the minimization numerically, as we will do when comparing LS and ML approaches on simulated data below.

3.5 Coarse-Grained Likelihood

Though ML estimation has the clear advantage of requiring no binning of the data, for large data sets, it often becomes computationally demanding to numerically minimize a log-likelihood function with as many terms as there are measurements (see sum in Eq. 5.5). The computational efficiency can be drastically increased by considering the likelihood over bins, which should be a reasonable approximation as long as we choose the bin size small enough for there to be little change in the PDF over each bin. The probability \( P_{b} \) of a particular measurement ending up in bin \( b \) can then be related to the model PDF and used to calculate the predicted bin count \( h_{b} \) as

$$ h_{b} \left( {\left\{ \tau \right\}_{M} } \right) = NP_{b} \left( {\left\{ \tau \right\}_{M} } \right),\;P_{b} \left( {\left\{ \tau \right\}_{M} } \right) = \mathop \smallint \limits_{{t_{b} - \Delta t_{b} /2}}^{{t_{b} + \Delta t_{b} /2}} {\text{d}}t\,p\left( {t|\left\{ \tau \right\}_{M} } \right) \approx \Delta t_{b} p\left( {t_{b} |\left\{ \tau \right\}_{M} } \right), $$
(5.10)

where the integral runs over the whole width \( \Delta t_{b} \) of bin \( b \) centered around \( t_{b} \).

Splitting the sum over measurements in the definition of the log-likelihood function (Eq. 5.5) into a sum over bins and a sum over measurements in each bin, it can be approximated by the coarse-grained (cg) log-likelihood function

$$ L^{\text{ML}} \left( {\left\{ \tau \right\}_{M} } \right) = - \mathop \sum \limits_{b = 1}^{B} \mathop \sum \limits_{{\begin{array}{*{20}c} {t_{n} \,{\text{in}}} \\ {{\text{bin}}\,b} \\ \end{array} }} \ln p\left( {t_{n} |\left\{ \tau \right\}_{M} } \right) \approx - \mathop \sum \limits_{b = 1}^{B} H_{b} \ln h_{b} \left( {\left\{ \tau \right\}_{M} } \right) = L^{\text{cgML}} \left( {\left\{ \tau \right\}_{M} } \right). $$
(5.11)

Here, the last equality is a definition, and we have dropped constant terms and factors not affecting the minimizing parameter values. Note that the results of using cgML estimation can always be made arbitrarily close to the original ML estimate by choosing the bin widths small enough.

3.6 The Connection Between LS and ML

We will now show that ML estimation can be seen as another approximation of tLS and, importantly, one that is generally expected to do better than both uwLS and wLS. The connection between LS and ML estimation has been studied for the case of independent and Gaussian-distributed data with equal variance [10], but in an effort to understand the differences in estimates more generally, we here employ a heuristic approach with wide applicability.

For any data set \( \left\{ t \right\}_{N} \) and model with parameter set \( \left\{ \tau \right\}_{M} \), we seek to compare tLS fitting to ML estimation. As the tLS scheme is based on binned data sets, we opt to compare it to equally binned cgML. The zero-derivative condition for finding the tLS parameter estimates \( \left\{ {\tau^{\text{tLS}} } \right\}_{M} \) from Eq. (5.1) is

$$ 0 = \frac{{\partial R^{{{\text{tLS}}}} }}{{\partial \tau _{m} }}\left( {\left\{ {\tau ^{{{\text{tLS}}}} } \right\}_{M} } \right) \approx \mathop \sum \limits_{b} \frac{{\Delta H_{b} \left( {\left\{ {\tau ^{{{\text{tLS}}}} } \right\}_{M} } \right)}}{{\left\langle {H_{b} } \right\rangle }}\frac{{\partial \Delta H_{b} \left( {\left\{ {\tau ^{{{\text{tLS}}}} } \right\}_{M} } \right)}}{{\partial \tau _{m} }},\;m = 1, \ldots ,{\text{~}}M. $$
(5.12)

Similarly, differentiating Eq. (5.11), and using the normalization of probabilities \( \left( {\mathop \sum \limits_{b} h_{b} = N} \right) \), the condition for finding the cgML estimate \( \left\{ {\tau^{\text{cgML}} } \right\}_{M} \) can be written as

$$ \begin{aligned} 0 & = - \frac{{\partial L^{\text{cgML}} }}{{\partial \tau_{m} }}\left( {\left\{ {\tau^{\text{cgML}} } \right\}_{M} } \right) \\ & = \mathop \sum \limits_{b} \frac{{\Delta H_{b} \left( {\left\{ {\tau^{\text{cgML}} } \right\}_{M} } \right)}}{{h_{b} \left( {\left\{ {\tau^{\text{cgML}} } \right\}_{M} } \right)}}\frac{{\partial \Delta H_{b} \left( {\left\{ {\tau^{\text{cgML}} } \right\}_{M} } \right)}}{{\partial \tau_{m} }},\;m = 1, \ldots , M. \\ \end{aligned} $$
(5.13)

Interestingly, though the functions that are minimized during tLS (Eq. 5.1) and cgML (Eq. 5.11) estimation are quite different, their minima are located in close proximity. From above, it is clear that the cgML minimization condition (Eq. 5.13) can be seen as an approximation to the tLS minimization condition (Eq. 5.12) with \( \left\langle {H_{b} } \right\rangle \approx h_{b} \left( {\left\{ {\tau ^{{{\text{cgML}}}} } \right\}_{M} } \right) \).

The cgML approximation \( \left( {\left\langle {H_{b} } \right\rangle \approx h_{b} \left( {\left\{ {\tau ^{{{\text{cgML}}}} } \right\}_{M} } \right)} \right) \) should be compared to the wLS approximation \( \left( {\left\langle {H_{b} } \right\rangle \approx H_{b} } \right) \). The wLS approximation includes only the data of each bin when estimating the variance in each bin. The cgML approximation takes into account the data in all bins, since \( \left\{ {\tau^{\text{cgML}} } \right\}_{M} \) is estimated from the whole data set by definition. As increasing the number of measurements generally reduces both the variance and systematic bias of estimates, we typically expect the cgML approach to outperform the wLS approach. It should be noted that the ML approach is not equivalent to setting \( \sigma_{b} \approx \sqrt {h_{b} \left( {\left\{ {\tau^{\text{cgML}} } \right\}_{M} } \right)} \) already in Eq. (5.1), as we would then need to know the optimal parameters before we have minimized the residue to find them. ML estimation elegantly bypasses this problem by enforcing the same approximation, not on the function to be minimized but directly on the condition defining the minimum (Eq. 5.13).

Having argued that we should generally expect (cg)ML to outperform wLS, we explicitly compare their performance, together with that of uwLS, on the examples used above.

4 Comparing LS and ML Through Simulations

Having established that uwLS, wLS, and cgML can all be seen as tLS approximations of various severity, we here numerically explore the consequences of these approximations. By generating data with a known distribution, we can quantify the success of the different approaches at estimating known parameter values. We do not discuss the numerical minimization schemes we use when analytics fail, further than stating that it is implemented in Mathematica™, using a simulated-annealing algorithm [22] to minimize the risk of finding a local rather than global minimum (see inset in Fig. 5.1b, e.g., of a local (red arrow) and global (yellow arrow) minimum). There are many powerful software packages available with the required numerical optimization capabilities.

Without a sharp cutoff time for the measurements, we always expect many long-time bins to be empty in the tail end of the PDF. A zero count in any bin is catastrophic for wLS, as it gives a zero estimate for the standard deviation and so introduces infinite terms in Eq. (5.1). In an attempt to circumvent such issues, various re-binning procedures or reassignments of weights can be performed. Though such approaches avoid infinite terms in Eq. (5.1), they do change the details of the estimation method depending on the observed data, and so risk introducing a strong bias. For simplicity, we will here only consider the interval between the highest and lowest measured data points generated, and for wLS we choose the minimum constant bin size that leaves no empty bins in the intervening interval.

4.1 Method Comparison for an Exponential Process

Though trivial, we start with the simple exponential process. Using Eq. (5.10), we can calculate the predicted bin counts \( h_{b} \left( {\tau_{\text{off}} } \right) \) from the PDF. It should be noted that we could in principle estimate both \( N \) and \( \tau_{\text{off}} \) by optimizing with respect to both in any LS or ML approach. Though this is often done, it is not advisable as it will increase the MSE compared to if we heed the fact that \( N \) is known and precisely dictates the translation from probability to histogram counts in Eq. (5.10).

In Fig. 5.5, we show the results of using uwLS, wLS, and ML estimation on 10,000 exponentially distributed data sets of 100 measurements each (\( \hat{\tau }_{\text{off}} = 1 \) s). Even after eliminating the zero bins for wLS (see above), the wLS estimate remains biased due to the unavoidable presence of the low-count bins [11, 19,20,21, 23, 27]. This bias has been shown to be inversely proportional to the average occupancy of the bins [11]. The fact that uwLS estimation introduces a much smaller—if not vanishing—bias compared to wLS estimation might seem strange, given that the latter estimates the standard deviations in bins based on the data, while the former ignores the data and assumes them all equal. The explanation can likely be found in that though the weighted approach clearly employs better approximation for bins with many counts, the relative errors in low-count bins can be very large, outstripping the error made when assuming the variance of counts to be equal in all bins. Among the three approaches, ML is clearly preferable as both bias and standard deviation are the smallest.

Fig. 5.5
figure 5

Distribution of unbinding-time estimates from 10,000 exponentially distributed data sets containing 100 samples each. There is a clear bias for wLS estimation (−0.04 s), while little bias is apparent for uwLS (0.009 s) and ML (0.0002 s) estimation. The standard deviation of ML estimation (0.10 s) is less than for wLS estimation (0.13 s), which in turn is less than for uwLS estimation (0.14 s). Notwithstanding the larger absolute bias, the \( \sqrt {\text{MSE}} \) for wLS estimation (0.13 s) outperforms that for uwLS estimation (0.14 s), while ML estimation outperforms both other methods (0.10 s)

4.2 Method Comparison for an Exponential Process with a Cutoff

Next, we consider a measurement that is limited by a maximum measurement time \( T_{\text{cut}} \). If this cutoff time is largely compared to the average unbinding time, we effectively have no cutoff, which we covered in the previous section. If we instead have a cutoff time that is comparable to the average unbinding time, there is information in the number of unbinding events that exceeded the maximal duration of the measurements. With a measurement cutoff time, the unbinding times are still exponentially distributed, but the number of experimental observations \( N = N_{\text{rec}} + N_{\text{cut}} \) has to be split into the \( N_{\text{rec}} \) events where the time was recorded, and the \( N_{\text{cut}} \) events for which we know only that they lasted longer than \( T_{\text{cut}} \). For both wLS and uwLS, we explicitly fit only the \( N_{\text{rec}} \) measurements falling within the observation window, while for ML estimation, we include also the information regarding the cut events, according to Eq. (5.8).

Though we lose data, introducing a short-time cutoff has the benefit of removing bins that are likely to have zero counts, and thus, we decrease the need to re-bin data for wLS estimation. For small data sets (Fig. 5.6a, b), the counts in each bin will still have large (relative) fluctuations, and it is not surprising that we see a substantial error in wLS estimation. This error decreases as the cutoff is lowered and progressively fewer low-count bins are included (c.f. Fig. 5.6a with b), even though a higher fraction of measurements falls outside the observation window. For the cutoff time close to the characteristic unbinding time, uwLS and ML estimation are comparable, as the variance in bin counts is roughly constant among bins below the cutoff time. This shows a scenario where uwLS outperforms wLS, though ML estimation consistently remains the better alternative.

Fig. 5.6
figure 6

Parameter estimation for 10,000 exponentially distributed data sets with a cutoff. a For sets with 100 measurements, and a \( T_{\text{cut}} = 1 \) s, we see a clear bias in wLS estimation, while uwLS estimation has a somewhat larger standard deviation than ML estimation. b For a lower \( T_{\text{cut}} = 0.5 \) s, the bias for wLS estimation decreases slightly, while uwLS approaches ML estimation. c Increasing the size of the data sets to 10,000 measurements and considering a moderate cutoff time, the difference between wLS estimation and ML estimation diminishes and both methods marginally outperform uwLS estimation. c For large data sets and a low cutoff time, all methods converge

As we increase the size of the data sets by a factor 100 (Fig. 5.6c, d), we expect the relative fluctuations around the predicted bin counts to decrease, bringing wLS estimation closer to ML estimation. This effect can be seen clearly seen in Fig. 5.6c, d. It is interesting to note that for these large data sets, the extra information regarding the cut measurements included in the ML estimation had little effect on the fit, as all fits roughly coincide in Fig. 5.6c, d.

4.3 Method Comparison for a Double-Exponential Process

For data distributed according to the double-exponential PDF of Eq. (5.9), we need to fit out two characteristic times (\( \hat{\tau }_{1} \,{\text{and}}\,\hat{\tau }_{2} \)), together with the fraction of events belonging to each (\( \hat{P}_{1} ,\hat{P}_{12} = 1 - \hat{P}_{11} ) \). In Fig. 5.7, we show the results of 10,000 fits to data sets of size 10,000, for a process with moderately separated characteristic times (\( \hat{\tau }_{1} = 1 \) s, \( \hat{\tau }_{2} = 3 \) s) and for three different population fractions (\( \hat{P}_{1} = 0.1 \) Fig. 5.7a–c, \( \hat{P}_{1} = 0.5 \) Fig. 5.7d–e, \( \hat{P}_{1} = 0.9 \) Fig. 5.7g–h). In each case, we report the \( \sqrt {\text{MSE}} \)/s within parenthesis in the legend.

Fig. 5.7
figure 7

Parameter estimation over 10,000 double-exponentially distributed data sets of size 10,000. Each column corresponds to parameter estimate distributions for a particular value of \( P_{1} \), and each row corresponds to a particular model parameter. ac \( P_{1} = 0.1 \). d, e \( P_{1} = 0.5 \). f, h \( P_{1} = 0.9 \). In each case, we report the \( \sqrt {\text{MSE}} \)/s within parenthesis in the legend. In all considered situations, ML estimation is clearly the preferable choice as it has the lowest \( \sqrt {\text{MSE}} \)

The error in the short-timescale estimate (\( \tau_{1} \)) is dominated by the variance around the average for all methods, and all methods perform better the larger the fraction of events corresponding to the shorter timescale are (Fig. 5.7a, d, and g). The error in the long-timescale estimates \( \left( {\tau_{2} } \right) \) is also dominated by the variances, which is particularly large in uwLS estimation (Fig. 5.7b, e, and h). This can likely be traced back to the fact that the constant variance assumption of uwLS suppresses the relative influence of long timescales, introducing a relatively low penalty for variation here. The error in the estimation of the fraction of measurements belonging to the short timescale \( \left( {P_{1} } \right) \) is also dominated by the variance, and uwLS is particularly effected due to the poor accounting for the change in variance going from short to long timescales (Fig. 5.7c, f, and i). For all parameter values considered, cgML estimation again clearly outperforms the other methods as was expected from our theoretical developments.

5 Fitting Experimental Data

In the previous section, we have examined the performance of LS and ML estimation on well-specified data sets without experimental noise. Though a proper treatment of experimental noise is outside our present scope, it is still interesting to apply the three fitting methods on experimental data to see to what extent they agree. Considering experimental data will also give us the opportunity to comment on how to estimate the variance of parameter estimates through bootstrapping.

5.1 All Fits Different, but All Naively Plausible

Continuing with our RNA–protein unbinding example, we now analyze SM total internal reflection microscopy (TIRFM) data. The experiments measure the unbinding time of double-stranded (ds) RNA from viral RNA-binding proteins involved in protecting the viral genome from the hosts’ RNA interference-based defenses [6]. The viral suppressors of RNA interference (VSR) proteins are immobilized on a glass surface, and the binding/unbinding of fluorescently tagged dsRNAs to the immobilized VSRs is followed (for more information on the biological aspects and the interpretation of the data, see [6]).

The unbinding-time data of 50 nucleotide dsRNA-binding VSR is fitted with uwLS, wLS, and cgML methods in Fig. 5.8a–c. In this particular system, and presumably due to the existence of weak and very strong binding modes, it is common to have a population of VSRs that unbind quickly, as well as a population that remain bound for the duration of the measurement. In the latter case, the apparent unbinding time will report on the photobleaching time of the fluorophores, as discussed previously. In such situations, the appropriate PDF is double exponential (Eq. 5.9), and the information regarding the number of molecules still bound and fluorescing at the end of the experiment \( \left( {N_{\text{cut}} } \right) \) can be incorporated in the ML estimation along the lines of Eq. (5.7)

Fig. 5.8
figure 8

a The measured distribution of unbinding times (red) together with the uwLS fit (blue). b The measured distribution of unbinding times (red) together with the wLS fit (blue). c The measured distribution of unbinding times (red) together with the cgML fit (blue). In ac, the average number of measurements predicted to fall outside the observation window for the optimal fit is given as an inset. This should be compared to \( N_{\text{cut}} = 1298 \) in the fitted data set. d Histogram of estimates for the short timescale generated over 10,000 bootstrapped data sets. e Histogram of estimates for the long timescale generated over 10,000 bootstrapped data sets. f Histogram of estimates of the fraction of unbinding times originating in the short timescale, generated from 10,000 bootstrapped data sets. The parameter distributions vary significantly between data sets, even though all fits look plausible in ac

$$ \begin{aligned} L^{\text{ML}} \left( {\tau_{1} , \tau_{2} , P_{1} } \right) & = - \mathop \sum \limits_{n = 1}^{{N_{\text{rec}} }} \ln \left( {\frac{{P_{1} }}{{\tau_{1} }}{\text{e}}^{{ - t_{n} /\tau_{1} }} + \frac{{1 - P_{1} }}{{\tau_{2} }}{\text{e}}^{{ - t_{n} /\tau_{2} }} } \right) \\ & \quad - N_{\text{cut}} \ln \left( {P_{1} {\text{e}}^{{ - T_{\text{cut}} /\tau_{1} }} + \left( {1 - P_{1} } \right){\text{e}}^{{ - T_{\text{cut}} /\tau_{2} }} } \right). \\ \end{aligned} $$
(5.14)

The information regarding \( N_{\text{cut}} \) is ignored in standard uwLS and wLS approaches, where the data is binned and fitted based on Eq. (5.9) only within the window capturing the \( N_{\text{rec}} \) unbinding times.

As can be seen in Fig. 5.8a–c, the three methods considered give very different results, all naively appearing to describe the data well. Lacking an objective way to evaluate the goodness of fit across scenarios, we can only point to the fact that our general developments and our numerical investigation suggest that the ML approach gives the best estimate of the fit parameters.

The insets in Fig. 5.8a–c report the average number of measurements that the best fit predicts should fall outside the measurement window. This average should be compared to the \( N_{\text{cut}} = 1298 \) measurements that actually fell outside the observation window. From this, it is clear that the extra information included in the ML estimation regarding the cut data does increase its predictive capabilities in this case, which was not visibly the case for the fits in Fig. 5.6c, d.

5.2 Bootstrapping: Doing the Best We Can with Limited Resources

To determine the standard deviation of our parameter estimates, we would ideally like to establish their distribution by repeating the same experiment many times—much like we did in our earlier numerical comparison between estimation methods. A common practice is to report the standard deviation of fit estimates over a triplicate of identical experiments. However, not having a statistically significant sample can result in significant errors in estimating the standard deviation. Unfortunately, repeating the same experiment a sufficient number of times is often too time-consuming and costly, and we have to rely on other means.

If we could perform repeat experiments, we would in effect draw new unbinding times from the true PDF describing the unbinding kinetics. Instead of repeating the experiments by drawing from the true PDF, we here repeatedly draw from our best estimate of the true PDF: the original data set. This approach is called bootstrapping the data [5]. To generate each “new experiment,” we randomly draw \( N \) unbinding times from our original data set (also of size \( N \)), allowing for repeated draws of the same data instance (this is known as random sampling with replacement). We then fit our bootstrapped data set in the same manner as we fit our original data sets. By repeating this process many times, we build up the desired distributions of fit parameters. In Fig. 5.8d–f, the distributions of the double-exponential fit parameters are plotted, using uwLS, wLS, and cgML methods over 10,000 bootstrapped data sets.

Contrary to the situation with our simulated data sets, we here do not know the true values of the model parameters and so cannot establish the bias nor the MSE and thus lack an objective metric by which to compare the different approaches. In light of this, it is important to stress that the fact that the standard deviation is consistently smallest for uwLS is not a good argument for why this approach should be preferable. Given the disparate results of the various methods—even though all fits naively look good (Fig. 5.8a–c)—it is clear that at least two of the three methods can go astray in very non-obvious ways, and that caution is warranted. Our heuristic arguments and simulations suggest ML estimation to be generally preferable.

6 Conclusion

We have provided an introduction to ML estimation as a powerful alternative to conventional LS fitting methods. Focusing on exponential distributions as examples, we showed how the ML method provides a general way to estimate the model parameters from stochastic data, in principle without the need for binning. We also showed that uwLS, wLS, and ML can all be thought of as approximations to tLS, utilizing various estimates for the a priori unknown standard deviation of bin counts. The main upshots of both our heuristic argument and numerical investigation are:

  1. 1.

    wLS becomes unreliable as soon as there are bins with low counts, as should always be expected in the tail end of distributions without a severe experimental cutoff time.

  2. 2.

    uwLS often outperforms wLS for processes with a single characteristic time, but for processes with multiple characteristic times, it becomes unreliable as it fails to appropriately weigh the contribution of data on different timescales.

  3. 3.

    (cg)ML consistently outperforms both wLS and uwLS by estimating bin-count variations from the whole data set, rather than ignoring them (uwLS) or estimating them on a bin-to-bin basis (wLS).

The two first points significantly limit the global applicability of both uwLS and wLS methods. The maximum-likelihood method is generally applicable though, needs no binning—but if binned, is not sensitive to empty bins—and outperforms both uwLS and wLS in all examples discussed. Although we focused on exponentially distributed data, our conclusions are general and should apply irrespective of the particular distribution describing the data. These advantages, together with the adaptability of the approach, have convinced the authors that ML estimation is the preferable choice for dealing with SM data; we hope our presentation has gone some way toward convincing the reader of the same.