Fitting to noisy, exponential data
tl;dr
All biological measurements are going to have some amount of noise. When you perform a fit to data, you are making an implicit assumption about the function that the noise takes. For the two most common fit strategies for exponential data (fitting an exponential directly to the data and fitting a line to the log transform of the data), the underlying assumption is either that the noise is independent of the signal or that the noise is proportional to the signal, respectively.
Introduction
Biological (especially microbiological) data often involves exponentials. Take balanced growth for instance: 1 cell divides to become 2 cells to become 4 cells to become 8 cells etc. Given this, a natural and biologically relevant question to ask is what the growth rate is (or, in other words, how long it takes for one cell to become two cells). Because the underlying process is an exponential process, this requires fitting an exponential to the data and then figuring out what the constant in the exponent is.
When fitting a model to exponential data, there are two commonly suggested and reasonable options:
- Use logarithmic transformation to convert the exponential data to linear data and then fit a straight line.
- Use nonlinear regression to fit an exponential curve to the data.
Notion AI’s first two suggestions when I prompted it with “how to fit a model to exponential data”
At first glance, it may seem like these two methods will give you the same result and so people will generally fit a linear function to logarithmic transform of the data because it is easier to implement. In reality though, these two methods are subtly different because of how they handle noise, and it is important to know which one you should use.
All measurements of biological systems will contain some amount of noise, a combination measurement noise and noisiness in the underlying biological system. Depending on what form the noise takes, it makes more sense to use one approach vs. another to fit a model to your data.
When noise in the measurement is independent of the measurement
Let us suppose that we make some measurement over time, $y = A \cdot e ^ {B \cdot t} + \varepsilon$, that is an additive combination of the underlying exponential biological signal $y = A \cdot e ^ {B \cdot t}$ and some noise ε inherent to the measurement. For the purposes of this post, we are going to assume that we’ve got background subtraction covered and there is no background 😛. We will also at first assume that the noise in the measurement is independent of measurement. In more colloquial terms, the noise does not depend on the magnitude of the measurement. We expect this to be the case when e.g. segmenting microscopy images of exponentially growing cells, since the magnitude of the noise will be the same if the cells are 1 µm vs 2 µm long. With methods more sensitive (but much more difficult) than microscopy, we can see that the growth of individual B. subtilis cells is well described by an exponential function.
Exponentially growing B. subtilis cells as captured by phase microscopy
Let’s generate some noisy exponential data and plot the log transform of the noisy data along with the residuals and see what we get!
Left: log transform of synthetic exponential data with constant error
Right: residuals of the data on the left
Hopefully you can appreciate with the synthetic data above that after we log transform the data, the noise is magnified the smaller the measurement is! What this means is that when we fit a line to log transform of the data, we will overfit to the noise at the low end. (Feel free to skip ahead to the next section if you don’t care about a mathematical explanation of why this is the case.)
When we log transform the data, the residuals (the difference between the observed value and the model estimate) becomes:
$\begin{split} \log(f(t)) - \log(y) &= \log(y + \varepsilon) - \log(y) \\ &= \log\left(\frac{y + \varepsilon}{y}\right) \\ &= \log\left(1 + \frac{\varepsilon}{y}\right) \\ &\approx \frac{\varepsilon}{y} \end{split}$
for small ε
-
When we Taylor expand the equation $\log(1 + \frac{\varepsilon}{y})$ around $\varepsilon = 0$:
$\begin{split} \log(1 + \frac{\varepsilon}{y}) &\approx g\left(h\left(0\right)\right) + \frac{1}{1!} g'\left(h\left(0\right)\right) \cdot h'(0) \cdot \varepsilon\\&= 0 + \frac{1}{1 + \frac{0}{y}} \cdot \frac{1}{y} \cdot \varepsilon \\&= \frac{\varepsilon}{y}\end{split}$
where $g(h) = \log(h)$ and $h(\epsilon) = 1 + \frac{\epsilon}{y}$.
If this is still confusing to you and you want to read up more, I refer you to these links on Taylor series and the chain rule.
How do I fix this problem?
There are two solutions: either use a non-linear regression to fit an exponential curve to the data (per Notion AI’s suggestion using a function like scipy.optimize.curve_fit), or adjust the weights for the polynomial fit to correct for the non-uniform variance in $\log f(t)$.
If you are using the np.polynomial.polynomial.polyfit function from numpy to do a linear fit, then you want the weights to be $w_i = f_i$ so that the product $w_i \cdot \log(f_i)$ all have the same variance. (Note that not all functions that do linear regression allow you to change the weights of the different data points.)
Does this mean that this is how I should always fit my exponential data?
Well, no. There is no one-size-fits-all in biology. Sometimes the noise in the data is dependent on and proportional to the measurement. For example, when taking growth curves using an OD600 plate reader, the noise is primarily caused by cells occasionally clumping up and the cell clumps passing in front of the optical path (thank you to Adam Strandberg for pointing this out!). The number and size of cell clumps is proportional to the total signal (i.e. how many cells there are in the culture). In this case, we expect that the measurement has a form like $f(t) = y + \varepsilon \cdot y$ where $y = A \cdot e ^ {B \cdot t}$ and the residuals instead look something like
$\begin{split} \log(f(t)) - \log(y) &= \log(y + \varepsilon \cdot y) - \log(y) \\ &= \log \left( \frac{y + \varepsilon \cdot y}{y}\right) \\ &= \log(1 + \varepsilon) \end{split}$
In this case, because when after we log transform the data the residuals are independent of the measurement, the correct thing to do is to fit a line to the unweighted log transform of the data.
Of course, it is possible that the noise in the measurement will be neither independent of the signal nor linearly dependent on the signal and will instead be some other more complicated third option! Nevertheless, it’s good to have an underlying model of what is causing noise in your system and adjust your fit strategies accordingly. Because most regression analyses will assume homoscedasticity (homogeneity of variance), you want your noise to be independent of the data in whatever space (linear or logarithmic) you are fitting in.