[This module compares the results of fitting a model to data when Least Squares, Pearson chi-square, Poisson and Negative Binomial likelihood statistics are optimized]
In a previous module, we explored an example of Least squares fitting the parameters of a mathematical SIR contagion model to data from a real influenza epidemic using the Monte Carlo parameter sampling method. The R script fit_iteration.R performs the Monte Carlo iterations, randomly sampling values of the reproduction number, R0, and the time of introduction of the virus to the population, t0, from uniform random distributions, calculates the Least Squares statistic, and plots the result to show which value of R0 and t0 minimizes the Least Squares statistic.
If you have n data points Y_i (where i=1,…,n), and model predictions for those data points, model_i (note that these predictions depend on the model parameters!), then the least squares statistic is calculated like this (let’s call that statistic “LS”):
In this case, our model_i estimates for each week are coming from our SIR model, and the Y_i are the number of cases we observed for that week.
The fit_iteration.R script produces the following plot:
Fitting with the Pearson chi-square goodness of fit statistic
In another module, we discussed the underlying assumptions of the Least Squares statistic… namely that the data points are independent, and the stochasticity underlying the random variation in the data points about the model prediction is Normal, with equal variance for each data point (“homoskedasticity“). In actuality, count data usually are not homoskedastic, particularly if their is a wide range in counts in the data, from small to large. In this particular data set, our counts per time bin range from 2 to 254. Thus, while Least Squares fitting is conceptually easy to understand, it probably isn’t the best choice for these particular data.
In this past module, we discussed generalized Least Squares fitting using the Pearson chi-squared statistic. The Pearson chi-squared statistic is only appropriate for count data, and adjusts the goodness of fit statistic to take into account the heteroskedasticity seen in count data. It is a “generalized” or “weighted” least squares statistic, and is calculated as follows:
It’s underlying premise is that the true probability distribution underlying the data stochasticity is Poisson (which approaches Normal when the counts are high enough). Weighted least squares statistics weight the statistic by the square of the uncertainty on each data point. For Poisson distributed data, the uncertainty is the square root of the expected mean.
The R script fit_iteration_pearson.R fits to the same influenza data as above, but instead of looking for the R0 and t0 that minimize the Least Squares statistic, it optimizes the Pearson chi-square statistic. The script produces the following plot:
You’ll notice that the best-fit values are quite different than what we got using the Least Squares statistic! That is because the Least Squares statistic was giving too much weight to bins that had few counts… these are “low-information” bins with high variation relative to the expected value, and should be weighted accordingly. The Pearson chi-square statistic isn’t perfect when the data are over-dispersed, but for count data it is far preferable to Least Squares fitting.
Optimization using the Poisson negative log-likelhood
The Pearson chi-squared statistic, while better than Least Squares for count data, is still only a good choice if there are enough counts in the data that the Poisson distribution approaches the Normal (generalized least squares statistics still have the assumption of Normally distributed stochasticity). This occurs when the expected number of counts is around 10-ish. In our influenza data, we have several bins with less than 10 counts.
So, we need a fit statistic that properly takes into account that our data are Poisson distributed (let’s ignore over-dispersion for the moment). This is achieved by optimizing the negative log Poisson likelihood statistic, described in this past module:
where the k_i is the observed number of counts in the i^th bin, and lambda_i is your model prediction for that bin.
The R script fit_iteration_poisson_likelihood.R fits to our influenza epidemic data, calculating the Poisson negative log likelihood at each iteration. The script produces the following plot:
Over-dispersed count data: the Negative Binomial negative log-likelihood
As mentioned in this past module, if your research question involves count data, pretty much always such data are over-dispersed, meaning that the stochastic variation in the data is much larger than would be expected from the Poisson distribution.
In this case, the best choice is the Negative Binomial maximum likelihood, which is a discrete probability distribution with an extra parameter, alpha, that is a measure of how over-dispersed the data are. If alpha=0, the data are Poisson distributed. If alpha gets large, there is a lot of over-dispersion in the data.
The R script fit_iteration_negbinom_likelihood.R fits to the influenza data, optimizing the Negative Binomial negative log-likelihood. The parameter alpha is now an additional nuisance parameter we have to fit for. The script produces the following plot:
Notice that 10,000 Monte Carlo iterations isn’t really sufficient to exactly pinpoint our best fit values and to get a good idea of the parabolic envelope below which we don’t see any points in the first three plots. This is because the more parameters you are fitting for, the more Monte Carlo iterations you will have to do to pinpoint the best-fit for the combination of all parameters.
Many of the problems we encounter in our research questions involve integer count data. In this module, we discussed that Least Squares probably isn’t the best choice for such data due to heteroskedasticity (however, you will see in the literature examples where people apply LS fits to count data anyway!). Inappropriate uses of the LS statistic should be caught in review, but often aren’t.
We discussed how a weighted least squares statistic, like Pearson chi-square, can help adjust for the heteroskedasticity problem in count data, and is a nice alternative as long as you have at least 10 counts per bin in each of your bins.
If the count data involve low-counts, a better choice is the Poisson negative log-likelihood (and at times you see such fits in the literature too), but count data are usually over-dispersed, in which case the best choice always is the Negative Binomial negative log-likelihood. In fact, in general, the Negative Binomial statistic is always applicable to independent count data, whereas the other three statistics we discuss here each have limitations in their applicability.
The only draw back of using the Negative Binomial likelihood is that it requires fitting for an extra parameter, the over-dispersion parameter, and the mathematical expression of the statistic looks complicated and involved, and can potentially scare the bejeezuz out of reviewers of your papers. Don’t let that stop you from using it though. Simply cite well-written papers like the following as precedent for using NB likelihood for count data in the life and social sciences, and perhaps consider not writing the explicit formula for the NB likelihood in your paper… just mention that you used it: Maximum Likelihood Estimation of the Negative Binomial Dispersion Parameter for Highly Overdispersed Data, with Applications to Infectious Diseases.
If you want to get your work published in a timely fashion, strive to use methods that are rigorous, but as simple as possible. If you have to use a more complicated method, describe it in your paper in plain and simple terms. In my long experience, this can avoid many problems with things getting hung up unnecessarily long times in review.