Optimizing model parameters to data (aka inverse problems)

[This presentation discusses methods commonly used to optimize the parameters of a mathematical model for population or disease dynamics to pertinent data.  Parameter optimization of such models is complicated by the fact that usually they have no analytic solution, but instead must be solved numerically. Choice of an appropriate "goodness of fit" statistic will be discussed, as will the benefits and drawbacks of various fitting methods such as gradient descent, Markov Chain Monte Carlo, and Latin Hypercube and random sampling.  An example of the application of the some of the methods using simulated data from a simple model for the incidence of a disease in a human population will be presented]


Introduction

Many of the parameters for models of the spread of disease or biological population dynamics are usually well known from prior observational studies.  For diseases like dengue and malaria, for instance, the vector population dynamics have been extensively studied in the laboratory and in the field. For diseases like measles and human seasonal  influenza, the latent and infectious period are fairly well known from observational studies.  However, for influenza, it is apparent that seasonal forcing plays a role in either the transmission and/or susceptibility, but it is currently unknown how much of this is due to climate and how much is due to seasonal patterns in human contact rates.  A model for the spread of influenza could include parameters that account for both of these effects, and these parameters can be estimated by fitting the model to seasonal influenza data.

Often times there are model parameters which must be estimated because prior knowledge is sketchy or non-existent.  An example of such a parameter is the vector/host biting rate in models for diseases like dengue, malaria, or leishmaniasis.  In addition, for newly emerged disease outbreaks (for instance Ebola), the reproduction number is not well known.

In order to accurately forecast the progression of an epidemic and/or examine the efficacy of control strategies to improve the current situation, it is necessary to have a model with parameters that have been optimized to describe current data. But in the mathematical life sciences, a hugely complicating factor in estimating model parameters is that the mathematical model itself often has no analytic solution, but rather must be solved numerically.  In addition, another problem is that our mathematical models typically are only an approximation to the actual dynamics.  This means that common regression methods coded up in platforms like Matlab or R simply will not work for these applications due to these computational complexities and model pathologies.  In this presentation, I’ll thus discuss the applicability of various methods for model parameter estimation in mathematical life sciences.

Some simulated data for a hypothetical epidemic in humans

For this presentation, we’ll use as example “data” a simulated pandemic flu like illness in humans, with true model:

sharm

with

sharmb

 

The R script sir_harm_generate.R uses the above model with parameters 1/gamma=3 days, beta_0=0.5, epsilon=0.3, and phi=0.  One infected person is assumed to be introduced to an entirely susceptible population of 10 million on day 60, and only 1/5000 cases is assumed to be counted (this is actually the approximate number of flu cases that are typically detected by surveillance networks in the US).  The script calculates the true model for the number of new cases we’d expect to see each day. In order to make this simulate real data, we need to randomly “smear” our expected number of counts to simulate the stochastic variation always seen in real epidemic data.  In this case, I assume that the randomness would be Poisson distributed.

The script produces the following plot:

sim_sir_harm

 

The script outputs the simulated data to the file sir_harmonic_simulated.csv

 

Another example we will use is simulation of a hypothetical vector borne disease in humans, where, through monitoring of community mosquito traps, we know that the temporal distribution of the number of mosquitoes per some unit area over two years is well described by

M_I

which, with A=100, looks like this:

M_Ib

The model we will use to describe the disease in the human population (ignoring human vital dynamics) is:

vector_borne_model

We’ll assume an initially naive population of one million people, and 1/kappa=10 days and 1/gamma=10 days and beta=15 (we assume here that the disease is particularly transmissible from mosquitoes to the naive population).  Again, we calculate the expected number of new cases per month, and randomly “smear” them to simulate the stochastic variation always seen in real epidemic data:simulated_vector_borne_incidence

The simulated observed data is in the file simulated_vector_borne_data.csv.  The R files that made these plots are vector_borne_utils.R and vector_borne.R.

Things to keep in mind when embarking on model parameter optimization

Successfully obtaining a model that is parameterized to optimally describe vector borne disease data first and foremost requires choosing a model that appropriately describes the dynamics underlying the patterns in the data.  This requires, at the very least, a knowledge of the biology of the vector population, and the epidemiology of the disease (ie; mechanism for spread, whether recovery confers immunity, etc).

We need to obtain the most parsimonious model that incorporates the most relevant dynamics.  If a model has too many parameters, even the most sophisticated of parameter optimization methods will fail!

The next step, after choosing a model, is to choose a “goodness-of-fit” statistic that will give an indication of how well a particular model parameterization is describing the data.  Once we have that goodness-of-fit statistic, there are a variety of methods that can be used to find the parameters that optimize that statistic.

Note that the parameter estimates returned by model optimization are just that; estimates.  Because of the randomness inherent in the data, there is stochasticity inherent to the parameter estimate.  Parameter estimation not only involves estimating the best-fit parameters, but also the uncertainty on the parameter estimates (often expressed as a 95% CI).

Choice of an appropriate “goodness of fit statistic”

Optimization of model parameters involves first choosing a “goodness-of-fit” statistic (aka: “figure of merit” statistic) that depends on the data and the model calculated under some hypothesis of the parameters.  The model parameter combination that minimizes this statistic is the “best-fit” combination.

Note that because of inherent stochasticity (randomness) in population or disease count data, even a best-fit model will usually not perfectly describe the data.  Stochasticity in the data can be due to a wide variety of things, like randomness in population movements or contacts, randomness in disease case identification (not all cases of many diseases are caught by surveillance systems), etc.

There are many different goodness-of-fit statistics that can be used, depending on the nature of the data. Choosing an appropriate goodness-of-fit statistic involves being aware of what probability distribution likely underlies the stochasticity in the data.   Ranked from most common to least commonly used in the field of mathematical life sciences, some such statistics are (this isn’t an exhaustive list):

Parameter optimization methods

A (not exhaustive) list of commonly used parameter optimization methods is as follows:

  • Gradient descent: best used when a closed form solution exists for the gradient of the function to be optimized.  Algorithm is not readily parallelized.
  • Simplex: No assessment of the function gradient is necessary.  Algorithm is not readily parallelized.
  • Markov Chain Monte Carlo: No assessment of the function gradient is necessary.  Provides assessment of parameter posterior probability density. Algorithm is not readily parallelized.
  • Latin hypercube and random sampling: No assessment of the function gradient is necessary.  Provides assessment of parameter posterior probability density.  Algorithm is easily parallelized to use high performance distributed computing resources, like XSEDE.

Leave a Reply