]]>

The file reads in the data files summary_pandemic_data.txt, and sunspot_wolf_and_group_1700_to_2014.txt

The R script produces the following plot, shown in the paper,

]]>

**I’m a statistician, and I also have a PhD in experimental particle physics. Research in experimental particle physics can involve complex models of observable physical processes, and fitting of those models to experimental data is a not uncommon task in that field. Like the field of applied mathematics in the life and social sciences (AMLSS), the models being fit at times have no analytic solution, and must be solved numerically using specialized methods. When I entered the field of AMLSS back in 2009, I had a lot to learn about the various models used in this field and the common methodologies, but I already had a solid tool box of specialized skills that allowed me to connect mathematical models to data, and it has turned out that those skills have been remarkably useful in exploring a wide range of research questions in the life and social sciences that I find interesting. I also apply these skills in consulting projects I do.**

**First off: what is the difference between statistical and mathematical modelling, anyway?**

The difference between statistical and mathematical models is often times confusing to people. In this past module on this site, I discuss an example of the differences, with an analysis of seasonal and pandemic influenza used as an example.

**Example of an analysis combining statistical and mathematical modelling: Mathematical and statistical modelling of the contagious spread of panic in a population**

During the 2014 Ebola outbreak, there were a total of five cases that were ultimately identified in America, compared to tens of thousands of cases in West Africa. Even though the “outbreak” in America was essentially non-existent, once the first case was identified in the US in autumn 2014, the media shifted into 24/7 coverage of the supposed dire threat Ebola presented to Americans, complete with scary imagery.

Autumn 2014 I was teaching a course in the ASU AMLSS graduate program on statistical methods for fitting the parameters of mathematical models to data. Each year, when I teach AML classes, I usually try to have a “class publication project” that encompasses the methodology I teach in the class. In this case, I thought it might be interesting to try to model the spread of Ebola-related panic in the US population, as expressed on social media, and explore how news media might play a role in that.

The class did the analysis as a homework assignment, and we wrote the paper together, which was published in 2015. The paper received national media attention when it came out.

First; why was this analysis important? Well, it has been suggested in the past that people talking about a particular disease on social media might perhaps be used as a real-time means to assess the temporal and geospatial spread of the disease in the population, rather than relying on slower traditional surveillance methods, which can suffer from backlogs in laboratory testing. For instance, tracking influenza, or cholera:

However, up until the US Ebola “outbreak” the problem was that it was impossible to say whether people were just discussing a disease on social media because they were worried about it, rather than because they actually had it. During the Ebola outbreak, pretty much no one actually had it in the US, so everyone who was talking about it was doing so because they were concerned about it. This gave us the perfect instance to gauge what kind of temporal patterns we might see in social media chatter due simply to panic or concern about a disease!

The data we used in the study were the daily number of news stories about Ebola from large national news outlets. We also obtained Twitter data related to Ebola, and Google search data in the US with search terms related to Ebola, including “do I have Ebola?” from Google Trends. Here is what the data looked like:

We came up with a model that related the number of news videos, V, and people who were infected, I, with the idea to tweet about Ebola, or do a Google search related to Ebola:

The parameter beta is a measure of how many tweets (or Google searches) per person per unit time one news story would inspire, and gamma parameterizes the “boredom” effect, through which people eventually move to a “recovered and immune” class, upon which they never tweet again about Ebola no matter how many Ebola-related news stories they are exposed to. Using the statistical methodology taught in the AML course, the students fit the parameters of that model to data, and obtained the following model predictions, shown in red:

The blue lines on the plot represent a plain statistical model that simply regresses the Twitter and Google search data on the news media data, without taking into account the “boredom” effect. Can you see that the regression fits are systematically too high early on, and systematically too low later for all the plots, but the same is not true of our mathematical model? That tells us that our mathematical model that includes boredom really does do a better job of describing the dynamics of peoples’ Ebola-related social media behaviours!

We found that each Ebola-related new story inspired on average thousands of tweets and Google searches. Also, on average, we found people were only interested enough for a few days to tweet or do Google search after seeing a news story about Ebola before they became bored with the topic:

We couldn’t have done this analysis without both mathematical modelling and statistical methods; it was a nice “bringing together” of the methodologies to explore an interesting research question.

**Another example of an analysis that involved mathematical and statistical modelling methods: contagion in mass killings and school shootings**

In January, 2014 there was a shooting at Purdue University, where one student entered a classroom and shot another student dead, then walked out and waited for police to arrest him.

At the time, it struck me that it was the third school shooting I had heard about in an approximately 10 day period. Even for the United States, which has a serious problem with firearm violence compared to other first world countries, this seemed like an unusual number to have in such a short period of time.

It led me to wonder if perhaps contagion was playing a role in these dynamics. Certainly, in the past it had been noted that suicide appears to be contagious, because (for example) in high schools where there is one suicide it is statistically more likely to see an ensuing cluster of suicides. And the “copy cat” effect in mass killings has long been suspected. I wondered if perhaps a mathematical model of contagion might be used to help * quantify *whether or not mass killings and school shootings are contagious. So, I talked with some colleagues:

And, we decided to use a mathematical model known as a Hawke’s point process “self-excitation” model to simulate the potential dynamics of contagion in mass killings; the idea behind the model is quite simple… there is a baseline probability (which may or may not depend on time) of a mass killing to occur by mere random chance (the dotted line, below). But, if a mass killing does occur, due to contagion it temporarily raises the probability that a similar event will occur in the near future. That probability decays exponentially:

How contagion would manifest itself in data is thus as unusual “bunching together in time” of events compared to what you would expect from just the baseline probability.

Here’s our (blurry) model:

The parameters of the model were Texcite, the average length of the excitation period, and Nsecondary, the average number of new mass killings inspired by one mass killing. N_0(t) was the baseline probability for mass killings to occur. We used statistical modelling methods to estimate N_0(t).

We needed data in order to fit the parameters of our model. From USA Today we obtained data on mass killings (four or more people killed), and from the Brady Campaign to Prevent Gun Violence, we obtained data on school shootings, and data on mass shootings (three or more people shot, not necessarily killed). Mass shootings happen very frequently in the US!

We compared how well the Hawkes model fit the data compared to a model that didn’t include self-excitation. If contagion is evident, the former will fit the data significantly better.

The fit results were:

Both mass killings and school shootings appear to be significantly contagious! And the length of the contagion period is on average around two weeks for both.

Mass shootings with more than three people shot, but less than four people killed were not contagious though.

Why? Well, mass shootings with low death counts happen on average more than once a week in the US. They happen so often, that they rarely make it past the local news. In contrast, mass shootings with high death rates, and school shootings, usually get national and even international media attention. It may likely be that widespread media attention is the vector for the contagion.

Again, this was an analysis that was made possible through the marriage of mathematical and statistical modelling methods.

**Statistical and mathematical modelling skills on the job market**

Quantitative and predictive analytics is a field that is growing very quickly. Statistical methods and data mining (“big data”) play a large role in predictive analytics, but the power of mathematical models is more and more being recognized as having same advantages over statistical models alone because mathematical models do not simply assume an “X causes Y” relationship, but instead can describe the complex dynamics of interacting systems. Having a tool box of skills that includes expertise in both mathematical and statistical modelling can lead to many interesting career opportunities, including consulting.

]]>

For many models, information about the parameters and/or initial conditions can be obtained from other studies. For instance, let’s examine the seasonal influenza SIR model we have used as an example in several other modules. Our data was influenza incidence data from an influenza epidemic in the midwest, and we fit the transmission rate, beta (or alternative, R0=beta/gamma), of an SIR model to this data. For example, using the R script fit_midwest_negbinom_gamma_fixed.R

The script performs a negative binomial likelihood fit to the influenza data, assuming that the average recovery period, 1/gamma, for flu is fixed at 4.8 days. The script produces the following plot (recall that alpha is the over-dispersion parameter for the negative binomial likelihood, and t0 is the time of introduction of the virus to the population.

The script gives the best-fit estimate using the graphical Monte Carlo fmin+1/2 method, and also the weighted mean method. Note that the plots should be much better populated in order to really get trustworthy estimates from the fmin+1/2 method.

**In reality, most of our parameters that we obtain from prior studies aren’t know to perfect precision**

In our script above, we assumed that 1/gamma was 4.8 days based on a prior study in the literature. However, this was estimated from observational studies of sick people, and, in reality, there are statistical uncertainties associated with that estimate. In the paper describing the studies, they state that their central estimate and 95% confidence interval on 1/gamma was 4.80 [4.31,5.29] days. **Unless told otherwise in the paper from which you get an estimate, you assume that the uncertainty on the parameter is Normally distributed. ** Because the 95% CI is +/-1.96*sigma from the mean, this implies that the std deviation width of the Normal distribution is sigma=(4.8-4.31)/1.96~0.25 days

Thus, our probability distribution for x=1/gamma is

P(x|mu,sigma)~exp(-0.5(x-mu)^2/sigma^2)

with mu=4.8 days, and sigma = 0.25 days, in this case.

**Uncertainty on “known” parameters affects the uncertainties on the other model parameters you estimate from your fit to data!**

The uncertainty on 1/gamma will affect the uncertainty on our parameter estimates. For instance, is it clear that if all we know about 1/gamma was that it was between 1 and 50 days, it would be much harder to pin down our transmission rate? (ie; we had no idea what 1/gamma was, and thus had fit for gamma, as well as R0, t0, and alpha) The script fit_midwest_negbinom_gamma_unconstrained.R does this, and produces the following plot:

You can see that the influenza data we have perhaps give us a little bit of sensitivity to the parameter gamma, but not much (basically, the fit just tells us 1/gamma is somewhere between 2 to 6 days, with 95% confidence). The uncertainties on our estimates of R0 and t0 have gone way up, compared to the first fit where we assumed 1/gamma was fixed at 4.8 days! Also, when you are using the weighted mean method to estimate parameters and the parameter uncertainties, you can also get the covariance matrix for your parameter estimates. The correlation matrix, derived from the covariance matrix, for this fit looks like this:

Notice that our estimates of R0 and 1/gamma are almost 100% correlated (this means that as 1/gamma goes up, R0 also has to go up to achieve a good fit to the data). You never want to see parameters so highly correlated in fits you do… it means that your best-fit parameters likely won’t give you a model with good predictive power for a different, equivalent data set (say, influenza data for the same region for the next flu season).

So, even though we seem to have a little bit of sensitivity to the value of 1/gamma in our fit, having that estimate 100% correlated to our estimate of R0 is not good, and a sign you shouldn’t trust the results of the fit.

**Incorporating uncertainties on “known” parameters from the literature in our fit likelihoods**

In order to take into account the uncertainties on our “known” parameter, x, you simply modify your fit likelihood to include the likelihood coming from the probability distribution for that parameter. Thus, the negative log likelihood is modified like so:

negloglike = negloglike + 0.5*(x-mu)^2/sigma^2

Then, in the fit, you do Monte Carlo sampling not only of all your other unknown parameters (like R0, t0, and alpha in this case), but also uniformly randomly sample parameter x over a range of around approximately mu-4*sigma to mu+4*sigma.

For 1/gamma, we know that mu=4.8 days, and sigma is 0.25 days. The R script fit_midwest_negbinom_gamma_constrained.R modifies the likelihood to take into account our probability distribution for our prior estimate of 1/gamma from the literature. The script produces the following plot:

(again, for the fmin+1/2 method, we’d like to see these plots much better populated!). Note that now our uncertainties on R0 and t0 from the weighted mean method are much smaller than they were when 1/gamma was completely unconstrained, but larger than they were when 1/gamma was fixed to 4.8 days. By modifying the likelihood to take into account the probability distribution of our prior belief for 1/gamma, we now have a fit that properly feeds that uncertainty into our uncertainty on R0 and t0.

When publishing analyses that involve fits like these, it is important to take into account your prior belief probability distributions for the parameter estimates you take from the literature. In some cases, your fit might be quite sensitive to the assumed values of those parameters; if the literature estimates are a bit off from what your data would “like” them to be to obtain a good fit, and you just assume a fixed central value for the parameter, sometimes you just won’t be able to get a good fit to your data.

When you include the parameter in your fit with a modified likelihood to take into account it’s prior belief probability distribution, the estimate you get from the fit to your data is known as the “posterior” estimate. Note that the posterior estimate, and uncertainty, on 1/gamma that we obtained from fit_midwest_negbinom_gamma_constrained.R is 4.798+/-0.247, and is pretty darn close to our prior belief estimate of 4.8+/-0.25. If our data were sensitive to the value of 1/gamma, our posterior estimate would have a smaller uncertainty than the prior belief estimate, and likely have a different central value too.

]]>As discussed in that module the model parameters can be estimated from the parameter hypothesis for which the negative log-likelihood statistic, f, is minimal, and the one standard deviation uncertainty on the parameters is obtained by looking at the range of the parameters for which the negative log likelihood is less than 1/2 more than the minimum value, like so:

This method has the advantage that it is easy to understand how to execute (once you’ve seen a few examples). However, we talked about the fact that this procedure is only reliable if you have many, many sweeps of the model parameter values (for instance, the above plots are pretty sparsely populated, and it would be a bad idea to trust the confidence intervals seen in them…. they are underestimated because the green arrows don’t quite reach to the edge of the parabolic envelope that encases the points).

The fmin+1/2 method also does not yield a convenient way to determine the covariance between the parameter estimates, without going through the complicated numerical gymnastics of estimating what is known as the Hessian matrix. The Hessian matrix (when maximizing a log likelihood) is a numerical approximation of the matrix of second partial derivatives of the likelihood function, evaluated at the point of the maximum likelihood estimates. Thus, it’s a measure of the steepness of the drop in the likelihood surface as you move away from the best-fit estimate.

It turns out that there is an easy, elegant way, when using the graphical Monte Carlo method, to use information coming from every single point that you sample to obtain (usually) more robust and reliable parameter estimates, and (usually) more reliable confidence intervals for the parameters.

**The weighted means method**

To begin to understand how this might work, first recall from the previous module that the fmin+1/2 method gives you the one standard deviation confidence interval. Recall that to get the S standard deviation confidence interval, you need to go up 0.5*S^2 from the value of fmin, and examine the range of points under that line. This means that when we plot our negative log likelihood, f, vs our parameter hypotheses, the points that lie some value X above fmin are, in effect sqrt(2*X) standard deviations away from the best-fit value. Here is what that looks like graphically:

The red lines correspond to the points that lie at fmin+1/2 (the one standard deviation confidence interval), the blue lines correspond to the points that lie at fmin+0.5*2^2=fmin+2 (the two standard deviation confidence interval), and the green lines correspond to the points that lie at fmin+0.5*3^2=fmin+4.5 (the three standard deviation confidence interval).

It should make sense to you that the points that are further away from fmin carry less information about the best-fit value compared to points that are have a likelihood close to the minimum. After all, when using the graphical Monte Carlo method, you aim to populate the graphs well enough to get a good idea of the width of the parabolic envelope *in the vicinity of the best fit value*.

So… if we were to take some kind of weighted average of our parameter hypotheses, giving more weight to values near the minimum in our likelihood, we should be able to estimate the approximate best-fit value.

It turns out that the weight that achieves this is intimately related to those confidence intervals we see above. If we do many Monte Carlo parameter sweeps, getting our parameter hypotheses and the corresponding negative log likelihoods, f, we can estimate our best fit values by taking the weighted average of our parameter hypotheses, weighted with weights

w=dnorm(sqrt(2*(f-fmin)))

where dnorm is the PDF of the Normal distribution. Notice that this is maximal when f=fmin, and gets smaller and smaller as f moves away from fmin. In fact, when f=fmin+0.5*S^2 (the value that corresponds to the S std dev CI), then

w=dnorm(S)

So, the points that are close to giving the minimum likelihood are given a greater weight in the fit, because they are more informative as to where the minimum actually lies. The plot of w vs S is:

Thus, the further f gets away from fmin, the less weight the points are given, *but they still have some weight*.

It turns out that not only can these weights be used to estimate our best-fit values, they also can be used to estimate the covariance matrix of our parameter estimates. If we have two parameters (for example), and we’ve randomly sampled N_MC parameter hypotheses, we would form a N_MCx2 matrix of these sampled values, and then take the weighted covariance of that matrix. The R cov.wt() function does this.

Advantages of the weighted mean method: with this method every single point you sample gives information about the best-fit parameters and the covariance matrix for those parameter estimates. Unlike the fmin+1/2 method, where it is only those points right near the minimum value of f and at fmin+1/2 that really matter in calculating the confidence interval.

Also, using this weighted method you trivially get the estimate covariance matrix for the parameters, unlike the fmin+1/2 method where this would be much harder to achieve.

Another advantage of this method is that you don’t have to populate your plots quite as densely as you would for the fmin+1/2 method in order for it to reliably work; this is because every single point you sample is now informing the calculation of the weighted mean and weighted covariance matrix.

The disadvantage of this method is that you must uniformly randomly sample the parameters (no preferential sampling of parameters using rnorm for instance), and you must uniformly sample them over a broad enough range that it encompasses at least a three or four standard deviation confidence interval; other wise, as we’ll see, you will underestimate the parameter uncertainties).

**An example**

As an example of how this works in practice, let’s return to the simple example we saw in this previous module, where we compared the performance of the fmin+1/2 method to that where we analytically calculate the Hessian to estimate the parameter uncertainties.

In the example, the model we considered was y=a*x+b, where a=0.1 and b=10, and x goes from 10 to 150, in integer increments. We simulate the stochasticity in the data by smearing with numbers drawn from the Poisson distribution with mean equal to the model prediction. Thus, an example of the simulated data look like this:

Recall that the Poisson negative log likelihood looks like this

where the y_i^obs are our data observations, and y_i^pred are our model prediction (y_i^pred = a*x_i+b).

In the example hess.R, we randomly generated many different samples of our y^obs, and then used the Monte Carlo parameter sweep method to find the values of a and b that minimize the negative log likelihood. Then we calculated the Hessian about this minimum and estimated the one-standard deviation uncertainties on a and b from the covariance matrix that is the inverse of the Hessian matrix. Recall that the square root of the diagonal elements of the covariance matrix are the parameter uncertainties.

We also did this using the fmin plus a half method, to show that If the fmin plus a half method works, its estimate of the one-standard-deviation confidence intervals should be very close to the Hessian estimate.

We can add into this exercise our weighted mean method. The R script hess_with_weighted_covariance_calculation.R does just this.

The script produces the following plot, histogramming the parameter estimates from the weighted mean method. As you can see, the estimates are unbiased, and the uncertainties on the parameters assessed by the weighted mean method are very close to those assessed by the analytic Hessian method:

The script also produces the following plot:

Notice in the top two plots that the parameter uncertainties assessed by the weighted mean method are quite close to those estimated by the Hessian method, but the uncertainties assessed by the fmin+1/2 method are always underestimates. This is because we didn’t sample that many points in our graphical Monte Carlo procedure, as can be seen in the examples in the two bottom plots; the plots are so sparsely populated, the green arrows that represent the CI’s estimated by the fmin+1/2 method don’t go all the way to the edge of the parabolic envelope.

So, even with relatively sparsely populated plots, the weighted mean method works quite well. If they are really, really sparsely populated, however, you will find that the performance of the method starts to degrade; take a look at what happens when you change nmc_iterations from 10000 in to 100 in hess_with_weighted_covariance_calculation.R:

The estimates of the parameter uncertainties still are scattered about the Hessian estimates (and the fmin+1/2 method miserably fails due to the sparsity of points). However, notice that there is quite a bit of variation in the uncertainty estimates using the weighted mean method about the red dotted line (compare to the other plot, above); the more MC iterations you have, the more closely these will cluster about the expected values (ie; the more trustworthy your parameter uncertainty estimates will be). So, don’t skimp on the MC parameter sampling iterations, even when using the weighted mean method! In general, with this method, you need to run enough MC parameter sweep iterations to get a reasonable idea of the parabolic envelope in the vicinity of the best-fit value.

One catch of this method, as mentioned above, is to ensure that you do your random uniform parameter sweeps over a broad enough area… if you sample parameters to close to the best-fit value, the weighted mean method will underestimate the confidence intervals. In general, you have to sweep at least a four standard CI. As an example of what occurs when you use too narrow a range, instead of sampling parameter a uniformly from 0.06 to 0.14, uniformly sample it from 0.09 to 0.11 in the original hess_with_weighted_covariance_calculation.R script. We now get:

You can see that the confidence intervals are now severely under-estimated by the weighted mean method.

It needs to be kept in mind that the covariance matrix returned by the weighted mean method assumes that the confidence interval is symmetrically distributed about the best-fit value. In practice, this isn’t always the case; sometimes the plots of the neg log likelihood vs parameter hypotheses, instead of looking like they have a symmetric parabolic envelope, have a highly asymmetric parabolic envelope, like this, for example:

The weighted mean method will essentially produce a one standard deviation estimate that is derived from an “average” symmetric parabola fit to the asymmetric parabola. It will tend to underestimate confidence intervals in such cases. When you have highly asymmetric parabolic envelopes in your plots of the neg log likelihood vs your parameter hypotheses, it is thus best to use the fmin+1/2 method.

]]>This content is password protected. To view it please enter your password below:

]]>

This content is password protected. To view it please enter your password below:

]]>

The Result is :

]]>

]]>

**Introduction**

In a previous module, we explored an example of Least squares fitting the parameters of a mathematical SIR contagion model to data from a real influenza epidemic using the Monte Carlo parameter sampling method. The R script fit_iteration.R performs the Monte Carlo iterations, randomly sampling values of the reproduction number, R0, and the time of introduction of the virus to the population, t0, from uniform random distributions, calculates the Least Squares statistic, and plots the result to show which value of R0 and t0 minimizes the Least Squares statistic.

If you have n data points Y_i (where i=1,…,n), and model predictions for those data points, model_i (note that these predictions depend on the model parameters!), then the least squares statistic is calculated like this (let’s call that statistic “LS”):

In this case, our model_i estimates for each week are coming from our SIR model, and the Y_i are the number of cases we observed for that week.

The fit_iteration.R script produces the following plot:

**Fitting with the Pearson chi-square goodness of fit statistic**

In another module, we discussed the underlying assumptions of the Least Squares statistic… namely that the data points are independent, and the stochasticity underlying the random variation in the data points about the model prediction is Normal, with equal variance for each data point (“homoskedasticity“). In actuality, count data usually are not homoskedastic, particularly if their is a wide range in counts in the data, from small to large. In this particular data set, our counts per time bin range from 2 to 254. Thus, while Least Squares fitting is conceptually easy to understand, it probably isn’t the best choice for these particular data.

In this past module, we discussed generalized Least Squares fitting using the Pearson chi-squared statistic. The Pearson chi-squared statistic is only appropriate for count data, and adjusts the goodness of fit statistic to take into account the heteroskedasticity seen in count data. It is a “generalized” or “weighted” least squares statistic, and is calculated as follows:

It’s underlying premise is that the true probability distribution underlying the data stochasticity is Poisson (which approaches Normal when the counts are high enough). Weighted least squares statistics weight the statistic by the square of the uncertainty on each data point. For Poisson distributed data, the uncertainty is the square root of the expected mean.

The R script fit_iteration_pearson.R fits to the same influenza data as above, but instead of looking for the R0 and t0 that minimize the Least Squares statistic, it optimizes the Pearson chi-square statistic. The script produces the following plot:

You’ll notice that the best-fit values are quite different than what we got using the Least Squares statistic! That is because the Least Squares statistic was giving too much weight to bins that had few counts… these are “low-information” bins with high variation relative to the expected value, and should be weighted accordingly. The Pearson chi-square statistic isn’t perfect when the data are over-dispersed, but for count data it is far preferable to Least Squares fitting.

**Optimization using the Poisson negative log-likelhood**

The Pearson chi-squared statistic, while better than Least Squares for count data, is still only a good choice if there are enough counts in the data that the Poisson distribution approaches the Normal (generalized least squares statistics still have the assumption of Normally distributed stochasticity). This occurs when the expected number of counts is around 10-ish. In our influenza data, we have several bins with less than 10 counts.

So, we need a fit statistic that properly takes into account that our data are Poisson distributed (let’s ignore over-dispersion for the moment). This is achieved by optimizing the negative log Poisson likelihood statistic, described in this past module:

where the k_i is the observed number of counts in the i^th bin, and lambda_i is your model prediction for that bin.

The R script fit_iteration_poisson_likelihood.R fits to our influenza epidemic data, calculating the Poisson negative log likelihood at each iteration. The script produces the following plot:

**Over-dispersed count data: the Negative Binomial negative log-likelihood**

As mentioned in this past module, if your research question involves count data, pretty much always such data are over-dispersed, meaning that the stochastic variation in the data is much larger than would be expected from the Poisson distribution.

In this case, the best choice is the Negative Binomial maximum likelihood, which is a discrete probability distribution with an extra parameter, alpha, that is a measure of how over-dispersed the data are. If alpha=0, the data are Poisson distributed. If alpha gets large, there is a lot of over-dispersion in the data.

The R script fit_iteration_negbinom_likelihood.R fits to the influenza data, optimizing the Negative Binomial negative log-likelihood. The parameter alpha is now an additional nuisance parameter we have to fit for. The script produces the following plot:

Notice that 10,000 Monte Carlo iterations isn’t really sufficient to exactly pinpoint our best fit values and to get a good idea of the parabolic envelope below which we don’t see any points in the first three plots. This is because the more parameters you are fitting for, the more Monte Carlo iterations you will have to do to pinpoint the best-fit for the combination of all parameters.

**Summary**

Many of the problems we encounter in our research questions involve integer count data. In this module, we discussed that Least Squares probably isn’t the best choice for such data due to heteroskedasticity (however, you will see in the literature examples where people apply LS fits to count data anyway!). Inappropriate uses of the LS statistic should be caught in review, but often aren’t.

We discussed how a weighted least squares statistic, like Pearson chi-square, can help adjust for the heteroskedasticity problem in count data, and is a nice alternative as long as you have at least 10 counts per bin in each of your bins.

If the count data involve low-counts, a better choice is the Poisson negative log-likelihood (and at times you see such fits in the literature too), but count data are usually over-dispersed, in which case the best choice always is the Negative Binomial negative log-likelihood. In fact, in general, the Negative Binomial statistic is * always* applicable to independent count data, whereas the other three statistics we discuss here each have limitations in their applicability.

The only draw back of using the Negative Binomial likelihood is that it requires fitting for an extra parameter, the over-dispersion parameter, and the mathematical expression of the statistic looks complicated and involved, and can potentially scare the bejeezuz out of reviewers of your papers. Don’t let that stop you from using it though. Simply cite well-written papers like the following as precedent for using NB likelihood for count data in the life and social sciences, and perhaps consider not writing the explicit formula for the NB likelihood in your paper… just mention that you used it: Maximum Likelihood Estimation of the Negative Binomial Dispersion Parameter for Highly Overdispersed Data, with Applications to Infectious Diseases.

If you want to get your work published in a timely fashion, strive to use methods that are rigorous, but as simple as possible. If you have to use a more complicated method, describe it in your paper in plain and simple terms. In my long experience, this can avoid many problems with things getting hung up unnecessarily long times in review.

The basic elements of virtually any scientific paper are as follows (Introduction, Methods and Materials, Results, and Discussion and Summary):

**Introduction:**This section always appears in a paper. At the beginning of the Introduction is where you motivate your work (ie; why should anyone care?). Start from a broad motivation, and move to focus in on the particular motivation of your work. For instance, let’s assume I was writing an Ebola paper that describes a compartmental modelling analysis I did to assess the effects of isolation and/or quarantine on the spread of the disease:- At the very beginning of the introduction I’d start off talking about the number of people killed in past outbreaks, and the wide geographic spread of outbreak locations, and the ever present risk that Ebola cases can be imported to other areas of the world due to modern air travel (the point being that no matter where you live, you should care about Ebola). Then I’d talk about the high mortality of the disease. These would be among the first few sentences in pretty much
paper written about Ebola.**any** - For my particular paper, I’d then mention that the lack of current treatment options (like vaccines or medications) leave better hygiene, quarantine, and isolation as the only options available to slow the spread of disease.

After

*motivating*the project, you then move on to describing the*objective*of the paper. This is where you present your research question. And where you give a very short overview of what you did in your analysis, and how it advances the body of work in the published literature on the subject. In the Ebola paper case, I would add some sentences saying that mathematical models are being increasingly used to assess the efficacy of disease intervention strategies (and I would cite a few well known seminal publications on that topic). Then I would state that in this work we use a mathematical model to assess the efficacy of isolation and quarantine, and I would state that no one has ever done that before for Ebola (as of August 2014, this was true).*In this part of the paper it is very important to state what is new and novel about your work.*Once you have described your motivation and objective, it is a good idea to end the Introduction with a sentence or two that gives a road map for what the reader should expect in the following sections. Something like “In the following section, we will describe the data sources and mathematical and statistical modelling methodologies used in these studies, followed by a presentation of results and discussion” (this is assuming your analysis uses data, a mathematical model, and statistical methods).

- At the very beginning of the introduction I’d start off talking about the number of people killed in past outbreaks, and the wide geographic spread of outbreak locations, and the ever present risk that Ebola cases can be imported to other areas of the world due to modern air travel (the point being that no matter where you live, you should care about Ebola). Then I’d talk about the high mortality of the disease. These would be among the first few sentences in pretty much
**Methods and Materials:**This section always appears in a paper. if you are using data in your analysis, the first subsection in this section should be**Data**. The Data section should*thoroughly*describe your sources of data. If you collected it, what were your laboratory or field protocols? If it is time series data, what time steps are used? What, precisely, is the data measuring? If you got the data online, give a reference to the source. Even if you didn’t collect the data, you need to describe the collection procedures of the person or group who did collect the data.

If you are using a mathematical or computational model, the next subsection should be**Model**. In this subsection, you will describe what kind of model you are using, and give citations to relevant related publications in the field. You will describe what is new and interesting about your model (if relevant… sometimes it is the data that are new and interesting, and what is novel is applying an old model to new data). Here you will give the model equations and compartmental flow diagram (if using a compartmental model), or other details about your mathematical or computational model. You need to give enough details that anyone could reproduce your work based on this information.If you are using statistical methods that are fancier than your usual statistical tests based on Student T, Z scores, Spearman rho, etc etc, you need to have a subsection under Methods and Materials called**Statistical Methods**. This subsection would be appropriate, for instance, if the statistical methods you use are so esoteric that they are either new, or very rarely used in your field.

**Results:**This section always appears in a paper. Here is where,*without discussion*, you give the results of your paper, often in tables and figures, and accompanied text.*Do not discuss the results here!*

**Discussion:**This section always appears in a paper. Never put results that you haven’t discussed in the Results section here…. they should be in the Results section! In the Discussion section, you talk about notable things revealed by your results and how this fits in with (or contradicts) the published literature.

**Summary:**This section is sometimes called Conclusions, and sometimes is lumped in with Discussion (and called Discussion and Summary). It depends on the journal. If there is a separate Summary section, you start off with a little paragraph describing what you presented in the paper, and why it is new and novel. In the summary you detail limitations of your study, possible future work, etc, and usually end with a “feel good” sentence about the utility of studies like yours.

**Lacum et al rubric for identifying seven key elements of scientific papers**

In 2014, Lacum et al published a study where they trained students to look for seven key elements when reading or writing papers. As I discussed above and in this post, where I describe what sections need to be in a scientific paper, these elements are integral in the sections of a paper:

- Motive: Statement indicating why the research was done (e.g., a gap in knowledge, contradictory results). The motive leads to the objective. The motive should appear in the Abstract and Introduction.
- Objective: Statement about what the authors want to know. The objective may be formulated as a research question, a research aim, or a hypothesis that needs to be tested. The objective should appear in the Abstract and Introduction.
- Main conclusion: Statement about the main outcome of the research. The main conclusion is closely connected to the objective. It answers the research question, it says whether the research aim was achieved, or it states whether the hypothesis was supported by evidence. The main conclusion will lead to an implication. The main conclusion is often the last sentence in the Abstract, and is of course also described in the Discussion and Summary.
- Implication: Statements indicating the consequences of the research. This can be a recommendation, a statement about the applicability of the results (in the scientific community or society), or a suggestion for future research. This may appear in the Abstract, and certainly appears in the Discussion and Summary.
- Support: The statements the authors use to justify their main conclusion. These statements can be based on their own data (or their interpretation) or can be statements from the literature (references).
- Counterargument: Statements that weaken or discredit the main conclusion. For example, possible methodological flaws, anomalous data, results that contradict previous studies, or alternative explanations. Counterarguments are sometimes presented as limitations. They are placed in the Discussion and Summary.
- Refutation: Statements that weaken or refute a counter-argument. Refutation appears in the Discussion and Summary

]]>

**Objectives:**

**This course is meant to provide students in applied mathematics with the broad skill-set needed to optimize model parameters to relevant biological or epidemic data. The course will almost entirely be based on material posted on this website.**

**Upon completing this course:**

**Students will gain a basic understanding of applied statistics, and will be functional in R. **

**Students will learn how to read in, manipulate, and export data in R, and will be able to create publication-quality plots in R. Students will be familiar with several different parameter optimization methods, and will understand the underlying assumptions of each.**

**List of course modules:**

- Good work habits, and requirements for homework
- Literature searches with Google Scholar
- Elements of scientific papers
- The basics of the R statistical programming language
- Difference between statistical and mathematical models
- Numerically solving systems of non-linear ODE’s in R: Euler’s method
- Numerical methods to solve non-linear ODE’s
- Numerically solving systems of non-linear ODE’s in R: 4th order Runge-Kutta using the deSolve library
- Extracting data from graphs in published literature
- Online sources of free data
- SIR disease model with age classes
- SIR modelling of influenza with a periodic transmission rate
- Fitting the parameters of an SIR model to influenza data using Least Squares and the Monte Carlo parameter sweep method
- An overview of goodness of fit statistics, and methods to fit parameters of mathematical models to data
- Fitting the parameters of an SIR model to influenza outbreak incidence count data with the Monte Carlo method: a comparison of Least Squares, Pearson chi-square weighted least squares, Poisson negative log-likelihood, and Negative Binomial negative log-likelihood
- Estimating parameter confidence intervals when using the Monte Carlo parameter sweep optimization method: the fmin+1/2 method
- A better method for estimation of confidence intervals compared to the fmin+1/2 method: the weighted mean method
- Temporal and geospatial patterns in threats to Jewish community centers: an example of contagion in social behaviours?
- Incorporating prior parameter estimates and their uncertainties into your likelihood fits
- Producing well written manuscripts in a timely fashion
- Giving a good presentation

**Course expectations:**

There will be regular homework projects assigned throughout the course, which will be worth 50% of the grade. Students are strongly encouraged to work together in groups to discuss issues related to the course and resolve problems. However, plagiarism of code will not be tolerated.

The culmination of the course will be a group term project (two to three students collaborating together, with the project worth 50% of the final grade) that requires the development of an R program to solve a system of ordinary differential equations that describes the dynamics of disease spread, interacting biological populations, etc. The students will then optimize the parameters of their model to data that the student has identified as being appropriate to describe with their model. The students will write-up the results of their project in a format suitable for publication, using the format required by a journal they have identified as being appropriate for the topic. A cover letter written to the editor of the journal is also required. **However, submission for publication is not required, but encouraged if the analysis is novel.**

Students are responsible for locating and obtaining sources of data, and developing an appropriate model for the project, so this should be something they begin to think about very early in the course.

**This course has no associated textbook, due to the unique nature of the course content. Instead the course content consists of the modules that appear on this website.**** A textbook that students may find useful is Statistical Data Analysis, by G. Cowan**

Students are expected to bring their laptops to class. Before the course begins, students are expected to have downloaded the R programming language onto their laptop from http://www.r-project.org/ (R is open-source free software).

Final project write-ups will be due **Friday, April 15th**. Each of the project groups will perform an in-class 20 min presentation on **Monday, April 24th, 2017 and Wed, April 26th, 2017**.

During the week of April 17th, project groups will meet with Dr. Towers to discuss their final project write-ups, and their upcoming presentation. By Friday, April 28th, all group members are to submit to Prof Towers a confidential email, detailing their contribution to the group project, and detailing the contributions of the other group members.

]]>

Near the ruins are petroglyphs, inscribed under a rock overhang:

(if you look closely, you will see at least two sets of nested circles, or spirals).

There have been several recorded instances of Pueblo spiral petroglyphs being associated with sun solstices. For instance, in Arizona, and Chaco Canyon. Indeed, the petroglyphs at Holly House have been interpreted as being a marker for the summer solstice, at which time a shaft of light bisects the spirals.

However, the location may have been used for winter solstice sun rise observations as well. On the horizon, immediately above the rock on which the petroglyphs are carved, the tip of Sleeping Ute mountain is clearly visible (note that the bearing displayed by the app is only approximate):

To obtain a much more precise bearing, using my GPS, I recorded the position of the site to be 37.39856N, 109.042946 W. The tip of Ute Mountain is located at 37.2842N, 108.7787W, at a bearing of 118.5 degrees.

At the latitude of the petroglyphs, the sun rises at winter solstice at 119.5 degrees at a level horizon. Because of the presence of the mountain, the azimuth of the sun rise occurs slightly offset from this.

Thus, the site was very likely used as a sun watching station not just at the summer solstice, but also at the winter solstice, tied into a broader ritual landscape.

]]>**Sherry Towers**

This is a presentation based on the work described in my recent paper, published in the Journal of Archaeological Science: Reports* Advanced geometrical constructs in a Pueblo ceremonial site, c 1200 CE*

The Sun Temple is a Pueblo III site at Mesa Verde National Park, Colorado, prominently located atop a mesa, with a commanding view of the surrounding area:

World Heritage site, Cliff palace is just across the canyon.

**Excavation and dating**

The site was excavated by Jesse Walter Fewkes in 1915.

Largely based on construction patterns and proximity to the Cliff Palace site (which has dendrochronology dates associated with the site), the Sun Temple site is thought to have been constructed circa 1200 AD, shortly before the ancestral Pueblo peoples largely abandoned the area around 1300 AD.

(image from Excavation and Repair of Sun Temple, Mesa Verde National Park, by JW Fewkes)

Greg Munson has done several nice studies of the Sun Temple site, including a 2011 study estimating the heights of the walls. His study “Reading, Writing, and Recording the Architecture: How Astronomical Cycles May Be Reflected in the Architectural Construction at Mesa Verde National Park, Colorado” by Munson, Nordby, and Bates (2010) found that the site was constructed in stages, and some elements of the site, such as the Kivas and the outer D wall, preceded the construction of other elements of the site like the Annex.

**Ceremonial nature of the site**

The D shape and bi-wall nature of the site are known to denote ancestral Pueblo ceremonial structures (there are several examples throughout the San Juan River basin area).

The four tower-like round structures were referred to by Fewkes as “kivas”, which are ceremonial structures in ancestral Pueblo architecture (but in the case of Sun Temple the round features do not have all the usual kiva features).

(image from Excavation and Repair of Sun Temple, Mesa Verde National Park, by JW Fewkes)

The entire site is walled in; you cannot access the interior without a ladder. The walls were several meters high (today they stand between ~2 to 12 feet).

On the outside southwest corner, there is an eroded basin feature, less than half a meter across, called the “Sun Shrine”, that has small knee-walls encasing it.

A few meters to the north of the site there is a small pecked basin. Such basins have been hypothesized to be calendrical watching stations (Malville and Putnam(1998))

**Archaeoastronomical significance of the site**

Many other Pueblo ruins are in the immediate vicinity of Sun Temple, including Balcony House, another ceremonial cliff dwelling. Balcony House is a known solar solstice observatory (on the summer solstice, the sun, when viewed from Balcony House, rises directly behind the La Plata mountains to the east… in fact, Balcony House is one of the few cliff dwellings that faces east, instead of south, and the position of the structure appears to be specifically related to solstice observations).

Chimney Rock National Monument, another Pueblo III site near Mesa Verde, has also been shown to have had astronomical alignments (specifically, to the position of the full moon at it’s most northern rise)

The site has also been subject to archaeoastronomical studies; for instance, J. Kim Malville found that key features of the site appeared to be aligned with key Sun and Moon set positions on the horizon, when viewed from Cliff Palace. The alignments were further examined in Munson (2014).

However, alignments to the rise/set of celestial bodes had not hitherto been considered within the Sun Temple site itself.

**Using aerial imagery to perform site surveys**

My research interests involve using aerial imagery of sites visible from the air to assess the possibility of astronomical alignments. A full description of the methodologies I use can be found on these web pages, including all the computer programs I have developed for such studies.

My interest in Sun Temple was sparked by a vacation visit to Mesa Verde in summer 2012. My paper describing my archaeoastronomical study of the site can be found here and is also available here.

I use the free Google Earth Pro software program to obtain aerial images of archaeological sites. If you download the Google Earth virtual globe program to your computer and start it up, in the search bar on the upper left hand side you can search for locations.

For instance, you can type “Sun Temple Mesa Verde” and it will take you to the aerial view of the site:

Often there are many aerial images available that have been taken in the past of a site, and some are better quality than others. Google Earth allows you to easily access these past images. At the top menu bar of Google Earth, you’ll see a little clock with an arrow going counterclockwise. If you click on it, you’ll get a menu of past aerial images for the area. This can be very useful, since some aerial images give clearer views of the site than others due to atmospheric conditions, or time of day, or resolution of the camera.

**Using CAD software for ground feature measurements**

Here is a screen shot of an aerial view of Sun Temple that I obtained from Google Earth:

Whenever I obtain a aerial image of a site from Google Earth, I ensure that I include the distance scale (at the lower left hand corner) in the screen shot. I also use the Google Earth line measure tool to determine and record the ground width across the field of view of the image. In this way, I can later use the image in a CAD software package, and determine distances between features in the image.

Once I have the screen shot of a aerial image of a site, I read that in to the free Xfig CAD software package and place datum points on key features of the site, and determine the distances between the datum points. Repeated measures are used to assess statistical uncertainty on the measurements.

The Pixelstick application for the Mac is also extraordinarily helpful for these kinds of studies.

**Avoiding “unanchored geometries”**

In my studies of the Sun Temple site, archaeoastronomical and otherwise, I only focus on measurements associated with key site features. For example, the length and width of the outer D, and the sizes of the four Kivas, and their respective positions relative to each other, and the SE and SW corners of the outer D. I only consider geometrical constructs associated with either the sizes of the features, or with at least two vertices anchored on the features.

This helps to avoid “unanchored geometries”, which are unfortunately prevalent in a lot of woo related to archaeological sites like these.

**A serendipitous discovery**

When using Xfig to examine the radii of the Kivas, I accidentally hit “÷” instead of “-” on my calculator when trying to determine the thickness of the wall of Kiva B, and obtained 1.42, which is to within 1% of the square root of 2.

A quick check showed the same was true for Kivas C and D. Constructing these Kiva walls could easily be achieved by inscribing and circumscribing two circles on a square with side length=c:

**Another serendipitous discovery**

When I measured the length of the D, its ratio to the width of the complex was within approximately 1% of the golden ratio, φ=(1+√5)/2≈1.618

The golden rectangle is seen in architecture and art throughout the ancient world, including the Greek Parthenon:

Golden rectangles are straightforward to construct with a straightedge and cord (or compass):

The golden ratio is related to the Fibonacci series, where each number in the series is the sum of the two before; 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …

As the series progresses, the ratio of adjacent numbers approaches the golden ratio. The golden rectangle can be constructed in a spiral formation using the Fibonacci numbers:

Drawing arcs in each of the squares with radius equal to the square length yields a Fibonacci spiral:

The golden ratio and spiral is found throughout nature

**Other geometrical constructs**

When we use Xfig and/or the pixelstick app to examine site features associated with the Sun Shrine, the positions and radii of the Kivas, and the position of the outer D, there is evidence of Pythagorean 3:4:5 triangles, squares, and equilateral triangles (triangles with all three sides equal).

Pythagorean 3:4:5 triangles are the simplest of the Pythagorean triple triangles, where the sides, a, b, and c are of some integer multiple of the unit length, the triangle is a right triangle (one angle equal to 90 degrees), and the square of the hypotenuse, c, is c^2=a^2+b^2.

Pythagorean 3:4:5 triangles are easily constructed with a cord and straight edge. Modern carpenters use 3:4:5 triangles to achieve square corners.

Note that the equilateral triangle with one vertex at the Sun Shrine is actually probably more likely to be a 30°:60°:90° right triangle (a triangle with base length equal to one unit, and hypotenuse of two units). The vertical edge of the right triangle goes through the ventilator shaft to the south of Kiva A.

The outer diameter of Kiva A appears to be constructed from the base of a 3:4:5 Pythagorean triangle, with height equal to the yellow lines. The inner radius is statistically consistent with being 3/4 of this.

**Ground Truthing**

In Summer 2014, I obtained a research permit from the NPS to access the site and perform a ground survey to verify the measurements obtained from aerial imagery, using tape, theodolite, and GPS measurements.

Ground truthing revealed that the the walls of Kiva A, which are much higher than the remaining walls of Kivas B and C, slope gently inwards with slope approximately 5cm/4000cm. The inner radius at the ground of the walls of Kiva A was statistically consistent with the ground inner radii of Kivas B and C.

**Was there a common unit of measurement?**

In this figure, all of the yellow lines are of exactly the same length in the CAD drawing, and the red lines are exactly twice the length of the yellow (which is the width of the Golden rectangle encasing the D). The blue lines are exactly 1/3 the length of the red.

Based on this, it is apparent that the red, yellow, and dark blue lines likely represent integer multiples of some base unit of measurement.

A common unit of measurement is also evidenced by the fact that the inner radii of Kivas A, B, and C (measured at the ground level) are all statistically consistent with being equal.

Additionally, the outer radius of Kiva A is statistically consistent with being 4/3 the inner radius.

** Assessing the common unit of measurement**

We can assess the average unit of measurement, based on the commonalities we see in the dimensions of the site features, and apparent geometrical constructs

Let X be the width of the outer D, which is statistically consistent with our average in the table. It appears that the inner radius of Kiva A (which is statistically consistent with the inner radii of Kivas B and C) is constructed from a Pythagorean 3:4:5 triangle such that the radius is 9X/64. If X is some integer multiple of a base unit, and the inner radii of Kivas A, B and C are also an integer multiple of that base unit, this implies that the base unit is, at most, L=X/64~30cm.

However, there is also evidence that this unit is perhaps also divisible by three, based on the distance between Kivas B and C, and the distance of Kiva D from the SE wall. This implies that the base unit is likely X/192~10 cm. This is very similar to the “hand” unit found in other ancient societies, which is the width of a clenched fist.

**Other notable geometric features**

The location of the pecked basin to the north of the site is not visible on aerial imagery. However its location on the CAD drawing was derived from ground survey measurements. As described in Munson, Nordby, and Bates (2010) the pecked basin and the SW and SE corners of the D form, to within about 5% to 10%, an equilateral triangle.

A nearly perfect Pythagorean 3:4:5 triangle (to within 1%) also runs from the Sun Shrine to the pecked basin, and goes through the center of Kiva D. The edges of the triangle also intersect several other key points.

The pecked basin may therefore have been a datum point, supporting this assertion in Munson, Nordby, and Bates (2010).

**Summary**

Using aerial imagery, we surveyed the Sun Temple site, and a ground survey was performed to verify the aerial image survey. Note that the aerial survey analysis is reproducible by any interested party using aerial imagery from Google Earth.

We find evidence of Golden rectangles, Pythagorean triangles, and equilateral triangles in the Sun Temple complex (note that there may be other geometrical constructs present, not revealed by this particular analysis). This is the first evidence of knowledge of Pythagorean and equilateral triangles anywhere in the New World. Golden rectangles have been potentially noted in Mayan ceremonial architecture, but the evidence of how precise these rectangles were is somewhat unclear:

**Summary (cont)**

The base unit used to layout the site appears to be approximately 10 cm. The prehistoric Maya had a similar unit called the *kab*, which was 9.2±0.3 cm.

Multiples of 3, 4, and 12 of this base unit are found throughout the site, similar to what has been noted in dimensions of the layout of Mayan ceremonial architecture. This may or may not reflect Mesoamerican influence on Pueblo culture.

The layout of the Sun Temple site appears to use the Sun Shrine as one of the primary datum points, with geometrical constructs used to layout the site perhaps starting from there. Even with the multiple geometrical constructs, it should be noted that there is still a wide degree of latitude in the exact placement of the site elements, such as the angle of the south wall, and the positions of Kivas B and C relative to the south wall and to Kiva A.

The ancestral Pueblo peoples appear to have laid out the Sun Temple site with remarkable care, and with a sophisticated knowledge of geometrical constructs. A feat made even more remarkable by the lack of a written language!

The site is of exceptional importance as an exemplar of Pueblo ceremonial construction, and deserves more recognition as such. ^{1}

Notes:

- Now, can we cut down that tree in the middle of the site that’s destroying the walls of Kiva C? ↩

**Choosing an appropriate statistic to assess goodness-of-fit of the model to the data****Choosing an appropriate method to find the model parameters that optimise the goodness-of-fit statistic**

**The first point depends only on the data. The second involves picking an optimisation method appropriate for the type of model being used. Mathematical models that are often used in population biology, epidemiology, etc, are usually non-linear, and can only be solved numerically. As we will see, the computational overhead involved in numerically solving a model considerably narrows the range in choices of appropriate optimisation methods.**

**Due to limited time, we will only discuss methods for finding the best-fit solution, not how to assess uncertainties on the best-fit solution]**

- Probability distributions important to modelling in the life and social sciences
- Overview of methods for optimizing model parameters to data (aka inverse problems)
- Basics of the R statistical programming language
- Introduction to numerically solving ODE models in R, with examples based on an SIR model for disease transmission
- Fitting the parameters of an ODE model to data, with examples based on fitting an SIR model to data from an influenza epidemic

*(Spoiler alert: as of mid-April, 2016, Clinton appears to have the broadest appeal across many demographics. However, I do not mean this analysis to advocate one candidate or political party over another, I simply represent the data as it is.)*

Based on the counties that have had primaries so far at the time of this writing in mid-April, 2016, we expressed the demographics of a particular county as a percentile related to all the other counties that have voted, and visualized the results in a format sometimes called a “spider-web graph”; the spokes of the circular graph correspond to various demographics and social indicators, and if a point lies on a spoke far from the center, it indicates that it lies at a higher percentile for the demographic that corresponds to that spoke.

So, for instance, if one of the spokes is “income”, the closer towards the center of the circle on that spoke, the lower the average household income compared to other counties, and the closer to the perimeter of the circle, the higher the average income. Here is what these spokes might look like for a bunch of different demographics and variables:

There is a lot of information on display all at once in the above plot! Let’s break it down a step at a time. The variables corresponding to each spoke are:

- fwhite: fraction of non-hispanic whites in the population
- fover_65: fraction of the population over the age of 65
- no_high_school: fraction of the population 25 years and over without a high school diploma
- bachelors: fraction of the population 25 years and over with at least a bachelors degree
- food stamps: fraction of family households receiving food stamp assistance
- uninsured: fraction of the population without health insurance
- bankruptcies: per-capita bankruptcy rates
- mobile homes: fraction of households that consist of mobile homes
- obese: fraction of the population that is obese
- overdose: per-capita death rates by drug overdoses
- suicide rate: per-capita age-adjusted suicide rate
- pop_dens: population density
- evangelical: fraction of population regularly attending an evangelical church
- firearm_owners: fraction of households that own firearms
- fvote_2012: fraction of adult population that voted in 2012 election
- f_obama_2012: fraction of votes that went to Obama
- f_independent_2012: fraction of votes that went to an independent candidate

The blue circle on the plot represents the median values for each of the demographics and variables for all counties that have voted in the primaries so far. The outer black circle represents the 100th percentile (basically, the county that has the highest value of that particular indicator along a spoke). The inner dashed line is the 25th percentile, and the outer dashed line is the 75th percentile.

Now, for a particular sub-group of counties (in the case in the figure, counties that favoured Trump over any other candidate by at least 5 points in the primary), we can show, with the red line, how the demographics in those counties compare to those of all other counties. You can see that, for example, the average median household income in counties that favoured Trump is much lower than that for all counties, because the red line dips sharply towards the center of the circle along the “income” spoke. And there is an unusually large fraction of people in those counties who do not have a high school diploma, because the red line deviates outwards along the “no_high_school” spoke.

Let’s look at this further, in more detail…

**Demographics of counties that heavily favour Trump in the Republican primaries**

Here we examine the average percentiles of counties that favoured Trump over any other candidate by at least 5 percentage points in the Republican primaries, which was 47% of all counties. This is what the demographics of those counties look like, where I have added a pink band to the plot above show the 25th to 75th percentiles for those counties:

The counties that favour Trump over other candidates skew older and less hispanic, are more poorly educated, have a high fraction of families receiving food stamps, low income, a relatively large fraction of people living in mobile homes, and are generally in poorer health than average. These counties were about average in voter participation, the percentage that voted for Obama in 2012, and the percentage that voted for an independent presidential candidate in 2012.

**Demographics of counties that heavily favour Cruz in the Republican primaries**

Now let’s look at the same plot for counties that favoured Cruz by at least 5 percentage points in the Republican primaries. This was 21% of counties:

These counties skew far more hispanic, more white, somewhat younger, higher income, and generally have better health than average, despite the very high average of people without health insurance. The counties also skew very rural (low population density), had generally very low voter participation in 2012, and skewed very Republican in the 2012 election.

**Demographics of counties that heavily favour Sanders in the Democratic primaries**

Now let’s look at the counties that favoured Sanders over Clinton by at least 5 percentage points in the Democratic primaries that have occurred so far (21% of counties):

These counties skew very white, very educated, much less evangelical, high income, low percentage of uninsured, and good health (except for overdose and suicide rates, which are about average). There was a high degree of voter participation in these counties in 2012, and they skewed Democrat and heavily Independent rather than Republican.

**Demographics of counties that heavily favour Clinton in the Democratic primaries**

Now let’s look at the counties that favoured Clinton over Sanders by at least 5 percentage points so far (67% of counties):

These counties skew perhaps somewhat less white than average, but for the most part are quite close the average for all other counties.

**Which candidate has the broadest appeal?**

As I discussed above, the candidate with the broadest appeal would be favoured by counties that are representative of the national averages in the various demographics. It would appear that, as of mid-April 2016, neither Trump nor Cruz achieves this, although Trump so far comes closer than Cruz to broader appeal; however Trump support appears to skew poorer, unhealthier, and less educated, and Cruz support appears to skew heavily rural, and evangelical.

Clinton appears to so far have far broader appeal over a wide array of demographics than Sanders (and indeed, over any other candidate).

**Sources of data**

- The Politico website makes available the county level election results for most of the primaries that have taken place. They are missing the county level results for Iowa and Alaska, and Kansas and Minnesota have results by district, not counties.
- All cause and cardiovascular death rates from 2010 to 2014 from the CDC.
- Household firearm ownership is estimated using the fraction of suicides that are committed by firearm; the suicide data by cause from 2010 to 2014 is obtained from the CDC.
- Education, racial and age demographics, household living arrangements, percentage without health insurance, and income are obtained from the 2014 Census Bureau American Community Survey 5 year averages.
- Land area of counties obtained from the 2015 Census Bureau Gazetteer files.
- Religion demographics from the 2010 Census religion study
- Drug overdose mortality from 1999 to 2014 from the CDC.
- Obesity and diabetes prevalence from the CDC.

One of my most frequent uses of bootstrapping and stochastic sampling methods is when doing cross-validation of a statistical model, such as a linear regression model.

When fitting a model to data, an important, but too often overlooked issue is cross-validation; ensuring that the model that you fit to one data set not only fits that data well, but also has good predictive abilities for an equivalent, but separate (ie; independent) data set. However, if all you have is the one data set, cross-validating the model poses a problem.

Linear regression

Best fit

AIC

AIC is a statistic

Example calculating AIC with R

]]>This content is password protected. To view it please enter your password below:

]]>

This content is password protected. To view it please enter your password below:

]]>