In this module, we’ll discuss how to make your own R library package, and how to upload it to the R CRAN repository.
In this module, students will become familiar with Negative Binomial likelihood fits for over-dispersed count data.
In this module students will become familiar with the Kolmogorov-Smirnov non-parametric test for equivalence of distributions
In this module we will discuss numerical methods that can be used to calculate the 95% CI on data that has been transformed by some function, if one knows the probability distribution underlying the stochasticity of the original data.
In this module, students will learn how the Least Squares fit statistic can be expressed as a likelihood.
In this module, students will become familiar with the importance of, and methods for, model validation.
In this module, students will become familiar with population standardized regression methods when working with data that is expressed as per-capita rates
In this module, students will become familiar with logistic (Binomial) regression for data that either consists of 1′s and 0′s (“yes” and “no”), or fractions that represent the number of successes out of n trials. We focus on the R glm() method for logistic linear regression. Continue reading
In this module, students will become familiar with Poisson regression for count data. We focus on the R glm() method for linear regression, and then describe the R optim() method that can be used for non-linear models. Continue reading
In this module we will discuss how to conduct one-sample and two-sample Students t-tests of sample means when the variance of the sample is unknown, testing the equality of the means of several samples, and z-test of sample means when the variance is known.
- Students t-test of the mean of one sample
- Example of Students t-test of the mean of one sample
- Students t-test comparing the means of two samples
- Example of Students t-test comparing the means of two samples
- Limitations of the Students t-test
- Testing for equality of more than two means (ANOVA)
- One and two sample Z-tests
[In this course we will be learning how to formulate figure-of-merit statistics that can help to answer research questions like “Is quantity A significantly greater/less than quantity B?”, or “Does quantity X appear to be significantly related to quantity Y?”. As we are about to discuss, statistics that can be used to answer these types of questions do so using the underlying probability distribution to the statistic. Every statistic used for hypothesis testing has an underlying probability distribution.]
In this module, we will discuss how to “mesh” two or more data sets together in order to answer a research question.
Statistical Methods for Students in the Life and Social Sciences
(aka: How to be a Data Boss)
This course is meant to introduce students in the life and social sciences to the skill set needed to do well-executed and well-explicated statistical analyses. The course is aimed at students with little prior experience in statistical analyses, but prior exposure to “stats 101″-type courses is helpful. The course will be almost entirely based on material posted on this website. The course syllabus can be found here. There is no textbook for this course, but recommended reading is How to Lie with Statistics by Irving Geis, Statistical Data Analysis by Glen Cowan, and Applied Linear Statistical Models by Kutner et al (doesn’t really matter which edition). Continue reading
The R code associated with the analysis presented in the paper Sunspot activity and influenza pandemics: a statistical assessment of the purported association is in sunspot_analysis.R
The R script produces the following plot, shown in the paper,
In this talk, I’ll discuss several past analyses I’ve done with collaborators where we have combined statistical and mathematical modelling methods to explore some interesting research questions.
I’m a statistician, and I also have a PhD in experimental particle physics. Research in experimental particle physics can involve complex models of observable physical processes, and fitting of those models to experimental data is a not uncommon task in that field. Like the field of applied mathematics in the life and social sciences (AMLSS), the models being fit at times have no analytic solution, and must be solved numerically using specialized methods. When I entered the field of AMLSS back in 2009, I had a lot to learn about the various models used in this field and the common methodologies, but I already had a solid tool box of specialized skills that allowed me to connect mathematical models to data, and it has turned out that those skills have been remarkably useful in exploring a wide range of research questions in the life and social sciences that I find interesting. I also apply these skills in consulting projects I do.
First off: what is the difference between statistical and mathematical modelling, anyway?
The difference between statistical and mathematical models is often times confusing to people. In this past module on this site, I discuss an example of the differences, with an analysis of seasonal and pandemic influenza used as an example.
Example of an analysis combining statistical and mathematical modelling: Mathematical and statistical modelling of the contagious spread of panic in a population
During the 2014 Ebola outbreak, there were a total of five cases that were ultimately identified in America, compared to tens of thousands of cases in West Africa. Even though the “outbreak” in America was essentially non-existent, once the first case was identified in the US in autumn 2014, the media shifted into 24/7 coverage of the supposed dire threat Ebola presented to Americans, complete with scary imagery.
Autumn 2014 I was teaching a course in the ASU AMLSS graduate program on statistical methods for fitting the parameters of mathematical models to data. Each year, when I teach AML classes, I usually try to have a “class publication project” that encompasses the methodology I teach in the class. In this case, I thought it might be interesting to try to model the spread of Ebola-related panic in the US population, as expressed on social media, and explore how news media might play a role in that.
The class did the analysis as a homework assignment, and we wrote the paper together, which was published in 2015. The paper received national media attention when it came out.
First; why was this analysis important? Well, it has been suggested in the past that people talking about a particular disease on social media might perhaps be used as a real-time means to assess the temporal and geospatial spread of the disease in the population, rather than relying on slower traditional surveillance methods, which can suffer from backlogs in laboratory testing. For instance, tracking influenza, or cholera:
However, up until the US Ebola “outbreak” the problem was that it was impossible to say whether people were just discussing a disease on social media because they were worried about it, rather than because they actually had it. During the Ebola outbreak, pretty much no one actually had it in the US, so everyone who was talking about it was doing so because they were concerned about it. This gave us the perfect instance to gauge what kind of temporal patterns we might see in social media chatter due simply to panic or concern about a disease!
The data we used in the study were the daily number of news stories about Ebola from large national news outlets. We also obtained Twitter data related to Ebola, and Google search data in the US with search terms related to Ebola, including “do I have Ebola?” from Google Trends. Here is what the data looked like:
We came up with a model that related the number of news videos, V, and people who were infected, I, with the idea to tweet about Ebola, or do a Google search related to Ebola:
The parameter beta is a measure of how many tweets (or Google searches) per person per unit time one news story would inspire, and gamma parameterizes the “boredom” effect, through which people eventually move to a “recovered and immune” class, upon which they never tweet again about Ebola no matter how many Ebola-related news stories they are exposed to. Using the statistical methodology taught in the AML course, the students fit the parameters of that model to data, and obtained the following model predictions, shown in red:
The blue lines on the plot represent a plain statistical model that simply regresses the Twitter and Google search data on the news media data, without taking into account the “boredom” effect. Can you see that the regression fits are systematically too high early on, and systematically too low later for all the plots, but the same is not true of our mathematical model? That tells us that our mathematical model that includes boredom really does do a better job of describing the dynamics of peoples’ Ebola-related social media behaviours!
We found that each Ebola-related new story inspired on average thousands of tweets and Google searches. Also, on average, we found people were only interested enough for a few days to tweet or do Google search after seeing a news story about Ebola before they became bored with the topic:
We couldn’t have done this analysis without both mathematical modelling and statistical methods; it was a nice “bringing together” of the methodologies to explore an interesting research question.
Another example of an analysis that involved mathematical and statistical modelling methods: contagion in mass killings and school shootings
In January, 2014 there was a shooting at Purdue University, where one student entered a classroom and shot another student dead, then walked out and waited for police to arrest him.
At the time, it struck me that it was the third school shooting I had heard about in an approximately 10 day period. Even for the United States, which has a serious problem with firearm violence compared to other first world countries, this seemed like an unusual number to have in such a short period of time.
It led me to wonder if perhaps contagion was playing a role in these dynamics. Certainly, in the past it had been noted that suicide appears to be contagious, because (for example) in high schools where there is one suicide it is statistically more likely to see an ensuing cluster of suicides. And the “copy cat” effect in mass killings has long been suspected. I wondered if perhaps a mathematical model of contagion might be used to help quantify whether or not mass killings and school shootings are contagious. So, I talked with some colleagues:
And, we decided to use a mathematical model known as a Hawke’s point process “self-excitation” model to simulate the potential dynamics of contagion in mass killings; the idea behind the model is quite simple… there is a baseline probability (which may or may not depend on time) of a mass killing to occur by mere random chance (the dotted line, below). But, if a mass killing does occur, due to contagion it temporarily raises the probability that a similar event will occur in the near future. That probability decays exponentially:
How contagion would manifest itself in data is thus as unusual “bunching together in time” of events compared to what you would expect from just the baseline probability.
Here’s our (blurry) model:
The parameters of the model were Texcite, the average length of the excitation period, and Nsecondary, the average number of new mass killings inspired by one mass killing. N_0(t) was the baseline probability for mass killings to occur. We used statistical modelling methods to estimate N_0(t).
We needed data in order to fit the parameters of our model. From USA Today we obtained data on mass killings (four or more people killed), and from the Brady Campaign to Prevent Gun Violence, we obtained data on school shootings, and data on mass shootings (three or more people shot, not necessarily killed). Mass shootings happen very frequently in the US!
We compared how well the Hawkes model fit the data compared to a model that didn’t include self-excitation. If contagion is evident, the former will fit the data significantly better.
The fit results were:
Both mass killings and school shootings appear to be significantly contagious! And the length of the contagion period is on average around two weeks for both.
Mass shootings with more than three people shot, but less than four people killed were not contagious though.
Why? Well, mass shootings with low death counts happen on average more than once a week in the US. They happen so often, that they rarely make it past the local news. In contrast, mass shootings with high death rates, and school shootings, usually get national and even international media attention. It may likely be that widespread media attention is the vector for the contagion.
Again, this was an analysis that was made possible through the marriage of mathematical and statistical modelling methods.
Statistical and mathematical modelling skills on the job market
Quantitative and predictive analytics is a field that is growing very quickly. Statistical methods and data mining (“big data”) play a large role in predictive analytics, but the power of mathematical models is more and more being recognized as having same advantages over statistical models alone because mathematical models do not simply assume an “X causes Y” relationship, but instead can describe the complex dynamics of interacting systems. Having a tool box of skills that includes expertise in both mathematical and statistical modelling can lead to many interesting career opportunities, including consulting.
Incorporating parameter prior-information into the fit
For many models, information about the parameters and/or initial conditions can be obtained from other studies. For instance, let’s examine the seasonal influenza SIR model we have used as an example in several other modules. Our data was influenza incidence data from an influenza epidemic in the midwest, and we fit the transmission rate, beta (or alternative, R0=beta/gamma), of an SIR model to this data. For example, using the R script fit_midwest_negbinom_gamma_fixed.R
The script performs a negative binomial likelihood fit to the influenza data, assuming that the average recovery period, 1/gamma, for flu is fixed at 4.8 days. The script produces the following plot (recall that alpha is the over-dispersion parameter for the negative binomial likelihood, and t0 is the time of introduction of the virus to the population.
The script gives the best-fit estimate using the graphical Monte Carlo fmin+1/2 method, and also the weighted mean method. Note that the plots should be much better populated in order to really get trustworthy estimates from the fmin+1/2 method.
In reality, most of our parameters that we obtain from prior studies aren’t know to perfect precision
In our script above, we assumed that 1/gamma was 4.8 days based on a prior study in the literature. However, this was estimated from observational studies of sick people, and, in reality, there are statistical uncertainties associated with that estimate. In the paper describing the studies, they state that their central estimate and 95% confidence interval on 1/gamma was 4.80 [4.31,5.29] days. Unless told otherwise in the paper from which you get an estimate, you assume that the uncertainty on the parameter is Normally distributed. Because the 95% CI is +/-1.96*sigma from the mean, this implies that the std deviation width of the Normal distribution is sigma=(4.8-4.31)/1.96~0.25 days
Thus, our probability distribution for x=1/gamma is
with mu=4.8 days, and sigma = 0.25 days, in this case.
Uncertainty on “known” parameters affects the uncertainties on the other model parameters you estimate from your fit to data!
The uncertainty on 1/gamma will affect the uncertainty on our parameter estimates. For instance, is it clear that if all we know about 1/gamma was that it was between 1 and 50 days, it would be much harder to pin down our transmission rate? (ie; we had no idea what 1/gamma was, and thus had fit for gamma, as well as R0, t0, and alpha) The script fit_midwest_negbinom_gamma_unconstrained.R does this, and produces the following plot:
You can see that the influenza data we have perhaps give us a little bit of sensitivity to the parameter gamma, but not much (basically, the fit just tells us 1/gamma is somewhere between 2 to 6 days, with 95% confidence). The uncertainties on our estimates of R0 and t0 have gone way up, compared to the first fit where we assumed 1/gamma was fixed at 4.8 days! Also, when you are using the weighted mean method to estimate parameters and the parameter uncertainties, you can also get the covariance matrix for your parameter estimates. The correlation matrix, derived from the covariance matrix, for this fit looks like this:
Notice that our estimates of R0 and 1/gamma are almost 100% correlated (this means that as 1/gamma goes up, R0 also has to go up to achieve a good fit to the data). You never want to see parameters so highly correlated in fits you do… it means that your best-fit parameters likely won’t give you a model with good predictive power for a different, equivalent data set (say, influenza data for the same region for the next flu season).
So, even though we seem to have a little bit of sensitivity to the value of 1/gamma in our fit, having that estimate 100% correlated to our estimate of R0 is not good, and a sign you shouldn’t trust the results of the fit.
Incorporating uncertainties on “known” parameters from the literature in our fit likelihoods
In order to take into account the uncertainties on our “known” parameter, x, you simply modify your fit likelihood to include the likelihood coming from the probability distribution for that parameter. Thus, the negative log likelihood is modified like so:
negloglike = negloglike + 0.5*(x-mu)^2/sigma^2
Then, in the fit, you do Monte Carlo sampling not only of all your other unknown parameters (like R0, t0, and alpha in this case), but also uniformly randomly sample parameter x over a range of around approximately mu-4*sigma to mu+4*sigma.
For 1/gamma, we know that mu=4.8 days, and sigma is 0.25 days. The R script fit_midwest_negbinom_gamma_constrained.R modifies the likelihood to take into account our probability distribution for our prior estimate of 1/gamma from the literature. The script produces the following plot:
(again, for the fmin+1/2 method, we’d like to see these plots much better populated!). Note that now our uncertainties on R0 and t0 from the weighted mean method are much smaller than they were when 1/gamma was completely unconstrained, but larger than they were when 1/gamma was fixed to 4.8 days. By modifying the likelihood to take into account the probability distribution of our prior belief for 1/gamma, we now have a fit that properly feeds that uncertainty into our uncertainty on R0 and t0.
When publishing analyses that involve fits like these, it is important to take into account your prior belief probability distributions for the parameter estimates you take from the literature. In some cases, your fit might be quite sensitive to the assumed values of those parameters; if the literature estimates are a bit off from what your data would “like” them to be to obtain a good fit, and you just assume a fixed central value for the parameter, sometimes you just won’t be able to get a good fit to your data.
When you include the parameter in your fit with a modified likelihood to take into account it’s prior belief probability distribution, the estimate you get from the fit to your data is known as the “posterior” estimate. Note that the posterior estimate, and uncertainty, on 1/gamma that we obtained from fit_midwest_negbinom_gamma_constrained.R is 4.798+/-0.247, and is pretty darn close to our prior belief estimate of 4.8+/-0.25. If our data were sensitive to the value of 1/gamma, our posterior estimate would have a smaller uncertainty than the prior belief estimate, and likely have a different central value too.
In this past module, we talked about the “fmin+1/2″ method that can be used to easily estimate one standard deviation confidence intervals on parameter estimates when using the graphical Monte Carlo method to fit our model parameters to data by minimizing a negative log likelihood goodness of fit statistic.
As discussed in that module the model parameters can be estimated from the parameter hypothesis for which the negative log-likelihood statistic, f, is minimal, and the one standard deviation uncertainty on the parameters is obtained by looking at the range of the parameters for which the negative log likelihood is less than 1/2 more than the minimum value, like so:
This method has the advantage that it is easy to understand how to execute (once you’ve seen a few examples). However, we talked about the fact that this procedure is only reliable if you have many, many sweeps of the model parameter values (for instance, the above plots are pretty sparsely populated, and it would be a bad idea to trust the confidence intervals seen in them…. they are underestimated because the green arrows don’t quite reach to the edge of the parabolic envelope that encases the points).
The fmin+1/2 method also does not yield a convenient way to determine the covariance between the parameter estimates, without going through the complicated numerical gymnastics of estimating what is known as the Hessian matrix. The Hessian matrix (when maximizing a log likelihood) is a numerical approximation of the matrix of second partial derivatives of the likelihood function, evaluated at the point of the maximum likelihood estimates. Thus, it’s a measure of the steepness of the drop in the likelihood surface as you move away from the best-fit estimate.
It turns out that there is an easy, elegant way, when using the graphical Monte Carlo method, to use information coming from every single point that you sample to obtain (usually) more robust and reliable parameter estimates, and (usually) more reliable confidence intervals for the parameters.
The weighted means method
To begin to understand how this might work, first recall from the previous module that the fmin+1/2 method gives you the one standard deviation confidence interval. Recall that to get the S standard deviation confidence interval, you need to go up 0.5*S^2 from the value of fmin, and examine the range of points under that line. This means that when we plot our negative log likelihood, f, vs our parameter hypotheses, the points that lie some value X above fmin are, in effect sqrt(2*X) standard deviations away from the best-fit value. Here is what that looks like graphically:
The red lines correspond to the points that lie at fmin+1/2 (the one standard deviation confidence interval), the blue lines correspond to the points that lie at fmin+0.5*2^2=fmin+2 (the two standard deviation confidence interval), and the green lines correspond to the points that lie at fmin+0.5*3^2=fmin+4.5 (the three standard deviation confidence interval).
It should make sense to you that the points that are further away from fmin carry less information about the best-fit value compared to points that are have a likelihood close to the minimum. After all, when using the graphical Monte Carlo method, you aim to populate the graphs well enough to get a good idea of the width of the parabolic envelope in the vicinity of the best fit value.
So… if we were to take some kind of weighted average of our parameter hypotheses, giving more weight to values near the minimum in our likelihood, we should be able to estimate the approximate best-fit value.
It turns out that the weight that achieves this is intimately related to those confidence intervals we see above. If we do many Monte Carlo parameter sweeps, getting our parameter hypotheses and the corresponding negative log likelihoods, f, we can estimate our best fit values by taking the weighted average of our parameter hypotheses, weighted with weights
where dnorm is the PDF of the Normal distribution. Notice that this is maximal when f=fmin, and gets smaller and smaller as f moves away from fmin. In fact, when f=fmin+0.5*S^2 (the value that corresponds to the S std dev CI), then
So, the points that are close to giving the minimum likelihood are given a greater weight in the fit, because they are more informative as to where the minimum actually lies. The plot of w vs S is:
Thus, the further f gets away from fmin, the less weight the points are given, but they still have some weight.
It turns out that not only can these weights be used to estimate our best-fit values, they also can be used to estimate the covariance matrix of our parameter estimates. If we have two parameters (for example), and we’ve randomly sampled N_MC parameter hypotheses, we would form a N_MCx2 matrix of these sampled values, and then take the weighted covariance of that matrix. The R cov.wt() function does this.
Advantages of the weighted mean method: with this method every single point you sample gives information about the best-fit parameters and the covariance matrix for those parameter estimates. Unlike the fmin+1/2 method, where it is only those points right near the minimum value of f and at fmin+1/2 that really matter in calculating the confidence interval.
Also, using this weighted method you trivially get the estimate covariance matrix for the parameters, unlike the fmin+1/2 method where this would be much harder to achieve.
Another advantage of this method is that you don’t have to populate your plots quite as densely as you would for the fmin+1/2 method in order for it to reliably work; this is because every single point you sample is now informing the calculation of the weighted mean and weighted covariance matrix.
The disadvantage of this method is that you must uniformly randomly sample the parameters (no preferential sampling of parameters using rnorm for instance), and you must uniformly sample them over a broad enough range that it encompasses at least a three or four standard deviation confidence interval; other wise, as we’ll see, you will underestimate the parameter uncertainties).
As an example of how this works in practice, let’s return to the simple example we saw in this previous module, where we compared the performance of the fmin+1/2 method to that where we analytically calculate the Hessian to estimate the parameter uncertainties.
In the example, the model we considered was y=a*x+b, where a=0.1 and b=10, and x goes from 10 to 150, in integer increments. We simulate the stochasticity in the data by smearing with numbers drawn from the Poisson distribution with mean equal to the model prediction. Thus, an example of the simulated data look like this:
Recall that the Poisson negative log likelihood looks like this
where the y_i^obs are our data observations, and y_i^pred are our model prediction (y_i^pred = a*x_i+b).
In the example hess.R, we randomly generated many different samples of our y^obs, and then used the Monte Carlo parameter sweep method to find the values of a and b that minimize the negative log likelihood. Then we calculated the Hessian about this minimum and estimated the one-standard deviation uncertainties on a and b from the covariance matrix that is the inverse of the Hessian matrix. Recall that the square root of the diagonal elements of the covariance matrix are the parameter uncertainties.
We also did this using the fmin plus a half method, to show that If the fmin plus a half method works, its estimate of the one-standard-deviation confidence intervals should be very close to the Hessian estimate.
We can add into this exercise our weighted mean method. The R script hess_with_weighted_covariance_calculation.R does just this.
The script produces the following plot, histogramming the parameter estimates from the weighted mean method. As you can see, the estimates are unbiased, and the uncertainties on the parameters assessed by the weighted mean method are very close to those assessed by the analytic Hessian method:
The script also produces the following plot:
Notice in the top two plots that the parameter uncertainties assessed by the weighted mean method are quite close to those estimated by the Hessian method, but the uncertainties assessed by the fmin+1/2 method are always underestimates. This is because we didn’t sample that many points in our graphical Monte Carlo procedure, as can be seen in the examples in the two bottom plots; the plots are so sparsely populated, the green arrows that represent the CI’s estimated by the fmin+1/2 method don’t go all the way to the edge of the parabolic envelope.
So, even with relatively sparsely populated plots, the weighted mean method works quite well. If they are really, really sparsely populated, however, you will find that the performance of the method starts to degrade; take a look at what happens when you change nmc_iterations from 10000 in to 100 in hess_with_weighted_covariance_calculation.R:
The estimates of the parameter uncertainties still are scattered about the Hessian estimates (and the fmin+1/2 method miserably fails due to the sparsity of points). However, notice that there is quite a bit of variation in the uncertainty estimates using the weighted mean method about the red dotted line (compare to the other plot, above); the more MC iterations you have, the more closely these will cluster about the expected values (ie; the more trustworthy your parameter uncertainty estimates will be). So, don’t skimp on the MC parameter sampling iterations, even when using the weighted mean method! In general, with this method, you need to run enough MC parameter sweep iterations to get a reasonable idea of the parabolic envelope in the vicinity of the best-fit value.
One catch of this method, as mentioned above, is to ensure that you do your random uniform parameter sweeps over a broad enough area… if you sample parameters to close to the best-fit value, the weighted mean method will underestimate the confidence intervals. In general, you have to sweep at least a four standard CI. As an example of what occurs when you use too narrow a range, instead of sampling parameter a uniformly from 0.06 to 0.14, uniformly sample it from 0.09 to 0.11 in the original hess_with_weighted_covariance_calculation.R script. We now get:
You can see that the confidence intervals are now severely under-estimated by the weighted mean method.
It needs to be kept in mind that the covariance matrix returned by the weighted mean method assumes that the confidence interval is symmetrically distributed about the best-fit value. In practice, this isn’t always the case; sometimes the plots of the neg log likelihood vs parameter hypotheses, instead of looking like they have a symmetric parabolic envelope, have a highly asymmetric parabolic envelope, like this, for example:
The weighted mean method will essentially produce a one standard deviation estimate that is derived from an “average” symmetric parabola fit to the asymmetric parabola. It will tend to underestimate confidence intervals in such cases. When you have highly asymmetric parabolic envelopes in your plots of the neg log likelihood vs your parameter hypotheses, it is thus best to use the fmin+1/2 method.