The Wikipedia pages for almost all probability distributions are excellent and very comprehensive (see, for instance, the page on the Normal distribution). The Negative Binomial distribution is one of the few distributions that (for application to epidemic/biological system modelling), I do not recommend reading the associated Wikipedia page. Instead, one of the best sources of information on the applicability of this distribution to epidemiology/population biology is this PLoS paper on the subject: Maximum Likelihood Estimation of the Negative Binomial Dispersion Parameter for Highly Overdispersed Data, with Applications to Infectious Diseases.

As the paper discusses, the Negative Binomial distribution is the distribution that underlies the stochasticity in over-dispersed count data. Over-dispersed count data means that the data have a greater degree of stochasticity than what one would expect from the Poisson distribution. In practice, this is frequently the case for count data arising in epidemic or population dynamics due to randomness in population movements or contact rates, and/or deficiencies in the model in capturing all intricacies of the population dynamics.

Recall that for count data with underlying stochasticity described by the Poisson distribution that the mean is mu=lambda, and the variance is sigma^2=lambda. In the case of the Negative Binomial distribution, the mean and variance are expressed in terms of two parameters, mu and alpha (note that in the PLoS paper above, m=mu, and k=1/alpha); the mean of the Negative Binomial distribution is mu=mu, and the variance is sigma^2=mu+alpha*mu^2. Notice that when alpha>0, the variance of the Negative Binomial distribution is always greater than the variance of the Poisson distribution.

There are several other formulations of the Negative Binomial distribution, but this is the one I’ve always seen used so far in analyses of biological and epidemic count data.

The probability of observing X counts with the Negative binomial distribution is (with m=mu and k=1/alpha):

Recall that m is the model prediction, and depends on the model parameters.

Example comparison of Poisson distributed and over-dispersed Negative Binomially distributed data

The R function for random generation for the Negative Binomial distribution is in the same nightmare format as the Negative Binomial distribution as described on the Wikipedia page I told you not to read above because it’s pretty much incomprehensible. In the AML_course_libs.R file, I have thus put a function for you called my_rnbinom(n,m,alpha) that generates n Negative Binomially distributed random numbers with mean=m and dispersion parameter alpha.

Let’s use this function to generate some Negative Binomially distributed simulated data about some true model, and compare it to Poisson distributed data about the same model:

source("AML_course_libs.R")

########################################## # true model for log of y is linear in x ########################################## a = 4.00 b = 0.01 x = seq(0,100,0.01) logy = a+b*x ypred = exp(logy)

########################################## # randomly generate Poisson distributed # and Negative Binomially distributed # simulated data about the true model ########################################## set.seed(314300) yobs = rpois(length(ypred),ypred) alpha = 0.2 yobs_nb = my_rnbinom(length(ypred),ypred,alpha)

########################################## # plot the data ########################################## mult.fig(1) ylim = c(min(c(yobs,yobs_nb)),max(c(yobs,yobs_nb))) plot(x,yobs,cex=1.0,xlab="x",ylab="y",ylim=ylim) points(x,yobs_nb,cex=1.2,col=2) points(x,yobs,cex=1.0) lines(x,ypred,col=3,lwd=5) legend("topleft",legend=c("Poisson distributed data","Negative Binomial distributed data","True model"),col=c(2,1,3),lwd=5,cex=0.9,bty="n")

You can see that the Negative Binomially distributed data is much more broadly dispersed about the true model compared to the Poisson distributed data. The NB data are “over-dispersed”. In true life, nearly all count data are over-dispersed because of various confounders that may result in extra variation in the data over and above the hypothesized model (many of which are often unknowable).

**Likelihood fitting with the Negative Binomial distribution**

If we had N data points, we would take the product of the probabilities in Eqn 1 to get the overall likelihood for the model, and the best-fit parameters maximize this statistic. And just like we discussed with the Poisson likelihood, the negative of the sum of the logs of the individual probabilities (the negative log likelihood) is the statistic that is usually used, and minimized to determine the best-fit model parameters.

Just like the Poisson likelihood fit, the Negative Binomial likelihood fit uses a log-link for the model prediction, m.

In practice, using a Negative Binomial likelihood fit in place of a Poisson likelihood fit with count data will result in more or less the same central estimates of the fit parameters, but the confidence intervals on the fit estimates will be larger, because it has now been taken into account the fact that the data are more dispersed (have greater stochasticity) than the Poisson model allows for.

Let’s fit to our simulated data above, to illustrate this. The “MASS” package in R has a method called glm.nb that allows you to do Negative Binomial likelihood fits. If you don’t already have it installed, install it now by typing

install.packages(“MASS”)

Now let’s generate some simulated data that is truly over-dispersed, but fit it with Poisson likelihood then Negative Binomial likelihood:

a = 4.00 b = 0.001 x = seq(0,100,1) logy = a+b*x ypred = exp(logy) set.seed(214388) alpha = 0.2 yobs_overdispersed = my_rnbinom(length(ypred),ypred,alpha) require("MASS") pois_fit = glm(yobs_overdispersed~x,family="poisson") NB_fit = glm.nb(yobs_overdispersed~x,control=glm.control(maxit=1000)) print(summary(pois_fit)) print(summary(NB_fit))

This produces the following results for the Poisson likelihood fit:

and these results for the NB likelihood fit:

You can see that in the Poisson likelihood fit, the fit coefficient for x appears to be highly statistically significant. But in the Negative Binomial likelihood fit, the confidence intervals are much wider, and the x is no longer statistically significant… the NB likelihood fit properly takes into account the extreme over-dispersion in the data, and it properly adjusts the confidence intervals.

Of course, in our true model, log(y) really does depend on x, but if the data are very overdispersed, and/or you only have a few data points, it reduces sensitivity to be able to detect that relationship.

**Moral of this story…**

Virtually all count data you will encounter in real life are over-dispersed. In general, Negative Binomial likelihood fits are far more trustworthy to use with count data than Poisson likelihood… the confidence intervals on the fit coefficients will be correct. If you try both types of fits, and the p-values are more or less the same, you can default to the simpler Poisson fits.

**But never use Poisson fit because you “like the answer better” that comes out of that fit compared to the NB fit (ie; the Poisson fit gives the apparently significant result you were “hoping” for, whereas it isn’t significant in the NB fit).**

**Model selection in R when using glm.nb()**

Just like we saw with Least Squares fitting using the R lm() method, and Poisson and Binomial likelihood fits using the R glm() method, you can do model selection in multivariate fits with R glm.nb model objects using the R stepAIC() function in the “MASS” library.

]]>

Sometimes in applied statistics we would like to test whether or not two sets of data appear to be drawn from the same underlying probability distribution. Most of the time we don’t even know what that probability distribution is… it could be some crazily shaped distribution that doesn’t match a nicely behaved smooth distribution like the Normal distribution, for example.

As an example, read in the data file drug_mortality_and_2016_election_data.csv This file contains county-level data on per capita drug mortality rates from 2006 to 2016, along with the fraction of the voters in each county that voted Republican or Democrat in the 2016 presidential election. Use the following code:

a = read.table("drug_mortality_and_2016_election_data.csv",header=T,as.is=T,sep=",") a = subset(a,!is.na(rate_drug_death_2006_to_2016_wonder))

a_rep = subset(a,prep_2016>pdem_2016) a_dem = subset(a,prep_2016<pdem_2016)

x1 = a_rep[,3] x2 = a_dem[,3] require("sfsmisc") mult.fig(4) breaks = seq(0,max(a$rate_drug_death_2006_to_2016_wonder),length=100) hist(x1,col=2,main="Republican majority counties",xlab="Drug death rate",breaks=breaks,freq=F) hist(x2,col=3,main="Democrat majority counties",xlab="Drug death rate",breaks=breaks,freq=F,add=F)

We would like to know if those two distributions are consistent with being drawn from the same underlying probability distribution.

To do this, we turn to what are known as non-parameteric tests. “Non-parametric” means that we do not assume some parameterisation of the underlying probability distribution (for instance, by assuming it’s Normal).

The Kolmogorov-Smirnov test (KS test) is an example of one such test. Given two data samples, X_1 and X_2, the algorithm first sorts both samples from smallest to largest, then creates a plot that shows the cumulative fraction in each sample below each value of X (note that X_1 and X_2 data samples do not need to be the same size). The KS test then looks at the maximum distance, D, between those two curves. As an example of this with our drug mortality data, use the code:

xtot = sort(c(x1,x2)) cumsum_x1 = numeric(0) cumsum_x2 = numeric(0) for (i in 1:length(xtot)){ cumsum_x1 = c(cumsum_x1,sum(x1<xtot[i])) cumsum_x2 = c(cumsum_x2,sum(x2<xtot[i])) } cumsum_x1 = cumsum_x1/length(x1) cumsum_x2 = cumsum_x2/length(x2)

mult.fig(1) plot(xtot,cumsum_x1,type="l",col=2,lwd=8,ylab="Cumulative fraction in each sample",xlab="Drug death rate") lines(xtot,cumsum_x2,col=3,lwd=8) legend("bottomright",legend=c("Republican majority counties","Democrat majority counties"),col=c(2,3),lwd=4,bty="n")

D = abs(cumsum_x1-cumsum_x2) iind = which.max(D) arrows(xtot[iind],cumsum_x1[iind],xtot[iind],cumsum_x2[iind],code=3,lwd=4,length=0.1)

The larger the value of D, the greater the difference between the two distributions. The KS-test bases its statistical test on the value of D.

The KS-test is implemented in R with the ks.test() function. It takes as its arguments the two samples of data.

k = ks.test(x1,x2) print(k)

It prints out the p-value testing the null hypothesis that the two samples are drawn from the same distribution.

**Difference between the KS test and the Students t-test**

You may be wondering why we can’t just use a two sample Students t-test to determine if the means of the two distributions are statistically consistent. You could of course do that, but two distributions can have dramatically different shapes, but have the same mean.

Here’s an example using simulated data where the two sample t-test shows the means of the samples are statistically consistent, but the KS test reveals they have very different shapes:

set.seed(8119990) x_1 = runif(1000,0,10) x_2 = rnorm(750,5,2) require("sfsmisc") mult.fig(4) hist(x_1,col=2) hist(x_2,col=3) k = ks.test(x_1,x_2) t = t.test(x_1,x_2) print(k) print(t)

**KS test with a known probability distribution**

The KS test can also be used parametrically, comparing the distribution of data to a known probability distribution. For example, if we wanted to compare the x_1 distribution from the above example to a Uniform probability distribution between 0 and 1, we would type

k=ks.test(x_1,"punif",0,1) print(k)

If we wanted to compare it to a Uniform probability distribution between 0 and 10:

k=ks.test(x_1,"punif",0,10)

print(k)

If we wanted to compare it to a Normal distribution with mean 5 and standard deviation 2:

k=ks.test(x_1,"pnorm",5,2)

print(k)

If we wanted to compare it to a Poisson distribution with mean 6:

k=ks.test(x_1,"ppois",6)You get the idea… just use the cumulative distribution function of the probability distribution you want to examine, along with the arguments for that probability distribution. ]]>

print(k)

**Propagation of uncertainties**

When doing statistical analyses, given an assumed probability distribution that underlies the stochasticity in the data, you might want to apply a function to the data, then obtain the 95% confidence interval on that transformation.

As an example of this, let’s assume that we are doing an analysis of the annual per capita rate of public mass shootings (with four or more people killled) that involved high capacity firearms during the Federal Assault Weapons ban from September 13, 1994 to September 13, 2004, compared to the rate of mass shootings since the ban has lapsed, to the end of 2013.

During the ban, say we observed that there were 12 mass shootings involving high capacity firearms. The ban lasted 10 years, and the average population of the US was 279.2 million people during that time.

After the ban, we observed that there were 35 mass shootings, over a period of 13.3 years, and the average US population during that time was 310.9 million people.

Let’s call the number of mass shootings N, the population P, and the length of the time period T. The annual per capita rate of mass shootings is thus N/(P*T). We would like to estimate the 95% confidence interval on our estimated per capita rate of public mass shootings. To do this, we notice that the number of mass shootings can be assumed to be Poisson distributed, because it is count data. Thus, to estimate the 95% confidence interval on our transformed data, we can generate a large number of Poisson distributed random numbers with mean N, divide those random numbers by (P*T), and then use the R quantile() function to determine the 95% confidence interval. Like so:

N = 12 T = 10 P = 279.2 vN = rpois(1000000,N) vrate = vN/(P*T) q = quantile(vrate,probs=c(0.025,0.975)) cat("The average annual rate per million people is:",N/(P*T),"\n") cat("The 95% confidence interval is [",q[1],",",q[2],"]\n",sep="")

Now, how about if we wanted to determine the ratio, R, of the rate after the ban to the rate during the ban. This is N_2/(P_2*T_2) divided by N_1/(P_1*T_1). We also want to estimate the 95% confidence interval on this quantity to determine if the interval contains 1. If it does, then there is no significant difference between the two rates.

set.seed(4289588) N_1 = 12 T_1 = 10 P_1 = 279.2 N_2 = 35 T_2 = 13.3 P_2 = 310.9 vN_1 = rpois(1000000,N_1) vN_2 = rpois(1000000,N_2) vR = vN_2/(P_2*T_2) vR = vR/(vN_1/(P_1*T_1)) R = N_2/(P_2*T_2) R = R/(N_1/(P_1*T_1)) q = round(as.numeric(quantile(vR,probs=c(0.025,0.975))),2) cat("The ratio of the rates is:",round(R,2),"\n") cat("The 95% confidence interval is [",q[1],",",q[2],"]\n",sep="")

We see that the 95% CI does not include the number 1, thus we reject the null hypothesis that the two rates are statistically consistent.

If we wanted to get the p-value for the difference between the rates, we could have used population standardized Poisson regression, with a factor, vperiod, on the RHS of the regression equation indicating whether or not the period was during the ban, or after it:

vN = c(N_1,N_2) vP = c(P_1,P_2) vT = c(T_1,T_2) vperiod = c(1,2) myfit = glm(vN~offset(log(vP))+offset(log(vT))+factor(vperiod),family="poisson") print(summary(myfit)) pvalue = summary(myfit)$coef[2,4] cat("The p-value testing the null hypothesis the rates are statistically consistent is:",pvalue,"\n")

The “intercept” coefficient is the log of the rate when vperiod=1, and the sum of the first and second coefficients is the log of the rate when vperiod=2.

In a paper, we could have just stated the regression coefficient as our result, but it is more easily understood by the average reader if instead the results are expressed as a ratio of the rates, with the 95% confidence interval on that ratio. You can still use the p-value from the population standardized regression to assess the null hypothesis that the rates are statistically consistent.

]]>

Previous modules have discussed regression using the Least Squares fit statistic, and also Poisson and Binomial (logistic) likelihood fits.

Likelihood fits can sometimes seem a bit daunting to first time users, because Least Squares is a very intuitive fit statistic (minimizing the sum of squares of the distances between the data points and the model), but likelihoods perhaps less so. It perhaps doesn’t help that it is usually not pointed out that Least Squares itself is a fit statistic derived from a likelihood expression, and minimizing the Least Squares statistic maximizes that likelihood.

Recall that the underlying assumptions of the Least Squares fitting method are that the data are Normally distributed, with the same standard deviation, sigma (ie; the data are homoskedastic), and the points are independently distributed about the true model, mu.

This means that the probability distribution for some observed dependent variable, y, is the Normal distribution:

Note that mu might depend on some explanatory variables. Perhaps in a linear fashion, like (for example)

Now, just like we saw in the modules on Poisson and Binomial regression, the Least Squares likelihood of observing some set of dependent variables, y_i, given predicted model values for each point, mu_i, is derived from the product of the probability densities seen in Equation 1

(note that the sigma is the same for each data point because of the homoskedasticity assumption of Least Squares). The best fit parameters in the mu_i function will maximize this likelihood. You could call this likelihood the “homoskedastic Normal likelihood”.

Recall that having a product of probabilities (all of which are between 0 and 1) can be problematic in practical computation because of underflow errors when using numerical methods to optimize the likelihood. Thus, in practice, we always take the log of both sides of the likelihood equation (the parameters in the calculation of mu that maximize the likelihood will also maximize the log likelihood).

This yields:

As we discussed in the Poisson and Binomial regression modules, in practice, numerical optimization methods in the underlying guts of statistical software packages minimize goodness-of-fit statistics, rather than maximize them. For this reason, we minimize the negative log likelihood:

Notice that because sigma is the same for all points, the first term is just a constant. And you’ll recognize the second term as the Least Squares statistic divided by 2*sigma^2.

Thus, whatever model parameters that go into the calculation of mu_i that minimize the Least Squares statistic will also minimize the Normal negative log likelihood!

**And thus Least Squares fits can equivalently be thought of as a homoskedastic Normal likelihood fit.**

]]>

In previous modules, we talked about the importance of model selection; selecting the most parsimonious model with the best predictive power for a particular data set. In particular, we have discussed the R stepAIC() method, which takes as its argument an R linear model fit object from either the lm() least squares linear regression method, or the glm() general linear model (with, for example, the Poisson or Binomial families).

Model selection is important because the more potential explanatory variables you put on the right hand side of the equation in a statistical model, the larger the uncertainties on the fitted coefficients, and there is a real risk of masking significant relationships to true explanatory variables if variables with no explanatory power are included on the right hand side.

Beyond this, however, is the issue of model validation; ensuring that a model has good predictive power for an independent similar set of data.

A very simple and straightforward way to do this, for example, would be to divide your data in half, and label one sample the “training sample”, and the other sample the “testing sample”. Your statistical model then gets fit to the “training sample”, and then you predict the values of the dependent variable for the testing sample using your trained model. If the model truly has good predictive power, the predicted values for the test sample will describe a significant amount of the variance in the dependent variable.

**Example of model validation with a split sample**

For this initial example, we will be doing a Least Squares fit to daily incidence data of assaults and batteries in Chicago from 2001 to 2012 (note: why is this perhaps not the best fitting method to use for these data?)

To do this study, you will need to download the files chicago_pollution.csv, chicago_weather_summary.csv, and chicago_crime_summary.csv

You will also need to download the file AML_course_libs.R that has several helper functions related to calculating things related to dates, and also the number of daylight hours by day of year, at a particular latitude.

The file chicago_crime_read_in_data_utils.R contains a function read_in_crime_weather_pollution_data() that takes as its arguments year_min and year_max that are used to select the date range. It returns a data frame with the daily ozone and particulate matter, temperature, humidity, air pressure, wind speed, assaults and batteries, thefts, and burglaries.

Download all the files and type the following code:

source("chicago_crime_read_in_data_utils.R") mydat = read_in_crime_weather_pollution_data(2001,2012)

print(names(mydat))

#################################################### # subset the data into a training and testing # sample #################################################### mydat_train = subset(mydat,year<=2006) mydat_test = subset(mydat,year>2006)

This code divides the data frame into two halves.

The contents of the data frame are:

Now let’s fit a model to the daily number of assaults in the training data that includes all the weather variables, and pollution variables, and also includes weekday as a factor, number of daylight hours, and linear trend in time:

#################################################### # fit a model with trend, weekday, daylight # hours, weather variables, and air pollution # variables #################################################### model_train = lm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat_train)

print(summary(model_train)) print(AIC(model_train))

mult.fig(4,main="Chicago crime data 2001 to 2012") plot(mydat_train$date, mydat_train$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Training data") lines(mydat_train$date,model_train$fit,col=2,lwd=4)

Not all of the potential explanatory variables were perhaps needed in the fit. Let’s check by doing model selection using stepAIC():

#################################################### # now do model selection based on the training data #################################################### require("MASS") sub_model_train = stepAIC(model_train)

print(summary(sub_model_train)) print(AIC(sub_model_train)) lines(mydat_train$date,sub_model_train$fit,col=4,lwd=2) legend("topright", legend=c("Full model","Best-fit model"), col=c(2,4), lwd=4, bty="n", cex=0.8)

This produces the following plot. Which variables got dropped from the fit?

Now let’s see how well this fitted model predicts the patterns in the testing data set. We use the R predict() function to do this:

#################################################### # now predict the results for the test data # and overlay them #################################################### y = predict(sub_model_train,mydat_test)

plot(mydat_test$date, mydat_test$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Testing data") lines(mydat_test$date,y,col=3,lwd=2)

This code produces the plot:

Well…. just visually, the model appears to do a reasonable job of predicting the next six years of data. But what we need is a quantification of how much better it fits compared to the null hypothesis model (which is just fitting a flat line).

The following code fits the null model to the test data, then “fits” the extrapolated prediction from the training model as an offset() with no intercept term. This isn’t really a fit at all, in the sense that it has no coefficients, but it does allow us to extract the AIC, and compare it to the AIC of the null model. If the extrapolated data fits the test data better than the null hypothesis model, the AIC will be smaller:

#################################################### # fit the null model to the test sample # (just the mean) # In order to compare this to how well the # extrapolation of the training sample fits the # data, "fit" that model as offset(y) but without # an intercept (-1 on the RHS forces the fit # to not include an intercept term) #################################################### null_fit_test = lm(mydat_test$assault~1) comparison_training_fit_test = lm(mydat_test$assault~offset(y)-1)

print(AIC(null_fit_test)) print(AIC(comparison_training_fit_test))

Does the extrapolated fit do a better job of fitting the data than the null model?

**Example of split sample validation with Poisson likelihood fit**

In the previous example, we used Least Squares regression to fit count data. Even though there were plenty of assaults per day (and thus the Poisson distribution approaches the Normal), this still isn’t the best method to be using with count data. It would be better if we used Poisson likelihood fits with the glm() method instead:

#################################################### #################################################### # now do a Poisson likelihood fit instead #################################################### model_train = glm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat_train, family="poisson")

print(summary(model_train)) print(AIC(model_train))

plot(mydat_train$date, mydat_train$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Training data") lines(mydat_train$date,model_train$fit,col=2,lwd=4)

sub_model_train = stepAIC(mod) print(summary(sub_model_train)) print(AIC(sub_model_train))

lines(mydat_train$date,sub_model_train$fit,col=4,lwd=2) legend("topright", legend=c("Full model","Best-fit model"), col=c(2,4), lwd=4, bty="n", cex=0.8)

Let’s predict the results for the test data, and then get the AIC of the model with the predicted results and compare it to the null hypothesis model of just a flat line. **Note that Poisson regression has a log-link, thus you have to make sure you take the log of the predicted values in the offset() function on the RHS of the fit equation!**

#################################################### # now predict the results for the test data # and overlay them #################################################### y = predict(sub_model_train,mydat_test,type="response")

plot(mydat_test$date, mydat_test$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Testing data") lines(mydat_test$date,y,col=3,lwd=2)

#################################################### # fit the null model to the test sample # (just the mean) # In order to compare this to how well the # extrapolation of the training sample fits the # data, "fit" that model as offset(y) but without # an intercept (-1 on the RHS forces the fit # to not include an intercept term) #################################################### null_fit_test = glm(mydat_test$assault~1,family="poisson") comparison_training_fit_test = glm(mydat_test$assault~offset(log(y))-1,family="poisson")

print(AIC(null_fit_test)) print(AIC(comparison_training_fit_test))

This example code produced the following plot:

Note the fit results look quite similar to the Least Squares fits we did above. This is because there were quite a few assaults per day, and thus the Poisson distribution was in the Normal regime. We probably could have gotten away with a Least Squares analysis, but it is always good to be rigorous and use the most appropriate probability distribution.

**Monte Carlo methods for model validation**

Especially for time series data, it is always a good idea to show that a model that is fit to the first half of the time series has good predictive power for the second half, which is what we did in the above example. More generally, however, it is a good idea to show that a model trained on one half of the data randomly selected from the sample usually has good predictive power for the second half.

If you randomly select half the data many times, train the model, then compare its predictive capability to the second half, you can see how often the model prediction for the second half has better predictive power than just a null model.

The issue of model selection starts to get a bit tricky, however, because the terms that might be selected for one randomly selected training sample might not be exactly the terms selected for a differently randomly selected samples.

One thing you can do is do the model selection process on the full sample. Then do a repetitive process where you fit that model to the randomly selected training sample, and see how often the extrapolated model provides a better prediction for the remaining data.

In the following code, we use the R formula() function to extract the model terms for the selected model that is produced by the stepAIC function:

#################################################### #################################################### # first fit to the full sample, and # select the most parsimonious best-fit model #################################################### model = glm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat, family="poisson")

sub_model = stepAIC(train_fit) sub_formula = formula(sub_model) cat("\n") cat("The formula of the sub model is:\n") print(sub_formula)

Now do many Monte Carlo iterations where we fit this sub model to one half of the data, randomly selected, and test it on the second half:

#################################################### # now do many iterations, randomly selecting # half the data for training the sub model, and # testing on the remaining half #################################################### vAIC_null = numeric(0) vAIC_comparison = numeric(0) for (iter in 1:100){ cat("Doing iteration:",iter,100,"\n")

iind = sample(nrow(mydat),as.integer(nrow(mydat)/2)) mydat_train_b = mydat[iind,] mydat_test_b = mydat[-iind,]

sub_model_train_b = glm(sub_formula,data=mydat_train_b,family="poisson")

y = predict(sub_model_train_b,mydat_test_b,type="response") null_fit_test = glm(mydat_test_b$assault~1,family="poisson") comparison_training_fit_test = glm(mydat_test_b$assault~offset(log(y))-1,family="poisson")

vAIC_null = c(vAIC_null,AIC(null_fit_test)) vAIC_comparison = c(vAIC_comparison,AIC(comparison_training_fit_test)) }

f = sum(vAIC_comparison<vAIC_null)/length(vAIC_null) cat("The fraction of times the predicted model did better than the null:",f,"\n")

**The R bootStepAIC package**

R has a packages called bootStepAIC. Within that package, there is a method called boot.stepAIC(model,data,B) that takes as its arguments an lm or glm model object previously fit to the data. It also takes as its argument the data frame the data were fit to, and a parameter, B, that states how many iterations should be done in the procedure.

If there are N points in the data samples, what boot.stepAIC does is randomly samples N points from that data, with replacement to create what is known as a “bootstrapped” data sample. Thus, the sampled data set looks somewhat like the original data set, but with some duplicated points, and some points missing. For large data samples, the bootstrapped data set will overlap the original data set by a fraction (1-1/e)~0.632

For the bootstrapped data set, the boot.stepAIC performs the stepAIC procedure, to determine which explanatory variables form the most parsimonious model with best explanatory power. It then stores the information of which variables were chosen.

Then it samples another bootstrapped sample and repeats the procedure. And again, and again, and again, until the iteration limit, B, has been reached (the default is B=100 iterations of this Monte Carlo bootstrapping procedure).

At the end of the procedure, the bootstepAIC method tells you how often each explanatory variables was selected in the stepAIC procedure. An explanatory variable that was selected 100% of the time, for example, is likely to have good explanatory power for independent data sets. If it only was selected 20% of the time, for example, it is unlikely to have good general predictive power, and is likely reflecting over-fitting to statistical fluctuations in your particular data.

require("MASS")

model = glm.nb(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat) if (!"bootStepAIC"%in%installed.packages()[,1]){ install.packages("bootStepAIC") } require("bootStepAIC") b = boot.stepAIC(model,mydat,B=25) print(b)

This produces the following output (amongst other things):

We can see that all variables except for wind were selected 100% of the time out of the 25 iterations. Except for wind, the coefficient for the other explanatory variables was either consistently +’ve or consistently -’ve 100% of the time. However, when we look at how often each variable was significant to p<0.05 in the fits, humidity was not always significant, and wind never was (ignore the weekday=6 factor level… if one of the levels is always significant, you should keep that factor).

How to proceed from here is largely arbitrary… If you use large B (use at least B=100… I used a smaller value of B here just to make the code run faster) you of course should keep all variables that are selected 100% of the time that were always significant to p<0.05, and were consistent in the sign of the their influence. As for whether or not you relax the selections to include variables that were significant at least 80% of the iterations (for example), that is up to you. **But because this is an arbitrary selection, it’s a good idea to do a cross-check of the robustness of your analysis conclusions to an equally reasonable selection, like 70%, or 90%. **Or, rely on the selection used in past papers on the subject; for example, this paper recommends a 60% selection. The former is preferable, but the latter will get through review.

To see how this procedure is talked about in a typical publication, see this paper. Specifically, pay attention to the second to last paragraph, where it is made clear that the model validation and robustness crosschecks are an important part of the analysis, and the middle and bottom of page 8 where the bootstrapping method and model validation methods are described.

This paper gives more information about bootstrapping methods. And this book is a good one to cite on the topic of the importance of model validation.

]]>

Data that are expressed as per-capita rates are frequently encountered in the life and social sciences. For example the per-capita rate of incidence of a certain disease, or per-capita crime rates. In both examples, “per-capita” means “per person”. And because the per-capita rate is expressed as “per person”, sometimes it might be easy to get confused and think that perhaps Binomial linear regression methods might be most appropriate because the data appear at first blush to be expressed as a fraction of the population.

But you’d be wrong… Binomial linear regression methods are only appropriate for fractions that are strictly constrained to be between 0 and 1. When we are talking about crime rates, for example, it’s entirely possible if crime were rampant that someone might be robbed several times a year. The per-capita annual rate could thus be above 1!

When talking about per-capita rates, the data consist of a counted number per some unit time, thus regression methods like Poisson or Negative Binomial methods would be appropriate. But we have to account for the population size, M, because obviously if the population doubles, the number of cases of crime we would count per unit time would also double (if the per-capita rate stayed the same).

To take this into account in a regression analysis, we use what is called “population-standardized” regression. Recall that Poisson regression methods use a log link for the expected number of events, lambda. In population-standardized linear regression with one explanatory variable (for example) we thus have

Notice that if we bring the log(M) over the LHS, we get log(lambda/M)… the per-capita rate!

Also notice that the log(M) term does not (nor should it) have a coefficient multiplying it in the fit. Thus, if your observed number of crimes are in the y vector, what you do NOT want to do is the following (recall that the glm function with family=poisson in R uses a log-link by default):

b = glm(y~log(M)+x,family="poisson")

This is because R will attempt to find some kind of best-fit coefficient for the log(M) term, when what we actually need to do is force its coefficient to be 1 to be able to interpret our output as a per-capita rate.

The way to do this in R is to use the offset() function:

b = glm(y~offset(log(M)) + x, family="poisson")

Now, in the fit the log(M) term is forced to have coefficient equal to one.

Here is a 2000 paper by W. Osgoode on the subject of population standardized Poisson regression.

**Example**

Let’s look at the annual per-capita rates of public mass shootings before, during, and after the Federal Assault Weapons Ban, which was enacted from September 14, 1994 to September 13, 2004. In the file mass_killings_data_1982_to_2017.csv, there is the annual number of public mass shootings from 1982 to 2017, as obtained from the Mother Jones mass shootings data base, and supplemented with a few public mass shootings they missed, as listed in the USA Today mass killings data base. Also in the file is the US population that year (in milliions), as obtained from the US Census Bureau. There is also a logical variable, lperiod, indicating wether the year was before (lperiod=0), during (lperiod=1), or after (lperiod=2) the ban assault (note that lperiod=1 when the weapons ban was in place for most of that particular year). Also included in the file is the number of people killed in mass shootings each year (the last two variables aren’t used in the following example).

The following code reads in this data, and does a population standardized fit, then plots the data and fit results:

adat=read.table("mass_killings_data_1982_to_2017.csv",sep=",",as.is=T,header=T) b = glm(number_shootings~offset(log(pop)),family="poisson",data=adat) plot(adat$year,adat$number_shootings,cex=2,xlab="Date",ylab="Number mass shootings") lines(adat$year,b$fit,col=2,lwd=5)

You can see that the fit estimates properly take into account that the US population went up from 1982 to 2017, and thus, even if the per-capita rate were the same over that entire period, we would expect a higher annual *number* of events in 2017 compared to 1982.

The summary of the fit from summary(b) yields

To interpret this output, we note that the glm family=”poisson” fit uses a log link. **hus we must take the exponential of the intercept term to get the average expected per-capita annual rate**. This is exp(-4.6701)=0.0094 per million people, per year.

We can compare this to what we get if we just take the average of the number of killings per million people per year: mean(adat$number_shootings/adat$pop)=0.0091. The two numbers are almost identical… and they should be!

In the literature, you may see people doing fits to (y/pop) using least squares regression, but that is the wrong method to use because the data will not satisfy the Normality assumption of the least squares method, because the y are Poisson distributed, and simply dividing by A does not magically transform things to make (y/A) Normally distributed.

**Other potential types of standardized regression**

If one assumes that the population density of some animal (or people) is the same over some study areas, but might depend on time (for example), if the researchers count the individuals, y, in places with areas, A, at distinct points in time, t, they can estimate the change in population density using area standardized regression, using a function call like:

b = glm(y~offset(log(A)) + t, family="poisson")

Note that Poisson regression is the appropriate fitting method to use, because these are count data (and may in fact involve low counts).

Another type of standardized fit might be if a researcher had counts of some organism in different samples of fluid with volumes, V, and wished to see how this might depend on some other explanatory variable.

]]>The Binomial probability distribution is appropriate for modelling the stochasticity in data that either consists of 1′s and 0′s (where 1 represents as “success” and 0 represents a “failure”), or fractional data like the total number of “successes”, k, out of n trials.

*Note that if our data set consists of n 1′s and 0′s, k of which are 1′s, we could alternatively express our 1′s and 0′s data as k successes out of n trials.*

There are other probability distributions that can be used to model the stochasticity in fractional, like the Beta Binomial distribution, but the Binomial probability distribution is the simplest of the probability distributions for modelling the number of successes out of N trials. The Binomial probability mass function for observing k successes out of n trials when a fraction p is expected, is

The parameter p is our “model expectation”, can can be just a constant, or a function of some explanatory variables, and is the expected value of k/n.

Note that if our data are 1′s and 0′s, each point could be considered a “trial”, where k could be either 1 or 0 for each data point, and n would be 1 for each data point. This special case of the Binomial distribution is known as the Bernoulli distribution.

Our predicted probability of success, p, could, in theory at least, be a linear function of some explanatory variable, x:

*However, this can present problems, because p must necessarily lie between 0 and 1 (because it is a fraction), but the explanatory variable might be negative, or even if it were positive, beta_0 and beta_1 might be such that the predicted value for p lies outside 0 and 1. This is a problem!*

Partly for this reason, Binomial logistic regression generally assumes what is known as a “logit-link”. The logit of a fraction is log(p/(1-p)), also know as the log-odds, because p/(1-p) is the odds of success. It is this logit link that give “logistic regression” its name.

Note that because p lies between 0 and 1, p/(1-p) lies in the range of 0 to infinity. This means that the logit of p (the log of the odds) lies between -infinity to +infinity. With the logit-link, we regress the logit of p on the explanatory variables. For linear regression with one explanatory variable, this looks like:

Because the logit lies in the range of -infinity to +infinity, now it doesn’t matter if the expression on the RHS of the equation is negative… the reverse transform will always give back a value of p between 0 and 1.

Nice!

By the way, if we call logit(p)=A, then the reverse transformation to calculate p is

Let’s assume that we have N data points of some observed data, k_i, successes, out of n_i trials, where i = 1,…N. This could be, for example, the daily n fraction of firearms the TSA detects that have a round chambered over a period of N days. k_i is the number each day found with a round chambered, and n_i is the total number found each day. The observed fraction might, at least hypothetically, linearly depend on time, x_i. In this case, our model looks like

In order to fit for beta_0 and beta_1 (or whatever the parameters of our model are), we need some “goodness of fit” statistic that we can optimize to estimate our best-fit values of our model with Binomially distributed data…

**Binomial likelihood**

The likelihood of observing our N data points, k_i, out of n_i when p_i are expected for each point is the product of the individual Binomial probabilities:

The “best-fit” parameters in the functional dependence of p_i on the explanatory variable, x_i (or variables… there doesn’t need to just be one), are the parameters that maximize this likelihood.

However, just like was pointed out in our discussion of Poisson regression methods for count data, in practice, underflow problems happen when you multiply a whole bunch of likelihoods (probabilities) together, each of which is between 0 and 1. To avoid this, what is normally done is take the logarithm of both sides of Eqn 1, and what is maximized is the logarithm of the likelihood, log(L):

The R glm() method with family=”binomial” option allows us to fit linear models to Binomial data, using a logit link, and the method finds the model parameters that maximize the above likelihood. If the success data is in a vector, k, and the number of trials data is in a vector, n, the function call looks like this:

myfit = glm(cbind(k,n-k)~x,family="binomial")

The glm() binomial method can also be used with data that are a bunch of 1′s and 0′s. For our little example here, the data might be at the individual firearm level, where ’1′ indicates that the firearm has a round chambered, and ’0′ indicates that it doesn’t. In this case, if the vector found_with_round_chambered contains these zeros and ones for all the firearms, and the vector day_gun_found contains the day each firearm was found (relative to some start day), we can fit to this data using the function call

myfit = glm(found_with_round_chambered~day_gun_found,family="binomial")

Note that in both cases, **it is exactly the same data**, just expressed a different way (you can always aggregate the 1′s and 0′s by day to get the total number of firearms found, and the number found with a round chambered each day, for example). This duality in how you can look at logistic regression is sometimes confusing to students who have been exposed to logistic regression methods either just using 0′s and 1′s, or just using fractional data.

**Example**

Let’s simulate some Binomial data, with trend in time. In the example described above, perhaps this might be firearms detected at TSA airport checkpoints, and determining whether they had a round chambered (“1″), or didn’t have a round chambered (“0″). In this simulated example, the logit of the fraction loaded, p, has the predicted trend

set.seed(541831) vday = seq(0,2*365) vlogit_of_p_predicted = -1+0.005*vday vp_predicted = exp(vlogit_of_p_predicted)/(1+exp(vlogit_of_p_predicted))

At time vday=0, the predicted average fraction of firearms found with a round chambered is thus:

p=exp(-1)/(1+exp(-1)) = 0.269

At time vday=730, the predicted average fraction of firearms found with a round chambered is:

p=exp(-1+0.005*730)/(1+exp(-1+0.005*730)) = 0.934

Let’s simulate same data where we assume the TSA detects exactly 25 firearms per day. We’ll simulate the data at the firearm level, where we record the day each firearm was found, and if it had a round chambered:

num_guns_found_per_day = 10 wfound_with_round_chambered = numeric(0) wday_gun_found = numeric(0) for (i in 1:length(vday)){ v = rbinom(num_guns_found_per_day,1,vp_predicted[i]) wfound_with_round_chambered = c(wfound_with_round_chambered,v) wday_gun_found = c(wday_gun_found,rep(vday[i],num_guns_found_per_day)) }

Notice that the wfound_with_round_chambered vector contains 1′s and 0′s.

We can recast this data instead by aggregating the number found with and without a round chambered by day:

num_aggregated_ones_per_day = aggregate(wfound_with_round_chambered,by=list(wday_gun_found),FUN="sum") num_aggregated_zeros_per_day = aggregate(1-wfound_with_round_chambered,by=list(wday_gun_found),FUN="sum")

vday = num_aggregated_ones_per_day[[1]] vnum_found_with_round_chambered = num_aggregated_ones_per_day[[2]] vnum_found_without_round_chambered = num_aggregated_zeros_per_day[[2]] vnum_found = vnum_found_with_round_chambered + vnum_found_without_round_chambered

Let’s plot the simulated data

vp_observed=vnum_found_with_round_chambered/vnum_found plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=4,col=2) legend("bottomright",legend=c("Observed","Predicted"),col=c(1,2),lwd=4,bty="n")

which produces the plot:

Now let’s do a linear logistic fit using the R glm() with family=”binomial” to the individual firearm data, and then to the data aggregated by day. When looking at aggregated data, we input the data to the fit as cbind(num_successes,num_failures). This can also be expressed as cbind(k,n-k), if k is num_successes, and n is the number of trials (n=num_successes+num_failures).

Note that the event and aggregated data are exactly the same data, so they should give exactly the same fit results!

fit_to_daily_data = glm(cbind(vnum_found_with_round_chambered,vnum_found_without_round_chambered)~vday,family="binomial") fit_to_event_data = glm(wfound_with_round_chambered~wday_gun_found,family="binomial")

print(summary(fit_to_daily_data)) print(summary(fit_to_event_data))

This produces the output:

We can plot the fit results overlaid on the data. Note that even though glm() uses the logit link, it converts the fit prediction to a probability to save you the work of doing it.

plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=8,col=2) lines(vday,fit_to_daily_data$fit,lwd=4,col=4,lty=3) legend("bottomright",legend=c("Observed","True model","Fitted model"),col=c(1,2,4),lwd=4,bty="n")

Producing the plot:

You can see that our fitted model is pretty close to the true model. This is because there are many data points (10 each day, for two years). If the data were much more sparse, we would expect to perhaps see a bit more deviation of the fitted model from the true, not because the true model is wrong (it is after all, the true model we used to simulate our data), but because with a sparse data set the fit gets more affected by stochastic variations in the data.

The script example_glm_binomial_fit.R does the above fit. You can try different values of number of guns found per day, and different model coefficients to see how it affects the simulated data and the fits.

Note that in this example we assumed a constant number of firearms found per day… we could have varied that, if we wanted, and it would not change the linear dependence of the logit of the probability of finding a firearm with a round chambered…. whether the number found per day is 1, or 100000 (or whatever), it doesn’t affect the probability of success.

**Model selection**

Just like least squares linear regression with the lm() method, or Poisson regression with the glm() method with family=”poisson”, you can use the R stepAIC() function to find the most parsimonious model that best fits the data.

**A real life Binomial logistical analysis example**

An inspection of the launch pad revealed large quantities of ice collecting due to unusually cold overnight Florida temperatures. NASA had no experience launching the shuttle in temperatures as cold as on the morning of Jan. 28, 1986. The temperatures of each of the 23 previous launches had been at least 20 degrees warmer.

At the launch site, the fuel segments were assembled vertically. Field joints containing rubber O-ring seals were installed between each fuel segment. There were three O-ring seals for each of the two fuel tanks.

The O-rings had never been tested in extreme cold. On the morning of the launch, the cold rubber became stiff, failing to fully seal the joint.

As the shuttle ascended, one of the seals on a booster rocket opened enough to allow a plume of exhaust to leak out. Hot gases bathed the hull of the cold external tank full of liquid oxygen and hydrogen until the tank ruptured.

At 73 seconds after liftoff, at an altitude of 9 miles (14.5 kilo- meters), the shuttle was torn apart by aerodynamic forces.

The two solid-rocket boosters continued flying until the NASA range safety officer destroyed them by remote control.

The crew compartment ascended to an altitude of 12.3 miles (19.8 km) before free-falling into the Atlantic Ocean, killing all aboard.

**Decision to fly based on faulty analysis of data**

This paper by Dalal, Fowlkes, and Hoadley (1989), described the O-ring failure data from the previous launches, and the reasoning behind the decision to launch on that cold day. The data from previous launches is shown in Table 1 of that paper, and I have put it in the file oring.csv

As it mentions in the Dalal et al paper, managers in charge of the launch decision felt that the launches with zero O-ring failures were non-informative of the risk of failure versus temperature, and thus excluded that data from their decision making process.

The following code reads in that data, and plots it against temperature:

o = read.table("oring.csv",sep=",",header=T,as.is=T) o_without_zero = subset(o,num_failure>0)

require("sfsmisc") mult.fig(1) plot(o$temp,o$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Space Shuttle O-Ring failure data for launches prior to Jan, 1986") points(o_without_zero$temp,o_without_zero$num_failure,col="orange",cex=3)

The points in orange are the non-zero points that were used to make the decision to launch.

It is unclear what statistical acumen, if any, was used in the risk analysis that went into that decision, but it should be pointed out here that at least one person behind the scenes was very vocal about the mistake that was being made by ignoring the zero data prior to the launch.

Let’s assume, as an example, that the analysis methodology might have been at the Stats 101 level, and a least squares regression was attempted on the data **(Note: why is this in fact a completely inappropriate method to use?)**

b = lm(num_failure~temp,data=o_without_zero) print(summary(b)) newdata = data.frame(temp=sort(c(o$temp,seq(30,70)))) ypred = predict(b,newdata,interval="predict") cat("The expected number of O-ring failures at 31 degrees from the LS fit to num_failure>0:",ypred[newdata$temp==31,1],"\n") ymax = 6 plot(o_without_zero$temp,o_without_zero$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Least Squares Fit only to non-zero data") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Least Squares fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

(note that using interval=”predict” in the R predict() method will return not only the fit prediction, but also it’s 95% confidence interval that arises due to the uncertainty on the fit estimates)

From the fit summary, it is apparent that there is no significant slope wrt temperature (p=0.20). Thus, from a naive analysis like this, one might conclude that there is no significantly increased risk of O-ring failure at 31 degrees compared to 60 degrees.

How about if we redo the least squares regression, but this time including the zeros:

b = lm(num_failure~temp,data=o) print(summary(b)) ypred = predict(b,newdata,interval="predict") cat("The expected number of O-ring failures at 31 degrees from the LS fit to num_failure:",ypred[newdata$temp==31,1],"\n") ymax = 6 plot(o$temp,o$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Least Squares Fit") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Least Squares fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

The fit now shows a significantly negative slope (p<0.001), but the predicted number of O-ring failures at 31 degrees is less than three. Given that they already had at least one successful prior launch with two O-ring failures, this hardly looks like something to be necessarily worried about.

But wait… that fit predicts negative O-ring failures when the temperature is above around 75 degrees. That doesn’t make sense. And there are only 6 O-rings in total… if we were to extrapolate the fit to even lower temperatures, it’s clear that we would eventually predict more than 6 O-ring failures for very low temperatures.

**Doing it right with Binomial logistic regression**

The following code does the fit using Binomial logistic linear regression. You’ll need to download the file AML_course_libs.R to run this; it contains a method get_prediction_and_confidence_interval_from_binomial_fit that estimates the 95% interval on extrapolations of a Binomial regression fit.

source("AML_course_libs.R") b = glm(cbind(num_failure,6-num_failure)~temp,family="binomial",data=o) ypred = get_prediction_and_confidence_intervals_from_binomial_fit(b,newdata) cat("The expected fraction of O-ring failures at 31 degrees from the logistic fit:",ypred[newdata$temp==31,1],"\n") ymax = 1.0 plot(o$temp,o$frac,cex=3,xlab="Temperature",ylab="Fraction of O-rings that fail",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Logistic regression") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Logistic regression fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

The y axis is the fraction of the O-rings that are expected to fail. The Binomial logistic regression predicts that 96% of the 6 rings will fail (ie; the likelihood is high that all 6 rings will fail). In fact, with 95% confidence, at least half of the rings will fail.

Beyond a statistical analysis of past launch data, however, apparently the O-rings had not been tested for flexibility at low temperatures. Richard Feynman, and Physics Nobel Prize laureate, was a member of the scientific commission that was appointed to look into the shuttle disaster. In a dramatic moment during the commission news conference, he demonstrated the inflexibility of O-rings at low temperature by pulling a deformed O-ring out of his glass of ice water.

Since the shuttle disaster, there have been other, more elaborate studies of the pre-launch O-ring data to attempt to assess the temperature dependent risk of failure…. for example, this analysis which examines the issue of model extrapolation given the large difference between the temperature of 31 degrees and all other temperatures in the past data, which were significantly warmer.

**Moral of this story**

Rare are statistical analyses we might attempt that might actually kill someone if we get it wrong. But this is an excellent case of how proper choice of analysis methods could have averted a disaster.

]]>The Poisson probability distribution is appropriate for modelling the stochasticity in count data. For example, like the number of people per household, or the number of crimes per day, or the number of Ebola cases observed in West Africa per month, etc etc etc.

There are other probability distributions that can be used to model the stochasticity in count data, like the Negative Binomial distribution, but the Poisson probability distribution is the simplest of the discrete probability distributions. The Poisson probability mass function for observing k counts when lambda are expected is:

The lambda is our “model expectation”, and it might be just a constant, or a function of some explanatory variables.

For example, perhaps we are examining how the number of crimes per day, k, might linearly depend on the daily average temperature, x. In this case, our model equation for lambda might be

where beta_0 and beta_1 are parameters of the model. But note that temperature can be negative, which might lead to negative values of the model expectation… clearly for count data this makes no sense!

An example of how using least squares linear regression can go horribly wrong with count data for this reason is given by the following code, which reads in some count data, y, vs an explanatory variable, x, from the file example_of_how_least_squares_fits_to_count_data_can_go_wrong.csv

adat=read.table("example_of_how_least_squares_fits_to_count_data_can_go_wrong.csv",header=T,sep=",",as.is=T) b = lm(y~x,data=adat)

mydat = data.frame(x=seq(0,2,0.1)) mydat$ypred = predict(b,mydat)

require("sfsmisc") mult.fig(1) xmin = min(c(adat$x,mydat$x)) xmax = max(c(adat$x,mydat$x)) ymin = min(c(adat$y,mydat$ypred)) ymax = max(c(adat$y,mydat$ypred)) plot(adat$x,adat$y,xlim=c(xmin,xmax),ylim=c(ymin,ymax),xlab="x",ylab="y") lines(x,b$fit,col=2,lwd=5) lines(mydat$x,mydat$ypred,col=2,lwd=5,lty=3) lines(c(-1e6,1e6),c(0,0),lty=3,col=4) legend("topleft",legend=c("Count data","Fit to data","Extrapolated fit"),col=c(1,2,2),lty=c(1,1,3),lwd=6,bty="n")

This produces the following plot… you can see that the extrapolated least squares fit predicts negative counts, which is impossible!

Solution…

With Poisson regression, we thus almost always use what is known as a “log-link” where we assume that the logarithm of lambda depends on the explanatory variables… this always ensures that lambda itself is greater than zero no matter what beta_0, beta_1 or x are:

Now, we might not know what beta_0 and beta_1 are, but if we have a bunch of observations of crimes over a series of N days, k_i (with i=1,…,N), and we also have for the same days, the average daily temperature, x_i, we can fit for beta_0 and beta_1 to determine which values best describe the observed relationship between the x_i and y_i. Our model for the expected number of crimes on the i^th day is thus

Using our collected data, we’d like to somehow estimate the “best-fit” values of beta_0 and beta_1 to the data. If the number of crimes per day is low, we can’t use least squares linear regression to do this because that method assumes that the data are Normally distributed, and it is only for large values of lambda that the Poisson distribution approaches the Normal.

We thus need a “goodness of fit” statistic that is appropriate to Poisson distributed data….

**Poisson likelihood**

The likelihood (probability) of observing our data, k_i, given our model predictions for each data point, lambda_i, is the product of the probabilities of observing each data point separately:

*Our “best fit” values of lambda_i for this model are the ones that will maximize this probability.* The least squares goodness-of-fit statistic is one that is usually quite easy for students to visualize. Likelihood fit statistics, however, are often more difficult to conceptualize because there isn’t a nice visual diagram that can explain it (like the arrows showing the distance between points and a model prediction, like we showed for least squares regression, for example).

However, for non-Normally distributed data, if you know what probability distribution underlies the data, you can write the likelihood distribution for observing a set of data by taking the product of the individual probabilities obtained from the probability distribution, just like we did above. The “best-fit” model maximizes that probability.

**Fitting for the model parameters with Poisson likelihood**

Note that probabilities that are multiplied in Eqn 1 are always between 0 and 1, and thus for a sample size of N points, Eqn 1 involves multiplying N values between 0 and 1 together. This can easily lead to underflow errors in our computation, which is a real problem for us when we try to apply this in practice. The solution to this is to take the logarithm of both sides of Eqn 1. Before we do that, here is a bit of a refresher on logarithms:

That is to say, the log of a product is the sum of the logs of the terms in the product. The log of x to some power is the same as that power times the logarithm of x. In this case, we will be taking the “natural log” (which is log_e, log to the base e) of both sides of Eqn 1. The natural log of e is log(e) = 1. Taking the natural logarithm of both sides of Eqn 1 thus yields:

Poisson regression has been around for a long time, but least squares regression methods have been around longer. Finding the best-fit in least squares regression involves finding the parameters that *minimize* the least squares statistic. But finding the best-fit in Poisson regression involves finding the parameters in lambda_i that *maximize* Eqn 2. The interior gut workings of an optimization method in any statistical software package always minimize goodness of fit statistics, mostly because of the least squares legacy.

Because of this, we take the negative of both sides of Eqn 2, and we say that the best-fit parameters in Poisson regression *minimize the negative log likelihood:*

For the special case of our linear model for log(lambda_i) that we are considering, we get:

Given some data k_i and x_i, the “best-fit” values of beta_0 and beta_1 minimize that expression. We could, in practice, guess a whole bunch of different values for beta_0 and beta_1, and plug them into Eqn 3, and narrow it down to which pair of values appear to give the smallest negative log likelihood. However, principles of calculus can be used to find the best fit values of beta_0 and beta_1 that minimize the expression in Eqn 3. These methods are used in the inner workings of the R least squares linear regression lm() function, which is used when the response variable is Normally distributed. When working with a linear regression model with Poisson distributed count data, the R generalized linear model method, glm(), can be used to perform the fit using the family=”poisson” option. Just like with the R least squares method, invisible to you the inner workings of the glm() methods use calculus principles to find the best-fit model parameters that minimize the Poisson negative log likelihood. If the response data (our k_i) are in a vector y, and our explanatory variable, x_i, is in a vector x, and we are fitting a linear Poisson model, the function call looks like this:

myfit = glm(y~x,family="poisson")

Note that even though a log-link hasn’t been specified for the linear model, that is in fact what the glm() model with family=poisson by default assumes.

**Example**

Let’s try fitting some simulated data with the glm() method with family=”poisson”. The following code randomly generates some Poisson distributed data, with a linear model with a log-link:

########################################################################

# randomly generate some Poisson distributed data according to a linear model

########################################################################

set.seed(484272)

x = seq(0,100,0.1)

intercept_true = 1.5

slope_true = 0.05

log_lambda = intercept_true+slope_true*x

pred = exp(log_lambda)

y = rpois(length(x),pred)

########################################################################

# put the data in a data frame

########################################################################

mydat=data.frame(x=x,y=y)

Now let’s fit a linear model to these simulated data, under the assumption that the stochasticity is Poisson distributed. Note that the plotting area is divided up with the mult.fig() method in the R sfsmisc library. You need to have this library installed in R to run that line of code. If you don’t have it installed, first type install.packages(“sfsmisc”) and pick a download site relatively close to your location.

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)"),col=c("darkorchid4",3),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5)

The code produces the following output:

Are the fitted linear values statistically consistent with the true values we used to simulate the data? Do a z-test to check.

If I was presenting these results in a paper, I would say something along the lines of, “y is found to be significantly associated with x (Poisson linear regression coefficient 0.0501, with 95% CI [0.0507,0.0513] , p<0.001).”

**Interpretation of the output**

With a log-link linear model, log(y)=a+b*x, thus y=exp(a)*exp(b*x). You may wish to interpret the model results in terms of how a change in x from x=x_0 to x=2*x_0 changes y

We can see from the model that if x=x_0, y=exp(a)*exp(b*x_0), and if we double x, then x’=2*x_0 then y’=exp(a)*exp(2*b*x_0). Thus the relative change in y when we double x is y’/y=exp(b*x_0).

If b*x_0 is very small, then the first order Taylor expansion for y’/y~1+b*x_0. In fact, some readers may have been taught to use this expression for interpreting log-link Poisson regression results for the relative change in y. It needs to be stressed, however, that this interpretation only works if the coefficient b is small!

**It is not just the R glm method with family=”poisson” that assumes a log-link for Poisson regression! **

I’ve put this simulated data into a file simulated_poisson_log_linear_data.csv. If you have used other statistics software packages, like SAS, stata, SPSS, minitab, etc, try reading this data into that package and doing a Poisson linear regression fit. Compare the output of that software package to that you got in R. The coefficients and uncertainties should be the same. And what you should note in doing this exercise is that even though those other software packages may not specifically specify that the Poisson linear regression uses a log-link, they do.

The presentation in this module is not R specific: *all Poisson linear regression uses a log-link by default.*

**Another example, with more than one explanatory variable**

Let’s look at some real data…

The file chicago_crime_summary.csv contains the daily number of crimes in Chicago, sorted by FBI Uniform Crime Reporting code, between 2001 to 2013. FBI UCR code 4 is aggravated assaults (column x4 in the file). The file chicago_weather_summary.csv contains daily average weather variables for Chicago, including temperature, humidity, air pressure, cloud cover, and precipitation. The R script AML_course_libs.R contains some helper functions, including convert_month_day_year_to_date_information(month,day,year) that converts month, day, and year to a date expressed in fractions of years.

The following R code reads in these data sets, and meshes the temperature data into the crime data set. A few days are missing temperature data, so we remove those days from the data set. If you do not have the chron library already installed in R, first install it using install.packages(“chron”), and pick a download site close to your location.

require("chron")

cdat = read.table("chicago_crime_summary.csv",header=T,as.is=T,sep=",") wdat = read.table("chicago_weather_summary.csv",header=T,as.is=T,sep=",")

cdat$jul = julian(cdat$month,cdat$day,cdat$year) cdat$temperature = wdat$temperature[match(cdat$jul,wdat$jul)] cdat$weekday = day.of.week(cdat$month,cdat$day,cdat$year) cdat = subset(cdat,!is.na(cdat$temperature))

source("AML_course_libs.R") a = convert_month_day_year_to_date_information(cdat$month,cdat$day,cdat$year) cdat$date = a$date

To regress the daily number of assaults (the column x4 in the data frame) on temperature, we use the R glm() method with family=poisson:

myfit = glm(cdat$x4~cdat$temperature,family=poisson)

require("sfsmisc") mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

The fit clearly needs linear trend in time in order to fit the data better. The following code adds that:

myfit = glm(cdat$x4~cdat$temperature+cdat$date,family=poisson)

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

This looks to be a better fit.

But is the stochasticity in the data really consistent with being Poisson distributed? Just like the QQ plot we made with the Least Squares regression fits to test whether or not the data were truly Normally distributed about the model hypotheses, we can make a similar set of plots, but for the Poisson distribution. The AML_course_libs.R script contains a function

overlay_expected_distribution_from_poisson_glm_fit = function(count_data,glm_model_object)

that takes as its arguments the vector of count data, and the best-fit linear model from the glm() method.

In the first part of this function, for each data point it determines the shape of the probability mass function given the model prediction for that point… it then adds these mass functions up for all the data points. When we histogram the data, we can overlay this “Poisson model expectation” curve.

The second part of the script creates a QQ plot of the quantiles of the ranked data, vs the quantiles of a simulated data set, simulated assuming the best-fit model with Poisson stochasticity. If the data truly are Poisson distributed about the model, we would expect this plot to be linear. The following code implements this function with our data and our model to produce the plot:

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

Even though our model with temperature plus linear trend in time is a better fit to the data than the model with just temperature, you can see that the above plots show that the data aren’t quite Poisson distributed about the model predictions. In fact, the QQ plot diagnostics indicate that the distribution appears to have some evidence of fat tails. This could point to potential confounding variables we haven’t yet taken into account (like, perhaps we might consider adding weekdays or holidays as factor levels in the fit). However, the data don’t appear to be grossly over-dispersed compared to the stochasticity expected from Poisson distributed data. Here is an example of including a factor in the explanatory variables (in this case weekday):

myfit = glm(cdat$x4~cdat$temperature+cdat$date+factor(cdat$weekday),family=poisson) print(summary(myfit))

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

**Model selection**

Just like with least squares regression, it is important to select the most parsimonious model that gives the best description of the data. Every potential explanatory variable has stochasticity associated with it, and that extra stochasticity broadens the confidence interval on the fit parameters for all parameters.

If those variables actually don’t have any explanatory power, that added stochasticity can thus carry the risk of disguising significant relationships to truly explanatory variables.

As with least squares linear regression, we can use the Aikaike Information Criterion AIC statistic to compare how well models fit data, with a penalization term for the number of parameters, k:

Note that the AIC includes the negative log likelihood… the smaller the negative log likelihood, the larger the likelihood. Thus, we want the most parsimonious model with the minimum value of the AIC.

The R stepAIC() function does model selection based on the AIC, dropping and adding terms in the candidate model one at a time, then calculating the AIC of the sub model.

After running the above code example, make sure the R MASS library is installed, and run the following code:

require("chron") cdat = read.table("chicago_crime_summary.csv",header=T,as.is=T,sep=",") wdat = read.table("chicago_weather_summary.csv",header=T,as.is=T,sep=",") cdat$jul = julian(cdat$month,cdat$day,cdat$year) source("AML_course_libs.R") a = convert_month_day_year_to_date_information(cdat$month,cdat$day,cdat$year) cdat$date = a$date cdat$temperature = wdat$temperature[match(cdat$jul,wdat$jul)] cdat$humidity = wdat$humidity[match(cdat$jul,wdat$jul)] cdat$pressure = wdat$pressure[match(cdat$jul,wdat$jul)] cdat$weekday = day.of.week(cdat$month,cdat$day,cdat$year) cdat = subset(cdat,!is.na(cdat$temperature+cdat$humidity)) myfit = glm(cdat$x4~cdat$temperature+cdat$pressure+cdat$humidity+cdat$date+factor(cdat$weekday),family=poisson) print(summary(myfit)) require("MASS") d = stepAIC(myfit) print(summary(myfit)) print(summary(d))

This produces the output:

and for the sub model fit selected by stepAIC():

Notice that air pressure was dropped from the fit by stepAIC because that submodel had a lower AIC. Also notice that the standard error went down on all the other parameter estimates once air pressure was dropped.

**Some cane waving…**

When I was a lass, working on my degree in experimental particle physics, we had to do model fitting very frequently. However, while we had a fortran (and later, a C++ package) that performed gradient descent optimization (or other optimization methods) of some function that you fed it, we didn’t have convenient pre-packaged methods like lm() or glm() where you could just fit a linear model with one tidy line of code. Instead, we had to write the code to actually program the likelihood ourselves.

We also had to walk to school ten miles a day, barefoot, through waist deep snow, even in the summer, and it was uphill both ways.

Get off my lawn.

While it can be a pain to have to code up the actual likelihood expression, the advantage of that stone age methodology was that we had to think carefully about what kind of stochasticity underlay our data, and code up the appropriate likelihood function (or least squares expression, if the stochasticity was Normally distributed). Using canned methods in statistical software packages for doing fitting can unfortunately sometimes lead to decreased understanding of what’s really going on with the fit.

Believe it or not, particle physicists still do fitting the same way they always have, coding up the likelihood function themselves. And they probably always will. Because it is critically important when testing hypotheses that you not only have your model right (ie; accounting for all potential confounding variables, and ensuring that the functional expression of the model is appropriate), but that you also have the correct specification of the probability distribution describing stochasticity in the data. **Otherwise your p-values testing your null hypothesis are garbage.**

**Getting up close and personal with Poisson regression in R **

R has a method called optim() that finds the parameters that minimize the function you feed to it. Unlike the glm() method, which can only find the parameters of a linear model, the optim() method can find the parameters of any kind of model. For instructive purposes to show how optim() works, let’s code up the Poisson negative log likelihood using the optim() method, and use it to fit a linear model to some data, and compare what we get out of the glm() method with family=”poisson”. The two methods should yield the same results. Describing the optim() method also gives you a better idea of what R is doing inside the guts of the glm() method. The R script poisson_and_optim.R defines the following functions that define a linear model with a log-link, and also calculate the Poisson negative log likelihood, given some data vectors x and y contained in a data frame, mydata_frame.

######################################################################## ######################################################################## # this is the function to calculate our linear model, assuming # a log link ######################################################################## mymodel_log_prediction = function(mydata_frame,par){ log_model_prediction = par[1] + par[2]*mydata_frame$x return(log_model_prediction) }

######################################################################## ######################################################################## # this is a function to compute the Poisson negative log likelihood ######################################################################## poisson_neglog_likelihood_statistic = function(mydata_frame,par){ model_log_prediction = mymodel_log_prediction(mydata_frame,par) # lfactorial(y) is log(y!) neglog_likelihood = sum(-mydata_frame$y*model_log_prediction +exp(model_log_prediction) +lfactorial(mydata_frame$y)) return(neglog_likelihood) }

Now, we need some data to fit to. The R script also has code that simulates some data with Poisson distributed stochasticity according to a linear model with a log-link (same as the first example we showed above):

######################################################################## # randomly generate some Poisson distributed data according to a linear model ######################################################################## set.seed(484272)

x = seq(0,100,0.1) intercept_true = 1.5 slope_true = 0.05 log_lambda = intercept_true+slope_true*x pred = exp(log_lambda) y = rpois(length(x),pred)

######################################################################## # put the data in a data frame ######################################################################## mydat=data.frame(x=x,y=y)

Now the script does the glm() fit, and the fit using the optim() method. The two methods return the results in an entirely different format, and it takes a bit more work to extract the parameter uncertainties using the optim() method:

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

coef = summary(myfit_glm)$coef[,1] ecoef = summary(myfit_glm)$coef[,2] cat("\n") cat("Results of the glm fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[2],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",-logLik(myfit_glm),"\n") cat("\n")

######################################################################## # now do the R optim() fit # # The results of the fit are in much more of a primitive format # than the results that can be extracted from an R glm() object # For example, in order to get the parameter estimate uncertainties, # we need to calculate the covariance matrix from the inverse of the fit # Hessian matrix (the parameter uncertainties are the square root of the # diagonal elements of this matrix) # Also, if we want the best-fit estimate, we need to calculate it # ourselves from our model function, given the best-fit parameters. ######################################################################## myfit_optim = optim(par=c(1,0),poisson_neglog_likelihood_statistic,mydata_frame=mydat,hessian=T) log_optim_fit = mymodel_log_prediction(mydat,myfit_optim$par)

coef = myfit_optim$par coefficient_covariance_matrix = solve(myfit_optim$hessian) ecoef = sqrt(diag(coefficient_covariance_matrix))

cat("\n") cat("Results of the optim fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[1],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",myfit_optim$value,"\n") cat("\n")

This produces the following output:

The following code overlays the fit results from both methods on the data:

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) lines(x,exp(log_optim_fit),col=2,lwd=1) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)","Best-fit Poisson linear model from R optim()"),col=c("darkorchid4",3,2),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5) lines(x,log_optim_fit,col=2,lwd=1)

In this case we just did a simple linear model fit. However, with changes to the mymodel_log_prediction() method, optim() can fit arbitrarily complicated models, including non-linear models. Unlike optim(), the glm() method cannot fit non-linear models.

]]>**Contents:**

- Students t-test of the mean of one sample
- Example of Students t-test of the mean of one sample
- Students t-test comparing the means of two samples
- Example of Students t-test comparing the means of two samples
- Limitations of the Students t-test
- Testing for equality of more than two means (ANOVA)
- One and two sample Z-tests

The Student t distribution arises when estimating the mean of a Normally distributed population, particularly when sample sizes are small, and the true population standard deviation is unknown.

**Using the Students t-test to test whether a sample mean is consistent with some value**

If we wish to test the null hypothesis that the mean of a sample of Normally distributed values is equal to mu, we use the Students t statistic

with degrees of freedom

where s is the sample standard deviation, and n is the sample size. The R t.test(x,mu) method tests the null hypothesis that the sample mean of a vector of data points, x, is equal to mu under the assumption that the data are Normally distributed.

**Note that it is up to the analyst to ensure that the data are, in fact Normally distributed.** The shapiro.test(x) method in R employs the Shapiro-Wilk test to test the Normality of the data.

**Example of one sample t-test**

The following R code shows an example of using the R t.test() method to do a one sample t test:

set.seed(832723) n_1 = 1000 s = 0.1 mean_1 = 0.1 x = rnorm(n_1,mean_1,s) t.test(x,mu=0.105)

which produces the output:

**Testing whether or not means of two samples are consistent with being equal**

The independent two sample t-test tests whether or not the means of two samples, X1, and X2, of Normally distributed data appear to be drawn from distributions with the same mean. If we assume that the two samples have unequal variances, the test statistic is calculated as

with, under the assumption that the variances of the two samples are unequal

with s_1^2 and s_2^ being the variances of the individual samples.

The t-distribution of the test will have degrees of freedom

This test is also know as Welch’s t-test.

If we instead assume that the two samples have equal variances, then we have

and the test has degrees of freedom

The R method t.test(x,y) tests the null hypothesis that two Normally distributed samples have equal means. The option var.equal=T implements the t-test under the hypothesis that the sample variances are equal.

When using the var.equal=T option, it is up to the analyst to do tests to determine whether or not the variances of the two samples are in fact statistically consistent with being equal. This can be achieved with the var.test(x,y) method in R, which compares the within sample variances to the variance of the combination of the x and y samples.

**Example of two sample t-test**

The following example code shows an implementation of the two sample t-test, first with the assumption with unequal variances, then with the assumption of equal variances (which is not true for this simulated data):

set.seed(832723) n_1 = 1000 n_2 = 100 s_1 = 0.1 s_2 = 0.11 mean_1 = 0.1 mean_2 = 0.08 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) print(t.test(x,y)) print(t.test(x,y,var.equal=T))

which produces the following output:

**Limitations of Students t-test**

Limitations of using Students-t distribution for hypothesis testing of means: hypothesis testing of sample means with the Student’s-t distribution assumes that the data are Normally distributed. In reality, with real data this is often violated. When using some statistic (like the Students-t statistic) that assumes some underlying probability distribution (in this case, Normally distributed data), it is incumbent upon the analyst to ensure that the data are reasonably consistent with that underlying distribution; the problem is that the Students-t test is usually applied with very small sample sizes, in which case it is extremely difficult to test the assumption of Normality of the data. Also, we can test the consistency of equality of at most two means; the Students-t test does not lend itself to comparison of more than two samples.

**Comparing the means of more than two samples, under the assumption of equal variance **

Under the assumption that several samples have equal variance, and are Normally distributed, but with potentially different means, one way to test if the sample means are significantly different is to chain the samples together, and create a vector of factor levels that identify which sample each data point represents.

The R aov() method assesses the ratio of average of the within group variance to the total variance, using the F statistic:

This is known as an Analysis of Variance (ANOVA) analysis. Essentially, the F-test p-value of tests the null hypothesis that the variance of the residuals of model is equal to the variance of the sample.

Example:

set.seed(832723) n_1 = 1000 n_2 = 100 n_3 = 250 s = 0.1 mean_1 = 0.1 mean_2 = 0.12 mean_3 = 0.07 x = rnorm(n_1,mean_1,s) y = rnorm(n_2,mean_2,s) z = rnorm(n_3,mean_3,s) vsample = c(x,y,z) vfactor = c(rep(1,n_1) ,rep(2,n_2) ,rep(3,n_3)) a = aov(vsample~factor(vfactor)) print(summary(a))

which produces the output:

But the thing I don’t like about the aov() method is that it doesn’t give quantitative information about the means of the sample for the different factor levels. Thus, an equivalent technique that I prefer is to use the R lm() method and regress the sample on the factor levels

myfit = lm(vsample~factor(vfactor)) print(summary(myfit))

which produces the output:

Now we have some information on how the means of the factor level differ. Note that the F statistic p-values from the lm() and aov() methods are the same.

**Z test of sample mean**

If you know what the true population std deviation of the data are, sigma, and want to test if the mean of the sample is statistically consistent with some value, you can use the Z-test

For a given cut on the p-value, alpha, with a two sided Z-test, we reject the null hypothesis when the absolute value of |bar(X)-mu| is greater than Z_(alpha/2), where Z_(alpha/2) is the (100-alpha/2) percentile of the standard Normal distribution.

You can also do one-sided Z-tests where you test the significance of Z<mu or Z>mu. However, unless you have very good reason to assume some direction to the relationship, *always* do a two-sided test of significance instead.

For the two sample Z test, to compare the means of two samples when the variance is known for both, we use the statistic

Now, recall that for large n, the Students t distribution approaches the Normal:

**For this reason, when the sample size is large, you can equivalently do a Z-test instead of a t-test, estimating sigma from the std deviation width of the sample.**

The BSDA library in R has a z.test() function that either performs a one sample Z test with z.test(x,mu,sigma.x) or a two sample Z test comparing the means of two samples with z.test(x,y,sigma.x,sigma.y)

**Example of one and two sample Z-tests compared to Student t-tests**

To run the following code, you will need to have installed the BSDA library in R, using the command install.packages(“BSDA”), then choosing a download site relatively close to your location.

First let’s compare the Z-test and Students t-test for fairly large sample sizes (they should return p-values that are quite close):

require("BSDA") set.seed(832723) n_1 = 1000 n_2 = 100 s_1 = 0.1 s_2 = 0.11 mean_1 = 0.1 mean_2 = 0.08 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) a=t.test(x,y) b=z.test(x,y,sigma.x=sd(x),sigma.y=sd(y)) cat("\n") cat("Student t test p-value: ",a$p.value,"\n") cat("Z test p-value: ",b$p.value,"\n")

This produces the output:

Now let’s do another example, but with much smaller sample sizes, and this time let’s put the means to be equal (thus the null hypothesis is true). In this case, the Students t-test is the more valid test to use:

require("BSDA") set.seed(40056) n_1 = 3 n_2 = 5 s_1 = 1 s_2 = 1.5 mean_1 = 0 mean_2 = 0 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) a=t.test(x,y) b=z.test(x,y,sigma.x=sd(x),sigma.y=sd(y)) cat("\n") cat("Student t test p-value: ",a$p.value,"\n") cat("Z test p-value: ",b$p.value,"\n")

This produces the output:

In this example the Z-test rejects the null (even though it is true), while the Student t test fails to reject it. If this were an analysis that is made “more interesting” by finding a significant difference between the X_1 and X_2 samples, you run the risk of publishing a faulty result that incorrectly rejects the null because you used an inappropriate test. In a perfect world null results should always be considered just as “interesting” as results where you reject the null. In unfortunate reality, however, researchers tend to not even try to publish null results, leading to reporting bias (the published results are heavily weighted towards results that, incorrectly or correctly, rejected the null).

And it turns out that you’ll always get a smaller p-value from the Z-test compared to the Students t-test: in the plot above that compares the Student t distribution to the Z distribution, you’ll note that the Students t distribution has much fatter tails than the Z distribution when the degrees of freedom are small. That means, for a given value of the Z-statistic, if the number of degrees of freedom are small in calculating the sample standard deviations, the Students t-test is the much more “conservative” test (ie; it always produces a larger p-value than the Z-test). Thus, if you mistakenly use the Z-test when sample sizes are small, you run the danger of incorrectly concluding a significant difference in the means when the null hypothesis is actually true.

For large sample sizes, there is negligible difference between the Z-test and Students t-test p-values (even though the Students t-test p-values will always be slightly larger). This is why you will often see Z-tests quoted in the literature for large samples.

]]>

This content is password protected. To view it please enter your password below:

]]>

Let’s begin our discussion of hypothesis testing by looking at a data point, X, which under the null hypothesis is drawn from the Normal distribution with mean 0 and std deviation 1 (ie; the standard Normal distribution). Recall that the standard Normal distribution is symmetric about 0, with long tails. The further we get from zero, the lower the probability. Thus, if our observed X is close to zero, it is quite likely that it was randomly drawn from the Normal distribution. If X is far from zero, however, say… X=+4.3, the probability is low to observe such a high value of X. In fact, the probability of observing a value of X at least that high is the integral of the upper tail of the Normal distribution from X to infinity. This is called a “one tailed” test of significance. If, on the other hand, we wanted to assess the probability of observing a value of X at least that far away from zero, then we concern ourselves with the probability of observing |X| at least as large as our observed value. This is the integral of the probability distribution from -infinity to -X, plus the integral from +X to infinity. This is called a “two tailed” test of significance.

For

The p-value is the probability that we would observe our data, given our null hypothesis. Alpha is the probability cut-off at which we say that the observed is improbable given the null hypothesis. Usually a cut-off of alpha=0.05 is used in analyses. When the p-value<alpha, we say that we have a “statistically significant” result.

The use of alpha=0.05 is somewhat controversial because it is arbitrary. Plus, one out of 20 times, we reject the null hypothesis when it is actually true. This means that many spurious “statistically significant” results can make it into the literature, especially if multiple tests of significance were done in the analysis, and the researchers did not correct their alpha for how many tests they did (for example, if we did 100 tests of significance in an analysis, even when the null hypothesis is actually true, on average we would find 5 of those tests yielded a “significant” result causing us to reject the null hypothesis).

Because of this problem, one psychology journal has actually banned the use of p-values in analyses published in their journals.

Type I error: Incorrectly rejecting the null hypothesis when it is actually true. Can be controlled by decreasing alpha. Also need to reduce alpha if doing multiple tests of significance.

Type II error: Incorrectly accepting the null hypothesis when it is actually false. Larger sample sizes can reduce type II errors because they give better statistical power to distinguish between null and alternate hypotheses.

Example:

Null hypothesis (H0): “The person on trial is innocent.”

A type I error occurs when convicting an innocent person (a miscarriage of justice). “Beyond a reasonable doubt” is an attempt to make alpha in trials as small as possible to reduce the probability of rejecting this null when it is actually true.

A type II error occurs when letting a guilty person go free (an error of impunity).

A positive correct outcome occurs when convicting a guilty person. A negative correct outcome occurs when letting an innocent person go free.

]]>

When doing quantitative research in the life and social sciences and working with data, it is often necessary to mesh two or more disparate sources of data together in order to study a research question. Even though both sets of data might ostensibly cover the same geographic locales (for example, like state-level data or county-level data), or the same data range (for example), some data might be missing in one data set for some locales or times, which presents some extra difficulties in trying to mesh the data sets together. Even if two or more data sets cover the same locales or times (for example), they may be sorted in different order, which means there isn’t a direct one-to-one crosswalk between the row of one data set and the row of another.

Other sources of difficulty in meshing together data sets might be that one data set might contain information for both locales and dates, but the other data sets of interest might just be for locales at specific dates, or by date at specific locales.

Some tips for the meshing process: always start by downloading and reading in each of the data sets separately. Do an initial exploratory analysis on each of the data sets, calculating averages, making simple plots, etc to ensure the data actually appear to be what you are expecting them to be. If the data sets are complicated, with a lot of fields, I find it helpful to preprocess each data set and produce a simpler preprocessed summary file that will make future analyses with the data easier and faster.

**Example**

In an example of this, we will mesh diabetes incidence data from the CDC between 2004 to 2013, with socioeconomic data from the US Census Bureau.

**Diabetes data**

Let’s begin with the diabetes incidence data. The CDC makes obesity prevalence and diabetes prevalence and incidence data at the county level available off of its County Data Indicators website. Navigating to the site, you should see something like this:

Click on the “Diagnosed Diabetes Incidence” tab to expand it:

Click on “All States” to download the Excel file.

Some R packages exist for reading Excel files into R, but I have not had much luck with them because they always seem to require libraries that are defunct, or the packages have bugs, etc etc (in fact, the link above which describes the packages notes many of these problems). By far the best solution I have found, and that is also recommended in the link above, is to open the Excel file in Excel, and then under File->Save As, click on .csv format in the dropdown menu.

Before you close off the Excel file, take a note of its columns. The columns contain the state name, the county Federal Information Processing Standard (FIPS) unique identifier for each county, the county name, and then for years 2004 to 2013 the number of new diabetes cases diagnosed each year, the rate per 1,000 people, and the lower and upper range of the 95% confidence interval on the rate estimate. The CDC obtained this confidence interval from the number of observed cases and population size by using a function similar to binom.test() in R.

On my computer, saving the Excel file in csv format produced the file INCIDENCE_ALL_STATES.csv. Once you do this, you have to visually inspect the file in a text editor to see if the header line is split across multiple lines. In this case, it is split across two lines. To read this file into R, we thus need to skip the first line:

a = read.table("INCIDENCE_ALL_STATES.csv",sep=",",header=T,as.is=T,skip=1)

Examining the column names of our data frame yields a list like the following (I couldn’t fit the entire list on my screen to take a screen shot):

Note that there are a number of columns with the name like “rate.per.1000″. Recall from the inspection of the Excel file that the first such column in the 2004 data, then the next is the 2005 data, and so on to the 2013 data.

Try histogramming the rate for 2004

hist(a$rate.per.1000)

You will note that you get the error message “Error in hist.default(a$rate.per.1000) : ‘x’ must be numeric”. To view the data in that column, type

a$rate.per.1000

You’ll notice that R thinks the data consist of strings, rather than numeric. This is because for some counties, the data entry is “No Data”. In order to convert the strings to numeric, type

a$rate.per.1000 = as.numeric(a$rate.per.1000)

The entries with “No Data” will now be NA, and the other entries will be converted to numeric. If we take the mean, you’ll notice that it will be equal to “NA”… this is because we have to tell the mean() function in R to ignore the NA values, and only calculate the mean from the defined values, using the na.rm=T option.

mean(a$rate.per.1000,na.rm=T)

Now, if you type

hist(a$rate.per.1000,col=2,xlab="Rate of diabetes incidence per 1000 population",main="2004 county-level data")

you will get the histogram

What you’re looking for here are any strange outliers (there don’t appear to be any). You also want to check if the data values are more or less what you expect. From the mean() of our values, we see the average incidence in 2004 is about 10/1000, or 1%. From the diabetes.org website, their latest report says that around 9% of Americans are living with diabetes (ie; the diabetes prevalence). It is roughly plausible that perhaps 1/10 people living with diabetes are newly diagnosed each year.

We need to do this type of exploratory analysis for all the columns of interest in the data file. In the R script, diabetes.R, I do this for the 2004 to 2013 data. The R script produces the following plot:

The R script also produces the following output:

The data for all years look more or less similar, and reasonable. If the data had outliers, I would look at the counties for which there were outliers, and try to track down what the true value should be by doing an Internet search. Sometimes outliers are caused by mis-transcription of data. Sometimes they actually are true outliers (!)

The script puts the fips, year, and rate information into vectors, then creates a data table that is written out into a summary file preprocessed_diabetes_incidence_by_county_by_year_2004_to_2013.csv

Making such summary files is often useful in order to get the data in a nice format for further analyses.

**Socio-economic data**

The US Census Bureau American Fact Fiinder database has data on a wide variety of socio-economic demographics:

On the site, click on “Advanced Search->Show Me All”

Click on “Geographies”, and then “County”, “All Counties Within Unites States”, and then “ADD TO YOUR SELECTIONS”. You can then close out that selection window by clicking “Close X” at its top right corner:

Notice that in the “Your Selections” box on the upper left hand corner it states that you are now searching for data by county.

Now, let’s look for data related to poverty by county. We are ultimately going to test whether or not there appears to be an association between poverty rates and diabetes incidence in the population. In the “Refine your search results” box, type “poverty” (without the quotes). The following list will come up:The acronym “ACS” refers to the US Census Bureau American Community Survey. They provide 5 year, 3 year, and 1 year running averages of various socioeconomic demographics. We want the one year averages. Click on sample S1701 “Poverty status in the last 12 months”. It brings up (note, that I can’t fit all the rows on the screen for the screen shot):

There are a lot of goodies in this table. Not only is there information on poverty rates, but also information by age, race and ethnicity, employment status, etc. To download the data for a specific year, click on the year to load the table. Let’s do 2013. Once the table shows, click on the “Download” tab at the upper center part of the screen, and click on “Use the data”, then “OK”

Now click “Download” to complete the download process to your computer:

This will download a compressed folder with the data. On my computer, this folder is called ACS_13_1YR_S1701. “S1701″ is the name of the data set, “1YR” indicates that it is the one year averages, and “13″ indicates that the data are for 2013.

Inside that folder, there are several files. ACS_13_1YR_S1701_with_ann.csv contains the data of interest, but if you look at it in a text editor, you will see the column names are rather inscrutable, and there appear to be two lines of header information. The file ACS_13_1YR_S1701_metadata.csv contains the information on what each of the columns means. Move these two files to your working directory.

If you try to read ACS_13_1YR_S1701_with_ann.csv into R as is, R will complain that there are more columns than column names. Skipping the first row, like we did with the diabetes data set, won’t help. It is that second line of header information that is the problem. It contains extra commas in the quotes that mess R up when it tries to read in the data. We can proceed one of two ways… edit the file to comment that second line out with a # so that R ignores it, or, use the following code to make R skip that second line:

all_content = readLines("ACS_13_1YR_S1701_with_ann.csv") skip_second = all_content[-2] b = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

Typing names(b) yields (note, I couldn’t fit the entire list on the screen to take the screen shot):

To reiterate, it is the ACS_13_1YR_S1701_metadata.csv file that describes what each of these many columns are. I usually open this with a text editor, and determine which column name is the information I’m interested in. For example, opening this in a text editor shows:

GEO.id,Id

GEO.id2,Id2

GEO.display-label,Geography

HC01_EST_VC01,Total; Estimate; Population for whom poverty status is determined

HC01_MOE_VC01,Total; Margin of Error; Population for whom poverty status is determined

HC02_EST_VC01,Below poverty level; Estimate; Population for whom poverty status is determined

HC02_MOE_VC01,Below poverty level; Margin of Error; Population for whom poverty status is determined

HC03_EST_VC01,Percent below poverty level; Estimate; Population for whom poverty status is determined

HC03_MOE_VC01,Percent below poverty level; Margin of Error; Population for whom poverty status is determined

I can see that the column I’m interested in is named HC03_EST_VC01. Note that you cannot assume that this column name will always correspond to the percentage of the population in poverty for all years in the S1701 series of ACS data. You have to check for each year!

The column GEO.id is the FIPS code for each county.

The following lines of code read in the data, make a summary data file that is less inscrutable than the original file, and histogram the poverty rates.

require("sfsmisc") ###################################################### # the second line in the ACS files often makes it problematic to read the # file in with read.table or read.csv. The following three lines # of code tell R to skip the second line when reading in the file ###################################################### all_content = readLines("ACS_13_1YR_S1701_with_ann.csv") skip_second = all_content[-2] b = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

wfips = b$GEO.id2 wpoverty = b$HC03_EST_VC01 mult.fig(1) hist(wpoverty,col="darkviolet",xlab="Poverty rate",main="2013 ACS poverty rate data") vdat = data.frame(fips=wfips,poverty=wpoverty) write.table(vdat,"preprocessed_poverty_rates_by_county_2013.csv",sep=",",row.names=F)

The output can be found in preproccesed_poverty_rates_by_county_2013.csv. The code produces the following plot:

All of the values look reasonable, and there does not appear to be any unusual outliers.

**Bringing it together**

Now we would like to examine the diabetes incidence and poverty rate data to determine if they appear to be related.

Note that there are over 3,000 counties in the US, but there are not that many counties in either the diabetes or poverty data sets. The one year American Community Survey data is usually much more limited than the 5 year survey estimates due to the amount of work and expense needed to do annual surveys. Thus, it is usually larger counties that are represented in the one year averages of socioeconomic and demographic data from the census bureau. As far a health data are concerned, there is the potential that some county health authorities haven’t reported their data to the CDC, for whatever reason, or that the number of diabetes cases newly diagnosed was below 20 for that county and year… for reasons of confidentiality, the CDC will not report aggregated data with less than 20 counts.

The following lines of code read in the two data sets, and report on the number of counties in one, but not in the other. The X%in%Y operator in R determines which vector elements in X are in Y(and returns TRUE if it is). Taking !X%in%Y returns TRUE if a vector element in X is not in Y.

ddat=read.table("preprocessed_diabetes_incidence_by_county_by_year_2004_to_2013.csv",sep=",",header=T,as.is=T) pdat=read.table("preprocessed_poverty_rates_by_county_2013.csv",sep=",",header=T,as.is=T)

ddat = subset(ddat,year==2013) cat("The number of counties in the diabetes data set is",nrow(ddat),"\n") cat("The number of counties in the poverty data set is",nrow(pdat),"\n")

i=which(!ddat$fips%in%pdat$fips) j=which(!pdat$fips%in%ddat$fips)

cat("The number of counties in the diabetes data set not in the poverty set is:",length(i),"\n") cat("The number of counties in the poverty data set not in the diabetes set is:",length(j),"\n")

We can subset the two data sets to ensure that they both contain information for the same set of counties:

ddat = subset(ddat,fips%in%pdat$fips) pdat = subset(pdat,fips%in%ddat$fips) cat("The number of counties in the diabetes data set is",nrow(ddat),"\n") cat("The number of counties in the poverty data set is",nrow(pdat),"\n")

You will find that both data sets are now the same size (755 counties).

However, there is no guarantee that now the counties are in the same order for the two datasets. To get the one-to-one correspondence between the data sets, we can use the R match() function. match(ddat$fips,pdat$fips) returns the index of the row of pdat data frame with fips corresponding to every value of ddat$fips in turn.

We can thus create a new element of the ddat data frame called poverty, which is obtained from the pdat data frame with the corresponding fips to every fips in ddat:

ddat$poverty = pdat$poverty[match(ddat$fips,pdat$fips)]

and we can now plot the diabetes rate versus the poverty and overlay the regression line:

mult.fig(1) plot(ddat$poverty,ddat$diabetes_rate,col="red",cex=2,xlab="Poverty rate",ylab="Diabetes rate",main="2013 data") myfit = lm(ddat$diabetes~ddat$poverty) lines(ddat$poverty,myfit$fit,col=4,lwd=6) print(summary(myfit))

]]>

**(aka: How to be a Data Boss)**

**This course is meant to introduce students in the life and social sciences to the skill set needed to do well-executed and well-explicated statistical analyses. The course is aimed at students with little prior experience in statistical analyses, but prior exposure to “stats 101″-type courses is helpful. The course will be almost entirely based on material posted on this website. The course syllabus can be found here.** **There is no textbook for this course, but recommended reading is How to Lie with Statistics by Irving Geis, Statistical Data Analysis by Glen Cowan, and Applied Linear Statistical Models by Kutner et al (doesn’t really matter which edition).** **Upon completing this course:** **Students will have an understanding of basic statistical methods, including hypothesis testing, linear regression and generalized regression methods, and will understand common pitfalls in statistical analyses, and how to avoid them (and detect them, when reviewing papers!). If we have the following problem as the course progresses, students need to tell me, because it means that I need to adjust the pace and content of the course material:** **Upon completion of the course, students will have basic functionality in R, and will learn how to read in, manipulate, and export data in R, and will be able to create publication-quality plots in R. Methods for producing well-written scientific papers, and giving good oral presentations, are also heavily stressed throughout the course.**

**The Dr.Towers’ Golden Rules for Statistical Data Analysis:**

**All (or nearly all) data has stochasticity (ie; randomness) associated with it****A probability distribution underlies that stochasticity****Hypothesis test are based on that probability distribution****Anything calculated using data (like statistics like the mean or standard deviation, for example) has stochasticity associated with it, because the data are stochastic.****Every statistical analysis needs to start with a “meet and greet” with your data. Calculation of basic statistics (sample size, means, standard deviations, ranges, etc), and plots to explore the data and ensure no funny business is going on.****When doing regression, you need two things: a model that describes how the data depend on the explanatory variables, and a goodness-of-fit statistic (like Least Squares, or Binomial likelihood, or Poisson likelihood, etc)**

**List of course modules:**

- Good work habits, and requirements for homework
- Literature searches with Google Scholar
- Elements of scientific papers
- The basics of the R statistical programming language
- Difference between statistical and mathematical models
- Probability distributions important to modelling in the life and social sciences
- Descriptive statistics: mean, covariance, variance, and correlation
- Online sources of free data
- Extracting data from graphs in the published literature
- Bringing together disparate sources of data
- Correlations, partial correlations, and confounding variables
- Exploratory data analysis examples
- Least squares linear regression
- Producing well written manuscripts in a timely fashion
- Giving a good presentation
- t-tests and z-tests of means, and ANOVA
- Poisson regression
- Logistic regression
- Population standardized Poisson regression for data expressed as per capita rates
- Kolmogorov-Smirnov test to compare the shape of two distributions
- Negative Binomial likelihood fits for over-dispersed count data
**Homework #8, due Friday April 20th at noon. (in class presentations the week of April 23rd)**

- Numerical methods for propagation of uncertainties
- Least Squares fitting is equivalent to homoskedastic Normal likelihood fitting
- Model validation methods
- Making choropleth maps in R
- K-means clustering
- R Shiny (more examples here)

**Course expectations:** There will be regular homework projects assigned throughout the course, which will be worth 50% of the grade. Students are strongly encouraged to work together in groups to discuss issues related to the homework and resolve problems. However, plagiarism of code will not be tolerated. There also may be unannounced in-class pop quizzes during the semester. If these occur, they will be counted among the homework grades. The culmination of the course will be a group term project (two to three students collaborating together, with the project worth 50% of the final grade). Students will write-up the results of their project in a format suitable for publication, using the format required by a journal they have identified as being appropriate for the topic. A cover letter written to the editor of the journal is also required. **However, submission for publication is not required, but encouraged if the analysis is novel.** Students are responsible for locating and obtaining sources of data, and developing an appropriate statistical model for the project, so this should be something they begin to think about very early in the course. **This course has no associated textbook. Instead the course content consists of the modules that appear on this website.**** A textbook that students may find useful is Statistical Data Analysis, by G. Cowan** Students are expected to bring their laptops to class. Before the course begins, students are expected to have downloaded the R programming language onto their laptop from http://www.r-project.org/ (R is open-source free software). Final project write-ups will be due **Friday, April 13th**. Each of the project groups will perform an in-class 20 min presentation on **Monday, April 23rd, 2018 and Wed, April 25th, 2018**. During the week of April 16th, project groups will meet with Dr. Towers to discuss their final project write-ups, and their upcoming presentation. By Friday, April 27th, all group members are to submit to Prof Towers a confidential email, detailing their contribution to the group project, and detailing the contributions of the other group members.

]]>

The file reads in the data files summary_pandemic_data.txt, and sunspot_wolf_and_group_1700_to_2014.txt

The R script produces the following plot, shown in the paper,

]]>

**I’m a statistician, and I also have a PhD in experimental particle physics. Research in experimental particle physics can involve complex models of observable physical processes, and fitting of those models to experimental data is a not uncommon task in that field. Like the field of applied mathematics in the life and social sciences (AMLSS), the models being fit at times have no analytic solution, and must be solved numerically using specialized methods. When I entered the field of AMLSS back in 2009, I had a lot to learn about the various models used in this field and the common methodologies, but I already had a solid tool box of specialized skills that allowed me to connect mathematical models to data, and it has turned out that those skills have been remarkably useful in exploring a wide range of research questions in the life and social sciences that I find interesting. I also apply these skills in consulting projects I do.**

**First off: what is the difference between statistical and mathematical modelling, anyway?**

The difference between statistical and mathematical models is often times confusing to people. In this past module on this site, I discuss an example of the differences, with an analysis of seasonal and pandemic influenza used as an example.

**Example of an analysis combining statistical and mathematical modelling: Mathematical and statistical modelling of the contagious spread of panic in a population**

During the 2014 Ebola outbreak, there were a total of five cases that were ultimately identified in America, compared to tens of thousands of cases in West Africa. Even though the “outbreak” in America was essentially non-existent, once the first case was identified in the US in autumn 2014, the media shifted into 24/7 coverage of the supposed dire threat Ebola presented to Americans, complete with scary imagery.

Autumn 2014 I was teaching a course in the ASU AMLSS graduate program on statistical methods for fitting the parameters of mathematical models to data. Each year, when I teach AML classes, I usually try to have a “class publication project” that encompasses the methodology I teach in the class. In this case, I thought it might be interesting to try to model the spread of Ebola-related panic in the US population, as expressed on social media, and explore how news media might play a role in that.

The class did the analysis as a homework assignment, and we wrote the paper together, which was published in 2015. The paper received national media attention when it came out.

First; why was this analysis important? Well, it has been suggested in the past that people talking about a particular disease on social media might perhaps be used as a real-time means to assess the temporal and geospatial spread of the disease in the population, rather than relying on slower traditional surveillance methods, which can suffer from backlogs in laboratory testing. For instance, tracking influenza, or cholera:

However, up until the US Ebola “outbreak” the problem was that it was impossible to say whether people were just discussing a disease on social media because they were worried about it, rather than because they actually had it. During the Ebola outbreak, pretty much no one actually had it in the US, so everyone who was talking about it was doing so because they were concerned about it. This gave us the perfect instance to gauge what kind of temporal patterns we might see in social media chatter due simply to panic or concern about a disease!

The data we used in the study were the daily number of news stories about Ebola from large national news outlets. We also obtained Twitter data related to Ebola, and Google search data in the US with search terms related to Ebola, including “do I have Ebola?” from Google Trends. Here is what the data looked like:

We came up with a model that related the number of news videos, V, and people who were infected, I, with the idea to tweet about Ebola, or do a Google search related to Ebola:

The parameter beta is a measure of how many tweets (or Google searches) per person per unit time one news story would inspire, and gamma parameterizes the “boredom” effect, through which people eventually move to a “recovered and immune” class, upon which they never tweet again about Ebola no matter how many Ebola-related news stories they are exposed to. Using the statistical methodology taught in the AML course, the students fit the parameters of that model to data, and obtained the following model predictions, shown in red:

The blue lines on the plot represent a plain statistical model that simply regresses the Twitter and Google search data on the news media data, without taking into account the “boredom” effect. Can you see that the regression fits are systematically too high early on, and systematically too low later for all the plots, but the same is not true of our mathematical model? That tells us that our mathematical model that includes boredom really does do a better job of describing the dynamics of peoples’ Ebola-related social media behaviours!

We found that each Ebola-related new story inspired on average thousands of tweets and Google searches. Also, on average, we found people were only interested enough for a few days to tweet or do Google search after seeing a news story about Ebola before they became bored with the topic:

We couldn’t have done this analysis without both mathematical modelling and statistical methods; it was a nice “bringing together” of the methodologies to explore an interesting research question.

**Another example of an analysis that involved mathematical and statistical modelling methods: contagion in mass killings and school shootings**

In January, 2014 there was a shooting at Purdue University, where one student entered a classroom and shot another student dead, then walked out and waited for police to arrest him.

At the time, it struck me that it was the third school shooting I had heard about in an approximately 10 day period. Even for the United States, which has a serious problem with firearm violence compared to other first world countries, this seemed like an unusual number to have in such a short period of time.

It led me to wonder if perhaps contagion was playing a role in these dynamics. Certainly, in the past it had been noted that suicide appears to be contagious, because (for example) in high schools where there is one suicide it is statistically more likely to see an ensuing cluster of suicides. And the “copy cat” effect in mass killings has long been suspected. I wondered if perhaps a mathematical model of contagion might be used to help * quantify *whether or not mass killings and school shootings are contagious. So, I talked with some colleagues:

And, we decided to use a mathematical model known as a Hawke’s point process “self-excitation” model to simulate the potential dynamics of contagion in mass killings; the idea behind the model is quite simple… there is a baseline probability (which may or may not depend on time) of a mass killing to occur by mere random chance (the dotted line, below). But, if a mass killing does occur, due to contagion it temporarily raises the probability that a similar event will occur in the near future. That probability decays exponentially:

How contagion would manifest itself in data is thus as unusual “bunching together in time” of events compared to what you would expect from just the baseline probability.

Here’s our (blurry) model:

The parameters of the model were Texcite, the average length of the excitation period, and Nsecondary, the average number of new mass killings inspired by one mass killing. N_0(t) was the baseline probability for mass killings to occur. We used statistical modelling methods to estimate N_0(t).

We needed data in order to fit the parameters of our model. From USA Today we obtained data on mass killings (four or more people killed), and from the Brady Campaign to Prevent Gun Violence, we obtained data on school shootings, and data on mass shootings (three or more people shot, not necessarily killed). Mass shootings happen very frequently in the US!

We compared how well the Hawkes model fit the data compared to a model that didn’t include self-excitation. If contagion is evident, the former will fit the data significantly better.

The fit results were:

Both mass killings and school shootings appear to be significantly contagious! And the length of the contagion period is on average around two weeks for both.

Mass shootings with more than three people shot, but less than four people killed were not contagious though.

Why? Well, mass shootings with low death counts happen on average more than once a week in the US. They happen so often, that they rarely make it past the local news. In contrast, mass shootings with high death rates, and school shootings, usually get national and even international media attention. It may likely be that widespread media attention is the vector for the contagion.

Again, this was an analysis that was made possible through the marriage of mathematical and statistical modelling methods.

**Statistical and mathematical modelling skills on the job market**

Quantitative and predictive analytics is a field that is growing very quickly. Statistical methods and data mining (“big data”) play a large role in predictive analytics, but the power of mathematical models is more and more being recognized as having same advantages over statistical models alone because mathematical models do not simply assume an “X causes Y” relationship, but instead can describe the complex dynamics of interacting systems. Having a tool box of skills that includes expertise in both mathematical and statistical modelling can lead to many interesting career opportunities, including consulting.

]]>

For many models, information about the parameters and/or initial conditions can be obtained from other studies. For instance, let’s examine the seasonal influenza SIR model we have used as an example in several other modules. Our data was influenza incidence data from an influenza epidemic in the midwest, and we fit the transmission rate, beta (or alternative, R0=beta/gamma), of an SIR model to this data. For example, using the R script fit_midwest_negbinom_gamma_fixed.R

The script performs a negative binomial likelihood fit to the influenza data, assuming that the average recovery period, 1/gamma, for flu is fixed at 4.8 days. The script produces the following plot (recall that alpha is the over-dispersion parameter for the negative binomial likelihood, and t0 is the time of introduction of the virus to the population.

The script gives the best-fit estimate using the graphical Monte Carlo fmin+1/2 method, and also the weighted mean method. Note that the plots should be much better populated in order to really get trustworthy estimates from the fmin+1/2 method.

**In reality, most of our parameters that we obtain from prior studies aren’t know to perfect precision**

In our script above, we assumed that 1/gamma was 4.8 days based on a prior study in the literature. However, this was estimated from observational studies of sick people, and, in reality, there are statistical uncertainties associated with that estimate. In the paper describing the studies, they state that their central estimate and 95% confidence interval on 1/gamma was 4.80 [4.31,5.29] days. **Unless told otherwise in the paper from which you get an estimate, you assume that the uncertainty on the parameter is Normally distributed. ** Because the 95% CI is +/-1.96*sigma from the mean, this implies that the std deviation width of the Normal distribution is sigma=(4.8-4.31)/1.96~0.25 days

Thus, our probability distribution for x=1/gamma is

P(x|mu,sigma)~exp(-0.5(x-mu)^2/sigma^2)

with mu=4.8 days, and sigma = 0.25 days, in this case.

**Uncertainty on “known” parameters affects the uncertainties on the other model parameters you estimate from your fit to data!**

The uncertainty on 1/gamma will affect the uncertainty on our parameter estimates. For instance, is it clear that if all we know about 1/gamma was that it was between 1 and 50 days, it would be much harder to pin down our transmission rate? (ie; we had no idea what 1/gamma was, and thus had fit for gamma, as well as R0, t0, and alpha) The script fit_midwest_negbinom_gamma_unconstrained.R does this, and produces the following plot:

You can see that the influenza data we have perhaps give us a little bit of sensitivity to the parameter gamma, but not much (basically, the fit just tells us 1/gamma is somewhere between 2 to 6 days, with 95% confidence). The uncertainties on our estimates of R0 and t0 have gone way up, compared to the first fit where we assumed 1/gamma was fixed at 4.8 days! Also, when you are using the weighted mean method to estimate parameters and the parameter uncertainties, you can also get the covariance matrix for your parameter estimates. The correlation matrix, derived from the covariance matrix, for this fit looks like this:

Notice that our estimates of R0 and 1/gamma are almost 100% correlated (this means that as 1/gamma goes up, R0 also has to go up to achieve a good fit to the data). You never want to see parameters so highly correlated in fits you do… it means that your best-fit parameters likely won’t give you a model with good predictive power for a different, equivalent data set (say, influenza data for the same region for the next flu season).

So, even though we seem to have a little bit of sensitivity to the value of 1/gamma in our fit, having that estimate 100% correlated to our estimate of R0 is not good, and a sign you shouldn’t trust the results of the fit.

**Incorporating uncertainties on “known” parameters from the literature in our fit likelihoods**

In order to take into account the uncertainties on our “known” parameter, x, you simply modify your fit likelihood to include the likelihood coming from the probability distribution for that parameter. Thus, the negative log likelihood is modified like so:

negloglike = negloglike + 0.5*(x-mu)^2/sigma^2

Then, in the fit, you do Monte Carlo sampling not only of all your other unknown parameters (like R0, t0, and alpha in this case), but also uniformly randomly sample parameter x over a range of around approximately mu-4*sigma to mu+4*sigma.

For 1/gamma, we know that mu=4.8 days, and sigma is 0.25 days. The R script fit_midwest_negbinom_gamma_constrained.R modifies the likelihood to take into account our probability distribution for our prior estimate of 1/gamma from the literature. The script produces the following plot:

(again, for the fmin+1/2 method, we’d like to see these plots much better populated!). Note that now our uncertainties on R0 and t0 from the weighted mean method are much smaller than they were when 1/gamma was completely unconstrained, but larger than they were when 1/gamma was fixed to 4.8 days. By modifying the likelihood to take into account the probability distribution of our prior belief for 1/gamma, we now have a fit that properly feeds that uncertainty into our uncertainty on R0 and t0.

When publishing analyses that involve fits like these, it is important to take into account your prior belief probability distributions for the parameter estimates you take from the literature. In some cases, your fit might be quite sensitive to the assumed values of those parameters; if the literature estimates are a bit off from what your data would “like” them to be to obtain a good fit, and you just assume a fixed central value for the parameter, sometimes you just won’t be able to get a good fit to your data.

When you include the parameter in your fit with a modified likelihood to take into account it’s prior belief probability distribution, the estimate you get from the fit to your data is known as the “posterior” estimate. Note that the posterior estimate, and uncertainty, on 1/gamma that we obtained from fit_midwest_negbinom_gamma_constrained.R is 4.798+/-0.247, and is pretty darn close to our prior belief estimate of 4.8+/-0.25. If our data were sensitive to the value of 1/gamma, our posterior estimate would have a smaller uncertainty than the prior belief estimate, and likely have a different central value too.

]]>As discussed in that module the model parameters can be estimated from the parameter hypothesis for which the negative log-likelihood statistic, f, is minimal, and the one standard deviation uncertainty on the parameters is obtained by looking at the range of the parameters for which the negative log likelihood is less than 1/2 more than the minimum value, like so:

This method has the advantage that it is easy to understand how to execute (once you’ve seen a few examples). However, we talked about the fact that this procedure is only reliable if you have many, many sweeps of the model parameter values (for instance, the above plots are pretty sparsely populated, and it would be a bad idea to trust the confidence intervals seen in them…. they are underestimated because the green arrows don’t quite reach to the edge of the parabolic envelope that encases the points).

The fmin+1/2 method also does not yield a convenient way to determine the covariance between the parameter estimates, without going through the complicated numerical gymnastics of estimating what is known as the Hessian matrix. The Hessian matrix (when maximizing a log likelihood) is a numerical approximation of the matrix of second partial derivatives of the likelihood function, evaluated at the point of the maximum likelihood estimates. Thus, it’s a measure of the steepness of the drop in the likelihood surface as you move away from the best-fit estimate.

It turns out that there is an easy, elegant way, when using the graphical Monte Carlo method, to use information coming from every single point that you sample to obtain (usually) more robust and reliable parameter estimates, and (usually) more reliable confidence intervals for the parameters.

**The weighted means method**

To begin to understand how this might work, first recall from the previous module that the fmin+1/2 method gives you the one standard deviation confidence interval. Recall that to get the S standard deviation confidence interval, you need to go up 0.5*S^2 from the value of fmin, and examine the range of points under that line. This means that when we plot our negative log likelihood, f, vs our parameter hypotheses, the points that lie some value X above fmin are, in effect sqrt(2*X) standard deviations away from the best-fit value. Here is what that looks like graphically:

The red lines correspond to the points that lie at fmin+1/2 (the one standard deviation confidence interval), the blue lines correspond to the points that lie at fmin+0.5*2^2=fmin+2 (the two standard deviation confidence interval), and the green lines correspond to the points that lie at fmin+0.5*3^2=fmin+4.5 (the three standard deviation confidence interval).

It should make sense to you that the points that are further away from fmin carry less information about the best-fit value compared to points that are have a likelihood close to the minimum. After all, when using the graphical Monte Carlo method, you aim to populate the graphs well enough to get a good idea of the width of the parabolic envelope *in the vicinity of the best fit value*.

So… if we were to take some kind of weighted average of our parameter hypotheses, giving more weight to values near the minimum in our likelihood, we should be able to estimate the approximate best-fit value.

It turns out that the weight that achieves this is intimately related to those confidence intervals we see above. If we do many Monte Carlo parameter sweeps, getting our parameter hypotheses and the corresponding negative log likelihoods, f, we can estimate our best fit values by taking the weighted average of our parameter hypotheses, weighted with weights

w=dnorm(sqrt(2*(f-fmin)))

where dnorm is the PDF of the Normal distribution. Notice that this is maximal when f=fmin, and gets smaller and smaller as f moves away from fmin. In fact, when f=fmin+0.5*S^2 (the value that corresponds to the S std dev CI), then

w=dnorm(S)

So, the points that are close to giving the minimum likelihood are given a greater weight in the fit, because they are more informative as to where the minimum actually lies. The plot of w vs S is:

Thus, the further f gets away from fmin, the less weight the points are given, *but they still have some weight*.

It turns out that not only can these weights be used to estimate our best-fit values, they also can be used to estimate the covariance matrix of our parameter estimates. If we have two parameters (for example), and we’ve randomly sampled N_MC parameter hypotheses, we would form a N_MCx2 matrix of these sampled values, and then take the weighted covariance of that matrix. The R cov.wt() function does this.

Advantages of the weighted mean method: with this method every single point you sample gives information about the best-fit parameters and the covariance matrix for those parameter estimates. Unlike the fmin+1/2 method, where it is only those points right near the minimum value of f and at fmin+1/2 that really matter in calculating the confidence interval.

Also, using this weighted method you trivially get the estimate covariance matrix for the parameters, unlike the fmin+1/2 method where this would be much harder to achieve.

Another advantage of this method is that you don’t have to populate your plots quite as densely as you would for the fmin+1/2 method in order for it to reliably work; this is because every single point you sample is now informing the calculation of the weighted mean and weighted covariance matrix.

The disadvantage of this method is that you must uniformly randomly sample the parameters (no preferential sampling of parameters using rnorm for instance), and you must uniformly sample them over a broad enough range that it encompasses at least a three or four standard deviation confidence interval; other wise, as we’ll see, you will underestimate the parameter uncertainties).

**An example**

As an example of how this works in practice, let’s return to the simple example we saw in this previous module, where we compared the performance of the fmin+1/2 method to that where we analytically calculate the Hessian to estimate the parameter uncertainties.

In the example, the model we considered was y=a*x+b, where a=0.1 and b=10, and x goes from 10 to 150, in integer increments. We simulate the stochasticity in the data by smearing with numbers drawn from the Poisson distribution with mean equal to the model prediction. Thus, an example of the simulated data look like this:

Recall that the Poisson negative log likelihood looks like this

where the y_i^obs are our data observations, and y_i^pred are our model prediction (y_i^pred = a*x_i+b).

In the example hess.R, we randomly generated many different samples of our y^obs, and then used the Monte Carlo parameter sweep method to find the values of a and b that minimize the negative log likelihood. Then we calculated the Hessian about this minimum and estimated the one-standard deviation uncertainties on a and b from the covariance matrix that is the inverse of the Hessian matrix. Recall that the square root of the diagonal elements of the covariance matrix are the parameter uncertainties.

We also did this using the fmin plus a half method, to show that If the fmin plus a half method works, its estimate of the one-standard-deviation confidence intervals should be very close to the Hessian estimate.

We can add into this exercise our weighted mean method. The R script hess_with_weighted_covariance_calculation.R does just this.

The script produces the following plot, histogramming the parameter estimates from the weighted mean method. As you can see, the estimates are unbiased, and the uncertainties on the parameters assessed by the weighted mean method are very close to those assessed by the analytic Hessian method:

The script also produces the following plot:

Notice in the top two plots that the parameter uncertainties assessed by the weighted mean method are quite close to those estimated by the Hessian method, but the uncertainties assessed by the fmin+1/2 method are always underestimates. This is because we didn’t sample that many points in our graphical Monte Carlo procedure, as can be seen in the examples in the two bottom plots; the plots are so sparsely populated, the green arrows that represent the CI’s estimated by the fmin+1/2 method don’t go all the way to the edge of the parabolic envelope.

So, even with relatively sparsely populated plots, the weighted mean method works quite well. If they are really, really sparsely populated, however, you will find that the performance of the method starts to degrade; take a look at what happens when you change nmc_iterations from 10000 in to 100 in hess_with_weighted_covariance_calculation.R:

The estimates of the parameter uncertainties still are scattered about the Hessian estimates (and the fmin+1/2 method miserably fails due to the sparsity of points). However, notice that there is quite a bit of variation in the uncertainty estimates using the weighted mean method about the red dotted line (compare to the other plot, above); the more MC iterations you have, the more closely these will cluster about the expected values (ie; the more trustworthy your parameter uncertainty estimates will be). So, don’t skimp on the MC parameter sampling iterations, even when using the weighted mean method! In general, with this method, you need to run enough MC parameter sweep iterations to get a reasonable idea of the parabolic envelope in the vicinity of the best-fit value.

One catch of this method, as mentioned above, is to ensure that you do your random uniform parameter sweeps over a broad enough area… if you sample parameters to close to the best-fit value, the weighted mean method will underestimate the confidence intervals. In general, you have to sweep at least a four standard CI. As an example of what occurs when you use too narrow a range, instead of sampling parameter a uniformly from 0.06 to 0.14, uniformly sample it from 0.09 to 0.11 in the original hess_with_weighted_covariance_calculation.R script. We now get:

You can see that the confidence intervals are now severely under-estimated by the weighted mean method.

It needs to be kept in mind that the covariance matrix returned by the weighted mean method assumes that the confidence interval is symmetrically distributed about the best-fit value. In practice, this isn’t always the case; sometimes the plots of the neg log likelihood vs parameter hypotheses, instead of looking like they have a symmetric parabolic envelope, have a highly asymmetric parabolic envelope, like this, for example:

The weighted mean method will essentially produce a one standard deviation estimate that is derived from an “average” symmetric parabola fit to the asymmetric parabola. It will tend to underestimate confidence intervals in such cases. When you have highly asymmetric parabolic envelopes in your plots of the neg log likelihood vs your parameter hypotheses, it is thus best to use the fmin+1/2 method.

]]>This content is password protected. To view it please enter your password below:

]]>