Data that are expressed as per-capita rates are frequently encountered in the life and social sciences. For example the per-capita rate of incidence of a certain disease, or per-capita crime rates. In both examples, “per-capita” means “per person”. And because the per-capita rate is expressed as “per person”, sometimes it might be easy to get confused and think that perhaps Binomial linear regression methods might be most appropriate because the data appears at first blush to be expressed as a fraction of the population. However, Binomial linear regression methods are only appropriate for fractions that are strictly constrained to be between 0 and 1. When we are talking about crime rates, for example, it’s entirely possible if crime were rampant that someone might be robbed several times a year. The per-capita rate could thus be above 1!

When talking about per-capita rates, the data consist of a counted number per some unit time, thus regression methods like Poisson or Negative Binomial methods would be appropriate. But we have to account for the population size, M, because obviously if the population doubles, the number of cases of crime we would count per unit time would also double (if the per-capita rate stayed the same).

To take this into account in a regression analysis, we use what is called “population-standardized” regression. Recall that Poisson regression methods use a log link for the expected number of events, lambda. In population-standardized linear regression with one explanatory variable (for example) we thus have

Notice that if we bring the log(M) over the LHS, we get log(lambda/M)… the per-capita rate!

Also notice that the log(M) term does not (nor should it) have a coefficient multiplying it in the fit. Thus, if your observed number of crimes are in the y vector, what you do NOT want to do is the following (recall that the glm function with family=poisson in R uses a log-link by default):

b = glm(y~log(M)+x,family="poisson")

This is because R will attempt to find some kind of best-fit coefficient for the log(M) term, when what we actually need to do is force its coefficient to be 1 to be able to interpret our output as a per-capita rate.

The way to do this in R is to use the offset() function:

b = glm(y~offset(log(M)) + x, family="poisson")

Now, in the fit the log(M) term is forced to have coefficient equal to one.

**Example**

Let’s look at the annual per-capita rates of public mass shootings before, during, and after the Federal Assault Weapons Ban, which was enacted from September 14, 1994 to September 13, 2004. In the file mass_killings_data_1982_to_2017.csv, there is the annual number of public mass shootings from 1982 to 2017, as obtained from the Mother Jones mass shootings data base, and supplemented with a few public mass shootings they missed, as listed in the USA Today mass killings data base. Also in the file is the US population that year (in milliions), as obtained from the US Census Bureau. There is also a logical variable, lperiod, indicating wether the year was before (lperiod=0), during (lperiod=1), or after (lperiod=2) the ban assault (note that lperiod=1 when the weapons ban was in place for most of that particular year). Also included in the file is the number of people killed in mass shootings each year.

The following code reads in this data, and does a population standardized fit, then plots the data and fit results:

b = glm(number_shootings~offset(log(pop)),family="poisson",data=adat) plot(adat$year,adat$number_shootings,cex=2,xlab="Date",ylab="Number mass shootings") lines(adat$year,b$fit,col=2,lwd=5)

You can see that the fit estimates properly take into account that the US population went up from 1982 to 2017, and thus, even if the per-capita rate were the same over that entire period, we would expect a higher annual *number* of events in 2017 compared to 1982.

The summary of the fit from summary(b) yields

To interpret this output, we note that the glm family=”poisson” fit uses a log link. Thus we must take the exponential of the intercept term to get the average expected per-capita annual rate. This is exp(-4.6701)=0.0094 per million people, per year.

We can compare this to what we get if we just take the average of the number of killings per million people per year: mean(adat$number_shootings/adat$pop)=0.0091. The two numbers are almost identical… and they should be!

]]>The Binomial probability distribution is appropriate for modelling the stochasticity in data that either consists of 1′s and 0′s (where 1 represents as “success” and 0 represents a “failure”), or fractional data like the total number of “successes”, k, out of n trials.

*Note that if our data set consists of n 1′s and 0′s, k of which are 1′s, we could alternatively express our 1′s and 0′s data as k successes out of n trials.*

There are other probability distributions that can be used to model the stochasticity in fractional, like the Beta Binomial distribution, but the Binomial probability distribution is the simplest of the probability distributions for modelling the number of successes out of N trials. The Binomial probability mass function for observing k successes out of n trials when a fraction p is expected, is

The parameter p is our “model expectation”, can can be just a constant, or a function of some explanatory variables, and is the expected value of k/n.

Note that if our data are 1′s and 0′s, each point could be considered a “trial”, where k could be either 1 or 0 for each data point, and n would be 1 for each data point. This special case of the Binomial distribution is known as the Bernoulli distribution.

Our predicted probability of success, p, could, in theory at least, be a linear function of some explanatory variable, x:

*However, this can present problems, because p must necessarily lie between 0 and 1 (because it is a fraction), but the explanatory variable might be negative, or even if it were positive, beta_0 and beta_1 might be such that the predicted value for p lies outside 0 and 1. This is a problem!*

Partly for this reason, Binomial logistic regression generally assumes what is known as a “logit-link”. The logit of a fraction is log(p/(1-p)), also know as the log-odds, because p/(1-p) is the odds of success. It is this logit link that give “logistic regression” its name.

Note that because p lies between 0 and 1, p/(1-p) lies in the range of 0 to infinity. This means that the logit of p (the log of the odds) lies between -infinity to +infinity. With the logit-link, we regress the logit of p on the explanatory variables. For linear regression with one explanatory variable, this looks like:

Because the logit lies in the range of -infinity to +infinity, now it doesn’t matter if the expression on the RHS of the equation is negative… the reverse transform will always give back a value of p between 0 and 1.

Nice!

By the way, if we call logit(p)=A, then the reverse transformation to calculate p is

Let’s assume that we have N data points of some observed data, k_i, successes, out of n_i trials, where i = 1,…N. This could be, for example, the daily n fraction of firearms the TSA detects that have a round chambered over a period of N days. k_i is the number each day found with a round chambered, and n_i is the total number found each day. The observed fraction might, at least hypothetically, linearly depend on time, x_i. In this case, our model looks like

In order to fit for beta_0 and beta_1 (or whatever the parameters of our model are), we need some “goodness of fit” statistic that we can optimize to estimate our best-fit values of our model with Binomially distributed data…

**Binomial likelihood**

The likelihood of observing our N data points, k_i, out of n_i when p_i are expected for each point is the product of the individual Binomial probabilities:

The “best-fit” parameters in the functional dependence of p_i on the explanatory variable, x_i (or variables… there doesn’t need to just be one), are the parameters that maximize this likelihood.

However, just like was pointed out in our discussion of Poisson regression methods for count data, in practice, underflow problems happen when you multiply a whole bunch of likelihoods (probabilities) together, each of which is between 0 and 1. To avoid this, what is normally done is take the logarithm of both sides of Eqn 1, and what is maximized is the logarithm of the likelihood, log(L):

The R glm() method with family=”binomial” option allows us to fit linear models to Binomial data, using a logit link, and the method finds the model parameters that maximize the above likelihood. If the success data is in a vector, k, and the number of trials data is in a vector, n, the function call looks like this:

myfit = glm(cbind(k,n-k)~x,family="binomial")

The glm() binomial method can also be used with data that are a bunch of 1′s and 0′s. For our little example here, the data might be at the individual firearm level, where ’1′ indicates that the firearm has a round chambered, and ’0′ indicates that it doesn’t. In this case, if the vector found_with_round_chambered contains these zeros and ones for all the firearms, and the vector day_gun_found contains the day each firearm was found (relative to some start day), we can fit to this data using the function call

myfit = glm(found_with_round_chambered~day_gun_found,family="binomial")

Note that in both cases, **it is exactly the same data**, just expressed a different way (you can always aggregate the 1′s and 0′s by day to get the total number of firearms found, and the number found with a round chambered each day, for example). This duality in how you can look at logistic regression is sometimes confusing to students who have been exposed to logistic regression methods either just using 0′s and 1′s, or just using fractional data.

**Example**

Let’s simulate some Binomial data, with trend in time. In the example described above, perhaps this might be firearms detected at TSA airport checkpoints, and determining whether they had a round chambered (“1″), or didn’t have a round chambered (“0″). In this simulated example, the logit of the fraction loaded, p, has the predicted trend

set.seed(541831) vday = seq(0,2*365) vlogit_of_p_predicted = -1+0.005*vday vp_predicted = exp(vlogit_of_p_predicted)/(1+exp(vlogit_of_p_predicted))

At time vday=0, the predicted average fraction of firearms found with a round chambered is thus:

p=exp(-1)/(1+exp(-1)) = 0.269

At time vday=730, the predicted average fraction of firearms found with a round chambered is:

p=exp(-1+0.005*730)/(1+exp(-1+0.005*730)) = 0.934

Let’s simulate same data where we assume the TSA detects exactly 25 firearms per day. We’ll simulate the data at the firearm level, where we record the day each firearm was found, and if it had a round chambered:

num_guns_found_per_day = 10 wfound_with_round_chambered = numeric(0) wday_gun_found = numeric(0) for (i in 1:length(vday)){ v = rbinom(num_guns_found_per_day,1,vp_predicted[i]) wfound_with_round_chambered = c(wfound_with_round_chambered,v) wday_gun_found = c(wday_gun_found,rep(vday[i],num_guns_found_per_day)) }

Notice that the wfound_with_round_chambered vector contains 1′s and 0′s.

We can recast this data instead by aggregating the number found with and without a round chambered by day:

num_aggregated_ones_per_day = aggregate(wfound_with_round_chambered,by=list(wday_gun_found),FUN=”sum”)

num_aggregated_zeros_per_day = aggregate(1-wfound_with_round_chambered,by=list(wday_gun_found),FUN=”sum”)

vday = num_aggregated_ones_per_day[[1]] vnum_found_with_round_chambered = num_aggregated_ones_per_day[[2]] vnum_found_without_round_chambered = num_aggregated_zeros_per_day[[2]] vnum_found = vnum_found_with_round_chambered + vnum_found_without_round_chambered

Let’s plot the simulated data

vp_observed=vnum_found_with_round_chambered/vnum_found plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=4,col=2) legend("bottomright",legend=c("Observed","Predicted"),col=c(1,2),lwd=4,bty="n")

which produces the plot:

Now let’s do a linear logistic fit using the R glm() with family=”binomial” to the individual firearm data, and then to the data aggregated by day. When looking at aggregated data, we input the data to the fit as cbind(num_successes,num_failures). This can also be expressed as cbind(k,n-k), if k is num_successes, and n is the number of trials (n=num_successes+num_failures).

Note that the event and aggregated data are exactly the same data, so they should give exactly the same fit results!

fit_to_daily_data = glm(cbind(vnum_found_with_round_chambered,vnum_found_without_round_chambered)~vday,family="binomial") fit_to_event_data = glm(wfound_with_round_chambered~wday_gun_found,family="binomial")

print(summary(fit_to_daily_data)) print(summary(fit_to_event_data))

This produces the output:

We can plot the fit results overlaid on the data. Note that even though glm() uses the logit link, it converts the fit prediction to a probability to save you the work of doing it.

plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=8,col=2) lines(vday,fit_to_daily_data$fit,lwd=4,col=4,lty=3) legend("bottomright",legend=c("Observed","True model","Fitted model"),col=c(1,2,4),lwd=4,bty="n")

Producing the plot:

You can see that our fitted model is pretty close to the true model. This is because there are many data points (10 each day, for two years). If the data were much more sparse, we would expect to perhaps see a bit more deviation of the fitted model from the true, not because the true model is wrong (it is after all, the true model we used to simulate our data), but because with a sparse data set the fit gets more affected by stochastic variations in the data.

The script example_glm_binomial_fit.R does the above fit. You can try different values of number of guns found per day, and different model coefficients to see how it affects the simulated data and the fits.

Note that in this example we assumed a constant number of firearms found per day… we could have varied that, if we wanted, and it would not change the linear dependence of the logit of the probability of finding a firearm with a round chambered…. whether the number found per day is 1, or 100000 (or whatever), it doesn’t affect the probability of success.

]]>

The Poisson probability distribution is appropriate for modelling the stochasticity in count data. For example, like the number of people per household, or the number of crimes per day, or the number of Ebola cases observed in West Africa per month, etc etc etc.

There are other probability distributions that can be used to model the stochasticity in count data, like the Negative Binomial distribution, but the Poisson probability distribution is the simplest of the discrete probability distributions. The Poisson probability mass function for observing k counts when lambda are expected is:

The lambda is our “model expectation”, and it might be just a constant, or a function of some explanatory variables.

For example, perhaps we are examining how the number of crimes per day, k, might linearly depend on the daily average temperature, x. In this case, our model equation for lambda might be

where beta_0 and beta_1 are parameters of the model. But note that temperature can be negative, which might lead to negative values of the model expectation… clearly for count data this makes no sense!

With Poisson regression, we thus almost always use what is known as a “log-link” where we assume that the logarithm of lambda depends on the explanatory variables… this always ensures that lambda itself is greater than zero no matter what beta_0, beta_1 or x are:

Now, we might not know what beta_0 and beta_1 are, but if we have a bunch of observations of crimes over a series of N days, k_i (with i=1,…,N), and we also have for the same days, the average daily temperature, x_i, we can fit for beta_0 and beta_1 to determine which values best describe the observed relationship between the x_i and y_i. Our model for the expected number of crimes on the i^th day is thus

Using our collected data, we’d like to somehow estimate the “best-fit” values of beta_0 and beta_1 to the data. If the number of crimes per day is low, we can’t use least squares linear regression to do this because that method assumes that the data are Normally distributed, and it is only for large values of lambda that the Poisson distribution approaches the Normal.

We thus need a “goodness of fit” statistic that is appropriate to Poisson distributed data….

**Poisson likelihood**

The likelihood (probability) of observing our data, k_i, given our model predictions for each data point, lambda_i, is the product of the probabilities of observing each data point separately:

*Our “best fit” values of lambda_i for this model are the ones that will maximize this probability.* The least squares goodness-of-fit statistic is one that is usually quite easy for students to visualize. Likelihood fit statistics, however, are often more difficult to conceptualize because there isn’t a nice visual diagram that can explain it (like the arrows showing the distance between points and a model prediction, like we showed for least squares regression, for example).

However, for non-Normally distributed data, if you know what probability distribution underlies the data, you can write the likelihood distribution for observing a set of data by taking the product of the individual probabilities obtained from the probability distribution, just like we did above. The “best-fit” model maximizes that probability.

**Fitting for the model parameters with Poisson likelihood**

Note that probabilities that are multiplied in Eqn 1 are always between 0 and 1, and thus for a sample size of N points, Eqn 1 involves multiplying N values between 0 and 1 together. This can easily lead to underflow errors in our computation, which is a real problem for us when we try to apply this in practice. The solution to this is to take the logarithm of both sides of Eqn 1. Before we do that, here is a bit of a refresher on logarithms:

That is to say, the log of a product is the sum of the logs of the terms in the product. The log of x to some power is the same as that power times the logarithm of x. In this case, we will be taking the “natural log” (which is log_e, log to the base e) of both sides of Eqn 1. The natural log of e is log(e) = 1. Taking the natural logarithm of both sides of Eqn 1 thus yields:

Poisson regression has been around for a long time, but least squares regression methods have been around longer. Finding the best-fit in least squares regression involves finding the parameters that *minimize* the least squares statistic. But finding the best-fit in Poisson regression involves finding the parameters in lambda_i that *maximize* Eqn 2. The interior gut workings of an optimization method in any statistical software package always minimize goodness of fit statistics, mostly because of the least squares legacy.

Because of this, we take the negative of both sides of Eqn 2, and we say that the best-fit parameters in Poisson regression *minimize the negative log likelihood:*

For the special case of our linear model for log(lambda_i) that we are considering, we get:

Given some data k_i and x_i, the “best-fit” values of beta_0 and beta_1 minimize that expression. We could, in practice, guess a whole bunch of different values for beta_0 and beta_1, and plug them into Eqn 3, and narrow it down to which pair of values appear to give the smallest negative log likelihood. However, principles of calculus can be used to find the best fit values of beta_0 and beta_1 that minimize the expression in Eqn 3. These methods are used in the inner workings of the R least squares linear regression lm() function, which is used when the response variable is Normally distributed. When working with a linear regression model with Poisson distributed count data, the R generalized linear model method, glm(), can be used to perform the fit using the family=”poisson” option. Just like with the R least squares method, invisible to you the inner workings of the glm() methods use calculus principles to find the best-fit model parameters that minimize the Poisson negative log likelihood. If the response data (our k_i) are in a vector y, and our explanatory variable, x_i, is in a vector x, and we are fitting a linear Poisson model, the function call looks like this:

myfit = glm(y~x,family="poisson")

Note that even though a log-link hasn’t been specified for the linear model, that is in fact what the glm() model with family=poisson by default assumes.

**Example**

Let’s try fitting some simulated data with the glm() method with family=”poisson”. The following code randomly generates some Poisson distributed data, with a linear model with a log-link:

########################################################################

# randomly generate some Poisson distributed data according to a linear model

########################################################################

set.seed(484272)

x = seq(0,100,0.1)

intercept_true = 1.5

slope_true = 0.05

log_lambda = intercept_true+slope_true*x

pred = exp(log_lambda)

y = rpois(length(x),pred)

########################################################################

# put the data in a data frame

########################################################################

mydat=data.frame(x=x,y=y)

Now let’s fit a linear model to these simulated data, under the assumption that the stochasticity is Poisson distributed. Note that the plotting area is divided up with the mult.fig() method in the R sfsmisc library. You need to have this library installed in R to run that line of code. If you don’t have it installed, first type install.packages(“sfsmisc”) and pick a download site relatively close to your location.

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)"),col=c("darkorchid4",3),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5)

The code produces the following output:

Are the fitted linear values statistically consistent with the true values we used to simulate the data? Do a z-test to check.

**Another example, with more than one explanatory variable**

Let’s look at some real data…

The file chicago_crime_summary.csv contains the daily number of crimes in Chicago, sorted by FBI Uniform Crime Reporting code, between 2001 to 2013. FBI UCR code 4 is aggravated assaults (column x4 in the file). The file chicago_weather_summary.csv contains daily average weather variables for Chicago, including temperature, humidity, air pressure, cloud cover, and precipitation. The R script AML_course_libs.R contains some helper functions, including convert_month_day_year_to_date_information(month,day,year) that converts month, day, and year to a date expressed in fractions of years.

The following R code reads in these data sets, and meshes the temperature data into the crime data set. A few days are missing temperature data, so we remove those days from the data set. If you do not have the chron library already installed in R, first install it using install.packages(“chron”), and pick a download site close to your location.

require("chron")

cdat = read.table("chicago_crime_summary.csv",header=T,as.is=T,sep=",") wdat = read.table("chicago_weather_summary.csv",header=T,as.is=T,sep=",")

cdat$jul = julian(cdat$month,cdat$day,cdat$year) cdat$temperature = wdat$temperature[match(cdat$jul,wdat$jul)] cdat$weekday = day.of.week(cdat$month,cdat$day,cdat$year) cdat = subset(cdat,!is.na(cdat$temperature))

source("AML_course_libs.R") a = convert_month_day_year_to_date_information(cdat$month,cdat$day,cdat$year) cdat$date = a$date

To regress the daily number of assaults (the column x4 in the data frame) on temperature, we use the R glm() method with family=poisson:

myfit = glm(cdat$x4~cdat$temperature,family=poisson)

require("sfsmisc") mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

The fit clearly needs linear trend in time in order to fit the data better. The following code adds that:

myfit = glm(cdat$x4~cdat$temperature+cdat$date,family=poisson)

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

This looks to be a better fit.

But is the stochasticity in the data really consistent with being Poisson distributed? Just like the QQ plot we made with the Least Squares regression fits to test whether or not the data were truly Normally distributed about the model hypotheses, we can make a similar set of plots, but for the Poisson distribution. The AML_course_libs.R script contains a function

overlay_expected_distribution_from_poisson_glm_fit = function(count_data,glm_model_object)

that takes as its arguments the vector of count data, and the best-fit linear model from the glm() method.

In the first part of this function, for each data point it determines the shape of the probability mass function given the model prediction for that point… it then adds these mass functions up for all the data points. When we histogram the data, we can overlay this “Poisson model expectation” curve.

The second part of the script creates a QQ plot of the quantiles of the ranked data, vs the quantiles of a simulated data set, simulated assuming the best-fit model with Poisson stochasticity. If the data truly are Poisson distributed about the model, we would expect this plot to be linear. The following code implements this function with our data and our model to produce the plot:

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

Even though our model with temperature plus linear trend in time is a better fit to the data than the model with just temperature, you can see that the above plots show that the data aren’t quite Poisson distributed about the model predictions. In fact, the QQ plot diagnostics indicate that the distribution appears to have some evidence of fat tails. This could point to potential confounding variables we haven’t yet taken into account (like, perhaps we might consider adding weekdays or holidays as factor levels in the fit). However, the data don’t appear to be grossly over-dispersed compared to the stochasticity expected from Poisson distributed data. Here is an example of including a factor in the explanatory variables (in this case weekday):

myfit = glm(cdat$x4~cdat$temperature+cdat$date+factor(cdat$weekday),family=poisson) print(summary(myfit))

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

**Some cane waving…**

When I was a lass, working on my degree in experimental particle physics, we had to do model fitting very frequently. However, while we had a fortran (and later, a C++ package) that performed gradient descent optimization (or other optimization methods) of some function that you fed it, we didn’t have convenient pre-packaged methods like lm() or glm() where you could just fit a linear model with one tidy line of code. Instead, we had to write the code to actually program the likelihood ourselves.

We also had to walk to school ten miles a day, barefoot, through waist deep snow, even in the summer, and it was uphill both ways.

Get off my lawn.

While it can be a pain to have to code up the actual likelihood expression, the advantage of that stone age methodology was that we had to think carefully about what kind of stochasticity underlay our data, and code up the appropriate likelihood function (or least squares expression, if the stochasticity was Normally distributed). Using canned methods in statistical software packages for doing fitting can unfortunately sometimes lead to decreased understanding of what’s really going on with the fit.

Believe it or not, particle physicists still do fitting the same way they always have, coding up the likelihood function themselves. And they probably always will. Because it is critically important when testing hypotheses that you not only have your model right (ie; accounting for all potential confounding variables, and ensuring that the functional expression of the model is appropriate), but that you also have the correct specification of the probability distribution describing stochasticity in the data. **Otherwise your p-values testing your null hypothesis are garbage.**

**Getting up close and personal with Poisson regression in R **

R has a method called optim() that finds the parameters that minimize the function you feed to it. Unlike the glm() method, which can only find the parameters of a linear model, the optim() method can find the parameters of any kind of model. For instructive purposes to show how optim() works, let’s code up the Poisson negative log likelihood using the optim() method, and use it to fit a linear model to some data, and compare what we get out of the glm() method with family=”poisson”. The two methods should yield the same results. Describing the optim() method also gives you a better idea of what R is doing inside the guts of the glm() method. The R script poisson_and_optim.R defines the following functions that define a linear model with a log-link, and also calculate the Poisson negative log likelihood, given some data vectors x and y contained in a data frame, mydata_frame.

######################################################################## ######################################################################## # this is the function to calculate our linear model, assuming # a log link ######################################################################## mymodel_log_prediction = function(mydata_frame,par){ log_model_prediction = par[1] + par[2]*mydata_frame$x return(log_model_prediction) }

######################################################################## ######################################################################## # this is a function to compute the Poisson negative log likelihood ######################################################################## poisson_neglog_likelihood_statistic = function(mydata_frame,par){ model_log_prediction = mymodel_log_prediction(mydata_frame,par) # lfactorial(y) is log(y!) neglog_likelihood = sum(-mydata_frame$y*model_log_prediction +exp(model_log_prediction) +lfactorial(mydata_frame$y)) return(neglog_likelihood) }

Now, we need some data to fit to. The R script also has code that simulates some data with Poisson distributed stochasticity according to a linear model with a log-link (same as the first example we showed above):

######################################################################## # randomly generate some Poisson distributed data according to a linear model ######################################################################## set.seed(484272)

x = seq(0,100,0.1) intercept_true = 1.5 slope_true = 0.05 log_lambda = intercept_true+slope_true*x pred = exp(log_lambda) y = rpois(length(x),pred)

######################################################################## # put the data in a data frame ######################################################################## mydat=data.frame(x=x,y=y)

Now the script does the glm() fit, and the fit using the optim() method. The two methods return the results in an entirely different format, and it takes a bit more work to extract the parameter uncertainties using the optim() method:

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

coef = summary(myfit_glm)$coef[,1] ecoef = summary(myfit_glm)$coef[,2] cat("\n") cat("Results of the glm fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[2],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",-logLik(myfit_glm),"\n") cat("\n")

######################################################################## # now do the R optim() fit # # The results of the fit are in much more of a primitive format # than the results that can be extracted from an R glm() object # For example, in order to get the parameter estimate uncertainties, # we need to calculate the covariance matrix from the inverse of the fit # Hessian matrix (the parameter uncertainties are the square root of the # diagonal elements of this matrix) # Also, if we want the best-fit estimate, we need to calculate it # ourselves from our model function, given the best-fit parameters. ######################################################################## myfit_optim = optim(par=c(1,0),poisson_neglog_likelihood_statistic,mydata_frame=mydat,hessian=T) log_optim_fit = mymodel_log_prediction(mydat,myfit_optim$par)

coef = myfit_optim$par coefficient_covariance_matrix = solve(myfit_optim$hessian) ecoef = sqrt(diag(coefficient_covariance_matrix))

cat("\n") cat("Results of the optim fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[1],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",myfit_optim$value,"\n") cat("\n")

This produces the following output:

The following code overlays the fit results from both methods on the data:

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) lines(x,exp(log_optim_fit),col=2,lwd=1) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)","Best-fit Poisson linear model from R optim()"),col=c("darkorchid4",3,2),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5) lines(x,log_optim_fit,col=2,lwd=1)

In this case we just did a simple linear model fit. However, with changes to the mymodel_log_prediction() method, optim() can fit arbitrarily complicated models, including non-linear models. Unlike optim(), the glm() method cannot fit non-linear models.

]]>**Contents:**

- Students t-test of the mean of one sample
- Example of Students t-test of the mean of one sample
- Students t-test comparing the means of two samples
- Example of Students t-test comparing the means of two samples
- Limitations of the Students t-test
- Testing for equality of more than two means (ANOVA)
- One and two sample Z-tests

The Student t distribution arises when estimating the mean of a Normally distributed population, particularly when sample sizes are small, and the true population standard deviation is unknown.

**Using the Students t-test to test whether a sample mean is consistent with some value**

If we wish to test the null hypothesis that the mean of a sample of Normally distributed values is equal to mu, we use the Students t statistic

with degrees of freedom

where s is the sample standard deviation, and n is the sample size. The R t.test(x,mu) method tests the null hypothesis that the sample mean of a vector of data points, x, is equal to mu under the assumption that the data are Normally distributed.

**Note that it is up to the analyst to ensure that the data are, in fact Normally distributed.** The shapiro.test(x) method in R employs the Shapiro-Wilk test to test the Normality of the data.

**Example of one sample t-test**

The following R code shows an example of using the R t.test() method to do a one sample t test:

set.seed(832723) n_1 = 1000 s = 0.1 mean_1 = 0.1 x = rnorm(n_1,mean_1,s) t.test(x,mu=0.105)

which produces the output:

**Testing whether or not means of two samples are consistent with being equal**

The independent two sample t-test tests whether or not the means of two samples, X1, and X2, of Normally distributed data appear to be drawn from distributions with the same mean. If we assume that the two samples have unequal variances, the test statistic is calculated as

with, under the assumption that the variances of the two samples are unequal

with s_1^2 and s_2^ being the variances of the individual samples.

The t-distribution of the test will have degrees of freedom

This test is also know as Welch’s t-test.

If we instead assume that the two samples have equal variances, then we have

and the test has degrees of freedom

The R method t.test(x,y) tests the null hypothesis that two Normally distributed samples have equal means. The option var.equal=T implements the t-test under the hypothesis that the sample variances are equal.

When using the var.equal=T option, it is up to the analyst to do tests to determine whether or not the variances of the two samples are in fact statistically consistent with being equal. This can be achieved with the var.test(x,y) method in R, which compares the within sample variances to the variance of the combination of the x and y samples.

**Example of two sample t-test**

The following example code shows an implementation of the two sample t-test, first with the assumption with unequal variances, then with the assumption of equal variances (which is not true for this simulated data):

set.seed(832723) n_1 = 1000 n_2 = 100 s_1 = 0.1 s_2 = 0.11 mean_1 = 0.1 mean_2 = 0.08 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) print(t.test(x,y)) print(t.test(x,y,var.equal=T))

which produces the following output:

**Limitations of Students t-test**

Limitations of using Students-t distribution for hypothesis testing of means: hypothesis testing of sample means with the Student’s-t distribution assumes that the data are Normally distributed. In reality, with real data this is often violated. When using some statistic (like the Students-t statistic) that assumes some underlying probability distribution (in this case, Normally distributed data), it is incumbent upon the analyst to ensure that the data are reasonably consistent with that underlying distribution; the problem is that the Students-t test is usually applied with very small sample sizes, in which case it is extremely difficult to test the assumption of Normality of the data. Also, we can test the consistency of equality of at most two means; the Students-t test does not lend itself to comparison of more than two samples.

**Comparing the means of more than two samples, under the assumption of equal variance **

Under the assumption that several samples have equal variance, and are Normally distributed, but with potentially different means, one way to test if the sample means are significantly different is to chain the samples together, and create a vector of factor levels that identify which sample each data point represents.

The R aov() method assesses the ratio of average of the within group variance to the total variance, using the F statistic:

This is known as an Analysis of Variance (ANOVA) analysis. Essentially, the F-test p-value of tests the null hypothesis that the variance of the residuals of model is equal to the variance of the sample.

Example:

set.seed(832723) n_1 = 1000 n_2 = 100 n_3 = 250 s = 0.1 mean_1 = 0.1 mean_2 = 0.12 mean_3 = 0.07 x = rnorm(n_1,mean_1,s) y = rnorm(n_2,mean_2,s) z = rnorm(n_3,mean_3,s) vsample = c(x,y,z) vfactor = c(rep(1,n_1) ,rep(2,n_2) ,rep(3,n_3)) a = aov(vsample~factor(vfactor)) print(summary(a))

which produces the output:

But the thing I don’t like about the aov() method is that it doesn’t give quantitative information about the means of the sample for the different factor levels. Thus, an equivalent technique that I prefer is to use the R lm() method and regress the sample on the factor levels

myfit = lm(vsample~factor(vfactor)) print(summary(myfit))

which produces the output:

Now we have some information on how the means of the factor level differ. Note that the F statistic p-values from the lm() and aov() methods are the same.

**Z test of sample mean**

If you know what the true population std deviation of the data are, sigma, and want to test if the mean of the sample is statistically consistent with some value, you can use the Z-test

For a given cut on the p-value, alpha, with a two sided Z-test, we reject the null hypothesis when the absolute value of |bar(X)-mu| is greater than Z_(alpha/2), where Z_(alpha/2) is the (100-alpha/2) percentile of the standard Normal distribution.

You can also do one-sided Z-tests where you test the significance of Z<mu or Z>mu. However, unless you have very good reason to assume some direction to the relationship, *always* do a two-sided test of significance instead.

For the two sample Z test, to compare the means of two samples when the variance is known for both, we use the statistic

Now, recall that for large n, the Students t distribution approaches the Normal:

**For this reason, when the sample size is large, you can equivalently do a Z-test instead of a t-test, estimating sigma from the std deviation width of the sample.**

The BSDA library in R has a z.test() function that either performs a one sample Z test with z.test(x,mu,sigma.x) or a two sample Z test comparing the means of two samples with z.test(x,y,sigma.x,sigma.y)

**Example of one and two sample Z-tests compared to Student t-tests**

To run the following code, you will need to have installed the BSDA library in R, using the command install.packages(“BSDA”), then choosing a download site relatively close to your location.

First let’s compare the Z-test and Students t-test for fairly large sample sizes (they should return p-values that are quite close):

require("BSDA") set.seed(832723) n_1 = 1000 n_2 = 100 s_1 = 0.1 s_2 = 0.11 mean_1 = 0.1 mean_2 = 0.08 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) a=t.test(x,y) b=z.test(x,y,sigma.x=sd(x),sigma.y=sd(y)) cat("\n") cat("Student t test p-value: ",a$p.value,"\n") cat("Z test p-value: ",b$p.value,"\n")

This produces the output:

Now let’s do another example, but with much smaller sample sizes, and this time let’s put the means to be equal (thus the null hypothesis is true). In this case, the Students t-test is the more valid test to use:

require("BSDA") set.seed(40056) n_1 = 3 n_2 = 5 s_1 = 1 s_2 = 1.5 mean_1 = 0 mean_2 = 0 x = rnorm(n_1,mean_1,s_1) y = rnorm(n_2,mean_2,s_2) a=t.test(x,y) b=z.test(x,y,sigma.x=sd(x),sigma.y=sd(y)) cat("\n") cat("Student t test p-value: ",a$p.value,"\n") cat("Z test p-value: ",b$p.value,"\n")

This produces the output:

In this example the Z-test rejects the null (even though it is true), while the Student t test fails to reject it. If this were an analysis that is made “more interesting” by finding a significant difference between the X_1 and X_2 samples, you run the risk of publishing a faulty result that incorrectly rejects the null because you used an inappropriate test. In a perfect world null results should always be considered just as “interesting” as results where you reject the null. In unfortunate reality, however, researchers tend to not even try to publish null results, leading to reporting bias (the published results are heavily weighted towards results that, incorrectly or correctly, rejected the null).

And it turns out that you’ll always get a smaller p-value from the Z-test compared to the Students t-test: in the plot above that compares the Student t distribution to the Z distribution, you’ll note that the Students t distribution has much fatter tails than the Z distribution when the degrees of freedom are small. That means, for a given value of the Z-statistic, if the number of degrees of freedom are small in calculating the sample standard deviations, the Students t-test is the much more “conservative” test (ie; it always produces a larger p-value than the Z-test). Thus, if you mistakenly use the Z-test when sample sizes are small, you run the danger of incorrectly concluding a significant difference in the means when the null hypothesis is actually true.

For large sample sizes, there is negligible difference between the Z-test and Students t-test p-values (even though the Students t-test p-values will always be slightly larger). This is why you will often see Z-tests quoted in the literature for large samples.

]]>

This content is password protected. To view it please enter your password below:

]]>

Let’s begin our discussion of hypothesis testing by looking at a data point, X, which under the null hypothesis is drawn from the Normal distribution with mean 0 and std deviation 1 (ie; the standard Normal distribution). Recall that the standard Normal distribution is symmetric about 0, with long tails. The further we get from zero, the lower the probability. Thus, if our observed X is close to zero, it is quite likely that it was randomly drawn from the Normal distribution. If X is far from zero, however, say… X=+4.3, the probability is low to observe such a high value of X. In fact, the probability of observing a value of X at least that high is the integral of the upper tail of the Normal distribution from X to infinity. This is called a “one tailed” test of significance. If, on the other hand, we wanted to assess the probability of observing a value of X at least that far away from zero, then we concern ourselves with the probability of observing |X| at least as large as our observed value. This is the integral of the probability distribution from -infinity to -X, plus the integral from +X to infinity. This is called a “two tailed” test of significance.

For

The p-value is the probability that we would observe our data, given our null hypothesis. Alpha is the probability cut-off at which we say that the observed is improbable given the null hypothesis. Usually a cut-off of alpha=0.05 is used in analyses. When the p-value<alpha, we say that we have a “statistically significant” result.

The use of alpha=0.05 is somewhat controversial because it is arbitrary. Plus, one out of 20 times, we reject the null hypothesis when it is actually true. This means that many spurious “statistically significant” results can make it into the literature, especially if multiple tests of significance were done in the analysis, and the researchers did not correct their alpha for how many tests they did (for example, if we did 100 tests of significance in an analysis, even when the null hypothesis is actually true, on average we would find 5 of those tests yielded a “significant” result causing us to reject the null hypothesis).

Because of this problem, one psychology journal has actually banned the use of p-values in analyses published in their journals.

Type I error: Incorrectly rejecting the null hypothesis when it is actually true. Can be controlled by decreasing alpha. Also need to reduce alpha if doing multiple tests of significance.

Type II error: Incorrectly accepting the null hypothesis when it is actually false. Larger sample sizes can reduce type II errors because they give better statistical power to distinguish between null and alternate hypotheses.

Example:

Null hypothesis (H0): “The person on trial is innocent.”

A type I error occurs when convicting an innocent person (a miscarriage of justice). “Beyond a reasonable doubt” is an attempt to make alpha in trials as small as possible to reduce the probability of rejecting this null when it is actually true.

A type II error occurs when letting a guilty person go free (an error of impunity).

A positive correct outcome occurs when convicting a guilty person. A negative correct outcome occurs when letting an innocent person go free.

]]>

When doing quantitative research in the life and social sciences and working with data, it is often necessary to mesh two or more disparate sources of data together in order to study a research question. Even though both sets of data might ostensibly cover the same geographic locales (for example, like state-level data or county-level data), or the same data range (for example), some data might be missing in one data set for some locales or times, which presents some extra difficulties in trying to mesh the data sets together. Even if two or more data sets cover the same locales or times (for example), they may be sorted in different order, which means there isn’t a direct one-to-one crosswalk between the row of one data set and the row of another.

Other sources of difficulty in meshing together data sets might be that one data set might contain information for both locales and dates, but the other data sets of interest might just be for locales at specific dates, or by date at specific locales.

Some tips for the meshing process: always start by downloading and reading in each of the data sets separately. Do an initial exploratory analysis on each of the data sets, calculating averages, making simple plots, etc to ensure the data actually appear to be what you are expecting them to be. If the data sets are complicated, with a lot of fields, I find it helpful to preprocess each data set and produce a simpler preprocessed summary file that will make future analyses with the data easier and faster.

**Example**

In an example of this, we will mesh diabetes incidence data from the CDC between 2004 to 2013, with socioeconomic data from the US Census Bureau.

**Diabetes data**

Let’s begin with the diabetes incidence data. The CDC makes obesity prevalence and diabetes prevalence and incidence data at the county level available off of its County Data Indicators website. Navigating to the site, you should see something like this:

Click on the “Diagnosed Diabetes Incidence” tab to expand it:

Click on “All States” to download the Excel file.

Some R packages exist for reading Excel files into R, but I have not had much luck with them because they always seem to require libraries that are defunct, or the packages have bugs, etc etc (in fact, the link above which describes the packages notes many of these problems). By far the best solution I have found, and that is also recommended in the link above, is to open the Excel file in Excel, and then under File->Save As, click on .csv format in the dropdown menu.

Before you close off the Excel file, take a note of its columns. The columns contain the state name, the county Federal Information Processing Standard (FIPS) unique identifier for each county, the county name, and then for years 2004 to 2013 the number of new diabetes cases diagnosed each year, the rate per 1,000 people, and the lower and upper range of the 95% confidence interval on the rate estimate. The CDC obtained this confidence interval from the number of observed cases and population size by using a function similar to binom.test() in R.

On my computer, saving the Excel file in csv format produced the file INCIDENCE_ALL_STATES.csv. Once you do this, you have to visually inspect the file in a text editor to see if the header line is split across multiple lines. In this case, it is split across two lines. To read this file into R, we thus need to skip the first line:

a = read.table("INCIDENCE_ALL_STATES.csv",sep=",",header=T,as.is=T,skip=1)

Examining the column names of our data frame yields a list like the following (I couldn’t fit the entire list on my screen to take a screen shot):

Note that there are a number of columns with the name like “rate.per.1000″. Recall from the inspection of the Excel file that the first such column in the 2004 data, then the next is the 2005 data, and so on to the 2013 data.

Try histogramming the rate for 2004

hist(a$rate.per.1000)

You will note that you get the error message “Error in hist.default(a$rate.per.1000) : ‘x’ must be numeric”. To view the data in that column, type

a$rate.per.1000

You’ll notice that R thinks the data consist of strings, rather than numeric. This is because for some counties, the data entry is “No Data”. In order to convert the strings to numeric, type

a$rate.per.1000 = as.numeric(a$rate.per.1000)

The entries with “No Data” will now be NA, and the other entries will be converted to numeric. If we take the mean, you’ll notice that it will be equal to “NA”… this is because we have to tell the mean() function in R to ignore the NA values, and only calculate the mean from the defined values, using the na.rm=T option.

mean(a$rate.per.1000,na.rm=T)

Now, if you type

hist(a$rate.per.1000,col=2,xlab="Rate of diabetes incidence per 1000 population",main="2004 county-level data")

you will get the histogram

What you’re looking for here are any strange outliers (there don’t appear to be any). You also want to check if the data values are more or less what you expect. From the mean() of our values, we see the average incidence in 2004 is about 10/1000, or 1%. From the diabetes.org website, their latest report says that around 9% of Americans are living with diabetes (ie; the diabetes prevalence). It is roughly plausible that perhaps 1/10 people living with diabetes are newly diagnosed each year.

We need to do this type of exploratory analysis for all the columns of interest in the data file. In the R script, diabetes.R, I do this for the 2004 to 2013 data. The R script produces the following plot:

The R script also produces the following output:

The data for all years look more or less similar, and reasonable. If the data had outliers, I would look at the counties for which there were outliers, and try to track down what the true value should be by doing an Internet search. Sometimes outliers are caused by mis-transcription of data. Sometimes they actually are true outliers (!)

The script puts the fips, year, and rate information into vectors, then creates a data table that is written out into a summary file preprocessed_diabetes_incidence_by_county_by_year_2004_to_2013.csv

Making such summary files is often useful in order to get the data in a nice format for further analyses.

**Socio-economic data**

The US Census Bureau American Fact Fiinder database has data on a wide variety of socio-economic demographics:

On the site, click on “Advanced Search->Show Me All”

Click on “Geographies”, and then “County”, “All Counties Within Unites States”, and then “ADD TO YOUR SELECTIONS”. You can then close out that selection window by clicking “Close X” at its top right corner:

Notice that in the “Your Selections” box on the upper left hand corner it states that you are now searching for data by county.

Now, let’s look for data related to poverty by county. We are ultimately going to test whether or not there appears to be an association between poverty rates and diabetes incidence in the population. In the “Refine your search results” box, type “poverty” (without the quotes). The following list will come up:The acronym “ACS” refers to the US Census Bureau American Community Survey. They provide 5 year, 3 year, and 1 year running averages of various socioeconomic demographics. We want the one year averages. Click on sample S1701 “Poverty status in the last 12 months”. It brings up (note, that I can’t fit all the rows on the screen for the screen shot):

There are a lot of goodies in this table. Not only is there information on poverty rates, but also information by age, race and ethnicity, employment status, etc. To download the data for a specific year, click on the year to load the table. Let’s do 2013. Once the table shows, click on the “Download” tab at the upper center part of the screen, and click on “Use the data”, then “OK”

Now click “Download” to complete the download process to your computer:

This will download a compressed folder with the data. On my computer, this folder is called ACS_13_1YR_S1701. “S1701″ is the name of the data set, “1YR” indicates that it is the one year averages, and “13″ indicates that the data are for 2013.

Inside that folder, there are several files. ACS_13_1YR_S1701_with_ann.csv contains the data of interest, but if you look at it in a text editor, you will see the column names are rather inscrutable, and there appear to be two lines of header information. The file ACS_13_1YR_S1701_metadata.csv contains the information on what each of the columns means. Move these two files to your working directory.

If you try to read ACS_13_1YR_S1701_with_ann.csv into R as is, R will complain that there are more columns than column names. Skipping the first row, like we did with the diabetes data set, won’t help. It is that second line of header information that is the problem. It contains extra commas in the quotes that mess R up when it tries to read in the data. We can proceed one of two ways… edit the file to comment that second line out with a # so that R ignores it, or, use the following code to make R skip that second line:

all_content = readLines("ACS_13_1YR_S1701_with_ann.csv") skip_second = all_content[-2] b = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

Typing names(b) yields (note, I couldn’t fit the entire list on the screen to take the screen shot):

To reiterate, it is the ACS_13_1YR_S1701_metadata.csv file that describes what each of these many columns are. I usually open this with a text editor, and determine which column name is the information I’m interested in. For example, opening this in a text editor shows:

GEO.id,Id

GEO.id2,Id2

GEO.display-label,Geography

HC01_EST_VC01,Total; Estimate; Population for whom poverty status is determined

HC01_MOE_VC01,Total; Margin of Error; Population for whom poverty status is determined

HC02_EST_VC01,Below poverty level; Estimate; Population for whom poverty status is determined

HC02_MOE_VC01,Below poverty level; Margin of Error; Population for whom poverty status is determined

HC03_EST_VC01,Percent below poverty level; Estimate; Population for whom poverty status is determined

HC03_MOE_VC01,Percent below poverty level; Margin of Error; Population for whom poverty status is determined

I can see that the column I’m interested in is named HC03_EST_VC01. Note that you cannot assume that this column name will always correspond to the percentage of the population in poverty for all years in the S1701 series of ACS data. You have to check for each year!

The column GEO.id is the FIPS code for each county.

The following lines of code read in the data, make a summary data file that is less inscrutable than the original file, and histogram the poverty rates.

require("sfsmisc") ###################################################### # the second line in the ACS files often makes it problematic to read the # file in with read.table or read.csv. The following three lines # of code tell R to skip the second line when reading in the file ###################################################### all_content = readLines("ACS_13_1YR_S1701_with_ann.csv") skip_second = all_content[-2] b = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

wfips = b$GEO.id2 wpoverty = b$HC03_EST_VC01 mult.fig(1) hist(wpoverty,col="darkviolet",xlab="Poverty rate",main="2013 ACS poverty rate data") vdat = data.frame(fips=wfips,poverty=wpoverty) write.table(vdat,"preprocessed_poverty_rates_by_county_2013.csv",sep=",",row.names=F)

The output can be found in preproccesed_poverty_rates_by_county_2013.csv. The code produces the following plot:

All of the values look reasonable, and there does not appear to be any unusual outliers.

**Bringing it together**

Now we would like to examine the diabetes incidence and poverty rate data to determine if they appear to be related.

Note that there are over 3,000 counties in the US, but there are not that many counties in either the diabetes or poverty data sets. The one year American Community Survey data is usually much more limited than the 5 year survey estimates due to the amount of work and expense needed to do annual surveys. Thus, it is usually larger counties that are represented in the one year averages of socioeconomic and demographic data from the census bureau. As far a health data are concerned, there is the potential that some county health authorities haven’t reported their data to the CDC, for whatever reason, or that the number of diabetes cases newly diagnosed was below 20 for that county and year… for reasons of confidentiality, the CDC will not report aggregated data with less than 20 counts.

The following lines of code read in the two data sets, and report on the number of counties in one, but not in the other. The X%in%Y operator in R determines which vector elements in X are in Y(and returns TRUE if it is). Taking !X%in%Y returns TRUE if a vector element in X is not in Y.

ddat=read.table("preprocessed_diabetes_incidence_by_county_by_year_2004_to_2013.csv",sep=",",header=T,as.is=T) pdat=read.table("preprocessed_poverty_rates_by_county_2013.csv",sep=",",header=T,as.is=T)

ddat = subset(ddat,year==2013) cat("The number of counties in the diabetes data set is",nrow(ddat),"\n") cat("The number of counties in the poverty data set is",nrow(pdat),"\n")

i=which(!ddat$fips%in%pdat$fips) j=which(!pdat$fips%in%ddat$fips)

cat("The number of counties in the diabetes data set not in the poverty set is:",length(i),"\n") cat("The number of counties in the poverty data set not in the diabetes set is:",length(j),"\n")

We can subset the two data sets to ensure that they both contain information for the same set of counties:

ddat = subset(ddat,fips%in%pdat$fips) pdat = subset(pdat,fips%in%ddat$fips) cat("The number of counties in the diabetes data set is",nrow(ddat),"\n") cat("The number of counties in the poverty data set is",nrow(pdat),"\n")

You will find that both data sets are now the same size (755 counties).

However, there is no guarantee that now the counties are in the same order for the two datasets. To get the one-to-one correspondence between the data sets, we can use the R match() function. match(ddat$fips,pdat$fips) returns the index of the row of pdat data frame with fips corresponding to every value of ddat$fips in turn.

We can thus create a new element of the ddat data frame called poverty, which is obtained from the pdat data frame with the corresponding fips to every fips in ddat:

ddat$poverty = pdat$poverty[match(ddat$fips,pdat$fips)]

and we can now plot the diabetes rate versus the poverty and overlay the regression line:

mult.fig(1) plot(ddat$poverty,ddat$diabetes_rate,col="red",cex=2,xlab="Poverty rate",ylab="Diabetes rate",main="2013 data") myfit = lm(ddat$diabetes~ddat$poverty) lines(ddat$poverty,myfit$fit,col=4,lwd=6) print(summary(myfit))

]]>

**(aka: How to be a Data Boss)**

**This course is meant to introduce students in the life and social sciences to the skill set needed to do well-executed and well-explicated statistical analyses. The course is aimed at students with little prior experience in statistical analyses, but prior exposure to “stats 101″-type courses is helpful. The course will be almost entirely based on material posted on this website. The course syllabus can be found here.** **There is no textbook for this course, but recommended reading is How to Lie with Statistics by Irving Geis, Statistical Data Analysis by Glen Cowan, and Applied Linear Statistical Models by Kutner et al (doesn’t really matter which edition).** **Upon completing this course:** **Students will have an understanding of basic statistical methods, including hypothesis testing, linear regression and generalized regression methods, and will understand common pitfalls in statistical analyses, and how to avoid them (and detect them, when reviewing papers!). If we have the following problem as the course progresses, students need to tell me, because it means that I need to adjust the pace and content of the course material:** **Upon completion of the course, students will have basic functionality in R, and will learn how to read in, manipulate, and export data in R, and will be able to create publication-quality plots in R. Methods for producing well-written scientific papers, and giving good oral presentations, are also heavily stressed throughout the course.** **The Dr.Towers’ Golden Rules for Statistical Data Analysis:**

**All (or nearly all) data has stochasticity (ie; randomness) associated with it****A probability distribution underlies that stochasticity****Hypothesis test are based on that probability distribution****Anything calculated using data (like statistics like the mean or standard deviation, for example) has stochasticity associated with it, because the data are stochastic.****Every statistical analysis needs to start with a “meet and greet” with your data. Calculation of basic statistics (sample size, means, standard deviations, ranges, etc), and plots to explore the data and ensure no funny business is going on.****When doing regression, you need two things: a model that describes how the data depend on the explanatory variables, and a goodness-of-fit statistic (like Least Squares, or Binomial likelihood, or Poisson likelihood, etc)**

**List of course modules:**

- Good work habits, and requirements for homework
- Literature searches with Google Scholar
- Elements of scientific papers
- The basics of the R statistical programming language
- Difference between statistical and mathematical models
- Probability distributions important to modelling in the life and social sciences
- Descriptive statistics: mean, covariance, variance, and correlation
- Online sources of free data
- Extracting data from graphs in the published literature
- Bringing together disparate sources of data
- Correlations, partial correlations, and confounding variables
- Exploratory data analysis examples
- Least squares linear regression
- Producing well written manuscripts in a timely fashion
- Giving a good presentation
- t-tests and z-tests of means, and ANOVA
- Poisson regression
- Logistic regression
- Population standardization
- model validation bootstrapping methods
- Correcting alpha for multiple tests
- Robust linear regression
- The importance of robustness cross-checks
- Transformation of variables
- R Shiny (more examples here)
- Kmeans clustering
- Kolmogorov-Smirnov test

**Course expectations:** There will be regular homework projects assigned throughout the course, which will be worth 50% of the grade. Students are strongly encouraged to work together in groups to discuss issues related to the homework and resolve problems. However, plagiarism of code will not be tolerated. There also may be unannounced in-class pop quizzes during the semester. If these occur, they will be counted among the homework grades. The culmination of the course will be a group term project (two to three students collaborating together, with the project worth 50% of the final grade). Students will write-up the results of their project in a format suitable for publication, using the format required by a journal they have identified as being appropriate for the topic. A cover letter written to the editor of the journal is also required. **However, submission for publication is not required, but encouraged if the analysis is novel.** Students are responsible for locating and obtaining sources of data, and developing an appropriate statistical model for the project, so this should be something they begin to think about very early in the course. **This course has no associated textbook. Instead the course content consists of the modules that appear on this website.**** A textbook that students may find useful is Statistical Data Analysis, by G. Cowan** Students are expected to bring their laptops to class. Before the course begins, students are expected to have downloaded the R programming language onto their laptop from http://www.r-project.org/ (R is open-source free software). Final project write-ups will be due **Friday, April 13th**. Each of the project groups will perform an in-class 20 min presentation on **Monday, April 23rd, 2018 and Wed, April 25th, 2018**. During the week of April 16th, project groups will meet with Dr. Towers to discuss their final project write-ups, and their upcoming presentation. By Friday, April 27th, all group members are to submit to Prof Towers a confidential email, detailing their contribution to the group project, and detailing the contributions of the other group members.

]]>

The file reads in the data files summary_pandemic_data.txt, and sunspot_wolf_and_group_1700_to_2014.txt

The R script produces the following plot, shown in the paper,

]]>

**I’m a statistician, and I also have a PhD in experimental particle physics. Research in experimental particle physics can involve complex models of observable physical processes, and fitting of those models to experimental data is a not uncommon task in that field. Like the field of applied mathematics in the life and social sciences (AMLSS), the models being fit at times have no analytic solution, and must be solved numerically using specialized methods. When I entered the field of AMLSS back in 2009, I had a lot to learn about the various models used in this field and the common methodologies, but I already had a solid tool box of specialized skills that allowed me to connect mathematical models to data, and it has turned out that those skills have been remarkably useful in exploring a wide range of research questions in the life and social sciences that I find interesting. I also apply these skills in consulting projects I do.**

**First off: what is the difference between statistical and mathematical modelling, anyway?**

The difference between statistical and mathematical models is often times confusing to people. In this past module on this site, I discuss an example of the differences, with an analysis of seasonal and pandemic influenza used as an example.

**Example of an analysis combining statistical and mathematical modelling: Mathematical and statistical modelling of the contagious spread of panic in a population**

During the 2014 Ebola outbreak, there were a total of five cases that were ultimately identified in America, compared to tens of thousands of cases in West Africa. Even though the “outbreak” in America was essentially non-existent, once the first case was identified in the US in autumn 2014, the media shifted into 24/7 coverage of the supposed dire threat Ebola presented to Americans, complete with scary imagery.

Autumn 2014 I was teaching a course in the ASU AMLSS graduate program on statistical methods for fitting the parameters of mathematical models to data. Each year, when I teach AML classes, I usually try to have a “class publication project” that encompasses the methodology I teach in the class. In this case, I thought it might be interesting to try to model the spread of Ebola-related panic in the US population, as expressed on social media, and explore how news media might play a role in that.

The class did the analysis as a homework assignment, and we wrote the paper together, which was published in 2015. The paper received national media attention when it came out.

First; why was this analysis important? Well, it has been suggested in the past that people talking about a particular disease on social media might perhaps be used as a real-time means to assess the temporal and geospatial spread of the disease in the population, rather than relying on slower traditional surveillance methods, which can suffer from backlogs in laboratory testing. For instance, tracking influenza, or cholera:

However, up until the US Ebola “outbreak” the problem was that it was impossible to say whether people were just discussing a disease on social media because they were worried about it, rather than because they actually had it. During the Ebola outbreak, pretty much no one actually had it in the US, so everyone who was talking about it was doing so because they were concerned about it. This gave us the perfect instance to gauge what kind of temporal patterns we might see in social media chatter due simply to panic or concern about a disease!

The data we used in the study were the daily number of news stories about Ebola from large national news outlets. We also obtained Twitter data related to Ebola, and Google search data in the US with search terms related to Ebola, including “do I have Ebola?” from Google Trends. Here is what the data looked like:

We came up with a model that related the number of news videos, V, and people who were infected, I, with the idea to tweet about Ebola, or do a Google search related to Ebola:

The parameter beta is a measure of how many tweets (or Google searches) per person per unit time one news story would inspire, and gamma parameterizes the “boredom” effect, through which people eventually move to a “recovered and immune” class, upon which they never tweet again about Ebola no matter how many Ebola-related news stories they are exposed to. Using the statistical methodology taught in the AML course, the students fit the parameters of that model to data, and obtained the following model predictions, shown in red:

The blue lines on the plot represent a plain statistical model that simply regresses the Twitter and Google search data on the news media data, without taking into account the “boredom” effect. Can you see that the regression fits are systematically too high early on, and systematically too low later for all the plots, but the same is not true of our mathematical model? That tells us that our mathematical model that includes boredom really does do a better job of describing the dynamics of peoples’ Ebola-related social media behaviours!

We found that each Ebola-related new story inspired on average thousands of tweets and Google searches. Also, on average, we found people were only interested enough for a few days to tweet or do Google search after seeing a news story about Ebola before they became bored with the topic:

We couldn’t have done this analysis without both mathematical modelling and statistical methods; it was a nice “bringing together” of the methodologies to explore an interesting research question.

**Another example of an analysis that involved mathematical and statistical modelling methods: contagion in mass killings and school shootings**

In January, 2014 there was a shooting at Purdue University, where one student entered a classroom and shot another student dead, then walked out and waited for police to arrest him.

At the time, it struck me that it was the third school shooting I had heard about in an approximately 10 day period. Even for the United States, which has a serious problem with firearm violence compared to other first world countries, this seemed like an unusual number to have in such a short period of time.

It led me to wonder if perhaps contagion was playing a role in these dynamics. Certainly, in the past it had been noted that suicide appears to be contagious, because (for example) in high schools where there is one suicide it is statistically more likely to see an ensuing cluster of suicides. And the “copy cat” effect in mass killings has long been suspected. I wondered if perhaps a mathematical model of contagion might be used to help * quantify *whether or not mass killings and school shootings are contagious. So, I talked with some colleagues:

And, we decided to use a mathematical model known as a Hawke’s point process “self-excitation” model to simulate the potential dynamics of contagion in mass killings; the idea behind the model is quite simple… there is a baseline probability (which may or may not depend on time) of a mass killing to occur by mere random chance (the dotted line, below). But, if a mass killing does occur, due to contagion it temporarily raises the probability that a similar event will occur in the near future. That probability decays exponentially:

How contagion would manifest itself in data is thus as unusual “bunching together in time” of events compared to what you would expect from just the baseline probability.

Here’s our (blurry) model:

The parameters of the model were Texcite, the average length of the excitation period, and Nsecondary, the average number of new mass killings inspired by one mass killing. N_0(t) was the baseline probability for mass killings to occur. We used statistical modelling methods to estimate N_0(t).

We needed data in order to fit the parameters of our model. From USA Today we obtained data on mass killings (four or more people killed), and from the Brady Campaign to Prevent Gun Violence, we obtained data on school shootings, and data on mass shootings (three or more people shot, not necessarily killed). Mass shootings happen very frequently in the US!

We compared how well the Hawkes model fit the data compared to a model that didn’t include self-excitation. If contagion is evident, the former will fit the data significantly better.

The fit results were:

Both mass killings and school shootings appear to be significantly contagious! And the length of the contagion period is on average around two weeks for both.

Mass shootings with more than three people shot, but less than four people killed were not contagious though.

Why? Well, mass shootings with low death counts happen on average more than once a week in the US. They happen so often, that they rarely make it past the local news. In contrast, mass shootings with high death rates, and school shootings, usually get national and even international media attention. It may likely be that widespread media attention is the vector for the contagion.

Again, this was an analysis that was made possible through the marriage of mathematical and statistical modelling methods.

**Statistical and mathematical modelling skills on the job market**

Quantitative and predictive analytics is a field that is growing very quickly. Statistical methods and data mining (“big data”) play a large role in predictive analytics, but the power of mathematical models is more and more being recognized as having same advantages over statistical models alone because mathematical models do not simply assume an “X causes Y” relationship, but instead can describe the complex dynamics of interacting systems. Having a tool box of skills that includes expertise in both mathematical and statistical modelling can lead to many interesting career opportunities, including consulting.

]]>

For many models, information about the parameters and/or initial conditions can be obtained from other studies. For instance, let’s examine the seasonal influenza SIR model we have used as an example in several other modules. Our data was influenza incidence data from an influenza epidemic in the midwest, and we fit the transmission rate, beta (or alternative, R0=beta/gamma), of an SIR model to this data. For example, using the R script fit_midwest_negbinom_gamma_fixed.R

The script performs a negative binomial likelihood fit to the influenza data, assuming that the average recovery period, 1/gamma, for flu is fixed at 4.8 days. The script produces the following plot (recall that alpha is the over-dispersion parameter for the negative binomial likelihood, and t0 is the time of introduction of the virus to the population.

The script gives the best-fit estimate using the graphical Monte Carlo fmin+1/2 method, and also the weighted mean method. Note that the plots should be much better populated in order to really get trustworthy estimates from the fmin+1/2 method.

**In reality, most of our parameters that we obtain from prior studies aren’t know to perfect precision**

In our script above, we assumed that 1/gamma was 4.8 days based on a prior study in the literature. However, this was estimated from observational studies of sick people, and, in reality, there are statistical uncertainties associated with that estimate. In the paper describing the studies, they state that their central estimate and 95% confidence interval on 1/gamma was 4.80 [4.31,5.29] days. **Unless told otherwise in the paper from which you get an estimate, you assume that the uncertainty on the parameter is Normally distributed. ** Because the 95% CI is +/-1.96*sigma from the mean, this implies that the std deviation width of the Normal distribution is sigma=(4.8-4.31)/1.96~0.25 days

Thus, our probability distribution for x=1/gamma is

P(x|mu,sigma)~exp(-0.5(x-mu)^2/sigma^2)

with mu=4.8 days, and sigma = 0.25 days, in this case.

**Uncertainty on “known” parameters affects the uncertainties on the other model parameters you estimate from your fit to data!**

The uncertainty on 1/gamma will affect the uncertainty on our parameter estimates. For instance, is it clear that if all we know about 1/gamma was that it was between 1 and 50 days, it would be much harder to pin down our transmission rate? (ie; we had no idea what 1/gamma was, and thus had fit for gamma, as well as R0, t0, and alpha) The script fit_midwest_negbinom_gamma_unconstrained.R does this, and produces the following plot:

You can see that the influenza data we have perhaps give us a little bit of sensitivity to the parameter gamma, but not much (basically, the fit just tells us 1/gamma is somewhere between 2 to 6 days, with 95% confidence). The uncertainties on our estimates of R0 and t0 have gone way up, compared to the first fit where we assumed 1/gamma was fixed at 4.8 days! Also, when you are using the weighted mean method to estimate parameters and the parameter uncertainties, you can also get the covariance matrix for your parameter estimates. The correlation matrix, derived from the covariance matrix, for this fit looks like this:

Notice that our estimates of R0 and 1/gamma are almost 100% correlated (this means that as 1/gamma goes up, R0 also has to go up to achieve a good fit to the data). You never want to see parameters so highly correlated in fits you do… it means that your best-fit parameters likely won’t give you a model with good predictive power for a different, equivalent data set (say, influenza data for the same region for the next flu season).

So, even though we seem to have a little bit of sensitivity to the value of 1/gamma in our fit, having that estimate 100% correlated to our estimate of R0 is not good, and a sign you shouldn’t trust the results of the fit.

**Incorporating uncertainties on “known” parameters from the literature in our fit likelihoods**

In order to take into account the uncertainties on our “known” parameter, x, you simply modify your fit likelihood to include the likelihood coming from the probability distribution for that parameter. Thus, the negative log likelihood is modified like so:

negloglike = negloglike + 0.5*(x-mu)^2/sigma^2

Then, in the fit, you do Monte Carlo sampling not only of all your other unknown parameters (like R0, t0, and alpha in this case), but also uniformly randomly sample parameter x over a range of around approximately mu-4*sigma to mu+4*sigma.

For 1/gamma, we know that mu=4.8 days, and sigma is 0.25 days. The R script fit_midwest_negbinom_gamma_constrained.R modifies the likelihood to take into account our probability distribution for our prior estimate of 1/gamma from the literature. The script produces the following plot:

(again, for the fmin+1/2 method, we’d like to see these plots much better populated!). Note that now our uncertainties on R0 and t0 from the weighted mean method are much smaller than they were when 1/gamma was completely unconstrained, but larger than they were when 1/gamma was fixed to 4.8 days. By modifying the likelihood to take into account the probability distribution of our prior belief for 1/gamma, we now have a fit that properly feeds that uncertainty into our uncertainty on R0 and t0.

When publishing analyses that involve fits like these, it is important to take into account your prior belief probability distributions for the parameter estimates you take from the literature. In some cases, your fit might be quite sensitive to the assumed values of those parameters; if the literature estimates are a bit off from what your data would “like” them to be to obtain a good fit, and you just assume a fixed central value for the parameter, sometimes you just won’t be able to get a good fit to your data.

When you include the parameter in your fit with a modified likelihood to take into account it’s prior belief probability distribution, the estimate you get from the fit to your data is known as the “posterior” estimate. Note that the posterior estimate, and uncertainty, on 1/gamma that we obtained from fit_midwest_negbinom_gamma_constrained.R is 4.798+/-0.247, and is pretty darn close to our prior belief estimate of 4.8+/-0.25. If our data were sensitive to the value of 1/gamma, our posterior estimate would have a smaller uncertainty than the prior belief estimate, and likely have a different central value too.

]]>As discussed in that module the model parameters can be estimated from the parameter hypothesis for which the negative log-likelihood statistic, f, is minimal, and the one standard deviation uncertainty on the parameters is obtained by looking at the range of the parameters for which the negative log likelihood is less than 1/2 more than the minimum value, like so:

This method has the advantage that it is easy to understand how to execute (once you’ve seen a few examples). However, we talked about the fact that this procedure is only reliable if you have many, many sweeps of the model parameter values (for instance, the above plots are pretty sparsely populated, and it would be a bad idea to trust the confidence intervals seen in them…. they are underestimated because the green arrows don’t quite reach to the edge of the parabolic envelope that encases the points).

The fmin+1/2 method also does not yield a convenient way to determine the covariance between the parameter estimates, without going through the complicated numerical gymnastics of estimating what is known as the Hessian matrix. The Hessian matrix (when maximizing a log likelihood) is a numerical approximation of the matrix of second partial derivatives of the likelihood function, evaluated at the point of the maximum likelihood estimates. Thus, it’s a measure of the steepness of the drop in the likelihood surface as you move away from the best-fit estimate.

It turns out that there is an easy, elegant way, when using the graphical Monte Carlo method, to use information coming from every single point that you sample to obtain (usually) more robust and reliable parameter estimates, and (usually) more reliable confidence intervals for the parameters.

**The weighted means method**

To begin to understand how this might work, first recall from the previous module that the fmin+1/2 method gives you the one standard deviation confidence interval. Recall that to get the S standard deviation confidence interval, you need to go up 0.5*S^2 from the value of fmin, and examine the range of points under that line. This means that when we plot our negative log likelihood, f, vs our parameter hypotheses, the points that lie some value X above fmin are, in effect sqrt(2*X) standard deviations away from the best-fit value. Here is what that looks like graphically:

The red lines correspond to the points that lie at fmin+1/2 (the one standard deviation confidence interval), the blue lines correspond to the points that lie at fmin+0.5*2^2=fmin+2 (the two standard deviation confidence interval), and the green lines correspond to the points that lie at fmin+0.5*3^2=fmin+4.5 (the three standard deviation confidence interval).

It should make sense to you that the points that are further away from fmin carry less information about the best-fit value compared to points that are have a likelihood close to the minimum. After all, when using the graphical Monte Carlo method, you aim to populate the graphs well enough to get a good idea of the width of the parabolic envelope *in the vicinity of the best fit value*.

So… if we were to take some kind of weighted average of our parameter hypotheses, giving more weight to values near the minimum in our likelihood, we should be able to estimate the approximate best-fit value.

It turns out that the weight that achieves this is intimately related to those confidence intervals we see above. If we do many Monte Carlo parameter sweeps, getting our parameter hypotheses and the corresponding negative log likelihoods, f, we can estimate our best fit values by taking the weighted average of our parameter hypotheses, weighted with weights

w=dnorm(sqrt(2*(f-fmin)))

where dnorm is the PDF of the Normal distribution. Notice that this is maximal when f=fmin, and gets smaller and smaller as f moves away from fmin. In fact, when f=fmin+0.5*S^2 (the value that corresponds to the S std dev CI), then

w=dnorm(S)

So, the points that are close to giving the minimum likelihood are given a greater weight in the fit, because they are more informative as to where the minimum actually lies. The plot of w vs S is:

Thus, the further f gets away from fmin, the less weight the points are given, *but they still have some weight*.

It turns out that not only can these weights be used to estimate our best-fit values, they also can be used to estimate the covariance matrix of our parameter estimates. If we have two parameters (for example), and we’ve randomly sampled N_MC parameter hypotheses, we would form a N_MCx2 matrix of these sampled values, and then take the weighted covariance of that matrix. The R cov.wt() function does this.

Advantages of the weighted mean method: with this method every single point you sample gives information about the best-fit parameters and the covariance matrix for those parameter estimates. Unlike the fmin+1/2 method, where it is only those points right near the minimum value of f and at fmin+1/2 that really matter in calculating the confidence interval.

Also, using this weighted method you trivially get the estimate covariance matrix for the parameters, unlike the fmin+1/2 method where this would be much harder to achieve.

Another advantage of this method is that you don’t have to populate your plots quite as densely as you would for the fmin+1/2 method in order for it to reliably work; this is because every single point you sample is now informing the calculation of the weighted mean and weighted covariance matrix.

The disadvantage of this method is that you must uniformly randomly sample the parameters (no preferential sampling of parameters using rnorm for instance), and you must uniformly sample them over a broad enough range that it encompasses at least a three or four standard deviation confidence interval; other wise, as we’ll see, you will underestimate the parameter uncertainties).

**An example**

As an example of how this works in practice, let’s return to the simple example we saw in this previous module, where we compared the performance of the fmin+1/2 method to that where we analytically calculate the Hessian to estimate the parameter uncertainties.

In the example, the model we considered was y=a*x+b, where a=0.1 and b=10, and x goes from 10 to 150, in integer increments. We simulate the stochasticity in the data by smearing with numbers drawn from the Poisson distribution with mean equal to the model prediction. Thus, an example of the simulated data look like this:

Recall that the Poisson negative log likelihood looks like this

where the y_i^obs are our data observations, and y_i^pred are our model prediction (y_i^pred = a*x_i+b).

In the example hess.R, we randomly generated many different samples of our y^obs, and then used the Monte Carlo parameter sweep method to find the values of a and b that minimize the negative log likelihood. Then we calculated the Hessian about this minimum and estimated the one-standard deviation uncertainties on a and b from the covariance matrix that is the inverse of the Hessian matrix. Recall that the square root of the diagonal elements of the covariance matrix are the parameter uncertainties.

We also did this using the fmin plus a half method, to show that If the fmin plus a half method works, its estimate of the one-standard-deviation confidence intervals should be very close to the Hessian estimate.

We can add into this exercise our weighted mean method. The R script hess_with_weighted_covariance_calculation.R does just this.

The script produces the following plot, histogramming the parameter estimates from the weighted mean method. As you can see, the estimates are unbiased, and the uncertainties on the parameters assessed by the weighted mean method are very close to those assessed by the analytic Hessian method:

The script also produces the following plot:

Notice in the top two plots that the parameter uncertainties assessed by the weighted mean method are quite close to those estimated by the Hessian method, but the uncertainties assessed by the fmin+1/2 method are always underestimates. This is because we didn’t sample that many points in our graphical Monte Carlo procedure, as can be seen in the examples in the two bottom plots; the plots are so sparsely populated, the green arrows that represent the CI’s estimated by the fmin+1/2 method don’t go all the way to the edge of the parabolic envelope.

So, even with relatively sparsely populated plots, the weighted mean method works quite well. If they are really, really sparsely populated, however, you will find that the performance of the method starts to degrade; take a look at what happens when you change nmc_iterations from 10000 in to 100 in hess_with_weighted_covariance_calculation.R:

The estimates of the parameter uncertainties still are scattered about the Hessian estimates (and the fmin+1/2 method miserably fails due to the sparsity of points). However, notice that there is quite a bit of variation in the uncertainty estimates using the weighted mean method about the red dotted line (compare to the other plot, above); the more MC iterations you have, the more closely these will cluster about the expected values (ie; the more trustworthy your parameter uncertainty estimates will be). So, don’t skimp on the MC parameter sampling iterations, even when using the weighted mean method! In general, with this method, you need to run enough MC parameter sweep iterations to get a reasonable idea of the parabolic envelope in the vicinity of the best-fit value.

One catch of this method, as mentioned above, is to ensure that you do your random uniform parameter sweeps over a broad enough area… if you sample parameters to close to the best-fit value, the weighted mean method will underestimate the confidence intervals. In general, you have to sweep at least a four standard CI. As an example of what occurs when you use too narrow a range, instead of sampling parameter a uniformly from 0.06 to 0.14, uniformly sample it from 0.09 to 0.11 in the original hess_with_weighted_covariance_calculation.R script. We now get:

You can see that the confidence intervals are now severely under-estimated by the weighted mean method.

It needs to be kept in mind that the covariance matrix returned by the weighted mean method assumes that the confidence interval is symmetrically distributed about the best-fit value. In practice, this isn’t always the case; sometimes the plots of the neg log likelihood vs parameter hypotheses, instead of looking like they have a symmetric parabolic envelope, have a highly asymmetric parabolic envelope, like this, for example:

The weighted mean method will essentially produce a one standard deviation estimate that is derived from an “average” symmetric parabola fit to the asymmetric parabola. It will tend to underestimate confidence intervals in such cases. When you have highly asymmetric parabolic envelopes in your plots of the neg log likelihood vs your parameter hypotheses, it is thus best to use the fmin+1/2 method.

]]>This content is password protected. To view it please enter your password below:

]]>

This content is password protected. To view it please enter your password below:

]]>

The Result is :

]]>

]]>

**Introduction**

In a previous module, we explored an example of Least squares fitting the parameters of a mathematical SIR contagion model to data from a real influenza epidemic using the Monte Carlo parameter sampling method. The R script fit_iteration.R performs the Monte Carlo iterations, randomly sampling values of the reproduction number, R0, and the time of introduction of the virus to the population, t0, from uniform random distributions, calculates the Least Squares statistic, and plots the result to show which value of R0 and t0 minimizes the Least Squares statistic.

If you have n data points Y_i (where i=1,…,n), and model predictions for those data points, model_i (note that these predictions depend on the model parameters!), then the least squares statistic is calculated like this (let’s call that statistic “LS”):

In this case, our model_i estimates for each week are coming from our SIR model, and the Y_i are the number of cases we observed for that week.

The fit_iteration.R script produces the following plot:

**Fitting with the Pearson chi-square goodness of fit statistic**

In another module, we discussed the underlying assumptions of the Least Squares statistic… namely that the data points are independent, and the stochasticity underlying the random variation in the data points about the model prediction is Normal, with equal variance for each data point (“homoskedasticity“). In actuality, count data usually are not homoskedastic, particularly if their is a wide range in counts in the data, from small to large. In this particular data set, our counts per time bin range from 2 to 254. Thus, while Least Squares fitting is conceptually easy to understand, it probably isn’t the best choice for these particular data.

In this past module, we discussed generalized Least Squares fitting using the Pearson chi-squared statistic. The Pearson chi-squared statistic is only appropriate for count data, and adjusts the goodness of fit statistic to take into account the heteroskedasticity seen in count data. It is a “generalized” or “weighted” least squares statistic, and is calculated as follows:

It’s underlying premise is that the true probability distribution underlying the data stochasticity is Poisson (which approaches Normal when the counts are high enough). Weighted least squares statistics weight the statistic by the square of the uncertainty on each data point. For Poisson distributed data, the uncertainty is the square root of the expected mean.

The R script fit_iteration_pearson.R fits to the same influenza data as above, but instead of looking for the R0 and t0 that minimize the Least Squares statistic, it optimizes the Pearson chi-square statistic. The script produces the following plot:

You’ll notice that the best-fit values are quite different than what we got using the Least Squares statistic! That is because the Least Squares statistic was giving too much weight to bins that had few counts… these are “low-information” bins with high variation relative to the expected value, and should be weighted accordingly. The Pearson chi-square statistic isn’t perfect when the data are over-dispersed, but for count data it is far preferable to Least Squares fitting.

**Optimization using the Poisson negative log-likelhood**

The Pearson chi-squared statistic, while better than Least Squares for count data, is still only a good choice if there are enough counts in the data that the Poisson distribution approaches the Normal (generalized least squares statistics still have the assumption of Normally distributed stochasticity). This occurs when the expected number of counts is around 10-ish. In our influenza data, we have several bins with less than 10 counts.

So, we need a fit statistic that properly takes into account that our data are Poisson distributed (let’s ignore over-dispersion for the moment). This is achieved by optimizing the negative log Poisson likelihood statistic, described in this past module:

where the k_i is the observed number of counts in the i^th bin, and lambda_i is your model prediction for that bin.

The R script fit_iteration_poisson_likelihood.R fits to our influenza epidemic data, calculating the Poisson negative log likelihood at each iteration. The script produces the following plot:

**Over-dispersed count data: the Negative Binomial negative log-likelihood**

As mentioned in this past module, if your research question involves count data, pretty much always such data are over-dispersed, meaning that the stochastic variation in the data is much larger than would be expected from the Poisson distribution.

In this case, the best choice is the Negative Binomial maximum likelihood, which is a discrete probability distribution with an extra parameter, alpha, that is a measure of how over-dispersed the data are. If alpha=0, the data are Poisson distributed. If alpha gets large, there is a lot of over-dispersion in the data.

The R script fit_iteration_negbinom_likelihood.R fits to the influenza data, optimizing the Negative Binomial negative log-likelihood. The parameter alpha is now an additional nuisance parameter we have to fit for. The script produces the following plot:

Notice that 10,000 Monte Carlo iterations isn’t really sufficient to exactly pinpoint our best fit values and to get a good idea of the parabolic envelope below which we don’t see any points in the first three plots. This is because the more parameters you are fitting for, the more Monte Carlo iterations you will have to do to pinpoint the best-fit for the combination of all parameters.

**Summary**

Many of the problems we encounter in our research questions involve integer count data. In this module, we discussed that Least Squares probably isn’t the best choice for such data due to heteroskedasticity (however, you will see in the literature examples where people apply LS fits to count data anyway!). Inappropriate uses of the LS statistic should be caught in review, but often aren’t.

We discussed how a weighted least squares statistic, like Pearson chi-square, can help adjust for the heteroskedasticity problem in count data, and is a nice alternative as long as you have at least 10 counts per bin in each of your bins.

If the count data involve low-counts, a better choice is the Poisson negative log-likelihood (and at times you see such fits in the literature too), but count data are usually over-dispersed, in which case the best choice always is the Negative Binomial negative log-likelihood. In fact, in general, the Negative Binomial statistic is * always* applicable to independent count data, whereas the other three statistics we discuss here each have limitations in their applicability.

The only draw back of using the Negative Binomial likelihood is that it requires fitting for an extra parameter, the over-dispersion parameter, and the mathematical expression of the statistic looks complicated and involved, and can potentially scare the bejeezuz out of reviewers of your papers. Don’t let that stop you from using it though. Simply cite well-written papers like the following as precedent for using NB likelihood for count data in the life and social sciences, and perhaps consider not writing the explicit formula for the NB likelihood in your paper… just mention that you used it: Maximum Likelihood Estimation of the Negative Binomial Dispersion Parameter for Highly Overdispersed Data, with Applications to Infectious Diseases.

If you want to get your work published in a timely fashion, strive to use methods that are rigorous, but as simple as possible. If you have to use a more complicated method, describe it in your paper in plain and simple terms. In my long experience, this can avoid many problems with things getting hung up unnecessarily long times in review.

The basic elements of virtually any scientific paper are as follows (Introduction, Methods and Materials, Results, and Discussion and Summary):

**Introduction:**This section always appears in a paper. At the beginning of the Introduction is where you motivate your work (ie; why should anyone care?). Start from a broad motivation, and move to focus in on the particular motivation of your work. For instance, let’s assume I was writing an Ebola paper that describes a compartmental modelling analysis I did to assess the effects of isolation and/or quarantine on the spread of the disease:- At the very beginning of the introduction I’d start off talking about the number of people killed in past outbreaks, and the wide geographic spread of outbreak locations, and the ever present risk that Ebola cases can be imported to other areas of the world due to modern air travel (the point being that no matter where you live, you should care about Ebola). Then I’d talk about the high mortality of the disease. These would be among the first few sentences in pretty much
paper written about Ebola.**any** - For my particular paper, I’d then mention that the lack of current treatment options (like vaccines or medications) leave better hygiene, quarantine, and isolation as the only options available to slow the spread of disease.

After

*motivating*the project, you then move on to describing the*objective*of the paper. This is where you present your research question. And where you give a very short overview of what you did in your analysis, and how it advances the body of work in the published literature on the subject. In the Ebola paper case, I would add some sentences saying that mathematical models are being increasingly used to assess the efficacy of disease intervention strategies (and I would cite a few well known seminal publications on that topic). Then I would state that in this work we use a mathematical model to assess the efficacy of isolation and quarantine, and I would state that no one has ever done that before for Ebola (as of August 2014, this was true).*In this part of the paper it is very important to state what is new and novel about your work.*Once you have described your motivation and objective, it is a good idea to end the Introduction with a sentence or two that gives a road map for what the reader should expect in the following sections. Something like “In the following section, we will describe the data sources and mathematical and statistical modelling methodologies used in these studies, followed by a presentation of results and discussion” (this is assuming your analysis uses data, a mathematical model, and statistical methods).

- At the very beginning of the introduction I’d start off talking about the number of people killed in past outbreaks, and the wide geographic spread of outbreak locations, and the ever present risk that Ebola cases can be imported to other areas of the world due to modern air travel (the point being that no matter where you live, you should care about Ebola). Then I’d talk about the high mortality of the disease. These would be among the first few sentences in pretty much
**Methods and Materials:**This section always appears in a paper. if you are using data in your analysis, the first subsection in this section should be**Data**. The Data section should*thoroughly*describe your sources of data. If you collected it, what were your laboratory or field protocols? If it is time series data, what time steps are used? What, precisely, is the data measuring? If you got the data online, give a reference to the source. Even if you didn’t collect the data, you need to describe the collection procedures of the person or group who did collect the data.

If you are using a mathematical or computational model, the next subsection should be**Model**. In this subsection, you will describe what kind of model you are using, and give citations to relevant related publications in the field. You will describe what is new and interesting about your model (if relevant… sometimes it is the data that are new and interesting, and what is novel is applying an old model to new data). Here you will give the model equations and compartmental flow diagram (if using a compartmental model), or other details about your mathematical or computational model. You need to give enough details that anyone could reproduce your work based on this information.If you are using statistical methods that are fancier than your usual statistical tests based on Student T, Z scores, Spearman rho, etc etc, you need to have a subsection under Methods and Materials called**Statistical Methods**. This subsection would be appropriate, for instance, if the statistical methods you use are so esoteric that they are either new, or very rarely used in your field.

**Results:**This section always appears in a paper. Here is where,*without discussion*, you give the results of your paper, often in tables and figures, and accompanied text.*Do not discuss the results here!*

**Discussion:**This section always appears in a paper. Never put results that you haven’t discussed in the Results section here…. they should be in the Results section! In the Discussion section, you talk about notable things revealed by your results and how this fits in with (or contradicts) the published literature.

**Summary:**This section is sometimes called Conclusions, and sometimes is lumped in with Discussion (and called Discussion and Summary). It depends on the journal. If there is a separate Summary section, you start off with a little paragraph describing what you presented in the paper, and why it is new and novel. In the summary you detail limitations of your study, possible future work, etc, and usually end with a “feel good” sentence about the utility of studies like yours.

**Lacum et al rubric for identifying seven key elements of scientific papers**

In 2014, Lacum et al published a study where they trained students to look for seven key elements when reading or writing papers. As I discussed above and in this post, where I describe what sections need to be in a scientific paper, these elements are integral in the sections of a paper:

- Motive: Statement indicating why the research was done (e.g., a gap in knowledge, contradictory results). The motive leads to the objective. The motive should appear in the Abstract and Introduction.
- Objective: Statement about what the authors want to know. The objective may be formulated as a research question, a research aim, or a hypothesis that needs to be tested. The objective should appear in the Abstract and Introduction.
- Main conclusion: Statement about the main outcome of the research. The main conclusion is closely connected to the objective. It answers the research question, it says whether the research aim was achieved, or it states whether the hypothesis was supported by evidence. The main conclusion will lead to an implication. The main conclusion is often the last sentence in the Abstract, and is of course also described in the Discussion and Summary.
- Implication: Statements indicating the consequences of the research. This can be a recommendation, a statement about the applicability of the results (in the scientific community or society), or a suggestion for future research. This may appear in the Abstract, and certainly appears in the Discussion and Summary.
- Support: The statements the authors use to justify their main conclusion. These statements can be based on their own data (or their interpretation) or can be statements from the literature (references).
- Counterargument: Statements that weaken or discredit the main conclusion. For example, possible methodological flaws, anomalous data, results that contradict previous studies, or alternative explanations. Counterarguments are sometimes presented as limitations. They are placed in the Discussion and Summary.
- Refutation: Statements that weaken or refute a counter-argument. Refutation appears in the Discussion and Summary

]]>