In this past analysis, my colleagues and I examined mass killings data in the US, and fit the data with a model that included a contagion effect, and another model that included temporal trends but no contagion effect. Because the data were count data (number of mass killings per day), we used a Negative Binomial likelihood fit.

And in this past analysis, which was an AML 612 class publication project, we examined data from a norovirus gastrointestinal disease outbreak aboard a cruise ship, and fit the data with a model that included both direct transmission, and transmission from contaminated surfaces in the environment. We also fit a model with just direct transmission, and a model with just environmental transmission. Again, because the data were count data (number of cases per day), we used a Negative Binomial likelihood fit.

In this past analysis, which was also an AML 612 class publication project, we examined US Twitter data and Google search trend data related to sentiments expressing concern about Ebola during the virtually non-existent outbreak of Ebola in the US in 2014. We fit the data with a mathematical model of contagion that used the number of media reports per day as the “vector” that sowed panic in the population, and included a “recovery/boredom” effect where after a period of time, no matter how many news stories run about Ebola, people lose interest in the topic. We compared this to a simple regression fit that did not include a boredom effect.

What helped to make these analyses interesting and impactful was the exploration of what dynamics in the model had to be there in order to fit the data well. When you have the skill set to fit a model to data with the appropriate likelihood statistic and estimate model parameters and uncertainties, it opens up a wide range of interesting research questions you can explore… you fit the model to the data as it is today, and then using that model you can explore the effect of control strategies. However, if you also add in the ability to make statements about which dynamical systems fit the data “better” than alternate modelling hypotheses, you’ll find there is a lot of very interesting low-hanging research fruit out there.

For the purposes of the following discussion, we will assume that we are fitting to our data using a negative log-likelihood statistic. Recall that the Least Squares statistic can be transformed to the Normal negative log-likelihood statistic.

**Likelihood ratio test for nested models**

When two models are “nested” meaning that one has all the dynamics of another (the “null model”, plus an additional effect (or more than one additional effect), we can use what is known as the likelihood ratio test to determine if the more complex model fits the data significantly better. To do this, first calculate the best-fit negative log-likelihood of the null model (the simpler model):

x_0=-log(L_0) Then calculate the best-fit negative log-likelihood of the more complex model:

x_1=-log(L_1)

The likelihood ratio test is based on the statistic lambda = -2*(x_1-x_2). If the null model is correct, this test statistic is distributed as chi-squared with degrees of freedom, k, equal to the difference in the number of parameters between the two models *that are being fit for*. This is known as Wilk’s theorem.

If the null hypothesis is true, this means that the p-value

p=1-pchisq(lambda,k) Eqn(1)

should be drawn from a uniform distribution between 0 and 1.

If the p-value is very small, it means the null hypotheses is dis-favoured, and we accept the alternate hypothesis that the more complex model is favoured. Generally, a p-value cutoff of p<0.05 is used to reject a null hypothesis.

Note that more complex models generally fit the data better than a simpler nested model. This is because the more complex model has more parameters, and those extra parameters give the model more wiggle room to fit to variations in the data. But the problem is that more parameters also increases the uncertainty on all the parameter estimates; so sure… it might give a lower negative log-likelihood, but there is a cost to be paid for that. The Wilk’s likelihood ratio test in effect penalizes you for the number of extra parameters you are fitting for (that k in Eqn 1 above). The higher k is, the lower the best-fit negative log-likelihood for the more complex model has to be in order for the null model to be rejected.

**Example**

One example of the kind of research question that can be answered using this methodology: when fitting to data for a predator-prey system (say, coyotes and rabbits), we can examine whether or not a Holling type II model fits the data significantly better than a Holling type I model. In the Holling type I model, the number of prey consumed, Y, is a linear function of the density of prey, X, the discovery rate, a, and the time spent searching, T:

Y = a*X*T Eqn(2)

In a Holling type II model, the relationship is

Y = a*X*T/(1+a*b*X) Eqn(3)

Note that the Holling type I model is nested within the Holling type II model when b=0, and thus a likelihood ratio test can be used to determine if one model fits the data significantly better. The Holling type II model has one extra parameter being fitted for compared to the Holling type I model.

For example, if our “null” Holling type I model when fit to the data with b=0 yields a best-fit negative log-likelihood of x_0=-log(L_0)=900, and our “alternate” Holling type II model yields a best-fit negative log-likelihood of x_1=-log(L_1) = 898.5, we would calculate the negative log test statistic as lamba = -2*(x_1-x_0)=2*1.5=3. If the null hypothesis is true, then lambda should be distributed as chi-squared with degrees of freedom, k, equal to the difference in the number of parameters between the two models (in this case, k=1, because the only difference in the parameters between the two models is the addition of the parameter b). Thus, the test is

pvalue_testing_null=1-pchisq(lambda,difference_in_degrees_of_freedom) pvalue_testing_null=1-pchisq(3,1)

which yields pvalue_testing_null=0.083. Thus in this case, we would say there is no statistically significant evidence that the Holling type II model fits the data better than the Holling type I model. This doesn’t mean, btw, that the Holling type II model is “wrong” and the Holling type I model is “right”. It simply means that based on this particular data, there is no statistically significant difference. If you had more data, it increases your sensitivity to detecting differences in the dynamics.

If, for example, you fit the models to another, larger, data set, and find x_0=532 and x_1=521, then in this case

pvalue_testing_null=1-pchisq(-2*(521-532),1)

and we get pvalue_testing_null=2.7e-6, which is very small indeed, and we conclude that the Holling type II model fits the data significantly better. In a paper, we would state these results along these lines: “The Negative Binomial negative log-likelihood best-fit statistics for the Holling type I and Holling type II models were 531 and 521, respectively. The likelihood ratio test statistic has p-value<0.001, thus we conclude the dynamics of the Holling type II model are favoured over the type I model.”

**What to do if the models being compared aren’t nested**

In all of the examples above, we assumed that the models were nested: for example, in the contagion in mass killings analysis, the model without contagion but just with temporal trends was nested within the model with contagion and temporal trends. In the norovirus analysis, the models with just direct transmission, and just environmental transmission, were nested in the model that contained both effects.

But what if we are trying to compare two plausible models that aren’t nested? In that case, we can use the Aikake Information Criterion (AIC) statistic. If the negative log-likelihood of model #1 is x_1=-log(L_1), and the negative log-likelihood of model #2 is x_2=-log(L_2), then the AIC statistic for model number 1 is

AIC_1 = 2*x_1 + 2*q_1

where q_1 is the total number of parameters being fitted for in model #1, and the AIC statistic for model #2 is

AIC_2 = 2*x_2 + 2*q_2

where q_2 is the total number of parameters being fitted for in model #2.

Now, find the minimum value of the two AIC statistics,

AIC_min = min(AIC_1,AIC_2)

Then, for each model, calculate the quantity:

B_i=exp((AIC_min-AIC_i)/2)

And then for each model calculate what is known as the “relative likelihood” (see also this paper):

p_i = B_i/sum(B_i)

If one of the models has a relative likelihood p_i>0.95, we conclude it is significantly favoured over the other.

Note that the AIC statistic penalizes you for the number of parameters being fit for… increasing the number of parameters might help bring the negative log-likelihood down, but it can result in an increase of the AIC.

Sometimes people just choose the model that has the lowest AIC statistic as being the “best” model (and in fact this is very commonly seen in the literature), but problems arise when there is only a small difference between the AIC statistics being compared. If they are very close, one model really is not much better than the other. Calculation of the relative likelihood statistic makes that apparent.

You can use AIC statistic to compare nested models, but if the models truly are nested, then the Wilk’s likelihood ratio test is preferred.

**AIC Example #1**

As an example of the use of AIC statistics to compare models let’s examine our Holling type I/II hypothetical analysis (even though it is a nested model example, and the Wilk’s likelihood ratio test would be the preferred method to compare the models).

If our Holling type I model when fit to the data with parameter b=0 yields a best-fit negative log-likelihood of x_1=-log(L_0)=900 and one parameter is being fit for (the “a” in Equation 2) then the AIC statistic for that model is

AIC_1 = 2*900 + 2*1=1802

If the fit of our Holling type II model yields a best-fit negative log-likelihood of x_2=-log(L_1) = 898.5, when two parameters are being fit for (the “a” and “b” in Equation 3), then the AIC statistic for that model is

AIC_2 = 2*898.5 + 2*2=1801

Thus, the AIC of the Holling type II model looks to be just a little bit lower than that of the Holling type I model. The B statistics for the two models are

B_1 = exp((1801-1802)/2) = 0.606 B_2 = exp(0) = 1

And the relative likelihoods are

p_1 = B_1/(B_1+B_2) = 0.606/1.606 = 0.377 p_2 = B_2/(B_1+B_2) = 1/1.606 = 0.623

Neither of these p_i’s are greater than 0.95, so we conclude that neither model is significantly favoured over the other.

“The AIC statistics derived from the Holling type I and Holling type II model fits to the data are 1802 and 1801, respectively. Neither model appears to be strongly preferred [1]“. With reference [1] being:

Wagenmakers EJ, Farrell S. AIC model selection using Akaike weights. Psychonomic bulletin & review. 2004 Feb 1;11(1):192-6.

However, as noted above, because these are actually nested models, a better choice would be Wilk’s test. In practice, you should only use AIC for non-nested models.

**AIC Example #2**

If our Holling type I model when fit to another set of data with parameter b=0 yields a best-fit negative log-likelihood of x_1=-log(L_0)=532 and one parameter is being fit for (the “a” in Equation 2) then the AIC statistic for that model is

AIC_1 = 2*532 + 2*1=1066

If the fit of our Holling type II model yields a best-fit negative log-likelihood of x_2=-log(L_1) = 521, when two parameters are being fit for (the “a” and “b” in Equation 3), then the AIC statistic for that model is

AIC_2 = 2*521 + 2*2=1046

The AIC of the Holling type II model is lower than that of the Holling type I model. The minimum value of the AIC for the two fits is 521.

The B statistics for the two models are thus

B_1 = exp((521-532)/2) = 0.004

B_2 = exp(0) = 1

And the relative likelihoods are

p_1 = B_1/(B_1+B_2) = 0.004/1.004 = 0.004

p_2 = B_2/(B_1+B_2) = 1/1.004 = 0.996

The relative likelihood of the second model is greater than 0.95, thus we conclude it is significantly favoured.

In a paper we would say “The AIC statistics derived from the Holling type I and Holling type II model fits to the data are 1066 and 1046, respectively. We conclude the Holling type II model is significantly preferred [1].” With reference [1] again being:

Wagenmakers EJ, Farrell S. AIC model selection using Akaike weights. Psychonomic bulletin & review. 2004 Feb 1;11(1):192-6.

**AIC Example # 3**

The R script example_AIC_comparison.R generates some simulated data according to the model A*exp(B*x)+C with Poisson distributed stochasticity. It then uses the graphical Monte Carlo method to fit a linear model (intercept+slope*x) and the model A*exp(B*x)+C to the simulated data using the Poisson negative log-likelihood as the goodness of fit statistic.

After 50,000 iterations of the Monte Carlo sampling procedure, the script produces the plot:

Based on the best-fit negative log likelihoods and the number of parameters being fit for for each model (two for the linear model, and three for the exponential model), the script also calculates the AIC statistic, and the relative likelihoods of the models derived from those statistics.

The script outputs the following:

We see that the relative likelihood of the exponential model (which was actually the true model underlying the simulated data) is significantly favoured over the linear model.

Something to try is instead of using integer values of x from 0 to 150 in the script, use integer values from 0 to 50. You will find with this smaller dataset that there is no statistically significant difference between the two models… there is simply not enough data to tell.

]]>

In this past module, we discussed the graphical Monte Carlo method for fitting model parameters to data. In this module, we described how to estimate the one standard deviation uncertainties in model parameters using the “fmin+1/2″ method, where fmin is the minimum value of the negative log likelihood.

Also in that module, we discussed that the Least Squares statistic (LS) is related to the Normal distribution negative log likelihood via

where min(LS) is the minimum value of the Least Squares statistic that you obtained from your many sampling iterations of the graphical Monte Carlo method.

When doing Least Squares fits with the graphical Monte Carlo method, to facilitate choosing the correct ranges used to Uniformly randomly sample the parameters and to estimate the parameter uncertainties, you should plot the Normal negative log-likelihood versus the model parameter hypotheses, rather than the Least Squares statistic.

Similarly, the negative log likelihood of the Pearson chi-squared weighted least squares statistic is

and if using that statistic, you should use this negative log-likelihood in your fits in order to assess the parameter uncertainty.

**Choosing parameter ranges**

Once you have your model, your data, and a negative log-likelihood statistic appropriate to the probability distribution underlying the stochasticity in the data, you need to write the R or Matlab (or whichever programming language of your choice) program to randomly sample parameter hypotheses, and compare the data to the model prediction, and calculate the negative log likelihood statistic. The program needs to do this sampling procedure many, many times, storing the parameter hypotheses and negative log-likelihoods in vectors.

*For the initial range to sample the parameters, choose a fairly broad range that you are pretty sure, based on your knowledge of the biology and/or epidemiology associated with the model, includes the true value. * Do enough iterations that you can assess more or less where the minimum is (I usually do 1,000 to 10,000 iterations in initial exploratory runs if I’m fitting for one or two parameters… you will need more iterations if you are fitting for more than two parameters at once). Then plot the results without any kind of limit on the range of the y-axis. Check to see if there appears to be an obvious minimum somewhere within the range sampled (and that the best-fit value does not appear to be at one side or the other of the sampled range). Using a broad range can also give you an indication if there are local and global minima in your likelihoods.

If the best-fit value of one of the parameters appears to be right at the edge of the sampling range, increase the sampling range and redo this first pass of fitting. Keep on doing that until it appears that the best-fit value is well within the sampled range.

Once you have the results from these initial broad ranges, when plotting the results restrict the range on the y axis by only plotting points with likelihood within 1000 of the minimum value of the likelihood. From these plots record the range in the parameter values of the points that satisfy this condition.

Now do another run of 1000 to 10,000 iterations (more if fitting for more than two parameters), sampling parameters within those new ranges (again, re-adjusting the range and iterating the process if necessary if the best-fit value of a parameter is at the left or right edge of the new sampled range).

Then plot the results, this time only plotting points with likelihood within 100 of the minimum value of the likelihood. Record the new range of parameters to sample, and run again.

Then plot the results of that new run, plotting points with likelihood within 15 of the minimum value of the likelihood and determine the new range of parameters to sample.

Then do a run with many more Monte Carlo iterations such that the final plot of the likelihood versus the sampled parameters is well populated (aim for hundreds of thousands to several million iterations… this can be achieved using high performance computing resources). Having many iterations allows precise determination of the best-fit values and one standard deviation uncertainties on your parameters (and 95% confidence interval).

**Example: fitting to simulated deer wasting disease data**

Chronic wasting disease (CWD) is a prion disease (akin to mad cow disease) that affects deer and elk species, first identified in US deer populations in the late 1960′s. The deer never recover from it, and eventually die. There has yet to be a documented case of CWD passing to humans who eat venison, but lab studies have shown that it can be passed to monkeys.

CWD is believed to be readily transmitted between deer through direct transmission, and through the environment. There is also a high rate of vertical transmission (transmission from mother to offspring). CWD has been rapidly spreading across North America.

In this exercise, we’ll assume that the dynamics of CWD spread are approximately described by an SI model, with births and deaths that occur with rate mu:

where N=S+I, and beta is the transmission rate.

Setting the time derivates equal to zero and solving for S yields that the possible equilibrium values of S* are S*=N (the disease free equilibrium) and S*=N*mu/beta (the endemic equilibrium). Note that we can express beta as beta=N*mu/S* = mu/(1-f) where f is the endemic equilibrium value of I

Let’s assume that officials randomly sample 100 deer each year out of a total population of 100,000, and test them for CWD (assume that the test doesn’t kill the deer) to estimate the prevalence of the disease in the deer population. The file cwd_deer_simulated.csv contains simulated prevalence data from such sampling studies carried out over many years.

The R script fit_si_deer_cwd_first_pass.R does a first pass fitting to this data, for the value of f=I*/N and the time of introduction, t0. The time of introduction clearly can’t be after the data begin to be recorded, so the script samples it from -50 years to year zero. We also don’t know what f is so we randomly uniformly sample it from 0 to 1. The script does 1000 iterations, sampling the parameters, calculating the model, then calculating the Binomial negative log-likelihood at each iteration, and produces the following plot:

f=I*/N has a clear minimum, but the time of introduction is perhaps not so obvious. Let’s try zooming in, only plotting points with likelihood within 1000 of the minimum value:

Here we can clearly see that the best-fit value of t0 is somewhere between -50 and zero (recall that it can’t be greater than zero because that is where the data begin). The best-fit value of I*/N appears to be around 0.70, but the parabola enveloping the likelihood in the vicinity of that minimum appears to be highly asymmetric. When choosing our new sampling ranges, we want to err on the side of caution, and sample past 0.73-ish to make sure we really have the minimum value of the likelihood within the new range.

In the file fit_si_deer_cwd_second_pass.R we do another 10,000 iterations of the Monte Carlo procedure, but this time, based on the above plot, sampling t0 from -50 to zero, and I*/N from 0.5 to 0.75. The script produces the following plot where it zooms in on points for which the likelihood is within 100 of the minimum value of the likelihood:

Based on these plots, it looks like in our next run, we should sample the time of introduction from -20 to zero, and sample I*/N from around 0.62 to 0.75. Going up to 0.75 may again be overly cautious in over-shooting the range to the right, but we’ll do another run, then adjust it again if we need to. The file fit_si_deer_cwd_third_pass.R does 10,000 more iterations of the Monte Carlo sampling, using these ranges, and produces the following plot only showing points that have likelihood within 15 of the minimum value:

Now, in a final run, we can sample t0 from -8 to zero, and I*/N form 0.67 to 0.72. The R script fit_si_deer_cwd_fourth_pass.R does this, and produces the following plot:

From this output we can estimate the best-fit, one standard deviation uncertainty (range of parameter values within fmin+1/2), and 95% confidence intervals (range of points within fmin+0.5*1.96^2).

**Reducing the size of file output**

Once you have narrowed in on the best parameter ranges with the above iterative method, there are still many parameter hypothesis combinations that fall within those ranges that yield large likelihoods nowhere near the minimum value (especially if you are fitting for several parameters at once). When running R scripts in batch to get well-populated likelihood-vs-parameter plots to ensure precise estimates of the best-fit parameters and uncertainties, this can lead to really large file output if you output every single set of sampled parameter hypotheses and likelihood.

To control the size of file output, what I’ll often do is only store parameter hypotheses and likelihoods for which the likelihood was within, say, 100 of the minimum likelihood found so far by that script.

]]>

Once you have chosen an appropriate goodness-of-fit statistic comparing your model to data, you need to find the model parameters that optimise (minimise) the goodness-of-fit statistic. The graphical Monte Carlo Uniform random sampling method is a computationally intensive “brute force” parameter optimisation method that has the advantage that it doesn’t require any information about the gradient of the goodness-of-fit statistic, and is also easily parallelizable to make use of high performance computing resources.

In this module, we discussed a related method, graphical optimisation with Latin hypercube sampling. With Latin hypercube sampling, you assume that the true parameters must lie somewhere within a hypercube you define in the k-dimensional parameter space. You grid up the hypercube with M grid points in each dimension. In the center of each of the M^k points, you calculate the goodness-of-fit statistic. With this method you can readily determine the approximate location of the minimum of the GoF. Because nested loops are involved, with use of computational tools like OpenMP, the calculations can be readily parallelized to some degree. However, if the sampling region of the hypercube is too large, you need to make M large to get sufficiently granular estimate of the dependence of the GoF statistic on the parameters. Several iterations of the procedure are usually necessary to narrow down the appropriate sampling region.

If you make the sampling range for each of the parameters large enough that you are sure the true parameters must lie within those ranges, with this method you can be reasonably sure that you are finding the global minimum of the GoF, rather than just a local minimum. This is important, because very often in real epidemic or population data, there are often local minima in GoF statistics.

Because it is difficult to parallelise Latin hypercube sampling, and there is discrete granularity in the sampling, a method which is preferable is randomly Uniformly sampling the parameters over a range. The method is easy to implement, and easy to parallelise. With this method, parameter hypotheses are randomly sampled from Uniform distributions in ranges that would be expected to include the best-fit values. The goodness-of-fit statistic is calculated for a particular set of hypotheses, and along with the hypotheses stored in vectors. The procedure is repeated many, many times, and then the goodness-of-fit statistic is plotted versus the parameter hypotheses to determine which values appear to optimise the GoF statistic.

**Example: fitting to simulated pandemic influenza data**

We’ll use as example “data” a simulated pandemic flu like illness spreading in a population, with true model:

with seasonal transmission rate

The R script sir_harm_sde.R uses the above model with parameters 1/gamma=3 days, beta_0=0.5, epsilon=0.3, and phi=0. One infected person is assumed to be introduced to an entirely susceptible population of 10 million on day 60, and only 1/5000 cases is assumed to be counted (this is actually the approximate number of flu cases that are typically detected by surveillance networks in the US). The script uses Stochastic Differential Equations to simulate the outbreak with population stochasticity, then additionally simulates the fact that the surveillance system on average only catches 1/5000 cases. The script outputs the simulated data to the file sir_harmonic_sde_simulated_outbreak.csv

The R script sir_harm_uniform_sampling.R uses the parameter sweep method to find the best-fit model to the simulated pandemic data in sir_harmonic_sde_simulated_outbreak.csv. Recall that our simulated data and the true model looked like this:

The sir_harm_uniform_sampling.R script randomly Uniformly samples the parameters R0, t0, and epsilon over many iterations and calculates three different goodness-of-fit statistics (Least squares, Pearson chi-squared, and Poisson negative log-likehood). Question: this is count data… which of those GoF statistics are probably the most appropriate to use?

Here is the result of running the script for 10,000 iterations, sweeping R0, the time of introduction t0, and epsilon parameters of our harmonic SIR model over certain ranges (how did I know which ranges to use? I ran the script a couple of times and fine-tuned the parameter ranges to ensure that the ranges actually contained the values that minimized the GoF statistics, and that the ranges weren’t so broad that I would have to run the script for ages to determine a fair approximation of the best-fit values). Question: based on what you see in the plots below do you think 10,000 iterations is enough? Also, do all three GoF statistics give the same estimates for the best-fit values of the three parameters? Which fit would you trust the most for this kind of data?

This method is known as the “graphical Monte Carlo method”. “Graphical” because you plot the goodness-of-fit statistic versus the parameter hypotheses to determine where the minimum is, and “Monte Carlo” because parameter hypotheses are randomly sampled.

The uniform random sampling method is easily parallelizable to make use of high performance computing resources, like the ASU Agave cluster, or NSF XSEDE resources.

Here is a summary of the results of running the sir_harm_uniform_sampling.R script in parallel using the ASU Agave high performance computing resources (note that the density of points in the plot below are what you are aiming to achieve when using the graphical Monte Carlo procedure… the plots above are too sparse to reliably estimate the best-fit parameters and their uncertainties!). Note that the Pearson chi-squared, least squares, and Poisson negative log likelihood statistics all predict somewhat different best-fit parameters. We were fitting to count data… which of these goodness of fit statistics is likely the most appropriate?

In addition to being easily parallelisable, the graphical Monte Carlo method with Uniform random sampling also has the advantage that Bayesian priors on the model parameters are trivially applied post-hoc to the results of the parameter sampling.

]]>

This content is password protected. To view it please enter your password below:

]]>

Running an R script in parallel on many CPU’s simultaneously with the aid of high performance computing resources allows you to quickly answer research questions that would take hours, days or even weeks to achieve with just your laptop or desktop alone. Batch processing is a submission system that allows you submit a bunch of “jobs” (program runs) to many CPU’s at a time.

All students in AML610 or AML612 can obtain an ASU Research Computing account for free by going to this link, as can any student who is working with faculty at ASU on research. Fill in the name of the professor teaching your course (or who you are working with), and their email. Under “Cluster” choice, choose the “Agave public cluster”.

For security reasons, ASU requires users to access the Agave cluster using the Cisco Virtual Private Network (VPN) client. From your “My ASU” page, go to the “My Apps” tab, and search for “Cisco”. Download the Cisco AnyConnect VPN app for your operating system.

Once you have downloaded the app, open it, and connect to sslvpn.asu.edu. Use your Asurite ID and password.

Once you are logged in to sslvpn.asu.edu through the Cisco AnyConnect client, go to this link: https://agaverstudio.rc.asu.edu and login with your Asurite ID and password.

This will start R Studio Agave in your browser window, that allows access to the Agave cluster. This is (more or less) what you will see:

Your directories in the right hand panel will be different than mine. I have several directories on the Agave cluster related to past work. It is good form to create a new directory for each project. Thus, let’s do an example of parallel computing in batch that we will put in the agave_R_test directory. Click on “New Folder” in the “Files” tab on the right hand panel, and create a directory “agave_R_test”.

You can see that a new folder has been created in the list of folders in the right hand panel:

Now, in the left hand panel notice there is a tab to access the R console, and a tab to access the “Terminal”. The Terminal tab connects you directly to the Agave cluster with the Unix operating system. **Don’t use the Terminal tab. Only use the R console tab.**

In the R Studio Agave console in the left hand panel, type

setwd("~/R") file.copy("/sample/slurm/slurmscript.tmpl",".") file.copy("/sample/slurm/batchtools.conf.R",".") setwd("..")

**Running your first job in batch in R Studio on the Agave cluster**

the R console in the left hand panel, type

setwd("agave_R_test")

Then, in the R Studio Agave console, type:

library(batchtools) reg = makeRegistry(file.dir = "./registry", seed = 1)

This will result in a sub-directory called “registry” appearing off of the agave_R_test folder. The “seed” option in the makeRegistry() method sets the random seed for each job you run to the job.id plus the value of the seed that you passed (in this case, 1, but I could have put whatever integer I liked for that value).

To see what is in the agave_R_test folder, in the right-hand Files panel, click on the agave_R_test folder. You will see that there is a registry sub-folder in it. The registry sub-folder is where the results output by the scripts in our batch jobs will be.

Download the files R_agave_simple_example.R and agave_mydata.csv to a working directory on your laptop or desktop. Also download the files batchtools.conf.R and slurmscript.tmpl

The R_agave_simple_example.R file contains a function called myfun

myfun = function(n){ aseed = 13413 set.seed(aseed+n) a = read.table("~/agave_R_test/agave_mydata.csv",sep=",",as.is=T,header=T) y =rnorm(1000) return(list(seed=aseed+n,mean_ax=mean(a$x),mean_ay=mean(a$y),mean_y=mean(y),sd_y=sd(y))) }

All this function does is set the random seed using the value of n passed to the function, reads in the data frame, then randomly samples 1000 Normally distributed random numbers, and fills a (one line!) data frame with the random seed and the mean and std deviation of the columns in the a, and the mean and std deviation of the randomly sampled numbers. It’s kind of a pointless script, but does show how to set the random seed, read in a file, sample random numbers, and output a data frame with some results.

Go back to the browser window where you are running R Studio Agave through https://agaverstudio.rc.asu.edu In your agave_R_test folder in the right hand panel, upload the agave_mydata.csv file, and the R_agave_simple_example.R, batchtools.conf.R and slurmscript.tmpl files to the directory.

Now, in the R console in the left panel of the R Studio Agave window, type

source("R_agave_simple_example.R")

You will see something like this:

Now, go to the registry->results folder in the right hand Files panel, and you will see many files that look like this:

These files (there are 100 of them) are the output from the myfun() method in R_agave_simple_example.R The files are in “rds” format which is stored as an R object. The files can be read into R using the readRDS() method.

I’ve created a little script merge_results.R that can be used to concatenate these files together into one data frame. Download merge_results.R to a working directory on your laptop or desktop, then in your browser R Studio Agave window, upload the script to Files in the agave_R_test folder (*not* to the results folder!).

Now, in your R console in the left hand panel of the R Studio Agave window, type

source("merge_results.R")

The concatenated results are in a data frame called merged_data that you can then analyse in your R console. This data frame is in the file merged_results.csv which will be produced in your working directory on R Studio Agave. You can download it to your laptop or desktop from the Files console by ticking the box next to the file, then clicking on More->Export.

The merge_results.R script is a general script that can be run on the output of any R script you run with R Studio Agave. Just ensure that you always run it from the same working directory you ran your batch script in.

**More complicated example: fitting an SIR model with non-Exponentially distributed infectious state sojourn times to influenza outbreak data**

In this example, we’ll fit an SIR model with Erlang distributed sojourn times in the infectious state to the 2007-2008 Midwest influenza outbreak data. The standard SIR model assumes Exponentially distributed sojourn times in the infectious state, meaning that the most probable time to recover is immediately after you are infected, which is clearly highly unrealistic. The Erlang distribution is a special case of the Gamma distribution with integer “shape” parameter, k. The larger the value of k, the more pointy and sharp the distribution is. We can include Erlang distributed sojourn times in the infectious state by dividing up the infectious compartment into k stages, each with rate leaving the compartment k*gamma. The value of 1/gamma is the overall average time an individual spends in the infectious state. This is known as the “linear chain trick”.

In R Studio Agave Files window (at bottom right), create a folder called midwest_influenza off of your Home directory.

Change to this folder, and upload batchtools.conf.R and slurmscript.tmpl files (ie: first download them from here to your laptop, then in R Studio Agave upload the files from your laptop to the midwest_influenza working directory on R Studio Agave).

Also upload the file midwest_influenza_2007_to_2008.dat, and the files midwest_erlang_fit.R and midwest_erlang_fit_submission.R to your midwest_influenza working directory on R Studio Agave.

The midwest_erlang_fit.R script contains a function called my_fitting_func() that does many Monte Carlo iterations, sampling parameters for an SIR model with non-exponentially distributed infectious period, along with an over-dispersion parameter alpha that is used to calculate the Negative Binomial negative log-likelihood. The my_fitting_func() function outputs a data.frame with the results of many Monte Carlo iterations. Many of the Monte Carlo iterations result in negative log likelihoods no where near the minimum value; thus, to help control the size of the output file, I make a selection at the end of the function to only the Monte Carlo iterations that yielded values of the negative log likelihood in the ballpark of the minimum value. If you don’t do this, and run on many CPU’s at once, your concatenated output files will end up being huge and unmanageable.

The midwest_erlang_fit_submission.R script contains the R instructions to submit the my_fitting_func() function to 100 different CPUs on the Agave cluster. In the R Studio Agave console window, type

source("midwest_erlang_fit_submission.R")

The system will then print out a dialogue telling you that it is submitting the jobs.

Typing

getStatus()

in the R Studio Agave console window will give you the status of your jobs. If you’ve done all of the above correctly, you will see your jobs running 50 at a time.

Once getStatus() indicates that all the jobs have completed, you are ready to concatenate the output into a single file. The R file merge_results.R does this for you. Upload this file to your R Studio Agave working directory (i.e. first download it from here to your laptop, and then upload it from your laptop to R Studio Agave in the R Studio Agave Files window in the lower right).

In the R Studio Agave console window, type

source("merged_results.R")

This will produce the file merged_results.csv in your R Studio Agave working directory. Check the file merged_results.csv in the R Studio Agave Files window at bottom right. Then go to More->Export, and download the file to your laptop.

You can now read this file into R running on your laptop, and create plots of the negative log-likelihood versus your parameter hypotheses, and determine the best-fit parameters and their one-standard deviation and 95% confidence intervals using fmin+1/2 and fmin+0.5*1.96^2, respectively. For example, in the R file midwest_erlang_fit.R I have defined a function called plot_results() that does just this.

source("midwest_erlang_fit.R") r = read.table("midwest_results.csv",header=T,as.is=T,sep=",") plot_results(r)

Which produces the following plot:

The script also outputs:

The fit seems to favour a high value of k, which implies that the probability distribution for the sojourn in the infectious state is very narrow, and centered on 1/gamma. I haven’t tried values of k>150, but my guess is that if I did, we still wouldn’t find a clear minimum. With this kind of fit, based on just the epidemic curve, the best we can say is that k is greater than 2 with 95% confidence.

This fit is not only a good example of when we can only put a lower (or upper) bound on a parameter, but the plots show evidence of local minima in the goodness-of-fit statistic. Those local minima are why the graphical Monte Carlo method is preferred over “black box” fitting methods. Black box methods, depending on your initial guess at the parameters, could easily walk you to one of the the local minima, rather than the global minimum in the goodness-of-fit statistic.

In a later module, we’ll talk about how we can use other information to help constrain the parameters of our fits. For example, in this paper by Carrat et al (2008) they reviewed influenza volunteer challenge studies, where people were voluntary infected with influenza, then blood titres were done regularly to examine how much of the virus they were shedding. This plot is in the paper, and it is pretty clear from the plot that the width of the probability distribution for recovery spans several days (favouring lower values of k over higher values):

**Step-by-step notes for running jobs in batch with R Studio Agave**

- Make sure you are logged into sslvpn.asu.edu with the Cisco AnyConnect client in order to connect to R Studio Agave
- Sign in to R Studio Agave at https://agaverstudio.rc.asu.edu/auth-sign-in
- In the lower right Files pane, create a new working directory under your Home directory
- In the lower right Files pane, change to that directory and upload batchtools.conf.R and slurmscript.tmpl files to the directory (ie: first download them from here to your laptop, then in R Studio Agave upload from your laptop to the working directory on R Studio Agave).
- In your script that you wish to run on Agave, define a function (say, my_fitting_func) and pass to it a parameter n. In the function, make sure you have a line set.seed(XXXX+n), where you replace XXXX with some integer of your choice. This line is important: it controls the random seed used by each run of the script.
- Make sure all the functions used by my_fitting_func are defined within that function, and that all R libraries it needs are loaded within the function. For example, here is what the first few lines of midwest_erlang_fit.R look like:
my_fitting_func=function(n){ set.seed(82914+n) require("chron") require("deSolve")

- Make sure that your my_fitting_func() function doesn’t do any plotting, or printing things out using cat(). You can plot things later using the output from R Studio Agave.
- Make sure your my_fitting_func function outputs a data frame with the results of your graphical Monte Carlo iterations. When using a negative log likelihood, it helps to control the size of the output file to only output the results with neg log likelihood in the ball park of the minimum value. For example midwest_erlang_fit.R has the last few lines:
vdat = data.frame(neglog_like=vneglog_likelihood,R0=vR0,t0=vt0,k=vk,alpha=valpha) i = which(vdat$neglog_like<=(min(vdat$neglog_like)+1000)) vdat = vdat[i,] return(vdat) } # end definition of my_fitting_func

- Try out your fitting script on your local laptop to ensure it appears to be working correctly, and that the ranges you are sampling the parameters over appear to be narrow enough. Re-iterate this process, adjusting the parameter ranges, as many times as needed to ensure that they aren’t too broad or too narrow, and that the best-fit values appear to be well centered in the ranges.
**Make sure you’ve uploaded your data file to your working directory on R Studio Agave!**- In an R script that you want to submit to batch, source the file that has the my_fitting_func file, and put the lines (see, for example, midwest_erlang_fit_submission.R):
source("midwest_erlang_fit.R") require("batchtools") ncpu = 100 unlink("./registry",recursive=T) reg = makeRegistry(file.dir = "./registry", seed = 1) batchMap(fun = my_fitting_func, n = seq(1,ncpu)) submitJobs(resources = list(walltime = 3600, memory = 1024, ncpus = 1))

- Now, in the R Studio Agave window, use setwd() to set the working directory to the folder where you have your data and R scripts, and now source your R batch submission script
- Typing getStatus() in the R Studio Agave console window will give you the status of your jobs. Note that if all the jobs you submitted are listed under “Error:”, you’ve got a problem. Common causes of this are: not having batchtools.R and/or slurmscript.tmpl in your working folder, or not having the data file in your working folder. Or the script needs an R library that you did not require within the my_fitting_func() function. If you get errors, in the lower-right Files window, go to your Home directory. There you will see a whole bunch of files that look like slurm.XXXX.err. Open the most recent (down at the bottom), and it will tell you why the error occurred. Fix the error in your R script(s) and try sourcing your submission script again from the R Studio Agave console window.
- Once getStatus() shows that all of your jobs have finished running, you’re ready to merge the results. Upload the R script merge_results.R to your R Studio Agave working directory (i.e. download it from here to your laptop first, then upload it from your laptop to R Studio Agave). Now source merge_results.R in the R Studio Agave console window. This will produce the file merged_results.csv in your R Studio Agave working directory.
- Check the file merged_results.csv in the R Studio Agave Files window at bottom right. Then go to More->Export, and download the file to your laptop.
- You can now read this file in, and create plots of the negative log-likelihood versus your parameter hypotheses, and determine the best-fit parameters and their one-standard deviation and 95% confidence intervals using fmin+1/2 and fmin+0.5*1.96^2, respectively. For example, in the R file midwest_erlang_fit.R I have defined a function called plot_results() that does just this.

]]>

A simple example of a compartmental model of infectious disease spread is the Susceptible, Infectious, Recovered model. In several past modules we have discussed this model in detail, but briefly, individuals in the susceptibly compartment can be infected on contact with infectious people in the population (whereupon they flow to the “infectious” compartment). Infectious people recover with some rate, gamma, and flow into the “recovered and immune” compartment.

The compartmental diagram for the model looks like this:

and the system of ordinary differential equations describing these dynamics is:

Inherent in these model equations is the assumption that the sojourn time in the infectious compartment is Exponentially distributed, with rate gamma. You can see this if you look near the beginning of the outbreak where I is approximately equal to zero, in which case the equation for dI/dt is dI/dt=-gamma*I. The solution to this equation when I goes to zero as t goes to infinity is I(t) = I_0 exp(-gamma*t).

The probability distribution for the sojourn time in the infectious state for this model thus looks like this:

Notice that the most probable time for leaving the infectious state is time t=0. This implies that the most probable time that you will recover from a disease like influenza or measles is immediately after being infected…. **this is clearly high unrealistic**! For all diseases, realistically, the probability distribution for the sojourn time in the infectious state looks more like a bump, like this:

With this distribution, at time t=0, the probability of leaving the infectious state is zero (as it is in reality). A few people recover early on after being infected, but most of those infected recover near the middle of the bump. A few take much longer to recover, and are in the tails of the distribution.

So, how can we incorporate realistic sojourn times like this in compartmental models? And why would we even want to? (hint: think about control strategies, like treatment or isolation, that might be aimed at people at various times after they are first infected… there is often a delay between time of infection and the application of an intervention strategies).

**Gamma distributed sojourn times**

It turns out the Gamma distribution offers an easy way to incorporate realistic sojourn times in a model. The Gamma distribution has two parameters, a shape parameter, k, and a scale parameter theta. The mean of the distribution is mu=k*theta. The probability density function for the Gamma distribution is:

When k is an integer, the distribution is called the Erlang distribution, and for the special case when k=1, the distribution is the Exponential probability distribution. It turns out that an Erlang distributed random number with scale parameter theta is the sum of k Exponentially distributed random numbers with rate theta. Here is an example of how the parameter k affects the shape of the Erlang distribution when scale parameter theta=1/k (and thus the mean of the distribution is one):

Notice that the higher the value of k, the more narrow and peaked the distribution is.

**Linear chain trick**

Because an Erlang distributed random number with scale theta and shape k is the sum of k Exponentially distributed random numbers with rate theta, there is a method called the “linear chain trick” that adds k disease stages to a compartmental model, each of which flows into the next with rate k*theta (except for the last which flows to the recovered class), where 1/theta is the average infectious period for the disease.

For the SIR model, if we assume the rate is gamma, we get

The R script sir_erlang_sojourn.R shows an example of how to code up a linear-chain model in R. It requires that the sfsmisc and deSolve libaries have been installed on your computer. If they have not, type in your R console

install.packages("sfsmisc","deSolve")

and choose an R repository mirror site close to your location for the download. Then type in your R console

source("sir_erlang_sojourn.R")

This produces the following plot:

The higher the value of k, the narrower the peak of the outbreak. The script also prints out the final size for the various values of k: the final size of the outbreak is independent of k.

In the absence of births and deaths (vital dynamics) in the model, the reproduction number for this model is exactly the same as a model that that just assumes k=1. That is to say, R0=beta/gamma.

This paper discusses the relationship between R0, k, gamma, and the rate of exponential rise at the beginning of an outbreak for SIR and SEIR models.

]]>

The files for this worked example can be found in my GitHub repository https://github.com/smtowers/example_latex

The repository contains the main LaTex document example_latex.tex, along with the bibtex file example_latex.bib. In order to compile the document, you also need to download the example_latex_histogram_plot.eps, which is the figure included in the file. To compile the document, run LaTex once, then BibTex, then LaTex twice (which should resolve all references).

This should produce the file example_latex.pdf

Note that the encapsulated postscript (EPS) figure for the paper was produced with the R script example_latex.R (you need to install the R extrafont library before running the script) The R script also shows you how to automatically output results from your analysis code that can be included as \newcommands in your latex file that allow you to copy and paste the results to your LaTex file so that reference those results in the text of your paper without having to manually transcribe numbers (which can lead to unnecessary transcription errors).

]]>**Git, an open source code management system, is used to store the source code for a project and track the complete history of all changes to that code. It allows developers to collaborate on a project more effectively by providing tools for managing possibly conflicting changes from multiple developers. GitHub allows developers to change, adapt and improve software from its public repositories for free. Repositories can have multiple collaborators and can be either public or private.**

GitHub facilitates social coding by providing a web interface to the Git code repository and management tools for collaboration.

Because GitHub is intuitive to use and its version-control tools are useful for collaboration, non-programmers have also begun to use GitHub to work on document-based and multimedia projects.

Three important terms used by developers in GitHub are fork, pull request and merge. A fork, also known as a branch, is simply a repository that has been copied from one member’s account to another member’s account. Forks and branches allow a developer to make modifications without affecting the original code. If the developer would like to share the modifications, she can send a pull request to the owner of the original repository. If, after reviewing the modifications, the original owner would like to pull the modifications into the repository, she can accept the modifications and merge them with the original repository.

In the following, we’ll talk about GitHub at it’s simplest: as a repository for data files you might want to read into R, and also as a repository for R library packages you might develop. I won’t talk about the finer points of versioning here…. just the basics of how to create your own GitHub repository and upload files to it via the online interface.

**GitHub data repositories**

My primary use of GitHub is as a repository for data files that I want to share with others, and that can be read by R Shiny visual analytics scripts that I develop (although I can also incorporate the data files as part of the R Shiny application, so it doesn’t necessarily need to be in a repository like GitHub for this purpose). I could, of course, use Dropbox to share my files, but GitHub allows me to write descriptions of them, and also makes them searchable online.

For example, on my GitHub account, I have a data repository: https://github.com/smtowers/data

In this repository, I have several files that I share publically, including the file Geneva_1918_influenza.csv, which is the daily incidence of influenza hospitalisations in Geneva, Switzerland during the 1918 influenza pandemic. The raw file can be found here. Putting this file on my GitHub repository allows me to share it publicly with whomever might want it simply by giving them the URL. Importantly, I can also read the file directly from GitHub within an R script. To try this out yourself, within the R console, type:

fname = "https://raw.githubusercontent.com/smtowers/data/master/Geneva_1918_influenza.csv" thetable = read.table(fname,header=T,as.is=T,sep=",") plot(thetable$num)

This also allows me to access the files in R Shiny scripts running off of servers like the shinyapps.io server, and to share the data file with whomever else might to want to use it in their analysis or applications.

An R Shiny script that I have written that uses this data can be found at https://sjones.shinyapps.io/geneva/ The app reads in the data, plots it, and then overlays the predictions of an SIR disease model with seasonally forced transmission, with parameters input by the user via slider bars. In another module, I talk about how to create your own R Shiny applications (which may or may not read data from GitHub).

**Creating a GitHub account**

Creating a GitHub account is simple and free. Go to github.com and click on “Sign Up For GitHub”. Once you have the account, sign in. To create a new repository, click on the green “New” button at the left hand side of the page:

When the dialogue window pops up, give your repository a name and short description, and click the “Initialize this repository with a README” box:

Click “Create Repository”.

You now have a blank repository, ready to be filled with your files. To upload a file, click on the “Upload files” tab near the upper right:

It will take you to a dialogue box where you can choose the file you want to upload from your computer. Choose your file. Then a dialogue box opens asking you to fill in a description of the file:

Once you click “Commit Changes” your file will now appear in your GitHub repository.

Should you want to update the file in the future, simply repeat the process, starting with “Upload file”. If you upload a file with the same name as a file already in the main branch of the repository, it will be over-written.

**Making your own R library packages in GitHub**

It is remarkably easy to upload your own R code to GitHub as an R library package that others can download and install. This website gives the complete guide to doing that, and is in fact the main resource I used to learn how to do this myself.

I created an R library, for example, with some code related to an analysis my colleagues and I did quantifying the average number of infections that descend down the chain-of-infection of a person infected during an outbreak. Those include the people that person directly infects, plus the number those go on to infect, plus the number those go on to infect, and so on until the chain-of-infection eventually dies out. We called this quantity the “average number of descendant infections”, or ANDI. With ANDI, we can quantify the average probability that at least one person ends up hospitalised down the chain-of-infection from an unvaccinated person infected in an outbreak of vaccine preventable diseases like measles (turns out, that probability is almost 100% in locations where vaccine coverage is sub-standard).

Our analysis code would likely be of interest of others, so we made an R library package of the methods to make it easy for people to download and use (we called the package “ANDI”). We also mentioned the package in our paper. To install the package yourself from GitHub (or any other R library package you find on GitHub, and there are many), install the devtools package on R:

install.packages("devtools")

then type:

require("devtools") install_github("smtowers/ANDI") require("ANDI")

There is example code showing how to use the methods in the package in https://github.com/smtowers/ANDI/blob/master/example.R

]]>

**What is the field of “visual analytics”?**

(see original source here)

Visual analytics (or “viz”) involves the development of interactive tools that facilitate analytical reasoning based on visual interfaces. The idea behind viz is to put data and/or models into the hands of the public, policy makers, other researchers, or other stakeholders, and allow them to visually examine the data or your models, integrate their own knowledge and perform their own selections to help them reach conclusions of importance to them. It can help to solve problems for which their size and complexity, and/or need for real-time expert input would otherwise be intractable.

For example, say there is an animal pathogen that has the potential to be used in a bio-terrorism attack against the farming economy. You may have developed a meta-population dynamical model of disease spread at that includes the spread of the pathogen among domesticated and wild animals in local areas, and also spread of the pathogen across borders (for example because the animals are transported or move between areas, or people carry the pathogen on their shoes or clothing). For public officials, who might have very few options at their disposal to stop the outbreak, you could, for example, develop a visual application that shows a map of the progression of the outbreak in the areas, with the intensity of the map colours indicating the prevalance of infection in that area at a particular time step.

You could provide tools that allow the public officials to visually examine the relative efficacy of different control strategies, based on their knowledge of what is actually feasible. Things like culling animals in and around the initially infected areas, or perhaps examining how limited vaccine stores might be employed, or limiting the transport of animals, or sanitation of the boots and clothing of farm workers or veterinarians, or stopping of travel across borders altogether. The visual analytics application allows officials to combine their expert knowledge and expertise with the model predictions via the visual interface in order to reach optimal solutions under multiple constraints, particularly constraints that might change in time.

Lots of examples of visual analytics applications produced by various companies or organisations can be found online. For example here, here, and here.

Visualisation applications can involve quite complex integrated high-level coding environments, and may involve different kinds of output to several different screens simultaneously, such as the system used by the Decision Theater at Arizona State University. Designing maximally effective visual analytics apps is based on quantitative analytics, graphic design, perceptual psychology, and cognitive science.

However, visual analytics applications need not necessarily be complex to be impactful. For example, applied mathematicians in the life and social sciences use dynamical models in analyses for quantification and prediction; relatively simple visual analytics applications can put those models and associated data (if relevant) into the hands of policy makers to allow them to examine how the model predictions change under different initial conditions or with different parameters. In addition, it means that people don’t have to rely on just the two dimensional plots you put in a publication… given the URL of an online visual analytics app associated with the analysis, they can go to that app and further examine the model and data themselves.

More and more, I try to integrate visual analytics into my own research, because I believe it has the potential to make my research much more impactful. In addition, it provides me with a way to share my data, and to make my analysis methodologies as transparent as possible.

I am also finding that development of visual analytics apps for my own use is quite useful… it is remarkably helpful, for example, to have slider bars for data selection or model inputs and examine how the analysis results or model predictions change when my assumptions change. I find this much easier compared to constantly repeating the process of editing a program and re-running it.

**Some examples of visual analytics or code-and-data-sharing frameworks**

A “visual analytics” application is simply any application that allows users to interact with data and models, and associated analysis methods (like fitting methods, for example). In this sense, any programming language that allows you to make interactive plots allows you to write visual analytics applications. However “online visual analytics” applications are ones that are hosted online, and do not require any specialised software for the user to run (other than a web browser).

In the following, I’ll mention several different software packages that allow the creation of interactive applications. Not all, however, provide the potential for online hosting of the applications. And this is certainly not an exhaustive list of all tools that are out there.

Code sharing with Mathematica notebooks

Students may already have some experience with applications they might have shared with others in a “notebook” format. For example, Mathematica notebooks allow users to share Mathematica code, which might include dynamical interactive selection criteria provided by user-driven slider bars or radio buttons that allow the user to examine the code output or plots under different selection criteria. Mathematica notebooks are typically shared by email or by posting the code in the cloud for others to download and then use themselves with Mathematica running on their computer. Mathematica is not free software and requires a site license. However, they have a free application called the Wolfram CDF player that allows users to examine Mathematica notebooks.

Code sharing of interactive scripts with Matlab

Matlab also allows for incorporation of user-interface controls like slider bars (for example) in Matlab scripts that can then be shared with others via email, or by posting the code in the cloud for others to download and then use themselves within Matlab running on their computer. Matlab is not free software, and requires a site license. However, GNU Octave is free software that allows users to run Matlab scripts (but Matlab users can’t necessarily run Octave scripts).

Plotly visual analytics package in Python (allows online sharing)

The Plotly (or Plot.ly) package in the Python programming language allows users to make online graphing applications, and provides free online hosting for applications. Examples of Plotly interactive and non-interactive applications can be searched for here. In my opinion, Plotly has similarities to the R Shiny package that I describe below, but I have noted that most of the example applications I have so far come across online are non-interactive for some reason.

Sharing code and data in Julia, Python and R with Jupyter

Jupyter is free open-source software widely used in industry, and is an integrated data management and code development environment, interfacing to several different programming languages (including Julia, Python, and R… which is where Ju-pyt-er in fact got its name) that you can download to your computer, and it allows display of the results of a programming script that is on your local computer (which might involve interactive elements), or off of a website that Juypter loads and then allows to run on your computer. Here is an example of an R script being run through the Jupyter interface. In 2014, Nature wrote an article discussing the advantages that of the integrated code and data development environment provided by Jupyter.

Visual analytics with Tableau

Tableau allows for the creation of nice visual analytics apps with a simple drag-and-drop interface, and is quite popular in the business intelligence community. Unfortunately it is lacking in quantitative and computational tools we would typically use in dynamical modelling analyses (such as ODE and PDE solvers, or delayed differential equation solvers). Tableau is not free, and requires a site license.

D3 JavaScript library for creation of online interactive visual analytics

D3 is a JavaScript library for producing static or interactive visualisations in web browsers, and is widely used for many data visualisation applications, including by many online news sites. Some of the applications are quite fun to play with (even if sometimes it is unclear what the point was, other than the app looks cool). However, from an applied mathematics perspective, D3 suffers from much of the shortcomings as Tableau in the sense that there are no canned methods available to numerically solve the kinds of equations typically involved in the dynamical models we use; it is possible to write your own JavaScript methods to solve ODE’s, PDE’s, etc, but it is a significant amount of computational overhead on your part.

Visual analytics with Infogram

Infogram is a website that allows you to create online visual analytics dashboards via an intuitive drag-and-drop interface (you can also create static infographics like plots, pie charts, etc, with no user interactivity). The dashboards can then be shared with others via a URL. Signing up for an account is free, and allows you to host up to 10 dashboards on their site. There are paid options that allow you to host more. While the development interface is simple and intuitive, and the site allows the potential development of nice, relatively uncluttered looking visual analytics, there are no statistical, numerical and computational tools that allow for more complicated modelling-related visual analytics applications.

Visual analytics applications with Flourish

Flourish is another website that allows you to create simple online visualisation applications, again with a drag-and-drop interface, and again with free hosting. And again, the application can be shared via URL. The functionality appears to be even more basic than that provided by Infogram, and again has no statistical, numerical or computational tools that allow for more sophisticated applications. Some examples of visual analytics applications in Flourish can be seen here.

R Shiny library for creation of online interactive visual analytics

One advantage of examining visual analytic application examples on Flourish, and Infogram, and on other software platforms, is that you can get ideas on how to convey information in your own visual analytics applications in a clean, elegant looking format. However, as we saw above, most other visual analytics applications suffer from the drawback that you either have to pay for the software, or the software simply is not sophisticated enough for use in dynamical modelling applications, or in other more complicated statistical analysis applications.

R Shiny is free and open-source, and is part of the R programming language, and thus is integrated with the vast powerhouse of statistical, numerical and computational methods that is R. It allows for development of visual analytics applications that can can be hosted off of websites like shinyapps.io, or off of your own server if you have installed the R Shiny Server (ASU unfortunately does not yet have this hosting ability, but I’m working on seeing if it can happen). The website shinyapps.io allows you to host up to five apps at a time, and gives you 25 hours per month of interactive access to them. This is plenty for most applications, unless you create a very popular application (in which case, shinyapps.io also allows paid plans that offer broader access options).

Of course, you could always opt to share your R Shiny applications with other R users by sending them your app code via old-fashioned email or in the cloud, but hosting off of a site like shinyapps.io means that all you have to do is give a user the URL of your app, rather than requiring them to have R installed, download the R code from their email or the cloud, and then run the app within R.

There are many online examples of R Shiny visual analytics applications that range from fairly simple to more fancy, fancier, and quite fancy.

Here is an example of an R Shiny app that allows you to examine movie reviews in the Rotten Tomatoes database (best viewed on its own hosting website):

And here is an example of an R Shiny app I wrote that reads in 1918 daily influenza hospitalisation data for Geneva, Switzerland during the 1918 influenza pandemic, and overlays the predictions of an SIR model with seasonal transmission, with model parameters provided by the user via slider bars (the app is best viewed off of the shinyapps.io website where it is hosted):

**Anatomy of an R Shiny application**

To use R shiny, you first have to download the shiny library. In the R console, type:

install.packages("shiny") require("shiny")

An R shiny application is built on two building blocks that the runApp() function in the R shiny library uses to create an interactive web browser application running locally on your computer (and are the files you will need to upload your application to an R Shiny external server to share online with others):

- Code in a file called ui.R that defines the ui() function is the user interface function that sets up the page layout, and defines the user inputs via text boxes, radio buttons, slider bars, etc
- Code in a separate file called server.R that defines the server() function that takes the inputs from ui.R and makes selections on the data, and/or creates plots of the data, and/or creates plots of your model with the input parameters, and/or tabulates things, and/or lets the user download the data, etc etc etc.

You can also have server.R source other R files off of the working directory where you’ve written your own functions to do various things. It can also read data from files in your working directory. As we will see in a bit when we get to that point, when you deploy the app online, these files will automatically get uploaded to the server.

Probably the best way to get an initial understanding of what the ui.R and server.R files do is to look at an example application. I’ve put the R files related to the Geneva 1918 influenza app in my GitHub repository: https://github.com/smtowers/geneva

Create a working directory on your computer, and from my GitHub repository download the files ui.R, server.R, and geneva_utils.R. The geneva_utils.R file just contains a bunch of functions used by the app that I didn’t want littering my server.R file. In the server.R file, notice that I load the functions in geneva_utils.R file with the line:

source("geneva_utils.R",local=T)

Whenever you source another R file in a shiny app that is part of your shiny app package, use the local=T option.

In R, change to that working directory, and type:

require("shiny") runApp()

A window should have opened up in your web browser, with the app running in it. This is running locally on your computer, using the code you just downloaded. In order to make an app public to other people online, you need to upload it to an R Shiny server (which is really easy… instructions on how to do that are below).

**Creating a shinyapps.io account to host your R Shiny applications**

Shinyapps.io is a website hosting service that allows hosting of up to five R Shiny apps per person for free. They have paid options if you need to host more apps.

Got to shinyapps.io and set up a free account, and then login. You will be presented with these welcome pages:

Follow the instructions in steps 1 and 2 (you need to click the button “Show Secret” before you copy the R code to paste into R). You only need to do steps 1 and 2 once per computer you will be using R from. You might want to save that code snippet for later reference if you will be running R from multiple computers.

Step 3 will be coming up when you deploy your first app….

**Deploying the 1918 Geneva influenza application to your own shinyapps.io account**

Make sure that R is in the working directory to which you downloaded the ui.R, server.R and geneva_utils.R files. Now, in R, type

require("rsconnect") deployApp(account="<your shinyapps.io account user name",appName="<whatever you want to call the app>")

The deployApp() method will automatically build any library packages your app depends on, and then upload all your files to the server. Note: if the server times out, just repeat the deployApp() command. Note that deployApp() can be used over and over again whenever you update the code to upload the newest version of your app.

Once the app is finished deploying, it will pop up in your web browser. You can now share the URL with whomever you like. The app URL will look like:

https://<your shinyapps.io account user name>.shinyapps.io/<name you gave the app>

**Deleting or archiving R Shiny apps on shinyapps.io**

If you would like to delete or archive old shiny apps, simply log in to your shinyapps.io account, and go to https://www.shinyapps.io/admin/#/applications/all

It will list your apps, and you will see icons giving you options to delete the app or archive it.

**Including data files in your R Shiny app**

Your R shiny app can either read in web-based files in a GitHub or Dropbox repository, using commands like this:

fname = "https://raw.githubusercontent.com/smtowers/data/master/Geneva_1918_influenza.csv" thetable = read.table(fname,header=T,as.is=T,sep=",")

or

fname = "https://www.dropbox.com/s/drn0nqnn8a85c7t/Geneva_1918_influenza.csv?dl=1" thetable = read.table(fname,header=T,as.is=T,sep=",")

Or, you can simply make a data subdirectory off of your working directory where you are developing your shiny app, put your data file in there, and read it in with your R shiny code with a command like:

fname = "./Geneva_1918_influenza.csv"

thetable = read.table(fname,header=T,as.is=T,sep=",")

When you deploy your app to a server like shinyapps.io using deployApp(), the data file will be uploaded with the rest of the files in the package.

**Trouble shooting R shiny apps**

When running your app locally, usually there will be error messages printed to your R terminal if the app has problems running. These will usually point you to specific lines in your files where there is a problem. A common problem is forgetting to put commas after various layout elements in your ui.R file.

If your app runs fine locally, but once you upload it you get the error message “Error: An error has occurred. Check your logs or contact the app author for clarification”, this can often be a sign that you forgot to explicitly load the R libraries needed by your script within the script itself using require() or library() statements. So, while at some point you might have loaded them in your R session when running some other script, the shiny app doesn’t know about them once it has been uploaded to the server.

**Styling your apps**

The basic shiny interface has a fair amount of flexibility in page layout, etc. If you want a greater array of fonts and colours, etc, you can use css style sheets, following the instructions here.

]]>**Objectives:**

**This course is meant to provide students in applied mathematics with the broad skill-set needed to optimize the parameters of dynamical mathematical models to relevant biological or epidemic data. The course will almost entirely be based on material posted on this website.**

**Upon completing this course:**

**Students will gain a basic understanding of applied statistics, and will be functional in R. **

**Students will learn how to read in, manipulate, and export data in R, and will be able to create publication-quality plots in R. Students will know how to upload data and code to GitHub, and how to create an RShiny visual analytics application. Students will be familiar with several different parameter optimization methods, and for each will understand the underlying assumptions, and weaknesses and strengths of the methodology. **

**After taking this course, the labels in the following picture will be switched:**

**The Dr.Towers’ Golden Rules for Any Statistical Data Analysis:**

**All (or nearly all) data has stochasticity (ie; randomness) associated with it****A probability distribution underlies that stochasticity****Hypothesis tests and goodness-of-fit statistics are based on that probability distribution****When doing a model fitting analysis, you need three things: some data, a model that describes the trends in the data, and a goodness-of-fit statistic based on the probability distribution that underlies the stochasticity in the data. Goodness-of-fit statistics include Least Squares, Binomial likelihood, Poisson likelihood, Negative Binomial likelihood, etc. Picking the correct goodness-of-fit statistic is critical!****Anything calculated using data (like statistics like the mean or standard deviation, or goodness-of-fit statistics, for example) has stochasticity associated with it, because the data are stochastic.****Every statistical analysis needs to start with a “meet and greet” with your data. Calculation of basic statistics (sample size, means, standard deviations, ranges, etc), and plots to explore the data and ensure no funny business is going on.**

**List of course modules:**

- Good work habits, and requirements for homework
- Literature searches with Google Scholar
- Elements of scientific papers
- The basics of the R statistical programming language
- Difference between statistical and mathematical models
- Numerically solving systems of non-linear ODE’s in R using deSolve: what does that black box do?
- Good practices in producing plots
- Example LaTex and BibTex documents
- Extracting data from graphs in published literature
- Online sources of free data
- Fitting the parameters of an SIR model to influenza data using Least Squares and the graphical Monte Carlo method
- SIR disease model with age classes
- SIR modelling of influenza with a periodic transmission rate
- Contagion models with non-exponentially distributed sojourn times in the infectious state
- An overview of goodness of fit statistics, and methods to fit parameters of mathematical models to data
- Estimating parameter confidence intervals when using the graphical Monte Carlo optimisation method: the fmin+1/2 method
- To calculate Least squares Normal negative log likelihood from Least Squares (LS) statistic, use

- negative_binomial_likelihood_calculation_functions.R contains helper functions to calculate the Negative Binomial neglog likelihood given data, model, and over-dispersion parameter alpha.
- If fmin is min value of neglog likelihood statistic, to calculate the 1 std dev confidence interval, determine range of parameter hypotheses with likelihood up to fmin+1/2
- To calculate the K-std dev confidence interval, determine range of parameter hypotheses with likelihood up to fmin+0.5*K^2
- To calculate 95% confidence interval, K=1.96, so determine range of parameter hypotheses with likelihood up to fmin+0.5*1.96^2

- To calculate Least squares Normal negative log likelihood from Least Squares (LS) statistic, use
- Fitting the parameters of an SIR model to influenza outbreak incidence count data with the graphical Monte Carlo method: a comparison of Least Squares, Poisson negative log-likelihood, and Negative Binomial negative log-likelihood
- A better method for estimation of confidence intervals compared to the fmin+1/2 method: the weighted mean method
- How to determine range over which to sample parameters, and y axis range when plotting the likelihood vs the parameter hypotheses
- Data and R code repositories in GitHub
- Creating online visual analytics apps with R Shiny
- Running R in batch with ASU high performance computing resources
- Incorporating prior parameter estimates and their uncertainties (Bayesian priors) into your likelihood fits
- Markov Chain Monte Carlo optimisation methods, and why they aren’t ideal for our purposes
- Comparing two models and hypothesis testing: which gives a “significantly” better fit?
- Model validation
- Producing well written manuscripts in a timely fashion
- Predatory journals and conferences, and how to avoid them
- Submitting papers to the ArXiv pre-print server
- Giving a good presentation

**Course expectations:**

While there are no course pre-requisites for this course, students should have a demonstrated ability to a) know what a dynamical compartmental mathematical model entails and how to construct one appropriate to a research question that interests them, and b) have the ability to numerically solve a system of ODE’s in some programming language (for example Matlab, Mathematica, Maple, R, Python, etc).

There will be regular homework projects assigned throughout the course, which will be worth 50% of the grade. **Many of the homework assignments build sequentially upon each other; accordingly, failing to duly hand in a homework assignment will result in a full letter grade reduction for the course. **

In-class pop quizzes will also be given on occasion, and will be included in the homework grade.

I am always available for video meetings to discuss any issues students might have with the homework or course material. On weeks that I am resident at ASU (one week a month), all students are expected to schedule an in-person one-on-one meeting with me outside of class time. This will be counted towards the homework grade. **Accordingly, failing to schedule a meeting on weeks I am at ASU will result in a full letter grade reduction for the course.**

Students are strongly encouraged to work together in groups to discuss issues related to the course and resolve problems. However, plagiarism of code will not be tolerated.

The culmination of the course will be a group term project (two to three students collaborating together, with the project worth 50% of the final grade) that requires the development of an R program to solve a system of ordinary differential equations that describes the dynamics of disease spread, interacting biological populations, etc. The students will then optimize the parameters of their model to data that the student has identified as being appropriate to describe with their model. The students will write-up the results of their project in a format suitable for publication, using the format required by a journal they have identified as being appropriate for the topic. A cover letter written to the editor of the journal is also required. **However, submission for publication is not required, but encouraged if the analysis is novel.**

Students are responsible for locating and obtaining sources of data, and developing an appropriate model for the project, so this should be something they begin to think about very early in the course.

**This course has no associated textbook, due to the unique nature of the course content. Instead the course content consists of the modules that appear on this website.**** A textbook that students may find useful is Statistical Data Analysis, by G. Cowan**

Students are expected to bring their laptops to class. Before the course begins, students are expected to have downloaded the R programming language onto their laptop from http://www.r-project.org/ (R is open-source free software).

Final project write-ups will be due **Friday, April 19th**. Each of the project groups will perform an in-class 20 min presentation on **Monday, April 22nd, 2019 and Wed, April 24th, 2017**.

During the week of April 15th, project groups will meet with Dr. Towers to discuss their final project write-ups, and their upcoming presentation. By Friday, April 26th, all group members are to submit to Prof Towers a confidential email, detailing their contribution to the group project, and detailing the contributions of the other group members.

]]>

In 2008, a librarian at the University of Colorado, Jeffrey Beall, noticed that there was a sudden proliferation of online journals with apparent shady practices; that is to say, they all charged publication fees to the authors *and* often charged fees to read the journal, and (worst of all) had virtually no quality control in the review process (if indeed the articles were reviewed at all).

He began a list, which he published online, called Beall’s List of predatory journals. He maintained the list for several years and it was extremely helpful to many researchers, but due to threats of litigation, he has since ceased maintaining it. However, many others in academia have stepped up to help with the work of identifying and publicly naming predatory journals. And many people are needed to do this, because in recent years such journals have wildly proliferated.

Once you are in academia as a graduate student, postdoc, or faculty, your university email is generally public. Predatory journals compile huge lists of such emails and send out many spam emails encouraging you to publish in their journals (they don’t mention the steep prices they pay for that).

The problem with predatory journals is that they can charge anywhere from hundreds to thousands of dollars for publication, and are usually not open and up front about these charges. If you get through the “review” process and find out about the exorbitant charges at the very end, and don’t want to pay, they hold the copyright to your article, and you cannot submit it elsewhere (effectively holding your paper hostage). The other problem is that, even if you fork over the money, such journals also at times charge other people to read the article, which almost no one is going to do. People won’t be able to access such articles for free through their university libraries either, because libraries only do that for journals they’ve paid for, and libraries don’t pay for predatory journals. So, publishing in a predatory journal means you’ve just paid big bucks to likely put your article in a black hole.

I get at least one to several emails from predatory journals every day. Most of them are poorly worded, with spelling mistakes and poor grammar, but a few have gotten clever enough to look slick enough that sometimes I have to take a second look. The latter are in the vast minority, however.

Here is one example of such a mailing, plucked from my own email:

Notice that it doesn’t mention publication fees, but promises blindingly fast “review” with only 7 days from submission to publication (!). In addition, it mentions it has an “IF” of 4.61. I’ve noticed recently that these spam journals make a point of mentioning high “IF”s. “IF” normally stands for “Impact Factor”, which is a way of assessing a journal’s influence… it is a measure of the average number of citations per article per year. Every year Thomson Reuters publishes the list of impact factors for journals indexed in the Web of Science. The Web of Science is careful to index reputable journals. It should be noted as an aside here that there are irreputable journal indexing sites as well, such as the International Science Indexing site, where predatory publishers pay them to list whatever impact factor the predatory publisher wants to appear. ISI is just one example of many such predatory indexing sites, however. To be careful, only publish with journals indexed in the Web of Science.

I downloaded the impact factors of journals indexed in the Web of Science from the Thomson and Reuters 2016 report, and histogrammed them:

The top right plot is simply zooming in on the top left plot… there are a few journals that have impact factor over 50, but they are rare (so rare you can’t even see them in the left hand histogram). The bottom plot shows the cumulative distribution… using the bottom plot, we can see that an impact factor of 4.61 is above the 90th percentile of all journals. Wow! The International Journal of Engineering Research and Development sounds very prestigious!

However, I have become convinced that this latest craze of quoting large “IF” numbers in predatory journal mailings is evidence that “IF” to them doesn’t mean impact factor. I think it probably stands for “Indicates Fraud”.

So, mailings from obscure journals that quote high “IF”s are a red flag the journals is likely predatory, in particular when it also promises you that your article can be published in a very short period of time (like a week). Peer review and revision simply cannot be done in a week.

If you’re still wondering if the journal is actually predatory, you can search for it on the Web of Science, which indexes valid, reputable journals. A nice online interface for doing that is provided by Clarivate Analytics. If the journal name isn’t found, it’s likely a scam journal. Also, you can google the journal name, along with the words “predatory” or “scam”. For journals I find questionable, this rarely fails to bring up a page where people have pointed out reasons it is likely a scam. For example, because the publishing house is know to churn out scam journals.

In the case of the International Journal of Engineering Research and Development, googling the journal name plus “predatory” didn’t bring anything up (but some other very similarly named journals did), but it also wasn’t found on the Web of Science. So I tried to determine the publishing house, to see if it was a known predatory publisher (predatory publishing houses usually publish many predatory journals). The email came from “daum.net”, which is a web portal in South Korea (similar to having a “yahoo.com” or “hotmail.com” internet address). This isn’t a good sign. Reputable journals and publishers have their own domain name.

I tried going to the journal web site to get more info. I didn’t see the name of the publisher there, so I clicked on “contact”. It gave me absolutely no information about where the journal is based (just an online form you can fill out to contact them). However, it does say on that page that manuscripts can be submitted to ijerd@editormails.com. It turns out that editormails.com has no website associated with it, but when I google “editormails.com” plus “predatory” or “scam” I find that it is associated with various other shady journals (however, not all predatory journals use “editormails.com” for their emails, so just because a journal doesn’t use editormails.com for emails doesn’t mean its not predatory).

I did notice on the journal web site that they at least do not charge to read their articles. However perusing a recent issue reveals that the first paper is an obvious crackpot paper, the second paper is not on any kind of engineering topic, the third paper does not present any kind of cohesive analysis in its review, and so on…. And they have many grammatical, punctuation, and spelling mistakes. It is clear that none of them have been reviewed. Which isn’t to say that some of them might not have at least some merit, but without review you cannot trust that the analyses are sound.

In addition, when I looked back five years to the first couple of volumes published in 2015 and looked up the papers in google scholar (it’s fairly quick to do) I found that the 19 papers had garnered 29 citations in 5 years (most of which were actually authors of the original papers self-referencing themselves). That’s an impact factor of 0.3. A far cry from the 4.61 touted on the journal website, and puts the journal well below the 1st percentile.

When identifying a predatory journal, you have to do so on the balance of the various evidence: is it indexed in reputable indexing sites like the Web of Science? does the name pop up on lists of predatory journals? does the publishing house name pop up on lists of predatory publishers? can you even find the name of the publishing house? is the journal website full of grammatical mistakes and mis-spellings? if you can access previous papers published by the journal, are they full of grammatical mistakes, run on sentences, and mis-spellings? check the number of citations garnered by some past papers… does it jive with the “IF” factor the journal purports it has? is the journal up-front and clear about their journal charges? Note; if they do mention a journal charge, and it seems quite cheap, that might just be the submission charge, not the publication charge. No reputable journal has submission charges (ie; a charge just to have the editor read your manuscript).

Lastly, I need to stress here that just because a journal has publication charges does not mean it is predatory. Many very reputable journals are moving to an “online first” publishing paradigm where authors pay for their manuscript to be made freely available online after review. These journals are not predatory because the articles go through thorough review, and the publisher does not charge readers for access to the article, and the articles are indexed in reputable indexing sites.

**Predatory journal editorial board offers**

Sometimes you will get emails from predatory journals offering you a place on their editorial board (“My goodness! I’ve never heard of this journal but I’m so honoured… I’m being asked to be an editor and I’m so early in my career! This is wonderful! I need to call my mum to let her know…” ).

Danger! If you agree to become an editor of a predatory journal, you will often be responsible for publishing one or more of your own articles per year, and are also often responsible for getting colleagues to submit one or more articles per year (for example as “special issues” that you have to create and manage). Now, not only will you getting dinged exorbitant charges for your papers to go into a black hole, but you will have to sucker your colleagues into doing it as well.

**Predatory conferences**

I also frequently get emails about conferences that are usually not even remotely in my field of expertise, but are advertised to occur in tourist destination areas like Valencia, Spain, or Brisbane, Australia, or other lovely sun drenched spots around the globe. The emails tell me they’d like me to be keynote speaker to talk about my esteemed research. They also list other “esteemed” researchers who will also be attending.

The problem with predatory conferences is that the list of other esteemed researchers who are supposed to be at the conference is fake, or the actual academic reputation of the listed researchers is wildly exaggerated. In addition, the companies organising predatory conferences often cancel at the last minute, and the fine print of their registration fee policy means that the registrants cannot get a refund (instead, if anything, you’ll just be given a “credit” towards another conference by the same company).

Even if the conference does take place, when you show up you find that the attendees are not the esteemed researchers promised by the company, but rather a bunch of other poor sots who got conned into going to a scam conference.

To avoid predatory conferences, google the conference name (making sure you spell it exactly… organisers of such conferences trickily try to name their conferences very similar to other, reputable conference names), along with “predatory” or “scam”. You’ll likely quickly find out what you need to know. If nothing comes up, go talk to a more senior person in your field about the conference. If it is a real conference, they should either know about it, and/or be acquainted with one or more of the listed organisers (if the organisers truly are experts in the field).

OMICS is a group infamous for publishing predatory journals and arranging predatory conferences. Never go to an OMICS conference. OMICS has many subsidiary shell companies now because of the infamy of its name, so unfortunately it is not always apparent that a predatory journal or conference is actually part of that company without doing some digging online. OMICS is so egregious in their practices, they have actually been sued by the US federal government.

**How do you find reputable journals, and reputable conferences?**

To find a reputable journal to which to submit your manuscript, take a look at the papers you cited in your manuscript… where did they publish? That is often a rich source of ideas for reputable journals within your field of research that you can publish in. Also, take a look at the list of journals your faculty mentors have published in… those are also a rich source of ideas for potential journals. A list of some reputable journal publishers includes Elsevier, Springer, Cambridge University Press (note: * not* Cambridge Scholars Publishing… that one is shady), Oxford University Press, XXX University Press (insert name of prestigious university here), SAGE, Taylor & Francis, CRC press, PLoS, Macmillan, and Wiley-Blackwell. Proceedings of societies like the Royal Society and the National Academy of Science are also (very) reputable.

For conference ideas, ask your faculty mentors or colleagues you trust. Join the professional organisation associated with your field (for example, for applied mathematicians, SIAM is a good choice, and offers student discounts). I’m a member of the American Statistical Association, and it also has student discounts. Professional organisations always have annual meetings. They also usually have publications that advertise upcoming conferences organised by other entities. All of those can be considered trustworthy.

]]>

There are many reasons why you might want to make your own R library package. For example, perhaps you have written R methods that you use on a frequent basis, and would like an easy way to access them that would work from any computer. Or perhaps you want to share your code with others. Or perhaps you want to use your code and data within an RShiny application. When writing a paper, it is also a very nice touch to have your analysis code and data available to others in an R repository, and to reference that repository in the paper.

Uploading an R library to GitHub is fairly straightforward (see this nice tutorial here). Uploading to the official R CRAN repository is a bit more work because they are very particular in how they want the package formatted (and for good reason, because they want to ensure all packages are properly documented and error free). In the following, as you prepare your package, I give tips on what needs to be added where in order to ease the path towards getting your package into R CRAN. This tutorial is primarily aimed at users who are writing a library package based on scripts written in R.

First, read through all of these steps. Then do these steps sequentially, *following all directions*. If, for whatever reason, you need to stop partway through and have to return to a fresh R session to complete the task at a later date, repeat steps 2 and 3 before proceeding from where you left off.

**Step 1:**

Now, let’s work on creating our R library. Create a directory on your computer that will hold all the files associated with the library. Name it something that will make it easy to find. It doesn’t have to necessarily have the same name as your R library. I’m going to write a set of methods to overlay a compartmental Susceptible, Infected, Recovered (SIR) disease model on weekly incidence data from a seasonal flu outbreak in the Midwest. This is the same model and same data I described in this past module. I’m thus going to call my directory “sir_influenza”. Call your directory some descriptive name that matches your particular project.

**Step 2:**

In R, change to that directory (use your directory name):

setwd("~/sir_influenza")

**Step 3:**

Now, to create an R package, you will need to install devtools and roxygen2 in R. Type the following in the R command line (only do the install.packages commands the first time you use devtools and roxygen):

install.packages("devtools") install.packages("roxygen2") require("devtools") require("roxygen2")

**Step 4:**

Now you’re ready to create the skeleton of your package. I’m going to call my package “SIRinfluenza”. This will be the name of my package in the R CRAN repository. Note that the rules for R library package names can only include ASCII numbers, letters, and a dot, and must have at least two characters, and start with a letter and not end with a dot. Using underscores in the package name “_”, or any other special character, is not allowed.

On the R command line, type the following (using your package name, not mine):

create("SIRinfluenza")

If you look in your directory that you created for this library, you will now see that it has a subdirectory with your package name. That subdirectory contains an R/ subdirectory (that’s where your R scripts will go… more on that in a minute), and a DESCRIPTION file that will contain a description of the package.

**Step 5:**

Using the editor of your choice, edit the DESCRIPTION file in that subdirectory and put in your name, a title of the package and a brief description of it. **Make sure that there are no mis-spelled words or acronyms, otherwise the package will be rejected by CRAN.**

Also note that **the title name of your package has to be given in title case (i.e. al nouns capitalised), otherwise the package will be rejected by CRAN.**

If your R scripts are going to depend on any external R libraries, you need to tell R to import them in the DESCRIPTION file. My R library will be using methods in the R deSolve library to solve the ODE’s of the SIR model, and will also be using the R sfsmisc library methods for plotting. So I need to add the lines to the file:

Imports: deSolve, sfsmisc

If you want your R library to be uploaded to the R CRAN repository, **you are also going to need to add “Author:” and “Maintainer:” lines to your DESCRIPTION file, otherwise your submission will be rejected by CRAN.** For example:

Author: Sherry Towers [aut, cre] Maintainer: Sherry Towers <smtowers@asu.edu>

Make sure there **is a space after the comma in “aut, cre” in the “Author:” line, otherwise your submission will be rejected by CRAN.** Also, make sure that the email is correct, and the best one to reach you at, because it is how CRAN will communicate with you regarding your package.

The automatically generated DESCRIPTION file has a “Depends:” line followed by the version of R that is installed on your machine. ** If this version number doesn’t end in 0, you need to edit it such that it does, otherwise CRAN will reject your submission. ** This was the original line in my automated DESCRIPTION file:

Depends: R (>= 3.4.3)

and this is what I changed it to:

Depends: R (>= 3.4.0)

The default version number on the “Version:” line ends in 9000. **Change that last number to something smaller (like 1, for example) otherwise your package will be rejected by R CRAN.**

XXXX Here is what my final DESCRIPTION file looks like.

**Step 6:**

Now you add the files associated with your R package. Let’s start with how to add data files (assuming that you have data files associated with your package)

In this past module I talked about fitting an SIR model to influenza outbreak data from the Midwest during the 2007-2008 flu season. Influenza data for past seasons can be downloaded from the CDC FluView website. In the file midwest_flu_2008.csv, I’ve put the influenza outbreak data for 2008 only (just to make things simpler for this example).

If I download this to my working directory, I can read it into R:

midwest_flu_2008 = read.table("midwest_flu_2008.csv",sep=",",header=T)

Now let’s add this to our R package:

devtools::use_data(midwest_flu_2008,pkg="./SIRinfluenza")

If you look in your subdirectories, you will see that there is now a subdirectory that contains a file in .rda format. This is R’s compressed data format.

**You need to document your data files, otherwise the package will be rejected by CRAN.** In the SIRinfluenza/R subdirectory, create a file called data.R, and fill in the relevant information. All of the lines except the last have to begin with #’ (not just a #). These are directives to methods in the R roxygen2 and devtools libraries that will automatically build the documentation (we’ll be doing this in a later step). In the \item sections you describe what each of the columns in the data frame are:

#' Midwest Influenza Cases in Early 2008 #' #' A dataset containing the weekly incidence of influenza cases identified by the #' CDC in the Midwest HHS region in early 2008. #' @format A data frame with three columns and 20 rows #' \describe{ #' \item{date}{Date, in years} #' \item{week}{Week of the year} #' \item{num_influenza_cases}{Number of identified influenza cases} #' } #' @source \url{https://www.cdc.gov/flu/weekly/pastreports.htm} "midwest_flu_2008"

The @source directive is included to indicate the source of the data.

**Note that the very last line does not begin with “#’” and is the exact name of the data frame, otherwise your package will be rejected by CRAN.**

**Step 7:**

Now create the R scripts that will contain the code associated with your packages.

It is good form to have a separate R script for each method.

You need to add comments to each of your scripts that document the method that will be processed by methods in the R roxygen2 and devtools libraries to automatically create documentation for the method. Again all of these comments need to start with #’ (not just #).

At the end of the set of comments, you need to add

#' @export

to ensure the library functions are public.

**You also need to add “#’ @param” comments to document the meaning of each of the input parameters to the method, and “#’ @return” comments to document the outputs of the method.** Here is what the comments on my SIRfunc.R file look like:

#' These are the differential equations for a Susceptibe, Infected, Recovered (SIR) #' determinstic model, input to the lsoda() method in the deSolve library #' #' @param t The time #' @param x The current value of the model compartments #' @param vparameters List of the parameters of the model #' @return The current value of the derivatives of the model #' #' @export

I also have a method called SIR_solve_model() in the SIR_solve_model.R script that solves the model for given values of the basic reproduction number R0 and the recovery rate, gamma, assuming that the first infected person is introduced to the population at time t=0 (since the parameters don’t depend on time in this particular model, you can always just shift the results in time after you obtain the solution).

Note that this code depends on methods in the R deSolve library to numerically solve the set of differential equation of the SIR model. ** But note that I do not have a library(deSolve) or require(“deSolve”) statement in this script… your R scripts in your library should never include these statements! Instead, you use an @import directive to let roxygen2 know that this code has dependencies on other libraries.** Like so:

#' @import deSolve

You also need to add examples of how to use your code using @examples XXXXX

Tips: If you use data frames in your code, **you’ll run into trouble when trying to submit to the R CRAN repository if you use the subset() command. R CRAN will complain it doesn’t know what the variables are that are used in the logical arguments for the subset. To get around this, instead use a boolean variable to find the rows in a data frame that satisfy a certain set of conditions. Let’s say this boolean is in the vector i. ** Then do something like this:

my_new_data_frame = my_old_data_frame[i,]

You can see an example of this in SIR_solve_model.R

Also, and much for the same reason,** if you have data as part of your package (say, in the data/ directory in an .rda file), and you wish to use it in the methods in your package, you need to do it like follows (but with your package name in place of SIRinfluenza):**

SIRinfluenza::midwest_flu_2008

You can see an example of this in overlay_SIR_model_on_data.R

**Step 8:**

To create the documentation for your R package, type in the R command line (again, use your package name, not SIRinfluenza):

devtools::document("./SIRinfluenza")

If you look in the subdirectories, you will see in the man subdirectory there are now files ending in an “.Rd” prefix that correspond to each of your R scripts. If you edit one of those files, you’ll see that R has translated the comments in your R scripts into documentation for each of the methods. There is also a documentation file for the midwest_flu_2008.rda file that was produced by the comments in the SIRinfluenza/R/data.R script. The NAMESPACE file has also been filled with the exports() and imports() of your package. The imports are the packages your library depends on, the exports are the method names of your library.

**Step 9:**

Now we’re going to go through the process of ensuring that our package passes the R CRAN requirements.

First, just to make sure that all the documentation is totally up to date for your package, on the R command line, type

devtools::document("./SIRinfluenza")

Now, to check compliance with R CRAN requirements, type:

devtools::check("./SIRinfluenza")

(note that the devtools check() purportedly runs the document() command… I always run the document() command first anyway, just because I’m superstitious I guess).

The check() method will spit out a bunch of output. You need to carefully go through this output and fix all ERRORS, WARNINGS, and NOTES. Just make a pot of coffee, and take it one at a time. ** In order for submission to R CRAN, every single problem needs to be addressed, otherwise their automated submission system will just reject your package. **Hopefully, if you followed all of the instructions above, the number of problems with your package will be minimal or none.

**Step 10:**

Now we’re going to build what is known as the “tarball” of our package. This is a compressed file that will be uploaded to CRAN.

To build the tarball for your package, on the R command line, type

devtools::build("./SIRinfluenza")

If you look in your directory, you will see that there is a file with your package name, version number (as it appeared in the DESCRIPTION file), and ending in “tar.gz”.

**For users of Mac, Linux, or Unix machines only:**

Rather than doing this in R, alternatively, on Mac, Linux, or Unix machines, you can type the following in a terminal window in your main directory that contains your package (in my case, this was in the ~/sir_influenza directory… see Step 2 above):

R CMD build SIRinfluenza

then list directory and look for the name of the tarball

R CMD check --as-cran SIRinfluenza_0.0.0.1.tar.gz

**Fix any ERRORS, WARNINGS, and NOTES that appear in the output, and then repeat the “R CMD build” and “R CMD check” lines until you no longer have any problems.**

*If you have a Max, Linux, or Unix machine, I suggest you do this from the terminal, rather than in R. For whatever reason that I have been unable to ascertain by reading the R devtools::check() documentation, checking from the terminal appears to be more stringent than checking from within R. While rare these days after so many years of using it daily, sometimes my R kung fu is weak.*

**Step 11:**

*Do this step if you are going to submit to R CRAN. Otherwise, if you just want to submit to GitHub, go on to Step 12. GitHub is OK if you’re just looking to share your code, but getting it into R CRAN has the advantage that other users know your package has been vetted as far as documentation goes. Also, if you are going to use your package in RShiny, you want it in R CRAN.*

You are now ready to submit your package to the R CRAN repository!

Go to https://cran.r-project.org/submit.html and fill out the forms and upload your tarball. Follow the instructions. You will get an email, with further instructions to confirm and complete the submission.

Within about 24 hours you will get a message either telling you your package has been accepted (yes!), or with a list of further problems that need to be fixed. If the latter happens, fix the problems, and re-upload your package.

If you update your package in the future, follow the instructions on this webpage to upload your updated package to R CRAN.

**Step 12:**

The following steps are to get upload your package on GitHub. *Note: if you have successfully uploaded your package to R CRAN, there is no reason to have it on GitHub as well.* The following instructions are for users of Mac, Linux, or Unix machines (sorry, I don’t use Windows and likely never will, so I don’t know how to do this on a Windows machine, but I’m sure if you dumpster dive on the Internet enough, there are instructions out there for Windows).

Go to the GitHub web-based code hosting service and sign up for an account (it’s free).

Install git on your local computer:

- Windows: http://git-scm.com/download/win (OK, maybe there is a little bit about Windows on here, but only because I copied and pasted this from another site).
- OS X: http://git-scm.com/download/mac.
- Debian/Ubuntu:
`sudo apt-get install git-core`

. - Other Linux distros: http://git-scm.com/download/linux.

Tell Git your name and email address. These are used to label each commit so that when you start collaborating with others, it’s clear who made each change. In the terminal window, run:

git config --global user.name "YOUR FULL NAME" git config --global user.email "YOUR EMAIL ADDRESS"

(You can check if you’re set up correctly by running `git config --global --list`

.)

**Step 13:**

Login to your new GitHub account and create an empty repository that will eventually hold your package. This will be at the URL https://github.com/<your github username>.

Create an empty repository clicking on the ‘+’ icon at the top right hand corner of your screen, and clicking on ‘new repository’. Alternatively, go to https://github.com/new

Now, enter in the name of your new repository. **Name your GitHub repository the same name you named your R library.** In this example, my library was called SIRinfluenza, so that is what I named my GitHub repository. This is what I see on my screen:

Click on “Create repository”

**Step 14:**

In a terminal window on your Mac, Linux, or Unix machine, change directories to the directory of your R package

cd ~/sir_influenza/SIRinfluenza

Now type:

git init git add ./ git commit

The last command will open up a file in an editor. Uncomment out all the files you want to commit (in this case, all of them).

Now type (but with your own Github user name and repository name!)

git remote add origin https://github.com/<your github username>/<your R package name> git push -u origin master

Now, when I go to https://github.com/smtowers/SIRinfluenza, I see that all of my files have been uploaded to the repository.

From here on in, if you edit a file and wish to upload the latest version to GitHub, from the SIRinfluenza directory (or whatever you named your package) type

git add ./ git commit git push -u origin master

**Step 15:**

Now, others can install your github R library on their own computers. In a fresh R session, type:

require("devtools") install_github("smtowers/SIRinfluenza")

require("SIRinfluenza")

Note that once you load the library, the R data file midwest_influenza_2008.rda that is part of the package is loaded into memory too.

]]>

The Wikipedia pages for almost all probability distributions are excellent and very comprehensive (see, for instance, the page on the Normal distribution). The Negative Binomial distribution is one of the few distributions that (for application to epidemic/biological system modelling), I do not recommend reading the associated Wikipedia page. Instead, one of the best sources of information on the applicability of this distribution to epidemiology/population biology is this PLoS paper on the subject: Maximum Likelihood Estimation of the Negative Binomial Dispersion Parameter for Highly Overdispersed Data, with Applications to Infectious Diseases.

As the paper discusses, the Negative Binomial distribution is the distribution that underlies the stochasticity in over-dispersed count data. Over-dispersed count data means that the data have a greater degree of stochasticity than what one would expect from the Poisson distribution. In practice, this is frequently the case for count data arising in epidemic or population dynamics due to randomness in population movements or contact rates, and/or deficiencies in the model in capturing all intricacies of the population dynamics.

Recall that for count data with underlying stochasticity described by the Poisson distribution that the mean is mu=lambda, and the variance is sigma^2=lambda. In the case of the Negative Binomial distribution, the mean and variance are expressed in terms of two parameters, mu and alpha (note that in the PLoS paper above, m=mu, and k=1/alpha); the mean of the Negative Binomial distribution is mu=mu, and the variance is sigma^2=mu+alpha*mu^2. Notice that when alpha>0, the variance of the Negative Binomial distribution is always greater than the variance of the Poisson distribution.

There are several other formulations of the Negative Binomial distribution, but this is the one I’ve always seen used so far in analyses of biological and epidemic count data.

The probability of observing X counts with the Negative binomial distribution is (with m=mu and k=1/alpha):

Recall that m is the model prediction, and depends on the model parameters.

Example comparison of Poisson distributed and over-dispersed Negative Binomially distributed data

The R function for random generation for the Negative Binomial distribution is in the same nightmare format as the Negative Binomial distribution as described on the Wikipedia page I told you not to read above because it’s pretty much incomprehensible. In the AML_course_libs.R file, I have thus put a function for you called my_rnbinom(n,m,alpha) that generates n Negative Binomially distributed random numbers with mean=m and dispersion parameter alpha.

Let’s use this function to generate some Negative Binomially distributed simulated data about some true model, and compare it to Poisson distributed data about the same model:

source("AML_course_libs.R")

########################################## # true model for log of y is linear in x ########################################## a = 4.00 b = 0.01 x = seq(0,100,0.01) logy = a+b*x ypred = exp(logy)

########################################## # randomly generate Poisson distributed # and Negative Binomially distributed # simulated data about the true model ########################################## set.seed(314300) yobs = rpois(length(ypred),ypred) alpha = 0.2 yobs_nb = my_rnbinom(length(ypred),ypred,alpha)

########################################## # plot the data ########################################## mult.fig(1) ylim = c(min(c(yobs,yobs_nb)),max(c(yobs,yobs_nb))) plot(x,yobs,cex=1.0,xlab="x",ylab="y",ylim=ylim) points(x,yobs_nb,cex=1.2,col=2) points(x,yobs,cex=1.0) lines(x,ypred,col=3,lwd=5) legend("topleft",legend=c("Poisson distributed data","Negative Binomial distributed data","True model"),col=c(1,2,3),lwd=5,cex=0.9,bty="n")

You can see that the Negative Binomially distributed data is much more broadly dispersed about the true model compared to the Poisson distributed data. The NB data are “over-dispersed”. In true life, nearly all count data are over-dispersed because of various confounders that may result in extra variation in the data over and above the hypothesized model (many of which are often unknowable).

**Likelihood fitting with the Negative Binomial distribution**

If we had N data points, we would take the product of the probabilities in Eqn 1 to get the overall likelihood for the model, and the best-fit parameters maximize this statistic. And just like we discussed with the Poisson likelihood, the negative of the sum of the logs of the individual probabilities (the negative log likelihood) is the statistic that is usually used, and minimized to determine the best-fit model parameters.

Just like the Poisson likelihood fit, the Negative Binomial likelihood fit uses a log-link for the model prediction, m.

In practice, using a Negative Binomial likelihood fit in place of a Poisson likelihood fit with count data will result in more or less the same central estimates of the fit parameters, but the confidence intervals on the fit estimates will be larger, because it has now been taken into account the fact that the data are more dispersed (have greater stochasticity) than the Poisson model allows for.

Let’s fit to our simulated data above, to illustrate this. The “MASS” package in R has a method called glm.nb that allows you to do Negative Binomial likelihood fits. If you don’t already have it installed, install it now by typing

install.packages(“MASS”)

Now let’s generate some simulated data that is truly over-dispersed, but fit it with Poisson likelihood then Negative Binomial likelihood:

a = 4.00 b = 0.001 x = seq(0,100,1) logy = a+b*x ypred = exp(logy) set.seed(214388) alpha = 0.2 yobs_overdispersed = my_rnbinom(length(ypred),ypred,alpha) require("MASS") pois_fit = glm(yobs_overdispersed~x,family="poisson") NB_fit = glm.nb(yobs_overdispersed~x,control=glm.control(maxit=1000)) print(summary(pois_fit)) print(summary(NB_fit))

This produces the following results for the Poisson likelihood fit:

and these results for the NB likelihood fit:

You can see that in the Poisson likelihood fit, the fit coefficient for x appears to be highly statistically significant. But in the Negative Binomial likelihood fit, the confidence intervals are much wider, and the x is no longer statistically significant… the NB likelihood fit properly takes into account the extreme over-dispersion in the data, and it properly adjusts the confidence intervals.

Of course, in our true model, log(y) really does depend on x, but if the data are very overdispersed, and/or you only have a few data points, it reduces sensitivity to be able to detect that relationship.

**Moral of this story…**

Virtually all count data you will encounter in real life are over-dispersed. In general, Negative Binomial likelihood fits are far more trustworthy to use with count data than Poisson likelihood… the confidence intervals on the fit coefficients will be correct. If you try both types of fits, and the p-values are more or less the same, you can default to the simpler Poisson fits.

**But never use Poisson fit because you “like the answer better” that comes out of that fit compared to the NB fit (ie; the Poisson fit gives the apparently significant result you were “hoping” for, whereas it isn’t significant in the NB fit).**

**Model selection in R when using glm.nb()**

Just like we saw with Least Squares fitting using the R lm() method, and Poisson and Binomial likelihood fits using the R glm() method, you can do model selection in multivariate fits with R glm.nb model objects using the R stepAIC() function in the “MASS” library.

]]>

Sometimes in applied statistics we would like to test whether or not two sets of data appear to be drawn from the same underlying probability distribution. Most of the time we don’t even know what that probability distribution is… it could be some crazily shaped distribution that doesn’t match a nicely behaved smooth distribution like the Normal distribution, for example.

As an example, read in the data file drug_mortality_and_2016_election_data.csv This file contains county-level data on per capita drug mortality rates from 2006 to 2016, along with the fraction of the voters in each county that voted Republican or Democrat in the 2016 presidential election. Use the following code:

a = read.table("drug_mortality_and_2016_election_data.csv",header=T,as.is=T,sep=",") a = subset(a,!is.na(rate_drug_death_2006_to_2016_wonder))

a_rep = subset(a,prep_2016>pdem_2016) a_dem = subset(a,prep_2016<pdem_2016)

x1 = a_rep[,3] x2 = a_dem[,3] require("sfsmisc") mult.fig(4) breaks = seq(0,max(a$rate_drug_death_2006_to_2016_wonder),length=100) hist(x1,col=2,main="Republican majority counties",xlab="Drug death rate",breaks=breaks,freq=F) hist(x2,col=3,main="Democrat majority counties",xlab="Drug death rate",breaks=breaks,freq=F,add=F)

We would like to know if those two distributions are consistent with being drawn from the same underlying probability distribution.

To do this, we turn to what are known as non-parameteric tests. “Non-parametric” means that we do not assume some parameterisation of the underlying probability distribution (for instance, by assuming it’s Normal).

The Kolmogorov-Smirnov test (KS test) is an example of one such test. Given two data samples, X_1 and X_2, the algorithm first sorts both samples from smallest to largest, then creates a plot that shows the cumulative fraction in each sample below each value of X (note that X_1 and X_2 data samples do not need to be the same size). The KS test then looks at the maximum distance, D, between those two curves. As an example of this with our drug mortality data, use the code:

xtot = sort(c(x1,x2)) cumsum_x1 = numeric(0) cumsum_x2 = numeric(0) for (i in 1:length(xtot)){ cumsum_x1 = c(cumsum_x1,sum(x1<xtot[i])) cumsum_x2 = c(cumsum_x2,sum(x2<xtot[i])) } cumsum_x1 = cumsum_x1/length(x1) cumsum_x2 = cumsum_x2/length(x2)

mult.fig(1) plot(xtot,cumsum_x1,type="l",col=2,lwd=8,ylab="Cumulative fraction in each sample",xlab="Drug death rate") lines(xtot,cumsum_x2,col=3,lwd=8) legend("bottomright",legend=c("Republican majority counties","Democrat majority counties"),col=c(2,3),lwd=4,bty="n")

D = abs(cumsum_x1-cumsum_x2) iind = which.max(D) arrows(xtot[iind],cumsum_x1[iind],xtot[iind],cumsum_x2[iind],code=3,lwd=4,length=0.1)

The larger the value of D, the greater the difference between the two distributions. The KS-test bases its statistical test on the value of D.

The KS-test is implemented in R with the ks.test() function. It takes as its arguments the two samples of data.

k = ks.test(x1,x2) print(k)

It prints out the p-value testing the null hypothesis that the two samples are drawn from the same distribution.

**Difference between the KS test and the Students t-test**

You may be wondering why we can’t just use a two sample Students t-test to determine if the means of the two distributions are statistically consistent. You could of course do that, but two distributions can have dramatically different shapes, but have the same mean.

Here’s an example using simulated data where the two sample t-test shows the means of the samples are statistically consistent, but the KS test reveals they have very different shapes:

set.seed(8119990) x_1 = runif(1000,0,10) x_2 = rnorm(750,5,2) require("sfsmisc") mult.fig(4) hist(x_1,col=2) hist(x_2,col=3) k = ks.test(x_1,x_2) t = t.test(x_1,x_2) print(k) print(t)

**KS test with a known probability distribution**

The KS test can also be used parametrically, comparing the distribution of data to a known probability distribution. For example, if we wanted to compare the x_1 distribution from the above example to a Uniform probability distribution between 0 and 1, we would type

k=ks.test(x_1,"punif",0,1) print(k)

If we wanted to compare it to a Uniform probability distribution between 0 and 10:

k=ks.test(x_1,"punif",0,10)

print(k)

If we wanted to compare it to a Normal distribution with mean 5 and standard deviation 2:

k=ks.test(x_1,"pnorm",5,2)

print(k)

If we wanted to compare it to a Poisson distribution with mean 6:

k=ks.test(x_1,"ppois",6)You get the idea… just use the cumulative distribution function of the probability distribution you want to examine, along with the arguments for that probability distribution. ]]>

print(k)

**Propagation of uncertainties**

When doing statistical analyses, given an assumed probability distribution that underlies the stochasticity in the data, you might want to apply a function to the data, then obtain the 95% confidence interval on that transformation.

As an example of this, let’s assume that we are doing an analysis of the annual per capita rate of public mass shootings (with four or more people killled) that involved high capacity firearms during the Federal Assault Weapons ban from September 13, 1994 to September 13, 2004, compared to the rate of mass shootings since the ban has lapsed, to the end of 2013.

During the ban, say we observed that there were 12 mass shootings involving high capacity firearms. The ban lasted 10 years, and the average population of the US was 279.2 million people during that time.

After the ban, we observed that there were 35 mass shootings, over a period of 13.3 years, and the average US population during that time was 310.9 million people.

Let’s call the number of mass shootings N, the population P, and the length of the time period T. The annual per capita rate of mass shootings is thus N/(P*T). We would like to estimate the 95% confidence interval on our estimated per capita rate of public mass shootings. To do this, we notice that the number of mass shootings can be assumed to be Poisson distributed, because it is count data. Thus, to estimate the 95% confidence interval on our transformed data, we can generate a large number of Poisson distributed random numbers with mean N, divide those random numbers by (P*T), and then use the R quantile() function to determine the 95% confidence interval. Like so:

N = 12 T = 10 P = 279.2 vN = rpois(1000000,N) vrate = vN/(P*T) q = quantile(vrate,probs=c(0.025,0.975)) cat("The average annual rate per million people is:",N/(P*T),"\n") cat("The 95% confidence interval is [",q[1],",",q[2],"]\n",sep="")

Now, how about if we wanted to determine the ratio, R, of the rate after the ban to the rate during the ban. This is N_2/(P_2*T_2) divided by N_1/(P_1*T_1). We also want to estimate the 95% confidence interval on this quantity to determine if the interval contains 1. If it does, then there is no significant difference between the two rates.

set.seed(4289588) N_1 = 12 T_1 = 10 P_1 = 279.2 N_2 = 35 T_2 = 13.3 P_2 = 310.9 vN_1 = rpois(1000000,N_1) vN_2 = rpois(1000000,N_2) vR = vN_2/(P_2*T_2) vR = vR/(vN_1/(P_1*T_1)) R = N_2/(P_2*T_2) R = R/(N_1/(P_1*T_1)) q = round(as.numeric(quantile(vR,probs=c(0.025,0.975))),2) cat("The ratio of the rates is:",round(R,2),"\n") cat("The 95% confidence interval is [",q[1],",",q[2],"]\n",sep="")

We see that the 95% CI does not include the number 1, thus we reject the null hypothesis that the two rates are statistically consistent.

If we wanted to get the p-value for the difference between the rates, we could have used population standardized Poisson regression, with a factor, vperiod, on the RHS of the regression equation indicating whether or not the period was during the ban, or after it:

vN = c(N_1,N_2) vP = c(P_1,P_2) vT = c(T_1,T_2) vperiod = c(1,2) myfit = glm(vN~offset(log(vP))+offset(log(vT))+factor(vperiod),family="poisson") print(summary(myfit)) pvalue = summary(myfit)$coef[2,4] cat("The p-value testing the null hypothesis the rates are statistically consistent is:",pvalue,"\n")

The “intercept” coefficient is the log of the rate when vperiod=1, and the sum of the first and second coefficients is the log of the rate when vperiod=2.

In a paper, we could have just stated the regression coefficient as our result, but it is more easily understood by the average reader if instead the results are expressed as a ratio of the rates, with the 95% confidence interval on that ratio. You can still use the p-value from the population standardized regression to assess the null hypothesis that the rates are statistically consistent.

]]>

Previous modules have discussed regression using the Least Squares fit statistic, and also Poisson and Binomial (logistic) likelihood fits.

Likelihood fits can sometimes seem a bit daunting to first time users, because Least Squares is a very intuitive fit statistic (minimizing the sum of squares of the distances between the data points and the model), but likelihoods perhaps less so. It perhaps doesn’t help that it is usually not pointed out that Least Squares itself is a fit statistic derived from a likelihood expression, and minimizing the Least Squares statistic maximizes that likelihood.

Recall that the underlying assumptions of the Least Squares fitting method are that the data are Normally distributed, with the same standard deviation, sigma (ie; the data are homoskedastic), and the points are independently distributed about the true model, mu.

This means that the probability distribution for some observed dependent variable, y, is the Normal distribution:

Note that mu might depend on some explanatory variables. Perhaps in a linear fashion, like (for example)

Now, just like we saw in the modules on Poisson and Binomial regression, the Least Squares likelihood of observing some set of dependent variables, y_i, given predicted model values for each point, mu_i, is derived from the product of the probability densities seen in Equation 1

(note that the sigma is the same for each data point because of the homoskedasticity assumption of Least Squares). The best fit parameters in the mu_i function will maximize this likelihood. You could call this likelihood the “homoskedastic Normal likelihood”.

Recall that having a product of probabilities (all of which are between 0 and 1) can be problematic in practical computation because of underflow errors when using numerical methods to optimize the likelihood. Thus, in practice, we always take the log of both sides of the likelihood equation (the parameters in the calculation of mu that maximize the likelihood will also maximize the log likelihood).

This yields:

As we discussed in the Poisson and Binomial regression modules, in practice, numerical optimization methods in the underlying guts of statistical software packages minimize goodness-of-fit statistics, rather than maximize them. For this reason, we minimize the negative log likelihood:

Notice that because sigma is the same for all points, the first term is just a constant. And you’ll recognize the second term as the Least Squares statistic divided by 2*sigma^2.

Thus, whatever model parameters that go into the calculation of mu_i that minimize the Least Squares statistic will also minimize the Normal negative log likelihood!

**And thus Least Squares fits can equivalently be thought of as a homoskedastic Normal likelihood fit.**

]]>

In previous modules, we talked about the importance of model selection; selecting the most parsimonious model with the best predictive power for a particular data set. In particular, we have discussed the R stepAIC() method, which takes as its argument an R linear model fit object from either the lm() least squares linear regression method, or the glm() general linear model (with, for example, the Poisson or Binomial families).

Model selection is important because the more potential explanatory variables you put on the right hand side of the equation in a statistical model, the larger the uncertainties on the fitted coefficients, and there is a real risk of masking significant relationships to true explanatory variables if variables with no explanatory power are included on the right hand side.

Beyond this, however, is the issue of model validation; ensuring that a model has good predictive power for an independent similar set of data.

A very simple and straightforward way to do this, for example, would be to divide your data in half, and label one sample the “training sample”, and the other sample the “testing sample”. Your statistical model then gets fit to the “training sample”, and then you predict the values of the dependent variable for the testing sample using your trained model. If the model truly has good predictive power, the predicted values for the test sample will describe a significant amount of the variance in the dependent variable.

**Example of model validation with a split sample**

For this initial example, we will be doing a Least Squares fit to daily incidence data of assaults and batteries in Chicago from 2001 to 2012 (note: why is this perhaps not the best fitting method to use for these data?)

To do this study, you will need to download the files chicago_pollution.csv, chicago_weather_summary.csv, and chicago_crime_summary.csv

You will also need to download the file AML_course_libs.R that has several helper functions related to calculating things related to dates, and also the number of daylight hours by day of year, at a particular latitude.

The file chicago_crime_read_in_data_utils.R contains a function read_in_crime_weather_pollution_data() that takes as its arguments year_min and year_max that are used to select the date range. It returns a data frame with the daily ozone and particulate matter, temperature, humidity, air pressure, wind speed, assaults and batteries, thefts, and burglaries.

Download all the files and type the following code:

source("chicago_crime_read_in_data_utils.R") mydat = read_in_crime_weather_pollution_data(2001,2012)

print(names(mydat))

#################################################### # subset the data into a training and testing # sample #################################################### mydat_train = subset(mydat,year<=2006) mydat_test = subset(mydat,year>2006)

This code divides the data frame into two halves.

The contents of the data frame are:

Now let’s fit a model to the daily number of assaults in the training data that includes all the weather variables, and pollution variables, and also includes weekday as a factor, number of daylight hours, and linear trend in time:

#################################################### # fit a model with trend, weekday, daylight # hours, weather variables, and air pollution # variables #################################################### model_train = lm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat_train)

print(summary(model_train)) print(AIC(model_train))

mult.fig(4,main="Chicago crime data 2001 to 2012") plot(mydat_train$date, mydat_train$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Training data") lines(mydat_train$date,model_train$fit,col=2,lwd=4)

Not all of the potential explanatory variables were perhaps needed in the fit. Let’s check by doing model selection using stepAIC():

#################################################### # now do model selection based on the training data #################################################### require("MASS") sub_model_train = stepAIC(model_train)

print(summary(sub_model_train)) print(AIC(sub_model_train)) lines(mydat_train$date,sub_model_train$fit,col=4,lwd=2) legend("topright", legend=c("Full model","Best-fit model"), col=c(2,4), lwd=4, bty="n", cex=0.8)

This produces the following plot. Which variables got dropped from the fit?

Now let’s see how well this fitted model predicts the patterns in the testing data set. We use the R predict() function to do this:

#################################################### # now predict the results for the test data # and overlay them #################################################### y = predict(sub_model_train,mydat_test)

plot(mydat_test$date, mydat_test$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Testing data") lines(mydat_test$date,y,col=3,lwd=2)

This code produces the plot:

Well…. just visually, the model appears to do a reasonable job of predicting the next six years of data. But what we need is a quantification of how much better it fits compared to the null hypothesis model (which is just fitting a flat line).

The following code fits the null model to the test data, then “fits” the extrapolated prediction from the training model as an offset() with no intercept term. This isn’t really a fit at all, in the sense that it has no coefficients, but it does allow us to extract the AIC, and compare it to the AIC of the null model. If the extrapolated data fits the test data better than the null hypothesis model, the AIC will be smaller:

#################################################### # fit the null model to the test sample # (just the mean) # In order to compare this to how well the # extrapolation of the training sample fits the # data, "fit" that model as offset(y) but without # an intercept (-1 on the RHS forces the fit # to not include an intercept term) #################################################### null_fit_test = lm(mydat_test$assault~1) comparison_training_fit_test = lm(mydat_test$assault~offset(y)-1)

print(AIC(null_fit_test)) print(AIC(comparison_training_fit_test))

Does the extrapolated fit do a better job of fitting the data than the null model?

**Example of split sample validation with Poisson likelihood fit**

In the previous example, we used Least Squares regression to fit count data. Even though there were plenty of assaults per day (and thus the Poisson distribution approaches the Normal), this still isn’t the best method to be using with count data. It would be better if we used Poisson likelihood fits with the glm() method instead:

#################################################### #################################################### # now do a Poisson likelihood fit instead #################################################### model_train = glm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat_train, family="poisson")

print(summary(model_train)) print(AIC(model_train))

plot(mydat_train$date, mydat_train$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Training data") lines(mydat_train$date,model_train$fit,col=2,lwd=4)

sub_model_train = stepAIC(mod) print(summary(sub_model_train)) print(AIC(sub_model_train))

lines(mydat_train$date,sub_model_train$fit,col=4,lwd=2) legend("topright", legend=c("Full model","Best-fit model"), col=c(2,4), lwd=4, bty="n", cex=0.8)

Let’s predict the results for the test data, and then get the AIC of the model with the predicted results and compare it to the null hypothesis model of just a flat line. **Note that Poisson regression has a log-link, thus you have to make sure you take the log of the predicted values in the offset() function on the RHS of the fit equation!**

#################################################### # now predict the results for the test data # and overlay them #################################################### y = predict(sub_model_train,mydat_test,type="response")

plot(mydat_test$date, mydat_test$assault, cex=2, xlab="Date", ylab="Number of assaults per day", main="Testing data") lines(mydat_test$date,y,col=3,lwd=2)

#################################################### # fit the null model to the test sample # (just the mean) # In order to compare this to how well the # extrapolation of the training sample fits the # data, "fit" that model as offset(y) but without # an intercept (-1 on the RHS forces the fit # to not include an intercept term) #################################################### null_fit_test = glm(mydat_test$assault~1,family="poisson") comparison_training_fit_test = glm(mydat_test$assault~offset(log(y))-1,family="poisson")

print(AIC(null_fit_test)) print(AIC(comparison_training_fit_test))

This example code produced the following plot:

Note the fit results look quite similar to the Least Squares fits we did above. This is because there were quite a few assaults per day, and thus the Poisson distribution was in the Normal regime. We probably could have gotten away with a Least Squares analysis, but it is always good to be rigorous and use the most appropriate probability distribution.

**Monte Carlo methods for model validation**

Especially for time series data, it is always a good idea to show that a model that is fit to the first half of the time series has good predictive power for the second half, which is what we did in the above example. More generally, however, it is a good idea to show that a model trained on one half of the data randomly selected from the sample usually has good predictive power for the second half.

If you randomly select half the data many times, train the model, then compare its predictive capability to the second half, you can see how often the model prediction for the second half has better predictive power than just a null model.

The issue of model selection starts to get a bit tricky, however, because the terms that might be selected for one randomly selected training sample might not be exactly the terms selected for a differently randomly selected samples.

One thing you can do is do the model selection process on the full sample. Then do a repetitive process where you fit that model to the randomly selected training sample, and see how often the extrapolated model provides a better prediction for the remaining data.

In the following code, we use the R formula() function to extract the model terms for the selected model that is produced by the stepAIC function:

#################################################### #################################################### # first fit to the full sample, and # select the most parsimonious best-fit model #################################################### model = glm(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat, family="poisson")

sub_model = stepAIC(train_fit) sub_formula = formula(sub_model) cat("\n") cat("The formula of the sub model is:\n") print(sub_formula)

Now do many Monte Carlo iterations where we fit this sub model to one half of the data, randomly selected, and test it on the second half:

#################################################### # now do many iterations, randomly selecting # half the data for training the sub model, and # testing on the remaining half #################################################### vAIC_null = numeric(0) vAIC_comparison = numeric(0) for (iter in 1:100){ cat("Doing iteration:",iter,100,"\n")

iind = sample(nrow(mydat),as.integer(nrow(mydat)/2)) mydat_train_b = mydat[iind,] mydat_test_b = mydat[-iind,]

sub_model_train_b = glm(sub_formula,data=mydat_train_b,family="poisson")

y = predict(sub_model_train_b,mydat_test_b,type="response") null_fit_test = glm(mydat_test_b$assault~1,family="poisson") comparison_training_fit_test = glm(mydat_test_b$assault~offset(log(y))-1,family="poisson")

vAIC_null = c(vAIC_null,AIC(null_fit_test)) vAIC_comparison = c(vAIC_comparison,AIC(comparison_training_fit_test)) }

f = sum(vAIC_comparison<vAIC_null)/length(vAIC_null) cat("The fraction of times the predicted model did better than the null:",f,"\n")

**The R bootStepAIC package**

R has a packages called bootStepAIC. Within that package, there is a method called boot.stepAIC(model,data,B) that takes as its arguments an lm or glm model object previously fit to the data. It also takes as its argument the data frame the data were fit to, and a parameter, B, that states how many iterations should be done in the procedure.

If there are N points in the data samples, what boot.stepAIC does is randomly samples N points from that data, with replacement to create what is known as a “bootstrapped” data sample. Thus, the sampled data set looks somewhat like the original data set, but with some duplicated points, and some points missing. For large data samples, the bootstrapped data set will overlap the original data set by a fraction (1-1/e)~0.632

For the bootstrapped data set, the boot.stepAIC performs the stepAIC procedure, to determine which explanatory variables form the most parsimonious model with best explanatory power. It then stores the information of which variables were chosen.

Then it samples another bootstrapped sample and repeats the procedure. And again, and again, and again, until the iteration limit, B, has been reached (the default is B=100 iterations of this Monte Carlo bootstrapping procedure).

At the end of the procedure, the bootstepAIC method tells you how often each explanatory variables was selected in the stepAIC procedure. An explanatory variable that was selected 100% of the time, for example, is likely to have good explanatory power for independent data sets. If it only was selected 20% of the time, for example, it is unlikely to have good general predictive power, and is likely reflecting over-fitting to statistical fluctuations in your particular data.

require("MASS")

model = glm.nb(assault~date+ factor(weekday)+ daylight_hours+ temperature+ humidity+ wind+ air_pressure+ ozone+ particulate_matter, data=mydat) if (!"bootStepAIC"%in%installed.packages()[,1]){ install.packages("bootStepAIC") } require("bootStepAIC") b = boot.stepAIC(model,mydat,B=25) print(b)

This produces the following output (amongst other things):

We can see that all variables except for wind were selected 100% of the time out of the 25 iterations. Except for wind, the coefficient for the other explanatory variables was either consistently +’ve or consistently -’ve 100% of the time. However, when we look at how often each variable was significant to p<0.05 in the fits, humidity was not always significant, and wind never was (ignore the weekday=6 factor level… if one of the levels is always significant, you should keep that factor).

How to proceed from here is largely arbitrary… If you use large B (use at least B=100… I used a smaller value of B here just to make the code run faster) you of course should keep all variables that are selected 100% of the time that were always significant to p<0.05, and were consistent in the sign of the their influence. As for whether or not you relax the selections to include variables that were significant at least 80% of the iterations (for example), that is up to you. **But because this is an arbitrary selection, it’s a good idea to do a cross-check of the robustness of your analysis conclusions to an equally reasonable selection, like 70%, or 90%. **Or, rely on the selection used in past papers on the subject; for example, this paper recommends a 60% selection. The former is preferable, but the latter will get through review.

To see how this procedure is talked about in a typical publication, see this paper. Specifically, pay attention to the second to last paragraph, where it is made clear that the model validation and robustness crosschecks are an important part of the analysis, and the middle and bottom of page 8 where the bootstrapping method and model validation methods are described.

This paper gives more information about bootstrapping methods. And this book is a good one to cite on the topic of the importance of model validation.

]]>

Data that are expressed as per-capita rates are frequently encountered in the life and social sciences. For example the per-capita rate of incidence of a certain disease, or per-capita crime rates. In both examples, “per-capita” means “per person”. And because the per-capita rate is expressed as “per person”, sometimes it might be easy to get confused and think that perhaps Binomial linear regression methods might be most appropriate because the data appear at first blush to be expressed as a fraction of the population.

But you’d be wrong… Binomial linear regression methods are only appropriate for fractions that are strictly constrained to be between 0 and 1. When we are talking about crime rates, for example, it’s entirely possible if crime were rampant that someone might be robbed several times a year. The per-capita annual rate could thus be above 1!

When talking about per-capita rates, the data consist of a counted number per some unit time, thus regression methods like Poisson or Negative Binomial methods would be appropriate. But we have to account for the population size, M, because obviously if the population doubles, the number of cases of crime we would count per unit time would also double (if the per-capita rate stayed the same).

To take this into account in a regression analysis, we use what is called “population-standardized” regression. Recall that Poisson regression methods use a log link for the expected number of events, lambda. In population-standardized linear regression with one explanatory variable (for example) we thus have

Notice that if we bring the log(M) over the LHS, we get log(lambda/M)… the per-capita rate!

Also notice that the log(M) term does not (nor should it) have a coefficient multiplying it in the fit. Thus, if your observed number of crimes are in the y vector, what you do NOT want to do is the following (recall that the glm function with family=poisson in R uses a log-link by default):

b = glm(y~log(M)+x,family="poisson")

This is because R will attempt to find some kind of best-fit coefficient for the log(M) term, when what we actually need to do is force its coefficient to be 1 to be able to interpret our output as a per-capita rate.

The way to do this in R is to use the offset() function:

b = glm(y~offset(log(M)) + x, family="poisson")

Now, in the fit the log(M) term is forced to have coefficient equal to one.

Here is a 2000 paper by W. Osgoode on the subject of population standardized Poisson regression.

**Example**

Let’s look at the annual per-capita rates of public mass shootings before, during, and after the Federal Assault Weapons Ban, which was enacted from September 14, 1994 to September 13, 2004. In the file mass_killings_data_1982_to_2017.csv, there is the annual number of public mass shootings from 1982 to 2017, as obtained from the Mother Jones mass shootings data base, and supplemented with a few public mass shootings they missed, as listed in the USA Today mass killings data base. Also in the file is the US population that year (in milliions), as obtained from the US Census Bureau. There is also a logical variable, lperiod, indicating wether the year was before (lperiod=0), during (lperiod=1), or after (lperiod=2) the ban assault (note that lperiod=1 when the weapons ban was in place for most of that particular year). Also included in the file is the number of people killed in mass shootings each year (the last two variables aren’t used in the following example).

The following code reads in this data, and does a population standardized fit, then plots the data and fit results:

adat=read.table("mass_killings_data_1982_to_2017.csv",sep=",",as.is=T,header=T) b = glm(number_shootings~offset(log(pop)),family="poisson",data=adat) plot(adat$year,adat$number_shootings,cex=2,xlab="Date",ylab="Number mass shootings") lines(adat$year,b$fit,col=2,lwd=5)

You can see that the fit estimates properly take into account that the US population went up from 1982 to 2017, and thus, even if the per-capita rate were the same over that entire period, we would expect a higher annual *number* of events in 2017 compared to 1982.

The summary of the fit from summary(b) yields

To interpret this output, we note that the glm family=”poisson” fit uses a log link. **hus we must take the exponential of the intercept term to get the average expected per-capita annual rate**. This is exp(-4.6701)=0.0094 per million people, per year.

We can compare this to what we get if we just take the average of the number of killings per million people per year: mean(adat$number_shootings/adat$pop)=0.0091. The two numbers are almost identical… and they should be!

In the literature, you may see people doing fits to (y/pop) using least squares regression, but that is the wrong method to use because the data will not satisfy the Normality assumption of the least squares method, because the y are Poisson distributed, and simply dividing by A does not magically transform things to make (y/A) Normally distributed.

**Other potential types of standardized regression**

If one assumes that the population density of some animal (or people) is the same over some study areas, but might depend on time (for example), if the researchers count the individuals, y, in places with areas, A, at distinct points in time, t, they can estimate the change in population density using area standardized regression, using a function call like:

b = glm(y~offset(log(A)) + t, family="poisson")

Note that Poisson regression is the appropriate fitting method to use, because these are count data (and may in fact involve low counts).

Another type of standardized fit might be if a researcher had counts of some organism in different samples of fluid with volumes, V, and wished to see how this might depend on some other explanatory variable.

]]>The Binomial probability distribution is appropriate for modelling the stochasticity in data that either consists of 1′s and 0′s (where 1 represents as “success” and 0 represents a “failure”), or fractional data like the total number of “successes”, k, out of n trials.

*Note that if our data set consists of n 1′s and 0′s, k of which are 1′s, we could alternatively express our 1′s and 0′s data as k successes out of n trials.*

There are other probability distributions that can be used to model the stochasticity in fractional, like the Beta Binomial distribution, but the Binomial probability distribution is the simplest of the probability distributions for modelling the number of successes out of N trials. The Binomial probability mass function for observing k successes out of n trials when a fraction p is expected, is

The parameter p is our “model expectation”, can can be just a constant, or a function of some explanatory variables, and is the expected value of k/n.

Note that if our data are 1′s and 0′s, each point could be considered a “trial”, where k could be either 1 or 0 for each data point, and n would be 1 for each data point. This special case of the Binomial distribution is known as the Bernoulli distribution.

Our predicted probability of success, p, could, in theory at least, be a linear function of some explanatory variable, x:

*However, this can present problems, because p must necessarily lie between 0 and 1 (because it is a fraction), but the explanatory variable might be negative, or even if it were positive, beta_0 and beta_1 might be such that the predicted value for p lies outside 0 and 1. This is a problem!*

Partly for this reason, Binomial logistic regression generally assumes what is known as a “logit-link”. The logit of a fraction is log(p/(1-p)), also know as the log-odds, because p/(1-p) is the odds of success. It is this logit link that give “logistic regression” its name.

Note that because p lies between 0 and 1, p/(1-p) lies in the range of 0 to infinity. This means that the logit of p (the log of the odds) lies between -infinity to +infinity. With the logit-link, we regress the logit of p on the explanatory variables. For linear regression with one explanatory variable, this looks like:

Because the logit lies in the range of -infinity to +infinity, now it doesn’t matter if the expression on the RHS of the equation is negative… the reverse transform will always give back a value of p between 0 and 1.

Nice!

By the way, if we call logit(p)=A, then the reverse transformation to calculate p is

Let’s assume that we have N data points of some observed data, k_i, successes, out of n_i trials, where i = 1,…N. This could be, for example, the daily n fraction of firearms the TSA detects that have a round chambered over a period of N days. k_i is the number each day found with a round chambered, and n_i is the total number found each day. The observed fraction might, at least hypothetically, linearly depend on time, x_i. In this case, our model looks like

In order to fit for beta_0 and beta_1 (or whatever the parameters of our model are), we need some “goodness of fit” statistic that we can optimize to estimate our best-fit values of our model with Binomially distributed data…

**Binomial likelihood**

The likelihood of observing our N data points, k_i, out of n_i when p_i are expected for each point is the product of the individual Binomial probabilities:

The “best-fit” parameters in the functional dependence of p_i on the explanatory variable, x_i (or variables… there doesn’t need to just be one), are the parameters that maximize this likelihood.

However, just like was pointed out in our discussion of Poisson regression methods for count data, in practice, underflow problems happen when you multiply a whole bunch of likelihoods (probabilities) together, each of which is between 0 and 1. To avoid this, what is normally done is take the logarithm of both sides of Eqn 1, and what is maximized is the logarithm of the likelihood, log(L):

The R glm() method with family=”binomial” option allows us to fit linear models to Binomial data, using a logit link, and the method finds the model parameters that maximize the above likelihood. If the success data is in a vector, k, and the number of trials data is in a vector, n, the function call looks like this:

myfit = glm(cbind(k,n-k)~x,family="binomial")

The glm() binomial method can also be used with data that are a bunch of 1′s and 0′s. For our little example here, the data might be at the individual firearm level, where ’1′ indicates that the firearm has a round chambered, and ’0′ indicates that it doesn’t. In this case, if the vector found_with_round_chambered contains these zeros and ones for all the firearms, and the vector day_gun_found contains the day each firearm was found (relative to some start day), we can fit to this data using the function call

myfit = glm(found_with_round_chambered~day_gun_found,family="binomial")

Note that in both cases, **it is exactly the same data**, just expressed a different way (you can always aggregate the 1′s and 0′s by day to get the total number of firearms found, and the number found with a round chambered each day, for example). This duality in how you can look at logistic regression is sometimes confusing to students who have been exposed to logistic regression methods either just using 0′s and 1′s, or just using fractional data.

**Example**

Let’s simulate some Binomial data, with trend in time. In the example described above, perhaps this might be firearms detected at TSA airport checkpoints, and determining whether they had a round chambered (“1″), or didn’t have a round chambered (“0″). In this simulated example, the logit of the fraction loaded, p, has the predicted trend

set.seed(541831) vday = seq(0,2*365) vlogit_of_p_predicted = -1+0.005*vday vp_predicted = exp(vlogit_of_p_predicted)/(1+exp(vlogit_of_p_predicted))

At time vday=0, the predicted average fraction of firearms found with a round chambered is thus:

p=exp(-1)/(1+exp(-1)) = 0.269

At time vday=730, the predicted average fraction of firearms found with a round chambered is:

p=exp(-1+0.005*730)/(1+exp(-1+0.005*730)) = 0.934

Let’s simulate same data where we assume the TSA detects exactly 25 firearms per day. We’ll simulate the data at the firearm level, where we record the day each firearm was found, and if it had a round chambered:

num_guns_found_per_day = 10 wfound_with_round_chambered = numeric(0) wday_gun_found = numeric(0) for (i in 1:length(vday)){ v = rbinom(num_guns_found_per_day,1,vp_predicted[i]) wfound_with_round_chambered = c(wfound_with_round_chambered,v) wday_gun_found = c(wday_gun_found,rep(vday[i],num_guns_found_per_day)) }

Notice that the wfound_with_round_chambered vector contains 1′s and 0′s.

We can recast this data instead by aggregating the number found with and without a round chambered by day:

num_aggregated_ones_per_day = aggregate(wfound_with_round_chambered,by=list(wday_gun_found),FUN="sum") num_aggregated_zeros_per_day = aggregate(1-wfound_with_round_chambered,by=list(wday_gun_found),FUN="sum")

vday = num_aggregated_ones_per_day[[1]] vnum_found_with_round_chambered = num_aggregated_ones_per_day[[2]] vnum_found_without_round_chambered = num_aggregated_zeros_per_day[[2]] vnum_found = vnum_found_with_round_chambered + vnum_found_without_round_chambered

Let’s plot the simulated data

vp_observed=vnum_found_with_round_chambered/vnum_found plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=4,col=2) legend("bottomright",legend=c("Observed","Predicted"),col=c(1,2),lwd=4,bty="n")

which produces the plot:

Now let’s do a linear logistic fit using the R glm() with family=”binomial” to the individual firearm data, and then to the data aggregated by day. When looking at aggregated data, we input the data to the fit as cbind(num_successes,num_failures). This can also be expressed as cbind(k,n-k), if k is num_successes, and n is the number of trials (n=num_successes+num_failures).

Note that the event and aggregated data are exactly the same data, so they should give exactly the same fit results!

fit_to_daily_data = glm(cbind(vnum_found_with_round_chambered,vnum_found_without_round_chambered)~vday,family="binomial") fit_to_event_data = glm(wfound_with_round_chambered~wday_gun_found,family="binomial")

print(summary(fit_to_daily_data)) print(summary(fit_to_event_data))

This produces the output:

We can plot the fit results overlaid on the data. Note that even though glm() uses the logit link, it converts the fit prediction to a probability to save you the work of doing it.

plot(vday,vp_observed,cex=2,xlab="Time, in days",ylab="Fraction found with a round chambered") lines(vday,vp_predicted,lwd=8,col=2) lines(vday,fit_to_daily_data$fit,lwd=4,col=4,lty=3) legend("bottomright",legend=c("Observed","True model","Fitted model"),col=c(1,2,4),lwd=4,bty="n")

Producing the plot:

You can see that our fitted model is pretty close to the true model. This is because there are many data points (10 each day, for two years). If the data were much more sparse, we would expect to perhaps see a bit more deviation of the fitted model from the true, not because the true model is wrong (it is after all, the true model we used to simulate our data), but because with a sparse data set the fit gets more affected by stochastic variations in the data.

The script example_glm_binomial_fit.R does the above fit. You can try different values of number of guns found per day, and different model coefficients to see how it affects the simulated data and the fits.

Note that in this example we assumed a constant number of firearms found per day… we could have varied that, if we wanted, and it would not change the linear dependence of the logit of the probability of finding a firearm with a round chambered…. whether the number found per day is 1, or 100000 (or whatever), it doesn’t affect the probability of success.

**Model selection**

Just like least squares linear regression with the lm() method, or Poisson regression with the glm() method with family=”poisson”, you can use the R stepAIC() function to find the most parsimonious model that best fits the data.

**A real life Binomial logistical analysis example**

An inspection of the launch pad revealed large quantities of ice collecting due to unusually cold overnight Florida temperatures. NASA had no experience launching the shuttle in temperatures as cold as on the morning of Jan. 28, 1986. The temperatures of each of the 23 previous launches had been at least 20 degrees warmer.

At the launch site, the fuel segments were assembled vertically. Field joints containing rubber O-ring seals were installed between each fuel segment. There were three O-ring seals for each of the two fuel tanks.

The O-rings had never been tested in extreme cold. On the morning of the launch, the cold rubber became stiff, failing to fully seal the joint.

As the shuttle ascended, one of the seals on a booster rocket opened enough to allow a plume of exhaust to leak out. Hot gases bathed the hull of the cold external tank full of liquid oxygen and hydrogen until the tank ruptured.

At 73 seconds after liftoff, at an altitude of 9 miles (14.5 kilo- meters), the shuttle was torn apart by aerodynamic forces.

The two solid-rocket boosters continued flying until the NASA range safety officer destroyed them by remote control.

The crew compartment ascended to an altitude of 12.3 miles (19.8 km) before free-falling into the Atlantic Ocean, killing all aboard.

**Decision to fly based on faulty analysis of data**

This paper by Dalal, Fowlkes, and Hoadley (1989), described the O-ring failure data from the previous launches, and the reasoning behind the decision to launch on that cold day. The data from previous launches is shown in Table 1 of that paper, and I have put it in the file oring.csv

As it mentions in the Dalal et al paper, managers in charge of the launch decision felt that the launches with zero O-ring failures were non-informative of the risk of failure versus temperature, and thus excluded that data from their decision making process.

The following code reads in that data, and plots it against temperature:

o = read.table("oring.csv",sep=",",header=T,as.is=T) o_without_zero = subset(o,num_failure>0)

require("sfsmisc") mult.fig(1) plot(o$temp,o$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Space Shuttle O-Ring failure data for launches prior to Jan, 1986") points(o_without_zero$temp,o_without_zero$num_failure,col="orange",cex=3)

The points in orange are the non-zero points that were used to make the decision to launch.

It is unclear what statistical acumen, if any, was used in the risk analysis that went into that decision, but it should be pointed out here that at least one person behind the scenes was very vocal about the mistake that was being made by ignoring the zero data prior to the launch.

Let’s assume, as an example, that the analysis methodology might have been at the Stats 101 level, and a least squares regression was attempted on the data **(Note: why is this in fact a completely inappropriate method to use?)**

b = lm(num_failure~temp,data=o_without_zero) print(summary(b)) newdata = data.frame(temp=sort(c(o$temp,seq(30,70)))) ypred = predict(b,newdata,interval="predict") cat("The expected number of O-ring failures at 31 degrees from the LS fit to num_failure>0:",ypred[newdata$temp==31,1],"\n") ymax = 6 plot(o_without_zero$temp,o_without_zero$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Least Squares Fit only to non-zero data") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Least Squares fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

(note that using interval=”predict” in the R predict() method will return not only the fit prediction, but also it’s 95% confidence interval that arises due to the uncertainty on the fit estimates)

From the fit summary, it is apparent that there is no significant slope wrt temperature (p=0.20). Thus, from a naive analysis like this, one might conclude that there is no significantly increased risk of O-ring failure at 31 degrees compared to 60 degrees.

How about if we redo the least squares regression, but this time including the zeros:

b = lm(num_failure~temp,data=o) print(summary(b)) ypred = predict(b,newdata,interval="predict") cat("The expected number of O-ring failures at 31 degrees from the LS fit to num_failure:",ypred[newdata$temp==31,1],"\n") ymax = 6 plot(o$temp,o$num_failure,cex=3,xlab="Temperature",ylab="\043 of O-ring failures",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Least Squares Fit") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Least Squares fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

The fit now shows a significantly negative slope (p<0.001), but the predicted number of O-ring failures at 31 degrees is less than three. Given that they already had at least one successful prior launch with two O-ring failures, this hardly looks like something to be necessarily worried about.

But wait… that fit predicts negative O-ring failures when the temperature is above around 75 degrees. That doesn’t make sense. And there are only 6 O-rings in total… if we were to extrapolate the fit to even lower temperatures, it’s clear that we would eventually predict more than 6 O-ring failures for very low temperatures.

**Doing it right with Binomial logistic regression**

The following code does the fit using Binomial logistic linear regression. You’ll need to download the file AML_course_libs.R to run this; it contains a method get_prediction_and_confidence_interval_from_binomial_fit that estimates the 95% interval on extrapolations of a Binomial regression fit.

source("AML_course_libs.R") b = glm(cbind(num_failure,6-num_failure)~temp,family="binomial",data=o) ypred = get_prediction_and_confidence_intervals_from_binomial_fit(b,newdata) cat("The expected fraction of O-ring failures at 31 degrees from the logistic fit:",ypred[newdata$temp==31,1],"\n") ymax = 1.0 plot(o$temp,o$frac,cex=3,xlab="Temperature",ylab="Fraction of O-rings that fail",ylim=c(0,ymax),xlim=c(31,max(o$temp)),main="Logistic regression") lines(newdata$temp,ypred[,1],col=2,lwd=8) lines(newdata$temp,ypred[,2],col=2,lwd=4,lty=3) lines(newdata$temp,ypred[,3],col=2,lwd=4,lty=3) legend("topright",legend=c("Logistic regression fit","95% CI on fit prediction"),col=2,lty=c(1,3),lwd=4,bty="n")

The y axis is the fraction of the O-rings that are expected to fail. The Binomial logistic regression predicts that 96% of the 6 rings will fail (ie; the likelihood is high that all 6 rings will fail). In fact, with 95% confidence, at least half of the rings will fail.

Beyond a statistical analysis of past launch data, however, apparently the O-rings had not been tested for flexibility at low temperatures. Richard Feynman, and Physics Nobel Prize laureate, was a member of the scientific commission that was appointed to look into the shuttle disaster. In a dramatic moment during the commission news conference, he demonstrated the inflexibility of O-rings at low temperature by pulling a deformed O-ring out of his glass of ice water.

Since the shuttle disaster, there have been other, more elaborate studies of the pre-launch O-ring data to attempt to assess the temperature dependent risk of failure…. for example, this analysis which examines the issue of model extrapolation given the large difference between the temperature of 31 degrees and all other temperatures in the past data, which were significantly warmer.

**Moral of this story**

Rare are statistical analyses we might attempt that might actually kill someone if we get it wrong. But this is an excellent case of how proper choice of analysis methods could have averted a disaster.

]]>The Poisson probability distribution is appropriate for modelling the stochasticity in count data. For example, like the number of people per household, or the number of crimes per day, or the number of Ebola cases observed in West Africa per month, etc etc etc.

There are other probability distributions that can be used to model the stochasticity in count data, like the Negative Binomial distribution, but the Poisson probability distribution is the simplest of the discrete probability distributions. The Poisson probability mass function for observing k counts when lambda are expected is:

The lambda is our “model expectation”, and it might be just a constant, or a function of some explanatory variables.

For example, perhaps we are examining how the number of crimes per day, k, might linearly depend on the daily average temperature, x. In this case, our model equation for lambda might be

where beta_0 and beta_1 are parameters of the model. But note that temperature can be negative, which might lead to negative values of the model expectation… clearly for count data this makes no sense!

An example of how using least squares linear regression can go horribly wrong with count data for this reason is given by the following code, which reads in some count data, y, vs an explanatory variable, x, from the file example_of_how_least_squares_fits_to_count_data_can_go_wrong.csv

adat=read.table("example_of_how_least_squares_fits_to_count_data_can_go_wrong.csv",header=T,sep=",",as.is=T) b = lm(y~x,data=adat)

mydat = data.frame(x=seq(0,2,0.1)) mydat$ypred = predict(b,mydat)

require("sfsmisc") mult.fig(1) xmin = min(c(adat$x,mydat$x)) xmax = max(c(adat$x,mydat$x)) ymin = min(c(adat$y,mydat$ypred)) ymax = max(c(adat$y,mydat$ypred)) plot(adat$x,adat$y,xlim=c(xmin,xmax),ylim=c(ymin,ymax),xlab="x",ylab="y") lines(x,b$fit,col=2,lwd=5) lines(mydat$x,mydat$ypred,col=2,lwd=5,lty=3) lines(c(-1e6,1e6),c(0,0),lty=3,col=4) legend("topleft",legend=c("Count data","Fit to data","Extrapolated fit"),col=c(1,2,2),lty=c(1,1,3),lwd=6,bty="n")

This produces the following plot… you can see that the extrapolated least squares fit predicts negative counts, which is impossible!

Solution…

With Poisson regression, we thus almost always use what is known as a “log-link” where we assume that the logarithm of lambda depends on the explanatory variables… this always ensures that lambda itself is greater than zero no matter what beta_0, beta_1 or x are:

Now, we might not know what beta_0 and beta_1 are, but if we have a bunch of observations of crimes over a series of N days, k_i (with i=1,…,N), and we also have for the same days, the average daily temperature, x_i, we can fit for beta_0 and beta_1 to determine which values best describe the observed relationship between the x_i and y_i. Our model for the expected number of crimes on the i^th day is thus

Using our collected data, we’d like to somehow estimate the “best-fit” values of beta_0 and beta_1 to the data. If the number of crimes per day is low, we can’t use least squares linear regression to do this because that method assumes that the data are Normally distributed, and it is only for large values of lambda that the Poisson distribution approaches the Normal.

We thus need a “goodness of fit” statistic that is appropriate to Poisson distributed data….

**Poisson likelihood**

The likelihood (probability) of observing our data, k_i, given our model predictions for each data point, lambda_i, is the product of the probabilities of observing each data point separately:

*Our “best fit” values of lambda_i for this model are the ones that will maximize this probability.* The least squares goodness-of-fit statistic is one that is usually quite easy for students to visualize. Likelihood fit statistics, however, are often more difficult to conceptualize because there isn’t a nice visual diagram that can explain it (like the arrows showing the distance between points and a model prediction, like we showed for least squares regression, for example).

However, for non-Normally distributed data, if you know what probability distribution underlies the data, you can write the likelihood distribution for observing a set of data by taking the product of the individual probabilities obtained from the probability distribution, just like we did above. The “best-fit” model maximizes that probability.

**Fitting for the model parameters with Poisson likelihood**

Note that probabilities that are multiplied in Eqn 1 are always between 0 and 1, and thus for a sample size of N points, Eqn 1 involves multiplying N values between 0 and 1 together. This can easily lead to underflow errors in our computation, which is a real problem for us when we try to apply this in practice. The solution to this is to take the logarithm of both sides of Eqn 1. Before we do that, here is a bit of a refresher on logarithms:

That is to say, the log of a product is the sum of the logs of the terms in the product. The log of x to some power is the same as that power times the logarithm of x. In this case, we will be taking the “natural log” (which is log_e, log to the base e) of both sides of Eqn 1. The natural log of e is log(e) = 1. Taking the natural logarithm of both sides of Eqn 1 thus yields:

Poisson regression has been around for a long time, but least squares regression methods have been around longer. Finding the best-fit in least squares regression involves finding the parameters that *minimize* the least squares statistic. But finding the best-fit in Poisson regression involves finding the parameters in lambda_i that *maximize* Eqn 2. The interior gut workings of an optimization method in any statistical software package always minimize goodness of fit statistics, mostly because of the least squares legacy.

Because of this, we take the negative of both sides of Eqn 2, and we say that the best-fit parameters in Poisson regression *minimize the negative log likelihood:*

For the special case of our linear model for log(lambda_i) that we are considering, we get:

Given some data k_i and x_i, the “best-fit” values of beta_0 and beta_1 minimize that expression. We could, in practice, guess a whole bunch of different values for beta_0 and beta_1, and plug them into Eqn 3, and narrow it down to which pair of values appear to give the smallest negative log likelihood. However, principles of calculus can be used to find the best fit values of beta_0 and beta_1 that minimize the expression in Eqn 3. These methods are used in the inner workings of the R least squares linear regression lm() function, which is used when the response variable is Normally distributed. When working with a linear regression model with Poisson distributed count data, the R generalized linear model method, glm(), can be used to perform the fit using the family=”poisson” option. Just like with the R least squares method, invisible to you the inner workings of the glm() methods use calculus principles to find the best-fit model parameters that minimize the Poisson negative log likelihood. If the response data (our k_i) are in a vector y, and our explanatory variable, x_i, is in a vector x, and we are fitting a linear Poisson model, the function call looks like this:

myfit = glm(y~x,family="poisson")

Note that even though a log-link hasn’t been specified for the linear model, that is in fact what the glm() model with family=poisson by default assumes.

**Example**

Let’s try fitting some simulated data with the glm() method with family=”poisson”. The following code randomly generates some Poisson distributed data, with a linear model with a log-link:

########################################################################

# randomly generate some Poisson distributed data according to a linear model

########################################################################

set.seed(484272)

x = seq(0,100,0.1)

intercept_true = 1.5

slope_true = 0.05

log_lambda = intercept_true+slope_true*x

pred = exp(log_lambda)

y = rpois(length(x),pred)

########################################################################

# put the data in a data frame

########################################################################

mydat=data.frame(x=x,y=y)

Now let’s fit a linear model to these simulated data, under the assumption that the stochasticity is Poisson distributed. Note that the plotting area is divided up with the mult.fig() method in the R sfsmisc library. You need to have this library installed in R to run that line of code. If you don’t have it installed, first type install.packages(“sfsmisc”) and pick a download site relatively close to your location.

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)"),col=c("darkorchid4",3),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5)

The code produces the following output:

Are the fitted linear values statistically consistent with the true values we used to simulate the data? Do a z-test to check.

If I was presenting these results in a paper, I would say something along the lines of, “y is found to be significantly associated with x (Poisson linear regression coefficient 0.0501, with 95% CI [0.0507,0.0513] , p<0.001).”

**Interpretation of the output**

With a log-link linear model, log(y)=a+b*x, thus y=exp(a)*exp(b*x). You may wish to interpret the model results in terms of how a change in x from x=x_0 to x=2*x_0 changes y

We can see from the model that if x=x_0, y=exp(a)*exp(b*x_0), and if we double x, then x’=2*x_0 then y’=exp(a)*exp(2*b*x_0). Thus the relative change in y when we double x is y’/y=exp(b*x_0).

If b*x_0 is very small, then the first order Taylor expansion for y’/y~1+b*x_0. In fact, some readers may have been taught to use this expression for interpreting log-link Poisson regression results for the relative change in y. It needs to be stressed, however, that this interpretation only works if the coefficient b is small!

**It is not just the R glm method with family=”poisson” that assumes a log-link for Poisson regression! **

I’ve put this simulated data into a file simulated_poisson_log_linear_data.csv. If you have used other statistics software packages, like SAS, stata, SPSS, minitab, etc, try reading this data into that package and doing a Poisson linear regression fit. Compare the output of that software package to that you got in R. The coefficients and uncertainties should be the same. And what you should note in doing this exercise is that even though those other software packages may not specifically specify that the Poisson linear regression uses a log-link, they do.

The presentation in this module is not R specific: *all Poisson linear regression uses a log-link by default.*

**Another example, with more than one explanatory variable**

Let’s look at some real data…

The file chicago_crime_summary.csv contains the daily number of crimes in Chicago, sorted by FBI Uniform Crime Reporting code, between 2001 to 2013. FBI UCR code 4 is aggravated assaults (column x4 in the file). The file chicago_weather_summary.csv contains daily average weather variables for Chicago, including temperature, humidity, air pressure, cloud cover, and precipitation. The R script AML_course_libs.R contains some helper functions, including convert_month_day_year_to_date_information(month,day,year) that converts month, day, and year to a date expressed in fractions of years.

The following R code reads in these data sets, and meshes the temperature data into the crime data set. A few days are missing temperature data, so we remove those days from the data set. If you do not have the chron library already installed in R, first install it using install.packages(“chron”), and pick a download site close to your location.

require("chron")

cdat = read.table("chicago_crime_summary.csv",header=T,as.is=T,sep=",") wdat = read.table("chicago_weather_summary.csv",header=T,as.is=T,sep=",")

cdat$jul = julian(cdat$month,cdat$day,cdat$year) cdat$temperature = wdat$temperature[match(cdat$jul,wdat$jul)] cdat$weekday = day.of.week(cdat$month,cdat$day,cdat$year) cdat = subset(cdat,!is.na(cdat$temperature))

source("AML_course_libs.R") a = convert_month_day_year_to_date_information(cdat$month,cdat$day,cdat$year) cdat$date = a$date

To regress the daily number of assaults (the column x4 in the data frame) on temperature, we use the R glm() method with family=poisson:

myfit = glm(cdat$x4~cdat$temperature,family=poisson)

require("sfsmisc") mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

The fit clearly needs linear trend in time in order to fit the data better. The following code adds that:

myfit = glm(cdat$x4~cdat$temperature+cdat$date,family=poisson)

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

This produces the following plot:

This looks to be a better fit.

But is the stochasticity in the data really consistent with being Poisson distributed? Just like the QQ plot we made with the Least Squares regression fits to test whether or not the data were truly Normally distributed about the model hypotheses, we can make a similar set of plots, but for the Poisson distribution. The AML_course_libs.R script contains a function

overlay_expected_distribution_from_poisson_glm_fit = function(count_data,glm_model_object)

that takes as its arguments the vector of count data, and the best-fit linear model from the glm() method.

In the first part of this function, for each data point it determines the shape of the probability mass function given the model prediction for that point… it then adds these mass functions up for all the data points. When we histogram the data, we can overlay this “Poisson model expectation” curve.

The second part of the script creates a QQ plot of the quantiles of the ranked data, vs the quantiles of a simulated data set, simulated assuming the best-fit model with Poisson stochasticity. If the data truly are Poisson distributed about the model, we would expect this plot to be linear. The following code implements this function with our data and our model to produce the plot:

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

Even though our model with temperature plus linear trend in time is a better fit to the data than the model with just temperature, you can see that the above plots show that the data aren’t quite Poisson distributed about the model predictions. In fact, the QQ plot diagnostics indicate that the distribution appears to have some evidence of fat tails. This could point to potential confounding variables we haven’t yet taken into account (like, perhaps we might consider adding weekdays or holidays as factor levels in the fit). However, the data don’t appear to be grossly over-dispersed compared to the stochasticity expected from Poisson distributed data. Here is an example of including a factor in the explanatory variables (in this case weekday):

myfit = glm(cdat$x4~cdat$temperature+cdat$date+factor(cdat$weekday),family=poisson) print(summary(myfit))

mult.fig(4,main="Daily assaults in Chicago, 2001 to 2013") plot(myfit$fit,cdat$x4,xlab="Best-fit model prediction",ylab="Observed data") lines(c(0,1e6),c(0,1e6),col=3,lty=3,lwd=3) legend("topleft",legend=c("Hypothetical 'perfect' fit"),col=c(3),lwd=3,bty="n",cex=0.7)

plot(cdat$date,cdat$x4,xlab="Date",ylab="Daily \043 of assaults") lines(cdat$date,myfit$fitted.values,col=2,lwd=3) legend("topright",legend=c("Data","Best-fit model"),col=c(1,2),lwd=3,bty="n",cex=0.7)

overlay_expected_distribution_from_poisson_glm_fit(cdat$x4,myfit)

**Model selection**

Just like with least squares regression, it is important to select the most parsimonious model that gives the best description of the data. Every potential explanatory variable has stochasticity associated with it, and that extra stochasticity broadens the confidence interval on the fit parameters for all parameters.

If those variables actually don’t have any explanatory power, that added stochasticity can thus carry the risk of disguising significant relationships to truly explanatory variables.

As with least squares linear regression, we can use the Aikaike Information Criterion AIC statistic to compare how well models fit data, with a penalization term for the number of parameters, k:

Note that the AIC includes the negative log likelihood… the smaller the negative log likelihood, the larger the likelihood. Thus, we want the most parsimonious model with the minimum value of the AIC.

The R stepAIC() function does model selection based on the AIC, dropping and adding terms in the candidate model one at a time, then calculating the AIC of the sub model.

After running the above code example, make sure the R MASS library is installed, and run the following code:

require("chron") cdat = read.table("chicago_crime_summary.csv",header=T,as.is=T,sep=",") wdat = read.table("chicago_weather_summary.csv",header=T,as.is=T,sep=",") cdat$jul = julian(cdat$month,cdat$day,cdat$year) source("AML_course_libs.R") a = convert_month_day_year_to_date_information(cdat$month,cdat$day,cdat$year) cdat$date = a$date cdat$temperature = wdat$temperature[match(cdat$jul,wdat$jul)] cdat$humidity = wdat$humidity[match(cdat$jul,wdat$jul)] cdat$pressure = wdat$pressure[match(cdat$jul,wdat$jul)] cdat$weekday = day.of.week(cdat$month,cdat$day,cdat$year) cdat = subset(cdat,!is.na(cdat$temperature+cdat$humidity)) myfit = glm(cdat$x4~cdat$temperature+cdat$pressure+cdat$humidity+cdat$date+factor(cdat$weekday),family=poisson) print(summary(myfit)) require("MASS") d = stepAIC(myfit) print(summary(myfit)) print(summary(d))

This produces the output:

and for the sub model fit selected by stepAIC():

Notice that air pressure was dropped from the fit by stepAIC because that submodel had a lower AIC. Also notice that the standard error went down on all the other parameter estimates once air pressure was dropped.

**Some cane waving…**

When I was a lass, working on my degree in experimental particle physics, we had to do model fitting very frequently. However, while we had a fortran (and later, a C++ package) that performed gradient descent optimization (or other optimization methods) of some function that you fed it, we didn’t have convenient pre-packaged methods like lm() or glm() where you could just fit a linear model with one tidy line of code. Instead, we had to write the code to actually program the likelihood ourselves.

We also had to walk to school ten miles a day, barefoot, through waist deep snow, even in the summer, and it was uphill both ways.

Get off my lawn.

While it can be a pain to have to code up the actual likelihood expression, the advantage of that stone age methodology was that we had to think carefully about what kind of stochasticity underlay our data, and code up the appropriate likelihood function (or least squares expression, if the stochasticity was Normally distributed). Using canned methods in statistical software packages for doing fitting can unfortunately sometimes lead to decreased understanding of what’s really going on with the fit.

Believe it or not, particle physicists still do fitting the same way they always have, coding up the likelihood function themselves. And they probably always will. Because it is critically important when testing hypotheses that you not only have your model right (ie; accounting for all potential confounding variables, and ensuring that the functional expression of the model is appropriate), but that you also have the correct specification of the probability distribution describing stochasticity in the data. **Otherwise your p-values testing your null hypothesis are garbage.**

**Getting up close and personal with Poisson regression in R **

R has a method called optim() that finds the parameters that minimize the function you feed to it. Unlike the glm() method, which can only find the parameters of a linear model, the optim() method can find the parameters of any kind of model. For instructive purposes to show how optim() works, let’s code up the Poisson negative log likelihood using the optim() method, and use it to fit a linear model to some data, and compare what we get out of the glm() method with family=”poisson”. The two methods should yield the same results. Describing the optim() method also gives you a better idea of what R is doing inside the guts of the glm() method. The R script poisson_and_optim.R defines the following functions that define a linear model with a log-link, and also calculate the Poisson negative log likelihood, given some data vectors x and y contained in a data frame, mydata_frame.

######################################################################## ######################################################################## # this is the function to calculate our linear model, assuming # a log link ######################################################################## mymodel_log_prediction = function(mydata_frame,par){ log_model_prediction = par[1] + par[2]*mydata_frame$x return(log_model_prediction) }

######################################################################## ######################################################################## # this is a function to compute the Poisson negative log likelihood ######################################################################## poisson_neglog_likelihood_statistic = function(mydata_frame,par){ model_log_prediction = mymodel_log_prediction(mydata_frame,par) # lfactorial(y) is log(y!) neglog_likelihood = sum(-mydata_frame$y*model_log_prediction +exp(model_log_prediction) +lfactorial(mydata_frame$y)) return(neglog_likelihood) }

Now, we need some data to fit to. The R script also has code that simulates some data with Poisson distributed stochasticity according to a linear model with a log-link (same as the first example we showed above):

######################################################################## # randomly generate some Poisson distributed data according to a linear model ######################################################################## set.seed(484272)

x = seq(0,100,0.1) intercept_true = 1.5 slope_true = 0.05 log_lambda = intercept_true+slope_true*x pred = exp(log_lambda) y = rpois(length(x),pred)

######################################################################## # put the data in a data frame ######################################################################## mydat=data.frame(x=x,y=y)

Now the script does the glm() fit, and the fit using the optim() method. The two methods return the results in an entirely different format, and it takes a bit more work to extract the parameter uncertainties using the optim() method:

######################################################################## # Do the model fit using glm. Note that glm() with family="poisson" # inherently assumes a log-link to the data ######################################################################## myfit_glm = glm(y~x,family=poisson,data=mydat) print(summary(myfit_glm))

coef = summary(myfit_glm)$coef[,1] ecoef = summary(myfit_glm)$coef[,2] cat("\n") cat("Results of the glm fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[2],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",-logLik(myfit_glm),"\n") cat("\n")

######################################################################## # now do the R optim() fit # # The results of the fit are in much more of a primitive format # than the results that can be extracted from an R glm() object # For example, in order to get the parameter estimate uncertainties, # we need to calculate the covariance matrix from the inverse of the fit # Hessian matrix (the parameter uncertainties are the square root of the # diagonal elements of this matrix) # Also, if we want the best-fit estimate, we need to calculate it # ourselves from our model function, given the best-fit parameters. ######################################################################## myfit_optim = optim(par=c(1,0),poisson_neglog_likelihood_statistic,mydata_frame=mydat,hessian=T) log_optim_fit = mymodel_log_prediction(mydat,myfit_optim$par)

coef = myfit_optim$par coefficient_covariance_matrix = solve(myfit_optim$hessian) ecoef = sqrt(diag(coefficient_covariance_matrix))

cat("\n") cat("Results of the optim fit:\n") cat("Intercept fitted, uncertainty, and true:",round(coef[1],3),round(ecoef[1],5),intercept_true,"\n") cat("Slope fitted, uncertainty, and true:",round(coef[2],3),round(ecoef[2],5),slope_true,"\n") cat("Negative log likelihood:",myfit_optim$value,"\n") cat("\n")

This produces the following output:

The following code overlays the fit results from both methods on the data:

######################################################################## # Now plot the data with the fitted values overlaid. Note that # even though the glm() method with family=poisson assumes a # log-link, what it spits out in the fitted.values attribute # is exponenent of that log-link ######################################################################## require("sfsmisc") mult.fig(4,main="Poisson simulated data, generated with the model log(y)=a+b*x") plot(x,y,xlab="x",ylab="y",cex=2,col="darkorchid4",main="y versus x") lines(x,myfit_glm$fitted.values,col=3,lwd=5) lines(x,exp(log_optim_fit),col=2,lwd=1) legend("topleft",legend=c("Simulated data","Best-fit Poisson linear model from R glm(y~x)","Best-fit Poisson linear model from R optim()"),col=c("darkorchid4",3,2),lwd=5,bty="n",cex=0.6)

plot(x,log(y),xlab="x",ylab="log(y)",cex=2,col="darkorchid4",main="log(y) versus x") lines(x,log(myfit_glm$fitted.values),col=3,lwd=5) lines(x,log_optim_fit,col=2,lwd=1)

In this case we just did a simple linear model fit. However, with changes to the mymodel_log_prediction() method, optim() can fit arbitrarily complicated models, including non-linear models. Unlike optim(), the glm() method cannot fit non-linear models.

]]>