The basics of the R statistical progamming language

[After you have read through this module, and have downloaded and worked through the provided R examples, you should be proficient enough in R to be able to download and run other R scripts that will be provided in other posts on this site. You should understand the basics of good programming practices (in any language, not just R). You will also have learned how to read data in a file into a table in R, and produce a plot.]

Contents:

Why use R for modelling?

I have programmed in many different computing and scripting languages, but the ones I most commonly use on a day to day basis are C++, Fortran, Perl, and R (with some Python, Java, and Ruby on the side).  In particular, I use R every day because it is not only a programming language, but also has graphics and a very large suite of statistical tools. Connecting models to data is a process that requires statistical tools, and R provides those tools, plus a lot more.

Unlike SAS, Stata, SPSS, and Matlab, R is free and open source (it is hard to beat a package that is more comprehensive than pretty much any other product out there and is free!).

Continue reading

Epidemic modelling with compartmental models using R

[After reading through this module you should have an intuitive understanding of how infectious disease spreads in the population, and how that process can be described using a compartmental model with flow between the compartments.  You should be able to write down the differential equations of a simple disease model, and you will learn in this module how to numerically solve those differential equations in R to obtain the model estimate of the epidemic curve]

An excellent reference book with background material related to these lectures is Mathematical Epidemiology by Brauer et al. 

Contents:

Introduction

Models of disease spread can yield insights into the mechanisms and dynamics most important to the spread of disease (especially when the models are compared to epidemic data).  With this improved understanding, more effective disease intervention strategies can potentially be developed. Sometimes disease models are also used to forecast the course of an epidemic, and doing exactly that for the 2009 pandemic was my introduction to the field of computational epidemiology.

There are lots of different ways to model epidemics, and there are several modules on this site on the topic, but let’s begin with one of the simplest epidemic models for an infectious disease like influenza: the Susceptible, Infected, Recovered (SIR) model.

Continue reading

Compartmental modelling without calculus

In another module on this site I describe how an epidemic for certain kinds of infectious diseases (like influenza) can be modelled with a simple Susceptible, Infectious, Recovered (SIR) model. Readers who have not yet been exposed to calculus (such as junior or senior high school students) may have been daunted by the system of differential equations shown in that post.  However, with only a small amount of programming experience in R, students without calculus can still easily model epidemics, or any other system that can be described with a compartmental model.  In this post I will show how that is done.
Continue reading

SIR infectious disease model with age classes

[After reading through this module, students should have an understanding of contact dynamics in a population with age structure (eg; kids and adults). You should understand how population age structure can affect the spread of infectious disease. You should be able to write down the differential equations of a simple SIR disease model with age structure, and you will learn in this module how to solve those differential equations in R to obtain the model estimate of the epidemic curve]

Contents:

Introduction

In a previous module I discussed epidemic modelling with a simple Susceptible, Infected, Recovered (SIR) compartmental model.  The model presented had only a single age class (ie; it was homogenous with respect to age).  But in reality, when we consider disease transmission, age likely does matter because kids usually make more contacts during the day than adults. The differences in contact patterns between age groups can have quite a profound impact on the model estimate of the epidemic curve, and also have implications for development of optimal disease intervention strategies (like age-targeted vaccination, social distancing, or closing schools).
Continue reading

SIR modelling of influenza with a periodic transmission rate

[After going through this module, students will be familiar with time-dependent transmission rates in a compartmental SIR model, will have explored some of  the complex dynamics that can be created when the transmission is not constant, and will understand applications to the modelling of influenza pandemics.]

Contents:

 

Introduction

Influenza is a seasonal disease in temperate climates, usually peaking in the winter.  This implies that the transmission of influenza is greater in the winter (whether this is due to increased crowding and higher contact rates in winter, and/or due to higher transmissibility of the virus due to favorable environmental conditions in the winter is still being discussed in the literature).  What is very interesting about influenza is that sometimes summer epidemic waves can be seen with pandemic strains (followed by a larger autumn wave).  An SIR model with a constant transmission rate simply cannot replicate the annual dual wave nature of an influenza pandemic.

Continue reading

Stochastic epidemic modelling with Agent Based Models

[After reading this module, you will be aware of the limitations of deterministic epidemic models, like the SIR model, and understand when stochastic models are important.  You will be introduced to three different methods of stochastic modelling, and understand the appropriate applications of each. By the end of this module, you will be able to implement a simple Agent Based stochastic model in R.]

Contents:

Continue reading

Good programming practices (in any language)

Easy readability, ease of editing, and ease of re-usability are things to strive for in code you write in any language. Achieving that comes with practice, and taking careful note of the hallmarks of tidy, readable, easily editable, and easily re-usable code written by other people.

While I’m certainly not perfect when it comes to utmost diligence in applying good programming practices, I do strive to make readable and re-useable code (if only because it makes my own life a lot easier when I return to code I wrote a year earlier, and I try to figure out what it was doing).

In the basics_programming.R script that I make use of some good programming practices that ensure easy readability. For instance, code blocks that get executed within an if/then statement, for loops, or while loops are indented by a few spaces (usually two or three… be consistent in the number of indent spaces you use).  This makes it clear when reading the code which nested block of code you are looking at.   I strongly advise you to not use tabs to indent code.  To begin with, every piece of code I’ve ever had to modify that had tabs for indent also used spaces here and there for indentation, and it makes it a nightmare to edit and have the code remain easily readable. Also, if you have more than one or two nested blocks of code, using tabs moves the inner blocks too far over to make the code easily readable.

In the R script sir_agent_func.R I define a function.  Notice in the script that instead of putting all the function arguments on one long line, I do it like this:

SIR_agent = function(N         # population size
                    ,I_0       # initial number infected
                    ,S_0       # initial number susceptible
                    ,gamma     # recovery rate in days^{-1}
                    ,R0        # reproduction number
                    ,tbeg      # begin time of simulation
                    ,tend      # end time of simulation
                    ,delta_t=0 # time step (if 0, then dynamic time step is  implemented)
                    ){

This line-by-line argument list makes it really easy to read the arguments (and allows inline descriptive comments).  It also makes it really easy to edit, because if you want to delete an argument, it is simple as deleting that line.  If you want to add an argument, add a line to the list.

Descriptive variable names are a good idea because they make it easier for someone else to follow your code, and/or make it easier to figure out what your own code did when you look at it 6 months after you wrote it.

Other good programming practices are to heavily comment code, with comment breaks that clearly delineate sections of code that do specific tasks (this makes code easy to read and follow).  I like to put such comments “in lights” (ie; I put a long line of comment characters both above and below the comment block, to make it stand out in the code).  If the comments are introducing a function, I will usually put two or three long lines of comment characters at the beginning of the comment block; it makes it clear and easy to see when paging through code where the functions are defined.

In-line comments are also very helpful to make it clear what particular lines of code are doing.

A good programming practice that makes code much easier to re-use is to never hard-code numbers in the program.  For instance, at the top of the basics_programming.R script I create a vector that is n=1000 elements long.  In the subsequent for loop, I don’t have

for (i in 1:1000){}

Instead I have

for (i in 1:n){}

This makes that code reusable as-is if I decide to use it to loop over the elements of a vector with a different length.  All I have to do is change n to the length of the new vector.  Another example of not hard-coding numbers is found in the code associated with the while loop example.

As an aside here, I should mention that in any programming language you should never hard-code the value of a constant like π (as was pointed out in basics.R, it is a built-in constant in R, so you don’t need to worry about this for R code).  In other languages, you should encode pi as pi=acos(-1.0), rather than something like pi=3.14159.  I once knew a physics research group that made the disastrous discovery that they had a typo in their hard-coded value of pi… they had to redo a whole bunch of work once the typo was discovered.

Notice in the script that I have a comment block at the top of basics_programming.R that explains what the script does, gives the date of creation and the name of the person who wrote the script (ie; me).  It also gives the contact info for the author of the script, and copyrights the script.  Every script or program you write should have that kind of boilerplate at the top (even if you think the program you are writing will only be used by you alone… you might unexpectedly end up sharing it with someone, and/or the boilerplate makes it clear that the program is *your* code, and that people just can’t pass it off as their own it if they come across it).   It also helps you keep track of when you wrote the program.

 

 

ASU AML 610: probability distributions important to modelling in the life and social sciences

[After reading this module, students should be familiar with probability distributions most important to modelling in the life and social sciences; Uniform, Normal, Poisson, Exponential, Gamma, Negative Binomial, and Binomial.]

Contents:
Introduction
Probability distributions in general
Probability density functions
Mean, variance, and moments of probability density functions
Mean, variance, and moments of a sample of random numbers
Uncertainty on sample mean and variance, and hypothesis testing
The Poisson distribution
The Exponential distribution
The memory-less property of the Exponential distribution
The relationship between the Exponential and Poisson distributions
The Gamma and Erlang distributions
The Negative Binomial distribution
The Binomial distribution


Introduction

There are various probability distributions that are important to be familiar with if one wants to model the spread of disease or biological populations (especially with stochastic models).  In addition, a good understanding of these various probability distributions is needed if one wants to fit model parameters to data, because the data always have underlying stochasticity, and that stochasticity feeds into uncertainties in the model parameters.  It is important to understand what kind of probability distributions typically underlie the stochasticity in epidemic or biological data.
Continue reading

Fitting the parameters of an SIR model to influenza data using Least Squares and the Monte Carlo parameter sweep method

[After reading this module, students should understand the Least Squares goodness-of-fit statistic.   Students will be able to read an influenza data set from a comma delimited file into R, and understand the basic steps involved in a Monte Carlo parameter sweep method to fit an SIR model to the data to estimate the R0 of the influenza strain by minimizing the Least Squares statistic.  Students will be aware that parameter estimates have uncertainties associated with them due to stochasticity (randomness) in the data.]

A really good reference for statistical data analysis (including fitting) is Statistical Data Analysis, by G.Cowan.

Contents:

Introduction

When a new virus starts circulating in the population, one of the first questions that epidemiologists and public health officials want answered is the value of the reproduction number of the spread of the disease in the population (see, for instance, here and here).

The length of the infectious period can roughly be estimated from observational studies of infected people, but the reproduction number can only be estimated by examination of the spread of the disease in the population.  When early data in an epidemic is being used to estimate the reproduction number, I usually refer to this as “real-time” parameter estimation (ie; the epidemic is still ongoing at the time of estimation).

Continue reading

ASU AML 610 Module VIII: Fitting to initial exponential rise of epidemic curves

In this module students will compare the performance of several fitting methods (Least squares, Pearson chi-squared, and likelihood fitting methods) in estimating the rate of exponential rise in initial epidemic incidence data.  Students will learn about the properties of good estimators (bias and efficiency).

A good reference source for this material is Statistical Data Analysis, by G.Cowan

Another good reference source (in a very condensed format) for statistical data analysis methods can be found here.

Contents:
Introduction
Properties of good estimators
Generating simulated exponential rise data
Estimation of the rate of exponential rise: Least Squares
Estimation of the rate of exponential rise: Pearson chi-squared
The Poisson maximum likelihood method
Estimation of parameter confidence intervals: any maximum likelihood method
Estimation of the rate of exponential rise: Poisson maximum likelihood method
Testing for over- or under-dispersion.
Correcting for over- or under-dispersion
Better method for determination of parameter estimates and their covariance when using the Pearson chi-squared method

Continue reading

Numerical methods to solve ordinary differential equations

After going through this module, students will be familiar with the Euler and Runge-Kutta methods for numerical solution of systems of ordinary differential equations.  Examples are provided to show students how complementary R scripts can be written to help debug Runge-Kutta methods implemented in C++.

Contents

Continue reading

AML610 module XI: practical problems when connecting deterministic models to data

Some (potentially) useful utilities for random number generation and manipulating vectors in C++

I’ve written some C++ code mainly related to vectors; calculating the weighted mean, running sum, extracting every nth element, etc).   There are also utilities related to random number generation from various probability distributions, and methods to calculate the CDF of various probability distributions.

The file UsefulUtils.h and UsefulUtils.cpp contain source code of a class that contains these utilities that can be useful when performing compartmental modelling in C++. These utilities will be used extensively in the examples that will be presented in this, and later, modules.  The file example_useful_utils.cpp gives examples of the use of the class.  It can be compiled with the makefile makefile_use with the command

make -f makefile_use example_useful_utils

Homework #4, due April 3rd, 2013 at 6pm. The data for the homework can be found here.

Continue reading

Meaning of bin-to-bin independence of data in a binned distribution

Link

What does it mean for data bins to be “independent” in a binned distribution (like, for instance, the time series of the number of newly identified disease cases each week)?  This means that the stochastic variation data in the i^th bin is uncorrelated to the variation in the j^th bin.   You should be aware that for epidemic data, this assumption is perhaps (probably) not satisfied, but the academic body of work related to this topic currently has not come up with a good way of assessing the bin-to-bin correlations, so for now we have little choice but to assume they are uncorrelated. Note that the assumption of independent bins underlies the Maximum Likelihood, Least Squares and Pearson chi-squared methods, unless special modification is made to those methods to take the correlations into account.

 

The difference between mathematical and statistical modelling (plus some more basics of R)

[In this module, we will discuss the difference between mathematical and statistical modelling, using pandemic influenza as an example.  Example R code that solves the differential equations of a compartmental SIR model with seasonal transmission (ie; a mathematical model) is presented.  Also provided are an example of how to download add-on library packages in R, plus more examples of reading data sets into R, and aggregating the data sets by some quantity (in this case, a time series of influenza data in Geneva in 1918, aggregated by weekday).

Delving into how to write the R code to solve systems of ODE’s related to a compartmental mathematical model is perhaps slightly off the topic of a statistical modelling course, but worthwhile to examine; as mathematical and computational modellers, usually your aim in performing statistical analyses will be to uncover potential relationships that can be included in a mathematical model to make that model better describe the underlying dynamics of the system]

AML 610 Fall 2013 Module III: Review of Probability Distributions, Basic Statistics, and Hypothesis Testing

[In this module, students will learn about probability distributions important to statistical modelling, focussing primarily on probability distributions that underlie the stochasticity in time series data.

In addition, in this course we will be learning how to formulate figure-of-merit statistics that can help to answer research questions like “Is quantity A significantly greater/less than quantity B?”, or “Does quantity X appear to be significantly related to quantity Y?”.  As we are about to discuss, statistics that can be used to answer these types of questions do so using the underlying probability distribution to the statistic.  Every statistic used for hypothesis testing has an underlying probability distribution.]

Continue reading

AML 610 Fall seminar series: Module IV linear regression

In this module, students will become familiar with linear regression methods. Note that before proceeding with any regression analysis, it is important to first perform initial data exploration, both with visualization analysis with histograms, boxplots, and scatter plots, and numerical summaries of the variables like the mean, standard deviations, maxima and minima, and correlations between variables.  In this way, you can determine if there are any unexpected “quirks” or problems with the data (and, more often than not, there are).

Continue reading

One sided vs two-sided tests

In this course so far we have covered the Z-test, the Student’s-t test and the chi-square test for testing the null hypothesis “is quantity A consistent with quantity B”.

The Z and students-t statistics are examples of two-tailed statistics, because two directions are considered “extreme”; the far low end and the far high end.  If your Z statistic is very low or very high, you would conclude that you should reject the null hypothesis that A is consistent with B:

File:P-value Graph.png

So far in this course, I’ve told you that if the p-value of the Z or students-t statistic returned by the pnorm() or pt() functions in R is less than 0.05, or greater than 0.95 (ie; p<0.05 or (1-p)<0.05), you probably should reject the null hypothesis that A is consistent with B.  Some people use a more stringent test, where they reject the null if p<0.025 or (1-p)<0.025.  The value on the RHS that is used as the cut-off value for p is known as the “critical value”, and is often denoted as alpha.

 

Using the chi-squared statistic to test if two Normally distributed variables A and B are statistically consistent with being equal to each other with a critical value of p<alpha* is exactly equivalent to using the two-sample Z test for the equivalence of A and B, with a critical value p<alpha/2 (or (1-p)<alpha/2)**.

* recall that if you use the pchisq() function in R, then the p-value of the test statistic is p=1-pchisq()

** recall that if you use the pnorm() function in R, then if pnorm()<0.5, the p-value of the test statistic is p=pnorm().  If pnorm>0.5 then the p-value of the test statistic is p=1-pnorm().

If you use this more

mple of a one-tailed

 

test because only one end is considered as “extreme”

K-means clustering

Sometimes we may want to determine if there are apparent “clusters” in our data (perhaps temporal/geo-spatial clusters, for instance).  Clustering analyses form an important aspect of large scale data-mining.

K-means clustering partitions n observations into k clusters, in which all the points are assigned to the cluster with the nearest mean.

If you decide to cluster the data in to k clusters, essentially what the algorithm does is to minimize the within-cluster sum of squared distances of each point to the center of the cluster (does this sound familiar?  It is similar to Least Squares).

The algorithm starts by randomly picking k points to be the cluster centers.  Then each of the remaining points is assigned to the cluster nearest to it by Euclidian distance.  In the next step, the mean centroid of each of the clusters is calculated from the points assigned to the cluster.  Then the algorithm re-assigns all points to the their nearest cluster centroid.  Then the centroid of each cluster is updated with the new point assignments. And then the points are re-assigned to clusters.  And the cluster centroids re-calculated… and so on and so on until you reach a step where none of the points switch clusters. Visually, the algorithm looks like this:

How many clusters are in the data?

You likely want to explore how many distinct clusters appear to be in the data. To do this, you can examine how much of the variance in the data is described by our “model” (which in this case is the clustered data).  Just like for a linear least squares statistical model, we can thus calculate an adjusted R-squared from the total variance in the data and the sum of the within group variance.  There is also a way to calculate the AIC for a kmeans model.

To see how to implement kmeans clustering in R, examine the following code that generates simulated data in 3 clusters:

require("sfsmisc")
set.seed(375523)
n1 = 1000
x1 = rnorm(n1,100,5)
y1 = rnorm(n1,0,1)
n2 = 750
x2 = rnorm(n2,125,7)
y2 = rnorm(n2,4,0.5)
n3 = 1500
x3 = rnorm(n2,75,10)
y3 = rnorm(n2,-4,1)

x = c(x1,x2,x3)
y = c(y1,y2,y3)
true_clus = c(rep(1,n1),rep(2,n2),rep(3,n3))
pch_clus = c(rep(20,n1),rep(15,n2),rep(17,n3))
mydata = data.frame(x,y)
plot(x,y,col=true_clus,pch=pch_clus)

This produces the following plot:kmeans_1

Now, to cluster these into nclus clusters, we can use the kmeans() function in R, like this

mydata_for_clustering=mydata
n = nrow(mydata_for_clustering)
nclus = 3
myclus = kmeans(mydata_for_clustering,centers=nclus)
print(names(myclus))

###################################################################
# plot the data, colored by the clusters
###################################################################
mult.fig(1,main="Simulated data with two clusters")
scol = rainbow(nclus,end=0.8) # select colors from the rainbow
plot(mydata$x,mydata$y,col=scol[myclus$cluster],pch=pch_clus)

Producing this plot:kmeans_2

Do you see something wrong here?

The problem is that the x dimension has a much larger range than the y dimension, and so in the calculation of the Euclidean distance it carries far too much weight. The solution is to normalize the space such that all dimensions have the same dimensionality. Try this:

mydata_for_clustering=mydata
mydata_for_clustering$x = (x-mean(x))/sd(x)
mydata_for_clustering$y = (y-mean(y))/sd(y)
nclus = 3

myclus = kmeans(mydata_for_clustering,centers=nclus)
print(names(myclus))

###################################################################
# plot the data, colored by the clusters
###################################################################
mult.fig(1,main="Simulated data with two clusters")
scol = rainbow(nclus,end=0.8) # select colors from the rainbow
plot(mydata$x,mydata$y,col=scol[myclus$cluster],pch=pch_clus)

which yields the following:kmeans_3

In most cases, however, the number of clusters won’t be so obvious.  In which case, we would need to use a figure of merit statistic like adjusted R^2 or (even better) AIC to determine which number of clusters appears to best describe the data.

To calculate the r-squared for a variety of cluster sizes, do this (after making sure your data are normalized!):

####################################################################
# the data frame mydata_for_clustering is ncol=2 dimensional
# and we wish to determine the minimum number of clusters that leads
# to a more-or-less minimum total within group sum of squares
####################################################################
kmax = 15 # the maximum number of clusters we will examine; you can change this
totwss = rep(0,kmax) # will be filled with total sum of within group sum squares
kmfit = list() # create and empty list
for (i in 1:kmax){
kclus = kmeans(mydata_for_clustering,centers=i,iter.max=20)
totwss[i] = kclus$tot.withinss
kmfit[[i]] = kclus
}

####################################################################
# calculate the adjusted R-squared
# and plot it
####################################################################
rsq = 1-(totwss*(n-1))/(totwss[1]*(n-seq(1,kmax)))

mult.fig(1,main="Simulated data with three clusters")
plot(seq(1,kmax),rsq,xlab="Number of clusters",ylab="Adjusted R-squared",pch=20,cex=2)

Producing the following plot:kmeans_4

We want to describe as much of the R^2 with the fewest number of clusters.  The usual method is to look for the “elbow” in the plot, and that is traditionally done visually.  However, we could try to construct a figure-of-merit statistic that identifies the elbow.  In the AML610 course, the students and the professor developed a quantitative method to formulate a statistic to identify the elbow point.  We call this statistic the COMINGS-elbow statistic (after the first initials of the students who came up with the algorithm); the statistic looks for the (k,aic) point that has the closest euclidean distance to (1,min(aic)).  However, we noted that the performance of this statistic in identifying the elbow point was improved if we first normalized the k and aic vectors such that they lay between 0 to 1.  Thus, the elbow point is identified by the (k’,aic’) point that is closest to (0,0).

####################################################################
# try to locate the "elbow" point
####################################################################
vrsq = (rsq-min(rsq))/(max(rsq)-min(rsq))
k = seq(1,length(vrsq))
vk = (k - min(k))/(max(k)-min(k))
dsq = (vk)^2 + (vrsq-1)^2

cat("The apparent number of clusters is: ",nclus,"\n")
points(nclus,rsq[nclus],col=2,pch=20,cex=2)

This figure-of-merit for elbow location doesn’t seem to work optimally. Can we think of a better one?

The AIC can be calculated with the following function:

kmeansAIC = function(fit){

m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}

aic=sapply(kmfit,kmeansAIC)
mult.fig(1,main="Simulated data with two clusters")
plot(seq(1,kmax),aic,xlab="Number of clusters",ylab="AIC",pch=20,cex=2)

Again, we are looking for the elbow point in the plot.

####################################################################
# look for the elbow point 
####################################################################
v = -diff(aic)
nv = length(v)
fom = v[1:(nv-1)]/v[2:nv]
nclus = which.max(fom)+1
cat("The apparent number of clusters is: ",nclus,"\n")
points(nclus,aic[nclus],col=2,pch=20,cex=2)

This yields:kmeans_5

With the AIC, our elbow locator FoM seems to work better (in this case, anyway).

In the file kmeans_example.R, I show the location of terrorist attacks in Afghanistan between 2009 to 2011 from the Global Terrorism Database.   The data are in afghanistan_terror_attacks_2009_to_2011.txt