Basic Unix

In the Arizona State University AML610 course “Computational and Statistical Methods in Applied Mathematics”, we will be ultimately be using super computing resources at ASU and the NSF XSEDE initiative to fit the parameters of a biological model to data.  To do this, it is necessary to know basic Unix commands to copy, rename, and delete files and directories, and how to list directories and locate files.  We will also be compiling all our C++ programs from the Unix shell, and in the command line directing the output of our programs to files.
Continue reading

Good practices in producing plots

Years ago I once had a mentor tell me that one of the hallmarks of a well-written paper is the figures; a reader should be able to read the abstract and introduction, and then, without reading any further, flip to the figures and the figures should provide much of the evidence supporting the hypothesis of the paper.  I’ve always kept this in mind in every paper I’ve since produced.  In this module, I’ll discuss various things you should focus on in producing good, clear, attractive plots.

Continue reading

Good programming practices (in any language)

Easy readability, ease of editing, and ease of re-usability are things to strive for in code you write in any language. Achieving that comes with practice, and taking careful note of the hallmarks of tidy, readable, easily editable, and easily re-usable code written by other people.

While I’m certainly not perfect when it comes to utmost diligence in applying good programming practices, I do strive to make readable and re-useable code (if only because it makes my own life a lot easier when I return to code I wrote a year earlier, and I try to figure out what it was doing).

In the basics_programming.R script that I make use of some good programming practices that ensure easy readability. For instance, code blocks that get executed within an if/then statement, for loops, or while loops are indented by a few spaces (usually two or three… be consistent in the number of indent spaces you use).  This makes it clear when reading the code which nested block of code you are looking at.   I strongly advise you to not use tabs to indent code.  To begin with, every piece of code I’ve ever had to modify that had tabs for indent also used spaces here and there for indentation, and it makes it a nightmare to edit and have the code remain easily readable. Also, if you have more than one or two nested blocks of code, using tabs moves the inner blocks too far over to make the code easily readable.

In the R script sir_agent_func.R I define a function.  Notice in the script that instead of putting all the function arguments on one long line, I do it like this:

SIR_agent = function(N         # population size
                    ,I_0       # initial number infected
                    ,S_0       # initial number susceptible
                    ,gamma     # recovery rate in days^{-1}
                    ,R0        # reproduction number
                    ,tbeg      # begin time of simulation
                    ,tend      # end time of simulation
                    ,delta_t=0 # time step (if 0, then dynamic time step is  implemented)
                    ){

This line-by-line argument list makes it really easy to read the arguments (and allows inline descriptive comments).  It also makes it really easy to edit, because if you want to delete an argument, it is simple as deleting that line.  If you want to add an argument, add a line to the list.

Descriptive variable names are a good idea because they make it easier for someone else to follow your code, and/or make it easier to figure out what your own code did when you look at it 6 months after you wrote it.

Other good programming practices are to heavily comment code, with comment breaks that clearly delineate sections of code that do specific tasks (this makes code easy to read and follow).  I like to put such comments “in lights” (ie; I put a long line of comment characters both above and below the comment block, to make it stand out in the code).  If the comments are introducing a function, I will usually put two or three long lines of comment characters at the beginning of the comment block; it makes it clear and easy to see when paging through code where the functions are defined.

In-line comments are also very helpful to make it clear what particular lines of code are doing.

A good programming practice that makes code much easier to re-use is to never hard-code numbers in the program.  For instance, at the top of the basics_programming.R script I create a vector that is n=1000 elements long.  In the subsequent for loop, I don’t have

for (i in 1:1000){}

Instead I have

for (i in 1:n){}

This makes that code reusable as-is if I decide to use it to loop over the elements of a vector with a different length.  All I have to do is change n to the length of the new vector.  Another example of not hard-coding numbers is found in the code associated with the while loop example.

As an aside here, I should mention that in any programming language you should never hard-code the value of a constant like π (as was pointed out in basics.R, it is a built-in constant in R, so you don’t need to worry about this for R code).  In other languages, you should encode pi as pi=acos(-1.0), rather than something like pi=3.14159.  I once knew a physics research group that made the disastrous discovery that they had a typo in their hard-coded value of pi… they had to redo a whole bunch of work once the typo was discovered.

Notice in the script that I have a comment block at the top of basics_programming.R that explains what the script does, gives the date of creation and the name of the person who wrote the script (ie; me).  It also gives the contact info for the author of the script, and copyrights the script.  Every script or program you write should have that kind of boilerplate at the top (even if you think the program you are writing will only be used by you alone… you might unexpectedly end up sharing it with someone, and/or the boilerplate makes it clear that the program is *your* code, and that people just can’t pass it off as their own it if they come across it).   It also helps you keep track of when you wrote the program.

 

 

Stochastic epidemic modelling with Agent Based Models

[After reading this module, you will be aware of the limitations of deterministic epidemic models, like the SIR model, and understand when stochastic models are important.  You will be introduced to three different methods of stochastic modelling, and understand the appropriate applications of each. By the end of this module, you will be able to implement a simple Agent Based stochastic model in R.]

Contents:

Continue reading

SIR modelling of influenza with a periodic transmission rate

[After going through this module, students will be familiar with time-dependent transmission rates in a compartmental SIR model, will have explored some of  the complex dynamics that can be created when the transmission is not constant, and will understand applications to the modelling of influenza pandemics.]

Contents:

 

Introduction

Influenza is a seasonal disease in temperate climates, usually peaking in the winter.  This implies that the transmission of influenza is greater in the winter (whether this is due to increased crowding and higher contact rates in winter, and/or due to higher transmissibility of the virus due to favorable environmental conditions in the winter is still being discussed in the literature).  What is very interesting about influenza is that sometimes summer epidemic waves can be seen with pandemic strains (followed by a larger autumn wave).  An SIR model with a constant transmission rate simply cannot replicate the annual dual wave nature of an influenza pandemic.

Continue reading

SIR infectious disease model with age classes

[After reading through this module, students should have an understanding of contact dynamics in a population with age structure (eg; kids and adults). You should understand how population age structure can affect the spread of infectious disease. You should be able to write down the differential equations of a simple SIR disease model with age structure, and you will learn in this module how to solve those differential equations in R to obtain the model estimate of the epidemic curve]

Contents:

Introduction

In a previous module I discussed epidemic modelling with a simple Susceptible, Infected, Recovered (SIR) compartmental model.  The model presented had only a single age class (ie; it was homogenous with respect to age).  But in reality, when we consider disease transmission, age likely does matter because kids usually make more contacts during the day than adults. The differences in contact patterns between age groups can have quite a profound impact on the model estimate of the epidemic curve, and also have implications for development of optimal disease intervention strategies (like age-targeted vaccination, social distancing, or closing schools).
Continue reading

Compartmental modelling without calculus

In another module on this site I describe how an epidemic for certain kinds of infectious diseases (like influenza) can be modelled with a simple Susceptible, Infectious, Recovered (SIR) model. Readers who have not yet been exposed to calculus (such as junior or senior high school students) may have been daunted by the system of differential equations shown in that post.  However, with only a small amount of programming experience in R, students without calculus can still easily model epidemics, or any other system that can be described with a compartmental model.  In this post I will show how that is done.
Continue reading

Epidemic modelling with compartmental models using R

[After reading through this module you should have an intuitive understanding of how infectious disease spreads in the population, and how that process can be described using a compartmental model with flow between the compartments.  You should be able to write down the differential equations of a simple disease model, and you will learn in this module how to numerically solve those differential equations in R to obtain the model estimate of the epidemic curve]

An excellent reference book with background material related to these lectures is Mathematical Epidemiology by Brauer et al. 

Contents:

Introduction

Models of disease spread can yield insights into the mechanisms and dynamics most important to the spread of disease (especially when the models are compared to epidemic data).  With this improved understanding, more effective disease intervention strategies can potentially be developed. Sometimes disease models are also used to forecast the course of an epidemic, and doing exactly that for the 2009 pandemic was my introduction to the field of computational epidemiology.

There are lots of different ways to model epidemics, and there are several modules on this site on the topic, but let’s begin with one of the simplest epidemic models for an infectious disease like influenza: the Susceptible, Infected, Recovered (SIR) model.

Continue reading

The basics of the R statistical progamming language

[After you have read through this module, and have downloaded and worked through the provided R examples, you should be proficient enough in R to be able to download and run other R scripts that will be provided in other posts on this site. You should understand the basics of good programming practices (in any language, not just R). You will also have learned how to read data in a file into a table in R, and produce a plot.]

Contents:

Why use R for modelling?

I have programmed in many different computing and scripting languages, but the ones I most commonly use on a day to day basis are C++, Fortran, Perl, and R (with some Python, Java, and Ruby on the side).  In particular, I use R every day because it is not only a programming language, but also has graphics and a very large suite of statistical tools. Connecting models to data is a process that requires statistical tools, and R provides those tools, plus a lot more.

Unlike SAS, Stata, SPSS, and Matlab, R is free and open source (it is hard to beat a package that is more comprehensive than pretty much any other product out there and is free!).

Continue reading

Welcome to Polymatheia

I am a data scientist with a diverse background visual analytics, data mining, social media analytics, machine learning, high performance computing, and mathematical and computational dynamical modeling.  As both an academic and a consultant, I have worked on a broad array of research topics in public health and the social sciences, including crime and violence risk analyses, spread of political and partisan sentiments in a society, disease modelling, and other topics in applied modelling in the social and life sciences, with over 360 publications on a wide variety of subjects. My unique trans-disciplinary skill set enables me to examine a wide range of research questions that are often of broad interest and importance to policy makers and the general public.

My work in the computational social sciences has been high impact, including an analysis of how media can incite panic in a population, and how contagion may play a role in the temporal patterns observed in mass killings in the US. The latter work has received extensive media coverage, including on BBC, CNN, NBC, CBS, FOX, NPR, Reuters, USA Today, Washington Post, Los Angeles Times, The Wall Street Journal, Huffington Post, Christian Science Monitor, Newsweek, The Atlantic, The Telegraph, The Guardian, Vox, and many other local, national, and international news agencies. The study was also profiled in Season 8, Episode 4 of Morgan Freeman’s Through the Wormhole documentary series on the Discovery Channel in 2017, “Is Gun Crime a Virus?”.

Several of my publications in computational epidemiology have also received widespread attention, including forecasts for the progression of the 2009 influenza pandemic, and the 2014 Ebola epidemic in West Africa. Both analyses correctly forecast the progression of the epidemics, and my forecast for the 2009 pandemic was discussed in a special US Senate meeting on October 23, 2009 H1N1 Flu: Monitoring the Nation’s response.

Through my consulting company, Towers Consulting LLC, I provide consulting services for industry, academia, and the public sector in quantitative and predictive analytics, modelling, statistics, and visual analytics.  Notable past and current clients include the VACCINE Department of Homeland Security Center of Excellence, Cure Violence Global, the Carter CenterGeorge Washington University, and Disney. Prospective clients can contact me at TowersConsultingLLC@gmail.com.

On this website I share information and computational tools related to a wide range of my research topics, including visual analytics, applied statistics, and mathematical and computational modelling methods. This website also includes material related to my lectures and seminars in statistical and computational methods in the life and social sciences.

My CV is available here.

Finding sources of data for computational, mathematical, or statistical modeling studies: free online data

[In this module we discuss methods for finding free sources of online data. We present examples of climate, population, and socio-economic data from a variety of online sources.  Other sources of potentially useful data are also discussed.  The data sources described here are by no means an exhaustive list of free online data that might be useful to use in a computational, statistical, or mathematical modeling study.] Continue reading

Finding sources of data: extracting data from the published literature

Connecting mathematical models to predicting reality usually involves comparing your model to data, and finding model parameters that make the model most closely match observations in data. And of course statistical models are wholly developed using sources of data.

Becoming adept at finding sources of data relevant to a model you are studying is a learned skill, but unfortunately one that isn’t taught in any textbook!

One thing to keep in mind is that any data that appears in a journal publication is fair game to use, even if it appears in graphical format only.  If the data is in graphical format, there are free programs, such as DataThief, that can be used to extract the data into a numerical file.

Continue reading

AML 610 Fall 2014: List of modules

The syllabus for this course can be found here.

The final write-ups for final group projects are due Monday, December 1st, 2014.  On Dec 2nd and 3rd students will meet with Prof Towers to receive feedback on their project and writeup.

Each of the project groups will perform an in-class 20 min presentation on Monday, Dec 8th, 2014 and Wed, Dec 10th, 2014. By Dec 9th, all group members are to submit to Prof Towers a confidential email, detailing their contribution to the group project, and detailing the contributions of the other group members.

The list of modules for the Fall 2014 course in computational and statistical methods for mathematical biologists and epidemiologists:

 

AML610 Fall 2013 lecture series

Introduction

In this course students will be introduced to statistical modelling methods such as linear regression, factor regression, and time series analysis.  All modelling and data analysis will be performed in the R statistical programming language.  The course meets on Thursdays from 12:00-2:45 pm in PSA 546.  The course syllabus can be found here.

The course will be structured in a series of modules covering various topics.  Some modules may take more than one lecture to cover.  Homework will be occasionally assigned throughout the course, usually after completion of a module.

The final project for the course will account for 50% of the grade, and is required to be based on a statistical analysis of one or more data sets from this page and in conjunction with discussion with myself and perhaps other faculty to determine if the topic of the analysis is novel.  Students are encouraged to work together on the project in groups of up to three, but the contributions of each student in the group to the project must be clearly defined.  The final project write-up will be due approximately two weeks before the end of classes (date to be announced).  Oral presentations of the projects will take place in class during the last week of classes.

There is no one textbook that covers the material in this course. Since students in this course have varying backgrounds in statistics, I strongly recommend that you go to the library and take a look at the statistical texts there, and find one or two that cover linear regression and/or time series analysis that you think are at your level. For recommended texts that I think are good, Statistical Data Analysis by Cowan is a good general text for various statistical methods.  A standard textbook for linear regression is Applied Linear Statistical Models by Kutner et al.  Two good textbooks that cover time series analysis is Time Series Analysis with Applications in R by Cryer, and Time Series Analysis and its Applications with R Examples by Shumway.  The material covered in these texts is much more expansive in scope than the material covered in this course (because they cover material that would form the basis of at least three or four different courses).  All of these texts are available off of libgen.info, but note that copyright infringement is a crime… one must never do such a terrible thing.