Running R in batch with ASU high performance computing resources

In this module, intended for students at ASU, we discuss how to use ASU high performance computing resources to run an R script on many CPU’s simultaneously.

Running an R script in parallel on many CPU’s simultaneously with the aid of high performance computing resources allows you to quickly answer research questions that would take hours, days or even weeks to achieve with just your laptop or desktop alone.  Batch processing is a submission system that allows you submit a bunch of “jobs” (program runs) to many CPU’s at a time.

All students in AML610 or AML612 can obtain an ASU Research Computing account for free by going to this link, as can any student who is working with faculty at ASU on research.  Fill in the name of the professor teaching your course (or who you are working with), and their email.  Under “Cluster” choice, choose the “Agave public cluster”.

For security reasons, ASU requires users to access the Agave cluster using the Cisco Virtual Private Network (VPN) client.  From your “My ASU” page, go to the “My Apps” tab, and search for “Cisco”.  Download the Cisco AnyConnect VPN app for your operating system.

Once you have downloaded the app, open it, and connect to sslvpn.asu.edu.  Use your Asurite ID and password.

Once you are logged in to sslvpn.asu.edu through the Cisco AnyConnect client, go to this link: https://agaverstudio.rc.asu.edu and login with your Asurite ID and password.

This will start R Studio Agave in your browser window, that allows access to the Agave cluster.  This is (more or less) what you will see:

R_agave_a

Your directories in the right hand panel will be different than mine.  I have several directories on the Agave cluster related to past work.  It is good form to create a new directory for each project.  Thus, let’s do an example of parallel computing in batch that we will put in the agave_R_test directory.  Click on “New Folder” in the “Files” tab on the right hand panel, and create a directory “agave_R_test”.

You can see that a new folder has been created in the list of folders in the right hand panel:

R_agave_b

Now, in the left hand panel notice there is a tab to access the R console, and a tab to access the “Terminal”.  The Terminal tab connects you directly to the Agave cluster with the Unix operating system.  Don’t use the Terminal tab.  Only use the R console tab.

In the R Studio Agave console in the left hand panel, type

setwd("~/R")
file.copy("/sample/slurm/slurmscript.tmpl",".")
file.copy("/sample/slurm/batchtools.conf.R",".")
setwd("..")

Running your first job in batch in R Studio on the Agave cluster

the R console in the left hand panel, type

setwd("agave_R_test")

Then, in the R Studio Agave console, type:

library(batchtools)
reg = makeRegistry(file.dir = "./registry", seed = 1)

This will result in a sub-directory called “registry” appearing off of the agave_R_test folder.  The “seed” option in the makeRegistry() method sets the random seed for each job you run to the job.id plus the value of the seed that you passed (in this case, 1, but I could have put whatever integer I liked for that value).

To see what is in the agave_R_test folder, in the right-hand Files panel, click on the agave_R_test folder.  You will see that there is a registry sub-folder in it.  The registry sub-folder is where the results output by the scripts in our batch jobs will be.

Download the files R_agave_simple_example.R and agave_mydata.csv to a working directory on your laptop or desktop.  Also download the files batchtools.conf.R and slurmscript.tmpl

The R_agave_simple_example.R file contains a function called myfun

myfun = function(n){
  aseed = 13413
  set.seed(aseed+n)
  a = read.table("~/agave_R_test/agave_mydata.csv",sep=",",as.is=T,header=T)
  y =rnorm(1000)
  return(list(seed=aseed+n,mean_ax=mean(a$x),mean_ay=mean(a$y),mean_y=mean(y),sd_y=sd(y)))
}

All this function does is set the random seed using the value of n passed to the function, reads in the data frame, then randomly samples 1000 Normally distributed random numbers, and fills a (one line!) data frame with the random seed and the mean and std deviation of the columns in the a, and the mean and std deviation of the randomly sampled numbers.  It’s kind of a pointless script, but does show how to set the random seed, read in a file, sample random numbers, and output a data frame with some results.

Go back to the browser window where you are running R Studio Agave through https://agaverstudio.rc.asu.edu  In your agave_R_test folder in the right hand panel, upload the agave_mydata.csv file, and the R_agave_simple_example.R, batchtools.conf.R and slurmscript.tmpl files to the directory.

Now, in the R console in the left panel of the R Studio Agave window, type

source("R_agave_simple_example.R")

You will see something like this:

agave_submission_output

Now, go to the registry->results folder in the right hand Files panel, and you will see many files that look like this:

agave_files_output

These files (there are 100 of them) are the output from the myfun() method in R_agave_simple_example.R  The files are in “rds” format which is stored as an R object.  The files can be read into R using the readRDS() method.

I’ve created a little script merge_results.R that can be used to concatenate these files together into one data frame.  Download merge_results.R to a working directory on your laptop or desktop, then in your browser R Studio Agave window, upload the script to Files in the agave_R_test folder (*not* to the results folder!).

Now, in your R console in the left hand panel of the R Studio Agave window, type

source("merge_results.R")

The concatenated results are in a data frame called merged_data that you can then analyse in your R console.  This data frame is in the file merged_results.csv which will be produced in your working directory on R Studio Agave.  You can download it to your laptop or desktop from the Files console by ticking the box next to the file, then clicking on More->Export.

myoutput_download

The merge_results.R script is a general script that can be run on the output of any R script you run with R Studio Agave.  Just ensure that you always run it from the same working directory you ran your batch script in.

More complicated example: fitting an SIR model with non-Exponentially distributed infectious state sojourn times to influenza outbreak data

In this example, we’ll fit an SIR model with Erlang distributed sojourn times in the infectious state to the 2007-2008 Midwest influenza outbreak data.  The standard SIR model assumes Exponentially distributed sojourn times in the infectious state, meaning that the most probable time to recover is immediately after you are infected, which is clearly highly unrealistic.  The Erlang distribution is a special case of the Gamma distribution with integer “shape” parameter, k.  The larger the value of k, the more pointy and sharp the distribution is.  We can include Erlang distributed sojourn times in the infectious state by dividing up the infectious compartment into k stages, each with rate leaving the compartment k*gamma.  The value of 1/gamma is the overall average time an individual spends in the infectious state.  This is known as the “linear chain trick”.

In R Studio Agave Files window (at bottom right), create a folder called midwest_influenza off of your Home directory.

Change to this folder, and upload batchtools.conf.R and slurmscript.tmpl files (ie: first download them from here to your laptop, then in R Studio Agave upload the files from your laptop to the midwest_influenza working directory on R Studio Agave).

Also upload the file midwest_influenza_2007_to_2008.dat, and the files midwest_erlang_fit.R and  midwest_erlang_fit_submission.R to your midwest_influenza working directory on R Studio Agave.

The midwest_erlang_fit.R script contains a function called my_fitting_func() that does many Monte Carlo iterations, sampling parameters for an SIR model with non-exponentially distributed infectious period, along with an over-dispersion parameter alpha that is used to calculate the Negative Binomial negative log-likelihood.  The my_fitting_func() function outputs a data.frame with the results of many Monte Carlo iterations.  Many of the Monte Carlo iterations result in negative log likelihoods no where near the minimum value; thus, to help control the size of the output file, I make a selection at the end of the function to only the Monte Carlo iterations that yielded values of the negative log likelihood in the ballpark of the minimum value.  If you don’t do this, and run on many CPU’s at once, your concatenated output files will end up being huge and unmanageable.

The midwest_erlang_fit_submission.R script contains the R instructions to submit the my_fitting_func() function to 100 different CPUs on the Agave cluster.  In the R Studio Agave console window, type

source("midwest_erlang_fit_submission.R")

The system will then print out a dialogue telling you that it is submitting the jobs.

Typing

getStatus()

in the R Studio Agave console window will give you the status of your jobs.  If you’ve done all of the above correctly, you will see your jobs running 50 at a time.

Once getStatus() indicates that all the jobs have completed, you are ready to concatenate the output into a single file.  The R file merge_results.R does this for you.  Upload this file to your R Studio Agave working directory (i.e. first download it from here to your laptop, and then upload it from your laptop to R Studio Agave in the R Studio Agave Files window in the lower right).

In the R Studio Agave console window, type

source("merged_results.R")

This will produce the file  merged_results.csv in your R Studio Agave working directory. Check the file merged_results.csv in the R Studio Agave Files window at bottom right. Then go to More->Export, and download the file to your laptop.

You can now read this file into R running on your laptop, and create plots of the negative log-likelihood versus your parameter hypotheses, and determine the best-fit parameters and their one-standard deviation and 95% confidence intervals using fmin+1/2 and fmin+0.5*1.96^2, respectively. For example, in the R file midwest_erlang_fit.R I have defined a function called plot_results() that does just this.

source("midwest_erlang_fit.R")
r = read.table("midwest_results.csv",header=T,as.is=T,sep=",")
plot_results(r)

Which produces the following plot:

midwest_influenza_agave_results

The script also outputs:

midwest_influenza_agave_results_text

The fit seems to favour a high value of k, which implies that the probability distribution for the sojourn in the infectious state is very narrow, and centered on 1/gamma.  I haven’t tried values of k>150, but my guess is that if I did, we still wouldn’t find a clear minimum.  With this kind of fit, based on just the epidemic curve, the best we can say is that k is greater than 2 with 95% confidence.

This fit is not only a good example of when we can only put a lower (or upper) bound on a parameter, but the plots show evidence of local minima in the goodness-of-fit statistic.  Those local minima are why the graphical Monte Carlo method is preferred over “black box” fitting methods.  Black box methods, depending on your initial guess at the parameters, could easily walk you to one of the the local minima, rather than the global minimum in the goodness-of-fit statistic.

In a later module, we’ll talk about how we can use other information to help constrain the parameters of our fits.  For example, in this paper by Carrat et al (2008) they reviewed influenza volunteer challenge studies, where people were voluntary infected with influenza, then blood titres were done regularly to examine how much of the virus they were shedding.  This plot is in the paper, and it is pretty clear from the plot that the width of the probability distribution for recovery spans several days (favouring lower values of k over higher values):

carrat_et_al_challenge_studies

Step-by-step notes for running jobs in batch with R Studio Agave

  1. Make sure you are logged into sslvpn.asu.edu with the Cisco AnyConnect client in order to connect to R Studio Agave
  2. Sign in to R Studio Agave at https://agaverstudio.rc.asu.edu/auth-sign-in
  3. In the lower right Files pane, create a new working directory under your Home directory
  4. In the lower right Files pane, change to that directory and upload batchtools.conf.R and slurmscript.tmpl files to the directory (ie: first download them from here to your laptop, then in R Studio Agave upload from your laptop to the working directory on R Studio Agave).
  5. In your script that you wish to run on Agave, define a function (say, my_fitting_func) and pass to it a parameter n.  In the function, make sure you have a line set.seed(XXXX+n), where you replace XXXX with some integer of your choice.  This line is important: it controls the random seed used by each run of the script.
  6. Make sure all the functions used by my_fitting_func are defined within that function, and that all R libraries it needs are loaded within the function.  For example, here is what the first few lines of midwest_erlang_fit.R look like:
     my_fitting_func=function(n){ 
       set.seed(82914+n) 
       require("chron") 
       require("deSolve")
  7.  Make sure that your my_fitting_func() function doesn’t do any plotting, or printing things out using cat().  You can plot things later using the output from R Studio Agave.
  8. Make sure your my_fitting_func function outputs a data frame with the results of your graphical Monte Carlo iterations.  When using a negative log likelihood, it helps to control the size of the output file to only output the results with neg log likelihood in the ball park of the minimum value.  For example midwest_erlang_fit.R has the last few lines:
      vdat = data.frame(neglog_like=vneglog_likelihood,R0=vR0,t0=vt0,k=vk,alpha=valpha) 
      i = which(vdat$neglog_like<=(min(vdat$neglog_like)+1000)) 
      vdat = vdat[i,] 
      return(vdat)
    } # end definition of my_fitting_func
  9. Try out your fitting script on your local laptop to ensure it appears to be working correctly, and that the ranges you are sampling the parameters over appear to be narrow enough.  Re-iterate this process, adjusting the parameter ranges, as many times as needed to ensure that they aren’t too broad or too narrow, and that the best-fit values appear to be well centered in the ranges.
  10. Make sure you’ve uploaded your data file to your working directory on R Studio Agave!
  11. In an R script that you want to submit to batch, source the file that has the my_fitting_func file, and  put the lines (see, for example, midwest_erlang_fit_submission.R):
    source("midwest_erlang_fit.R")
    require("batchtools") 
    ncpu = 100 
    unlink("./registry",recursive=T) 
    reg = makeRegistry(file.dir = "./registry", seed = 1) 
    batchMap(fun = my_fitting_func, n = seq(1,ncpu))
    submitJobs(resources = list(walltime = 3600, memory = 1024, ncpus = 1))
  12. Now, in the R Studio Agave window, use setwd() to set the working directory to the folder where you have your data and R scripts, and now source your R batch submission script
  13. Typing getStatus() in the R Studio Agave console window will give you the status of your jobs.  Note that if all the jobs you submitted are listed under “Error:”, you’ve got a problem.  Common causes of this are: not having batchtools.R and/or slurmscript.tmpl in your working folder, or not having the data file in your working folder.  Or the script needs an R library that you did not require within the my_fitting_func() function.  If you get errors, in the lower-right Files window, go to your Home directory.  There you will see a whole bunch of files that look like slurm.XXXX.err.  Open the most recent (down at the bottom), and it will tell you why the error occurred.  Fix the error in your R script(s) and try sourcing your submission script again from the R Studio Agave console window.
  14. Once getStatus() shows that all of your jobs have finished running, you’re ready to merge the results.  Upload the R script merge_results.R to your R Studio Agave working directory (i.e. download it from here to your laptop first, then upload it from your laptop to R Studio Agave).  Now source merge_results.R in the R Studio Agave console window.  This will produce the file merged_results.csv in your R Studio Agave working directory.
  15. Check the file merged_results.csv in the R Studio Agave Files window at bottom right.  Then go to More->Export, and download the file to your laptop.
  16. You can now read this file in, and create plots of the negative log-likelihood versus your parameter hypotheses, and determine the best-fit parameters and their one-standard deviation and 95% confidence intervals using fmin+1/2 and fmin+0.5*1.96^2, respectively.  For example, in the R file midwest_erlang_fit.R I have defined a function called plot_results() that does just this.

 

Leave a Reply