AML 610 Module XII: submitting jobs in batch to the ASU Saguaro distributed-computing system

The ASU Advanced Computing Center (A2C2) maintains the Saguaro distributed computing system, that currently has over 5,000 processor cores.

ASU students in the spring semester of AML610 should have already applied for and received an account on the Saguaro system (per the instructions of last month’s email describing how to apply for an account).

Saguaro allows you to simultaneously run multiple jobs in batch, directing standard output to a log file. For this course, we will be using Saguaro to solve a system of ODE’s under a hypothesis for the parameters and initial conditions values (either chosen in a parameter sweep, or randomly chosen within some range); the output of the ODE’s will then be compared to a data set, and a best-fit statistic (like Least Squares, Pearson chi-squared, or Maximum likelihood) computed. The parameter values and best-fit statistics are then printed to standard output.

Access to cloud computing resources, and knowledge of how to utilize those resources, has many different potential applications in modelling. Learning how to use Saguaro as a tool in solving problems related to this course can thus potentially open up many further avenues of future research to you.

Homework #5, due Thus April 18th, 2013 at 6pm. Data for the homework can be found here.

This document gives lots of useful information on the usage of Saguaro.

Log on to Saguaro by typing

ssh -l <your user name> saguaro.fulton.asu.edu

Once logged on, create a directory for your project by typing

mkdir <directory name>

A2C2 prefers Saguaro users to use the Intel icc compiler when compiling and linking C++ code that you intend to submit in batch. To load the compiler, you need to type on the command line:

module load intel

Copy all your C++ files from your local computer over to saguaro by using scp on your local computer. For instance:

scp <filename> <username>@saguaro.fulton.asu.edu:~/<directory>/<filename>

In the examples presented here, we will be using the code related to the SIR model, as presented in the previous module. The file makefile_sir_saguaro is a makefile that uses the Intel compiler to link and compile the relevant code. Go ahead and compile the program SIR_initial_fit

Take a careful look at the SIR_initial_fit.cpp program. Do you notice that in the first line of the program I set the random seed with the time stamp, in conjunction with an argument passed to the program? If I was running the program interactively, I would have to supply that argument on the command line. For instance, with the command

./SIR_initial_fit 50

I have deliberately designed the program this way so that when I submit the jobs in batch, a unique ID will be fed to the program as an argument on the command line, thus ensuring that each program as it runs will have a unique random seed. If you don’t do this, all the programs would have the same random seed, which would defeat the reason for running the programs in parallel!

Now, we need to develop a script that will submit the program to run in batch on Saguaro. The Saguaro system uses the Portable Batch System for submitting jobs to the Saguaro queues. In the file sir.pbs I have implemented directives to PBS that will be used when running jobs on the Saguaro batch system. Notice that the final line executes the program (using a PBS directive mpiexec), and the job ID $PBS_JOBID is passed as an argument to the program. This argument will be unique for each of the submitted jobs, thus ensuring that each of the programs will run with a different random seed.

So, this is something to keep in mind when running any kind of stochastic method in parallel; if all of your jobs are returning the exact same output, you haven’t set a unique random seed for each job.

To submit, for instance, four jobs to the Saguaro batch queue, type on the command line

for i in {0..3}; do qsub sir.pbs; done

To check on the progress of the jobs, type

qstat

Just after submitting the jobs you will see something like this:

Job id           Name             User            Time Use S Queue
----------------- ---------------- --------------- -------- - -----
5817902.newmoab    SIR_initial_fit  yourusername          0 Q normal         
5817903.newmoab    SIR_initial_fit  yourusername          0 Q normal         
5817904.newmoab    SIR_initial_fit  yourusername          0 Q normal         
5817905.newmoab    SIR_initial_fit  yourusername          0 Q normal

If the “S” column has a Q in it, that means the job is queued and has not yet started running. R means the job is running, and C means the job has completed.

Once the jobs start running you will see something like this:

Job id                    Name             User   Time Use S Queue
----------------- ---------------- --------------- -------- - -----
5817902.newmoab   SIR_initial_fit  yourusername   00:02:12 R serial         
5817903.newmoab   SIR_initial_fit  yourusername   00:02:13 R serial         
5817904.newmoab   SIR_initial_fit  yourusername   00:02:13 R serial         
5817905.newmoab   SIR_initial_fit  yourusername   00:02:12 R serial

If you wanted to delete one of the jobs from the queue, because you had submitted it by mistake (for instance, the last job in the list above), you would type

qdel 5817905.newmoab

Once the jobs have completed (once the time limit in sir.pbs has been reached) then the “R”s will change to “C”s, and if you list the directory, you will find the output files *.local.

To concatenate these files into one grand output file called temp.out, type

cat *.local >> temp.out

Now all that is left to be done is to scp the output file back to your local computer for further analysis 😉

Polymatheia

AML 610 Module XII: submitting jobs in batch to the ASU Saguaro distributed-computing system

Leave a Reply