THE PARASOL PARALLEL BATCH SYSTEM

 

 

OVERVIEW

Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to thousands of CPUs.  Parasol was developed initially by Jim Kent, and extended by other members of the Genome Bioinformatics Group at the University of California Santa Cruz.  Parasol is currently a fairly minimal system, but what it does it does well.   It can start up 500 jobs per second.   It restarts jobs in response to the inevitable systems failures that occur on large clusters.  If some of your jobs die because of your program bugs, Parasol can help manage restarting the crashed jobs after you fix your program as well.  The parasol source is at http://www.soe.ucsc.edu/~kent/src/parasol.tgz.

PARASOL QUICK START

 

To start things rolling you need to make a directory to put the batch in, and create a job list in this directory.  A basic job list is just a series of command lines, one for each job.  A fancier job list can contain checks on the input and output files for each job.  While generally jobs in a batch are somehow related,  they need not be.   Here’s a sample job list that compiles some code in parallel.  :

  cc –c lions.c

  cc –c tigers.c

  cc –c bears.c

  cc –c turkeys.c

  cc –c bats.c

To run this on parasol you’d log onto the machine running the parasol server and

  mkdir compileTheAnimals

  cd compileTheAnimals

  para make ../job.lst

assuming you’d already created job.lst .   The para program will hang out periodically printing a little information until all the jobs are done or one of them fails repeatedly.

 

If your job has problems you can retrieve them with

  para problems

This will among other things copy over the standard error output from the cluster nodes.  You may need to put the parasol host in your .rhosts file for this to work.

 

RUNNING A BIG BATCH OF JOBS WITH PARASOL

Parasol is really meant for large batches of jobs.   One big set of jobs we do at the Genome group is comparing the mouse vs. the human genomes.  Since humans and mice have a common ancestor, there are stretches of DNA that are similar (homologous) between the two species.  However since this common ancestor was almost 100 million years ago most of the DNA has changed quite a bit.  It’s not possible to find homologous regions with a simple string search.  Instead sophisticated “alignment” techniques must be used.   Aligning whole genomes against each other is one of the most compute intensive areas in bioinformatics.   Fortunately it is a problem that can be easily distributed across many machines.  We can break the human genome into approximately 200 pieces, and the mouse genome into approximately 1000 pieces.  We then align each piece of the human against each piece of the mouse.  This lets us split divide the big job into 200,000 little jobs.  Each of these little jobs might take about 10 minutes on one computer.  On 1000 computers the whole set of jobs should take less than two days.

 

Since parasol wants a line of input for each job, clearly we need an automatic way of generating the job list.  This is where the program “gensub2” comes in.   Gensub2 takes as input two lists of files and a template file.   The template file contains three parts – a preamble (everything before the #LOOP line) which is literally copied to the output,  a middle section which is repeated in the output with the filenames from the lists substituted in,  and a postscript (everything after the #ENDLOOP line) that is copied literally to the output.   We’ve used gensub2 with a number of schedulers – Condor and Codine as well as Parasol – and it has proven a simple but effective tool.   Here is a gensub2 template file for creating mouse/human alignments with the BLAT program:

  #LOOP

  /cluster/bin/blat $(path1) $(path2) out/$(root1)_$(root2)

  #ENDLOOP

Note depending on how parasol was set up on your system you may need to include the full path name for executables.  To create a job list do the following steps:

  mkdir mouseVsHuman

  cd mouseVsHuman

  mkdir out

  ls –1 /data/human/dna/pieces/*.seq > human.lst

  ls –1 /data/mouse/dna/pieces/*.seq > mouse.lst

  gensub2 human.lst mouse.lst template jobList

You could at this point run “para make” on jobList.   In the end if nothing went wrong you’d have a directory ‘out’ full of files that look something like mouse001_human001, mouse001_human002,  …. However this is a big enough job that it’s very likely something will go wrong.    It would be good to do a little sanity check on the output.

It wouldn’t be a bad idea to write a program to double-check the output yourself.  However Parasol can do some basic checks automatically.  If we change our template file to read:

 #LOOP

 /cluster/bin/blat $(path1) $(path2) {check out line+ out/(root1)_$(root2)}

 #ENDLOOP

then Parasol will check that each file has at least one line, and that the last line is complete.   This won’t catch every problem, but it will catch the vast majority of them.  Parasol will also of course detect program crashes.  It’s possible to put checks on input files as well.

 

Since this is a job liable to take more than a day,  rather than running ‘para make’,  I like to take a more interactive approach as follows:

  para create jobList   # Create job tracking database

  para try              # Run ten jobs

  para check            # Check up on jobs

  para push             # Push up to 100,000 jobs

  para check            # Check up on jobs

  para push             # Retry any crashed jobs so far

  para time             # Collect timing information.

It’s always good to do a ‘try’ before a ‘push’,  letting the first ten jobs run for five minutes or so and then checking up on them.   Note that “para push” will only resubmit jobs that have already crashed when para push is run.   If you’re interactively monitoring your jobs periodically pushing and checking (and timing) is useful.  Once you decide things are going well you can issue a “para shove” command, which will will check on your jobs every few minutes, restarting jobs if necessary, until the jobs are all done or one of the jobs has failed (crashed repeatedly).   If you want to go back to interactive monitoring of your jobs after a “para shove”, just hit <control C> to stop the shoving.

 

Typically there are three types of problems you encounter with a big batch of jobs.  The most common problem is a new bug in the program you’re running, which causes all or most of your jobs to crash.  Doing a “para try” will usually find these problems without wasting a lot of cluster time.  Another common problem is when one of the machines in the cluster is acting flaky for some reason – often because it is having an i/o problem of some sort.   When a “para check” shows that some jobs are crashing this is the first thing to check.  Do a “para problems” to get a report on the jobs that are having troubles, and see if they are all on the same host.  If so tell the system administrators about the problem.  They can use the “parasol remove machine” command to remove the machine from parasol service until the problem can be fixed.  The third common problem is due to rarely manifesting bugs in your code.  The symptom of this is the same job crashing repeatedly on different machines.  There’s no cure for this except doing some debugging – usually starting with isolating the minimum input needed to cause the problem.

 

THE PARASOL ENVIRONMENT

Parosol sets up a few environment variables prior to running jobs.  The PARASOL variable is set to the version of Parasol that is currently running.  Programs can use this to determine whether they are running under Parasol.   The JOB_ID variable is set to a unique number for each job that is run.  How the PATH variable is set up depends on how Parasol was installed.   At UCSC we have it set up to run things out of the bin directory of each user as well as /bin, /usr/bin, and /cluster/bin.  In general Parasol’s PATH will be different from the path in your shell, so when in doubt include the appropriate directory names in front of your executables.   You can also depend on the USER, HOME, and HOST variables being set to your user name, your home directory and the name of the machine the program is running on.  Beyond this what is in the environment varies from installation to installation.

INSTALLING PARASOL

 

To install parasol you’ll need to create a list of the machines in your cluster, and then run the ‘paraNodeStart’ which will launch a ‘paraNode’ server on each machine.  Next pick a machine, preferably one without much else going on in it to launch a ‘paraHub’ server.  Parasol users will then log onto this machine to launch jobs either using the ‘para’ or the ‘parasol’ commands.    You can bring down the paraNode servers with the ‘paraNodeStop’ command.  You can bring down the paraHub with ‘paraHubStop’.

 

Here’s a list of the user programs in the Parasol package:

·        para – a user-level command which manages a single batch of jobs

·        gensub2 – a user-level command for generating large batches of  jobs, each on a different set of files.

Here’s a list of the administrative programs:

·        parasol – administrative command for looking at all active jobs in the system and adding or removing nodes on the fly. Users may find this command handy too, though several aspects of it can only be accessed by root.

·        paraNode – runs jobs on a compute node

·        paraHub – job scheduler

·        paraNodeStart – start up paraNode daemons

·        paraNodeStop – bring down paraNode daemons

·        paraNodeStatus – query status of paraNode daemons

·        paraHubStop – bring down paraHub daemon

 

The first thing an administrator needs to do is create a list of machines to use as compute nodes.  This list is a tab-separated file with one line for each machine and the following fields:

·        <host name> - Network name of host. 

·        <number of cpus> - Number of CPUs to use in the machine

·        <meg of memory> - Megabytes of memory in machine

·        <local temp dir> - A directory in the machine for temp files.  Parasol puts the standard error output here.  Ideally this directory should periodically have files not accessed for a week removed.

·        <local data dir>  - A directory where local data resides.  Ignored for now.

·        <local data gig> - Size of local data directory.  Ignored for now.

·        <network switch> - Name of network switch this is on.  Ignored for now.


Here is a small example:

  testBad.node.ucsc 2 1024 /tmp /scratch 36000 r1
  kkr1u01.kilokluster.ucsc.edu 2 1024 /tmp /scratch 36000 r1
  kkr1u02.kilokluster.ucsc.edu 2 1024 /tmp /scratch 36000 r1

 

To start up the Parasol system do the following:

  ssh parasolhost

  mkdir parasolRootDir

  cd parasolRootDir

  cp machineList .

  su root

  paraNodeStart machineList hub=parasolHost

 userPath=bin sysPath=/share/bin:/usr/local/bin

 log=/tmp/log

  paraHub machineList –log=log &

You may well want to customize the userPath and sysPath according to the conventions used at your own installation.   The hub log will create a log that averages about 500 bytes per job, which can be useful for debugging.  Omit the –log parameter for no log (not even syslog).  For further security use the –subnet parameter to paraHub which will restrict incoming messages to the hub to a particular subnet.   A common example of this would be

  paraHub machineList –log=log –subnet=124.168 &

 

The paraHub daemon will detect machines that are down and work around them.  Every

ten minutes it will see if a machine that is down has come back up.   It’s possible to add new machines without bringing down the daemon using

  parasol add machine tempDir

ARCHITECTURE

The Parasol system consists primarily of a scheduling server and two clients: parasol and para.  The parasol scheduling system consists of a hub daemon, a heartbeat daemon, and number of spoke daemons running on the server system, and a node daemon running on each compute node.  The parasol and para clients both communicate only with the hub daemon.  The parasol client is quite lightwieght doing little more than forwarding messages from the command line to the parasol hub and printing the replies.  The para client is more substantial.  It creates a database around a batch of jobs submitted by the user and tracks the progress of these jobs through the scheduler.  Para does not expect the scheduler or the compute nodes to be completely reliable, and will resubmit jobs as necessary.

Figure 1.  Parasol processes (circles) and message flow (arrows).  All processes reside on the scheduling machine except for the node processes.    A spoke process can send messages to any node. 

The Scheduling System

The Hub Daemon and Its Spokes and Pulse

The hub daemon is the heart of the parasol scheduling system. The hub daemon spawns the heartbeat daemon and the spoke deamons on startup. The hub daemon then goes into a loop processing messages it recieves on the hub TCP/IP socket.  The hub daemon does not do anything time consuming in this loop.   The main thing the hub daemon does is put jobs on the job list,  move machines from the busy list to the free list,  and call the 'runner' routine.

 

The runner routine looks to see if there is a free machine, a free spoke, and a job to run.  If so it will send a message to the spoke telling it to run the job on the machine,  and then move the job from the 'pending' to the 'running' list,  the spoke from the freeSpoke to the busySpoke list, and the machine from the freeMachine to the busyMachine list.   This

indirection of starting jobs via a separate spoke process avoids the hub daemon itself having to wait to find out if a machine is down.

 

When a spoke is done assigning a job, the spoke sends a 'recycleSpoke' message to the hub, which puts the spoke back on the freeSpoke list. Likewise when a job is done the machine running the jobs sends a 'job done' message to the hub, which puts the machine back on the free list,  writes the job exit code to a results file, and removes the job

from the system.

 

Sometimes a spoke will find that a machine is down.  In this case it sends a 'node down' message to the hub as well as the 'spoke free' message.   The hub will then move the machine to the deadMachines list, and put the job back on the top of the pending list.

 

The heartbeat daemon simply sits in a loop sending heartbeat messages to the hub every so often (every 30 seconds currently), and sleeping the rest of the time.   When the hub gets a heartbeat message it does a few things:

·        It calls runner to try and start some more jobs.  (Runner is also called at the end of processing a recycleSpoke, jobDone, addJob or addMachine message.  Typically runner won't find anything new to run in the heartbeat, but this is put here mostly just in case of unforseen issues.)

·        It calls a routine to see if machines on the dead list have come back to life.

·        It calls a routine to see if jobs the system thinks have been running for a long time are still running on the machine they have been assigned to. If the machine has gone down it is moved to the dead list and the job is reassigned.  If the machine is up it lets the hub know if it has already finished the job.  This currently does happen, albeit in perhaps 1 in 10,000 jobs due to failure for the jobDone message to get through.  If the job is still running the hub gets notified, and check back on the job five or ten minutes later.

This whole system depends on the hub daemon being able to finish processing messages fast enough to keep the connection queue on the hub socket from overflowing.  Each job involves 3 messages to the hub socket:

    addJob - from a client to add the job to the system

    recycleSpoke - from the spoke after it's dispatched the job

    jobDone - from the compute node when the job is finished

On some of the earlier Linux kernals we had trouble with the connection queue overflowing when dispatching lots of short jobs.  This seemed to be from the jobDone messages coming in faster than Linux could make connections rather than the

hub daemon being slow.  On the kilokluster with a more modern kernal this has been less of a problem – only manifesting on 1,000 CPUs when running 50,000 jobs that do nothing.  (The shorter the job the more stressful it is on the scheduler).  Should overflow occur the heartbeat processing will gradually rescue the system in any case, but the throughput will be greatly reduced.

 

When the hub daemon first comes up it queries each node daemon for running and recently finished jobs.  If the hub is brought down for maintenance or crashes due to a program error,  CPU time already invested in running jobs on nodes is usually recovered.  Currently jobs that have not yet been sent to nodes will be lost,  though if they are submitted via the ‘para’ client below,  the para client will resubmit these jobs after noting ‘tracking errors.’  This feature of reconnecting to running jobs is new as of paraHub version 2.  

The Node Daemon

The node daemon is relatively simple.  It forks off a process to run a job in response to a “run” message over it’s TCP/IP socket from a spoke.  It sends a “jobDone” message to the hub with the return status code and user and system run times when a job finishes.  It keeps a count of CPUs free, and refuses to start a job unless a CPU is free.   It also will kill a job in response to a “kill” message from the hub. 

 

The node daemon also participates in the heartbeat processing.   It responds to a “resurrect” message with an “alive” message in the process that leads to a machine that was down being restored to use.  It responds to “check jobId” messages with a response saying whether or not the job is running, finished or has never been seen.

 

The node daemon needs to be run as root, so that it can setuid to run jobs as any particular user.

The Para Client

The para client manages batchs of jobs through the scheduler. It is designed to catch jobs which may have run into problems of any sort, and give the user a chance to rerun them after the problem is fixed.  The major input to para is a job list.  Each job can have checks associated with it before and after the job itself is run.   Initially para reads the job list and transforms it into a job database.  The central routine of para, paraCycle, reads the job database,  queries the hub to see what jobs are running and waiting,  looks at the results file to see what jobs are finished,  performs output checks on the finished jobs, sends unsubmitted jobs or jobs that need to be rerun to the hub,  updates the database in memory, and writes it back out.  The database is in a comma-delimited text format with one job per line.  The job database keeps track of the timing and status of each job submission. The code to read and write this database was generated with AutoSql.   para will avoid loading the hub with more than 100,000 jobs at a time,  and will only submit failed jobs three times before giving up on them.  

PARASOL  COMMAND REFERENCE

Note that all commands will produce a usage summary if run with no arguments.  In general this summary will be more up to date than the descriptions here.

para

Manage a batch of jobs in parallel on a compute cluster.

Normal usage is to do a 'para create' followed by 'para push' until job is done.  Use 'para check' to check status

Command Line

usage:

   para command [command-specific arguments]

The commands are:

 

para create jobList

   This makes the job-tracking database from a text file with the

   command line for each job on a separate line.  See below for

   further description of the jobList.  The ‘in’ checks in the

   jobList will be performed at this time.

para push

   This pushes forward the batch of jobs, submitting jobs to parasol.

   It will limit parasol queue size to something not too big and

   retry failed jobs

   options:

      -retries=N   Number of retries per job - default 4.

      -maxQueue=N  Number of jobs to allow on parasol queue

                   -  default 200000

      -minPush=N  Minimum number of jobs to queue - default 1. 

                  Overrides maxQueue

        -maxPush=N  Maximum number of jobs to queue - default 100000

      -warnTime=N Number of minutes job runs before hang warning

                  - default 4320 (3 days)

      -killTime=N Number of minutes job runs before push kills it

                  - default 20160 (2 weeks)

para try

   This is like para push, but only submits up to 10 jobs

para shove

   Push jobs in this database until all are done or one fails after N

   retries

para make jobList

   Create database and run all jobs in it if possible.  If one job

   crashes repeatedly this will fail.  Suitable for inclusion in

   makefiles.  Same as a 'create' followed by a 'shove'.

para check

   This checks on the progress of the jobs.

para stop

   This stops all the jobs in the batch

para chill

   Tells system to not launch more jobs in this batch, but

   does not stop jobs that are already running.

para finished

   List jobs that have finished

para hung

   List hung jobs in the batch

para crashed

   List jobs that crashed or failed output checks the last time they

   were run.

para failed

   List jobs that crashed after repeated restarts.

para problems

   List jobs that had problems (even if successfully rerun).

   Includes host info

para running

   Print info on currently running jobs

para time

      List timing information

Para Job Lists

Job lists are files with one command per line.  The following two lines would be a simple job list:

    echo hello boss

    ls -l

Job list can have build in checks on either the input or the output.

The overall format for a check is:

   {check in|out exists|exists+|line|line+ fileName}

this will be replaced by simply fileName when the job is run.

Checks "in" are used to make sure that input files are correct.

Checks "out" are used to make sure that output files are correct.

There are four types of checks:

   exists - file must exist

   exists+ - file must exist and be non-empty

   line - file must end with a complete line

   line+ - file must end with a complete line and be non-empty

Here's an example of a blat job spec:

  blat {check in line+ human.fa} {check in line+ mouse1.fa} {check out line+ hm1.psl}

  blat {check in line+ human.fa} {check in line+ mouse2.fa} {check out line+ hm2.psl}

Currently it’s not possible to redirect input or output in a job list.

gensub2

Generate job submission file from template and two file lists.

 

usage:

    gensub2 <file list 1> <file list 2> <template file> <output file>

This will substitute each file in the file lists for $(path1)

and $(path2)in the template between #LOOP and #ENDLOOP, and write

the results to the output.  Other substitution variables are:

       $(path1)  - full path name of first file

       $(path1)  - full path name of second file

       $(dir1)   - first directory. Includes trailing slash if any.

       $(dir2)   - second directory

       $(lastDir1) - The last directory in the first path. Includes

 trailing slash if any.

       $(lastDir2) - The last directory in the second path. Includes

 trailing slash if any.

       $(root1)  - first file name without directory or extension

       $(root2)  - second file name without directory or extension

       $(ext1)   - first file extension

       $(ext2)   - second file extension

       $(file1)  - name without dir of first file

       $(file2)  - name without dir of second file

       $(num1)   - index of first file in list

       $(num2)   - index of second file in list

The <file list 2> parameter can be 'single' if there is only one file

list.  By default the order is diagonal, meaning if the first list is ABC and the secon list is abc the combined order is Aa Ba Ab Ca Bb Ac  Cb Bc Cc.  This tends to put the largest jobs first if the file lists are both sorted by size. The following options can change this:

       -group1 - write elements in order Aa Ab Ac Ba Bb Bc Ca Cb Cc

       -group2 - write elements in order Aa Ba Ca Ab Bb Cb Ac Bc Cc

parasol

Parasol is the name given to the overall system for managing jobs on

a computer cluster and to this specific program.  This program is

intended primarily for system administrators, and some options of this command can only be used if logged in as root.  Regular users can do a ‘parasol status,’ or any of the ‘parasol list’ commands though, and may find these useful to find out how busy the computer cluster is.

 

     parasol add machine machineName tempDir  - add new machine to pool

   parasol remove machine machineName   - remove machine from pool

   parasol add spoke - add a new spoke daemon

   parasol [options] add job command-line  - add job to list

         options: -out=out -in=in -dir=dir -results=file -verbose

   parasol remove job id - remove job of given ID

   parasol ping [count] - ping hub server to make sure it's alive.

   parasol remove jobs userName [jobPattern] - remove jobs submitted

         by user that match jobPattern.  The jobPattern may include

         the wildcards ? and *, though these will need to be preceded

         by escapes (‘\’) if run from most shells.

   parasol list machines - list machines in pool

   parasol list jobs - list jobs one per line

   parasol list users – list users one per line

   parasol list batches – list batches one per line

   parasol status - print status of machines, jobs, and spokes.

 

ADMINISTRATIVE COMMAND REFERENCE

paraHub

paraHub - parasol hub server.

usage:

    paraHub machineList

Where machine list is a file with machine names in the

first column, and number of CPUs in the second column.

This file may include additional columns as well.

options:

    spokes=N Number of processes that feed jobs to nodes - default 30

    jobCheckPeriod=N  Minutes between checking on job - default 10

    machineCheckPeriod=N Seconds between checking on machine

-          default 20

    subnet=XXX.YYY.ZZZ Only accept connections from subnet

           (example subnet=192.168)

    nextJobId=N Starting job ID number, by default 100,000 past

                last job number used in previous run.

    log=logFile Write a log to logFile. Use 'stdout' here for console

    logFlush    Flush log with every write.  This will slow down the

                system somewhat, but can be useful in debugging.

    noResume    Don’t attempt to reconnect with jobs still running

                or recently finished on nodes.  Used for debugging.

 

paraHubStop

paraHubStop - Shut down paraHub daemon

usage:

   paraHubStop now

paraNode

paraNode - parasol node server.

usage:

    paraNode start

options:

    log=file - file may be 'stdout' to go to console

    umask=022 – file creation mask, defaults to 002

          so that files are world readable and group writable.

    userPath=bin:bin/i386 User dirs to add to path

    sysPath=/sbin:/local/bin System dirs to add to path

    hub=host – restict access to connections from hub

    cpu=N - Number of CPUs to use.  Default 1

 

paraNodeStart

paraNodeStart - Start up parasol node daemons on a list of machines

usage:

    paraNodeStart machineList

where machineList is a file containing a list of hosts

Machine list contains the following columns:

     <name> <number of cpus>

It may have other columns as well

options:

    exe=/path/to/paranode   

    umask=022 – file creation mask, defaults to 002

    userPath=bin:bin/i386 User dirs to add to path

    sysPath=/sbin:/local/bin System dirs to add to path

    log=/path/to/log/file – this is best on a local disk of node

    hub=machineHostingParaHub - nodes will ignore

                            messages from elsewhere

    rsh=/path/to/rsh/like/command

paraNodeStop

paraNodeStop - Shut down parasol node daemons on a list of machines

usage:

    paraNodeStop machineList

paraNodeStatus

paraNodeStatus - Check status of paraNode on a list of machines

usage:

    paraStat machineList

options:

    -long  List details of recent and current jobs.