Running Multiple Similar Jobs

You might want to submit a bunch of jobs that are almost the same, but different in small ways.  This article explains how to do that.

Audience: Faculty, Postdocs, Researchers and Students

This KB Article References: High Performance Computing
This Information is Intended for: Faculty, Postdocs, Researchers, Students
Last Updated: June 28, 2018

There might come a time where you want to submit multiple jobs that are almost identical. In this case, you could submit what's called a job array, a sequence of jobs running the same PBS script.

Suppose we want to extract every work listed on IMDB into a different file, depending on its type (movie, TV show, etc.). The data (which was obtained from here)can be found in /gpfs/projects/samples/sample-data/imdb, and this script can be found in /gpfs/projects/samples/array-jobs/array-jobs.pbs.

#!/bin/bash
#PBS -keo
#PBS -qshort
#PBS -l nodes=1:ppn=28,walltime=02:00:00
#PBS -t 0-9

SAMPLE_DATA="/gpfs/projects/samples/sample-data/imdb"

case "$PBS_ARRAYID" in
  0) TYPE="movie" ;;
  1) TYPE="short" ;;
  2) TYPE="tvEpisode" ;;
  3) TYPE="tvMiniSeries" ;;
  4) TYPE="tvMovie" ;;
  5) TYPE="tvSeries" ;;
  6) TYPE="tvShort" ;;
  7) TYPE="tvSpecial" ;;
  8) TYPE="video" ;;
  9) TYPE="videoGame" ;;
  *) echo "Unknown index $PBS_ARRAYID" ;;
esac

zegrep "tt[[:digit:]]{7}[[:space:]]$TYPE" "$SAMPLE_DATA/title.basics.tsv.gz"

If you've submitted jobs before, you'll note that this looks very similar to the general shape of other job scripts. Let's break it down.

Indexing Jobs

#PBS -t 0-9

Giving the -t flag will run your job as an array.  The indexes can be any sequence of numbers, and don't even have to start from zero! For example:

  • 1-10 numbers each job from 1 to 10.
  • 520-525,600 numbers each job from 520 to 525 (inclusive), and also 600.
  • 6 numbers a single job as 6.
There's a limit  to the number of jobs you can queue at once, and each job in the array counts towards it.

Using the Index

The index of the current running job is given in the environment variable $PBS_ARRAYID.  You can use it anywhere you'd normally use an environment variable, but case statements are among the simplest:

case "$PBS_ARRAYID" in
  0) TYPE="movie" ;;
  1) TYPE="short" ;;
  2) TYPE="tvEpisode" ;;
  3) TYPE="tvMiniSeries" ;;
  4) TYPE="tvMovie" ;;
  5) TYPE="tvSeries" ;;
  6) TYPE="tvShort" ;;
  7) TYPE="tvSpecial" ;;
  8) TYPE="video" ;;
  9) TYPE="videoGame" ;;
  *) echo "Unknown index $PBS_ARRAYID" ;;
esac

There are ten categories of media listed by IMDB, and this case statement will assign exactly one of them to the variable $TYPE.  The last pattern (*)) means "anything else", and will print out $PBS_ARRAYID if it's any value other than the ten given above it.  We recommend you handle this case to simplify the debugging of your PBS script.

If you're working with files that are already numbered, you can probably skip this step and just use $PBS_ARRAYID directly. If your numbers have leading zeroes, you'll need to zero-pad $PBS_ARRAYID with something like this:
FILE_NUMBER="$(printf '%03d' $PBS_ARRAYID)"

This will zero-pad $FILE_NUMBER if it's less than three digits long. 1 becomes 001, 27 becomes 027, and 100 and above are left as-is.

The Actual Task

The part of this script that does the interesting work looks like this:

zegrep "tt[[:digit:]]{7}[[:space:]]$TYPE" "$SAMPLE_DATA/title.basics.tsv.gz"

Here's how it works:

  • The file is compressed in gzip format. We could extract it somewhere before searching it with egrep, but zegrep does that for you.
  • The regular expression means "the letters tt, then seven digits, one whitespace character (a tab in this case), and then $TYPE". The variable $TYPE is expanded to its value before zegrep is run.

See man grep for more on grep's usage.

Running the Script

Submit this script then run qstat -u $USER. The output will look similar to this:

[jbond@login ~]$ qstat -u $USER

sn-mgmt.cm.cluster:
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
189758[].sn-mgmt.cm.cl  jbond       short    array-jobs.pbs      --      1     28       --   02:00:00 Q       --

Notice the brackets ([]) on the Job ID, which in this case starts with 189725[]. The brackets signify that this is an array job. Some of the underlying jobs may be queued or running. To see the individual jobs in each array, run qstat again with the -t flag:

[jbond@login array-jobs]$ qstat -tu $USER

sn-mgmt.cm.cluster:
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
189758[0].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-0    --      1     28       --   02:00:00 Q       --
189758[1].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-1    --      1     28       --   02:00:00 Q       --
189758[2].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-2    --      1     28       --   02:00:00 Q       --
189758[3].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-3    --      1     28       --   02:00:00 Q       --
189758[4].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-4    --      1     28       --   02:00:00 Q       --
189758[5].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-5    --      1     28       --   02:00:00 Q       --
189758[6].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-6    --      1     28       --   02:00:00 Q       --
189758[7].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-7    --      1     28       --   02:00:00 Q       --
189758[8].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-8    --      1     28       --   02:00:00 Q       --
189758[9].sn-mgmt.cm.c  jbond       short    array-jobs.pbs-9    --      1     28       --   02:00:00 Q       --

The number in brackets gives the index of that job. Besides that and the presence of $PBS_ARRAYID, these are ordinary jobs that will be scheduled just as any other job would, which means that they may or may not run simultaneously. To delete an individual job, run qdel job_number[index] (e.g. qdel 189758[6]), or omit the index to delete the entire array.

If you use the -keo flag, each job in the array will redirect its standard output to ~/$PBS_JOBNAME-$PBS_ARRAYID.o$PBS_JOBID (or e for standard error).

When Not to Use Job Arrays

Job arrays are more useful in some scenarios than others. You shouldn't use job arrays in these cases:

Your array would exceed your queue limit.

If your array is too big, or if you have other jobs scheduled already, then attempts to schedule it will fail. You'll want to either wait for your other jobs to finish or split up your job array into batches.

You want each job in the array to share data with the others.

Job arrays are not intended to be used to share data.  Use MPI for that.  You can grant each job in an array multiple nodes with the -l flag, same as you would with a single job.

Each job in the array is single-threaded.

We must confess that this example is simplified and contrived. zegrep only uses one thread, and since we're running it fewer times than we have CPUs it would be more appropriate to run them all on one node.  See /gpfs/projects/samples/array-jobs/better-version.pbs for a more realistic solution.

 

Submit a ticket

Additional Information


There are no additional resources available for this article.

Getting Help


The Division of Information Technology provides support on all of our services. If you require assistance please submit a support ticket through the IT Service Management system.

Submit A Ticket

For More Information Contact


IACS Support System