Job Routing and Priority Management

From athena

1 Partitions on Athena
2 Queues on Athena
3 Requesting Special Access to Nodes
4 NEW: Elastic job walltime limits and core counts 9/24/08
- 4.1 Elastic job length
- 4.2 Elastic job width
5 HPC Etiquette: So where should I actually submit my job?
- 5.1 Using "showstate" to figure out which nodes are free

[edit] Partitions on Athena

The Athena cluster is divided into three partitions. A job is only allowed to run in one partition at a time. In our case, we impose this constraint because we have 3 sets of nodes with different performance characteristics. The main partition is the batch_nodes partition, consisting of 112 Phase I production nodes. These nodes are completely identical to one another and have the same connectivity to one another. The debug_nodes partition consists of a set of 16 Phase I nodes that, although well connected to one another, are poorly connected to the remaining compute nodes in the cluster. Therefore, debug_nodes are segregated from batch_nodes by being in their own partition. The final partition is vpl_nodes. These are newer, faster Phase II nodes with 2GB/core (the rest of Athena has 1GB/core). Because of their different performance characteristics, they are kept in their own partition.

[edit] Batch_nodes partition

The main partition is batch_nodes and consists of 112 Phase I nodes. All of the nodes in this partition are owned by exactly one of four groups: physics, astro, int, cenpa. Concordantly, all of the nodes owned by these groups are in the batch_nodespartitions.

Jobs submitted to the default queue (see below) run in either the batch_nodes partition or the vpl_nodes partition.

[edit] Debug_nodes partition

The debug_nodes partition consists of the 16 Phase I nodes that are not well-connected to the rest of the cluster. Their primary purpose is for smaller debugging and interactive jobs. Other than their network connectivity, they are identical in performance and configuration to the batch_nodes nodes.

Jobs submitted to the debug queue (see below) run only in the debug_nodes partition.

[edit] VPL_nodes partition

The vpl_nodes partition consists of 12 Phase II nodes that were purchased more recently. They are faster and have more memory (2GB/core vs. 1GB/core) than the original Athena nodes. Because of this performance difference, they are kept in their own partition. They are owned by the "vpl" group.

Jobs submitted to the default queue (see below) run in either the batch_nodes partition or the vpl_nodes partition.

[edit] Queues on Athena

The Athena cluster has 3 queues: default, debug, and scavenge. Which queue you submit your job to affects which partitions it will run on. You select which queue you want to run on with the -q parameter. For example, to submit a job to the debug queue:

 [richardc@athena0 ill] qsub -q debug -l nodes=2:ppn=8,walltime=30:00 myscript.csh

[edit] Default queue

The default queue is, obviously, the default queue for job submissions. Jobs submitted here will run only in the batch_nodes partition. The default walltime limit is 4 hours. The maximum walltime limit is 18 hours except in the case of special priority jobs (see below) which have a limit of 1 week (168 hours). Submitting a job longer than your limit will result in the system deferring your job in the BatchHold state.

[edit] Debug queue

Jobs submitted to the debug queue will run only in the debug_nodes partition. Debug jobs have a maximum limit of 1 hour. Submitting a job longer than 1 hour will result in the system deferring your job in the BatchHold state. Keeping "debug" jobs separate from "batch" jobs guarantees high availability of resources for short, interactive use.

In order to use the debug nodes interactively, simply submit your job with the -I option. This will tell PBS not to process your batch script (in fact, you don't need to provide a batch script), but instead PBS will drop you at a command line on the compute node. Now you can run you program interactively just like on your workstation.

"Why should I bother going through PBS to get a node interactively when I can just ssh to one instead?" It is important to always use PBS to gain access to the compute nodes. If you circumvent this process, then PBS will not know you are using that node, and it will schedule another job on top of you. Bypassing PBS is the equivalent of cutting in line. Using a shared resource only works if everyone plays by the rules. If too many people start circumventing the scheduler, then we will have to turn off ssh access to compute nodes (like most supercomputing centers do). This will make everybody's life more inconvenient.

IF YOUR JOB SPANS MULTIPLE NODES you will be put at the command-line of the first node of the job (just like your batch script would have been). You can then access the other nodes that have been allocated to you just like your batch script would have done: through PBS or SSH. The environment variable $PBS_NODEFILE will have the location of a file that lists all of the nodes allocated to your job. Access this file to see which other nodes you can use.

[edit] Scavenge queue

Jobs to the scavenge queue, are eligible to run in either the batch_nodes partition or the debug_nodes (with the restriction that the job will not span across partition boundaries when it runs). This flexibility comes at a price: if your job happens to be running in the debug_nodes partition and somebody else submits a job debug, that job will preempt yours if there are not enough free debug nodes. If it is lucky and finds room to run in the batch_nodes it will be identical in priority and preemption behavior to jobs submitted in the default queue.

Therefore, the scavenge queue is designed for jobs that want to take advantage of the high availability of the nodes in the debug_nodes partition, at the expense of being preemptable by debug queue jobs. They will also compete and run in equal footing with standard default jobs in the batch_nodes partition. Like default jobs, the maximum walltime limit is 18 hours. Submitting a job longer than this will result in the system deferring your job in the BatchHold state.

NEW 9/24/2008: Due to some of the issues we have been having with debug jobs failing to preempt scavenge ones, scavenge jobs will not be able to use the debug_nodes partition between 8AM and 8PM Monday-Friday. These nodes will be held idle for debug jobs only.

Note that if you submit a scavenge job for more than 16 nodes (i.e. larger than the debug_nodes partition), it will be functionally identical to a default job. Furthermore, scavenge jobs longer than 12 hours will be unable to fit within the time constraints of the debug_nodes partition except on weekends.

[edit] Requesting Special Access to Nodes

UPDATED 3/2009 for new VPL nodes.

If you belong to the groups "astro," "vpl," "cenpa," "int," or "physics," you have the ability to request that your jobs get special handling for nodes that are owned by that group. "astro," "int," and "physics" own 32 nodes each, "cenpa" owns 16, and "vpl" owns 12. Members of the "vpl" group are also members of the "astro" group and can choose whichever association best fits their job. There are two levels or special handling that each group can achieve: "priority" and "preemption." Each of these levels of handling also has an extended wallclock limit of 1 week (168 hours).

Note that you can request priority/preemptive handing only if you submit jobs to the default queue.

[edit] Priority access

If you request "priority" access, your job will be placed ahead of non-priority jobs for the nodes that your group owns. To request "priority" access, simply use the -l qos=<group> option for qsub where <group> is the name of your group (e.g. astro, cenpa, int, physics, vpl). For example, if you are in the "physics" group and want priority access to the nodes owned by "physics":

 [richardc@athena0 ill] qsub -l qos=physics,nodes=8:ppn=8,walltime=30:00 myImportantScript.csh

If two or more jobs of the same group request priority handling, the one submitted first will run first.

NEW for Astro nodes 9/24/08: QoSs "physics," "cenpa," and "int" will be preemptable by "physics_now," "cenpa_now," and "int_now" jobs respectively (see the "Preemptive access" below). The "astro" and "vpl" QoSs, on the other hand, will not be preemptable by "astro_now" or "vpl_now" respectively.

[edit] Preemptive access

Let's say you have an important deadline tomorrow and you really need your nodes now. You can opt to kick everyone else off of the nodes that your group owns (ah, that sweet thrill of ownership!). To do this, you format your job submission with -l qos=<group>_now. For example, if you are in the "physics" group and want to run a job on your nodes right now and preempt anybody else who may be using them:

 [richardc@athena0 ill] qsub -l qos=physics_now,nodes=8:ppn=8,walltime=30:00 myVeryImportantScript.csh

Note that in this example, priority "physics" jobs will also be preempted (and "cenpa_now" preempts "cenpa," and "int_now" preempts "int"). So be prepared to be very very nice to your colleagues if you exercise this option.

NEW for Astro nodes 9/24/08: The "astro" and "vpl" nodes behave slightly differently, as requested by the Astronomy Department. If you are in the "astro" or "vpl" groups and want to run a job on your nodes right now, you will only preempt non-astro/non-vpl jobs. If you really need to run immediately, you can use the showq_i command to determine what jobs are running with the "astro" or "vpl" priority by looking in the columns labeled "Q" (for "QoS") and finding jobs with "as" in that column. Contact the user(s) that own the relevant priority jobs and ask them if they would be willing to cancel their jobs for you. As long as your preemptive job is in the queue before they cancel their jobs, your job will be the next one to run on those nodes (in other words, you do not have to do anything special to "reserve" the nodes that are being freed for you, as your preemptive priority will already ensure that you are next in line).

Preemptive priority should only be used when time is critical as it will kill any running job and cause it to be requeued.

[edit] What if my job got preempted?

If your job is preempted, the default behavior is that it is automatically requeued. In this case, it will have a priority equal to the priority it would have had if it actually waited in the queue the entire time (i.e. you do not have to return to the back of the line).

If you do not wish your jobs to be requeued when preempted, but rather you want then canceled instead, submit the job with the PBS option -r n.

If you can, it will be easier for you in the long run if you write your scripts so that they can detect if they have been restarted and compensate for that.

[edit] Can I find out who the [expletive] was that preempted my job?

Not yet. Many of you asked for this at the last Athena Town Meeting. The vendor of our resource management software is currently implementing this feature at our request, and we will let users know when it is available.

[edit] NEW: Elastic job walltime limits and core counts 9/24/08

In response to feedback at the last Athena Town Meeting, we have enabled a functionality of MOAB that permits jobs to have "elastic" footprints in terms of jobs "length" (execution time) and "width" (number of cores).

[edit] Elastic job length

Ordinarily, when you submit a job with the "-l walltime=" parameter, you specify the exact runtime of your job. In other words, when MOAB starts your job, it guarantees that your job will have exclusive access to those cores for <time>. You also guarentee MOAB that you job will run no longer than <time>. But maybe, you are OK with MOAB guaranteeing you something less than <time> provided that it lets you run for longer if the resources are available.

When you submit your job, MOAB allows you to specify an optional minimum wallclock limit "minwclimit". This is the minimum time that your job can run. In other words, MOAB will not start your job unless it can guarantee that it will run for at least this amount of time. The regular "walltime" parameter becomes the maximum walltime that your job will run (this is still subject to the normal walltime constraints, by the way).

For example, let's say I want my job to run at least 1 hour, but I want MOAB to let me run up to 18 hours:

% qsub -l minwclimit=1:00:00,walltime=18:00:00 myScript.csh

MOAB will start your job, and guarentee at least 1 hour of runtime. After an hour, this runtime will be extended in 5-minute increments until another job comes along that has higher priority. This strategy is very effective for users who wish to exploit the queue backfill, especially jobs that are in the scavenge queue. Since the amount of time that your job will run is not determined at its launch, there is no way for you job to know when it will be terminated. Therefore, your job must be able to checkpoint its progress as it runs.

More information on this option is available on the MOAB website.

[edit] Elastic job width

It is also possible to vary the number of cores on which your job runs. This is accomplished by using the "trl" ("Task Request List") capability of MOAB. You can specify a set of core counts for your job, e.g.:

% qsub -l trl=2:4:8:16,minwclimit=1:00:00,walltime=18:00:00 myScript.csh

will request that my task be run on 2, 4, 8, or 16 cores for somewhere between 1 and 24 hours. Note that you can also use the "trl" approach to specify unique runtimes as well:

% qsub -l trl=2@1000:4@500:8@250 myScript.csh

Tells MOAB to give you 2 cores for 1000 minutes, or 4 cores for 500 minutes, or 8 cores for 250 minutes. More information on Task Request List usage is available on the MOAB website.

Your PBS script can determine the number of cores that it is ultimately being run on by counting the lines in the $PBS_NODEFILE, e.g.:

set numprocs = `wc -l < $PBS_NODEFILE`

will set the shell variable "numprocs" to the number of cores on which your job is executing. You can then pass this parameter other commands in your script such as mpirun.

[edit] HPC Etiquette: So where should I actually submit my job?

Odds are that you are a member of one of the privileged groups, and consequently have multiple job submission avenues open to you. Here are some things to consider when deciding which qsub incantation to use:

Debug jobs should go in the debug queue. You will get immediate access to up to 16 nodes, which is very useful for debugging problems that your code has on the cluster.
Completely monopolizing your group's nodes may make you unpopular within your group. Therefore, if you have lots of processing to do, you may actually benefit (at least socially) from submitting with normal priority.
If you submit a priority job, it will only ever run on the nodes owned by your group. It is possible that the nodes in your group are already being used, but nodes in other groups are totally free. In this case, submitting a priority job will result in you waiting whereas a job with normal priority will start instantly. Therefore, you may wish to use the showstate command (documented below) to figure out which nodes are free.
In general, the more long-term computing you have to do and/or the more resilient your code is to restarts, the better off you are using the scavenge queue. This queue will give you access to the most nodes...the trade-off being that you are more subject to preemption.
Similarly, if you prefer only to be preempted rarely, but still have a lot of long-term computing to do, you are better off using the default queue with normal priority. This will give you access to all of the batch nodes while only having a small chance of preemption.
Preemption should only be exercised when actually needed. If you need immediate access to nodes for testing or interactive use, please use the debug queue instead.

[edit] Using "showstate" to figure out which nodes are free

You can use the showstate command to figure out which nodes belonging to which groups and partitions are free.

% showstate
cluster state summary for Mon May 5 17:34:50


    JobID              S      User    Group Procs   Remaining            StartTime
    ------------------ - --------- -------- ----- -----------  -------------------

(A) 62909              R  ytakimot      int    16     2:17:13  Mon May 19 15:52:03
(B) 62910              R  ytakimot      int    16     2:39:08  Mon May 19 16:13:58
(C) 62859              R  wdetmold  physics   128     3:54:45  Mon May 19 13:29:35
(D) 62874              R  wdetmold  physics   128     4:15:48  Mon May 19 13:50:38
(E) 62875              R  wdetmold  physics   128     7:42:40  Mon May 19 17:17:30
(F) 62872              R    cbrook    astro   112    14:07:51  Mon May 19 13:42:41
(G) 62854              R    cbrook    astro   112  6:16:50:47  Mon May 19 10:25:37
(H) 62855              R    cbrook    astro   112  6:18:37:16  Mon May 19 12:12:06
(I) 62871              R    cbrook    astro    32  6:23:43:08  Mon May 19 17:17:58 

usage summary:  9 active jobs  98 active nodes

              [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2][2][2][2][2][2][2][2][2][2][2][3][3][3][3][3]
              [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5]

Rack      03: [ ][ ][ ][E][E][E][E][E][B][B][A][D][F][F][F][F][F][F][F][F][ ][!][ ][E][E][E][E][E][E][A][C][C][ ][ ][ ]
Rack      04: [D][D][D][D][F][F][C][C][C][C][C][C][C][F][C][F][C][D][D][C][C][ ][ ][ ][ ][ ][C][ ][F][F][C][C][ ][ ][ ]
Rack      05: [ ][ ][ ][D][D][D][E][D][D][D][D][ ][E][ ][ ][ ][ ][D][D][ ][ ][ ][ ][ ][ ][ ][ ][ ][E][E][ ][E][ ][ ][ ]
Rack      06: [G][I][I][I][H][H][H][H][H][H][H][G][G][G][G][H][H][H][G][G][G][G][G][G][H][G][G][H][H][G][I][H][ ][ ][ ]

Key:  [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained

There are 3 partitions: batch_nodes, debug_nodes, and vpl_nodes. A job cannot cross partition boundaries. The batch_nodes partition is further subdivided into 4 groups. Jobs that are not group-specific (all scavenge queue jobs and default queue jobs that do not request special priority) can cross the group boundaries.

batch_nodes partition:
- INT: Rack 03, nodes 1-32
- Physics: Rack 04, nodes 1-32
- CENPA: Rack 05, nodes 04-13,18-19,29-32
- Astro: Rack 06, nodes 1-32
debug_nodes partition:
- Rack 05, nodes 01-03,14-17,20-28.
vpl_nodes partition:
- All racks, nodes 33-35

Using the showstate output above, we could submit up to a 13-node job to the default queue with normal priority and get instant access to the cluster (it would be routed to either the batch_nodes or the vpl_nodes partitions). We could also submit up to a 16-node job to the debug queue and get instant access. If we are in the physics group, we can have immediate priority non-preemptive access to at most 6 nodes.