How to apply a TORQUE wrench

From athena

Contents

[edit] Job Wrangling

If checkjob does not yield anything useful, try asking Torque:

 % tracejob 62741

 Job: 62741.athena0.npl.washington.edu

 05/16/2008 18:17:19  S    enqueuing into scavenge, state 1 hop 1
 05/16/2008 18:17:19  S    Job Queued at request of wdetmold@athena0.npl.washington.edu, owner =
                           wdetmold@athena0.npl.washington.edu, job name =
                           PBS_t10z0y0x0_P_l2864f21b676m010m050.306, queue = scavenge
 05/16/2008 18:17:19  S    Job Modified at request of root@athena0.npl.washington.edu
 05/16/2008 18:17:19  S    Job Run at request of root@athena0.npl.washington.edu
 05/16/2008 18:17:19  S    send of job to compute-3-22.local failed error = 15010
 05/16/2008 18:17:21  S    unable to run job, MOM rejected/rc=2
 05/16/2008 19:14:06  S    Holds uso released at request of root@athena0.npl.washington.edu

Verify all nodes are correctly reporting:

 % pbsnodes -a

Taking a node "offline" (note, this is functionally equivalent to "down" but better for marking nodes by hand):

 % pbsnodes -o compute-3-22.local

Forcibly purge a job:

 % qdel -p <jobid>

[edit] Deleting all jobs of a specific user

This was given to my by Ty Robinson, but I (Jeff) have not tried it yet. So test it first! :)

qdel $(showq |grep [username] |cut -f1 -d" ")


[edit] Case study: Node(s) report state "[@] Busy w/No Job"

% showstate
...
              [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2][2][2][2][2][2][2][2][2][2][3][3][3][3][3][3]
              [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5]
...
Rack      06: [C][C][P][P][P][A][C][B][C][C][B][P][P][B][@][@][@][@][Q][@][@][@][@][@][@][@][@][@][@][@][@][P][ ][ ][ ]

Key:  [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained

What is the matter with nodes 15-31 (except 19) on rack 6???

% pbsnodes > ~/blah
% emacs ~/blah

Now look for the entries for the problem nodes. Here is the one for compute-6-15:

compute-6-15.local
     state = job-exclusive
     np = 8
     ntype = cluster
     jobs = 0/58097.athena0.npl.washington.edu, 1/58097.athena0.npl.washington.edu, 2/58097.athena0.npl.washington.edu, 3/58097.athena0.npl.washington.edu, 4/58097.athena0.npl.washington.edu, 5/58097.athena0.npl.washington.edu, 6/58097.athena0.npl.washington.edu, 7/58097.athena0.npl.washington.edu
     status = opsys=linux,uname=Linux compute-6-15.local 2.6.9-55.0.12.ELsmp #1 SMP Wed Oct 17 08:15:59 EDT 2007 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=5295499,totmem=9181680kb,availmem=8958508kb,physmem=8161564kb,ncpus=? 0,loadave=0.00,netload=498704772302,state=free,jobs=,varattr=,rectime=1230607386

It looks like job 58097 is "stuck" on this node. In this particular instance, showq and showstate admitted knowing nothing about this job. As far as MOAB was concerned, it was done. However, PBS still knows about it:

% tracejob 58097
Job: 58097.athena0.npl.washington.edu

12/29/2008 15:36:29  S    Job deleted at request of root@athena0.npl.washington.edu
12/29/2008 15:36:29  S    Job sent signal SIGTERM on delete
12/29/2008 15:36:29  A    requestor=root@athena0.npl.washington.edu
12/29/2008 15:36:31  S    Job sent signal SIGKILL on delete
12/29/2008 17:59:01  S    enqueuing into scavenge, state 4 hop 1
12/29/2008 17:59:01  S    Requeueing job, substate: 42 Requeued in queue: scavenge
12/29/2008 17:59:06  S    Job deleted at request of root@athena0.npl.washington.edu
12/29/2008 17:59:06  S    Job sent signal SIGTERM on delete
12/29/2008 17:59:06  A    requestor=root@athena0.npl.washington.edu
12/29/2008 17:59:08  S    Job sent signal SIGKILL on delete

For jobs over 24 hours old, add a "look back a few days" specifier:

%tracejob -n 7 58097

So let's try forcibly removing it:

% qdel -p 58097
% tracejob 58097
Job: 58097.athena0.npl.washington.edu

12/29/2008 15:36:29  S    Job deleted at request of root@athena0.npl.washington.edu
12/29/2008 15:36:29  S    Job sent signal SIGTERM on delete
12/29/2008 15:36:29  A    requestor=root@athena0.npl.washington.edu
12/29/2008 15:36:31  S    Job sent signal SIGKILL on delete
12/29/2008 17:59:01  S    enqueuing into scavenge, state 4 hop 1
12/29/2008 17:59:01  S    Requeueing job, substate: 42 Requeued in queue: scavenge
12/29/2008 17:59:06  S    Job deleted at request of root@athena0.npl.washington.edu
12/29/2008 17:59:06  S    Job sent signal SIGTERM on delete
12/29/2008 17:59:06  A    requestor=root@athena0.npl.washington.edu
12/29/2008 17:59:08  S    Job sent signal SIGKILL on delete
12/29/2008 19:24:36  S    purging job without checking MOM
12/29/2008 19:24:36  S    dequeuing from scavenge, state RUNNING

That seems to have done something. And showstate agrees:

% showstate
...
              [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2][2][2][2][2][2][2][2][2][2][3][3][3][3][3][3]
              [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5]
...
Rack      06: [C][C][P][P][P][A][C][B][C][C][B][P][P][B][ ][ ][ ][ ][Q][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][T][P][ ][ ][ ]

Key:  [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained

[edit] Queue Wrangling

There is a new queue called 'systest' that anyone in group 'systest' can submit to. Default priority is 100000, and you can also request preemption with qos=test_now. It will send jobs to either partition, with a preference for the debug_nodes.

Checking server configuration:

 % qmgr -c 'p s'

[edit] Starting and stopping queues

Keep a queue from starting new jobs, but still let users submit to it:

 % qmgr -c "set queue <queue_name> started = False"

Keep a queue from accepting new submissions:

 % qmgr -c "set queue <queue_name> enabled = False"

[edit] Disabling a specific user

If you want to prevent a specific user from submitting to a general queue, there is a completely non-obvious way of accomplishing this (according to a PBS expert JPG knows...he has not tried it himself, yet):

# enable ACLs on queue default
qmgr -c 's q default acl_user_enable = true'
 
# allow all users to access batch
qmgr -c 's q default acl_users = +'

# disallow richardc to access batch
qmgr -c 's q default acl_users += -richardc'

[edit] Prologue and epilogue scripts

Documentation: http://www.clusterresources.com/torquedocs21/a.gprologueepilogue.shtml

Suggestions for what to have an epilogue script do:

  • ping
  • try running a general computation (does 2+2 still =4?)
  • file systems check (still mounted, etc?)
  • user's home directory still there?
  • is the local filesystem still there?