How to apply a TORQUE wrench
From athena
Contents |
[edit] Job Wrangling
If checkjob does not yield anything useful, try asking Torque:
% tracejob 62741 Job: 62741.athena0.npl.washington.edu 05/16/2008 18:17:19 S enqueuing into scavenge, state 1 hop 1 05/16/2008 18:17:19 S Job Queued at request of wdetmold@athena0.npl.washington.edu, owner = wdetmold@athena0.npl.washington.edu, job name = PBS_t10z0y0x0_P_l2864f21b676m010m050.306, queue = scavenge 05/16/2008 18:17:19 S Job Modified at request of root@athena0.npl.washington.edu 05/16/2008 18:17:19 S Job Run at request of root@athena0.npl.washington.edu 05/16/2008 18:17:19 S send of job to compute-3-22.local failed error = 15010 05/16/2008 18:17:21 S unable to run job, MOM rejected/rc=2 05/16/2008 19:14:06 S Holds uso released at request of root@athena0.npl.washington.edu
Verify all nodes are correctly reporting:
% pbsnodes -a
Taking a node "offline" (note, this is functionally equivalent to "down" but better for marking nodes by hand):
% pbsnodes -o compute-3-22.local
Forcibly purge a job:
% qdel -p <jobid>
[edit] Deleting all jobs of a specific user
This was given to my by Ty Robinson, but I (Jeff) have not tried it yet. So test it first! :)
qdel $(showq |grep [username] |cut -f1 -d" ")
[edit] Case study: Node(s) report state "[@] Busy w/No Job"
% showstate ... [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2][2][2][2][2][2][2][2][2][2][3][3][3][3][3][3] [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5] ... Rack 06: [C][C][P][P][P][A][C][B][C][C][B][P][P][B][@][@][@][@][Q][@][@][@][@][@][@][@][@][@][@][@][@][P][ ][ ][ ] Key: [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained
What is the matter with nodes 15-31 (except 19) on rack 6???
% pbsnodes > ~/blah % emacs ~/blah
Now look for the entries for the problem nodes. Here is the one for compute-6-15:
compute-6-15.local state = job-exclusive np = 8 ntype = cluster jobs = 0/58097.athena0.npl.washington.edu, 1/58097.athena0.npl.washington.edu, 2/58097.athena0.npl.washington.edu, 3/58097.athena0.npl.washington.edu, 4/58097.athena0.npl.washington.edu, 5/58097.athena0.npl.washington.edu, 6/58097.athena0.npl.washington.edu, 7/58097.athena0.npl.washington.edu status = opsys=linux,uname=Linux compute-6-15.local 2.6.9-55.0.12.ELsmp #1 SMP Wed Oct 17 08:15:59 EDT 2007 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=5295499,totmem=9181680kb,availmem=8958508kb,physmem=8161564kb,ncpus=? 0,loadave=0.00,netload=498704772302,state=free,jobs=,varattr=,rectime=1230607386
It looks like job 58097 is "stuck" on this node. In this particular instance, showq and showstate admitted knowing nothing about this job. As far as MOAB was concerned, it was done. However, PBS still knows about it:
% tracejob 58097 Job: 58097.athena0.npl.washington.edu 12/29/2008 15:36:29 S Job deleted at request of root@athena0.npl.washington.edu 12/29/2008 15:36:29 S Job sent signal SIGTERM on delete 12/29/2008 15:36:29 A requestor=root@athena0.npl.washington.edu 12/29/2008 15:36:31 S Job sent signal SIGKILL on delete 12/29/2008 17:59:01 S enqueuing into scavenge, state 4 hop 1 12/29/2008 17:59:01 S Requeueing job, substate: 42 Requeued in queue: scavenge 12/29/2008 17:59:06 S Job deleted at request of root@athena0.npl.washington.edu 12/29/2008 17:59:06 S Job sent signal SIGTERM on delete 12/29/2008 17:59:06 A requestor=root@athena0.npl.washington.edu 12/29/2008 17:59:08 S Job sent signal SIGKILL on delete
For jobs over 24 hours old, add a "look back a few days" specifier:
%tracejob -n 7 58097
So let's try forcibly removing it:
% qdel -p 58097 % tracejob 58097 Job: 58097.athena0.npl.washington.edu 12/29/2008 15:36:29 S Job deleted at request of root@athena0.npl.washington.edu 12/29/2008 15:36:29 S Job sent signal SIGTERM on delete 12/29/2008 15:36:29 A requestor=root@athena0.npl.washington.edu 12/29/2008 15:36:31 S Job sent signal SIGKILL on delete 12/29/2008 17:59:01 S enqueuing into scavenge, state 4 hop 1 12/29/2008 17:59:01 S Requeueing job, substate: 42 Requeued in queue: scavenge 12/29/2008 17:59:06 S Job deleted at request of root@athena0.npl.washington.edu 12/29/2008 17:59:06 S Job sent signal SIGTERM on delete 12/29/2008 17:59:06 A requestor=root@athena0.npl.washington.edu 12/29/2008 17:59:08 S Job sent signal SIGKILL on delete 12/29/2008 19:24:36 S purging job without checking MOM 12/29/2008 19:24:36 S dequeuing from scavenge, state RUNNING
That seems to have done something. And showstate agrees:
% showstate ... [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2][2][2][2][2][2][2][2][2][2][3][3][3][3][3][3] [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0][1][2][3][4][5] ... Rack 06: [C][C][P][P][P][A][C][B][C][C][B][P][P][B][ ][ ][ ][ ][Q][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][T][P][ ][ ][ ] Key: [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained
[edit] Queue Wrangling
There is a new queue called 'systest' that anyone in group 'systest' can submit to. Default priority is 100000, and you can also request preemption with qos=test_now. It will send jobs to either partition, with a preference for the debug_nodes.
Checking server configuration:
% qmgr -c 'p s'
[edit] Starting and stopping queues
Keep a queue from starting new jobs, but still let users submit to it:
% qmgr -c "set queue <queue_name> started = False"
Keep a queue from accepting new submissions:
% qmgr -c "set queue <queue_name> enabled = False"
[edit] Disabling a specific user
If you want to prevent a specific user from submitting to a general queue, there is a completely non-obvious way of accomplishing this (according to a PBS expert JPG knows...he has not tried it himself, yet):
# enable ACLs on queue default qmgr -c 's q default acl_user_enable = true' # allow all users to access batch qmgr -c 's q default acl_users = +' # disallow richardc to access batch qmgr -c 's q default acl_users += -richardc'
[edit] Prologue and epilogue scripts
Documentation: http://www.clusterresources.com/torquedocs21/a.gprologueepilogue.shtml
Suggestions for what to have an epilogue script do:
- ping
- try running a general computation (does 2+2 still =4?)
- file systems check (still mounted, etc?)
- user's home directory still there?
- is the local filesystem still there?