Moab Setup
From athena
Downloaded Moab
[root@athena0 moab]# curl -u washphys: -o moab-5.2.0-linux-x86_64-torque.tar.gz http://www.clusterresources.com/downloads/mwm/moab-5.2.0-linux-x86_64-torque.tar.gz
[root@athena0 moab-5.2.0]# ./configure --with-torque=/opt/torque
make install
Moab did not work when we started it as it complained about the torque dynamic library not in path. In order to get it to work, I added torque.conf to the /etc/ld.so.conf.d/ directory and put the path to the torque libraries:
/opt/torque/lib/
After I tried starting Moab, I received the following error:
[root@athena0 moab]# /usr/local/sbin/moab ERROR: server must be started on host 'athena0.local' (currently on 'athena0.npl.washington.edu') - see SCHEDCFG parameter
In order to resolve this, you need to change the hostname of the host from the fully qualified domain to simply the hostname. Now, this may break Rocks, so I need to call Cluster Resources on this one.
I also found some instructions on the moab.cfg file:
http://www.clusterresources.com/products/mwm/moabdocs/2.3initialtesting.shtml
If initial evaluation is complete or not required, the scheduler may be placed directly into production by setting the MODE attribute of the SCHEDCFG parameter to NORMAL and (re)starting the scheduler. Further sections within this manual will introduce key concepts and commands required to properly manage the scheduler in production operation.
Ok, now I needed to remove references from Sun's Open Grid Engine:
[root@athena0 profile.d]# grep engine * sge-binaries.csh:setenv SGE_ROOT /opt/gridengine sge-binaries.sh:SGE_ROOT=/opt/gridengine; export SGE_ROOT [root@athena0 profile.d]# mkdir DISABLED [root@athena0 profile.d]# mv sge-binaries.csh DISABLED/ [root@athena0 profile.d]# mv sge-binaries.sh DISABLED/ vi /etc/profile.d/moab-binaries.sh
And I added the following:
MOAB_ROOT=/usr/local; export MOAB_ROOT DEFAULTMANPATH=$MOAB_ROOT/man PATH=$MOAB_ROOT/bin:$PATH; export PATH shlib_path_name=$MOAB_ROOT/lib
To start moab:
/usr/local/sbin/moab
To stop moab:
/usr/local/bin/mschedctl
I organized the racks and slots accordingly. I also created 5 partitions for the scheduler: astro, cenpa, int, phys, and fastinter. Fastinter is on the single Cisco switch with a 50% non-block rate as opposed to a 33% non-blocking on the larger cisco switch. Here's an example from the /opt/moab/moab.cfg file:
NODECFG[compute-0-29] RACK=3 SLOT=30 PARTITION=int
On the phone with the dudes:
Added line to moab.cfg
GROUPCFG[cenpa]
Restarted
[root@athena0 moab]# mschedctl -R
Access control modifiers:
SRCFG[cenpa] GROUPLIST=cenpa,~phys SRCFG[cenpa] OWNER=GROUP:cenpa
PREEMPTPOLICY REQUEUE
Jobs need to be marked "restartable", so Nick's going to check on that.
checkjob will check the flags...
Other cool commands:
mdiag -r mched
Good websites for end users:
http://www.clusterresources.com/products/mwm/moabdocs/15.0improvingusereffectiveness.shtml
Submitting jobs:
[richardc@athena0 ill]$ msub -l nodes=5:ppn=8 advres=cenpa ill
That didn't work because you cannot submit an executable. I wrapped a script around it, and now it ran.
msub -l advres=cenpa ill.exe
The default allocation of time is one hour. You can set this on the command line:
msub -l walltime=3:10:00,advres=cenpa ill.exe
Here we're saying that we want 3 hours and 10 minutes on the reservation for CENPA.