Moab Setup

From athena

Downloaded Moab

 [root@athena0 moab]# curl -u washphys: -o moab-5.2.0-linux-x86_64-torque.tar.gz http://www.clusterresources.com/downloads/mwm/moab-5.2.0-linux-x86_64-torque.tar.gz
 [root@athena0 moab-5.2.0]# ./configure --with-torque=/opt/torque
 make install

Moab did not work when we started it as it complained about the torque dynamic library not in path. In order to get it to work, I added torque.conf to the /etc/ld.so.conf.d/ directory and put the path to the torque libraries:

 /opt/torque/lib/

After I tried starting Moab, I received the following error:

 [root@athena0 moab]# /usr/local/sbin/moab
 ERROR:    server must be started on host 'athena0.local' (currently on 'athena0.npl.washington.edu') - see SCHEDCFG parameter

In order to resolve this, you need to change the hostname of the host from the fully qualified domain to simply the hostname. Now, this may break Rocks, so I need to call Cluster Resources on this one.

I also found some instructions on the moab.cfg file:

http://www.clusterresources.com/products/mwm/moabdocs/2.3initialtesting.shtml

 If initial evaluation is complete or not required, the scheduler may be placed directly into production by setting the MODE attribute
 of the SCHEDCFG parameter to NORMAL and (re)starting the scheduler.  Further sections within this manual will introduce key concepts
 and commands required to properly manage the scheduler in production operation.

Ok, now I needed to remove references from Sun's Open Grid Engine:

 [root@athena0 profile.d]# grep engine *
 sge-binaries.csh:setenv SGE_ROOT /opt/gridengine
 sge-binaries.sh:SGE_ROOT=/opt/gridengine; export SGE_ROOT
 [root@athena0 profile.d]# mkdir DISABLED
 [root@athena0 profile.d]# mv sge-binaries.csh DISABLED/
 [root@athena0 profile.d]# mv sge-binaries.sh DISABLED/
 vi /etc/profile.d/moab-binaries.sh

And I added the following:

MOAB_ROOT=/usr/local; export MOAB_ROOT DEFAULTMANPATH=$MOAB_ROOT/man PATH=$MOAB_ROOT/bin:$PATH; export PATH shlib_path_name=$MOAB_ROOT/lib


To start moab:

 /usr/local/sbin/moab

To stop moab:

/usr/local/bin/mschedctl

I organized the racks and slots accordingly. I also created 5 partitions for the scheduler: astro, cenpa, int, phys, and fastinter. Fastinter is on the single Cisco switch with a 50% non-block rate as opposed to a 33% non-blocking on the larger cisco switch. Here's an example from the /opt/moab/moab.cfg file:

 NODECFG[compute-0-29] RACK=3 SLOT=30 PARTITION=int


On the phone with the dudes:

Added line to moab.cfg

 GROUPCFG[cenpa]

Restarted

 [root@athena0 moab]# mschedctl -R 


Access control modifiers:

 SRCFG[cenpa] GROUPLIST=cenpa,~phys
 SRCFG[cenpa] OWNER=GROUP:cenpa
  PREEMPTPOLICY   REQUEUE

Jobs need to be marked "restartable", so Nick's going to check on that.

 checkjob will check the flags...

Other cool commands:

 mdiag -r 
 mched

Good websites for end users:

http://www.clusterresources.com/products/mwm/moabdocs/15.0improvingusereffectiveness.shtml

Submitting jobs:

 [richardc@athena0 ill]$ msub -l nodes=5:ppn=8 advres=cenpa ill

That didn't work because you cannot submit an executable. I wrapped a script around it, and now it ran.

 msub -l advres=cenpa ill.exe

The default allocation of time is one hour. You can set this on the command line:

 msub -l walltime=3:10:00,advres=cenpa ill.exe

Here we're saying that we want 3 hours and 10 minutes on the reservation for CENPA.