Grub errors
From athena
After the EMC and moab update, two nodes were not able to boot due to a reported "grub 15" error. At least one of them (4-8) was undoubtedly caused by me killing it during the Rocks Install, so that the disk image was incomplete. The other one (6-20) had been down at the start of the exercise, so was coming up a little bit sideways as it was.
The *fix* was to
- (a) shut down a running node (6-5 in this case)
- (b) move its hard drive to the 2nd slot in the "sick" system
- (c) attach an external DVD drive and (using F11 during boot)
- (d) boot from a Fedora 9 install/rescue disk
(Helix couldn't handle the PE1950's hardware, nor talk to a USB keyboard).
- (e) once in rescue mode, let it mount the target (sick) disk as /mnt/sysimage
- (f) cd to /, mkdir foo, mount the "good" disk as /foo:
cd / mkdir /foo mount /dev/sdb1 /foo
- (g) inspect the target disk for what's missing.
(for 6-20, i just needed to rebuild /boot/grub, for 4-8, i needed to build /boot)
- (h) copy over the needed structures (i cd'd to /foo/boot, and did:
tar cf - . | (cd /mnt/sysimage/boot ; tar xf - )
- (i) cd to /, umount /foo, umount /mnt/sysimage, shut down.
- (j) REMOVE THE GOOD DISK
- (k) reboot
Both systems came up -not- installing Rocks... the grub menu shown by hitting a key at the appropriate second only offered the SMP and uniprocessor Linux images. For 6-20, doing a powercord reboot after it finished coming up in Linux worked in that it -did- cause a Rocks reinstall. For 4-8, there was not enough of a system on the disk to sustain a full boot, so i had to manually invoke the grub command line interface during the next attempt and give it the sequence for a Rocks Reinstall, namely:
root (hd0,0) kernel /boot/kickstart/default/vmlinuz ro root=LABEL=/ ramdisk_size=150000 kssendmac ks selinux=0 initrd /boot/kickstart/default/initrd.img boot
...and that convinced it to reload Rocks.
Restore "good" disk to donor node and reboot it, too. (see: "how to take a node down in a gentle manner" and (even more fun if i manage it) "how to get it back up"