mgarl10024 wrote:
GrubNext, I need to sort grub. Currently, grub is just on the first harddisk, so on boot if the 1st harddisk is down, the system will not be able to start. Reading around, I need to add 'fallback' to grub's menu.lst, and then point it at a Kernel declaration just like
http://www.howtoforge.com/software-raid1-grub-boot-debian-etch-p2.
On booting with just one harddisk, the system asks me what to do, but tells me that I can add bootdegraded=true to the kernel boot options - any idea where these go? menu.lst has something called "kopt", so I'm thinking here.
adding kernel boot parameters to 'kopt' would affect every new kernel image installed (at least when using ubuntu-packaged kernel images with corresponding hooks in post-installation scripts). you still have to add this option to every 'kernel'-directive of the boot items in menu.lst you wish to be affected by the boot option.
alternatively, you could manually add any kernel boot options directly to the 'kernel' directive of every boot item in menu.lst. in this case, you would want to edit your menu.lst every time a new kernel image package is installed.
also, don't forget to make sure grub's stage1 is installed to second disk's master boot record, too.
Quote:
Testing
Next, I need to test this to prove to myself it is working reliably in a failure.
The scenarios are
1) Drive 1 fails during boot
2) Drive 2 fails during boot
it depends. which stage of the boot process? what kind of grub setup? bootdegraded option used or not?
basically, if a disk (it does not matter at all, if it would be the first or the second one) fails AFTER the md-subsystem is loaded, the failure should be detected by the md subsystem, MD arrays using the failed disk would get automatically degraded by removal of any physical device related to the failed disk from any running MD array.
if a disk fails before MD subsystem is started, your BIOS and your bootloader are both set up to be able to boot from either drive, and you are using the bootdegraded option, your system will come up, and any MD device using a physical device on the failed disk would run in degraded mode (of course, it depends on raid-level used. a raid0 MD-device with one of the physical devices on a failed disk won't be able to start at all, for example).
Quote:
3) Drive 1 fails whilst machine on
4) Drive 2 fails whilst machine on
I'm guessing that when I put a "new" drive back in, I can use mdadm to add the drive to the array. When I do this, I am guessing I will need to go through synching again
exaclty. you'll have to partition the 'new' drive accordingly, add the 'new' partitions to corresponding arrays (hot-adding works, you don't have to stop an array to add a new physical device). this would trigger a resync. you'll be able to use the array during resync, but the performance would be severely degraded.
it does not matter at all if 'disk1' or 'disk2' fails.
Quote:
Reporting
Is there a way that I can set up reporting, so if a drive fails I am notified. A silently failing drive is no good as from the outside all will appear well!
mdadm can be set up to send out notifications. /etc/mdadm/mdadm.conf and/or /etc/default/mdadm are the files to check & modify to set the notifications up. it should be trivial to set up a cronjob checking for md-device status and alerting you in any other way in case of a failure -- page your mobile phone, buzz the buzzer, automatically place an order for a replacement disk... whatever you need/consider useful.
smartmontools is a nice-to-have as a kind of 'early detection' system, but I've seen enough dying drives with perfect SMART-values. don't forget to set smartmontools to perform regular self-testing routines (-s option of smartctl), this makes it much more useful. you may be able to detect possible drive failures long before MD is able to tell the drive is failing, thus possibly giving you more time to prepare everything for the actual failure.
oh, and don't forget to backup your raid1 arrays! i just cannot stress it enough that no redundant raid setup could ever be a replacement for a regular backup routine. raid1 gives you an operation reliability bonus (beeing able to keep the services running while you're replacing a failed disk), but does not affect the safety of your data in any way. even the cheapest/most simple possible backup solution like rsyncing to some kind of external storage device (like a usb-connected hard disk) would contribute much more to the safety of your data than any kind of raid-setup. another issue to be aware of is a problem related to most journalling filesystems (ext3fs, for example) and modern disks using out-of-order writeback caching mechanisms (see
http://lwn.net/Articles/283161/,
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal...)