Now that we have two machines that are identical (With the exception of IP and host name.) The next step is to download, install and configure DRBD. To achieve this I used a few posts and combined them into the setup that I wanted to setup. Some of the posts include

First thing I did was to check the disk to ensure that they were the same and all connected to the machine correctly

sudo fdisk -l

###
Disk /dev/sdb: 1073 MB, 1073741824 bytes
34 heads, 61 sectors/track, 1011 cylinders, total 2097152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 1979.1 GB, 1979120929792 bytes
255 heads, 63 sectors/track, 240614 cylinders, total 3865470566 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x02020202

Disk /dev/sdc doesn't contain a valid partition table

DRBD

Next I installed DRBD and heartbeat

##Both
sudo apt-get install drbd8-utils heartbeat

I am going to create a single share called disk1. With that in mind I created the configuration for this by creating the following file

##Both
sudo vi /etc/drbd.d/disk1.res

Here is the configuration I added into that file.

resource disk1 {

        protocol C;

        handlers {
                pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
                pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
                local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
                outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
        }

        startup {
                degr-wfc-timeout 120;
                wfc-timeout 30;
                outdated-wfc-timeout 20;
        }

        disk {
                on-io-error detach;
        }

        net {
                cram-hmac-alg sha1;
                shared-secret "password";
                after-sb-0pri disconnect;
                after-sb-1pri disconnect;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
        }

        syncer {
                rate 100M;
                verify-alg sha1;
                al-extents 257;
        }

        on vNas01 {
                device  /dev/drbd0;
                disk    /dev/sdc;
                address 192.168.0.211:7788;
                meta-disk /dev/sdb[0];
        }

        on vNas02 {
                device  /dev/drbd0;
                disk    /dev/sdc;
                address 192.168.0.212:7788;
                meta-disk /dev/sdb[0];
        }
}

A few items I would like to point out.

  • Protocol C – There are three possible protocols which all have different levels of risk. More here.
  • shared-secret – This needs to be the same on both ends.
  • rate 100M – This allows you to cap the speed of the connection between the hosts. More here.
  • Pay particular attention to the on vNas01 and on vNas02 they need to reflect the settings on the host (i.e. what drive is being used, what is the host IP, etc).

This file needs to be duplicated onto both hosts.

Then we change ownership and permissions on DRBD files to be able to be used correctly with Heartbeat.

##Both
sudo chgrp haclient /sbin/drbdsetup
sudo chmod o-x /sbin/drbdsetup
sudo chmod u+s /sbin/drbdsetup
sudo chgrp haclient /sbin/drbdmeta
sudo chmod o-x /sbin/drbdmeta
sudo chmod u+s /sbin/drbdmeta

Initialize the meta-data disk on both hosts

##Both
drbdadm create-md disk1

And then started DRBD on both hosts

##Both
/etc/init.d/drbd start

Now its time to pick a master. I used vNas01 as the master and to start the initial sync I ran

##vNas01
drbdadm -- --overwrite-data-of-peer primary disk1

This sync took a long time. (almost 24 hours). To watch its progress you can run

watch cat /proc/drbd

###
Every 2.0s: cat /proc/drbd                                                  Sun Jun  8 08:00:49 2014

version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
 0: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:90397696 dw:90389504 dr:0 al:0 bm:5516 lo:9 pe:6 ua:8 ap:0 ep:1 wo:f oos:1842345780
        [>....................] sync'ed:  4.7% (1799164/1887436)Mfinish: 14:11:31 speed: 36,044 (36,
168) want: 3,280 K/sec

Once the initial sync is complete we can start to setup the file system on the drive. First things first we need to install JFSUtils as we are going to use JFS. This is done on both hosts.

##Both
sudo apt-get install jfsutils

Once the sync is complete and jfsutils is installed, go ahead and create the partition on the drive and then create a mount point for the drive. This only needs to be done on the primary host (vNas01)

##vNas01
sudo mkfs.jfs /dev/drbd0
sudo mkdir -p /srv/data
sudo mount /dev/drbd0 /srv/data

Now, it is a good idea to test the file system to ensure it is working correctly. Write a test file to the drive from the primary.

dd if=/dev/zero of=/srv/data/test.zeros bs=1M count=1000

A little bit of house keeping on the primary to clean up the file system

##vNas01
sudo umount /srv/data
drbdadm secondary disk1

Switch to the secondary host (vNas02), where we will mount the resource and then remove the file to ensure sync works both ways.

##vNas02
sudo mkdir -p /srv/data
sudo drbdadm primary disk1
sudo mount /dev/drbd0 /srv/data

sudo ls /srv/data
##Check for test.zeros file

sudo rm /srv/data/test.zeros
sudo umount /srv/data
sudo drbdadm secondary disk1

Then we switch back to vNas01 to check the file is missing

##vNas01
sudo drbdadm primary disk1
sudo mount /dev/drbd0 /srv/data

sudo ls /srv/data
##Check that the test.zeros file is no longer there

 NFS

Install the NFS Server onto both hosts.

##Both
sudo apt-get install nfs-kernel-server

Removed the runlevel init scripts on both nodes

##Both
sudo update-rc.d -f nfs-kernel-server remove
sudo update-rc.d -f nfs-common remove
sudo update-rc.d nfs-kernel-server stop 20 0 1 2 3 4 5 6 .
sudo update-rc.d nfs-common  stop 20 0 1 2 3 4 5 6 .

Move the NFS locks and configuration to the DRBD device. (This way the settings transfer to which ever host holds the resource)

##vNas01
sudo mount /dev/drbd0 /srv/data
sudo mv /var/lib/nfs/ /srv/data/
sudo ln -s /srv/data/nfs/ /var/lib/nfs
sudo mv /etc/exports /srv/data
sudo ln -s /srv/data/exports /etc/exports

##vNas02
sudo rm -rf /var/lib/nfs
sudo ln -s /srv/data/nfs/ /var/lib/nfs
sudo rm /etc/exports
sudo ln -s /srv/data/exports /etc/exports

Then we add the file system to be the NFS share. (My network is 192.168.0.0/24)

##vNas01
sudo mkdir /srv/data/export
sudo chmod 777 /srv/data/export
sudo vi /etc/exports

##Add to the bottom
sudo vi /etc/exports
/srv/data/export        192.168.0.*(rw,async,no_subtree_check)

 Heartbeat

Last I configured heartbeat to transfer resources between machines if one goes offline. More information on the ha.cf file

##Both
sudo vi /etc/heartbeat/ha.cf

##File contents
# Give cluster 30 seconds to start
initdead 30
# Keep alive packets every 1 second
keepalive 1
# Misc settings
traditional_compression off
deadtime 10
deadping 10
warntime 5
# Nodes in cluster
node vNas01 vNas02
# Use ipmi to check power status and reboot nodes
stonith_host    vNas01 external/ipmi vNas02 192.168.0.211 dean XXXXXXX lan
stonith_host    vNas02 external/ipmi vNas01 192.168.0.212 dean XXXXXXX lan
# Use logd, configure /etc/logd.cf
use_logd on
# Don't move service back to preferred host when it comes up
auto_failback off
# If all systems are down, it's failure
ping_group lan_ping 192.168.0.211 192.168.0.212
# Takover if pings (above) fail
respawn hacluster /usr/lib/heartbeat/ipfail

##### Use unicast instead of default multicast so firewall rules are easier
# vNas01
ucast eth0 192.168.0.211
# vNas02
ucast eth0 192.168.0.212

Then we define the Authkeys for the cluster

##Both
sudo vi /etc/heartbeat/authkeys

##File contents
auth 3
3 md5 password

And lastly we define the resources that should be failed over if required

##Both
sudo vi /etc/heartbeat/haresources

##File Contents
vNas01 IPaddr2::192.168.0.210/24/eth0 drbddisk::disk1 Filesystem::/dev/drbd0::/srv/data::jfs nfs-kernel-server

Then we reboot both servers.

Testing

To test the setup, I added the NFS share to one of my ESXi boxes and started a file transfer into the Storage. Then I restarted the primary node and watched it fail over.

At any time you can see the status of the cluster by running

sudo cat /proc/drbd

Which will show you the status of DRBD from the perspective of the host. (i.e. is in the primary etc)/

Next I will cover some notes and thoughts about the setup.