Adventures in High Availability: HA iSCSI with DRBD, iSCSI, and Pacemaker

High availability for applications and physical machines is key to having services “appear” to never be down. With cloud computing, deploying failure resilient applications is needed for services that need to be always available.

The purpose of this blog is to provide more of the technical information for HA Open iSCSI that good friend and colleague, Lester, mentioned in his blog. Our goal was to setup HA with Open iSCSI without access to a SAN.  To accomplish this, we used PacemakerOpen iSCSI, and DRBD.  The great guys from Linbit  provided us with the documentation to deploy this environment.

Setup

HA iSCSI Diagram - viking-07 active

HA iSCSI diagram where viking-07 is active node

In our setup, we used the following:


[root@viking-07 ~]# crm status
============
Last updated: Tue Apr 3 20:28:29 2012
Stack: openais
Current DC: viking-07.eucalyptus-systems.com - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ viking-07.eucalyptus-systems.com viking-08.eucalyptus-systems.com ]

Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM): Started viking-07.eucalyptus-systems.com
p_target_clustervol (ocf::heartbeat:iSCSITarget): Started viking-07.eucalyptus-systems.com
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started viking-07.eucalyptus-systems.com
p_ip_clustervolip (ocf::heartbeat:IPaddr2): Started viking-07.eucalyptus-systems.com
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-07.eucalyptus-systems.com ]
Slaves: [ viking-08.eucalyptus-systems.com ]

Installation

The installation instructions are pretty straight forward.  The instructions will be the same for both machines – unless otherwise noted.  These instructions assume that CentOS 5.7 has already been installed.  If CentOS 5.7 is not already installed, please go to the CentOS 5 Documentation for installation instructions.

Both Nodes

Installing SCSI Target Framework

The SCSI Target Framework (tgt) is needed for the iSCSI servers we will use in the cluster setup.  To install, run the following command:

yum install scsi-target-utils

Once you have done this, make sure that the tgtd service is part of the system startup:

/sbin/chkconfig tgtd on

Installing Pacemaker cluster manager

Pacemaker is an open source, high availability resource manager.  The packages for the Pacemaker project are provided by clusterlabs.org.  To install Pacemaker, do the following:

  • Download the clusterlabs.repo file with wget or curl to the /etc/yum.repos.d directory:

    wget -O /etc/yum.repos.d/pacemaker.repo http://www.clusterlabs.org/rpm/epel-5/clusterlabs.repo

  • Install Pacemaker (and dependencies):

    yum install pacemaker pacemaker-libs corosync corosynclib

  • Make sure pacemaker is part of the system startup:

    /sbin/chkconfig corosync on

Install DRBD

DRBD provides us the storage backend for the cluster.  It mirrors the data written to the disk to the peer node.  For more information about what DRBD does, refer to the Mirroring section on the DRBD site. To install DRBD, run the following command:

yum install  drbd83 kmod-drbd83

Configuration

Configure DRBD resource

In order to configure DRBD, we need to create and edit a resource file clustervol.res under /etc/drbd.d on both nodes.  One thing to note here, we used a separate device (/dev/sdd2) that uses LVM to be utilized by DRBD (for syncing the content served up by tgtd). We did this to make it easier for recovery of disks in case of failure.

Both Nodes

First use an editor (such as VI) to open the file clustervol.res under /etc/drbd.d:

vi /etc/drbd.d/clustervol.res

Edit the file accordingly to match the environment.  Our resource file looks like the following:


# cat /etc/drbd.d/clustervol.res
resource clustervol {
device /dev/drbd1;
disk /dev/sdd2;
meta-disk internal;

on viking-07.eucalyptus-systems.com {
address 192.168.39.107:7790;
}
on viking-08.eucalyptus-systems.com {
address 192.168.39.108:7790;
}
}

The main parts of the configuration file are as follows:

  • resource - refers to the resource managed by DRBD
  • disk – refers to the device that DRBD will use
  • address - IP Address/port that DRBD will use

For performance gains for disk syncing and failover responsiveness, DRBD can be configured to match those needs. For more information on configuring DRBD, please refer to the sections Configuring DRBD and Optimizing DRBD performance of the DRBD 8.3 User’s Guide.  This document is a *must have* as a reference source.  I suggest reading it and trying out different configurations before putting any service using DRBD in a production environment.

LVM Configuration

There are a few ways that LVM can be utilized with DRBD.  They are as follows:

For our setup, we configured a DRBD resource as a Physical Volume, as as described in the documentation provided by Linbit.  For more information concerning using LVM with DRBD, please refer to section entitled Using LVM with DRBD in the DRBD 8.3 User’s Guide.

We need to make sure to instruct LVM to read the Physical Volume signatures from the DRBD devices only.

Both Nodes

Configure LVM to look at Physical Volume signatures from DRBD devices only by editing the /etc/lvm/lvm.conf file:

filter = [ "a|/dev/drbd.*|", "r|.*|" ]

Disable LVM cache (in /etc/lvm/lvm.conf):

write_cache_state = 0

After disabling the LVM cache, make sure to remove any stale cache entries by deleting the /etc/lvm/cache/.cache

After this is done on both nodes, we need to create an LVM Volume Group by initializing the DRBD resource as an LVM Physical Volume.  In order to do so, creation of the metadata for the resource is needed.  Our resource name is clustervol.

Both Nodes

Run the following command:


# drbdadm create-md clustervol
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

Next, put the resource up:


# drbdadm up clustervol

Primary Node

Now we need to do the initial sync between the nodes.  This needs to be done on the primary node.  For us, this is viking-07:


[root@viking-07 ~]# drbdadm primary --force clustervol
[root@viking-07 ~]# drbdadm -- --overwrite-data-of-peer primary clustervol

To see monitor the status of the sync and status of the DRBD resource, run the following command:


[root@viking-07 ~]# drbd-overview verbose

Once the syncing has started, we are ready to initialize the DRBD resource as a LVM Physical Volume by running the following command:


[root@viking-07 ~]# pvcreate /dev/drbd/by-res/clustervol

Next, we create an LVM Volume Group that includes this PV:


[root@viking-07 ~]# vgcreate clustervol /dev/drbd/by-res/clustervol

Finally, we need to add a logical volume to represent the iSCSI Logical Unit (LUs). There can be multiple LUs, but in our setup, we created one 10 Gig logical volume for testing purposes.  We created the LV with the following command:


[root@viking-07 ~]# lvcreate -L 10G -n lun1 clustervol

When the DRBD sync has finally completed, when <code>drbd-overview verbose</code> is executed, the output should look similar to this:


[root@viking-07 ~]# drbd-overview verbose
1:clustervol Connected Primary/Secondary UpToDate/UpToDate C r----- lvm-pv: clustervol 930.99G 10.00G

Pacemaker Configuration

Pacemaker is a cluster resource manager which handles resource level failover.  Corosync is the messaging layer which handlers node membership in the cluster and node failure at the infrastructure level.

To configure Pacemaker, do the following:

Primary Node

Generate corosync key:


[root@viking-07 ~]# corosync-keygen

chmod authkey to read-only by root, then copy authkey file to the other node:


[root@viking-07 ~]# chmod 0400 /etc/corosync/authkey
[root@viking-07 ~]# scp /etc/corosync/authkey root@192.168.39.108:/etc/corosync/

Copy corosync.conf.example to corosync.conf under /etc/corosync:


[root@viking-07 ~]# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf

Edit the following fields (this example assumes eth0 is the primary interface):


[root@viking-07 ~]# export coro_port=4000
[root@viking-07 ~]# export coro_mcast=226.94.1.1
[root@viking-07 ~]# export coro_addr=`ifconfig eth0 | grep "inet addr" | awk '{print $2}' | cut -d ":" -f 2 | cut -d "." -f 1,2,3 | xargs -I {} echo {}".0"`
[root@viking-07 ~]# sed -i.gres "s/.*mcastaddr:.*/mcastaddr:\ $coro_mcast/g" /etc/corosync/corosync.conf
[root@viking-07 ~]# sed -i.gres "s/.*mcastport:.*/mcastport:\ $coro_port/g" /etc/corosync/corosync.conf
[root@viking-07 ~]# sed -i.gres "s/.*bindnetaddr:.*/bindnetaddr:\ $coro_addr/g" /etc/corosync/corosync.conf

Copy the corosync.conf files to the other node:


[root@viking-07 ~]# scp /etc/corosync/corosync.conf root@192.168.39.108:/etc/corosync/corosync.conf

Both Nodes

Now you are ready to start corosync on both nodes:


# service corosync start

For more detail information about configuring Pacemaker, please reference the following documents provided by clusterlabs:

Primary Node

To finish up the configuration, the following commands need to be executed to prepare for the HA iSCSI target configuration for a 2-node cluster:


[root@viking-07 ~]# crm
crm(live)# configure
crm(live)configure# property stonith-enabled="false"
crm(live)configure# property no-quorum-policy="ignore"
crm(live)configure# property default-resource-stickiness="200"
crm(live)configure# commit

For more information on how to use the crm shell, please refer to the CRM CLI (command line interface) tool documentation provided by clusterlabs.

Active/Passive iSCSI Configuration

Now we are ready to configure the Active/Passive iSCSI cluster. The following cluster resources are needed for an active/passive iSCSI Target:

  • A DRBD resource to replicate data.  This is controlled by the cluster manager by switching between the Primary and Secondary roles.
  • An LVM Volume Group, which will be available on whichever node currently holds the DRBD resource in Primary Role
  • A virtual, floating IP for the cluster. This will allow initiators to connect to the target no matter which physical node it is running on
  • iSCSI Target
  • At least one iSCSI LUs that corresponds to a Logical Volume in the LVM Volume Group

In our setup, the Pacemaker configuration has 192.168.44.30 as the virtual IP address to use the target with iSCSI Qualified Name (IQN) iqn.1994-05.com.redhat:cfd95480cf87.clustervol. (An important note here is to make sure both nodes have the same initiatorname. The initiatorname for this configuration is iqn.1994-05.com.redhat:cfd95480cf87.  This information is in the /etc/iscsi/initiatorname.iscsi file.)

The target contains the Logical Unit with LUN1, mapping to the Logical Volume named lun1.

To begin with the configuration of the resource, open the crm shell as root, and issue the following commands on the Primary node (i.e. viking-07):


crm(live)# configure
crm(live)configure# primitive p_drbd_clustervol \
ocf:linbit:drbd \
params drbd_resource="clustervol" \
op monitor interval="29" role="Master" \
op monitor interval="31" role="Slave"
crm(live)configure# ms ms_drbd_clustervol p_drbd_clustervol \
meta master-max="1" master-node-max="1" clone-max="2" \
clone-node-max="1" notify="true"

Create a master/slave resource mapping to the DRBD resource clustervol:


crm(live)configure# primitive p_ip_clustervolip \
ocf:heartbeat:IPaddr2 \
params ip="192.168.44.30" cidr_netmask="24" \
op monitor interval="10s"
crm(live)configure# primitive p_lvm_clustervol \
ocf:heartbeat:LVM \
params volgrpname="clustervol" \
op monitor interval="10s" timeout="30" depth="0"
crm(live)configure# primitive p_target_clustervol \
ocf:heartbeat:iSCSITarget \
params iqn="iqn.1994-05.com.redhat:cfd95480cf87.clustervol" \
tid="1" \
op monitor interval="10s" timeout="20s"

Now we add the Logical Unit:


crm(live)configure# primitive p_lu_clustervol_lun1 ocf:heartbeat:iSCSILogicalUnit \
params target_iqn="iqn.1994-05.com.redhat:cfd95480cf87.clustervol" lun="1" path="/dev/clustervol/lun1" implementation="tgt" \
op monitor interval="10"

For information concerning addressing any security considerations for iSCSI, please refer to the section entitled Security Considerations in the document Highly Available iSCSI Storage with DRBD and Pacemaker, provided by Linbit.

To bring it all together, we need to create a resource group from the resource associated with our iSCSI target:


crm(live)configure# group rg_clustervol \
p_lvm_clustervol \
p_target_clustervol p_lu_clustervol_lun1 p_ip_clustervolip

The Pacemaker default for the resource group is ordered and co-located.  This means resources contained in the resource group will always run on the same physical machine, will be started in the same order as specified, and stopped in reverse order.

To wrap things up, make sure that the resource group is started on the node where DRBD is in the Primary role:


crm(live)configure# order o_drbd_before_clustervol \
inf: ms_drbd_clustervol:promote rg_clustervol:start
crm(live)configure# colocation c_clustervol_on_drbd \
inf: rg_clustervol ms_drbd_clustervol:Master

We have now finished our configuration.  All that is left to do is for it to be activated.  To do so, run the issue the following command in the crm shell:


crm(live)configure# commit

For more information about adding a DRBD-backed service to the cluster configuration, please reference Adding a DRBD-backed service to the cluster configuration in the DRBD 8.3 User’s Guide.

To see the setup of your configured resource group, run the following command using the crm CLI:


[root@viking-07 ~]# crm resource show
Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM) Started
p_target_clustervol (ocf::heartbeat:iSCSITarget) Started
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit) Started
p_ip_clustervolip (ocf::heartbeat:IPaddr2) Started
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-07.eucalyptus-systems.com ]
Slaves: [ viking-08.eucalyptus-systems.com ]

Accessing the iSCSI Target

Make sure that the cluster is online by using the crm CLI:


[root@viking-07 ~]# crm status
============
Last updated: Tue Apr 3 19:51:25 2012
Stack: openais
Current DC: viking-07.eucalyptus-systems.com - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ viking-07.eucalyptus-systems.com viking-08.eucalyptus-systems.com ]

Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM): Started viking-07.eucalyptus-systems.com
p_target_clustervol (ocf::heartbeat:iSCSITarget): Started viking-07.eucalyptus-systems.com
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started viking-07.eucalyptus-systems.com
p_ip_clustervolip (ocf::heartbeat:IPaddr2): Started viking-07.eucalyptus-systems.com
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-07.eucalyptus-systems.com ]
Slaves: [ viking-08.eucalyptus-systems.com ]

The status of the cluster reflects the diagram at the beginning of the blog.  We are ready to access iSCSI.

On another machine – here its viking-09 – start up Open iSCSI and attempt to access the target:


[root@viking-09 ~]# service iscsi start
iscsid is stopped
Starting iSCSI daemon: [ OK ]
[ OK ]

We need to start a discovery session on the target portal. Use the cluster IP address for this:


[root@viking-09 ~]# iscsiadm -m discovery -p 192.168.44.30 -t sendtargets

The output from this command should be the name of the target we configured:


192.168.44.30:3260,1 iqn.1994-05.com.redhat:cfd95480cf87.clustervol

Now we are ready to log into target:


[root@viking-09 ~]# iscsiadm -m node -p 192.168.44.30 -T iqn.1994-05.com.redhat:cfd95480cf87.clustervol --login
Logging in to [iface: default, target: iqn.1994-05.com.redhat:cfd95480cf87.clustervol, portal: 192.168.44.30,3260] (multiple)
Login to [iface: default, target: iqn.1994-05.com.redhat:cfd95480cf87.clustervol, portal: 192.168.44.30,3260] successful.

For more information on connecting to iSCSI targets, please refer to the section entitled Using highly available iSCSI Targets in the document Highly Available iSCSI Storage with DRBD and Pacemaker.

Running dmesg should show us what block device is associated with the target:


......
scsi9 : iSCSI Initiator over TCP/IP
Vendor: IET Model: Controller Rev: 0001
Type: RAID ANSI SCSI revision: 05
scsi 9:0:0:0: Attached scsi generic sg4 type 12
Vendor: IET Model: VIRTUAL-DISK Rev: 0001
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sde: 20971520 512-byte hdwr sectors (10737 MB)
sde: Write Protect is off
sde: Mode Sense: 49 00 00 08
SCSI device sde: drive cache: write back
SCSI device sde: 20971520 512-byte hdwr sectors (10737 MB)
sde: Write Protect is off
sde: Mode Sense: 49 00 00 08
SCSI device sde: drive cache: write back
sde: sde1
sd 9:0:0:1: Attached scsi disk sde
sd 9:0:0:1: Attached scsi generic sg5 type 0
.....

Now the only thing left is to format the device.  For this exercise, we just used XFS.


[root@viking-09 ~]# mkfs.xfs /dev/sde

Once thats completed, we mounted the device a mount point. Our mount point is /mnt/iscsi_vol.


[root@viking-09 ~]# mkdir -p /mnt/iscsi_vol
[root@viking-09 ~]# mount -o noatime,nobarrier /dev/sde /mnt/iscsi_vol/

Using df -ah, we can see the filesystem mounted:


[root@viking-09 ~]# df -ah
Filesystem Size Used Avail Use% Mounted on
/dev/md1 2.7T 1.5G 2.6T 1% /
proc 0 0 0 - /proc
sysfs 0 0 0 - /sys
devpts 0 0 0 - /dev/pts
/dev/md0 487M 29M 433M 7% /boot
tmpfs 7.9G 0 7.9G 0% /dev/shm
none 0 0 0 - /proc/sys/fs/binfmt_misc
sunrpc 0 0 0 - /var/lib/nfs/rpc_pipefs
/dev/sde 10G 4.6M 10G 1% /mnt/iscsi_vol

Testing HA Failover

A simple way to test failover is to use the crm CLI. Here is the current status of our cluster:


[root@viking-07 ~]# crm status
============
Last updated: Tue Apr 3 20:22:06 2012
Stack: openais
Current DC: viking-07.eucalyptus-systems.com - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ viking-07.eucalyptus-systems.com viking-08.eucalyptus-systems.com ]

Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM): Started viking-07.eucalyptus-systems.com
p_target_clustervol (ocf::heartbeat:iSCSITarget): Started viking-07.eucalyptus-systems.com
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started viking-07.eucalyptus-systems.com
p_ip_clustervolip (ocf::heartbeat:IPaddr2): Started viking-07.eucalyptus-systems.com
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-07.eucalyptus-systems.com ]
Slaves: [ viking-08.eucalyptus-systems.com ]

To failover, issue the following crm shell command:


[root@viking-07 ~]# crm resource move rg_clustervol viking-08.eucalyptus-systems.com

This crm command moved the resource from viking-07 to viking-08.  The status of the cluster should look like the following:


[root@viking-07 ~]# crm status
============
Last updated: Tue Apr 3 20:23:30 2012
Stack: openais
Current DC: viking-07.eucalyptus-systems.com - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ viking-07.eucalyptus-systems.com viking-08.eucalyptus-systems.com ]

Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM): Started viking-08.eucalyptus-systems.com
p_target_clustervol (ocf::heartbeat:iSCSITarget): Started viking-08.eucalyptus-systems.com
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started viking-08.eucalyptus-systems.com
p_ip_clustervolip (ocf::heartbeat:IPaddr2): Started viking-08.eucalyptus-systems.com
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-08.eucalyptus-systems.com ]
Slaves: [ viking-07.eucalyptus-systems.com ]

This is a diagram of what the cluster looks like now:

HA iSCSI Diagram - viking-08 active

HA iSCSI Diagram where viking-08 is the active node

You can switch the resource back to viking-07 by running the following crm CLI command:


[root@viking-07 ~]# crm resource move rg_clustervol viking-07.eucalyptus-systems.com

The status of the cluster should look similar to this:


[root@viking-07 ~]# crm status
============
Last updated: Tue Apr 3 20:28:29 2012
Stack: openais
Current DC: viking-07.eucalyptus-systems.com - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ viking-07.eucalyptus-systems.com viking-08.eucalyptus-systems.com ]

Resource Group: rg_clustervol
p_lvm_clustervol (ocf::heartbeat:LVM): Started viking-07.eucalyptus-systems.com
p_target_clustervol (ocf::heartbeat:iSCSITarget): Started viking-07.eucalyptus-systems.com
p_lu_clustervol_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started viking-07.eucalyptus-systems.com
p_ip_clustervolip (ocf::heartbeat:IPaddr2): Started viking-07.eucalyptus-systems.com
Master/Slave Set: ms_drbd_clustervol
Masters: [ viking-07.eucalyptus-systems.com ]
Slaves: [ viking-08.eucalyptus-systems.com ]

A more complex test is to do an md5sum of a file (e.g. an ISO) and copy it from a desktop/laptop to the machine that has the iSCSI targeted mounted (e.g. viking-09).  While the copy is happening, you can use the crm CLI to failover back and forth. You will see there is no delay.  You can monitor the status of the cluster by using crm_mon. After the copy is complete, do an md5sum of the ISO on the machine to where it was copied.  The md5sums should match.

The next phase of this blog will be to script as much of this install and configuration as possible (e.g. using Puppet or Chef).  Stayed tuned to more information about this.  Hope you enjoyed this blog.  Let me know if you have questions, suggestions, and/or comments.

Enjoy!