HAST (Highly Available Storage) is a new concept for FreeBSD and it is under constant development. HAST allows to transparently store data on two physically separated machines connected over the TCP/IP network. HAST operates on block level making it transparent for file systems, providing disk-like devices in /dev/hast directory.
In this article we will create two identical HAST nodes, hast1 and hast2. Both devices will use one NIC connected to a vlan for data synchronization and another NIC will be configured via CARP in order to share the same IP address across the network. The first node will be called “storage1.hast.test”, the second “storage2.hast.test” and they will both listen to a common IP address which we will bind to “storage.hast.test”
HAST binds its resource names according to the machine’s hostname. Therefore, we will use “hast1.freebsd.loc” and “hast2.freebsd.loc” as the machines’s hostnames so that HAST can operate without complaining.
For starters, lets set up two identical nodes. For this example I have installed FreeBSD 9.0-RELEASE on two deferent instances using a Linux KVM. Both nodes have 512MB of RAM, one SATA drive containing the OS and three SATA drives which will be used to create our shared Raidz1 pool. The final result looks like this:
In order for carp to work we don’t have to compile a new kernel. We can just load it as a module by adding to /boot/loader.conf
if_carp_load="YES"
Our both nodes are set up, it is time to make some adjustments. First a descent /etc/rc.conf for the first node:
zfs_enable="YES" ###Primary Interface## ifconfig_re0="inet 10.10.10.181 netmask 255.255.255.0" ###Secondary Interface for HAST### ifconfig_re1="inet 192.168.100.100 netmask 255.255.255.0" defaultrouter="10.10.10.1" sshd_enable="YES" hostname="hast1.freebsd.loc" ##CARP INTERFACE SETUP## cloned_interfaces="carp0" ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0" hastd_enable=YES
The second node we will also much the first except for the IP addressing:
zfs_enable="YES" ###Primary Interface## ifconfig_re0="inet 10.10.10.182 netmask 255.255.255.0" ###Secondary Interface for HAST### ifconfig_re1="inet 192.168.100.101 netmask 255.255.255.0" defaultrouter="10.10.10.1" sshd_enable="YES" hostname="hast2.freebsd.loc" ##CARP INTERFACE SETUP## cloned_interfaces="carp0" ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 100" hastd_enable=YES
At this point we have assigned re1 with two IPs for HAST synchronization. We have also assigned two IPs to re0 which in turn we share with a third common IP assigned to carp0.
As a result, re1 is being used for HAST synchronization in a vlan while carp0 which is cloned by re0 used under the same vlan with the rest of our clients.
In order for HAST to function correctly we have to resolve the correct IPs on every node. We don’t want to rely on DNS for this because DNS can fail. Instead we will use /etc/hosts same on every node.
::1 localhost localhost.freebsd.loc
127.0.0.1 localhost localhost.freebsd.loc
192.168.100.100 hast1.freebsd.loc hast1
192.168.100.101 hast2.freebsd.loc hast2
10.10.10.181 storage1.hast.test storage1
10.10.10.182 storage2.hast.test storage2
10.10.10.180 storage.hast.test storageNext, we have to create the /etc/hast.conf file. Here we will declare the resources that we want to create. All resources will eventually create devices located under /dev/hast on the primary node. Every resource indicates a physical device specifying a local and remote IP device. The /etc/hast.conf must be exactly the same on every node.
resource disk1 { on hast1 { local /dev/ad1 remote hast2 } on hast2 { local /dev/ad1 remote hast1 } } resource disk2 { on hast1 { local /dev/ad2 remote hast2 } on hast2 { local /dev/ad2 remote hast1 } } resource disk3 { on hast1 { local /dev/ad3 remote hast2 } on hast2 { local /dev/ad3 remote hast1 } }
In this example we are sharing three resources, disk1, disk2 and disk3. Each resource indicates a device the local and the remote IP address. With this configuration in place, we are ready to begin setting up out HAST devices.
Lets start hastd on both nodes first:
hast1#/etc/rc.d/hastd starthast2#/etc/rc.d/hastd startNow on the primary node we will initialize our resources, create them and finally assign a primary role:
hast1#hastctl role init disk1 hast1#hastctl role init disk2 hast1#hastctl role init disk3 hast1#hastctl create disk1 hast1#hastctl create disk2 hast1#hastctl create disk3 hast1#hastctl role primary disk1 hast1#hastctl role primary disk2 hast1#hastctl role primary disk3
Next, on the secondary node we will initialize our resources, create them and finally assign a secondary role:
hast2#hastctl role init disk1 hast2#hastctl role init disk2 hast2#hastctl role init disk3 hast2#hastctl create disk1 hast2#hastctl create disk2 hast2#hastctl create disk3 hast2#hastctl role secondary disk1 hast2#hastctl role secondary disk2 hast2#hastctl role secondary disk3
There are other ways for creating and assigning roles to each resource. Having repeat this procedure a few times, I saw that this usually always works.
Now check the status on both nodes:
hast1# hastctl status disk1: role: primary provname: disk1 localpath: /dev/ada1 ... remoteaddr: hast2 replication: fullsync status: complete dirty: 0 (0B) ... disk2: role: primary provname: disk2 localpath: /dev/ada2 ... remoteaddr: hast2 replication: fullsync status: complete dirty: 0 (0B) ... disk3: role: primary provname: disk3 localpath: /dev/ada3 ... remoteaddr: hast2 replication: fullsync status: complete dirty: 0 (0B) ...
The first node looks good. Status is complete.
hast2# hastctl status disk1: role: secondary provname: disk1 localpath: /dev/ada1 ... remoteaddr: hast1 replication: fullsync status: complete dirty: 0 (0B) ... disk2: role: secondary provname: disk2 localpath: /dev/ada2 ... remoteaddr: hast1 replication: fullsync status: complete dirty: 0 (0B) ... disk3: role: secondary provname: disk3 localpath: /dev/ada3 ... remoteaddr: hast1 replication: fullsync status: complete dirty: 0 (0B) ...
So does the second. Like I mentioned earlier there are different ways for doing this the first time. You have to look for the word status: complete. If you get a degraded status you can always repeat the procedure.
Now it is time to create our ZFS pool. The primary node should have a /dev/hast directory containing our resources. This directory appears only at the active node.
hast1# zpool create zhast raidz1 /dev/hast/disk1 /dev/hast/disk2 /dev/hast/disk3 hast1# zpool status zhast pool: zhast state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zhast ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 hast/disk1 ONLINE 0 0 0 hast/disk2 ONLINE 0 0 0 hast/disk3 ONLINE 0 0 0
We can now use hastctl status on each node to see if everything looks ok. The magic word we are looking for here is: replication: fullsync
At this point both of our nodes should be available for failover. We have storage1 running as primary and sharing a pool called zhast. Our storage2 is currently in a standby mode. If we have set DNS properly we can ssh to storage.hast.test or by using its carp IP to 10.10.10.180.
In order to perform a failover we have to first export our pool from the first node, change the role of each resource to secondary. Then change the role of each resource to primary on the standby node and import the pool. This procedure will be done manually to test if failover really works. But for a real HA solution we will eventually create a script that will take care of this.
First lets export our pool and change our resources role:
hast1# zpool export zhast hast1# hastctl role secondary disk1 hast1# hastctl role secondary disk2 hast1# hastctl role secondary disk3
Now, lets reverse the procedure on the standby node:
hast2# hastctl role primary disk1 hast2# hastctl role primary disk2 hast2# hastctl role primary disk3 hast2# zpool import zhast
The roles have successfully changed, lets see our pool status:
hast2# zpool status zhast pool: zhast state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zhast ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 hast/disk1 ONLINE 0 0 0 hast/disk2 ONLINE 0 0 0 hast/disk3 ONLINE 0 0 0 errors: No known data errors
Again, by using hastctl status on each node we can verify that the roles have indeed changed and that the status is complete. This is a sample output from the second node now in charge:
hast2# hastctl status disk1: role: primary provname: disk1 localpath: /dev/ad1 ... remoteaddr: hast1 replication: fullsync status: complete ... disk2: role: primary provname: disk2 localpath: /dev/ad2 ... remoteaddr: hast1 replication: fullsync status: complete ... disk3: role: primary provname: disk3 localpath: /dev/ad3 ... remoteaddr: hast1 replication: fullsync status: complete ...
It is now time to automate this procedure. When do we want our servers to automatically failover?
One reason would be if the primary node is not responding to the external network thus not being able to serve its clients. Using a devd event we can catch a carp interface going up or down and a state change.
Add the following lines to /etc/devd.conf on both nodes:
notify 30 { match "system" "IFNET"; match "subsystem" "carp0"; match "type" "LINK_UP"; action "/usr/local/bin/failover master"; }; notify 30 { match "system" "IFNET"; match "subsystem" "carp0"; match "type" "LINK_DOWN"; action "/usr/local/bin/failover slave"; };
And now lets create the failover script which will be responsible for doing automatically what we did before manually:
#!/bin/sh # Original script by Freddie Cash <fjwcash@gmail.com> # Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org> # and Viktor Petersson <vpetersson@wireload.net> # Modified by George Kontostanos <gkontos.mail@gmail.com> # The names of the HAST resources, as listed in /etc/hast.conf resources="disk1 disk2 disk3" # delay in mounting HAST resource after becoming master # make your best guess delay=3 # logging log="local0.debug" name="failover" pool="zhast" # end of user configurable stuff case "$1" in master) logger -p $log -t $name "Switching to primary provider for ${resources}." sleep ${delay} # Wait for any "hastd secondary" processes to stop for disk in ${resources}; do while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do sleep 1 done # Switch role for each disk hastctl role primary ${disk} if [ $? -ne 0 ]; then logger -p $log -t $name "Unable to change role to primary for resource ${disk}." exit 1 fi done # Wait for the /dev/hast/* devices to appear for disk in ${resources}; do for I in $( jot 60 ); do [ -c "/dev/hast/${disk}" ] && break sleep 0.5 done if [ ! -c "/dev/hast/${disk}" ]; then logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear." exit 1 fi done logger -p $log -t $name "Role for HAST resources ${resources} switched to primary." logger -p $log -t $name "Importing Pool" # Import ZFS pool. Do it forcibly as it remembers hostid of # the other cluster node. out=`zpool import -f "${pool}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}." exit 1 fi logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported." ;; slave) logger -p $log -t $name "Switching to secondary provider for ${resources}." # Switch roles for the HAST resources zpool list | egrep -q "^${pool} " if [ $? -eq 0 ]; then # Forcibly export file pool. out=`zpool export -f "${pool}" 2>&1` if [ $? -ne 0 ]; then logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}." exit 1 fi logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported." fi for disk in ${resources}; do sleep $delay hastctl role secondary ${disk} 2>&1 if [ $? -ne 0 ]; then logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}." exit 1 fi logger -p $log -t $name "Role switched to secondary for resource ${disk}." done ;; esac
Let’s try it and see if it works. Log into both the currently active and standby node. Make sure that you are on the active by issuing a hastctl status command. Then force a failover by bringing the interface which is associated with carp0 downL
hast1# ifconfig er0 downWatch at the generated messages:
hast1# tail -f /var/log/debug.log Feb 6 15:01:41 hast1 failover: Switching to secondary provider for disk1 disk2 disk3. Feb 6 15:01:49 hast1 hast: ZFS pool for resource exported. Feb 6 15:01:52 hast1 failover: Role switched to secondary for resource disk1. Feb 6 15:01:55 hast1 failover: Role switched to secondary for resource disk2. Feb 6 15:01:58 hast1 failover: Role switched to secondary for resource disk3.
hast2# tail -f /var/log/debug.log Feb 6 15:02:15 hast2 failover: Switching to primary provider for disk1 disk2 disk3. Feb 6 15:02:19 hast2 failover: Role for HAST resources disk1 disk2 disk3 switched to primary. Feb 6 15:02:19 hast2 failover: Importing Pool Feb 6 15:02:52 hast2 hast: ZFS pool for resource imported.
Voila! The failover worked like a charm and now hast2 has assumed the primary role.
Further considerations:
What we did today is a basic setup of two nodes sharing a raidz1 pool with automatic role failover in case of a failure that would result in a loss of a carp interface.
Obviously, a similar devd event would be generated in case we loose a HAST replication interface. This is something that needs to be addressed similarly since losing that interface will leave us with no synchronization at all.
Going further, we would have to add scripts that will bring up and down services during a failover.



Great guide. Very well commented and easy to follow. What would happen if the primary node disappears due to a serious hardware failure? If the primary can not properly export its zfs pool and switch its role, how will the secondary node compensate? Would it still be able to import the ZFS pool successfully, and continue on as the primary/only node. What would happen when the failed node comes back online, as both will be configured as primary.
I would love to see another post discussing how to setup systems to survive a sudden total loss of a node.
Thanks Brian. The idea of a manual failover is to properly switch nodes from primary to secondary in case of a scheduled maintenance. So, for example if were to issue a “/usr/local/bin/failover slave” on the primary node and “/usr/local/bin/failover master” on the secondary node, then this would reverse the roles and we would have a perfect synchronized failover.
In real life disaster can happen. If this hardware issue affects the ethernet that carp is configured to listen, then dved will initiate a failover. If however this doesn’t happen, we can always manually execute the “/usr/local/bin/failover master” on the secondary node which will immediately bring it to active mode.
The pool doesn’t have to be exported necessarily for the failover to work. HAST works in the GEOM provider level, meaning that data are always in synch as long as there is an active connection between both nodes. Therefore, the secondary node would assume its resources as primary and then import the pool.
I have been struggling with this also.
If you disable a carp interface this all works well.
Even if a node completely disappears, the slave will become master and all is fine.
But if a node disappears AND come back alive, the carp intarface will ALWAYS start as master, this is a bug well known before the 9.0 release.
Due the fact that it starts as master, the master script will run if a node comes back up after a power failure or whatever reasen it rebooted.
So you have two masters.
In my testing, and it seems a little random, you can get into a split brain scenario.
I could not get the job done without several reboots of both master and slave to stay out of the split brain.
I always reached the split brain scenario.
I will test it once more maybe something changed.
regards
Johan
There is a commit in 9-STABLE which allows the user to set the state of the carp cluster.
Link: http://svnweb.freebsd.org/base?view=revision&revision=232486
So if I read this correctly, the failover script that kicks the slave to master would have to set “sysctl net.inet.carp.preempt” so it can not be demoted if the original master comes back up?
Would one then need code added to the startup actions of the nodes to check for the presence of an existing master, and configure itself accordingly. This way when the original master comes back up, it does not try to be the master and create the split brain scenario Johan observed.
Thanks,
Brian
I believe so this can be achieved by setting up sysctl net.inet.carp.preempt=0
I really have to try this again in the future with a 9.0-STABLE.
Great article. Have found your site from the FreeBSD forums and actually you have lots of useful guides.
I read the article, but I think this is a solution only worth for enterprise stuff which has to be available 100% of the time. I think at home a better solution would be simply to setup a NAS plus a backup server (on separated physical machines) which may eventually receive ZFS snapshots. Or do you think this HAST solution may be equally cost-effective (at home)?
Hi Stefano, thanks for stopping by.
This is a solution for high availability so yes it is too much for home use!
I personally use ZFS differential snapshots to an external drive for my SOHO server.
Lets say I wanted to phase this in to a current production enviroment. Would I be able to set up a single hast node by omitting all the hast2 data. Create a backup by rsyncing my data from the primary file server to the backup, make my back up the primary, then rebuild the primary and add it in as the slave? I figure the process would be something like
1) Set HAST up as a single node on the back up file server, HAST1
2) Create a ZFS Pool on the HAST1 drives
3) Copy Data from the primary file server to the HASH1 ZFS Pool.
4) Migrate all machines to use HAST1.
5) Configure and create the HAST2 machine using the old primary file server.
6) Alter both configs so that there will be 2 nodes.
Restart HAST services on both machines and watch them sync
I don’t see why not. However, test it first on a virtual machine!
thanks for all this info.
adding another nod to the cluster will have the same procedure?
do you have another advance article about adding more nodes?
thanks.
Adding another node is not an option yet. At this stage HAST only works with 2 nodes.
Hast is a SAN (without iscsi) right?
or what can i use as a replacement for iscsi using HAST with freebsd.
can you please tell me if you know.
thanks.
I would recommend istgt. Have a look here for more information: http://people.freebsd.org/~rse/iscsi/iscsi.txt
can we create logical units after you created zpool hast?
thanks.
@George
I am not sure I understand the question…
like the command below, the way we use to do it with nexenta 3.0.
sbdadm create-lu /dev/zvol/rdsk/tank/zvol1
thanks.
Absolutely, you can do anything as long as you don’t make any direct changes to the vdevs. In this case it would go like:
#zfs create -V zhast/zvol1
thanks, i’m actually testing it now.
for the module if_carp_load=”YES”
/boot/loader.conf
i think you meant /boot/defaults/loader.conf
No, you don’t want to change the defaults file. /boot/loader.conf will override the defaults and will never be accidentally changed by an OS upgrade.
gKontos,
i got it working… I did everything step by step. I connected the storage using the ip carp to a small cloud and created some vm to it, unplugged the Ethernet cable and automatically hast2 became primary.
later on when i brought hast1, it became secondary.
and all my vm never went off line.
i see if i can bring it to production now, and see what happens.
thank you very much!
one thing freebsd doesn’t have any file name /boot/loader.conf
so i put if_carp_load=”YES” in /boot/defaults/loader.conf.
you said that /boot/loader.conf will override the defaults, but can’t because it doesn’t exists.
I am glad it worked for you. The /boot/loader.conf does not exist and you have to create it.