Setup a Three Node Replicated GlusterFS Cluster on FreeBSD

GlusterFS (GFS) is the open source equivalent to Microsoft's Distributed Filesystem (DFS). It's a service that replicates the contents of a filesystem in real time from one server to another. Clients connect to any server and changes made to a file will replicate automatically. It's similar to something like rsync or syncthing, but much more automatic and transparent. A FreeBSD port has been available since v3.4, and (as of this post) is currently at version 8.0 with 9.0 being released soon.

This is a technology demonstration and tutorial for deploying GlusterFS on FreeBSD. As a bonus this is an opportunity to play with Bhyve.

Requirements

If you want to follow along. You will need:

A workstation or desktop that is capable of virtualization^[1].
FreeBSD 12.2 or later as your virtualization host's OS.
At least 16 GB of RAM and 64 GB of free space. (you can also adjust some of the values in the templates to lower values, but not too low.)
Root access. All these commands should be run as the root user.

GlusterFS will run on any version of FreeBSD after 11.0, and works best with a minimum of 3 hosts.

Summary

Bhyve will be used to run virtual machines as our GlusterFS nodes. It's a great way to familiarize yourself with the process before going out and playing with real VM's or physical hardware. The guide consists of 4 parts.

Initializing the Host Machine for Bhyve
Creating and starting VM's
Deploying and configuring GlusterFS
Setting up clients

If you plan on using real hosts you can probably skip over the first few sections and head over to the part about Installing Gluster. Just make sure you have FreeBSD configured along with the volume you wish to use for GlusterFS. The official GlusterFS documentation was used as reference material while putting together the GlusterFS specific sections.

Install Bhyve Tooling

Bhyve is a type-2 hypervisor for FreeBSD that is available by default on 12-RELEASE and later. Sample scripts are included to help you get started, but they are low level and cumbersome. FreeBSD ports includes a high level tool called sysutils/vm-bhyve that encapsulates the management tasks under a single command (vm). For full details about VM-Bhyve's features and usage refer to the documentation. The workflow should be familiar to users of other virtualization platforms like VMWare or VirtualBox.

Install vm-bhyve using pkg. The qemu-utils package is needed to use the cloud images feature^[2].

pkg install -y vm-bhyve qemu-utils

Do as the post installation message says. We'll use a local directory under our home folder as the datastore.

sysrc vm_enable="YES"
sysrc vm_dir="/usr/home/me/Documents/Bhyve"
mkdir -p /usr/home/me/Documents/Bhyve
vm init
cp /usr/local/share/examples/vm-bhyve/* /usr/home/me/Documents/Bhyve/.templates/

The virtual guests will gain network access using a bridge on the host's interface. Usually this is done with ifconfig, but VM-Bhyve can take care of it with a few simple commands. VM-Bhyve will also store the configuration and apply it automatically the next time the system boots.

Use the vm command to create a virtual switch and 'plug-in' our host machine.

vm switch create public
vm switch add public rl1

Replace rl1 with the name of your primary network interface. Refer to the virtual switch documentation for additional configuration options. If your are unable to use bridging, host-only NAT is a viable alternative that requires a little more effort to setup.

Prepare VM Templates

The FreeBSD build process produces virtual machine disk images as part of each release. We'll use a 64-bit FreeBSD 12.2-RELEASE machine for our GlusterFS hosts.

Fetch the disk image from the official FreeBSD releases FTP repository.

vm img https://download.freebsd.org/ftp/releases/VM-IMAGES/12.2-RELEASE/amd64/Latest/FreeBSD-12.2-RELEASE-amd64.raw.xz

Our GlusterFS hosts will have 2 volumes. The first volume will hold the OS. The second will be used for the replicated GlusterFS volume. To save time we are going to create a template that will be used when provisioning the virtual machines. The template will be based off the default one that ships with VM-Bhyve.

Copy the default template to a new template file and name it freebsd-gluster.conf

cp /usr/local/share/examples/vm-bhyve/default.conf /usr/home/me/Documents/Bhyve/.templates/freebsd-gluster.conf

Open the file using your favorite editor and add the block device for the GlusterFS volume:

disk1_type="virtio-blk"
disk1_name="gluster0.img"
disk1_size=1G

Change the CPU and memory values. Keep in mind you'll be running 3 of these VM's simultaneously, so plan accordingly.

cpu=2
memory=4G

This is what my template file looks like:

loader="bhyveload"
cpu=2
memory=4G
network0_type="virtio-net"
network0_switch="public"
disk0_type="virtio-blk"
disk0_name="disk0.img"
disk1_type="virtio-blk"
disk1_name="gluster0.img"
disk1_size=1G

Create and Boot Virtual Machines

For our setup we will need 3 virtual machines. We'll name them sun, earth, moon, and assign static IP's.

Machine Name	Hostname	IP Address
Sun	sun.gluster.domain.tld	192.168.0.46
Earth	earth.gluster.domain.tld	192.168.0.47
Moon	moon.gluster.domain.tld	192.168.0.48

Issue the following commands to create 3 new FreeBSD virtual machines:

vm create -t freebsd-gluster -i FreeBSD-12.2-RELEASE-amd64.raw Sun
vm create -t freebsd-gluster -i FreeBSD-12.2-RELEASE-amd64.raw Earth
vm create -t freebsd-gluster -i FreeBSD-12.2-RELEASE-amd64.raw Moon

Boot your virtual machines:

vm start Sun
vm start Earth
vm start Moon

Congratulations, you just provisioned and booted three FreeBSD virtual machines using Bhyve!

These FreeBSD virtual machine images ship with the most basic configuration. They will have the default hostname and (probably) no network access^[3]. You may also notice that there is no GUI! The 'correct' way to access you newly provisioned FreeSBSD VM is to connect to it's serial console using cu. VM-Byhve provides a shortcut with the console sub-command.

Connect to the Sun virtual machine:

vm console Sun

You will see the text "Connected" printed on your terminal. If the machine has already booted to the login prompt, it will likely not produce any additional output. Press ENTER to re-print the login prompt.

FreeBSD/amd64 (freebsd) (ttyu0)

login:

To exit the console hold down SHIFT then tap the ~ (tilde) key on your keyboard. Release SHIFT and press CTRL+D

~
[EOT]

If the above key combination does not work try the alternate. Hold SHIFT, tap and release ~ (tilde) then press . (a period)

Configure Your VM's Networking

Configure each of your virtual machines as required. At minimum you should set a root password, add a non-root user, set a hostname, configure networking, and setup DNS. For convenience you may also want to enable SSH access to the machines. Here are the relevant contents of the /etc/rc.conf file for each of the machines:

For the Sun

hostname="sun.gluster.domain.tld"
ifconfig_vtnet0="inet 192.168.0.46 netmask 0xffffff00"
defaultrouter="192.168.0.100"
sshd_enable="YES"

The Earth

hostname="earth.gluster.domain.tld"
ifconfig_vtnet0="inet 192.168.0.47 netmask 0xffffff00"
defaultrouter="192.168.0.100"
sshd_enable="YES"

And Moon

hostname="moon.gluster.domain.tld"
ifconfig_vtnet0="inet 192.168.0.48 netmask 0xffffff00"
defaultrouter="192.168.0.100"
sshd_enable="YES"

The /etc/resolv.conf file should be the same on all the hosts. Notice that my search domain is gluster.domain.tld.

search gluster.domain.tld
nameserver 192.168.0.10
nameserver 192.168.0.11

If you don't have or can't manage internal DNS^[4] it's perfectly fine to maintain the hostname to IP address mappings using the /etc/hosts file in each of our guest VM's and host machine:

192.168.0.46    sun sun.gluster.domain.tld
192.168.0.47    earth earth.gluster.domain.tld
192.168.0.48    moon moon.gluster.domain.tld

Reboot each VM and try to connect using SSH. Make sure each VM can access each other using the hostname:

root@sun:/home/admin # ping moon
PING moon.gluster.domain.tld (192.168.0.48): 56 data bytes
64 bytes from 192.168.0.48: icmp_seq=0 ttl=64 time=0.445 ms

root@sun:/home/admin # ping earth
PING earth.gluster.domain.tld (192.168.0.47): 56 data bytes
64 bytes from 192.168.0.47: icmp_seq=0 ttl=64 time=0.391 ms

Prepare GlusterFS Volume (aka Brick)

A GlusterFS "brick" as it's formally called is the block device with a filesystem that will be used to create a replicated GlusterFS volume. Of course this being FreeBSD we are going to use ZFS as the underlying filesystem for our GlusterFS brick.

Verify that the second disk is present on our system. It should show up as vtblk1

# geom disk list

Geom name: vtbd0
Providers:
1. Name: vtbd0
   Mediasize: 21474836480 (20G)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r2w2e3
   descr: (null)
   ident: BHYVE-17E2-F5D1-6C23
   rotationrate: unknown
   fwsectors: 0
   fwheads: 0

Geom name: vtbd1
Providers:
1. Name: vtbd1
   Mediasize: 1073741824 (1.0G)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r0w0e0
   descr: (null)
   ident: BHYVE-5A78-52C0-8FE1
   rotationrate: unknown
   fwsectors: 0
   fwheads: 0

Yep, there it is.

On all three hosts, enable ZFS and create a pool named gluster. Also, enable lz4 compression.

sysrc zfs_enable="YES"
zpool create gluster vtbd1
zfs set compression=lz4 gluster

All three hosts should have an identical ZFS pool

# zfs list

NAME      USED  AVAIL  REFER  MOUNTPOINT
gluster  76.5K   832M    24K  /gluster

GlusterFS suggests that you don't use the root of a volume as a brick. Our brick is therefore going to live under /gluster/replicated:

Create a sub directory under your ZFS volume or dataset to act as the brick's root.

mkdir /gluster/replicated

IMPORTANT: If you created a dataset under the pool you should still create a sub directory for your GlusterFS brick.

Install and Setup GlusterFS

On every host install the latest version of net/glusterfs for FreeBSD using the package^[5].

pkg install -y glusterfs

On each host enable the GlusterFS service and start it.

sysrc glusterd_enable="YES"
service glusterd start

Lets pause for a moment to review what we have done so far. There are three individual hosts each with a ZFS volume and the GlusterFS service daemon running. GlusterFS is currently not doing anything and doesn't know about the other two hosts. We need to join these 3 hosts together into a cluster to create a trusted pool (this is not referring to a ZFS pool). This is done by probing the hosts. GlusterFS refers to it's pool members as "peers".

Pay close attention to which host you run the commands on. It matters^[6].

On Sun, run the initial probe

gluster peer probe earth
gluster peer probe moon

On Earth, probe Sun

gluster peer probe sun

The command should return peer probe: success each time it was run.

root@sun:/home/admin # gluster peer probe earth
peer probe: success
root@sun:/home/admin # gluster peer probe moon
peer probe: success

root@earth:/home/admin # gluster peer probe sun
peer probe: success

Verify that all peers have joined the cluster by running the peer status command on any host.

root@earth:/home/admin # gluster peer status
Number of Peers: 2

Hostname: sun.gluster.domain.tld
Uuid: cf1e33c2-90e0-4054-abbc-ea2201cef9a7
State: Peer in Cluster (Connected)
Other names:
sun

Hostname: moon
Uuid: f83673bb-1cd3-4fba-b60d-f3272ea8c7bc
State: Peer in Cluster (Connected)

HINT: If for some reason you need to start over: on all hosts, stop the GlusterFS service, delete the contents of /var/db/glusterd, and reinstall the package.

Create Replicated Volume

Now that all three hosts have discovered each other to form a cluster, we can bootstrap a new volume. Select any one host and use the volume create command to create a new volume.

Create a replica type volume named replicated using 3 peers.

gluster volume create replicated replica 3 sun:/gluster/replicated earth:/gluster/replicated moon:/gluster/replicated

The successful result should look something like the following

volume create: replicated: success: please start the volume to access data

On the same host, start the volume.

gluster volume start replicated

A successful result should look something like:

volume start: replicated: success

HINT: This time, if you need to start over you will need to destroy and recreate your ZFS volume in addition to stopping the Gluster service, deleting the contents of /var/db/glusterd, and reinstalling the package on every host.

Client Access to Shared Volume

Clients access the replicated share using fusefs. You should never write to the GlusterFS brick directly. Doing so will most certainly create what's called a Split-Brain situation. Instead mount replicated volumes using mount_glusterfs. It is included as part of the GlusterFS package and works like any other mount command. Client machines needing access to the volume will have to install the package and load the fusefs driver. No additional configuration beyond that is required.

On a client machine enable and load the required kernel module:

sysrc -f /boot/loader.conf fuse_load="YES"
kldload fuse

Install the net/glusterfs package:

pkg install -y glusterfs

Mount the GlusterFS replicated volume as /mnt/replicated:

mkdir /mnt/replicated
mount_glusterfs sun:replicated /mnt/replicated

The host "sun" can be replaced with the resolvable name of any one of the GlusterFS peer nodes.

Note: For automatic fail-over you need to pass additional options (backup-volfile-servers) to the mount program specifying which hosts to use as backup^[8].

GlusterFS doesn't require that the volume be mounted on each of the peer nodes in order function. However there is no harm in doing so.^[7], and it's perfectly fine to set the target host on each peer node to itself.

For added convenience use /etc/fstab so that the filesystem is automatically available at boot time.

On Earth

earth:replicated	/mnt/replicated	fusefs	rw,_netdev,backup-volfile-servers=sun:moon,mountprog=/usr/local/sbin/mount_glusterfs,late	0	0

The other two hosts would have a similar entry.

Moon:

moon:replicated	/mnt/replicated	fusefs	rw,_netdev,backup-volfile-servers=sun:earth,mountprog=/usr/local/sbin/mount_glusterfs,late	0	0

Sun:

sun:replicated	/mnt/replicated	fusefs	rw,_netdev,backup-volfile-servers=earth:moon,mountprog=/usr/local/sbin/mount_glusterfs,late	0	0

Testing with Some Data

Assuming you have mounted your replicated file system under /mnt/replicated, go ahead and pick any host (or a client machine) and create a new file.

echo "Hello World" > /mnt/replicated/hello.txt

Now pick any other host (or client) and read back the file's contents

# cat /mnt/replicated/hello.txt
Hello World

Next try copying or creating a large file and reboot the host your client is connected to. For example if you mounted a client using:

mount_glusterfs -o backup-volfile-servers=earth:moon sun:replicated /mnt/replicated

You could use dd to start writing a large file. Then reboot Sun. Your write process should continue as normal.

dd if=/dev/zero of=/mnt/replicated/test.img bs=1M count=100

The log file under /var/log/glusterfs/mnt-replicated.log should indicate an automatic fail over.

[2021-01-27 11:54:46.175448] I [glusterfsd-mgmt.c:2641:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: sun
[2021-01-27 11:54:46.175470] I [glusterfsd-mgmt.c:2681:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server earth

Closing Remarks

GlusterFS on FreeBSD is not yet what I would call production quality. There are a few know bugs and performance problems that could ruin your day. The most notable is the fact that GlusterFS uses poll instead of kqueue/kevent. There is also an issue where GlusterFS is (sometimes) unable to correctly find the process ID of the self-heal daemon, causing volume heal commands to fail^[9].

Other annoyances include the fact that there are still some Linuxisms within the codebase and testing framework. This means GlusterFS might make some Linux specific system calls or look for certain files to be in unexpected locations.

My hope is that there will be more interest in running GlusterFS on FreeBSD so that these issues can get solved.

Any data you host with GlusterFS should be safe, but like any good sysadmin, you should be keeping backups.

[1] GlusterFS does not require virtualization to run on FreeBSD, it's only used to simplify this guide.
[2] Bhyve does not use qemu. This dependency is only for providing the CLI utilities for fetching, extracting, and converting disk images.
[3] It's possible that if you have a DHCP server on your LAN, the VM's each received an IP address.
[4] DNS is not required to use GlusterFS, it's fine to use IP addresses. DNS however makes things very easy to work with.
[5] The package will tell you to enable /proc. This is not needed but more testing is required to confirm.
[6] You must pick one host (as the first) to probe two other hosts. Then choose one of the two other hosts to probe the first host.
[7] This depends on your environment needs and security policies.
[8] GlusterFS loads the list of backup volume servers from the one host, but this does not appear to be working correctly.
[9] This will prevent you from rebuilding a failed host and upgrading to a new version of GlusterFS. Work is in progress to track this down and fix it.