Guide: Create your own Windows fileserver cluster on 45drives hardware with GlusterFS

Intro

This is a complete how-to guide, from first boot to first file copy, on how to build your very own Gluster file storage cluster on 45drives (or any storage device capable of running Linux). It is meant to serve not only as a straight copy/pasteable recipe for creating the initial cluster and for adding nodes, but to also give insight into the how’s and whys of system settings and planning.
If you don’t care to read the hows and whys, or if you already have a good understanding of Gluster and Linux you can scroll down to the “Full Command List” section.

Our goal:

* To have a fully redundant windows compatible file server cluster that is hardware agnostic, able to scale to any size, expand compute along with storage space, survive a full node as well as individual drive failure, and best of all be completely open source!

In this guide we will:

* Plan out our filesystem size and node/disk redundancy
* Install CentOS 7 on our target cluster devices
* Install any extra drivers required by 45drives (RocketRAID PCI cards)
* Enable remote web management/alerting of said raid cards
* Install and configure ZFS from source for on-the-fly dedupe and memcache
* Optimize networking for large packets
* Install the latest version of GlusterFS (3.8 stable or 3.10 beta)
* Build our initial Gluster cluster
* Create a Gluster volume accessible across all Gluster nodes
* Tune the newly built cluster and volume for fast failover in the event of a complete node failure
* Enable load balancing and a floating IP address via CTDB
* Enable NFS and SMB (Windows shares) and have CTDB monitor their health across nodes
* Bind each node to Windows Active Directory and allow domain users and admins to read/write to the shared gluster volume

Following this guide:

* A section marked **PERFORM ON ALL** means you should perform the below listed actions on all nodes in the cluster
* A section marked **PERFORM ON ONE** means to perform the action on only one node in the cluster
* nano means you’ll be editing the file in question, feel free to use another text editor if you’d prefer
* You should have a basic understanding of linux, the command line, networking, filesharing, and active directory to be able to get the most from this guide

 Hardware and Planning

Cluster hardware:

With hardware agnostic open source linux software packages, in theory you could run this on any two+ devices that can boot 64bit linux, however I’ve found the following to hit a pretty good price point and the rest of this guide is focused around the 45drives Storinator. Redundant power supplies, higher end raid cards, and no screw drivers required for installing/removing drives were a few reasons I ended up with them over other similar options. Whatever you go with… remember, the more nodes you have in your cluster the faster and more tolerant of failure it’ll be.

Another important note is you’ll want storage with a decent cpu and good amount of ram. ZFS will be doing on the fly dedupe of files and file compression among other backend tasks and you really don’t want a cpu bottleneck while trying to write large or multiple files. The more ram the more cache you’ll be providing ZFS for fast file access, so be sure to toss in a good amount for a cheap speed upgrade!

Gluster cluster redundancy/expansion planning:

  Browsing the Gluster readme docs for explanations on different types of configuration and redundancy settings can quickly make your head spin. I’ll try and quickly explain what I believe is the best option for both node failure protection and storage capacity.
Gluster relies on bricks (a group of hard drives), bricks are made up of a zfs pool (think raid array), and a zfs pool is made up of individual hard drives. I went with zraid2 (similar to raid 6) for 2 drive fail protection as it seemed to provide the best trade of speed vs performance for my goal (cheap and deep storage). I also only created one brick per node for simplicity, which limits expansion slightly (I’ll explain below). The two main reasons for the one brick per node design were:
1. Simplicity, you only have to worry about one zpool and one brick for each node. Sure if you only have two or three nodes its pretty easy to keep track of 2, 4, or 6 zpools/bricks in a node but once you expand to 10+ nodes keeping it all straight becomes a far bigger headache for not as much gain.
2. Easier expandability as more nodes are added. With multiple bricks per node, you could in theory expand one node at a time. While this looks great on paper, expanding each node introduces a much larger amount of risk as you have to ‘break’ one of the existing brick pairs before expanding into the new node. This was a risk that I wasn’t willing to take with production data, especially with the low cost of the 45drives hardware and ability to just buy two nodes at a time instead.
Some downsides to the one brick per node strategy that you should understand are:
1. You must expand in pairs of nodes. You can’t expand one single node at a time as you might be able to with other brick configurations.
2. The rebuild size of the zpool/brick is larger, and if you have a very large node with over 60 drives it can take quite some time to complete if a drive fails, thus increasing the risk of another drive failing during rebuild.
 This is probably best explained with a visual, lets assume you start with 3 Gluster nodes and plan on adding a fourth some time in the future (and more after that)… Also keep in mind with one brick per node, buying 3 nodes to start doesn’t…. really make sense but anyway…

One brick per node:

With one brick per node, your first two nodes are paired, aka the data is mirrored between them, allowing one node to become offline while the data remains available. The third node in this case does nothing. You can either use it to create a third copy of the data (for extra protection) or just something else entirely (maybe backups).

Two bricks per node:

In this configuration, each node has two bricks (think raid arrays). This allows the ‘mirrored’ or paired data between nodes, but also a ‘staggered’ approach to the mirroring, allowing a third node to be added without breaking the ‘one pair per brick’ rule.
 
One important thing you want to avoid!!! is pairing two bricks that reside on the same node. By pairing them on the same node, if that node experiences a failure you’ll lose both pairs, and the entire volume will fail.

Expanding one brick per node:

Expanding with one brick per node is a straightforward process, simply add the new two node pair to the overall gluster volume and your capacity increases on the fly. No data must be rebuilt, and everything continues as it was, just with more space.

Expanding two bricks per node:

Expanding with two bricks per node in a staggered configuration is a bit more complex and requires removing a brick pair and then establishing two new pairs in its place. While technically possible to do, there is some risk of data loss/long rebuild times while the pairs are adjusted and rebuilt.
Either way, for this guide, we’ll assume you are also doing a one brick per node configuration… onward to networking!

Networking the nodes:

You’ll want two networks minimum. Ideally you’ll also want 4 nics. Also note I labeled the links to each network as “bonded”. In this case bonded simply means able to fail over to (either via active/active or active/standby). In my case I used vmware’s built in network failover capabilities but similar are available in centos itself. Regardless of the technology you decide to use, you still want to try to provide two paths to each network. You’ll also want your cluster traffic on a different vlan than your normal filesharing traffic. Cluster traffic can be quite chatty, and also take up _a lot_ of resources so keeping it isolated on the network is hugely important.
 

Failover:

Having two runs to each network allows a single switch to fail, and still have a means to communicate, either to the other gluster nodes (cluster network) or out to clients. Even if it isn’t directly hardware related, you’ll have to eventually perform maintenance on networking equipment… best to allow for failover now than scheduling maintenance windows hastily later.
Network specific commands, and switch specific configurations, are currently outside the scope of this guide. If there is enough interest at some point in the future though, I may create a secondary guide purely based around gluster networking.

Configuration of Gluster

 Ok enough theory and planning, let’s put these 45drives boxes to work and get us a cluster!

**PERFORM ON ALL NODES**

1. Install centos 7 and patch to the latest version, also disable the firewall:
yum update -y && systemctl disable firewalld.service && systemctl stop firewalld.service
2. get some basic tools needed later:
yum install -y wget perl nano gcc kernel-devel zlib-devel libuuid-devel ntp
3. Create a temporary download folder and download the latest RocketRaid raid card drivers so that centos can see the 45drives hard drives:
cd /tmp && mkdir driver && cd driver && wget http://www.highpoint-tech.com/BIOS_Driver/R750/Linux/R750_Linux_Src_v1.2.7-16_08_23.tar.gz
4. extract the download, then run setup:
tar -xzvf R750_Linux_Src_v1.2.7-16_08_23.tar.gz && ./r750-linux-src-v1.2.7-16_08_23.bin
5. reboot the node to discover the new drives
6. Download the web management tool from RocketRaid:
cd /tmp && mkdir mgmt && cd mgmt && wget www.highpoint-tech.com/BIOS_Driver/HRM/Linux/WebGUI_Linux_2.1.7_14_07_30.tgz
And then extract and install it:
tar -xzvf WebGUI_Linux_2.1.7_14_07_30.tgz &&  chmod +x hptsvr-https-2.1.7-14.0730.x86_64.rpm &&  yum install hptsvr-https-2.1.7-14.0730.x86_64.rpm
It will then become available on http://<serverip>:7402 with username: RAID, password: hpt
7. Download zfs from source as the centos version is horribly out of date:
cd /tmp && mkdir zfs && cd zfs && wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.9/spl-0.6.5.9.tar.gz && wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.9/zfs-0.6.5.9.tar.gz
8. Extract, build, and install zfs and its required dependency:
tar -xzvf spl-0.6.5.9.tar.gz && cd spl-0.6.5.9 && ./configure && make && make install && cd /tmp/zfs && tar -xzvf zfs-0.6.5.9.tar.gz && cd zfs-0.6.5.9 && ./configure && make && make install && modprobe zfs
9. Create the zfs zpool (raid6 array) from the raw drives and call it ‘brick1’. If you want to add a spare drive, place ‘spare sdx’ at the tail end of this command, where sdx is the spare drive to use.
zpool create -f brick1 raidz2 sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj sdak sdal sdam sdan sdao sdap sdaq sdar sdas sdat spare sdau
10. Set up things like zfs caching and auto mounting:
systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target && systemctl enable zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target
11. Because of some odd zfs race condition bugs with centos, we have to use a script to have it load the zpool on boot:
Edit  /usr/lib/systemd/system/zfs-import-cache.service and change the line ExecStart=  to  ExecStart=/usr/local/libexec/zfs/startzfscache.sh
Create a new file nano /usr/local/libexec/zfs/startzfscache.sh and add the following lines:
#!/bin/sh
sleep 10
/usr/local/sbin/zpool import -c /usr/local/etc/zfs/zpool.cache -aN
zfs mount -a
Then make the new script runnable: chmod +x /usr/local/libexec/zfs/startzfscache.sh
12. Create a new directory where we’ll store the actual volume information which we will need later on:
mkdir /brick1/vol
13. Set 45drives recommended zfs settings, also set things like on the fly compression:
zfs set atime=off brick1 && zfs set xattr=sa brick1 && zfs set exec=off brick1 && zfs set sync=disabled brick1 && zfs set compression=lz4 brick1 && zfs set redundant_metadata=most brick1
14. Enable jumbo packets on the backend network if not already previously done:
nano /etc/sysconfig/network-scripts/ethxxx  add line at the bottom: MTU=9014
15. Edit /etc/hosts and add in all the gluster servers static backend ip addresses (cluster ip addys):
nano /etc/hosts
172.16.2.1 Gluster1
172.16.2.2 Gluster2
16. Install gluster version 3.10 (latest stable). You can also choose to install the more ‘stable’ branch by replacing ‘310’ with 38′:
yum install -y centos-release-gluster310
17. Enable and install the rest of gluster from centos repos:
yum install -y glusterfs-server
18. Set gluster to start on boot, then start the gluster services:
systemctl enable glusterd && systemctl start glusterd

**PERFORM ON ONE NODE ONLY**

19. Establish the initial cluster using the hostnames we added to /etc/hosts:
Server1: gluster peer probe <server2’s hostname>
Server2: gluster peer probe <server1’s hostname>
20. Create the Gluster block pair and volume:
gluster volume create vol1 replica 2 <server1hostname>:/brick1/vol <server2hostname>:/brick1/vol
21. Start the newly created volume, also set it to failover after only 1 second (default is like 40 seconds or something high):
gluster volume start vol1
gluster volume set vol1 network.ping-timeout 1

**PERFORM ON ALL NODES**

22. Create a new folder to house the mounting of the new gluster volume:
mkdir /cluster && mkdir /cluster/vol1
23. Have the new gluster volume auto-mount on bootup, also mount it right now:
echo localhost:/vol1 /cluster/vol1 glusterfs defaults,_netdev 0 0 >> /etc/fstab && mount -a
24. Install CTDB for virtual ip addressing and create a place for it to hold its lockfile:
yum install ctdb && mkdir /cluster/ctdb_lock && mkdir /brick_lock

**PERFORM ON ONE NODE ONLY**

25. Create a gluster volume specifically for the ctdb lockfile (it seems to like its own volume for this):
gluster volume create ctdb_lock replica 2 <server1>:/brick_lock <server2>:/brick_lock force && gluster volume start ctdb_lock

**PERFORM ON ALL NODES**

26. Have the new lock volume auto-mount on boot etc:
echo localhost:/ctdb_lock /cluster/ctdb_lock glusterfs defaults,_netdev 0 0 >> /etc/fstab && mount -a
27. Tell CTDB where to place its lockfile, also where to look for its list of nodes:
nano /etc/sysconfig/ctdb And edit the following two lines to match below:
CTDB_RECOVERY_LOCK=/cluster/ctdb_lock/.CTDB-lockfile
CTDB_NODES=/etc/ctdb/nodes
28. Add the CTDB nodes:
nano /etc/ctdb/nodes and add the gluster servers based on their cluster IP address
29. Add at least one floating ip address that clients will use to reach the gluster cluster, make sure you include subnet and the interface to use:
nano /etc/ctdb/public_addresses
add:
172.16.1.100/24 eth0
30. For some reason ctdb doesn’t like to play nice with selinux so set it to permissive:
nano /etc/sysconfig/selinux and change it to ‘permissive
31. Reboot for the selinux changes to take effect.
32. Have ctdb start on boot and start it now:
service ctdb start && chkconfig ctdb on

**PERFORM ON ONE NODE ONLY IF NFS IS WANTED/USED**

33. Enable nfs on gluster:
gluster volume set vol1 nfs.disable off
34. Allow an IP range to reach gluster over nfs:
gluster volume set vol1 nfs.rpc-auth-allow <ip range>
35. Allow the nfs volumes to be reachable:
gluster volume set vol1 nfs.export-volumes on

**PERFORM ON ALL NODES**

36. Tune ctdb for faster failover
nano /etc/sysconfig/ctdb and add the following to the bottom of the file:
CTDB_SET_MonitorInterval=5
CTDB_SET_TakeoverTimeout=5
CTDB_SET_ElectionTimeout=2
CTDB_SET_KeepaliveLimit=3
CTDB_SET_KeepaliveInterval=1
CTDB_SET_ControlTimeout=15
37. Restart the ctdb service after the settings changes
service ctdb restart
At this point, if all you want is NFS, congratulations! Your done! You can now connect to the gluster cluster via the public facing ip address you added to CTDB. I highly recommend creating a dns A record to this ip for ease of use.
If you also want Windows fileshare capabilities, read/continue on:

**** FOR SAMBA****

**PERFORM ON ALL NODES**

1. Install samba and all its required bits:

yum install realmd samba samba-common oddjob oddjob-mkhomedir sssd ntpdate ntp samba-winbind* authconfig-gtk*

2. Make sure we get proper time sync from a domain controller:

ntpdate <your primary domain controller>

3. Set time services to start and start on boot:

service ntpd start && chkconfig ntpd on

4. Join this node to the active directory domain:

realm join –user=<username@domain.com> <yourdomain>

5. Make sure winbind (the service that ties this box to AD) starts on boot and then start it now:

chkconfig winbind on && service winbind start

6. I found that the above is not enough to get the nodes properly talking to AD, you also have to use authconfig to bind them as well:

authconfig-tui Then check/do the following on the prompts that appear:

use winbind, use shadow passwords, use winbind authentication, local authorization is sufficient (next)

ads, /bin/bash, domain: <your domain>, domain controllers: <your domain controller>, ADS realm: <your full domain name> (join domain)

(save)

(enter creds, ok)

(ok)

7. Edit krb5 for proper dns lookup

nano /etc/krb5.conf and add the line after ‘default_realm’

dns_lookup_kdc = true

8. Set winbind to tie to local centos groups and such:

nano /etc/nsswitch.conf and then add the line winbind to the end of the following lines:

passwd

Group

9. Set samba to start on boot, and start it now:

systemctl enable smb.service && systemctl enable nmb.service && systemctl restart smb.service && systemctl restart nmb.service

10. Reboot (no I don’t know why you have to reboot here, just do it)

11. Create our new windows share folder:

mkdir /cluster/vol1/Shares

12. Now the magic, configure samba for the new share and tie in all of our AD information:

nano /etc/samba/smb.conf  (add if missing):

 [global]
        encrypt passwords = yes
        winbind enum groups = yes
        load printers = no
        printcap name = /dev/null
        log file = /var/log/samba/%m.log
 #      log level = 4
        max log size = 50
        winbind nss info = sfu
        inherit acls = yes
        invalid users = root
        inherit permissions = yes
        map acl inherit = yes
        store dos attributes = yes
        vfs objects = acl_xattr
        winbind separator = +
 
[Shares]
        path = /cluster/vol1/Shares
        browsable = yes
        writable = yes
        guest ok = yes
        read only = no
        inherit acls = yes
        write list = @”domain admins@<yourdomain>”
        create mask = 775
        directory mask = 775

13. Restart winbind and samba:

service smb restart && service winbind restart

14. Use the following to verify our AD groups have been successfully connected (you’ll want to see at least one domain group appear here):

getent group

15. Set Domain Admins to read/write for the new windows share:

chgrp ‘<yourdomain>+Domain Admins’ /cluster/vol1/Shares && chmod 775 /cluster/vol1/Shares

16. Allow windows file/folder permissions even though this is a linux box:

zfs set aclinherit=passthrough brick1 && zfs set xattr=sa brick1

17. Grant domain admins rights to change folder/file permissions on the windows share (this is explicitly needed, don’t skip it!):

net rpc rights grant “<yourdomain>+Domain Admins” SeDiskOperatorPrivilege -U “<yourdomain>+<yourusername>”

18. Let CTDB deal with starting/stopping samba and winbind:

nano /etc/sysconfig/ctdb and uncomment the following two lines:

uncomment CTDB_MANAGES_SAMBA=yes

uncomment CTDB_MANAGES_WINBIND=yes

19. Stop winbind and samba from starting at boot on their own, and stop them from running right now:

chkconfig winbind off && chkconfig smb off && service winbind stop && service smb stop

20. Restart the ctdb service so it can start up winbind and samba itself:

service ctdb restart

Finish:

That’s it! You can now reach the new windows share from any windows client computer via //<your ctdb public ip address>/Shares

Remember if you ever plan on doing node maintenance to stop the ctdb service on that node before begining the maintenace process!


Posted

in

by

Tags:

Comments

9 responses to “Guide: Create your own Windows fileserver cluster on 45drives hardware with GlusterFS”

  1. David O. Avatar
    David O.

    How’s your small file performance with this setup? We built an almost identical setup, but made a few different design decisions (e.g. HW RAID 6 with three arrays per server and XFS instead of ZFS ).

    It performs well for the most part, but small file write performance is absolute dogsh*t. We’re getting 42 files/second write when the underlying bricks can to 20K+/sec. (Tested using bonnie++)

    Curious if yours is any better with the large bricks and ZFS instead of XFS.

    1. Michael Rickert Avatar
      Michael Rickert

      I found that larger bricks improved performance enough that it was worth it over smaller bricks, but not enough to make it a ‘must do’.

      What’s your cpu usage on the backend storage like when you’re doing small file benchmarking?

  2. Ryan Avatar
    Ryan

    I’ve followed this guide and others but after successfully setting up gluster I’m getting stuck setting up CTDB. How does the public_addresses file IP address, subnet, and interface translate into a physical connection? That is, if I enter 192.168.1.3/24 eth1 in the public_addresses file does this create an interface on Linux that then requests the 192.168.1.3 address from my DHCP server? I can’t set a static IP without a MAC address. Or am I supposed to setup some type of virtual IP on my router (pfSense)? Thanks for any advice you can provide, I can’t find the answer anywhere online.

    1. Michael Rickert Avatar
      Michael Rickert

      public_addresses are the VIPs (virtual IP addresses) you want the cluster to have. CTDB will then begin responding to ARP requests for that IP address, providing the switch/router with a valid mac address. This means that if another server tries to ping that 192.168.1.3 255.255.255.0 ip address, whichever clustered server that currently ‘holds’ that virtual ip will respond back and also provide its current real mac address.

      It’s all done outside of an additional load balancer (pfsense, F5, whatever), so think of it more like built in failover protection of sorts provided by CTDB itself. Of course that means it’s not providing actual load balancing (aka each request goes to a different VIP/ip), so your welcome to toss all of the VIPs behind pfsense to get true load balancing across nodes. Also if a node goes down, that VIP will change to a new host. For example if you have three nodes, and three VIPs, you should get one VIP per node. This is helpful for load balancing because you can put all three VIPs behind your pfsense load balancer, and it will properly round robin requests. If one of the hosts were to then go down, you would have one of the remaining hosts answering for two VIPs, ensuring connections to the now dead host still get answered/responded to.

      Of course you’re welcome to not use the VIPs at all, and instead put the real host IPs into pfsense, but you may have issues with active sessions dying if a host goes down. As far as DHCP, you’ll want to exclude any public_addresses IP’s from the DHCP scope, so that the VIP is not given out to another computer on the network by mistake in some way. Hopefully this helps clear things up a bit, if not let me know!

  3. Guy Boisvert Avatar
    Guy Boisvert

    Great article!

    How would you configure GlusterFS to migrate Windows server VMs that have 2 x 2TB local LVs on the KVM Servers? We are migrating from DAS on our KVM servers to GlusterFS to host our VM Images / LVMs maps. I have a problem figuring out how to convert/migrate large LVs into the Gluster cluster.

    1. Michael Rickert Avatar
      Michael Rickert

      Do you mean GlusterFS is having a hard time handling 2TB+ individual files? You want to be careful with extremely large single flat files since if gluster gets out of sync it will have to re-copy the entire file due to split brain issues. If you are running a larger node count its _usually_ not an issue but its still a risk and not ideal in terms of how gluster was designed to be used. The very best solution would be to stand up new windows fileservers in kvm (or p2v only the OS drive/partition), then have those servers point via iscsi to gluster for storage on a separate windows drive letter. Once you have windows backed by iscsi instead of large kvm disks, you can do a lot more with windows built in failover features, MPIO, and you also have a MUCH better overall footprint for glusterfs to handle file replication/sync across nodes since it would only be syncing individual windows files.

      1. Guy Boisvert Avatar
        Guy Boisvert

        I kinda sorted this out using sharding! I thing that sync would be much faster with shards, i’m used to use rsync with the benefit of splt files! So i created large (2TB) sparse file (qemu-img -o preallocation=metadata) on the Gluster volume. What do you think about this?

        Now the second part of my reflexion is regarding JBOD vs RAID on the servers. Gluster supports Erasure Coding but i don’t really know how it would tax the servers (storage and KVM server). If i relate to Linux software RAID, it seems to be not too heavy on resources. On the other hand, i know that the Mellanox Gen6 (*i think*) Infiniband/Ether cards support EC acceleration: the problem is to confirm if Driver/API is supported on my CentOS 7 test bench!

  4. Thomas Particke Avatar
    Thomas Particke

    Hello and thank you so much for all the effort you put into the documentation.
    I’ve been working on a solution for some time, but unfortunately I’ve always had the problem with the permissions on the lockfile.
    According to your documentation it finally worked.
    Contrary to many others, you don’t have a ‘ cluster’ entry in your SMB configuration, which surprised me honestly.
    And you have created an extra volume for the lockfile.
    I think that was the solution for me.
    Thanks again. Good work.

  5. whovahkiin Avatar
    whovahkiin

    browseable = yes

    not
    browsable = yes

    Your missing an e.

    Also this doesn’t work for CentOS 8 as it no longer uses authconfig-gtk.

Leave a Reply

Your email address will not be published. Required fields are marked *