ZFS on Linux
Contents
READ FIRST
Some considerations when working with ZFS
- ZFS uses vdevs and not physical disks.
- Be careful about how you add new disks to the array. No random adding and removing of disks (exception being when upgrading disks or a disk fails)
- ZFS is very powerful, be mindful of what you are going to do and plan it out!
- After a vdev is created, it can never be removed and you can not add into it.
Example:
NAME STATE READ WRITE CKSUM pool4tb ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdd ONLINE 0 0 0
radz1-0 is a vdev. To add more disks (other than hotspares) you must create a second vdev. In this case we are running two mirrored drives so it would be best to add a second pair of mirrored drives.
NAME STATE READ WRITE CKSUM pool4tb ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdd ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0
Now data will be striped across both vdevs.
ZFS on Linux Installation
CentOS 7
It has been reported that when installing zfs and it's dependencies at the same time, the kernel modules will not get created. Below are the current steps I found to work when installing ZFS.
yum -y install epel-release
Make sure the system is completely up to date.
yum -y update reboot -h
After reboot
yum -y install kernel-devel yum -y localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el7.noarch.rpm yum -y install spl
If everything was done right, the following command will take a while (depending on hardware)
yum -y install zfs-dkms yum -y install zfs /sbin/modprobe zfs
Fedora 28
[1]The instructions from the zfsonlinux.org site are correct, except for enabling the repo before installing. Even issuing the "dnf --set-enable zfs.repo" would result in failure. Had to edit the repo file directly (/etc/yum.repos.d/zfs.repo) to enable. Not a big deal, but something good to know.
Create ZFS Pool
At this point you can create your pool. Most of the time we will be interested in a ZRAID configuration. Depending on how much parity your interested user raidz, raidz1, raidz2, or raidz3.
zpool create <name of pool> raidz <disk1> <disk2> <etc>
NOTE: By default this will create a mount point of "/<name of pool>"
To add a spare drive
zpool add <name of pool> spare <disk>
Make sure to enable automatic rebuild when a drive fails, especially when using hot spares.
zpool autoreplace=on <name of pool>
I ran into the following that would help with managing the disks. Creating a label for each disk would have saved me time in the past[2]
# glabel label rex1 ada0 # glabel label rex2 ada1 # glabel label rex3 ada2 # zpool create rex raidz1 label/rex1 label/rex2 label/rex3
Create ZFS Volumes
zfs create <name of pool>/<Volume Name> zfs set mountpoint=<mount point>
Example:
zfs create pool4tb/archive mkdir /archive zfs set mountpoint=/archive pool4tb/archive
Additional Options
To enable compression
zfs set compression=lz4 <name of pool>
To increase the number of copies of a file on a dataset
zfs set copies=<1,2,3>
To have the pool auto-expand
zpool set autoexpand=on <name of pool>
- Encryption
http://www.makethenmakeinstall.com/2014/10/zfs-on-linux-with-luks-encrypted-disks/
EXAMPLE
1x2TB HDD sdb 4x1TB HDDs sdc sdd sde sdf Using the above drives it is possible to create a variety of deployments. In this example we will create a RAID5 like configuration that spans across three 2TB devices. We start by creating a pools and adding the drives. [root@nas ~]# zpool create -f set1 raidz /dev/sdc /dev/sdd [root@nas ~]# zpool create -f set2 raidz /dev/sde /dev/sdf [root@nas ~]# zpool status pool: set1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM set1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 errors: No known data errors pool: set2 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM set2 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 errors: No known data errors [root@nas ~]# zfs create -V 1.50T set1/vdev1 [root@nas ~]# zfs create -V 1.50T set2/vdev1 [root@nas ~]# zfs list NAME USED AVAIL REFER MOUNTPOINT set1 1.55T 214G 57.5K /set1 set1/vdev1 1.55T 1.76T 36K - set2 1.55T 214G 57.5K /set2 set2/vdev2 1.55T 1.76T 36K - [root@nas ~]# ls /dev/ <condensed output> zd0 zd16 [root@nas ~]# zpool create -f data raidz1 /dev/sdb /dev/zd0 /dev/zd16 [root@nas ~]# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT data 4.47T 896K 4.47T - 0% 0% 1.00x ONLINE - set1 1.81T 742K 1.81T - 0% 0% 1.00x ONLINE - set2 1.81T 429K 1.81T - 0% 0% 1.00x ONLINE - [root@nas ~]# df -lh Filesystem Size Used Avail Use% Mounted on /dev/sda3 33G 1.6G 32G 5% / devtmpfs 3.8G 0 3.8G 0% /dev tmpfs 3.8G 0 3.8G 0% /dev/shm tmpfs 3.8G 8.5M 3.8G 1% /run tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup /dev/sda1 497M 200M 298M 41% /boot tmpfs 775M 0 775M 0% /run/user/0 set1 214G 128K 214G 1% /set1 set2 214G 128K 214G 1% /set2 data 2.9T 256K 2.9T 1% /data
As you can see there is a LOT of wasted space using this method. Where we should have ~4TB of usable space we end with ~3TB. This was only an example, the better option is to create multiple independent data sets.
ZFS Send
Example of using ZFS send to replicate snapshots from a local pool to a local external drive.
nohup zfs send -R tank/datastore@auto-20180629.0000-2w | zfs recv -F backuppool/backup &
Incremental [3]
zfs send -R -i tank/datastore@auto-20180630.0000-2w tank/datastore@auto-20180701.0000-2w | zfs recv -F backuppool/backup
ssh[4]
nohup zfs send tank/datastore@auto-20180629.0000-2w | ssh root@somehost 'zfs receive backuppool/datastore@auto-20180629.0000-2w'
Troubleshooting
Auto import pool at boot
[5]There is a cache file that is used for mounting ZFS at boot. Make sure to run the following if ZFS is not importing on boot.
[root@nas ~]# systemctl enable zfs-import-cache.service
Kernel Module Failure After Upgrade
I ran the standard yum upgrade process on my home CentOS 7 server. After a reboot ZFS failed stating the module was not loaded and I should load it. However, modprobe would fail.
[root@nas ~]# modprobe zfs modprobe: ERROR: could not insert 'zfs': Invalid argument
Checking dmesg
[root@nas ~]# grep zfs /var/log/dmesg* /var/log/dmesg.old:[ 3.445947] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 3.445950] zfs: Unknown symbol vn_getattr (err -22) /var/log/dmesg.old:[ 5.103167] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 5.103172] zfs: Unknown symbol vn_getattr (err -22) /var/log/dmesg.old:[ 5.154686] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 5.154691] zfs: Unknown symbol vn_getattr (err -22) /var/log/dmesg.old:[ 5.273800] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 5.273804] zfs: Unknown symbol vn_getattr (err -22) /var/log/dmesg.old:[ 5.377193] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 5.377200] zfs: Unknown symbol vn_getattr (err -22) /var/log/dmesg.old:[ 92.649735] zfs: disagrees about version of symbol vn_getattr /var/log/dmesg.old:[ 92.649739] zfs: Unknown symbol vn_getattr (err -22)
I found a post about this[6], and it mentioned to check the dkms status. Below is what I found.
[root@nas ~]# dkms status spl, 0.7.12, 3.10.0-862.14.4.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) zfs, 0.7.12, 3.10.0-862.14.4.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)
[root@nas ~]# rpm -qa | grep kernel kernel-3.10.0-862.11.6.el7.x86_64 kernel-tools-3.10.0-862.14.4.el7.x86_64 kernel-3.10.0-693.5.2.el7.x86_64 kernel-tools-libs-3.10.0-862.14.4.el7.x86_64 kernel-3.10.0-862.14.4.el7.x86_64 kernel-3.10.0-862.9.1.el7.x86_64 kernel-headers-3.10.0-862.14.4.el7.x86_64 kernel-3.10.0-862.6.3.el7.x86_64
Set Hot Spare as replacement device
I had an issue where I created a raidz2 pool without spares (which is fine for this deployment). A drive failed, and I installed a replacement as a spare using the FreeNAS gui (this one was not ZFSoL). I was then stuck with a perpetually degraded pool.
pool: tank state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0 days 00:55:01 with 0 errors on Sat Jun 30 12:16:59 2018 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/ca363e73-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/cca8828b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d1b86990-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d51049fe-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d804819b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/625287ff-7c6b-11e8-a699-002590fde644 ONLINE 0 0 0 gptid/dda24b58-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e11d1f00-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e39e8936-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 spare-9 DEGRADED 0 0 0 17637264324123775223 OFFLINE 0 0 0 was /dev/gptid/e55c7104-5d4d-11e8-aaf6-002590fde644 gptid/051c5d74-612e-11e8-8357-002590fde644 ONLINE 0 0 0 gptid/e837b4dd-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 logs gptid/bdafc060-6ccc-11e8-8357-002590fde644 ONLINE 0 0 0 spares 227308045836062793 INUSE was /dev/gptid/051c5d74-612e-11e8-8357-002590fde644 errors: No known data errors
But if I would RTFM[7] I would know to detach the failed drive that I previously made offline.
root@freenas:~ # zpool detach tank 17637264324123775223 root@freenas:~ # zpool status pool: tank state: ONLINE scan: scrub repaired 0 in 0 days 00:55:01 with 0 errors on Sat Jun 30 12:16:59 2018 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/ca363e73-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/cca8828b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d1b86990-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d51049fe-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d804819b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/625287ff-7c6b-11e8-a699-002590fde644 ONLINE 0 0 0 gptid/dda24b58-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e11d1f00-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e39e8936-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/051c5d74-612e-11e8-8357-002590fde644 ONLINE 0 0 0 gptid/e837b4dd-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 logs gptid/bdafc060-6ccc-11e8-8357-002590fde644 ONLINE 0 0 0 errors: No known data errors
ZFS Not Mounting After Reboot
For some reason my system stopped mounting my ZFS volumes at boot. For a year I would manually mount as needed (a reboot was rare). But now I found the issue[8]
systemctl enable zfs-import.target
FreeNAS Specific
Replace drive using CLI
Recently I installed a FreeNAS server as part of a consulting gig, but the refurbished drives that came with the server started to fail and needed replacement. I had to do this remotely without the GUI due to limited VPN connectivity, which posed an issue with gaining the gptid of the replacement drive. Up until now I have relied on the GUI to provision the gptid and import the disk. My previous examples also show that I normally use the entire disk instead of using partitions on the disk.
The following is what I did to obtain a gptid for the drive.[9]
- First I obtained the drive information. Had a local tech provide me the SN.
- Ran a script I wrote to pull SN from drives listed in /dev to obtain the correct device (/dev/da13)
- At this point I created the gpt partion on the disk using the steps from the reference above.
gpart create -s gpt da13 gpart add -t freebsd-ufs da13
- Then I checked to see if the disk showed up with a label.
glabel list | grep da13
- At which point I could found the label in the full list
glabel list
- Then started the replacement of the failed disk that I previously took offline.
zpool replace tank 17805351018045823548 gptid/625287ff-7c6b-11e8-a699-002590fde644 pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Jun 30 09:43:44 2018 1.58T scanned at 647M/s, 840G issued at 335M/s, 1.58T total 72.7G resilvered, 51.81% done, 0 days 00:39:48 to go config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/ca363e73-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/cca8828b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d1b86990-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d51049fe-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/d804819b-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 replacing-5 OFFLINE 0 0 0 17805351018045823548 OFFLINE 0 0 0 was /dev/gptid/db20f312-5d4d-11e8-aaf6-002590fde644 gptid/625287ff-7c6b-11e8-a699-002590fde644 ONLINE 0 0 0 (resilvering) gptid/dda24b58-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e11d1f00-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0 gptid/e39e8936-5d4d-11e8-aaf6-002590fde644 ONLINE 0 0 0
SLOG Disk performance
ADATA SU800 128GB
=== START OF INFORMATION SECTION === Device Model: ADATA SU800 Serial Number: --- LU WWN Device Id: 5 707c18 300038465 Firmware Version: Q0922FS User Capacity: 128,035,676,160 bytes [128 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Oct 17 07:49:26 2019 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled root@freenas[~]# diskinfo -wS /dev/da21 /dev/da21 512 # sectorsize 128035676160 # mediasize in bytes (119G) 250069680 # mediasize in sectors 0 # stripesize 0 # stripeoffset 15566 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. ATA ADATA SU800 # Disk descr. 2J2720018797 # Disk ident. Yes # TRIM/UNMAP support 0 # Rotation rate in RPM Not_Zoned # Zone Mode Synchronous random writes: 0.5 kbytes: 781.3 usec/IO = 0.6 Mbytes/s 1 kbytes: 784.3 usec/IO = 1.2 Mbytes/s 2 kbytes: 800.7 usec/IO = 2.4 Mbytes/s 4 kbytes: 805.7 usec/IO = 4.8 Mbytes/s 8 kbytes: 795.7 usec/IO = 9.8 Mbytes/s 16 kbytes: 806.0 usec/IO = 19.4 Mbytes/s 32 kbytes: 787.7 usec/IO = 39.7 Mbytes/s 64 kbytes: 944.2 usec/IO = 66.2 Mbytes/s 128 kbytes: 1353.6 usec/IO = 92.3 Mbytes/s 256 kbytes: 2001.1 usec/IO = 124.9 Mbytes/s 512 kbytes: 3185.4 usec/IO = 157.0 Mbytes/s 1024 kbytes: 5407.7 usec/IO = 184.9 Mbytes/s 2048 kbytes: 7622.4 usec/IO = 262.4 Mbytes/s 4096 kbytes: 12125.0 usec/IO = 329.9 Mbytes/s 8192 kbytes: 21478.9 usec/IO = 372.5 Mbytes/s
Innodisk 3MG2-P (FreeNAS L2ARC)
This is the official FreeNAS L2ARC SSD sold on Amazon by ixSystems. Please note that this was not intended to be a SLOG disk, but I had it on hand so why not test it?
=== START OF INFORMATION SECTION === Model Family: Innodisk 3IE2/3ME2/3MG2/3SE2 SSDs Device Model: 2.5" SATA SSD 3MG2-P Serial Number: --- Firmware Version: M150821 User Capacity: 124,034,899,968 bytes [124 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Oct 17 08:09:09 2019 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled root@freenas[~]# diskinfo -wS /dev/da21 /dev/da21 512 # sectorsize 124034899968 # mediasize in bytes (116G) 242255664 # mediasize in sectors 0 # stripesize 0 # stripeoffset 15079 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. ATA 2.5" SATA SSD 3M # Disk descr. 20170503AA8931853024 # Disk ident. Yes # TRIM/UNMAP support 0 # Rotation rate in RPM Not_Zoned # Zone Mode Synchronous random writes: 0.5 kbytes: 1449.3 usec/IO = 0.3 Mbytes/s 1 kbytes: 1458.5 usec/IO = 0.7 Mbytes/s 2 kbytes: 1477.6 usec/IO = 1.3 Mbytes/s 4 kbytes: 1492.7 usec/IO = 2.6 Mbytes/s 8 kbytes: 1471.4 usec/IO = 5.3 Mbytes/s 16 kbytes: 1503.7 usec/IO = 10.4 Mbytes/s 32 kbytes: 1554.2 usec/IO = 20.1 Mbytes/s 64 kbytes: 1711.3 usec/IO = 36.5 Mbytes/s 128 kbytes: 2101.6 usec/IO = 59.5 Mbytes/s 256 kbytes: 2535.3 usec/IO = 98.6 Mbytes/s 512 kbytes: 3598.5 usec/IO = 138.9 Mbytes/s 1024 kbytes: 5856.2 usec/IO = 170.8 Mbytes/s 2048 kbytes: 8262.6 usec/IO = 242.1 Mbytes/s 4096 kbytes: 13505.4 usec/IO = 296.2 Mbytes/s 8192 kbytes: 23919.1 usec/IO = 334.5 Mbytes/s
- ↑ https://github.com/zfsonlinux/zfs/wiki/Fedora
- ↑ https://forums.freebsd.org/threads/how-to-recover-degraded-zpool.28084/
- ↑ https://docs.oracle.com/cd/E19253-01/819-5461/gbchx/index.html
- ↑ https://128bit.io/2010/07/23/fun-with-zfs-send-and-receive/
- ↑ http://serverfault.com/questions/732184/zfs-datasets-dissappear-on-reboot
- ↑ https://github.com/zfsonlinux/zfs/issues/1155
- ↑ https://docs.oracle.com/cd/E19253-01/819-5461/gcvdi/index.html
- ↑ https://serverfault.com/questions/914173/zfs-datasets-no-longer-automatically-mount-on-reboot-after-system-upgrade
- ↑ https://mikebeach.org/2014/03/01/how-to-format-a-disk-gpt-in-freenas/