On 29/11/2018 13:44, Olaf Meeuwissen wrote:
> Q: Isn't there some filesystem type that supports settings at a more
> granular level than the device? Like directory or per file?
> A: Eh ... Don't know. Haven't checked ...
> Solution: Go fish!
>
> # I haven't gone fishing yet but a vague recollection of Roger's post
> # where he mentioned ZFS seemed promising ...
You could set them at the directory level if you wanted with ZFS, by
creating a filesystem per directory. But that might get a bit
unmanageable, so doing it in a coarser-grained way is usually
sufficient. Let me show some examples.
Firstly, this is an example of a FreeBSD 11.2 ZFS NAS with a few
terabytes of HDD storage. There's a dedicated system pool, plus this
data pool:
% zpool status red
pool: red
state: ONLINE
scan: scrub repaired 0 in 4h12m with 0 errors on Wed Nov 28
19:51:31 2018
config:
NAME STATE READ WRITE CKSUM
red ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/zfs-b1-WCC4M0EYCTAZ ONLINE 0 0 0
gpt/zfs-b2-WCC4M5PV83PY ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gpt/zfs-b3-WCC4N7FLJD34 ONLINE 0 0 0
gpt/zfs-b4-WCC4N4UDKN8F ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
gpt/zfs-a1-B492480446 ONLINE 0 0 0
gpt/zfs-a2-B492480406 ONLINE 0 0 0
errors: No known data errors
This is arranged as a pair of two mirrors, referenced by GPT labels.
Each of the mirror "vdev"s is RAID1, and then writes are striped across
them. It's not /actually/ striping, it's more clever, and it biases the
writes to balance throughput and free space across all the vdevs, but
it's similar. The last vdev is a "log" device made of a pair of
mirrored SSDs. This "ZIL" is basically a fast write cache. I could
also have an SSD added as an L2ARC read cache as well, but it's not
necessary for this system's workload. No errors have occurred on any of
the devices in the pool.
All filesystems allocate storage from this pool. Here's a few:
% zfs list -r red | head -n6
NAME USED AVAIL REFER MOUNTPOINT
red 2.57T 1.82T 104K /red
red/bhyve 872K 1.82T 88K /red/bhyve
red/data 718G 1.82T 712G /export/data
red/distfiles 11.9G 1.82T 11.8G /red/distfiles
red/home 285G 1.82T 96K /export/home
There are actually 74 filesystems in this pool! Because the storage is
shared, there's no limit on how many you can have. So unlike
traditional partitioning, or even LVM (w/o thin pool) it's massively
more flexible. You can create, snapshot and destroy datasets on a whim,
and even delegate administration of parts of the tree to different users
and groups. So users could create datasets under their home directory,
snapshot them and send them to and from other systems. You can organise
your data into whatever filesystem structure makes sense.
Let's look at the pool used on the Linux system I'm writing this email on:
% zpool status rpool
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0h5m with 0 errors on Mon Nov 26
17:15:55 2018
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
sda2 ONLINE 0 0 0
errors: No known data errors
This is a single SSD "root pool" for the operating system; with data in
other pools not shown here.
% zfs list -r rpool
NAME USED AVAIL REFER MOUNTPOINT
rpool 45.4G 62.2G 96K none
rpool/ROOT 14.7G 62.2G 96K none
rpool/ROOT/default 14.7G 62.2G 12.0G /
rpool/home 3.18M 62.2G 96K none
rpool/home/root 3.08M 62.2G 3.08M /root
rpool/swap 8.50G 63.2G 7.51G -
rpool/var 5.72G 62.2G 96K none
rpool/var/cache 5.32G 62.2G 5.18G /var/cache
rpool/var/log 398M 62.2G 394M /var/log
rpool/var/spool 8.14M 62.2G 8.02M /var/spool
rpool/var/tmp 312K 62.2G 152K /var/tmp
These datasets comprise the entire operating system (I've omitted some
third-party software package datasets from rpool/opt).
% zfs list -t snapshot -r rpool
NAME USED AVAIL REFER MOUNTPOINT
rpool@cosmic-post 0B - 96K -
rpool/ROOT@cosmic-post 0B - 96K -
rpool/ROOT/default@cosmic-post 2.66G - 12.3G -
rpool/home@cosmic-post 0B - 96K -
rpool/home/root@cosmic-post 0B - 3.08M -
rpool/var@cosmic-post 0B - 96K -
rpool/var/cache@cosmic-post 148M - 5.06G -
rpool/var/log@cosmic-post 3.99M - 337M -
rpool/var/spool@cosmic-post 128K - 7.79M -
rpool/var/tmp@cosmic-post 160K - 192K -
These are snapshots which would permit rollback after a recent upgrade
[as you can see, this particular system is Ubuntu; I've not yet tried
out ZFS on Devuan].
Each dataset has particular properties associated with it. These are
the properties for the root filesystem:
% zfs get all rpool/ROOT/default
NAME PROPERTY VALUE SOURCE
rpool/ROOT/default type filesystem -
rpool/ROOT/default creation Sun Jun 12 10:46 2016 -
rpool/ROOT/default used 14.7G -
rpool/ROOT/default available 62.2G -
rpool/ROOT/default referenced 12.0G -
rpool/ROOT/default compressratio 1.63x -
rpool/ROOT/default mounted yes -
rpool/ROOT/default quota none
default
rpool/ROOT/default reservation none
default
rpool/ROOT/default recordsize 128K
default
rpool/ROOT/default mountpoint / local
rpool/ROOT/default sharenfs off
default
rpool/ROOT/default checksum on
default
rpool/ROOT/default compression lz4
inherited from rpool
rpool/ROOT/default atime off
inherited from rpool
rpool/ROOT/default devices off
inherited from rpool
rpool/ROOT/default exec on
default
rpool/ROOT/default setuid on
default
rpool/ROOT/default readonly off
default
rpool/ROOT/default zoned off
default
rpool/ROOT/default snapdir hidden
default
rpool/ROOT/default aclinherit restricted
default
rpool/ROOT/default createtxg 15 -
rpool/ROOT/default canmount on
default
rpool/ROOT/default xattr on
default
rpool/ROOT/default copies 1
default
rpool/ROOT/default version 5 -
rpool/ROOT/default utf8only on -
rpool/ROOT/default normalization formD -
rpool/ROOT/default casesensitivity sensitive -
rpool/ROOT/default vscan off
default
rpool/ROOT/default nbmand off
default
rpool/ROOT/default sharesmb off
default
rpool/ROOT/default refquota none
default
rpool/ROOT/default refreservation none
default
rpool/ROOT/default guid 3867409876204186651 -
rpool/ROOT/default primarycache all
default
rpool/ROOT/default secondarycache all
default
rpool/ROOT/default usedbysnapshots 2.66G -
rpool/ROOT/default usedbydataset 12.0G -
rpool/ROOT/default usedbychildren 0B -
rpool/ROOT/default usedbyrefreservation 0B -
rpool/ROOT/default logbias latency
default
rpool/ROOT/default dedup off
default
rpool/ROOT/default mlslabel none
default
rpool/ROOT/default sync standard
default
rpool/ROOT/default dnodesize legacy
default
rpool/ROOT/default refcompressratio 1.62x -
rpool/ROOT/default written 2.39G -
rpool/ROOT/default logicalused 22.0G -
rpool/ROOT/default logicalreferenced 18.0G -
rpool/ROOT/default volmode default
default
rpool/ROOT/default filesystem_limit none
default
rpool/ROOT/default snapshot_limit none
default
rpool/ROOT/default filesystem_count none
default
rpool/ROOT/default snapshot_count none
default
rpool/ROOT/default snapdev hidden
default
rpool/ROOT/default acltype off
default
rpool/ROOT/default context none
default
rpool/ROOT/default fscontext none
default
rpool/ROOT/default defcontext none
default
rpool/ROOT/default rootcontext none
default
rpool/ROOT/default relatime on
temporary
rpool/ROOT/default redundant_metadata all
default
rpool/ROOT/default overlay off
default
Some are defaulted from the parent dataset, some are general defaults,
while others have been set explicitly and some are readonly information.
Note the atime/devices/exec/setuid/readonly options which set the mount
options, as well as the mountpoint. Other options control quotas and
pre-allocated reservations of blocks to this filesystem, while others
are for performance tuning such as the cache, logbias and sync options.
Transparent compression with lz4 is enabled. Other options control
dataset integrity such as the checksum, copies and redundant_metadata
options, (which store multiple copies of blocks /in addition to/ the
effective RAID redundancy provided by the storage pool).
So the mount options can be set on a per-filesystem basis, and you can
have as many filesystems as you like within the directory hierarchy.
Ultimate flexibility!
This is the fstab:
% cat /etc/fstab
PARTUUID=7542d544-adc9-40b3-b6ef-5aa3ac5afbfb /boot/efi vfat defaults 0 1
/dev/zvol/rpool/swap none swap defaults 0 0
Just the swap volume (which is a ZFS volume), and the EFI thing (plus
some NFS mounts I omitted). All the ZFS filesystems get mounted using
the dataset properties, as shown above. There's a single place to
administer the options, within zfs itself. Mountpoint and other
property changes take immediate effect. It's simple, easy and powerful
to administer.
% mount | grep rpool
rpool/ROOT/default on / type zfs (rw,relatime,xattr,noacl)
(rw,nodev,noatime,xattr,noacl)
rpool/var/cache on /var/cache type zfs
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
rpool/var/log on /var/log type zfs
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
rpool/var/spool on /var/spool type zfs
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
rpool/var/tmp on /var/tmp type zfs
(rw,nosuid,nodev,noatime,xattr,noacl)
So that's a brief look at ZFS on Linux as a root filesystem. Hope it
was interesting.
Why should anyone care? I'd like to begin by contrasting this with Rick
Moen's points regarding placement of partitions on the disc for maximum
performance. While such a strategy is technically correct, the problem
is that it suffers from inflexible partition arrangements, and it's
extremely time consuming to profile enough variants to ensure the layout
is optimal for the workload, as well as making assumptions that the
workload will never change once you've adopted a particular layout. Is
placing /usr in the middle better than the rootfs, or /var or particular
user data? Who knows? And who has time for that? Certainly not any of
the admins who looked after my systems. I did this twenty years ago
when I had far too much time to micro-optimise this stuff, but I haven't
done so for a long time now. For the small performance gain you might
obtain, it's hugely costly.
The ZFS approach is to place all the storage in a huge pool, and then to
allow performance tuning of the pool and individual datasets within it.
The physical placement of the data becomes largely irrelevant; ZFS
handles this for you, and it probably does a better job of placing the
data efficiently. That is, after all, its entire purpose.
The other point is that ZFS scales well. Take the 4-disc NAS.
Streaming reads can pull data off all 4 discs in parallel via a
dedicated HBA. Streaming writes are balanced across all the discs. And
I can tune each dataset for throughput or latency, as well as adjusting
caching options and default blocksize to match the workload, and I can
add fast SSD storage for read and/or write caching to further improve
performance. There are dozens of knobs to tweak and plenty of different
strategies to employ in the pool layout. Plus a load of kernel
parameters to tune its behaviour there as well. There are several books
written on this stuff (by Michael Lucas and Allan Jude).
When it comes to filesystems, we still have the option of using plain
partitions, or md, and/or LVM. However, we've made a lot of progress in
storage technology over the last two decades, both in hardware and
software. Storage managers and filesystems like ZFS provide a lot of
value the older systems do not. Personally, I'm sold on it. It beats
the pants off LVM, and it doesn't eat your data or unbalance itself to
unusability like Btrfs.
If you're a ZFS user, and you look at the issue of mounting /usr as a
separate filesystem, it's really a non issue:
- it's just another dataset in the pool
- being mounted directly when the pool is activated means there are zero
issues mounting from initramfs vs directly; /all/ the filesystems in the
pool are mounted automatically
- if you do have a separate /usr, you can control the mount options just
like any other dataset, but /usr is nothing special; you can divide the
filesystem hierarchy as finely as you like with no problems
Regards,
Roger