Re: [DNG] /usr to merge or not to merge... that is the question

Author: Roger Leigh
Date:
To: dng
Subject: Re: [DNG] /usr to merge or not to merge... that is the question

On 29/11/2018 13:44, Olaf Meeuwissen wrote:
> Q: Isn't there some filesystem type that supports settings at a more
> granular level than the device? Like directory or per file?
> A: Eh ... Don't know. Haven't checked ...
> Solution: Go fish!
>
> # I haven't gone fishing yet but a vague recollection of Roger's post
> # where he mentioned ZFS seemed promising ...

You could set them at the directory level if you wanted with ZFS, by
creating a filesystem per directory. But that might get a bit
unmanageable, so doing it in a coarser-grained way is usually
sufficient. Let me show some examples.

Firstly, this is an example of a FreeBSD 11.2 ZFS NAS with a few
terabytes of HDD storage. There's a dedicated system pool, plus this
data pool:

     % zpool status red
       pool: red
      state: ONLINE
       scan: scrub repaired 0 in 4h12m with 0 errors on Wed Nov 28  
19:51:31 2018
     config:

             NAME                         STATE     READ WRITE CKSUM
             red                          ONLINE       0     0     0
               mirror-0                   ONLINE       0     0     0
                 gpt/zfs-b1-WCC4M0EYCTAZ  ONLINE       0     0     0
                 gpt/zfs-b2-WCC4M5PV83PY  ONLINE       0     0     0
               mirror-1                   ONLINE       0     0     0
                 gpt/zfs-b3-WCC4N7FLJD34  ONLINE       0     0     0
                 gpt/zfs-b4-WCC4N4UDKN8F  ONLINE       0     0     0
             logs
               mirror-2                   ONLINE       0     0     0
                 gpt/zfs-a1-B492480446    ONLINE       0     0     0
                 gpt/zfs-a2-B492480406    ONLINE       0     0     0

     errors: No known data errors

This is arranged as a pair of two mirrors, referenced by GPT labels.
Each of the mirror "vdev"s is RAID1, and then writes are striped across
them. It's not /actually/ striping, it's more clever, and it biases the
writes to balance throughput and free space across all the vdevs, but
it's similar. The last vdev is a "log" device made of a pair of
mirrored SSDs. This "ZIL" is basically a fast write cache. I could
also have an SSD added as an L2ARC read cache as well, but it's not
necessary for this system's workload. No errors have occurred on any of
the devices in the pool.

All filesystems allocate storage from this pool. Here's a few:

     % zfs list -r red | head -n6
     NAME                 USED  AVAIL  REFER  MOUNTPOINT
     red                 2.57T  1.82T   104K  /red
     red/bhyve            872K  1.82T    88K  /red/bhyve
     red/data             718G  1.82T   712G  /export/data
     red/distfiles       11.9G  1.82T  11.8G  /red/distfiles
     red/home             285G  1.82T    96K  /export/home

There are actually 74 filesystems in this pool! Because the storage is
shared, there's no limit on how many you can have. So unlike
traditional partitioning, or even LVM (w/o thin pool) it's massively
more flexible. You can create, snapshot and destroy datasets on a whim,
and even delegate administration of parts of the tree to different users
and groups. So users could create datasets under their home directory,
snapshot them and send them to and from other systems. You can organise
your data into whatever filesystem structure makes sense.

Let's look at the pool used on the Linux system I'm writing this email on:

     % zpool status rpool
       pool: rpool
      state: ONLINE
       scan: scrub repaired 0B in 0h5m with 0 errors on Mon Nov 26  
17:15:55 2018
     config:

             NAME        STATE     READ WRITE CKSUM
             rpool       ONLINE       0     0     0
               sda2      ONLINE       0     0     0

     errors: No known data errors

This is a single SSD "root pool" for the operating system; with data in
other pools not shown here.

     % zfs list -r rpool
     NAME                 USED  AVAIL  REFER  MOUNTPOINT
     rpool               45.4G  62.2G    96K  none
     rpool/ROOT          14.7G  62.2G    96K  none
     rpool/ROOT/default  14.7G  62.2G  12.0G  /
     rpool/home          3.18M  62.2G    96K  none
     rpool/home/root     3.08M  62.2G  3.08M  /root
     rpool/swap          8.50G  63.2G  7.51G  -
     rpool/var           5.72G  62.2G    96K  none
     rpool/var/cache     5.32G  62.2G  5.18G  /var/cache
     rpool/var/log        398M  62.2G   394M  /var/log
     rpool/var/spool     8.14M  62.2G  8.02M  /var/spool
     rpool/var/tmp        312K  62.2G   152K  /var/tmp

These datasets comprise the entire operating system (I've omitted some
third-party software package datasets from rpool/opt).

     % zfs list -t snapshot -r rpool
     NAME                             USED  AVAIL  REFER  MOUNTPOINT
     rpool@cosmic-post                  0B      -    96K  -
     rpool/ROOT@cosmic-post             0B      -    96K  -
     rpool/ROOT/default@cosmic-post  2.66G      -  12.3G  -
     rpool/home@cosmic-post             0B      -    96K  -
     rpool/home/root@cosmic-post        0B      -  3.08M  -
     rpool/var@cosmic-post              0B      -    96K  -
     rpool/var/cache@cosmic-post      148M      -  5.06G  -
     rpool/var/log@cosmic-post       3.99M      -   337M  -
     rpool/var/spool@cosmic-post      128K      -  7.79M  -
     rpool/var/tmp@cosmic-post        160K      -   192K  -

These are snapshots which would permit rollback after a recent upgrade
[as you can see, this particular system is Ubuntu; I've not yet tried
out ZFS on Devuan].

Each dataset has particular properties associated with it. These are
the properties for the root filesystem:

     % zfs get all rpool/ROOT/default
     NAME                PROPERTY              VALUE                  SOURCE
     rpool/ROOT/default  type                  filesystem             -
     rpool/ROOT/default  creation              Sun Jun 12 10:46 2016  -
     rpool/ROOT/default  used                  14.7G                  -
     rpool/ROOT/default  available             62.2G                  -
     rpool/ROOT/default  referenced            12.0G                  -
     rpool/ROOT/default  compressratio         1.63x                  -
     rpool/ROOT/default  mounted               yes                    -
     rpool/ROOT/default  quota                 none  
default
     rpool/ROOT/default  reservation           none  
default
     rpool/ROOT/default  recordsize            128K  
default
     rpool/ROOT/default  mountpoint            /                      local
     rpool/ROOT/default  sharenfs              off  
default
     rpool/ROOT/default  checksum              on  
default
     rpool/ROOT/default  compression           lz4  
inherited from rpool
     rpool/ROOT/default  atime                 off  
inherited from rpool
     rpool/ROOT/default  devices               off  
inherited from rpool
     rpool/ROOT/default  exec                  on  
default
     rpool/ROOT/default  setuid                on  
default
     rpool/ROOT/default  readonly              off  
default
     rpool/ROOT/default  zoned                 off  
default
     rpool/ROOT/default  snapdir               hidden  
default
     rpool/ROOT/default  aclinherit            restricted  
default
     rpool/ROOT/default  createtxg             15                     -
     rpool/ROOT/default  canmount              on  
default
     rpool/ROOT/default  xattr                 on  
default
     rpool/ROOT/default  copies                1  
default
     rpool/ROOT/default  version               5                      -
     rpool/ROOT/default  utf8only              on                     -
     rpool/ROOT/default  normalization         formD                  -
     rpool/ROOT/default  casesensitivity       sensitive              -
     rpool/ROOT/default  vscan                 off  
default
     rpool/ROOT/default  nbmand                off  
default
     rpool/ROOT/default  sharesmb              off  
default
     rpool/ROOT/default  refquota              none  
default
     rpool/ROOT/default  refreservation        none  
default
     rpool/ROOT/default  guid                  3867409876204186651    -
     rpool/ROOT/default  primarycache          all  
default
     rpool/ROOT/default  secondarycache        all  
default
     rpool/ROOT/default  usedbysnapshots       2.66G                  -
     rpool/ROOT/default  usedbydataset         12.0G                  -
     rpool/ROOT/default  usedbychildren        0B                     -
     rpool/ROOT/default  usedbyrefreservation  0B                     -
     rpool/ROOT/default  logbias               latency  
default
     rpool/ROOT/default  dedup                 off  
default
     rpool/ROOT/default  mlslabel              none  
default
     rpool/ROOT/default  sync                  standard  
default
     rpool/ROOT/default  dnodesize             legacy  
default
     rpool/ROOT/default  refcompressratio      1.62x                  -
     rpool/ROOT/default  written               2.39G                  -
     rpool/ROOT/default  logicalused           22.0G                  -
     rpool/ROOT/default  logicalreferenced     18.0G                  -
     rpool/ROOT/default  volmode               default  
default
     rpool/ROOT/default  filesystem_limit      none  
default
     rpool/ROOT/default  snapshot_limit        none  
default
     rpool/ROOT/default  filesystem_count      none  
default
     rpool/ROOT/default  snapshot_count        none  
default
     rpool/ROOT/default  snapdev               hidden  
default
     rpool/ROOT/default  acltype               off  
default
     rpool/ROOT/default  context               none  
default
     rpool/ROOT/default  fscontext             none  
default
     rpool/ROOT/default  defcontext            none  
default
     rpool/ROOT/default  rootcontext           none  
default
     rpool/ROOT/default  relatime              on  
temporary
     rpool/ROOT/default  redundant_metadata    all  
default
     rpool/ROOT/default  overlay               off  
default

Some are defaulted from the parent dataset, some are general defaults,
while others have been set explicitly and some are readonly information.
Note the atime/devices/exec/setuid/readonly options which set the mount
options, as well as the mountpoint. Other options control quotas and
pre-allocated reservations of blocks to this filesystem, while others
are for performance tuning such as the cache, logbias and sync options.
Transparent compression with lz4 is enabled. Other options control
dataset integrity such as the checksum, copies and redundant_metadata
options, (which store multiple copies of blocks /in addition to/ the
effective RAID redundancy provided by the storage pool).

So the mount options can be set on a per-filesystem basis, and you can
have as many filesystems as you like within the directory hierarchy.
Ultimate flexibility!

This is the fstab:
% cat /etc/fstab
PARTUUID=7542d544-adc9-40b3-b6ef-5aa3ac5afbfb /boot/efi vfat defaults 0 1
/dev/zvol/rpool/swap none swap defaults 0 0

Just the swap volume (which is a ZFS volume), and the EFI thing (plus
some NFS mounts I omitted). All the ZFS filesystems get mounted using
the dataset properties, as shown above. There's a single place to
administer the options, within zfs itself. Mountpoint and other
property changes take immediate effect. It's simple, easy and powerful
to administer.

     % mount | grep rpool
     rpool/ROOT/default on / type zfs (rw,relatime,xattr,noacl)
(rw,nodev,noatime,xattr,noacl)
     rpool/var/cache on /var/cache type zfs  
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
     rpool/var/log on /var/log type zfs  
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
     rpool/var/spool on /var/spool type zfs  
(rw,nosuid,nodev,noexec,noatime,xattr,noacl)
     rpool/var/tmp on /var/tmp type zfs  
(rw,nosuid,nodev,noatime,xattr,noacl)

So that's a brief look at ZFS on Linux as a root filesystem. Hope it
was interesting.

Why should anyone care? I'd like to begin by contrasting this with Rick
Moen's points regarding placement of partitions on the disc for maximum
performance. While such a strategy is technically correct, the problem
is that it suffers from inflexible partition arrangements, and it's
extremely time consuming to profile enough variants to ensure the layout
is optimal for the workload, as well as making assumptions that the
workload will never change once you've adopted a particular layout. Is
placing /usr in the middle better than the rootfs, or /var or particular
user data? Who knows? And who has time for that? Certainly not any of
the admins who looked after my systems. I did this twenty years ago
when I had far too much time to micro-optimise this stuff, but I haven't
done so for a long time now. For the small performance gain you might
obtain, it's hugely costly.

The ZFS approach is to place all the storage in a huge pool, and then to
allow performance tuning of the pool and individual datasets within it.
The physical placement of the data becomes largely irrelevant; ZFS
handles this for you, and it probably does a better job of placing the
data efficiently. That is, after all, its entire purpose.

The other point is that ZFS scales well. Take the 4-disc NAS.
Streaming reads can pull data off all 4 discs in parallel via a
dedicated HBA. Streaming writes are balanced across all the discs. And
I can tune each dataset for throughput or latency, as well as adjusting
caching options and default blocksize to match the workload, and I can
add fast SSD storage for read and/or write caching to further improve
performance. There are dozens of knobs to tweak and plenty of different
strategies to employ in the pool layout. Plus a load of kernel
parameters to tune its behaviour there as well. There are several books
written on this stuff (by Michael Lucas and Allan Jude).

When it comes to filesystems, we still have the option of using plain
partitions, or md, and/or LVM. However, we've made a lot of progress in
storage technology over the last two decades, both in hardware and
software. Storage managers and filesystems like ZFS provide a lot of
value the older systems do not. Personally, I'm sold on it. It beats
the pants off LVM, and it doesn't eat your data or unbalance itself to
unusability like Btrfs.

If you're a ZFS user, and you look at the issue of mounting /usr as a
separate filesystem, it's really a non issue:

- it's just another dataset in the pool
- being mounted directly when the pool is activated means there are zero
issues mounting from initramfs vs directly; /all/ the filesystems in the
pool are mounted automatically
- if you do have a separate /usr, you can control the mount options just
like any other dataset, but /usr is nothing special; you can divide the
filesystem hierarchy as finely as you like with no problems

Regards,
Roger

This message is part of the following thread:
	the complete thread tree sorted by date
	Didier Kryn at
	Roger Leigh at

Donate to Dyne.org