[TUHS] Quotas - did anyone ever use them?

Sat Jun 1 01:05:01 AEST 2019

[This is another essay-length post about O/Ses and ZFS...]

Arthur Krewat <krewat at kilonet.net> asks on Thu, 30 May 2019 20:42:33
-0400 about my TUHS list posting about running a large computing
facility without user disk quotas, and our experiences with ZFS:

>> I have yet to play with Linux and ZFS but would appreciate to
>> hear your experiences with it.

First, ZFS on the Solaris family (including DilOS, Dyson, Hipster,
Illumian, Omnios, Omnitribblix, OpenIndiana, Tribblix, Unleashed, and
XStreamOS), the FreeBSD family (including ClonOS, FreeNAS, GhostBSD,
HardenedBSD, MidnightBSD, PCBSD, Trident, and TrueOS), and on
GNU/Linux (1000+ distributions, due to theological differences) offers
important data safety features, and ease of management.

There are lots of details about ZFS that you can find in the slides of
a talk that we have given several times:

	http://www.math.utah.edu/~beebe/talks/2017/zfs/zfs.pdf

The slides at the end of that file contain pointers to ZFS resources,
including recent books.

Some of the key ZFS features are:

	* all disks form a dynamic shared pool from which space can be
	  drawn for datasets, on top of which filesystems can be
	  created;

	* the pool can exploit data redundancy via various RAID Zn
	  choices to survive loss of individual disks, and optionally,
	  provide hot spares shared across the pool, and available to
	  all datasets;

	* hardware RAID controllers are unneeded, and discouraged ---
	  a JBOD (just a bunch of disks) array is quite satisfactory

	* all metadata, and all file data blocks, have checksums that
	  are replicated elsewhere in the pool, and checked on EVERY
	  read and write, allowing automatic silent recovery (via data
	  redundancy) from transient or permanent errors in disk
	  blocks --- ZFS is self healing;

	* ZFS filesystems can have unlimited numbers of snapshots;

	* snapshots are extremely fast, typically less than one
	  second, even in multi-terabyte filesystems;

	* snapshots are readonly, and thus, immune to ransomware
          attacks;

	* ZFS send and receive operations allow propagation of copies
	  of filesystems by transferring only data blocks that have
	  changed since the last send operation;

	* the ZFS copy-on-write policy means that in-use blocks are
	  never changed, and that block updates are guaranteed to be
	  atomic;

	* quotas can optionally be enabled on datasets, and grown as
	  needed (quota shrink is not yet possible, but is in ZFS
	  development plans).

	* ZFS optionally supports encryption, data compression, block
	  deduplication, and n-way disk replication;

	* Unlike traditional fsck, which requires disks to be offline
	  during the checks, ZFS scrub operations can be run (usually
	  by cron jobs, and at lower priority) to go through datasets
	  to verify data integrity and filesystem sanity while normal
	  services continue.

ZFS likes to cache metadata, and active data blocks, in memory.  Most
of our VMs that have other filesystems, like EXT{2,3,4}, FFS, JFS,
MFS, ReiserFS, UFS, and XFS, run quite happily with 1GB of DRAM.  The
ZFS, DragonFly BSD Hammer, and BTRFS ones are happier with 2GB to 4GB
of DRAM.  Our central fileservers have 256GB to 768GB of DRAM.

The major drawback of copy-on-write and snapshots is that once a
snapshot has been taken, a filesystem-full condition cannot be
ameliorated by removing a few large files.  Instead, you have to
either increase the dataset quota (our normal practice), or you have
to free older snapshots.

Our view is that the benefits of snapshots for recovery of earlier
file versions far outweigh that one drawback: I myself did such a
recovery yesterday when I accidentally clobbered a critical file full
of digital signature keys.

On Solaris and FreeBSD families, snapshots are visible to users as
read-only filesystems, like this (for ftp://ftp.math.utah.edu/pub/texlive 
and http://www.math.utah.edu/pub/texlive):

	% df /u/ftp/pub/texlive
	Filesystem             1K-blocks      Used Available Use% Mounted on
	tank:/export/home/2001 518120448 410762240 107358208  80% /home/2001

	% ls /home/2001/.zfs/snapshot
	AMANDA           auto-2019-05-21  auto-2019-05-25  auto-2019-05-29
	auto-2019-05-18  auto-2019-05-22  auto-2019-05-26  auto-2019-05-30
	auto-2019-05-19  auto-2019-05-23  auto-2019-05-27  auto-2019-05-31
	auto-2019-05-20  auto-2019-05-24  auto-2019-05-28

	% ls /home/2001/.zfs/snapshot/auto-2019-05-21/ftp/pub/texlive
	Contents  Images  Source  historic  protext  tlcritical  tldump  tlnet  tlpretest

That is, you first use the df command to find the source of the
current mount point, then use ls to examine the contents of
.zfs/snapshot under that source, and finally follow your pathname
downward to locate a file that you want to recover, or compare with a
current copy, or another snapshot copy.

On Network Appliance systems with the WAFL filesystem design (see

	https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout

), snapshots are instead mapped to hidden directories inside each
directory, which is more convenient for human users, and is a feature
that we would really like to see on ZFS.

A nuisance for us is that the current ZFS implementation on CentOS 7
(a subset of the pay-for-service Red Hat Enterprise Linux 7) does not
show any files under the .zfs/snapshot/auto-YYYY-MM-DD directories,
except on the fileserver itself.

When we used Solaris ZFS for 15+ years, our users could themselves
recover previous file versions following instructions at

	http://www.math.utah.edu/faq/files/files.html#FAQ-8

Since our move to a GNU/Linux fileserver, they no longer can; instead,
they have to contact systems management to access such files.

We sincerely hope that CentOS 8 will resolve that serious deficiency:
see

	http://www.math.utah.edu/pub/texlive-utah/README.html#rhel-8

for comments on the production of that O/S release from the recent
major new Red Hat EL8 release.

We have a large machine-room UPS, and outside diesel generator, so our
physical servers are immune to power outages and power surges, the
latter being a common problem in Utah during summer lightning storms.
Thus, unplanned fileserver outages should never happen.

A second issue for us is that on Solaris and FreeBSD, we have never
seen a fileserver crash due to ZFS issues, and on Solaris, our servers
have sometimes been up for one to three years before we took them down
for software updates.  However, with ZFS on CentOS 7, we have seen 13
unexplained reboots in the last year.  Each has happened late at
night, or in the early morning, while backups to our tape robot, and
ZFS send/receive operations to a remote datacenter, are in progress.
The crash times suggest to us that heavy ZFS activity is exposing a
kernel or Linux ZFS bug.  We hope that CentOS 8 will resolve that
issue.

We have ZFS on about 70 physical and virtual machines, and GNU/Linux
BTRFS on about 30 systems.  With ZFS, freeing a snapshot moves its
blocks to the free list within seconds.  With BTRFS, freeing snapshots
often takes tens of minutes, and sometimes, hours, before space
recovery is complete.  That can be aggravating when it stops your work
on that system.

By contrast, snapshots on both BTRFS and ZFS are fast.  However, they
appear to be far smaller on ZFS than on BTRFS.  We have VMs and
physical machines with ZFS that have 300 to 1000 daily snapshots with
little noticeable reduction in free space, whereas those with BTRFS
seem to lose about a gigabyte a day.  My home TrueOS system has
sufficient space for about 25 years of ZFS dailies.  Consequently, I
run nightly reports of free space on all of our systems, and manually
intervene on the BTRFS ones when space hits a critical level (I try to
keep 10GB free).

On both ZFS and BTRFS, packages are available to trim old snapshots,
and we run the ZFS trimmer via cron jobs on our main fileservers.

In the GNU/Linux world, however, only openSUSE comes by default with a
cron-enabled BTRFS snapshot trimmer, so intervention is unnecessary on
that O/S flavor.  I have never installed snapshot trimmer packages on
any of our other VMs, because it just means more management work to
deal with variants in trimmer packages, configuration files, and cron
jobs.

Teams of ZFS developers from FreeBSD and GNU/Linux are working on
merging divergent features back into a common OpenZFS code base that
all O/Ses that support ZFS can use; that merger is expected to happen
within the next few months.  ZFS has been ported by third parties to
Apple macOS and Microsoft Windows, so it has the potential of becoming
a universal filesystem available on all common desktop environments.
Then we could use ZFS send/receive instead of .iso, .dmg, and .img
files to copy entire filesystems between different O/Ses.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe at math.utah.edu  -
- 155 S 1400 E RM 233                       beebe at acm.org  beebe at computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------