Search the Community
Showing results for tags 'zfs'.
Karlston posted a news in Software NewsZFS fans, rejoice—RAIDz expansion will be a thing very soon Founding OpenZFS dev Matthew Ahrens opened a pull request last week. Enlarge / OpenZFS supports many complex disk topologies, but "spiral stack sitting on a desk" still isn't one of them. Jim Salter OpenZFS founding developer Matthew Ahrens opened a PR for one of the most sought-after features in ZFS history—RAIDz expansion—last week. The new feature allows a ZFS user to expand the size of a single RAIDz vdev. For example, you can use the new feature to turn a three-disk RAIDz1 into a four, five, or six RAIDz1. OpenZFS is a complex filesystem, and things are necessarily going to get a bit chewy explaining how the feature works. So if you're a ZFS newbie, you may want to refer back to our comprehensive ZFS 101 introduction. Expanding storage in ZFS In addition to being a filesystem, ZFS is a storage array and volume manager, meaning that you can feed it a whole pile of disk devices, not just one. The heart of a ZFS storage system is the zpool—this is the most fundamental level of ZFS storage. The zpool in turn contains vdevs, and vdevs contain actual disks within them. Writes are split into units called records or blocks, which are then distributed semi-evenly among the vdevs. A storage vdev can be one of five types—a single disk, mirror, RAIDz1, RAIDz2, or RAIDz3. You can add more vdevs to a zpool, and you can attach more disks to a single or mirror vdev. But managing storage this way requires some planning ahead and budgeting—which hobbyists and homelabbers frequently aren't too enthusiastic about. Conventional RAID, which does not share the "pool" concept with ZFS, generally offers the ability to expand and/or reshape an array in place. For example, you might add a single disk to a six-disk RAID6 array, thereby turning it into a seven-disk RAID6 array. Undergoing a live reshaping can be pretty painful, especially on nearly full arrays; it's entirely possible that such a task might require a week or more, with array performance limited to a quarter or less of normal the entire time. Historically, ZFS has eschewed this sort of expansion. ZFS was originally developed for business use, and live array re-shaping is generally a non-starter in the business world. Dropping your storage's performance to unusable levels for days on end generally costs more in payroll and overhead than buying an entirely new set of hardware would. Live expansion is also potentially very dangerous since it involves reading and re-writing all data and puts the array in a temporary and far less well-tested "half this, half that" condition until it completes. For users with many disks, the new RAIDz expansion is unlikely to materially change how they use ZFS. It will still be both easier and more practical to manage vdevs as complete units rather than trying to muck about inside them. But hobbyists, homelabbers, and small users who run ZFS with a single vdev will likely get a lot of use out of the new feature. How does it work? Enlarge / In this slide, we see a four-disk RAIDz1 (left) expanded to a five-disk RAIDz1 (right). Note that the data is still written in four-wide stripes! Matthew Ahrens From a practical perspective, Ahrens' new vdev expansion feature merely adds new capabilities to an existing command, namely, zpool attach, which is normally used to add a disk to a single-disk vdev (turning it into a mirror vdev) or add an extra disk to a mirror (for example, turning a two-disk mirror into a three-disk mirror). With the new code, you'll be able to attach new disks to an existing RAIDz vdev as well. Doing so expands the vdev in width but does not change the vdev type, so you can turn a six-disk RAIDz2 vdev into a seven-disk RAIDz2 vdev, but you can't turn it into a seven-disk RAIDz3. Upon issuing your zpool attach command, the expansion begins. During expansion, each block or record is read from the vdev being expanded and is then rewritten. The sectors of the rewritten block are distributed among all disks in the vdev, including the new disk(s), but the width of the stripe itself is not changed. So a RAIDz2 vdev expanded from six disks to ten will still be full of six-wide stripes after expansion completes. So while the user will see the extra space made available by the new disks, the storage efficiency of the expanded data will not have improved due to the new disks. In the example above, we went from a six-disk RAIDz2 with a nominal storage efficiency of 67 percent (four of every six sectors are data) to a ten-disk RAIDz2. Data newly written to the ten-disk RAIDz2 has a nominal storage efficiency of 80 percent—eight of every ten sectors are data—but the old expanded data is still written in six-wide stripes, so it still has the old 67 percent storage efficiency. It's worth noting that this isn't an unexpected or bizarre state for a vdev to be in—RAIDz already uses a dynamic, variable stripe width to account for blocks or records too small to stripe across all the disks in a single vdev. For example, if you write a single metadata block—the data containing a file's name, permissions, and location on disk—it fits within a single sector on disk. If you write that metadata block to a ten-wide RAIDz2, you don't write a full ten-wide stripe—instead, you write an undersized block only three disks wide; a single data sector plus two parity sectors. So the "undersized" blocks in a newly expanded RAIDz vdev aren't anything for ZFS to get confused about. They're just another day at the office. Is there any lasting performance impact? As we discussed above, a newly expanded RAIDz vdev won't look quite like one designed that way from "birth"—at least, not at first. Although there are more disks in the mix, the internal structure of the data isn't changed. Adding one or more new disks to the vdev means that it should be capable of somewhat higher throughput. Even though the legacy blocks don't span the entire width of the vdev, the added disks mean more spindles to distribute the work around. This probably won't make for a jaw-dropping speed increase, though—six-wide stripes on a seven-disk vdev mean that you still can't read or write two blocks simultaneously, so any speed improvements are likely to be minor. The net impact to performance can be difficult to predict. If you are expanding from a six-disk RAIDz2 to a seven-disk RAIDz2, for example, your original six-disk configuration didn't need any padding. A 128KiB block can be cut evenly into four 32KiB data pieces, with two 32KiB parity pieces. The same record split among seven disks requires padding because 128KiB/five data pieces doesn't come out to an even number of sectors. Similarly, in some cases—particularly with a small recordsize or volblocksize set—the workload per individual disk may be significantly less challenging in the older, narrower layout than in the newer, wider one. A 128KiB block split into 32KiB pieces for a six-wide RAIDz2 can be read or written more efficiently per disk than one split into 16KiB pieces for a ten-wide RAIDz2, for example—so it's a bit of a crapshoot whether more disks but smaller pieces will provide more throughput than fewer disks but larger pieces did. The one thing you can be certain of is that the newly expanded configuration should typically perform as well as the original non-expanded version—and that once the majority of data is (re)written in the new width, the expanded vdev won't perform any differently, or be any less reliable, than one that was designed that way from the start. Why not reshape records/blocks during expansion? It might seem odd that the initial expansion process doesn't rewrite all existing blocks to the new width while it's running—after all, it's reading and re-writing the data anyway, right? We asked Ahrens why the original width was left as-is, and the answer boils down to "it's easier and safer that way." One key factor to recognize is that technically, the expansion isn't moving blocks; it's just moving sectors. The way it's written, the expansion code doesn't need to know where ZFS' logical block boundaries are—the expansion routine has no idea whether an individual sector is parity or data, let alone which block it belongs to. Expansion could traverse all the block pointers to locate block boundaries, and then it would know which sector belongs to what block and how to re-shape the block, but according to Ahrens, doing things that way would be extremely invasive to ZFS' on-disk format. The expansion would need to continually update spacemaps on metaslabs to account for changes in the on-disk size of each block—and if the block is part of a dataset rather than a zvol, update the per-dataset and per-file space accounting as well. If it really makes your teeth itch knowing you have four-wide stripes on a freshly five-wide vdev, you can just read and re-write your data yourself after expansion completes. The simplest way to do this is to use zfs snapshot, zfs send, and zfs receive to replicate entire datasets and zvols. If you're not worried about ZFS properties, a simple mv operation will do the trick. However, we'd recommend in most cases just relaxing and letting ZFS do its thing. Your undersized blocks from older data aren't really hurting anything, and as you naturally delete and/or alter data over the life of the vdev, most of them will get re-written naturally as necessary, without the need for admin intervention or long periods of high storage load due to obsessively reading and re-writing everything all at once. When will RAIDz expansion hit production? Ahrens' new code is not yet a part of any OpenZFS release, let alone added to anyone else's repositories. We asked Ahrens when we might expect to see the code in production, and unfortunately, it will be a while. It's too late for RAIDz expansion to be included in the upcoming OpenZFS 2.1 release, expected very soon (2.1 release candidate 7 is available now). It should be included in the next major OpenZFS release; it's too early for concrete dates, but major releases typically happen about once per year. Broadly speaking, we expect RAIDz expansion to hit production in the likes of Ubuntu and FreeBSD somewhere around August 2022, but that's just a guess. TrueNAS may very well put it into production sooner than that, since ixSystems tends to pull ZFS features from master before they officially hit release status. Matt Ahrens presented RAIDz expansion at the FreeBSD Developer Summit—his talk begins at 1 hour 41 minutes in this video. ZFS fans, rejoice—RAIDz expansion will be a thing very soon
Karlston posted a topic in Guides & TutorialsZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner We exhaustively tested ZFS and RAID performance on our Storage Hot Rod server. Enlarge / Neither the stopwatch nor the denim jacket is strictly necessary, if we're being honest about it. Aurich Lawson / Getty 126 with 68 posters participating, including story author This has been a long while in the making—it's test results time. To truly understand the fundamentals of computer storage, it's important to explore the impact of various conventional RAID (Redundant Array of Inexpensive Disks) topologies on performance. It's also important to understand what ZFS is and how it works. But at some point, people (particularly computer enthusiasts on the Internet) want numbers. First, a quick note: This testing, naturally, builds on those fundamentals. We're going to draw heavily on lessons learned as we explore ZFS topologies here. If you aren't yet entirely solid on the difference between pools and vdevs or what ashift and recordsize mean, we strongly recommend you revisit those explainers before diving into testing and results. And although everybody loves to see raw numbers, we urge an additional focus on how these figures relate to one another. All of our charts relate the performance of ZFS pool topologies at sizes from two to eight disks to the performance of a single disk. If you change the model of disk, your raw numbers will change accordingly—but for the most part, their relation to a single disk's performance will not. Equipment as tested First image of article image gallery. Please visit the source link to see all 2 images. We used the eight empty bays in our Summer 2019 Storage Hot Rod for this test. It's got oodles of RAM and more than enough CPU horsepower to chew through these storage tests without breaking a sweat. Specs at a glance: Summer 2019 Storage Hot Rod, as tested OS Ubuntu 18.04.4 LTS CPU AMD Ryzen 7 2700X—$250 on Amazon RAM 64GB ECC DDR4 UDIMM kit—$459 at Amazon Storage Adapter LSI-9300-8i 8-port Host Bus Adapter—$148 at Amazon Storage 8x 12TB Seagate Ironwolf—$320 ea at Amazon Motherboard Asrock Rack X470D4U—$260 at Amazon PSU EVGA 850GQ Semi Modular PSU—$140 at Adorama Chassis Rosewill RSV-L4112—Typically $260, currently unavailable due to CV19 The Storage Hot Rod's also got a dedicated LSI-9300-8i Host Bus Adapter (HBA) which isn't used for anything but the disks under test. The first four bays of the chassis have our own backup data on them—but they were idle during all tests here and are attached to the motherboard's SATA controller, entirely isolated from our test arrays. How we tested As always, we used fio to perform all of our storage tests. We ran them locally on the Hot Rod, and we used three basic random-access test types: read, write, and sync write. Each of the tests was run with both 4K and 1M blocksizes, and I ran the tests both with a single process and iodepth=1 as well as with eight processes with iodepth=8. For all tests, we're using ZFS on Linux 0.7.5, as found in main repositories for Ubuntu 18.04 LTS. It's worth noting that ZFS on Linux 0.7.5 is two years old now—there are features and performance improvements in newer versions of OpenZFS that weren't available in 0.7.5. We tested with 0.7.5 anyway—much to the annoyance of at least one very senior OpenZFS developer—because when we ran the tests, 18.04 was the most current Ubuntu LTS and one of the most current stable distributions in general. In the next article in this series—on ZFS tuning and optimization—we'll update to the brand-new Ubuntu 20.04 LTS and a much newer ZFS on Linux 0.8.3. Initial setup: ZFS vs mdraid/ext4 When we tested mdadm and ext4, we didn't really use the entire disk—we created a 1TiB partition at the head of each disk and used those 1TiB partitions. We also had to invoke arcane arguments—mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0—to avoid ext4's preallocation from contaminating our results. Using these relatively small partitions instead of the entire disks was a practical necessity, since ext4 needs to grovel over the entire created filesystem and disperse preallocated metadata blocks throughout. If we had used the full disks, the usable space on the eight-disk RAID6 topology would have been roughly 65TiB—and it would have taken several hours to format, with similar agonizing waits for every topology tested. ZFS, happily, doesn't need or want to preallocate metadata blocks—it creates them on the fly as they become necessary instead. So we fed ZFS each 12TB Ironwolf disk in its entirety, and we didn't need to wait through lengthy formatting procedures—each topology, even the largest, was ready for use a second or two after creation, with no special arguments needed. ZFS vs conventional RAID A conventional RAID array is a simple abstraction layer that sits between a filesystem and a set of disks. It presents the entire array as a virtual "disk" device that, from the filesystem's perspective, is indistinguishable from an actual, individual disk—even if it's significantly larger than the largest single disk might be. ZFS is an entirely different animal, and it encompasses functions that normally might occupy three separate layers in a traditional Unixlike system. It's a logical volume manager, a RAID system, and a filesystem all wrapped into one. Merging traditional layers like this has caused many a senior admin to grind their teeth in outrage, but there are very good reasons for it. There is an absolute ton of features ZFS offers, and users unfamiliar with them are highly encouraged to take a look at our 2014 coverage of next-generation filesystems for a basic overview as well as our recent ZFS 101 article for a much more comprehensive explanation. Megabytes vs Mebibytes As in the last article, our units of performance measurement here are kibibytes (KiB) and mebibytes (MiB). A kibibyte is 1,024 bytes, a mebibyte is 1,024 kibibytes, and so forth—in contrast to a kilobyte, which is 1,000 bytes, and a megabyte, which is 1,000 kilobytes. Kibibytes and their big siblings have always been the standard units for computer storage. Prior to the 1990s, computer professionals simply referred to them as K and M—and used the inaccurate metric prefixes when they spelled them out. But any time your operating system refers to GB, MB, or KB—whether in terms of free space, network speed, or amounts of RAM—it's really referring to GiB, MiB, and KiB. Storage vendors, unfortunately, eventually seized upon the difference between the metrics as a way to more cheaply produce "gigabyte" drives and then "terabyte" drives—so a 500GB SSD is really only 465 GiB, and 12TB hard drives like the ones we're testing today are really only 10.9TiB each. Testing and analysis, using ZFS default settings As we did with the mdraid performance tests, we used fio to test our Ironwolf disks under ZFS. Once again, we're focusing entirely on random access, in two block sizes: 4KiB and 1MiB, which we test for read, write, and synchronous write on all topologies. We also have some additional variables in play when testing ZFS. We wanted to show what happens when you misconfigure the ashift value (which sets the sector size) and also what happens when you tune recordsize to better reflect your workload. In order to focus on all this juicy and relevant data, we needed to cut some fluff—so we're only looking at multi-process operations this time, with fio set to numjobs=8 and iodepth=8. The major reason we tested single-process operations in the mdraid performance article was to very directly demonstrate that multiple-disk topologies tend not to accelerate single-threaded workloads. That's still true here—and while we did test single-threaded workloads against ZFS, there was nothing to be seen there that you don't see in the otherwise much more interesting multi-threaded, 8-process workloads. So that's what we're focusing on. Performance scales with vdevs First image of article image gallery. Please visit the source link to see all 3 images. One of the most common—and most pernicious—myths I encounter when talking to people about storage is the idea that performance scales well with the number of disks in a vdev, rather than the number of vdevs in a pool. The very mistaken idea—which seems reasonable on the surface—is that as the number of data chunks in a stripe go up, performance goes up with it. If you have an eight-disk RAIDz2, you've got six "data disks" per stripe, so six times the performance, give or take—right? Meanwhile, an 8-disk pool of mirrors only has four "data disks" per stripe, so—lower performance! In the above charts, we show performance trends per vdev for single-disk, 2-disk mirror, and 4-disk RAIDz2 vdevs. These are shown in solid lines, and the "joker"—a single RAIDz2 vdev, becoming increasingly wide—is the dark, dashed line. So at n=4, we're looking at four single-disk vdevs, four 2-disk mirror vdevs—and a single, 6-wide RAIDz2 vdev. Remember, n for the joker isn't the total number of disks—it's the total number of disks, minus parity. Let's take a look at n=2 on the 1M Async Write chart above. For single disk vdevs, two-wide mirror vdevs, and four-wide RAIDz2 vdevs, we see just the scaling we'd expect—at 202 percent, 203 percent, and 207 percent of the performance of a single vdev, of the same class. The RAIDz2 "joker" at n=2, on the other hand, is four disks wide—and since RAIDz2 is dual parity, that means it's got two "data disks," hence n=2. And we can see that it's underperforming badly compared to the per-vdev lines, with only 160 percent the performance of a single "data disk." The trend only gets worse as the single RAIDz gets wider, while the trend-lines for per-vdev scaling keep a clean, positive linear slope. We see similar trends in 4K writes and 1M reads alike. The closest the increasingly wide single-RAIDz2 vdev comes to linear scale is on 1MiB reads, where at first its per-disk scale appears to be keeping up with per-vdev scale. But it sharply falls off after n=4—and that trend will continue to get worse, as the vdev gets wider. We'll talk about how and why those reads are falling off so sharply later—but for now, let's get into some simple performance tests, with raw numbers. RAIDz2 vs RAID6—default settings First image of article image gallery. Please visit the source link to see all 3 images. If you carefully test RAIDz2 versus RAID6, the first thing that stands out is just how fast the writes are. Even sync writes, which you might intuitively expect to be slower on ZFS—since they must often be "double-committed," once to ZIL and once again to main storage—are significantly faster than they were on mdraid6. Uncached reads, unfortunately, trend the other way—RAIDz vdevs tend to pay for their fast writes with slow reads. This disadvantage tends to be strongly offset in the real world due to the ARC's higher cache hit ratio as compared to the simple LRU cache used by the kernel, but that effect is difficult or impossible to estimate in simple, synthetic tests like these. On servers with heavy, concurrent mixed read/write workloads, the effect of RAIDz's slow reads can also be offset by how much faster the writes are going—remember, storage is effectively half-duplex; you can't read and write at the same time. Any decrease in utilization on the write side will show up as increased availability on the read side. Storage is a complex beast! Caveats, hedges, and weasel words aside, let's make this clear—RAIDz is not a strong performer for uncached, pure read workloads. First image of article image gallery. Please visit the source link to see all 3 images. For the most part, we're seeing the same phenomenon in the 4KiB scale that we did at 1MiB—RAIDz2 handily beats RAID6 at writes, but it gets its butt handed to it on reads. What might not be quite so obvious is that RAIDz2 is suffering badly from a misconfiguration here—most users real-world experience of 4KiB I/O is due to small files, such as the dotfiles in a Linux user's home directory, the similar .INIs and what have you in a Windows user's home directory, and the hordes of small .INI or .conf files most systems are plagued with. But the 4KiB RAIDz2 performance you're seeing here is not the RAIDz2 performance you'd be seeing with 4KiB files. You see, fio writes single, very large files and seeks inside them. Since we're testing on the default recordsize of 128KiB, that means that each RAIDz2 4KiB read is forced to pull in an additional 124KiB of useless, unwanted data. Later, we'll see what happens when we properly tune our ZFS system for an appropriate recordsize—which much more closely approximates real-world experience with 4KiB I/O in small files, even without tuning. But for now, let's keep everything on defaults and move ahead to performance when using two-wide mirror vdevs. ZFS mirror vdevs vs RAID10—default settings First image of article image gallery. Please visit the source link to see all 3 images. The comparison between mirror vdevs and RAID10 is a fun one, because these are easily the highest-performing topologies. Who doesn't like big numbers? At first blush, the two seem pretty evenly matched. In 1MiB write, sync write, and uncached read, both systems exhibit near-linear scale and positive slope, and they tend to be pretty close. RAID10 clearly has the upper hand when it comes to pure uncached reads—but unlike RAIDz2 vs RAID6, the lead isn't enormous. We're leaving a lot of performance on the table, though, by sticking to default settings. We'll revisit that, but for now let's move on to see how mirrors and RAID10 fare with 4KiB random I/O. First image of article image gallery. Please visit the source link to see all 3 images. Although the curve is a little tweaky, we see Linux RAID10 clearly outperform ZFS mirrors on 4KiB writes—a first, for mdraid and ext4. But when we move on to sync 4KiB writes, the trend reverses. RAID10 is unable to keep up with even a single ext4 disk, while the pool of mirrors soars to better than 500 percent a single ext4 disk's performance. Moving onto 4KiB reads, we once again see the ZFS topology suffering due to misconfiguration. Having left our recordsize at the default 128KiB, we're reading in an extra 124KiB with every 4KiB we actually want. In some cases, we can get some use out of that unnecessary data later—by caching it and servicing future read requests from the cache instead of from the metal. But we're working with a large dataset, so very few of those cached "extras" are ever of any use. Retesting ZFS with recordsize set correctly We believe it's important to test things the way they come out of the box. The shipping defaults should be sane defaults, and that's a good place for everyone to start from. While the ZFS defaults are reasonably sane, fio doesn't interact with disks in quite the same way most users normally do. Most user interaction with storage can be characterized by reading and writing files in their entirety—and that's not what fio does. When you ask fio to show you random read and write behavior, it creates one very large file for each testing process (eight of them, in today's tests), and that process seeks within that large file. With the default recordsize=128K, ZFS will store a 4KiB file in an undersized record, which only occupies a single 4KiB sector—and reads of that file later will also only need to light up a single 4KiB sector on disk. But when performing 4KiB random I/O with fio, since the 4KiB requests are pieces of a very large file, ZFS must read (and write) the requests in full-sized 128KiB increments. Although the impact is somewhat smaller, the default 128KiB recordsize also penalizes large file access. After all, it's not exactly optimal to store and retrieve an 8MiB digital photo in 64 128KiB blocks, rather than only 8 1MiB blocks. In this section, we're going to zfs set recordsize=4K test for the 4KiB random I/O tests, and zfs set recordsize=1M for the 1MiB random I/O tests. Is ZFS getting "special treatment" here? An experienced sysadmin might reasonably object to ZFS being given special treatment while mdraid is left to its default settings. But there's a reason for that, and it's not just "we really like ZFS." While you can certainly tune chunk size on a kernel RAID array, any such tuning affects the entire device globally. If you tune a 48TB mdraid10 for 4KiB I/O, it's going to absolutely suck at 1MiB I/O—and similarly, a 48TB mdraid10 tuned for 1MiB I/O will perform horribly at 4KiB I/O. To fix that, you must destroy the entire array and any filesystems and data on it, recreate everything from scratch, and restore your data from backup—and it can still only be tuned for one performance use case. In sharp contrast, if you've got a 48TB ZFS pool, you can set recordsize per dataset—and datasets can be created and destroyed as easily as folders. If your ZFS server has 20TiB of random user-saved files (most of which are several MiB, such as photos, movies, and office documents) along with a 2TiB MySQL database, each can coexist peacefully and simply: [email protected]:~# zfs create pool/samba [email protected]:~# zfs set recordsize=1M pool/samba [email protected]:~# zfs create pool/mysql [email protected]:~# zfs set recordsize=16K pool/mysql Just like that, you've created what look like "folders" on the server which are optimized for the workloads to be found within. If your users create a bunch of 4KiB files, that's fine—the 4KiB files will still only occupy one sector, while the larger files reap the benefit of similarly large logical block sizes. Meanwhile, the MySQL database gets a recordsize which perfectly matches its own internal 16KiB pagesize, optimizing performance there without hurting it on the rest of the server. If you install a PostgreSQL instance later, you can tune for its default 8KiB page size just as easily: [email protected]:~# zfs create pool/postgres [email protected]:~# zfs set recordsize=8K pool/postgres And if you later re-tune your MySQL instance to use a larger or smaller page size, you can re-tune your ZFS dataset to match. If all you do is change recordsize, the already-written data won't change, but any new writes to the database will follow the dataset's new recordsize parameter. (If you want to re-write the existing data structure, you also need to do a block for block copy of it, eg with the mv command.) ZFS recordsize=1M—large blocks for large files We know everybody loves to see big performance numbers, so let's look at some. In this section, we're going to re-run our earlier 1MiB read, write, and sync write workloads against ZFS datasets with recordsize=1M set. We want to reiterate that this is a pretty friendly configuration for any normal "directory full of files" type of situation—ZFS will write smaller files in smaller blocks automatically. You really only need smaller recordsize settings in special cases with a lot of random access inside large files, such as database binaries and VM images. RAIDz2 vs RAID6—1MiB random I/O, recordsize=1M First image of article image gallery. Please visit the source link to see all 3 images. RAIDz2 writes really take off when its recordsize is tuned to fio's workload. Its 1MiB asynchronous writes leap from 568MiB/sec to 950MiB/sec, sometimes higher. These are rewrites of an existing fio workload file: The first fio write test always goes significantly faster on ZFS storage than additional test runs re-using the same file do. This effect doesn't increase, however. The second, 20th, and 200th run will always be the same. But in the interests of fairness, we throw away that first, significantly higher test run for ZFS. In this case, that first "throwaway" async write run was at 1,238MiB/sec. Sync writes are similarly boosted, with RAIDz2 turning in an additional 54MiB/sec over its untuned results, more than doubling RAID6's already-lagging performance. Unfortunately, tuning recordsize didn't help our 1MiB uncached reads—although the eight-wide RAIDz2 vdev improved by just under 100MiB/sec, RAIDz2 still lags significantly behind mdraid6 with five or more disks in the vdev. A pool with two four-wide RAIDz2 vdevs (not shown above) comes much closer, pulling 406MiB/sec to eight-wide mdraid6's 485MiB/sec. RAIDz2 vs RAID6—4KiB random I/O, recordsize=4K First image of article image gallery. Please visit the source link to see all 3 images. Moving along to 4KiB random access I/O, we see the same broad trends observed above. In both sync and async writes, RAIDz2 drastically outperforms RAID6. When committing small writes, RAID6 arrays fail to keep up with even a single ext4 disk. This trend reverses, again, when shifting to uncached reads. Wide RAIDz2 vdevs do at least manage to outperform a single ext4 disk, but they lag significantly behind mdraid and ext4, with an eight-wide RAIDz2 vdev being outperformed roughly 4:1. When deciding between these two topologies on a performance basis, the question becomes whether you'd prefer to have a 20:1 increase in write performance at the expense of a 4:1 decrease in reads, or vice versa. On the surface of it, this sounds like a no-brainer—but different workloads are different. A cautious—and wise!—admin would be well advised to do workload-specific testing both ways before making a final decision, if performance is the only metric that matters. ZFS mirror vdevs vs RAID10—1MiB random I/O, recordsize=1M First image of article image gallery. Please visit the source link to see all 3 images. If you were looking for an unvarnished performance victory for team ZFS, here's where you'll find it. RAID10 is the highest performing conventional RAID topology on every metric we test, and a properly-tuned ZFS pool of 2-wide mirror vdevs outperforms it everywhere. We can already hear conventional RAID fans grumbling that their hardware RAID would outrun that silly ZFS, since it's got a battery or supercapacitor backed cache and will therefore handle sync writes about as rapidly as standard asynchronous writes—but the governor's not entirely off on the ZFS side of that race, either. (We'll cover the use of a LOG vdev to accelerate sync writes in a different article.) ZFS mirror vdevs vs RAID10—4KiB random I/O, recordsize=4K First image of article image gallery. Please visit the source link to see all 3 images. Down in the weeds at 4KiB blocksize, our pool of mirrors strongly outperforms RAID10 on both synchronous and asynchronous writes. It does, however, lose to RAID10 on uncached 4KiB reads. Much like the comparison of RAIDz2 vs RAID6, however, it's important to look at the ratios. Although a pure 4KiB uncached random read workload performs not quite twice as well on RAID10 as on ZFS mirrors, such a workload is probably fairly rare—and the write performance advantage swings 5:1 in the other direction (or 12:1, for sync writes). Most 4KiB-heavy workloads will also be constantly saturated workloads, as the disks are constantly thrashing trying to keep up with demand—meaning that it wouldn't take many write operations to overwhelm RAID10's 4KiB read performance benefits with its write performance discrepancies. 4KiB random read workloads also tend to heavily favor better cache algorithms. We did not attempt to test cache efficiency here, but an ARC can safely be assumed to strongly outperform a simple LRU on nearly any workload. Why are RAIDz2 reads so slow? First image of article image gallery. Please visit the source link to see all 5 images. There's one burning question that needs to be answered after all this testing: Why are RAIDz reads so much slower than conventional RAID6 reads? With recordsize tuned appropriately for workload, RAIDz2 outperforms RAID6 on writes by as much as 20:1—which makes it that much more confusing why reads would be slower. The answer is fairly simple, and it largely amounts to the flip side of the same coin. Remember the RAID hole? Conventional RAID6 is not only willing but effectively forced to pack multiple blocks/files into the same stripe, since it doesn't have a variable stripe width. In addition to opening up the potential for corruption due to partial stripe write, this subjects RAID6 arrays to punishing read-modify-write performance penalties when writing partial stripes. RAIDz2, on the other hand, writes every block or file as a full stripe—even very small ones, by adjusting the width of the stripe. We captured the difference between the two topologies' 1MiB reads in the series of screenshots above. When recordsize is set to 1M, a 1MiB block gets carved into roughly 176KiB chunks and distributed among six of the eight disks in an eight-wide RAIDz2, with the other two disks each carrying a roughly 176KiB parity chunk. So when we read a 1MiB block from our eight-wide RAIDz2, we light up six of eight disks to do so. By contrast, the same disks in an eight-wide RAID6 default to a 512KiB chunk size—which means 512KiB of data (or parity) is written to each disk during a RAID6 write. When we go back to read that data from the RAID6, we only need to light up two of our eight disks for each block, as compared to RAIDz2's six. In addition, the two disks we light up on RAID6 are performing a larger, higher-efficiency operation. They're reading 128 contiguous 4KiB on-disk sectors, as compared to RAIDz2's six disks only reading 44 contiguous 4KiB on-disk sectors for each operation. If we want to get deep into the weeds, we could more extensively tune the RAIDz2 to work around this penalty: we could set zfs_max_recordsize=4194304 in /etc/modprobe.d/zfs.conf, export all pools and reload the ZFS kernel module, then zfs set recordsize=3M on the dataset. Setting a 3MiB recordsize would mean that each disk gets 512KiB chunks, just like RAID6 does, and performance would go up accordingly—if we write a 1MiB record, it gets stored on two disks in 512KiB chunks. And when we read it back, we read that 1MiB record by lighting up only those two disks, just like we did on RAID6. Unfortunately, that also means storage efficiency goes down—because that 1MiB record was written as an undersized stripe, with two chunks of data and two chunks of parity. So now we're performing as well or better than RAID6, but we're at 50 percent storage efficiency while it's still at 75 percent (six chunks out of every eight are data). To be fair, this is a problem for the eight-process, iodepth=8 reads we tested—but not for single-process, iodepth=1 reads, which we tested but did not graph. For single-process reads, RAIDz2 significantly outperforms RAID6 (at 129MiB/sec to 47MiB/sec), and for the exact same reason. It lights up three times as many disks to read the same 1MiB of data. TANSTAAFL—There Ain't No Such Thing As A Free Lunch. Conclusions If you're looking for raw, unbridled performance it's hard to argue against a properly-tuned pool of ZFS mirrors. RAID10 is the fastest per-disk conventional RAID topology in all metrics, and ZFS mirrors beat it resoundingly—sometimes by an order of magnitude—in every category tested, with the sole exception of 4KiB uncached reads. ZFS' implementation of striped parity arrays—the RAIDz vdev type—are a bit more of a mixed bag. Although RAIDz2 decisively outperforms RAID6 on writes, it underperforms it significantly on 1MiB reads. If you're implementing a striped parity array, 1MiB is hopefully the blocksize you're targeting in the first place, since those arrays are particularly awful with small blocksizes. When you add in the wealth of additional features ZFS offers—incredibly fast replication, per-dataset tuning, automatic data healing, high-performance inline compression, instant formatting, dynamic quota application, and more—we think it's difficult to justify any other choice for most general-purpose server applications. ZFS still has more performance options to offer—we haven't yet covered the support vdev classes, LOG, CACHE, and SPECIAL. We'll cover those—and perhaps experiment with recordsize larger than 1MiB—in another fundamentals of storage chapter soon. Source: ZFS versus RAID: Eight Ironwolf disks, two filesystems, one winner (Ars Technica) (To view the article's image galleries, please visit the above link)
zanderthunder posted a topic in Software NewsCanonical has announced the release of Ubuntu 19.10 Eoan Ermine. The new version will be supported for nine months as is the case with all non-LTS releases and is the final version before the next LTS update in April 2020. With this update, Canonical has updated the core pieces of software including GNOME, which was bumped to version 3.34. Additionally, users can also use ZFS on the root file system as an experimental feature. While the standard Ubuntu release will be what most people flock to, the 19.10 update is also available for other Ubuntu flavours such as Ubuntu Budgie, Kubuntu, Lubuntu, Ubuntu Kylin, Ubuntu MATE, Ubuntu Studio, and Xubuntu – each ships with a different desktop environment where one may be better than another for your particular needs. Aside from GNOME 3.34 and ZFS support, Ubuntu 19.10 also ships with Linux kernel 5.3 which includes support for new hardware. If you use the latest modern hardware and have had issues getting Ubuntu to recognise components, now could be a good time to have another look to see if everything is working. Installing Ubuntu isn’t too difficult a task, just head over to the Ubuntu website and download the ISO you’d like. Once that’s done you’ll want to write the ISO to a DVD or USB and begin the installation, you can find more information on this with Canonical’s tutorial. Source: Ubuntu 19.10 Eoan Ermine released with GNOME 3.34 and ZFS support (via Neowin)
Karlston posted a topic in Software NewsUbuntu 20.04’s zsys adds ZFS snapshots to package management ZFS for the masses is on the way with Ubuntu's zsys management system. Enlarge / This is a Fossa. It appears to be focusing. (Cryptoprocta ferox is a small, catlike carnivore native to Madagascar.) Mathias Appel 22 with 14 posters participating, including story author Last October, an experimental ZFS installer showed up in Eoan Ermine, the second interim Ubuntu release of 2019. Next month, Focal Fossa—Ubuntu's next LTS (Long Term Support) release—is due to drop, and it retains the ZFS installer while adding several new features to Ubuntu's system management with the fledgling zsys package. Phoronix reported this weekend that zsys is taking snapshots prior to package-management operations now, so we decided to install the latest Ubuntu 20.04 daily build and see how the new feature works. Taking Focal Fossa for a quick spin First image of article image gallery. Please visit the source link to see all 3 images. Focal installs much as any other Ubuntu release has, but it retains 19.10's ZFS installer—which is still hidden behind "advanced features" and still labeled experimental. After selecting a ZFS install, you give your OK to the resulting partition layout—with one primary partition for UEFI boot and three logical partitions for swap, boot ZFS pool, and root ZFS pool. A few minutes later, you've got yourself an Ubuntu installation. A quick look under the hood First image of article image gallery. Please visit the source link to see all 3 images. After installing Fossa, the first thing we did was verify the installed version of zsys. The apt management snapshots were added very recently in 0.4.1, and we've learned not to take for granted what's installed on beta or pre-beta daily builds of Linux distributions. Zsys was, in fact, already installed by default and was at version 0.4.1. There weren't any snapshots on the freshly installed system yet, so we did a quick apt install gimp. Afterward, we saw that zsys had taken a snapshot in every dataset present on rpool. Having a snapshot taken prior to installing new packages means that, if something should go haywire, we can easily revert the system to its state prior to the new package being installed. Carving the system up into so many different datasets means, in turn, that we can roll back only those parts of the system affected by the package manager—for example, we can roll back packages without affecting data in the user's home directory. After installing gimp and seeing new snapshots available, we tried installing a second package. One apt install pv later, we again checked for snapshots. Although we still found the snapshots taken prior to installing gimp, there were no new snapshots to roll back our pv installation. After several more experimental installations and removals with no new snapshots, we started grep-ing our way through the /etc directory to find out why. In apt.conf.d we find a config file named 90_zsys_system_autosnapshot that adds a pre-install hook to dpkg. This pre-install hook calls zsys-system-autosnapshot prior to making any changes to the package system. We weren't sure why we hadn't gotten any new snapshots, so we tried running zsys-system-autosnapshot directly—still no new snapshot. When we then took a look at zsys-system-autosnapshot itself, the reason for no new snapshots being taken was obvious. A minimum interval is built into that script so that it exits without doing anything if it has been less than 20 minutes since the last time it took snapshots. We're pretty dubious about this minimum-interval feature. On the one hand, once you accumulate a few thousand snapshots, you can begin seeing filesystem performance issues. On the other hand, we foresee a lot of problematic package installations not getting covered with snapshots this way. Zsys is still early in development First image of article image gallery. Please visit the source link to see all 4 images. We should note that zsys is nowhere near complete yet. The tool promises all manner of added functionality, and it's already useful—but it's still missing so much of the polish that normal users will need to see. We can see that zsys refers to these automatically generated snapshots as "system state"—and that zsysctl save will take those snapshots, and zsysctl show will give us a high-level overview of what sets of state have been saved. But there's no corresponding zsysctl load yet, and until there is, trying to use these saves to actually recover from disaster will remain a little more "expert" of an operation than it ought to be. Ubuntu's ZFS installer carves up the base system into a bewildering 21 separate datasets, so zsys really needs that high-level rollback assistant. It's easy enough to roll back any individual dataset using the zfs command itself—e.g., zfs rollback rpool/USERDATA/[email protected]_pmxbuj—but we don't anticipate users having a good time navigating such commands. We fully expect zsysctl to add functionality for easier rollbacks eventually. It's just not here yet. Source: Ubuntu 20.04’s zsys adds ZFS snapshots to package management (Ars Technica) (To view the article's image galleries, please visit the above link)