Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
sharkytm
Oct 9, 2003

Ba

By

Sharkytm doot doo do doot do doo


Fallen Rib

taqueso posted:

Yeah, it's obviously nice to have but millions of people get away without somehow

Millions of people aren't running ZFS. However, I don't think it's worth selling usable hardware to buy "server-grade" stuff unless you NEED IPMI, Xeon support, Dual-Xeon, etc. Or you want homelab credo, FWTW.

Both my FreeNAS Systems are running ECC, but that's because the motherboards require it. UDIMMS, too, which aren't cheap like Registered ECC.

I'd just run what you've got, at worst, you have to build a new box and transfer the data over. If you're using ZFS, that's stupid-simple. If not, then grab a couple of backup disks and back it all up and copy it over. Also, RAID (including RAIDZx) isn't backup.

Adbot
ADBOT LOVES YOU

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

D. Ebdrup posted:

Say what?!

https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/

Yep They use a 30-100GB non-shingled "CMR area" as a write buffer and then behind the scenes it's rewriting the shingles. But if you write too much at once then they hang until it's ready to accept more, which can be multiple minutes, and the controller naturally drops the drive.

Most home users don't load the whole thing full at once so you may not run into this at first, but when you hit it with a rebuild then it's quite likely to drop.

Oh, and to make this all worse, what you think is a read/write operation that the HDD can just scan right through without seeks, may not be at all. Like a SSD, it's got its own "page tables" internally and the logical mappings don't follow the physical mappings. Kind of an obvious approach when you think about the CMR area, it obviously needs some mapping to know what's "in buffer" and what's been flushed out, but not really the behavior you expect from a HDD.

Additionally, WD has a bug (or, differing interpretation of the "correct" behavior), where when ZFS tries to read an area that the drive knows hasn't been written (because it's not in the page tables) the drive will throw a hardware error, and that probably also leads to the drive being dropped.

Paul MaudDib fucked around with this message at 22:31 on Apr 15, 2020

DrDork
Dec 29, 2003
commanding officer of the Army of Dorkness
Welp. Glad I decided to populate my NAS with 8TB drives. Seriously, though, that's bad juju all the way around. I could see (and sorta agree) with using SMR on a Green / Blue drive where you can reasonably assume that the user won't need to push more than 100GB at once, and doing so lets you get more storage at a given price point. But putting it in Reds that they're actively advertising for NAS usage is straight up irresponsible.

BlankSystemDaemon
Mar 13, 2009



sharkytm posted:

Millions of people aren't running ZFS.
Millions of people probably are having their data stored on ZFS, without knowing it.

Paul MaudDib posted:

https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/

Yep They use a 30-100GB non-shingled "CMR area" as a write buffer and then behind the scenes it's rewriting the shingles. But if you write too much at once then they hang until it's ready to accept more, which can be multiple minutes, and the controller naturally drops the drive.

Most home users don't load the whole thing full at once so you may not run into this at first, but when you hit it with a rebuild then it's quite likely to drop.

Oh, and to make this all worse, what you think is a read/write operation that the HDD can just scan right through without seeks, may not be at all. Like a SSD, it's got its own "page tables" internally and the logical mappings don't follow the physical mappings. Kind of an obvious approach when you think about the CMR area, it obviously needs some mapping to know what's "in buffer" and what's been flushed out, but not really the behavior you expect from a HDD.

Additionally, WD has a bug (or, differing interpretation of the "correct" behavior), where when ZFS tries to read an area that the drive knows hasn't been written (because it's not in the page tables) the drive will throw a hardware error, and that probably also leads to the drive being dropped.
Welp, not buying anymore WD or HGST drives, then. I can't afford 10TB or bigger.
Are there even vendors that sell non-poo poo 6TB drives?

DrDork
Dec 29, 2003
commanding officer of the Army of Dorkness

D. Ebdrup posted:

Welp, not buying anymore WD or HGST drives, then. I can't afford 10TB or bigger.
Are there even vendors that sell non-poo poo 6TB drives?

Your options are Toshiba and hoping that HGST is still operating independently enough to not be doing shady poo poo. Seagate has a mix of CMR and SMR drives in the 4-8TB range. I think their IronWolf drives are CMR, but they also tend to have lower reliability than WD Reds.

WD's 8TB drives are (at least according to that article) proper CMR drives, so that's a little easier on the wallet. And considering what Toshiba drives cost, you may be able to get a WD Red 8TB for about the same price as a Toshiba 6TB.

eames
May 9, 2009

Paul MaudDib posted:

[interesting SMR info]

Seagate’s SMR drives even use a hierarchy of DRAM (few megabytes), NAND (few gigabytes), CMR (few hundred gigabytes) and SMR (few terabytes) and the drive firmware quietly shuffles data blocks around as it thinks it is needed. I posted about this a few months ago. WD might do the same.

It was quite fascinating because the drive is relatively smart about keeping hot files in the cache, so editing photos and documents felt like working on a SSD (because everything was kept in the NAND), but multi-hundred-gigabyte transfers IIRC slowed to 20 megabytes per second once all buffers were full and the firmware was forced to write directly to SMR.

This behavior kind of works because very few consumers do writes in excess of the CMR section (let’s say 200GB) in one sitting, most workloads would give the drive some idle time to flush to the SMR section.
Reading isn’t as big of a problem, if I remember correctly SMR even has an advantage over CMR when it comes to sequential read throughout, but don’t quote me on that.

ZFS RAID is a great example of a workload that would completely grind this whole contraption to a halt, because the drives firmware doesn’t know what the filesystem is doing (I’m assuming it just looks at blocks and shuffles those around, so a scrub or resilver would leave it thoroughly confused) and the filesystem has no idea what it is dealing with.

I think the concept itself has potential for many use cases, but I see a need for a standard that makes the OS/filesystem fully aware of the underlying hardware architecture.

Corb3t
Jun 7, 2003

My Trump check got deposited into my account today , who else plans on buying a couple more 10/12/14 TB EasyStores to shuck with it?

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I can grab 32GB of Samsung DDR4 PC-2400 RDIMMs off Ebay for $100 easy. That's substantially cheaper than for DIMMs off Newegg that are of varying quality.

sharkytm
Oct 9, 2003

Ba

By

Sharkytm doot doo do doot do doo


Fallen Rib

D. Ebdrup posted:

Millions of people probably are having their data stored on ZFS, without knowing it.

And I'll bet those storage units are using ECC... But I digress.

Hughlander
May 11, 2005

I have 2 zpools Main-Volume and datastore. Main-Volume is 6 4TB Reds in Z0, datastore is 10 8TB Reds in Z0. All of Main-Volume is on an LSI controller built into the motherboard, and 8 of datastore drives are in an external enclosure connected via ESATA with another LSI controller while the two remaining drives are on the same LSI as Main-Volume. Both controllers are flashed to IT mode.

I turned off all VMs and LXCs, shut down Prometheus collections and ran some fio runs.

code:
# fio --filename=/Main-Volume/subvol-102-disk-1/root/testfile --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bs=4k --rwmixread=100 --iodepth=16 --numjobs=2 --runtime=60 --group_reporti
ng --name=4ktest --size=4G
4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16
...
fio-3.12
Starting 2 processes                                                                                                                                                                                                               ng -
Jobs: 2 (f=2): [r(2)][100.0%][r=868KiB/s][r=217 IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=2): err= 0: pid=30439: Thu Apr 16 06:50:49 2020
  read: IOPS=171, BW=686KiB/s (703kB/s)(40.2MiB/60005msec)
    clat (usec): min=3, max=55565, avg=11653.73, stdev=9812.16
     lat (usec): min=3, max=55565, avg=11653.99, stdev=9812.15
    clat percentiles (usec):
     |  1.00th=[    5],  5.00th=[   15], 10.00th=[   18], 20.00th=[   39],
     | 30.00th=[   43], 40.00th=[11076], 50.00th=[12780], 60.00th=[14353],
     | 70.00th=[16712], 80.00th=[19530], 90.00th=[23462], 95.00th=[28967],
     | 99.00th=[36439], 99.50th=[39584], 99.90th=[46924], 99.95th=[49546],
     | 99.99th=[54789]
   bw (  KiB/s): min=  240, max=  600, per=50.00%, avg=343.01, stdev=49.74, samples=240
   iops        : min=   60, max=  150, avg=85.69, stdev=12.44, samples=240
  lat (usec)   : 4=0.01%, 10=3.70%, 20=10.34%, 50=19.12%, 100=0.17%
  lat (usec)   : 250=0.02%, 500=0.04%, 1000=0.02%
  lat (msec)   : 10=3.17%, 20=45.26%, 50=18.11%, 100=0.05%
  cpu          : usr=0.06%, sys=0.75%, ctx=6956, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10295,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=686KiB/s (703kB/s), 686KiB/s-686KiB/s (703kB/s-703kB/s), io=40.2MiB (42.2MB), run=60005-60005msec
700kB/sec read on Main-Volume

code:
fio --filename=/datastore/Media/testfile --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bs=4k --rwmixread=100 --iodepth=16 --numjobs=2 --runtime=60 --group_reporting --name=4ktest --size=4G
4ktest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16
...
fio-3.12
Starting 2 processes
Jobs: 2 (f=2): [r(2)][100.0%][r=324MiB/s][r=83.1k IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=2): err= 0: pid=31685: Thu Apr 16 06:51:43 2020
  read: IOPS=82.8k, BW=324MiB/s (339MB/s)(8192MiB/25315msec)
    clat (usec): min=2, max=1042, avg=23.76, stdev=10.15
     lat (usec): min=2, max=1042, avg=23.80, stdev=10.15
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[   20],
     | 30.00th=[   23], 40.00th=[   24], 50.00th=[   26], 60.00th=[   27],
     | 70.00th=[   29], 80.00th=[   30], 90.00th=[   32], 95.00th=[   34],
     | 99.00th=[   41], 99.50th=[   46], 99.90th=[   61], 99.95th=[  125],
     | 99.99th=[  277]
   bw (  KiB/s): min=160464, max=169768, per=50.05%, avg=165857.96, stdev=1461.51, samples=100
   iops        : min=40116, max=42442, avg=41464.47, stdev=365.37, samples=100
  lat (usec)   : 4=3.85%, 10=7.69%, 20=9.31%, 50=78.86%, 100=0.23%
  lat (usec)   : 250=0.02%, 500=0.03%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=3.32%, sys=96.22%, ctx=2050, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2097152,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=324MiB/s (339MB/s), 324MiB/s-324MiB/s (339MB/s-339MB/s), io=8192MiB (8590MB), run=25315-25315msec
339MB/s on datastore.

Note: With the set up it's probably not a controller issue, as 20% of datastore drives are on the same controller.

code:
zpool status
  pool: Main-Volume
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 10:31:45 with 0 errors on Sun Apr 12 10:55:47 2020
config:

        NAME                                          STATE     READ WRITE CKSUM
        Main-Volume                                   ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7SRLCY4  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EJPD3F55  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E8RF5925  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0AETE3E  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0AETJX7  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EFAKRXYF  ONLINE       0     0     0

errors: No known data errors

  pool: datastore
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 10:20:51 with 0 errors on Sun Apr 12 10:45:08 2020
config:

        NAME                              STATE     READ WRITE CKSUM
        datastore                         ONLINE       0     0     0
          raidz1-0                        ONLINE       0     0     0
            wwn-0x5000cca252c9c3e5-part2  ONLINE       0     0     0
            wwn-0x5000cca252c97647-part2  ONLINE       0     0     0
            wwn-0x5000cca252cd7334-part2  ONLINE       0     0     0
            wwn-0x5000cca252cd944b-part2  ONLINE       0     0     0
            wwn-0x5000cca252cd655c-part2  ONLINE       0     0     0
            wwn-0x5000cca252cd63df-part2  ONLINE       0     0     0
            wwn-0x5000cca252c8603f-part2  ONLINE       0     0     0
            wwn-0x5000cca252c8779d-part2  ONLINE       0     0     0
            wwn-0x5000cca252c857d2-part2  ONLINE       0     0     0
            wwn-0x5000cca252c95502-part2  ONLINE       0     0     0

errors: No known data errors
No errors, and a Scrub just happened when I started looking into it. I also checked the smart status of each individual drive and they were marked as PASSED with no pre-fail indications.

code:
 zpool iostat
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
Main-Volume  4.59T  17.2T     67    518   969K  5.25M
datastore    33.6T  38.9T    966    246  17.5M  7.35M

zpool list
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Main-Volume  21.8T  4.59T  17.2T        -         -    26%    21%  1.00x    ONLINE  -
datastore    72.5T  33.8T  38.7T        -         -     2%    46%  1.00x    ONLINE  -
iostat confirms what we saw above that the read bandwidth is really really low. Fragmentation is a bit higher than what I'd think given the allocations but still completely fine.
Given how much free space datastore has I'm now doing a syncoid to transfer everything over, and if I can figure out what's going on I'm prepared to rebuild the pool completely.

The questions are:
A) How do I get more data on what's going on?
B) What is going on?
C) How do I fix it?

Less Fat Luke
May 23, 2003

Exciting Lemon
Can you do a pastebin or gist of `zpool get all` for both and maybe a zdb -C as well of both? Offhand maybe the ashift is really wrong for the Main-Volume.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
We just had a discussion on the WD Reds above using SMR - I suspect your main volume is suffering from SMR issues. Your request latency being so insanely high is enough evidence of that pattern. I have a bunch of Toshiba drives to offset this problem in my 4TB drive based array but because I have several WD Reds in there I'm now looking to migrate the whole array completely to perhaps 12TB drives now. Just trying to figure out how many total drives I should try keeping online because it's already at 16 drives in a craptastic garbo setup I keep putting off properly housing...

Less Fat Luke
May 23, 2003

Exciting Lemon
I was gonna say SMR too but the read speeds in SMR arrays or ones with some SMR drives are not nearly that bad.

Hughlander
May 11, 2005

necrobobsledder posted:

We just had a discussion on the WD Reds above using SMR - I suspect your main volume is suffering from SMR issues. Your request latency being so insanely high is enough evidence of that pattern. I have a bunch of Toshiba drives to offset this problem in my 4TB drive based array but because I have several WD Reds in there I'm now looking to migrate the whole array completely to perhaps 12TB drives now. Just trying to figure out how many total drives I should try keeping online because it's already at 16 drives in a craptastic garbo setup I keep putting off properly housing...

The Red is from 2014. And they're EFRX which is not the line that has SMR.

Less Fat Luke posted:

Can you do a pastebin or gist of `zpool get all` for both and maybe a zdb -C as well of both? Offhand maybe the ashift is really wrong for the Main-Volume.

100% possible, I was looking into that yesterday. http://sprunge.us/KIhcOv --zpool get all and zdb (not -C).

ashift is 12 for Main-Volume, zdb alone didn't dump datastore though.

IOwnCalculus
Apr 2, 2003





Fragmentation shouldn't be it - I just checked my main pool and it's at 14% and the performance is way better than that.

Are any of the drives reporting errors in SMART or syslog? Possible that a drive is dying in a way that's just making it very, very slow to respond, but not actually chuck data errors.

not-edit: I'm wondering if something is weird with how fio is testing this, because I just ran that same test on my datastore and it's claiming read BW of 1MB/sec, which seems physically impossible given the workloads this array supports. I didn't stop everything else but the server isn't doing that much at the moment:

code:
Jobs: 2 (f=2): [r(2)][100.0%][r=996KiB/s,w=0KiB/s][r=249,w=0 IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=2): err= 0: pid=16723: Thu Apr 16 08:11:08 2020
   read: IOPS=258, BW=1035KiB/s (1059kB/s)(60.6MiB/60011msec)
    clat (usec): min=3, max=169682, avg=7727.77, stdev=11452.48
     lat (usec): min=3, max=169682, avg=7728.20, stdev=11452.50
    clat percentiles (usec):
     |  1.00th=[     7],  5.00th=[    15], 10.00th=[    41], 20.00th=[    49],
     | 30.00th=[    56], 40.00th=[    65], 50.00th=[   437], 60.00th=[  7504],
     | 70.00th=[  9896], 80.00th=[ 13829], 90.00th=[ 22152], 95.00th=[ 30016],
     | 99.00th=[ 51643], 99.50th=[ 59507], 99.90th=[ 81265], 99.95th=[ 87557],
     | 99.99th=[105382]
   bw (  KiB/s): min=  184, max= 1016, per=50.03%, avg=517.30, stdev=161.34, samples=240
   iops        : min=   46, max=  254, avg=129.30, stdev=40.36, samples=240
  lat (usec)   : 4=0.01%, 10=2.96%, 20=2.87%, 50=16.51%, 100=25.98%
  lat (usec)   : 250=0.46%, 500=2.08%, 750=0.60%, 1000=0.10%
  lat (msec)   : 2=0.40%, 4=1.48%, 10=17.35%, 20=16.54%, 50=11.51%
  lat (msec)   : 100=1.15%, 250=0.01%
  cpu          : usr=0.08%, sys=1.04%, ctx=8086, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=15522,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1035KiB/s (1059kB/s), 1035KiB/s-1035KiB/s (1059kB/s-1059kB/s), io=60.6MiB (63.6MB), run=60011-60011msec
rsyncing a 1.1GB file (that probably hasn't been opened recently) to a SSD:
code:
real    0m5.959s
user    0m4.201s
sys     0m2.476s

Hughlander
May 11, 2005

IOwnCalculus posted:

Fragmentation shouldn't be it - I just checked my main pool and it's at 14% and the performance is way better than that.

Are any of the drives reporting errors in SMART or syslog? Possible that a drive is dying in a way that's just making it very, very slow to respond, but not actually chuck data errors.

not-edit: I'm wondering if something is weird with how fio is testing this, because I just ran that same test on my datastore and it's claiming read BW of 1MB/sec, which seems physically impossible given the workloads this array supports. I didn't stop everything else but the server isn't doing that much at the moment:


I don't know enough about fio to know what a good command line is. This was just one I found online and got these two very different results. SMART status was checked all drives marked as PASSED, non of the counters looked bad or in Prefail land.

Less Fat Luke
May 23, 2003

Exciting Lemon
Yeah my guess is that one drive is bad, RAIDZ1 is going to read everything simultaneously so if something is making GBS threads the bed it'll stall everything. I'd run a copy or cat from the drive and maybe watch `iostat -x` and see if one drives util% column is maxed out compared to the others. Alternatively export the pool and cat the drive raw devices one by one to dev null and watch the speed metrics.

BlankSystemDaemon
Mar 13, 2009



Hughlander posted:

The Red is from 2014. And they're EFRX which is not the line that has SMR.
Oh, the 3 6TB drives I have lying around (waiting for 2 more until I can start building my array) are also EFRX, so I guess I'm safe if I keep buying them (since paradoxically shucking is not cheaper in Denmark).

Hughlander
May 11, 2005

Less Fat Luke posted:

Yeah my guess is that one drive is bad, RAIDZ1 is going to read everything simultaneously so if something is making GBS threads the bed it'll stall everything. I'd run a copy or cat from the drive and maybe watch `iostat -x` and see if one drives util% column is maxed out compared to the others. Alternatively export the pool and cat the drive raw devices one by one to dev null and watch the speed metrics.
code:
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              2.00  645.00      0.00  60740.00     0.00     2.00   0.00   0.31   27.50    1.33   0.39     0.00    94.17   0.81  52.40
sdh              2.00  618.00      0.00  60732.00     0.00     2.00   0.00   0.32   65.50    1.42   0.45     0.00    98.27   0.78  48.40
sdb            178.00    0.00 113628.00      0.00     0.00     0.00   0.00   0.00   17.33    0.00   2.73   638.36     0.00   3.33  59.20
sdc            311.00    0.00 112768.00      0.00     0.00     0.00   0.00   0.00    9.63    0.00   2.54   362.60     0.00   2.24  69.60
sdd            365.00    0.00 109784.00      0.00     0.00     0.00   0.00   0.00    8.09    0.00   2.48   300.78     0.00   1.74  63.60
sde            180.00    0.00 105520.00      0.00     0.00     0.00   0.00   0.00   17.09    0.00   2.69   586.22     0.00   3.18  57.20
sdf            332.00    0.00 112540.00      0.00     0.00     0.00   0.00   0.00    9.05    0.00   2.54   338.98     0.00   1.96  65.20
sdg            190.00    0.00 108432.00      0.00     0.00     0.00   0.00   0.00   15.93    0.00   2.67   570.69     0.00   3.05  58.00
sdi              1.00    0.00      4.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     0.00   4.00   0.40
sdj              3.00    0.00     44.00      0.00     0.00     0.00   0.00   0.00    0.33    0.00   0.00    14.67     0.00   2.67   0.80
sdk              2.00  541.00      0.00  60700.00     0.00     0.00   0.00   0.00   34.00    1.57   0.33     0.00   112.20   0.93  50.40
sdl              2.00  518.00      0.00  59268.00     0.00     2.00   0.00   0.38   61.50    1.69   0.38     0.00   114.42   0.98  51.20
sdm              2.00  541.00      0.00  60688.00     0.00     0.00   0.00   0.00   33.50    1.68   0.40     0.00   112.18   0.97  52.40
sdn              2.00  558.00      0.00  60720.00     0.00     1.00   0.00   0.18   63.50    1.76   0.46     0.00   108.82   0.99  55.60
sdo              2.00  543.00      0.00  60720.00     0.00     4.00   0.00   0.73   47.50    1.50   0.29     0.00   111.82   0.99  54.00
sdp              2.00  462.00      0.00  60744.00     0.00     0.00   0.00   0.00   44.00    2.40   0.59     0.00   131.48   1.03  48.00
sdq              2.00  493.00      0.00  60712.00     0.00     0.00   0.00   0.00   30.00    2.42   0.72     0.00   123.15   0.97  48.00
sdr              2.00  571.00      0.00  60748.00     0.00     2.00   0.00   0.35   48.50    1.50   0.34     0.00   106.39   0.98  56.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00   39.10    7.31    0.00   53.08
This is while running a syncoid to move data from Main-Volume to datastore a/h are the 2 8s in the main chassis, b-g are the drives of Main-Volume doing roughly the same reads. i/j are mirrored SSDs off of the motherboard but not the LSI controller, k-r are the rest fo the drives in the external array.

EDITED:
Actual iostat -x
code:
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda             95.11   32.62   1744.92   1235.08     0.01     0.06   0.01   0.19    0.99    1.72   0.09    18.35    37.86   0.71   9.06
sdh             95.22   32.56   1743.17   1233.35     0.01     0.07   0.01   0.22    0.95    1.70   0.08    18.31    37.88   0.70   8.99
sdb             22.32   85.74   1064.81    891.84     0.00     0.04   0.02   0.05   14.82    1.08   0.38    47.72    10.40   0.83   8.93
sdc             22.46   81.83   1041.02    850.21     0.00     0.04   0.01   0.05   13.84    1.08   0.35    46.36    10.39   0.84   8.81
sdd             22.46   86.09   1063.66    892.96     0.00     0.04   0.02   0.05   14.06    0.91   0.35    47.35    10.37   0.82   8.93
sde             21.65   82.04   1046.93    851.33     0.00     0.04   0.01   0.05   14.46    0.92   0.34    48.35    10.38   0.85   8.85
sdf             22.20   86.14   1065.47    892.97     0.00     0.04   0.02   0.04   14.25    0.88   0.34    47.99    10.37   0.83   8.96
sdg             22.27   82.23   1042.71    851.07     0.00     0.04   0.01   0.05   13.60    0.88   0.33    46.83    10.35   0.84   8.82
sdi              3.37   25.51     49.29    404.74     0.00     0.00   0.00   0.00    0.84    0.94   0.02    14.61    15.87   0.29   0.84
sdj              3.28   24.31     45.69    404.74     0.00     0.00   0.00   0.00    0.84    0.90   0.02    13.95    16.65   0.30   0.84
sdk             94.43   30.99   1746.20   1235.04     0.01     0.06   0.01   0.18    1.02    1.89   0.09    18.49    39.85   0.73   9.11
sdl             94.10   30.82   1745.54   1233.37     0.01     0.07   0.01   0.22    1.00    1.92   0.09    18.55    40.01   0.72   9.02
sdm             94.41   30.98   1744.51   1233.45     0.01     0.07   0.01   0.23    0.99    1.87   0.09    18.48    39.82   0.72   9.04
sdn             93.94   30.93   1746.92   1235.08     0.01     0.06   0.01   0.18    1.04    1.93   0.09    18.60    39.93   0.73   9.11
sdo             94.17   31.12   1746.36   1235.09     0.01     0.05   0.01   0.16    1.02    1.87   0.09    18.55    39.69   0.73   9.11
sdp             94.03   30.93   1745.07   1233.39     0.01     0.06   0.01   0.20    1.00    1.90   0.09    18.56    39.87   0.72   9.03
sdq             93.62   30.66   1746.09   1233.42     0.01     0.07   0.01   0.22    1.01    1.98   0.09    18.65    40.23   0.73   9.01
sdr             94.83   31.09   1745.30   1235.08     0.01     0.05   0.01   0.17    1.01    1.85   0.09    18.40    39.72   0.72   9.11

Less Fat Luke
May 23, 2003

Exciting Lemon
Yeah check the iostat during the fio run. That iostat is showing about 500MB/s being read accounting for parity so if that's Main-Volume then you're getting good speeds.

Edit: Also 4GB is a very small test for fio, it should be larger than your ARC or RAM in my opinion but maybe it's smart enough with that direct flag to bypass even the ARC? Either way I'd do it with a much larger file.

Additionally random reads in ZFS with something like FIO *should* be terrible, you're using spinning disks in RaidZ1. If anything both of those tests are anomalous because that 300+MBS in random reads can't possibly be coming from real spinners.

Less Fat Luke fucked around with this message at 17:02 on Apr 16, 2020

Hughlander
May 11, 2005

Less Fat Luke posted:

Yeah check the iostat during the fio run. That iostat is showing about 500MB/s being read accounting for parity so if that's Main-Volume then you're getting good speeds.

Edit: Also 4GB is a very small test for fio, it should be larger than your ARC or RAM in my opinion but maybe it's smart enough with that direct flag to bypass even the ARC? Either way I'd do it with a much larger file.

Additionally random reads in ZFS with something like FIO *should* be terrible, you're using spinning disks in RaidZ1. If anything both of those tests are anomalous because that 300+MBS in random reads can't possibly be coming from real spinners.

Thanks, I never used fio so would appreciate any pointers. To avoid the X/Y problem I dug myself into... I started looking at a real world issue I was having. Scanning files for a backup would take 40-60 minutes for 1 million files. All it was doing was grabbing the mtime for the files. Doing a similar scan of a similar size on the other zpool would take 2 minutes - 2:30. From there I looked to get reproducable measurements to show that yes, one pool is slower than the other. I'm taking a different tact now, I'm doing a zfs send/receive moving the same 1M files from Main-Volume to datastore and I'll run the same backup there. if it completes in 2 minutes though then I'll still want to understand what is going on between the two pools and how to improve the performance of one.

Less Fat Luke
May 23, 2003

Exciting Lemon
Yeah that's interesting, I almost wonder if you're somehow priming the ARC in the "fast" pool and all the mtimes are readily available in memory. I bet clearing the ARC with a reboot (or dropping and re-increasing the size) first before each test would help narrowing the issue down.

Hughlander
May 11, 2005

Less Fat Luke posted:

Yeah that's interesting, I almost wonder if you're somehow priming the ARC in the "fast" pool and all the mtimes are readily available in memory. I bet clearing the ARC with a reboot (or dropping and re-increasing the size) first before each test would help narrowing the issue down.

It was a separate set of files that also hadn't been accessed very recently. If this is the time that it's expected, I may just need to find a different backup solution. I really miss crashplan and it's filewatcher so much.

Kia Soul Enthusias
May 9, 2004

zoom-zoom
Toilet Rascal
Do any of these homebrew distros have a backup client for Windows that will automatically backup over the internet (via VPN or similar)? Imagine you are trying to keep family members in their 70s systems backed up.

H110Hawk
Dec 28, 2006

Charles posted:

Do any of these homebrew distros have a backup client for Windows that will automatically backup over the internet (via VPN or similar)? Imagine you are trying to keep family members in their 70s systems backed up.

I know this isn't the answer to the question you asked as phrased, but I strongly encourage using an off the shelf "cloud" backup solution like Backblaze. $4.583/month/computer with a 2 year plan. Yes it's cheaper to DIY in absolute dollars, but in sanity-bux :suicide: trying to handle this over the internet safely.

TraderStav
May 19, 2006

It feels like I was standing my entire life and I just sat down

H110Hawk posted:

I know this isn't the answer to the question you asked as phrased, but I strongly encourage using an off the shelf "cloud" backup solution like Backblaze. $4.583/month/computer with a 2 year plan. Yes it's cheaper to DIY in absolute dollars, but in sanity-bux :suicide: trying to handle this over the internet safely.

I've been looking for a mass backup solution for my NAS and have been avoiding it as I used Crashplan ages ago and it was miserable. Is Backblaze a good solution? Wiil I spend 2020 and 2021 uploading my data and have a lovely interface to pull things down?

xarph
Jun 18, 2001


Well poo poo, I bought one EFAX to complete upgrading my zpool from 3TB to 4TB disks and THEN I read this.

It’s replacing now, if it gets all the way through am I clear? This is just a boring rear end file server with low churn aside from a couple of 20GB VMs to run stuff that doesn’t have freebsd ports.

Is there a good list of known good SKUs? I think the seagate exos line lists whether a drive is CMR or SMR in the datasheets, but a) seagate, b) $$$.

H110Hawk
Dec 28, 2006

TraderStav posted:

I've been looking for a mass backup solution for my NAS and have been avoiding it as I used Crashplan ages ago and it was miserable. Is Backblaze a good solution? Wiil I spend 2020 and 2021 uploading my data and have a lovely interface to pull things down?

B2 works for me with the synology.

Devian666
Aug 20, 2008

Take some advice Chris.

Fun Shoe

xarph posted:

Well poo poo, I bought one EFAX to complete upgrading my zpool from 3TB to 4TB disks and THEN I read this.

It’s replacing now, if it gets all the way through am I clear? This is just a boring rear end file server with low churn aside from a couple of 20GB VMs to run stuff that doesn’t have freebsd ports.

Is there a good list of known good SKUs? I think the seagate exos line lists whether a drive is CMR or SMR in the datasheets, but a) seagate, b) $$$.

I have 2 x 6TB EFAX drives. I have them mirrored. I replaced the existing 2TB drives this year and it took about 11 hours or so to remirror the data with no problems. If you get all the way through and there are no errors or strange latency problems then it should be fine.

For my setup I keep adding data to the storage which shouldn't create an issue. The only change is that I now use it as a work server while on lockdown, but I haven't noticed any performance issues.

HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast
:siren: SMR drive watch :siren:

https://blocksandfiles.com/2020/04/16/toshiba-desktop-disk-drives-undocumented-shingle-magnetic-recording/

blocks & files posted:

Western Digital, Seagate and Toshiba – have now confirmed to Blocks & Files the undocumented use of SMR technology in desktops HDDs and in WD’s case, WD Red consumer NAS drives.

It just gets worse. Yay! (the new news being Toshiba, but Seagate hasn't been mentioned in this thread yet, so here's more on them: https://blocksandfiles.com/2020/04/15/seagate-2-4-and-8tb-barracuda-and-desktop-hdd-smr/)

HalloKitty fucked around with this message at 15:57 on Apr 17, 2020

BlankSystemDaemon
Mar 13, 2009



So, the only models that're known to not use SMR as of this post are WD EFRX and Toshiba X300 drives?`That's mighty slim pickings.
Weirdly, Toshiba X300 6TB drives are a lot cheaper than 6TB EFRX drives here in Denmark.

sharkytm
Oct 9, 2003

Ba

By

Sharkytm doot doo do doot do doo


Fallen Rib
The WD white label EMAZ doesn't. Those are what are usually find in the shuckable external drive. I can't believe that they'd gently caress the users that spend the money on legit Red drives, and not the external drives where performance is expected to be slower.

HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast

sharkytm posted:

The WD white label EMAZ doesn't. Those are what are usually find in the shuckable external drive. I can't believe that they'd gently caress the users that spend the money on legit Red drives, and not the external drives where performance is expected to be slower.

Yup. It's totally rear end-backwards. Turns out, the real winners are the shuckers. Cheaper AND better drives.

Steakandchips
Apr 30, 2009

So, 8tb and higher WD Reds are confirmed to be not-SMR, correct?

xarph
Jun 18, 2001


Devian666 posted:

I have 2 x 6TB EFAX drives. I have them mirrored. I replaced the existing 2TB drives this year and it took about 11 hours or so to remirror the data with no problems. If you get all the way through and there are no errors or strange latency problems then it should be fine.

For my setup I keep adding data to the storage which shouldn't create an issue. The only change is that I now use it as a work server while on lockdown, but I haven't noticed any performance issues.

My zpool replace operation completed successfully sometime overnight. Pool is fine.

The story has hit Ars: https://arstechnica.com/gadgets/2020/04/caveat-emptor-smr-disks-are-being-submarined-into-unexpected-channels/

sharkytm
Oct 9, 2003

Ba

By

Sharkytm doot doo do doot do doo


Fallen Rib

xarph posted:

My zpool replace operation completed successfully sometime overnight. Pool is fine.

The story has hit Ars: https://arstechnica.com/gadgets/2020/04/caveat-emptor-smr-disks-are-being-submarined-into-unexpected-channels/

Cue the class-action lawsuit in 3... 2... 1...

DrDork
Dec 29, 2003
commanding officer of the Army of Dorkness

sharkytm posted:

Cue the class-action lawsuit in 3... 2... 1...

I wouldn't be surprised. You can gently caress over consumers like that all day and no one is gonna complain all that much, but you start changing "known quantity" SKUs that are meant for businesses and that's gonna get a lot more attention.

ChiralCondensate
Nov 13, 2007

what is that man doing to his colour palette?
Grimey Drawer

Steakandchips posted:

So, 8tb and higher WD Reds are confirmed to be not-SMR, correct?

I'd like to see a script that does drive-fillup+whatever benchmarking to incur SMR-rewrite behavior, then we could test all the drives.

BlankSystemDaemon
Mar 13, 2009



ChiralCondensate posted:

I'd like to see a script that does drive-fillup+whatever benchmarking to incur SMR-rewrite behavior, then we could test all the drives.
By the sounds of it it's as simple as this oneliner.
pre:
zpool create tank raidz3 /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4 \
&& camdd -i /dev/random,bs=1M,depth=`sysctl -n hw.ncpu` -o file=/tank/random.bin -m 1024G \
&& zfs scrub
Granted, Linux might have trouble with it because of its CSPRNG and its lack of camdd which can operate at multiple queues (ie. make use of FreeBSDs threaded CSPRNG that cannot be exhausted because it's based on Fortuna and doesn't block).
Then again, that seems like a problem for Linux.

BlankSystemDaemon fucked around with this message at 20:53 on Apr 17, 2020

Adbot
ADBOT LOVES YOU

H110Hawk
Dec 28, 2006

D. Ebdrup posted:

By the sounds of it it's as simple as this oneliner.
pre:
zpool create tank raidz3 /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4 \
&& camdd -i /dev/random,bs=1M,depth=`sysctl -n hw.ncpu` -o file=/tank/random.bin -m 1024G \
&& zfs scrub
Granted, Linux might have trouble with it because of its CSPRNG and its lack of camdd which can operate at multiple queues (ie. make use of FreeBSDs threaded CSPRNG that cannot be exhausted because it's based on Fortuna and doesn't block).
Then again, that seems like a problem for Linux.

Just use urandom instead. I also think they've improved random materially in the last decade to make exhaustion less likely.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply