Enterprise Storage Megathread: Why is my NAS a SAN?

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

frunksock: Feb 21, 2002

H110Hawk posted:

What zpool configuration are you using? Any specific kernel tweaks?

I'm not who you asked, but the configuration I ended up on is pretty similar, with 7x(5+1) raidz1, 4 spares, all in one pool, and mirrored OS disks in the rpool. Since this is a X4540, the first crack I took at it, I put the OS on a CF card, but in addition to that being a pretty big pain in the rear end, I realized it doesn't really get me anything besides 2 extra spares, since for performance and reliability constraints, I'd settled on 5+1 stripes, with no one stripe containing more than one disk from a single controller. The command to create it would go something like

code:

zpool create foopool \
raidz c1t0d0 c2t2d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 \
raidz c2t1d0 c3t1d0 c4t1d0 c5t1d0 c6t1d0 c1t2d0 \
raidz c3t2d0 c4t2d0 c5t2d0 c6t2d0 c1t3d0 c2t3d0 \
raidz c4t3d0 c5t3d0 c6t3d0 c1t4d0 c2t4d0 c3t4d0 \
raidz c5t4d0 c6t4d0 c1t5d0 c2t5d0 c3t5d0 c4t5d0 \
raidz c6t5d0 c1t6d0 c2t6d0 c3t6d0 c4t6d0 c5t6d0 \
raidz c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 c6t7d0 \
spare c3t3d0 c4t4d0 c5t5d0 c6t6d0

Notice each stripe has exactly one disk from each of the 6 controllers (c0 got eaten by the CF card experiment).

With the rpool using c1t1d0 and c2t0d0

Using bonnie++, I get about 500MB/s writes and 730MB/s reads.

This is a log file repo where people do a lot of sorting and grepping and whatnot, so I used zfs gzip6 compression, which gets me over 9x compression, and (this is awesome) actually improves performance, since it moves the bottleneck from the disks to the CPU (because we now read far fewer blocks off of disk).

Doing an 8-way parallel grep, I can *grep* logfiles at over 820MB/s -- that number includes everything -- reading from disk, decompressing, and the grepping itself. Again, the bottleneck is CPU, not disk. It's awesome.

Oh, and the bonnie++ numbers are from an uncompressed filesystem. Since bonnie++ writes and reads blocks of 0s, you get insanely high (and fake) numbers if you run it on a gzip filesystem. My eyes got wide when I first ran it on the gziped filesystem, before I realized what was happening -- I forget the numbers, something like 3GB/s reads, maybe.

frunksock fucked around with this message at 23:56 on Sep 6, 2009

# ¿ Sep 6, 2009 23:49

Adbot: ADBOT LOVES YOU

# ¿ Apr 23, 2024 15:39

frunksock: Feb 21, 2002

adorai posted:

since it's apparently commodity hardware, i think you could probably run opensolaris w/ raidz2 on it if you wanted to.

The speculation on opensolaris.org is that they didn't do OpenSolaris / ZFS because OpenSolaris support for the SiI 3726 port multiplier that they're using only came out a couple weeks ago.

# ¿ Sep 6, 2009 23:53

frunksock: Feb 21, 2002

Not sure if this is the thread, but I want to talk Linux md for a bit. I've worked with ZFS and VxVM and with high-end enterprise arrays (DMX, etc), but not much with Linux md on cheap 1U/2Us with SATA disks.

My understanding is that it's commonly recommended to disable the write-back cache on SATA disks to protect from corruption and/or data loss in the event of a power failure or crash, when using software RAID without a battery-backed RAID controller. I understand that this risk exists even on a single, non-RAID drive, but that it's multiplied in software RAID configurations, especially so for RAID5/RAID6. Here are some things I am not 100% clear on:

RAID1 / RAID10: Does doing RAID1 or RAID10 pose any increased risk of data corruption or data loss due to power failure and pending writes (writes ACKed and cached by the disks, but uncommitted)? If so, how does this work?

Barriers: Does using ext3 or XFS barriers afford the same amount of protection from this situation as disabling the write cache entirely (again, say, for RAID10)? I also understand that barriers do not work with Linux md RAID5/6 .. what about RAID10?

Disabling the disks' write cache: I know how to do this using hdparm, but I also know that it is not a persistent change. If the machine reboots, the disks will come back up with the write-cache re-enabled. Worse, if there's a disk reset, they will come back up with the write-cache re-enabled (making the idea of doing it with a startup script inadequate). In RHEL4, there used to be an /etc/sysconfig/harddisks, but this no longer exists in RHEL5. What is the current method of persistently disabling the write cache on SATA disks? Is there a kernel option?

# ¿ Mar 10, 2010 18:02

frunksock: Feb 21, 2002

lilbean posted:

I would just mitigate most of the risk by using a UPS and - if you can afford it - systems with redundant power. I mean a motherboard can still blow and take down the whole system immediately, but most drives follow the flush and sync commands enough to not worry that much.

The colo these servers are in has issues often enough that that is not enough for me. I also want to understand what's what, even if I had bulletproof systems and datacenters.

# ¿ Mar 10, 2010 22:21

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?