Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS

adorai posted:

since it's apparently commodity hardware, i think you could probably run opensolaris w/ raidz2 on it if you wanted to.

i wouldn't even bother. the power situation on those boxes isn't even remotely trustworthy, any "raid" needs to be between boxes.

Adbot
ADBOT LOVES YOU

optikalus
Apr 17, 2008

H110Hawk posted:

I think you're wrong here, fan count is a lot less important than CFM forced over the disks. Sure the disks may be hotter than a 2U server with 4 disks in it, but they will be consistently the same temperature. The disks will probably suffer some premature failure, but that is the whole point of RAID. Get a cheapo seagate AS disk with 5 year warranty and just replace them as they fail.

The way the disks are laid out, the drives in the middle row will no doubt be several degrees hotter than the drives next to the fans. Air also has very poor thermal conductivity, so having such a small distance between the drives means that:

1) minimum air contact to the drive surface
2) maximum radiant heat transfer between drives

The drive rails on many servers actually act as heatsinks as well, to dissipate heat to the chassis. There are no such heatsinks in this chassis.

I'd love to see a plot of the temperatures of the disks vs. location in the chassis. Even in my SuperMicros, the 8th disk consistently runs 3 degrees C hotter than all the other drives:

1)
HDD #1 Temp. : 34
HDD #2 Temp. : 35
HDD #3 Temp. : 33
HDD #4 Temp. : 33
HDD #5 Temp. : 35
HDD #6 Temp. : 36
HDD #7 Temp. : 36
HDD #8 Temp. : 39

2)
HDD #1 Temp. : 30
HDD #2 Temp. : 31
HDD #3 Temp. : 29
HDD #4 Temp. : 32
HDD #5 Temp. : 30
HDD #6 Temp. : 33
HDD #7 Temp. : 34
HDD #8 Temp. : 36

3)
HDD #1 Temp. : 30
HDD #2 Temp. : 31
HDD #3 Temp. : 29
HDD #4 Temp. : 30
HDD #5 Temp. : 31
HDD #6 Temp. : 32
HDD #7 Temp. : 33
HDD #8 Temp. : 34

4)
HDD #1 Temp. : 32
HDD #2 Temp. : 33
HDD #3 Temp. : 32
HDD #4 Temp. : 34
HDD #5 Temp. : 34
HDD #6 Temp. : 35
HDD #7 Temp. : 33
HDD #8 Temp. : 36

and so on

The drives are installed:

1 3 5 7
2 4 6 8

Drives 1 - 6 are cooled by a 3 very high cfm fans, where as 7 and 8 are in front of the PSUs, which have their own fans. Those fans aren't as powerful, so obviously they bake.


Click here for the full 1200x886 image.


Given the density of those 5 drives, I can't see those fans in front sending much air through the drive array. It'd probably just send it out the side vents. The fans in the rear are then blowing that heated air over the CPU and PSU, which can't be good for them either.

Further, they're running *software* RAID. I can't believe how many times I've tried it and had it gently caress me over some how. It is flaky at best when your drives are in perfect working order, I could only imagine what'd it do when half the disks it knows about drop offline due to a popped breaker or bad PSU.

Don't get me wrong, I think it is a great idea, just poor execution. Instead of 45 drives per chassis, I'd stick to 30 or so. That'd give about 3/4" clearance between each drive, which would allow sufficient air flow and reduce radiant transfer.

H110Hawk
Dec 28, 2006

optikalus posted:

The way the disks are laid out, the drives in the middle row will no doubt be several degrees hotter than the drives next to the fans. Air also has very poor thermal conductivity, so having such a small distance between the drives means that:

I'd love to see a plot of the temperatures of the disks vs. location in the chassis. Even in my SuperMicros, the 8th disk consistently runs 3 degrees C hotter than all the other drives:

I think you miss the point. The disks need constant temperature, not low temperature. Remember the google article everyone loved a year or two back? Nothing has changed. It also doesn't matter if they have a slightly elevated failure rate. Their cost for downtime is nearly 0 compared to most other application out there. Build cost vs. Technician time is what they have to minimize. In that case, lowest price wins. See the other storage thread for my arguments.

Have you never opened up a Sun X4500? They cram in disks in the same fashion. It's what this was apparently modeled after.

http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.11/100507013e.pdf
http://www.sun.com/servers/x64/x4540/gallery/index.xml?t=4&p=2&s=1

namaste friends
Sep 18, 2004

by Smythe
NetApp has had something called System Manager out for a couple months now. It's meant as a Filerview replacement. It's also free and you can download it from NOW. For more info click here: http://blogs.netapp.com/storage_nuts_n_bolts/2009/03/sneak-preview-netapp-system-manager-nsm.html You'll need mmc 3.0 and all your snmp ports to your filers open. I can't comment on its reliability because I've never used it though.

The 2000 series filers are all clusterable btw. In fact, I've only ever come across one customer that bought a 2000 series with one node. With respect to expandability, the 2020 can be expanded with FC or SATA shelves. This is rather odd because the 2020 is populated with SAS drives.

Also, last week NetApp released Ontap 8, Ontap 7.2.3, the 2040 and the DS4243 SAS shelf. The big deal with Ontap 8 is that it supports aggregates bigger than 16 TB.

KS
Jun 10, 2003
Outrageous Lumpwad

optikalus posted:

Don't get me wrong, I think it is a great idea, just poor execution. Instead of 45 drives per chassis, I'd stick to 30 or so. That'd give about 3/4" clearance between each drive, which would allow sufficient air flow and reduce radiant transfer.

There are a half dozen vendors with an identical layout including Sun, HP, Overland, and Nexsan. Jamming 48 1.5tb drives in 4u is kinda the next big thing and a centerpiece of d2d backup strategies. Density is important and a temperature variance between drives is not important at all.

frunksock
Feb 21, 2002

H110Hawk posted:

What zpool configuration are you using? Any specific kernel tweaks?
I'm not who you asked, but the configuration I ended up on is pretty similar, with 7x(5+1) raidz1, 4 spares, all in one pool, and mirrored OS disks in the rpool. Since this is a X4540, the first crack I took at it, I put the OS on a CF card, but in addition to that being a pretty big pain in the rear end, I realized it doesn't really get me anything besides 2 extra spares, since for performance and reliability constraints, I'd settled on 5+1 stripes, with no one stripe containing more than one disk from a single controller. The command to create it would go something like

code:
zpool create foopool \
raidz c1t0d0 c2t2d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 \
raidz c2t1d0 c3t1d0 c4t1d0 c5t1d0 c6t1d0 c1t2d0 \
raidz c3t2d0 c4t2d0 c5t2d0 c6t2d0 c1t3d0 c2t3d0 \
raidz c4t3d0 c5t3d0 c6t3d0 c1t4d0 c2t4d0 c3t4d0 \
raidz c5t4d0 c6t4d0 c1t5d0 c2t5d0 c3t5d0 c4t5d0 \
raidz c6t5d0 c1t6d0 c2t6d0 c3t6d0 c4t6d0 c5t6d0 \
raidz c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 c6t7d0 \
spare c3t3d0 c4t4d0 c5t5d0 c6t6d0
Notice each stripe has exactly one disk from each of the 6 controllers (c0 got eaten by the CF card experiment).

With the rpool using c1t1d0 and c2t0d0

Using bonnie++, I get about 500MB/s writes and 730MB/s reads.

This is a log file repo where people do a lot of sorting and grepping and whatnot, so I used zfs gzip6 compression, which gets me over 9x compression, and (this is awesome) actually improves performance, since it moves the bottleneck from the disks to the CPU (because we now read far fewer blocks off of disk).

Doing an 8-way parallel grep, I can *grep* logfiles at over 820MB/s -- that number includes everything -- reading from disk, decompressing, and the grepping itself. Again, the bottleneck is CPU, not disk. It's awesome.

Oh, and the bonnie++ numbers are from an uncompressed filesystem. Since bonnie++ writes and reads blocks of 0s, you get insanely high (and fake) numbers if you run it on a gzip filesystem. My eyes got wide when I first ran it on the gziped filesystem, before I realized what was happening -- I forget the numbers, something like 3GB/s reads, maybe.

frunksock fucked around with this message at 23:56 on Sep 6, 2009

frunksock
Feb 21, 2002

adorai posted:

since it's apparently commodity hardware, i think you could probably run opensolaris w/ raidz2 on it if you wanted to.

The speculation on opensolaris.org is that they didn't do OpenSolaris / ZFS because OpenSolaris support for the SiI 3726 port multiplier that they're using only came out a couple weeks ago.

EnergizerFellow
Oct 11, 2005

More drunk than a barrel of monkeys
A few idle observations to cheaply improve the Backblaze box:

- LVM/RAID6/JFS? Get those bad boys on OpenSolaris with RAID-Z2 and ZFS.

- Boot from CompactFlash via a direct-plug IDE adapter. CF+adapter can be had for <$30/ea. Run OS in memory from tmpfs. Flash is vastly more reliable than a spindle, if you keep the write cycle count down.

- 110V @ 14A. Seriously 110V? These boxes need to be on 208V ASAP. Better efficiency from the PSs and AC lines too.

- Bump the motherboard budget to get something with ECC memory, PCIe x4 slots, and multiple NICs that you can channel bond (802.1ax). Something like an ASUS P5BV-M will run you ~$150/ea.

- 2x 4-port SATA PCIe x4 cards for ~$60/ea and run one of the SATA backplanes off the motherboard SATA port. Fewer chips to fail in the box, eliminate the very over saturated PCI SATA card, and one of the PCIe controllers.

- Run the numbers on low-power CPUs and 5400/7200 RPM 'green' drives. Given the large number of boxes, additional component cost can be offset in power consumption savings and datacenter BTU.

brent78
Jun 23, 2004

I killed your cat, you druggie bitch.
The 80GB boot drive is a waste, they should be booting from the network via PXE. I'm also curious what they do when a drive fails, since it's not hot swap. Again, seems like MogileFS would have a perfect fit and probably higher uptime.

Edit: If you look at the pics, the disk arrays sit on top a rail. I wouldn't want to be the one that has to pull one out for maintenance.

brent78 fucked around with this message at 09:03 on Sep 7, 2009

ragzilla
Sep 9, 2005
don't ask me, i only work here


EnergizerFellow posted:

- 2x 4-port SATA PCIe x4 cards for ~$60/ea and run one of the SATA backplanes off the motherboard SATA port. Fewer chips to fail in the box, eliminate the very over saturated PCI SATA card, and one of the PCIe controllers.

They mentioned that on the page- the onboard SATA ports have issues with multipliers. Of course that may not be an issue on a different mobo.

H110Hawk
Dec 28, 2006

EnergizerFellow posted:

- 110V @ 14A. Seriously 110V? These boxes need to be on 208V ASAP. Better efficiency from the PSs and AC lines too.

Sometimes it's hard to get 200v power in datacenters. :(

quote:

- Run the numbers on low-power CPUs and 5400/7200 RPM 'green' drives. Given the large number of boxes, additional component cost can be offset in power consumption savings and datacenter BTU.

It appears that the same disk from the 'green' line saves you 3 Watts per disk. Their seagate disk costs $120. The WD green disk costs $122. 45 disks per box, ~10 boxes per rack, is 1,350W (12.2A@110v) per rack saved at a cost of $900. This also reduces the power/block from 14 to 12.8A. A rack costs $1,000-1,200/month with 60 amps of 110v power and mostly adequate cooling. This means that over the course of 2-3 months they would make their money back on being able to nearly stuff one more pod into each rack. This makes the assumption they get the performance they need from the disks, which they likely do.

I wonder if there are other factors preventing this, such as disk supply or other inefficiencies in the system they aren't showing us.

http://www.wdc.com/en/products/products.asp?driveid=575
http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.11/100507013e.pdf
http://www.google.com/products/catalog?q=western+digital+1.5tb&cid=5473896154771326069&sa=title#scoring=p

EnergizerFellow
Oct 11, 2005

More drunk than a barrel of monkeys

ragzilla posted:

They mentioned that on the page- the onboard SATA ports have issues with multipliers. Of course that may not be an issue on a different mobo.
I noticed that too and from looking into the issue, their post is the only reference I can find on modern Intel chipsets having issues with Silicon Image multipliers. From their wording, I would also infer they tried running all/most of the backplanes from the Intel ICH controller. Under the idea I had, only one of the backplanes would be off the motherboard.

Another issue they alluded to was the SATA controller chipset. All low-end multipliers on the market are from Silicon Image, near as I can tell, and the Marvell chipsets didn't support multipliers until rather recently (~1-2 year ago). Also looks like all commodity 4-port SATA PCIe chips on the market are from Marvell, while the 2-ports are Silicon Image, thus their odd card choice, I'm guessing. They probably standardized their layout before the current Marvell chips where available.

H110Hawk posted:

I wonder if there are other factors preventing this, such as disk supply or other inefficiencies in the system they aren't showing us.
I do wonder that as well, such as why they have a seemingly high-speed CPU. I wonder if they have single-thread performance issues.

brent78
Jun 23, 2004

I killed your cat, you druggie bitch.

EnergizerFellow posted:

I do wonder that as well, such as why they have a seemingly high-speed CPU. I wonder if they have single-thread performance issues.
They mention doing all their operations via HTTPS, so encryption/decryption is all done by the CPU.

H110Hawk posted:

Sometimes it's hard to get 200v power in datacenters. :(
If you mean mom and pop basements, then yes. I get 208V 60A 3-Phase from my local co-lo down the street. 208 is standard fare for a datacenter. They host all their gear at 365 Main, so not a problem there.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

H110Hawk posted:

Sometimes it's hard to get 200v power in datacenters. :(
What? No it isn't.

unknown
Nov 16, 2002
Ain't got no stinking title yet!


I've seen colos ("professional" ones that is) that can't handle doing 230/208 for administrative reasons rather than technical ones.

All their usage/growth planning are done in 120v, and have no way to deal/convert other voltages to fit that. A side benefit is that usually it keeps the rack density down so that they can rent more racks to the same customer.

Of course, these are the colos that provide 3phase 208v for the same price as 1 phase 208v once they do the administrative "upgrade". [I know of 1 group that got that into their contract - lucky bastards for the time being]

Farmer Crack-Ass
Jan 2, 2001

this is me posting irl

EnergizerFellow posted:

I do wonder that as well, such as why they have a seemingly high-speed CPU. I wonder if they have single-thread performance issues.

Don't they do software RAID on those? A huge 48-disk RAID array probably gobbles a huge amount of CPU power.

Echidna
Jul 2, 2003

Well, I have finally got a "proper" iSCSI array for a small Xen virtualisation setup, so I can shift away from the current homebrew DRBD/Heartbeat/IET setup.

It's the Dell MD3000i, which I saw mentioned earlier along with some vaguely negative comments. It is a budget array, but I have to say for the price it's not a bad bit of kit, especially after we got our Dell account manager to knock the price down by a huge amount as we were ordering just in time for their end of month tally.

We've got it configured with dual controllers, 8x300Gb and 7x146GB 15k SAS drives. Throughput is around GigE wire speed - 110MB/s for both reads and writes. I'm also seeing a respectable IOPS figure depending on workloads; during an iozone run, I could see it sustaining around 1.5K IOPS to a RAID5 volume.

True, the management features are a world apart from the usual Sun and HP kit I'm used to, but it does the job. My main gripes are :

  • No built in graphing (seriously, Dell - WTF?), but you can do it from the CLI - see http://www.delltechcenter.com/page/MD3000i+Performance+Monitoring
  • Can't resize or change the I/O profile of a virtual disk once it's setup. This is a PITA, so make sure you set things up correctly the first time! You can however change the RAID level of a disk group once it's been created.
  • You need a Windows or RHEL box to run the administration GUI on - I'm sure you can probably hack a way to get the CLI running under Debian, but I haven't tried. You're probably SOL if you want to run it on anything else like Solaris. update: It looks like the admin tool and SMcli are just shell scripts that run Java apps. I tried a quick'n'dirty hack of installing everything under RHEL, tarring up /opt/dell and /var/opt/SM and then transferring them over to a Debian Lenny host. All I had to change was the #!/bin/sh to #!/bin/bash at the top of the SMcli and SMclient wrappers, and they seem to work. I haven't put them through any serious testing though...
  • Can't mix SAS and SATA in the same enclosure. The controllers do support SATA as well as SAS, although SATA drives don't show up as options in the Dell pricing configuration thingy. Our account manager advised us that although technically you can mix SAS and SATA in the same enclosure, they'd experienced a higher than average number of disk failures in that configuration, due to the vibration patterns created by disks spinning at different rates (15K SAS and 7.2K SATA). If you need to mix the two types, your only real option is to attach a MD1000 array to the back (you can add up to two of these) and have each chassis filled with just one type of drive.


The hardware failover works nicely - the array is active/passive for each virtual disk, as both controllers are typically active, each handling separate virtual disks for load-balancing purposes. When a controller fails, the remaining "good" controller takes over the virtual disks or disk groups from the failed controller. Failback is pretty transparent - the GUI guides you through the steps, but I found that simply inserting a replacement HD/Controller/etc. just did the job automagically.

Multipath support under RHEL/CentOS works fine with some tweaking - it uses the RDAC modules which lead to some oddness on CentOS 5.3. What tends to happen is that the first time device mapper picks up the paths, RDAC doesn't get a chance to initialise things properly (scsi_dh_rdac module isn't loaded) so you end up with all sorts of SCSI errors showing up in your logs. After flushing your paths (multipath -F) and restarting multipathd, things are OK. This is apparently fixed in RHEL 5.4 (https://bugzilla.redhat.com/show_bug.cgi?id=487293), so should make it's way out to CentOS from there. I'm unsure what the status is on other distros, though.

My multipath.conf contains the following :
code:
devices {
        device {
                vendor "DELL"
                product "MD3000i"
                path_grouping_policy group_by_prio
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                path_checker rdac
                prio_callout "/sbin/mpath_prio_rdac /dev/%n"
                hardware_handler "1 rdac"
                failback immediate
        }
}
And with everything working, multipath -ll shows :
code:
360026b90002ab6f40000056a4aa9e87b dm-12 DELL,MD3000i
[size=409G][features=0][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=200][active]
 \_ 21:0:0:1  sdi 8:128 [active][ready]
 \_ 22:0:0:1  sdj 8:144 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 20:0:0:1  sdg 8:96  [active][ghost]
 \_ 23:0:0:1  sdh 8:112 [active][ghost]
Just thought I'd chime in with my experiences as I didn't see any feedback on this particular array before.

Echidna fucked around with this message at 11:08 on Sep 15, 2009

Halo_4am
Sep 25, 2003

Code Zombie

Echidna posted:

  • Can't mix SAS and SATA in the same enclosure. --- Our account manager advised us that although technically you can mix SAS and SATA in the same enclosure, they'd experienced a higher than average number of disk failures in that configuration, due to the vibration patterns created by disks spinning at different rates (15K SAS and 7.2K SATA). If you need to mix the two types, your only real option is to attach a MD1000 array to the back (you can add up to two of these) and have each chassis filled with just one type of drive.



That sounds like some BS to me. Even if all your disks are spinning at 15k they're not likely to be perfectly synced to vibrate in the exact same pattern. Assuming the enclosure is made of thin paper and individual disk vibration matters at all to its neighbors. Assuming this is even true, so what? The disks are warrantied and your sas drives will be in raid patterns same as the sata drives. If they fail more frequently due to uneven vibration patterns it's only Dell that loses. I'm thinking they're just trying to sell you an additional MD1000...

Thanks for the write-up otherwise though.

H110Hawk
Dec 28, 2006

Halo_4am posted:

I'm thinking they're just trying to sell you an additional MD1000...

Agreed. You should compliment him on his quick thinking on the vibration patterns, though. Ask him if it was his idea. Sales guys occasionally need to be laughed at and called out.

lilbean
Oct 2, 2003

H110Hawk posted:

Agreed. You should compliment him on his quick thinking on the vibration patterns, though. Ask him if it was his idea. Sales guys occasionally need to be laughed at and called out.
Our Sun reseller gave us this bullshit before, and we called him out on it. He said it was in the sales teams training literature. Dickheads.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Echidna posted:

Can't mix SAS and SATA in the same enclosure. The controllers do support SATA as well as SAS, although SATA drives don't show up as options in the Dell pricing configuration thingy. Our account manager advised us that although technically you can mix SAS and SATA in the same enclosure, they'd experienced a higher than average number of disk failures in that configuration, due to the vibration patterns created by disks spinning at different rates (15K SAS and 7.2K SATA). If you need to mix the two types, your only real option is to attach a MD1000 array to the back (you can add up to two of these) and have each chassis filled with just one type of drive.
Hahahahaha you bought it

Pretty much every SAN vendor I've ever seen has mixed and matched storage types in entry-level to mid-end SANs for enclosure-level redundancy and last I heard EMC and IBM are still in business

Echidna
Jul 2, 2003

lilbean posted:

Our Sun reseller gave us this bullshit before, and we called him out on it. He said it was in the sales teams training literature. Dickheads.

Interesting that you've heard that line before as well, then. It did sound a bit fishy to me (I've used a mixed chassis before), but figured that regardless of the truth behind that statement, if that's what they're saying then I'd rather go with their recommended solution as I really don't want any finger pointing when getting support later.

Ah well, it actually worked out in our favour - our account manager obviously believed it, and knew I wasn't going to buy an additional MD1000 (no physical space for it in the rack) so he swapped out the SATA drives in my quote request, and replaced them with higher capacity 15k SAS drives so we'd be on all the same drive type. He then slashed the price of the whole thing down to well under what we'd been quoted for the original SATA solution.

It's amazing what putting an order through on the last day of the month can do, when they have targets to meet...

Echidna fucked around with this message at 07:19 on Sep 15, 2009

lilbean
Oct 2, 2003

Echidna posted:

It's amazing what putting an order through on the last day of the month can do, when they have targets to meet...
Oh absolutely. My last big order was six Sun T5140 systems and a couple of J4200 disk arrays for them, and our CDW sales rep was falling over himself to get us to order it by the end of the month (in January). He called me twice a day to check on the status and what not, and then I finally e-mailed him to tell him to calm down.

So he calls me and he says "Look, if I make this sale I win a 42-inch plasma TV." As if I give a poo poo. So one of our other Sun sales guys get us the lowest price (Parameter Driven Solutions in Toronto) and we go with them.

Then the CDW sales guy leaves a threatening message on my voicemail! He's angry as gently caress and saying poo poo like "I worked for a month on your quote and I DON'T LOSE SALES" and so on. So I didn't reply, but CDWs website helpfully mentions the contact information of your rep's manager which let me forward the voicemail to him. So the salesman no longer works there :)

Fake edit: Also, I've seen Apple sales people and sales engineer give the same lines of bullshit about vibrational testing.

namaste friends
Sep 18, 2004

by Smythe

lilbean posted:

Oh absolutely. My last big order was six Sun T5140 systems and a couple of J4200 disk arrays for them, and our CDW sales rep was falling over himself to get us to order it by the end of the month (in January). He called me twice a day to check on the status and what not, and then I finally e-mailed him to tell him to calm down.

So he calls me and he says "Look, if I make this sale I win a 42-inch plasma TV." As if I give a poo poo. So one of our other Sun sales guys get us the lowest price (Parameter Driven Solutions in Toronto) and we go with them.

Then the CDW sales guy leaves a threatening message on my voicemail! He's angry as gently caress and saying poo poo like "I worked for a month on your quote and I DON'T LOSE SALES" and so on. So I didn't reply, but CDWs website helpfully mentions the contact information of your rep's manager which let me forward the voicemail to him. So the salesman no longer works there :)

Fake edit: Also, I've seen Apple sales people and sales engineer give the same lines of bullshit about vibrational testing.

That's appalling. It's common for sales cycles to last 6 months in the storage industry. In fact, the cycle often stretches beyond 6 months and not just for high end arrays.

oblomov
Jun 20, 2002

Meh... #overrated
Yeap, probably took us 5-6 months each last couple times we were selecting a storage solution. There were usually 3-4 vendors in the running and only 1 was selected each time, so there were plenty of losers, but everyone understood that this is what it takes in the Enterprise.

oblomov
Jun 20, 2002

Meh... #overrated
Speaking of storage, anyone have experience with fairly large systems, as in 600-800TB, with most of that being short-term archive type of storage? If so, what do you guys use? NetApp wasn't really great solution for this due to volume size limitations, which I guess one could mask with a software solution on top, but that's clunky. They just came out with 8.0, but I have 0 experience with that revision. What about EMC, say Clariion 960, anyone used that? Symmetrix would do this, but that's just stupidly expensive. Most of my experience is NetApp with Equallogic thrown in for a good measure (over last year or so).

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer
For that much storage sun is probably the cheapest outside of a roll your own type solution.

TrueWhore
Oct 1, 2000

Can someone tell me if my basic idea here is feasible:

I'd like to set up a zfs volume, and share it out as a iscsi target (rw)

I'd also like to take that same volume and share it out as several read only iscsi targets. I think this should be possible using zfs clones? But then the clones wouldn't update when the master rw volume does correct? Is there some other way to get it set up the way I want it to work, ie one iscsi initiator can write to a volume and several others can read it at the same time, and see updates as they happen?

Basically if you haven't guessed I am trying to get a semi SAN setup, as a stopgap measure until we can get a real SAN. I have 4 video editing stations that need access to archived material, and I am willing to have just one writer and several readers. If worst comes to worst I will go with the clones method, and just unmount, destroy clone, create new clone, remount, whenever I need a reader client to see the updated master volume.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

oblomov posted:

Speaking of storage, anyone have experience with fairly large systems, as in 600-800TB, with most of that being short-term archive type of storage? If so, what do you guys use? NetApp wasn't really great solution for this due to volume size limitations, which I guess one could mask with a software solution on top, but that's clunky. They just came out with 8.0, but I have 0 experience with that revision. What about EMC, say Clariion 960, anyone used that? Symmetrix would do this, but that's just stupidly expensive. Most of my experience is NetApp with Equallogic thrown in for a good measure (over last year or so).
Isilon is really big in this space. They mostly deal with data warehousing and near-line storage for multimedia companies and high-performance computing. They're very competitive on price for orders of this size, but I can't speak yet for the reliability of their product.

I haven't gotten to play with ours yet; it's sitting in boxes in the corner of the datacenter.

TrueWhore posted:

Can someone tell me if my basic idea here is feasible:

I'd like to set up a zfs volume, and share it out as a iscsi target (rw)

I'd also like to take that same volume and share it out as several read only iscsi targets. I think this should be possible using zfs clones? But then the clones wouldn't update when the master rw volume does correct? Is there some other way to get it set up the way I want it to work, ie one iscsi initiator can write to a volume and several others can read it at the same time, and see updates as they happen?

Basically if you haven't guessed I am trying to get a semi SAN setup, as a stopgap measure until we can get a real SAN. I have 4 video editing stations that need access to archived material, and I am willing to have just one writer and several readers. If worst comes to worst I will go with the clones method, and just unmount, destroy clone, create new clone, remount, whenever I need a reader client to see the updated master volume.
Tell us a little more (read: a lot more) about what you're trying to do with this shared LUN (especially in terms of operating systems involved), because unless you're using a cluster filesystem, I don't think this is going to work the way you think it's going to work. Operating systems maintain extensive caches to speed disk I/O, and unless those caches stay coherent (meaning something forces them to update when something changes on the cluster), they're going to be seeing garbage all over the drive.

On top of this, I don't think Windows will even mount a SCSI LUN that's in read-only mode. I don't have any idea about Mac or various Unixes.

Why can't you just use a network filesystem?

Vulture Culture fucked around with this message at 05:01 on Sep 17, 2009

Vanilla
Feb 24, 2002

Hay guys what's going on in th

oblomov posted:

Speaking of storage, anyone have experience with fairly large systems, as in 600-800TB, with most of that being short-term archive type of storage? If so, what do you guys use? NetApp wasn't really great solution for this due to volume size limitations, which I guess one could mask with a software solution on top, but that's clunky. They just came out with 8.0, but I have 0 experience with that revision. What about EMC, say Clariion 960, anyone used that? Symmetrix would do this, but that's just stupidly expensive. Most of my experience is NetApp with Equallogic thrown in for a good measure (over last year or so).

All the time, depends exactly what you need it for - just to stream to and delete shortly after? More details?

Typically see this on Clariion with 1TB drives. When 2 TB drives come along the footprint will be a lot less.

paperchaseguy
Feb 21, 2002

THEY'RE GONNA SAY NO

oblomov posted:

Speaking of storage, anyone have experience with fairly large systems, as in 600-800TB, with most of that being short-term archive type of storage? If so, what do you guys use? NetApp wasn't really great solution for this due to volume size limitations, which I guess one could mask with a software solution on top, but that's clunky. They just came out with 8.0, but I have 0 experience with that revision. What about EMC, say Clariion 960, anyone used that? Symmetrix would do this, but that's just stupidly expensive. Most of my experience is NetApp with Equallogic thrown in for a good measure (over last year or so).

I've put together a few CX4-480s and 960s, though I was mostly designing for performance (mail systems with 100k+ users). At the extreme, you can get 740TB+ usable with the 960 these days. (With 1TB and 2TB drives I would recommend RAID 6 since they take forever to rebuild.) Soon you will be able to get 800TB raw on a single floor tile.

With short term archiving, are you going to tape? Consider a CDL or Data Domain DDX?

http://www.datadomain.com/pdf/DataDomain-DDXArraySeries-Datasheet.pdf

oblomov
Jun 20, 2002

Meh... #overrated

Vanilla posted:

All the time, depends exactly what you need it for - just to stream to and delete shortly after? More details?

Typically see this on Clariion with 1TB drives. When 2 TB drives come along the footprint will be a lot less.

2TB is not out yet as far as I know. It's mostly just to stream data to (sequential writes mostly) and then archive it for a few weeks. I was basically thinking Clariion or NetApp 6000 series. Only problem being that 15/16TB is max volume size (not sure on the Clariion, could be wrong there). However, today I found out that size needs to be double, i.e. 1.5TB or so. My guess is that management will look at the cost and abandon the project, but hey, I got to come up with something, and it sure to be an interesting design exercise :).

Will also look at Isilon, never heard of them before.

1000101
May 14, 2003

BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!
I think NetApp GX may address this particular need.

Here's a whitepaper on using a CX4-960 for data warehousing:
http://www.emc.com/collateral/hardware/white-papers/h5548-deploying-clariion-dss-workloads-wp.pdf

Either way you go you're going to have to aggregate multiple devices into a single logical volume. If you stick with NetApp you'll just have to create several 16TB LUNs and stripe them together using LVM or veritas or something like that. On EMC (correct me if I'm wrong Vanilla) you'll end up with a bunch of 32TB metaLUNs to achieve the same goal.

Vanilla
Feb 24, 2002

Hay guys what's going on in th
The largest LUN the Clariion can create is 16 exabytes.

Basically as long as your OS can address it the Clariion can create it. To expand the lun you would add metaluns but this is transparent to the host /app. You'll need to make metaluns to make your huge LUN.

If you are just making one huge LUN you have the opion of using ALUA which would mean that the LUN can be accessed from both Storage Processors and can get the performance of both (active / active). This means that one SP wont own the lun which would mean the other SP does nothing.

2TB drives are hitting the consumer market now so not long until the usual vendors pass it through their QA.

*Edited for clarification.*

Vanilla fucked around with this message at 10:56 on Sep 19, 2009

namaste friends
Sep 18, 2004

by Smythe

1000101 posted:

I think NetApp GX may address this particular need.

Here's a whitepaper on using a CX4-960 for data warehousing:
http://www.emc.com/collateral/hardware/white-papers/h5548-deploying-clariion-dss-workloads-wp.pdf

Either way you go you're going to have to aggregate multiple devices into a single logical volume. If you stick with NetApp you'll just have to create several 16TB LUNs and stripe them together using LVM or veritas or something like that. On EMC (correct me if I'm wrong Vanilla) you'll end up with a bunch of 32TB metaLUNs to achieve the same goal.

Ontap 8 running in "classic" mode overcomes the 16 TB aggregate limit.

http://www.ntapgeek.com/2009/09/64-bit-aggregates-in-data-ontap-8.html

GX or cluster mode as it is known in ontap 8 has several limitations, including an inability to snapmirror.

namaste friends fucked around with this message at 20:57 on Sep 20, 2009

oblomov
Jun 20, 2002

Meh... #overrated

Cultural Imperial posted:

Ontap 8 running in "classic" mode overcomes the 16 TB aggregate limit.

http://www.ntapgeek.com/2009/09/64-bit-aggregates-in-data-ontap-8.html

GX or cluster mode as it is known in ontap 8 has several limitations, including an inability to snapmirror.

Yeap, talked to my NetApp reps and looks like Ontap 8 will do this. Also talked to some EMC guys, and Vanilla is correct as well, Clariion will handle the LUN size as well. There is no way in hell that I would be doing LVM stripes and such :).

crazyfish
Sep 19, 2002

Sorry to bump this thread from the grave, but does anyone know offhand what the maximum size of a single LUN is on Windows Server 2008 x64 (namely using the iSCSI software initiator)? The LUN is exposed to Windows as a basic disk with 4k sectors. I was under the impression that the address limit of basic disks was 2^32 - 1 sectors (2TB for 512 byte disks, 16TB for 4k) was a 32 bit limiation and not a limitation in 64 bit. I tried creating a 17TB LUN and Windows wouldn't initialize the disk, and a wire capture didn't show any writes at all so I'm certain it's not a corruption problem on the target side.

edit: I was using GPT as I'm aware of MBR's limitations.

crazyfish fucked around with this message at 19:13 on Oct 30, 2009

complex
Sep 16, 2003

Don't know if you've found this in searching already, but: http://powerwindows.wordpress.com/2009/02/21/maximum-lun-partition-disk-volume-size-for-windows-servers/

Also, I don't know if that helps answer your question.

What is the target, the device presenting the ISCSI LUN?

crazyfish
Sep 19, 2002

I did find that article. The problem is that it doesn't address the basic disk vs dynamic disk limit specifically, and when I login to the iSCSI target and attempt to initialize the LUN via disk management it fails to initialize with an "I/O error" as GPT and doesn't allow me to convert it to dynamic without being initialized. I checked a wire capture and saw absolutely no I/O errors being returned from the target, only a series of reads from sector 0 (which I would expect during initialization) and no writes so I'm inclined to believe this is a Windows issue.

15TB LUNs work exactly as expected, allowing me to initialize to GPT, convert them to dynamic, and create a striped volume across two of them.

The target implementation was developed in-house and exposes a virtual block device.

Adbot
ADBOT LOVES YOU

EnergizerFellow
Oct 11, 2005

More drunk than a barrel of monkeys
16TB (minus 4 KB?) is a hard limit under the Microsoft iSCSI initiator, regardless of x86 or x64.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply